Files
nexus/wiki/concepts/Calibration-Testing.md

79 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Calibration Testing"
type: concept
tags: [model-evaluation, probability-calibration, model-quality]
last_updated: 2026-04-25
---
## Definition
概率校准Calibration Testing验证模型输出的预测概率与实际发生的频率是否一致。一个校准良好的分类器若它预测某事件概率为 80%,则该事件实际发生的频率应接近 80%。
## Core Methods
### Hosmer-Lemeshow Test
- 将预测概率分组默认10组比较每组观测正例数与期望正例数
- 统计量:$\chi^2 = \sum \frac{(observed - expected)^2}{expected(1 - expected/n)}$
- 自由度:组数 - 2p-value < 0.05 → 拒绝原假设(校准差)
- **局限性**:对样本量敏感,分组方式不同结果不同
### Brier Score
- $BS = \frac{1}{N}\sum(p_i - y_i)^2$,取值 [0, 0.25](二分类)
- 同时衡量校准calibration和区分度refinement
- 值越低越好,可分解为:$BS = Calibration^2 + Refinement$
- **优势**:无需分组,对样本量稳健,可跨模型比较
### Reliability Diagram可靠性图
- 将预测概率分箱bin绘制实际正例率 vs 预测概率
- 理想情况为 45° 对角线S 形曲线表示欠/过度预测
- 视觉诊断工具,适合识别系统性校准偏差
### Expected Calibration Error (ECE)
- 加权平均每箱预测概率与实际频率的绝对差
- $ECE = \sum_b \frac{|b|}{n} |acc(b) - conf(b)|$
- 量化校准误差,便于跨模型对比
## Usage
```python
# Hosmer-Lemeshow
from scipy.stats import chi2
def hosmer_lemshow_test(y_true, y_pred, groups=10):
data = pd.DataFrame({"y": y_true, "p": y_pred})
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
agg = data.groupby("bucket", observed=True).agg(
n=("y", "count"), observed=("y", "sum"), expected=("p", "sum")
)
hl_stat = (((agg["observed"] - agg["expected"])**2) /
(agg["expected"] * (1 - agg["expected"]/agg["n"]))).sum()
dof = len(agg) - 2
p_value = 1 - chi2.cdf(hl_stat, dof)
return {"HL_statistic": round(hl_stat, 4), "p_value": round(p_value, 6), "calibrated": p_value >= 0.05}
# Brier Score
from sklearn.metrics import brier_score_loss
bs = brier_score_loss(y_true, y_pred)
```
## Model QA 中的应用
Model QA Specialist 执行以下校准审计:
1. **跨子群体校准**:在年龄/地区/收入等子群体上分别测试,发现整体指标掩盖的局部校准问题
2. **时间窗口稳定性**:跨 OOTOut-of-Time窗口测试校准稳定性识别时间漂移
3. **分布偏移下的校准**在压力场景population shift下测试评估模型鲁棒性
4. **决策阈值校准**:根据业务决策阈值(如 p > 0.6 触发行动),评估该阈值处的校准质量
## Relationship
- **依赖** [[Discrimination-Metrics]]先验证模型有区分能力AUC/Gini再讨论校准才有意义
- **依赖** [[SHAP]]SHAP 解释"哪个特征导致校准偏差",支撑诊断方向
- **依赖** [[Population-Stability-Index]]PSI 捕捉特征分布漂移,漂移是校准失效的根本原因之一
- **支撑** [[specialized-model-qa]]SourceModel QA Specialist 的核心审计步骤之一
## Key Insights
- **High AUC ≠ Well Calibrated**:模型可以高区分度但低校准(如逻辑回归自然校准,神经网络往往过度自信)
- **业务影响**:校准误差 180bps0.18)在 decile 10 可能影响 12% 的资产组合
- **监管要求**:巴塞尔协议/IFRS 9/CCAR 等监管框架明确要求信用风险模型的概率校准