Files
nexus/wiki/concepts/Hosmer-Lemeshow-Test.md

92 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Hosmer-Lemeshow Test"
type: concept
tags: [model-evaluation, calibration-testing, goodness-of-fit]
last_updated: 2026-04-25
---
## Definition
Hosmer-LemeshowHL检验是一种评估二分类模型预测概率校准程度的拟合优度检验通过比较预测概率分箱后的观测正例数与期望正例数判断模型预测与实际结果之间是否存在显著差异。p-value < 0.05 时拒绝原假设(模型校准良好),认为模型存在显著的校准偏差。
## Algorithm
1. 将样本按预测概率从小到大分箱(默认 10 箱,或自定义 g 组)
2. 对每箱计算:
- **观测正例数** $O_g = \sum_{i \in \text{group } g} y_i$
- **期望正例数** $E_g = \sum_{i \in \text{group } g} \hat{p}_i$
- **样本数** $n_g$
3. 计算 HL 统计量:
$$H = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g (1 - E_g / n_g)}$$
4. 自由度 $df = G - 2$(减去截距和斜率估计参数)
5. 与 $\chi^2(df)$ 分布比较,$p = 1 - F_{H}(H)$
## Interpretation
```python
from scipy.stats import chi2
def hosmer_lemshow_test(y_true: pd.Series, y_pred: pd.Series, groups: int = 10) -> dict:
data = pd.DataFrame({"y": y_true, "p": y_pred})
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
agg = data.groupby("bucket", observed=True).agg(
n=("y", "count"),
observed=("y", "sum"),
expected=("p", "sum"),
)
hl_stat = (
((agg["observed"] - agg["expected"])**2) /
(agg["expected"] * (1 - agg["expected"] / agg["n"]))
).sum()
dof = len(agg) - 2
p_value = 1 - chi2.cdf(hl_stat, dof)
return {
"HL_statistic": round(hl_stat, 4),
"p_value": round(p_value, 6),
"calibrated": p_value >= 0.05, # True = well calibrated
"dof": dof,
"groups_used": len(agg),
}
```
| p-value | 判读 |
|---------|------|
| ≥ 0.05 | 🟢 模型校准良好,无显著证据表明预测概率偏离实际频率 |
| < 0.05 | 🔴 拒绝原假设,模型存在显著校准偏差 |
## Limitations
1. **分组方式敏感**:不同分箱数量/方法导致不同结果10 等分是惯例但非最优
2. **样本量敏感**:大样本下即使微小偏差也能导致显著 p-value实际影响可能很小
3. **掩盖子群体问题**:整体通过 HL 检验不等于所有子群体都校准良好
4. **序贯分组问题**qcut 在重复值多时可能合并箱子,需检查 `groups_used`
## Alternatives
- **Brier Score**:无需分组,对样本量稳健,但只能给出误差量级而非定位
- **Spiegelhalter's Z-test**:基于 Brier Score 的统计检验
- **Reliability Curves**:可视化诊断,比 HL 检验提供更多信息
- **Expected Calibration Error (ECE)**:量化平均校准误差,解释性更强
## Model QA 中的应用
Model QA Specialist 将 HL 检验用于:
1. **模型上线前验证**:新模型上线必须通过 HL 检验p ≥ 0.05
2. **定期监控**:在 OOT 窗口上重复执行,监控校准随时间恶化趋势
3. **子群体分层测试**:在关键子群体(高风险/低风险/新客户)上分别执行
4. **Champion-Challenger**:对比 champion model vs challenger model 的 HL 结果
## Relationship
- **被依赖** [[Calibration-Testing]]HL 检验是 Calibration Testing 的核心统计工具之一
- **依赖** [[Discrimination-Metrics]]先确认模型有区分能力AUC/Gini 达标),再讨论校准
- **依赖** [[Population-Stability-Index]]PSI 漂移往往是 HL 检验失败的前兆
- **依赖** [[SHAP]]HL 检验发现校准问题后,用 SHAP waterfall 诊断具体原因
- **支撑** [[specialized-model-qa]]SourceModel QA Specialist 校准测试步骤的核心工具