Sync: add model evaluation and training notes
This commit is contained in:
38
wiki/concepts/ADDIE-Model.md
Normal file
38
wiki/concepts/ADDIE-Model.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "ADDIE 模型"
|
||||
type: concept
|
||||
tags: []
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
ADDIE 模型是企业培训课程开发的系统性框架,包含五个阶段:
|
||||
|
||||
1. **Analysis(分析)**:培训需求分析——组织诊断、能力差距识别、培训 ROI 估算、需求优先级排序
|
||||
2. **Design(设计)**:学习目标设计——基于 Bloom 认知分类定义可衡量的学习成果
|
||||
3. **Development(开发)**:课程内容开发——微课、案例、练习、题库、课件
|
||||
4. **Implementation(实施)**:培训交付——线上/线下/混合学习交付方式
|
||||
5. **Evaluation(评估)**:效果评估——基于 Kirkpatrick 四级模型评估培训效果
|
||||
|
||||
## Aliases
|
||||
- ADDIE
|
||||
- ADDIE Model
|
||||
- ADDIE 教学设计模型
|
||||
- 分析-设计-开发-实施-评估
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **每个阶段有明确交付物**:分析报告、教学设计文档、课程包、培训执行计划、效果评估报告
|
||||
- **迭代性**:实践中通常循环迭代,而非严格线性执行
|
||||
- **系统性**:确保培训项目从需求到效果有完整闭环
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Kirkpatrick-四级评估]]:ADDIE 的最后一步(Evaluation)的具体方法论
|
||||
- [[Bloom-认知分类]]:ADDIE Design 阶段学习目标设计的认知层次框架
|
||||
- [[Kolb-体验式学习圈]]:与 ADDIE 并行的另一学习设计框架,侧重体验循环
|
||||
|
||||
## Source
|
||||
|
||||
- [[corporate-training-designer]]
|
||||
40
wiki/concepts/Bloom-认知分类.md
Normal file
40
wiki/concepts/Bloom-认知分类.md
Normal file
@@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "Bloom 认知分类"
|
||||
type: concept
|
||||
tags: []
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Bloom 认知分类(Bloom's Taxonomy)是由 Benjamin Bloom 等人于 1956 年提出的教育目标分类框架,将学习认知过程分为六个递进层次:
|
||||
|
||||
1. **Remember(记忆)**:识记、回忆基本事实——定义、列表、复述
|
||||
2. **Understand(理解)**:解释概念含义——总结、分类、解释原因
|
||||
3. **Apply(应用)**:将知识运用于新情境——执行、操作、解决问题
|
||||
4. **Analyze(分析)**:拆解复杂结构——区分、组织、归因
|
||||
5. **Evaluate(评价)**:基于标准做判断——检查、批判、论证
|
||||
6. **Create(创造)**:整合元素形成新结构——设计、建构、发明
|
||||
|
||||
## Aliases
|
||||
- Bloom's Taxonomy
|
||||
- Bloom 认知分类
|
||||
- Bloom 教育目标分类
|
||||
- 布鲁姆认知分类
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **递进性**:从低阶思维(记忆/理解)到高阶思维(分析/评价/创造)
|
||||
- **教学设计应用**:每个层次对应不同的学习活动和评估方式
|
||||
- 低阶目标 → 讲授、阅读、测验
|
||||
- 高阶目标 → 案例分析、项目实践、创作展示
|
||||
- **逆向设计**:从期望的认知层次出发,设计学习活动和评估
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ADDIE-Model]]:Bloom 分类是 ADDIE Design 阶段学习目标定义的核心工具
|
||||
- [[Kirkpatrick-四级评估]]:学习活动的认知层次影响 Level 2 评估方法的选择
|
||||
|
||||
## Source
|
||||
|
||||
- [[corporate-training-designer]]
|
||||
78
wiki/concepts/Calibration-Testing.md
Normal file
78
wiki/concepts/Calibration-Testing.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
title: "Calibration Testing"
|
||||
type: concept
|
||||
tags: [model-evaluation, probability-calibration, model-quality]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
概率校准(Calibration Testing)验证模型输出的预测概率与实际发生的频率是否一致。一个校准良好的分类器:若它预测某事件概率为 80%,则该事件实际发生的频率应接近 80%。
|
||||
|
||||
## Core Methods
|
||||
|
||||
### Hosmer-Lemeshow Test
|
||||
- 将预测概率分组(默认10组),比较每组观测正例数与期望正例数
|
||||
- 统计量:$\chi^2 = \sum \frac{(observed - expected)^2}{expected(1 - expected/n)}$
|
||||
- 自由度:组数 - 2;p-value < 0.05 → 拒绝原假设(校准差)
|
||||
- **局限性**:对样本量敏感,分组方式不同结果不同
|
||||
|
||||
### Brier Score
|
||||
- $BS = \frac{1}{N}\sum(p_i - y_i)^2$,取值 [0, 0.25](二分类)
|
||||
- 同时衡量校准(calibration)和区分度(refinement)
|
||||
- 值越低越好,可分解为:$BS = Calibration^2 + Refinement$
|
||||
- **优势**:无需分组,对样本量稳健,可跨模型比较
|
||||
|
||||
### Reliability Diagram(可靠性图)
|
||||
- 将预测概率分箱(bin),绘制实际正例率 vs 预测概率
|
||||
- 理想情况为 45° 对角线;S 形曲线表示欠/过度预测
|
||||
- 视觉诊断工具,适合识别系统性校准偏差
|
||||
|
||||
### Expected Calibration Error (ECE)
|
||||
- 加权平均每箱预测概率与实际频率的绝对差
|
||||
- $ECE = \sum_b \frac{|b|}{n} |acc(b) - conf(b)|$
|
||||
- 量化校准误差,便于跨模型对比
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
# Hosmer-Lemeshow
|
||||
from scipy.stats import chi2
|
||||
|
||||
def hosmer_lemshow_test(y_true, y_pred, groups=10):
|
||||
data = pd.DataFrame({"y": y_true, "p": y_pred})
|
||||
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
|
||||
agg = data.groupby("bucket", observed=True).agg(
|
||||
n=("y", "count"), observed=("y", "sum"), expected=("p", "sum")
|
||||
)
|
||||
hl_stat = (((agg["observed"] - agg["expected"])**2) /
|
||||
(agg["expected"] * (1 - agg["expected"]/agg["n"]))).sum()
|
||||
dof = len(agg) - 2
|
||||
p_value = 1 - chi2.cdf(hl_stat, dof)
|
||||
return {"HL_statistic": round(hl_stat, 4), "p_value": round(p_value, 6), "calibrated": p_value >= 0.05}
|
||||
|
||||
# Brier Score
|
||||
from sklearn.metrics import brier_score_loss
|
||||
bs = brier_score_loss(y_true, y_pred)
|
||||
```
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 执行以下校准审计:
|
||||
1. **跨子群体校准**:在年龄/地区/收入等子群体上分别测试,发现整体指标掩盖的局部校准问题
|
||||
2. **时间窗口稳定性**:跨 OOT(Out-of-Time)窗口测试校准稳定性,识别时间漂移
|
||||
3. **分布偏移下的校准**:在压力场景(population shift)下测试,评估模型鲁棒性
|
||||
4. **决策阈值校准**:根据业务决策阈值(如 p > 0.6 触发行动),评估该阈值处的校准质量
|
||||
|
||||
## Relationship
|
||||
|
||||
- **依赖** [[Discrimination-Metrics]]:先验证模型有区分能力(AUC/Gini),再讨论校准才有意义
|
||||
- **依赖** [[SHAP]]:SHAP 解释"哪个特征导致校准偏差",支撑诊断方向
|
||||
- **依赖** [[Population-Stability-Index]]:PSI 捕捉特征分布漂移,漂移是校准失效的根本原因之一
|
||||
- **支撑** [[specialized-model-qa]](Source):Model QA Specialist 的核心审计步骤之一
|
||||
|
||||
## Key Insights
|
||||
|
||||
- **High AUC ≠ Well Calibrated**:模型可以高区分度但低校准(如逻辑回归自然校准,神经网络往往过度自信)
|
||||
- **业务影响**:校准误差 180bps(0.18)在 decile 10 可能影响 12% 的资产组合
|
||||
- **监管要求**:巴塞尔协议/IFRS 9/CCAR 等监管框架明确要求信用风险模型的概率校准
|
||||
76
wiki/concepts/Discrimination-Metrics.md
Normal file
76
wiki/concepts/Discrimination-Metrics.md
Normal file
@@ -0,0 +1,76 @@
|
||||
---
|
||||
title: "Discrimination Metrics"
|
||||
type: concept
|
||||
tags: [model-evaluation, classification-metrics, model-performance]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
判别能力指标(Discrimination Metrics)衡量模型区分正例与负例的能力——给定一个随机正例和一个随机负例,模型有多大概率给正例更高的分数。区别于校准(衡量概率准确性),判别度衡量排序正确性。
|
||||
|
||||
## Core Metrics
|
||||
|
||||
### AUC (Area Under the ROC Curve)
|
||||
- ROC 曲线下面积,取值 [0.5, 1.0]
|
||||
- 0.5 = 随机猜测,1.0 = 完美区分
|
||||
- 解读:给定随机正例和随机负例,有 AUC 概率给正例更高分数
|
||||
- **优势**:阈值无关,对类别不平衡相对稳健
|
||||
|
||||
### Gini Coefficient
|
||||
- $Gini = 2 \times AUC - 1$
|
||||
- 取值 [0, 1.0],与 AUC 线性等价
|
||||
- 金融行业常用(信用卡评分、信贷风控)
|
||||
- 监管报告标准指标
|
||||
|
||||
### KS Statistic (Kolmogorov-Smirnov)
|
||||
- 两个累积分布函数(正例 vs 负例)之间的最大垂直距离
|
||||
- $KS = \max_t |F_{pos}(t) - F_{neg}(t)|$
|
||||
- 取值 [0, 1.0];KS > 0.2 通常认为有区分能力
|
||||
- **优势**:不依赖阈值,提供最佳分割点位置信息
|
||||
|
||||
### Additional Metrics
|
||||
| Metric | Formula | 适用场景 |
|
||||
|--------|---------|---------|
|
||||
| F1 Score | $2 \times \frac{precision \times recall}{precision + recall}$ | 类别不平衡 |
|
||||
| RMSE | $\sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$ | 回归模型 |
|
||||
| Log Loss | $-\frac{1}{N}\sum[y_i \log p_i + (1-y_i)\log(1-p_i)]$ | 概率质量 |
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score, f1_score
|
||||
from scipy.stats import ks_2samp
|
||||
|
||||
def discrimination_report(y_true, y_score):
|
||||
auc = roc_auc_score(y_true, y_score)
|
||||
gini = 2 * auc - 1
|
||||
ks_stat, ks_pval = ks_2samp(y_score[y_true == 1], y_score[y_true == 0])
|
||||
return {
|
||||
"AUC": round(auc, 4),
|
||||
"Gini": round(gini, 4),
|
||||
"KS": round(ks_stat, 4),
|
||||
"KS_pvalue": round(ks_pval, 6),
|
||||
}
|
||||
```
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 执行以下判别能力审计:
|
||||
1. **全数据切片分析**:在 Train/Validation/Test/OOT 四个数据切片上分别计算 AUC/Gini/KS
|
||||
2. **子群体性能**:在性别/年龄/地区等受保护属性上分别测试,发现公平性隐患
|
||||
3. **时间稳定性**:跨 OOT 窗口追踪 AUC/Gini 趋势,识别性能衰减
|
||||
4. **冠军-挑战者对比**:Proposed model vs. incumbent production model,量化相对提升
|
||||
|
||||
## Relationship
|
||||
|
||||
- **被依赖** [[Calibration-Testing]]:先确认判别能力(KS > 0.2, AUC > 0.7),再测试校准
|
||||
- **依赖** [[Population-Stability-Index]]:PSI 监控输入稳定性,判别指标监控输出健康度
|
||||
- **依赖** [[SHAP]]:判别指标提供"是否好"的答案,SHAP 解释"为什么"
|
||||
- **支撑** [[specialized-model-qa]](Source):Model QA Specialist 的核心性能评估步骤
|
||||
|
||||
## Key Insights
|
||||
|
||||
- **判别度 vs 校准**:高 AUC 模型仍可能在特定概率区间严重校准偏差;两者必须同时评估
|
||||
- **KS vs AUC**:KS 对尾部区分更敏感(抓坏人),AUC 对整体排序更均衡
|
||||
- **监管门槛**:金融风控通常要求 Gini > 0.4(相当于 AUC > 0.7)方可上线
|
||||
91
wiki/concepts/Hosmer-Lemeshow-Test.md
Normal file
91
wiki/concepts/Hosmer-Lemeshow-Test.md
Normal file
@@ -0,0 +1,91 @@
|
||||
---
|
||||
title: "Hosmer-Lemeshow Test"
|
||||
type: concept
|
||||
tags: [model-evaluation, calibration-testing, goodness-of-fit]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Hosmer-Lemeshow(HL)检验是一种评估二分类模型预测概率校准程度的拟合优度检验,通过比较预测概率分箱后的观测正例数与期望正例数,判断模型预测与实际结果之间是否存在显著差异。p-value < 0.05 时拒绝原假设(模型校准良好),认为模型存在显著的校准偏差。
|
||||
|
||||
## Algorithm
|
||||
|
||||
1. 将样本按预测概率从小到大分箱(默认 10 箱,或自定义 g 组)
|
||||
2. 对每箱计算:
|
||||
- **观测正例数** $O_g = \sum_{i \in \text{group } g} y_i$
|
||||
- **期望正例数** $E_g = \sum_{i \in \text{group } g} \hat{p}_i$
|
||||
- **样本数** $n_g$
|
||||
3. 计算 HL 统计量:
|
||||
|
||||
$$H = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g (1 - E_g / n_g)}$$
|
||||
|
||||
4. 自由度 $df = G - 2$(减去截距和斜率估计参数)
|
||||
5. 与 $\chi^2(df)$ 分布比较,$p = 1 - F_{H}(H)$
|
||||
|
||||
## Interpretation
|
||||
|
||||
```python
|
||||
from scipy.stats import chi2
|
||||
|
||||
def hosmer_lemshow_test(y_true: pd.Series, y_pred: pd.Series, groups: int = 10) -> dict:
|
||||
data = pd.DataFrame({"y": y_true, "p": y_pred})
|
||||
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
|
||||
|
||||
agg = data.groupby("bucket", observed=True).agg(
|
||||
n=("y", "count"),
|
||||
observed=("y", "sum"),
|
||||
expected=("p", "sum"),
|
||||
)
|
||||
|
||||
hl_stat = (
|
||||
((agg["observed"] - agg["expected"])**2) /
|
||||
(agg["expected"] * (1 - agg["expected"] / agg["n"]))
|
||||
).sum()
|
||||
|
||||
dof = len(agg) - 2
|
||||
p_value = 1 - chi2.cdf(hl_stat, dof)
|
||||
|
||||
return {
|
||||
"HL_statistic": round(hl_stat, 4),
|
||||
"p_value": round(p_value, 6),
|
||||
"calibrated": p_value >= 0.05, # True = well calibrated
|
||||
"dof": dof,
|
||||
"groups_used": len(agg),
|
||||
}
|
||||
```
|
||||
|
||||
| p-value | 判读 |
|
||||
|---------|------|
|
||||
| ≥ 0.05 | 🟢 模型校准良好,无显著证据表明预测概率偏离实际频率 |
|
||||
| < 0.05 | 🔴 拒绝原假设,模型存在显著校准偏差 |
|
||||
|
||||
## Limitations
|
||||
|
||||
1. **分组方式敏感**:不同分箱数量/方法导致不同结果,10 等分是惯例但非最优
|
||||
2. **样本量敏感**:大样本下即使微小偏差也能导致显著 p-value(实际影响可能很小)
|
||||
3. **掩盖子群体问题**:整体通过 HL 检验不等于所有子群体都校准良好
|
||||
4. **序贯分组问题**:qcut 在重复值多时可能合并箱子,需检查 `groups_used`
|
||||
|
||||
## Alternatives
|
||||
|
||||
- **Brier Score**:无需分组,对样本量稳健,但只能给出误差量级而非定位
|
||||
- **Spiegelhalter's Z-test**:基于 Brier Score 的统计检验
|
||||
- **Reliability Curves**:可视化诊断,比 HL 检验提供更多信息
|
||||
- **Expected Calibration Error (ECE)**:量化平均校准误差,解释性更强
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 将 HL 检验用于:
|
||||
1. **模型上线前验证**:新模型上线必须通过 HL 检验(p ≥ 0.05)
|
||||
2. **定期监控**:在 OOT 窗口上重复执行,监控校准随时间恶化趋势
|
||||
3. **子群体分层测试**:在关键子群体(高风险/低风险/新客户)上分别执行
|
||||
4. **Champion-Challenger**:对比 champion model vs challenger model 的 HL 结果
|
||||
|
||||
## Relationship
|
||||
|
||||
- **被依赖** [[Calibration-Testing]]:HL 检验是 Calibration Testing 的核心统计工具之一
|
||||
- **依赖** [[Discrimination-Metrics]]:先确认模型有区分能力(AUC/Gini 达标),再讨论校准
|
||||
- **依赖** [[Population-Stability-Index]]:PSI 漂移往往是 HL 检验失败的前兆
|
||||
- **依赖** [[SHAP]]:HL 检验发现校准问题后,用 SHAP waterfall 诊断具体原因
|
||||
- **支撑** [[specialized-model-qa]](Source):Model QA Specialist 校准测试步骤的核心工具
|
||||
32
wiki/concepts/Kirkpatrick-四级评估.md
Normal file
32
wiki/concepts/Kirkpatrick-四级评估.md
Normal file
@@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "Kirkpatrick 四级评估"
|
||||
type: concept
|
||||
tags: []
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Kirkpatrick 四级评估模型是衡量企业培训效果的标准框架,由 Donald Kirkpatrick 于 1959 年提出,分为四个层次:
|
||||
|
||||
- **Level 1 — Reaction(反应)**:学员对培训的满意度调查——课程评分、讲师评分、NPS
|
||||
- **Level 2 — Learning(学习)**:知识与技能掌握程度——知识测验、技能实操评估、案例分析作业
|
||||
- **Level 3 — Behavior(行为)**:训后行为改变——30/60/90 天行为跟踪、上级观察、关键行为清单
|
||||
- **Level 4 — Results(结果)**:业务指标变化——营收、客户满意度、生产效率、员工留存率
|
||||
|
||||
## Aliases
|
||||
- Kirkpatrick Model
|
||||
- Kirkpatrick 四级评估
|
||||
- Kirkpatrick 四层次评估
|
||||
- 培训效果评估模型
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **逐级递进**:Level 1-2 较易测量,Level 3-4 需要更长周期和更复杂的数据收集
|
||||
- **业务导向**:Level 3-4 直接关联业务指标,是培训投资回报(ROI)的核心证明
|
||||
- **最低标准**:所有培训项目至少应评估到 Level 2(Learning)
|
||||
- **高投资标准**:领导力发展、关键岗位培训等高投资必须追踪到 Level 3(Behavior)
|
||||
|
||||
## Source
|
||||
|
||||
- [[corporate-training-designer]]
|
||||
37
wiki/concepts/Kolb-体验式学习圈.md
Normal file
37
wiki/concepts/Kolb-体验式学习圈.md
Normal file
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: "Kolb 体验式学习圈"
|
||||
type: concept
|
||||
tags: []
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Kolb 体验式学习圈(Kolb's Experiential Learning Cycle)由 David Kolb 于 1984 年提出,描述了一个四阶段的循环学习过程:
|
||||
|
||||
1. **Concrete Experience(具体经验)**:全身心投入真实或模拟的体验
|
||||
2. **Reflective Observation(反思观察)**:从不同视角审视体验,思考发生了什么
|
||||
3. **Abstract Conceptualization(抽象概念化)**:从经验中提炼出理论、模型或框架
|
||||
4. **Active Experimentation(主动实验)**:将概念应用于新的实践场景,测试假设
|
||||
|
||||
## Aliases
|
||||
- Kolb's Learning Cycle
|
||||
- Kolb 体验式学习
|
||||
- Kolb 学习圈
|
||||
- 体验式学习循环
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **闭环性**:四个阶段首尾相连,形成持续改进的学习螺旋
|
||||
- **个性化**:不同学习者偏好不同阶段(有人偏经验型,有人偏反思型)
|
||||
- **主动学习**:强调"做中学",而非被动接受知识
|
||||
- **应用场景**:沙盘模拟、角色扮演、剧本杀式培训、领导力发展项目
|
||||
|
||||
## Relationship to Other Concepts
|
||||
|
||||
- **与 ADDIE 模型**:体验式学习可作为 ADDIE Implementation 阶段的教学方法
|
||||
- **与 Kirkpatrick Level 3**:体验式学习的闭环特性天然支持训后行为改变的追踪
|
||||
|
||||
## Source
|
||||
|
||||
- [[corporate-training-designer]]
|
||||
71
wiki/concepts/Partial-Dependence-Plots.md
Normal file
71
wiki/concepts/Partial-Dependence-Plots.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: "Partial Dependence Plots"
|
||||
type: concept
|
||||
tags: [model-interpretability, feature-analysis, model-visualization]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
偏依赖图(Partial Dependence Plots,PDP)展示一个或两个特征与模型预测之间的边际关系——在控制其他特征后,该特征取不同值时模型输出的平均预测变化。核心假设:特征之间相对独立(独立PDP),否则需要 ICE 曲线(Individual Conditional Expectation)补充。
|
||||
|
||||
## Core Types
|
||||
|
||||
### 1D PDP(单特征)
|
||||
- 固定其他特征不动,在目标特征的取值范围内计算模型平均预测
|
||||
- 可视化:x 轴为特征值,y 轴为偏依赖值(边际预测效应)
|
||||
- 用于:验证特征方向是否符合业务预期(单调递增/递减/U形)
|
||||
|
||||
### 2D PDP(特征交互)
|
||||
- 两个特征同时变化,展示交互效应对预测的联合影响
|
||||
- 用于:检测模型学习到的非预期特征交互(如 X₁ × X₂ 的非线性组合)
|
||||
|
||||
### ICE Curves(Individual Conditional Expectation)
|
||||
- 每条线代表一个样本的偏依赖曲线(而非平均值)
|
||||
- 解决 PDP 掩盖个体异质性的问题
|
||||
- 与 PDP 结合:PDP 叠加 ICE 曲线,同时展示平均趋势和个体差异
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
from sklearn.inspection import PartialDependenceDisplay
|
||||
|
||||
# 1D PDP for single feature
|
||||
fig, ax = plt.subplots(figsize=(8, 5))
|
||||
PartialDependenceDisplay.from_estimator(
|
||||
model, X, [feature_name],
|
||||
grid_resolution=50, ax=ax
|
||||
)
|
||||
ax.set_title(f"Partial Dependence - {feature_name}")
|
||||
fig.savefig(f"pdp_{feature_name}.png", dpi=150)
|
||||
|
||||
# 2D PDP for feature interaction
|
||||
fig, ax = plt.subplots(figsize=(8, 6))
|
||||
PartialDependenceDisplay.from_estimator(
|
||||
model, X, [(feat_a, feat_b)], ax=ax
|
||||
)
|
||||
fig.savefig(f"pdp_interact_{feat_a}_x_{feat_b}.png", dpi=150)
|
||||
```
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 使用 PDP 进行以下审计:
|
||||
1. **方向性验证**:检查 PDP 曲线方向是否符合业务领域知识(如"收入↑ → 违约概率↓")
|
||||
2. **非单调性检测**:识别模型在某些区间学习到的反直觉非单调关系
|
||||
3. **交互效应识别**:2D PDP 检测 top correlated feature pairs 的交互效应
|
||||
4. **跨时间稳定性**:对比 Train vs OOT 的 PDP 曲线,识别特征关系的时间漂移
|
||||
5. **SHAP 交叉验证**:PDP 验证边际方向,SHAP 验证精确归因,两者互补
|
||||
|
||||
## Relationship
|
||||
|
||||
- **依赖** [[SHAP]]:SHAP 提供精确特征归因,PDP 提供趋势可视化;PDP 曲线形状与 SHAP beeswarm 的分布吻合
|
||||
- **依赖** [[Population-Stability-Index]]:PSI 捕捉特征分布漂移,PDP 捕捉特征效应的变化,两者共同判断模型是否需要重训
|
||||
- **支撑** [[Calibration-Testing]]:PDP 揭示的非线性关系可能是校准问题的根源
|
||||
- **支撑** [[specialized-model-qa]](Source):Model QA Specialist 的特征分析核心工具
|
||||
|
||||
## Key Limitations
|
||||
|
||||
- **强交互效应**:当特征高度相关时,PDP 可能产生误导性结论(忽略其他特征的条件分布)
|
||||
- **异质性掩盖**:个体 ICE 曲线与平均 PDP 的差异反映异质性,忽略可能遗漏关键子群体
|
||||
- **分类变量**:需预先分箱,箱的划分方式影响结果解释
|
||||
- **高维特征**:超过 2 个特征的交互需用 SHAP interaction values 或 ALE plots
|
||||
102
wiki/concepts/Population-Stability-Index.md
Normal file
102
wiki/concepts/Population-Stability-Index.md
Normal file
@@ -0,0 +1,102 @@
|
||||
---
|
||||
title: "Population Stability Index"
|
||||
type: concept
|
||||
tags: [model-monitoring, feature-drift, model-governance]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
群体稳定性指数(Population Stability Index,PSI)是衡量两个分布(通常是开发样本 vs 实际样本)之间差异的量化指标,广泛用于监控机器学习模型输入特征和输出评分的分布漂移,是模型生命周期管理的核心监控工具。
|
||||
|
||||
## Algorithm
|
||||
|
||||
$$\text{PSI} = \sum_{i=1}^{n} (act_i - exp_i) \times \ln\left(\frac{act_i}{exp_i}\right)$$
|
||||
|
||||
其中:
|
||||
- $act_i$ = 实际(当前)样本在分箱中的占比
|
||||
- $exp_i$ = 期望(基准)样本在分箱中的占比
|
||||
- 使用 **Laplace smoothing**(加 1 平滑)避免除零
|
||||
|
||||
## Interpretation Thresholds
|
||||
|
||||
| PSI Range | 判读 | 建议行动 |
|
||||
|-----------|------|---------|
|
||||
| < 0.10 | 🟢 无显著漂移 | 无需操作 |
|
||||
| 0.10–0.25 | 🟡 中等漂移 | 调查原因,密切监控 |
|
||||
| ≥ 0.25 | 🔴 显著漂移 | **立即采取行动**,考虑重训 |
|
||||
|
||||
## Implementation
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
def compute_psi(expected: pd.Series, actual: pd.Series, bins: int = 10) -> float:
|
||||
"""
|
||||
Compute Population Stability Index between two distributions.
|
||||
Interpretation:
|
||||
< 0.10 → No significant shift (green)
|
||||
0.10–0.25 → Moderate shift, investigation recommended (amber)
|
||||
>= 0.25 → Significant shift, action required (red)
|
||||
"""
|
||||
breakpoints = np.linspace(0, 100, bins + 1)
|
||||
expected_pcts = np.percentile(expected.dropna(), breakpoints)
|
||||
|
||||
expected_counts = np.histogram(expected, bins=expected_pcts)[0]
|
||||
actual_counts = np.histogram(actual, bins=expected_pcts)[0]
|
||||
|
||||
# Laplace smoothing
|
||||
exp_pct = (expected_counts + 1) / (expected_counts.sum() + bins)
|
||||
act_pct = (actual_counts + 1) / (actual_counts.sum() + bins)
|
||||
|
||||
psi = np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
|
||||
return round(psi, 6)
|
||||
|
||||
|
||||
def variable_stability_report(
|
||||
df: pd.DataFrame,
|
||||
date_col: str,
|
||||
variables: list[str],
|
||||
psi_threshold: float = 0.25,
|
||||
) -> pd.DataFrame:
|
||||
"""Monthly stability report for model features."""
|
||||
periods = sorted(df[date_col].unique())
|
||||
baseline = df[df[date_col] == periods[0]]
|
||||
|
||||
results = []
|
||||
for var in variables:
|
||||
for period in periods[1:]:
|
||||
current = df[df[date_col] == period]
|
||||
psi = compute_psi(baseline[var], current[var])
|
||||
results.append({
|
||||
"variable": var, "period": period, "psi": psi,
|
||||
"flag": "🔴" if psi >= psi_threshold else ("🟡" if psi >= 0.10 else "🟢"),
|
||||
})
|
||||
|
||||
return pd.DataFrame(results).pivot_table(
|
||||
index="variable", columns="period", values="psi"
|
||||
).round(4)
|
||||
```
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 将 PSI 应用于以下场景:
|
||||
1. **特征稳定性监控**:每月计算所有特征的 PSI,识别漂移最早的预警信号
|
||||
2. **评分分布监控**:模型输出的评分 PSI,检测整体预测分布变化
|
||||
3. **分段 PSI**:在子群体上分别计算,识别特定分段的漂移(整体 PSI 掩盖的局部问题)
|
||||
4. **重训触发器**:将 PSI ≥ 0.25 设为自动重训的硬触发条件
|
||||
|
||||
## Relationship
|
||||
|
||||
- **被依赖** [[SHAP]]:PSI 识别分布漂移,SHAP 分析漂移后的特征贡献变化
|
||||
- **被依赖** [[Discrimination-Metrics]]:PSI 漂移通常先于 AUC/Gini 下降出现,是预警指标
|
||||
- **被依赖** [[Calibration-Testing]]:特征分布漂移(PSI)是校准失效的根本原因之一
|
||||
- **支撑** [[specialized-model-qa]](Source):Model QA Specialist 的监控框架核心指标
|
||||
|
||||
## Key Insights
|
||||
|
||||
- **方向性陷阱**:PSI 仅反映分布差异大小,不反映变化方向(高→低 或 低→高 均为漂移)
|
||||
- **阈值依赖**:0.1/0.25 阈值是行业惯例,具体阈值应基于业务风险调整
|
||||
- **特征 vs 评分 PSI**:特征 PSI 先于评分 PSI 变化,是更敏感的早期预警
|
||||
- **监控频率**:生产模型应至少每月计算一次,关键业务模型建议每周甚至每日
|
||||
70
wiki/concepts/SHAP.md
Normal file
70
wiki/concepts/SHAP.md
Normal file
@@ -0,0 +1,70 @@
|
||||
---
|
||||
title: "SHAP (SHapley Additive exPlanations)"
|
||||
type: concept
|
||||
tags: [model-interpretability, feature-attribution, explainable-ai]
|
||||
last_updated: 2026-04-25
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
SHAP(SHapley Additive exPlanations)是一种基于博弈论 Shapley 值的模型可解释性框架,为每个特征的贡献提供统一的量化度量。通过计算每个特征在所有可能的特征组合中的边际贡献均值,SHAP 给出唯一且公平的归因值。
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Global Interpretability
|
||||
- **SHAP Summary Plot (Beeswarm)**:同时展示特征值方向和影响幅度的散点图,横轴为 SHAP 值,纵轴为特征,颜色编码特征值高低
|
||||
- **SHAP Bar Plot**:各特征 mean |SHAP| 排序,展示整体特征重要性
|
||||
- **应用场景**:与文档化特征理由对比,识别未在方法论文档中讨论但实际影响显著的"隐性特征"
|
||||
|
||||
### Local Interpretability
|
||||
- **SHAP Waterfall Plot**:解释单个预测——从基础值(base value)出发,逐特征展示其推动预测的方向和幅度
|
||||
- **SHAP Force Plot**:可视化单个预测的特征贡献,常用于高风险决策解释
|
||||
- **应用场景**:边缘案例预测(top/bottom decile、误分类记录)的深度分析
|
||||
|
||||
### SHAP Interaction Values
|
||||
- 检测特征之间的依赖和交互效应
|
||||
- 将总 SHAP 贡献分解为:主效应 + 交互效应
|
||||
- 用于识别模型学习到的非预期特征交互
|
||||
|
||||
## Usage
|
||||
|
||||
```python
|
||||
import shap
|
||||
|
||||
explainer = shap.TreeExplainer(model)
|
||||
shap_values = explainer.shap_values(X)
|
||||
|
||||
# Global: beeswarm
|
||||
shap.summary_plot(shap_values, X, show=False)
|
||||
plt.savefig("shap_beeswarm.png", dpi=150)
|
||||
|
||||
# Global: bar
|
||||
shap.summary_plot(shap_values, X, plot_type="bar", show=False)
|
||||
plt.savefig("shap_importance.png", dpi=150)
|
||||
|
||||
# Local: waterfall
|
||||
explanation = explainer(X.iloc[[idx]])
|
||||
shap.plots.waterfall(explanation[0], show=False)
|
||||
plt.savefig(f"shap_waterfall_{idx}.png", dpi=150)
|
||||
```
|
||||
|
||||
## Model QA 中的应用
|
||||
|
||||
Model QA Specialist 使用 SHAP 进行以下审计:
|
||||
1. **全局分析**:对比 SHAP 特征重要性与文档化特征理由,发现未记录的高贡献特征
|
||||
2. **PDP 交叉验证**:SHAP 分析结合 PDP 验证特征方向是否符合预期
|
||||
3. **局部解释**:边缘案例的 SHAP waterfall 揭示模型决策机制
|
||||
4. **稳定性监测**:跨时间窗口的 SHAP 排名变化反映特征重要性漂移
|
||||
|
||||
## Relationship
|
||||
|
||||
- **依赖** [[Population-Stability-Index]]:PSI 监测特征分布漂移,SHAP 监测特征贡献变化,两者结合才能完整评估模型健康度
|
||||
- **依赖** [[Calibration-Testing]]:SHAP 解释模型"为什么"预测,校准测试验证模型"多准确"预测
|
||||
- **依赖** [[Discrimination-Metrics]]:SHAP 贡献分析在 AUC/Gini 判定模型整体可用之后进行细节诊断
|
||||
- **支撑** [[Partial-Dependence-Plots]]:PDP 提供边际效应可视化,SHAP 提供精确归因,两者互补
|
||||
|
||||
## Key Limitations
|
||||
|
||||
- 计算复杂度:精确 Shapley 值计算为指数级,TreeExplainer 对树模型高效但对神经网络等黑盒模型需用 KernelExplainer(采样近似)
|
||||
- 交互效应分离:当特征高度相关时,Shapley 值归因可能不稳定
|
||||
- 基准依赖:Shapley 值的解释力取决于基准(base value)的选取
|
||||
Reference in New Issue
Block a user