4.8 KiB
4.8 KiB
title, type, tags, date
| title | type | tags | date | ||||
|---|---|---|---|---|---|---|---|
| Model QA Specialist | source |
|
2026-04-20 |
Source File
Summary
- 核心主题:独立模型 QA 专家智能体,对机器学习和统计模型进行端到端审计
- 问题域:模型生命周期审计,覆盖文档、数据、特征、模型构建、校准、可解释性、公平性和业务影响
- 方法/机制:10 阶段审计流程,包含 PSI 计算、SHAP 分析、Hosmer-Lemeshow 校准检验、歧视度量、Gini/KS 统计
- 结论/价值:为组织提供证据驱动的模型质量评估,量化问题严重程度并提出修复建议
Key Claims
- Model QA Specialist ← 执行端到端审计 ← 覆盖文档治理、数据重建、特征分析、模型复制、校准测试、可解释性分析
- PSI(Population Stability Index)← 量化特征分布偏移 ← 用于检测输入变量在时间窗口上的稳定性
- SHAP(SHapley Additive exPlanations)← 提供全局和局部可解释性 ← 分析特征贡献度和预测驱动力
- Hosmer-Lemeshow 检验 ← 评估概率校准质量 ← p-value < 0.05 表示显著校准偏差
- 独立原则 ← 从不审计自建模型 ← 保持客观性,用数据挑战每个假设
Key Quotes
"You treat every model as guilty until proven sound." — Model QA Specialist 核心原则 "Every finding must include: observation, evidence, impact assessment, and recommendation." — 证据驱动发现要求 "Never state 'the model is wrong' without quantifying the impact." — 量化学术原则
Key Concepts
- Population Stability Index (PSI):量化两个分布之间差异的指标,< 0.10 无显著偏移,0.10–0.25 中等偏移,≥ 0.25 显著偏移
- SHAP Analysis:基于博弈论的特征贡献分析方法,提供全局(beeswarm/bar)和局部(waterfall/force)解释
- Calibration Testing:校准检验,Hosmer-Lemeshow、Brier score、reliability diagrams 评估概率预测准确性
- Discrimination Metrics:歧视度量指标,包括 Gini 系数、KS 统计量、AUC,用于评估模型区分能力
- Partial Dependence Plots:偏依赖图,展示特征与预测结果的边际关系,用于验证单调性和检测非线性阈值
- Fairness Audit:公平性审计,跨受保护属性( demographics parity、equalized odds)检测歧视性偏差
- Model Audit:模型审计,对模型全生命周期进行系统性评估的 10 阶段方法论
Key Entities
- Model QA Specialist:主体,The Agency 项目中的独立模型审计专家智能体,人格为怀疑但协作
Connections
- Model QA Specialist ← 属于 ← The Agency
- Model QA Specialist ← 使用 ← SHAP Analysis
- Model QA Specialist ← 使用 ← Population Stability Index (PSI)
- Model QA Specialist ← 使用 ← Calibration Testing
- Model QA Specialist ← 产出 ← Fairness Audit
- Model QA Specialist ← 应用于 ← ML Ops
Contradictions
- 与其他 Agent 角色:Corporate Training Designer — 两者虽同属 The Agency 但领域无冲突
Technical Deliverables
Population Stability Index (PSI) 计算
def compute_psi(expected: pd.Series, actual: pd.Series, bins: int = 10) -> float:
breakpoints = np.linspace(0, 100, bins + 1)
expected_pcts = np.percentile(expected.dropna(), breakpoints)
expected_counts = np.histogram(expected, bins=expected_pcts)[0]
actual_counts = np.histogram(actual, bins=expected_pcts)[0]
exp_pct = (expected_counts + 1) / (expected_counts.sum() + bins)
act_pct = (actual_counts + 1) / (actual_counts.sum() + bins)
psi = np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
return round(psi, 6)
Discrimination Metrics(Gini & KS)
def discrimination_report(y_true: pd.Series, y_score: pd.Series) -> dict:
auc = roc_auc_score(y_true, y_score)
gini = 2 * auc - 1
ks_stat, ks_pval = ks_2samp(y_score[y_true == 1], y_score[y_true == 0])
return {"AUC": round(auc, 4), "Gini": round(gini, 4), "KS": round(ks_stat, 4)}
Hosmer-Lemeshow Calibration Test
def hosmer_lemeshow_test(y_true: pd.Series, y_pred: pd.Series, groups: int = 10) -> dict:
data = pd.DataFrame({"y": y_true, "p": y_pred})
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
agg = data.groupby("bucket", observed=True).agg(n=("y", "count"), observed=("y", "sum"), expected=("p", "sum"))
hl_stat = (((agg["observed"] - agg["expected"]) ** 2) / (agg["expected"] * (1 - agg["expected"] / agg["n"]))).sum()
dof = len(agg) - 2
p_value = 1 - chi2.cdf(hl_stat, dof)
return {"HL_statistic": round(hl_stat, 4), "p_value": round(p_value, 6), "calibrated": p_value >= 0.05}