Files
nexus/wiki/sources/model-qa-specialist.md
2026-04-21 00:02:55 +08:00

87 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Model QA Specialist"
type: source
tags: [agent, the-agency, ml-ops, model-audit]
date: 2026-04-20
---
## Source File
- [[raw/Agent/agency-agents/specialized/specialized-model-qa.md]]
## Summary
- **核心主题**:独立模型 QA 专家智能体,对机器学习和统计模型进行端到端审计
- **问题域**:模型生命周期审计,覆盖文档、数据、特征、模型构建、校准、可解释性、公平性和业务影响
- **方法/机制**10 阶段审计流程,包含 PSI 计算、SHAP 分析、Hosmer-Lemeshow 校准检验、歧视度量、Gini/KS 统计
- **结论/价值**:为组织提供证据驱动的模型质量评估,量化问题严重程度并提出修复建议
## Key Claims
- Model QA Specialist ← 执行端到端审计 ← 覆盖文档治理、数据重建、特征分析、模型复制、校准测试、可解释性分析
- PSIPopulation Stability Index← 量化特征分布偏移 ← 用于检测输入变量在时间窗口上的稳定性
- SHAPSHapley Additive exPlanations← 提供全局和局部可解释性 ← 分析特征贡献度和预测驱动力
- Hosmer-Lemeshow 检验 ← 评估概率校准质量 ← p-value < 0.05 表示显著校准偏差
- 独立原则 ← 从不审计自建模型 ← 保持客观性,用数据挑战每个假设
## Key Quotes
> "You treat every model as guilty until proven sound." — Model QA Specialist 核心原则
> "Every finding must include: observation, evidence, impact assessment, and recommendation." — 证据驱动发现要求
> "Never state 'the model is wrong' without quantifying the impact." — 量化学术原则
## Key Concepts
- [[Population Stability Index (PSI)]]:量化两个分布之间差异的指标,< 0.10 无显著偏移0.100.25 中等偏移,≥ 0.25 显著偏移
- [[SHAP Analysis]]基于博弈论的特征贡献分析方法提供全局beeswarm/bar和局部waterfall/force解释
- [[Calibration Testing]]校准检验Hosmer-Lemeshow、Brier score、reliability diagrams 评估概率预测准确性
- [[Discrimination Metrics]]:歧视度量指标,包括 Gini 系数、KS 统计量、AUC用于评估模型区分能力
- [[Partial Dependence Plots]]:偏依赖图,展示特征与预测结果的边际关系,用于验证单调性和检测非线性阈值
- [[Fairness Audit]]:公平性审计,跨受保护属性( demographics parity、equalized odds检测歧视性偏差
- [[Model Audit]]:模型审计,对模型全生命周期进行系统性评估的 10 阶段方法论
## Key Entities
- [[Model QA Specialist]]**主体**The Agency 项目中的独立模型审计专家智能体,人格为怀疑但协作
## Connections
- [[Model QA Specialist]] ← 属于 ← [[The Agency]]
- [[Model QA Specialist]] ← 使用 ← [[SHAP Analysis]]
- [[Model QA Specialist]] ← 使用 ← [[Population Stability Index (PSI)]]
- [[Model QA Specialist]] ← 使用 ← [[Calibration Testing]]
- [[Model QA Specialist]] ← 产出 ← [[Fairness Audit]]
- [[Model QA Specialist]] ← 应用于 ← [[ML Ops]]
## Contradictions
- 与其他 Agent 角色:**Corporate Training Designer** — 两者虽同属 The Agency 但领域无冲突
## Technical Deliverables
### Population Stability Index (PSI) 计算
```python
def compute_psi(expected: pd.Series, actual: pd.Series, bins: int = 10) -> float:
breakpoints = np.linspace(0, 100, bins + 1)
expected_pcts = np.percentile(expected.dropna(), breakpoints)
expected_counts = np.histogram(expected, bins=expected_pcts)[0]
actual_counts = np.histogram(actual, bins=expected_pcts)[0]
exp_pct = (expected_counts + 1) / (expected_counts.sum() + bins)
act_pct = (actual_counts + 1) / (actual_counts.sum() + bins)
psi = np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
return round(psi, 6)
```
### Discrimination MetricsGini & KS
```python
def discrimination_report(y_true: pd.Series, y_score: pd.Series) -> dict:
auc = roc_auc_score(y_true, y_score)
gini = 2 * auc - 1
ks_stat, ks_pval = ks_2samp(y_score[y_true == 1], y_score[y_true == 0])
return {"AUC": round(auc, 4), "Gini": round(gini, 4), "KS": round(ks_stat, 4)}
```
### Hosmer-Lemeshow Calibration Test
```python
def hosmer_lemeshow_test(y_true: pd.Series, y_pred: pd.Series, groups: int = 10) -> dict:
data = pd.DataFrame({"y": y_true, "p": y_pred})
data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop")
agg = data.groupby("bucket", observed=True).agg(n=("y", "count"), observed=("y", "sum"), expected=("p", "sum"))
hl_stat = (((agg["observed"] - agg["expected"]) ** 2) / (agg["expected"] * (1 - agg["expected"] / agg["n"]))).sum()
dof = len(agg) - 2
p_value = 1 - chi2.cdf(hl_stat, dof)
return {"HL_statistic": round(hl_stat, 4), "p_value": round(p_value, 6), "calibrated": p_value >= 0.05}
```