--- title: "Model QA Specialist" type: source tags: [agent, the-agency, ml-ops, model-audit] date: 2026-04-20 --- ## Source File - [[raw/Agent/agency-agents/specialized/specialized-model-qa.md]] ## Summary - **核心主题**:独立模型 QA 专家智能体,对机器学习和统计模型进行端到端审计 - **问题域**:模型生命周期审计,覆盖文档、数据、特征、模型构建、校准、可解释性、公平性和业务影响 - **方法/机制**:10 阶段审计流程,包含 PSI 计算、SHAP 分析、Hosmer-Lemeshow 校准检验、歧视度量、Gini/KS 统计 - **结论/价值**:为组织提供证据驱动的模型质量评估,量化问题严重程度并提出修复建议 ## Key Claims - Model QA Specialist ← 执行端到端审计 ← 覆盖文档治理、数据重建、特征分析、模型复制、校准测试、可解释性分析 - PSI(Population Stability Index)← 量化特征分布偏移 ← 用于检测输入变量在时间窗口上的稳定性 - SHAP(SHapley Additive exPlanations)← 提供全局和局部可解释性 ← 分析特征贡献度和预测驱动力 - Hosmer-Lemeshow 检验 ← 评估概率校准质量 ← p-value < 0.05 表示显著校准偏差 - 独立原则 ← 从不审计自建模型 ← 保持客观性,用数据挑战每个假设 ## Key Quotes > "You treat every model as guilty until proven sound." — Model QA Specialist 核心原则 > "Every finding must include: observation, evidence, impact assessment, and recommendation." — 证据驱动发现要求 > "Never state 'the model is wrong' without quantifying the impact." — 量化学术原则 ## Key Concepts - [[Population Stability Index (PSI)]]:量化两个分布之间差异的指标,< 0.10 无显著偏移,0.10–0.25 中等偏移,≥ 0.25 显著偏移 - [[SHAP Analysis]]:基于博弈论的特征贡献分析方法,提供全局(beeswarm/bar)和局部(waterfall/force)解释 - [[Calibration Testing]]:校准检验,Hosmer-Lemeshow、Brier score、reliability diagrams 评估概率预测准确性 - [[Discrimination Metrics]]:歧视度量指标,包括 Gini 系数、KS 统计量、AUC,用于评估模型区分能力 - [[Partial Dependence Plots]]:偏依赖图,展示特征与预测结果的边际关系,用于验证单调性和检测非线性阈值 - [[Fairness Audit]]:公平性审计,跨受保护属性( demographics parity、equalized odds)检测歧视性偏差 - [[Model Audit]]:模型审计,对模型全生命周期进行系统性评估的 10 阶段方法论 ## Key Entities - [[Model QA Specialist]]:**主体**,The Agency 项目中的独立模型审计专家智能体,人格为怀疑但协作 ## Connections - [[Model QA Specialist]] ← 属于 ← [[The Agency]] - [[Model QA Specialist]] ← 使用 ← [[SHAP Analysis]] - [[Model QA Specialist]] ← 使用 ← [[Population Stability Index (PSI)]] - [[Model QA Specialist]] ← 使用 ← [[Calibration Testing]] - [[Model QA Specialist]] ← 产出 ← [[Fairness Audit]] - [[Model QA Specialist]] ← 应用于 ← [[ML Ops]] ## Contradictions - 与其他 Agent 角色:**Corporate Training Designer** — 两者虽同属 The Agency 但领域无冲突 ## Technical Deliverables ### Population Stability Index (PSI) 计算 ```python def compute_psi(expected: pd.Series, actual: pd.Series, bins: int = 10) -> float: breakpoints = np.linspace(0, 100, bins + 1) expected_pcts = np.percentile(expected.dropna(), breakpoints) expected_counts = np.histogram(expected, bins=expected_pcts)[0] actual_counts = np.histogram(actual, bins=expected_pcts)[0] exp_pct = (expected_counts + 1) / (expected_counts.sum() + bins) act_pct = (actual_counts + 1) / (actual_counts.sum() + bins) psi = np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct)) return round(psi, 6) ``` ### Discrimination Metrics(Gini & KS) ```python def discrimination_report(y_true: pd.Series, y_score: pd.Series) -> dict: auc = roc_auc_score(y_true, y_score) gini = 2 * auc - 1 ks_stat, ks_pval = ks_2samp(y_score[y_true == 1], y_score[y_true == 0]) return {"AUC": round(auc, 4), "Gini": round(gini, 4), "KS": round(ks_stat, 4)} ``` ### Hosmer-Lemeshow Calibration Test ```python def hosmer_lemeshow_test(y_true: pd.Series, y_pred: pd.Series, groups: int = 10) -> dict: data = pd.DataFrame({"y": y_true, "p": y_pred}) data["bucket"] = pd.qcut(data["p"], groups, duplicates="drop") agg = data.groupby("bucket", observed=True).agg(n=("y", "count"), observed=("y", "sum"), expected=("p", "sum")) hl_stat = (((agg["observed"] - agg["expected"]) ** 2) / (agg["expected"] * (1 - agg["expected"] / agg["n"]))).sum() dof = len(agg) - 2 p_value = 1 - chi2.cdf(hl_stat, dof) return {"HL_statistic": round(hl_stat, 4), "p_value": round(p_value, 6), "calibrated": p_value >= 0.05} ```