Update nexus wiki content

2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions
--- a/wiki/sources/engineering-incident-response-commander.md
+++ b/wiki/sources/engineering-incident-response-commander.md
@@ -0,0 +1,62 @@
+---
+title: "Incident Response Commander Agent Personality"
+type: source
+tags: []
+date: 2026-05-01
+---
+
+## Source File
+- [[Agent/agency-agents/engineering/engineering-incident-response-commander]]
+
+## Summary（用中文描述）
+- 核心主题：面向可靠工程的智能体（AI Agent）—— Incident Response Commander（事故响应指挥官），将生产事故混乱转化为结构化解决
+- 问题域：生产环境事故管理、值班流程设计、事后复盘、可靠性工程文化
+- 方法/机制：SEV1–SEV4 严重等级分类框架、角色分工（IC/Comms/Tech Lead/Scribe）、无责文化（blameless）、SLO/SLI/SLA 体系、混沌工程、5 Whys 根因分析
+- 结论/价值：为可靠工程组织提供完整的事故响应 SOP，降低 MTTD/MTTR，保护 on-call 工程师心理健康
+
+## Key Claims（用中文描述）
+- 事故指挥官（IC）通过固定角色分工和固定更新节奏，将混乱转为结构化响应
+- 无责文化（blameless culture）确保工程师敢于上报问题而非隐瞒，是可靠性组织的基础
+- SLO 必须有约束力：错误预算耗尽时，功能开发必须暂停，转向可靠性工作
+- Runbook 每季度必须测试一次——未经测试的 runbook 是虚假的安全感
+- On-call 工程师必须有应急处置权，无需多级审批链
+- 每次事故必须在 48 小时内生成时间线、影响评估和后续行动项
+
+## Key Quotes
+> "Never frame findings as 'X person caused the outage' — frame as 'the system allowed this failure mode'" — 无责文化的核心原则：归因于系统缺陷，而非个人错误
+> "The gap is that we have no integration test for config validation — that's the systemic issue to fix" — 复盘时聚焦系统性缺口，而非追责
+> "A blameless post-mortem without follow-through is just a meeting" — 事后复盘若无跟进，只是浪费时间
+> "Chaos multiplies without coordination" — 无协调则混乱倍增
+
+## Key Concepts
+- [[BlamelessPostMortem]]：无责复盘——聚焦系统性根因而非个人错误，保护心理安全
+- [[ErrorBudget]]：错误预算——SLO 未达标时的容忍空间；低于 25% 时全员投入可靠性工作
+- [[ServiceLevelObjective]]（SLO）：服务等级目标——有约束力的可靠性承诺，而非纸面指标
+- [[ServiceLevelIndicator]]（SLI）：服务等级指标——可测量的具体指标（如错误率、延迟）
+- [[FiveWhys]]：5问法——通过层层追问找到系统性根本原因
+- [[FaultTreeAnalysis]]：故障树分析——结构化根因分析工具
+- [[ChaosEngineering]]：混沌工程——通过受控故障注入验证系统韧性
+- [[GameDay]]：Game Day——跨团队模拟多服务级联故障演练
+- [[MeanTimeToDetect]]（MTTD）：从故障发生到检测的平均时间，目标 < 5 分钟（SEV1/2）
+- [[MeanTimeToResolve]]（MTTR）：从检测到恢复的平均时间，目标 < 30 分钟（SEV1）
+- [[IncidentSeverityMatrix]]：SEV1–SEV4 严重等级矩阵，定义响应时间、升级路径和沟通节奏
+
+## Key Entities
+- [[IncidentCommander]]（IC）：事故指挥官——唯一决策者，负责时间线管理和角色协调
+- [[CommunicationsLead]]：沟通负责人——按严重等级节奏向干系人发送状态更新
+- [[TechnicalLead]]：技术负责人——主导诊断，使用 runbook 和可观测性工具
+- [[Scribe]]：记录员——实时记录每个操作和发现，含时间戳
+- [[OnCallEngineer]]：值班工程师——负责检测和初步响应
+- [[SiteReliabilityEngineering]]（SRE）：网站可靠性工程——本 agent 的工程领域背景
+
+## Connections
+- [[SiteReliabilityEngineering]] ← 依赖 → [[ErrorBudget]]
+- [[BlamelessPostMortem]] ← 依赖 → [[FiveWhys]]
+- [[IncidentSeverityMatrix]] ← 支撑 → [[IncidentCommander]]
+- [[ChaosEngineering]] ← 验证 → [[GameDay]]
+- [[ServiceLevelObjective]] ← 包含 → [[ServiceLevelIndicator]]
+- [[MeanTimeToDetect]] ← 度量 → [[OnCallEngineer]]
+- [[MeanTimeToResolve]] ← 度量 → [[IncidentCommander]]
+
+## Contradictions
+- （暂无检测到冲突页面）