Update nexus wiki content

2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions
--- a/wiki/concepts/AI-For-On-Call.md
+++ b/wiki/concepts/AI-For-On-Call.md
@@ -0,0 +1,55 @@
+---
+title: "AI For On-Call"
+type: concept
+tags: [sre, ai, on-call, incident-response, automation]
+last_updated: 2026-04-20
+---
+
+# AI For On-Call
+
+AI 在值班（On-Call）场景中的最佳应用不是自主修复，而是为值班工程师提供足够的上下文以快速修复故障。
+
+## Core Thesis
+> "AI's most valuable role in SRE isn't autonomous remediation. It's making sure on-call engineers have the context to fix incidents fast." — Heinrich Hartmann
+
+## Why Context Matters
+值班工程师在面对故障时最大的挑战不是不知道怎么做，而是：
+1. **信息过载**：日志、指标、告警太多，难以快速定位问题
+2. **上下文丢失**：不熟悉的服务/代码，需要时间理解
+3. **时间压力**：MTTR 目标要求快速响应
+
+## AI 辅助 On-Call 的关键场景
+
+### 1. 上下文聚合（Context Aggregation）
+AI 从多个来源聚合相关信息：
+- 告警历史和趋势
+- 相关的故障报告
+- 最近变更记录
+- 依赖服务状态
+
+### 2. 快速诊断辅助（Rapid Diagnosis）
+- 总结告警的根本原因
+- 推荐可能有效的修复步骤
+- 识别类似的已知问题
+
+### 3. 值班交接增强（On-Call Handoff）
+- 自动生成值班交接摘要
+- 突出显示未解决的问题
+- 提供历史上下文
+
+## What AI Should NOT Do
+- **自动执行修复**：缺乏足够上下文的自动修复可能造成更大损害
+- **绕过人工审批**：关键变更需要人工确认
+- **忽视不确定性**：AI 应清楚表达置信度
+
+## Related Products
+- [[RunLLM]]：专注于 On-Call 上下文增强的 AI 产品
+
+## Related Concepts
+- [[Incident-Response]]
+- [[Observability]]
+- [[Resilience]]
+- [[Self-Healing]]
+
+## Source
+- SRE Weekly Issue #513 — [[sre-weekly-issue-513]]