Files
nexus/wiki/concepts/AI-For-On-Call.md
2026-05-03 05:42:12 +08:00

56 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "AI For On-Call"
type: concept
tags: [sre, ai, on-call, incident-response, automation]
last_updated: 2026-04-20
---
# AI For On-Call
AI 在值班On-Call场景中的最佳应用不是自主修复而是为值班工程师提供足够的上下文以快速修复故障。
## Core Thesis
> "AI's most valuable role in SRE isn't autonomous remediation. It's making sure on-call engineers have the context to fix incidents fast." — Heinrich Hartmann
## Why Context Matters
值班工程师在面对故障时最大的挑战不是不知道怎么做,而是:
1. **信息过载**:日志、指标、告警太多,难以快速定位问题
2. **上下文丢失**:不熟悉的服务/代码,需要时间理解
3. **时间压力**MTTR 目标要求快速响应
## AI 辅助 On-Call 的关键场景
### 1. 上下文聚合Context Aggregation
AI 从多个来源聚合相关信息:
- 告警历史和趋势
- 相关的故障报告
- 最近变更记录
- 依赖服务状态
### 2. 快速诊断辅助Rapid Diagnosis
- 总结告警的根本原因
- 推荐可能有效的修复步骤
- 识别类似的已知问题
### 3. 值班交接增强On-Call Handoff
- 自动生成值班交接摘要
- 突出显示未解决的问题
- 提供历史上下文
## What AI Should NOT Do
- **自动执行修复**:缺乏足够上下文的自动修复可能造成更大损害
- **绕过人工审批**:关键变更需要人工确认
- **忽视不确定性**AI 应清楚表达置信度
## Related Products
- [[RunLLM]]:专注于 On-Call 上下文增强的 AI 产品
## Related Concepts
- [[Incident-Response]]
- [[Observability]]
- [[Resilience]]
- [[Self-Healing]]
## Source
- SRE Weekly Issue #513 — [[sre-weekly-issue-513]]