56 lines
1.7 KiB
Markdown
56 lines
1.7 KiB
Markdown
---
|
||
title: "AI For On-Call"
|
||
type: concept
|
||
tags: [sre, ai, on-call, incident-response, automation]
|
||
last_updated: 2026-04-20
|
||
---
|
||
|
||
# AI For On-Call
|
||
|
||
AI 在值班(On-Call)场景中的最佳应用不是自主修复,而是为值班工程师提供足够的上下文以快速修复故障。
|
||
|
||
## Core Thesis
|
||
> "AI's most valuable role in SRE isn't autonomous remediation. It's making sure on-call engineers have the context to fix incidents fast." — Heinrich Hartmann
|
||
|
||
## Why Context Matters
|
||
值班工程师在面对故障时最大的挑战不是不知道怎么做,而是:
|
||
1. **信息过载**:日志、指标、告警太多,难以快速定位问题
|
||
2. **上下文丢失**:不熟悉的服务/代码,需要时间理解
|
||
3. **时间压力**:MTTR 目标要求快速响应
|
||
|
||
## AI 辅助 On-Call 的关键场景
|
||
|
||
### 1. 上下文聚合(Context Aggregation)
|
||
AI 从多个来源聚合相关信息:
|
||
- 告警历史和趋势
|
||
- 相关的故障报告
|
||
- 最近变更记录
|
||
- 依赖服务状态
|
||
|
||
### 2. 快速诊断辅助(Rapid Diagnosis)
|
||
- 总结告警的根本原因
|
||
- 推荐可能有效的修复步骤
|
||
- 识别类似的已知问题
|
||
|
||
### 3. 值班交接增强(On-Call Handoff)
|
||
- 自动生成值班交接摘要
|
||
- 突出显示未解决的问题
|
||
- 提供历史上下文
|
||
|
||
## What AI Should NOT Do
|
||
- **自动执行修复**:缺乏足够上下文的自动修复可能造成更大损害
|
||
- **绕过人工审批**:关键变更需要人工确认
|
||
- **忽视不确定性**:AI 应清楚表达置信度
|
||
|
||
## Related Products
|
||
- [[RunLLM]]:专注于 On-Call 上下文增强的 AI 产品
|
||
|
||
## Related Concepts
|
||
- [[Incident-Response]]
|
||
- [[Observability]]
|
||
- [[Resilience]]
|
||
- [[Self-Healing]]
|
||
|
||
## Source
|
||
- SRE Weekly Issue #513 — [[sre-weekly-issue-513]]
|