Files
nexus/wiki/concepts/Root-Cause-Analysis.md
2026-04-22 04:03:04 +08:00

72 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Root Cause Analysis"
tags:
- devops
- troubleshooting
- ai
- observability
created: 2026-04-25
---
# Root Cause Analysis (RCA)
## Definition
Root Cause Analysis (RCA) 是通过系统化方法追溯问题根本原因的过程而非仅处理表面症状。Agentic AI 通过跨层日志关联(计算、网络、应用),比人工更快定位问题根因,显著加速事故解决。
## Traditional vs AI-Driven RCA
| 维度 | 传统 RCA | AI-Driven RCA |
|------|---------|--------------|
| 分析速度 | 数小时至数天 | 分钟级 |
| 数据范围 | 有限日志样本 | 全量日志 + 跨源关联 |
| 关联能力 | 依赖人工经验 | 自动跨层相关性分析 |
| 准确性 | 受经验影响 | 基于模式匹配的一致性 |
| 知识积累 | 个人经验为主 | 可学习的组织知识 |
## Agentic AI RCA 工作流
```
1. 异常检测 → CloudWatch/Stackdriver/Azure Monitor 告警触发
2. 数据收集 → 自动聚合相关时间段的所有日志
3. 跨层关联 → 关联 compute/networking/application 日志
4. 模式匹配 → 匹配历史故障模式
5. 根因输出 → 输出结构化根因报告 + 修复建议
```
## AI-Driven RCA 示例
> AI agent monitoring AWS EKS detects a spike in error rates. It correlates:
> - Kubernetes pod logs (application layer)
> - VPC flow logs (network layer)
> - RDS metrics (database layer)
> - → Identifies: External API timeout causing connection pool exhaustion
> - → Suggests: Implement retry strategy with exponential backoff
## 与 [[AIOps]] 的关系
RCA 是 [[AIOps]] 能力矩阵的核心组件:
```python
AIOps_Capabilities = {
"Anomaly Detection": "检测异常模式",
"Root Cause Analysis": "自动诊断 ←", # ← 本页
"Predictive Maintenance": "预测性维护",
"Smart Alerting": "减少告警疲劳",
"Automated Remediation": "自动修复",
"Capacity Optimization": "容量优化"
}
```
## Related Concepts
- [[Self-Healing Systems]] — RCA 发现根因后触发自动修复
- [[AIOps]] — RCA 是 AIOps 的核心能力
- [[MTTR]] — RCA 速度直接影响 MTTR
- [[Observability]] — RCA 依赖可观测性数据
## Related Sources
- [[how-agentic-ai-can-help-for-cloud-devops]]
- [[what-i-know-about-cloud-service-delivery-1]]