nexus/wiki/concepts/Incident-Management.md

---
title: "Incident Management"
type: concept
tags: [itsm, operations, reliability]
date: 2025-03-01
---

## Definition

事件管理（Incident Management）是[[ITSM]]的核心流程之一，专注于**快速恢复服务正常运作**，将服务中断或降级对业务的影响降到最低。

## Incident Lifecycle

```
┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Event  │ →  │ Detect  │ →  │ Triage  │ →  │ Resolve │ →  │ Review  │
│ Occurs  │    │ & Alert │    │ & Prior │    │ & Recover│  │ & Learn │
└─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘
```

## Modern Incident Management (ITSM 2.0)

在[[ITSM 2.0]]中，事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动：

### Key Capabilities

| 能力 | 描述 | 技术 |
|------|------|------|
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
| Dynamic Prioritization | 动态优先级 | ML Models |
| Auto-escalation | 自动升级 | Alert Routing |
| Self-Healing | 自愈 | Automated Recovery |

### AIOps-Powered Incident Response

```
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
    ↓          ↓          ↓          ↓          ↓
  AIOps    ML模型     技能路由    Runbooks    告警升级
```

## Key Metrics

| 指标 | 描述 |
|------|------|
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
| Change Failure Rate | 变更失败率 |

## Priority Levels

| 优先级 | 描述 | SLA |
|--------|------|-----|
| P1/Critical | 核心服务不可用 | 15分钟 |
| P2/High | 主要功能不可用 | 1小时 |
| P3/Medium | 次要功能受影响 | 4小时 |
| P4/Low | 轻微影响 | 24小时 |

## Related Concepts

- [[ITSM]] — 父框架
- [[Problem-Management]] — 问题管理
- [[AIOps]] — AI运维能力
- [[Self-Healing-Systems]] — 自愈系统
- [[MTTR]] — 平均恢复时间
- [[MTTD]] — 平均检测时间
- [[Event-Correlation]] — 事件关联
- [[Root-Cause-Analysis]] — 根因分析

## Sources

- [[understanding-complete-itsm]] — AIOps-driven Incident Management