Files
nexus/wiki/concepts/Incident-Management.md

75 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Incident Management"
type: concept
tags: [itsm, operations, reliability]
date: 2025-03-01
---
## Definition
事件管理Incident Management是[[ITSM]]的核心流程之一,专注于**快速恢复服务正常运作**,将服务中断或降级对业务的影响降到最低。
## Incident Lifecycle
```
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Event │ → │ Detect │ → │ Triage │ → │ Resolve │ → │ Review │
│ Occurs │ │ & Alert │ │ & Prior │ │ & Recover│ │ & Learn │
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
```
## Modern Incident Management (ITSM 2.0)
在[[ITSM 2.0]]中,事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动:
### Key Capabilities
| 能力 | 描述 | 技术 |
|------|------|------|
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
| Dynamic Prioritization | 动态优先级 | ML Models |
| Auto-escalation | 自动升级 | Alert Routing |
| Self-Healing | 自愈 | Automated Recovery |
### AIOps-Powered Incident Response
```
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
↓ ↓ ↓ ↓ ↓
AIOps ML模型 技能路由 Runbooks 告警升级
```
## Key Metrics
| 指标 | 描述 |
|------|------|
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
| Change Failure Rate | 变更失败率 |
## Priority Levels
| 优先级 | 描述 | SLA |
|--------|------|-----|
| P1/Critical | 核心服务不可用 | 15分钟 |
| P2/High | 主要功能不可用 | 1小时 |
| P3/Medium | 次要功能受影响 | 4小时 |
| P4/Low | 轻微影响 | 24小时 |
## Related Concepts
- [[ITSM]] — 父框架
- [[Problem-Management]] — 问题管理
- [[AIOps]] — AI运维能力
- [[Self-Healing-Systems]] — 自愈系统
- [[MTTR]] — 平均恢复时间
- [[MTTD]] — 平均检测时间
- [[Event-Correlation]] — 事件关联
- [[Root-Cause-Analysis]] — 根因分析
## Sources
- [[understanding-complete-itsm]] — AIOps-driven Incident Management