75 lines
2.6 KiB
Markdown
75 lines
2.6 KiB
Markdown
---
|
||
title: "Incident Management"
|
||
type: concept
|
||
tags: [itsm, operations, reliability]
|
||
date: 2025-03-01
|
||
---
|
||
|
||
## Definition
|
||
|
||
事件管理(Incident Management)是[[ITSM]]的核心流程之一,专注于**快速恢复服务正常运作**,将服务中断或降级对业务的影响降到最低。
|
||
|
||
## Incident Lifecycle
|
||
|
||
```
|
||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||
│ Event │ → │ Detect │ → │ Triage │ → │ Resolve │ → │ Review │
|
||
│ Occurs │ │ & Alert │ │ & Prior │ │ & Recover│ │ & Learn │
|
||
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||
```
|
||
|
||
## Modern Incident Management (ITSM 2.0)
|
||
|
||
在[[ITSM 2.0]]中,事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动:
|
||
|
||
### Key Capabilities
|
||
|
||
| 能力 | 描述 | 技术 |
|
||
|------|------|------|
|
||
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
|
||
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
|
||
| Dynamic Prioritization | 动态优先级 | ML Models |
|
||
| Auto-escalation | 自动升级 | Alert Routing |
|
||
| Self-Healing | 自愈 | Automated Recovery |
|
||
|
||
### AIOps-Powered Incident Response
|
||
|
||
```
|
||
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
|
||
↓ ↓ ↓ ↓ ↓
|
||
AIOps ML模型 技能路由 Runbooks 告警升级
|
||
```
|
||
|
||
## Key Metrics
|
||
|
||
| 指标 | 描述 |
|
||
|------|------|
|
||
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
|
||
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
|
||
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
|
||
| Change Failure Rate | 变更失败率 |
|
||
|
||
## Priority Levels
|
||
|
||
| 优先级 | 描述 | SLA |
|
||
|--------|------|-----|
|
||
| P1/Critical | 核心服务不可用 | 15分钟 |
|
||
| P2/High | 主要功能不可用 | 1小时 |
|
||
| P3/Medium | 次要功能受影响 | 4小时 |
|
||
| P4/Low | 轻微影响 | 24小时 |
|
||
|
||
## Related Concepts
|
||
|
||
- [[ITSM]] — 父框架
|
||
- [[Problem-Management]] — 问题管理
|
||
- [[AIOps]] — AI运维能力
|
||
- [[Self-Healing-Systems]] — 自愈系统
|
||
- [[MTTR]] — 平均恢复时间
|
||
- [[MTTD]] — 平均检测时间
|
||
- [[Event-Correlation]] — 事件关联
|
||
- [[Root-Cause-Analysis]] — 根因分析
|
||
|
||
## Sources
|
||
|
||
- [[understanding-complete-itsm]] — AIOps-driven Incident Management
|