title, type, tags, date
| title |
type |
tags |
date |
| Incident Management |
concept |
| itsm |
| operations |
| reliability |
|
2025-03-01 |
Definition
事件管理(Incident Management)是ITSM的核心流程之一,专注于快速恢复服务正常运作,将服务中断或降级对业务的影响降到最低。
Incident Lifecycle
Modern Incident Management (ITSM 2.0)
在ITSM 2.0中,事件管理由AIOps和Self-Healing-Systems驱动:
Key Capabilities
| 能力 |
描述 |
技术 |
| Real-time Observability |
实时可观测性 |
Metrics, Logs, Traces |
| Automated Remediation |
自动化修复 |
AIOps, Runbooks |
| Dynamic Prioritization |
动态优先级 |
ML Models |
| Auto-escalation |
自动升级 |
Alert Routing |
| Self-Healing |
自愈 |
Automated Recovery |
AIOps-Powered Incident Response
Key Metrics
| 指标 |
描述 |
| MTTR |
Mean Time to Recovery — 平均恢复时间 |
| MTTD |
Mean Time to Detect — 平均检测时间 |
| MTTA |
Mean Time to Acknowledge — 平均确认时间 |
| Change Failure Rate |
变更失败率 |
Priority Levels
| 优先级 |
描述 |
SLA |
| P1/Critical |
核心服务不可用 |
15分钟 |
| P2/High |
主要功能不可用 |
1小时 |
| P3/Medium |
次要功能受影响 |
4小时 |
| P4/Low |
轻微影响 |
24小时 |
Related Concepts
Sources