nexus/wiki/concepts/Incident-Management.md at b40abbcd473a7093d8261e212e3d6de97c1e516a

ishenwei/nexus

Fork 0

Files

Shen Wei f09834b5a5 Update nexus: fix conflicts and sync local changes

2026-04-26 12:06:50 +08:00

2.6 KiB

Raw Blame History

title, type, tags, date

title

type

Definition

事件管理（Incident Management）是ITSM的核心流程之一，专注于快速恢复服务正常运作，将服务中断或降级对业务的影响降到最低。

Incident Lifecycle

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Event  │ →  │ Detect  │ →  │ Triage  │ →  │ Resolve │ →  │ Review  │
│ Occurs  │    │ & Alert │    │ & Prior │    │ & Recover│  │ & Learn │
└─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘

Modern Incident Management (ITSM 2.0)

在ITSM 2.0中，事件管理由AIOps和Self-Healing-Systems驱动：

Key Capabilities

能力	描述	技术
Real-time Observability	实时可观测性	Metrics, Logs, Traces
Automated Remediation	自动化修复	AIOps, Runbooks
Dynamic Prioritization	动态优先级	ML Models
Auto-escalation	自动升级	Alert Routing
Self-Healing	自愈	Automated Recovery

AIOps-Powered Incident Response

监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
    ↓          ↓          ↓          ↓          ↓
  AIOps    ML模型     技能路由    Runbooks    告警升级

Key Metrics

指标	描述
MTTR	Mean Time to Recovery — 平均恢复时间
MTTD	Mean Time to Detect — 平均检测时间
MTTA	Mean Time to Acknowledge — 平均确认时间
Change Failure Rate	变更失败率

Priority Levels

优先级	描述	SLA
P1/Critical	核心服务不可用	15分钟
P2/High	主要功能不可用	1小时
P3/Medium	次要功能受影响	4小时
P4/Low	轻微影响	24小时

ITSM — 父框架
Problem-Management — 问题管理
AIOps — AI运维能力
Self-Healing-Systems — 自愈系统
MTTR — 平均恢复时间
MTTD — 平均检测时间
Event-Correlation — 事件关联
Root-Cause-Analysis — 根因分析

Sources

understanding-complete-itsm — AIOps-driven Incident Management

2.6 KiB Raw Blame History Unescape Escape