Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,74 +1,74 @@
|
||||
---
|
||||
title: "Incident Management"
|
||||
type: concept
|
||||
tags: [itsm, operations, reliability]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
事件管理(Incident Management)是[[ITSM]]的核心流程之一,专注于**快速恢复服务正常运作**,将服务中断或降级对业务的影响降到最低。
|
||||
|
||||
## Incident Lifecycle
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Event │ → │ Detect │ → │ Triage │ → │ Resolve │ → │ Review │
|
||||
│ Occurs │ │ & Alert │ │ & Prior │ │ & Recover│ │ & Learn │
|
||||
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
## Modern Incident Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动:
|
||||
|
||||
### Key Capabilities
|
||||
|
||||
| 能力 | 描述 | 技术 |
|
||||
|------|------|------|
|
||||
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
|
||||
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
|
||||
| Dynamic Prioritization | 动态优先级 | ML Models |
|
||||
| Auto-escalation | 自动升级 | Alert Routing |
|
||||
| Self-Healing | 自愈 | Automated Recovery |
|
||||
|
||||
### AIOps-Powered Incident Response
|
||||
|
||||
```
|
||||
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
AIOps ML模型 技能路由 Runbooks 告警升级
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
|
||||
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
|
||||
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
|
||||
| Change Failure Rate | 变更失败率 |
|
||||
|
||||
## Priority Levels
|
||||
|
||||
| 优先级 | 描述 | SLA |
|
||||
|--------|------|-----|
|
||||
| P1/Critical | 核心服务不可用 | 15分钟 |
|
||||
| P2/High | 主要功能不可用 | 1小时 |
|
||||
| P3/Medium | 次要功能受影响 | 4小时 |
|
||||
| P4/Low | 轻微影响 | 24小时 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Problem-Management]] — 问题管理
|
||||
- [[AIOps]] — AI运维能力
|
||||
- [[Self-Healing-Systems]] — 自愈系统
|
||||
- [[MTTR]] — 平均恢复时间
|
||||
- [[MTTD]] — 平均检测时间
|
||||
- [[Event-Correlation]] — 事件关联
|
||||
- [[Root-Cause-Analysis]] — 根因分析
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AIOps-driven Incident Management
|
||||
---
|
||||
title: "Incident Management"
|
||||
type: concept
|
||||
tags: [itsm, operations, reliability]
|
||||
date: 2025-03-01
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
事件管理(Incident Management)是[[ITSM]]的核心流程之一,专注于**快速恢复服务正常运作**,将服务中断或降级对业务的影响降到最低。
|
||||
|
||||
## Incident Lifecycle
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Event │ → │ Detect │ → │ Triage │ → │ Resolve │ → │ Review │
|
||||
│ Occurs │ │ & Alert │ │ & Prior │ │ & Recover│ │ & Learn │
|
||||
└─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
## Modern Incident Management (ITSM 2.0)
|
||||
|
||||
在[[ITSM 2.0]]中,事件管理由[[AIOps]]和[[Self-Healing-Systems]]驱动:
|
||||
|
||||
### Key Capabilities
|
||||
|
||||
| 能力 | 描述 | 技术 |
|
||||
|------|------|------|
|
||||
| Real-time Observability | 实时可观测性 | Metrics, Logs, Traces |
|
||||
| Automated Remediation | 自动化修复 | AIOps, Runbooks |
|
||||
| Dynamic Prioritization | 动态优先级 | ML Models |
|
||||
| Auto-escalation | 自动升级 | Alert Routing |
|
||||
| Self-Healing | 自愈 | Automated Recovery |
|
||||
|
||||
### AIOps-Powered Incident Response
|
||||
|
||||
```
|
||||
监控检测 → 智能分类 → 自动路由 → 自动化修复 → SLA监控
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
AIOps ML模型 技能路由 Runbooks 告警升级
|
||||
```
|
||||
|
||||
## Key Metrics
|
||||
|
||||
| 指标 | 描述 |
|
||||
|------|------|
|
||||
| [[MTTR]] | Mean Time to Recovery — 平均恢复时间 |
|
||||
| [[MTTD]] | Mean Time to Detect — 平均检测时间 |
|
||||
| MTTA | Mean Time to Acknowledge — 平均确认时间 |
|
||||
| Change Failure Rate | 变更失败率 |
|
||||
|
||||
## Priority Levels
|
||||
|
||||
| 优先级 | 描述 | SLA |
|
||||
|--------|------|-----|
|
||||
| P1/Critical | 核心服务不可用 | 15分钟 |
|
||||
| P2/High | 主要功能不可用 | 1小时 |
|
||||
| P3/Medium | 次要功能受影响 | 4小时 |
|
||||
| P4/Low | 轻微影响 | 24小时 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[ITSM]] — 父框架
|
||||
- [[Problem-Management]] — 问题管理
|
||||
- [[AIOps]] — AI运维能力
|
||||
- [[Self-Healing-Systems]] — 自愈系统
|
||||
- [[MTTR]] — 平均恢复时间
|
||||
- [[MTTD]] — 平均检测时间
|
||||
- [[Event-Correlation]] — 事件关联
|
||||
- [[Root-Cause-Analysis]] — 根因分析
|
||||
|
||||
## Sources
|
||||
|
||||
- [[understanding-complete-itsm]] — AIOps-driven Incident Management
|
||||
|
||||
Reference in New Issue
Block a user