--- title: "Event Correlation" type: concept tags: [aiops, monitoring, incident-management, operations] date: 2025-03-01 --- ## Definition 事件关联(Event Correlation)是[[AIOps]]的核心技术之一,通过算法将大量分散的监控告警和系统事件归类为少量有意义的事件组,减少告警噪音,加速[[Incident-Management]]和[[Root-Cause-Analysis]]。 ## The Problem ``` Without Event Correlation: ───────────────────────────── Alert #1: CPU High on Server A Alert #2: Memory High on Server A Alert #3: Disk I/O High on Server A Alert #4: Network Latency on Server A Alert #5: App Response Slow Alert #6: Database Connection Pool Full Alert #7: API Timeout ... (100+ alerts for ONE root cause) ``` ## Event Correlation Techniques ### 1. Rule-Based Correlation ``` IF alerts occur within time window T AND involve same source/host/service THEN group as single incident ``` ### 2. Statistical Correlation - Time series analysis - Pattern matching - Anomaly detection ### 3. AI/ML Correlation - Root cause inference - Causal graph models - Predictive correlation ## Benefits | 收益 | 描述 | |------|------| | 告警降噪 | 减少90%+噪音 | | 加速RCA | 快速定位根因 | | MTTR降低 | 减少人工分析时间 | | SLA保障 | 更快响应 | ## In ITSM Context 在[[ITSM 2.0]]的[[Incident-Management]]中,事件关联是关键能力: ``` Incident Management 2.0 ├── Event Correlation (ML-enhanced) │ ├── 告警去重 │ ├── 根因推断 │ └── 关联推理 ├── AIOps-powered Analysis │ ├── 异常检测 │ ├── 模式识别 │ └── 预测分析 └── Self-Healing Automation ├── 自动诊断 └── 自动修复 ``` ## Related Concepts - [[AIOps]] — 事件关联的AI引擎 - [[Incident-Management]] — 事件管理的应用场景 - [[Root-Cause-Analysis]] — 根因分析 - [[MTTR]] — 平均恢复时间 - [[Self-Healing-Systems]] — 自愈系统 ## Sources - [[understanding-complete-itsm]] — ML-enhanced Event Correlation - [[what-i-know-about-cloud-service-delivery-1]] — AIOps中的事件关联