Files
nexus/wiki/concepts/AlertManagement.md
2026-04-27 16:26:34 +08:00

55 lines
1.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Alert Management"
type: concept
tags: [monitoring, alerting, devops, sre]
last_updated: 2026-04-26
---
## Alert Management告警管理
**中文名称:** 告警管理
**类型:** 运维流程与方法论
**别名:**
- 告警管理
- 告警分发
- Alert Routing
---
## Definition
告警管理Alert Management是指从告警**生成 → 接收 → 分类 → 分发 → 响应 → 关闭**的全生命周期管理流程,目的是在关键系统异常时及时通知相关人员,同时避免告警风暴和告警疲劳。
**告警生命周期:**
1. **生成Generate** 监控系统Prometheus基于规则判断是否触发告警
2. **转发Forward** Prometheus 通过 Alertmanager API 发送告警
3. **分发表单Dismiss** Alertmanager 执行抑制、分组、静默
4. **路由Route** 按标签/严重级别路由到对应通知渠道
5. **响应Respond** 值班人员收到通知并处理
6. **关闭Resolve** 问题解决后告警自动消失
**告警治理最佳实践:**
- **SLO/SLA 驱动:** 告警应与业务关键指标绑定,而非基础设施细节
- **分级告警:** Critical / Warning / Info 三级,避免所有告警同等紧急
- **抑制规则:** 根因告警触发时自动抑制派生告警
- **静默期:** 维护窗口内临时屏蔽告警
- **On-call Rotation** 值班轮换确保 24/7 有人响应
**告警评估黄金法则:** 每条告警必须有明确处理步骤;无法立即采取行动的告警应该被抑制或降低级别
---
## Prometheus 告警管理架构
```
Prometheus (规则判断) → Alertmanager (抑制/分组/路由) → 通知渠道 (邮件/Slack/PagerDuty/电话)
```
---
## Related Sources
- [[家庭监控方案-prometheus-grafana-node-exporter-cadvisor-blackbox]]
- [[ctp-topic-8-implementation-of-cloud-monitoring-using-micro-focus-operations-brid]]