chore: sync local project changes
This commit is contained in:
@@ -1,54 +1,54 @@
|
||||
---
|
||||
title: "Alert Management"
|
||||
type: concept
|
||||
tags: [monitoring, alerting, devops, sre]
|
||||
last_updated: 2026-04-26
|
||||
---
|
||||
|
||||
## Alert Management(告警管理)
|
||||
|
||||
**中文名称:** 告警管理
|
||||
|
||||
**类型:** 运维流程与方法论
|
||||
|
||||
**别名:**
|
||||
- 告警管理
|
||||
- 告警分发
|
||||
- Alert Routing
|
||||
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
告警管理(Alert Management)是指从告警**生成 → 接收 → 分类 → 分发 → 响应 → 关闭**的全生命周期管理流程,目的是在关键系统异常时及时通知相关人员,同时避免告警风暴和告警疲劳。
|
||||
|
||||
**告警生命周期:**
|
||||
1. **生成(Generate):** 监控系统(Prometheus)基于规则判断是否触发告警
|
||||
2. **转发(Forward):** Prometheus 通过 Alertmanager API 发送告警
|
||||
3. **分发表单(Dismiss):** Alertmanager 执行抑制、分组、静默
|
||||
4. **路由(Route):** 按标签/严重级别路由到对应通知渠道
|
||||
5. **响应(Respond):** 值班人员收到通知并处理
|
||||
6. **关闭(Resolve):** 问题解决后告警自动消失
|
||||
|
||||
**告警治理最佳实践:**
|
||||
- **SLO/SLA 驱动:** 告警应与业务关键指标绑定,而非基础设施细节
|
||||
- **分级告警:** Critical / Warning / Info 三级,避免所有告警同等紧急
|
||||
- **抑制规则:** 根因告警触发时自动抑制派生告警
|
||||
- **静默期:** 维护窗口内临时屏蔽告警
|
||||
- **On-call Rotation:** 值班轮换确保 24/7 有人响应
|
||||
|
||||
**告警评估黄金法则:** 每条告警必须有明确处理步骤;无法立即采取行动的告警应该被抑制或降低级别
|
||||
|
||||
---
|
||||
|
||||
## Prometheus 告警管理架构
|
||||
|
||||
```
|
||||
Prometheus (规则判断) → Alertmanager (抑制/分组/路由) → 通知渠道 (邮件/Slack/PagerDuty/电话)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Sources
|
||||
- [[家庭监控方案-prometheus-grafana-node-exporter-cadvisor-blackbox]]
|
||||
- [[ctp-topic-8-implementation-of-cloud-monitoring-using-micro-focus-operations-brid]]
|
||||
---
|
||||
title: "Alert Management"
|
||||
type: concept
|
||||
tags: [monitoring, alerting, devops, sre]
|
||||
last_updated: 2026-04-26
|
||||
---
|
||||
|
||||
## Alert Management(告警管理)
|
||||
|
||||
**中文名称:** 告警管理
|
||||
|
||||
**类型:** 运维流程与方法论
|
||||
|
||||
**别名:**
|
||||
- 告警管理
|
||||
- 告警分发
|
||||
- Alert Routing
|
||||
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
告警管理(Alert Management)是指从告警**生成 → 接收 → 分类 → 分发 → 响应 → 关闭**的全生命周期管理流程,目的是在关键系统异常时及时通知相关人员,同时避免告警风暴和告警疲劳。
|
||||
|
||||
**告警生命周期:**
|
||||
1. **生成(Generate):** 监控系统(Prometheus)基于规则判断是否触发告警
|
||||
2. **转发(Forward):** Prometheus 通过 Alertmanager API 发送告警
|
||||
3. **分发表单(Dismiss):** Alertmanager 执行抑制、分组、静默
|
||||
4. **路由(Route):** 按标签/严重级别路由到对应通知渠道
|
||||
5. **响应(Respond):** 值班人员收到通知并处理
|
||||
6. **关闭(Resolve):** 问题解决后告警自动消失
|
||||
|
||||
**告警治理最佳实践:**
|
||||
- **SLO/SLA 驱动:** 告警应与业务关键指标绑定,而非基础设施细节
|
||||
- **分级告警:** Critical / Warning / Info 三级,避免所有告警同等紧急
|
||||
- **抑制规则:** 根因告警触发时自动抑制派生告警
|
||||
- **静默期:** 维护窗口内临时屏蔽告警
|
||||
- **On-call Rotation:** 值班轮换确保 24/7 有人响应
|
||||
|
||||
**告警评估黄金法则:** 每条告警必须有明确处理步骤;无法立即采取行动的告警应该被抑制或降低级别
|
||||
|
||||
---
|
||||
|
||||
## Prometheus 告警管理架构
|
||||
|
||||
```
|
||||
Prometheus (规则判断) → Alertmanager (抑制/分组/路由) → 通知渠道 (邮件/Slack/PagerDuty/电话)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Sources
|
||||
- [[家庭监控方案-prometheus-grafana-node-exporter-cadvisor-blackbox]]
|
||||
- [[ctp-topic-8-implementation-of-cloud-monitoring-using-micro-focus-operations-brid]]
|
||||
|
||||
Reference in New Issue
Block a user