--- title: "Prometheus告警规则" type: concept aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules] tags: [prometheus, alerting, monitoring, devops, prometheus] date: 2025-11-11 --- # Prometheus告警规则 ## Overview Prometheus 告警规则(Alert Rules)是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。 ## Rule Format ```yaml groups: - name: # 告警组名称(全局唯一) interval: # 评估间隔(可选,默认 evaluation_interval) rules: - alert: # 告警名称(Alertmanager 中唯一标识) expr: # 触发条件的 PromQL 表达式 for: # 持续时间(告警变为 firing 前需满足条件的最短时间) labels: # 标签(用于 Alertmanager 路由和分类) severity: # 如:critical / warning / info annotations: # 注解(人类可读的告警描述) summary: # 简短摘要 description: # 详细描述,支持模板变量 ``` ## Template Variables(注解模板) 在 `description` 中可以使用 `$labels` 和 `$value` 等模板变量: ```yaml annotations: description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%)" ``` ## Home Server Alert Rules(alerts.yml 完整示例) ```yaml groups: - name: system-alerts rules: - alert: HostHighCPU expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85 for: 2m labels: severity: warning annotations: summary: "高 CPU 使用率" description: "主机 CPU 使用率超过 85%(持续 2 分钟)" - alert: HostLowDisk expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10 for: 5m labels: severity: critical annotations: summary: "磁盘空间不足" description: "磁盘剩余空间低于 10%" - alert: HostLowMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15 for: 5m labels: severity: warning annotations: summary: "内存使用率高" description: "可用内存低于 15%" - alert: HTTPProbeFailed expr: probe_success == 0 for: 2m labels: severity: critical annotations: summary: "站点不可达" description: "HTTP 探测失败:{{ $labels.instance }}" - alert: TLSCertExpiring expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14 for: 1h labels: severity: warning annotations: summary: "TLS 证书即将到期" description: "证书 {{ $labels.instance }} 剩余不到 14 天" ``` ## Alert Lifecycle ``` Inactive(正常)→ Pending(等待确认,for 计时中)→ Firing(触发,发送给 Alertmanager) ``` ## Prometheus Configuration ```yaml # prometheus.yml rule_files: - "/etc/prometheus/alerts.yml" alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] global: evaluation_interval: 30s # 告警规则评估间隔 ``` ## Related Entities - [[Prometheus]] — 告警引擎宿主 - [[Alertmanager]] — 告警接收和分发 ## Related Concepts - [[PromQL]] — 告警条件的查询语言 - [[Alertmanager]] — 告警分发机制 - [[System Monitoring]] — 上游应用领域