Auto-sync: 2026-04-22 04:02
This commit is contained in:
116
wiki/concepts/Prometheus告警规则.md
Normal file
116
wiki/concepts/Prometheus告警规则.md
Normal file
@@ -0,0 +1,116 @@
|
||||
---
|
||||
title: "Prometheus告警规则"
|
||||
type: concept
|
||||
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
|
||||
tags: [prometheus, alerting, monitoring, devops, prometheus]
|
||||
date: 2025-11-11
|
||||
---
|
||||
|
||||
# Prometheus告警规则
|
||||
|
||||
## Overview
|
||||
Prometheus 告警规则(Alert Rules)是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
|
||||
|
||||
## Rule Format
|
||||
```yaml
|
||||
groups:
|
||||
- name: <group_name> # 告警组名称(全局唯一)
|
||||
interval: <duration> # 评估间隔(可选,默认 evaluation_interval)
|
||||
rules:
|
||||
- alert: <alert_name> # 告警名称(Alertmanager 中唯一标识)
|
||||
expr: <promql_expr> # 触发条件的 PromQL 表达式
|
||||
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
|
||||
labels: # 标签(用于 Alertmanager 路由和分类)
|
||||
severity: <level> # 如:critical / warning / info
|
||||
annotations: # 注解(人类可读的告警描述)
|
||||
summary: <text> # 简短摘要
|
||||
description: <text> # 详细描述,支持模板变量
|
||||
```
|
||||
|
||||
## Template Variables(注解模板)
|
||||
在 `description` 中可以使用 `$labels` 和 `$value` 等模板变量:
|
||||
```yaml
|
||||
annotations:
|
||||
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%)"
|
||||
```
|
||||
|
||||
## Home Server Alert Rules(alerts.yml 完整示例)
|
||||
```yaml
|
||||
groups:
|
||||
- name: system-alerts
|
||||
rules:
|
||||
|
||||
- alert: HostHighCPU
|
||||
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "高 CPU 使用率"
|
||||
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
|
||||
|
||||
- alert: HostLowDisk
|
||||
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "磁盘空间不足"
|
||||
description: "磁盘剩余空间低于 10%"
|
||||
|
||||
- alert: HostLowMemory
|
||||
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "内存使用率高"
|
||||
description: "可用内存低于 15%"
|
||||
|
||||
- alert: HTTPProbeFailed
|
||||
expr: probe_success == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "站点不可达"
|
||||
description: "HTTP 探测失败:{{ $labels.instance }}"
|
||||
|
||||
- alert: TLSCertExpiring
|
||||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "TLS 证书即将到期"
|
||||
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
|
||||
```
|
||||
|
||||
## Alert Lifecycle
|
||||
```
|
||||
Inactive(正常)→ Pending(等待确认,for 计时中)→ Firing(触发,发送给 Alertmanager)
|
||||
```
|
||||
|
||||
## Prometheus Configuration
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
rule_files:
|
||||
- "/etc/prometheus/alerts.yml"
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: ['alertmanager:9093']
|
||||
|
||||
global:
|
||||
evaluation_interval: 30s # 告警规则评估间隔
|
||||
```
|
||||
|
||||
## Related Entities
|
||||
- [[Prometheus]] — 告警引擎宿主
|
||||
- [[Alertmanager]] — 告警接收和分发
|
||||
|
||||
## Related Concepts
|
||||
- [[PromQL]] — 告警条件的查询语言
|
||||
- [[Alertmanager]] — 告警分发机制
|
||||
- [[System Monitoring]] — 上游应用领域
|
||||
Reference in New Issue
Block a user