117 lines
3.7 KiB
Markdown
117 lines
3.7 KiB
Markdown
---
|
||
title: "Prometheus告警规则"
|
||
type: concept
|
||
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
|
||
tags: [prometheus, alerting, monitoring, devops, prometheus]
|
||
date: 2025-11-11
|
||
---
|
||
|
||
# Prometheus告警规则
|
||
|
||
## Overview
|
||
Prometheus 告警规则(Alert Rules)是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
|
||
|
||
## Rule Format
|
||
```yaml
|
||
groups:
|
||
- name: <group_name> # 告警组名称(全局唯一)
|
||
interval: <duration> # 评估间隔(可选,默认 evaluation_interval)
|
||
rules:
|
||
- alert: <alert_name> # 告警名称(Alertmanager 中唯一标识)
|
||
expr: <promql_expr> # 触发条件的 PromQL 表达式
|
||
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
|
||
labels: # 标签(用于 Alertmanager 路由和分类)
|
||
severity: <level> # 如:critical / warning / info
|
||
annotations: # 注解(人类可读的告警描述)
|
||
summary: <text> # 简短摘要
|
||
description: <text> # 详细描述,支持模板变量
|
||
```
|
||
|
||
## Template Variables(注解模板)
|
||
在 `description` 中可以使用 `$labels` 和 `$value` 等模板变量:
|
||
```yaml
|
||
annotations:
|
||
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%)"
|
||
```
|
||
|
||
## Home Server Alert Rules(alerts.yml 完整示例)
|
||
```yaml
|
||
groups:
|
||
- name: system-alerts
|
||
rules:
|
||
|
||
- alert: HostHighCPU
|
||
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
|
||
for: 2m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "高 CPU 使用率"
|
||
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
|
||
|
||
- alert: HostLowDisk
|
||
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "磁盘空间不足"
|
||
description: "磁盘剩余空间低于 10%"
|
||
|
||
- alert: HostLowMemory
|
||
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "内存使用率高"
|
||
description: "可用内存低于 15%"
|
||
|
||
- alert: HTTPProbeFailed
|
||
expr: probe_success == 0
|
||
for: 2m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "站点不可达"
|
||
description: "HTTP 探测失败:{{ $labels.instance }}"
|
||
|
||
- alert: TLSCertExpiring
|
||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
|
||
for: 1h
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "TLS 证书即将到期"
|
||
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
|
||
```
|
||
|
||
## Alert Lifecycle
|
||
```
|
||
Inactive(正常)→ Pending(等待确认,for 计时中)→ Firing(触发,发送给 Alertmanager)
|
||
```
|
||
|
||
## Prometheus Configuration
|
||
```yaml
|
||
# prometheus.yml
|
||
rule_files:
|
||
- "/etc/prometheus/alerts.yml"
|
||
|
||
alerting:
|
||
alertmanagers:
|
||
- static_configs:
|
||
- targets: ['alertmanager:9093']
|
||
|
||
global:
|
||
evaluation_interval: 30s # 告警规则评估间隔
|
||
```
|
||
|
||
## Related Entities
|
||
- [[Prometheus]] — 告警引擎宿主
|
||
- [[Alertmanager]] — 告警接收和分发
|
||
|
||
## Related Concepts
|
||
- [[PromQL]] — 告警条件的查询语言
|
||
- [[Alertmanager]] — 告警分发机制
|
||
- [[System Monitoring]] — 上游应用领域
|