Files
nexus/wiki/concepts/Prometheus告警规则.md

117 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Prometheus告警规则"
type: concept
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
tags: [prometheus, alerting, monitoring, devops, prometheus]
date: 2025-11-11
---
# Prometheus告警规则
## Overview
Prometheus 告警规则Alert Rules是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
## Rule Format
```yaml
groups:
- name: <group_name> # 告警组名称(全局唯一)
interval: <duration> # 评估间隔(可选,默认 evaluation_interval
rules:
- alert: <alert_name> # 告警名称Alertmanager 中唯一标识)
expr: <promql_expr> # 触发条件的 PromQL 表达式
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
labels: # 标签(用于 Alertmanager 路由和分类)
severity: <level> # 如critical / warning / info
annotations: # 注解(人类可读的告警描述)
summary: <text> # 简短摘要
description: <text> # 详细描述,支持模板变量
```
## Template Variables注解模板
`description` 中可以使用 `$labels``$value` 等模板变量:
```yaml
annotations:
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%"
```
## Home Server Alert Rulesalerts.yml 完整示例)
```yaml
groups:
- name: system-alerts
rules:
- alert: HostHighCPU
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "高 CPU 使用率"
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "磁盘剩余空间低于 10%"
- alert: HostLowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率高"
description: "可用内存低于 15%"
- alert: HTTPProbeFailed
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "站点不可达"
description: "HTTP 探测失败:{{ $labels.instance }}"
- alert: TLSCertExpiring
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "TLS 证书即将到期"
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
```
## Alert Lifecycle
```
Inactive正常→ Pending等待确认for 计时中)→ Firing触发发送给 Alertmanager
```
## Prometheus Configuration
```yaml
# prometheus.yml
rule_files:
- "/etc/prometheus/alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
global:
evaluation_interval: 30s # 告警规则评估间隔
```
## Related Entities
- [[Prometheus]] — 告警引擎宿主
- [[Alertmanager]] — 告警接收和分发
## Related Concepts
- [[PromQL]] — 告警条件的查询语言
- [[Alertmanager]] — 告警分发机制
- [[System Monitoring]] — 上游应用领域