Files
nexus/wiki/concepts/Prometheus告警规则.md
2026-04-22 04:03:04 +08:00

3.6 KiB
Raw Blame History

title, type, aliases, tags, date
title type aliases tags date
Prometheus告警规则 concept
Prometheus Alert Rules
Prometheus告警规则YAML
alert_rules
prometheus
alerting
monitoring
devops
prometheus
2025-11-11

Prometheus告警规则

Overview

Prometheus 告警规则Alert Rules是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 for 指定的评估周期数时,告警从 pending 状态转为 firing 状态,触发后发送给 Alertmanager 进行路由分发。

Rule Format

groups:
- name: <group_name>          # 告警组名称(全局唯一)
  interval: <duration>        # 评估间隔(可选,默认 evaluation_interval
  rules:
  - alert: <alert_name>       # 告警名称Alertmanager 中唯一标识)
    expr: <promql_expr>      # 触发条件的 PromQL 表达式
    for: <duration>           # 持续时间(告警变为 firing 前需满足条件的最短时间)
    labels:                  # 标签(用于 Alertmanager 路由和分类)
      severity: <level>      # 如critical / warning / info
    annotations:             # 注解(人类可读的告警描述)
      summary: <text>        # 简短摘要
      description: <text>    # 详细描述,支持模板变量

Template Variables注解模板

description 中可以使用 $labels$value 等模板变量:

annotations:
  description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%"

Home Server Alert Rulesalerts.yml 完整示例)

groups:
- name: system-alerts
  rules:

  - alert: HostHighCPU
    expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "高 CPU 使用率"
      description: "主机 CPU 使用率超过 85%(持续 2 分钟)"

  - alert: HostLowDisk
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "磁盘空间不足"
      description: "磁盘剩余空间低于 10%"

  - alert: HostLowMemory
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "内存使用率高"
      description: "可用内存低于 15%"

  - alert: HTTPProbeFailed
    expr: probe_success == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "站点不可达"
      description: "HTTP 探测失败:{{ $labels.instance }}"

  - alert: TLSCertExpiring
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "TLS 证书即将到期"
      description: "证书 {{ $labels.instance }} 剩余不到 14 天"

Alert Lifecycle

Inactive正常→ Pending等待确认for 计时中)→ Firing触发发送给 Alertmanager

Prometheus Configuration

# prometheus.yml
rule_files:
  - "/etc/prometheus/alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

global:
  evaluation_interval: 30s    # 告警规则评估间隔