Update nexus: fix conflicts and sync local changes

This commit is contained in:
Shen Wei
2026-04-26 12:06:50 +08:00
parent 191797c01b
commit f09834b5a5
2443 changed files with 254323 additions and 255154 deletions

View File

@@ -1,116 +1,116 @@
---
title: "Prometheus告警规则"
type: concept
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
tags: [prometheus, alerting, monitoring, devops, prometheus]
date: 2025-11-11
---
# Prometheus告警规则
## Overview
Prometheus 告警规则Alert Rules是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
## Rule Format
```yaml
groups:
- name: <group_name> # 告警组名称(全局唯一)
interval: <duration> # 评估间隔(可选,默认 evaluation_interval
rules:
- alert: <alert_name> # 告警名称Alertmanager 中唯一标识)
expr: <promql_expr> # 触发条件的 PromQL 表达式
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
labels: # 标签(用于 Alertmanager 路由和分类)
severity: <level> # 如critical / warning / info
annotations: # 注解(人类可读的告警描述)
summary: <text> # 简短摘要
description: <text> # 详细描述,支持模板变量
```
## Template Variables注解模板
`description` 中可以使用 `$labels``$value` 等模板变量:
```yaml
annotations:
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%"
```
## Home Server Alert Rulesalerts.yml 完整示例)
```yaml
groups:
- name: system-alerts
rules:
- alert: HostHighCPU
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "高 CPU 使用率"
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "磁盘剩余空间低于 10%"
- alert: HostLowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率高"
description: "可用内存低于 15%"
- alert: HTTPProbeFailed
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "站点不可达"
description: "HTTP 探测失败:{{ $labels.instance }}"
- alert: TLSCertExpiring
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "TLS 证书即将到期"
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
```
## Alert Lifecycle
```
Inactive正常→ Pending等待确认for 计时中)→ Firing触发发送给 Alertmanager
```
## Prometheus Configuration
```yaml
# prometheus.yml
rule_files:
- "/etc/prometheus/alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
global:
evaluation_interval: 30s # 告警规则评估间隔
```
## Related Entities
- [[Prometheus]] — 告警引擎宿主
- [[Alertmanager]] — 告警接收和分发
## Related Concepts
- [[PromQL]] — 告警条件的查询语言
- [[Alertmanager]] — 告警分发机制
- [[System Monitoring]] — 上游应用领域
---
title: "Prometheus告警规则"
type: concept
aliases: [Prometheus Alert Rules, Prometheus告警规则YAML, alert_rules]
tags: [prometheus, alerting, monitoring, devops, prometheus]
date: 2025-11-11
---
# Prometheus告警规则
## Overview
Prometheus 告警规则Alert Rules是以 YAML 格式定义的告警条件,基于 PromQL 表达式判断指标状态。当表达式结果为真且持续超过 `for` 指定的评估周期数时,告警从 `pending` 状态转为 `firing` 状态,触发后发送给 Alertmanager 进行路由分发。
## Rule Format
```yaml
groups:
- name: <group_name> # 告警组名称(全局唯一)
interval: <duration> # 评估间隔(可选,默认 evaluation_interval
rules:
- alert: <alert_name> # 告警名称Alertmanager 中唯一标识)
expr: <promql_expr> # 触发条件的 PromQL 表达式
for: <duration> # 持续时间(告警变为 firing 前需满足条件的最短时间)
labels: # 标签(用于 Alertmanager 路由和分类)
severity: <level> # 如critical / warning / info
annotations: # 注解(人类可读的告警描述)
summary: <text> # 简短摘要
description: <text> # 详细描述,支持模板变量
```
## Template Variables注解模板
`description` 中可以使用 `$labels``$value` 等模板变量:
```yaml
annotations:
description: "主机 {{ $labels.instance }} CPU 使用率超过 85%(当前值:{{ $value }}%"
```
## Home Server Alert Rulesalerts.yml 完整示例)
```yaml
groups:
- name: system-alerts
rules:
- alert: HostHighCPU
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "高 CPU 使用率"
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "磁盘剩余空间低于 10%"
- alert: HostLowMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率高"
description: "可用内存低于 15%"
- alert: HTTPProbeFailed
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "站点不可达"
description: "HTTP 探测失败:{{ $labels.instance }}"
- alert: TLSCertExpiring
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "TLS 证书即将到期"
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
```
## Alert Lifecycle
```
Inactive正常→ Pending等待确认for 计时中)→ Firing触发发送给 Alertmanager
```
## Prometheus Configuration
```yaml
# prometheus.yml
rule_files:
- "/etc/prometheus/alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
global:
evaluation_interval: 30s # 告警规则评估间隔
```
## Related Entities
- [[Prometheus]] — 告警引擎宿主
- [[Alertmanager]] — 告警接收和分发
## Related Concepts
- [[PromQL]] — 告警条件的查询语言
- [[Alertmanager]] — 告警分发机制
- [[System Monitoring]] — 上游应用领域