first build nexus
This commit is contained in:
@@ -0,0 +1,526 @@
|
||||
---
|
||||
title: 家庭监控方案:Prometheus + Grafana + Node Exporter + cAdvisor +Blackbox
|
||||
tags:
|
||||
- "#prometheus"
|
||||
- "#monitoring"
|
||||
- "#grafana"
|
||||
author:
|
||||
- Wei Shen
|
||||
created: 2025-11-11
|
||||
published:
|
||||
description: 按优先级与场景划分的、能通过 Docker 部署的监控方案,并附带部署建议、关键监控项、告警示例
|
||||
---
|
||||
#prometheus #grafana #monitoring
|
||||
|
||||
|
||||
我给你一套按优先级与场景划分的、能通过 Docker 部署的监控方案,并附带部署建议、关键监控项、告警示例、以及两套可直接拷贝的 `docker-compose` 验证模板(一个轻量、一个较完整)。我把信息组织成清单和操作步骤,便于你直接在 NAS / Ubuntu Server 上实践。
|
||||
|
||||
# 核心目标(你要覆盖的监控面)
|
||||
|
||||
1. 主机层:CPU / 内存 / 磁盘 / 网络 / I/O / inode。
|
||||
2. 容器层:容器运行状态、重启次数、资源限制/使用、退出码、镜像版本。
|
||||
3. 服务层(应用):HTTP(S) 可用性、响应码、延迟、错误率、TLS 证书到期、DNS 解析是否异常。
|
||||
4. 日志:应用错误/异常、关键业务日志索引(可选全文搜索)。
|
||||
5. 合规与可视化:集中 time-series 存储 + 仪表盘 + 报警/通知通道(邮件/Slack/电话/Teams)。
|
||||
|
||||
![[IMG-20251229190624400.png]]
|
||||
# 推荐工具(均可 Docker 化)
|
||||
|
||||
按功能分组,给出用途与为何推荐(并标注官方安装/镜像文档):
|
||||
|
||||
### 观测 + 时序数据 / 查询 / 告警
|
||||
|
||||
- **Prometheus(采集 + 告警规则)**:拉取 exporters(node_exporter、cAdvisor、blackbox_exporter)采集指标,支持 PromQL 命名与告警规则。适合做主观测时序库与告警。([Prometheus](https://prometheus.io/?utm_source=chatgpt.com "Prometheus - Monitoring system & time series database"))
|
||||
|
||||
- **Alertmanager**(Prometheus 的告警分发):用于抑制、分组并把告警推到邮件/Slack/Webhook/PagerDuty。
|
||||
|
||||
|
||||
### 可视化 + 日志聚合
|
||||
|
||||
- **Grafana**:展示 Prometheus / VictoriaMetrics / Loki 等数据源的仪表盘与告警。支持仪表盘模板与报警通知。([Grafana Labs](https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/?utm_source=chatgpt.com "Run Grafana Docker image | Grafana documentation"))
|
||||
|
||||
- **Grafana Loki + Promtail**(如果你要日志聚合): 轻量级、与 Grafana 原生集成,适合把应用日志索引进来。
|
||||
|
||||
|
||||
### 主机 / 容器指标(简易采集)
|
||||
|
||||
- **node_exporter**(主机指标采集,Prometheus exporter)
|
||||
|
||||
- **cAdvisor**(容器资源/性能指标,Prometheus 可抓取)
|
||||
|
||||
- **blackbox_exporter**(外网/内网 HTTP/TCP/ICMP/HTTPS 监测/探测,用于合成监测)。
|
||||
|
||||
|
||||
### 合成 / 可用性 / Uptime 检查(外网/内网访问)
|
||||
|
||||
- **Uptime Kuma**:自托管的“Uptime Robot”式工具,易上手,做外网或内网的合成可用性探针(HTTP/TCP/DNS/TLS),带历史和通知支持。推荐用于合成监测(synthetic checks)。([uptimekuma.org](https://uptimekuma.org/install-uptime-kuma-docker/?utm_source=chatgpt.com "Install Uptime Kuma using Docker or Docker Compose"))
|
||||
|
||||
|
||||
### 轻量单主机快速看板(推荐做 PoC)
|
||||
|
||||
- **Netdata**:开箱即用的详细 realtime 主机/容器监控面板(默认 19999 端口)。适合快速诊断热点,能和 Prometheus 集成做长期存储。([learn.netdata.cloud](https://learn.netdata.cloud/docs/netdata-agent/installation/docker?utm_source=chatgpt.com "Install Netdata with Docker"))
|
||||
|
||||
|
||||
### 时序数据库替代(可选,用于更大规模)
|
||||
|
||||
- **VictoriaMetrics / Thanos / Cortex**:当数据量大或想要长期存储 + 高效写入时。VictoriaMetrics 配置简单,常见于 single-host 或 small-cluster 场景。
|
||||
|
||||
|
||||
### 管理/操作视角(容器管理)
|
||||
|
||||
- **Portainer**:可视化管理 Docker 主机/Swarm,带部分监控/日志功能(不替代 Prometheus/Grafana,但便于运维快速操作)。
|
||||
|
||||
|
||||
---
|
||||
|
||||
# 推荐的架构方案
|
||||
|
||||
### 标准(生产常见,适合多主机)
|
||||
|
||||
用途:长期监控、告警、仪表盘。
|
||||
组件:Prometheus + node_exporter + cAdvisor + blackbox_exporter + Grafana + Alertmanager。可选 Loki(日志)、VictoriaMetrics(长期存储)。Prometheus 抓取所有主机/容器指标,Grafana 做可视化,Alertmanager 负责通知。([Prometheus](https://prometheus.io/?utm_source=chatgpt.com "Prometheus - Monitoring system & time series database"))
|
||||
|
||||
|
||||
---
|
||||
|
||||
# 我猜你可能没想过但挺有用的点(主动建议)
|
||||
|
||||
1. **合成(synthetic)与真实用户监控结合**:Uptime Kuma 做外网/内网可用性探针 + Prometheus blackbox_exporter 做更细粒度 HTTP/TLS/DNS 探测(响应码、证书有效期、解析时延)。
|
||||
|
||||
2. **TLS 证书到期告警**:通过 blackbox_exporter 或直接 Prometheus exporter(或在 Uptime Kuma 中)设置证书剩余天数阈值告警。
|
||||
|
||||
3. **DNS 解析单独监控**:外网访问不通常是 DNS 问题,单独做 DNS probe(blackbox_exporter 支持)。
|
||||
|
||||
4. **短期与长期数据分层**:Netdata 做短期高分辨率展示,Prometheus + VictoriaMetrics 做长期汇总(remote_write)。
|
||||
|
||||
5. **自动化接入新主机**:在新主机上用 Ansible / cloud-init 快速部署 node_exporter + cAdvisor + promtail(日志)并注册到 Prometheus。
|
||||
|
||||
6. **容器标签化 & 报表**:保证容器/服务启动时打上 `service=xxx`、`env=prod` 标签,便于 PromQL 分组和 SLA 报表。
|
||||
|
||||
|
||||
---
|
||||
|
||||
# 推荐监控项(可直接写为 PromQL/告警条件)
|
||||
|
||||
核心指标与告警建议(举例):
|
||||
|
||||
- 主机:`node_filesystem_avail_bytes` < 10% → 磁盘告警。
|
||||
|
||||
- CPU:5 分钟平均 CPU 使用率 > 85%(或按核数修正)→ 告警。
|
||||
|
||||
- 内存:`node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15` → 内存告警。
|
||||
|
||||
- 容器:容器重启次数 `increase(container_restart_total[1h]) > 0` → 告警(可过滤重启策略更新产生的重启)。
|
||||
|
||||
- HTTP:黑箱探测 `probe_success == 0` 连续 3 次 → 报警;`probe_duration_seconds` 高于阈值 → 性能警告。
|
||||
|
||||
- TLS:证书剩余天数 < 14 → 告警。
|
||||
|
||||
(这些可直接放进 Prometheus 的 alert rules,也可在 Grafana 转换为告警)
|
||||
|
||||
---
|
||||
|
||||
# 安全与运维注意(捷径与坑)
|
||||
|
||||
- 减少容器权限:尽量不要给 exporters 过高宿主机权限,除非需要(e.g., Netdata 需要 `/proc`、`/sys`、Docker socket 才能全面监控)。审慎开启 Docker socket 挂载(风险:容器拿到宿主机 root 等同权限)。([learn.netdata.cloud](https://learn.netdata.cloud/docs/netdata-agent/installation/docker?utm_source=chatgpt.com "Install Netdata with Docker"))
|
||||
|
||||
- 网络分区:把监控流量/端口放在管理 VLAN 或通过防火墙限定访问。
|
||||
|
||||
- 存储:Prometheus 本地磁盘会增长,考虑长期保留要用远端存储或定期 snapshot。
|
||||
|
||||
- 备份:Grafana 仪表盘 JSON 导出,Prometheus rule 与配置放在 Git(GitOps)。
|
||||
|
||||
- 证书/反向代理:生产建议在反向代理(Caddy/Traefik/Nginx)做 TLS 终端并加 Basic/Auth 或内部 SSO。
|
||||
|
||||
|
||||
---
|
||||
|
||||
# 快速落地步骤(我会按专家视角把步骤拆成可运行的命令)
|
||||
|
||||
1. 在一台测试机器上用上面 PoC compose 启动,验证 Netdata 与 Uptime Kuma 可访问(19999 / 3001)。
|
||||
|
||||
2. 把 Prometheus compose 与 `prometheus.yml` 上线到主要监控节点,配置 `scrape_configs` 抓取 node_exporter / cadvisor / blackbox。
|
||||
|
||||
3. 在每台主机上用 docker-compose 或 Ansible 部署 node_exporter(或使用 host network 模式),并在 Prometheus 中加入目标。
|
||||
|
||||
4. 在 Grafana 导入几个预置 dashboard(node_exporter、cadvisor、blackbox 面板),并在 Alertmanager 中配置告警路由(Slack/邮件)。
|
||||
|
||||
5. 把 Uptime Kuma 的探测项(内网域名/公网域名)全部建好,设置通知通道(Mail / Webhook 指向 Alertmanager/Grafana 或直接 Slack)。
|
||||
|
||||
|
||||
---
|
||||
|
||||
# 参考资源(官网/文档 — 我把主要 URL 放在下面,便于你深入阅读与部署)
|
||||
|
||||
- Prometheus 官方: [https://prometheus.io/](https://prometheus.io/)
|
||||
|
||||
- Prometheus Docker 镜像(Docker Hub): [https://hub.docker.com/r/prom/prometheus](https://hub.docker.com/r/prom/prometheus)
|
||||
|
||||
- Grafana Docker 安装文档: [https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/](https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/)
|
||||
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
# 我们来落地选项 **A:完整的 `docker-compose` + `prometheus.yml` + alert rules**。
|
||||
|
||||
目标是:
|
||||
在你家里(NAS/Ubuntu Server)上,用 Docker 一键启动一个完整的监控栈。
|
||||
**功能覆盖**:
|
||||
|
||||
- 主机与容器指标(`node_exporter`, `cAdvisor`)
|
||||
|
||||
- 内外网域名健康探测(`blackbox_exporter`)
|
||||
|
||||
- 数据采集与存储(`prometheus`)
|
||||
|
||||
- 可视化(`grafana`)
|
||||
|
||||
- 告警分发(`alertmanager`,可选接邮件/Slack)
|
||||
|
||||
|
||||
---
|
||||
|
||||
## 🧱 一、目录结构建议
|
||||
|
||||
请在主机上建一个目录 `/opt/monitoring/`(可换路径):
|
||||
|
||||
```
|
||||
/opt/monitoring/
|
||||
├── docker-compose.yml
|
||||
├── prometheus/
|
||||
│ ├── prometheus.yml
|
||||
│ ├── alerts.yml
|
||||
│ └── targets/
|
||||
│ ├── node.yml
|
||||
│ ├── cadvisor.yml
|
||||
│ └── blackbox.yml
|
||||
├── alertmanager/
|
||||
│ └── config.yml
|
||||
└── grafana/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐳 二、`docker-compose.yml`
|
||||
|
||||
这是一个完整可运行的 Compose 文件,覆盖核心组件。
|
||||
|
||||
``` yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
restart: always
|
||||
volumes:
|
||||
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro
|
||||
- ./prometheus/targets:/etc/prometheus/targets:ro
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--web.enable-lifecycle'
|
||||
ports:
|
||||
- "9090:9090"
|
||||
depends_on:
|
||||
- cadvisor
|
||||
- node_exporter
|
||||
- blackbox
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
container_name: grafana
|
||||
restart: always
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_AUTH_ANONYMOUS_ENABLED=true
|
||||
- GF_AUTH_ANONYMOUS_ORG_NAME=Main Org
|
||||
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
|
||||
- GF_SECURITY_ALLOW_EMBEDDING=true
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
volumes:
|
||||
- grafana-storage:/var/lib/grafana
|
||||
depends_on:
|
||||
- prometheus
|
||||
|
||||
alertmanager:
|
||||
image: prom/alertmanager:latest
|
||||
container_name: alertmanager
|
||||
restart: always
|
||||
ports:
|
||||
- "9093:9093"
|
||||
volumes:
|
||||
- ./alertmanager/config.yml:/etc/alertmanager/config.yml:ro
|
||||
command:
|
||||
- '--config.file=/etc/alertmanager/config.yml'
|
||||
|
||||
node_exporter:
|
||||
image: prom/node-exporter:latest
|
||||
container_name: node_exporter
|
||||
restart: always
|
||||
network_mode: "host"
|
||||
pid: "host"
|
||||
volumes:
|
||||
- /proc:/host/proc:ro
|
||||
- /sys:/host/sys:ro
|
||||
- /:/rootfs:ro
|
||||
|
||||
cadvisor:
|
||||
image: gcr.io/cadvisor/cadvisor:latest
|
||||
container_name: cadvisor
|
||||
restart: always
|
||||
ports:
|
||||
- "8080:8080"
|
||||
volumes:
|
||||
- /:/rootfs:ro
|
||||
- /var/run:/var/run:ro
|
||||
- /sys:/sys:ro
|
||||
- /var/lib/docker/:/var/lib/docker:ro
|
||||
|
||||
blackbox:
|
||||
image: prom/blackbox-exporter:latest
|
||||
container_name: blackbox
|
||||
restart: always
|
||||
ports:
|
||||
- "9115:9115"
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
grafana-storage:
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ 三、Prometheus 配置文件 `prometheus.yml`
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 30s
|
||||
|
||||
rule_files:
|
||||
- "/etc/prometheus/alerts.yml"
|
||||
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets: ['alertmanager:9093']
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['prometheus:9090']
|
||||
|
||||
- job_name: 'node_exporter'
|
||||
file_sd_configs:
|
||||
- files:
|
||||
- /etc/prometheus/targets/node.yml
|
||||
|
||||
- job_name: 'cadvisor'
|
||||
file_sd_configs:
|
||||
- files:
|
||||
- /etc/prometheus/targets/cadvisor.yml
|
||||
|
||||
- job_name: 'blackbox_http'
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [http_2xx]
|
||||
file_sd_configs:
|
||||
- files:
|
||||
- /etc/prometheus/targets/blackbox.yml
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: __param_target
|
||||
- target_label: __address__
|
||||
replacement: blackbox:9115
|
||||
- source_labels: [__param_target]
|
||||
target_label: instance
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🗂 四、Targets 文件示例
|
||||
|
||||
### `targets/node.yml`
|
||||
|
||||
```yaml
|
||||
- targets:
|
||||
- "192.168.3.47:9100"
|
||||
labels:
|
||||
env: home
|
||||
role: server
|
||||
```
|
||||
|
||||
### `targets/cadvisor.yml`
|
||||
|
||||
```yaml
|
||||
- targets:
|
||||
- "cadvisor:8080"
|
||||
labels:
|
||||
env: home
|
||||
role: docker
|
||||
```
|
||||
|
||||
### `targets/blackbox.yml`
|
||||
|
||||
```yaml
|
||||
- targets:
|
||||
- "https://pq2435887bh.vicp.fun"
|
||||
- "http://shenwei-nas.vip.cpolar.cn"
|
||||
- "https://transmission.vip.cpolar.cn"
|
||||
labels:
|
||||
env: home
|
||||
type: website
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 五、Prometheus 告警规则 `alerts.yml`
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: system-alerts
|
||||
rules:
|
||||
|
||||
- alert: HostHighCPU
|
||||
expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "高 CPU 使用率"
|
||||
description: "主机 CPU 使用率超过 85%(持续 2 分钟)"
|
||||
|
||||
- alert: HostLowDisk
|
||||
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "磁盘空间不足"
|
||||
description: "磁盘剩余空间低于 10%"
|
||||
|
||||
- alert: HostLowMemory
|
||||
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.15
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "内存使用率高"
|
||||
description: "可用内存低于 15%"
|
||||
|
||||
- alert: ContainerRestarting
|
||||
expr: increase(container_last_seen[5m]) == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "容器异常退出或未上报"
|
||||
description: "5 分钟内容器未更新指标,可能异常退出"
|
||||
|
||||
- alert: HTTPProbeFailed
|
||||
expr: probe_success == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "站点不可达"
|
||||
description: "HTTP 探测失败:{{ $labels.instance }}"
|
||||
|
||||
- alert: TLSCertExpiring
|
||||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "TLS 证书即将到期"
|
||||
description: "证书 {{ $labels.instance }} 剩余不到 14 天"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📧 六、Alertmanager 配置 `config.yml`
|
||||
|
||||
```yaml
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
receiver: default
|
||||
group_wait: 10s
|
||||
group_interval: 5m
|
||||
repeat_interval: 3h
|
||||
|
||||
receivers:
|
||||
- name: default
|
||||
email_configs:
|
||||
- to: "youremail@example.com"
|
||||
from: "monitor@example.com"
|
||||
smarthost: "smtp.example.com:587"
|
||||
auth_username: "monitor@example.com"
|
||||
auth_password: "yourpassword"
|
||||
```
|
||||
|
||||
> 💡 你也可以改成 Slack、Teams、Telegram 等通知方式,只要换成对应的 `*_configs` 即可。
|
||||
|
||||
---
|
||||
|
||||
## 🧩 七、Grafana 快速导入仪表盘
|
||||
|
||||
安装后访问 `http://localhost:3000`(admin/admin 登录):
|
||||
|
||||
- 添加 Prometheus 数据源:`http://prometheus:9090`
|
||||
|
||||
- 导入官方 Dashboard:
|
||||
|
||||
- Node Exporter Full: `1860`
|
||||
|
||||
- cAdvisor Container Metrics: `14282`
|
||||
|
||||
- Blackbox Exporter Probe: `7587`
|
||||
|
||||
|
||||
这些 ID 可直接在 Grafana.com → Dashboards → “Import” 输入。
|
||||
|
||||
---
|
||||
|
||||
## 🚀 八、启动命令
|
||||
|
||||
```bash
|
||||
cd /opt/monitoring
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
验证访问:
|
||||
|
||||
- Prometheus: [http://192.168.3.47:9090](http://192.168.3.47:9090/)
|
||||
|
||||
- Grafana: [http://192.168.3.47:3000](http://192.168.3.47:3000/)
|
||||
|
||||
- Alertmanager: [http://192.168.3.47:9093](http://192.168.3.47:9093/)
|
||||
|
||||
- cAdvisor: [http://192.168.3.47:8080](http://192.168.3.47:8080/)
|
||||
|
||||
- Blackbox exporter: [http://192.168.3.47:9115](http://192.168.3.47:9115/)
|
||||
|
||||
- node_exporter: [http://192.168.3.47:9100/metrics](http://192.168.3.47:9100/metrics)
|
||||
|
||||
|
||||
|
||||
## 🧠 九、扩展建议(后续可以加)
|
||||
|
||||
- 加 **Loki + Promtail** 做日志收集。
|
||||
|
||||
- 加 **Uptime Kuma** 作为轻量合成探针外层 UI(外观好看)。
|
||||
|
||||
- 用 **Traefik/Caddy** 把上述组件统一暴露并加 HTTPS。
|
||||
|
||||
- 加 **VictoriaMetrics** 作为长时存储(Prometheus remote_write)。
|
||||
|
||||
Reference in New Issue
Block a user