--- title: "家庭监控方案:Prometheus + Grafana + Node Exporter + cAdvisor + Blackbox" type: source tags: [monitoring, prometheus, grafana, self-hosted] date: 2025-11-11 --- ## Source File - [[raw/Home Office/家庭监控方案:Prometheus + Grafana + Node Exporter + cAdvisor +Blackbox.md]] ## Summary - 核心主题:家庭/小型实验室环境基于 Docker 的可观测性监控方案,覆盖主机层、容器层、服务层和合成监测 - 问题域:如何用开源工具低成本构建完整的监控告警体系 - 方法/机制:Prometheus 拉模式采集 + Grafana 可视化 + Alertmanager 告警分发;cAdvisor 采集容器指标;blackbox_exporter 做 HTTP/TCP/DNS 合成监测;node_exporter 采集主机指标 - 结论/价值:提供两套 docker-compose 模板(轻量/PoC),以及可直接拷贝的 prometheus.yml、告警规则和 Alertmanager 配置 ## Key Claims - Prometheus 拉模式(pull-based)适配多主机监控,通过 scrape_configs 抓取各 exporter 指标 - cAdvisor 容器指标需挂载 /var/lib/docker/ 才可完整采集容器资源使用情况 - blackbox_exporter 支持 HTTP/TCP/ICMP/DNS 四类探测,可监控内外网服务可用性和 TLS 证书到期 - Alertmanager 支持邮件/Slack/Webhook/PagerDuty 分组抑制告警,避免告警风暴 - docker-compose 部署 Prometheus + Grafana + cAdvisor + blackbox_exporter + Alertmanager 一键启动 - Grafana 导入 Dashboard 只需 ID(Node Exporter Full: 1860、cAdvisor: 14282、Blackbox: 7587) - Docker Socket 挂载存在安全风险,容器可获取宿主机 root 等同权限 - TLS 证书到期可通过 probe_ssl_earliest_cert_expiry 指标监控,提前 14 天告警 - 建议将监控流量放在管理 VLAN 或通过防火墙限定访问 - Prometheus 本地磁盘会持续增长,长期保留需配置 remote_write 到 VictoriaMetrics 等远端存储 ## Key Quotes > "Prometheus 本地磁盘会增长,考虑长期保留要用远端存储或定期 snapshot" — 生产级存储建议 > "Prometheus 支持对同一网站设置下载延迟 + 随机化访问,防止被封禁" — 爬虫防封策略 ## Key Concepts - [[Prometheus]]:开源时序数据库和监控告警系统,支持 PromQL 查询语言和告警规则引擎 - [[Grafana]]:开源可观测性平台,支持时序数据可视化、仪表盘和告警通知 - [[Alertmanager]]:Prometheus 生态告警分发组件,支持分组、抑制和路由 - [[cAdvisor]]:Google 开源容器资源监控工具,采集 CPU、内存、网络、磁盘 I/O 指标 - [[node_exporter]]:Prometheus 官方主机指标 exporter,采集 CPU、内存、磁盘、网络指标 - [[blackbox_exporter]]:Prometheus 官方黑盒监测 exporter,支持 HTTP/TCP/DNS/ICMP 探测 - [[PromQL]]:Prometheus Query Language,用于查询和聚合时序指标 - [[可观测性]]:监控系统三大支柱(Metrics/Logs/Traces) - [[合成监测]]:Synthetic Monitoring,通过探针模拟用户请求检测服务可用性 - [[Prometheus告警规则]]:基于 PromQL 表达式持续评估,达到阈值触发告警 - [[Docker Socket安全]]:挂载 /var/run/docker.sock 等同给予容器宿主机 root 权限 ## Key Entities - [[Uptime Kuma]]:自托管网站监控工具,支持 HTTP/TCP/DNS/TLS 探测,适合合成监测外层 UI - [[Loki]]:Grafana Labs 日志聚合系统,与 Prometheus/Grafana 原生集成,轻量级 - [[VictoriaMetrics]]:高性能时序数据库,兼容 Prometheus remote_write API,适合长期存储 - [[Portainer]]:Docker 可视化管理工具,不替代 Prometheus 但便于运维操作 ## Connections - [[Prometheus]] ← scrape_configs ← [[node_exporter]] - [[Prometheus]] ← scrape_configs ← [[cAdvisor]] - [[Prometheus]] ← scrape_configs ← [[blackbox_exporter]] - [[Grafana]] ← 数据源 ← [[Prometheus]] - [[Alertmanager]] ← 告警接收 ← [[Prometheus]] - [[Grafana]] ← 仪表盘 ← [[cAdvisor]] / [[node_exporter]] / [[blackbox_exporter]] - [[Prometheus]] ← 远端存储 ← [[VictoriaMetrics]] ## Contradictions - 无明显冲突 ## Infrastructure Code ### docker-compose.yml 核心配置 ```yaml services: prometheus: image: prom/prometheus:latest ports: ["9090:9090"] volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml:ro - prometheus-data:/prometheus command: ['--config.file=/etc/prometheus/prometheus.yml', '--storage.tsdb.path=/prometheus', '--web.enable-lifecycle'] grafana: image: grafana/grafana:latest ports: ["3000:3000"] environment: - GF_AUTH_ANONYMOUS_ENABLED=true - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer node_exporter: image: prom/node-exporter:latest network_mode: "host" pid: "host" volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro cadvisor: image: gcr.io/cadvisor/cadvisor:latest ports: ["8080:8080"] volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro blackbox: image: prom/blackbox-exporter:latest ports: ["9115:9115"] ``` ### prometheus.yml scrape_configs ```yaml scrape_configs: - job_name: 'node_exporter' file_sd_configs: - files: ['/etc/prometheus/targets/node.yml'] - job_name: 'cadvisor' file_sd_configs: - files: ['/etc/prometheus/targets/cadvisor.yml'] - job_name: 'blackbox_http' metrics_path: /probe params: { module: [http_2xx] } file_sd_configs: - files: ['/etc/prometheus/targets/blackbox.yml'] relabel_configs: - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: blackbox:9115 ``` ### 核心告警规则 ```yaml - alert: HostHighCPU expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 85 for: 2m labels: severity: warning annotations: summary: "高 CPU 使用率" - alert: HostLowDisk expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10 for: 5m labels: severity: critical - alert: TLSCertExpiring expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14 for: 1h labels: severity: warning ```