Files
nexus/wiki/concepts/Self-Healing-Systems.md

3.0 KiB
Raw Blame History

title, type, tags, date, aliases
title type tags date aliases
Self-Healing Systems concept
aiops
automation
reliability
agentic-ai
2026-04-14
Self-Healing

Definition

自愈系统Self-Healing Systems是能够自动检测异常、诊断问题并执行修复操作的智能系统,无需人工干预即可恢复正常运行状态。这是Agentic AIAIOps的核心能力之一。

How It Works

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Anomaly    │ →  │   Diagnosis   │ →  │    Repair    │
│   Detection   │    │    & Root    │    │   Action     │
│              │    │   Cause       │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
       ↓                  ↓                   ↓
   AI/ML Model       Decision Tree       Automated Script
   + Metrics         + Knowledge Base     + Runbooks
                                                  ↓
                    ┌──────────────┐    ┌──────────────┐
                    │  Monitoring  │ ←  │ Verification │
                    │    Close     │    │   & Report   │
                    └──────────────┘    └──────────────┘

Self-Healing Actions

动作类型 描述 示例
Restart 服务重启 Pod重启、进程重启
Scale 扩缩容 增加Pod数量、扩容资源
Evict 驱逐问题节点 Kubernetes节点驱逐
Cleanup 资源清理 清理磁盘、释放连接池
Rollback 版本回滚 回到上一个稳定版本
Reroute 流量切换 DNS切换、负载均衡调整

In ITSM Context

ITSM 2.0Incident-Management中,自愈是关键能力:

AIOps-Powered Self-Healing

  • Real-time observability drives rapid detection
  • ML models predict failure before it happens
  • Automated runbooks execute recovery
  • Continuous learning improves future responses

Kubernetes Self-Healing

Kubernetes提供原生自愈机制:

  • Liveness Probes — 自动重启不健康容器
  • Readiness Probes — 停止流量到不健康Pod
  • Node Failure Detection — 自动重新调度Pod

Sources