nexus/wiki/concepts/Self-Healing-Systems.md at 31d316b0967fdf0486ce318d8c9537cc575a7b21

ishenwei/nexus

Fork 0

Files

weishen de096f2f88 Auto-sync: 2026-04-22 04:02

2026-04-22 04:03:04 +08:00

3.0 KiB

Raw Blame History

title, type, tags, date, aliases

title

type

Definition

自愈系统（Self-Healing Systems）是能够自动检测异常、诊断问题并执行修复操作的智能系统，无需人工干预即可恢复正常运行状态。这是Agentic AI和AIOps的核心能力之一。

How It Works

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Anomaly    │ →  │   Diagnosis   │ →  │    Repair    │
│   Detection   │    │    & Root    │    │   Action     │
│              │    │   Cause       │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
       ↓                  ↓                   ↓
   AI/ML Model       Decision Tree       Automated Script
   + Metrics         + Knowledge Base     + Runbooks
                                                  ↓
                    ┌──────────────┐    ┌──────────────┐
                    │  Monitoring  │ ←  │ Verification │
                    │    Close     │    │   & Report   │
                    └──────────────┘    └──────────────┘

Self-Healing Actions

动作类型	描述	示例
Restart	服务重启	Pod重启、进程重启
Scale	扩缩容	增加Pod数量、扩容资源
Evict	驱逐问题节点	Kubernetes节点驱逐
Cleanup	资源清理	清理磁盘、释放连接池
Rollback	版本回滚	回到上一个稳定版本
Reroute	流量切换	DNS切换、负载均衡调整

In ITSM Context

在ITSM 2.0的Incident-Management中，自愈是关键能力：

AIOps-Powered Self-Healing

Real-time observability drives rapid detection
ML models predict failure before it happens
Automated runbooks execute recovery
Continuous learning improves future responses

Kubernetes Self-Healing

Kubernetes提供原生自愈机制：

Liveness Probes — 自动重启不健康容器
Readiness Probes — 停止流量到不健康Pod
Node Failure Detection — 自动重新调度Pod

Agentic AI — 自愈的驱动者
AIOps — 自愈的分析引擎
Incident-Management — 自愈的应用场景
Kubernetes — 自愈的主要载体
Root-Cause-Analysis — 自愈前的诊断过程
MTTR — 自愈改善的关键指标

Sources

how-agentic-ai-can-help-for-cloud-devops — Agentic AI自愈场景
understanding-complete-itsm — ITSM 2.0自愈能力
Agentic-AI — 实体页面中的自愈描述
Kubernetes — Kubernetes自愈机制

3.0 KiB Raw Blame History Unescape Escape