---
title: "Self-Healing Systems"
type: concept
tags: [aiops, automation, reliability, agentic-ai]
date: 2026-04-14
aliases:
  - Self-Healing
---

## Definition

自愈系统（Self-Healing Systems）是能够**自动检测异常、诊断问题并执行修复操作**的智能系统，无需人工干预即可恢复正常运行状态。这是[[Agentic AI]]和[[AIOps]]的核心能力之一。

## How It Works

```
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Anomaly    │ →  │   Diagnosis   │ →  │    Repair    │
│   Detection   │    │    & Root    │    │   Action     │
│              │    │   Cause       │    │              │
└──────────────┘    └──────────────┘    └──────────────┘
       ↓                  ↓                   ↓
   AI/ML Model       Decision Tree       Automated Script
   + Metrics         + Knowledge Base     + Runbooks
                                                  ↓
                    ┌──────────────┐    ┌──────────────┐
                    │  Monitoring  │ ←  │ Verification │
                    │    Close     │    │   & Report   │
                    └──────────────┘    └──────────────┘
```

## Self-Healing Actions

| 动作类型 | 描述 | 示例 |
|----------|------|------|
| Restart | 服务重启 | Pod重启、进程重启 |
| Scale | 扩缩容 | 增加Pod数量、扩容资源 |
| Evict | 驱逐问题节点 | Kubernetes节点驱逐 |
| Cleanup | 资源清理 | 清理磁盘、释放连接池 |
| Rollback | 版本回滚 | 回到上一个稳定版本 |
| Reroute | 流量切换 | DNS切换、负载均衡调整 |

## In ITSM Context

在[[ITSM 2.0]]的[[Incident-Management]]中，自愈是关键能力：

### AIOps-Powered Self-Healing
- Real-time observability drives rapid detection
- ML models predict failure before it happens
- Automated runbooks execute recovery
- Continuous learning improves future responses

### Kubernetes Self-Healing
[[Kubernetes]]提供原生自愈机制：
- **Liveness Probes** — 自动重启不健康容器
- **Readiness Probes** — 停止流量到不健康Pod
- **Node Failure Detection** — 自动重新调度Pod

## Related Concepts

- [[Agentic AI]] — 自愈的驱动者
- [[AIOps]] — 自愈的分析引擎
- [[Incident-Management]] — 自愈的应用场景
- [[Kubernetes]] — 自愈的主要载体
- [[Root-Cause-Analysis]] — 自愈前的诊断过程
- [[MTTR]] — 自愈改善的关键指标

## Sources

- [[how-agentic-ai-can-help-for-cloud-devops]] — Agentic AI自愈场景
- [[understanding-complete-itsm]] — ITSM 2.0自愈能力
- [[Agentic-AI]] — 实体页面中的自愈描述
- [[Kubernetes]] — Kubernetes自愈机制