Files
nexus/wiki/concepts/Self-Healing-Systems.md

74 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Self-Healing Systems"
type: concept
tags: [aiops, automation, reliability, agentic-ai]
date: 2026-04-14
aliases:
- Self-Healing
---
## Definition
自愈系统Self-Healing Systems是能够**自动检测异常、诊断问题并执行修复操作**的智能系统,无需人工干预即可恢复正常运行状态。这是[[Agentic AI]]和[[AIOps]]的核心能力之一。
## How It Works
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Anomaly │ → │ Diagnosis │ → │ Repair │
│ Detection │ │ & Root │ │ Action │
│ │ │ Cause │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
AI/ML Model Decision Tree Automated Script
+ Metrics + Knowledge Base + Runbooks
┌──────────────┐ ┌──────────────┐
│ Monitoring │ ← │ Verification │
│ Close │ │ & Report │
└──────────────┘ └──────────────┘
```
## Self-Healing Actions
| 动作类型 | 描述 | 示例 |
|----------|------|------|
| Restart | 服务重启 | Pod重启、进程重启 |
| Scale | 扩缩容 | 增加Pod数量、扩容资源 |
| Evict | 驱逐问题节点 | Kubernetes节点驱逐 |
| Cleanup | 资源清理 | 清理磁盘、释放连接池 |
| Rollback | 版本回滚 | 回到上一个稳定版本 |
| Reroute | 流量切换 | DNS切换、负载均衡调整 |
## In ITSM Context
在[[ITSM 2.0]]的[[Incident-Management]]中,自愈是关键能力:
### AIOps-Powered Self-Healing
- Real-time observability drives rapid detection
- ML models predict failure before it happens
- Automated runbooks execute recovery
- Continuous learning improves future responses
### Kubernetes Self-Healing
[[Kubernetes]]提供原生自愈机制:
- **Liveness Probes** — 自动重启不健康容器
- **Readiness Probes** — 停止流量到不健康Pod
- **Node Failure Detection** — 自动重新调度Pod
## Related Concepts
- [[Agentic AI]] — 自愈的驱动者
- [[AIOps]] — 自愈的分析引擎
- [[Incident-Management]] — 自愈的应用场景
- [[Kubernetes]] — 自愈的主要载体
- [[Root-Cause-Analysis]] — 自愈前的诊断过程
- [[MTTR]] — 自愈改善的关键指标
## Sources
- [[how-agentic-ai-can-help-for-cloud-devops]] — Agentic AI自愈场景
- [[understanding-complete-itsm]] — ITSM 2.0自愈能力
- [[Agentic-AI]] — 实体页面中的自愈描述
- [[Kubernetes]] — Kubernetes自愈机制