74 lines
3.0 KiB
Markdown
74 lines
3.0 KiB
Markdown
---
|
||
title: "Self-Healing Systems"
|
||
type: concept
|
||
tags: [aiops, automation, reliability, agentic-ai]
|
||
date: 2026-04-14
|
||
aliases:
|
||
- Self-Healing
|
||
---
|
||
|
||
## Definition
|
||
|
||
自愈系统(Self-Healing Systems)是能够**自动检测异常、诊断问题并执行修复操作**的智能系统,无需人工干预即可恢复正常运行状态。这是[[Agentic AI]]和[[AIOps]]的核心能力之一。
|
||
|
||
## How It Works
|
||
|
||
```
|
||
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||
│ Anomaly │ → │ Diagnosis │ → │ Repair │
|
||
│ Detection │ │ & Root │ │ Action │
|
||
│ │ │ Cause │ │ │
|
||
└──────────────┘ └──────────────┘ └──────────────┘
|
||
↓ ↓ ↓
|
||
AI/ML Model Decision Tree Automated Script
|
||
+ Metrics + Knowledge Base + Runbooks
|
||
↓
|
||
┌──────────────┐ ┌──────────────┐
|
||
│ Monitoring │ ← │ Verification │
|
||
│ Close │ │ & Report │
|
||
└──────────────┘ └──────────────┘
|
||
```
|
||
|
||
## Self-Healing Actions
|
||
|
||
| 动作类型 | 描述 | 示例 |
|
||
|----------|------|------|
|
||
| Restart | 服务重启 | Pod重启、进程重启 |
|
||
| Scale | 扩缩容 | 增加Pod数量、扩容资源 |
|
||
| Evict | 驱逐问题节点 | Kubernetes节点驱逐 |
|
||
| Cleanup | 资源清理 | 清理磁盘、释放连接池 |
|
||
| Rollback | 版本回滚 | 回到上一个稳定版本 |
|
||
| Reroute | 流量切换 | DNS切换、负载均衡调整 |
|
||
|
||
## In ITSM Context
|
||
|
||
在[[ITSM 2.0]]的[[Incident-Management]]中,自愈是关键能力:
|
||
|
||
### AIOps-Powered Self-Healing
|
||
- Real-time observability drives rapid detection
|
||
- ML models predict failure before it happens
|
||
- Automated runbooks execute recovery
|
||
- Continuous learning improves future responses
|
||
|
||
### Kubernetes Self-Healing
|
||
[[Kubernetes]]提供原生自愈机制:
|
||
- **Liveness Probes** — 自动重启不健康容器
|
||
- **Readiness Probes** — 停止流量到不健康Pod
|
||
- **Node Failure Detection** — 自动重新调度Pod
|
||
|
||
## Related Concepts
|
||
|
||
- [[Agentic AI]] — 自愈的驱动者
|
||
- [[AIOps]] — 自愈的分析引擎
|
||
- [[Incident-Management]] — 自愈的应用场景
|
||
- [[Kubernetes]] — 自愈的主要载体
|
||
- [[Root-Cause-Analysis]] — 自愈前的诊断过程
|
||
- [[MTTR]] — 自愈改善的关键指标
|
||
|
||
## Sources
|
||
|
||
- [[how-agentic-ai-can-help-for-cloud-devops]] — Agentic AI自愈场景
|
||
- [[understanding-complete-itsm]] — ITSM 2.0自愈能力
|
||
- [[Agentic-AI]] — 实体页面中的自愈描述
|
||
- [[Kubernetes]] — Kubernetes自愈机制
|