Files
nexus/wiki/concepts/Self-Healing.md
2026-04-29 00:02:51 +08:00

100 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Self-Healing"
type: concept
tags: [Self-Healing, SRE, Automation, Resilience, Cloud-Native, Fault-Tolerance]
sources:
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
last_updated: 2026-04-29
---
## Self-Healing自愈能力
自愈能力Self-Healing是指软件系统具备持续监控系统健康状态并在无需人工干预的情况下自动检测故障并恢复服务的能力。是 [[SRE]] 和 [[Recovery-Assurance]] 理念在软件层面的具体实现。
## Definition
> "Self-healing is the ability of a system to detect failures, diagnose the root cause, and restore service automatically without human intervention." — [[SRE]] Principles
自愈系统通过以下机制实现自动化恢复:
1. **故障检测**:通过[[Observability]]采集的遥测数据识别异常
2. **根因诊断**:分析异常模式,判断故障类型(临时故障 vs. 持久故障)
3. **恢复执行**:触发预定义的修复动作(重启服务、切换节点、扩容降级)
4. **验证反馈**:恢复后验证服务可用性,确认健康状态
## Self-Healing Mechanisms
| 层级 | 机制 | 示例 |
|------|------|------|
| **基础设施层** | 自动替换失败的计算节点 | Kubernetes Node 自动替换、EC2 Auto Recovery |
| **容器/编排层** | Pod 自动重启、重新调度 | Kubernetes Liveness/Readiness Probe、自动重启策略 |
| **应用层** | 应用内嵌自愈逻辑 | Circuit Breaker 模式、Graceful Degradation |
| **数据层** | 自动故障转移 | Multi-AZ RDS 自动 failover、DynamoDB 自动复制 |
| **网络层** | 流量自动路由 | Route 53 Health Check + DNS Failover、NLB 自动移除不健康目标 |
## Relationship with SRE
在 [[SRE]] 实践中,自愈能力是消除 Toil重复性手工劳动的重要手段
- **Mean Time To RecoveryMTTR降低**:自动化恢复比人工响应快 10-100 倍
- **Toil 减少**:值班工程师不再需要手动处理可预测的故障模式
- **Error Budget 保护**自动恢复快系统可用性更高Error Budget 消耗更慢
## Connection to Recovery Assurance
[[Recovery-Assurance]] 要求系统不仅能恢复,还要能**保证**恢复能力。自愈能力是 Recovery Assurance 的技术基础之一:
- **持续可恢复性验证**:自愈测试本身就是一种恢复路径的持续验证
- **减少人工依赖**:人工协调是 DR 测试延迟的主要原因,自愈减少了人力瓶颈
- **规模化的前提**:无法自愈的系统在云原生规模下无法保证恢复能力
## Self-Healing vs. Chaos Engineering
| 维度 | 自愈Self-Healing | 混沌工程Chaos Engineering |
|------|---------------------|---------------------------|
| **目的** | 故障时自动恢复 | 主动注入故障,验证系统韧性 |
| **触发** | 被动(故障发生) | 主动(实验注入) |
| **时机** | 生产故障时执行 | 日常实验 |
| **关系** | 互补:混沌工程发现弱点 → 自愈修复故障 | 互补:混沌工程发现弱点 → 自愈修复故障 |
## Implementation Pattern
```yaml
# Kubernetes Self-Healing Manifest 示例
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
restartPolicy: Always # Pod 故障自动重启
terminationGracePeriodSeconds: 30 # 优雅关闭
# HPA水平 Pod 自动扩缩容)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
## Related Concepts
- [[SRE]] — 自愈是 SRE 消除 Toil、提升可靠性的核心手段
- [[Recovery-Assurance]] — 自愈是 Recovery Assurance 的技术基础
- [[Observability]] — 自愈依赖可观测性提供的遥测数据
- [[High-Availability]] — 高可用是自愈的基础设施保障
## Sources
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]