Auto-sync: 2026-04-29 00:02
This commit is contained in:
99
wiki/concepts/Self-Healing.md
Normal file
99
wiki/concepts/Self-Healing.md
Normal file
@@ -0,0 +1,99 @@
|
||||
---
|
||||
title: "Self-Healing"
|
||||
type: concept
|
||||
tags: [Self-Healing, SRE, Automation, Resilience, Cloud-Native, Fault-Tolerance]
|
||||
sources:
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Self-Healing(自愈能力)
|
||||
|
||||
自愈能力(Self-Healing)是指软件系统具备持续监控系统健康状态,并在无需人工干预的情况下自动检测故障并恢复服务的能力。是 [[SRE]] 和 [[Recovery-Assurance]] 理念在软件层面的具体实现。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Self-healing is the ability of a system to detect failures, diagnose the root cause, and restore service automatically without human intervention." — [[SRE]] Principles
|
||||
|
||||
自愈系统通过以下机制实现自动化恢复:
|
||||
|
||||
1. **故障检测**:通过[[Observability]]采集的遥测数据识别异常
|
||||
2. **根因诊断**:分析异常模式,判断故障类型(临时故障 vs. 持久故障)
|
||||
3. **恢复执行**:触发预定义的修复动作(重启服务、切换节点、扩容降级)
|
||||
4. **验证反馈**:恢复后验证服务可用性,确认健康状态
|
||||
|
||||
## Self-Healing Mechanisms
|
||||
|
||||
| 层级 | 机制 | 示例 |
|
||||
|------|------|------|
|
||||
| **基础设施层** | 自动替换失败的计算节点 | Kubernetes Node 自动替换、EC2 Auto Recovery |
|
||||
| **容器/编排层** | Pod 自动重启、重新调度 | Kubernetes Liveness/Readiness Probe、自动重启策略 |
|
||||
| **应用层** | 应用内嵌自愈逻辑 | Circuit Breaker 模式、Graceful Degradation |
|
||||
| **数据层** | 自动故障转移 | Multi-AZ RDS 自动 failover、DynamoDB 自动复制 |
|
||||
| **网络层** | 流量自动路由 | Route 53 Health Check + DNS Failover、NLB 自动移除不健康目标 |
|
||||
|
||||
## Relationship with SRE
|
||||
|
||||
在 [[SRE]] 实践中,自愈能力是消除 Toil(重复性手工劳动)的重要手段:
|
||||
|
||||
- **Mean Time To Recovery(MTTR)降低**:自动化恢复比人工响应快 10-100 倍
|
||||
- **Toil 减少**:值班工程师不再需要手动处理可预测的故障模式
|
||||
- **Error Budget 保护**:自动恢复快,系统可用性更高,Error Budget 消耗更慢
|
||||
|
||||
## Connection to Recovery Assurance
|
||||
|
||||
[[Recovery-Assurance]] 要求系统不仅能恢复,还要能**保证**恢复能力。自愈能力是 Recovery Assurance 的技术基础之一:
|
||||
|
||||
- **持续可恢复性验证**:自愈测试本身就是一种恢复路径的持续验证
|
||||
- **减少人工依赖**:人工协调是 DR 测试延迟的主要原因,自愈减少了人力瓶颈
|
||||
- **规模化的前提**:无法自愈的系统在云原生规模下无法保证恢复能力
|
||||
|
||||
## Self-Healing vs. Chaos Engineering
|
||||
|
||||
| 维度 | 自愈(Self-Healing) | 混沌工程(Chaos Engineering) |
|
||||
|------|---------------------|---------------------------|
|
||||
| **目的** | 故障时自动恢复 | 主动注入故障,验证系统韧性 |
|
||||
| **触发** | 被动(故障发生) | 主动(实验注入) |
|
||||
| **时机** | 生产故障时执行 | 日常实验 |
|
||||
| **关系** | 互补:混沌工程发现弱点 → 自愈修复故障 | 互补:混沌工程发现弱点 → 自愈修复故障 |
|
||||
|
||||
## Implementation Pattern
|
||||
|
||||
```yaml
|
||||
# Kubernetes Self-Healing Manifest 示例
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 8080
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
failureThreshold: 3
|
||||
|
||||
restartPolicy: Always # Pod 故障自动重启
|
||||
terminationGracePeriodSeconds: 30 # 优雅关闭
|
||||
|
||||
# HPA(水平 Pod 自动扩缩容)
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
spec:
|
||||
minReplicas: 3
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[SRE]] — 自愈是 SRE 消除 Toil、提升可靠性的核心手段
|
||||
- [[Recovery-Assurance]] — 自愈是 Recovery Assurance 的技术基础
|
||||
- [[Observability]] — 自愈依赖可观测性提供的遥测数据
|
||||
- [[High-Availability]] — 高可用是自愈的基础设施保障
|
||||
|
||||
## Sources
|
||||
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
Reference in New Issue
Block a user