Update nexus wiki content
This commit is contained in:
49
wiki/concepts/Resilience.md
Normal file
49
wiki/concepts/Resilience.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
title: "Resilience"
|
||||
type: concept
|
||||
tags: [sre, reliability, engineering, fault-tolerance]
|
||||
last_updated: 2026-04-20
|
||||
---
|
||||
|
||||
# Resilience
|
||||
|
||||
韧性(Resilience)是系统在面对故障、压力和变化时保持服务可用性的能力。SRE 的核心目标之一就是建立和维持系统韧性。
|
||||
|
||||
## Definition
|
||||
韧性不仅是"不故障",而是:
|
||||
- **故障吸收**:系统能够吸收和缓解故障的影响
|
||||
- **快速恢复**:故障发生后能快速恢复正常服务
|
||||
- **适应性学习**:从故障中学习,持续改进
|
||||
|
||||
## The 5 Things Resilience Cannot Be Automated
|
||||
Uptime Labs 总结了 5 种无法被自动化的韧性要素:
|
||||
|
||||
### 1. Learning(学习)
|
||||
从故障和Near-miss中提取经验教训,形成组织知识。
|
||||
|
||||
### 2. Decision-Making(决策)
|
||||
在高压情况下做出正确判断,选择最优响应策略。
|
||||
|
||||
### 3. Prioritization(优先级排序)
|
||||
在多个问题同时发生时,决定处理顺序。
|
||||
|
||||
### 4. Communication(沟通)
|
||||
协调团队、通知利益相关者、管理期望。
|
||||
|
||||
### 5. Adaptation(适应)
|
||||
根据新情况调整策略,不拘泥于预设剧本。
|
||||
|
||||
## SRE Practices for Resilience
|
||||
- [[BlamelessPostMortem]]:从故障中学习
|
||||
- [[Self-Healing]]:自动化恢复机制
|
||||
- [[Observability]]:理解系统状态
|
||||
- [[Organizational-Second-Hit-Syndrome]]:理解组织层面的韧性
|
||||
- [[Chaos-Engineering]]:主动发现弱点
|
||||
|
||||
## Relationship to Other Concepts
|
||||
- **Reliability** 是韧性的组成部分
|
||||
- **Fault Tolerance** 是实现韧性的手段之一
|
||||
- **Incident Response** 是韧性响应的执行过程
|
||||
|
||||
## Source
|
||||
- SRE Weekly Issue #513 — [[sre-weekly-issue-513]]
|
||||
Reference in New Issue
Block a user