Auto-sync: 2026-04-26 16:02

This commit is contained in:
2026-04-26 16:02:45 +08:00
parent 1abf0d56f5
commit d2ae5b3948
20 changed files with 1656 additions and 1731 deletions

View File

@@ -1,79 +1,63 @@
# Error Budget
## Definition
Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached.
Error Budget = 100% - (Actual Reliability Target)
Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month.
## Role in DevOps Maturity
The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity.
### Error Budget Across Maturity Levels
| Maturity | Error Budget Usage |
|----------|-------------------|
| Phase 1 | No error budget concept — reactive to failures as they occur |
| Phase 2 | Awareness growing — teams begin to understand the cost of failures |
| Phase 3 | Error budgets not explicitly managed — standardization helps but not measured |
| Phase 4 | Error budgets tracked — continuous monitoring enables measurement |
| Phase 5 | Error budgets actively used to drive deployment decisions — balancing innovation vs reliability |
## How Error Budgets Work
### The Concept
If your system achieves:
- **99.9% uptime**: 8.76 hours of downtime allowed per year (43.8 minutes per month)
- **99.99% uptime**: 52.6 minutes of downtime allowed per year (4.38 minutes per month)
The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves.
### Error Budget Policy Example
- If error budget is >50% remaining: Deploy freely (encourage experimentation)
- If error budget is 25-50%: Proceed with caution, require additional testing
- If error budget is <25%: Pause non-critical deployments until budget recovers
- If error budget is exhausted: Stop all deployments, focus on reliability
## Error Budget and SLOs
| Concept | Role |
|---------|------|
| **SLO (Service Level Objective)** | The target reliability level (e.g., 99.9%) |
| **Error Budget** | The allowable failure budget derived from the SLO |
| **SLI (Service Level Indicator)** | The actual reliability measured |
Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability.
## Business Impact
### Benefits of Error Budget Thinking
1. **Incentivizes reliability**: Teams are motivated to maintain system health
2. **Enables calculated risk-taking**: Clear budget allows confident experimentation
3. **Prevents over-engineering**: Don't build for 99.999% when 99.9% is the target
4. **Aligns business and engineering**: Both understand the reliability-investment trade-off
### Risks Without Error Budgets
- Over-investment in reliability beyond business needs
- Under-investment leading to frequent customer-facing failures
- Conflicting priorities between feature delivery and reliability
- No clear signal for when to slow down
## Error Budget vs Change Failure Rate
| Metric | Measures |
|--------|----------|
| **Error Budget** | Total allowable failures over a time period |
| **Change Failure Rate** | Percentage of deployments causing failures |
These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR.
## Sources
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
## Related Concepts
- [[concepts/SLO]]
- [[concepts/Change-Failure-Rate]]
- [[concepts/DORA-Metrics]]
- [[concepts/High-Availability]]
- [[concepts/DevOps-Maturity]]
---
title: "Error Budget"
type: concept
tags: [SRE, Reliability, DevOps Metrics]
sources: [devops-maturity-model-from-traditional-it-to-advanced-devops]
last_updated: 2026-04-26
---
## 定义
错误预算Error Budget是允许的、一定时间段内系统可以承受的错误和失败的数量或比例。它是一个平衡可靠性目标与创新速度的风险管理工具。
## 核心概念
错误预算源于 SRESite Reliability Engineering理念核心思想是
> 如果你的服务可靠性目标是 99.9%,那么你有 0.1% 的"错误预算"可以用于实验和发布。
## 计算方式
```
Error Budget = (1 - Reliability SLO) × Time Period
例如:
- 月 SLO = 99.9%
- 月错误预算 = 0.1% × 30天 × 24小时 = 0.72 小时(约 43 分钟)
```
## 在 DevOps 成熟度模型中的位置
在 DevOps 成熟度衡量指标体系中,错误预算是一个重要指标:
> "Error Budget — The permissible rate of errors and failures in production."
错误预算的使用策略因 DevOps 成熟度阶段不同而异:
| 成熟度阶段 | 错误预算使用方式 |
|-----------|----------------|
| Phase 1-2 | 无正式错误预算概念 |
| Phase 3 | 开始建立 SLO但未充分利用错误预算 |
| Phase 4 | 明确的错误预算政策,用于平衡创新与可靠性 |
| Phase 5 | 数据驱动决策,团队自主利用错误预算进行实验 |
## 与相关概念的关系
- [[MTTR]]:错误预算与 MTTR 共同定义系统可靠性曲线
- [[Change Failure Rate]]:高变更失败率会快速消耗错误预算
- [[Deployment Frequency]]:高部署频率需要配合错误预算管理以维持可靠性目标
- [[DevOps Maturity Model]]:错误预算是衡量组织成熟度的重要指标之一
## 错误预算政策示例
```yaml
SLO: 99.9%(每月 43 分钟错误预算)
策略:
- 错误预算充足(>50%):可自由发布和实验
- 错误预算中等25-50%):谨慎发布
- 错误预算不足(<25%):冻结发布,专注可靠性
- 错误预算耗尽:停止所有非关键变更
```
## 来源
- [[devops-maturity-model-from-traditional-it-to-advanced-devops]]