3.5 KiB
Error Budget
Definition
Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached.
Error Budget = 100% - (Actual Reliability Target)
Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month.
Role in DevOps Maturity
The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity.
Error Budget Across Maturity Levels
| Maturity | Error Budget Usage |
|---|---|
| Phase 1 | No error budget concept — reactive to failures as they occur |
| Phase 2 | Awareness growing — teams begin to understand the cost of failures |
| Phase 3 | Error budgets not explicitly managed — standardization helps but not measured |
| Phase 4 | Error budgets tracked — continuous monitoring enables measurement |
| Phase 5 | Error budgets actively used to drive deployment decisions — balancing innovation vs reliability |
How Error Budgets Work
The Concept
If your system achieves:
- 99.9% uptime: 8.76 hours of downtime allowed per year (43.8 minutes per month)
- 99.99% uptime: 52.6 minutes of downtime allowed per year (4.38 minutes per month)
The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves.
Error Budget Policy Example
- If error budget is >50% remaining: Deploy freely (encourage experimentation)
- If error budget is 25-50%: Proceed with caution, require additional testing
- If error budget is <25%: Pause non-critical deployments until budget recovers
- If error budget is exhausted: Stop all deployments, focus on reliability
Error Budget and SLOs
| Concept | Role |
|---|---|
| SLO (Service Level Objective) | The target reliability level (e.g., 99.9%) |
| Error Budget | The allowable failure budget derived from the SLO |
| SLI (Service Level Indicator) | The actual reliability measured |
Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability.
Business Impact
Benefits of Error Budget Thinking
- Incentivizes reliability: Teams are motivated to maintain system health
- Enables calculated risk-taking: Clear budget allows confident experimentation
- Prevents over-engineering: Don't build for 99.999% when 99.9% is the target
- Aligns business and engineering: Both understand the reliability-investment trade-off
Risks Without Error Budgets
- Over-investment in reliability beyond business needs
- Under-investment leading to frequent customer-facing failures
- Conflicting priorities between feature delivery and reliability
- No clear signal for when to slow down
Error Budget vs Change Failure Rate
| Metric | Measures |
|---|---|
| Error Budget | Total allowable failures over a time period |
| Change Failure Rate | Percentage of deployments causing failures |
These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR.