# Error Budget ## Definition Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached. Error Budget = 100% - (Actual Reliability Target) Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month. ## Role in DevOps Maturity The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity. ### Error Budget Across Maturity Levels | Maturity | Error Budget Usage | |----------|-------------------| | Phase 1 | No error budget concept — reactive to failures as they occur | | Phase 2 | Awareness growing — teams begin to understand the cost of failures | | Phase 3 | Error budgets not explicitly managed — standardization helps but not measured | | Phase 4 | Error budgets tracked — continuous monitoring enables measurement | | Phase 5 | Error budgets actively used to drive deployment decisions — balancing innovation vs reliability | ## How Error Budgets Work ### The Concept If your system achieves: - **99.9% uptime**: 8.76 hours of downtime allowed per year (43.8 minutes per month) - **99.99% uptime**: 52.6 minutes of downtime allowed per year (4.38 minutes per month) The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves. ### Error Budget Policy Example - If error budget is >50% remaining: Deploy freely (encourage experimentation) - If error budget is 25-50%: Proceed with caution, require additional testing - If error budget is <25%: Pause non-critical deployments until budget recovers - If error budget is exhausted: Stop all deployments, focus on reliability ## Error Budget and SLOs | Concept | Role | |---------|------| | **SLO (Service Level Objective)** | The target reliability level (e.g., 99.9%) | | **Error Budget** | The allowable failure budget derived from the SLO | | **SLI (Service Level Indicator)** | The actual reliability measured | Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability. ## Business Impact ### Benefits of Error Budget Thinking 1. **Incentivizes reliability**: Teams are motivated to maintain system health 2. **Enables calculated risk-taking**: Clear budget allows confident experimentation 3. **Prevents over-engineering**: Don't build for 99.999% when 99.9% is the target 4. **Aligns business and engineering**: Both understand the reliability-investment trade-off ### Risks Without Error Budgets - Over-investment in reliability beyond business needs - Under-investment leading to frequent customer-facing failures - Conflicting priorities between feature delivery and reliability - No clear signal for when to slow down ## Error Budget vs Change Failure Rate | Metric | Measures | |--------|----------| | **Error Budget** | Total allowable failures over a time period | | **Change Failure Rate** | Percentage of deployments causing failures | These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR. ## Sources - [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]] ## Related Concepts - [[concepts/SLO]] - [[concepts/Change-Failure-Rate]] - [[concepts/DORA-Metrics]] - [[concepts/High-Availability]] - [[concepts/DevOps-Maturity]]