Files
nexus/wiki/concepts/Error-Budget.md
2026-04-21 20:03:06 +08:00

3.4 KiB

Error Budget

Definition

Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached.

Error Budget = 100% - (Actual Reliability Target)

Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month.

Role in DevOps Maturity

The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity.

Error Budget Across Maturity Levels

Maturity Error Budget Usage
Phase 1 No error budget concept — reactive to failures as they occur
Phase 2 Awareness growing — teams begin to understand the cost of failures
Phase 3 Error budgets not explicitly managed — standardization helps but not measured
Phase 4 Error budgets tracked — continuous monitoring enables measurement
Phase 5 Error budgets actively used to drive deployment decisions — balancing innovation vs reliability

How Error Budgets Work

The Concept

If your system achieves:

  • 99.9% uptime: 8.76 hours of downtime allowed per year (43.8 minutes per month)
  • 99.99% uptime: 52.6 minutes of downtime allowed per year (4.38 minutes per month)

The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves.

Error Budget Policy Example

  • If error budget is >50% remaining: Deploy freely (encourage experimentation)
  • If error budget is 25-50%: Proceed with caution, require additional testing
  • If error budget is <25%: Pause non-critical deployments until budget recovers
  • If error budget is exhausted: Stop all deployments, focus on reliability

Error Budget and SLOs

Concept Role
SLO (Service Level Objective) The target reliability level (e.g., 99.9%)
Error Budget The allowable failure budget derived from the SLO
SLI (Service Level Indicator) The actual reliability measured

Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability.

Business Impact

Benefits of Error Budget Thinking

  1. Incentivizes reliability: Teams are motivated to maintain system health
  2. Enables calculated risk-taking: Clear budget allows confident experimentation
  3. Prevents over-engineering: Don't build for 99.999% when 99.9% is the target
  4. Aligns business and engineering: Both understand the reliability-investment trade-off

Risks Without Error Budgets

  • Over-investment in reliability beyond business needs
  • Under-investment leading to frequent customer-facing failures
  • Conflicting priorities between feature delivery and reliability
  • No clear signal for when to slow down

Error Budget vs Change Failure Rate

Metric Measures
Error Budget Total allowable failures over a time period
Change Failure Rate Percentage of deployments causing failures

These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR.

Sources