Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,79 +1,79 @@
|
||||
# Error Budget
|
||||
|
||||
## Definition
|
||||
Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached.
|
||||
|
||||
Error Budget = 100% - (Actual Reliability Target)
|
||||
|
||||
Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month.
|
||||
|
||||
## Role in DevOps Maturity
|
||||
|
||||
The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity.
|
||||
|
||||
### Error Budget Across Maturity Levels
|
||||
| Maturity | Error Budget Usage |
|
||||
|----------|-------------------|
|
||||
| Phase 1 | No error budget concept — reactive to failures as they occur |
|
||||
| Phase 2 | Awareness growing — teams begin to understand the cost of failures |
|
||||
| Phase 3 | Error budgets not explicitly managed — standardization helps but not measured |
|
||||
| Phase 4 | Error budgets tracked — continuous monitoring enables measurement |
|
||||
| Phase 5 | Error budgets actively used to drive deployment decisions — balancing innovation vs reliability |
|
||||
|
||||
## How Error Budgets Work
|
||||
|
||||
### The Concept
|
||||
If your system achieves:
|
||||
- **99.9% uptime**: 8.76 hours of downtime allowed per year (43.8 minutes per month)
|
||||
- **99.99% uptime**: 52.6 minutes of downtime allowed per year (4.38 minutes per month)
|
||||
|
||||
The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves.
|
||||
|
||||
### Error Budget Policy Example
|
||||
- If error budget is >50% remaining: Deploy freely (encourage experimentation)
|
||||
- If error budget is 25-50%: Proceed with caution, require additional testing
|
||||
- If error budget is <25%: Pause non-critical deployments until budget recovers
|
||||
- If error budget is exhausted: Stop all deployments, focus on reliability
|
||||
|
||||
## Error Budget and SLOs
|
||||
|
||||
| Concept | Role |
|
||||
|---------|------|
|
||||
| **SLO (Service Level Objective)** | The target reliability level (e.g., 99.9%) |
|
||||
| **Error Budget** | The allowable failure budget derived from the SLO |
|
||||
| **SLI (Service Level Indicator)** | The actual reliability measured |
|
||||
|
||||
Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability.
|
||||
|
||||
## Business Impact
|
||||
|
||||
### Benefits of Error Budget Thinking
|
||||
1. **Incentivizes reliability**: Teams are motivated to maintain system health
|
||||
2. **Enables calculated risk-taking**: Clear budget allows confident experimentation
|
||||
3. **Prevents over-engineering**: Don't build for 99.999% when 99.9% is the target
|
||||
4. **Aligns business and engineering**: Both understand the reliability-investment trade-off
|
||||
|
||||
### Risks Without Error Budgets
|
||||
- Over-investment in reliability beyond business needs
|
||||
- Under-investment leading to frequent customer-facing failures
|
||||
- Conflicting priorities between feature delivery and reliability
|
||||
- No clear signal for when to slow down
|
||||
|
||||
## Error Budget vs Change Failure Rate
|
||||
|
||||
| Metric | Measures |
|
||||
|--------|----------|
|
||||
| **Error Budget** | Total allowable failures over a time period |
|
||||
| **Change Failure Rate** | Percentage of deployments causing failures |
|
||||
|
||||
These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR.
|
||||
|
||||
## Sources
|
||||
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/SLO]]
|
||||
- [[concepts/Change-Failure-Rate]]
|
||||
- [[concepts/DORA-Metrics]]
|
||||
- [[concepts/High-Availability]]
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
# Error Budget
|
||||
|
||||
## Definition
|
||||
Error Budget is the permissible rate of errors and failures that a system can tolerate within a defined period without violating its reliability targets. It represents the "budget" of allowed failures before reliability SLAs are breached.
|
||||
|
||||
Error Budget = 100% - (Actual Reliability Target)
|
||||
|
||||
Example: If your target is 99.9% uptime, your error budget is 0.1% downtime per month.
|
||||
|
||||
## Role in DevOps Maturity
|
||||
|
||||
The DevOps Maturity Model explicitly lists Error Budget as one of the key metrics for measuring DevOps maturity.
|
||||
|
||||
### Error Budget Across Maturity Levels
|
||||
| Maturity | Error Budget Usage |
|
||||
|----------|-------------------|
|
||||
| Phase 1 | No error budget concept — reactive to failures as they occur |
|
||||
| Phase 2 | Awareness growing — teams begin to understand the cost of failures |
|
||||
| Phase 3 | Error budgets not explicitly managed — standardization helps but not measured |
|
||||
| Phase 4 | Error budgets tracked — continuous monitoring enables measurement |
|
||||
| Phase 5 | Error budgets actively used to drive deployment decisions — balancing innovation vs reliability |
|
||||
|
||||
## How Error Budgets Work
|
||||
|
||||
### The Concept
|
||||
If your system achieves:
|
||||
- **99.9% uptime**: 8.76 hours of downtime allowed per year (43.8 minutes per month)
|
||||
- **99.99% uptime**: 52.6 minutes of downtime allowed per year (4.38 minutes per month)
|
||||
|
||||
The "error budget" is the allowed bad events — once depleted, deployment velocity must slow down until reliability improves.
|
||||
|
||||
### Error Budget Policy Example
|
||||
- If error budget is >50% remaining: Deploy freely (encourage experimentation)
|
||||
- If error budget is 25-50%: Proceed with caution, require additional testing
|
||||
- If error budget is <25%: Pause non-critical deployments until budget recovers
|
||||
- If error budget is exhausted: Stop all deployments, focus on reliability
|
||||
|
||||
## Error Budget and SLOs
|
||||
|
||||
| Concept | Role |
|
||||
|---------|------|
|
||||
| **SLO (Service Level Objective)** | The target reliability level (e.g., 99.9%) |
|
||||
| **Error Budget** | The allowable failure budget derived from the SLO |
|
||||
| **SLI (Service Level Indicator)** | The actual reliability measured |
|
||||
|
||||
Error Budgets operationalize SLOs by creating concrete incentives for balancing innovation and reliability.
|
||||
|
||||
## Business Impact
|
||||
|
||||
### Benefits of Error Budget Thinking
|
||||
1. **Incentivizes reliability**: Teams are motivated to maintain system health
|
||||
2. **Enables calculated risk-taking**: Clear budget allows confident experimentation
|
||||
3. **Prevents over-engineering**: Don't build for 99.999% when 99.9% is the target
|
||||
4. **Aligns business and engineering**: Both understand the reliability-investment trade-off
|
||||
|
||||
### Risks Without Error Budgets
|
||||
- Over-investment in reliability beyond business needs
|
||||
- Under-investment leading to frequent customer-facing failures
|
||||
- Conflicting priorities between feature delivery and reliability
|
||||
- No clear signal for when to slow down
|
||||
|
||||
## Error Budget vs Change Failure Rate
|
||||
|
||||
| Metric | Measures |
|
||||
|--------|----------|
|
||||
| **Error Budget** | Total allowable failures over a time period |
|
||||
| **Change Failure Rate** | Percentage of deployments causing failures |
|
||||
|
||||
These metrics work together: Low CFR preserves error budget; depleted error budget signals need to improve CFR.
|
||||
|
||||
## Sources
|
||||
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/SLO]]
|
||||
- [[concepts/Change-Failure-Rate]]
|
||||
- [[concepts/DORA-Metrics]]
|
||||
- [[concepts/High-Availability]]
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
|
||||
Reference in New Issue
Block a user