84 lines
3.1 KiB
Markdown
84 lines
3.1 KiB
Markdown
# Change Failure Rate
|
|
|
|
## Definition
|
|
Change Failure Rate (CFR) is the percentage of deployments that cause failures in production — such as service outages, degraded performance, or incidents requiring hotfixes, rollbacks, or patches.
|
|
|
|
Change Failure Rate is one of the four core **DORA metrics** used to measure DevOps performance.
|
|
|
|
## Why Change Failure Rate Matters
|
|
|
|
A low change failure rate indicates:
|
|
- High confidence in the deployment process
|
|
- Robust testing and quality assurance
|
|
- Effective risk management
|
|
- Mature operational practices
|
|
|
|
A high change failure rate means:
|
|
- Frequent production incidents
|
|
- Unstable deployments
|
|
- Low team confidence
|
|
- Customer impact
|
|
|
|
## Across DevOps Maturity Levels
|
|
|
|
| Maturity | Change Failure Rate Characteristic |
|
|
|----------|-----------------------------------|
|
|
| Phase 1 | High — manual processes, no automated testing, siloed teams, security only at release |
|
|
| Phase 2 | Improving — unit, integration, and end-to-end tests implemented, but security separate |
|
|
| Phase 3 | Lower — automated infrastructure, security scans integrated throughout development |
|
|
| Phase 4 | Significantly reduced — performance/load testing, immutable infrastructure, dependency vulnerability management |
|
|
| Phase 5 | 0-15% (elite) — zero human intervention, real-time data decisions, high-level security integration prevents non-compliant code |
|
|
|
|
## Elite Performance Benchmark (DORA)
|
|
- **Elite performers**: 0-15% change failure rate
|
|
- **High performers**: 16-30% change failure rate
|
|
- **Medium performers**: 16-30% change failure rate
|
|
- **Low performers**: 31-100% change failure rate
|
|
|
|
## Types of Failed Changes
|
|
- Production outages
|
|
- Service degradations
|
|
- Data corruption
|
|
- Security vulnerabilities introduced
|
|
- Performance regressions
|
|
- Failed rollbacks
|
|
|
|
## How to Reduce Change Failure Rate
|
|
|
|
### Technical Practices
|
|
- Comprehensive test automation (unit, integration, E2E)
|
|
- Feature flags for gradual rollouts
|
|
- Canary deployments
|
|
- Blue-green deployments
|
|
- Automated rollback mechanisms
|
|
- Chaos engineering to find weaknesses before production
|
|
|
|
### Process Improvements
|
|
- Code review requirements
|
|
- Security scanning in CI/CD pipeline
|
|
- Staging environment parity with production
|
|
- Small batch sizes to limit blast radius
|
|
- Dependency management and vulnerability scanning
|
|
|
|
### Cultural Factors
|
|
- Blameless post-mortems
|
|
- Learning from failures
|
|
- Psychological safety to report issues
|
|
- Shared ownership of reliability
|
|
|
|
## Relationship with Other DORA Metrics
|
|
- **Deployment Frequency**: Higher frequency with lower CFR indicates elite performance
|
|
- **Lead Time**: Shorter lead times with maintained/low CFR = high performance
|
|
- **MTTR**: Lower CFR means fewer incidents, contributing to lower overall MTTR
|
|
|
|
## Sources
|
|
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
|
- [[sources/cloud-devop-maturity-guideline.md]]
|
|
|
|
## Related Concepts
|
|
- [[concepts/DORA-Metrics]]
|
|
- [[concepts/Continuous-Deployment]]
|
|
- [[concepts/DevOps-Maturity]]
|
|
- [[concepts/Error-Budget]]
|
|
- [[concepts/Rollback-Rate]]
|