Files
nexus/wiki/concepts/Change-Failure-Rate.md

3.1 KiB

Change Failure Rate

Definition

Change Failure Rate (CFR) is the percentage of deployments that cause failures in production — such as service outages, degraded performance, or incidents requiring hotfixes, rollbacks, or patches.

Change Failure Rate is one of the four core DORA metrics used to measure DevOps performance.

Why Change Failure Rate Matters

A low change failure rate indicates:

  • High confidence in the deployment process
  • Robust testing and quality assurance
  • Effective risk management
  • Mature operational practices

A high change failure rate means:

  • Frequent production incidents
  • Unstable deployments
  • Low team confidence
  • Customer impact

Across DevOps Maturity Levels

Maturity Change Failure Rate Characteristic
Phase 1 High — manual processes, no automated testing, siloed teams, security only at release
Phase 2 Improving — unit, integration, and end-to-end tests implemented, but security separate
Phase 3 Lower — automated infrastructure, security scans integrated throughout development
Phase 4 Significantly reduced — performance/load testing, immutable infrastructure, dependency vulnerability management
Phase 5 0-15% (elite) — zero human intervention, real-time data decisions, high-level security integration prevents non-compliant code

Elite Performance Benchmark (DORA)

  • Elite performers: 0-15% change failure rate
  • High performers: 16-30% change failure rate
  • Medium performers: 16-30% change failure rate
  • Low performers: 31-100% change failure rate

Types of Failed Changes

  • Production outages
  • Service degradations
  • Data corruption
  • Security vulnerabilities introduced
  • Performance regressions
  • Failed rollbacks

How to Reduce Change Failure Rate

Technical Practices

  • Comprehensive test automation (unit, integration, E2E)
  • Feature flags for gradual rollouts
  • Canary deployments
  • Blue-green deployments
  • Automated rollback mechanisms
  • Chaos engineering to find weaknesses before production

Process Improvements

  • Code review requirements
  • Security scanning in CI/CD pipeline
  • Staging environment parity with production
  • Small batch sizes to limit blast radius
  • Dependency management and vulnerability scanning

Cultural Factors

  • Blameless post-mortems
  • Learning from failures
  • Psychological safety to report issues
  • Shared ownership of reliability

Relationship with Other DORA Metrics

  • Deployment Frequency: Higher frequency with lower CFR indicates elite performance
  • Lead Time: Shorter lead times with maintained/low CFR = high performance
  • MTTR: Lower CFR means fewer incidents, contributing to lower overall MTTR

Sources