2.7 KiB
2.7 KiB
MTTR (Mean Time to Recovery)
Definition
MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation.
MTTR is one of the four core DORA metrics used to measure DevOps performance.
Key Components
MTTR can be broken down into:
- MTTD (Mean Time to Detect) — Average time to identify a problem
- MTTA (Mean Time to Acknowledge) — Average time to acknowledge and begin addressing a problem
- Mean Time to Repair/Restore — Actual time to fix and restore service
- MTTR = MTTD + MTTA + Mean Time to Repair
Across DevOps Maturity Levels
| Maturity | Detection & Recovery Capability |
|---|---|
| Phase 1 | Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring |
| Phase 2 | Better MTTD — essential monitoring tools alert teams when issues affect users |
| Phase 3 | Improved — security scans integrated earlier, but monitoring unchanged from Phase 2 |
| Phase 4 | Continuous monitoring tracks system health, enabling early detection and root cause analysis |
| Phase 5 | Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions |
MTTD and MTTA
MTTD (Mean Time to Detect)
- The average time to identify that a problem has occurred
- Lower is better — faster detection means faster recovery
- Requires: comprehensive monitoring, alerting, and observability
MTTA (Mean Time to Acknowledge)
- The average time from detection to someone actively working on the issue
- Includes time to notify on-call staff, triage, and begin investigation
- Requires: clear incident response processes and on-call coverage
Elite Performance Benchmark (DORA)
- Elite performers: MTTR < 1 hour
- Short MTTR indicates:
- Robust incident detection and alerting
- Clear incident response processes
- Well-practiced on-call procedures
- Effective automation for rollback and recovery
- Good observability and debugging tools
How to Reduce MTTR
- Implement comprehensive monitoring and alerting
- Practice chaos engineering and incident simulations
- Automate rollback procedures
- Use feature flags to isolate failures
- Maintain runbooks for common failures
- Foster blameless post-mortem culture
- Use observability tools for faster root cause analysis
Sources
- sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md
- sources/cloud-devop-maturity-guideline.md