# MTTR (Mean Time to Recovery) ## Definition MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation. MTTR is one of the four core **DORA metrics** used to measure DevOps performance. ## Key Components MTTR can be broken down into: 1. **MTTD (Mean Time to Detect)** — Average time to identify a problem 2. **MTTA (Mean Time to Acknowledge)** — Average time to acknowledge and begin addressing a problem 3. **Mean Time to Repair/Restore** — Actual time to fix and restore service 4. **MTTR = MTTD + MTTA + Mean Time to Repair** ## Across DevOps Maturity Levels | Maturity | Detection & Recovery Capability | |----------|--------------------------------| | Phase 1 | Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring | | Phase 2 | Better MTTD — essential monitoring tools alert teams when issues affect users | | Phase 3 | Improved — security scans integrated earlier, but monitoring unchanged from Phase 2 | | Phase 4 | Continuous monitoring tracks system health, enabling early detection and root cause analysis | | Phase 5 | Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions | ## MTTD and MTTA ### MTTD (Mean Time to Detect) - The average time to identify that a problem has occurred - Lower is better — faster detection means faster recovery - Requires: comprehensive monitoring, alerting, and observability ### MTTA (Mean Time to Acknowledge) - The average time from detection to someone actively working on the issue - Includes time to notify on-call staff, triage, and begin investigation - Requires: clear incident response processes and on-call coverage ## Elite Performance Benchmark (DORA) - **Elite performers**: MTTR < 1 hour - Short MTTR indicates: - Robust incident detection and alerting - Clear incident response processes - Well-practiced on-call procedures - Effective automation for rollback and recovery - Good observability and debugging tools ## How to Reduce MTTR - Implement comprehensive monitoring and alerting - Practice chaos engineering and incident simulations - Automate rollback procedures - Use feature flags to isolate failures - Maintain runbooks for common failures - Foster blameless post-mortem culture - Use observability tools for faster root cause analysis ## Sources - [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]] - [[sources/cloud-devop-maturity-guideline.md]] ## Related Concepts - [[concepts/DORA-Metrics]] - [[concepts/MTTD]] - [[concepts/MTTA]] - [[concepts/Error-Budget]] - [[concepts/Change-Failure-Rate]] - [[concepts/DevOps-Maturity]]