Files
nexus/wiki/concepts/MTTR.md

67 lines
2.8 KiB
Markdown

# MTTR (Mean Time to Recovery)
## Definition
MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation.
MTTR is one of the four core **DORA metrics** used to measure DevOps performance.
## Key Components
MTTR can be broken down into:
1. **MTTD (Mean Time to Detect)** — Average time to identify a problem
2. **MTTA (Mean Time to Acknowledge)** — Average time to acknowledge and begin addressing a problem
3. **Mean Time to Repair/Restore** — Actual time to fix and restore service
4. **MTTR = MTTD + MTTA + Mean Time to Repair**
## Across DevOps Maturity Levels
| Maturity | Detection & Recovery Capability |
|----------|--------------------------------|
| Phase 1 | Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring |
| Phase 2 | Better MTTD — essential monitoring tools alert teams when issues affect users |
| Phase 3 | Improved — security scans integrated earlier, but monitoring unchanged from Phase 2 |
| Phase 4 | Continuous monitoring tracks system health, enabling early detection and root cause analysis |
| Phase 5 | Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions |
## MTTD and MTTA
### MTTD (Mean Time to Detect)
- The average time to identify that a problem has occurred
- Lower is better — faster detection means faster recovery
- Requires: comprehensive monitoring, alerting, and observability
### MTTA (Mean Time to Acknowledge)
- The average time from detection to someone actively working on the issue
- Includes time to notify on-call staff, triage, and begin investigation
- Requires: clear incident response processes and on-call coverage
## Elite Performance Benchmark (DORA)
- **Elite performers**: MTTR < 1 hour
- Short MTTR indicates:
- Robust incident detection and alerting
- Clear incident response processes
- Well-practiced on-call procedures
- Effective automation for rollback and recovery
- Good observability and debugging tools
## How to Reduce MTTR
- Implement comprehensive monitoring and alerting
- Practice chaos engineering and incident simulations
- Automate rollback procedures
- Use feature flags to isolate failures
- Maintain runbooks for common failures
- Foster blameless post-mortem culture
- Use observability tools for faster root cause analysis
## Sources
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
- [[sources/cloud-devop-maturity-guideline.md]]
## Related Concepts
- [[concepts/DORA-Metrics]]
- [[concepts/MTTD]]
- [[concepts/MTTA]]
- [[concepts/Error-Budget]]
- [[concepts/Change-Failure-Rate]]
- [[concepts/DevOps-Maturity]]