67 lines
2.8 KiB
Markdown
67 lines
2.8 KiB
Markdown
# MTTR (Mean Time to Recovery)
|
|
|
|
## Definition
|
|
MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation.
|
|
|
|
MTTR is one of the four core **DORA metrics** used to measure DevOps performance.
|
|
|
|
## Key Components
|
|
|
|
MTTR can be broken down into:
|
|
1. **MTTD (Mean Time to Detect)** — Average time to identify a problem
|
|
2. **MTTA (Mean Time to Acknowledge)** — Average time to acknowledge and begin addressing a problem
|
|
3. **Mean Time to Repair/Restore** — Actual time to fix and restore service
|
|
4. **MTTR = MTTD + MTTA + Mean Time to Repair**
|
|
|
|
## Across DevOps Maturity Levels
|
|
|
|
| Maturity | Detection & Recovery Capability |
|
|
|----------|--------------------------------|
|
|
| Phase 1 | Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring |
|
|
| Phase 2 | Better MTTD — essential monitoring tools alert teams when issues affect users |
|
|
| Phase 3 | Improved — security scans integrated earlier, but monitoring unchanged from Phase 2 |
|
|
| Phase 4 | Continuous monitoring tracks system health, enabling early detection and root cause analysis |
|
|
| Phase 5 | Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions |
|
|
|
|
## MTTD and MTTA
|
|
|
|
### MTTD (Mean Time to Detect)
|
|
- The average time to identify that a problem has occurred
|
|
- Lower is better — faster detection means faster recovery
|
|
- Requires: comprehensive monitoring, alerting, and observability
|
|
|
|
### MTTA (Mean Time to Acknowledge)
|
|
- The average time from detection to someone actively working on the issue
|
|
- Includes time to notify on-call staff, triage, and begin investigation
|
|
- Requires: clear incident response processes and on-call coverage
|
|
|
|
## Elite Performance Benchmark (DORA)
|
|
- **Elite performers**: MTTR < 1 hour
|
|
- Short MTTR indicates:
|
|
- Robust incident detection and alerting
|
|
- Clear incident response processes
|
|
- Well-practiced on-call procedures
|
|
- Effective automation for rollback and recovery
|
|
- Good observability and debugging tools
|
|
|
|
## How to Reduce MTTR
|
|
- Implement comprehensive monitoring and alerting
|
|
- Practice chaos engineering and incident simulations
|
|
- Automate rollback procedures
|
|
- Use feature flags to isolate failures
|
|
- Maintain runbooks for common failures
|
|
- Foster blameless post-mortem culture
|
|
- Use observability tools for faster root cause analysis
|
|
|
|
## Sources
|
|
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
|
- [[sources/cloud-devop-maturity-guideline.md]]
|
|
|
|
## Related Concepts
|
|
- [[concepts/DORA-Metrics]]
|
|
- [[concepts/MTTD]]
|
|
- [[concepts/MTTA]]
|
|
- [[concepts/Error-Budget]]
|
|
- [[concepts/Change-Failure-Rate]]
|
|
- [[concepts/DevOps-Maturity]]
|