MTTR (Mean Time to Recovery)

Definition

MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation.

MTTR is one of the four core DORA metrics used to measure DevOps performance.

Key Components

MTTR can be broken down into:

MTTD (Mean Time to Detect) — Average time to identify a problem
MTTA (Mean Time to Acknowledge) — Average time to acknowledge and begin addressing a problem
Mean Time to Repair/Restore — Actual time to fix and restore service
MTTR = MTTD + MTTA + Mean Time to Repair

Across DevOps Maturity Levels

Maturity	Detection & Recovery Capability
Phase 1	Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring
Phase 2	Better MTTD — essential monitoring tools alert teams when issues affect users
Phase 3	Improved — security scans integrated earlier, but monitoring unchanged from Phase 2
Phase 4	Continuous monitoring tracks system health, enabling early detection and root cause analysis
Phase 5	Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions

MTTD and MTTA

MTTD (Mean Time to Detect)

The average time to identify that a problem has occurred
Lower is better — faster detection means faster recovery
Requires: comprehensive monitoring, alerting, and observability

MTTA (Mean Time to Acknowledge)

The average time from detection to someone actively working on the issue
Includes time to notify on-call staff, triage, and begin investigation
Requires: clear incident response processes and on-call coverage

Elite Performance Benchmark (DORA)

Elite performers: MTTR < 1 hour
Short MTTR indicates:
- Robust incident detection and alerting
- Clear incident response processes
- Well-practiced on-call procedures
- Effective automation for rollback and recovery
- Good observability and debugging tools

How to Reduce MTTR

Implement comprehensive monitoring and alerting
Practice chaos engineering and incident simulations
Automate rollback procedures
Use feature flags to isolate failures
Maintain runbooks for common failures
Foster blameless post-mortem culture
Use observability tools for faster root cause analysis

2.7 KiB Raw Blame History