Files
nexus/wiki/concepts/MTTR.md

2.8 KiB

MTTR (Mean Time to Recovery)

Definition

MTTR (Mean Time to Recovery) is the average time required to recover from a failure — from the moment a failure is detected to the moment service is fully restored to normal operation.

MTTR is one of the four core DORA metrics used to measure DevOps performance.

Key Components

MTTR can be broken down into:

  1. MTTD (Mean Time to Detect) — Average time to identify a problem
  2. MTTA (Mean Time to Acknowledge) — Average time to acknowledge and begin addressing a problem
  3. Mean Time to Repair/Restore — Actual time to fix and restore service
  4. MTTR = MTTD + MTTA + Mean Time to Repair

Across DevOps Maturity Levels

Maturity Detection & Recovery Capability
Phase 1 Long MTTD and MTTR — outages reported by users (reactive), no proactive monitoring
Phase 2 Better MTTD — essential monitoring tools alert teams when issues affect users
Phase 3 Improved — security scans integrated earlier, but monitoring unchanged from Phase 2
Phase 4 Continuous monitoring tracks system health, enabling early detection and root cause analysis
Phase 5 Max uptime — high collaboration, rapid data-driven decision-making, minimal customer interruptions

MTTD and MTTA

MTTD (Mean Time to Detect)

  • The average time to identify that a problem has occurred
  • Lower is better — faster detection means faster recovery
  • Requires: comprehensive monitoring, alerting, and observability

MTTA (Mean Time to Acknowledge)

  • The average time from detection to someone actively working on the issue
  • Includes time to notify on-call staff, triage, and begin investigation
  • Requires: clear incident response processes and on-call coverage

Elite Performance Benchmark (DORA)

  • Elite performers: MTTR < 1 hour
  • Short MTTR indicates:
    • Robust incident detection and alerting
    • Clear incident response processes
    • Well-practiced on-call procedures
    • Effective automation for rollback and recovery
    • Good observability and debugging tools

How to Reduce MTTR

  • Implement comprehensive monitoring and alerting
  • Practice chaos engineering and incident simulations
  • Automate rollback procedures
  • Use feature flags to isolate failures
  • Maintain runbooks for common failures
  • Foster blameless post-mortem culture
  • Use observability tools for faster root cause analysis

Sources