Files
nexus/wiki/concepts/MTTD.md
2026-04-21 20:03:06 +08:00

2.3 KiB

MTTD (Mean Time to Detect)

Definition

MTTD (Mean Time to Detect) is the average time required to identify that a problem or failure has occurred in a system. It measures the effectiveness of monitoring, alerting, and observability practices.

MTTD is a component of MTTR and represents the first phase of incident response.

Why MTTD Matters

A short MTTD means:

  • Failures are caught before they cascade into larger outages
  • Customer impact is minimized
  • The team can begin recovery faster
  • Root cause analysis starts sooner

Long MTTD means:

  • Problems can escalate undetected
  • User experience degrades for longer periods
  • More customers are affected
  • Root cause analysis becomes harder as the incident grows

Across DevOps Maturity Levels

Maturity Detection Capability
Phase 1 Long MTTD — outages reported by users, no proactive monitoring, reactive approach
Phase 2 Better MTTD — essential monitoring tools alert teams as soon as issues affect users
Phase 3 Improved detection — automated monitoring continues, security scans added earlier in pipeline
Phase 4 Continuous monitoring — tracks system health for early problem detection and root cause analysis
Phase 5 Minimal MTTD — max uptime with high collaboration and continuous monitoring, no customer interruptions

Key Practices for Low MTTD

Monitoring & Alerting

  • Comprehensive application performance monitoring (APM)
  • Infrastructure monitoring
  • Log aggregation and analysis
  • Real-user monitoring (RUM)
  • Synthetic monitoring

Alerting Best Practices

  • Meaningful alert thresholds (avoid alert fatigue)
  • Alert routing to appropriate on-call staff
  • Clear alert context for rapid triage
  • Correlation of related alerts

Observability

  • Structured logging
  • Distributed tracing
  • Metrics dashboards
  • Error tracking

MTTD vs Other Metrics

  • MTTR: MTTD is a component of MTTR (MTTR = MTTD + MTTA + Mean Time to Repair)
  • Availability: High availability depends partly on short MTTD
  • Change Failure Rate: Fewer failures reaching production reduces MTTD pressure

Sources