Files
nexus/wiki/concepts/MTTD.md

67 lines
2.4 KiB
Markdown

# MTTD (Mean Time to Detect)
## Definition
MTTD (Mean Time to Detect) is the average time required to identify that a problem or failure has occurred in a system. It measures the effectiveness of monitoring, alerting, and observability practices.
MTTD is a component of MTTR and represents the first phase of incident response.
## Why MTTD Matters
A short MTTD means:
- Failures are caught before they cascade into larger outages
- Customer impact is minimized
- The team can begin recovery faster
- Root cause analysis starts sooner
Long MTTD means:
- Problems can escalate undetected
- User experience degrades for longer periods
- More customers are affected
- Root cause analysis becomes harder as the incident grows
## Across DevOps Maturity Levels
| Maturity | Detection Capability |
|----------|---------------------|
| Phase 1 | Long MTTD — outages reported by users, no proactive monitoring, reactive approach |
| Phase 2 | Better MTTD — essential monitoring tools alert teams as soon as issues affect users |
| Phase 3 | Improved detection — automated monitoring continues, security scans added earlier in pipeline |
| Phase 4 | Continuous monitoring — tracks system health for early problem detection and root cause analysis |
| Phase 5 | Minimal MTTD — max uptime with high collaboration and continuous monitoring, no customer interruptions |
## Key Practices for Low MTTD
### Monitoring & Alerting
- Comprehensive application performance monitoring (APM)
- Infrastructure monitoring
- Log aggregation and analysis
- Real-user monitoring (RUM)
- Synthetic monitoring
### Alerting Best Practices
- Meaningful alert thresholds (avoid alert fatigue)
- Alert routing to appropriate on-call staff
- Clear alert context for rapid triage
- Correlation of related alerts
### Observability
- Structured logging
- Distributed tracing
- Metrics dashboards
- Error tracking
## MTTD vs Other Metrics
- **MTTR**: MTTD is a component of MTTR (MTTR = MTTD + MTTA + Mean Time to Repair)
- **Availability**: High availability depends partly on short MTTD
- **Change Failure Rate**: Fewer failures reaching production reduces MTTD pressure
## Sources
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
## Related Concepts
- [[concepts/MTTR]]
- [[concepts/MTTA]]
- [[concepts/DORA-Metrics]]
- [[concepts/APM]]
- [[concepts/DevOps-Maturity]]