67 lines
2.4 KiB
Markdown
67 lines
2.4 KiB
Markdown
# MTTD (Mean Time to Detect)
|
|
|
|
## Definition
|
|
MTTD (Mean Time to Detect) is the average time required to identify that a problem or failure has occurred in a system. It measures the effectiveness of monitoring, alerting, and observability practices.
|
|
|
|
MTTD is a component of MTTR and represents the first phase of incident response.
|
|
|
|
## Why MTTD Matters
|
|
|
|
A short MTTD means:
|
|
- Failures are caught before they cascade into larger outages
|
|
- Customer impact is minimized
|
|
- The team can begin recovery faster
|
|
- Root cause analysis starts sooner
|
|
|
|
Long MTTD means:
|
|
- Problems can escalate undetected
|
|
- User experience degrades for longer periods
|
|
- More customers are affected
|
|
- Root cause analysis becomes harder as the incident grows
|
|
|
|
## Across DevOps Maturity Levels
|
|
|
|
| Maturity | Detection Capability |
|
|
|----------|---------------------|
|
|
| Phase 1 | Long MTTD — outages reported by users, no proactive monitoring, reactive approach |
|
|
| Phase 2 | Better MTTD — essential monitoring tools alert teams as soon as issues affect users |
|
|
| Phase 3 | Improved detection — automated monitoring continues, security scans added earlier in pipeline |
|
|
| Phase 4 | Continuous monitoring — tracks system health for early problem detection and root cause analysis |
|
|
| Phase 5 | Minimal MTTD — max uptime with high collaboration and continuous monitoring, no customer interruptions |
|
|
|
|
## Key Practices for Low MTTD
|
|
|
|
### Monitoring & Alerting
|
|
- Comprehensive application performance monitoring (APM)
|
|
- Infrastructure monitoring
|
|
- Log aggregation and analysis
|
|
- Real-user monitoring (RUM)
|
|
- Synthetic monitoring
|
|
|
|
### Alerting Best Practices
|
|
- Meaningful alert thresholds (avoid alert fatigue)
|
|
- Alert routing to appropriate on-call staff
|
|
- Clear alert context for rapid triage
|
|
- Correlation of related alerts
|
|
|
|
### Observability
|
|
- Structured logging
|
|
- Distributed tracing
|
|
- Metrics dashboards
|
|
- Error tracking
|
|
|
|
## MTTD vs Other Metrics
|
|
- **MTTR**: MTTD is a component of MTTR (MTTR = MTTD + MTTA + Mean Time to Repair)
|
|
- **Availability**: High availability depends partly on short MTTD
|
|
- **Change Failure Rate**: Fewer failures reaching production reduces MTTD pressure
|
|
|
|
## Sources
|
|
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
|
|
|
## Related Concepts
|
|
- [[concepts/MTTR]]
|
|
- [[concepts/MTTA]]
|
|
- [[concepts/DORA-Metrics]]
|
|
- [[concepts/APM]]
|
|
- [[concepts/DevOps-Maturity]]
|