Update nexus: fix conflicts and sync local changes
This commit is contained in:
@@ -1,66 +1,66 @@
|
||||
# MTTD (Mean Time to Detect)
|
||||
|
||||
## Definition
|
||||
MTTD (Mean Time to Detect) is the average time required to identify that a problem or failure has occurred in a system. It measures the effectiveness of monitoring, alerting, and observability practices.
|
||||
|
||||
MTTD is a component of MTTR and represents the first phase of incident response.
|
||||
|
||||
## Why MTTD Matters
|
||||
|
||||
A short MTTD means:
|
||||
- Failures are caught before they cascade into larger outages
|
||||
- Customer impact is minimized
|
||||
- The team can begin recovery faster
|
||||
- Root cause analysis starts sooner
|
||||
|
||||
Long MTTD means:
|
||||
- Problems can escalate undetected
|
||||
- User experience degrades for longer periods
|
||||
- More customers are affected
|
||||
- Root cause analysis becomes harder as the incident grows
|
||||
|
||||
## Across DevOps Maturity Levels
|
||||
|
||||
| Maturity | Detection Capability |
|
||||
|----------|---------------------|
|
||||
| Phase 1 | Long MTTD — outages reported by users, no proactive monitoring, reactive approach |
|
||||
| Phase 2 | Better MTTD — essential monitoring tools alert teams as soon as issues affect users |
|
||||
| Phase 3 | Improved detection — automated monitoring continues, security scans added earlier in pipeline |
|
||||
| Phase 4 | Continuous monitoring — tracks system health for early problem detection and root cause analysis |
|
||||
| Phase 5 | Minimal MTTD — max uptime with high collaboration and continuous monitoring, no customer interruptions |
|
||||
|
||||
## Key Practices for Low MTTD
|
||||
|
||||
### Monitoring & Alerting
|
||||
- Comprehensive application performance monitoring (APM)
|
||||
- Infrastructure monitoring
|
||||
- Log aggregation and analysis
|
||||
- Real-user monitoring (RUM)
|
||||
- Synthetic monitoring
|
||||
|
||||
### Alerting Best Practices
|
||||
- Meaningful alert thresholds (avoid alert fatigue)
|
||||
- Alert routing to appropriate on-call staff
|
||||
- Clear alert context for rapid triage
|
||||
- Correlation of related alerts
|
||||
|
||||
### Observability
|
||||
- Structured logging
|
||||
- Distributed tracing
|
||||
- Metrics dashboards
|
||||
- Error tracking
|
||||
|
||||
## MTTD vs Other Metrics
|
||||
- **MTTR**: MTTD is a component of MTTR (MTTR = MTTD + MTTA + Mean Time to Repair)
|
||||
- **Availability**: High availability depends partly on short MTTD
|
||||
- **Change Failure Rate**: Fewer failures reaching production reduces MTTD pressure
|
||||
|
||||
## Sources
|
||||
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/MTTR]]
|
||||
- [[concepts/MTTA]]
|
||||
- [[concepts/DORA-Metrics]]
|
||||
- [[concepts/APM]]
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
# MTTD (Mean Time to Detect)
|
||||
|
||||
## Definition
|
||||
MTTD (Mean Time to Detect) is the average time required to identify that a problem or failure has occurred in a system. It measures the effectiveness of monitoring, alerting, and observability practices.
|
||||
|
||||
MTTD is a component of MTTR and represents the first phase of incident response.
|
||||
|
||||
## Why MTTD Matters
|
||||
|
||||
A short MTTD means:
|
||||
- Failures are caught before they cascade into larger outages
|
||||
- Customer impact is minimized
|
||||
- The team can begin recovery faster
|
||||
- Root cause analysis starts sooner
|
||||
|
||||
Long MTTD means:
|
||||
- Problems can escalate undetected
|
||||
- User experience degrades for longer periods
|
||||
- More customers are affected
|
||||
- Root cause analysis becomes harder as the incident grows
|
||||
|
||||
## Across DevOps Maturity Levels
|
||||
|
||||
| Maturity | Detection Capability |
|
||||
|----------|---------------------|
|
||||
| Phase 1 | Long MTTD — outages reported by users, no proactive monitoring, reactive approach |
|
||||
| Phase 2 | Better MTTD — essential monitoring tools alert teams as soon as issues affect users |
|
||||
| Phase 3 | Improved detection — automated monitoring continues, security scans added earlier in pipeline |
|
||||
| Phase 4 | Continuous monitoring — tracks system health for early problem detection and root cause analysis |
|
||||
| Phase 5 | Minimal MTTD — max uptime with high collaboration and continuous monitoring, no customer interruptions |
|
||||
|
||||
## Key Practices for Low MTTD
|
||||
|
||||
### Monitoring & Alerting
|
||||
- Comprehensive application performance monitoring (APM)
|
||||
- Infrastructure monitoring
|
||||
- Log aggregation and analysis
|
||||
- Real-user monitoring (RUM)
|
||||
- Synthetic monitoring
|
||||
|
||||
### Alerting Best Practices
|
||||
- Meaningful alert thresholds (avoid alert fatigue)
|
||||
- Alert routing to appropriate on-call staff
|
||||
- Clear alert context for rapid triage
|
||||
- Correlation of related alerts
|
||||
|
||||
### Observability
|
||||
- Structured logging
|
||||
- Distributed tracing
|
||||
- Metrics dashboards
|
||||
- Error tracking
|
||||
|
||||
## MTTD vs Other Metrics
|
||||
- **MTTR**: MTTD is a component of MTTR (MTTR = MTTD + MTTA + Mean Time to Repair)
|
||||
- **Availability**: High availability depends partly on short MTTD
|
||||
- **Change Failure Rate**: Fewer failures reaching production reduces MTTD pressure
|
||||
|
||||
## Sources
|
||||
- [[sources/devops-maturity-model-from-traditional-it-to-advanced-devops.md]]
|
||||
|
||||
## Related Concepts
|
||||
- [[concepts/MTTR]]
|
||||
- [[concepts/MTTA]]
|
||||
- [[concepts/DORA-Metrics]]
|
||||
- [[concepts/APM]]
|
||||
- [[concepts/DevOps-Maturity]]
|
||||
|
||||
Reference in New Issue
Block a user