Files
nexus/wiki/concepts/SRE.md
2026-04-18 05:18:07 +08:00

27 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "SRE"
type: concept
tags: [sre, devops, reliability]
---
## Definition
SRESite Reliability Engineering站点可靠性工程是一种将软件工程方法应用于运维问题的实践旨在创建高度可靠和可扩展的系统。
## Core Practices
- **SLI/SLO/SLA**:服务水平指标/目标/协议
- **错误预算**:允许的故障配额,用于平衡创新与稳定性
- **Postmortem事后分析**:不追究责任的故障复盘,提取学习教训
- **Toil Reduction**:减少重复性手工运维工作
## Key Metrics
- **MTTR**Mean Time To Recovery平均恢复时间
- **MTTF**Mean Time To Failure平均故障间隔时间
- **可用性目标**:通常为 99.9%(三个九)到 99.99%(四个九)
## Related Entities
- [[AI SRE]] — 使用 AI 自动化 SRE 任务的工具
## Related Concepts
- [[DevOps]] — 结合开发与运营实现持续软件交付的方法论
- [[混沌工程]] — 主动测试系统韧性的实践方法
- [[无责复盘]] — 不追究个人责任,聚焦问题本质的失败分析方法