107 lines
3.8 KiB
Markdown
107 lines
3.8 KiB
Markdown
---
|
||
title: "Progressive Rollout (渐进式放量)"
|
||
tags: [devops, continuous-delivery, feature-management, risk-mitigation]
|
||
aliases: [Canary Deployment, 灰度发布, Canary Release]
|
||
created: 2026-04-25
|
||
---
|
||
|
||
# Progressive Rollout (渐进式放量)
|
||
|
||
**Progressive Rollout**(渐进式放量/灰度发布)是一种通过 [[Feature Flag]] 控制新功能逐步向用户群发布的风险管理策略。与"全有或全无"的传统部署不同,Progressive Rollout 将影响范围控制在最小范围内,从而实现**可量化的 RTO**。
|
||
|
||
## Aliases
|
||
- Canary Deployment
|
||
- Canary Release
|
||
- 灰度发布
|
||
- Staged Rollout
|
||
|
||
## Core Mechanism
|
||
|
||
> "Instead of flipping the switch for everyone simultaneously, roll out gradually."
|
||
|
||
```
|
||
1% 用户 → 观察错误率、性能指标
|
||
5% 用户 → 监控转化率、用户反馈
|
||
25% 用户 → 检查下游系统负载
|
||
100% 用户 → 完成全量发布
|
||
```
|
||
|
||
## Progressive Rollout vs. Big Bang Release
|
||
|
||
| 维度 | Big Bang(全量发布) | Progressive Rollout(渐进式放量) |
|
||
|------|---------------------|----------------------------------|
|
||
| 影响范围 | 全部用户 | 受控小群体 |
|
||
| 问题发现 | 事后 | 事中(1% 阶段即可发现) |
|
||
| RTO(如果出问题) | 小时级(紧急回滚) | 秒级(关闭开关) |
|
||
| 回滚风险 | 可能丢失新事务 | 无数据损失 |
|
||
| 团队压力 | 高(2AM 部署) | 低(白天放量) |
|
||
| 反馈收集 | 事后分析 | 实时监控 |
|
||
|
||
## RTO 重新定义
|
||
|
||
> "If something breaks at the 5% mark, you've contained the damage. Your RTO is measured in seconds (flip the flag off) instead of hours (emergency rollback deployment)."
|
||
|
||
| 场景 | RTO(Big Bang) | RTO(Progressive Rollout) |
|
||
|------|-----------------|---------------------------|
|
||
| 发现问题 | 全量发布后 | 1% 阶段即可监控到 |
|
||
| 止血时间 | 小时级(回滚部署) | 秒级(关闭开关) |
|
||
| 受影响用户 | 100% | 最多 5%(当前阶段) |
|
||
|
||
## 放量策略
|
||
|
||
### 基于用户群体的定向放量
|
||
|
||
| 策略 | 说明 | 适用场景 |
|
||
|------|------|----------|
|
||
| 随机抽样 | 随机选取 X% 用户 | 通用场景 |
|
||
| 地区定向 | 先在特定地区放量 | 法规合规、时区测试 |
|
||
| 用户分层 | 优先向付费用户放量 | 降低高价值用户风险 |
|
||
| 设备类型 | 先桌面后移动 | 移动端兼容性风险 |
|
||
| Beta 用户 | 先向内部/Beta 用户开放 | 需要早期反馈 |
|
||
|
||
### 基于指标的自动 gates
|
||
|
||
```yaml
|
||
rollout_stages:
|
||
- percentage: 1
|
||
gates:
|
||
- error_rate < 0.1%
|
||
- p95_latency < 500ms
|
||
- percentage: 5
|
||
gates:
|
||
- conversion_rate > baseline - 5%
|
||
- support_tickets < 10
|
||
- percentage: 25
|
||
gates:
|
||
- downstream_api_latency < 200ms
|
||
- no_critical_errors
|
||
```
|
||
|
||
## 与 [[Kill Switch]] 的关系
|
||
|
||
Progressive Rollout 和 Kill Switch 是同一机制的两面:
|
||
|
||
- **Progressive Rollout**:控制功能如何到达用户(渐进式)
|
||
- **Kill Switch**:在发现问题时紧急切断(防御性)
|
||
|
||
两者结合 → 既有渐进放量的可控性,又有 Kill Switch 的应急保障。
|
||
|
||
## 实践要点
|
||
|
||
1. **监控先行**:每次放量前确保监控仪表盘就绪
|
||
2. **定义回退标准**:什么指标触发停止放量或回退?
|
||
3. **自动化放量**:避免手动操作带来的错误
|
||
4. **跨团队对齐**:产品、工程、运营需要共同定义放量节奏
|
||
|
||
## Related Concepts
|
||
|
||
- [[Feature Flag]] — Progressive Rollout 的技术基础
|
||
- [[Kill Switch]] — Progressive Rollout 的应急保障
|
||
- [[RTO]] — Progressive Rollout 将 RTO 从小时降至秒级
|
||
- [[Deployment-vs-Release]] — Progressive Rollout 实现部署与发布的解耦
|
||
- [[Micro-Recovery]] — Progressive Rollout 支持 feature 级别的精准恢复
|
||
|
||
## Sources
|
||
|
||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|