Auto-sync: 2026-04-29 00:02
This commit is contained in:
82
wiki/concepts/Availability-Zone-ID.md
Normal file
82
wiki/concepts/Availability-Zone-ID.md
Normal file
@@ -0,0 +1,82 @@
|
||||
---
|
||||
title: "Availability-Zone-ID"
|
||||
type: concept
|
||||
tags: [AWS, VPC, Networking, Multi-Account]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## Availability-Zone-ID
|
||||
|
||||
AWS 可用区标识符(如 `ap-southeast-1a`、`ap-southeast-1b`),用于在多账号 AWS 环境中精确定位物理可用区位置。相比 AZ 名称,AZ ID 能唯一标识跨账号的同一物理位置。
|
||||
|
||||
## Problem: AZ Name Inconsistency
|
||||
|
||||
不同 AWS 账号对同一物理可用区的**名称**可能不同:
|
||||
|
||||
| 账号 A | 账号 B | 物理位置(AZ ID) |
|
||||
|--------|--------|-------------------|
|
||||
| ap-southeast-1a | ap-southeast-1b | apse1-az1 |
|
||||
| ap-southeast-1b | ap-southeast-1a | apse1-az2 |
|
||||
|
||||
**问题**:
|
||||
- 账号 A 的 `ap-southeast-1a` = 账号 B 的 `ap-southeast-1b`(物理位置相同)
|
||||
- 如果用 AZ 名称设计跨账号 VPC 对等连接或可用性架构,可能出现"看起来对称但物理不对称"的问题
|
||||
|
||||
## Solution: Use AZ ID
|
||||
|
||||
使用 AZ ID(如 `apse1-az1`)替代 AZ 名称:
|
||||
|
||||
```yaml
|
||||
availability_zone_ids:
|
||||
- apse1-az1 # 物理位置 apse1 的第一个 AZ
|
||||
- apse1-az2 # 物理位置 apse1 的第二个 AZ
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- 跨账号一致性:AZ ID 在所有账号中唯一标识同一物理位置
|
||||
- 可靠性设计:确保高可用架构在物理层面真正对称
|
||||
- VPC 对等连接:正确配置跨账号连接
|
||||
|
||||
## How to Find AZ IDs
|
||||
|
||||
```bash
|
||||
# 使用 AWS CLI 查询当前账号的 AZ ID 映射
|
||||
aws ec2 describe-availability-zones --output json
|
||||
```
|
||||
|
||||
输出示例:
|
||||
```json
|
||||
{
|
||||
"AvailabilityZones": [
|
||||
{
|
||||
"ZoneName": "ap-southeast-1a",
|
||||
"ZoneId": "apse1-az1",
|
||||
"RegionName": "ap-southeast-1"
|
||||
},
|
||||
{
|
||||
"ZoneName": "ap-southeast-1b",
|
||||
"ZoneId": "apse1-az2",
|
||||
"RegionName": "ap-southeast-1"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[VPC-自动化供给]]:AZ ID 是 VPC YAML 配置的一部分
|
||||
- [[IPAM]]:IPAM 与 VPC 供给集成时需考虑 AZ 映射
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-45-automatic-ip-address-allocation-with-ipam]] ← YAML 支持指定 AZ ID
|
||||
- [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] ← 强调 AZ ID 用于跨账号一致性
|
||||
|
||||
## Aliases
|
||||
|
||||
- AZ ID
|
||||
- Availability Zone Identifier
|
||||
- 物理可用区标识符
|
||||
74
wiki/concepts/CIDR-审批流程.md
Normal file
74
wiki/concepts/CIDR-审批流程.md
Normal file
@@ -0,0 +1,74 @@
|
||||
---
|
||||
title: "CIDR-审批流程"
|
||||
type: concept
|
||||
tags: [AWS, Networking, IPAM, Approval, Automation]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## CIDR-审批流程
|
||||
|
||||
基于 CIDR 地址块大小的差异化审批规则,用于平衡自动化效率与 IP 地址空间治理。IPAM 系统根据请求的子网大小自动路由至不同的处理流程。
|
||||
|
||||
## Approval Matrix
|
||||
|
||||
| CIDR 前缀 | 子网大小 | IP 地址数量 | 审批流程 |
|
||||
|-----------|----------|-------------|----------|
|
||||
| /16 | 65,536 | 65,534 可用 | **自动批准** |
|
||||
| /18 | 16,384 | 16,382 可用 | **自动批准** |
|
||||
| /20 | 4,096 | 4,094 可用 | **自动批准** |
|
||||
| /22 | 1,024 | 1,022 可用 | **自动批准** ✅ |
|
||||
| /24 | 256 | 254 可用 | **需审批** ⚠️ |
|
||||
| /26 | 64 | 62 可用 | **需审批** ⚠️ |
|
||||
| /28 | 16 | 14 可用 | **需审批** ⚠️ |
|
||||
|
||||
## Decision Logic
|
||||
|
||||
```
|
||||
用户请求子网
|
||||
↓
|
||||
IPAM 系统判断 CIDR 大小
|
||||
↓
|
||||
┌─────────────────────────────────┐
|
||||
│ CIDR ≥ /22? │
|
||||
│ ├─ 是 → 自动批准 │
|
||||
│ │ 调用 NIOS API │
|
||||
│ │ 分配下一可用块 │
|
||||
│ └─ 否 → 提交网络团队审批 │
|
||||
│ 用户提供申请理由 │
|
||||
│ 网络团队人工审核 │
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Business Rationale
|
||||
|
||||
1. **/22 及更大(自动批准)**
|
||||
- IP 地址空间充足
|
||||
- 对整体地址空间影响小
|
||||
- 鼓励团队自主自动化
|
||||
|
||||
2. **/24 及更小(需审批)**
|
||||
- IP 地址空间紧张
|
||||
- 需评估是否可合并到更大块
|
||||
- 防止地址空间碎片化
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[IPAM]]:执行审批逻辑的核心系统
|
||||
- [[Infoblox-NIOS]]:存储和管理 CIDR 分配记录
|
||||
- [[VPC-自动化供给]]:CIDR 审批是自动化供给的一部分
|
||||
|
||||
## Notes
|
||||
|
||||
- 审批阈值(/22)由网络团队定义,可能因组织政策调整
|
||||
- 邮件通知机制覆盖用户和网络团队,全程可追溯
|
||||
- 参见 [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] 中的详细说明
|
||||
|
||||
## Aliases
|
||||
|
||||
- CIDR Approval Workflow
|
||||
- IP 地址审批流程
|
||||
- 子网大小审批规则
|
||||
- IPAM Approval Threshold
|
||||
78
wiki/concepts/Customer-Zero.md
Normal file
78
wiki/concepts/Customer-Zero.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
title: "Customer Zero Environment"
|
||||
type: concept
|
||||
tags: [Customer-Zero, DevOps, QA, Staging, Release-Management, Production-Readiness]
|
||||
sources:
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Customer Zero Environment
|
||||
|
||||
Customer Zero Environment(新版本的首位客户环境/内部验证环境)是指在新版本或产品正式发布给外部客户之前,在内部部署的预生产环境,用于在真实流量场景下验证功能正确性、性能和恢复能力。是 [[SRE]] Build 阶段的关键实践,也是 [[Recovery-Assurance]] 四位框架中"Build"环节的核心概念。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Customer Zero is the environment where your organization is the first customer of your own product — validating releases in production-like conditions before external rollout."
|
||||
|
||||
Customer Zero 环境本质上是**内部影子客户**——用自己的产品,在受控环境中模拟真实使用场景,发现问题后再对外发布。
|
||||
|
||||
## Purpose
|
||||
|
||||
| 目标 | 说明 |
|
||||
|------|------|
|
||||
| **新版本验证** | 在真实环境中测试新版本功能和性能 |
|
||||
| **恢复路径验证** | 验证备份/恢复/故障转移流程在实际负载下是否有效 |
|
||||
| **配置变更验证** | 测试配置变更(IaC 脚本、基础设施调整)对系统的影响 |
|
||||
| **灾难演练** | 在隔离环境中主动触发故障,验证恢复 SLA |
|
||||
| **性能基线建立** | 建立系统在正常负载下的性能基准 |
|
||||
|
||||
## Customer Zero vs. Other Environments
|
||||
|
||||
| 环境 | 目的 | 何时使用 |
|
||||
|------|------|----------|
|
||||
| **Dev** | 开发调试 | 开发人员日常编码 |
|
||||
| **Test** | 功能测试 | QA 团队执行测试用例 |
|
||||
| **Staging** | 预发布验证 | 接近生产的镜像测试 |
|
||||
| **Customer Zero** | **内部影子客户验证** | **在真实生产配置下进行最终验证** |
|
||||
| **Production** | 正式服务客户 | 正式上线 |
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
1. **生产等效配置**:Customer Zero 使用与生产完全相同的基础设施配置(VPC、子网、安全组、IAM 角色)
|
||||
2. **影子数据**:使用脱敏的生产数据副本(或合成数据),反映真实数据量和分布
|
||||
3. **隔离但连通**:通常与生产隔离,但可以使用生产的数据源(如 CloudWatch Logs)的脱敏版本
|
||||
4. **持续验证**:不仅是发布前的单次验证,而是 CI/CD 流水线中的持续验证关卡
|
||||
|
||||
## Connection to SRE
|
||||
|
||||
在 [[SRE]] 的 Build 阶段,Customer Zero 环境是"Release Readiness"的核心:
|
||||
|
||||
- **Go-Live Checklist 的一部分**:SRE 团队在支持新产品上线前,需要在 Customer Zero 验证监控覆盖、告警阈值和恢复流程
|
||||
- **Error Budget 验证**:在新版本发布后,通过 Customer Zero 监控错误趋势,确认 Error Budget 消耗符合预期
|
||||
- **Toil 发现**:Customer Zero 中发现的重复性问题,推动自动化改进,减少未来的 Toil
|
||||
|
||||
## Connection to Recovery Assurance
|
||||
|
||||
[[Recovery-Assurance]] 四位框架中的"Build"环节:
|
||||
|
||||
```
|
||||
Design → Software → Build(Customer Zero) → Environments
|
||||
```
|
||||
|
||||
- **Design**:定义可恢复性需求([[RTO]]/[[RPO]])
|
||||
- **Software**:软件内嵌遥测,支持健康监控
|
||||
- **Build**:Customer Zero 环境验证恢复路径和 SLA
|
||||
- **Environments**:SRE + [[Observability]] 支撑持续运营
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[SRE]] — Customer Zero 是 SRE Build 阶段的关键实践
|
||||
- [[Recovery-Assurance]] — Build 环节的验证环境
|
||||
- [[Observability]] — Customer Zero 中的恢复演练依赖可观测性数据
|
||||
- [[RTO]] / [[RPO]] — Customer Zero 验证 DR 目标是否满足
|
||||
- [[CI/CD]] — Customer Zero 是 CI/CD 流水线中的质量关卡
|
||||
|
||||
## Sources
|
||||
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
85
wiki/concepts/Disaster-Recovery.md
Normal file
85
wiki/concepts/Disaster-Recovery.md
Normal file
@@ -0,0 +1,85 @@
|
||||
---
|
||||
title: "Disaster Recovery"
|
||||
type: concept
|
||||
tags: [Disaster-Recovery, DR, Business-Continuity, RTO, RPO, High-Availability, Cloud-DevOps]
|
||||
sources:
|
||||
- ctp-topic-72-implementing-an-enterprise-dr-strategy-using-aws-backup
|
||||
- ctp-topic-44-aws-backup-in-micro-focus
|
||||
- rto-vs-rpo-key-differences-for-modern-disaster-recovery
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Disaster Recovery(灾难恢复)
|
||||
|
||||
灾难恢复(Disaster Recovery,DR)是指保护信息系统免受灾难性事件(地震、洪水、火灾、勒索软件、硬件故障、人为错误)影响的策略与实践体系,是 [[Business-Continuity-Plan]](业务连续性计划)的 IT 技术层面核心组成部分。
|
||||
|
||||
## Core Metrics
|
||||
|
||||
DR 的两大核心量化指标:
|
||||
|
||||
| 指标 | 全称 | 含义 | 测量方向 |
|
||||
|------|------|------|----------|
|
||||
| **[[RTO]]** | Recovery Time Objective | 恢复时间目标:系统中断到恢复的最大可接受时长 | Forward(从故障向前) |
|
||||
| **[[RPO]]** | Recovery Point Objective | 恢复点目标:可接受的最大数据丢失时间窗口 | Backward(从故障向后追溯) |
|
||||
|
||||
## DR Strategies
|
||||
|
||||
### Protection Scope
|
||||
|
||||
| 策略 | 说明 | RTO | RPO | 成本 |
|
||||
|------|------|-----|-----|------|
|
||||
| **Backup Only** | 定期备份,无备用设施 | 数小时至数天 | 数小时至数天 | $ |
|
||||
| **Pilot Light** | 核心服务常驻,冷备设施待机 | 数十分钟 | 分钟级 | $$ |
|
||||
| **Warm Standby** | 部分服务热备,按需扩展 | 数分钟 | 秒级 | $$$ |
|
||||
| **Multi-Region Active-Active** | 多区域同时运行 | ~0 | ~0 | $$$$ |
|
||||
|
||||
### Cloud-Native DR on AWS
|
||||
|
||||
- **[[AWS-Backup]]**:集中化管理 EC2、RDS、DynamoDB、S3 等服务的备份
|
||||
- **[[AWS-Backup-Audit-Manager]]**:自动化合规审计
|
||||
- **Cross-Region Replication**:S3 跨区域复制 EBS 卷快照
|
||||
- **AWS Elastic Disaster Recovery**:持续复制到 AWS,提供秒级 RPO
|
||||
|
||||
## DR vs. High Availability
|
||||
|
||||
| 维度 | 高可用(HA) | 灾难恢复(DR) |
|
||||
|------|-------------|--------------|
|
||||
| **目标故障** | 单组件故障(硬件、软件) | 区域性灾难(数据中心失效) |
|
||||
| **覆盖范围** | 单站点内的冗余 | 跨地理位置的保护 |
|
||||
| **触发方式** | 自动 failover | 人工决策触发 |
|
||||
| **测试频率** | 持续运行(always-on) | 定期演练 |
|
||||
|
||||
## DR Testing Challenges
|
||||
|
||||
当前企业 DR 测试面临的普遍挑战(OpenText 案例):
|
||||
|
||||
- **被动性**:测试按客户时间表安排,非主动设计
|
||||
- **手动性**:大量人工协调,SME 全程参与
|
||||
- **不一致**:缺乏跨组织的统一 DR 方法论
|
||||
- **局限性**:超大规模云平台的测试主要覆盖区域故障,缺乏对账户级/服务级故障的验证
|
||||
|
||||
## DR to Recovery Assurance Evolution
|
||||
|
||||
[[OpenText]] 提出的演进框架——从被动 DR 转向主动 [[Recovery-Assurance]]:
|
||||
|
||||
1. **Design**:将可恢复性前置为架构设计原则
|
||||
2. **Software**:软件内嵌遥测,支持持续健康监控
|
||||
3. **Build**:Customer Zero 环境验证恢复路径
|
||||
4. **Environments**:SRE + 可观测性工程支撑弹性
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[RTO]] — 恢复时间目标,DR 核心指标
|
||||
- [[RPO]] — 恢复点目标,DR 核心指标
|
||||
- [[Business-Continuity-Plan]] — 业务连续性计划,DR 的上层框架
|
||||
- [[Recovery-Assurance]] — 灾难恢复的演进方向,从被动响应到主动保证
|
||||
- [[High-Availability]] — 高可用性,DR 的微观层面
|
||||
- [[AWS-Backup]] — AWS 云原生 DR 实现工具
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-72-implementing-an-enterprise-dr-strategy-using-aws-backup]]
|
||||
- [[ctp-topic-44-aws-backup-in-micro-focus]]
|
||||
- [[rto-vs-rpo-key-differences-for-modern-disaster-recovery]]
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
63
wiki/concepts/Gate-Process.md
Normal file
63
wiki/concepts/Gate-Process.md
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: "Gate Process"
|
||||
type: concept
|
||||
tags: [CTP, Cloud, AWS, Governance]
|
||||
sources: [ctp-topic-20-program-demand-process-flow-and-poc-onboarding]
|
||||
last_updated: 2026-04-14
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
网关审批流程(Gate Process)是用于治理云迁移项目进度的关键决策点框架。通过在关键里程碑设置"网关"(Gate)来确保项目在进入下一阶段前满足所有准入条件,从而控制风险并确保治理严谨性。
|
||||
|
||||
## Gate Stages in CTP
|
||||
|
||||
### Gate 0 — 评估准入(Assessment Gate)
|
||||
|
||||
- **目的**:确认需求是否符合云转型范围
|
||||
- **审查内容**:业务需求、技术可行性、资源可用性
|
||||
- **产出**:准入决策,是否进入详细规划阶段
|
||||
|
||||
### Gate 1 — 设计审批(Design Authority Gate)
|
||||
|
||||
- **目的**:验证解决方案设计是否符合云原生原则
|
||||
- **审查内容**:架构设计、安全策略、合规性、 IaC 规划
|
||||
- **产出**:Design Authority 批准或驳回
|
||||
- **关键要求**:必须有经过审批的 [[Solution-Design]]
|
||||
|
||||
### Gate 3 — 迁移准入(Migration Gate)
|
||||
|
||||
- **目的**:确认产品已具备进入生产环境迁移的条件
|
||||
- **审查内容**:POC 成果、IaC 就绪状态、团队能力、安全评审结果
|
||||
- **产出**:最终迁移批准
|
||||
|
||||
## Gate Process vs 敏捷方法
|
||||
|
||||
| 维度 | Gate Process | 敏捷方法(Scrum/Kanban) |
|
||||
|------|-------------|--------------------------|
|
||||
| 决策模式 | 阶段性审批节点 | 持续反馈循环 |
|
||||
| 变更控制 | Gate 之间冻结 | 随时调整优先级 |
|
||||
| 适用场景 | 大范围迁移治理 | 迭代式产品开发 |
|
||||
| 风险控制 | 强制审查点 | 快速失败快速调整 |
|
||||
| 文档要求 | 高(各 Gate 交付物) | 低(Working Software) |
|
||||
|
||||
## Relationship with Agile
|
||||
|
||||
两者非逻辑矛盾,而是适用场景不同:
|
||||
|
||||
- **Gate Process** 适用于需要严格治理的大范围企业迁移决策
|
||||
- **敏捷方法** 适用于持续迭代的产品开发和交付
|
||||
|
||||
CTP 实践中可将两者结合:使用敏捷方法管理迭代交付,使用 Gate Process 治理迁移准入决策。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Program-Demand-Process]]:Gate Process 是需求流程的核心治理机制
|
||||
- [[Proof-of-Concept]]:Gate 3 前必须完成 POC 并验证成功标准
|
||||
- [[Solution-Design]]:Gate 1 的核心审批交付物
|
||||
- [[Design-Authority]]:Gate 1 审批的执行主体
|
||||
- [[Agile]]:与 Gate Process 形成互补的迭代管理方法
|
||||
|
||||
## References
|
||||
|
||||
- [[ctp-topic-20-program-demand-process-flow-and-poc-onboarding]]
|
||||
52
wiki/concepts/Hub-and-Spoke.md
Normal file
52
wiki/concepts/Hub-and-Spoke.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: "Hub-and-Spoke Network Topology"
|
||||
type: concept
|
||||
tags: [AWS, Networking, Topology, Transit Gateway]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## Hub-and-Spoke
|
||||
|
||||
Hub-and-Spoke 是一种星型网络拓扑结构,其中所有分支(Spoke)连接到中心节点(Hub),分支间的通信通常经过 Hub 中转。
|
||||
|
||||
## Definition
|
||||
|
||||
- **Hub(中心节点)**: 负责汇聚所有 Spoke 的流量,执行路由决策和安全策略
|
||||
- **Spoke(分支节点)**: 各自独立的 VPC 或 Landing Zone,通过 Hub 接入全局网络
|
||||
- **通信模式**: Spoke-to-Spoke 通信必须经过 Hub 转发,而非直接互联
|
||||
|
||||
## In AWS Transit Gateway Architecture
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 描述的架构中:
|
||||
|
||||
- **Hub**: 每个地理区域(APJ、EMEA、AMS)的区域级 Transit Gateway(如 EMEA 的伦敦 Hub、AMS 的俄勒冈 Hub)
|
||||
- **Spoke**: 各个 Landing Zones,通过 TGW Peering 接入区域 Hub
|
||||
- **Inter-Hub**: 区域 Hub 之间通过 Full Mesh(全网状)连接,确保全球流量的可达性
|
||||
|
||||
## Key Properties
|
||||
|
||||
| 属性 | 值 |
|
||||
|------|-----|
|
||||
| 架构类型 | 星型拓扑 |
|
||||
| 扩展性 | 高——新增 Spoke 仅需连接到 Hub |
|
||||
| 复杂度 | 低——集中管理路由策略 |
|
||||
| 缺点 | Hub 可能成为瓶颈或单点故障 |
|
||||
| 适用场景 | 多账号 VPC 互联、全球 Landing Zone 网络 |
|
||||
|
||||
## Relationship to Transit Gateway
|
||||
|
||||
AWS Transit Gateway 是实现 Hub-and-Spoke 架构的核心服务:
|
||||
- [[AWS-Transit-Gateway-TGW]] 提供区域级 Hub 功能
|
||||
- [[TGW-Peering]] 用于 Hub 之间的跨区域互联
|
||||
- [[Hub-and-Spoke]] 与 Full Mesh 组合使用(Spoke-to-Hub = Hub-and-Spoke, Hub-to-Hub = Full Mesh)
|
||||
|
||||
## Connections
|
||||
|
||||
- [[AWS-Transit-Gateway-TGW]] ← 实现 ← [[Hub-and-Spoke]]
|
||||
- [[TGW-Peering]] ← 跨 Hub 连接 ← [[Hub-and-Spoke]]
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 案例 ← [[Hub-and-Spoke]]
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
66
wiki/concepts/IPAM.md
Normal file
66
wiki/concepts/IPAM.md
Normal file
@@ -0,0 +1,66 @@
|
||||
---
|
||||
title: "IPAM"
|
||||
type: concept
|
||||
tags: [Networking, AWS, Automation, IP-Address-Management]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
- ctp-topic-22-global-dns-service-offerings
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## IPAM(IP Address Management)
|
||||
|
||||
企业级 IP 地址管理平台,核心功能包括:**有效管理**、**控制**、**监控**和**分配**企业内部的 IP 地址空间。IPAM 通过集中化和自动化手段,替代传统的手工 Excel 管理模式。
|
||||
|
||||
## Problem Statement
|
||||
|
||||
传统 IP 地址管理依赖 Excel 电子表格:
|
||||
- **效率低**:每次 VPC 供给需与网络团队多次交接
|
||||
- **易出错**:手工规划易产生 IP 地址重叠冲突
|
||||
- **不可追溯**:缺乏统一的变更历史记录
|
||||
- **无法自动化**:IP 地址分配无法与 IaC 流水线集成
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
1. **集中化管理**:单一可信数据源管理所有 IP 地址分配
|
||||
2. **自动化供给**:通过 API 与 IaC 工具集成,自动分配下一可用 IP 地址块
|
||||
3. **生命周期管理**:VPC 销毁时自动回收 IP 地址
|
||||
4. **审批工作流**:基于 CIDR 大小的差异化审批规则
|
||||
5. **可扩展属性**:存储元数据(owner、company、subnet_name、status 等)
|
||||
|
||||
## Implementation: Infoblox NIOS
|
||||
|
||||
本 Wiki 中 IPAM 的核心实现为 **Infoblox NIOS**:
|
||||
|
||||
- **Grid 架构**:分布式网格架构,包含主数据库和冗余 DNS/NTP/DHCP 服务
|
||||
- **API 驱动**:通过 API 调用自动分配 IP 地址
|
||||
- **与 AWS 集成**:作为 VPC 自动化供给流程的 IP 地址来源
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[Infoblox-NIOS]]:核心网络控制平面
|
||||
- [[Infoblox-Grid]]:分布式网格架构
|
||||
- [[CIDR-审批流程]]:基于 CIDR 大小的审批规则
|
||||
- [[VPC-自动化供给]]:IPAM 驱动的声明式 VPC 创建
|
||||
|
||||
## Key Entities
|
||||
|
||||
- [[Pushka]](Principal SRE):IPAM 自动化方案的发起人
|
||||
- [[Infoblox]]:IPAM 供应商
|
||||
- [[AWS-Landing-Zone]]:IPAM 实施的背景环境
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-45-automatic-ip-address-allocation-with-ipam]] ← mechanism ← **IPAM**
|
||||
- 介绍 IPAM 的核心机制和 YAML 驱动方式
|
||||
- [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] ← application ← **IPAM**
|
||||
- 展示 IPAM 在 Workload VPC 供给中的完整应用
|
||||
- [[ctp-topic-22-global-dns-service-offerings]] ← shares_infra ← **IPAM**
|
||||
- Infoblox 同时支撑 DNS Anycast 和 IPAM
|
||||
|
||||
## Aliases
|
||||
|
||||
- IP Address Management
|
||||
- IP Address Management System
|
||||
- IPAM 系统
|
||||
64
wiki/concepts/Infoblox-Grid.md
Normal file
64
wiki/concepts/Infoblox-Grid.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: "Infoblox-Grid"
|
||||
type: concept
|
||||
tags: [Networking, DNS, DHCP, IPAM, Infoblox, High-Availability]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
- ctp-topic-22-global-dns-service-offerings
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## Infoblox Grid
|
||||
|
||||
Infoblox 的分布式网格架构,是企业级 DNS、DHCP 和 IPAM(IPAM)服务的高可用基础设施。Grid 架构将多个 Infoblox 设备组织成一个逻辑单元,提供统一的控制平面和冗余保护。
|
||||
|
||||
## Architecture Components
|
||||
|
||||
### Grid Master
|
||||
- **角色**:整个 Grid 的主控节点
|
||||
- **职责**:管理成员节点、配置文件分发、IP 地址分配决策
|
||||
- **位置**:本组织中位于休斯顿数据中心
|
||||
|
||||
### Grid Members
|
||||
- **角色**:分布在多个地理位置的工作节点
|
||||
- **职责**:承载 DNS、DHCP、IPAM 服务
|
||||
- **冗余**:多成员部署提供故障转移能力
|
||||
|
||||
### Supporting Services
|
||||
- **DNS**:Anycast 支持全球低延迟
|
||||
- **NTP**:时间同步服务
|
||||
- **DHCP**:IP 地址动态分配
|
||||
|
||||
## Grid Communication
|
||||
|
||||
- 成员之间通过 Grid 协议通信
|
||||
- 配置变更通过主节点统一分发
|
||||
- IP 地址分配决策由主节点协调
|
||||
|
||||
## vs. Single Node
|
||||
|
||||
| 特性 | 单节点 | Grid 架构 |
|
||||
|------|--------|-----------|
|
||||
| 高可用 | ❌ | ✅ 故障自动转移 |
|
||||
| 地理分布 | ❌ | ✅ 全球多站点 |
|
||||
| 集中管理 | ✅ | ✅ 更强 |
|
||||
| 扩展性 | 有限 | ✅ 线性扩展 |
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[Infoblox-NIOS]]:Grid 成员上运行的操作系统
|
||||
- [[IPAM]]:Grid 的核心功能之一
|
||||
- [[DNS-Anycast]]:Grid DNS 服务的高级特性
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-45-automatic-ip-address-allocation-with-ipam]] ← Infoblox Grid 作为 IPAM 后端
|
||||
- [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] ← Grid 防止 IP 地址重叠
|
||||
- [[ctp-topic-22-global-dns-service-offerings]] ← Grid 支撑 DNS Anycast 服务
|
||||
|
||||
## Aliases
|
||||
|
||||
- Infoblox Grid Architecture
|
||||
- NIOS Grid
|
||||
- Infoblox Cluster
|
||||
71
wiki/concepts/Infoblox-NIOS.md
Normal file
71
wiki/concepts/Infoblox-NIOS.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
title: "Infoblox-NIOS"
|
||||
type: concept
|
||||
tags: [Networking, DNS, DHCP, IPAM, Infoblox]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
- ctp-topic-22-global-dns-service-offerings
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## Infoblox NIOS
|
||||
|
||||
Infoblox 的核心网络控制平面操作系统,提供 **DNS**、**DHCP** 和 **IP 地址管理(IPAM)** 三大核心功能。作为企业级网络基础设施,NIOS 是 Cloud Transformation Programme 中 IPAM 自动化方案的技术核心。
|
||||
|
||||
## Core Functions
|
||||
|
||||
### 1. DNS(域名系统)
|
||||
- 权威 DNS 服务器
|
||||
- DNS Anycast 支持全球低延迟解析
|
||||
- 与 Microsoft Active Directory 深度集成
|
||||
|
||||
### 2. DHCP(动态主机配置协议)
|
||||
- 自动化 IP 地址分配
|
||||
- IP 租约管理
|
||||
- 与 DNS 动态更新集成
|
||||
|
||||
### 3. IPAM(IP 地址管理)
|
||||
- 集中化 IP 地址池管理
|
||||
- 可扩展属性(Extensible Attributes)存储元数据
|
||||
- API 驱动的自动化分配
|
||||
- 与 IaC 工具(Terraform/Terragrunt)集成
|
||||
|
||||
## Extensible Attributes
|
||||
|
||||
NIOS 支持自定义可扩展属性,用于存储业务元数据:
|
||||
|
||||
| 属性名 | 用途 |
|
||||
|--------|------|
|
||||
| space_owner | IP 地址空间负责人 |
|
||||
| company | 所属公司/业务单元 |
|
||||
| subnet_name | 子网名称 |
|
||||
| compartment_type | 隔间类型 |
|
||||
| status | 分配状态(allocated/reserved/available) |
|
||||
| business_contact | 业务联系人 |
|
||||
| engineering_contact | 工程联系人 |
|
||||
|
||||
## Architecture
|
||||
|
||||
- **Grid Master**:主控节点,位于休斯顿数据中心
|
||||
- **冗余服务**:DNS、NTP、DHCP 多活冗余
|
||||
- **API 接口**:RESTful API 支持自动化集成
|
||||
- **与 AWS 集成**:通过 VPC 供给流水线调用 NIOS API 分配 IP 地址
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[IPAM]]:NIOS 的核心功能之一
|
||||
- [[Infoblox-Grid]]:NIOS 的分布式网格架构
|
||||
- [[DNS-Anycast]]:NIOS 的 DNS 高可用机制
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-45-automatic-ip-address-allocation-with-ipam]] ← NIOS 作为 IPAM 引擎
|
||||
- [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] ← NIOS 驱动 VPC 供给
|
||||
- [[ctp-topic-22-global-dns-service-offerings]] ← NIOS 提供 DNS Anycast
|
||||
|
||||
## Aliases
|
||||
|
||||
- NIOS
|
||||
- InfoBlocks NIOS
|
||||
- Infoblox Network Operating System
|
||||
38
wiki/concepts/Inline-Layer.md
Normal file
38
wiki/concepts/Inline-Layer.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Inline Layer (Firewall Policy)"
|
||||
type: concept
|
||||
tags: ["AWS", "Firewall", "Checkpoint", "Network-Security", "Policy"]
|
||||
sources: ["ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security"]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
Inline Layer 是 Checkpoint Firewall 中防火墙策略的另一种组织结构——采用基于账号编号的父子规则架构。一个父规则下嵌套多个子规则,子规则按账号维度进行流量控制。与 Ordered Layer(顺序多层检查)不同,Inline Layer 通过账号维度进行规则分组和继承。
|
||||
|
||||
## Mechanism
|
||||
在 [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]] 中,Pradeep 演示了 Inline Layer 的应用场景:
|
||||
|
||||
- **账号维度分组**:将同一账号或 OU 内的规则聚合为一个 Inline Layer 块
|
||||
- **父子规则结构**:父规则定义范围(哪个账号),子规则定义具体允许/拒绝的流量
|
||||
- **自动化友好**:新账号上线时,只需在父规则下添加子规则,无需修改核心策略结构
|
||||
- **简化规则管理**:规则数量随账号数线性增长,而非 N^2 增长
|
||||
|
||||
## Ordered Layer vs Inline Layer
|
||||
|
||||
| 维度 | Ordered Layer | Inline Layer |
|
||||
|------|-------------|-------------|
|
||||
| 组织维度 | 多层(地理→类型→BU→产品→环境→角色) | 账号编号维度 |
|
||||
| 检查逻辑 | 顺序通过全部层 | 在父规则下匹配子规则 |
|
||||
| 适用场景 | 精细化多层安全控制 | 跨账号规则聚合与自动化 |
|
||||
| 管理复杂度 | 中(维度多但粒度细) | 低(账号分组简化管理) |
|
||||
|
||||
## Combined Usage
|
||||
Checkpoint 在 Landing Zone 中通常组合使用两种 Layer:
|
||||
- **Ordered Layers**:处理安全控制层(地理封锁、BU 隔离等)
|
||||
- **Inline Layers**:处理账号维度的规则管理,支持自动化和扩展
|
||||
|
||||
## Connections
|
||||
- [[Checkpoint-Firewall]] — Inline Layer 是 Checkpoint 策略集的核心组织方式
|
||||
- [[Ordered-Layer]] — Checkpoint 策略的两种组织模式:Ordered Layer(顺序检查)vs Inline Layer(账号维度)
|
||||
- [[AWS-Landing-Zone]] — 在 LZ 网络隔离架构中实施
|
||||
- [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]]
|
||||
@@ -21,3 +21,4 @@ last_updated: 2026-05-06
|
||||
## Related Sources
|
||||
- [[ctp-topic-35-aws-landing-zone-design-refresher-saas-labs]]
|
||||
- [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]]
|
||||
- [[ctp-topic-39-implementing-eks-in-the-aws-lab-landing-zone]]
|
||||
|
||||
@@ -1,47 +1,72 @@
|
||||
---
|
||||
title: "Observability"
|
||||
type: concept
|
||||
tags: [devops, monitoring, sre, infrastructure]
|
||||
last_updated: 2026-04-26
|
||||
---
|
||||
|
||||
## Observability(可观测性)
|
||||
|
||||
**中文名称:** 可观测性
|
||||
|
||||
**类型:** 技术方法论 / SRE 核心支柱
|
||||
|
||||
**别名:**
|
||||
- 可观测性
|
||||
- 云原生可观测性
|
||||
- Observability Stack
|
||||
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
可观测性(Observability)是指通过系统外部输出来推断其内部状态的能力。在 IT 运维领域,通常由三大支柱构成:
|
||||
|
||||
1. **指标(Metrics):** 系统运行时数值数据的时序聚合——如 CPU 使用率、内存占用、请求 QPS。代表工具:Prometheus、InfluxDB、VictoriaMetrics。
|
||||
2. **日志(Logs):** 系统运行事件的离散记录——如错误日志、访问日志、业务事件。代表工具:ELK(Elasticsearch + Logstash + Kibana)、Loki、Graylog。
|
||||
3. **链路(Traces):** 分布式请求在多个服务间的调用路径追踪——如 HTTP 请求从 API → DB → Cache 的完整耗时。代表工具:Jaeger、Zipkin、OpenTelemetry。
|
||||
|
||||
**第三支柱趋势:** OpenTelemetry(OTel)作为 CNCF 项目,正在成为可观测数据的统一采集标准,将 Traces、Metrics、Logs 三者以统一规范融合。
|
||||
|
||||
---
|
||||
|
||||
## 家庭监控场景下的应用
|
||||
|
||||
在家庭服务器/NAS 监控中,可观测性通过以下组件实现:
|
||||
- **指标:** Prometheus + node_exporter + cAdvisor + blackbox_exporter
|
||||
- **可视化:** Grafana 仪表盘
|
||||
- **告警:** Alertmanager + 邮件/Slack 通知
|
||||
- **日志(可选):** Loki + Promtail
|
||||
|
||||
---
|
||||
|
||||
## Related Sources
|
||||
- [[家庭监控方案-prometheus-grafana-node-exporter-cadvisor-blackbox]]
|
||||
- [[public-cloud-learning-sessions-observability-with-opentelemetry]]
|
||||
- [[ctp-topic-67-cloud-native-observability-using-opentelemetry]]
|
||||
- [[ctp-topic-8-implementation-of-cloud-monitoring-using-micro-focus-operations-brid]]
|
||||
---
|
||||
title: "Observability"
|
||||
type: concept
|
||||
tags: [Observability, SRE, Cloud-Native, Telemetry, Monitoring, Reliability]
|
||||
sources:
|
||||
- public-cloud-learning-sessions-observability-with-opentelemetry-20240402-160113
|
||||
- ctp-topic-67-cloud-native-observability-using-opentelemetry
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Observability(可观测性)
|
||||
|
||||
可观测性(Observability)是指系统通过其外部输出理解其内部状态的能力。在软件工程中,可观测性通过遥测数据(Telemetry)——指标(Metrics)、日志(Logs)、追踪(Traces)——持续理解系统健康状态,是 [[SRE]] 和 [[Recovery-Assurance]] 的核心技术基础。
|
||||
|
||||
## Three Pillars
|
||||
|
||||
可观测性三大支柱(Three Pillars of Observability):
|
||||
|
||||
| 支柱 | 说明 | 示例 |
|
||||
|------|------|------|
|
||||
| **Metrics(指标)** | 聚合的数值数据,反映系统状态趋势 | CPU 使用率、请求延迟、错误率 |
|
||||
| **Logs(日志)** | 离散的事件记录,按时间顺序记录系统活动 | 访问日志、错误日志、审计日志 |
|
||||
| **Traces(追踪)** | 跨服务和组件的请求传播路径 | 分布式链路追踪、调用链可视化 |
|
||||
|
||||
## Observability vs. Monitoring
|
||||
|
||||
传统监控(Monitoring)与可观测性(Observability)的核心区别:
|
||||
|
||||
| 维度 | 传统监控(Monitoring) | 可观测性(Observability) |
|
||||
|------|---------------------|-------------------------|
|
||||
| **目标** | 回答预设问题 | 回答任意未知问题 |
|
||||
| **假设** | 故障模式已知 | 故障模式未知(High Cardinality) |
|
||||
| **数据** | 聚合指标,低基数 | 原始事件,高基数 |
|
||||
| **根因定位** | 依赖仪表板预设视图 | 通过遥测数据探索定位 |
|
||||
| **适用场景** | 稳定系统 | 云原生、分布式系统 |
|
||||
|
||||
> "You can't monitor your way to understanding a distributed system. You need observability." — Charity Majors
|
||||
|
||||
## Observability Engineering
|
||||
|
||||
可观测性工程(Observability Engineering)是将可观测性作为架构设计原则,在软件开发生命周期中内嵌遥测数据收集:
|
||||
|
||||
- **Left-Shift**:在开发阶段就定义 SLI/SLO,持续验证
|
||||
- **Telemetry as Code**:将遥测配置纳入 IaC,实现版本化管理
|
||||
- **Continuous Validation**:用主动探测(Synthetic Monitoring)验证恢复路径
|
||||
|
||||
## Connection to SRE and Recovery Assurance
|
||||
|
||||
在 [[SRE]] 实践中,可观测性是实现可靠性目标的必要条件:
|
||||
|
||||
- **SLI/SLO/SLA 的测量基础**:可观测性提供量化可靠性的原始数据
|
||||
- **Error Budget 的支撑**:通过指标追踪 Error Budget 消耗速度
|
||||
- **On-Call 响应的依据**:日志和追踪是 MTTR(Mean Time To Recovery)的核心数据源
|
||||
- **[[Recovery-Assurance]] 的前提**:无法观测的系统无法保证恢复能力
|
||||
|
||||
## OpenTelemetry
|
||||
|
||||
[[OpenTelemetry]](OTel)是 CNCF 的开源可观测性框架,提供厂商中立的指标、日志、追踪统一采集标准。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[SRE]] — 可观测性是 SRE 四大黄金信号的基础
|
||||
- [[Recovery-Assurance]] — 可观测性是 Recovery Assurance 的技术前提
|
||||
- [[OpenTelemetry]] — 可观测性工程的具体实现框架
|
||||
- [[RTO]] / [[RPO]] — 可观测性支撑 RTO/RPO 的持续监控
|
||||
|
||||
## Sources
|
||||
|
||||
- [[public-cloud-learning-sessions-observability-with-opentelemetry-20240402-160113]]
|
||||
- [[ctp-topic-67-cloud-native-observability-using-opentelemetry]]
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
|
||||
41
wiki/concepts/Ordered-Layer.md
Normal file
41
wiki/concepts/Ordered-Layer.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "Ordered Layer (Firewall Policy)"
|
||||
type: concept
|
||||
tags: ["AWS", "Firewall", "Checkpoint", "Network-Security", "Policy"]
|
||||
sources: ["ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security"]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
Ordered Layer 是 Checkpoint Firewall 中防火墙策略的组织结构——策略按优先级顺序排列的多层检查机制。流量必须逐层通过检查,全部通过后方可放行。与之对应的是 Inline Layer(基于账号编号的父子规则结构)。
|
||||
|
||||
## Mechanism
|
||||
在 [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]] 中,Pradeep 演示了 Checkpoint 在 Frankfurt Landing Zone 的 Ordered Layers:
|
||||
|
||||
1. **地理封锁(Geo-blocking)**:按来源/目标地理位置阻断流量
|
||||
2. **类型检查(Type)**:基于资源的 `Type` 标签进行访问控制
|
||||
3. **业务单元隔离(BU)**:基于 `BU`/`BusinessUnit` 标签隔离不同业务单元间的通信
|
||||
4. **产品隔离(Product)**:基于 `Product` 标签隔离不同产品间的通信
|
||||
5. **环境隔离(Environment)**:基于 `Environment` 标签隔离不同环境(生产/非生产)
|
||||
6. **服务器角色(Server Role)**:基于 `ServerRole` 标签进行细粒度角色级控制
|
||||
|
||||
**核心特性**:
|
||||
- 顺序执行:流量必须通过每一层检查,任一层拒绝则整体拒绝
|
||||
- 默认阻断跨产品线通信(Inter-product is not allowed)
|
||||
- 策略以标签为依据,替代传统的 IP 地址规则
|
||||
|
||||
## Comparison with Traditional Firewall Rules
|
||||
|
||||
| 维度 | 传统 IP-Based 规则 | Ordered Layer + 标签驱动 |
|
||||
|------|-------------------|------------------------|
|
||||
| 规则维护 | IP 变更需手动更新 | 标签自动关联,无需更新规则 |
|
||||
| 动态性 | 静态,难以适应云 | 动态,随资源标签变化 |
|
||||
| 扩展性 | 随账号/服务增长爆炸 | 通过 OU + 标签层级管控 |
|
||||
| 管理复杂度 | 高(N^2 规则) | 低(层级 + 标签维度) |
|
||||
|
||||
## Connections
|
||||
- [[Checkpoint-Firewall]] — Ordered Layer 是 Checkpoint 策略集的核心组织方式
|
||||
- [[Inline-Layer]] — Checkpoint 策略的两种组织模式:Ordered Layer(顺序检查)vs Inline Layer(账号维度)
|
||||
- [[AWS-Landing-Zone]] — 在 LZ 网络隔离架构中实施
|
||||
- [[Resource-Tagging]] — Ordered Layer 依赖标签体系驱动策略执行
|
||||
- [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]]
|
||||
49
wiki/concepts/Overlay-Network.md
Normal file
49
wiki/concepts/Overlay-Network.md
Normal file
@@ -0,0 +1,49 @@
|
||||
---
|
||||
title: "Overlay Network"
|
||||
type: concept
|
||||
tags: [AWS, Networking, Virtualization, SD-WAN]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## Overlay Network
|
||||
|
||||
叠加网络(Overlay Network)是在现有物理网络(Underlay)之上构建的逻辑网络,通过隧道技术(Tunneling)实现复杂的路由、安全和流量工程功能,与底层物理基础设施解耦。
|
||||
|
||||
## Definition
|
||||
|
||||
- **Underlay(底层网络)**: 物理基础设施——路由器、交换机、光纤链路(如 AWS 区域间的物理连接)
|
||||
- **Overlay(叠加网络)**: 逻辑隧道网络——在 Underlay 之上构建的虚拟网络层,通过封装(Encapsulation)实现端到端连接
|
||||
- **解耦价值**: Overlay 的路径选择、策略控制与 Underlay 的物理拓扑相互独立
|
||||
|
||||
## Key Mechanisms
|
||||
|
||||
- **隧道协议**: GRE、VXLAN、IPSec、WireGuard 等
|
||||
- **封装**: 将原始数据包封装在新的 IP 包头中,通过 Underlay 传输
|
||||
- **网络虚拟化**: VPC 即为 AWS 原生的 Overlay Network 实现
|
||||
|
||||
## In AWS Transit Gateway + SD-WAN Architecture
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 中描述的架构:
|
||||
|
||||
- **Underlay**: AWS 区域间的物理网络连接(APJ/EMEA/AMS 区域 Hub 之间的 Full Mesh)
|
||||
- **Overlay**: Silver Peak SD-WAN 在 AWS 中部署的虚拟 SD-WAN 设备构成的逻辑网络
|
||||
- **价值**: SD-WAN Overlay 实现动态路径选择,即使 Underlay 静态路由失效也能自动切换
|
||||
|
||||
## Relationship to Related Concepts
|
||||
|
||||
| 概念 | 关系 |
|
||||
|------|------|
|
||||
| [[AWS-Transit-Gateway-TGW]] | AWS 原生 Overlay 服务(区域级) |
|
||||
| [[SD-WAN]] | Overlay 的一种实现形式 |
|
||||
| [[Hub-and-Spoke]] | Overlay 网络的拓扑结构模式 |
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 场景 ← [[Overlay-Network]]
|
||||
- [[SD-WAN]] ← 实现方式 ← [[Overlay-Network]]
|
||||
- [[AWS-Transit-Gateway-TGW]] ← AWS 原生 ← [[Overlay-Network]]
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
57
wiki/concepts/Prisma-Access.md
Normal file
57
wiki/concepts/Prisma-Access.md
Normal file
@@ -0,0 +1,57 @@
|
||||
---
|
||||
title: "Prisma Access"
|
||||
type: concept
|
||||
tags: [AWS, Security, SASE, VPN, Networking]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## Prisma Access
|
||||
|
||||
Prisma Access 是 [[PaloAltoNetworks]] 提供的基于云的安全访问服务(SASE, Secure Access Service Edge),用于替代传统 VPN,提供更安全的统一访问体验。
|
||||
|
||||
## Definition
|
||||
|
||||
- **类型**: SASE(Secure Access Service Edge)云安全服务
|
||||
- **供应商**: [[PaloAltoNetworks]]
|
||||
- **核心功能**: 将网络安全功能(SWG、CASB、ZTNA、Firewall-as-a-Service)与网络连接功能(SD-WAN)整合为单一云原生服务
|
||||
- **替代方案**: 传统 VPN(Pulse Secure VPN 等)
|
||||
|
||||
## Key Capabilities
|
||||
|
||||
- **就近接入**: 在全球部署大量接入网关(PoP, Point of Presence),用户自动路由至最近节点
|
||||
- **统一安全策略**: 所有流量统一执行安全检查,无需逐设备配置
|
||||
- **ZTNA(Zero Trust Network Access)**: 基于身份和设备状态而非网络位置授权访问
|
||||
- **与 SD-WAN 整合**: 可直接打通 SD-WAN 骨干网,实现云端与分支机构的统一连接
|
||||
|
||||
## In CTP Architecture
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 中的规划:
|
||||
|
||||
- **现状**: 使用 [[Pulse-VPN]] 提供远程访问(传统 VPN 架构)
|
||||
- **目标**: 迁移至 Prisma Access,实现:
|
||||
1. 全球更多接入网关,用户就近接入
|
||||
2. 显著降低访问延迟
|
||||
3. 直接打通 SD-WAN 骨干网
|
||||
4. 统一安全策略管理
|
||||
|
||||
## Comparison: Traditional VPN vs Prisma Access
|
||||
|
||||
| 维度 | Pulse VPN(传统) | Prisma Access(SASE) |
|
||||
|------|-----------------|----------------------|
|
||||
| 接入方式 | VPN 隧道,IP 路由 | 就近接入,身份驱动 |
|
||||
| 延迟 | 单一 VPN 入口,高延迟 | 全球 PoP,低延迟 |
|
||||
| 安全策略 | 基于网络位置 | 基于身份和设备状态 |
|
||||
| 扩展性 | 差 | 好(云原生) |
|
||||
| SD-WAN 整合 | 无 | 原生整合 |
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 远程访问方案 ← [[Prisma-Access]]
|
||||
- [[PaloAltoNetworks]] ← 提供商 ← [[Prisma-Access]]
|
||||
- [[SD-WAN]] ← 整合 ← [[Prisma-Access]]
|
||||
- [[Pulse-VPN]] ← 替代 ← [[Prisma-Access]]
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
35
wiki/concepts/Private-Hosted-Zone.md
Normal file
35
wiki/concepts/Private-Hosted-Zone.md
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
title: "Private Hosted Zone"
|
||||
type: concept
|
||||
tags:
|
||||
- AWS
|
||||
- DNS
|
||||
- Networking
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Private Hosted Zone(PHZ,私有托管区)是 Amazon Route 53 的一项功能,允许在指定的 Amazon VPC 内部解析自定义私有域名(如 `int-sas.local`、`corp.internal`)。与公有托管区不同,PHZ 的DNS记录不对互联网开放,仅在关联的 VPC 内可见。
|
||||
|
||||
## Aliases
|
||||
- Private Hosted Zone
|
||||
- PHZ
|
||||
- AWS 私有托管区
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **VPC 范围隔离**:DNS 记录仅在关联的 VPC 内可解析,保证内部域名不暴露
|
||||
- **跨账号关联**:VPC 可与另一个 AWS 账户拥有的 PHZ 关联,但必须先完成"授权(Authorization)"再执行"关联(Association)"
|
||||
- **Resolver 自动优先**:当查询匹配 PHZ 中的域名时,Route 53 Resolver 直接返回 PHZ 记录,不再转发至转发规则
|
||||
- **多 VPC 支持**:一个 PHZ 可关联多个 VPC,支持跨区域(但建议同区域以减少延迟)
|
||||
- **集中化 vs 分散化**:在 Landing Zone 架构中,推荐集中式 DNS 账号管理 PHZ,而非在每个业务账号中分散创建
|
||||
|
||||
## Related Concepts
|
||||
- [[Route-53-Resolver]] — PHZ 依赖 Resolver 进行解析
|
||||
- [[Resolver-Rules]] — 未匹配 PHZ 的查询由 Resolver Rules 转发
|
||||
- [[VPC-Association-Authorization]] — 跨账号 PHZ 关联流程
|
||||
- [[AWS-Landing-Zone]] — 多账号环境下的 PHZ 管理策略
|
||||
|
||||
## Sources
|
||||
- [[ctp-topic-19-configuring-dns-within-aws-lzs]]
|
||||
46
wiki/concepts/Program-Demand-Process.md
Normal file
46
wiki/concepts/Program-Demand-Process.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "Program Demand Process"
|
||||
type: concept
|
||||
tags: [CTP, Cloud, AWS, Demand-Management]
|
||||
sources: [ctp-topic-20-program-demand-process-flow-and-poc-onboarding, ctp-topic-57-product-backlog-managing-demand]
|
||||
last_updated: 2026-04-14
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
程序需求流程(Program Demand Process)指从业务需求产生、优先级排序到最终交付云迁移的端到端管理路径。是 Cloud Transformation Programme(CTP)治理框架的核心入口。
|
||||
|
||||
## Demand Sources
|
||||
|
||||
需求驱动来源可分为三类:
|
||||
|
||||
- **业务案例驱动**:如数据中心关闭、业务连续性需求、基础设施老化等业务压力
|
||||
- **战略优先级驱动**:高层管理人员(如 Matt)定义的企业战略优先级,自上而下传导
|
||||
- **产品路线图驱动**:产品团队的技术演进需求与路线图规划
|
||||
|
||||
## Process Stages
|
||||
|
||||
1. **需求录入**:业务端提交转型需求
|
||||
2. **优先级排序**:基于业务价值和紧迫性排列需求优先级
|
||||
3. **POC 决策**:评估是否需要进行概念验证
|
||||
4. **Gate 审批**:通过 Gate 0/1/3 等关键决策点
|
||||
5. **迁移执行**:迁移至 Labs 或 SaaS 生产环境
|
||||
6. **验收与关闭**:确认迁移达成预期目标
|
||||
|
||||
## Key Gate Points
|
||||
|
||||
- **Gate 0**:评估准入,确认需求符合云转型范围
|
||||
- **Gate 1**:Design Authority 审批,验证解决方案设计
|
||||
- **Gate 3**:迁移准入,最终批准启动生产迁移
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Proof-of-Concept]]:在需求流程中降低迁移风险的关键验证手段
|
||||
- [[Gate-Process]]:治理需求流程的关键决策框架
|
||||
- [[Solution-Design]]:Gate 1 审批的核心交付物
|
||||
- [[Product-Backlog]]:需求管理的优先级排序机制
|
||||
|
||||
## References
|
||||
|
||||
- [[ctp-topic-20-program-demand-process-flow-and-poc-onboarding]]
|
||||
- [[ctp-topic-57-product-backlog-managing-demand]]
|
||||
59
wiki/concepts/Proof-of-Concept.md
Normal file
59
wiki/concepts/Proof-of-Concept.md
Normal file
@@ -0,0 +1,59 @@
|
||||
---
|
||||
title: "Proof of Concept"
|
||||
type: concept
|
||||
tags: [CTP, Cloud, AWS, POC]
|
||||
sources: [ctp-topic-20-program-demand-process-flow-and-poc-onboarding]
|
||||
last_updated: 2026-04-14
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
概念验证(Proof of Concept, POC)是在正式云迁移前用于证明架构可行性、测试复杂网络需求及验证迁移方法的实验性阶段。是降低云迁移风险的核心手段。
|
||||
|
||||
## POC Objectives
|
||||
|
||||
- **架构可行性验证**:确认目标云架构能够满足业务需求
|
||||
- **技术可行性测试**:验证复杂网络配置、依赖关系和集成点
|
||||
- **团队能力建设**:让团队熟悉基于 Gruntwork 的新一代 Landing Zone 环境
|
||||
- **风险识别**:在正式迁移前发现并解决潜在问题
|
||||
- **迁移方法验证**:验证数据迁移和应用迁移的具体方法
|
||||
|
||||
## POC vs 传统"经典落地分区"
|
||||
|
||||
| 维度 | 传统方式 | 新一代 Landing Zone + POC |
|
||||
|------|----------|---------------------------|
|
||||
| 构建方式 | 手动构建 | IaC(Terraform/Terragrunt)自动化 |
|
||||
| 可重复性 | 低 | 高(通过代码复用) |
|
||||
| 环境一致性 | 难以保证 | 严格一致 |
|
||||
| 文档化 | 分散 | 集中于 IaC 代码 |
|
||||
| 审计追踪 | 困难 | Git 版本控制 |
|
||||
|
||||
## POC Deliverables
|
||||
|
||||
POC 阶段必须产出的关键交付物:
|
||||
|
||||
- **解决方案设计文档**:经过 Design Authority 审批的架构设计
|
||||
- **IaC 脚本**:可用于正式部署的 Terraform/Terragrunt 配置
|
||||
- **迁移时间表**:明确的里程碑和交付日期
|
||||
- **成功标准验证报告**:证明产品已具备进入生产环境迁移的条件
|
||||
|
||||
## Success Criteria
|
||||
|
||||
POC 成功标准必须在启动前明确定义,包括:
|
||||
|
||||
- 技术可行性指标(架构满足需求)
|
||||
- 性能指标(满足 NFR 定义的非功能性需求)
|
||||
- 安全合规指标(通过安全评审)
|
||||
- 团队能力指标(团队能够独立运维新环境)
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Program-Demand-Process]]:POC 是需求流程的关键验证环节
|
||||
- [[Gate-Process]]:POC 阶段受 Gate 1 Design Authority 审批约束
|
||||
- [[Solution-Design]]:POC 的核心产出物,需审批后方可进入迁移
|
||||
- [[Landing-Zone-Architecture]]:POC 部署的目标环境基础
|
||||
- [[Infrastructure-as-Code]]:新一代 Landing Zone 的核心技术手段
|
||||
|
||||
## References
|
||||
|
||||
- [[ctp-topic-20-program-demand-process-flow-and-poc-onboarding]]
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: "RPO (Recovery Point Objective)"
|
||||
tags: [devops, disaster-recovery, sre, reliability, data-protection]
|
||||
last_updated: 2026-04-28
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
# RPO (Recovery Point Objective)
|
||||
@@ -85,7 +85,9 @@ RTO 和 RPO 衡量的是不同维度,必须**同时优化**:
|
||||
- [[Kill Switch]] — 关闭故障功能,保护数据不被继续破坏
|
||||
- [[High Availability]] — 高可用性,降低 RPO 的基础设施
|
||||
- [[Data-Governance]] — 数据治理,包含 RPO 策略
|
||||
|
||||
## Sources
|
||||
- [[sources/rto-vs-rpo-key-differences-for-modern-disaster-recovery.md]]
|
||||
- [[sources/ctp-topic-72-implementing-an-enterprise-dr-strategy-using-aws-backup.md]]
|
||||
|
||||
- [[ctp-topic-72-implementing-an-enterprise-dr-strategy-using-aws-backup]]
|
||||
- [[ctp-topic-44-aws-backup-in-micro-focus]]
|
||||
- [[rto-vs-rpo-key-differences-for-modern-disaster-recovery]]
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
@@ -8,7 +8,8 @@ tags:
|
||||
- Cloud Architecture
|
||||
sources:
|
||||
- ctp-topic-66-exposing-the-differences-between-postgresql-rds-and-aurora
|
||||
last_updated: 2026-04-28
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
95
wiki/concepts/Recovery-Assurance.md
Normal file
95
wiki/concepts/Recovery-Assurance.md
Normal file
@@ -0,0 +1,95 @@
|
||||
---
|
||||
title: "Recovery Assurance"
|
||||
type: concept
|
||||
tags: [Recovery-Assurance, SRE, Disaster-Recovery, Observability, Automation, Cloud-DevOps, Resilience]
|
||||
sources:
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Recovery Assurance(恢复保证)
|
||||
|
||||
恢复保证(Recovery Assurance)是灾难恢复([[Disaster-Recovery]])理念的演进方向——从被动应对灾难,到主动设计、持续验证、自动化保证系统的恢复能力。是 [[OpenText]] 在 2024 年提出的 DR 演进框架核心理念。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Recovery Assurance = 可恢复性作为架构设计原则 + 可观测性作为持续监控手段 + 自动化作为规模化保障"
|
||||
|
||||
传统 DR 关注的是"灾难发生后如何恢复",而 Recovery Assurance 关注的是"如何保证系统在任何故障下都能可靠恢复"——从反应式(Reactive)转向主动式(Proactive)。
|
||||
|
||||
## The Four-Pillar Framework
|
||||
|
||||
[[OpenText]] 提出的四位框架,将 Recovery Assurance 落地到架构的四个层面:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 1. DESIGN(设计) │
|
||||
│ → 可恢复性作为架构设计原则 │
|
||||
│ → 在设计阶段就定义恢复机制 │
|
||||
│ → [[RTO]]/[[RPO]] 目标前置纳入架构评审 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 2. SOFTWARE(软件) │
|
||||
│ → 软件内嵌遥测,支持持续健康监控 │
|
||||
│ → [[Self-Healing]] 自愈能力 │
|
||||
│ → [[Observability]] 驱动的故障检测 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 3. BUILD(构建) │
|
||||
│ → [[Customer-Zero]] 环境验证恢复路径 │
|
||||
│ → 在发布前验证 RTO/RPO 是否满足 SLA │
|
||||
│ → CI/CD 流水线中的恢复演练 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ 4. ENVIRONMENTS(环境) │
|
||||
│ → [[SRE]] + 可观测性工程持续运营 │
|
||||
│ → 跨 AWS/GCP/Azure 的统一可恢复性标准 │
|
||||
│ → Error Budget 驱动发布节奏 │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Key Enablers
|
||||
|
||||
| 驱动因素 | 说明 |
|
||||
|----------|------|
|
||||
| **[[SRE]]** | 用软件工程思维解决运维问题,通过 Error Budget 量化可靠性 |
|
||||
| **[[Observability]]** | 通过遥测数据持续理解系统健康状态,是 Recovery Assurance 的技术前提 |
|
||||
| **[[Self-Healing]]** | 软件层面的自动恢复能力,减少人工响应时间和 Toil |
|
||||
| **[[Customer-Zero]]** | 内部验证环境,在生产级配置下验证恢复路径 |
|
||||
| **[[Automation]]** | 减少人工协调成本,使 Recovery Assurance 可规模化 |
|
||||
|
||||
## Why Evolution is Needed
|
||||
|
||||
| 传统 DR 的问题 | Recovery Assurance 的解决方案 |
|
||||
|---------------|-----------------------------|
|
||||
| 反应式(Reactive) | 主动设计(Proactive) |
|
||||
| 手动测试,成本高 | 自动化验证,持续运行 |
|
||||
| 按客户时间表 | 持续监控,即时验证 |
|
||||
| 无一致性方法 | 统一四位框架 |
|
||||
| 无法规模化 | 自动化保障,可规模化 |
|
||||
| 仅覆盖区域故障 | 覆盖多云多层级故障模式 |
|
||||
|
||||
## Connection to Business Continuity
|
||||
|
||||
Recovery Assurance 是 [[Business-Continuity-Plan]](业务连续性计划)在 IT 技术层面的具体实现:
|
||||
|
||||
- **BCP 定义业务恢复目标**(最大可接受中断时长、关键业务功能)
|
||||
- **Recovery Assurance 实现技术恢复能力**(RTO/RPO、自动化恢复路径)
|
||||
- **两者共同**:确保灾难发生后业务能在 SLA 时间内恢复运营
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Disaster-Recovery]] — Recovery Assurance 的前身,从 DR 演进而来
|
||||
- [[SRE]] — Recovery Assurance 的核心方法论
|
||||
- [[Observability]] — Recovery Assurance 的技术基础
|
||||
- [[Self-Healing]] — Recovery Assurance 在软件层面的自动恢复实现
|
||||
- [[Customer-Zero]] — Recovery Assurance Build 阶段的验证环境
|
||||
- [[RTO]] / [[RPO]] — Recovery Assurance 的量化目标
|
||||
- [[Business-Continuity-Plan]] — Recovery Assurance 的上层业务框架
|
||||
|
||||
## Sources
|
||||
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
35
wiki/concepts/Resolver-Rules.md
Normal file
35
wiki/concepts/Resolver-Rules.md
Normal file
@@ -0,0 +1,35 @@
|
||||
---
|
||||
title: "Resolver Rules"
|
||||
type: concept
|
||||
tags:
|
||||
- AWS
|
||||
- DNS
|
||||
- Networking
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Resolver Rules(解析规则)是 AWS Route 53 Resolver 的核心配置对象,用于定义特定域名的 DNS 查询应转发至哪个目标 DNS 服务器(如本地数据中心的 On-prem DNS)。它们是实现混合云 DNS 解析的关键机制。
|
||||
|
||||
## Aliases
|
||||
- Resolver Rules
|
||||
- Route 53 Resolver Rules
|
||||
- DNS Forwarding Rules
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **域名匹配转发**:规则按域名模式(如 `*.corp.internal`)匹配查询,将匹配项转发至指定 IP 地址的 DNS 服务器
|
||||
- **共享机制**:通过 AWS RAM(Resource Access Manager)将规则跨账号共享给业务账户,业务 VPC 无需单独创建规则即可使用
|
||||
- **入站 vs 出站**:Resolver Rules 配合 Outbound Endpoint 使用;Inbound Endpoint 则处理反向(由外向内)的解析请求
|
||||
- **Terraform 自动化**:规则定义完全可通过 Terraform 声明式管理,集成到 Landing Zone 模块化供给流程中
|
||||
- **授权流程**:跨账号共享时,接受方账户需明确接受共享,规则才能生效
|
||||
|
||||
## Related Concepts
|
||||
- [[Route-53-Resolver]] — Resolver Rules 是 Resolver 的配置对象
|
||||
- [[AWS-RAM]] — 跨账号共享规则的技术手段
|
||||
- [[Private-Hosted-Zone]] — 与 PHZ 互补:PHZ 覆盖私有域名直接解析,Rules 覆盖需转发至外部 DNS 的域名
|
||||
- [[AWS-Landing-Zone]] — 集中化 DNS 账号场景下的规则管理策略
|
||||
|
||||
## Sources
|
||||
- [[ctp-topic-19-configuring-dns-within-aws-lzs]]
|
||||
45
wiki/concepts/Resource-Tagging.md
Normal file
45
wiki/concepts/Resource-Tagging.md
Normal file
@@ -0,0 +1,45 @@
|
||||
---
|
||||
title: "Resource Tagging"
|
||||
type: concept
|
||||
tags: ["AWS", "Tagging", "Cloud-Governance", "Cost-Allocation", "Security"]
|
||||
sources: ["ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security", "public-cloud-learning-sessions-opentext-tagging-standard-v2"]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
Resource Tagging(资源标签)是 AWS 及其他云平台中的元数据体系——在云资源上附加键值对,用于描述资源的业务属性、安全分类、运营信息等。标签是云环境动态化、自动化治理的基础。
|
||||
|
||||
## Standard Tag Taxonomy
|
||||
在 OpenText/Micro Focus 云转型环境中,核心标签维度包括:
|
||||
|
||||
| 标签键 | 说明 | 示例 |
|
||||
|--------|------|------|
|
||||
| `Owner` | 资源所有者(优先使用 PDL) | `Steve.Jarman@opentext.com` |
|
||||
| `Team` | 团队名称 | `ADM`, `ITOM` |
|
||||
| `Type` | 资源类型 | `R&D`, `Production` |
|
||||
| `BU` / `BusinessUnit` | 业务单元 | `Octane`, `ArcSight` |
|
||||
| `Product` | 所属产品 | `IDM`, `Operations` |
|
||||
| `Environment` | 环境 | `Production`, `UAT`, `Dev` |
|
||||
| `ServerRole` | 服务器角色 | `Web`, `DB`, `App` |
|
||||
| `AppID` | 应用标识 | `OCT-HUB-001` |
|
||||
| `Account` | AWS 账号 | `123456789012` |
|
||||
|
||||
## Tagging as Security Foundation
|
||||
在 [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]] 中,Steve Jarman 强调:
|
||||
- **迁移规划前提**:在将资产迁移至云之前,必须先收集机器信息 → 理解迁移范围 → 应用正确标签
|
||||
- **标签即安全凭证**:传统基于 IP 的防火墙规则无法适应云环境动态性,标签成为安全策略的动态依据
|
||||
- **SCP 强制执行**:通过 [[SCP-Security-Control-Policy]] 拒绝标签不合规的资源创建
|
||||
- **Checkpoint 标签驱动**:Checkpoint Firewall 读取资源标签决定网络访问策略,标签缺失或错误导致流量被拦截
|
||||
|
||||
## Tagging Governance Workflow
|
||||
```
|
||||
制定标签标准 → IaC 自动打标 → SCP 强制合规 → Tag Validation Tool 审计 → 修正不合规资源
|
||||
```
|
||||
(参考 [[ctp-topic-28-aws-tag-validation-tool]])
|
||||
|
||||
## Connections
|
||||
- [[SCP-Security-Control-Policy]] — 标签是 SCP 的执行依据
|
||||
- [[Checkpoint-Firewall]] — 标签驱动防火墙策略
|
||||
- [[AWS-Landing-Zone]] — 标签体系是 LZ 治理的核心
|
||||
- [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]]
|
||||
- [[public-cloud-learning-sessions-opentext-tagging-standard-v2]]
|
||||
34
wiki/concepts/Route-53-Resolver.md
Normal file
34
wiki/concepts/Route-53-Resolver.md
Normal file
@@ -0,0 +1,34 @@
|
||||
---
|
||||
title: "Route 53 Resolver"
|
||||
type: concept
|
||||
tags:
|
||||
- AWS
|
||||
- DNS
|
||||
- Networking
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
AWS Route 53 Resolver 是 Amazon Route 53 提供的 DNS 解析服务核心组件,负责在 VPC 与其他网络环境之间转发 DNS 查询。它提供两个关键端点类型:Inbound Endpoints(允许本地数据中心向 AWS VPC 发起 DNS 查询)和 Outbound Endpoints(允许 VPC 向本地 DNS 服务器转发查询),从而实现混合云环境的双向 DNS 解析。
|
||||
|
||||
## Aliases
|
||||
- Route 53 Resolver
|
||||
- AWS Resolver
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **混合云 DNS 网关**:解决 VPC 内 AWS 资源与本地数据中心(On-prem)之间的域名解析互通问题
|
||||
- **Inbound Endpoint**:监听 ENI 上的 UDP/TCP 53 端口,接收来自本地网络的递归 DNS 查询
|
||||
- **Outbound Endpoint**:通过转发规则(Resolver Rules)将匹配特定域名的查询主动发送至指定 IP(如 On-prem DNS 服务器)
|
||||
- **跨账号共享**:Resolver Rules 可通过 AWS RAM 共享给其他 AWS 账户,无需在各账户单独创建规则
|
||||
- **与 Private Hosted Zone 协同**:Resolver 自动优先查询 PHZ 中的记录,未命中时再使用转发规则
|
||||
|
||||
## Related Concepts
|
||||
- [[Private-Hosted-Zone]] — 在 VPC 内部解析私有域名
|
||||
- [[Resolver-Rules]] — 定义域名转发逻辑
|
||||
- [[VPC-Association-Authorization]] — 跨账号 VPC 与 PHZ 关联的授权机制
|
||||
- [[AWS-Landing-Zone]] — 多账号环境下的 DNS 集中化管理背景
|
||||
|
||||
## Sources
|
||||
- [[ctp-topic-19-configuring-dns-within-aws-lzs]]
|
||||
42
wiki/concepts/SCP-Security-Control-Policy.md
Normal file
42
wiki/concepts/SCP-Security-Control-Policy.md
Normal file
@@ -0,0 +1,42 @@
|
||||
---
|
||||
title: "SCP (Security Control Policy)"
|
||||
type: concept
|
||||
tags: ["AWS", "Security", "Landing-Zone", "Tagging", "OU"]
|
||||
sources: ["ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security"]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
SCP(Security Control Policy)是 AWS Organizations 中的一种策略类型,通过「显式拒绝」(deny)逻辑强制执行组织范围内的安全与合规规则。与 IAM 策略不同,SCP 作用于组织单元(OU)或账户级别,控制谁可以执行什么操作,而不是授予权限。
|
||||
|
||||
## Core Mechanism
|
||||
- **基于标签的 SCP**:拒绝资源在不符合预期标签值的情况下被创建(如:拒绝在特定 OU 中创建没有 `Environment: Production` 标签的 EC2 实例)
|
||||
- **OU 分层执行**:SCP 在 OU 层级自上而下继承,高层级 OU 的拒绝策略优先级最高
|
||||
- **防止标签篡改**:阻止普通用户通过修改标签(如从 `Team: ADM` 改为 `Team: ITOM`)绕过安全审计或访问控制
|
||||
|
||||
## In AWS Landing Zone Context
|
||||
在 [[AWS-Landing-Zone]] 架构中,SCP 是 Landing Zone 治理的关键组件:
|
||||
- 与 [[Checkpoint-Firewall]] 的标签驱动策略联动:SCPs 确保只有正确标记的资源进入云环境,Checkpoint 基于标签实施网络层访问控制
|
||||
- SCP 是「防护栏」(Guardrails)的核心实现手段
|
||||
- 补充 AWS IAM 的「授予权限」模型,提供强制拒绝能力
|
||||
|
||||
## Example Use Case
|
||||
```
|
||||
# 拒绝在没有 Owner 标签的情况下创建 EC2
|
||||
{
|
||||
"Effect": "Deny",
|
||||
"Action": "ec2:RunInstances",
|
||||
"Resource": "arn:aws:ec2:*:*:instance/*",
|
||||
"Condition": {
|
||||
"Null": {
|
||||
"aws:RequestTag/Owner": "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Connections
|
||||
- [[AWS-Landing-Zone]] — SCP 是 LZ 治理的核心工具
|
||||
- [[Checkpoint-Firewall]] — SCP + Checkpoint 构成标签驱动的端到端安全体系
|
||||
- [[ctp-topic-10-aws-landing-zone-lz-data-collection-tagging-related-security]]
|
||||
- [[ctp-topic-28-aws-tag-validation-tool]] — SCP 强制执行标签,Tag Validation Tool 审计存量资源
|
||||
58
wiki/concepts/SD-WAN.md
Normal file
58
wiki/concepts/SD-WAN.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: "SD-WAN (Software-Defined Wide Area Network)"
|
||||
type: concept
|
||||
tags: [AWS, Networking, WAN, Overlay, SASE]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## SD-WAN (Software-Defined Wide Area Network)
|
||||
|
||||
SD-WAN(Software-Defined Wide Area Network)是一种软件定义的广域网技术,通过软件控制层对物理网络进行抽象,实现动态路径选择、负载均衡和自动化流量调度。
|
||||
|
||||
## Definition
|
||||
|
||||
- **SD**: Software-Defined——网络控制平面与数据平面分离,通过软件集中管理
|
||||
- **WAN**: Wide Area Network——跨越地理区域的广域网
|
||||
- **核心价值**: 将底层物理网络(Underlay)抽象为逻辑 Overlay 网络,灵活调度流量
|
||||
|
||||
## In CTP Architecture
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 中描述的演进路线:
|
||||
|
||||
- **当前状态**: TGW 间路由依赖静态前缀列表,缺乏 BGP 动态路由,DR 场景需要人工干预
|
||||
- **演进目标**: 引入 [[SilverPeak]] SD-WAN 作为叠加网络(Overlay),在 AWS 中部署虚拟 SD-WAN 设备
|
||||
- **解决问题**: 动态路径选择、自动化流量调度,消除静态路由的局限性
|
||||
|
||||
## Key Properties
|
||||
|
||||
| 属性 | 值 |
|
||||
|------|-----|
|
||||
| 架构类型 | Overlay Network(叠加网络) |
|
||||
| 控制平面 | 软件集中控制,与硬件解耦 |
|
||||
| 路径选择 | 基于实时链路质量(带宽、延迟、丢包率) |
|
||||
| 部署模式 | 虚拟设备(vSIM 或纯软件) |
|
||||
| 典型厂商 | Silver Peak, Viptela (Cisco), VeloCloud (VMware) |
|
||||
|
||||
## Relationship to SASE
|
||||
|
||||
SD-WAN 是 SASE(Secure Access Service Edge)架构的核心组件:
|
||||
- SD-WAN 提供灵活的广域网连接
|
||||
- SASE 将 SD-WAN 与安全服务(SWG、CASB、ZTNA)整合
|
||||
- [[Prisma-Access]] 即为 Palo Alto Networks 的 SASE 产品
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 演进目标 ← [[SD-WAN]]
|
||||
- [[SilverPeak]] ← 供应商 ← [[SD-WAN]]
|
||||
- [[Overlay-Network]] ← 基于 ← [[SD-WAN]]
|
||||
- [[Prisma-Access]] ← 整合 ← [[SD-WAN]]
|
||||
|
||||
## Relationship to CTP Topic 31
|
||||
|
||||
在 [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]] 中,SSM 作为 SD-WAN 落地前的**临时/过渡方案**:SSM 提供零 VPN 的安全访问,而 SD-WAN 落地后将从网络层彻底解决多区域互联与安全策略统一管理问题。
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
- [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]]
|
||||
40
wiki/concepts/SLR.md
Normal file
40
wiki/concepts/SLR.md
Normal file
@@ -0,0 +1,40 @@
|
||||
---
|
||||
title: "SLR"
|
||||
type: concept
|
||||
tags: [Service-Level, SLO, SRE, Monitoring, Cloud-Transformation]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
SLR (Service Level Requirement,服务等级需求) 是组织对服务可靠性和性能的业务层面需求定义。它是从业务视角出发,对服务应达到的可用性和性能水平的正式要求,通常与 SLA(Service Level Agreement,对外承诺)和 SLO(Service Level Objective,内部目标)配套使用。
|
||||
|
||||
## Relationship with SLO and SLA
|
||||
|
||||
```
|
||||
SLA(对外承诺)← 基于 ← SLO(内部目标)← 基于 ← SLR(业务需求)
|
||||
```
|
||||
|
||||
| 层级 | 视角 | 说明 |
|
||||
|------|------|------|
|
||||
| SLR | 业务需求 | 产品/业务方对服务等级的实际需求 |
|
||||
| SLO | 内部目标 | SRE/运维团队设定的内部可靠性目标(通常比 SLA 更严格) |
|
||||
| SLA | 对外承诺 | 对客户的正式合同承诺 |
|
||||
|
||||
## SLR in Cloud Transformation
|
||||
|
||||
在云转型项目中,SRE 团队与产品团队协作定义 SLR/SLO 体系:
|
||||
- **目标**:向产品团队做周/双周/月度指标汇报
|
||||
- **方法**:定义监控指标,从 SLI 向上汇总至 KPI
|
||||
- **工具**:通过 Grafana 等可观测性工具展示 SLO 达成情况
|
||||
|
||||
## Relationship with Monitoring
|
||||
|
||||
SLR 是监控体系设计的基础:
|
||||
1. 从 SLR 导出具体的 SLI(Service Level Indicator)
|
||||
2. SLI 对应具体的监控指标和告警规则
|
||||
3. 指标持续采集并与 SLO 比对,消耗 Error Budget
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-30-managing-change]]
|
||||
54
wiki/concepts/SRE.md
Normal file
54
wiki/concepts/SRE.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "SRE"
|
||||
type: concept
|
||||
tags: [SRE, Site-Reliability-Engineering, DevOps, Automation, Cloud-Transformation]
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
SRE (Site Reliability Engineering,站点可靠性工程) 是一种通过软件工程思维解决运维问题的方法论。其核心理念是打破传统运维与产品开发之间的壁垒,通过自动化、可靠性测量和系统性方法提高服务质量。SRE 起源于 Google,现已被广泛应用于云原生和企业 IT 环境。
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. 将错误视为学习机会
|
||||
SRE 将故障和错误视为改进系统的机会,而非单纯追究责任。通过 Post-mortem 和 CAPA 流程从事故中提取根本原因。
|
||||
|
||||
### 2. 拥抱风险
|
||||
SRE 接受服务存在固有风险,通过 Error Budget 和 SLO 量化可接受的可靠性水平,在创新速度与稳定性之间取得平衡。
|
||||
|
||||
### 3. 消除 toil(重复性手工劳动)
|
||||
Toil 是指那些手动、重复性、可自动化、缺乏持久价值且随规模线性增长的工作。SRE 团队应将 toil 控制在 50% 以下,其余时间用于系统性改进和功能开发。
|
||||
|
||||
### 4. 自动化一切可自动化的
|
||||
通过 IaC(基础设施即代码)和 CI/CD Pipeline 将变更实现完全自动化,减少人工审批环节和出错概率。
|
||||
|
||||
### 5. 可测量性优先
|
||||
所有系统行为都需要可观测性指标支撑(SLI/SLO/SLA),通过监控和告警实现问题的早期发现。
|
||||
|
||||
## SRE Team Collaboration Model
|
||||
|
||||
SRE 团队在云转型项目中与产品团队在三个阶段协作:
|
||||
|
||||
| 阶段 | 说明 | SRE 职责 |
|
||||
|------|------|----------|
|
||||
| Build(构建) | 产品基础设施搭建阶段 | 定义技术架构、共享 IaC 模块、定义 SLO/SLR |
|
||||
| Early Live Support(早期上线支持) | Build 与 BAU 之间的过渡阶段 | 完成 Go-Live Checklist(监控覆盖、支持模型、事件响应流程) |
|
||||
| BAU(日常运维) | 持续运营阶段 | 周/双周/月度指标汇报、持续改进 |
|
||||
|
||||
## Key Metrics
|
||||
|
||||
- **SLI (Service Level Indicator)**:服务等级指标,直接测量的系统指标(如可用性、延迟)
|
||||
- **SLO (Service Level Objective)**:服务等级目标,SLI 的目标值(如 99.9% 可用性)
|
||||
- **Error Budget**:错误预算,SLO 允许范围内的错误配额,用于指导发布节奏
|
||||
- **Toil**:重复性手工劳动,应控制在 50% 以下
|
||||
|
||||
## Relationship with DevOps
|
||||
|
||||
SRE 是 DevOps 理念的具体实现形式之一。DevOps 强调打破开发与运维的边界,SRE 则通过量化指标(Error Budget、SLO)和自动化工具将这一理念落地。
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-30-managing-change]]
|
||||
- [[ctp-topic-41-nfrs-and-error-budgets]]
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
99
wiki/concepts/Self-Healing.md
Normal file
99
wiki/concepts/Self-Healing.md
Normal file
@@ -0,0 +1,99 @@
|
||||
---
|
||||
title: "Self-Healing"
|
||||
type: concept
|
||||
tags: [Self-Healing, SRE, Automation, Resilience, Cloud-Native, Fault-Tolerance]
|
||||
sources:
|
||||
- public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2
|
||||
last_updated: 2026-04-29
|
||||
---
|
||||
|
||||
## Self-Healing(自愈能力)
|
||||
|
||||
自愈能力(Self-Healing)是指软件系统具备持续监控系统健康状态,并在无需人工干预的情况下自动检测故障并恢复服务的能力。是 [[SRE]] 和 [[Recovery-Assurance]] 理念在软件层面的具体实现。
|
||||
|
||||
## Definition
|
||||
|
||||
> "Self-healing is the ability of a system to detect failures, diagnose the root cause, and restore service automatically without human intervention." — [[SRE]] Principles
|
||||
|
||||
自愈系统通过以下机制实现自动化恢复:
|
||||
|
||||
1. **故障检测**:通过[[Observability]]采集的遥测数据识别异常
|
||||
2. **根因诊断**:分析异常模式,判断故障类型(临时故障 vs. 持久故障)
|
||||
3. **恢复执行**:触发预定义的修复动作(重启服务、切换节点、扩容降级)
|
||||
4. **验证反馈**:恢复后验证服务可用性,确认健康状态
|
||||
|
||||
## Self-Healing Mechanisms
|
||||
|
||||
| 层级 | 机制 | 示例 |
|
||||
|------|------|------|
|
||||
| **基础设施层** | 自动替换失败的计算节点 | Kubernetes Node 自动替换、EC2 Auto Recovery |
|
||||
| **容器/编排层** | Pod 自动重启、重新调度 | Kubernetes Liveness/Readiness Probe、自动重启策略 |
|
||||
| **应用层** | 应用内嵌自愈逻辑 | Circuit Breaker 模式、Graceful Degradation |
|
||||
| **数据层** | 自动故障转移 | Multi-AZ RDS 自动 failover、DynamoDB 自动复制 |
|
||||
| **网络层** | 流量自动路由 | Route 53 Health Check + DNS Failover、NLB 自动移除不健康目标 |
|
||||
|
||||
## Relationship with SRE
|
||||
|
||||
在 [[SRE]] 实践中,自愈能力是消除 Toil(重复性手工劳动)的重要手段:
|
||||
|
||||
- **Mean Time To Recovery(MTTR)降低**:自动化恢复比人工响应快 10-100 倍
|
||||
- **Toil 减少**:值班工程师不再需要手动处理可预测的故障模式
|
||||
- **Error Budget 保护**:自动恢复快,系统可用性更高,Error Budget 消耗更慢
|
||||
|
||||
## Connection to Recovery Assurance
|
||||
|
||||
[[Recovery-Assurance]] 要求系统不仅能恢复,还要能**保证**恢复能力。自愈能力是 Recovery Assurance 的技术基础之一:
|
||||
|
||||
- **持续可恢复性验证**:自愈测试本身就是一种恢复路径的持续验证
|
||||
- **减少人工依赖**:人工协调是 DR 测试延迟的主要原因,自愈减少了人力瓶颈
|
||||
- **规模化的前提**:无法自愈的系统在云原生规模下无法保证恢复能力
|
||||
|
||||
## Self-Healing vs. Chaos Engineering
|
||||
|
||||
| 维度 | 自愈(Self-Healing) | 混沌工程(Chaos Engineering) |
|
||||
|------|---------------------|---------------------------|
|
||||
| **目的** | 故障时自动恢复 | 主动注入故障,验证系统韧性 |
|
||||
| **触发** | 被动(故障发生) | 主动(实验注入) |
|
||||
| **时机** | 生产故障时执行 | 日常实验 |
|
||||
| **关系** | 互补:混沌工程发现弱点 → 自愈修复故障 | 互补:混沌工程发现弱点 → 自愈修复故障 |
|
||||
|
||||
## Implementation Pattern
|
||||
|
||||
```yaml
|
||||
# Kubernetes Self-Healing Manifest 示例
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /healthz
|
||||
port: 8080
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
failureThreshold: 3
|
||||
|
||||
restartPolicy: Always # Pod 故障自动重启
|
||||
terminationGracePeriodSeconds: 30 # 优雅关闭
|
||||
|
||||
# HPA(水平 Pod 自动扩缩容)
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
spec:
|
||||
minReplicas: 3
|
||||
maxReplicas: 10
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
```
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[SRE]] — 自愈是 SRE 消除 Toil、提升可靠性的核心手段
|
||||
- [[Recovery-Assurance]] — 自愈是 Recovery Assurance 的技术基础
|
||||
- [[Observability]] — 自愈依赖可观测性提供的遥测数据
|
||||
- [[High-Availability]] — 高可用是自愈的基础设施保障
|
||||
|
||||
## Sources
|
||||
|
||||
- [[public-cloud-learning-sessions-opentext-evolving-from-dr-to-recovery-assurance-2]]
|
||||
83
wiki/concepts/Solution-Design.md
Normal file
83
wiki/concepts/Solution-Design.md
Normal file
@@ -0,0 +1,83 @@
|
||||
---
|
||||
title: "Solution Design"
|
||||
type: concept
|
||||
tags: [CTP, Cloud, AWS, Architecture]
|
||||
sources: [ctp-topic-20-program-demand-process-flow-and-poc-onboarding]
|
||||
last_updated: 2026-04-14
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
解决方案设计(Solution Design)是在 POC 阶段需要完成并经过 Design Authority 审批的架构文档,确保云迁移方案符合云原生原则、安全合规要求和企业的技术标准。
|
||||
|
||||
## Purpose
|
||||
|
||||
- 为云迁移提供经过验证的技术蓝图
|
||||
- 确保架构设计满足业务需求和非功能性需求(NFR)
|
||||
- 为 Design Authority(Gate 1)审批提供依据
|
||||
- 为后续 IaC 实施提供规范文档
|
||||
|
||||
## Key Components
|
||||
|
||||
### 1. Architecture Overview
|
||||
|
||||
- 目标云架构图(VPC 设计、网络拓扑、AZ 分布)
|
||||
- 与现有本地环境的集成方案
|
||||
- 多账号结构设计
|
||||
|
||||
### 2. Landing Zone Design
|
||||
|
||||
- 基于 Gruntwork 参考架构的 Landing Zone 配置
|
||||
- 安全边界和网络分段
|
||||
- IAM 角色和访问控制策略
|
||||
|
||||
### 3. Application Migration Design
|
||||
|
||||
- 应用的云迁移策略(Rehost/Replatform/Refactor)
|
||||
- 数据迁移方案
|
||||
- 依赖关系映射
|
||||
|
||||
### 4. IaC Design
|
||||
|
||||
- Terraform/Terragrunt 模块设计
|
||||
- CI/CD 流水线配置
|
||||
- 环境一致性策略
|
||||
|
||||
### 5. Security & Compliance
|
||||
|
||||
- 安全基线配置
|
||||
- 合规审计规划
|
||||
- 数据保护措施
|
||||
|
||||
### 6. Operations Design
|
||||
|
||||
- 监控和可观测性方案
|
||||
- 灾难恢复策略
|
||||
- 运维流程和 Runbook
|
||||
|
||||
## Design Principles
|
||||
|
||||
- **云原生优先**:充分利用云原生服务,减少 lift-and-shift
|
||||
- **安全性内嵌**:安全要求从设计阶段纳入,而非后期添加
|
||||
- **IaC 为核心**:所有基础设施变更通过代码管理
|
||||
- **可观测性设计**:监控和日志从一开始就规划好
|
||||
- **可扩展性**:架构设计应适应未来业务增长
|
||||
|
||||
## Review & Approval Process
|
||||
|
||||
1. **自评审**:解决方案团队内部评审
|
||||
2. **安全评审**:安全团队审查合规性
|
||||
3. **Design Authority 评审**:核心评审环节,Gate 1 审批
|
||||
4. **最终批准**:纳入正式迁移计划
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Proof-of-Concept]]:Solution Design 是 POC 阶段的核心产出
|
||||
- [[Gate-Process]]:Solution Design 是 Gate 1 审批的核心交付物
|
||||
- [[Landing-Zone-Architecture]]:Solution Design 的基础设施蓝图
|
||||
- [[Infrastructure-as-Code]]:Solution Design 的实施手段
|
||||
- [[Design-Authority]]:Solution Design 的审批主体
|
||||
|
||||
## References
|
||||
|
||||
- [[ctp-topic-20-program-demand-process-flow-and-poc-onboarding]]
|
||||
50
wiki/concepts/Static-Routing.md
Normal file
50
wiki/concepts/Static-Routing.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "Static Routing"
|
||||
type: concept
|
||||
tags: [AWS, Networking, Routing, Transit Gateway]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## Static Routing
|
||||
|
||||
静态路由(Static Routing)是指由网络管理员手动配置的固定路由条目,路由路径不随网络拓扑变化而自动更新。与之对应的是使用动态路由协议(如 BGP、OSPF)自动发现和更新路由的动态路由。
|
||||
|
||||
## Definition
|
||||
|
||||
- **配置方式**: 手动在路由器/网关中写入目的网络与下一跳的对应关系
|
||||
- **路由选择**: 固定不变,除非管理员主动修改
|
||||
- **适用场景**: 小型网络、路由路径明确且稳定的环境
|
||||
|
||||
## Limitations in AWS Transit Gateway Context
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 中描述的问题:
|
||||
|
||||
- **当前状态**: TGW 间的跨区域路由依赖静态前缀列表(Prefix Lists)
|
||||
- **缺乏动态协议**: 没有 BGP 等动态路由协议支持,无法自动感知链路故障或拓扑变化
|
||||
- **DR 场景痛点**: 灾难恢复场景下需要人工干预切换路由,无法自动收敛
|
||||
- **规模局限**: 随着 Landing Zone 数量增长,手动维护静态路由表的复杂度呈指数上升
|
||||
|
||||
## Static vs Dynamic Routing
|
||||
|
||||
| 维度 | 静态路由 | 动态路由(BGP/OSPF) |
|
||||
|------|---------|-------------------|
|
||||
| 配置复杂度 | 低(小型网络) | 高 |
|
||||
| 故障自愈 | ❌ 需人工干预 | ✅ 自动收敛 |
|
||||
| 可扩展性 | 差 | 好 |
|
||||
| 资源开销 | 低 | 高(协议开销) |
|
||||
| 适用规模 | < 10 节点 | 任意规模 |
|
||||
|
||||
## Evolution Path
|
||||
|
||||
静态路由的局限推动了向 [[SD-WAN]] 的演进——SD-WAN 通过软件控制层实现动态路径选择,即使底层 Underlay 网络仍使用静态路由,Overlay 层也能实现智能流量调度。
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 痛点 ← [[Static-Routing]]
|
||||
- [[AWS-Transit-Gateway-TGW]] ← 应用于 ← [[Static-Routing]]
|
||||
- [[SD-WAN]] ← 演进目标 ← [[Static-Routing]]
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
46
wiki/concepts/TCO.md
Normal file
46
wiki/concepts/TCO.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: "Total Cost of Ownership (TCO)"
|
||||
type: concept
|
||||
tags:
|
||||
- Cloud
|
||||
- FinOps
|
||||
- Cost-Management
|
||||
- AWS
|
||||
- VMware
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## Total Cost of Ownership (TCO)
|
||||
|
||||
A financial framework used to evaluate the total cost of acquiring, operating, and maintaining a technology solution over its entire lifecycle, compared across different deployment options.
|
||||
|
||||
## Definition
|
||||
TCO encompasses all direct and indirect costs associated with a technology investment, not just the upfront acquisition cost. In cloud migration and hybrid cloud contexts, TCO analysis is used to compare on-premises infrastructure against cloud-hosted solutions like VMware Cloud on AWS or native AWS services.
|
||||
|
||||
## Components
|
||||
- **Acquisition Costs**: Hardware/software procurement, licensing, implementation
|
||||
- **Operational Costs**: Staff, maintenance, support contracts, utilities
|
||||
- **Infrastructure Costs**: Data center space, power, cooling, physical security
|
||||
- **Hidden Costs**: Underutilization, downtime, technical debt, migration effort
|
||||
- **Exit Costs**: Data transfer out, licensing cancellation, decommissioning
|
||||
|
||||
## Key Applications
|
||||
|
||||
### VMC on AWS TCO Analysis
|
||||
- Cloud economics team performs TCO calculations for VMC on AWS vs. on-premises vs. native hyperscaler
|
||||
- VMware sells entire hosts, enabling over-provisioning and cost reduction
|
||||
- VMC on AWS offers 27% cost saving compared to going to a regular cloud
|
||||
- Compare TCO with on-premises or other hyperscalers for informed migration decisions
|
||||
|
||||
### General Cloud TCO Considerations
|
||||
- On-premises: CapEx heavy, but can have unused capacity (Micro Focus hardware utilization < 40%)
|
||||
- Cloud: OpEx model, pay-as-you-go, but egress costs and lock-in risks
|
||||
- Hybrid: Balances migration flexibility with gradual transition costs
|
||||
|
||||
## Connections
|
||||
- [[VMware-Cloud-on-AWS]] ← evaluates ← [[TCO]] for cloud migration decisions
|
||||
- [[Cloud-Transformation]] ← uses ← [[TCO]] for business case justification
|
||||
- [[FinOps]] ← practices ← [[TCO]] analysis for cloud cost optimization
|
||||
|
||||
## Sources
|
||||
- [[ctp-topic-43-vmware-cloud-on-aws]]
|
||||
53
wiki/concepts/TGW-Peering.md
Normal file
53
wiki/concepts/TGW-Peering.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "TGW Peering"
|
||||
type: concept
|
||||
tags: [AWS, Networking, Transit Gateway, Multi-Region]
|
||||
sources: [ctp-topic-18-wide-area-networking-in-aws-cloud]
|
||||
last_updated: 2026-05-07
|
||||
---
|
||||
|
||||
## TGW Peering
|
||||
|
||||
TGW Peering(Transit Gateway Peering)是在不同区域(Region)或同一区域(Region)内的两个 AWS Transit Gateway 之间建立的点对点连接,用于跨网段流量传输和跨区域 VPC 互联。
|
||||
|
||||
## Definition
|
||||
|
||||
- **连接对象**: 两个 Transit Gateway(可跨区域或同区域)
|
||||
- **流量类型**: VPC-to-VPC、Transit Gateway-to-On-prem、跨区域互联
|
||||
- **路由控制**: 通过路由表(Transit Gateway Route Table)配置,支持静态路由和关联/传播机制
|
||||
|
||||
## In CTP Global Architecture
|
||||
|
||||
在 [[ctp-topic-18-wide-area-networking-in-aws-cloud]] 中描述的架构:
|
||||
|
||||
- **连接模式**: 所有 Landing Zones 通过 TGW Peering 接入各自地理区域的区域 Hub Transit Gateway
|
||||
- **跨区域连接**: 各区域 Hub Transit Gateway 之间通过 Full Mesh(全网状)TGW Peering 连接,确保全球流量可达
|
||||
- **地理分区**: APJ/EMEA/AMS 三大区域,每个区域有独立的 Hub Transit Gateway(如 EMEA 伦敦、AMS 俄勒冈)
|
||||
|
||||
## Key Properties
|
||||
|
||||
| 属性 | 值 |
|
||||
|------|-----|
|
||||
| 连接类型 | 点对点(Peer-to-Peer) |
|
||||
| 跨区域支持 | ✅ 支持跨 Region Peering |
|
||||
| 带宽限制 | 受限于 AWS 全球网络基础设施 |
|
||||
| 路由方式 | Transit Gateway Route Table(可关联多个路由表) |
|
||||
| 与 TGW Peering 对比 | 跨区域连接 vs 区域内连接 |
|
||||
|
||||
## Relationship to Related Concepts
|
||||
|
||||
| 概念 | 关系 |
|
||||
|------|------|
|
||||
| [[AWS-Transit-Gateway-TGW]] | TGW Peering 的连接主体 |
|
||||
| [[Hub-and-Spoke]] | Landing Zone 作为 Spoke 通过 TGW Peering 接入 Hub |
|
||||
| [[Static-Routing]] | 当前 TGW Peering 间路由依赖静态前缀列表 |
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]] ← 连接机制 ← [[TGW-Peering]]
|
||||
- [[AWS-Transit-Gateway-TGW]] ← 连接对象 ← [[TGW-Peering]]
|
||||
- [[Hub-and-Spoke]] ← 实现方式 ← [[TGW-Peering]]
|
||||
|
||||
## Sources
|
||||
|
||||
- [[ctp-topic-18-wide-area-networking-in-aws-cloud]]
|
||||
@@ -30,6 +30,7 @@ Transit Gateway 在 AWS Landing Zone 架构中扮演网络互联的核心角色
|
||||
- **Scope**: Regional
|
||||
- **Architecture**: Hub-and-Spoke
|
||||
- **In SAS LZ**: Network Account 核心组件
|
||||
- **Inter-Regional**: 各区域 Hub 通过 [[TGW-Peering]] Full Mesh 互联
|
||||
|
||||
## Relationship to Checkpoint
|
||||
- Transit Gateway 负责路由
|
||||
|
||||
36
wiki/concepts/VPC-Association-Authorization.md
Normal file
36
wiki/concepts/VPC-Association-Authorization.md
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
title: "VPC Association Authorization"
|
||||
type: concept
|
||||
tags:
|
||||
- AWS
|
||||
- DNS
|
||||
- Networking
|
||||
- Multi-Account
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
VPC Association Authorization(VPC 关联授权)是 AWS Route 53 私有托管区(PHZ)跨账号关联的安全机制。当一个 VPC(属于账户 A)需要关联另一个账户(B)拥有的 Private Hosted Zone 时,必须先由 PHZ 所有者(账户 B)创建授权记录,明确允许该 VPC 的关联请求,然后由 VPC 所有者(账户 A)执行实际的关联操作。
|
||||
|
||||
## Aliases
|
||||
- VPC Association Authorization
|
||||
- PHZ Cross-Account Association
|
||||
- 跨账号 PHZ 授权
|
||||
|
||||
## Key Characteristics
|
||||
|
||||
- **两步流程**:① PHZ 拥有者执行 `associate-vpc-with-hosted-zone` 并传入 `vpc` 参数(对方账户的 VPC)进行授权;② VPC 拥有者在自己的账户中完成关联操作
|
||||
- **安全边界**:授权机制确保只有经过明确批准的 VPC 才能解析 PHZ 中的私有域名,防止未授权访问
|
||||
- **Terraform 支持**:两步流程均可通过 Terraform 声明式管理,推荐由 DNS 账号集中执行授权操作
|
||||
- **解除关联**:同理,解除关联也需要 PHZ 拥有者先撤销授权
|
||||
- **适用场景**:在 Landing Zone 多账号架构中,业务账户的 VPC 需关联 DNS 账户托管的 PHZ
|
||||
|
||||
## Related Concepts
|
||||
- [[Private-Hosted-Zone]] — 授权的目标对象
|
||||
- [[AWS-Landing-Zone]] — 多账号环境下的典型应用场景
|
||||
- [[Route-53-Resolver]] — 与 PHZ 协同工作的解析引擎
|
||||
- [[AWS-RAM]] — 可用于跨账号共享 Resolver Rules;PHZ 关联授权是另一种跨账号资源共享机制
|
||||
|
||||
## Sources
|
||||
- [[ctp-topic-19-configuring-dns-within-aws-lzs]]
|
||||
100
wiki/concepts/VPC-自动化供给.md
Normal file
100
wiki/concepts/VPC-自动化供给.md
Normal file
@@ -0,0 +1,100 @@
|
||||
---
|
||||
title: "VPC-自动化供给"
|
||||
type: concept
|
||||
tags: [AWS, VPC, IaC, Automation, IPAM]
|
||||
sources:
|
||||
- ctp-topic-45-automatic-ip-address-allocation-with-ipam
|
||||
- ctp-topic-61-workload-vpc-provision-with-ipam-automation
|
||||
last_updated: 2026-04-24
|
||||
---
|
||||
|
||||
## VPC-自动化供给
|
||||
|
||||
通过声明式配置文件自动完成 AWS VPC 创建的自动化流程,IP 地址分配完全由 IPAM 系统驱动,无需人工介入。VPC 自动化供给是 Cloud Transformation Programme 中网络层自动化的核心组件。
|
||||
|
||||
## Traditional Workflow(传统流程)
|
||||
|
||||
```
|
||||
业务单元(BU)
|
||||
↓ 提出 IP 地址需求
|
||||
SRE 团队
|
||||
↓ 向网络团队发起申请
|
||||
网络团队
|
||||
↓ 计算最优 CIDR 范围
|
||||
↓ 更新电子表格
|
||||
SRE 团队
|
||||
↓ 准备 YAML 配置文件(硬编码 CIDR)
|
||||
↓ 执行 Terraform/Terragrunt
|
||||
```
|
||||
|
||||
**问题**:
|
||||
- 多次手工交接,效率低下
|
||||
- 手工规划易产生 IP 地址重叠
|
||||
- 电子表格难以维护,缺乏版本控制
|
||||
- 自动化程度低,变更缓慢
|
||||
|
||||
## Automated Workflow(自动化流程)
|
||||
|
||||
```
|
||||
用户
|
||||
↓ 填写 YAML(业务联系人 + 工程联系人 + 期望子网大小)
|
||||
Terragrunt
|
||||
↓ 调用 IPAM API(Infoblox NIOS)
|
||||
Infoblox Grid
|
||||
↓ 自动分配下一可用 IP 地址块
|
||||
Terragrunt
|
||||
↓ 执行 VPC 创建
|
||||
AWS
|
||||
↓ VPC + Subnets 创建完成
|
||||
Infoblox Grid
|
||||
↓ 记录分配结果
|
||||
```
|
||||
|
||||
**优势**:
|
||||
- 无需手工申请 IP 地址
|
||||
- 单一可信数据源(IPAM)
|
||||
- 版本控制友好的 YAML 配置
|
||||
- 销毁时自动回收 IP 地址
|
||||
- 向后兼容旧配置
|
||||
|
||||
## YAML Configuration
|
||||
|
||||
新格式 YAML 配置文件(对比传统 network.yml):
|
||||
|
||||
```yaml
|
||||
infoblox:
|
||||
business_contact: "bu@example.com"
|
||||
engineering_contact: "sre@example.com"
|
||||
date: "2026-04-14"
|
||||
subnet_size: "/22" # 期望子网大小(非硬编码 CIDR)
|
||||
parent_cidr: "10.1.0.0/16" # 区域常量父 CIDR
|
||||
vpc_name: "my-vpc" # VPC 名称(支持多 VPC)
|
||||
availability_zone_ids: # 可选:指定 AZ ID
|
||||
- "ap-southeast-1a"
|
||||
- "ap-southeast-1b"
|
||||
```
|
||||
|
||||
## CIDR Approval Workflow
|
||||
|
||||
| CIDR 大小 | 流程 |
|
||||
|-----------|------|
|
||||
| /22 或更大 | **自动批准**,无需人工介入 |
|
||||
| /24 或更小 | **需提交理由**,网络团队审批 |
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- [[IPAM]]:驱动自动化供给的核心系统
|
||||
- [[Infoblox-NIOS]]:IPAM 的技术实现
|
||||
- [[CIDR-审批流程]]:基于 CIDR 大小的差异化审批规则
|
||||
|
||||
## Connections
|
||||
|
||||
- [[ctp-topic-45-automatic-ip-address-allocation-with-ipam]] ← 介绍 VPC 自动化供给机制
|
||||
- [[ctp-topic-61-workload-vpc-provision-with-ipam-automation]] ← 展示完整应用案例
|
||||
- [[ctp-topic-31-network-segregation-and-secure-access]] ← VPC 自动化是网络分段的基础
|
||||
|
||||
## Aliases
|
||||
|
||||
- VPC Provisioning
|
||||
- VPC 自动供给
|
||||
- Automated VPC Creation
|
||||
38
wiki/concepts/Zero-Trust-Access.md
Normal file
38
wiki/concepts/Zero-Trust-Access.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Zero-Trust Access"
|
||||
type: concept
|
||||
tags: ["AWS", "Security", "Zero-Trust", "IAM", "SSM"]
|
||||
sources: ["ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones"]
|
||||
last_updated: 2026-05-08
|
||||
---
|
||||
|
||||
## Definition
|
||||
零信任访问(Zero-Trust Access)是一种安全模型,核心理念是"永不信任,始终验证"——无论请求来自网络内部还是外部,均需经过身份验证和授权检查。
|
||||
|
||||
## In AWS Landing Zone Context
|
||||
在 [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]] 中,SSM 替代 VPN 体现了零信任访问原则:
|
||||
- **默认不信任**:用户每次访问都需要通过 IAM 角色认证
|
||||
- **最小权限**:仅授予访问特定 EC2 实例 SSM Agent 的权限
|
||||
- **无需 VPN**:不依赖网络层面的信任,通过 IAM + SSM Agent 实现精细化访问控制
|
||||
- **双因素认证**:结合 AWS IAM 条件和多因素认证(MFA)
|
||||
|
||||
## Relationship to Traditional VPN
|
||||
| 维度 | 传统 VPN | Zero-Trust (SSM) |
|
||||
|------|---------|------------------|
|
||||
| 信任边界 | 网络层(VPN 隧道内即信任) | 身份层(每次验证) |
|
||||
| 访问范围 | 网段级别(全网可通) | 实例级别(精确到单台 EC2) |
|
||||
| 凭证管理 | VPN 共享凭证 | IAM Role 动态凭证 |
|
||||
| 双因素 | 依赖 VPN 提供商 | 依赖 AWS IAM + MFA |
|
||||
|
||||
## Long-term Vision
|
||||
在 [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]] 中描述的演进路径:
|
||||
- 当前:SSM 零信任访问(临时方案)
|
||||
- 最终目标:IaC 化 + Break-glass 应急访问,彻底消除控制台登录
|
||||
|
||||
## Related Concepts
|
||||
- [[Network-Segmentation]] — 零信任网络隔离
|
||||
- [[IAM-Role]] — 零信任身份模型
|
||||
- [[AWS-SSM]] — 零信任访问的具体实施工具
|
||||
|
||||
## Related Sources
|
||||
- [[ctp-topic-31-network-segregation-and-secure-access-to-the-new-aws-landing-zones]]
|
||||
53
wiki/concepts/cost-of-delay.md
Normal file
53
wiki/concepts/cost-of-delay.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "Cost of Delay (CoD)"
|
||||
type: concept
|
||||
tags:
|
||||
- SAFe
|
||||
- WSJF
|
||||
- Prioritization
|
||||
- CTP
|
||||
sources:
|
||||
- ctp-topic-65-tracing-the-value-delivered-in-cloud-transformation
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
延迟成本(CoD)是因推迟或延迟交付某项工作而造成的**经济价值损失**。它是 WSJF 公式的核心组成部分,量化了"晚做这件事的代价"。
|
||||
|
||||
## Formula
|
||||
|
||||
```
|
||||
CoD = Business Value + Time Criticality + Risk Reduction & Opportunity Enablement
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
| 组成 | 说明 | 示例 |
|
||||
|------|------|------|
|
||||
| Business Value | 交付后带来的直接经济收益 | 收入增长、成本节约 |
|
||||
| Time Criticality | 窗口期的紧迫程度 | 合规截止日、竞争威胁 |
|
||||
| Risk Reduction | 降低未来风险的价值 | 安全修复、稳定性提升 |
|
||||
| Opportunity Enablement | 开启新商业机会的价值 | 进入新市场、API 发布 |
|
||||
|
||||
## Usage in WSJF
|
||||
|
||||
CoD 是 [[Weighted Shortest Job First (WSJF)]] 公式的分子:
|
||||
```
|
||||
WSJF = CoD / Job Size
|
||||
```
|
||||
CoD 越高且 Job Size 越小 → WSJF 越高 → 优先级越高。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Weighted Shortest Job First (WSJF)]]:CoD 的主要应用框架
|
||||
- [[Serviceable Obtainable Market (SOM)]]:Business Value 评估的宏观参考
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[CTP Topic 65 Tracing the Value Delivered in Cloud Transformation]]
|
||||
|
||||
## Aliases
|
||||
- Cost of Delay
|
||||
- CoD
|
||||
- Delay Cost
|
||||
38
wiki/concepts/serviceable-obtainable-market.md
Normal file
38
wiki/concepts/serviceable-obtainable-market.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Serviceable Obtainable Market (SOM)"
|
||||
type: concept
|
||||
tags:
|
||||
- Market-Analysis
|
||||
- Business-Value
|
||||
- CTP
|
||||
sources:
|
||||
- ctp-topic-65-tracing-the-value-delivered-in-cloud-transformation
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
SOM(Serviceable Obtainable Market,服务可获得市场规模)是指在特定时间段内,考虑到现有销售能力、分销渠道和市场份额目标后,**实际可获得的市场规模**。是 TAM(Total Addressable Market,总可寻址市场)和 SAM(Serviceable Available Market,服务可获得市场)的子集。
|
||||
|
||||
## 在 CTP 价值框架中的应用
|
||||
|
||||
在云转型计划(CTP)的价值捕获框架中,SOM 是四大评估维度之一:
|
||||
1. **Revenue Increase**(收入增长)
|
||||
2. **Cost Reduction**(成本降低)
|
||||
3. **Risk Position Improvement**(风险改善)
|
||||
4. **SOM Size**(可获得市场规模)
|
||||
|
||||
SOM 帮助产品团队和需求经理设定**可落地的商业目标**,避免过度乐观的市场预期。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Cost of Delay (CoD)]]:量化推迟决策代价的框架
|
||||
- [[Value Stream]]:SOM 评估是价值流分析的市场维度输入
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[CTP Topic 65 Tracing the Value Delivered in Cloud Transformation]]
|
||||
|
||||
## Aliases
|
||||
- SOM
|
||||
- Serviceable Obtainable Market
|
||||
44
wiki/concepts/value-stream.md
Normal file
44
wiki/concepts/value-stream.md
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: "Value Stream"
|
||||
type: concept
|
||||
tags:
|
||||
- Lean
|
||||
- Value-Delivery
|
||||
- Scaled-Agile
|
||||
- CTP
|
||||
sources:
|
||||
- ctp-topic-65-tracing-the-value-delivered-in-cloud-transformation
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
一组为客户(外部或内部)交付产品或服务的**活动集合**,涵盖从需求提出到价值实现的完整链条。Value Stream 的目标是通过消除浪费(Muda)、减少不均衡(Mura)和克服冗余(Muri)来最大化客户价值。
|
||||
|
||||
## Types
|
||||
|
||||
- **Operational Value Stream (OVS)**:面向外部客户的解决方案交付,Scaled Agile Framework (SAFe) 定义
|
||||
- **Development Value Stream (DVS)**:内部产品或平台的建设与维护,Scaled Agile Framework (SAFe) 定义
|
||||
|
||||
## Components
|
||||
|
||||
| 活动类型 | 说明 | 示例 |
|
||||
|---------|------|------|
|
||||
| 增值活动 (Value-Adding) | 直接为客户创造价值 | 功能交付、快速响应 |
|
||||
| 价值赋能活动 (Value-Enabling) | 间接支撑价值交付 | 测试环境搭建、培训 |
|
||||
| 浪费 (Waste / Muda) | 不产生价值,应消除 | 等待、过度加工、缺陷 |
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Weighted Shortest Job First (WSJF)]]:Value Stream 内工作的优先级排序方法
|
||||
- [[Lean]]:Value Stream 分析的理论基础
|
||||
- [[Cost of Delay (CoD)]]:量化延迟价值的指标
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[CTP Topic 65 Tracing the Value Delivered in Cloud Transformation]]
|
||||
|
||||
## Aliases
|
||||
- Value Streams
|
||||
- Value Stream Mapping
|
||||
- VSM
|
||||
52
wiki/concepts/weighted-shortest-job-first.md
Normal file
52
wiki/concepts/weighted-shortest-job-first.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: "Weighted Shortest Job First (WSJF)"
|
||||
type: concept
|
||||
tags:
|
||||
- SAFe
|
||||
- Prioritization
|
||||
- WSJF
|
||||
- CTP
|
||||
- Value-Delivery
|
||||
sources:
|
||||
- ctp-topic-65-tracing-the-value-delivered-in-cloud-transformation
|
||||
last_updated: 2026-04-28
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
WSJF 是 Scaled Agile Framework (SAFe) 中用于**排列工作优先级**的量化方法。通过将"延迟成本"(Cost of Delay)除以"工作规模"(Job Size),得到每单位工时的价值回报,优先做 WSJF 值最高的工作。
|
||||
|
||||
## Formula
|
||||
|
||||
```
|
||||
WSJF = Cost of Delay (CoD) / Job Size
|
||||
```
|
||||
|
||||
**Cost of Delay (CoD)** 由三部分构成:
|
||||
- **Business Value**(业务价值):对客户的直接经济影响
|
||||
- **Time Criticality**(时间紧迫性):推迟交付的价值损失
|
||||
- **Risk Reduction / Opportunity Enablement**(风险降低 / 机会实现)
|
||||
|
||||
## Interpretation
|
||||
|
||||
| WSJF 值 | 优先级 | 含义 |
|
||||
|---------|--------|------|
|
||||
| 高 | 🔴 最高 | 价值高 + 工作量小,优先做 |
|
||||
| 中 | 🟡 中等 | 价值与工作量平衡 |
|
||||
| 低 | 🟢 低 | 价值低 + 工作量大,延后做 |
|
||||
|
||||
**核心理念**:`What we want to do is deliver the maximum value early back into the business for the least amount of effort.` — 以最小投入尽早交付最大价值。
|
||||
|
||||
## Related Concepts
|
||||
|
||||
- [[Cost of Delay (CoD)]]:WSJF 公式的分子
|
||||
- [[Value Stream]]:WSJF 应用于价值流内的项目排序
|
||||
- [[Process]]:WSJF 优化的是过程输出的价值交付效率
|
||||
|
||||
## Related Sources
|
||||
|
||||
- [[CTP Topic 65 Tracing the Value Delivered in Cloud Transformation]]
|
||||
|
||||
## Aliases
|
||||
- WSJF
|
||||
- Weighted Shortest Job First
|
||||
Reference in New Issue
Block a user