Update nexus wiki content
This commit is contained in:
50
wiki/concepts/Medallion-Architecture.md
Normal file
50
wiki/concepts/Medallion-Architecture.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "Medallion Architecture"
|
||||
type: concept
|
||||
tags: [data-engineering, lakehouse, architecture]
|
||||
sources: [engineering-data-engineer]
|
||||
last_updated: 2026-05-02
|
||||
---
|
||||
|
||||
## Definition
|
||||
|
||||
Medallion Architecture 是一种数据湖仓(Lakehouse)分层架构,通过 Bronze(青铜)→ Silver(白银)→ Gold(黄金)三层设计,实现数据从原始到业务就绪的渐进式提升。
|
||||
|
||||
## Three Layers
|
||||
|
||||
### Bronze Layer(原始层)
|
||||
- **特性**:原始、不可变、追加写入(append-only)
|
||||
- **规则**:永远不在原地转换数据;保留完整的 source file、ingestion timestamp、source system 元数据
|
||||
- **Schema**:Schema-on-Read(读取时推断)
|
||||
- **分区策略**:按 ingestion date 分区,支持低成本历史重放
|
||||
|
||||
### Silver Layer(清洗层)
|
||||
- **特性**:已清洗、去重、统一格式(conformed)
|
||||
- **规则**:必须可跨域 join;显式处理 null(impute/flag/reject);标准化数据类型、日期格式、货币码、国家码
|
||||
- **实现**:SCD Type 2 追踪历史变更;主键 + 事件时间戳去重
|
||||
- **质量**:每字段 null 处理规则必须明确记录
|
||||
|
||||
### Gold Layer(业务层)
|
||||
- **特性**:业务就绪、SLA 保证、为查询模式优化
|
||||
- **规则**:Gold 层消费者禁止直接读取 Bronze 或 Silver;必须附带行级数据质量评分;使用 replaceWhere 原子覆盖
|
||||
- **优化**:Z-Ordering 多维聚类、分区裁剪、预聚合
|
||||
- **SLA**:明确刷新频率(如"每 15 分钟刷新一次")
|
||||
|
||||
## Core Principles
|
||||
|
||||
- **不可变性**:Bronze 层不可覆盖,每条记录携带 `_ingested_at` 和 `_source_system`
|
||||
- **渐进式质量**:数据质量在 Bronze→Silver→Gold 每层逐步提升
|
||||
- **消费者保护**:上游 schema 变化通过 `mergeSchema=true` 捕获,但不自动污染下游
|
||||
- **幂等性**:Silver→Gold 每步管道必须幂等——重新运行不产生重复
|
||||
|
||||
## Related Concepts
|
||||
- [[CDC (Change Data Capture)]]
|
||||
- [[Data Contract]]
|
||||
- [[Data Lineage]]
|
||||
- [[SCD Type 2]]
|
||||
|
||||
## Related Entities
|
||||
- [[Delta Lake]](Bronze/Silver/Gold 存储格式)
|
||||
- [[Apache Spark]](计算引擎)
|
||||
- [[Databricks]](托管平台)
|
||||
- [[Apache Iceberg]](开放表格格式替代方案)
|
||||
Reference in New Issue
Block a user