Update nexus wiki content
This commit is contained in:
55
wiki/entities/Delta-Lake.md
Normal file
55
wiki/entities/Delta-Lake.md
Normal file
@@ -0,0 +1,55 @@
|
||||
---
|
||||
title: "Delta Lake"
|
||||
type: entity
|
||||
tags: [data-engineering, lakehouse, open-table-format, ACID]
|
||||
sources: [engineering-data-engineer]
|
||||
last_updated: 2026-05-02
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Delta Lake 是由 Databricks 开源的开放表格格式(Open Table Format),为数据湖提供 ACID 事务、时间旅行、Z-Ordering 等能力。Data Engineer Agent 使用 Delta Lake 作为 Medallion Architecture 三层(Bronze/Silver/Gold)的统一存储格式。
|
||||
|
||||
## Key Features
|
||||
|
||||
### ACID Transactions
|
||||
- 写操作原子提交,读者永远看到一致状态
|
||||
- 多并发写操作不会产生部分写入
|
||||
|
||||
### Time Travel
|
||||
- 任意时间点查询数据(`VERSION AS OF` 或 `TIMESTAMP AS OF`)
|
||||
- 用于审计、合规和回滚
|
||||
|
||||
### Schema Enforcement & Evolution
|
||||
- `mergeSchema=true`:允许 schema 演进,捕获上游变更
|
||||
- 禁止删除 required 列,类型变更需显式声明
|
||||
|
||||
### Z-Ordering
|
||||
- 多维数据聚类,将相关数据物理上聚集存储
|
||||
- 显著加速复合过滤查询
|
||||
|
||||
### Liquid Clustering(Delta Lake 3.x+)
|
||||
- 自动压缩和聚类,自适应工作负载
|
||||
|
||||
### UPSERT / MERGE
|
||||
```python
|
||||
target.alias("target").merge(source.alias("source"), merge_condition) \
|
||||
.whenMatchedUpdateAll() \
|
||||
.whenNotMatchedInsertAll() \
|
||||
.execute()
|
||||
```
|
||||
实现幂等的增量数据更新。
|
||||
|
||||
## Alternative Formats
|
||||
- [[Apache Iceberg]]:另一个开放表格格式规范,跨引擎(Spark/Trino/Presto)互操作
|
||||
- Apache Hudi:支持 hoodie-based incremental processing
|
||||
|
||||
## Used By
|
||||
- [[Databricks]](原生支持)
|
||||
- [[Apache Spark]](`delta` format 直接支持)
|
||||
- AWS Glue、Snowflake(通过 connectors)
|
||||
|
||||
## Related Concepts
|
||||
- [[Medallion Architecture]]
|
||||
- [[Apache Spark]]
|
||||
- [[SCD Type 2]]
|
||||
Reference in New Issue
Block a user