56 lines
1.7 KiB
Markdown
56 lines
1.7 KiB
Markdown
---
|
||
title: "Delta Lake"
|
||
type: entity
|
||
tags: [data-engineering, lakehouse, open-table-format, ACID]
|
||
sources: [engineering-data-engineer]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
## Overview
|
||
|
||
Delta Lake 是由 Databricks 开源的开放表格格式(Open Table Format),为数据湖提供 ACID 事务、时间旅行、Z-Ordering 等能力。Data Engineer Agent 使用 Delta Lake 作为 Medallion Architecture 三层(Bronze/Silver/Gold)的统一存储格式。
|
||
|
||
## Key Features
|
||
|
||
### ACID Transactions
|
||
- 写操作原子提交,读者永远看到一致状态
|
||
- 多并发写操作不会产生部分写入
|
||
|
||
### Time Travel
|
||
- 任意时间点查询数据(`VERSION AS OF` 或 `TIMESTAMP AS OF`)
|
||
- 用于审计、合规和回滚
|
||
|
||
### Schema Enforcement & Evolution
|
||
- `mergeSchema=true`:允许 schema 演进,捕获上游变更
|
||
- 禁止删除 required 列,类型变更需显式声明
|
||
|
||
### Z-Ordering
|
||
- 多维数据聚类,将相关数据物理上聚集存储
|
||
- 显著加速复合过滤查询
|
||
|
||
### Liquid Clustering(Delta Lake 3.x+)
|
||
- 自动压缩和聚类,自适应工作负载
|
||
|
||
### UPSERT / MERGE
|
||
```python
|
||
target.alias("target").merge(source.alias("source"), merge_condition) \
|
||
.whenMatchedUpdateAll() \
|
||
.whenNotMatchedInsertAll() \
|
||
.execute()
|
||
```
|
||
实现幂等的增量数据更新。
|
||
|
||
## Alternative Formats
|
||
- [[Apache Iceberg]]:另一个开放表格格式规范,跨引擎(Spark/Trino/Presto)互操作
|
||
- Apache Hudi:支持 hoodie-based incremental processing
|
||
|
||
## Used By
|
||
- [[Databricks]](原生支持)
|
||
- [[Apache Spark]](`delta` format 直接支持)
|
||
- AWS Glue、Snowflake(通过 connectors)
|
||
|
||
## Related Concepts
|
||
- [[Medallion Architecture]]
|
||
- [[Apache Spark]]
|
||
- [[SCD Type 2]]
|