Files
nexus/wiki/entities/Delta-Lake.md
2026-05-03 05:42:12 +08:00

56 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Delta Lake"
type: entity
tags: [data-engineering, lakehouse, open-table-format, ACID]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Delta Lake 是由 Databricks 开源的开放表格格式Open Table Format为数据湖提供 ACID 事务、时间旅行、Z-Ordering 等能力。Data Engineer Agent 使用 Delta Lake 作为 Medallion Architecture 三层Bronze/Silver/Gold的统一存储格式。
## Key Features
### ACID Transactions
- 写操作原子提交,读者永远看到一致状态
- 多并发写操作不会产生部分写入
### Time Travel
- 任意时间点查询数据(`VERSION AS OF``TIMESTAMP AS OF`
- 用于审计、合规和回滚
### Schema Enforcement & Evolution
- `mergeSchema=true`:允许 schema 演进,捕获上游变更
- 禁止删除 required 列,类型变更需显式声明
### Z-Ordering
- 多维数据聚类,将相关数据物理上聚集存储
- 显著加速复合过滤查询
### Liquid ClusteringDelta Lake 3.x+
- 自动压缩和聚类,自适应工作负载
### UPSERT / MERGE
```python
target.alias("target").merge(source.alias("source"), merge_condition) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
```
实现幂等的增量数据更新。
## Alternative Formats
- [[Apache Iceberg]]另一个开放表格格式规范跨引擎Spark/Trino/Presto互操作
- Apache Hudi支持 hoodie-based incremental processing
## Used By
- [[Databricks]](原生支持)
- [[Apache Spark]]`delta` format 直接支持)
- AWS Glue、Snowflake通过 connectors
## Related Concepts
- [[Medallion Architecture]]
- [[Apache Spark]]
- [[SCD Type 2]]