44 lines
1.5 KiB
Markdown
44 lines
1.5 KiB
Markdown
---
|
||
title: "Apache Hudi"
|
||
type: entity
|
||
tags: [data-engineering, lakehouse, open-table-format, incremental]
|
||
sources: [engineering-data-engineer]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
## Overview
|
||
|
||
Apache Hudi 是另一个开放表格格式(Open Table Format),专注于 incremental processing(增量处理)和 upsert 支持。Data Engineer Agent 使用 Hudi 的 Copy-on-Write(CoW)和 Merge-on-Read(MoR)表类型实现增量数据管道。
|
||
|
||
## Key Capabilities
|
||
|
||
### Copy-on-Write (CoW)
|
||
- 每次写入重写数据文件(Parquet)
|
||
- 读优化,适合写少读多的场景
|
||
- 数据文件始终保持最优压缩
|
||
|
||
### Merge-on-Read (MoR)
|
||
- 更新以 log 形式追加,读取时合并
|
||
- 写优化,适合高频增量写入
|
||
- 支持 late-arriving data 和 near-real-time 分析
|
||
|
||
### Incremental Pull
|
||
- Hudi 提供 `incrementalQueries`,消费者只需读取自上次处理以来的变更
|
||
- 支持 Change Log 模式(仅返回变更记录,而非全量快照)
|
||
|
||
## Use Cases
|
||
|
||
- **CDC Ingestion**:Hudi + Debezium CDC 记录 → 增量摄取
|
||
- **Slowly Changing Dimension (SCD)**:MoR 表支持 SCD Type 1 和 Type 2
|
||
- **Time Travel Audit**:满足监管要求的审计日志
|
||
|
||
## Ecosystem Position
|
||
|
||
Apache Hudi 与 [[Delta Lake]] 和 [[Apache Iceberg]] 并列为三大开放表格格式。Hudi 的差异化优势在于其 incremental processing 能力和对 Spark Structured Streaming 的深度集成。
|
||
|
||
## Related Concepts
|
||
- [[Delta Lake]]
|
||
- [[Apache Iceberg]]
|
||
- [[CDC (Change Data Capture)]]
|
||
- [[SCD Type 2]]
|