Files
nexus/wiki/entities/Apache-Hudi.md
2026-05-03 05:42:12 +08:00

44 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Apache Hudi"
type: entity
tags: [data-engineering, lakehouse, open-table-format, incremental]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Apache Hudi 是另一个开放表格格式Open Table Format专注于 incremental processing增量处理和 upsert 支持。Data Engineer Agent 使用 Hudi 的 Copy-on-WriteCoW和 Merge-on-ReadMoR表类型实现增量数据管道。
## Key Capabilities
### Copy-on-Write (CoW)
- 每次写入重写数据文件Parquet
- 读优化,适合写少读多的场景
- 数据文件始终保持最优压缩
### Merge-on-Read (MoR)
- 更新以 log 形式追加,读取时合并
- 写优化,适合高频增量写入
- 支持 late-arriving data 和 near-real-time 分析
### Incremental Pull
- Hudi 提供 `incrementalQueries`,消费者只需读取自上次处理以来的变更
- 支持 Change Log 模式(仅返回变更记录,而非全量快照)
## Use Cases
- **CDC Ingestion**Hudi + Debezium CDC 记录 → 增量摄取
- **Slowly Changing Dimension (SCD)**MoR 表支持 SCD Type 1 和 Type 2
- **Time Travel Audit**:满足监管要求的审计日志
## Ecosystem Position
Apache Hudi 与 [[Delta Lake]] 和 [[Apache Iceberg]] 并列为三大开放表格格式。Hudi 的差异化优势在于其 incremental processing 能力和对 Spark Structured Streaming 的深度集成。
## Related Concepts
- [[Delta Lake]]
- [[Apache Iceberg]]
- [[CDC (Change Data Capture)]]
- [[SCD Type 2]]