Update nexus wiki content
This commit is contained in:
43
wiki/entities/Apache-Hudi.md
Normal file
43
wiki/entities/Apache-Hudi.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Apache Hudi"
|
||||
type: entity
|
||||
tags: [data-engineering, lakehouse, open-table-format, incremental]
|
||||
sources: [engineering-data-engineer]
|
||||
last_updated: 2026-05-02
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Apache Hudi 是另一个开放表格格式(Open Table Format),专注于 incremental processing(增量处理)和 upsert 支持。Data Engineer Agent 使用 Hudi 的 Copy-on-Write(CoW)和 Merge-on-Read(MoR)表类型实现增量数据管道。
|
||||
|
||||
## Key Capabilities
|
||||
|
||||
### Copy-on-Write (CoW)
|
||||
- 每次写入重写数据文件(Parquet)
|
||||
- 读优化,适合写少读多的场景
|
||||
- 数据文件始终保持最优压缩
|
||||
|
||||
### Merge-on-Read (MoR)
|
||||
- 更新以 log 形式追加,读取时合并
|
||||
- 写优化,适合高频增量写入
|
||||
- 支持 late-arriving data 和 near-real-time 分析
|
||||
|
||||
### Incremental Pull
|
||||
- Hudi 提供 `incrementalQueries`,消费者只需读取自上次处理以来的变更
|
||||
- 支持 Change Log 模式(仅返回变更记录,而非全量快照)
|
||||
|
||||
## Use Cases
|
||||
|
||||
- **CDC Ingestion**:Hudi + Debezium CDC 记录 → 增量摄取
|
||||
- **Slowly Changing Dimension (SCD)**:MoR 表支持 SCD Type 1 和 Type 2
|
||||
- **Time Travel Audit**:满足监管要求的审计日志
|
||||
|
||||
## Ecosystem Position
|
||||
|
||||
Apache Hudi 与 [[Delta Lake]] 和 [[Apache Iceberg]] 并列为三大开放表格格式。Hudi 的差异化优势在于其 incremental processing 能力和对 Spark Structured Streaming 的深度集成。
|
||||
|
||||
## Related Concepts
|
||||
- [[Delta Lake]]
|
||||
- [[Apache Iceberg]]
|
||||
- [[CDC (Change Data Capture)]]
|
||||
- [[SCD Type 2]]
|
||||
Reference in New Issue
Block a user