Update nexus wiki content
This commit is contained in:
65
wiki/entities/Apache-Kafka.md
Normal file
65
wiki/entities/Apache-Kafka.md
Normal file
@@ -0,0 +1,65 @@
|
||||
---
|
||||
title: "Apache Kafka"
|
||||
type: entity
|
||||
tags: [data-engineering, streaming, event-driven, real-time]
|
||||
sources: [engineering-data-engineer]
|
||||
last_updated: 2026-05-02
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Apache Kafka 是分布式事件流平台,支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
|
||||
|
||||
## Core Streaming Pipeline
|
||||
|
||||
```python
|
||||
from pyspark.sql.functions import from_json, col, current_timestamp
|
||||
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
|
||||
|
||||
def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
|
||||
stream = spark.readStream \
|
||||
.format("kafka") \
|
||||
.option("kafka.bootstrap.servers", kafka_bootstrap) \
|
||||
.option("subscribe", topic) \
|
||||
.option("startingOffsets", "latest") \
|
||||
.option("failOnDataLoss", "false") \
|
||||
.load()
|
||||
|
||||
parsed = stream.select(
|
||||
from_json(col("value").cast("string"), order_schema).alias("data"),
|
||||
col("timestamp").alias("_kafka_timestamp"),
|
||||
current_timestamp().alias("_ingested_at")
|
||||
).select("data.*", "_kafka_timestamp", "_ingested_at")
|
||||
|
||||
return parsed.writeStream \
|
||||
.format("delta") \
|
||||
.outputMode("append") \
|
||||
.option("checkpointLocation", f"{bronze_path}/_checkpoint") \
|
||||
.option("mergeSchema", "true") \
|
||||
.trigger(processingTime="30 seconds") \
|
||||
.start(bronze_path)
|
||||
```
|
||||
|
||||
## Key Semantics
|
||||
|
||||
- **Exactly-Once Processing**:`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once
|
||||
- **At-Least-Once**:默认 delivery guarantee(配合幂等消费者可升级为 Exactly-Once)
|
||||
- **Late-Arriving Data**:Watermark + Event-Time 窗口处理迟到事件
|
||||
|
||||
## Streaming vs. Micro-Batch Trade-off
|
||||
|
||||
| 模式 | 延迟 | 吞吐量 | 适用场景 |
|
||||
|------|------|--------|----------|
|
||||
| Continuous(连续处理) | ~100ms | 中等 | 超低延迟需求 |
|
||||
| Micro-batch(微批次) | ~30s | 极高 | 高吞吐量、低成本 |
|
||||
|
||||
## Related Platforms
|
||||
|
||||
- **Azure Event Hubs**:Azure 托管 Kafka 兼容服务
|
||||
- **AWS Kinesis**:AWS 流式数据平台(Kafka 替代)
|
||||
- **Confluent Cloud**:Kafka 全托管云服务
|
||||
|
||||
## Related Concepts
|
||||
- [[CDC (Change Data Capture)]]
|
||||
- [[Medallion Architecture]](Kafka → Bronze Delta Lake)
|
||||
- [[Apache Spark]](Structured Streaming + Kafka)
|
||||
Reference in New Issue
Block a user