66 lines
2.4 KiB
Markdown
66 lines
2.4 KiB
Markdown
---
|
||
title: "Apache Kafka"
|
||
type: entity
|
||
tags: [data-engineering, streaming, event-driven, real-time]
|
||
sources: [engineering-data-engineer]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
## Overview
|
||
|
||
Apache Kafka 是分布式事件流平台,支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
|
||
|
||
## Core Streaming Pipeline
|
||
|
||
```python
|
||
from pyspark.sql.functions import from_json, col, current_timestamp
|
||
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
|
||
|
||
def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
|
||
stream = spark.readStream \
|
||
.format("kafka") \
|
||
.option("kafka.bootstrap.servers", kafka_bootstrap) \
|
||
.option("subscribe", topic) \
|
||
.option("startingOffsets", "latest") \
|
||
.option("failOnDataLoss", "false") \
|
||
.load()
|
||
|
||
parsed = stream.select(
|
||
from_json(col("value").cast("string"), order_schema).alias("data"),
|
||
col("timestamp").alias("_kafka_timestamp"),
|
||
current_timestamp().alias("_ingested_at")
|
||
).select("data.*", "_kafka_timestamp", "_ingested_at")
|
||
|
||
return parsed.writeStream \
|
||
.format("delta") \
|
||
.outputMode("append") \
|
||
.option("checkpointLocation", f"{bronze_path}/_checkpoint") \
|
||
.option("mergeSchema", "true") \
|
||
.trigger(processingTime="30 seconds") \
|
||
.start(bronze_path)
|
||
```
|
||
|
||
## Key Semantics
|
||
|
||
- **Exactly-Once Processing**:`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once
|
||
- **At-Least-Once**:默认 delivery guarantee(配合幂等消费者可升级为 Exactly-Once)
|
||
- **Late-Arriving Data**:Watermark + Event-Time 窗口处理迟到事件
|
||
|
||
## Streaming vs. Micro-Batch Trade-off
|
||
|
||
| 模式 | 延迟 | 吞吐量 | 适用场景 |
|
||
|------|------|--------|----------|
|
||
| Continuous(连续处理) | ~100ms | 中等 | 超低延迟需求 |
|
||
| Micro-batch(微批次) | ~30s | 极高 | 高吞吐量、低成本 |
|
||
|
||
## Related Platforms
|
||
|
||
- **Azure Event Hubs**:Azure 托管 Kafka 兼容服务
|
||
- **AWS Kinesis**:AWS 流式数据平台(Kafka 替代)
|
||
- **Confluent Cloud**:Kafka 全托管云服务
|
||
|
||
## Related Concepts
|
||
- [[CDC (Change Data Capture)]]
|
||
- [[Medallion Architecture]](Kafka → Bronze Delta Lake)
|
||
- [[Apache Spark]](Structured Streaming + Kafka)
|