2.4 KiB
2.4 KiB
title, type, tags, sources, last_updated
| title | type | tags | sources | last_updated | |||||
|---|---|---|---|---|---|---|---|---|---|
| Apache Kafka | entity |
|
|
2026-05-02 |
Overview
Apache Kafka 是分布式事件流平台,支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
Core Streaming Pipeline
from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load()
parsed = stream.select(
from_json(col("value").cast("string"), order_schema).alias("data"),
col("timestamp").alias("_kafka_timestamp"),
current_timestamp().alias("_ingested_at")
).select("data.*", "_kafka_timestamp", "_ingested_at")
return parsed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", f"{bronze_path}/_checkpoint") \
.option("mergeSchema", "true") \
.trigger(processingTime="30 seconds") \
.start(bronze_path)
Key Semantics
- Exactly-Once Processing:
checkpointLocation+ Delta Lake ACID 写入确保端到端 Exactly-Once - At-Least-Once:默认 delivery guarantee(配合幂等消费者可升级为 Exactly-Once)
- Late-Arriving Data:Watermark + Event-Time 窗口处理迟到事件
Streaming vs. Micro-Batch Trade-off
| 模式 | 延迟 | 吞吐量 | 适用场景 |
|---|---|---|---|
| Continuous(连续处理) | ~100ms | 中等 | 超低延迟需求 |
| Micro-batch(微批次) | ~30s | 极高 | 高吞吐量、低成本 |
Related Platforms
- Azure Event Hubs:Azure 托管 Kafka 兼容服务
- AWS Kinesis:AWS 流式数据平台(Kafka 替代)
- Confluent Cloud:Kafka 全托管云服务
Related Concepts
- CDC (Change Data Capture)
- Medallion Architecture(Kafka → Bronze Delta Lake)
- Apache Spark(Structured Streaming + Kafka)