--- title: "Apache Kafka" type: entity tags: [data-engineering, streaming, event-driven, real-time] sources: [engineering-data-engineer] last_updated: 2026-05-02 --- ## Overview Apache Kafka 是分布式事件流平台,支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。 ## Core Streaming Pipeline ```python from pyspark.sql.functions import from_json, col, current_timestamp from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str): stream = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_bootstrap) \ .option("subscribe", topic) \ .option("startingOffsets", "latest") \ .option("failOnDataLoss", "false") \ .load() parsed = stream.select( from_json(col("value").cast("string"), order_schema).alias("data"), col("timestamp").alias("_kafka_timestamp"), current_timestamp().alias("_ingested_at") ).select("data.*", "_kafka_timestamp", "_ingested_at") return parsed.writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", f"{bronze_path}/_checkpoint") \ .option("mergeSchema", "true") \ .trigger(processingTime="30 seconds") \ .start(bronze_path) ``` ## Key Semantics - **Exactly-Once Processing**:`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once - **At-Least-Once**:默认 delivery guarantee(配合幂等消费者可升级为 Exactly-Once) - **Late-Arriving Data**:Watermark + Event-Time 窗口处理迟到事件 ## Streaming vs. Micro-Batch Trade-off | 模式 | 延迟 | 吞吐量 | 适用场景 | |------|------|--------|----------| | Continuous(连续处理) | ~100ms | 中等 | 超低延迟需求 | | Micro-batch(微批次) | ~30s | 极高 | 高吞吐量、低成本 | ## Related Platforms - **Azure Event Hubs**:Azure 托管 Kafka 兼容服务 - **AWS Kinesis**:AWS 流式数据平台(Kafka 替代) - **Confluent Cloud**:Kafka 全托管云服务 ## Related Concepts - [[CDC (Change Data Capture)]] - [[Medallion Architecture]](Kafka → Bronze Delta Lake) - [[Apache Spark]](Structured Streaming + Kafka)