Files
nexus/wiki/entities/Apache-Kafka.md
2026-05-03 05:42:12 +08:00

2.4 KiB
Raw Blame History

title, type, tags, sources, last_updated
title type tags sources last_updated
Apache Kafka entity
data-engineering
streaming
event-driven
real-time
engineering-data-engineer
2026-05-02

Overview

Apache Kafka 是分布式事件流平台支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。

Core Streaming Pipeline

from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
    stream = spark.readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", kafka_bootstrap) \
        .option("subscribe", topic) \
        .option("startingOffsets", "latest") \
        .option("failOnDataLoss", "false") \
        .load()

    parsed = stream.select(
        from_json(col("value").cast("string"), order_schema).alias("data"),
        col("timestamp").alias("_kafka_timestamp"),
        current_timestamp().alias("_ingested_at")
    ).select("data.*", "_kafka_timestamp", "_ingested_at")

    return parsed.writeStream \
        .format("delta") \
        .outputMode("append") \
        .option("checkpointLocation", f"{bronze_path}/_checkpoint") \
        .option("mergeSchema", "true") \
        .trigger(processingTime="30 seconds") \
        .start(bronze_path)

Key Semantics

  • Exactly-Once ProcessingcheckpointLocation + Delta Lake ACID 写入确保端到端 Exactly-Once
  • At-Least-Once:默认 delivery guarantee配合幂等消费者可升级为 Exactly-Once
  • Late-Arriving DataWatermark + Event-Time 窗口处理迟到事件

Streaming vs. Micro-Batch Trade-off

模式 延迟 吞吐量 适用场景
Continuous连续处理 ~100ms 中等 超低延迟需求
Micro-batch微批次 ~30s 极高 高吞吐量、低成本
  • Azure Event HubsAzure 托管 Kafka 兼容服务
  • AWS KinesisAWS 流式数据平台Kafka 替代)
  • Confluent CloudKafka 全托管云服务