Files
nexus/wiki/entities/Apache-Kafka.md
2026-05-03 05:42:12 +08:00

66 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Apache Kafka"
type: entity
tags: [data-engineering, streaming, event-driven, real-time]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Apache Kafka 是分布式事件流平台支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
## Core Streaming Pipeline
```python
from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load()
parsed = stream.select(
from_json(col("value").cast("string"), order_schema).alias("data"),
col("timestamp").alias("_kafka_timestamp"),
current_timestamp().alias("_ingested_at")
).select("data.*", "_kafka_timestamp", "_ingested_at")
return parsed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", f"{bronze_path}/_checkpoint") \
.option("mergeSchema", "true") \
.trigger(processingTime="30 seconds") \
.start(bronze_path)
```
## Key Semantics
- **Exactly-Once Processing**`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once
- **At-Least-Once**:默认 delivery guarantee配合幂等消费者可升级为 Exactly-Once
- **Late-Arriving Data**Watermark + Event-Time 窗口处理迟到事件
## Streaming vs. Micro-Batch Trade-off
| 模式 | 延迟 | 吞吐量 | 适用场景 |
|------|------|--------|----------|
| Continuous连续处理 | ~100ms | 中等 | 超低延迟需求 |
| Micro-batch微批次 | ~30s | 极高 | 高吞吐量、低成本 |
## Related Platforms
- **Azure Event Hubs**Azure 托管 Kafka 兼容服务
- **AWS Kinesis**AWS 流式数据平台Kafka 替代)
- **Confluent Cloud**Kafka 全托管云服务
## Related Concepts
- [[CDC (Change Data Capture)]]
- [[Medallion Architecture]]Kafka → Bronze Delta Lake
- [[Apache Spark]]Structured Streaming + Kafka