Update nexus wiki content

This commit is contained in:
2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions

View File

@@ -0,0 +1,65 @@
---
title: "Apache Kafka"
type: entity
tags: [data-engineering, streaming, event-driven, real-time]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Apache Kafka 是分布式事件流平台支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层,构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
## Core Streaming Pipeline
```python
from pyspark.sql.functions import from_json, col, current_timestamp
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load()
parsed = stream.select(
from_json(col("value").cast("string"), order_schema).alias("data"),
col("timestamp").alias("_kafka_timestamp"),
current_timestamp().alias("_ingested_at")
).select("data.*", "_kafka_timestamp", "_ingested_at")
return parsed.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", f"{bronze_path}/_checkpoint") \
.option("mergeSchema", "true") \
.trigger(processingTime="30 seconds") \
.start(bronze_path)
```
## Key Semantics
- **Exactly-Once Processing**`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once
- **At-Least-Once**:默认 delivery guarantee配合幂等消费者可升级为 Exactly-Once
- **Late-Arriving Data**Watermark + Event-Time 窗口处理迟到事件
## Streaming vs. Micro-Batch Trade-off
| 模式 | 延迟 | 吞吐量 | 适用场景 |
|------|------|--------|----------|
| Continuous连续处理 | ~100ms | 中等 | 超低延迟需求 |
| Micro-batch微批次 | ~30s | 极高 | 高吞吐量、低成本 |
## Related Platforms
- **Azure Event Hubs**Azure 托管 Kafka 兼容服务
- **AWS Kinesis**AWS 流式数据平台Kafka 替代)
- **Confluent Cloud**Kafka 全托管云服务
## Related Concepts
- [[CDC (Change Data Capture)]]
- [[Medallion Architecture]]Kafka → Bronze Delta Lake
- [[Apache Spark]]Structured Streaming + Kafka