Update nexus wiki content

2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions
--- a/wiki/entities/Apache-Kafka.md
+++ b/wiki/entities/Apache-Kafka.md
@@ -0,0 +1,65 @@
+---
+title: "Apache Kafka"
+type: entity
+tags: [data-engineering, streaming, event-driven, real-time]
+sources: [engineering-data-engineer]
+last_updated: 2026-05-02
+---
+
+## Overview
+
+Apache Kafka 是分布式事件流平台，支持高吞吐量、低延迟的实时数据管道。Data Engineer Agent 使用 Kafka 作为流式数据摄取的核心传输层，构建 Exactly-Once 语义的实时 Bronze 层摄取管道。
+
+## Core Streaming Pipeline
+
+```python
+from pyspark.sql.functions import from_json, col, current_timestamp
+from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
+
+def stream_bronze_orders(kafka_bootstrap: str, topic: str, bronze_path: str):
+    stream = spark.readStream \
+        .format("kafka") \
+        .option("kafka.bootstrap.servers", kafka_bootstrap) \
+        .option("subscribe", topic) \
+        .option("startingOffsets", "latest") \
+        .option("failOnDataLoss", "false") \
+        .load()
+
+    parsed = stream.select(
+        from_json(col("value").cast("string"), order_schema).alias("data"),
+        col("timestamp").alias("_kafka_timestamp"),
+        current_timestamp().alias("_ingested_at")
+    ).select("data.*", "_kafka_timestamp", "_ingested_at")
+
+    return parsed.writeStream \
+        .format("delta") \
+        .outputMode("append") \
+        .option("checkpointLocation", f"{bronze_path}/_checkpoint") \
+        .option("mergeSchema", "true") \
+        .trigger(processingTime="30 seconds") \
+        .start(bronze_path)
+```
+
+## Key Semantics
+
+- **Exactly-Once Processing**：`checkpointLocation` + Delta Lake ACID 写入确保端到端 Exactly-Once
+- **At-Least-Once**：默认 delivery guarantee（配合幂等消费者可升级为 Exactly-Once）
+- **Late-Arriving Data**：Watermark + Event-Time 窗口处理迟到事件
+
+## Streaming vs. Micro-Batch Trade-off
+
+| 模式 | 延迟 | 吞吐量 | 适用场景 |
+|------|------|--------|----------|
+| Continuous（连续处理） | ~100ms | 中等 | 超低延迟需求 |
+| Micro-batch（微批次） | ~30s | 极高 | 高吞吐量、低成本 |
+
+## Related Platforms
+
+- **Azure Event Hubs**：Azure 托管 Kafka 兼容服务
+- **AWS Kinesis**：AWS 流式数据平台（Kafka 替代）
+- **Confluent Cloud**：Kafka 全托管云服务
+
+## Related Concepts
+- [[CDC (Change Data Capture)]]
+- [[Medallion Architecture]]（Kafka → Bronze Delta Lake）
+- [[Apache Spark]]（Structured Streaming + Kafka）