Update nexus wiki content

This commit is contained in:
2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions

View File

@@ -0,0 +1,51 @@
---
title: "Apache Spark"
type: entity
tags: [data-engineering, big-data, processing-engine]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySparkSpark 的 Python API作为核心计算平台构建 Bronze→Silver→Gold ETL/ELT 管道。
## Key Capabilities for Data Engineering
### PySpark Data Pipeline
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws, lit
from delta.tables import DeltaTable
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
```
### Delta Lake Integration
- Spark + Delta Lake 是 Medallion Architecture 的标准实现组合
- 支持 `mergeSchema=true` 处理 schema 演进
- 支持 `MERGE INTO` 实现幂等 upsert
### Streaming
- Spark Structured Streaming + Kafka构建 Exactly-Once 语义的实时管道
- 触发模式Continuous连续处理或 Micro-batch微批次
## Performance Features
- **Adaptive Query Execution (AQE)**动态分区合并、Broadcast Join 优化
- **Z-Ordering**:多维聚类加速复合过滤查询
- **Bloom Filters**高基数字符串列ID、邮箱的文件跳过
## Managed Platforms
- [[Databricks]]Unity Catalog、DLT、Workflows
- [[Amazon-RDS]] / EMRAWS Spark 托管)
- Google DataprocGCP Spark 托管)
## Related Concepts
- [[Medallion Architecture]]
- [[Delta Lake]]
- [[Apache Kafka]]
- [[CDC (Change Data Capture)]]