Files
nexus/wiki/entities/Apache-Spark.md
2026-05-03 05:42:12 +08:00

52 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Apache Spark"
type: entity
tags: [data-engineering, big-data, processing-engine]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySparkSpark 的 Python API作为核心计算平台构建 Bronze→Silver→Gold ETL/ELT 管道。
## Key Capabilities for Data Engineering
### PySpark Data Pipeline
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws, lit
from delta.tables import DeltaTable
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
```
### Delta Lake Integration
- Spark + Delta Lake 是 Medallion Architecture 的标准实现组合
- 支持 `mergeSchema=true` 处理 schema 演进
- 支持 `MERGE INTO` 实现幂等 upsert
### Streaming
- Spark Structured Streaming + Kafka构建 Exactly-Once 语义的实时管道
- 触发模式Continuous连续处理或 Micro-batch微批次
## Performance Features
- **Adaptive Query Execution (AQE)**动态分区合并、Broadcast Join 优化
- **Z-Ordering**:多维聚类加速复合过滤查询
- **Bloom Filters**高基数字符串列ID、邮箱的文件跳过
## Managed Platforms
- [[Databricks]]Unity Catalog、DLT、Workflows
- [[Amazon-RDS]] / EMRAWS Spark 托管)
- Google DataprocGCP Spark 托管)
## Related Concepts
- [[Medallion Architecture]]
- [[Delta Lake]]
- [[Apache Kafka]]
- [[CDC (Change Data Capture)]]