52 lines
1.7 KiB
Markdown
52 lines
1.7 KiB
Markdown
---
|
||
title: "Apache Spark"
|
||
type: entity
|
||
tags: [data-engineering, big-data, processing-engine]
|
||
sources: [engineering-data-engineer]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
## Overview
|
||
|
||
Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySpark(Spark 的 Python API)作为核心计算平台,构建 Bronze→Silver→Gold ETL/ELT 管道。
|
||
|
||
## Key Capabilities for Data Engineering
|
||
|
||
### PySpark Data Pipeline
|
||
```python
|
||
from pyspark.sql import SparkSession
|
||
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws, lit
|
||
from delta.tables import DeltaTable
|
||
|
||
spark = SparkSession.builder \
|
||
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
|
||
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
|
||
.getOrCreate()
|
||
```
|
||
|
||
### Delta Lake Integration
|
||
- Spark + Delta Lake 是 Medallion Architecture 的标准实现组合
|
||
- 支持 `mergeSchema=true` 处理 schema 演进
|
||
- 支持 `MERGE INTO` 实现幂等 upsert
|
||
|
||
### Streaming
|
||
- Spark Structured Streaming + Kafka:构建 Exactly-Once 语义的实时管道
|
||
- 触发模式:Continuous(连续处理)或 Micro-batch(微批次)
|
||
|
||
## Performance Features
|
||
|
||
- **Adaptive Query Execution (AQE)**:动态分区合并、Broadcast Join 优化
|
||
- **Z-Ordering**:多维聚类加速复合过滤查询
|
||
- **Bloom Filters**:高基数字符串列(ID、邮箱)的文件跳过
|
||
|
||
## Managed Platforms
|
||
- [[Databricks]](Unity Catalog、DLT、Workflows)
|
||
- [[Amazon-RDS]] / EMR(AWS Spark 托管)
|
||
- Google Dataproc(GCP Spark 托管)
|
||
|
||
## Related Concepts
|
||
- [[Medallion Architecture]]
|
||
- [[Delta Lake]]
|
||
- [[Apache Kafka]]
|
||
- [[CDC (Change Data Capture)]]
|