title, type, tags, sources, last_updated
| title |
type |
tags |
sources |
last_updated |
| Apache Spark |
entity |
| data-engineering |
| big-data |
| processing-engine |
|
| engineering-data-engineer |
|
2026-05-02 |
Overview
Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySpark(Spark 的 Python API)作为核心计算平台,构建 Bronze→Silver→Gold ETL/ELT 管道。
Key Capabilities for Data Engineering
PySpark Data Pipeline
Delta Lake Integration
- Spark + Delta Lake 是 Medallion Architecture 的标准实现组合
- 支持
mergeSchema=true 处理 schema 演进
- 支持
MERGE INTO 实现幂等 upsert
Streaming
- Spark Structured Streaming + Kafka:构建 Exactly-Once 语义的实时管道
- 触发模式:Continuous(连续处理)或 Micro-batch(微批次)
Performance Features
- Adaptive Query Execution (AQE):动态分区合并、Broadcast Join 优化
- Z-Ordering:多维聚类加速复合过滤查询
- Bloom Filters:高基数字符串列(ID、邮箱)的文件跳过
Managed Platforms
- Databricks(Unity Catalog、DLT、Workflows)
- Amazon-RDS / EMR(AWS Spark 托管)
- Google Dataproc(GCP Spark 托管)
Related Concepts