--- title: "Apache Spark" type: entity tags: [data-engineering, big-data, processing-engine] sources: [engineering-data-engineer] last_updated: 2026-05-02 --- ## Overview Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySpark(Spark 的 Python API)作为核心计算平台,构建 Bronze→Silver→Gold ETL/ELT 管道。 ## Key Capabilities for Data Engineering ### PySpark Data Pipeline ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws, lit from delta.tables import DeltaTable spark = SparkSession.builder \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .getOrCreate() ``` ### Delta Lake Integration - Spark + Delta Lake 是 Medallion Architecture 的标准实现组合 - 支持 `mergeSchema=true` 处理 schema 演进 - 支持 `MERGE INTO` 实现幂等 upsert ### Streaming - Spark Structured Streaming + Kafka:构建 Exactly-Once 语义的实时管道 - 触发模式:Continuous(连续处理)或 Micro-batch(微批次) ## Performance Features - **Adaptive Query Execution (AQE)**:动态分区合并、Broadcast Join 优化 - **Z-Ordering**:多维聚类加速复合过滤查询 - **Bloom Filters**:高基数字符串列(ID、邮箱)的文件跳过 ## Managed Platforms - [[Databricks]](Unity Catalog、DLT、Workflows) - [[Amazon-RDS]] / EMR(AWS Spark 托管) - Google Dataproc(GCP Spark 托管) ## Related Concepts - [[Medallion Architecture]] - [[Delta Lake]] - [[Apache Kafka]] - [[CDC (Change Data Capture)]]