Files
nexus/wiki/entities/Apache-Spark.md
2026-05-03 05:42:12 +08:00

1.7 KiB
Raw Blame History

title, type, tags, sources, last_updated
title type tags sources last_updated
Apache Spark entity
data-engineering
big-data
processing-engine
engineering-data-engineer
2026-05-02

Overview

Apache Spark 是统一的大规模数据处理引擎,支持批处理、流处理、机器学习和 SQL 查询。Data Engineer Agent 使用 PySparkSpark 的 Python API作为核心计算平台构建 Bronze→Silver→Gold ETL/ELT 管道。

Key Capabilities for Data Engineering

PySpark Data Pipeline

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, sha2, concat_ws, lit
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Delta Lake Integration

  • Spark + Delta Lake 是 Medallion Architecture 的标准实现组合
  • 支持 mergeSchema=true 处理 schema 演进
  • 支持 MERGE INTO 实现幂等 upsert

Streaming

  • Spark Structured Streaming + Kafka构建 Exactly-Once 语义的实时管道
  • 触发模式Continuous连续处理或 Micro-batch微批次

Performance Features

  • Adaptive Query Execution (AQE)动态分区合并、Broadcast Join 优化
  • Z-Ordering:多维聚类加速复合过滤查询
  • Bloom Filters高基数字符串列ID、邮箱的文件跳过

Managed Platforms

  • DatabricksUnity Catalog、DLT、Workflows
  • Amazon-RDS / EMRAWS Spark 托管)
  • Google DataprocGCP Spark 托管)