nexus/wiki/concepts/MPP.md at 4e9ee6f51eb49f43f6f7a612468c58817666b32c - nexus - Gitea: Git with a cup of tea

ishenwei/nexus

Files

weishen 4e9ee6f51e Auto-sync: 2026-04-24 00:02

2026-04-24 00:03:01 +08:00

1.6 KiB

Raw Blame History

title, type, tags, sources, last_updated

title

type

tags

sources

last_updated

MPP (Massively Parallel Processing)

concept

Distributed Computing

Data-Warehouse

Performance

ctp-topic-68-introduction-to-redshift

2026-04-23

Overview

MPP（大规模并行处理）是一种分布式计算架构，通过多个计算节点并行执行查询和数据处理任务，显著提升大规模数据集的查询速度和系统吞吐量。

How It Works

任务分解：协调节点（Leader/Coordinator）将大型查询分解为多个子任务
并行分发：子任务分发至多个计算节点（Compute Node）
独立执行：各节点在本地数据子集（Slice/Partition）上并行执行计算
结果汇总：各节点结果返回协调节点，进行最终聚合和输出

Key Benefits

线性扩展：增加节点数量可线性提升查询性能
高吞吐量：适合复杂分析查询和大规模数据聚合
容错性：单节点故障不影响整体系统（部分实现）

Trade-offs

数据倾斜（Data Skew）：数据分布不均导致部分节点负载过重
跨节点通信：节点间数据传输增加延迟
复杂查询优化：需精心设计数据分布策略

Applications

数据仓库：Amazon Redshift、Snowflake、Google BigQuery
大数据处理：Apache Spark（Spark SQL）、Presto/Trino
科学计算：分布式矩阵运算、基因组分析

Columnar-Storage：列式存储与 MPP 协同优化分析查询
Distribution-Key：数据分布策略影响 MPP 性能
Sort-Key：排序键优化局部性，提升 MPP 节点内效率