title, type, tags, sources, last_updated
| title |
type |
tags |
sources |
last_updated |
| Great Expectations |
entity |
| data-engineering |
| data-quality |
| testing |
| validation |
|
| engineering-data-engineer |
|
2026-05-02 |
Overview
Great Expectations(gx)是 Python 原生的数据质量验证框架,支持自动化测试数据管道的可靠性。Data Engineer Agent 使用 Great Expectations 在 Silver→Gold 层之间实施行级数据质量评分,确保 Gold 层数据符合 SLA 承诺。
Core Usage Pattern
Key Features
- Expectations:声明式数据质量断言(如
expect_column_values_to_be_between、expect_column_mean_to_be_between)
- Profiling:从现有数据自动生成期望规则
- Data Docs:自动化数据质量报告(HTML 格式)
- Checkpoint:将测试嵌入 CI/CD 流水线
Role in Medallion Architecture
- Bronze 层:基础完整性检查(文件是否可读、schema 是否存在)
- Silver 层:字段级质量验证(null 率、值范围、分布)
- Gold 层:业务规则验证 + SLA 评分,评分必须附加到每行
Integration
- Spark Integration:通过
gx.spark.from_pyspark_df() 直接验证 PySpark DataFrame
- Databricks:原生 Great Expectations + Databricks Jobs 集成
- dbt Tests:Great Expectations 规则可导出为 dbt 测试
Related Concepts