Files
nexus/wiki/entities/Great-Expectations.md
2026-05-03 05:42:12 +08:00

59 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Great Expectations"
type: entity
tags: [data-engineering, data-quality, testing, validation]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---
## Overview
Great Expectationsgx是 Python 原生的数据质量验证框架支持自动化测试数据管道的可靠性。Data Engineer Agent 使用 Great Expectations 在 Silver→Gold 层之间实施行级数据质量评分,确保 Gold 层数据符合 SLA 承诺。
## Core Usage Pattern
```python
import great_expectations as gx
context = gx.get_context()
def validate_silver_orders(df) -> dict:
batch = context.sources.pandas_default.read_dataframe(df)
result = batch.validate(
expectation_suite_name="silver_orders.critical",
run_id={"run_name": "silver_orders_daily", "run_time": datetime.now()}
)
stats = {
"success": result["success"],
"evaluated": result["statistics"]["evaluated_expectations"],
"passed": result["statistics"]["successful_expectations"],
"failed": result["statistics"]["unsuccessful_expectations"],
}
if not result["success"]:
raise DataQualityException(f"Silver orders failed validation: {stats['failed']} checks failed")
return stats
```
## Key Features
- **Expectations**:声明式数据质量断言(如 `expect_column_values_to_be_between``expect_column_mean_to_be_between`
- **Profiling**:从现有数据自动生成期望规则
- **Data Docs**自动化数据质量报告HTML 格式)
- **Checkpoint**:将测试嵌入 CI/CD 流水线
## Role in Medallion Architecture
- **Bronze 层**基础完整性检查文件是否可读、schema 是否存在)
- **Silver 层**字段级质量验证null 率、值范围、分布)
- **Gold 层**:业务规则验证 + SLA 评分,评分必须附加到每行
## Integration
- **Spark Integration**:通过 `gx.spark.from_pyspark_df()` 直接验证 PySpark DataFrame
- **Databricks**:原生 Great Expectations + Databricks Jobs 集成
- **dbt Tests**Great Expectations 规则可导出为 dbt 测试
## Related Concepts
- [[Data Contract]]
- [[Medallion Architecture]]