nexus/wiki/entities/Great-Expectations.md

---
title: "Great Expectations"
type: entity
tags: [data-engineering, data-quality, testing, validation]
sources: [engineering-data-engineer]
last_updated: 2026-05-02
---

## Overview

Great Expectations（gx）是 Python 原生的数据质量验证框架，支持自动化测试数据管道的可靠性。Data Engineer Agent 使用 Great Expectations 在 Silver→Gold 层之间实施行级数据质量评分，确保 Gold 层数据符合 SLA 承诺。

## Core Usage Pattern

```python
import great_expectations as gx

context = gx.get_context()

def validate_silver_orders(df) -> dict:
    batch = context.sources.pandas_default.read_dataframe(df)
    result = batch.validate(
        expectation_suite_name="silver_orders.critical",
        run_id={"run_name": "silver_orders_daily", "run_time": datetime.now()}
    )
    stats = {
        "success": result["success"],
        "evaluated": result["statistics"]["evaluated_expectations"],
        "passed": result["statistics"]["successful_expectations"],
        "failed": result["statistics"]["unsuccessful_expectations"],
    }
    if not result["success"]:
        raise DataQualityException(f"Silver orders failed validation: {stats['failed']} checks failed")
    return stats
```

## Key Features

- **Expectations**：声明式数据质量断言（如 `expect_column_values_to_be_between`、`expect_column_mean_to_be_between`）
- **Profiling**：从现有数据自动生成期望规则
- **Data Docs**：自动化数据质量报告（HTML 格式）
- **Checkpoint**：将测试嵌入 CI/CD 流水线

## Role in Medallion Architecture

- **Bronze 层**：基础完整性检查（文件是否可读、schema 是否存在）
- **Silver 层**：字段级质量验证（null 率、值范围、分布）
- **Gold 层**：业务规则验证 + SLA 评分，评分必须附加到每行

## Integration

- **Spark Integration**：通过 `gx.spark.from_pyspark_df()` 直接验证 PySpark DataFrame
- **Databricks**：原生 Great Expectations + Databricks Jobs 集成
- **dbt Tests**：Great Expectations 规则可导出为 dbt 测试

## Related Concepts
- [[Data Contract]]
- [[Medallion Architecture]]