--- title: "Great Expectations" type: entity tags: [data-engineering, data-quality, testing, validation] sources: [engineering-data-engineer] last_updated: 2026-05-02 --- ## Overview Great Expectations(gx)是 Python 原生的数据质量验证框架,支持自动化测试数据管道的可靠性。Data Engineer Agent 使用 Great Expectations 在 Silver→Gold 层之间实施行级数据质量评分,确保 Gold 层数据符合 SLA 承诺。 ## Core Usage Pattern ```python import great_expectations as gx context = gx.get_context() def validate_silver_orders(df) -> dict: batch = context.sources.pandas_default.read_dataframe(df) result = batch.validate( expectation_suite_name="silver_orders.critical", run_id={"run_name": "silver_orders_daily", "run_time": datetime.now()} ) stats = { "success": result["success"], "evaluated": result["statistics"]["evaluated_expectations"], "passed": result["statistics"]["successful_expectations"], "failed": result["statistics"]["unsuccessful_expectations"], } if not result["success"]: raise DataQualityException(f"Silver orders failed validation: {stats['failed']} checks failed") return stats ``` ## Key Features - **Expectations**:声明式数据质量断言(如 `expect_column_values_to_be_between`、`expect_column_mean_to_be_between`) - **Profiling**:从现有数据自动生成期望规则 - **Data Docs**:自动化数据质量报告(HTML 格式) - **Checkpoint**:将测试嵌入 CI/CD 流水线 ## Role in Medallion Architecture - **Bronze 层**:基础完整性检查(文件是否可读、schema 是否存在) - **Silver 层**:字段级质量验证(null 率、值范围、分布) - **Gold 层**:业务规则验证 + SLA 评分,评分必须附加到每行 ## Integration - **Spark Integration**:通过 `gx.spark.from_pyspark_df()` 直接验证 PySpark DataFrame - **Databricks**:原生 Great Expectations + Databricks Jobs 集成 - **dbt Tests**:Great Expectations 规则可导出为 dbt 测试 ## Related Concepts - [[Data Contract]] - [[Medallion Architecture]]