Update nexus wiki content
This commit is contained in:
58
wiki/entities/Great-Expectations.md
Normal file
58
wiki/entities/Great-Expectations.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: "Great Expectations"
|
||||
type: entity
|
||||
tags: [data-engineering, data-quality, testing, validation]
|
||||
sources: [engineering-data-engineer]
|
||||
last_updated: 2026-05-02
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Great Expectations(gx)是 Python 原生的数据质量验证框架,支持自动化测试数据管道的可靠性。Data Engineer Agent 使用 Great Expectations 在 Silver→Gold 层之间实施行级数据质量评分,确保 Gold 层数据符合 SLA 承诺。
|
||||
|
||||
## Core Usage Pattern
|
||||
|
||||
```python
|
||||
import great_expectations as gx
|
||||
|
||||
context = gx.get_context()
|
||||
|
||||
def validate_silver_orders(df) -> dict:
|
||||
batch = context.sources.pandas_default.read_dataframe(df)
|
||||
result = batch.validate(
|
||||
expectation_suite_name="silver_orders.critical",
|
||||
run_id={"run_name": "silver_orders_daily", "run_time": datetime.now()}
|
||||
)
|
||||
stats = {
|
||||
"success": result["success"],
|
||||
"evaluated": result["statistics"]["evaluated_expectations"],
|
||||
"passed": result["statistics"]["successful_expectations"],
|
||||
"failed": result["statistics"]["unsuccessful_expectations"],
|
||||
}
|
||||
if not result["success"]:
|
||||
raise DataQualityException(f"Silver orders failed validation: {stats['failed']} checks failed")
|
||||
return stats
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Expectations**:声明式数据质量断言(如 `expect_column_values_to_be_between`、`expect_column_mean_to_be_between`)
|
||||
- **Profiling**:从现有数据自动生成期望规则
|
||||
- **Data Docs**:自动化数据质量报告(HTML 格式)
|
||||
- **Checkpoint**:将测试嵌入 CI/CD 流水线
|
||||
|
||||
## Role in Medallion Architecture
|
||||
|
||||
- **Bronze 层**:基础完整性检查(文件是否可读、schema 是否存在)
|
||||
- **Silver 层**:字段级质量验证(null 率、值范围、分布)
|
||||
- **Gold 层**:业务规则验证 + SLA 评分,评分必须附加到每行
|
||||
|
||||
## Integration
|
||||
|
||||
- **Spark Integration**:通过 `gx.spark.from_pyspark_df()` 直接验证 PySpark DataFrame
|
||||
- **Databricks**:原生 Great Expectations + Databricks Jobs 集成
|
||||
- **dbt Tests**:Great Expectations 规则可导出为 dbt 测试
|
||||
|
||||
## Related Concepts
|
||||
- [[Data Contract]]
|
||||
- [[Medallion Architecture]]
|
||||
Reference in New Issue
Block a user