Files
nexus/wiki/sources/engineering-data-engineer.md
2026-05-03 05:42:12 +08:00

58 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Data Engineer Agent Personality"
type: source
tags: []
date: 2026-05-02
---
## Source File
- [[../../../../../Workspace/nexus/raw/Agent/agency-agents/engineering/engineering-data-engineer.md]]
## Summary用中文描述
- 核心主题Data Engineer Agent 个性定义——构建可靠、可观测、自愈的数据管道和 Lakehouse 架构的专业 Agent
- 问题域:如何将原始、混乱、来自多种来源的数据转化为可靠的、高质量的、可分析的数据资产,并保证准时、按规模、全程可观测
- 方法/机制Medallion ArchitectureBronze→Silver→Gold、PySpark+Delta Lake ETL/ELT、dbt 数据质量契约、Great Expectations 质量验证、Kafka 流式处理、CDC 增量摄取
- 结论/价值Data Engineer Agent 的核心价值在于将数据可靠性作为产品交付,通过 Medallion 分层架构确保 Bronze=原始不可变、Silver=清洗去重、Gold=业务就绪,并通过 SLA 监控、沿袭追踪、数据目录实现全栈可观测性
## Key Claims用中文描述
- Data Engineer Agent 通过 Medallion ArchitectureBronze→Silver→Gold分层设计实现了数据质量从原始到业务就绪的渐进式提升
- Data Engineer Agent 要求所有管道必须幂等idempotent—— 重新运行产生相同结果,永不产生重复数据
- Data Engineer Agent 通过 CDCChange Data Capture和增量管道设计将全量刷新成本降低 90% 以上
- Data Engineer Agent 通过 Great Expectations 实现行级数据质量评分,确保 Gold 层数据达到 SLA 保证
- Data Engineer Agent 通过 Apache Kafka 实现 Exactly-Once 语义和延迟到达数据处理,平衡流式与微批次的成本-延迟权衡
## Key Quotes
> "Bronze = raw, immutable, append-only; never transform in place" — Medallion Architecture Bronze 层核心原则
> "All pipelines must be idempotent — rerunning produces the same result, never duplicates" — 管道可靠性第一准则
> "Null handling must be deliberate — no implicit null propagation into gold/semantic layers" — Silver→Gold 层 null 值处理规范
> "Data in gold/semantic layers must have row-level data quality scores attached" — Gold 层数据质量强制要求
## Key Concepts
- [[Medallion Architecture]]Bronze原始只读→ Silver清洗去重→ Gold业务聚合的三层数据湖仓架构每层有明确的转换规则和 SLA
- [[CDC (Change Data Capture)]]:通过变更数据捕获实现增量管道,相比全量刷新可节省 90%+ 计算成本
- [[Data Contract]]:数据生产者和消费者之间的明确 schema 契约schema 漂移必须触发告警而非静默损坏
- [[Data Lineage]]:数据沿袭追踪——每一行数据都能追溯到其来源系统
- [[SCD Type 2]]Slowly Changing Dimension Type 2实现历史维度变更追踪
## Key Entities
- [[Apache Spark]]大规模并行处理引擎Data Engineer Agent 的核心计算平台
- [[Delta Lake]]:开放表格格式,提供 ACID 事务、时间旅行和 Z-Ordering 等能力
- [[dbt]]数据转换和质量管理工具Data Engineer Agent 用于定义数据质量契约
- [[Great Expectations]]数据质量验证框架Data Engineer Agent 用于行级数据质量评分
- [[Apache Kafka]]事件流平台Data Engineer Agent 用于构建 Exactly-Once 语义的实时管道
- [[Databricks]]Lakehouse 平台Unity Catalog、DLTData Engineer Agent 的主要托管环境之一
- [[Snowflake]]云数据仓库Data Engineer Agent 的另一主要数据平台
- [[Apache Iceberg]]开放表格格式规范Data Engineer Agent 用于跨引擎互操作
## Connections
- [[Apache Spark]] ← builds_with ← [[Delta Lake]]
- [[dbt]] ← validates ← [[Apache Spark]]
- [[Apache Kafka]] ← streams_to ← [[Delta Lake]]
- [[Great Expectations]] ← enforces ← [[Data Contract]]
- [[Databricks]] ← hosts ← [[Apache Spark]], [[Delta Lake]]
- [[Medallion Architecture]] ← implements ← [[Data Lineage]]
- [[CDC (Change Data Capture)]] ← enables ← [[Medallion Architecture]]
## Contradictions
- 无已知冲突。Data Engineer Agent 与 SRE Agent[[engineering-sre]])在数据管道 SLA 监控告警响应层面高度互补Data Engineer 负责管道内部可观测性SRE 负责整体服务可靠性。