Update nexus wiki content

This commit is contained in:
2026-05-03 05:42:06 +08:00
parent 90f3811b83
commit 111bc65b7b
707 changed files with 32306 additions and 7289 deletions

View File

@@ -0,0 +1,57 @@
---
title: "Data Engineer Agent Personality"
type: source
tags: []
date: 2026-05-02
---
## Source File
- [[../../../../../Workspace/nexus/raw/Agent/agency-agents/engineering/engineering-data-engineer.md]]
## Summary用中文描述
- 核心主题Data Engineer Agent 个性定义——构建可靠、可观测、自愈的数据管道和 Lakehouse 架构的专业 Agent
- 问题域:如何将原始、混乱、来自多种来源的数据转化为可靠的、高质量的、可分析的数据资产,并保证准时、按规模、全程可观测
- 方法/机制Medallion ArchitectureBronze→Silver→Gold、PySpark+Delta Lake ETL/ELT、dbt 数据质量契约、Great Expectations 质量验证、Kafka 流式处理、CDC 增量摄取
- 结论/价值Data Engineer Agent 的核心价值在于将数据可靠性作为产品交付,通过 Medallion 分层架构确保 Bronze=原始不可变、Silver=清洗去重、Gold=业务就绪,并通过 SLA 监控、沿袭追踪、数据目录实现全栈可观测性
## Key Claims用中文描述
- Data Engineer Agent 通过 Medallion ArchitectureBronze→Silver→Gold分层设计实现了数据质量从原始到业务就绪的渐进式提升
- Data Engineer Agent 要求所有管道必须幂等idempotent—— 重新运行产生相同结果,永不产生重复数据
- Data Engineer Agent 通过 CDCChange Data Capture和增量管道设计将全量刷新成本降低 90% 以上
- Data Engineer Agent 通过 Great Expectations 实现行级数据质量评分,确保 Gold 层数据达到 SLA 保证
- Data Engineer Agent 通过 Apache Kafka 实现 Exactly-Once 语义和延迟到达数据处理,平衡流式与微批次的成本-延迟权衡
## Key Quotes
> "Bronze = raw, immutable, append-only; never transform in place" — Medallion Architecture Bronze 层核心原则
> "All pipelines must be idempotent — rerunning produces the same result, never duplicates" — 管道可靠性第一准则
> "Null handling must be deliberate — no implicit null propagation into gold/semantic layers" — Silver→Gold 层 null 值处理规范
> "Data in gold/semantic layers must have row-level data quality scores attached" — Gold 层数据质量强制要求
## Key Concepts
- [[Medallion Architecture]]Bronze原始只读→ Silver清洗去重→ Gold业务聚合的三层数据湖仓架构每层有明确的转换规则和 SLA
- [[CDC (Change Data Capture)]]:通过变更数据捕获实现增量管道,相比全量刷新可节省 90%+ 计算成本
- [[Data Contract]]:数据生产者和消费者之间的明确 schema 契约schema 漂移必须触发告警而非静默损坏
- [[Data Lineage]]:数据沿袭追踪——每一行数据都能追溯到其来源系统
- [[SCD Type 2]]Slowly Changing Dimension Type 2实现历史维度变更追踪
## Key Entities
- [[Apache Spark]]大规模并行处理引擎Data Engineer Agent 的核心计算平台
- [[Delta Lake]]:开放表格格式,提供 ACID 事务、时间旅行和 Z-Ordering 等能力
- [[dbt]]数据转换和质量管理工具Data Engineer Agent 用于定义数据质量契约
- [[Great Expectations]]数据质量验证框架Data Engineer Agent 用于行级数据质量评分
- [[Apache Kafka]]事件流平台Data Engineer Agent 用于构建 Exactly-Once 语义的实时管道
- [[Databricks]]Lakehouse 平台Unity Catalog、DLTData Engineer Agent 的主要托管环境之一
- [[Snowflake]]云数据仓库Data Engineer Agent 的另一主要数据平台
- [[Apache Iceberg]]开放表格格式规范Data Engineer Agent 用于跨引擎互操作
## Connections
- [[Apache Spark]] ← builds_with ← [[Delta Lake]]
- [[dbt]] ← validates ← [[Apache Spark]]
- [[Apache Kafka]] ← streams_to ← [[Delta Lake]]
- [[Great Expectations]] ← enforces ← [[Data Contract]]
- [[Databricks]] ← hosts ← [[Apache Spark]], [[Delta Lake]]
- [[Medallion Architecture]] ← implements ← [[Data Lineage]]
- [[CDC (Change Data Capture)]] ← enables ← [[Medallion Architecture]]
## Contradictions
- 无已知冲突。Data Engineer Agent 与 SRE Agent[[engineering-sre]])在数据管道 SLA 监控告警响应层面高度互补Data Engineer 负责管道内部可观测性SRE 负责整体服务可靠性。