42 lines
1.4 KiB
Markdown
42 lines
1.4 KiB
Markdown
---
|
||
title: "Databricks"
|
||
type: entity
|
||
tags: [data-engineering, lakehouse, analytics-platform, cloud]
|
||
sources: [engineering-data-engineer]
|
||
last_updated: 2026-05-02
|
||
---
|
||
|
||
## Overview
|
||
|
||
Databricks 是基于 Apache Spark 的统一分析和 AI 平台,提供 Lakehouse、Notebook、MLflow、Delta Live Tables(DLT)和 Unity Catalog 等能力。Data Engineer Agent 使用 Databricks 作为主要的托管执行环境。
|
||
|
||
## Key Products for Data Engineering
|
||
|
||
### Unity Catalog
|
||
- 统一治理:跨云(AWS/Azure/GCP)的数据目录和权限管理
|
||
- 细粒度行级安全(Row-Level Security)和列掩码(Column Masking)
|
||
|
||
### Delta Live Tables (DLT)
|
||
- 声明式流式和批处理管道
|
||
- 自动管理基础设施、checkpoint 和数据质量
|
||
- 内置期望(Expectations)定义,数据质量自动验证
|
||
|
||
### Databricks Workflows
|
||
- 编排多任务管道(notebooks + SQL + JAR)
|
||
- 支持 CI/CD 集成(Asset Bundles)
|
||
|
||
### Asset Bundles
|
||
- 基础架构即代码(IaC)方式管理 Databricks 资源
|
||
- 可与 GitHub Actions 集成实现自动化部署
|
||
|
||
## Cloud Platforms
|
||
- **AWS**:S3 + Databricks
|
||
- **Azure**:ADLS + Databricks (Microsoft Fabric 集成)
|
||
- **GCP**:GCS + Databricks
|
||
|
||
## Related Concepts
|
||
- [[Medallion Architecture]]
|
||
- [[Delta Lake]](Databricks 是主要贡献者和推广者)
|
||
- [[Apache Spark]]
|
||
- [[dbt]](dbt Cloud 与 Databricks 深度集成)
|