Notes: Spark, DBT, and Cloud Provider Project

Introduction

Today’s Focus: Project using Spark, DBT, and a cloud provider (Azure).
Community Poll:
- Spark DBT and Cloud Platform: 43% (Winner).
- Data Warehousing: 38%.
- Snowflakes: 16%.
- Apache Flink: 4%.

Adopted Architecture: Medallion architecture with three layers:
1. Bronze Layer: Raw and uncut data.
2. Silver Layer: Transformed data.
3. Gold Layer: Ready-for-consumption data for BI units and data scientists.

Cloud: Azure.
Services: Azure Data Factory, Azure Databricks, DBT, Azure Data Lake Storage Gen2 (ADLS).
Initial Setup: Azure subscription with resources (Databricks, storage, etc.)
- Recommended to create a free Azure account with $200 credits.

Azure Resource Creation:
- Resource Group: Medallion Spark DBT Resource Group.
- Location: UK South.
Azure Data Lake Storage (ADLS) Gen2:
- Storage Account: medallionstorageaccount.
- Layers: Bronze, Silver, Gold created in containers.
Azure Data Factory (ADF):
- Resource Group: Medallion Spark DBT Resource Group.
- Name: Medallion ADF.
Azure Key Vault:
- Resource Group: Medallion Spark DBT Resource Group.
- Name: Medallion Spark KV.
- Used for storing secrets like storage account keys.
SQL Database Setup:
- Resource Group: Medallion Spark DBT Resource Group.
- Database: medalliondbdev.
- Pre-loaded with AdventureWorks sample data.
Databricks Workspace:
- Resource Group: Medallion Spark DBT Resource Group.
- Name: medallionsparkdatabricks.

Pipeline Components:
1. Lookup Activity: Retrieve schema and table names.
2. ForEach Activity: Loop through tables with Copy data activities
3. Copy Data Activity: Dumping data into Bronze layer in ADLS.
Linked Services:
- SQL Database.
- ADLS Gen2.

Compute Cluster: Use existing cluster for job submission.
Storage Mounting:
- Mounted Bronze, Silver, and Gold layers on Databricks.
Notebook: Create and run notebooks to process data from Bronze to Silver and Gold.
Jobs Setup:
- Submit jobs via Azure Data Factory to Databricks notebook.
SQL Execution in Databricks:
- Create databases and tables based on bronze data dumping.

DBT Installation:
- Install DBT Databricks via pip install dbt-databricks.
- Install Databricks CLI for easier connection.
DBT Configurations:
- Configure profiles for connecting DBT to Databricks.
Snapshots:
- Create initial snapshots of AdventureWorks tables using SQL.
- Store snapshots in Silver layer in Delta format.

Materialization:
- Define transformations and materialize as tables in Gold layer.
- Example SQL for customers: Filtering invalid records.
DBT Testing and Documentation:
- Run tests to validate the transformations.
- Generate and serve documentation.

End Result: Data processed through Bronze-Silver-Gold layers with appropriate transformations using integrated Azure components (Data Factory, Databricks, ADLS Gen2) and DBT.
DBT Doc Server: View lineage and dependencies of the data models.