📝

Notes: Spark, DBT, and Cloud Provider Project

Jul 19, 2024

Notes: Spark, DBT, and Cloud Provider Project

Introduction

  • Today’s Focus: Project using Spark, DBT, and a cloud provider (Azure).
  • Community Poll:
    • Spark DBT and Cloud Platform: 43% (Winner).
    • Data Warehousing: 38%.
    • Snowflakes: 16%.
    • Apache Flink: 4%.

System Architecture

  • Adopted Architecture: Medallion architecture with three layers:
    1. Bronze Layer: Raw and uncut data.
    2. Silver Layer: Transformed data.
    3. Gold Layer: Ready-for-consumption data for BI units and data scientists.

Technologies Used

  • Cloud: Azure.
  • Services: Azure Data Factory, Azure Databricks, DBT, Azure Data Lake Storage Gen2 (ADLS).
  • Initial Setup: Azure subscription with resources (Databricks, storage, etc.)
    • Recommended to create a free Azure account with $200 credits.

Project Setup

Azure Resources and Configurations

  1. Azure Resource Creation:
    • Resource Group: Medallion Spark DBT Resource Group.
    • Location: UK South.
  2. Azure Data Lake Storage (ADLS) Gen2:
    • Storage Account: medallionstorageaccount.
    • Layers: Bronze, Silver, Gold created in containers.
  3. Azure Data Factory (ADF):
    • Resource Group: Medallion Spark DBT Resource Group.
    • Name: Medallion ADF.
  4. Azure Key Vault:
    • Resource Group: Medallion Spark DBT Resource Group.
    • Name: Medallion Spark KV.
    • Used for storing secrets like storage account keys.
  5. SQL Database Setup:
    • Resource Group: Medallion Spark DBT Resource Group.
    • Database: medalliondbdev.
    • Pre-loaded with AdventureWorks sample data.
  6. Databricks Workspace:
    • Resource Group: Medallion Spark DBT Resource Group.
    • Name: medallionsparkdatabricks.

Data Flow and Transformation

Data Factory Pipeline

  • Pipeline Components:
    1. Lookup Activity: Retrieve schema and table names.
    2. ForEach Activity: Loop through tables with Copy data activities
    3. Copy Data Activity: Dumping data into Bronze layer in ADLS.
  • Linked Services:
    • SQL Database.
    • ADLS Gen2.

Azure Databricks Configuration

  • Compute Cluster: Use existing cluster for job submission.
  • Storage Mounting:
    • Mounted Bronze, Silver, and Gold layers on Databricks.
  • Notebook: Create and run notebooks to process data from Bronze to Silver and Gold.
  • Jobs Setup:
    • Submit jobs via Azure Data Factory to Databricks notebook.
  • SQL Execution in Databricks:
    • Create databases and tables based on bronze data dumping.

Transformations with DBT

Initial Setup

  • DBT Installation:
    • Install DBT Databricks via pip install dbt-databricks.
    • Install Databricks CLI for easier connection.
  • DBT Configurations:
    • Configure profiles for connecting DBT to Databricks.
  • Snapshots:
    • Create initial snapshots of AdventureWorks tables using SQL.
    • Store snapshots in Silver layer in Delta format.

Data Models

  • Materialization:
    • Define transformations and materialize as tables in Gold layer.
    • Example SQL for customers: Filtering invalid records.
  • DBT Testing and Documentation:
    • Run tests to validate the transformations.
    • Generate and serve documentation.

Summary

  • End Result: Data processed through Bronze-Silver-Gold layers with appropriate transformations using integrated Azure components (Data Factory, Databricks, ADLS Gen2) and DBT.
  • DBT Doc Server: View lineage and dependencies of the data models.

Closing

  • Subscribe and turn on notifications for future updates.
  • Q&A and feedback are welcome.