🗃️

Performing SCD Type 2 Operations in Delta Lake

Jul 7, 2024

Performing SCD Type 2 Operations in Delta Lake

Introduction

  • Presenter: Rajesh
  • Topic: How to perform SCD Type 2 operation in Delta Lake while developing Databricks solution.
  • Background Music: Yes

Key Concepts

Slowly Changing Dimensions (SCD)

  • Important concept in data warehousing.
  • Definition: Change of attribute/value of entities over time.
  • Example: Customer changing address over time; we need to update this in data warehousing solutions as well.

Methods to Handle SCD

  • SCD Type 1: Overwrites new address; history is lost.
  • SCD Type 2: Makes previous record inactive and inserts new record; maintains history but creates duplicate primary keys.
  • SCD Type 3: Creates new column for updated value; maintains current and previous values in separate columns.

Focus of the Video

  • Demonstrate SCD Type 2 in Delta Lake using Databricks development.

Explanation and Demo

Initial Setup

  • Delta Table Example: Composite primary key (pk1, pk2), four dimension attributes (dim1, dim2, dim3, dim4), active status, start date, end date.
  • Target Table: Example with hypothetical data.
  • Source Data: Could be from JSON, CSV, SQL; example with three records.
    • Record 1: No change
    • Record 2: New record
    • Record 3: Value change

Steps to Implement SCD Type 2

  1. Create Delta Table: Structure includes composite primary key, attributes, status, dates.
  2. Insert Initial Data: Populate delta table with sample data.
  3. Define Schema & Source Data: Create schema and data frame for incoming source data.
  4. Left Join: Join source data with target table.
  5. Filter Changes: Compare dimension values and filter only changed records.
  6. Create Merge Key: Concatenate primary keys into a single merge key.
  7. Handle Matching Records: Create dummy records with null merge key for matching records.
  8. Union Data Frames: Combine modified records into a single data frame.
  9. Perform Merge Statement: Use PySpark merge statement to update/insert records in delta table.
  10. Verify Final Data: Query delta table to check the updated records.

PySpark Code Execution Steps

  1. Left Join Example:
    • Creates a joined view of source and target tables.
  2. Filter Operation:
    • Filters out records with no changes.
  3. Creating and Handling Merge Key:
    • Combines primary keys.
    • Filters matched records and creates null merge key for dummy records.
  4. Performing Merge:
    • Uses merge statement to update inactive records and insert new ones.
  5. Final Table Validation:
    • Shows final state of the target table after merge operations.

Conclusion

  • Successfully demonstrated SCD Type 2 operation in Delta Lake using Databricks.
  • Ensured the data maintains historical integrity and new records are properly inserted.
  • Encouraged viewers to like, comment, and subscribe to the channel.