Performing SCD Type 2 Operations in Delta Lake

Introduction

Presenter: Rajesh
Topic: How to perform SCD Type 2 operation in Delta Lake while developing Databricks solution.
Background Music: Yes

Important concept in data warehousing.
Definition: Change of attribute/value of entities over time.
Example: Customer changing address over time; we need to update this in data warehousing solutions as well.

SCD Type 1: Overwrites new address; history is lost.
SCD Type 2: Makes previous record inactive and inserts new record; maintains history but creates duplicate primary keys.
SCD Type 3: Creates new column for updated value; maintains current and previous values in separate columns.

Delta Table Example: Composite primary key (pk1, pk2), four dimension attributes (dim1, dim2, dim3, dim4), active status, start date, end date.
Target Table: Example with hypothetical data.
Source Data: Could be from JSON, CSV, SQL; example with three records.
- Record 1: No change
- Record 2: New record
- Record 3: Value change

Create Delta Table: Structure includes composite primary key, attributes, status, dates.
Insert Initial Data: Populate delta table with sample data.
Define Schema & Source Data: Create schema and data frame for incoming source data.
Left Join: Join source data with target table.
Filter Changes: Compare dimension values and filter only changed records.
Create Merge Key: Concatenate primary keys into a single merge key.
Handle Matching Records: Create dummy records with null merge key for matching records.
Union Data Frames: Combine modified records into a single data frame.
Perform Merge Statement: Use PySpark merge statement to update/insert records in delta table.
Verify Final Data: Query delta table to check the updated records.

Left Join Example:
- Creates a joined view of source and target tables.
Filter Operation:
- Filters out records with no changes.
Creating and Handling Merge Key:
- Combines primary keys.
- Filters matched records and creates null merge key for dummy records.
Performing Merge:
- Uses merge statement to update inactive records and insert new ones.
Final Table Validation:
- Shows final state of the target table after merge operations.

Successfully demonstrated SCD Type 2 operation in Delta Lake using Databricks.
Ensured the data maintains historical integrity and new records are properly inserted.
Encouraged viewers to like, comment, and subscribe to the channel.