💧

Change Data Capture and Stream Objects in Snowflake

Jun 24, 2024

Lecture Notes: Change Data Capture and Stream Objects in Snowflake

Introduction:

  • Change Data Capture (CDC): A method to track changes in data
    • Traditional methods: update timestamp, row versioning, log scanner
    • Challenges: complex, cumbersome, logic-heavy
  • Snowflake's Solution: Stream objects
    • Elegant and simple approach
    • Adds three additional meta columns to the source table
    • Tracks all DML operations for downstream processes
    • Ideal for data professionals and crucial for Snowflake knowledge

Key Questions to Address:

  1. Tracking different DML operations: insert, update, delete
  2. Types of streams in Snowflake
  3. Stream creation on external/transient tables
  4. Cloning stream objects
  5. Cost considerations
  6. Mechanism of tracking changes

Overview of Stream Objects:

  • Definition: First-class objects within a schema tracking changes in a source table
  • Source Tables: Can be a permanent, transient, temporary, or external table
    • Example: Creating a stream called my_stream on a source table customer
  • SQL Syntax:
CREATE STREAM my_stream ON TABLE customer;
  • Metadata Columns Added:
    • METADATA$ACTION
    • METADATA$IS_UPDATE
    • METADATA$ROW_ID
  • Data Analysis:
    • Querying the base table shows current data
    • Querying the stream table shows only changes and metadata

Example Scenario:

  • Initial state: Customer table with 10 records at 9:00
  • Changes: 5 DML operations including inserts, updates, and deletes at 11:00
  • Stream Table: Shows only the changes, not the full data set

Types of Streams:

  1. Default Streams: Capture all types of DMLs (insert, update, delete)
  2. Append Only Streams: Capture only insert operations
  • Parameter: APPEND_ONLY = TRUE
    • For external tables: INSERT_ONLY = TRUE

Creating Multiple Streams on a Source Table:

  • Unlimited Streams: No hard limit on stream creation
  • Example: Creating my_stream_delta and my_stream_append on customer table
    • my_stream_delta: Captures all DMLs
    • my_stream_append: Captures only inserts

Cloning Stream Objects:

  • Yes, streams can be cloned.
  • Example: Cloning my_stream to create my_cloned_stream
    • Both streams remain independent
    • Offset Value: Helps identify changes based on the table's version
      • Ensures no data duplicates within a transaction

Practical Implementation:

  1. **Creating Customer Table: **
  2. **Inserting Records: ** Inserting initial records to test DML operations
  3. Tracking Changes: Using stream objects to track all DML operations on the source table
  4. Understanding Append Only Streams: Only capture new insert operations
  5. Creating Streams on Transient Tables: Similar behavior to permanent tables
  6. Impact of Time Travel on Streams: Stream has its own retention period (14 days default)
  7. Consuming Stream Data: Transferring changes to a downstream table
  8. External Tables & Streams: Create streams on external tables with insert-only mode

Practical Example of Insert, Delete, and Update Operations:

  1. Insert 2 records (ID: 11, 12)
    • Stream Table: Captures insert as METADATA$ACTION=INSERT
  2. Delete record (ID: 9)
    • Stream Table: Captures delete as METADATA$ACTION=DELETE
  3. Update records (ID: 3, 5)
    • Stream Table: Captures updates with METADATA$IS_UPDATE=TRUE
    • Old record marked as deleted, new record marked as inserted

Detailed Scenario on Consumption:

  1. Creating Customer Table: as source table
  2. Creating Dimensional Table: for downstream consumption
  3. Stream Creation: on customer table
  4. Performing DML Operations: & validate with stream
  5. Consume Data: Transfer DML changes to the downstream table using INSERT statements within transactions
    • Example: Customer table to Customer_Dim table

Best Practices:

  1. Regular check on offset values
  2. Ensuring correct match of DMLs with business logic
  3. Efficient querying & transaction blocks design to avoid data duplication

Future Learning: Snowflake Task Objects

  • Automated data pipelines and ETL processes
  • Covered in Chapter 18

Keep practicing and exploring Snowflake’s stream objects for a comprehensive understanding.