Lecture Notes: Change Data Capture and Stream Objects in Snowflake
Introduction:
- Change Data Capture (CDC): A method to track changes in data
- Traditional methods: update timestamp, row versioning, log scanner
- Challenges: complex, cumbersome, logic-heavy
- Snowflake's Solution: Stream objects
- Elegant and simple approach
- Adds three additional meta columns to the source table
- Tracks all DML operations for downstream processes
- Ideal for data professionals and crucial for Snowflake knowledge
Key Questions to Address:
- Tracking different DML operations: insert, update, delete
- Types of streams in Snowflake
- Stream creation on external/transient tables
- Cloning stream objects
- Cost considerations
- Mechanism of tracking changes
Overview of Stream Objects:
- Definition: First-class objects within a schema tracking changes in a source table
- Source Tables: Can be a permanent, transient, temporary, or external table
- Example: Creating a stream called
my_stream on a source table customer
- SQL Syntax:
CREATE STREAM my_stream ON TABLE customer;
- Metadata Columns Added:
METADATA$ACTION
METADATA$IS_UPDATE
METADATA$ROW_ID
- Data Analysis:
- Querying the base table shows current data
- Querying the stream table shows only changes and metadata
Example Scenario:
- Initial state: Customer table with 10 records at 9:00
- Changes: 5 DML operations including inserts, updates, and deletes at 11:00
- Stream Table: Shows only the changes, not the full data set
Types of Streams:
- Default Streams: Capture all types of DMLs (insert, update, delete)
- Append Only Streams: Capture only insert operations
- Parameter:
APPEND_ONLY = TRUE
- For external tables:
INSERT_ONLY = TRUE
Creating Multiple Streams on a Source Table:
- Unlimited Streams: No hard limit on stream creation
- Example: Creating
my_stream_delta and my_stream_append on customer table
my_stream_delta: Captures all DMLs
my_stream_append: Captures only inserts
Cloning Stream Objects:
- Yes, streams can be cloned.
- Example: Cloning
my_stream to create my_cloned_stream
- Both streams remain independent
- Offset Value: Helps identify changes based on the table's version
- Ensures no data duplicates within a transaction
Practical Implementation:
- **Creating Customer Table: **
- **Inserting Records: ** Inserting initial records to test DML operations
- Tracking Changes: Using stream objects to track all DML operations on the source table
- Understanding Append Only Streams: Only capture new insert operations
- Creating Streams on Transient Tables: Similar behavior to permanent tables
- Impact of Time Travel on Streams: Stream has its own retention period (14 days default)
- Consuming Stream Data: Transferring changes to a downstream table
- External Tables & Streams: Create streams on external tables with insert-only mode
Practical Example of Insert, Delete, and Update Operations:
- Insert 2 records (ID: 11, 12)
- Stream Table: Captures insert as
METADATA$ACTION=INSERT
- Delete record (ID: 9)
- Stream Table: Captures delete as
METADATA$ACTION=DELETE
- Update records (ID: 3, 5)
- Stream Table: Captures updates with
METADATA$IS_UPDATE=TRUE
- Old record marked as deleted, new record marked as inserted
Detailed Scenario on Consumption:
- Creating Customer Table: as source table
- Creating Dimensional Table: for downstream consumption
- Stream Creation: on customer table
- Performing DML Operations: & validate with stream
- Consume Data: Transfer DML changes to the downstream table using
INSERT statements within transactions
- Example:
Customer table to Customer_Dim table
Best Practices:
- Regular check on offset values
- Ensuring correct match of DMLs with business logic
- Efficient querying & transaction blocks design to avoid data duplication
Future Learning: Snowflake Task Objects
- Automated data pipelines and ETL processes
- Covered in Chapter 18
Keep practicing and exploring Snowflake’s stream objects for a comprehensive understanding.