Exploring Stream Processing with Apache Flink

Aug 12, 2024

CDC Stream Processing with Apache Flink

Introduction

Overview of Apache Flink and its applications in stream processing
Presentation split into three parts:
- Introduction to Flink
- Flink SQL
- In-depth exploration of Flink's engine

Speaker Background

Long-time committer of Apache Flink
Core architect of Flink SQL
Career background includes:
- Data Artisans (acquired by Alibaba)
- Founded Emerok (acquired by Confluent)

Understanding Stream Processing

Stream processing vs. batch processing
Key building blocks in stream processing:
1. Streams
  - Creating and distributing streams
  - Joining main and side streams
  - Dynamic rule engines and reprocessing logic
2. Time
  - Managing progress and synchronization
  - Handling timeouts and fast-forwarding
3. State
  - Storing and managing state efficiently
  - State backup, versioning, and time travel
4. Fault tolerance
  - Creating snapshots and restoring states

Unique Features of Apache Flink

Declarative pipeline definition without dealing with parallelism directly
Efficient local state access for each subtask
Checkpoints for consistent snapshots stored in durable storage

Use Cases for Apache Flink

Transaction processing, IoT, event interactions
Data analytics, data integration (ETL), event-driven applications
Examples: social networks using Flink for event counting

APIs in Flink

Core APIs:
- Data Stream API: Directly interacts with the stream operator API for low-level operations
- Table and SQL API: Relational API that includes a planner and optimizer
Example of Data Stream API usage
Example of Table and SQL API usage

Change Log and CDC Processing

Concept of change log stream processing in Flink
Distinction between bounded and unbounded streams
Dynamic tables as logical concepts representing streams
Mapping between tables and streams using stream-table duality

Operations in Flink SQL

Changelog modes for operations: insert, update before, update after, delete
Importance of retraction in managing updates
Example of summing transactions and dynamically updating results

Connectors in Flink

Various source connectors and their change log modes
Examples include:
- File system, Kafka, JDBC
- CDC via Kafka for dynamic updates

Advanced Features and Examples

Temporal joins and stream enrichment
Using primary keys to optimize performance
Summary of advanced SQL features like pattern matching and windowing

Conclusion

Flink SQL as a powerful, flexible changelog processor
Emphasis on the capability to integrate various systems with diverse semantics
Invitation for questions from the audience

Note: The talk contained a demo showcasing real-time updates in a Flink job using a MySQL database.

Full transcript