Exploring Stream Processing with Apache Flink

Aug 12, 2024

CDC Stream Processing with Apache Flink

Introduction

  • Overview of Apache Flink and its applications in stream processing
  • Presentation split into three parts:
    • Introduction to Flink
    • Flink SQL
    • In-depth exploration of Flink's engine

Speaker Background

  • Long-time committer of Apache Flink
  • Core architect of Flink SQL
  • Career background includes:
    • Data Artisans (acquired by Alibaba)
    • Founded Emerok (acquired by Confluent)

Understanding Stream Processing

  • Stream processing vs. batch processing
  • Key building blocks in stream processing:
    1. Streams
      • Creating and distributing streams
      • Joining main and side streams
      • Dynamic rule engines and reprocessing logic
    2. Time
      • Managing progress and synchronization
      • Handling timeouts and fast-forwarding
    3. State
      • Storing and managing state efficiently
      • State backup, versioning, and time travel
    4. Fault tolerance
      • Creating snapshots and restoring states

Unique Features of Apache Flink

  • Declarative pipeline definition without dealing with parallelism directly
  • Efficient local state access for each subtask
  • Checkpoints for consistent snapshots stored in durable storage

Use Cases for Apache Flink

  • Transaction processing, IoT, event interactions
  • Data analytics, data integration (ETL), event-driven applications
  • Examples: social networks using Flink for event counting

APIs in Flink

  • Core APIs:
    • Data Stream API: Directly interacts with the stream operator API for low-level operations
    • Table and SQL API: Relational API that includes a planner and optimizer
  • Example of Data Stream API usage
  • Example of Table and SQL API usage

Change Log and CDC Processing

  • Concept of change log stream processing in Flink
  • Distinction between bounded and unbounded streams
  • Dynamic tables as logical concepts representing streams
  • Mapping between tables and streams using stream-table duality

Operations in Flink SQL

  • Changelog modes for operations: insert, update before, update after, delete
  • Importance of retraction in managing updates
  • Example of summing transactions and dynamically updating results

Connectors in Flink

  • Various source connectors and their change log modes
  • Examples include:
    • File system, Kafka, JDBC
    • CDC via Kafka for dynamic updates

Advanced Features and Examples

  • Temporal joins and stream enrichment
  • Using primary keys to optimize performance
  • Summary of advanced SQL features like pattern matching and windowing

Conclusion

  • Flink SQL as a powerful, flexible changelog processor
  • Emphasis on the capability to integrate various systems with diverse semantics
  • Invitation for questions from the audience

Note: The talk contained a demo showcasing real-time updates in a Flink job using a MySQL database.