Data Pipeline for Near Real-Time Ingestion of Clickstream or Playback Data

Introduction

Speaker: Karthik, Senior Manager, Lead Data Engineer with 15 years of experience.
Expertise: Designing, building, and optimizing large-scale data pipelines in big data and cloud engineering.

Task: Create a data pipeline for near real-time ingestion of Netflix clickstream or playback data, focusing on ad hoc monitoring of metrics.

Metrics of Interest:
- User engagement metrics, playback metrics, customer churn, navigation paths, and behavior profiling.
- Importance of understanding user behaviors for product insights and machine learning applications.

Key Metrics:
- Customer Churn: Monitoring user activity to prevent loss of users (e.g., those who stop visiting).
- Path Analysis: Understanding navigation flow and identifying potential obstacles in user journey (e.g., number of clicks to reach a target).
- Playback Data: Analyzing streaming activity including number of sessions, pauses, and engagement levels with content.

Pipeline Segments:
- Data Capture
- Streaming Processing
- Storage and Analytics

Push Model: Servers push data to the data infrastructure.
Pull Model: Services pull data from applications.
Trade-offs:
- Push can overwhelm infrastructure; Pull may introduce latency.

Data Collection:
- Users generate clickstream events while interacting with the application (e.g., Netflix).
- API Gateway: Exposes endpoints for data ingestion.
Event Processing:
- Use of Kafka as a streaming platform to buffer events.
- Optional: Use Flink for millisecond-level latency processing.
Data Storage:
- Data Lake: Organized in layers (raw, processed, access) for historical analysis and ease of querying.
- NoSQL Database: For real-time access and performance, particularly for high event throughput.
Analytics:
- Utilize tools like AWS Athena for direct queries on the Data Lake.
- Consider Redshift for data warehousing.

AWS Kinesis: An alternative to Kafka for stream processing, suited for clickstream data.
Firehose: To buffer and load data into data lakes periodically, ensuring data availability for historical analysis.

Cost Factors: Balance between performance and costs.
Schema Evolution: Importance of flexibility in data modeling, particularly with NoSQL databases.
Potential Bottlenecks: Need for fault tolerance and scalability in high-throughput scenarios.

Overall, the design focused on flexibility, scalability, and real-time processing to effectively handle and analyze clickstream and playback data for Netflix.
Suggested areas for future exploration include deeper discussions on trade-offs, bottlenecks, and fault tolerance in the pipeline.