Overview
This lecture explains data skew in Apache Spark, its causes, impacts, how to detect it, and examples using Spark jobs and the Spark UI.
What is Data Skew?
- Data skew occurs when data is unevenly distributed across Spark partitions.
- Some partitions may have much more data than others, causing unequal processing times.
Detecting Data Skew
- In Spark UI, jobs may be stuck at the last stage due to a large partition.
- Event timeline shows most partitions finish early, but one takes much longer.
- Task summary shows a large gap between minimum and maximum task duration, indicating skewed partitions.
Example of Skewed vs. Ideal Partitioning
- If a Spark job has five partitions and one is much larger, only one core processes it while others are idle.
- In an ideal scenario, data is evenly split and all cores finish at roughly the same time.
Operations Causing Data Skew
- Aggregation (groupBy): Grouping by a key with uneven frequency (e.g., country with more transactions) causes skew.
- Join Operations: Joining on a skewed key (e.g., product ID or customer ID) results in partitions with far more data.
Why Data Skew is a Problem
- Jobs take longer to finish, wasting developer time on debugging.
- Uneven resource utilization leads to idle resources but continued costs.
- Risk of out-of-memory errors or data spills, as Spark writes excess data to disk and reads it back, slowing processing.
Examples in Spark
- Uniform data: Using
spark.range and checking partition size shows even row counts per partition.
- Skewed data: Unioning dataframes with different sizes and repartitioning creates partitions with unequal data.
- Join with skew: Skewed join keys (e.g., customer ID) result in some partitions taking much longer, visible in Spark UI timelines.
Key Terms & Definitions
- Data Skew — Uneven distribution of data across partitions in a distributed system.
- Partition — A subset of data in Spark processed by a single core or executor.
- Executor — A process launched for an application that runs tasks in Spark.
- Aggregation — An operation that groups data and applies functions like count or sum.
- Join Key — The field used to combine two datasets in a join operation.
- Shuffle — Data redistribution across partitions, usually after operations like groupBy or join.
Action Items / Next Steps
- Review related videos on AQE (Adaptive Query Execution), broadcast joins, and sorting for data skew solutions.
- Practice detecting and analyzing data skew in your Spark jobs using the Spark UI.