⚖️

Understanding Data Skew in Spark

Jul 10, 2025

Overview

This lecture explains data skew in Apache Spark, its causes, impacts, how to detect it, and examples using Spark jobs and the Spark UI.

What is Data Skew?

  • Data skew occurs when data is unevenly distributed across Spark partitions.
  • Some partitions may have much more data than others, causing unequal processing times.

Detecting Data Skew

  • In Spark UI, jobs may be stuck at the last stage due to a large partition.
  • Event timeline shows most partitions finish early, but one takes much longer.
  • Task summary shows a large gap between minimum and maximum task duration, indicating skewed partitions.

Example of Skewed vs. Ideal Partitioning

  • If a Spark job has five partitions and one is much larger, only one core processes it while others are idle.
  • In an ideal scenario, data is evenly split and all cores finish at roughly the same time.

Operations Causing Data Skew

  • Aggregation (groupBy): Grouping by a key with uneven frequency (e.g., country with more transactions) causes skew.
  • Join Operations: Joining on a skewed key (e.g., product ID or customer ID) results in partitions with far more data.

Why Data Skew is a Problem

  • Jobs take longer to finish, wasting developer time on debugging.
  • Uneven resource utilization leads to idle resources but continued costs.
  • Risk of out-of-memory errors or data spills, as Spark writes excess data to disk and reads it back, slowing processing.

Examples in Spark

  • Uniform data: Using spark.range and checking partition size shows even row counts per partition.
  • Skewed data: Unioning dataframes with different sizes and repartitioning creates partitions with unequal data.
  • Join with skew: Skewed join keys (e.g., customer ID) result in some partitions taking much longer, visible in Spark UI timelines.

Key Terms & Definitions

  • Data Skew — Uneven distribution of data across partitions in a distributed system.
  • Partition — A subset of data in Spark processed by a single core or executor.
  • Executor — A process launched for an application that runs tasks in Spark.
  • Aggregation — An operation that groups data and applies functions like count or sum.
  • Join Key — The field used to combine two datasets in a join operation.
  • Shuffle — Data redistribution across partitions, usually after operations like groupBy or join.

Action Items / Next Steps

  • Review related videos on AQE (Adaptive Query Execution), broadcast joins, and sorting for data skew solutions.
  • Practice detecting and analyzing data skew in your Spark jobs using the Spark UI.