⚖️

Salting to Solve Data Skew

Jul 10, 2025

Overview

This lecture explains how "salting" is used to solve data skew in distributed data processing systems, focusing on aggregation and join operations.

What is Salting?

  • Salting means adding randomness to a key to distribute data more evenly across partitions in distributed systems.
  • The salt is typically a random number appended to the original key.

Data Skew and Its Problems

  • Data skew occurs when a key (like '1') appears much more frequently, causing that key’s partition to be overloaded.
  • Overloaded partitions slow down processing due to uneven data distribution after hashing and partitioning.

Salting in Joins

  • To salt a skewed table, choose a salt number (e.g., 3) and assign a random salt value (0 to 2) to each row.
  • Create a new composite join key: (original key, salt).
  • For the second table, generate all possible salt values for each key and use an "explode" operation.
  • Join both tables on the composite key to distribute matching keys across multiple partitions.
  • Salting ensures previously skewed keys are split and processed in parallel, reducing processing time.
  • If one table is smaller, apply the explode operation to the smaller table to save resources.

Salting in Aggregations

  • For aggregations like group by/count, data skew leads to uneven partition processing.
  • Assign a random salt to each row using the chosen salt number.
  • Group by both value and salt, then count.
  • Finally, group by the original key and sum the counts to get final results.
  • This approach splits heavily skewed keys into multiple groups, ensuring balanced partition processing.

Key Terms & Definitions

  • Data Skew — Imbalanced distribution of records across partitions, often due to repeating key values.
  • Partition — A chunk of data processed in parallel in distributed systems.
  • Shuffle Partition — The number of partitions data is distributed into during shuffling (data re-distribution).
  • Salt — A random or pseudo-random number added to the key to break up skew.
  • Explode — A function that converts array values in a row into separate rows, often used for creating all possible salt values per key.

Action Items / Next Steps

  • Practice implementing salting for both joins and aggregations in Spark or similar distributed systems.
  • Experiment with different salt numbers to observe effects on data distribution.
  • Review and understand partitioning and shuffle operations in your data processing framework.