Salting to Solve Data Skew

Overview

This lecture explains how "salting" is used to solve data skew in distributed data processing systems, focusing on aggregation and join operations.

Salting means adding randomness to a key to distribute data more evenly across partitions in distributed systems.
The salt is typically a random number appended to the original key.

Data skew occurs when a key (like '1') appears much more frequently, causing that key’s partition to be overloaded.
Overloaded partitions slow down processing due to uneven data distribution after hashing and partitioning.

To salt a skewed table, choose a salt number (e.g., 3) and assign a random salt value (0 to 2) to each row.
Create a new composite join key: (original key, salt).
For the second table, generate all possible salt values for each key and use an "explode" operation.
Join both tables on the composite key to distribute matching keys across multiple partitions.
Salting ensures previously skewed keys are split and processed in parallel, reducing processing time.
If one table is smaller, apply the explode operation to the smaller table to save resources.

For aggregations like group by/count, data skew leads to uneven partition processing.
Assign a random salt to each row using the chosen salt number.
Group by both value and salt, then count.
Finally, group by the original key and sum the counts to get final results.
This approach splits heavily skewed keys into multiple groups, ensuring balanced partition processing.

Data Skew — Imbalanced distribution of records across partitions, often due to repeating key values.
Partition — A chunk of data processed in parallel in distributed systems.
Shuffle Partition — The number of partitions data is distributed into during shuffling (data re-distribution).
Salt — A random or pseudo-random number added to the key to break up skew.
Explode — A function that converts array values in a row into separate rows, often used for creating all possible salt values per key.

Practice implementing salting for both joins and aggregations in Spark or similar distributed systems.
Experiment with different salt numbers to observe effects on data distribution.
Review and understand partitioning and shuffle operations in your data processing framework.