Overview
This lecture explains how "salting" is used to solve data skew in distributed data processing systems, focusing on aggregation and join operations.
What is Salting?
- Salting means adding randomness to a key to distribute data more evenly across partitions in distributed systems.
- The salt is typically a random number appended to the original key.
Data Skew and Its Problems
- Data skew occurs when a key (like '1') appears much more frequently, causing that key’s partition to be overloaded.
- Overloaded partitions slow down processing due to uneven data distribution after hashing and partitioning.
Salting in Joins
- To salt a skewed table, choose a salt number (e.g., 3) and assign a random salt value (0 to 2) to each row.
- Create a new composite join key: (original key, salt).
- For the second table, generate all possible salt values for each key and use an "explode" operation.
- Join both tables on the composite key to distribute matching keys across multiple partitions.
- Salting ensures previously skewed keys are split and processed in parallel, reducing processing time.
- If one table is smaller, apply the explode operation to the smaller table to save resources.
Salting in Aggregations
- For aggregations like group by/count, data skew leads to uneven partition processing.
- Assign a random salt to each row using the chosen salt number.
- Group by both value and salt, then count.
- Finally, group by the original key and sum the counts to get final results.
- This approach splits heavily skewed keys into multiple groups, ensuring balanced partition processing.
Key Terms & Definitions
- Data Skew — Imbalanced distribution of records across partitions, often due to repeating key values.
- Partition — A chunk of data processed in parallel in distributed systems.
- Shuffle Partition — The number of partitions data is distributed into during shuffling (data re-distribution).
- Salt — A random or pseudo-random number added to the key to break up skew.
- Explode — A function that converts array values in a row into separate rows, often used for creating all possible salt values per key.
Action Items / Next Steps
- Practice implementing salting for both joins and aggregations in Spark or similar distributed systems.
- Experiment with different salt numbers to observe effects on data distribution.
- Review and understand partitioning and shuffle operations in your data processing framework.