Spark Join Strategies Overview

Overview

This lecture covers the different join strategies used by Apache Spark to optimize joins between DataFrames, explaining when and how each strategy is applied for efficient job execution.

Introduction to Join Strategies in Spark

Join strategy refers to the algorithm Spark uses internally to join two DataFrames.
Choice of join strategy affects job optimization and execution time.
Common join types include: broadcast hash join, shuffle hash join, shuffle sort merge join, Cartesian join, and broadcast nested loop join.

Broadcast Hash Join

Broadcast hash join is used when one DataFrame is smaller than the default broadcast threshold (10 MB by default).
The smaller DataFrame is broadcasted to all worker nodes and stored in memory.
Reduces network shuffling since data is available locally for hash joining.
Only supports equi joins (joins using equality).
Not used for full outer joins; works best when broadcasting a small DataFrame.
Threshold can be configured with spark.sql.autoBroadcastJoinThreshold.

Shuffle Hash Join

Used when broadcast join is not possible and sort merge join is disabled.
Data is shuffled so all records with the same join key arrive at the same executor.
Hash join happens after the shuffle.
More expensive than broadcast join due to the shuffling step.
Only supports equi joins; does not support full outer joins.

Shuffle Sort Merge Join

Default when sort merge join is enabled and broadcast join is not possible.
Data is shuffled and sorted on join keys before merging.
Supports all join types (inner, left, right, full outer, anti-join).
Suitable when join keys are sortable.
More costly than broadcast hash join but supports a broader range of joins.

Cartesian Join

Executes a Cartesian product between two DataFrames.
Used for both equi and non-equi joins but only inner type.
Very expensive; matches each row from one DataFrame with every row from the other.
Generally not preferred due to high resource usage.

Broadcast Nested Loop Join

Used for both equi and non-equi joins and supports all join types.
Broadcasts one DataFrame and compares every record in a nested loop fashion.
No sorting involved; slower and more costly than other join strategies.

Default Join Strategy Selection

Spark defaults: broadcast hash join > sort merge join > shuffle hash join > Cartesian join for equi joins.
For non-equi joins: broadcast nested loop join is preferred over Cartesian join.
Strategies are chosen based on DataFrame size, configuration, and join type.

Key Terms & Definitions

Equi Join — Join where the keys are compared for equality (e.g., ==).
Broadcast — Distributing a small DataFrame to all nodes to avoid shuffling.
Shuffling — Redistributing data across partitions/nodes to align join keys.
Cartesian Join — Produces all possible row pair combinations from two DataFrames.
Nested Loop Join — Checks every record in one DataFrame against every record in another.

Action Items / Next Steps

Review Spark configurations: spark.sql.autoBroadcastJoinThreshold and spark.sql.join.preferSortMergeJoin.
Practice implementing different join types and inspecting execution plans.
Understand when to adjust join strategies for job performance.