Spark DAG and Operations

Overview

This lecture explains how Apache Spark creates and manages Directed Acyclic Graphs (DAGs) for different operations such as reading files, narrow and wide transformations, joins, and aggregation. The focus is on understanding Spark's internal execution mechanisms to optimize data processing jobs.

Reading Files in Spark

Reading a Parquet file generates two jobs: one for metadata and one for actions (e.g., show).
Metadata includes details like partitions, columns, size, and column types for Spark's optimization.
Each Parquet file part is treated as a partition; Spark reads the required number of partitions for the action.
Spark batches rows for efficiency, but batches and partitions are different.

Narrow Transformations

Narrow transformations (e.g., filter, add/modify columns, select) do not trigger shuffles.
Writing a DataFrame triggers a job and reads the whole dataset as an action.
In the DAG, column operations are grouped inside a "project" node, similar to a SQL select clause.
Write operations can be simulated for testing using "noop" format.

Wide Transformations: Joins

Joins usually create multiple jobs: reading, shuffling, and joining.
Sort-merge joins require shuffles and multiple stages.
Spark adaptively reduces the number of shuffle partitions (AQE) for optimization and handles skewed partitions by further splitting.
Before execution, the Spark plan shows "adaptiveSparkPlan.isFinal=false"; after execution, it becomes true.

Broadcast Joins

A broadcast join occurs if one DataFrame is small (under threshold); the smaller table is broadcast to all executors.
Broadcast joins require only two jobs: broadcast and join.
Join logic and final column selection are reflected in the DAG.

Aggregations (Group By)

Group by with count/sum triggers two jobs: reading and aggregation with shuffle.
Spark first does partial aggregation locally, then shuffles data based on the key, and finishes with a final aggregate.
AQE reduces shuffle partitions if many are empty.

Aggregation: Group By with Count Distinct

Group by with count distinct needs two shuffles and three stages.
Spark locally deduplicates (distinct), shuffles based on group keys, deduplicates again, then counts per group.
Additional shuffle aggregates ensure correct distinct counts per group.

Key Terms & Definitions

Spark DAG (Directed Acyclic Graph) — A structure representing all transformations and actions for a query execution plan.
Partition — A slice of the data processed by a single Spark task.
Batch — A group of rows processed together; not the same as a partition.
Narrow Transformation — Operations where each partition of the parent RDD is used by at most one partition of the child RDD.
Wide Transformation — Operations that require shuffling data across the network, like joins and groupBy.
Shuffle — Data transfer across partitions, typically needed for wide transformations.
AQE (Adaptive Query Execution) — Spark 3.0 feature for optimizing query plans based on runtime statistics.
Broadcast Join — A join where a small table is sent to all nodes to avoid shuffles.
Hash Aggregate — An aggregation method using hash tables for grouping and aggregation.

Action Items / Next Steps

Review Spark query plans and DAG visualizations from your own Spark jobs.
Experiment with narrow and wide transformations, especially joins and groupBy, on sample datasets.
Read up on Adaptive Query Execution in Spark documentation.