Transformations and Actions in Spark

Jul 16, 2024

Take quiz

Mindmap

Transformations and Actions in Spark

Overview

Discussing transformations and actions in Spark with respect to RDD.
Prerequisite concepts of Spark and RDD are covered in previous videos (available in the playlist).
Video link and further resources are provided in the description of the video.

Key Points

Transformations and Actions

Transformations: Operations that use an RDD and return another RDD. Examples: map, filter.
Actions: Operations that return a result after running a computation on an RDD. Examples: collect, show, save as file.
Lazy Evaluation: Spark only computes the transformations when an action is called.

Identifying Transformations and Actions

By seeing the code, you can often tell what is a transformation or an action by its function name.
map, filter are transformations.
collect, show, save as file are actions.

Lazy Evaluation

Spark utilizes lazy evaluation to optimize resource usage.
Example: when performing transformations like read, map, filter, nothing happens until an action (like save as file) is called.
Why Lazy Evaluation?: To avoid unnecessary use of memory and computation resources.
Compiler checks for an action to start the transformation from bottom to top.

Types of Transformations

Narrow Transformations

Transformations where each partition of the parent RDD is used by at most one partition of the child RDD.
Examples: map, filter.

Wide Transformations

Transformations where multiple child partitions may depend on multiple parent partitions.
Examples: reduceByKey, groupByKey.
Wide transformations involve shuffling data across the network.

Example: Word Count Program

Narrow Transformations: read, flatMap, map (executed sequentially without triggering any job in Spark UI until an action is performed).
Wide Transformation: reduceByKey (creates a new stage due to shuffling).
Action: collect (triggers the execution of the DAG).

Stages and DAG

Stages in Spark are determined by the type of transformations. Narrow transformations are in one stage, wide transformations split into a new stage.
DAG (Directed Acyclic Graph): Shows the sequence of transformations and the lineage of each RDD.
Spark maintains the DAG to recompute lost data and ensure fault tolerance.

Practical Demonstration

Spark UI shows the stages involved in the word count example with read, map, flatMap in one stage, and reduceByKey in another due to shuffling.
Demonstrates how jobs are only triggered by actions, not by transformations alone.

Miscellaneous

Detailed explanation of each concept is planned for subsequent videos.
Encouragement to subscribe, share, and connect on social media for more resources and updates.

Full transcript