Transformations and Actions in Spark

Jul 16, 2024

Transformations and Actions in Spark

Overview

  • Discussing transformations and actions in Spark with respect to RDD.
  • Prerequisite concepts of Spark and RDD are covered in previous videos (available in the playlist).
  • Video link and further resources are provided in the description of the video.

Key Points

Transformations and Actions

  • Transformations: Operations that use an RDD and return another RDD. Examples: map, filter.
  • Actions: Operations that return a result after running a computation on an RDD. Examples: collect, show, save as file.
  • Lazy Evaluation: Spark only computes the transformations when an action is called.

Identifying Transformations and Actions

  • By seeing the code, you can often tell what is a transformation or an action by its function name.
  • map, filter are transformations.
  • collect, show, save as file are actions.

Lazy Evaluation

  • Spark utilizes lazy evaluation to optimize resource usage.
  • Example: when performing transformations like read, map, filter, nothing happens until an action (like save as file) is called.
  • Why Lazy Evaluation?: To avoid unnecessary use of memory and computation resources.
  • Compiler checks for an action to start the transformation from bottom to top.

Types of Transformations

Narrow Transformations

  • Transformations where each partition of the parent RDD is used by at most one partition of the child RDD.
  • Examples: map, filter.

Wide Transformations

  • Transformations where multiple child partitions may depend on multiple parent partitions.
  • Examples: reduceByKey, groupByKey.
  • Wide transformations involve shuffling data across the network.

Example: Word Count Program

  1. Narrow Transformations: read, flatMap, map (executed sequentially without triggering any job in Spark UI until an action is performed).
  2. Wide Transformation: reduceByKey (creates a new stage due to shuffling).
  3. Action: collect (triggers the execution of the DAG).

Stages and DAG

  • Stages in Spark are determined by the type of transformations. Narrow transformations are in one stage, wide transformations split into a new stage.
  • DAG (Directed Acyclic Graph): Shows the sequence of transformations and the lineage of each RDD.
  • Spark maintains the DAG to recompute lost data and ensure fault tolerance.

Practical Demonstration

  • Spark UI shows the stages involved in the word count example with read, map, flatMap in one stage, and reduceByKey in another due to shuffling.
  • Demonstrates how jobs are only triggered by actions, not by transformations alone.

Miscellaneous

  • Detailed explanation of each concept is planned for subsequent videos.
  • Encouragement to subscribe, share, and connect on social media for more resources and updates.