Spark RDD Transformations Overview

Jul 10, 2025

Overview

This lecture covers transformation operations in Apache Spark RDDs, explaining their types, characteristics, and Python implementation examples.

Transformations in RDD

  • Transformation is an operation applied to an RDD that produces a new RDD as output.
  • Transformations are "lazy," meaning new RDDs are not created until an action is applied.
  • Spark tracks transformations using a Directed Acyclic Graph (DAG) for execution planning.

Types of Transformations

  • Transformations are classified as "narrow" or "wide" based on data dependencies.
  • Narrow Transformations: Each partition computes output from a single input partition with no data shuffling.
  • Wide Transformations: Output partitions depend on multiple input partitions and require data shuffling.

Narrow Transformation Examples

  • Map: Applies a function to each element, e.g., multiplying by 2.
  • FlatMap: Maps each input element to zero or more outputs, changing the number of partitions.
  • Filter: Selects elements based on a condition, e.g., even numbers.
  • Union: Combines two RDDs, including duplicates and ignoring order.
  • Sample: Extracts a random sample from an RDD, with options for replacement and seed for reproducibility.

Wide Transformation Examples

  • GroupBy: Groups elements based on a function (e.g., first letter or a modulo operation).
  • Intersection: Finds elements common to two RDDs.
  • Subtract: Gets elements present in one RDD but not another.
  • Distinct: Returns unique elements from an RDD.

Key Terms & Definitions

  • RDD (Resilient Distributed Dataset) — The core data structure in Spark, representing a distributed collection of elements.
  • Lazy Evaluation — Transformations are only computed when an action triggers execution.
  • DAG (Directed Acyclic Graph) — Execution plan built from a series of transformations for optimized computation.
  • Shuffling — Data redistribution across partitions or nodes, occurs during wide transformations.

Action Items / Next Steps

  • Review and practice with Python code for map, flatMap, filter, union, sample, groupBy, intersection, subtract, and distinct.
  • Prepare for the next lecture on key-value RDD transformations, including reduceByKey, groupByKey, and join.