Overview
This lecture covers transformation operations in Apache Spark RDDs, explaining their types, characteristics, and Python implementation examples.
Transformations in RDD
- Transformation is an operation applied to an RDD that produces a new RDD as output.
- Transformations are "lazy," meaning new RDDs are not created until an action is applied.
- Spark tracks transformations using a Directed Acyclic Graph (DAG) for execution planning.
Types of Transformations
- Transformations are classified as "narrow" or "wide" based on data dependencies.
- Narrow Transformations: Each partition computes output from a single input partition with no data shuffling.
- Wide Transformations: Output partitions depend on multiple input partitions and require data shuffling.
Narrow Transformation Examples
- Map: Applies a function to each element, e.g., multiplying by 2.
- FlatMap: Maps each input element to zero or more outputs, changing the number of partitions.
- Filter: Selects elements based on a condition, e.g., even numbers.
- Union: Combines two RDDs, including duplicates and ignoring order.
- Sample: Extracts a random sample from an RDD, with options for replacement and seed for reproducibility.
Wide Transformation Examples
- GroupBy: Groups elements based on a function (e.g., first letter or a modulo operation).
- Intersection: Finds elements common to two RDDs.
- Subtract: Gets elements present in one RDD but not another.
- Distinct: Returns unique elements from an RDD.
Key Terms & Definitions
- RDD (Resilient Distributed Dataset) — The core data structure in Spark, representing a distributed collection of elements.
- Lazy Evaluation — Transformations are only computed when an action triggers execution.
- DAG (Directed Acyclic Graph) — Execution plan built from a series of transformations for optimized computation.
- Shuffling — Data redistribution across partitions or nodes, occurs during wide transformations.
Action Items / Next Steps
- Review and practice with Python code for map, flatMap, filter, union, sample, groupBy, intersection, subtract, and distinct.
- Prepare for the next lecture on key-value RDD transformations, including reduceByKey, groupByKey, and join.