RDD in Spark

Jul 16, 2024

RDD in Spark

Overview

RDD: Stands for Resilient Distributed Dataset.
Data structure in Apache Spark.
Datasets in RDD are distributed across multiple servers/nodes.
Enables parallel computations on distributed datasets.

Characteristics of RDD

Immutable: Collection of objects that cannot be altered.
Fault-tolerant: Resilient due to the use of DAG (Directed Acyclic Graph).
- Can recompute missing partitions if needed.
Distributed: Data can be stored across multiple nodes/servers.
Dataset: Users can load data externally (e.g., from CSV or JSON files).

Operations on RDD

Two main operations: Transformations and Actions.

Transformations

Function that takes an RDD as input and returns one or more RDDs.
Lazy Operations:
- Not pre-computed.
- Executed only when an action is performed.
Methods include:
- map
- filter
- reduceByKey
Two types of transformations:
- Narrow: Data shuffling is minimized.
- Wide: Involves wide data shuffling.

Actions

Produces the final result of the RDD computation.
Uses DAG to execute tasks.
Loads data into the original RDD, performs transformations, and returns final result to the driver program.
Does not create new RDDs.
Action methods include:
- first
- take
- reduce
- collect
- count

Key Points

Transformations create new RDDs from existing RDDs but are lazy and not computed until an action is performed.
Actions are used to work with the actual dataset and to get the final computed result, without creating new RDDs.

Full transcript