RDD in Spark

Jul 16, 2024

RDD in Spark

Overview

  • RDD: Stands for Resilient Distributed Dataset.
  • Data structure in Apache Spark.
  • Datasets in RDD are distributed across multiple servers/nodes.
  • Enables parallel computations on distributed datasets.

Characteristics of RDD

  • Immutable: Collection of objects that cannot be altered.
  • Fault-tolerant: Resilient due to the use of DAG (Directed Acyclic Graph).
    • Can recompute missing partitions if needed.
  • Distributed: Data can be stored across multiple nodes/servers.
  • Dataset: Users can load data externally (e.g., from CSV or JSON files).

Operations on RDD

  • Two main operations: Transformations and Actions.

Transformations

  • Function that takes an RDD as input and returns one or more RDDs.
  • Lazy Operations:
    • Not pre-computed.
    • Executed only when an action is performed.
  • Methods include:
    • map
    • filter
    • reduceByKey
  • Two types of transformations:
    • Narrow: Data shuffling is minimized.
    • Wide: Involves wide data shuffling.

Actions

  • Produces the final result of the RDD computation.
  • Uses DAG to execute tasks.
  • Loads data into the original RDD, performs transformations, and returns final result to the driver program.
  • Does not create new RDDs.
  • Action methods include:
    • first
    • take
    • reduce
    • collect
    • count

Key Points

  • Transformations create new RDDs from existing RDDs but are lazy and not computed until an action is performed.
  • Actions are used to work with the actual dataset and to get the final computed result, without creating new RDDs.