Lecture on RDD in Spark

Jul 16, 2024

Lecture on RDD in Spark

Introduction to RDD

  • RDD stands for Resilient Distributed Dataset
  • Purpose: Distribute datasets across multiple servers and nodes for parallel computations.

Key Features of RDD

  • Immutable Collection: RDDs are collections of objects that do not change.
  • Resilience: Fault-tolerant. Uses DAG (Directed Acyclic Graph) to recompute missing partitions.
  • Distribution: Data can be stored across various nodes/servers.
  • Data Loading: Can load datasets externally (e.g., CSV or JSON files).

Operations on RDD

  • Two Main Types of Operations
    • Transformations
    • Actions

Transformations

  • Definition: Functions that take an RDD as input and return one or more RDDs as output.
  • Characteristics:
    • Lazy Operations: Not executed until an action is performed.
    • Can apply multiple methods like map, filter, reduceByKey.
  • Types:
    • Narrow Transformations
    • Wide Transformations

Actions

  • Definition: The final result of RDD computations.
  • Process:
    • Load data into original RDD
    • Perform intermediate transformations
    • Returns result to driver program
  • Characteristics:
    • Do not create new RDDs
    • Final results can be saved to files or displayed on the console.
  • Examples: first, take, reduce, collect, count

Conclusion

  • RDDs are a foundational data structure in Apache Spark
  • They enable efficient, fault-tolerant, and parallel data processing.