Coconote
AI notes
AI voice & video notes
Export note
Try for free
RDD in Spark
Jul 16, 2024
RDD in Spark
Overview
RDD
: Stands for Resilient Distributed Dataset.
Data structure in Apache Spark.
Datasets in RDD are distributed across multiple servers/nodes.
Enables parallel computations on distributed datasets.
Characteristics of RDD
Immutable
: Collection of objects that cannot be altered.
Fault-tolerant
: Resilient due to the use of DAG (Directed Acyclic Graph).
Can recompute missing partitions if needed.
Distributed
: Data can be stored across multiple nodes/servers.
Dataset
: Users can load data externally (e.g., from CSV or JSON files).
Operations on RDD
Two main operations: Transformations and Actions.
Transformations
Function that takes an RDD as input and returns one or more RDDs.
Lazy Operations
:
Not pre-computed.
Executed only when an action is performed.
Methods include:
map
filter
reduceByKey
Two types of transformations:
Narrow:
Data shuffling is minimized.
Wide:
Involves wide data shuffling.
Actions
Produces the final result of the RDD computation.
Uses DAG to execute tasks.
Loads data into the original RDD, performs transformations, and returns final result to the driver program.
Does not create new RDDs.
Action methods include:
first
take
reduce
collect
count
Key Points
Transformations
create new RDDs from existing RDDs but are lazy and not computed until an action is performed.
Actions
are used to work with the actual dataset and to get the final computed result, without creating new RDDs.
📄
Full transcript