PySpark Actions and RDDs

Overview

This lecture covers actions in PySpark, how to create and manage Spark Context and RDDs, and the difference between actions and transformations. It also demonstrates common actions such as collect, count, reduce, and fold.

Spark Context & Initialization

Spark Context is the main entry point for PySpark, connecting Python code to the Spark cluster.
Only one Spark Context can be active per JVM; it must be stopped to create a new one.
Spark Context can be created with or without custom configurations (e.g., app name, master).
For DataFrame and SQL operations, Spark Session is used instead.

RDD Basics

RDD (Resilient Distributed Dataset) is an immutable, distributed collection of objects partitioned for faster access.
RDDs can be created from local data, CSV, text files, or other storages using sc.parallelize or sc.textFile.
Machine learning operations on data include preprocessing, filtering, imputing missing values, and feature selection.

Actions vs Transformations

Transformations create new RDDs and are lazily evaluated, forming a DAG (Directed Acyclic Graph) executed only upon action.
Actions trigger computation and return a value (integer, string, etc.), not an RDD.
Examples of actions: collect, countByValue, take, max, min, mean, reduce, fold.

Common PySpark Actions

collect() returns the entire dataset as a list (avoid in production for large datasets).
countByValue() returns counts of each unique element.
foreach() applies a function to each element but returns nothing (useful for logging).
take(n) returns the first n elements of an RDD.
distinct().count() gives the number of unique elements.
glom() converts each partition into a tuple for easier access to partition data.

Reduce & Fold

reduce(function) combines elements using the specified function (e.g., sum, product).
fold(initial_value, function) like reduce, but uses an initial value for each partition and again for global aggregation.
The behavior of fold depends on the number of partitions and the initial value.

Example RDD Operations

RDDs can be created from lists or ranges using sc.parallelize.
Partitions can be controlled with the numSlices parameter.
Partitioning affects fold and glom functions.

Key Terms & Definitions

Spark Context — Entry point for Spark functionality in Python; manages connection to the cluster.
RDD — Resilient Distributed Dataset; immutable partitioned collection for distributed computing.
Transformation — Operation that returns a new RDD and is lazily evaluated.
Action — Operation that triggers computation and returns a non-RDD result.
glom() — Action that converts each partition of RDD to a tuple.
reduce() — Aggregates RDD elements using a function.
fold() — Like reduce, but uses an initial value for each aggregation.

Action Items / Next Steps

Review previous videos for foundational knowledge on PySpark and RDDs.
Practice creating Spark Contexts and RDDs using different methods.
Experiment with various actions, especially reduce and fold, to understand partitioning effects.
Watch the next video for an in-depth explanation of transformations, including narrow and wide types.