πŸ’»

PySpark and RDDs Overview

Jul 10, 2025

Overview

This lecture introduces PySpark, explains its differences from Pandas, and provides a detailed overview of RDDs (Resilient Distributed Datasets), the core data structure in Apache Spark.

Introduction to PySpark

  • PySpark is a Python API for using Apache Spark, allowing Python developers to leverage Spark's big data capabilities.
  • It is made possible by the Py4J library, which enables interoperability between Java and Python.
  • Python and Apache Spark must be installed to use PySpark.

PySpark vs. Pandas

  • Pandas is best for small datasets that fit into a single machine’s RAM; PySpark handles much larger datasets by distributing data across clusters.
  • PySpark supports parallelism across CPU cores and cluster nodes, while Pandas operates on a single CPU core.
  • PySpark operations are lazy (computed only when needed); Pandas operations are immediate (eager).
  • RDDs (main data structure in PySpark) are immutable, increasing safety for parallel processing; Pandas DataFrames are mutable.
  • Pandas has a richer API and is generally easier for complex data manipulations; PySpark trades simplicity for scalability.
  • PySpark can be horizontally scaled by adding new cluster nodes, unlike Pandas.

Resilient Distributed Dataset (RDD)

  • RDD is Spark's primary data structure, standing for Resilient Distributed Dataset.
  • "Resilient" means RDDs are fault-tolerant and can recover data if a node fails.
  • "Distributed" means data is automatically split into partitions across multiple cluster nodes.
  • "Dataset" refers to a collection of data, like a table or file.
  • RDDs are immutable and follow lazy evaluation (transformations are not executed until an action is called).
  • Data is partitioned automatically, enabling parallel processing across the cluster.

Key Features of RDDs

  • In-memory computation for fast data processing.
  • Fault tolerance via data lineage graphs for recovery from failures.
  • Immutability, allowing safe sharing across nodes without risk of data corruption.
  • Logical partitioning, the basis for parallelism in Spark.
  • Persistence: RDDs can be cached in RAM or stored on disk for reuse.
  • Coarse-grained operations (like map, filter, group by) apply to all dataset elements.
  • Location stickiness: task placement is optimized to be close to data using DAG (Directed Acyclic Graph) scheduling and data lineage tracking.

Key Terms & Definitions

  • PySpark β€” Python API interface for Apache Spark
  • RDD (Resilient Distributed Dataset) β€” Immutable, distributed data collection in Spark
  • Fault Tolerance β€” Ability to recover data after failures
  • Partition β€” A logical division of data within an RDD, spread across nodes
  • Lazy Evaluation β€” Computations executed only when an action is called
  • Coarse-grained Operation β€” Operation applied to all data elements simultaneously
  • DAG Scheduler β€” Spark component that organizes computation tasks for efficiency
  • Data Lineage Graph β€” Metadata tracking the origin and transformations of each RDD

Action Items / Next Steps

  • Watch the previous videos for foundational knowledge on Hadoop, MapReduce, and Spark.
  • Prepare for the next video on installing Python, Apache Spark, and using PySpark in Jupyter Notebook or PyCharm.