📊

Apache Spark and PySpark Overview

Jul 28, 2024

Notes on Apache Spark and PySpark Lecture

Introduction to PySpark

  • PySpark is an interface for Apache Spark in Python.
  • Often used for large-scale data processing and machine learning.
  • Course led by Krish Knack.

Overview of the Course Topics

  • Focus: Using Spark with Python using the PySpark library.
  • Key Topics to Cover:
    • Why Spark is required.
    • Machine Learning with Spark MLlib.
    • Data preprocessing techniques using PySpark DataFrames.
    • Implementation on cloud platforms like DataBricks and AWS.

Advantages of Apache Spark

  • Runs workloads 100 times faster compared to MapReduce.
  • Ease of use with APIs in Java, Scala, Python, and R.
  • Capable of handling huge datasets beyond local system limits by leveraging distributed computing.
  • Can run on various clusters like Hadoop, Kubernetes, and cloud services.

Getting Started with PySpark

  1. Installation of PySpark:
    • Create a new environment (e.g., myenv).
    • Install PySpark using pip install pyspark.
  2. Creating a Spark Session:
    • Import Spark functionalities using from pyspark.sql import SparkSession.
    • Initialize the session: spark = SparkSession.builder.appName("MyApp").getOrCreate().

Working with DataFrames in PySpark

  • DataFrames in PySpark are similar to pandas DataFrames but optimized for performance in distributed computing environments.
  • Reading Data:
    • Use spark.read.csv("filename.csv", header=True, inferSchema=True) to read CSV files.
    • Adjust column names using options to treat headers correctly.

DataFrame Operations

  • Check Data Type: df.printSchema().
  • Display Data: df.show().
  • Selecting a Column: Use df.select("column_name").show().
  • Adding a Column: Use df.withColumn("new_column_name", value).
  • Dropping Columns: Use df.drop("column_name").show().
  • Renaming Columns: Use df.withColumnRenamed("old_name", "new_name").

Machine Learning with PySpark MLlib

  • Involves using the ml API for machine learning tasks.
  • Core tasks include: regression, classification, clustering.
  • Data preprocessing is crucial to prepare data for machine learning.