PiSpark Lecture Series

Jul 4, 2024

PiSpark Lecture Notes

Introduction to PiSpark and Apache Spark

  • PiSpark: Interface for Apache Spark in Python, used for large-scale data processing and machine learning.
  • Course Instructor: Krish Naik.
  • Focus: How to use Spark with Python via the PiSpark library.
  • Topics Covered:
    • Importance and requirements of using Spark.
    • Apache Spark MLlib for machine learning within Spark using PiSpark.
    • Pre-processing datasets using PiSpark.
    • Using PiSpark DataFrames.
    • Implementing PiSpark on cloud platforms like Databricks and AWS.

Why Apache Spark?

  • Efficient for working with large datasets (GBs of data).
  • Distributed system handling for data preprocessing and other operations.
  • Workloads run 100 times faster than Hadoop’s MapReduce.
  • Ease of use in various programming languages including Java, Scala, Python, and R.
  • Can combine SQL, streaming, and complex analytics.
  • Supports multiple deployment modes, including standalone, in clusters, or on cloud platforms (AWS, Databricks).
  • Utilizes in-memory cluster computing.

Setting Up PiSpark

  • Installation:
    • Create a new environment (e.g., my_env).
    • Use pip install pyspark.
  • Verifying Installation:
    • Execute import pyspark in Python to ensure it was installed correctly.

Working with DataFrames in PiSpark

Basic Operations

  • Starting Spark Session:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('practice').getOrCreate()
    
  • Reading Data:
    df = spark.read.csv('test1.csv', header=True, inferSchema=True)
    df.show()
    
  • Schema and Data Types:
    df.printSchema()
    
  • Selecting Columns:
    df.select('name', 'experience').show()
    

Advanced DataFrame Operations

  • Adding Columns:
    df = df.withColumn('experience_plus_two', df['experience'] + 2)
    df.show()
    
  • Dropping Columns:
    df = df.drop('experience_plus_two')
    df.show()
    
  • Renaming Columns:
    df = df.withColumnRenamed('name', 'new_name')
    df.show()
    
  • Filtering Data:
    df.filter(df['salary'] < 20000).show()
    

Handling Missing Values

  • Dropping Rows with Null Values:
    df.na.drop().show()
    
  • Filling Missing Values:
    df.na.fill('missing_value', subset=['age', 'experience']).show()
    
  • Imputing with Mean:
    from pyspark.ml.feature import Imputer
    imputer = Imputer(inputCols=['age', 'experience'], outputCols=['age_imputed', 'experience_imputed']).setStrategy('mean')
    df = imputer.fit(df).transform(df)
    df.show()
    

GroupBy and Aggregate Functions

  • Grouped Aggregation:
    df.groupBy('department').sum('salary').show()
    
  • Finding Mean Salary by Department:
    df.groupBy('department').mean('salary').show()
    

Machine Learning with PiSpark MLlib

  • Vector Assembler for grouping features:
    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=['age', 'experience'], outputCol='features')
    output = assembler.transform(df)
    output.select('features', 'salary').show()
    
  • Train Test Split:
    train_data, test_data = df.randomSplit([0.7, 0.3])
    
  • Linear Regression Model:
    from pyspark.ml.regression import LinearRegression
    lr = LinearRegression(featuresCol='features', labelCol='salary')
    lr_model = lr.fit(train_data)
    predictions = lr_model.transform(test_data)
    predictions.show()
    

Using Databricks

  • Setting Up Databricks:
    • Use community version for free.
    • Create clusters and upload datasets to DBFS (Databricks File System).
  • Running PiSpark on Databricks:
    • Create a cluster and upload datasets.
    • Use notebooks to execute PiSpark code similarly to Jupyter notebooks.
    • Install necessary libraries through DBFS.

Summary

  • PiSpark is a powerful interface to use Apache Spark with Python for big data processing and machine learning.
  • Installing and setting up PiSpark involves creating a new virtual environment and verifying the installation.
  • DataFrames are central to using PiSpark, enabling easy manipulation and analysis of large datasets.
  • Machine learning capabilities in PiSpark are leveraged through VectorAssembler and MLlib, making it straightforward to train and evaluate models.
  • Databricks offers a robust platform to execute Spark operations with cluster support and cloud integration.