PiSpark Lecture Notes

Introduction to PiSpark and Apache Spark

PiSpark: Interface for Apache Spark in Python, used for large-scale data processing and machine learning.
Course Instructor: Krish Naik.
Focus: How to use Spark with Python via the PiSpark library.
Topics Covered:
- Importance and requirements of using Spark.
- Apache Spark MLlib for machine learning within Spark using PiSpark.
- Pre-processing datasets using PiSpark.
- Using PiSpark DataFrames.
- Implementing PiSpark on cloud platforms like Databricks and AWS.

Why Apache Spark?

Efficient for working with large datasets (GBs of data).
Distributed system handling for data preprocessing and other operations.
Workloads run 100 times faster than Hadoop’s MapReduce.
Ease of use in various programming languages including Java, Scala, Python, and R.
Can combine SQL, streaming, and complex analytics.
Supports multiple deployment modes, including standalone, in clusters, or on cloud platforms (AWS, Databricks).
Utilizes in-memory cluster computing.

Setting Up PiSpark

Installation:
- Create a new environment (e.g., my_env).
- Use pip install pyspark.
Verifying Installation:
- Execute import pyspark in Python to ensure it was installed correctly.

Working with DataFrames in PiSpark

Basic Operations

Starting Spark Session:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('practice').getOrCreate()

Reading Data:

df = spark.read.csv('test1.csv', header=True, inferSchema=True)
df.show()

Schema and Data Types:
```
df.printSchema()
```
Selecting Columns:
```
df.select('name', 'experience').show()
```

Advanced DataFrame Operations

Adding Columns:

df = df.withColumn('experience_plus_two', df['experience'] + 2)
df.show()

Dropping Columns:

df = df.drop('experience_plus_two')
df.show()

Renaming Columns:

df = df.withColumnRenamed('name', 'new_name')
df.show()

Filtering Data:
```
df.filter(df['salary'] < 20000).show()
```

Handling Missing Values

Dropping Rows with Null Values:
```
df.na.drop().show()
```

Filling Missing Values:

df.na.fill('missing_value', subset=['age', 'experience']).show()

Imputing with Mean:

from pyspark.ml.feature import Imputer
imputer = Imputer(inputCols=['age', 'experience'], outputCols=['age_imputed', 'experience_imputed']).setStrategy('mean')
df = imputer.fit(df).transform(df)
df.show()

GroupBy and Aggregate Functions

Grouped Aggregation:

df.groupBy('department').sum('salary').show()

Finding Mean Salary by Department:

df.groupBy('department').mean('salary').show()

Machine Learning with PiSpark MLlib

Vector Assembler for grouping features:

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['age', 'experience'], outputCol='features')
output = assembler.transform(df)
output.select('features', 'salary').show()

Train Test Split:

train_data, test_data = df.randomSplit([0.7, 0.3])

Linear Regression Model:

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol='features', labelCol='salary')
lr_model = lr.fit(train_data)
predictions = lr_model.transform(test_data)
predictions.show()

Using Databricks

Setting Up Databricks:
- Use community version for free.
- Create clusters and upload datasets to DBFS (Databricks File System).
Running PiSpark on Databricks:
- Create a cluster and upload datasets.
- Use notebooks to execute PiSpark code similarly to Jupyter notebooks.
- Install necessary libraries through DBFS.

Summary

PiSpark is a powerful interface to use Apache Spark with Python for big data processing and machine learning.
Installing and setting up PiSpark involves creating a new virtual environment and verifying the installation.
DataFrames are central to using PiSpark, enabling easy manipulation and analysis of large datasets.
Machine learning capabilities in PiSpark are leveraged through VectorAssembler and MLlib, making it straightforward to train and evaluate models.
Databricks offers a robust platform to execute Spark operations with cluster support and cloud integration.

PiSpark Lecture Series