PySpark Regression Overview

Overview

This lecture demonstrates how to apply machine learning regression algorithms (Linear Regression and Random Forest Regression) on a Spark SQL DataFrame using PySpark, covering data import, preprocessing, feature vector assembly, model training, evaluation, and predictions.

Setting Up PySpark for Machine Learning

Import necessary libraries: SparkSession, VectorAssembler, LinearRegression, RandomForestRegressor, RegressionEvaluator, and Imputer.
Use PySpark ML package for DataFrame-based machine learning (not MLlib, which is for RDDs).

Loading and Preparing Data

Read data from a CSV file into a DataFrame, ensuring headers are included.
By default, all imported columns are of string type and must be cast to float for computation.
Use the .cast('float') method to convert columns to float.
Check for null values in each column using SQL-like functions and alias results for clarity.

Handling Missing Data

Use PySpark's Imputer to fill missing values in input columns.
Apply .fit() and .transform() methods to impute data and replace nulls with computed values.

Feature Engineering with VectorAssembler

Drop the target column ("Chance of Admit") from feature columns.
Use VectorAssembler to combine input columns into a single feature vector required for ML models.

Model Training and Evaluation: Linear Regression

Split data into train (70%) and test (30%) subsets using .randomSplit().
Initialize and train a LinearRegression model with feature and label columns.
View model coefficients (slopes) and intercept (y = mx + c).
Evaluate model with metrics: Root Mean Squared Error (RMSE) and R² score on training data.
Predict on test data and evaluate R² score using RegressionEvaluator (approx. 78% without preprocessing).

Model Training and Evaluation: Random Forest Regression

Initialize and train RandomForestRegressor with the same input/output columns.
Predict on test data and evaluate R² score (approx. 82% without preprocessing).

Additional Notes

Additional feature transformations (e.g., standard scaling, one-hot encoding) can be applied using PySpark ML functions.
Data preprocessing steps were not covered in detail in this video, but will be addressed in future content.

Key Terms & Definitions

DataFrame — Table-like data structure in PySpark.
VectorAssembler — Tool to combine multiple columns into a single feature vector.
Imputer — Tool to fill missing (null) values in columns.
Linear Regression — Regression algorithm to fit a straight line to data.
Random Forest Regression — Ensemble algorithm using multiple decision trees for regression.
RegressionEvaluator — Utility to assess model performance using metrics like R² and RMSE.
R² Score — Metric indicating model accuracy (closer to 1.0 is better).
RMSE — Root Mean Squared Error; lower values indicate better fit.

Action Items / Next Steps

Practice loading, casting, and imputing data in PySpark DataFrames.
Try applying both Linear Regression and Random Forest Regression to a sample dataset.
Prepare for the next lecture on classification examples in PySpark.