📊

PySpark Regression Overview

Jul 10, 2025

Overview

This lecture demonstrates how to apply machine learning regression algorithms (Linear Regression and Random Forest Regression) on a Spark SQL DataFrame using PySpark, covering data import, preprocessing, feature vector assembly, model training, evaluation, and predictions.

Setting Up PySpark for Machine Learning

  • Import necessary libraries: SparkSession, VectorAssembler, LinearRegression, RandomForestRegressor, RegressionEvaluator, and Imputer.
  • Use PySpark ML package for DataFrame-based machine learning (not MLlib, which is for RDDs).

Loading and Preparing Data

  • Read data from a CSV file into a DataFrame, ensuring headers are included.
  • By default, all imported columns are of string type and must be cast to float for computation.
  • Use the .cast('float') method to convert columns to float.
  • Check for null values in each column using SQL-like functions and alias results for clarity.

Handling Missing Data

  • Use PySpark's Imputer to fill missing values in input columns.
  • Apply .fit() and .transform() methods to impute data and replace nulls with computed values.

Feature Engineering with VectorAssembler

  • Drop the target column ("Chance of Admit") from feature columns.
  • Use VectorAssembler to combine input columns into a single feature vector required for ML models.

Model Training and Evaluation: Linear Regression

  • Split data into train (70%) and test (30%) subsets using .randomSplit().
  • Initialize and train a LinearRegression model with feature and label columns.
  • View model coefficients (slopes) and intercept (y = mx + c).
  • Evaluate model with metrics: Root Mean Squared Error (RMSE) and R² score on training data.
  • Predict on test data and evaluate R² score using RegressionEvaluator (approx. 78% without preprocessing).

Model Training and Evaluation: Random Forest Regression

  • Initialize and train RandomForestRegressor with the same input/output columns.
  • Predict on test data and evaluate R² score (approx. 82% without preprocessing).

Additional Notes

  • Additional feature transformations (e.g., standard scaling, one-hot encoding) can be applied using PySpark ML functions.
  • Data preprocessing steps were not covered in detail in this video, but will be addressed in future content.

Key Terms & Definitions

  • DataFrame — Table-like data structure in PySpark.
  • VectorAssembler — Tool to combine multiple columns into a single feature vector.
  • Imputer — Tool to fill missing (null) values in columns.
  • Linear Regression — Regression algorithm to fit a straight line to data.
  • Random Forest Regression — Ensemble algorithm using multiple decision trees for regression.
  • RegressionEvaluator — Utility to assess model performance using metrics like R² and RMSE.
  • R² Score — Metric indicating model accuracy (closer to 1.0 is better).
  • RMSE — Root Mean Squared Error; lower values indicate better fit.

Action Items / Next Steps

  • Practice loading, casting, and imputing data in PySpark DataFrames.
  • Try applying both Linear Regression and Random Forest Regression to a sample dataset.
  • Prepare for the next lecture on classification examples in PySpark.