Overview
This lecture demonstrates how to apply machine learning regression algorithms (Linear Regression and Random Forest Regression) on a Spark SQL DataFrame using PySpark, covering data import, preprocessing, feature vector assembly, model training, evaluation, and predictions.
Setting Up PySpark for Machine Learning
- Import necessary libraries: SparkSession, VectorAssembler, LinearRegression, RandomForestRegressor, RegressionEvaluator, and Imputer.
- Use PySpark ML package for DataFrame-based machine learning (not MLlib, which is for RDDs).
Loading and Preparing Data
- Read data from a CSV file into a DataFrame, ensuring headers are included.
- By default, all imported columns are of string type and must be cast to float for computation.
- Use the
.cast('float') method to convert columns to float.
- Check for null values in each column using SQL-like functions and alias results for clarity.
Handling Missing Data
- Use PySpark's
Imputer to fill missing values in input columns.
- Apply
.fit() and .transform() methods to impute data and replace nulls with computed values.
Feature Engineering with VectorAssembler
- Drop the target column ("Chance of Admit") from feature columns.
- Use
VectorAssembler to combine input columns into a single feature vector required for ML models.
Model Training and Evaluation: Linear Regression
- Split data into train (70%) and test (30%) subsets using
.randomSplit().
- Initialize and train a
LinearRegression model with feature and label columns.
- View model coefficients (slopes) and intercept (y = mx + c).
- Evaluate model with metrics: Root Mean Squared Error (RMSE) and R² score on training data.
- Predict on test data and evaluate R² score using
RegressionEvaluator (approx. 78% without preprocessing).
Model Training and Evaluation: Random Forest Regression
- Initialize and train
RandomForestRegressor with the same input/output columns.
- Predict on test data and evaluate R² score (approx. 82% without preprocessing).
Additional Notes
- Additional feature transformations (e.g., standard scaling, one-hot encoding) can be applied using PySpark ML functions.
- Data preprocessing steps were not covered in detail in this video, but will be addressed in future content.
Key Terms & Definitions
- DataFrame — Table-like data structure in PySpark.
- VectorAssembler — Tool to combine multiple columns into a single feature vector.
- Imputer — Tool to fill missing (null) values in columns.
- Linear Regression — Regression algorithm to fit a straight line to data.
- Random Forest Regression — Ensemble algorithm using multiple decision trees for regression.
- RegressionEvaluator — Utility to assess model performance using metrics like R² and RMSE.
- R² Score — Metric indicating model accuracy (closer to 1.0 is better).
- RMSE — Root Mean Squared Error; lower values indicate better fit.
Action Items / Next Steps
- Practice loading, casting, and imputing data in PySpark DataFrames.
- Try applying both Linear Regression and Random Forest Regression to a sample dataset.
- Prepare for the next lecture on classification examples in PySpark.