General Feeds: Various features that describe automobiles (make, fuel type, engine location, etc.).
Categorical Variables: Symboling, Make, Fuel Type, Aspiration, etc.
Numeric Variables: Length, Weight, Horsepower, etc.
Important Columns
Symboling: Risk assessment for insuring a specific automobile.
Normalized Losses: Losses incurred by insurance companies on specific vehicles.
Data Preparation
Initial Steps
Select Columns in Data Set: Select all features.
Remove Duplicate Rows: Check for duplicates and remove if any.
Edit Metadata: Separate categorical and numeric features.
Demarcate categorical features from numeric ones.
Ensure 11 categorical features and 15 numeric features.
Explicitly set these features' category.
Data Imputation
Categorical Variables: Use clean missing data to replace missing values with the mode.
Numeric Variables: Use clean missing data to replace missing values with the median.
Data Normalization: Normalize numeric data using Min-Max normalization.
Splitting Data
Initial Split: Split data into training (95%) and test (5%) datasets.
Secondary Split: Further split training data into 95% training and 5% validation datasets.
Linear Regression Model Setup
Initialize Models
Initialize two linear regression models with different solution methods:
Ordinary Least Squares (OLS).
Online Gradient Descent.
Adjust hyperparameters for both models.
L2 Regularization Weight: Set to a larger value to penalize coefficients.
Online Gradient Descent Learning Rate: Set to 0.1, enable learning rate decay, set training epochs.
Hyperparameter Tuning
Use the Tune Model Hyperparameters component.
Parameter Sweeping Mode: Random grid, set maximum runs.
Random Seed for reproducibility.
Metric for Performance: Root Mean Squared Error (RMSE).
Run the tuning process on both models.
Explore which hyperparameter combinations yield the best performing models.
Training the Model
Use the Train Model component to train the best model from hyperparameter tuning.
Specify the label column as Price.
Connect the training dataset from the initial split.
Run the training process for both models.
Model Testing and Evaluation
Score Model: Score both trained models using the test dataset.
Evaluate Model: Evaluate both models and compare performative metrics.
Coefficient of Determination (R²): Measures goodness of fit.
Mean Absolute Error (MAE): Average error in predictions.
Root Mean Squared Error (RMSE): Square root of the mean of squared errors.
Relative Absolute Error (RAE): Normalizes errors for comparison.
Relative Squared Error (RSE): Similar use case as RAE but often used in financial modeling.
Visualize and compare results from both models.
Key Insights
OSL vs. Online Gradient Descent: OSL typically provides better performance as it’s a closed-form solution while online gradient descent is iterative and approximated.
Adjusting Hyperparameters and Performing Regularizations Can Impact Model Performance Significantly.
Q&A and Conclusion
Addressed various queries from participants regarding data steps, model configurations, and statistical significance of the parameters.
Discussion on Future Sessions: Professor will check if logistic regression is included.
Feedback Request: Participants encouraged to provide feedback.
Next Steps: Concluding remarks and thanks to participants.
Contact Information
Professor’s Email: Provided for further queries.