Gradient Boosting: Lecture Notes

Introduction

Presenter: Ranjan
Topic: Gradient Boosting
Context: AI Playlist on YouTube
Previous Topics Covered: Ensemble techniques including Voting, Bagging, Stacking, Pasting, and Adaptive Boosting (AdaBoost)

Types of Boosting: 4 known types of boosting; previously covered AdaBoost
Importance: Widely used in data science problems and competitions, common interview topic
Difference from Other Techniques: Boosting involves dependency between models, unlike Bagging where models are independent

AdaBoost Recap: Weight Manipulation
- Correct classification: decrease weight
- Misclassification: increase weight
Gradient Boosting: Loss Function Optimization
- No weights
- Learning by optimizing the loss function (actual - predicted)
- Various loss functions can be used: L1, L2, RMS, MSS

Base Model: Average model just better than random guessing
Errors & Residuals: Use of Residuals
- First, use the base model for initial prediction
- Calculate errors (residuals) as actual - predicted
- Residuals are used as input for subsequent models
Learning Rate: Determines the magnitude of change added by each subsequent model
Model Iteration: Continue until the number of estimators is reached or residuals are minimized to zero

Dataset: Home price data with variables like number of rooms, floor, and room size
Steps:
1. Base Model: Calculate average dependent variable (home price)
2. Calculate Residuals: Difference between actual values and predictions from the base model
3. Train First Residual Model (RM1): Use residuals as target for RM1
4. Update Predictions: Add RM1’s predictions to base model’s predictions
5. Recalculate Residuals: Use new predictions to calculate residuals
6. Repeat: Fit subsequent residual models (RM2, RM3, etc.) on updated residuals until all residuals are minimized
Formula for Final Output:

\text{Final Output} = \text{Base Model Output} + \eta \times \text{RM1 Output} + \eta \times \text{RM2 Output} + \ldots

Learning Rate (η): Value between 0-1, controls the contribution of each tree; lower is preferable for robustness and generalization
Number of Estimators: Number of decision trees used
Subsample: Fraction of data used in each iteration
Loss Function: Type of loss function (e.g., MSE, RMSE)
Maximum Features & Depth: Controls overfitting by limiting the number and depth of features in decision trees

Gradient Boosting: Uses residuals from prior models as a corrective measure, iteratively reduces error
Final Steps: Add up the models to get final predictions, minimize error through iterations
Next Video: Practical implementation using a dataset

Engagement: Like, subscribe, press the bell icon, share the video
Call to Action: Stay tuned for the next video on using Gradient Boosting in a model