Gradient Boosting: Lecture Notes
Introduction
- Presenter: Ranjan
- Topic: Gradient Boosting
- Context: AI Playlist on YouTube
- Previous Topics Covered: Ensemble techniques including Voting, Bagging, Stacking, Pasting, and Adaptive Boosting (AdaBoost)
Gradient Boosting Overview
- Types of Boosting: 4 known types of boosting; previously covered AdaBoost
- Importance: Widely used in data science problems and competitions, common interview topic
- Difference from Other Techniques: Boosting involves dependency between models, unlike Bagging where models are independent
Key Concepts
- AdaBoost Recap: Weight Manipulation
- Correct classification: decrease weight
- Misclassification: increase weight
- Gradient Boosting: Loss Function Optimization
- No weights
- Learning by optimizing the loss function (actual - predicted)
- Various loss functions can be used: L1, L2, RMS, MSS
Detailed Process of Gradient Boosting
- Base Model: Average model just better than random guessing
- Errors & Residuals: Use of Residuals
- First, use the base model for initial prediction
- Calculate errors (residuals) as actual - predicted
- Residuals are used as input for subsequent models
- Learning Rate: Determines the magnitude of change added by each subsequent model
- Model Iteration: Continue until the number of estimators is reached or residuals are minimized to zero
Example Walkthrough
- Dataset: Home price data with variables like number of rooms, floor, and room size
- Steps:
- Base Model: Calculate average dependent variable (home price)
- Calculate Residuals: Difference between actual values and predictions from the base model
- Train First Residual Model (RM1): Use residuals as target for RM1
- Update Predictions: Add RM1’s predictions to base model’s predictions
- Recalculate Residuals: Use new predictions to calculate residuals
- Repeat: Fit subsequent residual models (RM2, RM3, etc.) on updated residuals until all residuals are minimized
- Formula for Final Output:
$$
\text{Final Output} = \text{Base Model Output} + \eta \times \text{RM1 Output} + \eta \times \text{RM2 Output} + \ldots
$$
Hyperparameters in Gradient Boosting
- Learning Rate (η): Value between 0-1, controls the contribution of each tree; lower is preferable for robustness and generalization
- Number of Estimators: Number of decision trees used
- Subsample: Fraction of data used in each iteration
- Loss Function: Type of loss function (e.g., MSE, RMSE)
- Maximum Features & Depth: Controls overfitting by limiting the number and depth of features in decision trees
Summary
- Gradient Boosting: Uses residuals from prior models as a corrective measure, iteratively reduces error
- Final Steps: Add up the models to get final predictions, minimize error through iterations
- Next Video: Practical implementation using a dataset
Conclusion
- Engagement: Like, subscribe, press the bell icon, share the video
- Call to Action: Stay tuned for the next video on using Gradient Boosting in a model