Transcript for:
Gradient Boosting: Lecture Notes

hey hello welcome to my youtube channel it's ranjan and this is ai playlist and today we will talk about gradient boosting so i have already covered some ensemble techniques voting backing stacking and pasting and adaptive boosting in my previous video i hope you have understood the basic difference between all these terms voting backing stacking pasting and adaptive boosting so today we will be talking about gradient boosting and i hope you are aware about there are four type of boosting which are very known and we have covered already at a boost and this video we will be talking about gradient boosting so before starting this video i will just say about that gradient boosting and xg boost extreme gradient boos are extremely used in every data science problem in every competitions and this is one of the favorite interview topic so just give attention to this video you need to focus because this is very complicated so this is basic difference i have already covered the between backing and boosting there is no dependency between all models and in boosting there is some dependency in case of adder boosting i have already covered that in case of right classification the weight will be decreased and in case of misclassification the weight will increase so before starting about the gradient boosting let's have a some difference between the add a boost and gradient boosting and in adaboost we have already seen in the last video that that learning was happening by the weight updation so we are getting increased weight for the misclassification and weight is getting decreased for right classification so when weight was reason for misclassification so that sample is getting higher probability to select the next succeeding model and in gradient boosting we are not using weights the learning is happening by the optimizing the loss function so we all know what is loss loss is actual minus predicted so that is loss function error function cost function so all things are same and we can use any loss l1 loss l2 loss rmsc msc mean square root mean square so in gradient boosting the learning will be happening by the optimizing of the loss function so that's why it's called gradient because we all know that in gradient descent we are minimizing the loss by the optimizing the lost function and in adaboost we were using stumps for every base estimator what system i already covered in the previous video that if a decision tree have only a one dap that is and in this gradient boosting we will cover two types of model so first model is average model and after that all consecutive models will be decision tree with having the full tab so in this the decision tree grown only for one level but in this gradient boosting the decision tree will be completely grown it can go any levels so first of all i will give you a little brief how it works then we will take an example gradient boosting is used for classification and regression both and gradient boosting is used for both classification and regression but it mostly used for classification and today in this video i will take example of regression and in my upcoming video i will take example of classification as well so i have drafted a flowchart of the gradient boosting so this is my original data the data which i have to work on so in this data i have x and y we all know what is x and y x is independent y is dependent variable so first we will make a base model it is nothing it is just like a average model so it is just a better than the random guessing it is average model i will show in the next slides that how it calculates the output so data will be passed to the base model and once data is passed to the base model it will predict some output we will compare this output with our actual when we will compare our actual and predicted so there would be some error so that error will be go to this decision tree so what we are doing we are just using error error as an input to this decision and we will use all x variables here we have passed x and y to the base model and in this this is rm1 we will say is residual model and these are residuals errors residual we can say anything and it will go to this decision tree this is rm1 and it is grown to the full depth and it will predict another residue so suppose in case we have this model predicted a 50 residual so it will predict 30 then this error will be passed to this model and again only x variables independent variable we will not pass y to this model y will pass only to this model x and y will be passed to this model only and rest of the models only x data will go so these are the number of the decision tree we can use these all would be the residual model 1 residual model 2 residual model 3 and up to residual model n then how output will be calculated so i have just written in 11 language so this is final output so what it would be equal to the it would be the output of the base model what output it would give plus this is the learning rate this is learning rate and this is the output which rm1 will give it will give some error and again this is the learning rate and this is the rm2 it will give some error so all errors would be subtracted or would be addition to this output so i will show you in the next slides how it goes so these are the steps that we will follow in the gradient boosting first of all we will create a base model that i have already shown you in the last slide so first we will create a base model that would be the average model they're all data x and y we will pass independent variable and dependent variable to this average model and in case of classification it would be most frequent category and in case of regression we will calculate the average and after first step we will calculate the residuals how we will calculate residuals once we have the output from the average model that is predicted output and we have some actual value we will subtract it once we subtract it we will get the residual that is basically a loss once we have the residuals from the average model then in step 3 we will create a model rm1 that is residual model 1 which will take a residuals s target now you will see that in the rm1 i am taking x that is my independent variables and residuals i am not using y here i am not using output here my this model will be fit on the residuals that is basically a optimizing function so i will show you in the next slide i have taken a beautiful example when we have rm1 it will give me some target because the target here is residual so it will give me new value of residuals so now we have the new residual value so we will subtract that residual from our actual value now we will subtract this residual from the value which we have in this model don't worry i will show you in the next slide how it goes and once we have the this residual once we have the value then we will calculate residual again actual minus predicted and again we will create a new model rm2 so basically i have just written a trick first we will create average model then we will calculate the residuals then we will create a residual model rm1 so average model residual calculate and then residual model and after again we will calculate the residuals then again residual model 2 then again we will calculate residual that we again calculate rm residual model and and it will go to till it will go until we have reached the number of the estimator or we have residual is zero it will go to the point once we have the error is zero this is error and this is number of iterations i will show in the next slide so this is my base model the green is my base model it gives me some average then i have used another decision tree so at this point my error was this and i have used another model that is red model this is my rm2 so basically this is my rm1 this is my base model this is my average model this is my rm2 so at this point error got reduced and again at this point error got reduced we have used rm3 so at the end we will add all the models and we will get the predicted output that in that point error would be minimal so these are the number of iterations suppose this is the first iteration this is second iteration this is third iteration fourth fifth so at the sixth iteration i got my error as zero or minimum so i will explain you what i am doing in this when i have trained my model what i am doing in this i will tell you that we are using some data to train on this model when i train this model i will get some loss function i will get some pattern i will use that loss and pattern as a input to my next model so basically it would be like this this is my average model base model and it will give me some loss function so that loss i will use as a input to my rm1 and rm1 will give me some loss and that would be rm2 so here would be my loss so in gradient boosting we are using loss as a input to the another model i will give you a secret or a month of gradient boosting you have to remember that thing that it uses a gradient which is loss or error function of a this model it will use a gradient of this model as an input to its next model and it goes on so what it will pass it will just pass a error it will just pass the error of this model to its next model i have shown you in this flowchart it will pass its error to its next model and it will pass its error to its next model and it goes on so what it does it will work on previous error by doing so it will boost the performance it will boost the accuracy so that's why it's gradient boosting and why it's called gradient boosting because we have seen that in gradient descent we got a graph like this but it happens here it just decrease the loss in this we just decrease the loss this is loss error so in this gradient boosting we are also seeing the same pattern what we are doing we are just decreasing the loss we are just optimizing the loss function what do we mean by optimizing the loss function optimizing the loss function is basically decreasing actual minus predicted value we are just decreasing the loss what is this this is loss if we have some difference between actual minus predicted this is lost so our main aim is to decrease the loss we have to reduce this value zero so we are doing this zero now i will explain gradient boosting with help of this small data set so let me give a background of this data set this is a data of the price of the home which is having some details like number of the rooms and on which floor and what is the size of the room in the square feet here these three variables are our x and this would be our y according to gradient boosting steps i have explained you these are the steps so according to first step we will calculate the average average of 5. this is our dependent variable so we will calculate the average and it will come out 108 approx now according to step 2 we have to calculate the residuals that means error we can use any error msc rmsc but here we are using just simple one actual minus predicted so this is our actual and this is predicted by the base model so in that case our residual will be this 22 minus 8 and minus 15. so here our step 2 is completed here our residuals are get calculated which base model predicted so our base model predicted 108 for every rows because it's the average model so it will predict one output for every row if it is predicted one zero eight for every o so in that case we have the residuals 22 minus 18 minus 50 20 minus 13 minus 20 and 27 so now what is in step three as i explained you here step three we have to create rm1 which will take residual as a target so here see i am taking new target residuals i have deleted these things i am not considering this in rm1 if you will see i have just replaced this by this one so our new target would be residual our rm1 will fit on these residuals so it would be target for rm1 so it will predict new value of residuals so here it is predicting new value of residuals so here it is predicting 25 minus 14 minus 13 and so on what is the rm one so these values are predicted by rm1 so rm1 is a decision tree regressor so it could be like this it it would be completely grown so i will give these features these are the x independent variables for rm1 and these this would be y this is dependent so this is target so our rm one trained on this data and predicted these ways so there is some difference so these are the predicted by rm now i have taken the same so these are the values which are predicted by our rm now i have to calculate new predicted value of home price because ultimate our goal is to predict the home price so what is the formula average predicted value by the base model plus lr lr would be eta that is learning rate and it vary from zero to one so we can take anything between zero to one predictor residuals which is predicted by rm1 in next step it would be eta rm2 i have shown you there output of base model plus eta into rm1 plus eta into rm2 so here i have trained rm1 i will show you for first so our average predicted value was 1 0 8 that is predicted by base model and plus lr would be suppose if we have 0.1 and predicted residual is 25 so it would be approx 108 plus 2.5 so it would be 111.5 so i have rounded off and explain here 1 1 one so if you will see my home price was 130 and what is predicted by a base model one zero eight so now according to the rf one my value got changed to one one one so you will see that my last function is reduced to some extent and here as well so here it is increasing because my home price was much larger than one zero eight so it got one one one and here it was much lesser than my actual home price was much lesser than the average model average price so it got decreased so here you will see this is my new predictor price value up to the rm1 i have just used base model and rm0 so now i will calculate my residuals now i have to calculate my residuals again see in four step four i have to calculate my residues again once i have created the rm1 model rm1 will give me value some residuals then i have to predict new value again i've just copied from the last slide okay this was the previous information because this is my new protected price okay so on the basis of the new predictor price i will calculate the new value of residuals i'm just replacing this because this is this wouldn't be used in my next database so i have just written here same here so it was my actual price and it was this price is additive this price is addition of two models base model plus rm one so i got one one one one zero seven this much now i will subtract these and i will get the residuals 130 minus 1 1 1 it would be 20 minus 17 minus 14 18 12 18 24. now these are the residues it will take as a new target for rf but i have explained that it will pass the loss or the residual to the next model so these residuals were we have received from the rm1 now it will go to rm2 we have to follow the same procedure and it will go to till when we have the residual is zero or we have the number of estimators reached which we have configured in the model so gradient boostings have some hyper parameter by using which we can tune our model we can get good accuracy so in this gradient boosting we don't have base estimator we use by default decision tree as a base parameter i have explained you that in the first model it will use average model and in this consecutive models it will use decision tree decision tree so this is our learnable parameter and it will control the magnitude of the change how much change is required in each decision tree like suppose rm1 how much value of residual it will calculate because it will calculate new value of residual in each model it controls that thing how much value it will calculate so lower values are generally preferable it will make the model robust it will not over fit so it will allow to generalize the model very well and the next is our number of estimator and number of estimator is self-explanatory that how many decision tree we have to use in consecutive so basically these parameters already in adaboost or tagging so new it is we have sub sample it is just a fraction of the observation like suppose we have a thousand rows in original data set and how much fraction of these records we want to use in so it depends so sub sample will give me a number and that number will denote a fraction of these thousand rows will be selected for each rm1 rm2 so suppose i am using 0.8 so 0.8 that means 800 observation will be gone to is decision tree if i am taking as one so all all the observation will go so generally we take 0.8 and it can be fine tuned further and loss is basically which loss we want to use whether standard division standard whether standard deviation msc rmse these all the things and maximum features how many features we want to go for each decision tree and maximum depth if we want to limit the depth of the decision tree which is like we were using one in the ada boost so it generally range between 5 to 32 and this an estimator we have already covered so in next video we will cover how to use this gradient boosting in our model by using some data set so i hope this video is informative to you you have learned something from this video if i am able to make you understand about gradient boosting please like this video and subscribe my channel if you have not subscribed already and don't forget to press the bell icon to get the notification in your inbox so do share this video with your friends so see you all in the next video till then goodbye enjoy happy learning [Music] you