welcome to xgboost map this is part 8 of ensembl techniques and in this video we shall cover the most important mathematical formulations behind xgboost there are some prerequisites first is the adaboost math which we covered in part five video the gradient boost math which we covered in part six particularly all the eight steps of the algorithm and the xg boost concepts which is covered in part 7 of this video series let's get started so the xg boost is an implementation of optimized gradient booster trees and the eight step gradient boosting algorithm we discussed in the previous video which is next to boost concepts is very much applicable but with a variety of improvements it is also a regularized model and it measures the complexity of the tree now what that means mathematically is our objective function is going to have two components first is a training laws which measure how well the data fits our model and then the second component is the regularization component which measures the complexity of the tree the last function used by xgboost is a squared loss and it's a sum of squared difference between the predicted value and the actual value there's another thing to note here which is about the trade-off on the optimization is that the objective function will optimize to trade off between the training loss and the regularization lost loss and generally when you're optimizing training loss it encourages predictive models for a higher accuracy on your training data however if you're optimizing regularization it encourages generalized simpler models for better suitability for production environments okay so now let's look into the details first on the regularization parameter within xd boost and xt boost uses two parameters to define the complexity of the tree first is the number of leaves and the second is an l2 norm of the leaf score the other parameters gamma and lambda in this function are hyperparameters the overall complexity is defined as a sum of the number of leaves weighted by gamma which is hyperparameter and a lambda weighted l2 norm of your leaf score leaf score is also indicated as a leaf weight okay now later at stage at later stages we will see how w is being computed but now let's take a very quick example of how the overall omega which is the regularization parameter is being computed for a sample tree so let's assume we have a sample tree with uh two nodes and then three leaves in this particular case leaf one has got a weight w one of plus two and a leaf two has got a weight of zero point one and a leaf three has got a weight of minus one the lambda is going to be gamma times three because we have three leaves plus half times the lambda times w one square plus w two square plus w three square okay now that we understand how omega is being put together let's start looking at the other component which is the training loss component the training laws can be represented as an additive model because we are obviously in a boosting space and in the earlier videos both in era boost and in gradient boost we saw how the for loops were constructed and how the trees were sequentially added with each other to finally aggregate an ensemble model applying the same idea we can see that this particular loss function can actually be written as the final model which is y hat of t is composed of the previous model which is y hat of t minus 1 plus a new model that we are learning okay so now we can rewrite our objective function in these terms using these terms as an additive model as below now let's take a look at the next section where we're going to be applying a formulation from mathematics so i want to introduce you to a theorem in mathematics called taylor approximation which is used to approximate well-defined differentiable functions taylor approximations say for a function f which is taking a parameter x plus delta x which is a small unit of x can be written in terms of f of x and delta x and the approximation is f of x plus f dash of x which is representing the first order differentiation of f of x times delta x plus half second order differentiation of f of x times delta x the whole square so that's the approximation given by the taylor theorem now considering we have already defined our xgboost objective function in an additive form we are now going to be applying this particular taylor approximation for up to two terms and simplify our objective function okay so when we apply it on our objective function which is what we had in the earlier slide now what we get is we get a term gi and h i as new terms in the new formulation the gi term indicates the first order differentiation and the hi term indicates the second order differentiation done for approximating the functions okay now that we have applied taylor approximation to the original objective function and got its quadratic form let's put together both the training loss along with the regularization which we derived earlier so the training laws we have derived so far is this particular formulation where we have the gi and h i terms and we have earlier done the regularization term which had the number of leaves and the l2 norm of the leaf score when we put them together we are obviously gonna further simplify it by using capital g and capital h to refer to the summation of all the g's and the summation of all h respectively remember the g's and the h the lowercase g's and lowercase h represent values of each leaf in the tree however the the upper case g and the upper case h is now representing the entire tree's structure for this formulation now this particular objective function that we have derived and simplified is a sum of t quadratic equations each quadratic function we now will be able to derive an optimal weight which is w j which is what we want to derive and that w j is simplified as minus g j by h j plus lambda now that the optimal wj is derived we substitute it back into the original objective function and we get a newer objective function without any w's and this objective function shall be called the minimum objective function and this is the most simplified form of quadratic approximation of our original objective function okay now let's try and understand how xg boost grows these trees first it's a greedy tree growing algorithm and it starts with a tree with depth zero second for each node it will have it'll try to have at a split so for each leaf node of the tree let's try and add a split that's what the greedy algorithm is going to do and come with that comes along a small dynamics we also need to understand is that in this particular diagram i've shown you one two three four five different nodes and what is worth noting is that there is an objective function before an objective function after and this objective function before is defined before the node 2 was split and objective function after is defined after the node 2 was split the objective function before is calculated only with node 1 2 and 3 because at that point in time only these three were the leaves in the tree however objective after is calculated with node 1 4 5 and 3 because these are the leaves now left in the tree the change in this objective function is calculated using the gain function okay this gain is intuitively similar to how we did gain calculations using entropy or guinea during the addition three calculations the gain has a score of the left child plus the right child subtracted by the aggregated score if we do not split so that's the intuition behind this formulation and then it is subtracted with a gamma parameter which is referring to the complexity cost for introducing the additional leaf okay so now that the gain calculation is done we need to know that xg boost will stop the split if the best split node has a negative gain and it is also worth noting that we have already seen in the previous video about the sparsity awareness algorithm of xg boost okay so now that we have reviewed the tree building process or the learning the structure process let's quickly review the overall process for xg boost first let's summarize the algorithm first there's a weak learner f of x i then in a for loop where t refers to the toll trees to be built we are creating a new instance of the model and the tree is built using the previously discussed tree learning algorithm and then the minimum objective objective function is used in these build ups because that is referring to how good the t's tree structure is all the new models are additively combined and the training stops either when t is reached or when the objective accuracy levels are reached to acceptable levels the final model is an additive combination of all the models built within the loop and that is the ensemble final xd boost model now that's the overall overview of summary and now let's quickly review and recap everything we've covered in this video first we saw the basic general objective function and the squared loss function then we understood the trade-off optimization built into this function we reviewed the measure of complexity of the tree we decompose the objective function to additive method then we applied a taylor approximation for a quadratic approximation of the objective function then we derive the optimal weight and also the minimum objective function which was the most simplified objective approximation that we could get from our general objective function then finally we reviewed how the structure is being learned or the tree is being built okay now in the next video we will be covering an illustrative example using all of our map on how xgboost builds these trees satisfying various formulations we have discussed in this video okay so until then good luck to you and stay safe