XGBoost Overview and Features

hi guys my name is nitesh and your welcome to my youtube0 boost i guess you heard the name Machine learning will become very famous There is a library that is probably available at today's date It is used in every Kaggle competition and Even when there are projects in the industry If XT Boost is used there then Till now unfortunately on this channel I have If the AC boost was not covered properly Now what I am going to do is very detailed I'm going to cover xz boost a in Multiple Videos About Today's Video Let's talk about today's video, it is Going to be about the introduction of ekj However, you can get the boost depending on the length of the video. You will think how can you do an introduction of one hour If it is possible then what did I actually do I have the original paper of A XT Buss was a So, I studied it in great detail. is in the process of making this video and Everything that the authors have said there I simplified it and made it into this one video I've tried putting it in XT Boost's If I tell you very much about There are more things in the Xyz Boost Library and that is why as a beginner you Sometimes you will feel overwhelmed, friend I guarantee you what you have to learn I will give you if you watch today's video completely If you take it then it is over valuable to you, not that It will come because you will get a very good overview Gonna get that good xj boost in it All things happen and we have to follow this order If you want to read then that is the introduction The video has become a little longer but I am Pretty sure if you watch it completely you will be quite you will take the knowledge with you okay then or with that let's start the video so xg boost In order to understand it better, we must once First you have to go back a bit and the machine If we have to talk about learning then When you go to study machine learning then The first definition of machine learning You will be taught that a machine learning is a Techniques where you learn from data so machine What do you do in learning? Do you have any There are data sets, you can do something on them They apply algorithms Algorithms identify patterns within your data learn and then make predictions This is what happens with machine learning mostly now if If you talk about what an algorithm is, then maybe You will know a lot about machine learning Algorithms are Linear Regression Logistic Regression Knife Base KNN SVM Very all the algorithms are there so when I run the machine When I was starting to study learning, my A very obvious confusion in my mind I wonder why there are so many algorithms There is no single algorithm for all types Why not apply it to problems If I could then I would obviously read more after that the algorithms are what they are Data specific or scenario specific So if you talk about the very beginning Let's talk about the 70s and 80s, then this The algorithms developed on time are fine Now I don't remember which are those subjects Algorithms from the Let's Knife Base and Let's All the algorithms that were developed, Algorithms The good thing was that these people were machines Solve problems using learning could have been but this was a major disadvantage in These people of A are of a very specific type Used to perform well only on data Like if you are talking about linear regression So do linear regression very specifically Nice work on speaking linear data It's okay if you talk late Above the naive bass the naive bass is very limited type of data such as textual data If it performs very well then it is 70-80 The algorithms that were mostly drawn back that he was not very general, he was very specialized I used to work on it, then I came in the 90s What happened was that algorithms started being made that Not only was the performance more powerful wise butls were more general meaning different If it could be applied to type of data then 90 The three major algorithms that have emerged in That is random Forrest SVM and Gradient Boosting in all three What is special about algorithms is that they are very They are powerful like their results which The performance is always good and At the same time, it is also used for different types of data. The above also works fine but if you talk If you talk about their disadvantages then look inside them There were also some disadvantages, such as the first The disadvantage was that these algorithms were still Even from overfitting type problems were struggling and secondly scalability the problem of It was okay, there was a scalability problem This means that as the data around 2000 Internet started increasing, data came very fast as the size of data sets began to increase, so did the size of started increasing and as the data sets As the size increases, these algorithms Not effective on large data sets So there was a problem with these short performances performance in the How many results are coming on sense matrix how are you coming And there was a speed problem that large data sets These two problems were slow on the XG Boost comes in to solve the problem 2014 is very recent but xg boost Ever since it has come it has become the best machine There are many different learning algorithms out there that of any size on top of the type data gives a good performance above ok now have fun The point is that XG Boost is Not an Algorithm Many people think that this is is that XC Boost is not an algorithm XC Boost is not an algorithm XC Boost is actually a library which is already assisting So over the gradient boosting algorithm in Short Term Wa Guy his name is tinky chen he is just 35 Years old he saw that gradient boosting has a lot of potential if we consider the gradient hold the boosting and press a little Make improvements both in Terms of Performance End speed then it can become something very powerful and that is what xg boost is xg Boost is basically gradient boosting lifting and on top of that a variety of Applying optimizations so that it becomes a Super Powerful Machine Learning Algorithms So XJ Boost in Nutshell Merge Two Things by doing machine learning which he has picked up from gradient boosting and many more Concepts of software engineering that we people are going to read in this video and these combined both of them and then we have a very powerful The library itself can be used to a lot of scenario lots of data sets and it will Mostly give you great performance at a very high speed okay so now our job is to What is in the video that we will tell a little bit You will understand what the whole history has been like At least in terms of developing XG Boost what improvements have been made to the bow in Terms of performance and speed and a Overall we will take a look at the XJ Boost because it is very huge xt boost is very everything inside it so you have a It is important to have an overview, only then you can get it right Now you will be able to cover anything this way Understand its history to get an overview If it is necessary then let's do one thing XT Let me understand a little bit about the history of Boost Read the full history of XT Boost at 3 I am dividing it into parts first of all Part will be in early days When to XG Boost which was made in 2014 after that the next stage is to come i wood Name on Kaggle Why the fuss, I'll tell you this was Around After 2016 end a third stage has come Judges Open Source days this is fine since 2016 so the most First, we will talk about the He is the creator of XG Boost Why Gradient Boosting Algorithm This question came to my mind that XJ Boost at that time and many more Popular algorithms of machine learning what was the reason that tian qi chen felt it that gradient boosting is right Candidate to create boost so I am a little Praising gradient boosting Obviously the thing is that you need a gradient Do you know about boosting if you don't know? If yes then you will have to read it but it will be late Certain factors that affect gradient boosting Would make a very powerful and capable candidate The first of them is flexible if You may have read a little So you know about gradient boosting First of all you must know that you can use gradient Any loss function can be used in boosting Can Be Unlike Other Machine Learning Algorithms Where there are very specific isolated losses Gradient works only on function Boosting is designed in such a way that it Any loss function is differentiable can work on it now what will it do The advantage is that you can try many different types of you can work on the problems You can work on regression problems You can work on the classification Also work on even ranking problem Infact you can make your own if you want. Also works on custom defined problems can you do this was region number one region Number Two Was performance you probably know most of the The data sets over which you want to apply the gradient If you apply boosting, you will get good result will give you okay the next point is it is Quiet Rob means if you do the regularization correctly If you apply it then the results you get When they come they are very imposing so this is One More Thing and Lastly Yours Kaggle Many people already use gradient boosting in were doing to win different Competition is to see and After understanding this, there is also another feature in Rob would be that the internally missing values would be greatly If it handles comfortably then more like this lots of small features gradient In Boosting So after looking at all these factors Tian Qi Chen thought, no friend There is something about gradient boosting He is doing so many things right The only problem was that the performance was a little poor And if improvements are made to larger data sets If it starts working on it also then what algorithm will it become and that was the Thought Process in Tinky Chains Brain and Teeth One only decided that we will do grey boosting and do optimization on top of that and make XT boost then these 2014 itself published a paper xj boost a Highly scalable machine learning algorithms are There is a kind of paper I am writing in the description I will do it, you should definitely go through it It is told there that what We are doing improvement in this video will also cover it but reading papers is a good Once the XG boost is formed, it becomes a habit So what did Tanki Chen do after that? To understand how to measure performance In the competitions which take place in Kaggle He started participating he himself participated in a competition called Higgs boson this is the name of the competition this A competition related to particle physics if you go to kaggle you will still have to You will find its page, here you will see the machine Some particles are learned by learning I don't have much idea about identification either yes i read it long ago but this was The Competition Where Tinky Chen Himself Participated and He one only one that competition and that was the For the first time no people noticed that friend Kagal A winner tried a new algorithm in has released a new one called XJ Boost and Then more people started trying it out what happened in 2016 Of the 29 winning submissions, Cagle P The solution was prepared by 16 people It had xG Boost and as soon as this thing happened After it was revealed to the world, XG Boost's popularity has increased rapidly is it and at this point tinky chain is decided that friend in this algorithm in this library There is potential and if we take it further If we want to increase it then there is only one way that we By open sourcing it, the meaning of open source is that we take its code to the world and All the machine learning around the world Anthia is its engineer, those people are its engineers Make your own additions above only if you make XG Boost an open source project and beyond So XG Boost never turned back I did not see many things added in it We have added a lot of new features In this video we will discuss one by one Many more optimizations were added which Its performance etc. has been improved further Besides the speed has increased further because there are many When all the people came and did the development Multiple platform support is available here Gaya Like Different Operating Systems Over different programming languages also xg boost started working Apart from this, if you talk about XG Boost The documentation, the tutorials, And in general the content is available on internet But that too became more popular than that and It started increasing apart from that if you talk In today's date everyone whenever Participates in any competition So it's like default that he has his one The solution is to at least apply XG Boost he only prepares it on the hole What happened as the community evolved The growth that XG Boost will probably achieve in 10 years He would have achieved that much growth in five years taken and fast forward 2023 As of today, everyone believes that XJ Boost is something you have to expect If you want to become a data Scientist and Machine Learning Engineer Because both in terms of performance and in Terms of speed is no match for it so this was a A Brief History of How XG Boost Came to Be And how he introduced the world of machine learning If we start to rule then guys we Take an overview of XJ Boost And we also read the history of XJ Boost Now our goal is to get a good exercise boost mastering it so that we can effectively do it You can use it in any project now will be very honest with you xg boost there is a huge library like many There are things inside XG Boost and if you Are you a beginner then you have to be very over velminik it is very difficult but since There are a lot of things that seem a bit difficult if possible then what do I have planned First of all Hum XT Boost Multiple I will read it by dividing it into videos but this is This is our video, in this I will tell you once All the features of XJ Boost no i want to give you an idea of that into I want to give you so that in the upcoming videos When you read them in detail, a light will light up It will come to your mind that yes, we had read this okay so ideally this is the like website By going here you can enjoy a little bit of X-Bus You can read about most of the things you You will find it here when you go to the docs and You Can See Docs has a lot of content Well, it is not possible to read everything. yes but like a lot of content is there to i Wood just recommend once just glance through do the rest i will give you an idea what are the features in xj boost so if Let us talk about Prinki Chen, who is the Creator of XG Boost to While Building This Library XG Boosts Their Brains There were three major areas of concerns that These three things should be there in this library should have been the first area of concern performance he wanted it to be The library should have a very good performance If there is no problem of overfitting then there is robustness Performance is a very important criteria was on his mind The second was an even more important criteria and the concern was speed because xj boost is The need to be scalable on top of big data It is very important to be able to work comfortably There was a lot of consideration about speed too Focused and the third was flexible Flexible in the sense Tinky Chain wanted that more and more people use XT Boost He didn't want it to be just It is restricted only to those people For those who know Python or for those Those who only have Windows or [music] What came in speed and flexibility We will discuss what all came in We will do it, okay, one by one, all three of them Let's talk about aspects first Flexible Key XJ Boost by default is the way it is designed it is very very Flexible Twinkie Chain Always Had This A Library of Philosophy in His Mind A person will become successful only when he/she has a lot of it will be able to reach the people so that they These four things are essential for the flexibility of the implement I will take you through each one of my points of view Let me tell you the first Vaz cross platform ch basically means that what is the exercise boost models and on any operating system Can run on Linux and Windows or Then Mac in fact they are cross platform Chuch means that if you you can run on that okay okay let's go okay Having a class platform is not an easy task today It's not such a big deal, most machines Learning Algorithms They Are Cross Platforms extend beyond this next point Let's discuss this is something unique which You will also be introduced to the rest of the machine learning algorithms. So I won't get any exj boost actually Multiple programming languages supports generally you will notice that whatever your predictive models are or so let's run in python Or I run in R or sometimes There are also some algorithms which Plot mats work in plot lib mat plot lib common sorry not mat plot li mat I work in a lab I get confused ok Matlab is a very famous software in which You can also run some predictive models yes but what did akbus do akbus All the famous programming So I created wrappers for these languages If you look at today's date, you will Java Scala Ruby Python R and some other languages infact if you If you want to go to official documentation then click here But you can cr Java Ruby Swift Julia CC Plus Ps for all you have The interface is available infact you can also do this You can build a model using Let's in python and you are loading it in java I am using it in Java as well Infact I have written a small code here I have attached it just to show you that this is bit here you have written this in python what are you doing making a model and you are saving it in this file and then This code is written in Java and You can import these libraries in Java You can also load the same model of XG Boost If you are getting a prediction then this is true Powerful feature that assumes that tomorrow you are a You have created a website for your client for which it is fully designed in java because A lot of enterprise applications are based in Java They are made and tomorrow you will get one here Now we have to do a machine learning component Most of the machine learning components that they are made in python now Here you will face the issue of compatibility Using Python to Directly Build Machine Learning Models how to run it in java there is only one way You can first use Python Build a model, send it to an API service and that API service If you use Java then there are many problems in system design The constraints are here but what if you can do this directly this would be so Convenience that if we do this for other languages as well If rappers create it then there are many kinds of XG Boost in Projects Easily If it can be used then this was great again Thinking by the Creator who created XJ Boost After this I helped the popular The XJ boost doesn't stop here either Apart from this there is one more thing in XJ Boost that You have many other famous libraries There is also support for XJ Boost so If you talk about the process of model building where you do data analysis and Then when you build models, there XT Boost is compatible with all the famous Python Libraries Like Napy Pandas Plotting Libraries Like Mat Plot With Extra Psyched Learn you can become very It can be done easily if you talk to me What is the current state of distributed computing? We are distributor computing We will discuss it after some time but right now just To know the name if you are interested in Distributor computing which is famous Libraries are yours like Spark or pi spark It happened or dusk happened, it is also compatible with these it is exj buss if you talk about model Interpretability Model Interpretability where you once After the model is made, we explain it Yes, there are some famous libraries there too Like Exj Boost with Pup & Lime It is compatible if you talk about model deployed You might have heard the name of Docker Along with all this, a boost of cyber policies It is compatible and lasts only if you talk About Workflow Management is basically a simple exercise So here also there are some famous libraries Like Ache's Airflow or is it your ml flow, this is also a You have a very famous library along with them Compatibility is very strong XG Boost works a lot in this aspect too Has been done Lastly the most important thing is exercise boost There are many different ways to approach machine learning problems I am comfortable with you if you talk to me So linear regression is Just Comfortable with Regression Problems Right if you talk about logistic regression So it is just comfortable with classification problem but xt boost is like a algorithm which you can use to run any kind of machine Give him a learning problem and he will solve it So it is possible to solve the regression problem also you can work in classification You can also work on binary classification Also working on multi class classification not only that if you are doing time series When we talk about forecasting, we will consider time series Your XJ boost is also very important in forecasting It is used and gives good results Apart from this, if you talk about ranking pro what are the ranking problems where but you are given some items and You have to rank them On the basis of requirement such as Recommender systems and all that are there too Apart from this, XJ boost is very useful Even where detection is done normally, XJ Boost is used a lot and lastly since XJ Boost in Bay based on GBDT That is Gradient Boosting Decision Trees So in gradient degeneration trees you can do any You can use differentiable loss function This feature is also available in XG Boost, tomorrow you Define any problem according to your needs You can provide your custom loss function Let's tell you tomorrow that you have such a loss function Those who want more weightage to positive points Giving comparison of negative points In this case you can define your own loss function You need a loss function that can be used who is giving more weightage to false poses You can do that design with false negative be your own custom You can also make this whole matrix Flexible gives you XD boost then I hope you understand after this whole discussion I will be coming to why flexible hedge Been a very important aspect of xG Boost Okay so these were the four major points which we have covered here so let's guys now we move on to the next aspect A lot of focus was put on XJ Boost While making the So Tian chain released that as we move forward, the data The size of the sets will keep on increasing so if we A very powerful machine learning If you want to create an algorithm then There must be this capability inside that No matter how big a data set you give it, it If you train comfortably on it then The algorithms of today mean ged Boosting and their speed is as good as it gets There was no big data set so XJ What is in Boost that a lot of software Optimizations are applied in this order to increase the training speed okay so we what people are going to do one by one Read about Six Optimization which will make you understand that Why XT Boosts the Rest of Machine Learning Infact it is faster than the algorithm I will prove it to you once ampere kali ki really use xj boost Training time is less in comparison to gradient boosting so here I have a The sample code is written so this is the sample I haven't done anything special here with the code I've created a synthetic data set toy Created a data set that contains 10,000 rows And if there are 200 columns then it is a very large data set but it's not small either and then What have I done before Gradient Boosting Classified and Then I have classified xyz boost And I have just measured the time, okay, it's very simple Code so here you can see this is the result Gradient boosting is applied to this data. It took 72 seconds for the train to reach Ware Edge XT Boost takes just 30 minutes to train If it takes seconds then almost 12 times 13 Times are faster right now it's the most This difference measure is not an accurate method The point of doing this is to prove that Anytime you pick any data set, especially If you are looking for large data sets then XG Boost training time at least 10 times it will be less and this is a big improvement if You train a model on a data set It is taking 10 hours to do this, it is very big If data is set then use XT Boost If it takes you an hour, that's a pretty big deal. There was a thing in XD Boost, how is it If we understand then it will be late and six aspects We will discuss all these one by one If you do then the first point is parallel processing parallel processing i guess even If you are from non technical background you would be able to understand key parallel What is processing? Building a house for you Is If you build this house alone then use lets This process is taking you 30 days How fast can you do this work Don't do it alone, bring five more people There is a good chance that everyone works equally If you do this work, it will be done in 6 days. The main idea is that you work sequentially. Instead of working parallelly, divide the work Do you take it but when I read that XT Boost supports parallel processing so I A very innocent doubt came up, the doubt was that We all read this about boosting that boosting works in stages So you give your data to the model first. The first model makes some mistakes and You forward the mistakes to the second model This model makes some mistakes in this you send me in third model and so on so essentially if you are talking about boosting if you do it then it is sequential Process rights one by one you can edit models If you train then I would be very happy to know this It felt surprising that if there was a process is essentially sequential so you can do that How will you bring parallel processing then a little bit After doing research, I understood that actually This is the parallel processing model not in the building This means that these models are it still Sequential We do not apply any processing but every model We work in parallel to build individually Let me explain this The point here is that every model is a decision tree so what I mean to say is that when you You are making this tree, you are growing it So in the process of making this you can do parallel work. As soon as you apply processing it becomes then you can start making the next tree How do I apply parallel processing for this Let me explain to you that it is very important You will know how decision trees work if you do not know then you will not be able to understand There is also an example, I will tell you Let's assume we have data that age how much is it and how many A marks and from the Letts What was the placement cost, let's say here it is 21 This is the year, here it is 98, placement here It costed 13 lakhs and it is 19, 89 and placement It seemed that 17 out of 11 lakhs is 95 and placement It seemed that 15 lakh rupees have come now and this data is of great use There is not much point in giving an example so I explained it to you now if I have to do this This is the first model to be built on the data. If you want to make a decision tree, then you might remember How will a decision tree be made? Do you know You have to first create your root node Now you have to find the root node, you have two Features are either f1 or f2 Root nut will be formed in either one of them or my f1 If the root note is made or not then what will I do First of all I have to do my age You have to grab the column containing the column and put it first sort it out then I have to find out the splitting points It's like the average of these two is 18 in The average of both is 20 so basically I I will divide the data once on the basis of 18 but one more time I'm going to divide by 20 on the basis and I my Or I will calculate the Gini index Whichever of these two comes minimum, it will be My splitting criteria will be If the score is 18 then there will be splitting criteria Gaya & Greater 18 and its corresponding then i will do the same work on marks I would sort these out first 89 95 98 then what will be the slit values In between these let's came 92 in between this I got 97 in LETS, now I am between these two I will try splitting it up and see what happens I will take out the index and see which one The minimum is coming, I will select it I will take it from Let's he is coming 97 and its corresponding But if you look at this whole process, we what are we doing in the first place Testing the splitting criteria and then all the splitting is in A2 as well testing the criteria and then But both are being compared with each other if you look then don't you think when we do this Gonna work If it is there then we can do this work also side by side Can there be any connection between these two? There is no relationship and this is parallel Less processing So as soon as you pick up f1 your A second core of the processor would take up f2 And both these works happen parallelly here But you get a splitting criteria here you will find the second splitting Criteria is met then you can do both of these You compare it and you will get your final splitting criteria is met but Since both these tasks are happening in parallel So the overall speed increases now I just gave you this example, just give it to you Imagine the features, if there are 200 features You did not enable parallel processing If it were so then you would have all the features one by one It would take a lot of time to try it out But what you can do here is that Can you work? Obviously it will depend on that How many cores are there in your processor? If it is a 8 course then you can take 8 at a time you can work on the features but whatever it is If the speed is up then this is your M Jobs By saying there is a hyper parameter in it As soon as you minus and If you put this, the process will get activated And your parallel processing starts working but this is one of the optimization Because of which xG The boost is fast I hope you get the first point If you understand, then I will tell you that How to Use XG Boost Parallel Processing does the process of tree building in and what is the benefit of this that the All trees are made They become a little fast but XT boost He is able to do this because he Uses optimized data structures what is a data structure the data How to store it and rest of the machine What learning algorithms do is that they Data is stored row wise, I will tell you Let me explain you an example, suppose you have This data is C GPA IQ and Placement how much did you cost here on different days is like this So the rest of your machine learning algorithms or what is this general storage like that cry wise basically blocks are formed so this is one block of information this is second block of information and then on top of this you can do row by row processing you do it but exercise boost is different Uses optimized data structures that we call column what do you do in block column block Example of storing information row wise You can store information column wise like Here you can organize your features in different columns store in blocks and whenever parallel processing is required You can move the block one column at a time You take it and start operating on it do you start finding the Splitting The criteria is this first point This is why parallel processing is possible found because XJ Boost internally stores data is storing it in a different manner where the rest Algorithms store data in a row wise fashion Are XT Boost columns in block fashion I am storing it, what is the benefit of it You can do this parallel processing very easily. Are you getting this, this is just one example of this Apart from this there are other examples where you wood C has very smart data structures is used to improve the speed of the Algorithm Let's Move On To The Third Point So the third thing is called cash awareness cash cash memory whatever you say Basically XJ bus is cache memory uses it very efficiently now if you Come from a non-technical background and If you don't know what cache memory is How is a computer setup in general? It happens that there is a CPU whose job is to all the computations have to be done but to perform computations using the data If you need it, then data is in the RAM what is happening the whole time is that Both components are talking to each other Whenever the CPU is If he needed data he would have asked for it from Ram okay these two are connected through The system is simply now the problem with this approach that since these two components are two different If there is hardware then this data is transferred It is happening, it takes some time, so To solve this a concept comes up cache memory is a very Small memory which is located inside the CPU Well generally if you want a normal laptop If you talk about it then it can be from 4 MB to 8 MB or It is a little more than that, it is smaller but you can store such things in it which you want that it is going to be used again and again like you you may have even noticed that if someone You visit the website frequently If you visit on daily basis then you You will notice that after a week It takes less time for the website to load Because you can store some basic things in cache memory you store it like its css The file or its logo, such files you store it in the cache memory and The next time you load you will actually You are not ordering from the server, you are ordering your own machine if you are loading from what is xj boost What XC Boost Utilizes concept of kaif memory so I can't explain it fully right now because We haven't read the details but exj A concept for boost tree formation uses what we call Histogram Based whenever he gets any numerical training You get features like CGPA or If you have IQ then what does this feature do? creates a histogram above for you Now you will know the histogram I guess The histogram is formed by taking the values above the bins you have a data set like this where IQ is given, say from 50 to You can make bins upto 150 in this like one bin became 50 to 60 the other one became 60 from 70 this is how you make bins and This is how your histogram is formed, then x What Boost does is that it increases the amount of Binance that you get The value is stored in the cache memory It does this because these are the values You are going to feel it again and again, that means again and again It is going to be used to build the tree and Similarly other required things which He needs it again and again, what do you do for him yes you can put it in cache memory What happens to your speed training that he improves it in a simple way I will give you an example, you should accept it you're cooking in the kitchen okay and you The ingredients you need are in the fridge if it is kept inside then when you Ingredients Not me the cook Realife Certain Ingredients Needed she is reading them again and again so she what did you do after picking up the ingredients I keep it next to me every time now No need to go to the fridge I can pick it up directly from my side and use it so whatever we have kept with us here we will keep it You will call it cash memory which is a fridge You will say rum and you will call our cook If you say CPU then basically in simple words I must say it got a bit too technical that the concept of cache memory is exj Boost uses it efficiently to Build Decision Trees and That is Why XJ Boost becomes a little faster let's move on to the fourth point of Optimization called out of core Computing So what is out of core computing I will explain it to you in very simple words So let's say you have a computer, Laptop which has RAM he is 8 GB and your manager has sent you a data set which is very important but when you When you look at that data set, you realize that this The data set is 10 GB Now the simple question is where do you want 8GB RAM How will you load a 10GB data set if you If you are not able to load it then there is no other way If the work cannot be done then this is a big Challenge in machine learning that if you have A very large data set has arrived, so large that even your rum can't contain it Then you cannot train that model. This is where out-of-core computing comes in What do you do in off-core computing that You can use let's to process your large data sets You can divide the 10GB data set into chunks would divide it yes ok let's make a chunk of 2GB And this way you have created 5 chunks. Now what you do is to do it one chunk at a time Do you load the model with rum? you train then you load the next chunk you do it again, then the third one, then the fourth one, then on the fifth and all this work Sequential I Guess The Only One a where you have this out of core Computing feature is available directly You have nothing to do in this library near you There is a hyper parameter which you need to set The name of that hyper parameter has to be given Hota Hai Tree Method it as per your requirement Do you set it out of core computing kind off gets activated and how many times you Divide even large data sets chunk by chunk You can train your model by doing this Cache memory is also very important in off core computing so when you use this model If you are training sequence then whatever You can give important things to them as simple cash You store in memory whatever you need It is going to happen again and again and then you can cash By extracting that information from the memory, this If you perform the training then you are out of Core computing works in tandem with cache Awareness, these two things work together but whatever it is this is very important feature Because of having this feature you can exchange Boost can be scaled to any large data set you can do it okay so this was the fourth Optimization that is available in XT Boost so guys move to the next point The next point is a very interesting point. And that is distributor computing so If you are from software background then You might have heard this term, it is very famous is considered a very buzz word and People who are doing this in the industry honestly They get very good money anyway see you so let me first explain what is distributed computing and then i'll show you the context of the xyz boost I will explain that the distributor How computing uses XG Buss So if you talk about distri computing Distributor computing means You have a lot of machines okay and in literature these machines are called nodes are called each of these are called nodes and now what do you do let's say your have a task such as to train a model on a data set this is the task so you What are you doing using all these nodes? you complete this task so basically Distributor computing is a process where are you Divide your tasks and distribute tasks Two different nodes and everyone has their own work Everybody gets their work done in parallel perform this is what happens Distributor computing now if you talk about do xg boost then in xg boost You have this feature if you want. Multiple Devices Support which you can load on a single machine If you can't do that then in that case you can set that data by dividing it into smaller parts you can give it to different notes and all these Notes Independent We will perform training on it and after I will aggregate the results now No, the question might be coming in your mind that If we do this same work then we can do it out of core computing we can also do this from where we can do small Make chunks and feed them into the machine There is a benefit with distributor Think about computing and see what it is What is the benefit out of core computing What happens is that you are dividing it into chunks But your machine is just one, so that Can only work on one chunk at a time Here is what you understand when you create five chunks then your machine would only be able to read one chunk at a time was processing look at the sequence here suppose if your You have FiveNotes and you have 10GB of storage there's a data set, have you broken it down into five chunks Now what can you do with your five machines I can send one chunk at a time and this is The training will be parallel so basically Increase your speed Gaya is the first to go with large data sets All you are able to do is second benefit that if there is faster training then this is like a Now I will tell you how it would be a better option I will try to explain it all process so what are you doing you your The data set was divided into five sections. He got d1 He got d2 He got three this one got four this one got five Now what will each machine do with its given data Start the tree building process from above Let's say our 10GB data will do two There were features f1 and f2 so everyone had it first so 1/5 of the data is you just make sure that Everyone is getting equal data now. what will you do with your f1 and f2 will do processing and best splitting If you give the criteria for F1 A splitting criterion will come from here one will come from here one will come from here one will come here one will come from here and one will come from here now what to do you have to perform aggregation So there will be another node which is of kind of master it will work and he will complete it He will handle the communication and will go and ask tell me your splitting criteria what's yours what's yours what is yours what is yours and He will check which splitting Getting the most benefit from the criteria and then that slitting criteria is If you select it then basically what will you do? You are moving your data to the first of the partition What you are doing is step one then you can do all the data You are training locally on one single node Then there is a masternode whose job is to viewing the entire communication and Performing the aggregation and once the Aggregation is done, your work is done then click this is the process of distributor The computing that you get in XG Boost It is available but implementing it is a bit difficult Is it so difficult for you to do external I told you that you have to use libraries You had to know that exercise in the beginning of the video Another cool feature of Boost is that it Integrate with external libraries If you can, you can get this work done here There is a library for this called dusk you can use it or you You can also use Kuber Nitties, okay? These are some options you have, using You can perform distributor computing in some future video i will teach this how to do it but just wanted to give it to you give you a rough idea of distributor What is computing and how does XG Boost help? how to implement it so guys lets move on to the last part of the speed aspect and that is GPU Support GPU Stands for Graphical Processing unit or graphics card We say that it is popular even now In all the discussions we had before It was unknown that whatever processing was going on that's happening through the cpu yes the main guy is a CPU but you also know it as the data gets bigger If it is there then the CPU becomes slow like If the CPU is not able to work as fast A very good approach is to use a GPU and A Graphics Card Deep Learning Essentials This is what deep learning is all about training you know that's on GPU What is a GPU, basically it is It's basically a processor where there are a lot of all the courses would have been all of them are like less powerful who Of course they are less powerful but more in Numbers So each core does not perform very large calculations. but by doing small calculations If you can then that is why GPU can be used a lot This is done in gaming where the price is very high Grade graphics display is deep In the training of learning where there are many Performing small matrix based calculations So what about XG Boost which creators then he realized that sins He has quite a lot Task parallelogram has been made like this histogram The base training which Creation of Histogram plus that split All these tasks are being found This is a highly paralyzed force, meaning they are parallel if it can be done then give me the thought don't use gpu as well and it won't let you do that Decided that we will provide GPU support ands boost i one such library where Do you have GPU support infact Look at the graph, I took it from a blog Here, there is a test error on the x axis which is We're trying to minimize and x If you have time on access then you can see this which The orange curve is the CPU curve which has 32 courses and this blue curve is Tesla P100 GPU and you can see just that around 120 seconds I am already performing much better because If you are working with GPU then do something special You basically don't get a hyper I have just mentioned the parameter above, which The name is Tree Method: You have to set its value Two GPUs underscore hist and internally If your computer has a graphics card then he will start using xj boost to Perform the training and you would notice that Your speed will be completely up process so this is the sixth but a very Important Points about How XT Boost Software optimization is to be done faster speedier okay now if I tell you After telling me all these things, I will ask you What do you feel about XJ Boost? about the way in which these people Software is used to improve the the speed of the algorithm i'm pretty shy A word may come to your mind at the end be that word could be They did not like extreme software Whatever heavy heavy, pick up all the concepts of There were concepts and they all were put in place over gradient boosting and then derive XG Boost and No Wonder of XG Boost which is the full name of the Extreme Gradient Boosting and these The extreme word is coming from here also I felt that friend, we did a little more extreme work done like jo We had to apply optimization We have guns extreme and that's why they called this Algorithm XT Boost Okay So Be It There was an aspect where we read that Software applying engineering concepts How to further optimize gradient boosting That's done, now we move on to the next section We will go and discuss that the machine How to use the concept of learning The performance has been further optimized Our next point of discussion will be so Guys now we are moving to third One aspect of XG Boost and that is Now we will discuss the performance that How Tian Qi Chen Has Different Aspects Improved performance by working on And here you will see how the machine Concepts of learning to advance concepts It has been used to make the abuse better okay so here we are five We will discuss the points one by one First, let me tell you about the first point Let me tell you, the first point is regularized Learning Objective so I guess you need regularization when we will get to know about it I studied linear models, linear regression logistic regression extra so there We studied the concept of regularization and We have talked about l1 l2 regularization there. Now in a nut shell I will tell you I tell you that regularization is mostly done then When you use your model training data It is giving good result but The results on the testing data are that much are not good basically due to overfitting In this situation, you can expand your model a little more. Trying to simplify it by using a regularization term so what are you basically you do one in your loss function You add the term of regularization and now What you need to do is to change your loss function To solve this, you have to solve this as well and this too will have to be solved This is the basic idea behind Regularization is now gradient boosting which You have any differential loss in it You could have used the function but here Regularization could not be applied The regularization in gradient boosting This was applied to reduce overfitting There were two ways, one was learning from the rate where you can calculate the number of subsequence trees for each Multiplying by a learning rate inside were to reduce its impact and There used to be a second for tree construction in process You used to prune it etc. right What is the advantage of XJ Boost here? The loss function you have in XJ Boost It has a regularization term by default what happens is that when you to minimize your loss function If you try then your focus will be on regularization also goes to term and automatically again Regularization is starting to happen, I will tell you I will show you So this is our original XG Boost paper so here if you go down there's a tree The first point in boosting is that Exactly this is regularized learning objective and here you can see that this Your overall loss function is XT Boost So here you have a differential loss The function is your prediction and your target based on the base of and at the same time it term gets added now in this term if you If you focus here then this is the reg ration the term dab here is basically the weight of the Leaf and Lada is your regularization parameter what is the hyper parameter to see this If loss is present in the function itself then very Regularization becomes easier In applying directly whereas if you You will notice the gradient tree here This thing does not happen properly in boosting If yes, then we will go into detail about it When we will study the mathematical formulation You will read this point, I am just trying to explain it to you wanted the performance of the xj boost It is a little good because it is different An attempt has been made to add regularization what will happen if your model is A general model of a slightly general type will be made will be created and used on different types of data sets his performance will be good and it Mostly because your loss function is this designed in such a way that by default it has This is a regularization term, I hope you understand this I understood the point so guys let's move on to the next point that is handling missing values now this To understand the point, we will use this third point. First we will understand what is Sparsity Aware Split Finding You will understand this as soon as you Then you will automatically understand the second point. Missing that XJ boost How JustTo handles values Give you some contact, this is what we have read till today that any machine learning algorithm pass if you send a data set that contains If there are missing values then it will not work above write missing values did not work if it happens then that is what we should do we have to do pre-processing and we need to find our missing values we have to impute or else we can remove it or do I feel it but exj There is no such need in boost XJ Boost By default, missing values are automatically filled in to understand how one can handle the same We will read Sparsity Aware Split Finding out what is sparsity first Sparsity happens when your data has There are lots of zeros or missing values now There is a mechanism inside XT Boost Due to which he is able to understand first of all that late sparsity and then is able to handle it Let me give you an example of how to miss Values are being handled on y so assume We have a feature called in which different different values are given and then there Missing now what would be the value if any other algorithm He doesn't work on it because there is a missing value then you should either I had to fill it or remove it but ex... What does XJ Boost do smartly? tries to understand which direction I should fill in the missing value in Let me explain to you what this means What you need to do first is If you want to decide the splitting criteria then Your first splitting criteria is done 4.5 Next is done 5.5 Then around the missing values you Splitting criteria do not decide ho it happened between six and eight let's from sen and here it becomes 8.5 so now you what are you doing you are creating a node Greater Than 4.5 Right Now You will go through each data point and check whether will it come in the left node or the right node will come in node like the four thing here but the four thing comes here gaya eva this is yes this is no to wala point the rest of the f 6 8 is here and not here But you have come right now the question is why this is missing What we do with the value is this node will go to or will go to this node then What Abus does is that he does this twice Splitting will replace missing values once will send it here and then calculate the gain and once the missing value is entered here will send and then calculate the gain and where But the max value is still coming, he will understand that should be inserted in that node then in this manner what is he doing xj boost that one Trying to Make Directional Sense In which direction should the missing values be inserted? should fill and that is how exej Boost handles missing values right now I didn't tell you in much detail On top of this we will make a separate video, this is complete to explain how things work right now what can i do i can give you code example that really if you are missing If you give values to XD Boost then also it It will handle it so I wrote this code so this is the code here you see this We have data, it has two features, so first There is no missing value in the feature but The other feature has a missing value OK and this is our va value we have some We did not do the train test directly Split XGB Classify to fill missing values I haven't tried it I haven't tried it yet I trained the model and the accuracy is Look, our accuracy is 5. and no machine learning algorithm works I don't do that, but I say no value xD boost is not like that xd boost Can handle missing values internally okay so let me give you an overview Rest we will cover it in detail So guys let's move on to the next point A the next point is efficient split Finding and in this we will know about two things I want to know about a Weighted Quantized Sketch and one is approximate tree learning okay is this is an important part like this is A Part Because of Which XT Boost Does Not Work Performs well in terms of speed But its performance on like data set is also good If it improves then I will give you the basic idea I'll give you a little bit because Bagging algorithms over building trees Depends on they build a lot of trees Now what is a big problem in making trees Let's assume that you have the features Numerical is like This is now if you want to make a tree here then You have a splitting criterion for each note. Butt splitting will have to be decided How do you decide the criteria, so you you use a technique called Is Exact Greedy Search what do you do in it every two points take the average between yes and then catch it You create a splitting criteria and Then you compare everyone's results Do you compare it or did we read the decision? Now what is the problem with this approach in trees Obviously this will give you the best result I will give it to you because you are trying out values but i think you can understand that The problem with this approach is that it It's slow because you have to enter a lot of values I want to try it out, imagine this is a big one The data set contains a If you have crores of values then how many times will you get this If you have to do something then this concept is right here Here comes the solution of Approximate Tree Learning what do you do in this concept that you you don't try out all the values instead what do you do that you can use this Numerical Column you do it without me, you do it without me meaning you divide it into bins like a bin From one to five it's done, the second one is gone to 10 became the third one, from 11 to 15 and so on then you can bin it Your continuous variable is now Be Discreet Obviously such a good performance It will not come as far as the act was in greedy search but the speed can increase a lot right so this It is called Approximate Tree Learning Now in Approximate Tree Learning Now you have to make these bins, So actually this is the whole process of making cans. There is a process and the training which is done in this manner it happens to us Let's call it histogram based training in xd boost because you create bins you are doing it like you are adding bins to a histogram If you create then this is the bins for creating The techniques we use for the process we call it weighted quanta il sketch so what do you do you do this Distribution Study of Numerical Column do you study the distribution let's say this is the The distribution is fine now According to the distribution you can get Bins you decide you say you are a bin From here to here again there will be a bin From here to here there will be a bin from here to here then a bin from here to here till like this so basically what you are doing calculating quantiles and using those You can create these bins based on quantiles. You're like a bin can go from zero to 30 One quanta can be 30 to 45 quanta The binning you're doing is based on quantiles. You are doing it on the basis of this, now what is its benefit yes, I will explain it to you exactly method we will read further why we These bins are using quantiles This is the advantage of making this look a What could have been the way to make bins? Assume that you have the feature in The values are between zero and 100 Suppose all the values are correct then what is one way To do binning you can do uniform binning Doing uniform binning means starting from zero A bin of 10 10 to 20 A bin of 20 20 From 30 to one bin doing this from 90 This uniform is worth Rs. 100 but without any price Binning is actually There is no matching with the distribution like there is no relation now if you Start using quantiles for binning So what is the benefit of this if you assume that your Here is the distribution of the data there is zero here there is 100 so where But the data is less, like look here, early Parts have less data and later parts There is also less data in this which means that The values between zero and 10 are less and The values between 80 and 100 are also low The highest values are from late Between 40 and 60, what do you expect from the lattes? You will do that where the values are low You will make big bins and where The values are very dense there you can see small you will make small bins because inside it When there are large number of data points then you You are using quant and quantiles When you are binning by using Actually the distribution of the data is Educated decision making by studying it are you doing what should be this bins What will happen with that, you will get more accurate data You will be able to describe it through these bins And then the trees that will grow will be more If you become accurate then basically first of all We have done this through approximate tree learning what did you do did you acquire speed ok but Our effort in acquiring this speed Performance was as follows based on test data The result was that it started decreasing because obviously Aject greedy search will only make it worse To fill that gap, we have what did you do that was the binning technique you used Weighted Quanta or Sketch of Data on the basis of distribution He decides where the bins will be and Its benefit is that in this way you According to the data, Joe Bins' requirement yes you can create it now I know this is like just the surface that I touch We will read this in detail but this is one Very Important Aspect of XJ Boost which you don't get in gradient boosting You will get Let's Move On To The Last Machine learning based optimization and that is Tree Pruning So if you have Decision Trees and Random Forest and Bagging of the type If you have read the topics then you will know the tree What is pruning Sow tree pruning is basically a process where you You cut shots to see the depth of the trees or do you trim and the goal is to Reduce the Complexity Reduce Overfitting Okay, so this is a very important aspect. The whole boosting process goes on We use two types of pruning there is one called post pruning where you Once you have fully grown the tree and then based on that you decide that how much smaller should I make it and the second one is pre Pruning is where you can do a lot of work for tree construction You decide in the process itself that the tree How big will The Good Part About XG be? Boost Key X Boost Gives You a Lot of Options for pruning so here you have not Only Prey Related to pruning and post pruning there are a lot of hyper parameters but at the same time by calling another gamma value with a hyper There are parameters using which you can decide this You can do that a new branch should be opened Or else a new branch should not be created only then When there will be a significant reduction in losses If all these things are happening then Joe XG in Flexible Tree Pruning This is Vanla Gradient Boosting in Boost If you were not in the tree pruning then you should know Do you use it effectively or internally? If it happens then obviously the performance Most of the data is of XJ Boost Works great on sets OK So these are the five things we discussed which it's a little different, there's other things too But it's five different things in XG Boost which is only found in XJ Boost So that's why I thought you should read about it once I will give you an introduction and we will do the rest People will discuss all these things in future videos. So, we will cover it in detail as well discussed in great detail that xG What are the components in Boost that make it Making Machine Learning Line So Important Now if you have this in your mind that Xyz Boost is the only Gradient Boosting implementation that is very robust and very Gives good results in data sets you are actually wrong there are too many Implementations: You may have even heard their names. You must have heard one of them is on your screen This is from Microsoft Research and its The name is LightGBM Light Gradient Boosting machines and many more Features that compare to XJ Boost would stand in yes it is a little light weight and about it We will discuss these details further but at this point i just want to show you that this is also an important library which you can use many times Results XT Boost vs XT Comparison There are also better things with boost, this is like Targeted for faster training speeds Memory Usage Some Times Better Accuracy and it also has the same feature parallel Distributor & GPU learning is fine And it is capable of handling large scale Data is similar to another library you may have heard the name again and that is cat boost cat boost is a high Performance Open Source Library for Gradient Boosting on Decision Trees and One of its most important features is Category support for all features built in Category support for tomorrow's features which we will introduce again We will read it later but at this point you Do not get the idea that just the boost is the only library that has created gradients It is not that boosting has been improved much There are two more libraries in fact there are many more libraries but these three This is what you hear the most will come ok xj boost ly gbm cat Boost Ex Boost now we will go into details You will read more in the upcoming videos Gradually we will use LiteGBM and CAT will cover boost also ok so or that's me pretty soon you must have learned something And now you are excited to boost xD to learn in detail so if you If you liked the video then please like and You are not subscribed to the channel please do subscribe see you next bye in the video

Transcript for:XGBoost Overview and Features

Transcript for:
XGBoost Overview and Features