Transcript for:
Decision Trees and Random Forests

[Music] you [Music] I'm going to teach a couple of machine learning techniques called decision trees and random for us in this lecture we'll start with a decision tree so decision tree as the name suggests is a tree like structure that you generate from data to do some machine learning tasks and as we have seen before the most common machine learning tasks are either classification or regression or function approximation so you can generate these decision trees for both these types of problems both classification problems and regression problems just to recap for people who have not seen this many times a classification problem is one where when you're given a data you are trying to classify it to one of the predefined classes that you have and regression problem is one where given a data you are trying to predict an output or a target feature value now you could build a decision tree for either a classification problem our regression problem and there are multiple algorithms for building these trees which are all mentioned here but what I would like to mention is at the end of it what you're going to get is like a tree structure which is going to let you either predict a value for a target or provide a classification the reason why I mentioned many algorithms is that when you use a particular software or a package these would become options that you can choose and there are multiple metrics that are used in building this decision tree and I will demonstrate how one metric works in this lecture and you can see that the other matrix will work similarly so once you understand one of these you will be able to read on your own and understand the other matrix and how they might be used so these may takes basically we will talk about them in more detail as we go along these metrics basically all of you to unravel the tree so to speak so if you think about a decision tree as a tree with some top node and then there are multiple nodes that are being unraveled and formed there must be some logical procedure for this unraveling and making this tree from data and there could be multiple matrix and Gini impurity information entropy or variance reduction or metrics that are used typically Gini impurity and information entropy are used when you have classification problems and variance reduction is a metric that is used when you have regression problems that they are trying to model so in this lecture I will focus on classification problems and Gini impurity isometric but what you need to see through this lecture is how this tree is formed and once you understand that it will be very easy to see how you can develop a tree like that for regression problems so the way I'm going to teach this development of a decision tree or understanding decision trees is through an example so this will make it a lot more concrete in terms of your understanding in terms of what the key ideas are in these decision trees okay so let's start with a classification problem as I mentioned in previous slide I'm going to focus on a classification problem so let's assume that we have this problem this is a very very well known problem if you just search for this you will see this being described in multiple papers it used to be a data set that many people tried their algorithms on for has been for a while now so the idea is there is this iris flower and there are multiple species of this given a new iris flower would you be able to classify it as one of the three species that we are considering here so the flower type is iris and the species are setosa versicolor are virginica species so now when we say that given as a flower would you be able to classify it into one of these categories we should have some training data to do this so let's assume in this case for me to explain decision trees to you let's assume that there are about 149 data points and that means I have 149 samples of this flower which are pre classified as being set over versicolor are virginica species now when we say data what does that mean so for each sample what we might do is we might compute certain characteristics of that sample in this data set there are four features that are computed for each sample so one is called the sepal length the other one is sepal width petal length and petal width so these are actual geometric features that that you you compute for these sample and then let's say you have this forty-nine already i characterized seto so you take each sample measure this and then put it into data and then say this is from seto so and so on now the problem is I have to come up with a classifier which will learn how to classify the data when I when I have a new data point based on already pre classified data that is used to train my algorithm and the algorithm that we are going to look at in this case is a decision tree because which is what we are going to use in the example that we are going to look at now the ranges for these feature values are several lengths if it's a setosa flower we notice that it is between four point three and five point eight if it's versicolor four point nine and seven virginica four point nine and seven point nine this is basic inspection of data and then seeing what are the ranges of these values for each of these attributes for each of these classes so this is one class this is another class another class and the attributes are distributed in these ranges for this data so this is basic explanation of the problem that we are trying to solve now you will notice as I described decision tree starting with this problem we will see that just from this data we are going to generate a tree which is going to allow us to classify new data points so the key things that I want you to notice as we go through this example or how this tree is generated at the end of it if a new data point is given how would that tree help us classify into one of these three classes is the important thing that I would like you to notice so let me explain decision trees through this example of this iris classification so decision tree as the name suggests is tree like structure where you have nodes which open out into other nodes which might open in the other nodes and so on and you start with one node and then you come to let's say certain number of nodes at the end and how do you generate this tree and how it is useful for classification is what we are going to see in this lecture so to understand this easily let's think about what each node means so what each node means is that each node has a collection of data points so if I start with the root node or the very first node this node is going to have all the data points that are there in the problem so in this case we know that there are 50 data points from versicolor 50 data points from virginica and 49 data points of setosa now if we didn't build this tree at all and we just sat with this data and then I'm going to say let me do a classification whenever you give me a new data point then the best way to do this would be to look at this data and then see which class is the most occurring class in this data point under the assumption that the global data set will also have similar proportions so whatever is occurring the most here is the most likely species type in terms of the total data set then we will say okay it has to be either vesicular or Jenica and if you're forced to give one decision you might randomly break the tie and then say a new data point is either a vesicular or virginica right so if you were to just sit at the root node and then do this this is the solution that you'll come up with but this is not useful from a classification viewpoint what we are trying to do is we are trying to use the features of these samples and then do some computation so that we get a good classification with a high accuracy is what we are looking at right because if I just kept saying vesicular as the class without doing any more development of this decision tree then in the 149 sample points I will get 50 of them right because every time I'm just going to say vesicular so 50 I will get right but 99 times I will get it wrong so that's very poor accuracy which is what you don't want so basically the idea is to somehow develop this tree so that whichever node I stopped at and give an answer the accuracy improves so that is basic idea of this decision tree the same notion also works with regression trees if you had one target value at the first node if you ask you what is likely to be the target value for a new data point if you don't want to do anything at all you might simply take the average of all the training data points and say that is a likely value for any new data point right but that will give you a very poor model so in that case also you'll try to develop this tree expand this tree so that your regression problem you can solve much better so the same idea works for both classification and regression problems is what I wanted to tell you now we are going to learn how we are going to develop this tree starting with just this data node the first node and our concepts that we're going to introduce for us to explain to you how this tree is developed but before we do that to understand this better let's look at what might be the best case scenario for this example if I were to develop a tree supposing I were to develop this tree at the end of it let's say I get three nodes and this node has data of all versicolor and virginica is zero this is zero and this node has all data corresponding to zero versicolor 50 virginica zero setosa and this has data corresponding to this zero zero forty nine if this is the case then what happens is as you start from here and go through these nodes we will explain what going through this nodes mean as you start from the base node and use the features of the new data that has been given the feature values and you traverse this tree and if you end up here then you will say the new data belongs to particular sample if you end up here you will say it belongs to virginica and if you end up here you will say it belongs to set oh so now the question is how do you develop the tree so that this partition happens and what does traversing this tree mean so how do I start from here and decide whether I should go to this node or this node is basically what I'm going to teach and that is the basic idea of decision tree as an algorithm itself now remember that you might not always be able to classify this data into such distinct nodes there might be some overlap which you might simply not be able to get rid of and all of this depends on the data so in some data sets you can get a complete separation and in some data set whatever you do you might not get complete separation and actually you know from your machine learning knowledge that getting complete separation and training said by itself might not be very helpful because basically we are war learning the training set so the ability to generalize when I get a new data point might be much less if this happens okay now that we have set up this basic tree structure and then explain to the to you the ideas of how we are going to use this tree to make the classification now it's easy for me to describe how this tree itself is generated purely from data and how these algorithms work to develop this tree a tree that is developed like this from data is called a decision tree and it is called a decision tree because at every node based on the feature value we make a decision traverse the tree till you stop at one point which you where you give the final decision in terms of what class this data belongs to okay so it basically mimics how we take decisions if this is this and that is that then do this and so on the same notion is captured here in the tree okay so now to develop this tree we are going to introduce some basic terminology as I mentioned before I am going to describe this notion of Gini impurity so if you notice here Gini impurity for every node so you can you can define a Gini impurity for every node in the tree and for every node in the tree Gini impurity is defined using this equation let's try and understand what this equation means so to understand that equation let us take this node here and then we'll come back to one of these nodes and then kind of contrast what happens so if I take a Gini index for this node this formula basically says Gini index is equal to 1 minus Sigma I equal to 1 to 3 because there are three classes F I square F I square ok so basically we have to define what F is so there are three classes each F basically computes a fraction of samples from a particular class in the total set so for example if we compute a fraction for vesicular f1 the number of samples in this data set the cigarette is 50 the total number of samples is 149 so f1 will be 50 by 149 and f2 will also be 50 by 149 and f3 will be 49 by 149 so once we have these three numbers we can compute a Gini impurity for this node as 1 minus 50 by 149 whole squared minus 50 by 149 whole squared - 49 by 149 whole square so that is how you compute the Gini impurity of this node okay similarly once we get to each of these nodes based on the number of data points of these classes in the total data we can compute Gini impurity and you will notice and in this example it will be nicely seen as we go through this as we go to different nodes because you are partitioning this data this Gini impurity will keep changing for each one of these nodes now if you are defining Gini impurity in a decision tree for each of these nodes we might also ask this question as to what is the best node right in this decision tree so if you go back and look at these nodes so for example let's try and compute the Gini impurity for this node why am i picking this node this node is what I'm going to call as pure node because if you look at this data point if I come to this node I can categorically say the sample belongs to versicolor right if I somehow land up in this node then I can say the sample belongs to virginica and if I land up here I will say the sample belongs to set awesome right so these are what are called pure nodes that means these nodes have some more collected data corresponding to only very specific classes where the other classes are excluded in the data so if I have a node like this a pure node how would I compute the Gini index I use the same formula but what will that value be we can see this here in this case again I have 1 minus Sigma I equal to 1 to 3 fi squared in this case now f1 for versicolor is 50/50 because 50 + 0 + 0 there are only 50 samples this is equal to 1 F 2 will be 0 by 50 and F 3 will be 0 by okay so the fractions will be 1 0 0 and once you use this and then calculate the Gini impurity index so the Gini impurity will be zero okay so whenever the Gini impurity is 0 that means we have got pure sets we're only one class is represented and the other classes are all left out so ideally then the goal of the whole decision tree is to start with the data itself which gives me a Gini impurity based on doing nothing with this data and then somehow unravel this tree and get two nodes at the bottom which are all pure are as close to pure as possible right so here the top of this decision tree there will be a positive value for this Gini impurity and our goal is to come down to the lowest level of the tree where all the nodes have Gini impurity of zero or pure nodes so in other words what we are trying to do is we are trying to keep reducing this Gini impurity as we go down this tree to get to zero Gini impurity so that's what we are trying to do so now we come to this question of how do we decide how to traverse this tree so then we ask this question what does it mean when we say I want to traverse this tree so basically as I said before each one of this is a decision so at this point I have to take a decision and the way I take a decision here is I pick one feature from the data set and then do some computation with that feature and the result of this computation allows me to go either here or here so an example might be take one feature and if that feature is greater than five go to this node and if this feature is less than Phi go to this node might be one way to do this so in others other words whenever we are trying to traverse on Andra will this tree what we are looking at is we are looking at a future value and making some decisions based on that future value so this whole tree itself is developed like that and each one of these decision points will have a feature and something that you compare it with so that you can develop this decision tree now as you notice even in this case there are four features right petal width petal length sepal width and sepal length so you could choose any one of these four features for opening out this tree and each of these features have different ranges of values as we saw in the previous slide so there has to be some partition point for those values so that we make a decision to go to one or the other part of the tree so how do we do that okay that is what we are going to see so once we have this Gini impurity we are also going to discuss something called Gini split index so let's assume I choose some feature and then decide to split this data okay now what does it mean when I say one feature and somehow I'm going to make this decision to split the data we will see in the next slide but for now bear with me and just think about this supposing I say I'm going to pick one feature in this case you know sepal length and then I'm going to say if sepal length is less than some value a go here and if it is greater than equal to some value a go here right that's something that I can make a decision so why should I only choose sepal and not settle with why petal length petal width and so on are questions that we are going to answer but just to explain this whole process here I'm explaining this okay then if this is how I'm going to unravel this tree then what I do is of all the samples here I find the samples that would satisfy this condition and then put them here okay maybe of the 50 samples we will see the actual values I am just explaining this so maybe of this 50 samples you know if 30 of these samples are such that sepal length is less than a then the remaining 20 will go here so this way you split the whole sample of 149 data points into some number now it might be just that this has maybe 99 data points come here 50 come here it all depends on how what we choose here maybe 60 of these days the points come here and the remaining go there and so on okay so now you have this because we have already defined the Gini impurity for each node based on this we can actually compute a Gini impurity for this node and based on what comes here we can compute a Gini impurity for this node so now if we use this sepal length and this number a as the choices to make this split then we can compute something called Gini split index which is basically the difference between the Gini index of this original node minus p1 and p2 will come back to we can compute a Gini index for this node same formula based on how the split happened and we can compute a Gini index for this node based on how split this split happened the only thing that we need to understand at this formula is what are p1 and p2 that is very simple if I start with 149 data points let's say 99 data points come here and 50 data points go here p1 is the fraction of data points that came to this node which will be 99 by 149 and p2 will be the fraction of data points that went to this node which will be 50 by 149 so we can compute this fraction and we can also compute the Gini index for the each of these once we have that we can compute this Gini split index okay now what this value is will depend very critically on what feature we chose here to unravel this and also what is the value of the feature that we chose to make that partition now the whole notion of decision tree is how do you come up with the sequence of features so that you unravel the tree and the values that you use to partition this all of those are automated and these algorithms these packages will give you the best solution for for these splits and so on you don't have to do it but I am explaining this so that you understand what basically happens when you finally see a result for a decision tree example that you might work with so basically the import rules for constructing this is so every parent node which will have a higher Gini impurity remember I said at the at the top is where I have not done any splitting and ideally what I am looking for is the bottom level nodes which are pure so that means for them the Gini impurity is zero there they're pure sets so basically Gini impurity keeps coming down as I go down the tree and there are multiple options for doing this splitting so what we do is whenever there are multiple options we could enumerate all of those options or we could use some smart algorithm to pick options that are likely to be very good and let's say I have multiple options I will compute the Gini split index for each of these options and then I will always choose the Gini split index which is maximum right now why do we want to choose a Gini split index which is maximum that comes from your previous equation here now if I am having a node with a certain Gini impurity and I want to make children nodes of those what I want to do is I want to get to these pure nodes as quickly as possible right only then I can do perfect classification so ideally if I could split in such a way that the two children node that come out of this main node have Gini impurity of zero right that would be the best solution in which case the Gini split index will be the Gini impurity value of the original node because this will be 0 this will be 0 right so that is the best solution so actually the split index should be as high as possible which basically means that these values are as low as possible so these have low impurity are there as close to pure sets are possible so when I have multiple possibilities one thing I can do is I can you numerate these possibilities and then for each one of these possibilities I compute a Gini split index and then I find out the possibility for which the genius pretend X is a maximum and then I say this is the choice I am going to make okay so that is what is seen here so let's look at this now with this example now that I have explained all of this you will understand and this hopefully much better so the root node as we showed before in this example is with this number of data points okay so we do 49 50 50 in this case for 49 setosa 50 versicolor and you know 50 virginica so if you know look at this node here you see these numbers of data points okay and also you see this node has something so which says what Ziegler that basically means if you are sitting in the node for a data point and I asked you what do you classify the sampler so you are going to say I'm going to classify the samplers versatile that means at the beginning if I don't do anything at all any new sample you are giving me I'm just going to close my eyes and say it's over Sigler and why am i saying its particular I could have said vesicular are virginica because there are 50 50 data points I just randomly broken this tie and then said it's for Sigler okay so this now you see in this node none of the data is lost okay so all the data is still there here right 49 setosa 50 versicolor and 50 virginica now remember in the tree thing I said the first decision we have to take so the decision means I have to choose one of the features there are four different possibilities what the algorithms do is that they look at all these multiple possibilities and find the best split somehow to come up with the the most compact tree that you can build but here in in this lecture I'm trying to explain the ideas behind it so I'm going to take a couple of examples to show you what happens so for example if the algorithm had actually chosen the petal length as basically the feature of choice here now you look at petal length and then so you see this here so if it's set those are the petal length range is one to one ix versicolor it's 3.5 to 1 and this is 3 to 5.1 sorry and this range is 4.5 to 6.9 now if you notice all of this are automatically done by the algorithm here I'm just trying to explain this with this data so that you understand that is this logic behind how I break this and so on and once you look at a tree you will be able to see what is actually happening here now if you look at this range three to five point one four point five to six point nine you will see that there is an overlap right so any data value that I pick which partitions this there is likely to be both vesicular and virginica and the partition data but if you look at this here there is a clean partition because for setosa petal length is between 1 and 1.9 for vesicular is 3 and 5.14 virginica it's 4.5 and 6.9 so if you take any value between 1.9 and 3 ok you would be able to separate out setosa from the other species right so one value you can take is roughly in the middle of 1.9 and 3.5 so that you get good classification so you might say if petal length is less than 2.4 go to this node and if petal length is greater than 2.4 go to this node right if that is the decision that you are making then let's see how the data will get partitioned right so in the training data if you pick all the data points where petal length is less than 2.4 and bring it to this node you will notice all the setosa data will come here to this node and all the versicolor and virginica data will go to this node ok so remember this 149 data points now has been split into two nodes one node has retained 49 data points and the other node has retained 100 data points now you notice that this 49 data point node is a pure node whereas the 100 data point node is not a pure node nonetheless if you make this decision that I going to choose better length as the first feature based on which I am going to separate this and 2.4 as the number then you will get this and now you can quite easily compute the Gini index for this which is sorry Gini impurity for this which is 1 minus 49 by 1 49 squared 50 by 1 49 squared minus 50 by 1 49 squared which is what I had mentioned this will turn out to be 0.66 now since this is a pure node you will get the value to be 0 here because at the three fractions or 49 by 49 0 by 49 and 0 by 49 so you'll get a value of 0 here and if you look at this here you will get a value of 0 by 150 by 150 by 100 as the three fractions so I will have 1 minus 0 minus 0.5 squared minus 0.5 squared which is what is shown here it's 0.5 so I have a Gini impurity for this I have Gini impurity for this I have a Gini impurity for this now for this option of petal length and value of 2.4 if I were to compute a Gini split index remember Genie split inductors in indexes the Gini impurity value of this - whatever fraction of data came here times the Gini impurity of this - whatever fraction of data came here times the Gini impurity of this we have already computed the Gini impurity of these three nodes here so the genie split index is the original nodes Gini impurity - 49 points came here so 49 by 149 and the Gini impurity is 0 times 0 - hundred points came here so 100 by 149 time's the Gini impurity of this which is 0.5 so if you compute this you get 0.32 4 okay so please look at this computation and if you understand this computation every other time the same computation is done there is no difference at all so all you need to know is that every node has a Gini impurity and then based on this partition you can compute jeanny split index so if I were to do this and said instead of petal length let me use sepal length okay and then I decide sepal length is what I use and then let's say I come up with this value of five point rate if you if you asked me how did you come up with this five point eight there are multiple heuristic in in these things and you can you can use a very simple heuristic here to come up with five point eight now the reason why I don't want to go too much into this heuristic is there are multiple possibilities and these algorithms automatically figure out what are good possibility so we really do not need to worry about how this five point eight comes about but you want to think about how once I have a final solution how I understand that solution so let's say somehow I have come up with this five point eight as the as the best split value here so basically I say sepal length is five point eight remember this is again the top node which is whatever we had before without any partition now what I do is I partition data such that all the data which have sepal length less than five point eight I bring to this node and all the data which are sepal length greater than five point eight I bring to this node and it might turn out when I do this that forty nine samples of said toaster comes here and twenty-one from the next one and three from the next one and notice what I do here okay I basically have this note saying setosa and why does it say setosa because if you look at this data point the maximum number of class from which this data comes is setosa right so if I stand up here after at the end of my decision tree process then I will say that setosa that sample is most likely to be setosa right and if you look at this here this is virginica that is because zero of setosa 29 of versicolor and 47 of virginica so the maximum number of samples come from the virginica species so I will say the the sample that I have is a virginica sample now I don't want to go through these computations if you do the same computations as I showed before you will get a split index value of point one nine one five and you will notice the previous value was different from this the key things that I want to notice look at the mechanics of this very simple all that is happening is your original data set is being split into multiple data sets right so in the first case the original data set was split into two data set in a particular fashion and in the second choice the same data set is split into two data sets which are different from what I got for the first data set right so the way the decision tree works is is actually splitting this data set into multiple nodes and in each node depending on what the work class is preponderant in the data I'm going to classify that that's it right so if I have more nodes in the tree if I start from here now what will happen is this data set will be split more and this data set will be split more so you can think of this original data as being whittled down into smaller and smaller data sets in each of these nodes and the way that smaller and smaller data sets are developed is through these choices that we make so in this case we set sepal length less than 5.8 then we go to the training data set and say all instances where sepal length was less than 5.8 I combine into this node and greater than 5.8 I combine into the other node now after doing all of this possibilities we might identify that petal length is the best possibility so which if you go back and notice here so if I do petal length as the possibility then I have already classified all setosa samples so I don't need to explore this portion of the tree anymore simply because there's nothing to do it's already a pure set right the only problem is here so what I want to do is I want to start exploring this part of the tree more and if you notice this that what we are doing it so we are starting with the node where I have this versicolor with 0 50 and 50 now again I have multiple choices right again I can say I want to do petal length again there is nothing stopping you from that again I have this 4 choices and again I have to compare them some with some value so let's say I did choose petal width and then say less than 1.8 now what will happen is of the 149 samples 49 have already been classified so they are not any more under consideration so I start with this other hundred samples which is 50 versicolor and 50 virginica and then what I do is I use this petal width and then say less than 1.8 I collect all of this data from this and see how many come here so about fifty four of this hundred comes here and about forty six of the hundred comes here so 54 plus 46 is 100 right now notice what I call this node this node I call as versicolor because most of the samples in this node are from versicolor and this node I will call as virginica because most of the samples from this node are from virginica so ultimately you do all of this and then basically you come up with this kind of decision tree so basically what it says is if if I am here I have this vesicular as the decision this is setosa this is versicolor versicolor virginica so now when I get a new data point how will I use this decision tree the way I will use this decision tree is I will first take the data point and I will check what is the petal length in the data point right if that petal length is 2.4 then I will say for this new data point it is actually setosa and the problem is done right now if it is greater than 2.4 then what I will do is I will look at the petal width of this new data point and if it's less than 1.8 I will say it's what's the color if it's greater than 1.8 I will say it's virginica right knowing fully well that I could have errors here and errors here so for example you know 5 times when actually virginica this is being called versicolor even in the training data so in the test data we don't know that's where we will use a test data to check the decision tree and see how many how many times do I get this to be the correct number right so that is what I will I will look at now there is always you know the possibility of some Mis classification at the end the data might not be complete so that you can clearly classify this problem all the time and you know you if you're not happy with this and then you're saying okay in this node it's not still pure 49 and 50 five can I actually break this down further so that I get 49 and five you have to look at other features and see whether the other features will allow you to do this in some cases they might allow you to do it in some cases the data might be such that you cannot do it and as I mentioned before even in cases where the data allows you to do this you don't want to keep building a tree which is very very complex right so you don't want to have a very very deep tree which completely learns this training data set which will have no generalization capability so you might want to stop at some point and then say in the training set I'm willing to take certain amount of errors so that I have better generalization capability with a test data now now that we have seen decision trees random forest is a very very simple idea the key notion of random forests is the following so when you look at decision trees if there are minor changes in data are there are minor errors in data the the decision tree that comes about can be quite different so decision trees are not generally very robust to errors in data right because you are choosing some numbers and in some cases if there are errors these numbers might not partition them very well and so on so in some sense what you want is if you get give the data set and then you build a decision tree and if the data it changes a little bit you don't want to see major changes in the decision tree you don't want decision trees to completely change which is possible because of errors in data so to avoid that and somehow give it robustness we come up with this idea of random for us and as the name suggests random forest it's a collection of decision trees right so you might ask the question how do I make multiple decision trees from the same data set so the way random forests work is the following you have all this data what you do is you make multiple data sets from your original data so how do you make multiple data sets from your original data what you do is you sometimes sub select only a portion of the data right and then say I'm going to build a decision tree for this portion of the data in some cases you might sub select only a certain features in the data so in the previous case we saw four features you might say I'm going to drop one feature and then say that is my data set right so you can sub select number of data points you can sub select number of features and so on so if I have one ordinal data set from this I generate multiple data sets which are kind of marked form of the original data set either through dropping some data or dropping some features and so on now using the technique that I discussed for each of this data set you can come up with a decision tree right so if I make ten different data sets from original data then I can build ten decision trees so they're those ten decision trees together form a random forest okay now you might ask the question as to I have now ten decision trees which solution from which decision tree do I use so in random for us what you do is if you have ten decision trees you go through all of those trees get a solution from all of them and whatever is the majority decision of all these trees that's a solution of the random forest so for example here you might get one subset of data where you only keep sepal length sepal width and petal length and then build a tree and let's say you have a new test data point and let's say this three predicts setosa now you could have built another tree with only sepal length and petal length okay and that's another decision tree and when you give a new data point you only pick the sepal and petal and the attributes of that and run it through this tree and then say let's say that is also saying it is setosa now you could have petal length and petal width and in these cases you could have dropped some data also right so you could sub select data also and then send this new test data point and if that also says setosa now there are three trees and all of them said setosa so the solution is set also now if two of them had sets at all so and one is virginica the solution is sill set o so basically that is what is called majority now if one says set those are one says virginica and what says versicolor there is a problem you have to break the tie somehow and then give a solution but in general when you have multiple trees you would hope that there will be some consensus among all of these three solutions and you say that solution is the solution for the random forest so by doing this the stochasticity or the problems with the data can be addressed to to a large extent in many problems by basically bagging the trees there are multiple trees from which you are getting the result and each of these trees themselves are produced from splitting the data to through dropping of some data points are dropping of some features and so on so basically you try to make yourself immune to fluctuations in data by multiple data sets that you generate and when all of them war will ming-lee say something is a result then it is likely to be correct than just depending on one decision tree so that is the idea of random forests now all of this is for you to understand how these things work but when you use one of these Python packages they will do most of this work for you all you need to know is when you look at it and see tree in to understand what that tree means and you have to know the difference between random forest and decision tree and so on so hopefully this has been useful session for you to understand decision trees and random for us thank you [Music]