Transcript for:
Understanding Decision Trees for Classification

hello people from the future welcome to normalized nerd today i will explain the concept behind decision trees to be more precise classification using decision trees as you know me obviously i will discuss the intuition and the underlying math behind training a decision tree and this video will contain a lot of visualizations so i hope you are super excited if you want to see more videos like this and stay connected with me please subscribe to this channel and join our discord server the link will be in the description so let's get started first of all we need some data because no algorithm can learn without any previous experience this data set contains two features x0 and x1 i have plotted x0 along the horizontal axis and x1 along the vertical axis there exists two classes green and red if you notice carefully the classes are not linearly separable that is you can't just draw a line to separate them this is intentional because i want to show you the true power of decision trees so what is a decision tree generally speaking it's a binary tree that recursively splits the data set until we are left with pure leaf nodes that is the data with only one type of class don't worry if you didn't get what i just said by the end of this video you will understand every bit of it the best way to start is to look at a trained decision tree so here's the decision tree classifier for this data set notice that there are two kinds of nodes decision nodes and leaf nodes the former one contains a condition to split the data and the latter one helps us to decide the class of a new data point how we will see later i want to walk you through the tree first let's have the whole data at our root node now based on the condition of the root node we are gonna split it the condition is whether x0 feature is less than or equal to minus 12 in the plot the splitting condition looks like this vertical line every point that lies either left or on the line satisfies this condition we are going to place those points in the left children of the root node and the points that don't meet the condition will go to the right we are going to follow this rule for the remaining nodes also please notice that the left child contains only red points so it's a pure node we don't need to split it further let's come to the right child the condition is if x 0 is less than equal to 9 here is the splitting line and resulting child nodes in this case the right child is pure so we are going to further split the left one following the left child this time the condition is x one less than or equal to nine obviously here we will have a horizontal splitting line see that in this case both the child nodes are pure so we don't need to split further now the natural question is how can we use this tree to classify a new data point suppose the new data point has x0 and x1 as 15 and 7 respectively first we check the condition at the root node 15 is greater than -12 so it doesn't satisfy that so we come to the right child and here it fails again so finally we arrive at a leaf node luckily all the points in this node are red so we can easily classify our new point as red but this might not be the case for more complex data we might not have hundred percent pure leaf nodes in those cases we perform a majority voting and assign the majority class to the test point let me show you how our model has divided the feature space into two regions based on the splitting criteria okay so that is how we use a binary decision tree for classification at this point some of you might be thinking so this is just a bunch of nested if statements so why do we consider this as machine learning well yes it is just a bunch of if statements but we need to know the correct conditions right there are many many possible splitting conditions here i am showing a couple of them our model needs to learn which features to take and the corresponding correct threshold values to optimally split the data and this is why it is machine learning but how our model decides the optimal splits for that let's start at the root at this stage we have the whole data with us we are gonna compare two splits the first one is x one less than or equal to four and the corresponding child nodes look like this the second split condition is x 0 less than or equal to -12 now which split would you choose well if you remember our goal that is to get pure leaf nodes we must go for the second split right because in this case we have successfully produced the left child with red points only but how can our computer do the same well the answer lies in information theory more precisely the model will choose the split that maximizes the information gain to calculate information gain we first need to understand the information contained in a state let's look at the root state here the number of red points and green points is the same imagine we are predicting the class of a randomly picked point only half of the time we will be correct which means this state has the highest impurity or uncertainty the way to quantify this is to use entropy it is the measure of information contained in a state if entropy is high then we are very uncertain about the randomly picked point and we need more bits to describe this state by the way pi denotes the probability of the ith class here both red and green have equal probability that is 0.5 if we just plug in the values here we will find the entropy corresponding to the root state is 1 which is by the way the highest possible value of entropy please note that i am using log base 2 here using this formula we will calculate the entropy of the remaining four states you can also notice that the state with four rate points gives the minimum entropy that's why we call it a pure node now to find the information gain corresponding to a split we need to subtract the combined entropy of the child nodes from the entropy of the parent node also notice that i have written weights for the child nodes that's nothing but the relative size of a child with respect to the parent so let's calculate the information gain clearly the second split gives the greater information gain and that's why we would choose this one now i have just compared two splits to show you the working in reality the model compares every possible split and takes the one that maximizes information gain so the model traverses through every possible feature and feature value to search for the best feature and the corresponding threshold it will be more clear to you when i will show you the code yes in the next video we will see how to code a decision tree from scratch here i want to tell you a very important point decision tree is a greedy algorithm well it selects the current best split that maximizes information gain it does not backtrack and change a previous split so all the following splits will depend on the current one and this does not guarantee that we will get the most optimal set of splits but yes greedy search makes our training a lot faster and it works really good despite of its simplicity coming back to our tree now it's gonna find the best split on the right child and so on after this process is done we will see our final model notice that how at every level the impurity of the states are decreasing okay so by now you should have a great understanding about decision tree classifiers in the next video i will show you how to code such a tree from scratch that you can actually use on a real world data set and i will also show you another thing called genie index so please subscribe and turn on the bell notification i will see you in the next video thanks for watching [Music] you