Understanding Decision Trees in Machine Learning

welcome to the decision tree tutorial my name is richard kirschner with simplylearn that's www.simplylearn.com so the decision tree one of the many powerful tools in the machine learning library begins with a problem i think i have to buy a car so in making this question you want to know how do i decide which one to buy and you're going to start asking questions is a mileage greater than 20 is the price less than 15 will be sufficient for six people does that have enough airbag anti-lock brakes all these questions come up then as we feed all this data in we make a decision and that decision comes up oh hey this seems like a good idea here's a car so as we going through this decision process using a decision tree we're going to explore this maybe not in buying a car but in how to process data what's in it for you let's start by finding out what is machine learning and why we even want to know about it for processing our data and we'll go into the three basic types of machine learning and the problems that are used by machine learning to solve finally we'll get into what is a decision tree what are the problems a decision tree solves what are the advantages and disadvantages of using a decision tree and then we want to dig in a little deep into the mechanics how does the decision tree work and then we'll go in and do a case loan repayment prediction where we actually are going to put together some python code and show you the basic python code for generating a decision tree what is machine learning there are so many different ways to describe what is machine learning in today's world and illustrate it we're going to take a graphic here and uh making decisions or trying to understand what's going on and really underlying machine learning is people want to wish they were smarter wish we could understand the world better so you can see a guy here who's uh saying hey how can i understand the world better and someone comes up and says let's use artificial intelligence machine learning is a part of artificial intelligence and now he gets a big smile on his face because now he has artificial intelligence to help him make his decisions and they can think in new ways so this brings in new ideas so what is machine learning this is a wonderful graph here you can see where we have a learn predict decide these are the most basic three premises of machine learning in learning we can describe the data in new ways and able to learn new aspects about what we're looking at and then we can use that to predict things and we can use that to make decisions so maybe something that's never happened before but we can make a good guess whether it's going to be a good investment or not it also helps us categorize stuff so we can remember it better so it's easier to pull it out of the catalog we can analyze data in new ways we never thought possible and then of course there's the very large growing industry of recognize we can do facial recognition driver recognition automated car recognition all these are part of machine learning going back to our guy here who's in his ordinary system and would like to be smarter make better choices what happens with machine learning is an application of artificial intelligence wherein the system gets the ability to automatically learn and improved based on experience so this is exciting because you have your ordinary guy who now has another form of information coming in and this is with the artificial intelligence helps him see things he never saw or track things he can't track so instead of having to read all the news feeds he can now have an artificial intelligence sorted out so he's only looking at the information he needs to make a choice with and of course we use all those machine learning tools back in there and he's now making smarter choices with less work types of machine learning let's break it into three primary types of learning first is supervised learning where you already have the data and the answers so if you worked at a bank you'd already have a list of all the previous loans and who defaulted on them and who made good payments on them you then program your machine learning tool and that lets you predict on the next person whether they're going to be able to make their payments or not on their loan if you have one category we already know the answers the next one would be you don't know the answers you just have a lot of information coming in unsupervised learning allows you to group liked information together so if you're analyzing photos it might group all the images of trees together and all the images of houses together without ever knowing what a house or a tree is which leads us to the third type of machine learning the third type of machine learning is reinforcement learning unlike supervised or unsupervised learning you don't have the data prior to starting so you get the data one line at a time and then whether you make a good choice or a bad choice the machine learning tool has to then adjust accordingly so you get a plus or minus feedback you can liken this to the way a human learns we experience life one minute at a time and we learn from that and either our memories is good or we learn to avoid something problems in machine learning to understand where the decision tree fits into our machine learning tools we have to understand the basics of some of the machine learning problems and three of the primary ones fall underneath classification problems with categorical solutions like yes or no true or false one or zero this might be does it belong to a particular group yes or no then we have regression problems where there's a continuous value needs to be predicted like product prices profit and you can see here this is a very simple linear graph you can guess what the next value is based on the first four it kind of follows a straight line going up and clustering this is problems where the data needs to be organized to find specific patterns like in the case of product recommendation they group all the different products that people just like you viewed on a shopping site and say people who bought this also bought this the most commonly used for the decision tree is for classification for figuring out is it red or is it not is it a fruit or is it a vegetable yes or no true false left or right zero one and so we talk about classification we're going to look at the basic machine learning these are the four main tools used in classification there's a nave bays logistic regression decision tree and random forest the first two are for simpler data so if your data is not very complex you can usually use these to do a fairly good representation by drawing a line through the data or a curve through the data they work wonderful in a lot of problems but as things get more complicated the decision tree comes in and then if you have a very large amount of data you start getting into the random forest so the decision tree is actually a part of the random forest but today we're just going to focus on the decision tree what is a decision tree let's go through a very simple example before we dig in deep decision tree is a tree shaped diagram used to determine a course of action each branch of the tree represents a possible decision or current or reaction let's start with a simple question how to identify a random vegetable from a shopping bag so we have this group of vegetables in here and we can start off by asking a simple question is it red and if it's not then it's going to be the purple fruit to the left probably an eggplant if it's true it's going to be one of the red fruits is the diameter greater than 2 if false is going to be a what looks to be a red chili and if it's true it's going to be a bell pepper from the capsicum family so it's a capsicum problems that decision tree can solve so let's look at the two different categories the decision tree can be used on it can be used on the classification the true false yes no and it can be used on regression where we figure out what the next value is in a series of numbers or a group of data in classification the classification tree will determine a set of logical if-then conditions to classify problems for example discriminating between three types of flowers based on certain features in regression a regression tree is used when the target variable is numerical or continuous in nature we fit the regression model to the target variable using each of the independent variables each split is made based on the sum of squared error before we dig deeper into the mechanics of the decision tree let's take a look at the advantages of using a decision tree and we'll also take a glimpse at the disadvantages the first thing you'll notice is that it's simple to understand interpret and visualize it really shines here because you can see exactly what's going on in a decision tree little effort is required for data preparation so you don't have to do special scaling there's a lot of things you don't have to worry about when using a decision tree it can handle both numerical and categorical data as we discovered earlier and non-linear parameters don't affect its performance so even if the data doesn't fit an easy curved graph you can still use it to create an effective decision or prediction if we're going to look at the advantages of a decision tree we also need to understand the disadvantages of a decision tree the first disadvantage is overfitting overfitting occurs when the algorithm captures noise in the data that means you're solving for one specific instance instead of a general solution for all the data high variance the model can get unstable due to small variation in data low bias tree a highly complicated decision tree tends to have a low bias which makes it difficult for the model to work with new data decision tree important terms before we dive in further we need to look at some basic terms we need to have some definitions to go with our decision tree and the different parts we're going to be using we'll start with entropy entropy is a measure of randomness or unpredictability in the data set for example we have a group of animals in this picture there's four different kinds of animals and this data set is considered to have a high entropy you really can't pick out what kind of animal it is based on looking at just the four animals as a big clump of entities so as we start splitting it into subgroups we come up with our second definition which is information gain information gain it is a measure of decrease in entropy after the data set is split so in this case based on the color yellow we've split one group of animals on one side as true and those who aren't yellow as false as we continue down the yellow side we split based on the height true or false equals 10 and on the other side height is less than 10 true or false and as you see as we split it the entropy continues to be less and less and less and so our information gain is simply the entropy e1 from the top and how it's changed to e2 on the bottom and we'll look at the deeper math although you really don't need to know a huge amount of math when you actually do the programming in python because it'll do it for you but we'll look on the actual math of how they compute entropy finally we went into the different parts of our tree and they call the leaf node leaf node carries the classification or the decision so it's the final end at the bottom the decision node has two or more branches this is where we're breaking the group up into different parts and finally you have the root note the top most decision note is known as the root node how does a decision tree work wonder what kind of animals i'll get the jungle today maybe you're the hunter with the gun or if you're more into photography you're a photographer with a camera so let's look at this group of animals and let's try to classify different types of animals based on their features using a decision tree so the problem statement is to classify the different types of animals based on their features using a decision tree the data set is looking quite messy and the entropy is high in this case so let's look at a training set or a training data set and we're looking at color we're looking at height and then we have our different animals we have our elephants our giraffes our monkeys and our tigers and they're of different colors and shapes let's see what that looks like and how do we split the data we have to frame the conditions that split the data in such a way that the information gain is the highest note gain is a measure of decrease in entropy after splitting so the formula for entropy is the sum that's what this symbol looks like that looks like kind of like a e funky e of k where i equals 1 to k k would represent the number of animals the different animals in there where value or p value of i would be the percentage of that animal times the log base 2 of the same the percentage of that animal let's try to calculate the entropy for the current data set and take a look at what that looks like and don't be afraid of the math you don't really have to memorize this math just be aware that it's there and this is what's going on in the background and so we have three giraffes two tigers one monkey two elephants a total of eight animals gathered and if we plug that into the formula we get an entropy that equals three over eight so we have three giraffes a total of eight times the log usually they use base 2 on the log so log base 2 of 3 over 8 plus in this case it says yellow fence 2 over 8 2 elephants over total of 8 times log base 2 2 over 8 plus 1 monkey over total of eight log base two one over eight and plus two over eight of the tigers log base two over eight and if we plug that into our computer our calculator i obviously can't do logs in my head we get an entropy equal to point 0.571 the program will actually calculate the entropy of the data set similarly after every split to calculate the gain now we're not going to go through each set one at a time to see what those numbers are we just want you to be aware that this is a formula or the mathematics behind it gain can be calculated by finding the difference of the subsequent entropy values after a split now we will try to choose a condition that gives us the highest gain we will do that by splitting the data using each condition and checking that the gain we get out of them the condition that gives us the highest gain will be used to make the first split can you guess what that first split will be just by looking at this image as a human is probably pretty easy to split it let's see if you're right if you guessed the color yellow you're correct let's say the condition that gives us the maximum gain is yellow so we will split the data based on the color yellow if it's true that group of animals goes to the left if it's false it goes to the right the entropy after the splitting has decreased considerably however we still need some splitting of both the branches to attain an entropy value equal to zero so we decide to split both the nodes using height as the condition since every branch now contains single label type we can say that entropy in this case has reached the least value and here you see we have the giraffes the tigers the monkey and the elephants all separated into their own groups this tree can now predict all the classes of animals present in the dataset with a hundred percent accuracy that was easy use case loan repayment prediction let's get into my favorite part and open up some python and see what the programming code in the scripting looks like in here we're going to want to do a prediction and we start with this individual here who's requesting to find out how good his customers are going to be whether they're going to be paid their loan or not for his bank and from that we want to generate a problem statement to predict if a customer will repay loan amount or not and then we're going to be using the decision tree algorithm in python let's see what that looks like and let's dive into the code in our first few steps of implementation we're going to start by importing the necessary packages that we need from python and we're going to load up our data and take a look at what the data looks like so the first thing i need is i need something to edit my python and run it in so let's flip on over and here i'm using the anaconda jupiter notebook now you can use any python ide you like to run it in but i find the jupyter notebook's really nice for doing things on the fly and let's go ahead and just paste that code in the beginning and before we start let's talk a little bit about what we're bringing in and then we're going to do a couple things in here we have to make a couple changes as we go through this first part of the import the first thing we bring in is numpy as np that's very standard when we're dealing with mathematics especially with uh very complicated machine learning tools you almost always see the numpy come in for your num your number it's called number python it has your mathematics in there in this case we actually could take it out but generally you'll need it for most of your different things you work with and then we're going to use pandas as pd that's also a standard the pandas is a data frame setup and you can liken this to taking your basic data and storing it in a way that looks like an excel spreadsheet so as we come back to this when you see np or pd those are very standard uses you'll know that that's the pandas and i'll show you a little bit more when we explore the data in just a minute then we're going to need to split the data so i'm going to bring in our train test and split and this is coming from the sk learn package cross validation in just a minute we're going to change that and we'll go over that too and then there's also the sk.tree import decision tree classifier that's the actual tool we're using remember i told you don't be afraid of the mathematics it's going to be done for you well the decision tree classifier has all that mathematics in there for you so you don't have to figure it back out again and then we have sklearn.metrics for accuracy score we need to score our our setup that's the whole reason we're splitting it between the training and testing data and finally we still need the sklearn import tree and that's just the basic tree function is needed for the decision tree classifier and finally we're going to load our data down here and i'm going to run this and we're going to get two things on here one we're going to get an error and two we're going to get a warning let's see what that looks like so the first thing we had is we have an error why is this error here well it's looking at this it says i need to read a file and when this was written the person who wrote it this is their path where they stored the file so let's go ahead and fix that and i'm going to put in here my file path i'm just going to call it full file name and you'll see it's on my c drive and this is very lengthy setup on here where i stored the data2.csv file don't worry too much about the full path because on your computer it'll be different the data.2 csv file was generated by simplylearn if you want a copy of that you can comment down below and request it here in the youtube and then if i'm going to give it a name full file name i'm going to go ahead and change it here to full file name so let's go ahead and run it now and see what happens and we get a warning when you're coding understanding these different warnings and these different errors that come up is probably the hardest lesson to learn so let's just go ahead and take a look at this and use this as a opportunity to understand what's going on here if you read the warning it says the cross validation is depreciated so it's a warning on it's being removed and it's going to be moved in favor of the model selection so if we go up here we have sklearn dot cross validation and if you research this and go to the sklearn site you'll find out that you can actually just swap it right in there with model selection and so when i come in here and i run it again that removes a warning what they've done is they've had two different developers develop it in two different branches and then they decided to keep one of those and eventually get rid of the other one that's all that is and very easy and quick to fix before we go any further i went ahead and opened up the data from this file remember the the data file we just loaded on here the data underscore 2.csv let's talk a little bit more about that and see what that looks like both as a text file because it's a comma separated variable file and in a spreadsheet this is what it looks like as a basic text file you can see at the top they've created a header and it's got one two three four five columns and each column has data in it and let me flip this over because we're also going to look at this uh in an actual spreadsheet so you can see what that looks like and here i've opened it up in the open office calc which is pretty much the same as excel and zoomed in and you can see we've got our columns and our rows of data a little easier to read in here we have a result yes yes no we have initial payment last payment credit score house number if we scroll way down we'll see that this occupies a thousand and one lines of code or lines of data with uh the first one being a column and then 1 000 lines of data now as a programmer if you're looking at a small amount of data i usually start by pulling it up in different sources so i can see what i'm working with but in larger data you won't have that option it'll just be too too large so you need to either bring in a small amount that you can look at it like we're doing right now or we can start looking at it through the python code so let's go ahead and move on and take the next couple steps to explore the data using python let's go ahead and see what it looks like in python to print the length and the shape of the data so let's start by printing the length of the database we can use a simple lin function from python and when i run this you'll see that it's a thousand long and that's what we expected there's a thousand lines of data in there if you subtract the column head this is one of the nice things when we did the balance data from the panda read csv you'll see that the header is row zero so it automatically removes a row and then shows the data separate it does a good job sorting that data out for us and then we can use a different function and let's take a look at that and again we're going to utilize the tools in panda and since the balance underscored data was loaded as a panda data frame we can do a shape on it and let's go ahead and run the shape and see what that looks like what's nice about this shape is not only does it give me the length of the data we have a thousand lines it also tells me there's five columns so we were looking at the data we had five columns of data and then let's take one more step to explore the data using python and now that we've taken a look at the length and the shape let's go ahead and use the pandas module for head another beautiful thing in the data set that we can utilize so let's put that on our sheet here and we have print data set and balance data dot head and this is a pandas print statement of its own so it has its own print feature in there and then we went ahead and gave a label for a print job here of dataset just a simple print statement and we run that and let's just take a closer look at that let me zoom in here there we go pandas does such a wonderful job of making this a very clean readable data set so you can look at the data you can look at the column headers you can have it when you put it as the head it prints the first five lines of the data and we always start with zero so we have five lines we have zero one two three four instead of one two three four five that's a standard scripting and programming set as you wanna start with the zero position and that is what the data head does it pulls the first five rows of data puts in a nice format that you can look at and view very powerful tool to view the data so instead of having to flip and open up an excel spreadsheet or open office cal or trying to look at a word doc where it's all scrunched together and hard to read you can now get a nice open view of what you're working with we're working with a shape of a thousand long five wide so we have five columns and we do the full data head you can actually see what this data looks like the initial payment last payment credit scores house number so let's take this now that we've explored the data and let's start digging into the decision tree so in our next step we're going to train and build our data tree and to do that we need to first separate the date out we're going to separate into two groups so that we have something to actually train the data with and then we have some data on the side to test it to see how good our model is remember with any of the machine learning you always want to have some kind of test set to weigh it against so you know how good your model is when you distribute it let's go ahead and break this code down and look at it in pieces so first we have our x and y where do x and y come from well x is going to be our data and y is going to be the answer or the target you can look at its source and target in this case we're using x and y to denote the data in and the data that we're actually trying to guess what the answer is going to be and so to separate it we can simply put in x equals the balance of the data.values the first brackets means that we're going to select all the lines in the database so it's all the data and the second one says we're only going to look at columns one through five remember i always start with zero zero is a yes or no and that's whether the loan went default or not so we want to start with one if we go back up here that's the initial payment and it goes all the way through the house number well if we want to look at one through five we can do the same thing for y which is the answers and we're going to set that just equal to the zero row so it's just the zero row and then it's all rows going in there so now we've divided this into two different data sets one of them with the data going in and one with the answers next we need to split the data and here you'll see that we have it split into four different parts the first one is your x training your x test your y train your y test simply put we have x going in where we're going to train it and we have to know the answer to train it with and then we have x test where we're going to test that data and we have to know in the end what the y was supposed to be and that's where this train test split comes in that we loaded earlier in the modules this does it all for us and you can see they set the test size equal to 0.3 so that's roughly 30 percent will be used in the test and then we use a random state so it's completely random which rows it takes out of there and then finally we get to actually build our decision tree and they've called it here clf underscore entropy that's the actual decision tree or decision tree classifier and in here they've added a couple variables which we'll explore in just a minute and then finally we need to fit the data to that so we take our clf entropy that we created and we fit the x train and since we know the answers for x-trade or the y-train we go ahead and put those in and let's go ahead and run this and what most of these sklearn modules do is when you set up the variable in this case we set the clf entropy called decision tree classifier it automatically prints out what's in that decision tree there's a lot of variables you can play within here and it's quite beyond the scope of this tutorial to go through all of these and how they work but we're working on entropy that's one of the options we've added that it's completely a random state of 100 so 100 percent and we have a max depth of three now the max depth if you remember above when we were doing the different graphs of animals means it's only going to go down three layers before it stops and then we have minimal samples of leaves as five so it's going to have at least five leaves at the end so i'll have at least three splits i'll have no more than three layers and at least five end leaves with the final result at the bottom now that we've created our decision tree classifier not only created it but trained it let's go ahead and apply it and see what that looks like so let's go ahead and make a prediction and see what that looks like we're going to paste our predict code in here and before we run it let's just take a quick look at what's it's doing here we have a variable why predict that we're going to do and we're going to use our variable clf entropy that we created and then you'll see dot predict and it's very common in the sk learn modules that there are different tools have the predict when you're actually running a prediction in this case we're going to put our x test data in here now if you delivered this for use in actual commercial use and distributed it this would be the new loans you're putting in here to guess whether the person is going to be uh pay them back or not in this case so we need to test out the data and just see how good our sample is how good of our tree does at predicting the loan payments and finally since anaconda jupiter notebook works as a command line for python we can simply put the y predict e in to print it i could just as easily have put the print and put brackets around y predict en to print it out we'll go ahead and do that it doesn't matter which way you do it and you'll see right here that runs a prediction this is roughly 300 in here remember it's 30 percent of a thousand so you should have about 300 answers in here and this tells you which each one of those lines of our test went in there and this is what our y predict came out so let's move on to the next step we're going to take this data and try to figure out just how good a model we have so here we go since sklearn does all the heavy lifting for you and all the math we have a simple line of code to let us know what the accuracy is and let's go ahead and go through that and see what that means and what that looks like let's go ahead and paste this in and let me zoom in a little bit there we go so you have a nice full picture and we'll see here we're just going to do a print accuracy is and then we do the accuracy score and this was something we imported earlier if you remember at the very beginning let me just scroll up there real quick so you can see where that's coming from that's coming from here down here from sklearn.metrics import accuracy score and you could probably run a script make your own script to do this very easily how accurate is it how many out of 300 do we get right and so we put in our y test that's the one we ran the predict on and then we put in our y predict e n that's the answers we got and we're just going to multiply that by a hundred because this is just going to give us an answer as a decimal and we want to see it as a percentage let's run that and see what it looks like and if you see here we got an accuracy of 93.66667 so when we look at the number of loans and we look at how good our model fit we can tell people it has about a 93.6 fitting to it so just a quick recap on that we now have accuracy setup on here and so we have created a model that uses the decision tree algorithm to predict whether a customer will repay the loan or not the accuracy of the model is about 94.6 percent the bank can now use this model to decide whether it should approve the loan request from a particular customer or not and so this information is really powerful we may not be able to as individuals understand all these numbers because they have thousands of numbers that come in but you can see that this is a smart decision for the bank to use a tool like this to help them to predict how good their profits going to be off of the loan balances and how many are going to default or not so we've had a lot of fun learning about decision trees so let's take a look at the key takeaways that we've covered today what is machine learning we covered up some different aspects of machine learning and what that is utilized in your everyday life and what you can use it for for predicting for describing for guessing what the next outcome is for storing information we looked at the three main types of machine learning supervised learning unsupervised learning and reinforced learning we looked at problems in machine learning and what it solves classification regression and clustering finally we went through how does the decision tree work where we looked at the hunter he's trying to sort out the different animals and what kind of animals they are and then we rolled up our sleeves and did our python coding and actually applied it to a data set now remember if you have more questions on this or you have suggestions you can post that down below here in the comments sections also if you want a copy of the data set that simply learned put together for this tutorial you can also put a request in there too that brings us to our conclusion and i'd like to thank you for joining us for more information you can visit www.simplylearn.com you can also click below to ask more questions through the youtube interface [Music] hi there if you like this video subscribe to the simply learn youtube channel and click here to watch similar videos to nerd up and get certified click here

Transcript for:Understanding Decision Trees in Machine Learning

Transcript for:
Understanding Decision Trees in Machine Learning