Transcript for:
Decision Trees in Machine Learning

so we've looked at some very numerical kind of uh setups where there's a lot of math involved ukian geometry um that kind of thing a totally different machine learning algorithm for approaching this is the decision trees and there's also Forest that go with the decision trees they're based on multiple trees combined the decision tree is a supervised learning algorithm used for classification it creates a model that predicts the value of a Target variable by learning simple decision rules inferred from the data features a decision tree is a hierarchal tree structure where an internal node represents features or tribute the branch represents a decision Rule and each Leaf node represents the outcome and you can see here where they have the first one uh yes or no and then you go either left or right and so forth one of the coolest things about decision trees um is and I'll see people actually run a decision tree even though their final model is different because it decision tree allows you to see what's going on you can actually look at it and say why did you go right or left what was the choice where's that break uh and that is really nice if you're trying to share that information with somebody else as to why when you start getting into the why this is happening decision trees are very powerful so the topmost note of a decision tree is known as the root node it learns to partition on the basis of the attribute value it partitions the tree in a recursive manner so you have your decision node if you get yes you go down to the next note that's a decision note and either yes you go to if it ends on a leaf note then you know your answer uh which is yes or no so there's your there's your in classification set up on there here's an example of a decision tree that tells whether I'll sleep or not at a particular evening mine would be depending on whether I have the news on or not do I need to sleep no okay I'll work uh yes is it raining outside yes I'll sleep no I'll work so I guess if it's uh not raining outside it's harder to fall asleep where they have that nice uh rain coming in and again this is really cool about a decision tree is I can actually look at it and go oh I like to sleep when it rains outside so when you're looking at all the data you can say oh this is where the switch comes in when it rains outside I'll sleep really good if it's not raining or if I don't need sleep then I'm not going to sleep I'm going to go work so let's go ahead and take a look at that that looks like in the code just like we did before we go ahead and open up the S kit setup just to tell you what the decision tree classifier has you have your parameters which we'll look a little bit more in depth at as we write the code but it has uh different ways of splitting it the strategy used to choose a split at each node uh Criterion max depth remember the tree how far down do you want it do you want to take up the space of your whole computer with a and and map every piece of data or you know the smaller that number is the smaller the level the tree is and the less processing it takes but it's also more General so you're less likely to get as in-depth an answer um and then of course minimal samples you need for it to split samples for the leaf there's a lot of things in here as far as what how big the tree is and how to define it and when do you define it and how to weight it and they have their different attributes which you can dig deeper into uh that can be very important if you want to know the why of things uh and then we go down here to our methods and you'll see just like everything else we have our fit method very important uh and our predict uh the two main things that we use what is what we're going to predict our X to be equal to and we'll go ahead and go up here and start putting together the code uh we're going to import our numpy our pandas there's our confusion Matrix our train test Split Decision tree classifier that's the big one that we're actually working with uh that's the line right here where we're going to be oops decision tree there it is decision tree classifier that's the one I was looking for and of course we want to know the accuracy and the classification report on here and we're going to do a little different than we did in the other examples and there's a reason for this let me go and run this and load this up here uh we're going to go ahead and build things on functions and this is when you start splitting up into a team this is the kind of thing you start seeing a lot more both in teams and for yourself because you might want to swap one data to test it on a different data depending on what's going on uh so we're going to have our import data here um the data set length the balance and so forth um this just returns balance data let me just go ahead and print because I'm curious as to what this looks like import data and it's going to return the balance data so if I run that uh if we go ahead and print this out here and run that you can see that we have uh a whole bunch of data that comes in there and some interesting setup on here has uh let's see BR RR I'm not sure exactly what that represents on here uh 111 112 and so forth so we have a different set of data here the shape is uh five columns 1 2 3 4 five uh seems to have a number at the beginning which I'm going to guess uh b r l a letter I mean and then a bunch of numbers in there one one one one let's see down here we got 555 uh set up on this and let's see balance data and since it said balance data I'm going to guess that uh B meets balanc R means you need to move it right and L means it's um needs to be moved left or skewed to left I'm not sure which one uh let's go and close that out and we'll go ahead and create a function to split the data set uh X balance data equals data values y balance equals data values of zero there's that letter remember left right and balance then we're looking for the values of 1 through five and we go ahead and split it just like you would X train y train set random State 100 test size is3 so we're taking 30% of the data and it's going to return your X your y your y train your uh your your X train your X test your y train your um y test again we do this because if you're running a lot of these you might want to switch how you split the data and how you train it I tend to use a bfold method I'll take a third of the data and I'll train it on the other two thirds and test it on that third and then I'll switch it I'll switch which third is a test data and then I can actually take that information and correlate it and it gives me a a really uh robust package for figuring out what the complete accuracy is uh but in this case we're just going to go ahead this is our function for splitting data and this is where kind of gets interesting because remember we were talking a little bit about uh the different settings in our model and so uh in here we're going to create a decision tree but we're going to use the Gen Genie setup and where did that come from uh what's the genie on here uh so if we go back to the top of their page and we have what uh Criterion are we going to use we're going to use Genie they have Genie and entropy those are the two main ones that they use for the decision tree uh so this one's going to be Genie and if we're going to have a function that creates the Genie model and it even goes down here and here's our fit train of the Genie model uh we'll probably also want to create one for entropy sometimes I even just um I might even make this just one function with a different setups and I know one of my one of the things I worked on recently I had to create a one that tested across multiple models and so I would send the parameters to the models or I would send this part right here where it says decision tree classifier that whole thing might be what I send to create the model and I know it's going to fit we're going to have our XT Trin and we're going to have our predict and all that stuff is the same so you can just send that model to your function uh for testing different models again this just gives you one of the ways to do it and you can see here we're going to train train with the genie and we're also going to chain train with the entropy to see how that works and if you're going to have your models going two separate models you're sending there we'll go ahead and create a prediction this simply is our y predict equals our uh whatever object we sent whatever model we sent here the C LF object and predict against our X test and you can see here print y predict and return y predict set up on here we'll load that definition up and then if you're going to have a function that runs a predict and print some things out uh we should also have our accuracy function so here's our calculate the accuracy what are we sending we're sending our y test data this could also be y actual and Y predict and then we'll print out a confusion Matrix uh then we'll print out the accuracy of the um score on here and print a report classification report bundle it all together there so if we bring this all together we have um all this steps we've been working towards which is importing our data by the ways you'll spend 80% of your time importing data in most machine learning setups and cooking it and burning it and getting it formatted so that it it uh works with whatever models you're working with the decision tree has some cool features in that if you're missing data it can actually pick that up and just skip that and says I don't know how to split this there's no way of knowing whether it rained or didn't rain last night so I'll look at something else like whether you watched uh TV after 8:00 you know that blue screen thing uh so we have our function importing our data set we bring in the data we split the data so we have our X test test and Y train and then we have our different models our clf Genie so it's a decision tree classifier using the genie setup and then we can also create the model using entropy uh and then once we have that we have our function for making the prediction and we have our function for calculating the accuracy uh and then if we're going to have that we should probably have our main code involved here this probably looks more familiar if you're depending on what you're working on if you're working on like a pie charm then you would see this in throwing something up real quick in jupyter Notebook uh so here's our our main data import which we've already defined uh we get our split data we create our Genie we create our entropy so there's our two models going on here there's our two models so these are two separate data models we've already sent them to be trained then we're going to go ahead and print the results using Genie index so we'll start with the genie and we want to go ahead with the genie and print our um our predictions YX test to the genie and calculate the accuracy on here and then we want to print the results using entropy so this is just the same thing coming down like we did with the genie we're going to put out our y predict entropy and our calculations so let's go ahead and run that and just see what this uh piece of code does uh we do have like one of our data needs to be is getting a warning on there this nothing major because it's just a simple warning probably an update of a new version's coming out uh and so here we are we have our data set it's got 625 you can actually see an example of the data set B meaning balanced I guess and here's our five data points 1111 means it's balanced it's skewed to the right with 1112 uh and so forth on here and then we're going to go ahead and predict from a prediction whether it's to the right or to the left you can think of a washing machine that's skew that's banging on one side of the thing or maybe it's an automated car where we're down the middle of the road that's imbalance and it starts going veering to the right so we need to correct for it uh and when we print out the confusion Matrix we have three different variables r l and B so we should three the three different variables on here and you have as far as whether it predicts in this case the balance there's not a lot of balance loads on here and didn't do a good job guessing whether it's balanced or not that's what I took from this first one uh the second one I'm guessing is the right so it did pretty good job guessing the right balance you can see that a bunch of them came up left unbalanced um probably not good for an automated car as it tells you 18 out of the uh 18 missed things and tells you to go the wrong direction and here we are going the other way uh 19 to 71 and of course we can back that up with an accuracy report on here and you can see the Precision how well the left and right balance is 79% 79% precision and so forth and then we went and used the entropy and let me just see if we can get so we can get them both next to each other here's our entropy of our um the first setup our first model which is the Genie model 6718 1971 6322 2070 pretty close the two models you know that's not a huge difference in numbers this second one of entropy did slightly it looks like slightly worse cuz it did one better as as far as the right balance and did what is this four worse on the left balance or whatever uh so slightly worse if I was guessing between these I'd probably use the first one they're so close though that wouldn't be it wouldn't be a Clear Choice as to which one worked better and there's a lot of numbers you can play with here which might give better results depending on what the data is going in now uh one of the takeaways you should have from the different category routines we R is that they run very similar you you certainly change the perimeters in them as to whether you're using what model you're using and how you're using it and what data they get applied to but when you're talking about the scikit learn package it does such an awesome job of making it easy uh you split your data up you train your data and you run the prediction and then you see what kind of accuracy what kind of confusion confusion Matrix It generates so um we talk about algorithm selection logistic regression K near as neighbors uh logistic regression is used when we have a binomial outcome for example to predict whether an email is Spam or not whether the tumor is malignant or not the logistic regression works really good on that you can do it in a k nearest neighbors also the question is which one will it work better in um I find the logistic regression models work really good in a lot of raw numbers so if you're working with say the stock market is this a good investment or a bad investment um so that's one of the things it handles the numbers better K nearest neighbors are used in scenarios where nonparametric no fixed number of perimeters algorithms are required it is used in pattern recognition Data Mining and intrusion detection uh so K means really good in finding the patterns um I've seen that as a pre-processor to a lot of other processors where you use the K nearest neighbors to figure out what data groups together very powerful package support Vector machines uh support vecttor machines are used whenever the data has higher Dimensions the human genome microarray svms are extensively used in the hard handwriting recognition models and you can see that we were able to switch between the parabolic and the circular setup on there where you can now have that dnut kind of data and be able to filter that out with the support Vector machine and then decision trees are mostly used in operational researches specifically in decision analysis to help identify a strategy most likely to reach any goal they are pre preferred where the model is easy to understand I like that last one that's a good description is it easy to understand so you have data coming in when am I going to go to bed you know is it raining outside you can go back and actually look at the pieces and see those different decision modes takes a little bit more to dig in there and figure out what they're doing uh but you can do that and you can actually help you figure out why um people love it for the why Factor so so uh strengths and limitations big one on all of these the strengths and limitations we talk about logistic regressions uh the strings are it is easy to implement and efficient to train it is relatively easy to regularize the data points remember how we put everything between zero and one when you look at logistic regression models uh you don't have to worry about that as much limitations as a high Reliance on proper representation of data It could only predict a categorical out come with the K nearest neighbors it doesn't need a separate training period new data can be added seamlessly without affecting the accuracy of the model uh kind of an interesting thing because you can do partial training uh that can become huge if you're running across really large data sets or the data is coming in you can continually uh do a partial fit on the data with the K nearest neighbors and continue to adjust that data uh it doesn't doesn't work on high dimensional and large data sets we were looking at the breast cancer uh 36 different features what happens when you have 127 features or a million features and you say well what do you have a million features in well if I was analyzing uh log um the legal documents I might have a tokenizer that splits a words up to be analyzed and that tokenizer might create 1 million different words available that might be in the document for doing weights uh sensitive to noisy data outliers and missing values that's a huge one with K nearest neighbors they really don't know what to do with a missing value how do you compute the the distance if you don't know what the value is uh the svm uh Works more efficiently on high dimensional data it is relatively memory efficient so it's able to create those planes with only a few different variables in there as opposed to having to store a lot of data for different uh features and things like that it's not suitable for a large data sets uh the svm you start running this over gigabytes to data causes some huge issues underperforms if the data has noise or overlapping that's a big one we were looking at that where the spvm splits it and it creates a soft buffer but what happens when you have a lot of stuff in the middle uh that's hard to sort out it doesn't know what to do with that causes SPM to start crashing or not perform as well decision trees handles nonlinear perimeters and missing values efficiently the missing values is huge I've seen this in uh was it the wine tasting data sets where they have three different data sets and they share certain features uh but then each one has some features that aren't in the other ones and it has to figure out how to handle those well the decision tree does that automatically instead of having to figure a way to fill that data in before processing like you would with the other models uh it's easy to understand and has less training period so it trains pretty quickly uh comes up there and just keeps forking the tree down and moving the parts around and so it it doesn't have to go through the data multiple times guessing and adjusting it just creates the tree as it goes overfitting and high variants are the most annoying part of it that's that's an understatement uh that has to do with how many leavs and how many decisions you have it du the more you have the more overfit it is to the data it also uh just in making the choices and how the choices come in it might overfit to a specific feature because that's where it started at and that's what it knows and it really um is challenged with large data sets they've been working on that with the data Forest but it's not suitable for large data sets it's really something you'd probably run on a single machine and not across um not across a uh data pool or anything if you are an aspiring data scientist who's looking out for online training and certification in data science from the best universities and Industry experts then search no more simply learns postgraduate program in data science from Caltech University in collaboration with IBM should be the right choice for more details on this program please use the link in the description box below so the decision tree one of the many powerful tools in the machine learning library begins with a problem I think I have to buy a car so in making this question you want to know how do I decide which one to buy and you're going to start asking questions is a mileage greater than 20 is a price less than 15 will it be sufficient for six people does it have enough airbag antiock brakes all these questions come up then as we feed all this data in we make a DEC decision and that decision comes up oh hey this seems like a good idea here's a car so as we going through this decision process using a decision tree we're going to explore this maybe not in buying a car but in how to process data what's in it for you let's start by finding out what is machine learning and why we even want to know about it for processing our data and we'll go into the three basic types of machine learning and the problems that are used by Machine learning to solve finally we'll get into what is a decision tree what are the problems a decision tree solves what are the advantages and disadvantages of using a decision tree and then we want to dig in a little deep into the mechanics how does the decision tree work and then we'll go in and do a case loan repayment prediction where we actually going to put together some python code and show you the basic python code for generating a decision tree what is machine learning there are so many different ways to describe what is machine learning in today's world and illustrate it we're going to take a graphic here and uh making decisions or trying to understand what's going on and really underlying machine learning is people want to wish they were smarter wish we could understand the world better so you can see a guy here who's uh saying hey how can I understand the world better and someone comes up and says let's use artificial intelligence machine learning is a part of artificial intelligence and that way gets a big smile on his face because now he has artificial intelligence to help him make his decisions uh and they can think in new ways so this brings in new ideas so what is machine learning this is a wonderful graph here you can see where we have learn predict decide these are the most most basic three premises of machine learning in learning we can describe the data in new ways and able to learn new aspects about what we're looking at and then we can use that to predict things and we can use that to make decisions so maybe it's something that's never happened before but we can make a good guess whether it's going to be a good investment or not it also helps us categorize stuff so we can remember it better so it's easier to pull it out of the catalog we can analyze data in new ways we never thought possible and then of course there's the very large growing industry of recognize we can do facial recognition driver recognition automated car recognition all these are part of machine learning going back to our guy here who's in his ordinary system and would like to be smarter make better choices what happens with machine learning is an application of artificial intelligence wherein the system gets the ability to automatically learn and improved based on experience so this is exciting cuz you have your ordinary guy who now has another form of information coming in and this is with the artificial intelligence helps him see things he never saw or track things he can't track so instead of having to read all the news feeds he can now have an artificial intelligence sorted out so he's only looking at the information he needs to make a choice with and of course we use all those machine learning tools back in there and he's now making smarter choices with less work types of machine learning let's break it into three primary types of learning first is supervised learning where you already have the data and the answers so if you worked at a bank you'd already have a list of all the previous loans and who defaulted on them and who made good payments on them you then program your machine learning tool and that lets you predict on the next person whether they're going to be able to make their payments or not on their loan if you have one category where you already know the answers the next one would be you don't know the answers you just have a lot of information coming in unsupervised learning allows you to group liked information together so if you're analyzing photos it might group all the images of trees together and all the images of houses together without ever knowing what a house or a tree is which leads us to the third type of machine learning the third type of machine learning is reinforcement learning unlike supervised or unsupervised learning you don't have the data prior to starting so you get the data one line at a time and then whether you make a good choice or a bad choice the machine learning tool has to then adjust accordingly so you get a plus or minus feedback you can liken this to the way a human learns we experience life one minute at a time and we learn from that and either our memories is good or we learn to avoid some problems in machine learning to understand where the decision tree fits into our machine learning tools we have to understand the basics of some of the machine learning problems and three of the primary ones fall underneath classification problems with categorical Solutions like yes or no true or false one or zero this might be does it belong to a particular group yes or no then we have regression problems where there's a continuous value needs to be predicted like product prices profit and you can see here this is a very simple linear graph uh you can guess what the next value is based on the first four it kind of follows a straight line going up and clustering this is problems where the data needs to be organized to find specific patterns like in the case of product recommendation they group all the different products that people just like you viewed on a shopping site and say people who bought this also bought this the most commonly used for the decision tree is for classification for figuring out is it red or is it not is it a fruit or is it a vegetable yes or no true false left or right 01 and so when we talk about classification we're going to look at the basic machine learning these are the four main tools used in classification there's the Nave Bays logistic regression decision tree and random Forest the first two are for simpler data so if your data is not very complex you can usually use these to do a fairly good representation by drawing a line through the data or a curve through the data they work Wonderful in a lot of problems but as things get more complicated the decision tree comes in and then if you have a very large amount of data you start getting into the random Forest so the decision tree is actually a part of the random Forest but today we're just going to focus on the decision tree what is a decision tree let's go through a very simple example before we dig in deep decision tree is a tree shaped diagram used to determine a course of action each branch of the tree represents a possible decision occurrence or reaction let's start with a simple question how do I identify a random vegetable from a shopping bag so we have this group of vegetables in here and we can start off by asking a simple question is it red and if it's not then it's going to be the purple fruit to the left probably an eggplant if it's true it's going to be one of the red fruits is a diameter greater than two if false it's going to be a what looks to be a red chili and if it's true it's going to be a bell pepper from the capsicum family so it's a capsicum problems that decision tree can solve so let's look at the two different categories the decision tree can be used on it can be used on the classification the true false yes no and it can be used on regression where we figure out what the next value is in a series of numbers or a group of data in classification the classification tree will determine a set of logical if then conditions to classify problems for example discriminating between three types of flowers based on certain features in regression a regression tree is used when the target variable is numerical or continuous in nature we fit the regression model to the Target variable using each of the independent variables each split is made based on the sum of squared error before we dig deeper into the mechanics of the decision tree let's take a look at the advantages of using a decision tree and we'll also take a glimpse at the disadvantages the first thing you'll notice is that it's simple to understand interpret and visualize it really shines here because you can see exactly what's going on in a decision tree little effort is required for data preparation so you don't have to do special scaling there's a lot of things you don't have to worry about when using a decision tree it can handle both numerical and categorical data as we discovered earlier and nonlinear parameters don't affect its performance so even if the data doesn't fit an easy curved graph you can still use it to create an effective decision or prediction if we're going to look at the advantages of a decision tree we also need to understand the disadvantages of a decision tree the first disadvantage is overfitting overfitting occurs when the algorithm captures noise in the data that means you're solving for one specific instance instead of a general solution for all the data High variance the model can get unstable due to small variation in data low bias tree a highly complicated decision tree tends to have a low bias which makes it difficult for the model to work with new data decision tree important terms before we dive in further we need to look at some basic terms we need to have some definitions to go with our decision tree in the different parts we're going to be using we'll start with entropy entropy is a measure of Randomness or unpredictability in the data set for example we have a group of animals in this picture there's four different kinds of animals and this data set is considered to have a high entropy you really can't pick out what kind of animal it is based on looking at just the four animals as a big clump of of uh entities so as we start splitting it into subgroups we come up with our second definition which is Information Gain Information Gain it is a measure of decrease in entropy after the data set is split so in this case based on the color yellow we've split one group of animals on one side as true and those who aren't yellow as false as we continue down the yellow side we split base on the height true or false equals 10 and on the other side height is less than 10 true or false and as you see as we split it the entropy continues to be less and less and less and so our Information Gain is simply the entropy E1 from the top and how it's changed to E2 in the bottom and we'll look at the uh deeper math although you really don't need to know a huge amount of math when you actually do the programming in Python because they'll do it for you but we'll look on the actual math of how they compute entropy finally we went under the different parts of our tree and they call the Leaf node Leaf node carries the classification or the decision so it's a final end at the bottom the decision node has two or more branches this is where we're breaking the group up into different parts and finally you have the root node the topmost decision node is known as the root node how does a decision tree work wonder what kind of animals I'll get the jungle today maybe you're the hunter with a gun or if you're more into photography you're a photographer with a camera so let's look at this group of animals and let's try to to classify different types of animals based on their features using a decision tree so the problem statement is to classify the different types of animals based on their features using a decision tree the data set is looking quite messy and the entropy is high in this case so let's look at a training set or a training data set and we're looking at color we're looking at height and then we have our different animals we have our elephants our giraffes our monkeys and our tigers and they're of different colors and shapes let's see what that looks like and how how do we split the data we have to frame the conditions that split the data in such a way that the Information Gain is the highest note gain is the measure of decrease in entropy after splitting so the formula for entropy is the sum that's what this symbol looks like that looks like kind of like a e funky e of K where I equals 1 to k k would represent the number of animal the different animals in there where value or P value of I would be the percentage of that animal times the log base 2 of the same the percentage of that animal let's try to calculate the entropy for the current data set and take a look at what that looks like and don't be afraid of the math you don't really have to memorize this math and just be aware that it's there and this is what's going on in the background and so we have three giraffes two tigers one monkey two elephants a total of eight animals gathered and if we plug that into the formula we get an entropy that equals 3 over 8 so we have three drafts a total of 8 time the log usually they use base two on the log so log base 2 of 3 over8 plus in this case let's say it's the elephants 2 over 8 two elephants over total of 8 * log base 2 2 over 8 plus one monkey over total of 8 log base 2 1 over8 and plus 2 over 8 of the Tigers log base 2 over 8 and if we plug that into our computer our calculator I obviously can't do logs in my head we get an entropy equal to .571 the program will actually calculate the entropy of the data set similarly after every split to calculate the gain now we're not going to go through each set one at a time to see what those numbers are we just want you to be aware that this is a Formula or the mathematics behind it gain can be calculated by finding the difference of the subsequent entropy values after a split now we will try to choose a condition that gives us the highest gain we will do that by splitting the data using each condition and checking that the gain we get out of them the condition that gives us the highest gain will be used to make the first split can you guess what that first split will be just by looking at this image as a human it's probably pretty easy to split it let's see if you're right if you guessed the color yellow you're correct let's say the condition that gives us the maximum gain is yellow so we will split the data based on the color yellow if it's true that group of animals goes to the left if it's false it goes to the right the entropy after the splitting has decreased considerably however we still need some splitting at both the branches to attain an enty value equal to zero so we decided to split both the nodes using height as a condition since every Branch now contains single label type we can say that entropy in this case has reached the least value and here you see we have the giraffes the Tigers the monkey and the elephants all separated into their own groups this tree can now predict all the classes of animals present in the data set with 100% accuracy that was easy use case loan repayment prediction let's get into my favorite part and open up some Python and see what the programming code and the scripting looks like in here we're going to want to do a prediction and we start with this individual here who's requesting to find out how good his customers are going to be whether they're going to repay their loan or not for his bank and from that we want to generate a problem statement to predict if a customer will repay loan amount or not and then we're going to be using the decision tree algorithm in Python let's see what that looks like and let's dive into the code in our first few steps of implementation we're going to start by importing the necessary packages that we need from Python and we're going to load up our data and take a look at what the data looks like so the first thing I need is I need something to edit my Python and run it in so let's flip on over and here I'm using the Anaconda Jupiter notebook now you can use any python IDE you like to run it in but I find the jupyter notebooks really nice for doing things on the Fly and let's go ahead and just paste that code in the beginning and before we start let's talk a little bit about what we're bringing in and then we're going to do a couple things in here we have to make a couple changes as we go through this first part of the import the first thing we bring in is numpy as NP that's very standard when we're dealing with uh mathematics especially with uh very complicated machine learning tools you almost always see the numpy come in for your num your numers it's called number python it has your mathematics in there in this case we actually could take it out but generally you'll need it for most of your different things you work with and then we're going to use pandas as PD that's also a standard the pandas is a data frame setup and you can liken this to uh taking your basic data and storing it in a way that looks like an Excel spreadsheet so as we come back to this when you see NP or PD those are very standard uses you'll know that that's the pandas and I'll show you a little bit more when we explore the data in just a minute then we're going to need to split the data so I'm going to bring in our train test and split and this is coming from the sklearn package cross validation in just a minute we're going to change that and we'll go over that too and then there's also the k. tree import decision tree classifier that's the actual tool we're using remember I told you don't be afraid of the mathematics it's going to be done for you well the decision tree classifier has all that mathematics in there for you so you don't have to figure it back out again and then we have SK learn. metrics for accuracy score we need to score our our setup that's the whole reason we're splitting it between the training and testing data and finally we still need the sklearn import tree and that's just the basic tree function is needed for the decision tree classifier and finally we're going to load our data down here and I'm going to run this and we're going to get two things on here one we're going to get an error and two we're going to get a warning let's see what that looks like so the first thing we had is we have an error why is this error here well it's looking at this it says I need to read a file and when this was written the person who wrote it this is their path where they stored the file so let's go ahead and fix that and I'm going to put in here my file path I'm just going to call it full file name and you'll see it's on my C drive and it's this very lengthy setup on here where I stored the data 2. CSV file don't worry too much about the full path because on your computer it'll be different the data. 2 CSV file was generated by simply learn if you want a copy of that you can comment down below and request it here in the YouTube and then if I'm going to give it a name full file name I'm going to go ahead and change it here to full file name so let's go ahead and run it now and see what happens and we get a warning when you're coding understanding these different warnings and these different errors that come up is probably the hardest lesson to learn so let's just go ahead and take a look at this and use this as a uh opportunity to understand what's going on here if you read the warning says the cross validation is depreciated so it's a warning on it's being removed and it's going to be moved in favor of the model selection so if we go up here we have sklearn Doc crossvalidation and if you research this and go to sklearn site you'll find out that you can actually just swap it right in there with model selection and so when I come in here and I run it again that removes a warning what they've done is they've had two different developers develop it in two different branches and then they decided to keep one of those and eventually get rid of the other one that's all that is and very easy and quick to fix before we go any further I went ahead and opened up the data from this file remember the the data file we just loaded on here the dataor 2. CSV let's talk a little bit more about that and see what that looks like both as a text file because it's a comma separated variable file and in a spreadsheet this is what it looks like as a basic text file you can see at the top they've created a header and it's got 1 2 3 four five columns and each column has data in it and let me flip this over cuz we're also going to look at this uh in an actual spreadsheet so you can see what that looks like and here I've opened it up in the open Office calc which is pretty much the same as um Excel and zoomed in and you can see we've got our columns and our rows of data little easier to read in here we have a result yes yes no we have initial payment last payment credit score house number if we scroll way down we'll see that this occupies 1,1 lines of code or lines of data with uh the first one being a column and then 1,000 lines of data now as a programmer if you're looking at a small amount of data I usually start by pulling it up in different sources so I can see what I'm working with but in larger data you won't have that option it'll just be um two too large so you need to either bring in a small amount that you can look at it like we're doing right now or we can start looking at it through the python code so let's go ahead and move on and take the next couple steps to explore the data using python let's go ahead and see what it looks like in Python to print the length and the shape of the data so let's start by printing the length of the database we can use a simple Lind function from Python and when I run this you'll see that it's a th long and that's what we expected there's a thousand lines of data in there if you subtract the column head and this is one of the nice things when we did the uh balance data from the panda read CSV you'll see that the header is row zero so it automatically removes a row and then shows the data separate it does a good job sorting that data out for us and then we can use a different function and let's take a look at that and again we're going to utilize the tools in panda and since the balance uncore data was loaded as a panda data frame we can do a shape on it and let's go ahead and run the shape and see what that looks like what's nice about this shape is not only does it give me the length of the data we have a thousand lines it also tells me there's five columns so we were looking at the data we had five columns of data and then let's take one more step to explore the data using Python and now that we've taken a look at the length and the shape let's go ahead and use the the uh pandas module for head another beautiful thing in the data set that we can utilize so let's put that on our sheet here and we have print data set and balance data. head and this is a panda's print statement of its own so it has its own print feature in there and then we went ahead and gave a label for our print job here of data set just a simple print statement and when we run that and let's just take a closer look at that let me zoom in here there we go pandas does such a wonderful job of making this a very clean readable data set so you can look at the data you can look at the column headers you can have it uh when you put it as a head it prints the first five lines of the data and we always start with zero so we have five lines we have 0 1 2 3 4 instead of 1 2 3 4 5 that's a standard scripting and programming set is you want to start with the zero position and that is what the data head does it pulls the first five rows of data puts in a nice format that you can look at and view very powerful tool to view the data so instead of having to flip and open up an Excel spreadsh sheet or open Office Cal or trying to look at a word dock where it's all scrunched together and hard to read you can now get a nice open view of what you're working with we're working with a shape of a thousand long five wide so we have five columns and we do the full datae you can actually see what this data looks like the initial payment last payment credit scores house number so let's take this now that we've explored the data and let's start digging into the decision tree so in our next step we're going to train and build our data tree and to do that we need to First separate the data out we're going to separate into two groups so that we have something to actually train the data with and then we have some data on the side to test it to see how good our model is remember with any of the machine learning you always want to have some kind of test set to to weigh it against so you know how good your model is when you distribute it let's go ahead and break this code down and look at it in pieces so first we have our X and Y where do X and Y come from well X is going to be our data and Y is going to be the answer or the target you can look at it source and Target in this case we're using X and Y to denote the data n and the data that we're actually trying to guess what the answer is going to be and so to separate it we can simply put in x equals the balance of the data. values the first Breck brackets means that we're going to select all the lines in the database so it's all the data and the second one says we're only going to look at columns one through five remember we always start with zero zero is a yes or no and that's whether the loan went default or not so we want to start with one if we go back up here that's the initial payment and it goes all the way through the house number well if we want to look at uh 1 through five we can do the same thing for Y which is the answers and we're going to set that just equal to the zero row so it's just the zero row and then it's all rows going in there so now we've divided this into two different data sets one of them with the data going in and one with the answers next we need to split the data and here you'll see that we have it split into four different parts the first one is your X training your X test your y train your y test simply put we have X going in where we're going to train it and we have to know the answer to train it with and then we have X test where we're going to test that data and we have to know in the end what the Y was supposed to be and that's where this train test split comes in that we loaded earlier in the modules this does it all for us and you can see they set the test size equal to3 so that's roughly 30% will be used in the test and then we use a random state so it's completely random which rows it takes out of there and then finally we get to actually build our decision tree and they've called it here clf entropy that's the actual decision tree or decision tree classifier and in here they've added a couple variables which we'll explore in just a minute and then finally we need to fit the data to that so we take our clf entropy that we created and we fit the X train and since we know the answers for X train are the Y train we go and put those in and let's go ahead and run this and what most of these sklearn modules do is when you set up the variable in this case when we set the clf entrop equal decision tree classifier it automatically prints out what's in that decision tree there's a lot of variables you can play with in here and it's quite beyond the scope of this tutorial to go through all of these and how they work but we're working on entropy that's one of the options we've added that it's completely a random state of 100 so 100% And we have a max depth of three now the max dep depth if you remember above when we were doing the different graphs of animals means it's only going to go down three layers before it stops and then we have minimal samples of leaves is five so it's going to have at least five leaves at the end so I'll have at least three splits we have no more than three layers and at least five end leaves with the final result at the bottom now that we've created our decision Tre classifier not only created it but trained it let's go ahead and apply it and see what that looks like so let's go ahead and make a prediction and see what that looks like we're going to paste our predict code in here and before we run it let's just take a quick look at what's this doing here we have a variable y predict that we're going to do and we're going to use our variable clf entropy that we created and then you'll see do predict and that's very common in the sklearn modules that their different tools have the predict when you're actually running a prediction and this this case we're going to put our X test data in here now if you delivered this for use an actual commercial use and distributed it this would be the new loans you're putting in here to guess whether the person's going to be uh pay them back or not in this case though we need to test out the data and just see how good our sample is how good of our tree does at predicting the loan payments and finally since Anaconda jupyter notebook is works as a command line for python we can simply put the Y predict e in to print it I could just as easily have put the print and put brackets around y predict to print it out we'll go ahead and do that it doesn't matter which way you do it and you'll see right here that it runs a prediction this is roughly 300 in here remember it's 30% of a th000 so you should have about 300 answers in here and this tells you which each one of those lines of our uh test went in there and this is what our y predict came out so let's move on to the next step where we're going to take this data and try to figure out just how good a model we have so here we go since SK learn does all the heavy lifting for you and all the math we have a simple line of code to let us know what the accuracy is and let's go ahead and go through that and see what that means and what that looks like let's go ahead and paste this in and let me zoom in a little bit there we go see you have a nice full picture and we'll see here we're just going to do a print accuracy is and then we do the accuracy score and this was something we imported um earlier if you remember at the very beginning let me just scroll up there real quick so you can see where that's coming from that's coming from here down here from sklearn docs import accuracy score and you could probably run a script make your own script to do this very easily how accurate is it how many out of 300 do we get right and so we put in our y test that's the one we ran the predict on and then we put in our y predict that's the answers we got and we're just going to multiply that by 100 because this is just going to give us an answer as a decimal and we want to see it as a percentage and let's run that and see what it looks like and if you see here we got an accuracy of 93. 66667 so when we look at the number of loans and we look at how good our model fit we can tell people it has about a 93.6 fitting to it so just a quick recap on that we now have accuracy set up on here and so we have created a model that uses the decision tree algorithm to predict whether a customer will repay the loan or not the accuracy of the model is about 94.6% the bank can now use this model to decide whether it should approve the loan request from a particular customer or not and so this information is really powerful we might not be able to as individuals understand all these numbers because they have thousands of numbers that come in but you can see that this is a smart decision for the bank to use a tool like this to help them to predict how good their profits going to be off of the loan balances and how many are going to default or not so we've had a lot of fun learning about decision trees so let's take a look at the key takeaways that we've covered today what is machine learning we covered up some different aspects of machine learning and what that is utilized in your everyday life and what you can use it for for predicting for describing for guessing what the next outcome is for um storing information we looked at the three main types of machine learning supervised learning unsupervised learning and reinforced learning we looked at problems in machine learning and what it solves classification regression and clustering finally we went through uh how does the decision tree work where we looked at the hunter he's trying to sort out the different animals and what kind of animals they are and then we rolled up our sleeves and did our python coding and actually applied it to a data set