Transcript for:
Lecture on Choosing the Best Machine Learning Model and Cross-Validation

sometimes we get into this dilemma of which machine learning model should i use to solve my problem for example we worked on this iris flower dataset problem now you can classify those iris flowers using svm random forest logistic regression decision tree which model out of these is the best cross validation is a technique which allows you to answer the exact same question it basically allows you to evaluate a model performance when you are looking at machine learning models such as classifying emails as spam or not spam your typical procedure is you first train the model using whatever label data set that you have and once the model is built the next step is to use a different data set for testing your model and your model will basically return the results back and then those results you can compare it with the truth are to measure the accuracy of a model now there are several ways you can perform this training step the first way is all the training data set that you have available you just feed that to your model okay so you're using 100 of all your samples to train the model and then you use the same exit samples to test the model this in real life can be compared to an example of preparing this cute kid for a mathematic test so let's say you have 100 mathematical questions you train this kid for those questions and then when he goes for the exam you ask the exit same questions okay and then you try to measure his mathematical skills based on the score now this is not a very good way of measuring someone's math skills because he has already seen those questions before all right so what if he gets like 100 out of 100 there's no point he has already seen them so then the second option we have is we split the samples that we have available into training and test data sets so for example here i will i will out of 100 i will use 70 for training and 30 for testing okay and we have been using this train test split method in all of our supervised learning model tutorials and you will see that we have used train the split method in i think majority of our tutorials so that's what this technique is uh so again going back to our student example uh you give this person 70 math questions for preparation and the remaining 30 you will reserve for the test so that way when he goes for the test he has not seen these 30 questions which is good that way you can measure the skills in a good way but there is one problem with this approach also let's say 70 math question that you gave this person let's say they were all algebra and now remaining 30 questions are from calculus so now those questions he has not seen before or he doesn't have knowledge on that topic itself then he might not perform well so this this technique is also it kind of works okay but it's not like very perfect that's why we have k-fold cross validation now in this technique what we do is we divide our 100 samples into folds so i have five folds here each containing 20 samples and then you run multiple iterations in the first iteration you use fold number two to five for training the model and the first fold for testing the model and you not down the score second iteration you use first fold and then three to five for training and the second one for testing again not down the score you repeat the process till the last fold where you use fold number five for testing and remaining for training and then once you have every uh this course you just average them out now this technique is very very good because you are giving variety of samples to your models and then you are taking individual scores and then averaging them out here i have imported necessary libraries that i am going to use in this tutorial and the purpose of this coding exercise is to classify the digits data set which is an sklearn library you probably know about this um these are like handwritten characters that we can classify uh into one of the ten categories zero to nine we will use our different algorithms different models and then we'll evaluate performance of each of them using uh k-fold cross validation okay so here i imported my digits data set and the first thing i'm going to do here is split that data set into training and test data set so this is something you might be aware that using trend test split you can split it into our training and test data sets okay and once you have that we can use a different classifier so for example the first classifier i'm going to use is logistic regression and i'm going to measure the score so here i created a logistic regression classifier classifies basically your machine learning model which is trying to classify your samples and i trained it using extra in white trend and then when i tested the performance using the score method it returned me 0.959 which is pretty good now i use logistic regression but i have many other models as well for example svm i want to try on svm how how svm performs so you can see svm perform pretty poorly it's not the score is very low actually uh compared to logistic regression i can try random forest as well so in random forest case it is performing i think the best so this was a quick way of measuring the performance of these three models okay so logistic regression svm and random forest classifier rehabilitate the performance and we found that the random forest classifier is performing the best now this works in a practical situations but what happens is the distribution of samples in x strain and x test is not is not uniform right so for example when i run this method one more time now my samples change totally so now when i execute this code using control enter by the way that's a shortcut in jupyter notebook to execute your cell or you will see that uh the score will change so the previous code was 959 when i executed 953 so it changed a little bit here is 0.40 when i re-execute it you can see it become 0.62 now why did that happen because when i re-executed this cell x strain extensed white rain and vitis got changed the samples that we put into these four sets they got changed hence you expect the performance of your models to change when i execute this this is still performing better but you see that previously score was 0.97 now it's 0.98 so you see the problem with train test split method that you can't run it only one time and say that particular model is better than the other model you have to run this multiple times you know so see if i run it multiple time every time it is changing all right now let's try a k-fold so first what i'm going to do is uh use k-fold api to demonstrate what exactly it is doing all right so from sk learn dot model selection you can import k fold all right and kf is equal to k4 here you can specify how many folds you want to create so i want to create let's say three fourths just as an example so here it created that okay and the way you use this k-fold on the data set is you will say something like from train index test index in kf dot split okay now my k fold is ready we know that is going to make three splits so here in the argument you can supply the data set all right so let's say for simplicity sake i just want to supply number one to nine and then i can print train index and test index so when i run this what exactly is happening is this will return an iterator and that iterator will return train and test index for each of the iterations so it divided this into three folds three each and the first iteration it used one fold for testing which is this and remaining two folds for training which is this in the second iteration it moved this fold into training so you can see 0 1 2 is in a training set and then this fold into testing and it repeats the procedure like that so you can hear supply 10 folds also and it it should work accordingly now we are going to use k-fold for our digits example so to simplify the things i'm going to write a generic method called get score which can take a model as an input then x train x taste y train and y taste and i'll tell you what's the purpose of this method so this method calls model.fit which means i want to train my model using x train and y train and once the training is done it will return the model score using the test samples that you are supplying as an argument to this method now this method is pretty powerful we could have used this method to measure performance of these guys as well so for example let me just quickly show you if instead of doing you know repeating all these three lines for each of the model i could have just called the getscape score method here on x train x test white train y test and it should have returned me the score all right so same way i could have done svc here so if you do svc it's just modular modularizing our core all right so once we have this method ready i'm going to now use k fall on our digits data set so from sklearn dot model selection this time i'm going to use stratified stratified k-fold so stratified k-4 is similar to k-fold but it is little better in a way that when you are separating out your force it will divide each of the classification categories uh unif in a uniform way okay and this could be very helpful uh because imagine you are creating three folds for example our for our iris flower data set and two floor two folds have or two type of flowers and the one fold has just a very different type of flowers then it might create problems that's why using stratif stratified k-fold is better okay so here i'm going to say my n splits is equal to 3. people usually use 10 and splits but just to keep things simple i am using three here okay and once you have your folds ready so this method is exactly same as this okay so k4 and stratified k4 is same thing so we are doing we are repeating kind of the same thing that we did in these two lines here but we are using now our real example of digits data sets and i will prepare the scores array uh prepare the scores of our different models so l means the logistic regression scores and then random four score okay so i i need these arrays and i'll tell you why i need this array so here see same thing we are doing the exact same thing here okay but instead of this dummy data and now we are using our real digits data okay and what will now happen is in our digits data we have train index and then again in digits data we have our test index and these things i'm going to uh store in x train x test y chain y test okay digi start data so we have train and test index um now i want y train so y train is digit.target okay so y train uh train index actually and then in the digits.target now i want test index okay if you do this this and if you look at the length of the each of these sets you will realize it is doing exactly same thing as it did in this cell number 26 right now it's a time to measure the performance of three of our models in each iteration so since uh we have three folds this for loop is gonna repeat three times every time we'll take this different x test and x trains and y train and y test and we'll measure performance of our model and then we'll append the scores in these arrays okay so that's what we're doing so let's first start with getscore method getscore method as you know it will the first argument that it takes is the model then it takes x train x test y train and white test right let me just print this and i can do the same thing for three different models so the second model is svm and the third model is random forest classifier when i print this course it is printing the different scores okay so first iteration these are the three model performance second iteration and third iteration now instead of just printing why don't we append this the same score here so instead of printing i'll say append that exact same score here scores svm dot happen and scores rf dot happen all right so now my scores should be ready so i want to print them svm looks like this you you can see the svm performance is not that great and we have seen the same behavior before all right so logistic regression and random forest looks to be performing similar almost right on one instance this guy performed little better but what you can do is now you can take the average of these three scores and you can determine what is the best model for a given problem here it looks like based on this it looks like our logistic regression model might perform the best one optimization i can think of doing here is uh increased number of trees in my random forest classifier so if i increase the trees to 40 let's see how it performs all right i increase the trees to 40 and my scores are ready so now i will once again evaluate this course nice so you can see now random forest seems to be doing better because the 95 score match is here but here is 89.91 and here is 92 and 92. so now after i did little tweaking a parameter tuning in my random forest classifier my scores improved now this code looks little messy because we have to deal with so many indexes um but luckily sk learn library comes up with a ready-made method called called cross val score which you can use to do the exact same thing that i did here all right so this thing i i wrote this code just to explain you how k-fold works but in real life when you're solving machine learning problem you don't need to write this much code you can just call a crosswell score method which i can demo it here right now so to use that method you can import the method from escalant model selection it's called cross val score okay and once the method is imported you can call this method with first argument is your classifier the second argument is your x which is digits.data the third argument is your y which is digit.target so if you do shift tab to look at the the documentation it says estimator which is a model the second argument is your x and third argument is y so that's what we did and when you execute this line this is showing you that the similar score basically right that score is being shown here so internally this method did the same thing as this for loop basically it created folds and it measured the performance and it prepared this course uh array okay i will now call the same method on my svm classifier and see how that goes again svm didn't perform really well my accuracy is 39 percent 41 percent and so on versus here 89 94 percent it's much better and in the third case i'll just copy paste this code here so see all you have to do is just call one line man it's machine learning is not hard if you know the internals of these libraries then all you need to do is just just call one single method and say it's measuring the performance now we compared different classifiers you can compare same classifier with different parameters this is called parameter tuning for example we have random forest classifier right here and random forest classifier we ran with 40 trees we can actually run the same classifier with let's say 5 trees and get the score okay so you see this is 40 classifier this is the score with five classifier we say the score went down a little bit let me try 15 for example and the score went little bit up okay so looks like as i increase my trees my score is increasing so how about if i make my trees to be 50 so with 50 it increased even further 60 okay 60 i think 50 was better than 60. so so this way you can take a model and then you can do parameter tuning so you're using the same algorithm basically here random forex classifier but then you are tuning the parameters and trying to gauge uh which one delivers the best performance so you can see that cross-validation technique is pretty useful it not only allows you to compare different algorithms but the same algorithm with different parameters how it would behave for your given problem it can tell you that okay so machine learning is not like a scientific equation where uh for a given problem you use this model versus that model is something a little bit like trial and error based where for your given problem and given data set you need to try various models with various parameters and then uh figure out which one is the best for your use case all right now now comes the most interesting part which is the exercise so what uh we want to do uh this tutorial is take the iris flower data set and use different classifier random forest decision tree svm and logistic regression and use crosswell score method to measure the performance of the best classifier so for example if i'm a teacher and i'm telling you to solve the classification problem of iris flower which model will use so you will use a crosswell score with these four different models and you will tell me which model performs the best all right that's all i had for this tutorial i have a link of jupyter notebook used in this tutorial as well as exercise in the description down below in the video description section and uh stay tuned i'll be back with the next machine learning tutorial pretty soon uh and if you like my content uh please subscribe to my channel please give it a thumbs up it helps me with the youtube search ranking so please do that and thanks once again bye