Scikit-Learn Tutorial Part 1 Notes

hello and welcome to scikit-learn tutorial part 1 my name is Richard Kirchner with the simply learned team that's w w simply learned comm get certified get ahead now we're gonna cover the scikit-learn tutorial which has a lot of features and all kinds of api in it to explore data and do your data science with effects it's probably one of the top data science packages out there so what is the site can't learn it's simple and efficient tool for data mining and data analysis it's built on numpy side pie and mat plot library so it interfaces very well with these other modules and it's an open source commercially usable bsd license bsd are really stood for berkeley software distribution license but it means it's open source with very few restrictions as far as what you can do with it another reason to really like the scikit-learn set up so you don't have to pay for it as a commercial license versus many other copyrighted platforms out there what we could achieve using the scikit-learn we use class the two main things or classification and regression models classification identifying which category and object belongs to for one application very commonly used is spam detection so is it a spam or is it not a spam yes/no in banking you might be is this a good loan bad loan today we'll be looking at wine is it gonna be a good wine or a bad wine and regression is predicting an attribute associated with an object one example is dock prices prediction what is gonna be the next value if this dock today sold for twenty three dollars and five cents a share what do you think it's gonna sell for tomorrow and the next day and the next day so that would be a regression model same thing with weather weather forecasting any of these are regression models we're looking at one specific prediction i want to tribute today we will be doing classification like i said we're gonna be looking at whether a wine is good or bad but certainly the regression model which is in many cases more useful because you're looking for an actual value is also a little harder to follow sometimes so classification is a really good place to start we can also do clustering and model selection clustering is taking an automatic grouping of similar objects into sets customer segmentation is an example so we have these customers like this they'll probably also like this or if you like this particular kind of features on your objects maybe you like these other objects so it's a referral is a good one especially in amazon.com or any of your shopping networks model selection comparing validating and choosing parameters and models now this is actually a little bit deeper as far as a site kit learned we're looking at different models for predicting the right course or the best course or what's the best solution today like I say we're looking at wines it's gonna be how do you get the best wine out of this so we can compare different models and we'll look a little bit at that and improve the models accuracy via different parameters and fine tuning now this is only part one so we're not gonna do too much tuning on the models we're looking at but I'll point them out as we go to other features dimensionality protection and pre-processing dimensionality reduction is we were reducing the number of random variables to consider this increases a model efficiency we won't touch that in today's tutorial but be aware if you have you know thousands of columns of data coming in but thousands of features some of those are going to be duplicated or some of them you can combine to form a new column and by reducing all those different features into a smaller amount you can have a you can increase the efficiency of your model it can process faster and in some cases you'll be less biased because if you're weighing it on the same feature over and over again it's gonna be biased to that feature and pre-processing these are both pre-processing but pre-processing is feature extraction and normalization so we're gonna be transforming input data such as text reuse with machine learning algorithms we'll be doing a simple scaling in this one for our pre-processing and I'll point that out when we get to that and we can discuss pre-processing at that point with that let's go ahead and roll up our sleeves and dive in and see what we got here now I like to use the jupiter notebook and i use it out of the Anaconda navigator so if you install the Anaconda navigator by default it will come with the jupiter notebook or you can install the jupiter notebook by itself this code will work in any of your Python setups I believe I'm running an environment of three-point-seven setup on there I'd have to go in here environments and look it up for the Python setup but it's one of the three x's and we go ahead and launch this and this will open it up in a web browser so it's kind of nice it keeps everything separate and in this anaconda you can actually have different environments different versions of Python different modules installed in each environment so it's a very powerful tool if you're doing a lot of development and Jupiter notebook is just a wonderful visual display certainly you can use I know spiders another one which is installed with the anaconda I actually use a simple notepad plus plus when I'm doing some of my Python script any of your IDs will work fine jupiter notebook is ironpython because it's designed for the interface but it's good to be aware of these different tools and when i launch the jupiter notebook it'll open up like i said a webpage in here and we'll go over here to new and create a new python i'm set up like i said i believe this is python 3 7 but any other three this the scikit-learn works with any of the three x's there's even two seven versions so it's been around a long time so it's very big on the development side and then the guys in the back guys and gals develop they went ahead and put this together for me and let's go ahead and import our different packages now if you've been reading some of our other tutorials you'll recognize pandas as p d-- pandas libraries pretty widely used as a data frame setup so it's just like columns and rows and a spreadsheet with a lot of different features for looking stuff up the Seabourn sits on top of mat plot libraries this is for a graphing will see that how quick it is to throw a graph out there to view in the jupiter notebook for demos and showing people what's going on and then we're gonna use the random forest the SVC or support vector classifier and also the neural network so we're going to look at this we're actually gonna go through and look at three different classifiers that are most common some of the most common classifiers let's show how those work in the scikit-learn setup and how they're different and then if you're gonna do your setup on here you'll want to go ahead and import some metrics so the SK learned metrics on here and we're using confusion metrics and the classification report out of that and then we're gonna use from the SK learn pre-processing the standard scaler and label encoder standard scaler is probably the most commonly used pre-processing there's a lot of different pre-processing packages in the SK learn and then model selection for splitting our data up it's one of many ways we can split data into different sections and the last line here is our percentage mat plot library in line some of the Seaboard and mat plot library will go ahead and display perfectly in line without this and some won't it's good to always include this when you're in the jupiter notebook this is jupiter notebook so if you're in ite when you run this it will actually open up a new window and display the graphics that way so you only need this if you're running it in a editor like this one with the specifically jupiter notebook I'm not even familiar with other editors that are like this but I'm sure they're out there I'm sure there's a Firefox version or something Jupiter notebook just happens to be the most widely used out there and we can go ahead and hit the Run button and this now has saved all this underneath the packages so my packages are now all loaded I've run them whether you run it on top we run it to the left and all the packages are up there so we now have them all available to us for our project we're working on and I'm just gonna make a little side note on that when you're playing with these and you delete something out and add something in even if I went back and deleted this cell and just hit the scissors up here these are still loaded in this kernel so until I go under kernel and restart or restart and clear or restart and run all I'll still have access to pandas important to know because I've done that before I've loaded up maybe not a module here but I've loaded up my own code and then changed my mind and wondering why is he keep putting out the wrong output and then I realize it's still loaded in the kernel you have to restart the kernel just a quick side note for working with a jupiter notebook and one of the troubleshooting things that comes up we're gonna go and load up our data set we're using the pandas so if you haven't yet go look at our pandas tutorial a simple read the csv with the separation on here so let me go ahead and run that and that's now loaded into the variable wine and let's take a quick look at the actual file I always like to look at the actual data I'm working with in this case we have wine quality - red I'll just open that up I haven't in my OpenOffice setup separated by semicolons that's important to notice and we open that up you'll see we have go all the way down here it looks like 1600 lines of data - the first one so 15 1599 lines and we have a number of features going across the last one is quality and have to bet we see the quality is as different numbers in at five six seven it's not really I'm not sure how high of a level it goes but I don't see anything over a seven so it's kind of five through seven is what I see here five six and seven four or five six and seven looking to see if there's any other values in there looking through the demo or to begin with I didn't realize the setup on this see there's a different quality values in there alcohol sulfates pH density total sulfur sulfur dioxide and so on those are other features we're going to be looking at and since this is a pandas we'll just do wine head and that prints our first five roles rows of data that's of course pandas command and we can see that looks very similar to we were looking at before we have everything across here it's automatically assigned an index on the left that's what pandas does if you don't give it an index and for the column names it has assigned the first row so we have our first row of data pulled off the our comma separated variable file in this case semicolon separated and it shows the different features going across and we have what 1 2 3 4 5 6 7 8 9 10 11 features 12 including quality but that's the one we want to work on and understand and then because we're in pandas dataframe we can also do Wyandotte info and let's go ahead and run that this tells us a lot about our variables we're working with you'll see here that there is 1599 that's what I said from the spreadsheet so it looks correct non-null float64 this is very important information especially the non null there's no null values in here that can really trip us up in pre-processing and there's a number of ways to process non null values one is just to delete that data out of there so if you have enough data in there you might just delete your non null values another one is to fill that information in with like the average or the most common values or other such means but we're not gonna have to worry about that but we'll look at it another way because we can also do wine is null and sum it up and this will give us a similar won't tell us that these are float values but it will give us a summation I'm sorry go let me run that he'll give us a summation on here how many null values in each one so he wanted to you know from here you would be able to say okay this is a null value but she doesn't tell you how many are null values this one would clearly tell you that you have maybe five null values here to null values here you might just a few had only seven null values and all that different dating probably just delete them out where if ninety percent of the data was no values you might rethink either a different data collection setup or find a different way to deal with the null values we'll talk about that just a little bit in the models too because the models themselves have some built-in features especially the forest model which we're gonna look at this point we need to make a choice and to keep it simple we're gonna do a little pre-processing of the data and we're gonna create some bins and bins we're gonna do is 2 comma 6 point 5 comma 8 what this means is that we're gonna take those values if you remember up here let me just scroll back up here we had our quality and the quality who comes out between 2 and 8 basically or 1 and 8 we have 5 5 5 6 you can see just in the just in the first 5 lines of variation and quality we're gonna separate that into just two bins of quality and so we've decided to create two bins and we have bad and good it's gonna be the labels on those two bins we have a spread of 6.5 and an exact index of ate the exact index is because we're doing 0 to 8 on there the 6.5 we can change we could actually make this smaller or greater but we're only looking for the really good wide we're not looking for the 0 1 2 3 4 5 6 we're looking for wines with 7 or 8 on so high quality you know this is what I want to put on my dinner table at night I want to taste the good wine not the semi good wine or mediocre wine and then this is a panda so PD member stands for pandas pandas cut means we're cutting out the wine quality and we're replacing it and then we have our bins equals bins that's the command bins is the actual command and then our variable bins to come at 6.58 so two different bins and our labels bad and good and we can also do let me just do it this way wine quality since that's what we're working on and let's look at unique another pandas command and we'll run this and I get this lovely error why did I get an error well because I replaced wine quality and I did this cut here which changes things on here I'm sorry literally altered one of the variables are saved in the memory so we'll go up here to the kernel restart and run all that starts it from the very beginning and we can see here that that fixes the air because I'm not cutting something that's already been cut we have our wine quality unique and the wine quality unique is a bad or good so we have two qualities objects bad is less than good meaning bad is going to be zero and goods going to be one and to make that happen we need to actually encode it so we'll use the label quality equals label encoder and the label encoder let me just go back there since as part of SK learned that was one of the things we import it was a label encoder you can see that right here from the SK learn processing import standard scaler which we're going to use in a minute and label encoder and that's what tells it to use that equals zero and good equals one and we'll go ahead and run that and then we need to apply it to the data and when we do that we take our wine quality that we had before and we're gonna set that equal to label quality which is our encoder and let's look at this line right here we have dot fit transform and you'll see this in the pre-processing these are the most common used is fit transform and fit transform because they're so often that you're also transforming the data when you fit it they just combined them into one command and if we're just gonna take the wine quality feed it back into there and put that back in our wine quality setup and run that and now what we do the wine and the head of the first five values and we go ahead and run this you can see right here underneath quality zero zero zero have to go down a little further to look at the better wines let's see if we have some that are ones yeah there we go there's some ones down here let's look at ten of them you can see all the way down to zero or one that's our quality and again we're looking at high quality we're looking at the seven and the eight or six point five and up and let's go ahead and grab our or was it here we go wine quality let's take a look at what else more information about the wine quality itself we can do a simple pandas thing value counts I type that in there correctly and we can see that we only have two hundred and seventeen of our wines which are going to be the higher quality so two hundred seventeen and the rest of them fall into the bad bucket and the zero which is 1382 so again we're just looking for the top percentage of these the top what is that it's probably about a little under 20% on there so we're looking for our top wines are seven and eight and let's use our let's plot this on a graph so we take a look at this and the SNS if you remember correctly that is I'm just back at the top that's our Seaborn Seabourn sits on top of mat plot library it has a lot of added features plus all the features of the mat plot library and it also makes it quick and easy to put out a graph we'll do a simple bar graph and they actually call it count plot and then we want to just do count plot the wine quality just put our wine quality in there let's go ahead and run this and see what that looks like and nice in line members why we did the in line so make sure it appears in here and you can see the blue space or the first space represents low quality wine and our second bar is the high quality line and you can see they were just looking at the top quality wine here most the wine we want to just give it away to the neighbors and maybe if you don't like your neighbors may become the good quality wine and I don't know what you do with the bed quality wine I guess use it for cooking there we go but you can see here it forms a nice little graph for us with the Seabourn on there and you can see our setup on that so now we've looked we've done some pre-processing we've described our data a little bit we have a picture of how much of the wine what we expect it to be high quality low quality checked out the fact that there's none we don't have any null values to contend with or any odd values some of the other things you sometimes look at these is if you have like some values that are just way off the chart so the measurement might be off or miss calibrated equipment fear in the scientific field so the next step we want to go ahead and do is we want to go ahead and separate our data set or reformat our data set and we usually use capital X and that denotes that features we're working with and we usually use a lowercase Y that denotes what in this case quality what we're looking for and we can take this we can go wine it's gonna be our full thing of wine dropping what are we dropping we're dropping the quality so these are all the features - quality make sure we have our axes equals 1 if you left it out it would still come out correctly just because of the way it processes on the defaults and then why if we're gonna remove quality for our X this is going to be wine and it is just the quality that we're looking at for Y so we put that in there and we'll go ahead and run this so now we've separated the features that we want to use to predict the quality of the wine and the quality itself the next step is if you're going to create a data set in a model we got to know how good our model is so we're going to split the data train and test splitting data and this is one of the packages we imported from SK learn and the actual package was trained test split and we're gonna do XY test size point two random State 42 and this returns four variables and most common you'll see is capital X train so we're gonna train our set with capital X test that's the data we're gonna keep on the side to test it with why train why I remember stands for the quality or the answer we're looking for so when we train it we're gonna use X train and why train and then Y test to see how good our X test does and the train test split let me just go back up to the top that was part of the SK learn model selection import train test split there is a lot of ways to split data up this is your first starting you do your first model you probably start with the basics on here you have one test for training one for test our test size is 0.2 or 20% and random stage just means we just start with a it's like a random seed number so that's not too important back there we're randomly selecting which ones we're going to use since this is the most common way this is what we're gonna use today there is and it's not even a scaler and package yet so someone's still putting it in there one of the new things they do is they split the data into thirds and then they'll run the model on each of they combine each of those thirds into two thirds for training and one for testing and so you actually go through all the data and you come up with three different test results from it which is pretty cool that's a pretty cool way of doing it you could actually do that with this by just splitting this into thirds and then or you'll have a test side one test set third and then split the training set also into thirds and also do that and get three different data sets this works fine for most projects especially when you're starting out it works great so we have our X train or X test or white train in our Y test and then we need to go ahead and do the scaler and let's talk about this because this is really important some models do not need to have scaling going on most models do and so we create our scalar variable we'll call it SC standard scaler and if you're correctly we imported that here wrong with the label encoder the standard scaler setup so there's our scalar and this is gonna convert the values instead of having some values that go from zero if you remember up here we had some values are fifty four sixty forty fifty 902 so our total sulphur dioxide would have these huge values coming in to our model and some models would look at that and they'd become very biased to sulfur dioxide and have the hugest impact and then a value that had point oh seven six point zero nine eight or chlorides would have very little impact because it's such a small number so we take the scalar we kind of level the playing field and depending on our scaler it sets it up between zero and one a lot of times is what it does let's go ahead and take a look at that and we'll go ahead and start with our X train and our X train equals SC fit transform we talked about that earlier that's an SK learn set up it's going to both fit and transform our X train into our X train variable and if we have a neck strain we also need to do that to our test and this is important because you need to note that you don't want to refit the data we want to use the same fit we used on the training is on the testing otherwise you get different results and so we'll do just hopes not fit transform we're only going to transform the test side of the data so here's our X test that we want to transform and let's go ahead and run that and just so we have an idea let's go ahead and take and just print out our X train oh let's do first ten variables very similar the way you do with the head on a data frame you can see here our variables are now much more uniform and they've scaled them to the same scale so they're between certain numbers and with the basic scaler you can fine tune it I just let it do its defaults on this and that's fine for what we're doing in most cases you don't really need to mess with it too much it doesn't look like it goes between like - probably - 2 - 2 or something like that that's just looking at the Train variable I'll go ahead and cut that one out of there so before we actually build the models and start discussing the SK learn models we're gonna use we covered a lot of ground here most of when you're working with these models you put a lot of work into pre prepping the data so we looked at the data notice that it's separated and loaded it up we went in there we found out there's no null values it's hard to say no nodal values we have there's none there's none nobody I can't say it and of course we sum it up if you had a lot of null values this would be really important coming in here so is there a null summary we looked at pre-processing the data as far as the quality and we're looking at the bins so this would be something you might start playing with maybe you don't want superfine wine you don't want the 7 & 8 maybe you want to split this differently so certainly you can play with the bins and get different values and make the bins smaller or lean more towards the lower quality so you then have like medium to high quality and we went ahead and gave it labels again this is all pandas we're doing in here setting up with unique labels and group names bad good bad is less and good be so important you don't know how many times people go through these models and they have them reversed or something and then they go back they're like why is this data not looking correct so it's important to remember what you're doing up here and double check it and we used our label encoder so that was to set that up as quality zero one good in this case we have bad good zero one and we just double check that to make sure that's what came up in the quality there we threw it into a graph because people like to see graphs I don't know about you but you start looking at all these numbers and all this text and you get down here and you say oh yes you know here this is how much of the wine we're gonna label as subpar not good this is how much we're gonna label as good and then we got down here to finally separating out our data so it's ready to go into the models and the models take X and a Y in this case X is all of our features - the one we're looking for and then Y is the features we're looking for so in this case we dropped quality and in the Y case we added quality and then because we need to have a training set in a test set so we can see how good our models do we went ahead and split the models up X train X test y train Y test and that's using the Train test split which is part of the SK learn package and we did as far as our testing size 0.2 or 20% the default is 25% so if you leave that out it'll do default setup and we did a random state equals 42 if you leave that out it'll use a random state I believe it's default one I'd have to look that back up and then finally we scaled the data this is so important to scale the data going back up to here if you have something that's coming out as a hundred it's gonna really outweigh something that's point O seven one that's not in all the models different models handle it differently and as we look at the different models I'll talk a little bit about that we're gonna only look at three models today three of the top models used for this and see how they compare and how the numbers come out between them so we're gonna look at three different setups let me change my cell here to mark down there we go we're going to start with the random forest classifier so the three centers were looking at is a random forest classifier support vector classifier and then a neural network now we start with the random forest classifier because it has the least amount of parts moving parts to fine-tune and let's go ahead and put this in here so we're gonna call it RFC for random forest classifier and if you remember we imported that so let me go back up here to the top real quick and we did an import of the random forest classifier from SK learn ensembl and then we'll all that we also let me just point this out here's our SVM where we imported our support vector classifier so SVM is support vector model support vector classifier and then we also have our neural network and we're gonna from there the multi layered perceptron classifier kind of a mouthful for the P perceptron don't worry too much about that name is just it's a neural network there's a lot of different options on there in setups which was where they came up with the perceptron but so we have our three different models we're going to go through one here and then we're going to weigh them here's our so we're gonna use a confusion metrics also from the SK learn package to see how good our model does with our split so let's go back down there and take a look at that and we have our RF s equals random forest classifier and we have n estimators equals 200 this is the only value you play with with a random forest classifier how many forests do you need or how many trees in the forest so how many models are in here that makes it pretty good as a start-up model because you're only playing with one number and it's pretty clear what it is and you can lower this number or raise it usually start up with a higher number and then bring it down to see if it keeps the same value so you have less you know the smaller the model the better the fit and it's easier to send out to somebody else if you're going to distribute it now the random forest classifier everything I read says it's used for kind of a medium sized data set so you can run it in big data you can run it on smaller data obviously but test in the mid-range and we'll go ahead and take our RFC and I just copied this from the other side dot fit X train comma Y train so we're sending it our features and then the quality in the white rain what we want to predict in there and we just do a simple fit I remember this is SK learn so everything is fit or transform another one is predict which we'll do in just a second here let's do that now predict are FC equals and it's our RFC model predict and what are we predicting on while we train it with our train values so now we need our test our X test so this has done it this is going to do this is the three lines of code we need to create our random forest variable fit our training data to it so we're programming it to fit in this case it's got 200 different trees it's going to build and then we're gonna predict on here let me go ahead and just run that and we can actually do something like all of us to predict our F C just real quick we'll look at the first 20 variables of it let's go ahead and run that and in our first 20 variables we have three wines and make the cut and the other 17 don't so the other 17 are bad quality and three of them are good quality in our predicted values and if you can remember correctly go ahead and take this out of here this is based on our test so these are the first 20 values and our test and this has as you can see all the different features listed in there and they've been scaled so when you look at these are a little bit confusing to look at and hard to read but we have there's a minus 0 1 so this is 0.36 minus I 1 so point 1 64 minus 0.09 or not still minus 1 so minus 0.9 all between 0 & 1 on here I think I was confused earlier and I said 0 between 2 negative 2 but between minus 1 and 1 which is what it should be in the scale we'll go ahead and just cut that out of there run this we ever set up on here so now we've run the prediction and we have predicted values well one you could publish them but what do we do with them well we want to do with them as we want to see how we're our model model perform that's a whole reason for splitting it between the training and testing model and for that if you remember we imported the classification report that was again from the SK learn there's our confusion matrix and classification report and the classification report actually sits on the confusion matrix so uses that information and our classification report we want to know how good our Y test that's actual values versus our predicted RFC so we'll go ahead and print this report out and let's take a look and we can see here we have a precision out of the 0 we had about 0.9 t2 that were labeled as bad that we're actually bad and out of precision for the quality wines we're running about 78% so you kind of give us a overall 90% and you can see our f1 score or support setup on there or recall you could also do the confusion matrix on here which gives you a little bit more information but for this this is gonna be good enough for right now we're just gonna look at how good this model was because we want to compare the random forest classifier with the other two models and you know what let's go ahead and put in the confusion matrix just so you can see that on there with Y test and prediction RFC so in the confusion matrix we can see here that we had 266 correct and seven wrong these are the miss labels for bad and we had a lot of missed labels for good wine so our quality labels aren't that good we're good at predicting bad wine not so good at predicting whether it's a good quality wine important to note on there so that is our basic random forest classifier and let me go ahead of cell and change cell type to markdown and run that so we have a nice label let's look at our SVM classifier our support vector model and this should look familiar we have our C LF we're going to create what's it we'll call it just like we call this an RFC and then we'll have our CL f dot fit and this should be identical to up above X train comma Y train and just like we did before let's go ahead and do the prediction and here is our CLF predict and it's going to equal the CLF dot predict and we want to go ahead and use X underscore test and right about now you can realize that you can create these different models and actually just create a loop to go through your different models and put the data in and that's how they designed it they designed it to have that ability let's go ahead and run this and then let's go ahead and do our classification report and I'm just going to copy this right off of here they say you shouldn't copy and paste your code and the reason is is when you go in here and edit it you unbearably will miss something we only have two lines I think I'm safe to do it today let's go ahead and run this and let's take a look how the SVM classifier came out so up here we had a 90 percent and down here running about an 86 percent so it's not doing as good now remember we randomly split the data so if I run this a bunch of times you'll see some changes down here so these numbers this size of data if I read it a hundred times it would probably be within plus or minus three or four on here in fact if I ran this a hundred times you'd probably see these come out almost the same as far as how well they do classification and then on the confusion matrix let's take a look at this one this had 22 by 25 this one has 35 by 12 so it's doing not quite as good that shows up here 71 percent versus 78 percent and then if we're gonna do a SVM classifier we also want to show you one more before I do that kind of tease you a little bit here Before we jump into neural networks the big save all deep learning because everything else must be shallow learning that's a joke let's just talk a little bit about the SVM versus the random forest classifier the SVM tends to work better on smaller numbers it also works really good a lot of times you convert things into numbers and bins and things like that the random forest tends to do better with those at least that's my brief experience with it or if you have just a lot of raw data coming in the SVM is usually the fastest and easiest to apply model on there so the beats have their own benefits you'll find they'll again that when you run these like a hundred times difference between these two on a dataset like this is gonna just go away there's randomness involved depending on which data we took and how they classify them the big one is the neural networks and this is what makes the neural networks nice as they can do they can look into huge amounts of data so for a project like this you probably don't need a neural network on this but it's important to see how they work differently and how they come up differently so you can work with huge amounts of data you can also many respects they work really good with text analysis especially if it's time-sensitive more and more you have an order of text and they've just come out with different ways of feeding that data end where the series in the order of the words is really important same thing with starting to predict in the stock market if you have tons of data coming in from different sources the neural network can really process at in a powerful way to pull up things that aren't seen before when I say lots of data coming in I'm not talking about just the high lows that you can run an SVM on real easily I'm talking about the data that comes in where you have maybe you pulled off the Twitter feeds and have word counts going on and you've pulled off the different news feeds of business are looking at and the different releases when they release the different reports so you have all this different data coming in and the neural network does really good with that pictures picture processing now is really moving heavily into the neural network if you have a pixel - or pixel 3 foam put out by Google it has a neural network for doing it's kind of goofy but you can put a little Star Wars Androids dancing around your pictures and things like that that's all done with a neural network so has a lot of different uses but it's also requires a lot of data and is a little heavy-handed for something like this and this should now look familiar because we've done it twice before we have our multi-layered perceptron classifier we'll call it an ml PC and it's this is what we imported in ml PC classifier there's a lot of settings in here the first one is the hidden layers you have to have the hidden layers in there we're gonna do three layers of 11 each so that's how many nodes or each layers it comes in and that was based on the fact we have 11 features coming in then I went ahead and just did three layers probably get by with a lot less on this but I didn't want to sit and play with it all afternoon again this is one of those things you play with a lot because the more hidden layers you have the more resources you're using you can also run into problems with overfitting with too many layers and you also have to run higher iterations the max iteration we have is set to 500 the defaults 200 because I used three layers of eleven each which is by the way kind of a default I use I realized that usually you have about three layers going down and the number of features going across you'll see that it's pretty common for the first classifier when you're working in neural networks but it also means you have to do higher iterations so we up the iterations to 500 so that means it's going through the day to 500 times to program those different layers and carefully adjust them and we do have a full tutorials you can go look up on neural networks and understand the neural network settings a lot more and of course we have you're looking over here where we had our previous model where we fit it same thing here MLP see fit X train Y train and then we going to create our prediction so let's do our predict and ml PC and it's going to equal the ml PC and we'll just take the same thing here predict X test let's just put that down here predict a test and if I run that we've now programmed it we now have our prediction here same as before and we'll go ahead and do the copy print again I always be careful with the copy paste not because you always run the chance of missing one of these variables so if you're doing a lot of coding you might want to skip that copy and paste and just type it in and let's go ahead and run this and see what that looks like and we came up with an 88% we're gonna compare that with the 86 from our tree or SVM classifier and our 90 from the random forest classifier and keep in mind random forest classifiers they do good on midsize data the SVM on smaller amounts of data although to be honest I don't think that's necessarily the split between the two and these things will actually come together if you ran a number of times and we can see down here the noun of good wines mislabeled with set up on there it's on par with our random forests so we had 20 to 25 shouldn't be a surprise it's identical they just didn't do as good with the bad wines labeling what's a bad one what's not see yet because they had two hundred sixty six and seven we had down here 260 and 13 so mislabeled a couple the bad wines is good wines so we've explored three of these basic classifiers these are probably the three most widely used right now I might even throw in the random tree if we open up their website we go under supervised learning there's a linear model we didn't do that almost most of data usually just start with a linear model because it's gonna process the quickest I'm gonna use the least amount of resources but you can see they have linear quadratic they have kernel Ridge there's our support vector I'm stochastic gradient nearest name nearest neighbors is another common one that's used a lot very similar to the SVM Gaussian process cross decomposition naive Bayes this is more of an intellectual one that I don't see used a lot but it's like the basis of a lot of other things decision tree there's another one that's used a lot ensemble methods as much multi class of multi label algorithms feature selection neural networks that's the other one we use down here and of course the forest so you can see there's a SK learn there are so many different options and they just develop them over the years we covered three of the most commonly used ones in here and went over a little bit over why they're different neural network just because it's fun to work in deep learning and not in shallow learning yes I told you it doesn't mean that the SVM is actually shallow so there's a lot of it covers a lot of things and same thing with the decision for the random forest classifier and we noticed that there's a number of other different classifier options in there these are just the three most common ones and I'd probably throw the nearest neighbor in there and the decision tree which is usually part of the decision for us depending on what the backend you're using and since as human beings if I was in the shareholders office I wouldn't want to leave them with a confusion matrix they need that information for making decisions but we want to give him just one particular score and so I would go ahead and we have our SK learn metrics we're gonna import the accuracy score and I'm just gonna do this on the random forests since that was our best model and we have our CM accuracy score and they forgot to print it if you remember in Jupiter notebook we can just do the last variable we leave out there we'll print and so our cm Accra score we get is 90% and that matches up here we should already see that up here in precision so you can either quote that but a lot of times people like to see it highlighted at the very end this is our precision on this model and then the final stage is we would like to use this for future so let's go ahead and take our wine just remember correctly we'll do wine ahead of 10 I'll run that remember our original data set we've gone through so many steps and we're going to go back to the original data we can see here we have our top 10 or top 10 on the list only two of them make it as having high enough quality wine for us to be interested in them and then let's go ahead and create some data here we'll call it X new equals and is important this data has to be we just kind of randomly selected some data looks an awful lot like some of the other numbers on here which is what it should look like and so we have our X nu equals seven point three point five eight and so on and then it is so important this is where people forget this step X nu equals SC remember SC that was our standard scalar variable we created if we go right back up here before we did anything else we created an SC we fit it and we transformed it and then we need to do what transform the data we're going to feed in so we're going to go back down here and we're going to transform our x nu and then we're gonna go ahead and use the oh yeah here we go our random forest and if you remember all it is is our RFC predict model right there let's go ahead and just grab that down here and so our y nu equals here's our our AFC predict I'm going to do our X nu in and then it's kind of nice to know what it actually puts out so according to this it should print out what our prediction is for this wine and oh it's a bad wine okay so we didn't pick out a good wine for our X nu and they should be expected most of wine if you remember correctly only a small percentage of the wine met our quality requirements so we can look at this and say oh we'll have to try another wine out which is fine by me cuz I like to try out new wines and I certainly have a collection of old wine bottles and very few of the match but you can see here we've gone through the whole process just a quick read rehash we had our imports we touched a lot on the SK learn our random forests our SVM and our MLP classifier so we had our support vector classifier we had our random forests and we have our neural network through the top used classifiers in the SK learn system and we also have our confusion metric matrix and our classification report which we used our standard scaler for scaling it and our label encoder and of course we needed to go ahead and split our data up in our implant line train and explore the data in here for null values we set up our quality into bins we took a look at the data and what we actually have and put a nice little plot to show her quality what we're looking at and then we went through our three different models and it's always interesting as he spends so much time getting to these models and then you kind of go through the models and play with them until you get the best training on there without becoming bias that's always a challenge is to not over train your data to the point where you're training it to fit the test value and finally we went ahead and actually used it and applied it to a new wine which unfortunately didn't make the cut it's gonna be the one that we drink a glass out of and save the rest from cooking of course that's according to the random forests on there because we use the best model that it came up with that completes our part one of the SK learn if you have questions or want copies of the code feel free to put a note in the YouTube video down below or visit @ww simply learn calm they'll be happy to answer questions you might have feel free to go either in to our forums at sub you learn comm or again in the youtube video below again my name is Richard Kirchner with these simply learn t SW W simply learn comm get certified get ahead thank you for joining us today hi there if you liked this video subscribe to the simple learn YouTube channel and click here to watch similar videos to nerd up and get certified click here

hello and welcome to scikit-learn tutorial part 1 my name is Richard Kirchner with the simply learned team that&#39;s w w simply learned comm get certified get ahead now we&#39;re gonna cover the scikit-learn tutorial which has a lot of features and all kinds of api in it to explore data and do your data science with effects it&#39;s probably one of the top data science packages out there so what is the site can&#39;t learn it&#39;s simple and efficient tool for data mining and data analysis it&#39;s built on numpy side pie and mat plot library so it interfaces very well with these other modules and it&#39;s an open source commercially usable bsd license bsd are really stood for berkeley software distribution license but it means it&#39;s open source with very few restrictions as far as what you can do with it another reason to really like the scikit-learn set up so you don&#39;t have to pay for it as a commercial license versus many other copyrighted platforms out there what we could achieve using the scikit-learn we use class the two main things or classification and regression models classification identifying which category and object belongs to for one application very commonly used is spam detection so is it a spam or is it not a spam yes/no in banking you might be is this a good loan bad loan today we&#39;ll be looking at wine is it gonna be a good wine or a bad wine and regression is predicting an attribute associated with an object one example is dock prices prediction what is gonna be the next value if this dock today sold for twenty three dollars and five cents a share what do you think it&#39;s gonna sell for tomorrow and the next day and the next day so that would be a regression model same thing with weather weather forecasting any of these are regression models we&#39;re looking at one specific prediction i want to tribute today we will be doing classification like i said we&#39;re gonna be looking at whether a wine is good or bad but certainly the regression model which is in many cases more useful because you&#39;re looking for an actual value is also a little harder to follow sometimes so classification is a really good place to start we can also do clustering and model selection clustering is taking an automatic grouping of similar objects into sets customer segmentation is an example so we have these customers like this they&#39;ll probably also like this or if you like this particular kind of features on your objects maybe you like these other objects so it&#39;s a referral is a good one especially in amazon.com or any of your shopping networks model selection comparing validating and choosing parameters and models now this is actually a little bit deeper as far as a site kit learned we&#39;re looking at different models for predicting the right course or the best course or what&#39;s the best solution today like I say we&#39;re looking at wines it&#39;s gonna be how do you get the best wine out of this so we can compare different models and we&#39;ll look a little bit at that and improve the models accuracy via different parameters and fine tuning now this is only part one so we&#39;re not gonna do too much tuning on the models we&#39;re looking at but I&#39;ll point them out as we go to other features dimensionality protection and pre-processing dimensionality reduction is we were reducing the number of random variables to consider this increases a model efficiency we won&#39;t touch that in today&#39;s tutorial but be aware if you have you know thousands of columns of data coming in but thousands of features some of those are going to be duplicated or some of them you can combine to form a new column and by reducing all those different features into a smaller amount you can have a you can increase the efficiency of your model it can process faster and in some cases you&#39;ll be less biased because if you&#39;re weighing it on the same feature over and over again it&#39;s gonna be biased to that feature and pre-processing these are both pre-processing but pre-processing is feature extraction and normalization so we&#39;re gonna be transforming input data such as text reuse with machine learning algorithms we&#39;ll be doing a simple scaling in this one for our pre-processing and I&#39;ll point that out when we get to that and we can discuss pre-processing at that point with that let&#39;s go ahead and roll up our sleeves and dive in and see what we got here now I like to use the jupiter notebook and i use it out of the Anaconda navigator so if you install the Anaconda navigator by default it will come with the jupiter notebook or you can install the jupiter notebook by itself this code will work in any of your Python setups I believe I&#39;m running an environment of three-point-seven setup on there I&#39;d have to go in here environments and look it up for the Python setup but it&#39;s one of the three x&#39;s and we go ahead and launch this and this will open it up in a web browser so it&#39;s kind of nice it keeps everything separate and in this anaconda you can actually have different environments different versions of Python different modules installed in each environment so it&#39;s a very powerful tool if you&#39;re doing a lot of development and Jupiter notebook is just a wonderful visual display certainly you can use I know spiders another one which is installed with the anaconda I actually use a simple notepad plus plus when I&#39;m doing some of my Python script any of your IDs will work fine jupiter notebook is ironpython because it&#39;s designed for the interface but it&#39;s good to be aware of these different tools and when i launch the jupiter notebook it&#39;ll open up like i said a webpage in here and we&#39;ll go over here to new and create a new python i&#39;m set up like i said i believe this is python 3 7 but any other three this the scikit-learn works with any of the three x&#39;s there&#39;s even two seven versions so it&#39;s been around a long time so it&#39;s very big on the development side and then the guys in the back guys and gals develop they went ahead and put this together for me and let&#39;s go ahead and import our different packages now if you&#39;ve been reading some of our other tutorials you&#39;ll recognize pandas as p d-- pandas libraries pretty widely used as a data frame setup so it&#39;s just like columns and rows and a spreadsheet with a lot of different features for looking stuff up the Seabourn sits on top of mat plot libraries this is for a graphing will see that how quick it is to throw a graph out there to view in the jupiter notebook for demos and showing people what&#39;s going on and then we&#39;re gonna use the random forest the SVC or support vector classifier and also the neural network so we&#39;re going to look at this we&#39;re actually gonna go through and look at three different classifiers that are most common some of the most common classifiers let&#39;s show how those work in the scikit-learn setup and how they&#39;re different and then if you&#39;re gonna do your setup on here you&#39;ll want to go ahead and import some metrics so the SK learned metrics on here and we&#39;re using confusion metrics and the classification report out of that and then we&#39;re gonna use from the SK learn pre-processing the standard scaler and label encoder standard scaler is probably the most commonly used pre-processing there&#39;s a lot of different pre-processing packages in the SK learn and then model selection for splitting our data up it&#39;s one of many ways we can split data into different sections and the last line here is our percentage mat plot library in line some of the Seaboard and mat plot library will go ahead and display perfectly in line without this and some won&#39;t it&#39;s good to always include this when you&#39;re in the jupiter notebook this is jupiter notebook so if you&#39;re in ite when you run this it will actually open up a new window and display the graphics that way so you only need this if you&#39;re running it in a editor like this one with the specifically jupiter notebook I&#39;m not even familiar with other editors that are like this but I&#39;m sure they&#39;re out there I&#39;m sure there&#39;s a Firefox version or something Jupiter notebook just happens to be the most widely used out there and we can go ahead and hit the Run button and this now has saved all this underneath the packages so my packages are now all loaded I&#39;ve run them whether you run it on top we run it to the left and all the packages are up there so we now have them all available to us for our project we&#39;re working on and I&#39;m just gonna make a little side note on that when you&#39;re playing with these and you delete something out and add something in even if I went back and deleted this cell and just hit the scissors up here these are still loaded in this kernel so until I go under kernel and restart or restart and clear or restart and run all I&#39;ll still have access to pandas important to know because I&#39;ve done that before I&#39;ve loaded up maybe not a module here but I&#39;ve loaded up my own code and then changed my mind and wondering why is he keep putting out the wrong output and then I realize it&#39;s still loaded in the kernel you have to restart the kernel just a quick side note for working with a jupiter notebook and one of the troubleshooting things that comes up we&#39;re gonna go and load up our data set we&#39;re using the pandas so if you haven&#39;t yet go look at our pandas tutorial a simple read the csv with the separation on here so let me go ahead and run that and that&#39;s now loaded into the variable wine and let&#39;s take a quick look at the actual file I always like to look at the actual data I&#39;m working with in this case we have wine quality - red I&#39;ll just open that up I haven&#39;t in my OpenOffice setup separated by semicolons that&#39;s important to notice and we open that up you&#39;ll see we have go all the way down here it looks like 1600 lines of data - the first one so 15 1599 lines and we have a number of features going across the last one is quality and have to bet we see the quality is as different numbers in at five six seven it&#39;s not really I&#39;m not sure how high of a level it goes but I don&#39;t see anything over a seven so it&#39;s kind of five through seven is what I see here five six and seven four or five six and seven looking to see if there&#39;s any other values in there looking through the demo or to begin with I didn&#39;t realize the setup on this see there&#39;s a different quality values in there alcohol sulfates pH density total sulfur sulfur dioxide and so on those are other features we&#39;re going to be looking at and since this is a pandas we&#39;ll just do wine head and that prints our first five roles rows of data that&#39;s of course pandas command and we can see that looks very similar to we were looking at before we have everything across here it&#39;s automatically assigned an index on the left that&#39;s what pandas does if you don&#39;t give it an index and for the column names it has assigned the first row so we have our first row of data pulled off the our comma separated variable file in this case semicolon separated and it shows the different features going across and we have what 1 2 3 4 5 6 7 8 9 10 11 features 12 including quality but that&#39;s the one we want to work on and understand and then because we&#39;re in pandas dataframe we can also do Wyandotte info and let&#39;s go ahead and run that this tells us a lot about our variables we&#39;re working with you&#39;ll see here that there is 1599 that&#39;s what I said from the spreadsheet so it looks correct non-null float64 this is very important information especially the non null there&#39;s no null values in here that can really trip us up in pre-processing and there&#39;s a number of ways to process non null values one is just to delete that data out of there so if you have enough data in there you might just delete your non null values another one is to fill that information in with like the average or the most common values or other such means but we&#39;re not gonna have to worry about that but we&#39;ll look at it another way because we can also do wine is null and sum it up and this will give us a similar won&#39;t tell us that these are float values but it will give us a summation I&#39;m sorry go let me run that he&#39;ll give us a summation on here how many null values in each one so he wanted to you know from here you would be able to say okay this is a null value but she doesn&#39;t tell you how many are null values this one would clearly tell you that you have maybe five null values here to null values here you might just a few had only seven null values and all that different dating probably just delete them out where if ninety percent of the data was no values you might rethink either a different data collection setup or find a different way to deal with the null values we&#39;ll talk about that just a little bit in the models too because the models themselves have some built-in features especially the forest model which we&#39;re gonna look at this point we need to make a choice and to keep it simple we&#39;re gonna do a little pre-processing of the data and we&#39;re gonna create some bins and bins we&#39;re gonna do is 2 comma 6 point 5 comma 8 what this means is that we&#39;re gonna take those values if you remember up here let me just scroll back up here we had our quality and the quality who comes out between 2 and 8 basically or 1 and 8 we have 5 5 5 6 you can see just in the just in the first 5 lines of variation and quality we&#39;re gonna separate that into just two bins of quality and so we&#39;ve decided to create two bins and we have bad and good it&#39;s gonna be the labels on those two bins we have a spread of 6.5 and an exact index of ate the exact index is because we&#39;re doing 0 to 8 on there the 6.5 we can change we could actually make this smaller or greater but we&#39;re only looking for the really good wide we&#39;re not looking for the 0 1 2 3 4 5 6 we&#39;re looking for wines with 7 or 8 on so high quality you know this is what I want to put on my dinner table at night I want to taste the good wine not the semi good wine or mediocre wine and then this is a panda so PD member stands for pandas pandas cut means we&#39;re cutting out the wine quality and we&#39;re replacing it and then we have our bins equals bins that&#39;s the command bins is the actual command and then our variable bins to come at 6.58 so two different bins and our labels bad and good and we can also do let me just do it this way wine quality since that&#39;s what we&#39;re working on and let&#39;s look at unique another pandas command and we&#39;ll run this and I get this lovely error why did I get an error well because I replaced wine quality and I did this cut here which changes things on here I&#39;m sorry literally altered one of the variables are saved in the memory so we&#39;ll go up here to the kernel restart and run all that starts it from the very beginning and we can see here that that fixes the air because I&#39;m not cutting something that&#39;s already been cut we have our wine quality unique and the wine quality unique is a bad or good so we have two qualities objects bad is less than good meaning bad is going to be zero and goods going to be one and to make that happen we need to actually encode it so we&#39;ll use the label quality equals label encoder and the label encoder let me just go back there since as part of SK learned that was one of the things we import it was a label encoder you can see that right here from the SK learn processing import standard scaler which we&#39;re going to use in a minute and label encoder and that&#39;s what tells it to use that equals zero and good equals one and we&#39;ll go ahead and run that and then we need to apply it to the data and when we do that we take our wine quality that we had before and we&#39;re gonna set that equal to label quality which is our encoder and let&#39;s look at this line right here we have dot fit transform and you&#39;ll see this in the pre-processing these are the most common used is fit transform and fit transform because they&#39;re so often that you&#39;re also transforming the data when you fit it they just combined them into one command and if we&#39;re just gonna take the wine quality feed it back into there and put that back in our wine quality setup and run that and now what we do the wine and the head of the first five values and we go ahead and run this you can see right here underneath quality zero zero zero have to go down a little further to look at the better wines let&#39;s see if we have some that are ones yeah there we go there&#39;s some ones down here let&#39;s look at ten of them you can see all the way down to zero or one that&#39;s our quality and again we&#39;re looking at high quality we&#39;re looking at the seven and the eight or six point five and up and let&#39;s go ahead and grab our or was it here we go wine quality let&#39;s take a look at what else more information about the wine quality itself we can do a simple pandas thing value counts I type that in there correctly and we can see that we only have two hundred and seventeen of our wines which are going to be the higher quality so two hundred seventeen and the rest of them fall into the bad bucket and the zero which is 1382 so again we&#39;re just looking for the top percentage of these the top what is that it&#39;s probably about a little under 20% on there so we&#39;re looking for our top wines are seven and eight and let&#39;s use our let&#39;s plot this on a graph so we take a look at this and the SNS if you remember correctly that is I&#39;m just back at the top that&#39;s our Seaborn Seabourn sits on top of mat plot library it has a lot of added features plus all the features of the mat plot library and it also makes it quick and easy to put out a graph we&#39;ll do a simple bar graph and they actually call it count plot and then we want to just do count plot the wine quality just put our wine quality in there let&#39;s go ahead and run this and see what that looks like and nice in line members why we did the in line so make sure it appears in here and you can see the blue space or the first space represents low quality wine and our second bar is the high quality line and you can see they were just looking at the top quality wine here most the wine we want to just give it away to the neighbors and maybe if you don&#39;t like your neighbors may become the good quality wine and I don&#39;t know what you do with the bed quality wine I guess use it for cooking there we go but you can see here it forms a nice little graph for us with the Seabourn on there and you can see our setup on that so now we&#39;ve looked we&#39;ve done some pre-processing we&#39;ve described our data a little bit we have a picture of how much of the wine what we expect it to be high quality low quality checked out the fact that there&#39;s none we don&#39;t have any null values to contend with or any odd values some of the other things you sometimes look at these is if you have like some values that are just way off the chart so the measurement might be off or miss calibrated equipment fear in the scientific field so the next step we want to go ahead and do is we want to go ahead and separate our data set or reformat our data set and we usually use capital X and that denotes that features we&#39;re working with and we usually use a lowercase Y that denotes what in this case quality what we&#39;re looking for and we can take this we can go wine it&#39;s gonna be our full thing of wine dropping what are we dropping we&#39;re dropping the quality so these are all the features - quality make sure we have our axes equals 1 if you left it out it would still come out correctly just because of the way it processes on the defaults and then why if we&#39;re gonna remove quality for our X this is going to be wine and it is just the quality that we&#39;re looking at for Y so we put that in there and we&#39;ll go ahead and run this so now we&#39;ve separated the features that we want to use to predict the quality of the wine and the quality itself the next step is if you&#39;re going to create a data set in a model we got to know how good our model is so we&#39;re going to split the data train and test splitting data and this is one of the packages we imported from SK learn and the actual package was trained test split and we&#39;re gonna do XY test size point two random State 42 and this returns four variables and most common you&#39;ll see is capital X train so we&#39;re gonna train our set with capital X test that&#39;s the data we&#39;re gonna keep on the side to test it with why train why I remember stands for the quality or the answer we&#39;re looking for so when we train it we&#39;re gonna use X train and why train and then Y test to see how good our X test does and the train test split let me just go back up to the top that was part of the SK learn model selection import train test split there is a lot of ways to split data up this is your first starting you do your first model you probably start with the basics on here you have one test for training one for test our test size is 0.2 or 20% and random stage just means we just start with a it&#39;s like a random seed number so that&#39;s not too important back there we&#39;re randomly selecting which ones we&#39;re going to use since this is the most common way this is what we&#39;re gonna use today there is and it&#39;s not even a scaler and package yet so someone&#39;s still putting it in there one of the new things they do is they split the data into thirds and then they&#39;ll run the model on each of they combine each of those thirds into two thirds for training and one for testing and so you actually go through all the data and you come up with three different test results from it which is pretty cool that&#39;s a pretty cool way of doing it you could actually do that with this by just splitting this into thirds and then or you&#39;ll have a test side one test set third and then split the training set also into thirds and also do that and get three different data sets this works fine for most projects especially when you&#39;re starting out it works great so we have our X train or X test or white train in our Y test and then we need to go ahead and do the scaler and let&#39;s talk about this because this is really important some models do not need to have scaling going on most models do and so we create our scalar variable we&#39;ll call it SC standard scaler and if you&#39;re correctly we imported that here wrong with the label encoder the standard scaler setup so there&#39;s our scalar and this is gonna convert the values instead of having some values that go from zero if you remember up here we had some values are fifty four sixty forty fifty 902 so our total sulphur dioxide would have these huge values coming in to our model and some models would look at that and they&#39;d become very biased to sulfur dioxide and have the hugest impact and then a value that had point oh seven six point zero nine eight or chlorides would have very little impact because it&#39;s such a small number so we take the scalar we kind of level the playing field and depending on our scaler it sets it up between zero and one a lot of times is what it does let&#39;s go ahead and take a look at that and we&#39;ll go ahead and start with our X train and our X train equals SC fit transform we talked about that earlier that&#39;s an SK learn set up it&#39;s going to both fit and transform our X train into our X train variable and if we have a neck strain we also need to do that to our test and this is important because you need to note that you don&#39;t want to refit the data we want to use the same fit we used on the training is on the testing otherwise you get different results and so we&#39;ll do just hopes not fit transform we&#39;re only going to transform the test side of the data so here&#39;s our X test that we want to transform and let&#39;s go ahead and run that and just so we have an idea let&#39;s go ahead and take and just print out our X train oh let&#39;s do first ten variables very similar the way you do with the head on a data frame you can see here our variables are now much more uniform and they&#39;ve scaled them to the same scale so they&#39;re between certain numbers and with the basic scaler you can fine tune it I just let it do its defaults on this and that&#39;s fine for what we&#39;re doing in most cases you don&#39;t really need to mess with it too much it doesn&#39;t look like it goes between like - probably - 2 - 2 or something like that that&#39;s just looking at the Train variable I&#39;ll go ahead and cut that one out of there so before we actually build the models and start discussing the SK learn models we&#39;re gonna use we covered a lot of ground here most of when you&#39;re working with these models you put a lot of work into pre prepping the data so we looked at the data notice that it&#39;s separated and loaded it up we went in there we found out there&#39;s no null values it&#39;s hard to say no nodal values we have there&#39;s none there&#39;s none nobody I can&#39;t say it and of course we sum it up if you had a lot of null values this would be really important coming in here so is there a null summary we looked at pre-processing the data as far as the quality and we&#39;re looking at the bins so this would be something you might start playing with maybe you don&#39;t want superfine wine you don&#39;t want the 7 &amp; 8 maybe you want to split this differently so certainly you can play with the bins and get different values and make the bins smaller or lean more towards the lower quality so you then have like medium to high quality and we went ahead and gave it labels again this is all pandas we&#39;re doing in here setting up with unique labels and group names bad good bad is less and good be so important you don&#39;t know how many times people go through these models and they have them reversed or something and then they go back they&#39;re like why is this data not looking correct so it&#39;s important to remember what you&#39;re doing up here and double check it and we used our label encoder so that was to set that up as quality zero one good in this case we have bad good zero one and we just double check that to make sure that&#39;s what came up in the quality there we threw it into a graph because people like to see graphs I don&#39;t know about you but you start looking at all these numbers and all this text and you get down here and you say oh yes you know here this is how much of the wine we&#39;re gonna label as subpar not good this is how much we&#39;re gonna label as good and then we got down here to finally separating out our data so it&#39;s ready to go into the models and the models take X and a Y in this case X is all of our features - the one we&#39;re looking for and then Y is the features we&#39;re looking for so in this case we dropped quality and in the Y case we added quality and then because we need to have a training set in a test set so we can see how good our models do we went ahead and split the models up X train X test y train Y test and that&#39;s using the Train test split which is part of the SK learn package and we did as far as our testing size 0.2 or 20% the default is 25% so if you leave that out it&#39;ll do default setup and we did a random state equals 42 if you leave that out it&#39;ll use a random state I believe it&#39;s default one I&#39;d have to look that back up and then finally we scaled the data this is so important to scale the data going back up to here if you have something that&#39;s coming out as a hundred it&#39;s gonna really outweigh something that&#39;s point O seven one that&#39;s not in all the models different models handle it differently and as we look at the different models I&#39;ll talk a little bit about that we&#39;re gonna only look at three models today three of the top models used for this and see how they compare and how the numbers come out between them so we&#39;re gonna look at three different setups let me change my cell here to mark down there we go we&#39;re going to start with the random forest classifier so the three centers were looking at is a random forest classifier support vector classifier and then a neural network now we start with the random forest classifier because it has the least amount of parts moving parts to fine-tune and let&#39;s go ahead and put this in here so we&#39;re gonna call it RFC for random forest classifier and if you remember we imported that so let me go back up here to the top real quick and we did an import of the random forest classifier from SK learn ensembl and then we&#39;ll all that we also let me just point this out here&#39;s our SVM where we imported our support vector classifier so SVM is support vector model support vector classifier and then we also have our neural network and we&#39;re gonna from there the multi layered perceptron classifier kind of a mouthful for the P perceptron don&#39;t worry too much about that name is just it&#39;s a neural network there&#39;s a lot of different options on there in setups which was where they came up with the perceptron but so we have our three different models we&#39;re going to go through one here and then we&#39;re going to weigh them here&#39;s our so we&#39;re gonna use a confusion metrics also from the SK learn package to see how good our model does with our split so let&#39;s go back down there and take a look at that and we have our RF s equals random forest classifier and we have n estimators equals 200 this is the only value you play with with a random forest classifier how many forests do you need or how many trees in the forest so how many models are in here that makes it pretty good as a start-up model because you&#39;re only playing with one number and it&#39;s pretty clear what it is and you can lower this number or raise it usually start up with a higher number and then bring it down to see if it keeps the same value so you have less you know the smaller the model the better the fit and it&#39;s easier to send out to somebody else if you&#39;re going to distribute it now the random forest classifier everything I read says it&#39;s used for kind of a medium sized data set so you can run it in big data you can run it on smaller data obviously but test in the mid-range and we&#39;ll go ahead and take our RFC and I just copied this from the other side dot fit X train comma Y train so we&#39;re sending it our features and then the quality in the white rain what we want to predict in there and we just do a simple fit I remember this is SK learn so everything is fit or transform another one is predict which we&#39;ll do in just a second here let&#39;s do that now predict are FC equals and it&#39;s our RFC model predict and what are we predicting on while we train it with our train values so now we need our test our X test so this has done it this is going to do this is the three lines of code we need to create our random forest variable fit our training data to it so we&#39;re programming it to fit in this case it&#39;s got 200 different trees it&#39;s going to build and then we&#39;re gonna predict on here let me go ahead and just run that and we can actually do something like all of us to predict our F C just real quick we&#39;ll look at the first 20 variables of it let&#39;s go ahead and run that and in our first 20 variables we have three wines and make the cut and the other 17 don&#39;t so the other 17 are bad quality and three of them are good quality in our predicted values and if you can remember correctly go ahead and take this out of here this is based on our test so these are the first 20 values and our test and this has as you can see all the different features listed in there and they&#39;ve been scaled so when you look at these are a little bit confusing to look at and hard to read but we have there&#39;s a minus 0 1 so this is 0.36 minus I 1 so point 1 64 minus 0.09 or not still minus 1 so minus 0.9 all between 0 &amp; 1 on here I think I was confused earlier and I said 0 between 2 negative 2 but between minus 1 and 1 which is what it should be in the scale we&#39;ll go ahead and just cut that out of there run this we ever set up on here so now we&#39;ve run the prediction and we have predicted values well one you could publish them but what do we do with them well we want to do with them as we want to see how we&#39;re our model model perform that&#39;s a whole reason for splitting it between the training and testing model and for that if you remember we imported the classification report that was again from the SK learn there&#39;s our confusion matrix and classification report and the classification report actually sits on the confusion matrix so uses that information and our classification report we want to know how good our Y test that&#39;s actual values versus our predicted RFC so we&#39;ll go ahead and print this report out and let&#39;s take a look and we can see here we have a precision out of the 0 we had about 0.9 t2 that were labeled as bad that we&#39;re actually bad and out of precision for the quality wines we&#39;re running about 78% so you kind of give us a overall 90% and you can see our f1 score or support setup on there or recall you could also do the confusion matrix on here which gives you a little bit more information but for this this is gonna be good enough for right now we&#39;re just gonna look at how good this model was because we want to compare the random forest classifier with the other two models and you know what let&#39;s go ahead and put in the confusion matrix just so you can see that on there with Y test and prediction RFC so in the confusion matrix we can see here that we had 266 correct and seven wrong these are the miss labels for bad and we had a lot of missed labels for good wine so our quality labels aren&#39;t that good we&#39;re good at predicting bad wine not so good at predicting whether it&#39;s a good quality wine important to note on there so that is our basic random forest classifier and let me go ahead of cell and change cell type to markdown and run that so we have a nice label let&#39;s look at our SVM classifier our support vector model and this should look familiar we have our C LF we&#39;re going to create what&#39;s it we&#39;ll call it just like we call this an RFC and then we&#39;ll have our CL f dot fit and this should be identical to up above X train comma Y train and just like we did before let&#39;s go ahead and do the prediction and here is our CLF predict and it&#39;s going to equal the CLF dot predict and we want to go ahead and use X underscore test and right about now you can realize that you can create these different models and actually just create a loop to go through your different models and put the data in and that&#39;s how they designed it they designed it to have that ability let&#39;s go ahead and run this and then let&#39;s go ahead and do our classification report and I&#39;m just going to copy this right off of here they say you shouldn&#39;t copy and paste your code and the reason is is when you go in here and edit it you unbearably will miss something we only have two lines I think I&#39;m safe to do it today let&#39;s go ahead and run this and let&#39;s take a look how the SVM classifier came out so up here we had a 90 percent and down here running about an 86 percent so it&#39;s not doing as good now remember we randomly split the data so if I run this a bunch of times you&#39;ll see some changes down here so these numbers this size of data if I read it a hundred times it would probably be within plus or minus three or four on here in fact if I ran this a hundred times you&#39;d probably see these come out almost the same as far as how well they do classification and then on the confusion matrix let&#39;s take a look at this one this had 22 by 25 this one has 35 by 12 so it&#39;s doing not quite as good that shows up here 71 percent versus 78 percent and then if we&#39;re gonna do a SVM classifier we also want to show you one more before I do that kind of tease you a little bit here Before we jump into neural networks the big save all deep learning because everything else must be shallow learning that&#39;s a joke let&#39;s just talk a little bit about the SVM versus the random forest classifier the SVM tends to work better on smaller numbers it also works really good a lot of times you convert things into numbers and bins and things like that the random forest tends to do better with those at least that&#39;s my brief experience with it or if you have just a lot of raw data coming in the SVM is usually the fastest and easiest to apply model on there so the beats have their own benefits you&#39;ll find they&#39;ll again that when you run these like a hundred times difference between these two on a dataset like this is gonna just go away there&#39;s randomness involved depending on which data we took and how they classify them the big one is the neural networks and this is what makes the neural networks nice as they can do they can look into huge amounts of data so for a project like this you probably don&#39;t need a neural network on this but it&#39;s important to see how they work differently and how they come up differently so you can work with huge amounts of data you can also many respects they work really good with text analysis especially if it&#39;s time-sensitive more and more you have an order of text and they&#39;ve just come out with different ways of feeding that data end where the series in the order of the words is really important same thing with starting to predict in the stock market if you have tons of data coming in from different sources the neural network can really process at in a powerful way to pull up things that aren&#39;t seen before when I say lots of data coming in I&#39;m not talking about just the high lows that you can run an SVM on real easily I&#39;m talking about the data that comes in where you have maybe you pulled off the Twitter feeds and have word counts going on and you&#39;ve pulled off the different news feeds of business are looking at and the different releases when they release the different reports so you have all this different data coming in and the neural network does really good with that pictures picture processing now is really moving heavily into the neural network if you have a pixel - or pixel 3 foam put out by Google it has a neural network for doing it&#39;s kind of goofy but you can put a little Star Wars Androids dancing around your pictures and things like that that&#39;s all done with a neural network so has a lot of different uses but it&#39;s also requires a lot of data and is a little heavy-handed for something like this and this should now look familiar because we&#39;ve done it twice before we have our multi-layered perceptron classifier we&#39;ll call it an ml PC and it&#39;s this is what we imported in ml PC classifier there&#39;s a lot of settings in here the first one is the hidden layers you have to have the hidden layers in there we&#39;re gonna do three layers of 11 each so that&#39;s how many nodes or each layers it comes in and that was based on the fact we have 11 features coming in then I went ahead and just did three layers probably get by with a lot less on this but I didn&#39;t want to sit and play with it all afternoon again this is one of those things you play with a lot because the more hidden layers you have the more resources you&#39;re using you can also run into problems with overfitting with too many layers and you also have to run higher iterations the max iteration we have is set to 500 the defaults 200 because I used three layers of eleven each which is by the way kind of a default I use I realized that usually you have about three layers going down and the number of features going across you&#39;ll see that it&#39;s pretty common for the first classifier when you&#39;re working in neural networks but it also means you have to do higher iterations so we up the iterations to 500 so that means it&#39;s going through the day to 500 times to program those different layers and carefully adjust them and we do have a full tutorials you can go look up on neural networks and understand the neural network settings a lot more and of course we have you&#39;re looking over here where we had our previous model where we fit it same thing here MLP see fit X train Y train and then we going to create our prediction so let&#39;s do our predict and ml PC and it&#39;s going to equal the ml PC and we&#39;ll just take the same thing here predict X test let&#39;s just put that down here predict a test and if I run that we&#39;ve now programmed it we now have our prediction here same as before and we&#39;ll go ahead and do the copy print again I always be careful with the copy paste not because you always run the chance of missing one of these variables so if you&#39;re doing a lot of coding you might want to skip that copy and paste and just type it in and let&#39;s go ahead and run this and see what that looks like and we came up with an 88% we&#39;re gonna compare that with the 86 from our tree or SVM classifier and our 90 from the random forest classifier and keep in mind random forest classifiers they do good on midsize data the SVM on smaller amounts of data although to be honest I don&#39;t think that&#39;s necessarily the split between the two and these things will actually come together if you ran a number of times and we can see down here the noun of good wines mislabeled with set up on there it&#39;s on par with our random forests so we had 20 to 25 shouldn&#39;t be a surprise it&#39;s identical they just didn&#39;t do as good with the bad wines labeling what&#39;s a bad one what&#39;s not see yet because they had two hundred sixty six and seven we had down here 260 and 13 so mislabeled a couple the bad wines is good wines so we&#39;ve explored three of these basic classifiers these are probably the three most widely used right now I might even throw in the random tree if we open up their website we go under supervised learning there&#39;s a linear model we didn&#39;t do that almost most of data usually just start with a linear model because it&#39;s gonna process the quickest I&#39;m gonna use the least amount of resources but you can see they have linear quadratic they have kernel Ridge there&#39;s our support vector I&#39;m stochastic gradient nearest name nearest neighbors is another common one that&#39;s used a lot very similar to the SVM Gaussian process cross decomposition naive Bayes this is more of an intellectual one that I don&#39;t see used a lot but it&#39;s like the basis of a lot of other things decision tree there&#39;s another one that&#39;s used a lot ensemble methods as much multi class of multi label algorithms feature selection neural networks that&#39;s the other one we use down here and of course the forest so you can see there&#39;s a SK learn there are so many different options and they just develop them over the years we covered three of the most commonly used ones in here and went over a little bit over why they&#39;re different neural network just because it&#39;s fun to work in deep learning and not in shallow learning yes I told you it doesn&#39;t mean that the SVM is actually shallow so there&#39;s a lot of it covers a lot of things and same thing with the decision for the random forest classifier and we noticed that there&#39;s a number of other different classifier options in there these are just the three most common ones and I&#39;d probably throw the nearest neighbor in there and the decision tree which is usually part of the decision for us depending on what the backend you&#39;re using and since as human beings if I was in the shareholders office I wouldn&#39;t want to leave them with a confusion matrix they need that information for making decisions but we want to give him just one particular score and so I would go ahead and we have our SK learn metrics we&#39;re gonna import the accuracy score and I&#39;m just gonna do this on the random forests since that was our best model and we have our CM accuracy score and they forgot to print it if you remember in Jupiter notebook we can just do the last variable we leave out there we&#39;ll print and so our cm Accra score we get is 90% and that matches up here we should already see that up here in precision so you can either quote that but a lot of times people like to see it highlighted at the very end this is our precision on this model and then the final stage is we would like to use this for future so let&#39;s go ahead and take our wine just remember correctly we&#39;ll do wine ahead of 10 I&#39;ll run that remember our original data set we&#39;ve gone through so many steps and we&#39;re going to go back to the original data we can see here we have our top 10 or top 10 on the list only two of them make it as having high enough quality wine for us to be interested in them and then let&#39;s go ahead and create some data here we&#39;ll call it X new equals and is important this data has to be we just kind of randomly selected some data looks an awful lot like some of the other numbers on here which is what it should look like and so we have our X nu equals seven point three point five eight and so on and then it is so important this is where people forget this step X nu equals SC remember SC that was our standard scalar variable we created if we go right back up here before we did anything else we created an SC we fit it and we transformed it and then we need to do what transform the data we&#39;re going to feed in so we&#39;re going to go back down here and we&#39;re going to transform our x nu and then we&#39;re gonna go ahead and use the oh yeah here we go our random forest and if you remember all it is is our RFC predict model right there let&#39;s go ahead and just grab that down here and so our y nu equals here&#39;s our our AFC predict I&#39;m going to do our X nu in and then it&#39;s kind of nice to know what it actually puts out so according to this it should print out what our prediction is for this wine and oh it&#39;s a bad wine okay so we didn&#39;t pick out a good wine for our X nu and they should be expected most of wine if you remember correctly only a small percentage of the wine met our quality requirements so we can look at this and say oh we&#39;ll have to try another wine out which is fine by me cuz I like to try out new wines and I certainly have a collection of old wine bottles and very few of the match but you can see here we&#39;ve gone through the whole process just a quick read rehash we had our imports we touched a lot on the SK learn our random forests our SVM and our MLP classifier so we had our support vector classifier we had our random forests and we have our neural network through the top used classifiers in the SK learn system and we also have our confusion metric matrix and our classification report which we used our standard scaler for scaling it and our label encoder and of course we needed to go ahead and split our data up in our implant line train and explore the data in here for null values we set up our quality into bins we took a look at the data and what we actually have and put a nice little plot to show her quality what we&#39;re looking at and then we went through our three different models and it&#39;s always interesting as he spends so much time getting to these models and then you kind of go through the models and play with them until you get the best training on there without becoming bias that&#39;s always a challenge is to not over train your data to the point where you&#39;re training it to fit the test value and finally we went ahead and actually used it and applied it to a new wine which unfortunately didn&#39;t make the cut it&#39;s gonna be the one that we drink a glass out of and save the rest from cooking of course that&#39;s according to the random forests on there because we use the best model that it came up with that completes our part one of the SK learn if you have questions or want copies of the code feel free to put a note in the YouTube video down below or visit @ww simply learn calm they&#39;ll be happy to answer questions you might have feel free to go either in to our forums at sub you learn comm or again in the youtube video below again my name is Richard Kirchner with these simply learn t SW W simply learn comm get certified get ahead thank you for joining us today hi there if you liked this video subscribe to the simple learn YouTube channel and click here to watch similar videos to nerd up and get certified click here

Transcript for:Scikit-Learn Tutorial Part 1 Notes

Transcript for:
Scikit-Learn Tutorial Part 1 Notes