Understanding Scikit-learn for Machine Learning

hi everyone my name is vincent and this is a photo of me in this series of videos i am going to explain psychic learn to you psychic learn is the go-to library for machine learning with an amazing ecosystem of plugins and in this series of videos i will highlight some of the most important parts to understand if you're going to be working with this tool what's important to note is that these videos i'm about to show you originally appeared on calm code this website features short videos about a topic in a series so what i've done is i've concatenated the videos about scikit-learn together for this free code camp video that means that this long video is split up into different sections and each section will highlight a different but relevant so i could learn topic now note that all the code for the videos that you're about to see is available on github you'll find a link in the show notes and you can download each notebook yourself locally to run them but you can also use a service like google colab as well so let's talk about the topics that we will be discussing in this video in the first section we'll discuss some high level topics involved with psychic learn they are mainly there to help you appreciate how to construct machine learning pipelines but the videos will also try to help explain to you why machine learning is still hard in practice after all machine learning tends to go beyond just optimizing a model after that we've got a segment where we will talk about pre-processing tools you can use a scikit-learn model but the performance of the model really depends a lot on how the data is pre-processed and understanding this pre-processing step tends to make your models a lot better next if you want to judge a model you'll also need to think about how to quantify its qualities so in this section we'll talk about the metrics that scikit-learn offers but also how to build custom metrics to get your machine learning models on for your own specific use case after that we're going to discuss meta estimators the idea here is that while you can pre-process data as a step in your machine learning pipeline you sometimes also want to apply post processing in order to properly explain this though we will have to discuss these things called meta estimators then finally i would like to demonstrate a machine learning library that integrates with scikit-learn that tries to make machine learning a little bit more human the tool is called human learn and full disclaimer it's a tool that i made myself the goal of this library is to show you how you might be able to benchmark the domain knowledge that's inside of your company before you resort to machine learning now i hope that these topics indeed sound interesting to you but let's now get started with understanding psychic learn scikit-learn is possibly the most used machine learning tool in the world and in this series of videos i would like to explain the general flow of how to use scikit-learn when you want to make predictions now for reasons that will become clear later we are going to use a very specific version of scikit-learn in this series of videos in particular we'll be using version 0.23.0 and if you're in a jupiter notebook you can use this command to make sure that you've got this specific version installed it should also be said that this series of videos is different than some of the other series of videos on this website in this series of videos it is extra important that you watch all of the videos not just the first few you actually need to watch all of the videos to get a proper understanding of how to use scikit-learn appropriately also the scikit-learn ecosystem is vast and it's impossible to do it justice in just a short series of videos the goal of this series of videos is more to give you a bit of an overview on how in general you should think and work inside a psychic learn but there will be other series of videos that will go more in depth in a certain aspect of psychic learn that i will not be able to do justice in just this series of videos that said let's dive in the way scikit-learn works is you start with some data you then eventually give it to a model the model will learn from it and then you will be able to make predictions that's the general flow however let's be a little bit specific just giving data to a model is a bit vague so what do we mean by giving data to a model now typically if we have a data set that's useful for predictions then we can split the data set up into two parts and the common notation is to call one part of the data x and the other part of the data y now typically my data set x represents everything that i'm using to make a prediction and my data set y contains the prediction that i'm interested in making the use case that we're going to deal with in a moment has to do with house price prediction so you can imagine that this y data set that contains the house prices and x over here contains information about the house things like square feet how big is the garden that sort of thing if you split your data up in this fashion the next thing that you can do is you can pass that to your model and then it's the model's job to learn the pattern such that we can predict y using x this x and y notation comes from scientific articles and because of the notation that scikit-learn adheres to it's also the notation that i will use in my videos what we'll do next is go back to our notebook and get ourselves a data set that psychic learn provides that gives us our x and y given that we have scikit-learn installed what i can do is i can say from sklearn datasets i can type load press autocomplete and then i see a whole lot of data sets that are at my disposal and these data sets are meant for benchmarking and educational purposes and what i'll just go ahead and do is i'll load the boston data set that's a data set that contains house prices of boston i believe in the 70s now if i call the load boston data set as is it's going to give me a dictionary with lots of items in it however there's a parameter that we can set called return x y and if we set that to true then we get two arrays out one array represents the house prices and these house prices are in thousands of dollars and these are all properties of the houses one nice thing that i can do is i can type x comma y equals and then i've got my x and my y that i can go ahead and use inside of a scikit-learn model so what we are about to do is make a model appear inside a cycle learn but just conceptually it is nice to point out what a model does actually when you create a model it hasn't seen any data yet there has to be this moment where we declare that this model over here learns from data and in psychic learn that means that there are two phases there's phase one where we create the model and then there's phase two where the model has to learn from the data in scikit-learn all models are just python objects and then this learning over here in scikit-learn terminology that is often called the dot fit step and during that fit step we typically pass it the x as well as the y data set it's important to know that these are two separate phases that the model has to go through if we wanted to be able to make predictions but given that we've highlighted this overview let's go back to the notebook and show you what the code might look like so let's load up our first model scikit-learn comes with many but one of my favorite ones is the k-nearest neighbor model and you can find it in the neighbors sub module and i will grab this k neighbor's regressor now for the creation step what i will do is i will have a variable called model and that will point to this model object now at the moment there hasn't been a phase where the model was able to learn anything so if i were not to call model dot predict x as in hey model could you perhaps predict why using x then right now i will get an error and that's because the model has not been fitted yet however if i were now to say hey model please fit yourself x comma y now by calling this k neighbors will learn from the data as well as possible such that if i now wanted to make a prediction it is able to do so and you can verify yourself that the number of predictions that we're making is equal to the number of rows in our x array now i'll discuss later what this machine learning model does exactly because you might wonder how the predictions get made but one thing i would just briefly like to point out is that i can also use another model say from scikit-learn linear model i'm just going to import linear regression but the beauty of scikit-learn is that even though this is a completely different machine learning model that internally works very differently the api is still exactly the same if i were now to replace this k neighbor's regressor with this linear regressor then i can run all the cells and sure the predictions will be different because it's a different model the internals work differently but the api is exactly the same it's still.fit dot predict and it's this what makes scikit-learn very nice to use you can stick to a general api and not have to worry too much about the internals and that is very nice so i'm back in a jupiter notebook and what i've got is my trusty k nearest neighbor model and i am fitting that model such that later on i can make predictions with it and what i've got now is this array of predictions but what i've also got is this array of original values these were the true values so the same so one thing that i typically like to do is just make a scatter plot i will show the predictions on the x-axis and the predicted values on the y-axis and by looking at this we get a bit of an impression of how well the model is doing it's just a plot but it does give me an impression that in general it seems to pick up something of a signal and that there indeed is some form of correlation happening as well when our model says the house price should be high it seems to be high in reality as well it's not perfect though and it's a little bit noisy here and there's a couple of reasons for why that may be but let's now talk about how this k-neighbor's regressor works let's consider a simpler version of the dataset and the data set contains a square feet that a house has as well as let's say the proximity to school let's say it's these two and let's assume that if you have a big house and proximity to schools is relatively small then these red dots indicate that these are indeed the expensive houses and you can imagine maybe the cheaper houses being over here now the way that a prediction is made when you're using a k neighbor's regressor is let's say i want to make a prediction for this point right over here now what the model will do is it will start looking for its nearest neighbors let's say we're looking for the nearest five i suppose that will be these if i look at the distance from the point that i'm trying to predict to all the other points in this data set i think these might be the nearest neighbors now the prediction for this data point will be the average of these five neighbors that we found and that is how a prediction is made but here is where the tricky thing can happen it might be that the proximity to a school here is in miles and it might very well be the case that you might have one mile over here two miles over here and three over here but these square feet that we have on this axis well that can be well in the thousands so that means that the distance that will be used to find neighbors will be very different on this axis compared to this axis because this x-axis just features larger numbers it also means that that axis will have a much bigger effect in our end predictions and that might not be what we want so this means we have to rethink what our model actually is at the moment this is our high level overview of what a machine learning model is still data going into the model and then we get a prediction eventually but maybe we have to rethink what a model exactly is here because if i think about the example that we just saw maybe before this data set x goes into this model over here maybe we would have to do some pre-processing first we saw that the square footage can be in the thousands and maybe school proximity can be in singletons so there might be something to be said about applying pre-processing before it touches the model so let's draw that so here's the redrawn schematic the idea is that we take our data set x and before we give it to our k nearest neighbor we apply some sort of scaling just to make sure that the effect that each column can have on the prediction is on the level playing field and doing this will make this k nearest neighbor predict rather differently and when you think about it that way maybe we need to redefine what we think a model is before we said that it is this k nearest neighbor that is the model but maybe we should expand that idea maybe everything inside of this box is the model if the preprocessing has a large effect on the model itself then for all intents and purposes we would like that to be part of our entire system so that said maybe it's more this pipeline that i've drawn over here that should be regarded as the model and it just so happens that in scikit-learn we have this notion of a pipeline and this pipeline also has the api where we can call dot fit on the entire pipeline as well as dot predict once it's trained the reason why that is so nice is because this pre-processing step also has to learn from the data in order to properly scale and normalize and by putting everything into a pipeline we are able to just handle that automatically such that we still have to interface with one object instead of many so i hope this overview paints a clear picture of why we like to have a new definition of what a model is and what i'll do now is i'll just implement this in the jupyter notebook so let's first import the parts that we need from scikit-learn i will need to import from the pre-processing module something that can do scaling and for that i will just use this standard scalar object next what i'll do is i will also import the pipeline object and that allows me to chain processing steps after each other so what i will do now having imported these tools is start a new pipeline object it needs a list of tuples and it's a pair of a name as well as a step do keep in mind that you have to pass the object over here not the class so it's important that you use these brackets after we've done scaling we would like to use our nearest neighbor and this is the pipeline what i can now do is just call pipe.fit x comma y and this entire pipeline will now train and fit itself and what i can go ahead and do is i can replace the model that i had originally over here with the pipeline that's now also scaled and when i run this we should see a new graphic appear as well so i don't know about you but this does look a bit better because there's less noise there is one other issue though that we've just introduced so let's have a look at what's actually happening now because we're cheating i'm telling the pipeline to go ahead and predict using this data set x but note that that's the same data set that we're using in the dot fit moment we are learning from the same data as we are judging on and the k nearest neighbor will do something now that's cheeky suppose that i want to make a prediction for a point that's let's say over here what i will then do is i will grab the five nearest neighbors and then i will make a prediction for this one point by taking the average but we're not making a prediction like this in our scatter chart the point that i'm trying to predict is a point that's in our original data set as well so what we're actually doing now is we're saying suppose i want to predict this point what are the nearest five neighbors and well the nearest five neighbors would include these four points as well as this original point so that means that i'm literally using the data point that i'm trying to predict in order to figure out if i'm doing the prediction well as far as judging whether or not a model is good this chart over here is giving us a view that is too optimistic and i can force it too what i'll quickly do now is i'll just change the pipeline a little bit to emphasize what's currently going wrong this k neighbor's regressor has a few settings and in particular we have this number of neighbors that we can set so let's change this number from 5 to 1. and i will run every single step now again just to show you what the effect is if i only select one neighbor then the chart falsely suggests that we're making a perfect prediction but the model here is only able to do that because it's allowed to memorize the original data the nearest neighbor here is the original data point so this chart doesn't tell us anything about how it might predict points that are not in the original data set and this is a big issue we want our model to predict data that it's not seen before and we cannot trust charts and statistics where the model is allowed to predict on the same data that it's trained on and that brings us to two issues the first issue is how can we get a fair comparison going for ourselves but perhaps another issue is how do i go about picking an appropriate number of neighbors here and when you consider thoughts like this you might realize that we have to review our understanding of what a model is one more time currently this is our belief of what a model is we have this pipeline and there can be multiple steps inside of it but now we have another issue because now we might have some settings for the k nearest neighbor i might want to try out the system where we have one neighbor or two all the way up until 10. and i would like to pick this number of neighbors setting such that my model makes the best predictions and in order to figure out which predictions are the best one thing that we can do is we can compare our prediction with the original label but as we've seen in a previous video we have to be really careful here we don't want to judge the model on the same data set as we're learning from and with that in mind maybe we should do a trick with a data set just to keep the methodology clean and here's the idea i'm going to cut this data set up into let's say three different segments and what i'm also going to do is i'm going to copy both data sets three times and here's the idea first i'm going to say well let's give this the predict name and let's do it over here i'll set the predict name here in the second set and i'll put predict down here in the third and i'm going to declare the other parts for training the first time around this part of the data set is going to be used for training and then given that trained model i can use this portion of the data that's not been used for training to test how well my predictions are going and in the next data copy i'm going to repeat the exercise but a different portion of the data is going to be used for prediction as well as training and finally the same thing happens here as well the idea here being i'm going to call dot fit dot predict here but i'm fitting on the green part and i'm predicting on the red part this prevents me from ever predicting on data that i've used during training but it does allow me to judge in the predict section of my data how well my predictions are and the idea essentially is if i just repeat this then maybe i'll have a pretty good metric of performance for when i had one neighbor selected when i had two neighbors selected and when i had 10 neighbors selected however in scikit-learn all of this is something that the pipeline will not handle for you instead there is a different object and the name of this object is a grid search cv object the idea behind it is that you can give this grid search a pipeline and you can also give it a grid like this number of neighbors over here and internally it will perform cross validation which is the procedure that i've explained here and by performing this cross validation we have a methodology that is somewhat sound and what i would argue here is that maybe this grid search object maybe that is the model that we should be thinking about and in scikit-learn also this grid search has a dot fit as well as a dot predict method attached so let's turn this pipeline into a proper grid search the first thing i'll need to do is make sure that i have my grid search cv object imported and you can import it from sklearn model selection and the object that you're interested in is this grid search cv object and given that we have this imported now what i can do is i can start a new grid search object to get started with it i need to pass an estimator and an estimator is something that has a dot fit as well as a predict and the pipeline that i've made earlier this one over here will do just fine next what i gotta pass is a parameter grid and this parameter grid is going to represent all the settings that we would like to go over in our pipeline in particular the one that we're interested in changing is this number of neighbors in this k neighbor's aggressor now to set the grid we need to have the name of that parameter and the easiest way to get there is to use the get params method that is on every scikit-learn estimator including this pipeline and when you run this you will see all the settings that you're able to tweak in particular you'll notice that there is this number of neighbors property that's on this model and the name model that i have here corresponds with this name of this pipeline step here and the number of neighbors that i have here corresponds with the number of neighbors that are a parameter in this object over here for our intents and purposes though the only thing i need to grab is this one but know that you can grab more here if you would like to change more parameters and what i'm going to do is i'm going to say hey you know what these are all the values that i would like you to check and i would really like this grid search to also do cross validation so let's set the cross validation parameter to three and now this is my model oop forgot a comma there now as far as grit search goes i would like to mention that what we're doing in this one over here is relatively basic i will leave links to other video series where we're going more in depth but the main thing i want to give you as a high overview of what will you end up doing when you have a grid search like this well you are simply just gonna call model.fit xy just like every other estimator we've seen so far except in the grid search what will happen is we will have all sorts of settings and cross-validation that's happening on our behalf so we don't have to write that code ourselves however once this has trained there is this very interesting property available called cv results and note that this property ends with an underscore the grid search will train and for every cross validation for every setting it's keeping track of a couple of numbers and what i can do is i can just give that dictionary that we have as output there to a pandas data frame and here what you see is all sorts of statistics you see how long it took to fit the entire thing and for every parameter that we have and for every cross validation split that we've made we can see how well it did on a certain score and we can also see which result was the best at this point in time it seems like we found that this might be the best setting and that's interesting and at this point in time we can spend a little bit of time analyzing this generated data set to figure out why that might be the main point i'm trying to make so far though is that with only a very small amount of code over here we have a fairly mature pipeline and we also have a proper set of lego bricks to build machine learning systems and if you'll be using scikit-learn a lot this is the pattern that you are eventually going to be aiming for we have a proper pipeline that we can go ahead and tweak quite easily it's clear what steps are happening the steps are reproducible as well and as far as methodology goes we can argue that we're doing a couple of things right because we're using this grid search object over here so try to stick to this pattern whenever you're using scikit-learn the system of fit predict that scikit-learn offers and the way that it allows you to construct pipelines is something to appreciate that said though we could ask ourselves have we now found a model that can go to production have we done our work are we now proper data scientists and the answer is no and i have to highlight why and this is also the reason why it's very important that you watch all of the videos in this series so far we have been using the scikit-learn api appropriately now i use the word appropriately here in the sense that we've been using its building blocks in the right way we click together a pipeline where you've been using a grid search and so far i would argue these are all good things as far as an approach to a data science project though we could not have done it worse and i would like to explain to you why we've been using this load boston data set but during the entire analysis we've not taken any time or even bothered to look what's actually inside of this data set now when i just run load boston like so you will notice that i get a dictionary one of the things inside of this dictionary is this description tag and what i can do is i can get a description of all the variables that are in the data set now if i want to get a nice output i have to print what is returned to me here so i'll do that and what i can now do is i can have a little look at what we're actually dealing with so first of all you can kind of wonder is 506 houses enough to give us a lot of confidence in our model maybe not we can also ask ourselves the question hey from what year is this data it might be that this data is really old and that the world has moved on in a way that this data set doesn't really represent what's currently happening that's also a valid concern but it gets worse we also never bothered to look at what attributes are actually in the data set x that we're using to make a prediction so we have things like crime in the neighborhood we have things like how industrious is the area but the one that i was really concerned with that when i first looked at this is this one apparently this is a data set where the proportion of blacks in your town is something that is actually being used to predict a house price now i don't know about you but i'm having a really hard time coming up with a use case for this data set that is appropriate looking at this feature it's clear to me that we have a potential for a racist algorithm and that is a property i don't want to have in production now there's a lot of things that can go wrong when we deploy a machine learning model and we've discussed methodology and these are fine topics it's good to worry about that but grid search is simply not enough and what has been bothering me is that this data set has been used for so long in so many different courses without even looking at the variables that are in here putting a model that has this feature in it is incredibly naive and i think as a profession it really helps if we just do better this is also why scikit-learn has now chosen to remove this data set load boston after a few releases will be gone and this was the reason why i had to pin the version number of scikit-learn in the beginning mainly because the load boston data set in a future release will no longer be available remember this chart that we made halfway in this video series it contains our predicted values versus our true values and this chart was generated because we were using the same data set that we trained on as we evaluated on and the danger here is that the chart suggested that our model was better even though the model was effectively leaking information if we had blindly trusted this model then the risk is that this model might have gone to production with very high expectations and the results clearly would not have been great the only way to catch these sorts of things in practice is to remain skeptical you should always feel free to distress the model and to try to test every weakness that you can come up with only after a long period and lots of stress tests are you allowed to put a little bit of faith into the model that you have and note that the same thing that happened in this chart also happened to us while we were doing the grid search the grid search introduced a methodology and we certainly have statistical concerns when it comes to a model if it has poor predictive power then it won't be very useful in practice but we should maybe mention here that there is a danger in doing this grid search it may give you the impression that you're doing the right thing after all numbers go up and you might get optimistic about the quality of your model and it's exactly this optimism that might be causing you to develop a blind spot it's the time during which you make these optimistic charts as well as when you see numbers go up in your grid search that you forget to think about things like hey what's actually in my data and i hope the load boston data set as we've seen in the previous video makes it clear that you cannot just blindly put any data into a model you actually have to spend some effort and try to understand what you have at hand there are many things that can and have gone wrong in the application of machine learning models and i would argue that it's also your responsibility to educate yourself to make sure that it doesn't happen to you if the output of a machine learning model is your responsibility then so is the data going in so please use scikit-learn and its amazing lego brick-like features but also understand that so i could learn typically is the easy part of the profession the hard part is understanding the story behind the data set and to understand what might go wrong when you put a model into production and it means that it will be good to make sure that you're up to date on themes of ethics in algorithmic design but also topics like feedback mechanisms and to consider fallback scenarios for when things go wrong in production that concludes the psychic learn portion of this video this would be a good time to grab a drink or have a quick break and in the next portion we'll talk about pre-processing tools inside of scikit-learn quite typically in scikit-learn you'll have your labels and you'll have your data that you want to use to make a prediction and then both of these eventually will be passed to a model and then you'll have something that is able to make a prediction but the idea of what a model is can be extended here because if you think about what happens more often than not it's a pretty good idea to first transform your data that you're using for the prediction and the reason for doing that is because the model performance will just be a bit better by doing so so what i figured might be a good idea was spend some videos explaining some of the more frequent transformers that people tend to use and also show that it's important to not forget about these transformers because they do really matter in your pipeline so what i've done here is i've started the notebook i've imported numpy pandas and black.lib and i've imported this dataset called drawdata1 and it's a csv file it has a column x a column y and a column z and z has two values either a and b and the idea is that these are just some numeric numbers and we're trying to predict this one column over here and using map.lib we can show what the dataset looks like so we've got a group of data points over here and another group of data points over here and we should notice that there seems to be like a small group of outliers over here as well as a small group of outliers over here but another thing that you should be aware of is that the y-axis over here is on a completely different scale as the x-axis over here and that can be something that's bothersome the effect that these axes will have depends on your algorithm but in general you can imagine that algorithms are sensitive for this kind of thing so a large chunk of your pre-processing in this case is going to revolve around scaling we want to rescale this data such that there's still information in there but it's just numerically a bit more stable because the x and y axes are just a little bit more in line with each other and as you might be able to guess from the name a standard way of doing this is using the standard scalar from scikit-learn and what it does is for each column it is going to calculate the mean as well as the variance the idea here being that if you have a data point x and you subtract the mean of x from it and then you divide by the square root of the variance well then you're going to have something that revolves around zero and this will also have a variance that's kept at bay so what i'll now do is i'll just go ahead and use the standard scalar to rescale the dataset that you see here to use a standard scalar we first have to import it and next what we have to do is we have to create a scalar object but from here we should be able to call fit transform and give it our data set x and this will be our new transformed data set and what i'll just go ahead and do is i'll just copy this plot code over here and put that down here this way i can just feed it the new values for x and we can see what it looks like so one thing you should notice at this point is that these axes numerically are much more similar but it's not exactly perfect though it seems that this spread is about 8 units whereas the spread over here is more like three and a half and we can also observe that there's nothing really happening with these outliers so the standard scaler is doing things we like but it does make you wonder is there maybe another way of scaling this to further demonstrate what might be a weakness of the standard scaler i figured i would generate a data set to make the point just a little bit more tangible so what i've got here is just some data that has a couple of outliers on one end and what i'm going to do now is i'm going to say well let's take that data set let's subtract from that the mean of the data set and let's then divide by the standard deviation and whenever i run this i have slightly different data because i'm simulating data but as i'm running this over and over you should notice a few things and yes for starters definitely the numbers we have here on the x-axis these are definitely scaled so you could argue that's a good thing but the downside is we still have outliers and depending on the algorithm that you're using outliers will make life a little bit harder for you so let's instead conceptually come up with a different way of normalizing where these outliers are just a little bit less of a problem so let's say that this is my original data set now what i could do is i could calculate the mean value which would probably be around here and i could then say oh let's standardize around that but let's calculate some other values instead let's let's ignore the mean for now instead what i could do is i could calculate the quantiles i could imagine for example that the 50th percent quantile is over here and that means that 50 of all the data is on this side of the line and 50 of the data is on that side of the line i could imagine that maybe the 25th quantile is over here and that means that 25 of all the data is on the left side and 75 is on the right side and i think that the 75th quantile will be somewhere over here and let's say the 99th that might be over here this is something that we could go ahead and calculate and if i were now to think hey how can i project that onto something that's normalized well i could have a number line down below and i would have the number 50 halfway the number 100 would be all the way to the right the number zero will be over here i would have 25 over here and 75 over here and i hope you can see that there's a mapping here and notice when i scale it this way that then in this scaled representation the distance from the outlier to the 75th quantile is a lot smaller and that means that by using quantiles as opposed to means and standard deviations we may be able to get a more robust pre-processing step if there's outliers in here so let's use this idea as a pre-processing step and see what the effect is on our data set so from scikit-learn pre-processing i can now import the quantile transformer and what i'm going to go ahead and do is i'm going to replace the standard scaler with that quantile transformer but before i run this notice the axes the numbers that are the minimum and the maximum and also notice these outlier clusters i'm now going to run this and show the effect now there's a warning typically scikit-learn likes to calculate a thousand quantiles and the data set that we gave it doesn't have enough data for it so what i'm just going to go ahead and do is turn that number of quantiles into a hundred but notice that the minimum as well as the maximum on both axes are exactly equal to zero and one and you might also recognize that the clusters that we saw earlier are still in the data it's just that they have less of a profound effect and the reason for that is because now we're using quantiles to transform and scale as opposed to using the mean and standard deviation in the previous video we showed that when you take your x matrix and you pass it through a transformer the quantile transformer that you can get a very different output and what i would like to show in this video is that when you take that output and then pass that to a model that then the predictions are also going to be very different so what i have done is i have made this plot output function and what the function does it allows you to pass through a transformer and the function will then run all of this it's going to train a model a k nearest neighbor model and it will then also produce some predictions and i just would like to show you what a profound effect this transformer can make in a predictive pipeline as well so i'll first run it with a standard scaler and what you can see here is the original data that we start out with followed by the transformed data and then we will see the predicted data which is this final plot over here and what you can see is that everything around this area over here would get predicted as the yellow class and pretty much everything else will be predicted as the purple one let's now compare that to the quantile transformer because the transform data is much less influenced by the outliers you'll notice that the model is bound to think that there's just this dividing line between the two classes and that's how we're gonna classify everything and if i were to compare the predicted data plot you'll really see a big difference and in this particular case i might argue as far as numerical stability goes the quantile transformer does have benefits but it would still be a good idea to verify this with a grid search but i hope that you can imagine that the quantile transformer even with the grid search is just going to be more stable in the long run because these groups of outliers are not going to have as much of an effect anymore let's now have a look at a different data set in this case i'm looking at dronedata2.csv and this dataset is special in the sense that it's a dataset that represents a classification task that is not linearly separable it's not possible for me to draw a single line such that on one side of the line i'll have purple points and on the other side of the line i'll have yellow points and this might make you think that a logistic regression is not going to be the right choice for this task you may need to have a different algorithm here to get a good classifier so let's see if that is accurate so what you see right now is a pipeline that has a quantile transformer first as a pre-processing step and then a logistic regression and the best thing the logistic regression has been able to do is come up with this line to separate the two classes and this is obviously a bad classifier these two groups should be the same color and the line should not separate this way but maybe we can fix this with preprocessing instead if i consider the two axes that i have at my disposal here then essentially what is happening is the logistic regression will get x1 and x2 and at the moment it can only use those two columns to come up with a separating line but what if i generate x1 times x2 as another feature that's something that the logistic regression might be able to use and what about x1 to the power of 2 and x2 to the power of 2. what you could say is that we can limit ourselves to the linear features but we can also just use those linear features to come up with non-linear features that our model might be able to use so let's change the pipeline such that we add these features in as well and then see what the effect is from scikit-learn pre-processing i can import the polynomial features and just start a new object over here now note that i'm not changing any of the input variables here but one thing that's nice to point out is that i am only calculating the interactions at the moment and i'm only calculating this for a degree of 2 but i could increase this if i wanted to but let's just see what the effect of just this is booyah and if i were to scroll up now to compare i would argue we have a near perfect classification and granted this is on the train set so it's cheating a little bit but what i do hope to have shown here is how much of an effect a single pre-processing step could have in your pipeline the effects can actually be quite drastic so far we've been doing a lot of pre-processing on numeric data but you can also imagine that we have data that's like this where maybe we have classes low medium high risk and then it will be nice if we can do some pre-processing such that this text data becomes numeric data as well the most common technique for that is the one hot encoder now what this encoder will do is it will be able to take in an array of text or categories and transform that out to something that is indeed numeric now if you just run this as is you're going to get a data structure that's known as a sparse matrix there's a setting though that we can change such that the sparsity is false and then we can actually see what is inside of it note that the first two rows are indicated as low and we can see that indeed they share the same column then we see high over here which is listed there and then we see medium below and that is listed there so we can see a form of correspondence and that is something that is indeed useful and the most common use case for this is if this is let's say the class that you would like to predict then this numeric representation that is going to be the y array that you're going to pass to psychic learn because this is something that psychic learn can use to train numerically on there is some behavior to be aware of though and this is not super relevant if you're generating labels but it is relevant if you're using this onenote encoder to encode information for the data that will predict the label let's say i grab the encoder now and i ask it to transform something and what i'm asking it to transform is something that it's never seen before so notice that i'm asking it to give me a label for zero but zero does not appear in this set and that is the set where we did perform a fit on so we might wonder what's going to happen here well we're going to get a big fat value error so it's saying value error found unknown category 0. so essentially it is telling us you're not allowed to give me data that i've never seen before we can change the setting for this though because at the moment the handle unknown parameter is set to error but we can change that such that it's set to ignore and now if i run this it's not going to give me an error and what it's doing is it's saying well these are all zeros or another way of saying that zero is neither low high or medium so we can just go ahead and give it this zero array back now one thing to finally note about that is that this is a very useful setting if you're generating your x matrix so to say but you don't want to do that if you're generating your y labels because those are things that you want to have very strict control over in this series of videos i've shown you some of the pre-processing steps that are available but a very convenient way for you to play around with more of them and to get a better understanding is to go to this website called drawdata.xyz and full disclaimer it's a website that i made but it's a website that allows you to quite literally make a drawing of a little bit of data and that way you can play with it from your jupyter notebook and playing with preprocessors is the best way to learn about them now what you can go ahead and do from here once you've drawn your data set that you're interested in you can click this download csv button to download this file locally but what you can also do is you can copy the csv to your clipboard what you can then do is you can type pandas.read clipboard and then this will be able to read from your clipboard the only thing you have to do manually is you gotta set the separator to a comma because i think the clipboard is typically reading in from excel but what i can now do is just run this and lo and behold the data set that i was just drawing is now available to me here and this is a really nice way to just get a little bit playful with so i can learn pre-processing steps and pipelines i hope that this series of videos has inspired you to check out these pre-processing steps a little bit more in depth but mainly i would like you to remember that these pre-processing steps really do matter they can have a profound effect on what your algorithm is going to end up doing and that concludes the pre-processing part of this video next we will move on to metrics i was looking around for a fun data set when i found one on kaggle it's a relatively well-known data set about credit card fraud and i figured it'd be a nice example to explain something about scikit-learn because kaggle is giving me a data set and this data set consists of a label that i would like to learn as well as the data that i'll be using to predict that label and typically you would start building your model but you're not going to build one typically you're going to make a few so you're going to have model a all the way down to model z and these models might be the same type of model but they're going to have different hyper parameters and for all intents and purposes they will be different models because of it and all these different models are going to have a different prediction and what we would like to do is pick the best model and in scikit-learn you would use a grid search for this and what the grid search will do is it will take all of these predictions and it will compare them against our original label using a metric and it's important you get that metric right select the wrong metric and you're going to pick the wrong model so i figured it'd be useful to spend a few videos explaining how psychic learn goes about this metric because it is a really important part of your machine learning pipeline so let's check the data set before we talk about metrics i downloaded the data set into a csv file that's on my downloads folder and i'm reading it in with pandas and the data set is a little bit big so i'm only reading in the first 80 000 rows the data set has a lot of columns and most of them are anonymized at the end of the data frame we see a class that's the thing we're interested in predicting accurately and we see an amount the idea behind this data set is that we have all sorts of anonymized features that's represented by column v1 all the way down to v 28 but all of these features describe characteristics of a transaction and we also know the amount that was in this transaction now to get this data frame into scikit-learn it helps to have it in numpy and that's what i've done below x contains all the columns that have a v in it and y is the column over here what i'd like to predict and to emphasize what kind of data set this is let's print some useful information so let's print the shape and let's also list the number of fraud cases so this looks like data that's ready for psychic learn but we have to be aware of one phenomenon and that is the fact that the number of fraud cases you know it's about 200 and that is out of 80 000 cases so it's safe to say that this data set is unbalanced there are way more cases with fraud than without and that is something to keep in mind so keeping that in mind let's start with a simple model and i'll just use logistic regression for now so first i'm going to import it and then i'm going to say hey please fit on that data and make a prediction as well now when i run this i get an error saying that the total number of iterations has been reached and what's probably happening here is that because this data set is so imbalanced it's not converging within the number of iterations that it has so what i'll do is i'll just set the maximum number of iterations to be higher i think the number of iterations initially is a hundred but let's set it to a thousand okay so that works what i'll now do is i'll take the sum over all of the predictions that i have so this is by no means a perfect metric but what i would like to observe now is if i'm overfitting on just the train set then the model detects fewer cases than i actually have in my data set so without grid search i can already tell that this model could be doing better so let's add a setting that might allow us to move the algorithm in the direction that we're interested in and for logistic regression one convenient way of going about that is to say hey let's specify the class weight now the class weight is a dictionary that allows me to specify how much weight to assign to each class and in particular the way that you should read it is that for class 0 that will be the non-fraud cases then we assign a weight of 1. but for my class of one which would be the fraud cases i'm saying well let's give that double the weight the idea being that we're gonna get more fraud cases selected so let's run this and see what the effect is and booyah i'm able to find more fraud cases this way so this is a pretty good place to start i have a setting here that i would like to optimize and that means that right now i can start worrying about the grid search and about the metrics so now that this basic example works let's get a grid search going so we can find the best value for this class weight and i'll start with basic settings let's first import it and next let's define our grid search i'll first need to pass it an estimator and that's our logistic regression in this case and let's also set the maximum number of iterations to a thousand the next thing i have to do is i have to set the parameters and i'm interested in changing this class weight so now i have to give it a list of settings to loop over and in particular the settings have to be dictionaries like so and let's use a list comprehension for that so next let's specify how many cross cross-validations we want to do and i'll just say 4 for now and i have a couple of cores on this machine so i'll set the number of jobs to -1 and that way this grid search can occur in parallel my grid is now defined and what i can do is i can tell my grid to go ahead and fit and it's done training so i'll start a new cell here because this grid object it has a cv results property now and this contains all of the results from the cross validation and that's a dictionary with lots of values but i can easily turn that into a data frame so when i look at these results i see the class weight appear and i also see the scores so for every cross validation split we see this score and it's this score that grid search is using to pick the best model but then we gotta wonder how has it come up with this score because we didn't specify any metrics in our grid search yet it is able to find this score right here look there's no metrics now where that score comes from is from the model i've got a logistic regression right here and what i can do is i can ask for the scoring method that is in there so we see that there's a bound method called score and if i use two question marks then i can see the implementation of it so this is the implementation and i see a doc string and if i look at the implementation it says that from metrics it's importing the accuracy score internally so that means that this logistic regression it has a score and unless we specify otherwise the score for logistic regression is just going to be accuracy and when i look down below at this mean test score then that makes sense the model is predicting no fraud most of the time so we're getting a really high accuracy there but this is not the metric that i might be interested in though so let's change that so let's import some things from the scikit-learn metrics module and in particular i'll just import the precision score and the recall score now the way these score functions work and i'll take precision as a first example is you can pass it the true values those are the values that it should predict and you can give it the predicted values so i'll just predict some and so far so good i can do the same thing with recall now precision and recall both measure different things what recall will tell me is it will tell me did i get all the fraud cases and precision is saying given that i predict fraud how accurate am i and you can imagine in an extreme example if i were to say hey let's predict that every single case is a fraud case well then the recall score is going to be really high and the precision score is going to be really low in another extreme case suppose i find one candidate that's a fraudulent candidate but nobody else gets predicted as fraud then i'll have a really high precision and a really low recall and you can imagine that if you're going to optimize for either of these two you're going to get a substantially different algorithm and in this case the crux is do we care more about false positives or false negatives for now i'll assume that we care a little bit more about precision so what i will do is i will take these two metric functions and add them to the grid search now let's add precision and recall to the grid search to add it to the grid search we have to pass a scoring dictionary and i can say hey i've got my precision score and i've got my recall score there's one extra thing we have to do though and that is if you want to use these functions inside of a grid search then we have to pass that to the make score function first we'll discuss later why they make this distinction but by doing this we now say hey let's add these metrics now as well the one extra thing that i have to pass now as well is the refit if i just tell so i could learn hey these are the extra scores that i want you to keep track of so i could learn will do it but if i want this grid search to select the best model based on one of these scores then i have to explicitly mention which score it has to optimize over so let's run this and i just ran and one of the effects that you now see is that we have the test precision that's now listed here as well as a test recall score one extra thing that we could add now is these are the test scores which are useful but sometimes it's also nice to see the train scores as well so we can set a flag that we want to also get the train scores and our cross-validation results and if we now scroll all the way to the back we should at some point see yes scores for the train set as well since the grid search is now pretty well set up it will be good to do a proper run so i will change two things for starters i'm just going to increase the number of cross validations by having more cross validations this will take longer to run but we should have more accurate metrics coming out next what i'll do is i will replace this range 4 with a numpy linear space and that allows me to say hey let's start at 1 let's end at 20 and let's have 30 steps in between this should give me a higher resolution on the effect of this class weight and again by setting this value v higher i'm telling the algorithm to focus on the fraud cases so i will now run this and when it's done running i'll show some charts that summarize this so it is now done running and i've made a few charts this is the first one it shows the test results we have the weight on this axis of the class weight and we see the two scores on the y-axis and if you want to have a really good precision you have to be on this end of the spectrum and if you want to have a really good recall you have to be on this end of the spectrum and note that if you want to have a balance between the two now that you're somewhere in the middle so this is interesting but you might wonder what do the scores from the train sets tell us and again we have our class weight here and our scores but we get a completely different picture so it's a good reminder that cross-validating results is a good idea because i've got two metrics now scikit-learn is able to optimize either of them so at the moment i'm able to either optimize for precision which is going to give me the model over here with very very high precision and very low recall or i can say hey pick the model with the best recall and then i have the opposite problem maybe instead i want to be here in the middle and there are two ways of going about that one might be to go for another metric that's inside a psychic learn in this particular case the f1 score is something that you might be interested in because the f1 score represents a balance between precision and recall but it might be more interesting instead to make our own scikit-learn does not support every single metric out there so it is good to be able to write your own sometimes and in this case i think it might be cool to have one that selects the minimum out of the recall and precision score if i select the best model based on this metric then i will have a model that balances the two so let's go and implement this now you might remember that the precision score function that we had that's the function that we're using over here but the signature of that function was that we said hey there's these true labels going in and there are these predicted labels going in and the output of this function was a number so let's use that as a template to create our own and i will call this the min recall precision and if i want to calculate the minimum between precision and recall i'll first calculate the recall next i'll calculate the precision and then i'll return the minimum of the two let's now add this to our grid search i'll just call that the minimum of both and again i gotta pass that to the make scorer and let's now also say that the grid search has the refit on it let's now run this it's done running now and let's have another look at the charts i updated them while the grid search was running so one thing that's interesting is it does seem that the grid search would now pick a model that's around this region over here i can't exactly see where but it makes sense it's close to where the two are balanced so that's interesting but you might wonder well the green line well that green line that's always lower than either of the two i might wonder well why is that i'll leave it as a small exercise for yourself to figure out why and you can check the appendix for the answer in the previous video we've made our own metric here and then we used it in the grid search over here but before we were able to use it we first had to pass it to this make scorer function and it wasn't just our own function we also did this for the precision and the recall as well so what's up with that so to show what is happening there let's make a scorer using our min precision recall function and then just let's just have a look at the implementation so the make score function apparently takes the min recall precision function that we had that's this one and it takes that function and then turns it into a predict scorer object and it has a signature in goes an estimator x y true and some form of sample weight so one way of looking at this is to say well we start with a metric function that accepts y true the true labels and y predicted the predicted labels and what makes score is doing is it's turning that into something else some other callable object where i can pass it the estimator my x data set my entire y labels and a sample weight as well and this is what the make score function does the idea behind it is that sometimes you would like to use the function this way if you're in a notebook and you just want to quickly use a metric then calling that directly can be useful but it would be a shame if you had to rewrite that function such that the grid search will be able to use it because the grid search likes to have something callable where you also have access to the estimator now what i just quickly want to do is rewrite this function over here such that i no longer need make scorer just to demonstrate that this is indeed the case and we'll do that using this signature over here so i've copied the signature this is now the input and let's now fill in this function well what i can do is i can first calculate the predicted y values from this estimator that is being passed in so that means i can just say s dot predict and now i should no longer need this make scorer so let's run this and it ran and let's now just check if we get the same chart excellent when you're looking at this min recall precision function then there's probably a lot of parts that you recognize for example the estimator that we have here that is going to be a logistic regression like we've defined here with the correct settings for the class weight and and you can also imagine that we have this x here and this y true and that these will be the train or test set labels and data sets but you might wonder well what's up with this sample weight over here and sample weight well that's an extra feature that can be passed along certain machine learning models allow you to say well this row is more important than this other row that's different than the class weight the class weight that we have over here that says hey this class should get more attention but the sample weight allows us to pass data that says this row is more important than that row and in the case that we're dealing with fraud where we have rows that resemble financial transactions you could maybe say that maybe the transaction amount maybe that's a valid setting the really big transactions that have millions of dollars in it well sure if there's fraud there that's much more important to catch than if it's only about a single dollar so one thing that i could do is i could rewrite this entire function to take that sample weight into account one thing that i would need to do though is i would also need to give a sample weight and that's something i have to do over here so let's say i'll do that and what i can do here is i can say hey take that original data frame and take the amount column that is the amount of the financial transaction and just for numerical stability what you can do is you can also maybe not take the exact amount but the logarithm of it this way if you have really big transactions then you prevent that we're overfitting on it and what i'm going to do now is i'm just going to run this and i'll leave this the way it is i'm not changing this if i have a principled way of taking that sample weight and translating that to a metric i want to optimize then that is definitely something i can now do here which is good that's a benefit that gives me flexibility but i don't have the financial expertise to come up with a good rule now so what i just want to briefly show is if we actually do this what is the effect because even though we're not changing the metric adding this effect will change the algorithm so i'll run this now and show you the summary charts to see what changed so just for reference this is the original chart and this is where the balancing point is between precision and recall and let's now see how this compares so the dashed line is where the balance was before and the dotted line is where the balance is now so adding sample weight will definitely influence the algorithm here but it's good to know that we could also use the sample way for our metric if we wanted to so let's consider a new approach to this problem we could consider that perhaps fraud is like an outlier it's a rare event but it's also something out of the ordinary so one could wonder sure we can have a classifier go in there and predict whether or not something is fraud but if we just have an outlier detection algorithm might that not correlate with fraud so what i would now like to do is replace the logistic regression with an outlier detector and then adapt these metrics such that we can check if this is the case so let's first write a little bit of code that can handle outlier detection i'll just start by importing the isolation forest algorithm and because it's an outlier detection algorithm when i call dot fit it doesn't need a label it just needs the data set i can still use it to make predictions though so let's just run this when i look at this output it seems like everything here is a 1 as if it's always predicting fraud but that's not the case to show that i'll just quickly import a counter object and i'll just count this how often a one occurs and how often another number occurs and that's a thing psychic learn assumes that a one represents not an outlier and minus one that that does represent an outlier so if i want to translate these outlier detections what i got to do is i got to say well if these predictions are equal to negative 1 then i want to have a 1 because that indicates a label for fraud and 0 otherwise now with that observation done let's just briefly go back here and look at our metrics because a lot of stuff now will break these recall and precision scores they expect zero or one values at the moment not negative one so if i were to put an isolation force here the metrics would break that is unless i write my own variant over here so i'll just quickly go ahead and do that and there we are i have made two functions over here that don't need the make scorer like we saw before and by doing it this way these predictions are now turned from outlier predictions into fraud label predictions what i've now also changed is i have removed the logistic regression here and put in the isolation forest and i have now also put in the contamination factor here as the hyper parameter that i want to tune again i'll not go into the details on what that means but we could say it's something i want to optimize finally the two functions i have here again are being put into the scoring methods down here below and here's the small amazing thing about this notice again that this isolation forest when you train an isolation force you call it via fit and then you pass it an x matrix without y labels per se but because we are passing in y labels below here anyway we are able to use them here in our own custom metrics and that is something really flexible that means that we can pass it because that allows us to use outlier detection algorithms as if they were classifiers so what i'll now just quickly do is run this and then make a similar chart like what we had before so here's the similar chart on the x-axis again we have our hyper parameter and on the y-axis we have our scores and when looking at this it does seem that even though the outlier detection algorithm is able to detect some things it's not able to detect as well as the logistic regression that we had before that said this can still be a useful feature to have in your pipeline but that is outside the scope of these series of videos what is amazing here is that we are able to use metrics to quickly judge if an outlier model would be useful in this classification problem in this series of videos i hope to have been able to show you how you can use metrics in your scikit-learn pipelines and that it actually offers a very flexible api that said i do think it's fair to point out that you don't always have to make your own metrics very often the metrics that scikit-learn has will be relevant to your project too you can have a scroll around there's metrics for classification but also for regression and it's definitely worth to spend some time here on the documentation page before wrapping up though i do want to point out one small danger with the current approach the way that we model things is by taking our data frame passing it through a grid search to get our model while also checking the metrics and in terms of methodology this is a fair approach but there is one big danger and that's especially true in the case of fraud do we really trust these labels especially in the case of fraud it may be safe to assume that we don't have a complete picture of all the fraud cases it's a cat and mouse game so when i see a label for fraud i think i can assume that that's accurate but when there's no fraud that might just be fraud that's undetected and if we think a step further probably the labels that i have those are the labels of the easy cases those were the cases that got caught and if we then train a model that optimizes for the easier cases then we have to be honest we might have a model that has a blind spot for the harder to detect fraud cases so i must stress it is really important that we keep track of our metrics that we take that very serious because it has a huge impact on how we pick the model but typically it is not enough we still need to concern ourselves with the quality of these labels because if the labels are wrong then so is your metric in the previous video we were using the make score function to take our custom metric and to turn that into something that the grid search was able to use and before delving in depth in what the make scorer function does exactly i figured it'd be good to point out that when it comes to customization there's a couple of settings that you can still add not every metric will have the rule where greater is better if you take mean squared error for example in that case greater will be worse so that is something you want to specify otherwise the grid search might pick the wrong model and there's also metrics that depend on having a probability measure being there as opposed to just a class and if you need the probability measure for your metric that's also something you ought to specify so for example let's say i take this min recall precision function and if greater was worse for this metric you could specify greater is better is false and then you would pass this in your grid search now note that this is of course a bit of a detail but there will be moments when you need it so i figured it'd be good to at least mention it let's now move on to a relatively fancy feature of psychic learn the ability to work with meta models usually in scikit-learn what you'll do is you'll make a pipeline and that pipeline will have pre-processing tools maybe some featurizers and after that you'll have a machine learning model and the idea is that this pipeline is just a very convenient way to connect everything here however if you think about what you might want to have happen inside of a pipeline then there might be some steps you would like to have after you've made your model maybe there's some post processing you would like to do the way that the scikit-learn pipeline works however is that you're able to have many featurizers and pre-processing tools chained together but at the end it stops with this model the thing that can do fit predict that should be the final thing in your pipeline so that might make you wonder how can we go about some of these extra steps and post processing tools we might have things that we would like to grid search so it will be nice if there's a trick up our sleeves that allows us to still have access to these sorts of techniques in our scikit-learn pipeline over here now it's a bit of an advanced trick but the way to go about this is to think about meta estimators scikit-learn has a few of them but there's a lot of them implemented in other tools like scikit lego full disclosure this is a tool that i made but the idea is to have an estimator that can take this model and add extra behavior to it it's a powerful modeling technique so in this series of videos what i would like to do is highlight a few of these meta estimators and show you how you can use them to make your modeling pipelines a bit more expressive scikit-learn has a classifier that's known as the voting classifier and it's an example of a meta estimator to help explain what it does though let's consider this classification task it's definitely an artificial data set i am using the make classification method to generate it but what you can see here is that i've got these blobs of yellow points as well as purple points and what you can see is that there are some purple points around here where there's mainly yellow points and vice versa there's also a few yellow points around here now there are two kinds of models i suppose that you could make here if we were to say train a logistic regression on this data set then effectively what the logistic regression would do is it would pick the direction where it can make the most difference and then create one separating line in this case it probably will have a separating line somewhere over here and it will say that everything on this end of the line is supposed to be the yellow class and everything on this side of the line is supposed to be the purple one a different algorithm near his neighbor would do something slightly different whenever it needs to predict a point let's say a point over here it would look at all the neighbors let's say the nearest five and then make a prediction and that would mean that in this region over here it might predict a purple point and where the logistic regression is a little bit too general as in it's splitting up everything here on the left and the right you can imagine that the nearest neighbor is maybe too specific at times and you might want to have a way to balance these two things out and being able to balance these two that is something that the voting classifier can do the idea behind the voting classifier is that you can give it a list of estimators and you can also give it a weight for each estimator and that might allow you to say well i want this estimator to be weighted twice as heavily as this one the nice thing about this is that this weight over here is something that you can use inside of your grid search and that's really nice because that means you don't manually have to specify these weights you can have the grid search determine the best way to balance different models for your data set and here's an example of how the voting classifier works on this data set i've got my voting classifier over here that i've imported beforehand and then i kind of submit something that resembles a pipeline it's a list of tuples with the name that i associate with an estimator and then the estimator itself note that the first classifier going in here is this logistic regression and the second classifier over here is this k neighbors classifier but what i'm able to specify is i'm able to say well these are the weights i would like to associate with both models and i would like you to do soft voting and soft voting over here effectively means that we are averaging the predict proba values now to show the effect of this i've created a couple of charts this chart over here shows the original data and this chart over here shows the proba predictions of the first model this one and you can see we get the behavior that we expect there's a line that's separating everything and you're either on the left hand or right hand side of it this chart represents the predictions from the k neighbors classifier and you can definitely see that it's zooming in on a couple of areas we see darker colors appear in the yellow region and we also see lighter colors appear in the dark purple region and what you can see in this third chart are the predictions from this voting classifier and you can see that we basically smooth the predictions of these two models what i can do is i can say well let's give a somewhat higher weight to the k nearest neighbor model and by doing that you can see that these two charts are now definitely more similar but i can also give a higher weight to the logistic regression classifier as well so there's two things to observe here first of all there's merits to having such a voting classifier you can combine different models that work different ways this way which is nice but moreover the main thing that's happening here code wise is that we have an estimator here that takes as input other psychic learn estimators this voting classifier is adding behavior to both of these models and that can be a powerful modeling strategy note that if we put this voting classifier inside of a pipeline that this is still the final model at the end of it but we are able to add behavior because models can be used as input here and this way of thinking about models gives us a lot of flexibility and expressiveness let's now say that we have a slightly different data set again it's an artificial one made with the make blobs function in scikit-learn but given that we have a data set like this let's now talk about a different consideration for models let's say that we have a logistic regression that we'd like to fit on this data set you might get a separating line that's somewhere over here let's say the way that the separating line works if i were to draw it out on this axis over here if i were to draw what the probability is that my predicted value is equal to 1 which i'm associating with these yellow points then the predicted curve would look something like this now what's happening here is we have a threshold around where the probability is larger or lower than 0.5 when we actually say to which class a point belongs to you could wonder though what if we just move that threshold just slightly maybe move it somewhere over here maybe at 0.7 if we were to do that then this classification line would move upwards it would move over here that's the new one and that's the old one now the consequence of doing this is that your model might become less accurate overall but whenever we do predict that it's going to be a yellow point using this new line we're probably more sure of it in this region over here there's some purple points and there's some yellow points in this region over here there's still a couple of purple points left but it's less so by tuning this threshold over here we might have a nice knob to turn to exchange precision and recall in our model and the ability to do this is provided by the thresholder meta model that you can find in the scikit lego package the way that it works is similar to the voting classifier we have a meta model over here and as input it accepts an estimator in this case logistic regression thresholder accepts any cycler model but it only works in binary classification cases what you're then able to do is you can set this threshold value over here which again is something that you can put inside of a scikit-learn grid search and here's the effect you can see that if we set a very low threshold that maybe the optimal line that we would have had over here kind of moves down a bit over here this setting i assume will be great for recall but bad for precision and if we move the threshold to the other direction over here then we might have something that's very high precision but very low recall what i've got here is an example of how you can maybe use this inside of a grid search i have my modeling pipeline over here inside of that modeling pipeline i've got my one model and this model has this threshold parameter i can refer to that using model underscore underscore in my parameter grid here which i am using in my grid search and what i'm doing is i'm just looping over all sorts of values of this threshold and i'm keeping track of the precision the recall as well as the accuracy and if we scroll down to the results of this grid search we see some interesting things the blue line over here represents the precision and if we look at the threshold value then indeed it seems that the higher the threshold the more picky we are and then also the better the precision is as expected this does come at the cost of this recall over here which is the orange line and we can definitely see it plummet whenever we have a high threshold value now keep in mind that we're doing this on the test sets we're not reporting train numbers here something that's interesting is that for this particular data set it seems that as long as we remain let's say in this range of thresholds the accuracy doesn't really change too much but the precision and recall curves do so that means that you don't have to give up your accuracy to get a bit of a boost in either precision or recall and the nice thing about this threshold is that it's something that's easy to interpret and at the same time it's also a nice example of what i would call post processing in order to tune the threshold i need to have the model ready and it's this sort of post-processing steps that are probably best implemented as a meta model as a model that accepts another model as input what i've got here is a somewhat large pipeline that is predicting the weight of a chicken over time it's being applied to a data set that has a diet column that says something about what the chicken's eating different chickens have different foodstuffs and also time the idea is that chickens gain weight over time but they might gain more or less depending on the diet that they have the way that i'm going about that is i am selecting the diet column and i am one hot encoding that next what i'm doing is i'm taking the time column and i'm just passing that through a standard scaler schematically what is happening is whenever i get my new data set it is going into this pipeline that has a feature union which is splitting up the features into two buckets the time feature and the diet feature the time feature gets scaled the diet one gets encoded and then this becomes the data set that i can use for my machine learning and it's this data set that is being used down below here for my linear regression and here's what the predictions look like for every diet you see a new prediction line now the downside of the way that we've encoded our data is that this linear regression sees the one-hot encoded variable for these diets and the only way that it can deal with it is to see it as a constant shift and if we think about the possible effect that a diet could have then i hope that you might agree that we're not really modeling it the right way here the effect of the diet might be something else than adding a mere constant so then the question is can we maybe reuse this diet in a different way maybe instead of adding a feature that is one hot encoded for every diet can't we maybe group per this diet and then train a different model for each you would have model 1 for diet 1 model 2 for diet 2 etc and then maybe what we can do is we can say well whenever there's a new data set x coming into this model then what needs to happen is this internal grouping needs to figure out what diet this new data belongs to and then make a prediction using the appropriate model this behavior can also be implemented as a meta model and scikit lego has an implementation of exactly this and here's an implementation and the effect in this case we have a data frame that's going into this model over here so that means that i can use the column name here to describe the group but if it was a numpy like data structure then you can also refer to the column indices over here you'll note that when we do this the effect of this diet isn't simply a constant instead it's training a different linear regression each of these lines has their own intercept and their own slope and i hope that you can imagine that making models this way can be reasonable the main risk that you have to keep in mind is that it is possible that maybe one of these groups has way more data than another one of these groups so that is something to keep in mind if you're interested in doing something like this let's now consider a time series task there's a basic one over here and one of the properties of this time series is that it's also changing over time there's a seasonal pattern in here for sure but the shape of the seasonal effect is definitely getting amplified as time goes on so let's say you want to maybe model this one way of modeling this is that we can say well let's have a look at every month in the year and let's try to calculate the mean of every month and then that can be our model and what we can do is we can use the grouped predictor for that as well as a dummy regressor and if we were to then calculate the mean for every month the prediction would look something like this again note that we're using the dummy regressor from scikit-learn here and the grouped predictor that we saw in a previous video when you have a look at these predictions though you'll notice that something is definitely off it seems that we're really good at predicting this middle year but we're undershooting in the more recent years and we're overshooting in the far past so that means that we have to wonder well what can we maybe do here to make this just a little bit better and it's pretty good to observe that this is something reasonably general in this particular case i'm definitely looking at a time series but it's not unreasonable to consider that the world is a moving target no matter what you're trying to predict there's something to be said to maybe focus in on the more recent data and maybe to forget the data that's in the past now the interesting thing is that scikit-learn actually provides a small mechanism to deal with this not every model allows for this but some models do the dummy regressor is one of these models though and what we're interested in is a parameter that's tucked away in this fit method if you look at the signature you'll notice that the fit method has this simple weight parameter it's a parameter that's set to none in most cases but what you can do with it is you can say well how much do i weigh my data points let's say that this is my big input x and my array y that i would like to predict then i can have another array of sample weights and what that allows me to do is it allows me to say well this data point over here that's worth 0.1 this one is worth 0.2 and maybe this data point is worth 10. the idea behind these weights is that you can say that maybe this data point is way more important to predict than maybe this one or this one and if you can specify this for every row in your data set then this gives you an option to customize now for our purposes it would be nice to have a meta model that can automatically put values for these sample weights into whatever we're trying to predict and in particular something that's general is that we might say you know what let's just do some exponential smoothing if we assume that this data set here is sorted such that everything on this end is the most recent data and everything on this end is the old data then maybe we can say that there's a decay parameter that says that everything that's happening here is super important but as we move to the past it gets less and less important now this idea of adding a decay that is a meta model that's also made available by psychic lego and here's the implementation and you got to pay attention because we're actually using two meta models here we're starting out with our good old dummy regressor but then we say well we want to add decay to it what we're doing is we're saying well the more recent data points matter more than the old ones note that if you want to do this your data set does need to be sorted beforehand next what i'm doing is i'm saying okay take that model but now i want you to calculate the mean not on the entire data set but per month and what's nice about this example is that we can demonstrate that we start with a very basic model but by adding this meta behavior to it we are quite expressive in what sort of model we would like to end up with and we can see the effect over here as well the orange line over here is what we would get if we didn't do any decay and what you can see here is this green line which fits the more recent data very well but it doesn't really fit the old data as much it doesn't really pay attention to that and that's because we're adding a decay that's saying well ignore the far away history and focus on the more recent events one thing to keep in mind here is that by applying this trick you will get this awkward situation where it is possible that your training data has way worse performance than your testing data and that is because you're probably testing towards the future and it's the future that we're actually paying more attention to in our model so don't be too surprised if that's something that you see but the main thing that i hope that you recognize here is that it's very nice to be able to take models that we already know and understand and just add small interpretable behaviors to it it's that that gives us the opportunity to customize a model to the use case that we're interested in and not only will that make our models more accurate but it also makes them just slightly more interpretable which is always a good thing in this final segment we will talk about a tool called human learn and the whole goal of this segment is to showcase that you don't always need machine learning back when data science was really just starting out it wasn't really common to use machine learning algorithms instead what was more common is you would start with a data set and then you would get a domain expert to come up with some business rules and this would give you a system that could automatically assign labels to new data that came in these days though it's a little bit more common to not write your own rules but instead to have a machine learning algorithm figure them out you would collect data as well as labels and then it will be the job of the machine learning algorithm to figure out what the appropriate rules are the question is though in this transition from a rule-based system to a machine learning model have we maybe lost something along the way and there's a couple of issues if you think about it for starters machine learning isn't perfect it might be that the machine learning model is very accurate but that it has behavior that we aren't necessarily proud of think of themes like fairness and it might also be a little bit ineffective odds are that if you're starting out with data science and you've got a use case that there's also a domain expert around who can pretty much tell you what rules already should matter if nothing else it would be nice if we can construct rule-based systems in such a way that we can easily benchmark it against machine learning models so that got me thinking maybe we need to have a tool that makes it easier to write these rule-based systems and that is why i started this new open source project called human learn the goal of the project is to offer scikit-learn compatible tools that make it easier to construct and benchmark these rule-based systems it is made in such a way that it's also easy for you to combine these rules into machine learning models as well and in this series of videos i would like to highlight some of the main tools of the package and how you can get it working in your daily flow what i've got here is a jupiter notebook and one thing to observe is that at the top i have this pip install human learn command and it's this command that makes this ulearn package available to me what i'm doing in this notebook so far is somewhat basic the main thing is i'm loading in this low titanic function and that allows me to load in this data set about the titanic disaster the goal of the dataset is to predict whether or not a passenger survived so we have a column here that we would like to predict that's binary you survived or you didn't and there's a couple of things that we have at our disposal in order to predict that we have the passenger class we've got the name gender age and what we can do is we can take all of this data and put it into a machine learning model and look for a prediction however if i were to come up with a simple domain rule that might also go ahead and work then i could argue that perhaps i can also just have a look at this column here that tells me how much somebody paid something's telling me that the people who paid more for a ticket are probably also the ones who were perhaps more protected odds are that they were on the upper decks and i can definitely imagine that they had better access to lifeboats than everyone else on the ship so that means that i could already come up with a rule-based system i could say hey let's just have a threshold on this fare and everyone above the threshold survived and everyone below the threshold didn't probably the easiest way to implement this logic in python is to write it down as a function which is what i've done here i'm assuming that a data frame goes in as the first argument and then i have this threshold parameter and effectively all the logic of predicting is happening in here if you paid more than the threshold then i'm assuming that you survived and i'm returning in numpy array and what i could do is i could just say hey let's apply that fair based function to my input data x that i've defined here and i will get all sorts of predictions here the issue with this is that this is a function it is not something that scikit-learn is able to accept as a classifier and that means that i cannot use it in a grid search or pipeline and that's a bummer because i would really like to be able to grid search this threshold hyper parameter over here lucky for us though there's a tool inside of humanlearn that addresses this the tool that we want to use now is called a function classifier and i'm importing it here what the function classifier will do is two things it will take a function that you've written that outputs a prediction and it will turn it into a scikit-learn compatible classifier the second thing that it will then do is it will also make sure that any arguments that you've defined here when creating the function classifier that overlap with the arguments in the function itself that those arguments can be grid searched so to say and to show you how it works what i've done here is i've created a somewhat simple grid search i'm doing two cross validations and i have a parameter grid here over the threshold effectively i'm looping over lots of values between 0 and 100 and what i'm doing is i'm keeping track of the accuracy the precision and the recall and what this grid search will do is it will pick the best hyper parameter based on the accuracy when i call dot fit over here this system will run and then i can inspect the grid search results the grid search just ran and i'm taking the results from it turning it into a data frame and then making a chart what you see on the x-axis over here is the threshold value that we set on the fare column and then you see three lines the blue line is the the orange line is the precision and the green line stands for the recall one thing that we can see is that when we increase the threshold the precision goes up that's the orange line and this suggests that the rule that we started with isn't half bad people who paid more for the ticket are probably more wealthy and are therefore perhaps more likely to get to a lifeboat quicker now as a model this isn't necessarily perfect the recall that we have down below here decreases the higher we set the threshold and that's also because there's less and less people that paid a certain amount for it but all in all this is already kind of nice because it allows us to test if machine learning models can make better or worse decisions than domain experts and quantifying that can be nice a final thing to mention about this is how general this is we can come up with any python function that we like here that has any logic that can cause a prediction and then any argument that we have in the function can be used as a hyper parameter inside of a grid search and this in practice can be very very powerful the function classifier is meant to be very customizable that means that you can do more than just come up with a system that can make predictions you can also use it to add behaviors to existing models for example let's say that i already have a classifier and that i also have a outlier detection system something that might be really sensible is that you first check for a new data point whether or not it is an outlier if it is that you then say well let's not make a prediction let's fall back and if it's not an outlier it might also be good to check if the model's certainty or the proba that's in the model if that is high enough and if it isn't that might also be a nice reason to perform a fallback scenario only when these two checks pass then should we take the action that the original classifier takes if you want to construct something like this and you already have an outlier detection model as well as a classifier then the function classifier object can also be used to very easily declare systems that do exactly this to give an example i have written down the pseudo code that you might need to construct something like this i'm assuming that you already have your outlier detector and i'm also assuming that you already have your classifier but then if you want to construct the diagram that we had before the only thing you have to do is declare this one make decision function inside of this function the first thing that needs to happen is you first need to have a classifier that makes a prediction on all the data points that are passed in next if there is any doubt that is to say we calculate the probability values for each class and if the maximum of all those probability values is less than a threshold well then we are going to not use the predictions that we had instead we are going to override it by saying well the class is now doubt fallback and the numpyware function makes it really convenient to write this logic we also like to make sure that it's not an outlier so we apply a similar trick if we have an outlier detection model that's pre-trained we can make a prediction and if the prediction says it's an outlier then we can assign another label instead of the original predictions this gives us our make decision function and our function classifier makes it very easy to then accept that function and we now have a fallback model that you can use in scikit-learn grid search like you're used to now i hope you agree that designing systems like this is beneficial in practice it is really nice that we can add our own little logic in front of a existing machine learning model because that allows us to give a behavior that we are interested in i also hope it makes it clear that the goal of this package isn't to make rule-based systems that act as if they're just predictors instead we want to construct rule-based systems using pre-existing machine learning models the idea being here that this gives us a nice balance between natural intelligence and artificial intelligence let's consider another way of constructing rule-based systems and to explain this i will be using a data set that is in the scikit lego package now to get the data that we're interested in what we got to do is we got to import this load penguins function all we can do then is we can call the function like you see here and this gives us a data frame the goal of this data set is to predict the species of penguin based on the properties of the penguin like the body weight or the length of the flipper to get a different view of the data set though what i'm doing below is i am using this module inside of human learn the experimental interactive module to generate some interactive charts and here you can see one of these interactive charts i'll just zoom out a bit and the interactive chart that you see here shows these different classes appear for the different kinds of penguins that we have now one reason why people like doing machine learning is because it's kind of hard to come up with a rule-based system that can handle data points like this writing an if-else statement that will perfectly describe this area over here is relatively tricky but what if we could just go ahead and draw it instead if i'm to consider this data set over here i can argue that it's not really that complex as a human i can kind of draw the area where i expect the green dots to be in and i can pretty much eyeball what the algorithm should look like and that allows me to draw out these polygons to get started with drawing i got a selected color that i'm interested in in this case red which corresponds to this class and then i can double click and click and click click click click and when i'm done making the polygon i can double click once more and then i've got a shape i can click and drag this shape if i'm interested and what i can also do is i can use this edit button over here to edit the shape but the main point i want to make here is that we can also declare rules by drawing them and this is just one drawing but let's also make another one and now i've got these two drawings that as far as i'm concerned should serve as a pretty okay classification algorithm now to use this as a classification algorithm it's useful to point out one more thing note that we added a chart to this clf object over here and that object is this interactive charts object the one that took the data frame as input and this species as a label column what i can do is i can take that object and ask it for all the polygons that i've drawn so far using this json blob we can reconstruct the drawings that we made and we can use a point-in-poly algorithm to check if a new point is in one of these polygons at any given point and if that's the case we can make the appropriate classification to get this to work though we're going to have to import something called an interactive classifier the idea behind this classifier is that we can define a scikit-learn compatible model from this json blob and what we can then do is we can generate our x and y data sets as we would normally from a data frame and then we could use this model as if it was any cycler model that we're used to and to emphasize how these classifications work i've added a little bit of code that makes the matplotlib charts and you can verify yourself that the plots that you see here do correspond with the drawings made earlier in this notebook it deserves to be mentioned though that these drawn models have properties that you might not have thought of up front and some of these properties are actually quite beneficial what i'm doing now is i'm taking the exact same model as i drew before but now i'm having it predict a new example over here and this new example is predicted with a class of addly and we can see that it's doubting between two classes in particular if i were to scroll up now then i can also explain why compared to this chart the new example point that i've made is somewhere over here whereas for the other chart the same point will be somewhere around here so that explains why it's having trouble making a proper classification but let's now say that the data that i'm getting in it's incomplete it's missing a value well in this case if this was the value that's missing then i could scroll back up to here and then where before the point was inside of this polygon it now isn't anymore so this chart is not giving us any more information for our classification task however because these two variables are definitely known we can definitely still say that the penguin is in this region and from a machine learning perspective that means that our drawings are robust against missing data we now see that we're predicting a new class and that's because we are now zooming in on these two features and no longer using these two there is another use case for these drawn machine learning models so the same if we have a look at the drawing right here you'll notice that there's a couple of points that kind of fall outside the polygon that we drew and in this particular case we can perhaps make the argument maybe some of these points are outliers especially if we were to zoom out i hope you recognize that this machine learning model really only has a comfort zone around here if we were to see points that are definitely on the outskirts of this comfort zone then it'd be nice if we can have a system that says hey that's an outlier we should treat those points differently luckily for us what we can do is we can take the same drawing that we use to make a classifier and use it as an outlier detection system where before we would check whether or not a point was inside of a polygon in order to make a decision the outlier detector will check whether or not the point is outside of polygons and there's a hyper parameter we can specify the minimum number of polygons that a point has to be in in order for it not to be an outlier but to give a simple demo of this feature you'll notice below here that i am using the interactive outlier detector as opposed to the classifier and then i'm following very similar steps as before but the main thing that you'll notice now is that when i make the chart of the predictions that all the points on the outskirts are indeed now classified as an outlier and again what we could do is we can make changes to this drawing over here and then we would get different results in all the drawings that we've made so far we've had the luxury of this label that was available however you can also imagine situations where we don't have a label readily available and in these situations it might be nice to use this drawing mechanic to actually assign labels to data points so as a demo let's just add group 1 and group 2 as labels here because the labels parameter here is now a list these interactive charts will no longer internally check for a column and instead it will assume that these are the values that you want to assign so if i now were to run this you'll notice that the colors in the chart here are gone i do however still have the ability to make a drawing like before and i can use this drawing to make an outlier detector or a classifier but what i can also do is use this drawing in a pre-processing step human learn comes with this interactive preprocessor and just like in the examples before it is able to read in this chart data but instead of acting as a predictor it can act as either a scikit-learn transformer but it can also be used in a pandas pipeline as you see below here in the case of pandas what will happen is it will add two new columns with counts how often a data point was in a polygon and if i were to use the scikit-learn flow instead i would get the same counts but as a numpy array now this interactive charts api is relatively experimental currently it only allows for 2d charts but you can imagine that other visuals will also be added in future things like histograms can also be used to make selections the main utility of this library though is to maybe start thinking a little bit differently about machine learning models and to maybe start thinking about rule-based systems instead the goal isn't to throw machine learning out the window but instead to use machine learning algorithms with some business rules around them such that literally machine learning models can play by the rules i hope you enjoyed watching these videos on scikit-learn it's a vast ecosystem and we've only really been able to scratch the surface here but i hope that i've been able to explain enough for you to either get started or to explore the tool more if you enjoyed these videos then i would like to mention that there is some other machine learning content out there that you might also appreciate in particular i make a lot of videos on the calm code website so definitely feel free to check that out also if you're interested in learning more about natural language processing then you might also appreciate the videos i've been making on the official spacey channel in this series of videos i'm trying to make a model that can attack programming languages and if that sounds interesting definitely feel free to check that out as well and if you're interested in learning more about natural language processing definitely feel free to check out the algorithm whiteboard channel on youtube it's a playlist that i maintain on behalf of my employer raza and you'll find many examples and explanations of algorithms there as well if your interest is to learn more about psychic learn though then i might also have some other recommendations for you for starters definitely feel free to check out free code camp i've been in industry for over seven years but i still find many useful courses and hints on a wide variety of topics on this website which includes related toolkits like numpy and pandas second i might also recommend checking out the pi data youtube channel understanding how psychid learn works is great but sometimes you would also like to hear anecdotes and lessons learned from applying psychic learning practice and the pi data channel on youtube is a great resource for just that now the third and final resource that i really recommend is the psychic learn documentation page scikit-learn is a gigantic library and there's a lot of features that to this day i still discover basically just by reading the documentation but there's also something specific about the documentation that i would like to highlight if you go to the documentation page you'll find a couple of sub-sections there is a user guide that you can follow but there's also a subsection over here under more where it reads related packages you see scikit-learn isn't just a package at this stage you could also say it's an ecosystem and there are many different projects for specific use cases that might seriously help out there's experimentation frameworks tools for model selection there's also extra support for tools that have to do with time series and although the list of tools here can be intimidatingly large i really do recommend just having a look here simply to get a feel of what you could do with psychic learn so i hope you'll give this documentation page a glance and once more i hope that you'll be able to use scikit-learn to do meaningful work with machine learning thanks for listening

Transcript for:Understanding Scikit-learn for Machine Learning

Transcript for:
Understanding Scikit-learn for Machine Learning