Full Data Science Process

what is going on guys welcome back in today's video we're going to go through the full data science process we're going to complete a full data analysis we're going to start by exploring the data set we're going to pre-process the data we're going to engineer some custom features we're going to train them all and then we're going to evaluate the model to see how well it performs and we're going to also introduce some more advanced stuff like hyper parameter tuning and so on so this video is targeted towards beginners in data science and machine learning but I will as I said uh also introduce some more advanced concepts and I think this is interesting for everyone who's interested in data science and machine learning so let us get right into it out again [Music] all right so for this video there are only a few prerequisites that we need to have in advance and the first thing is a working IPython notebook environment so that can be jupyter notebook that can be Jupiter lab that can be IPython notebooks inside of vs code or pycharm it doesn't really matter but you want to use IPython notebooks and not just ordinary Python scripts because when we do data science when we go through the full process we want to do it in sections we want to do the exploration we want to do the pre-processing we want to do the feature engineering and all that and we don't want to go through all these steps every time we add something new so we want to do the exploration then the exploration is done we want to continue with the results to do the pre-processing or whatever and we don't want to run everything just to do the next thing that we coded so we want to work in cells and this is possible inside of IPython notebooks I have multiple videos on this channel where I show you how to work with them I have a video on Jupiter lab I have a video on ordinary Jupiter notebooks I have a video on Google call up so you might want to check those out if you don't know what a jupyter notebook or an IPython notebook is is in the first place the second thing we're going to need is we're going to need the basic data science Stacks so you're going to need the basic data science libraries which you can install via pip in a command line so those are numpy those are pandas those are matplotlip Seaborn uh scikit learn and we're not going to use neural networks today so we're not going to need tensorflow Keras or anything like that so those should actually be all that we need if some new modules come up I'm going to add them later on but those are the modules that are the main machine learning traditional machine learning and data science libraries and that's basically all we need so we can close this now and the data set that we're going to use for this video today is the California housing prices data set from kaggle so you will find a link in the description down below if I don't forget if I forget you can just look at the URL up here and essentially we have a bunch of features here like the coordinates the median age the total rooms total bedrooms population essentially we have area us of houses and then we have the target variable which is the median house value so this is going to be a regression task we're going to try to predict the house value based on all the other features so we're going to look at the total bedrooms total rooms at the coordinates and stuff like that at the ocean proximity and then we're going to determine we're going to try to predict the house value that's the basic idea and in order to be able to do that we're going to download the data set I'm going to save it in my programming directory which is programming neural 9. python current so this is going to be archive.zip so I can open this up here um and I can extract it so 7-Zip extract here and this is this housing CSV file here so what we're going to do next is we're going to open up a terminal again and we're going to navigate to this directory so I'm going to go to desktop I'm going to go to programming python neural 9 or actually I think it was neural nine python current and here I'm going to just start Jupiter lab so you can just start your IPython notebook environment I prefer Jupiter lab so I'm going to run this here um this is going to start it locally it's going to open it up in the browser and then what I can do is I can create in this directory a new IPython notebook so I can just click here and I can start with the data analysis and what we're going to do first is we're going to import a couple of basic libraries like pandas SPD numpy S and P matplotlip dot Pi plot splt Seaborn s s and s and the socket stuff we're going to import separately so what we're going to do now is we're going to say data equals PD read underscore CSV so we're going to load the CSV file housing.csv into our notebook I think the default separator um is comma so I think it is comma already yeah otherwise we would have to probably specify either sep or delimiter I'm not sure what the pandas keyword is but since we're using commas we can just load it like that and we can look at the data already by just typing data you can see we have here the coordinates let me just close this here on the left we have the coordinates we have the median age total rooms I think the features should also be no they're not described actually we don't have a feature description oh we do have one there you go so total number of rooms within a block so we have blocks on the map I think so this is not one house I think this is an area on the map in California um and this is true for all the features I think so total number of people residing within a block yeah that's the basic idea of what we're doing here um and here you have the numbers so these are the numbers that we're going to use to train them all we want to predict the median house value so this is going to be our Target variable all the other things we're going to use as input features so we're going to use them in order to predict this value um and because of that because of this goal that we're having here we already see some problems so first of all this ocean proximity here we cannot really use this right away we cannot just take these words and feed them into our model whatever the model is if it's a support Vector classifier a um I don't know a random Forest classifier decision tree neural network we cannot just take the text and feed it into the network as a category so we need to pre-process this variable at the very least and the other ones seem like they need to be scaled probably so let's continue a little bit with the exploration so we can also say data info to see if we have any non-null or actually if we have any null values and you can see most features have 20 640 non-null values but here we have actually some values missing which is a problem so what we're going to do here since there are not a lot of values missing it's uh around 200 values missing here we're just going to drop these nand entries so nand stands for not a number Nan and we can just say data dot drop n a and this is going to return a data frame uh with only non-null values and if we want to save it if we actually want to apply this we can say in place equals true so this takes the data drops the non uh drops the nand values and saves the result in the data object again and if we now say data.info again you can see that we have the same amount of non-null values in all the different features so that is that what we're going to do next and this is important uh we're going to split the data into training and testing data and we're going to split it into X and Y data this is very important because what we want to do here is we want to train them all on one set of the data and we want to evaluate it on another set so we are not going to work with all of the data because we need to have some unseen data that the model has never seen before to see if it performs well on this data and we do have the results for all the values but the model doesn't have to result so what we're going to do now is we're going to open up a new cell here and we're going to say from sklearn.model underscore selection we're going to import the train test split like this and before we can do the actual train test split we need to Define what X and Y is so because the train test split takes X and Y and turns it into X strain y train X test y test but we need to have X and Y first so what we're going to do is we're going to say that X is actually the data frame without the target variables so it's going to be data drop and we're going to drop the target variable which is median house value and since we're dropping a column we need to say axis equals one and the Y is going to be the opposite it's only going to be the median house value like this so if I print X you can see it's the full data frame without this one column and if I print y you can see it's only this one column so a panda Series in this case and now we're going to say x strain X test y train y test is going to be the result of splitting X and Y with a test size of and you can here now specify how many how much percent of the data you want to use for testing the the go-to value is 0.2 so 20 of the data will be reserved for evaluating and we're not going to touch this data for anything we're not going to tune any hyper parameters with that data we're not going to do anything um before we're confident that we can launch the model so the test set is something that you only look at when you say okay I'm done with the training I'm done with the hyper parameter tuning I have my model I'm confident launching the model now I want to evaluate it on the test data this is what you use to test data for so we're not going to look at this anymore before we're done with the full process so we're going to use only the train data now and maybe we're going to split it into validation um into training set and validation set we're going to look at this maybe in a second here but what we want to do now is we want to join again the training data and the the the X training data into why training data so that we can analyze maybe some basic correlations so what we can do here is we can say X uh we're actually not X sorry we can say train underscore data is X strain and I think we should just be able to do a join on y train and then we have again the combined data frame but we have it only for the test stat only for the training data here so this is that we can do some basic exploration of the numeric features so it's not going to include the ocean proximity but what we can do is we can say for example traindata dot hist to get a histogram uh or to get multiple histograms for the distributions of the individual features and I think we should also be able to pass a fixed size uh let's see if that works there you go so you can see here the distribution of the various features and probably even more interesting is the correlation with the target variable so what we're going to do here now is we're going to say S and S heat map we're going to use a heat map to visualize a correlation Matrix and we can actually just go ahead and say train underscore data dot cord this is a function that produces a correlation heat map so maybe before we plot it let me show you what this looks like we can say train data core this gives us a correlation Matrix here you can see that every feature has a correlation of one with itself meaning that it's the exact same value every time and then you can see the other correlations for the feature combinations so we're going to take this and we're going to plot a heat map based on that we're going to say the annotations are going to be set to true so that we can see the actual correlation numbers we're going to say that the color map is going to be yellow green blue and I think we should say PLT figure fixed size 15 8. there you go and you can see that what's interesting to us is the correlation with the median house value you can see that the median income of a block correlates quite highly or quite strongly with the median house value so this is a very interesting variable to look at this can be a great predictor for the house value and then you can see that for example the latitude is negatively correlated with the house value so this is one part of the coordinate of the coordinates so that's it for the exploration you can do more exploration if you want to you can just look at the data to familiarize yourself with it now we're going to go into the pre-processing and as we saw we have a bunch of skewed features here so we have the total rooms the total bedrooms uh the population the households which are quite skewed I think in statistics you say it's right skewed even though uh visually you would say it's leaning to the left I think the proper term is Right skewed but I'm not 100 sure on that but you can see the data is skewed it's not a nice gaussian bell curve so what we're going to do for those features is we're going to take the logarithm of those features um and see what the distribution looks like by using that so we're going to say train data uh total rooms is going to be equal to NP so numpy log of train data total rooms and in order to prevent uh zero values we're going to go with one plus one so we're going to copy this four times we're going to replace this with total bedrooms then we have [Music] um population [Music] and then we have households there you go so we can run this and we can then again say train data hist figure size is going to be again 15 8. and now you can see we have something that looks more like a gaussian bell curve here so lock normal distribution seems to be the case for those four features here um so what we can do now is and what we should do now is we want to use this ocean proximity feature as well because I just think I assume now as a human that when you're closer to the coast you're probably going to have higher prices because that's a more desirable area um and when you're more Inland you're going to probably have lower prices this is just my assumption now I don't know if that's true but I don't think that this feature is irrelevant so we're going to use it but we cannot use it like this year we want to use it we have to turn it into a numerical value first and with categorical features oftentimes it makes sense to not ordinal encode them so to not just give them numbers so for example to not just take this value here and give it a one this value here give it a two this value here give it a 3 but to one hot and coat them so to category to to take the individual category values and to set them to Binary to turn them into binary features that can be either zero or one so what would that mean it would mean that if we say train data dots what was it ocean ocean proximity ocean proximity dot of value counts we can call the value counts and you can see we have these categories here and instead of just taking them and assigning them assigning numbers to them what we're going to do is we're going to create a new feature for every single one of those and we're going to give this a value of one or zero so we're going to say this is a feature it can either be yes or no this is a feature representing if the location is less than one hour away from the ocean yes would be one zero would be uh no so in order to do that we're going to just say PD gets dummies so this is the one hot encoding that we use from pandas we also have one Halton coding and scikit learn but this is how you do it in pandas you just say get dummies and you pass the category so uh in this case train underscore data and then potion proximity and this then returns this year and this can be joined with our train data so we can just say train data I think join us enough there you go we now have these categories and we're going to then just drop afterwards the ocean proximity because we turned it now into multiple features so that is that and we need to save this of course into train data there you go so we can print it again and now we can actually also look at the correlation between those values so we can actually take this heat map here and do it again to see how the values correlate how these new features correlate with a Target variable and you can see here for example that um what is it median house value we have a negative correlation with Inland so if you're Inland you pay less way less for your house or the median price in the block is way lower than if it's not Inland and uh the same or the opposite is true for less than one hour away from the ocean this is usually a higher price so we now have a bunch of more useful features um what we also might want to do is we might want to take features that we already have and combine them into new features so for example we might have something and this this brings us down to the feature engineering part or maybe before we do the feature engineering let me just show you one thing that can be quite interesting let's go ahead and visualize the coordinates so let's go ahead and say the following thing PLT dot figure and we're going to assign a fixed size again off um 15 8. just so that you can see how the proximity and how the position of the individual blocks influences the pricing so we're going to say here SNS so we're going to use C uh Seaborn we're going to say scatter plot the x coordinate is going to be the latitude the y coordinate is going to be the longitude I hope this is how it's pronounced I'm not sure um and then the data itself is the train data and the color this is interesting the color will be set now to our Target variable which is the median house value and the palette is going to be cool warm so what you see here is essentially this is the median house value so the the more red it gets the more expensive the houses and the more blue it is the less expensive the houses and you can see here even though this is not a map of California now per se this down here would be the coast so this would be the C and here we would have uh the Inland basically and you can see that all the houses that are on the coast uh close to the ocean close to the water are more expensive so this is also a nice thing here to visualize uh just something that we we can keep in mind here um but now let us let us move on with the feature engineering we have a bunch of features that are interesting on their own but we can also combine them maybe to more interesting features so for example uh we do have total rooms here per block and we also have total bedrooms but what might be even more interesting or at least interesting as well is the number of bedrooms per room so how many rooms of those rooms are bedrooms so this can be a separate feature we can go ahead and say train data and then we can call this bedroom ratio is going to be trained data total bedrooms divided by train data total groups and then we can also add another feature so for example we have um total rooms but if you have a block with many households with more households than usual you will probably also have more rooms so the rooms in and of themselves doesn't necessarily give us um the full picture so what we want to do is we want to say also train data household rooms you want to know how many rooms per household and this is basically just train data total rooms divided by train data households and now once again we can go ahead and show a correlation heat map and we can look at the Target variable and add our new features you can see they have a significant correlation so the bedroom ratio has a negative correlation with the um house uh house value with the median house value per block and we can see that is even more so you can see the households in and of themself are not very interesting as a input variable they have a correlation of 0.07 to our Target variable uh to our Target value but the household rooms are more interesting and for the bedrooms we can also see the bedrooms and the rooms are not really interesting or actually this is uh not correct we have this year okay the rooms are interesting but the bedrooms are not interesting in and of themselves but the bedroom ratio is quite interesting so we engineer two features that seem to be interesting and important um so that's it for the feature engineering you can of course also come up with new features you can drop features oftentimes it makes a model better if you drop features that have close to zero correlation with a Target variable we're not going to do that here we're going to use all the values but what we want to do now is we want to train multiple models and the most simple model that we can train is a simple linear regression so this is not even actual machine learning I mean it is machine learning but it's more like a more statistic than machine learning so what we're going to do here is we're going to just say from sklearn.linearmodel we're going to import linear regression and we're going to say that the regressor is going to be a linear regression and we have to split this again into training and testing or actually into into X and Y because one thing that's important now is we split the data up here maybe that was a little bit early to do that we split the data here on um on the on the original data so without the new features which means that now we have to do it again we don't have to do the train test split again but we have to do the um the X Y split again and we have to also add all these new features then to our test data to evaluate the model so what we're going to do here is we're going to say um X train y train is going to be train data dot drop and we're going to drop again our Target variable so the median house value X is one and the Y value is where the Y feature is median house value and now we do rec.fit and we fit on X train y train now one thing that you probably noticed is that we didn't scale the data this is something we should do we should scale the data um but let's see just here as a test how well this performs on the test data and in order to do that we're going to and we're going to do this here because we're done we're not going to do hyper parameter tuning for the linear regression we're going to just say okay we're confident launching the linear regression into production and we're going to see how it performs for the other model for the random Forest regressor we're going to do hyper parameter tuning but for this one we're just going to evaluate it right away and what we need to do here first is as I mentioned we need to take the test data and go through the same process that we went through with the training data and for this what we're going to do is we're going to say um maybe it wasn't the best idea to split that early but we're going to basically do everything that we did with the train data so we're going to copy this first we're going to paste it down here and of course if you're doing an actual project a large scale uh data analysis you probably want to do this in a function and just feed in the data and process everything automatically with a function instead of doing it manually like I'm doing right now with copy paste and changing but we're going to do it now here because it's quite a simple project we're going to take this we're also going to take this and we're going to take this here as well and I think that is it all we need to do now is we need to change the train data here to test data so test data and here here here here here here here there as well here as well here as well here as well here as well here as well and then also of course here so x-test join y test and I think that should work right out of the box but the problem is now that we probably we do actually have the island okay so we only have very few instances of the island so if it happens by Randomness that we don't have a single instance of the island category in our test data we would have um we wouldn't have the columns so it wouldn't have the same format as our training data but in this case I think if we print tests data uh I mean we're actually printing test data we have 16 columns and if I go with train data we have 16 columns as well so that works um we should be able to just call rec.score on X test and Y test I hope this is possible it is not possible why is that could not convert string to float less than one hour ocean do we still have do we still have the oh I know why because the problem is that we have this um we do this here again after we're done with the process and here we don't do it so we need to do this here as well so X tests why tests is the test data and test data here and maybe we should run this in a separate cell because I think if we run the code from this block again it's going to cause problems uh and this is now extreme y train so this is stupid let's do this again and here we want to have X now I copied the wrong thing come on I'm confused now let me just do this one more time we're going to take that we're going to open up a new cell we're going to paste this here we're going to change train to test we're going to change train to test we're going to change train to test and here as well and then we're going to see if we still have the island column because I don't believe it to be honest we do have it though unless I messed up something else but does X train now have 15 columns as well okay so this should be compatible let's see if that works there you go we get a score of 0.668 which is not too bad but also not too good we're going to definitely get a better score um with the random Forest regressor which we're going to do next so this is the most simple model just a simple linear regression on the data with the features it might make sense to scale the data first and we're going to do that probably for the first progressor let's see what we get here when we scale the data so let's say from sklearn dot pre-processing import the standard scalar the scalar is going to be standard scalar and we're going to say x train s and Y train s is going to be or actually let's say x train s is going to be scalar fit transform X train and then we want to train on Extreme s we don't need to scale the output usually um and we're going to do the same thing oh actually I shouldn't have done this because this won't work twice but I think they shouldn't have messed up the code now so X test s is going to be scalar transform X test and then we want to say okay it doesn't really change too much here but usually you want to scale your data so let's move on now with the let's move on now with the random Forest progressor this is going to be a more powerful model and we're going to also do hyper parameter tuning so we're going to try to find the optimal model and for that we're going to do is we're going to say from sklearn dot Ensemble so this is a special type of machine learning where we combine different models in this case decision trees we're going to say random Forest classifier is what we want to build and we're going to say forest equals oh sorry not random Forest classifier we're doing regression So Random forest regressor and we're going to say forest equals random forest regressor and we're going to try with the default so we're going to say forest fit X train y train and then Forest score X test y test and it takes some time to train and you can see we already get 81 percent which is way better um but hopefully we will be able to get this get it to an even higher accuracy maybe maybe not by providing a parameter grid and trying a different options here and for that we're going to use grid search with cross validation so cross validation to keep it simple means we take the data we split it into K folds K being a number and we use all but one fold to train the data and one for the evaluation so if you have 10 folds you use nine for training one for testing and you do that for all possible combinations and then you evaluate the parameters that you used for this one run for this one iteration um so what we're going to do is we're going to say from sklearn or maybe before we do that let's go ahead and try to see if the scaling improves that so pre-processing actually we don't need to pre-process again we can just feed the train s and the test s which is the scaled version and let's see if we get more than 81 percent here or 0.81 score oh it's a little bit higher I think before we had a three here not a four but it doesn't really change too much here so let's go ahead and say sklearn dot model selection from here we want to import the grid search CV grid search cross validation and to this we need to provide a parameter grid so param grid and here we specify now the different hyper parameters that we want to try so one is the n estimators parameters so estimators and here we can pass a list of three which is I think is this the default I'm not sure 3 10 and 30 for example and then we can try uh the Max features two four six eight and I think we also have the Min sample split I'm not sure if we if this won't take too long because the problem is when we provide a parameter grip with a grid what this is going to do is it's going to do the cross validation with all these combinations so it's going to try three estimators with two Max features three estimators with four Max features and so on then it's going to say 10 Max 10 estimators with two Max features so it's going to go through all the combinations and if I now add another one it's going to combine this as well so this is probably too much for the training time here we're going to just say now grid search is going to be a grid search CV of the forest regressor that we had up there with the parameter grid that we just defined with five-fold cross validation the scoring is going to be based on the negative mean squared error because we need to have a positive score so the more positive it gets the better and the squared error is the opposite um and then we want to say return train score equals true and we just want to fit this grid search onto xtrain and we're going to train it on the scale data right away so extreme s and Y train so this fits the grid search onto this data and it will probably take some time so I'm going to skip that part unless it finishes by the time I'm done with talking but this is going to give us the optimal model we're going to get the best estimator the best regressor and then we can use it to score this best regressor against the test data to say okay we want to launch this into production so let's see how it performs now one thing that I noticed is that it probably makes sense I'm not sure if that makes a difference but it probably makes sense to define the forest again as a fresh random Forest regressor so that we don't have the fitted version from above then I'm going to run this here again and we're going to see what the result is so once the grid search is done we can go ahead and say grid search dot best underscore estimator underscore and we can get the optimal hyper parameters Max features equals 8 and estimators equals 30 and now we can check if this best estimator actually produces better results it's not always the case so sometimes it's actually worse or it doesn't change much but more often than not it produces better results let's see what we have now I'm going to call this best Forest is going to be that best Forest DOT score and we're going to evaluate it on X test s and Y test and in this case we get actually a worse performance as I mentioned this can happen usually it's not the case so usually when you tune the hyper parameters you get better results maybe we can try to improve this by adding more stuff to the parameter grid so for example we can see here 30 seems to be the best so maybe we can change this to go from 30 to 50 to 100 and for the Max features we had eight so also the max number let's go with eight um 12 20 or something like that and then maybe Min samples split and provide two four six eight this is going to take now longer but um we're going to run this again we're going to see if we find a better estimator now and because of that we're going to provide different values for our parameter grid now when you go to the scikit-learn documentation of the random Forest regressor you're going to see that the default for the N estimators parameter is 100 so we use 100 decision trees in the forest by default which is what the model above here used so this is the default setting and it uses 100 trees it performed better than our version with 30 trees and 30 was also the maximum this is something that you also need to know about hyper parameter tuning if you provide a parameter grid and the optimal value for a certain parameter is either the maximum or the minimum you might want to go more into that direction so if you have values from 50 to 5000 and the best value is 50 or 5000 you might want to go more into that direction because it could be that the optimal value is not in your parameter grit which was the case for us so we had n estimators 30 which was the maximum and you also see that the 100th default performs better so we can change this here now to 100 200 300 also maybe Max features is not um the most useful parameter to tune at least when I work with random Forest progressives I usually tune stuff like the Min sample split so how many samples do you need to split a note by default this is two as far as I know so let's look it up um by default this is two yeah but we can provide something else so we can provide a default too but we can also say four for example and then maybe we can also say max depth of the tree so how deep shall the tree go um and by default we don't have such a limit but we can also specify one so we can specify none as the default and then maybe something like four or eight just to see what happens and then we can run this again and if we see for example that the best number for the depth is eight we can explore if another number might be better or maybe it's none because that's basically no limitation if we see that the optimal number of estimators is 300 we might explore even more but let's just run this parameter grid search now all right so this grid search took quite some time now so let's see what the optimal model is here so grid search dot best estimator and it says n estimators equal to 200 seems to be the best one so right in the middle which is nice um here it took the default value of 2 and here it also took the default value of none you can see that because it doesn't specify it specifically if it doesn't mention the parameter it's just a default so let's go ahead now and say grid search bestestimator DOT score X train no X test s y test and we get 81.435 percent up here we got 4 4 so it's still worse than the default but yeah this can happen oftentimes it's not the case but this is a full project this is taking the housing price data set from kaggle doing the exploration doing the pre-processing engineering features scaling them and then building models and evaluating them and you can continue now you can do a logistic regression I mean actually this is classification so not but you can go with a support Vector classifier you can go with a simple decision tree you can try to train a neural network if you want to you can play around with that you can also try to engineer different features you can drop features you can do whatever you think might improve the quality of them all and you can play around with that you will definitely learn something it will make you better data scientist but this is a full data science machine learning uh project or analysis in Python so that's it for today's video I hope you enjoyed it and hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye [Music]

Transcript for:Full Data Science Process

Transcript for:
Full Data Science Process