Transcript for:

This course is a practical and hands-on introduction to machine learning with Python and Scikit-Learn for beginners. It is designed and taught by Aakash N.S., who is the CEO and co-founder of Jovian. You'll learn about a bunch of common machine learning models and different types of machine learning. By the end of this course, you'll be able to confidently build, train, and deploy machine learning models in the real world. The topic for today is linear regression with Scikit-Learn. This is the first topic we are covering in machine learning with Python. And in fact, one of the things we'll try and do today is to apply linear regression to a real world problem. And we'll try to predict medical expenses using this technique. So here's an overview of what we are going to cover today. Just a quick outline. So first we start by stating a typical problem statement for machine learning. Then we will download and explore a dataset. Then we will perform linear regression with one variable using Scikit-Learn. Then we'll perform linear regression with multiple variables. And don't worry if these terms don't make sense right now. They will by the end of this session. Then we'll use categorical features for machine learning. Learn how to apply that in general, not just for linear regression. We will also talk about regression coefficients and feature importance. How do you interpret the outputs of a model that you create? And then we will apply linear regression to some other datasets. This tutorial is an executable Jupyter notebook hosted on Jovian. And you can run this online on binder or you can run this on your computer as well. You can also run this on Google collab and that's perfectly fine too. So here's the problem that we will solve today. And using this problem, we will define the terms machine learning and linear regression in the context of this problem. And then later we'll generalize their definitions. Let's read the problem carefully. And then we'll think about how we can approach this problem. Acme insurance incorporated offers affordable health insurance to thousands of customers all over the United States. And as the lead data scientist at Acme, you are tasked with creating an automated system to estimate the annual medical expenditure for new customers using information such as their age, sex, BMI, children, smoking habits, region of residents, et cetera. Why are you asked to do that? Because estimates from your system will be used to determine the annual insurance premium, which is the amount paid every month or every year. That is offered to the customer. So your job is to figure out when a new customer applies for medical insurance based on information like their age, sex, BMI, et cetera, how much they might incur in medical bills in an average year. If you can find that information along with some other information can be used to determine what you should charge the customer for insurance. And this is a typical problem that you will face and it's a typical example of a machine learning problem where you want to replace some manual decision making or one, one piece of a manual decision making process with some automation. So in this case, for example, you may need an expert who has seen customers for several decades to figure out based on this information, what their medical information might look like, or if you're a data scientist, you can figure that out using data and that's what we're going to create today. You need some data to actually figure this information out, but fortunately because Acme insurance already has thousands of customers, they have given you a CSV file. So here's a CSV file that your manager or whoever is working on this project has given you, and this CSV file contains historical data. Let's say from the past year for over 1300 customers, you can see here we have information like age, sex, BMI, the body mass index, number of children, whether this person is a smoker or not, which region of the country they belong to and the actual medical charges they incurred in the previous year. All right. So we have some historical data and we would want recent historical data not going too far behind because medical charges can change due to inflation and due to other reasons as well. So that's the data you have. And your job is to create some kind of a system which can take this information age, sex, BMI, children, smoker region for a new customer and estimate what their charges are going to be over the coming year. Okay. And here's a source for this data set. This is a popular data set that is used in machine learning. Now, before we proceed further, just take a moment to think about how you can approach this problem. Let's begin by downloading the data. So we will download the data using the URL retrieve function from URL lib.request. This is a simple function that's already present in the Python standard library URL lib. So let's get the URL into a variable called medical charges URL and let's import from URL lib.request from URL lib.request the URL retrieve function. Let's call URL retrieve with the medical charges URL and let's download the data to this file, medical.csv. Okay. So now we have a file medical.csv, which should have the required data. Next is to create a pandas data frame using the downloaded file to view and analyze the data. So let's install the pandas library using pip and then we import pandas using the alias PD, and then we call PD dot read CSV and give it the name of the CSV file. And here we have the data frame. So the first step is to get the data into your Jupiter notebook, no matter what kind of problem you're working on, whether it is data analysis or machine learning. And one way to get the data is by downloading it using URL lib. The other way to do it is you can go file open and then you can click the upload button here. You can click the upload button here and upload the file from your computer. We've also looked at other ways to download datasets, for example, from Kaggle and we will apply all of these different techniques at different points in this course. So here's the data and this data contains 1,338 rows and seven columns. Each row of the dataset contains information about exactly one customer. Our objective is to find a way to estimate the value in the charges column using the values in the other columns. Now, if we can do so for historical data, then we should be able to estimate the charges for new customers too, simply by asking for their information like age, sex, BMI, et cetera, maybe in an application form. Okay. So let's first check the data type for each column. We go medical DF dot info and we have seven columns age and age age seems to be an integer sex seems to be a string, as you can see here, male and female. So actually it's more of a categorical column, but let's just call it a string for now BMI seems to be a floating point number, and you can see here in BMI, we can have these decimals each can also have decimals, but it looks like we've only collected the integer portion of the age. So that's what we're working with. Then we have children. This is the number of children, the customer has, and this is again an integer and typically this would be a number like zero, one, two, three, then we have smoker, whether the person smokes or not, and this has a yes, no answer. So this was probably a tick box somewhere in a form. So this is also a categorical column, which we have in the form of a string right now, then we have a region. So it seems like there are different regions, customers, customers, can belong to like Southwest, Southeast, Northwest, Northeast. Seems like there are four of them, but we'll have to check. And finally, we have the actual annual medical charges incurred by the customer, so there are numeric columns, and then there are categorical or string columns, and among the numeric, there are integers and floating point numbers as well. Now, one good thing here is that there are no null entries. If you see here, no, no count that is one three, three, eight, which is also the total number of entries. So that means that there are no empty values in any columns. So none of the columns contain any missing values and that already seems as a fair bit of work, but don't worry, we will deal with missing values in future lessons. Next, let us look at some statistics or the numerical columns. So if we call medical DF dot describe that is going to give us information like the count means standard deviation, minimum, and all the percentiles like 25 percentile, 50 percentile, which is the median 75 percentile and the maximum, and this is always a good thing to do at the very beginning just to see if the data, the range of data makes sense. And I would say that the data makes sense. It seems like the average age is around 40 seems like the standard deviation is 15. Minimum is 18 maximum is 64. Okay. We already know that 18 is the minimum age, 64 is the maximum age. They both seem reasonable, but this also tells us that perhaps the insurance company does not take applications below the age of 18 or above the age of 64. Then we have the BMI. So with the BMI, it seems like the minimum is 15. So the BMI just to give you a sense of what the average age is. The BMI just to give you a sense is the mass of a person in kilograms divided by the square of their height. So it's a ratio of mass or weight to the square of a person's height in meters. So BMI, this is in the units kilograms per meter square. So it seems like the BMI is the minimum is around 15.96 or 16 and the maximum is around 53. Okay. Then we have the number of children that goes from a minimum of one to a maximum of five. Then we have the actual medical charges which go from a minimum of one, one, two, one. So just a thousand dollars, this person basically did not go to the hospital at all during the year to a maximum of $64,000. So you can see that there is a huge variation in the charges from thousand to 64,000, not only that you can also see that if you check the 50% mark, 50% of people have their medical charges under $10,000 or even 75% have it under $16,000. So that means there is a huge skew or there are several outliers in the charges column. And we'll see that in more detail once we study the visualizations, but even just by looking at these numbers, you can tell some of these things. The mean is 13,000, but the 50% mark is 9,000. So there's a bit of a skew in the data. So this is what you should do whenever you load up a data frame, just look at the data, just look at all these statistics and try and draw whatever inferences you can in this case. Once again, there don't seem to be any incorrect values like negative ages or ages greater than 200, let's say, or any negative values. Everything seems nice and clean. So somebody has put together this dataset for us, made sure that it is pristine. All the data is valid and there is no missing data that makes our job a lot easier. So that was downloading the data. So let me just save my project at this point. I'm just going to import the Jovian library and click Jovian.com it. Make sure you're saving your notebook from time to time. So now we've downloaded the data. So let's start by doing what we know. Let's start by just doing some exploratory analysis and visualization. And let's see where that can take us. We will visualize the distributions of values in some of the columns. And we will also find the relationships between the charges columns and the other columns, because we have to use the other columns to predict charges. And we will use the libraries, matplotlib, seaborn, and plotly for visualization plotly specifically, because it will give us some interactive charts and save us some coding, but a lot of all of these plots can be done using just matplotlib and seaborn as well, and I've linked to some tutorials here that you can check out in case you want to refresh or learn one of these libraries. So I'm going to install these libraries first using pip. All right. And then I'm going to import these libraries. So I'm going to import plotly dot express, which is a high level API that makes it really easy to create plotly interactive charts. Then I'm going to import matplotlib and I'm going to import matplotlib dot pipe lot as PLT, this is how typically matplotlib is done. So I'm going to import that. I'm going to import matplotlib. And finally, I'm going to import seaborn as SNS. So these are all the conventional ways in which these libraries are imported. One last thing we have is matplotlib inline. The purpose of this line is to ensure that any charts that you create, they show up as outputs within your Jupiter notebook and not as popups, because when you close a pop-up, the chart goes away. Okay. All right. So now we have imported the libraries. One other thing I'm going to do is just change some settings, which will improve the default style and the font sizes for our charts, which is going to make the font sizes a bit larger and use the dark grid style for seaborn. This will only apply to matplotlib and seaborn, not to plotly. Okay. Let's begin with the first column age and age is a numeric column. The minimum age in the dataset is 18 and the maximum age in the dataset is 64. So we can visualize the distribution of each using a histogram, essentially trying to find out how many people, how many customers we have for every age. From 18 to 64. So that means we need 64 minus 18 plus one or 47 bins, one for each year. And that way we'll be able to visualize it using a histogram. Another useful plot is a box plot. A box plot is also quite useful because it tells you the minimum, the maximum, the median, and all the columns. So that can give you a sense of the distribution of, of where the 50% point lies. And using plotly, we can actually create both of these charts with just one line of code. We say PX dot histogram pass in the medical DF data frame. And then we specify that we just want to use the age on the X axis. And we want to use the age on the X axis. And we want to use 47 bins and we want to give the charter title distribution of age. Now, by adding this marginal equals box parameter, plotly will also plot a box plot right above the histogram for us. So here's what the distribution of age looks like. You can see here at the age of 18, we have 69 customers at the age of 19. We have 68, then we have anywhere from 20 to 30 customers at every age. It seems like a lot of people here in the range of 40 to 50. And then here we have the highest age, which is 64 and nobody beyond the age of 64. It's a nice uniform distribution. If you just ignore these two values, you can see that the midpoint is around 39. Almost. And the quartiles are also equally balanced on each side. So the distribution of ages in the dataset is almost uniform with 20 to 30 customers for every age, except for the ages of 18 and 19, which seem to have over twice as many customers compared to other ages now to explain this piece, the uniform distribution might simply arise from the fact that there is not a big variation in the number of people at any given age between 18 and 64 in the United States. For example, if we open up this other chart that's available online, this tells you the population and different age ranges. So you can see here 15 to 19, 20 to 24, 25 to 29. So it seems like the, and it's split across male and female, but the overall population you can clearly see across different age groups is fairly similar. And you can assume that there is a certain percentage of people, certain percentage of population that takes their insurance from ACMA insurance incorporated. So that's why you see this kind of a distribution, which reflects the overall population. This is always something that you should check whenever you have a dataset. Does that make sense? This is always something that you should check whenever you have a dataset does the data you have reflect the distribution of the overall population or not. But in this case, in one case it does not. And I will let you figure out why this might be the case. Can you explain why there are over twice as many customers with the ages 18 and 19 compared to other ages? If I had to guess, I would probably guess that maybe ACMA insurance is offering a lower insurance premium. If you sign up before the age of 20, that could be one reason. And also because maybe 18 is the legal age for you to be able to get insurance, at least from this company. So a lot of people, as soon as they turn 18, get the insurance that could be one of the other reasons, but neither of these, we can assume we would have to then go and verify if this is indeed the case, maybe even look at some industry averages. If we can find some reports online. Okay. So that was the distribution of each. It was mostly uniform with a couple of years where we had twice as many people. Next, let's look at the distribution of body mass index. Once again, we are going to create a histogram of the body mass index. And this time we're going to use the red color and we're going to have a box part once again. So here is a distribution of the body mass index. And this looks very different. It seems like it is distributed according to a Gaussian distribution or what is also called a normal distribution. Here's what a normal distribution looks like. So if you have a certain population of data. Yeah, this is the one. So if you have a certain population of data where most of the values are centered around a single value, the, what is called the mean, and then you have a decreasing probability as you go away from that mean that's called a normal or Gaussian distribution. And this is exactly what is visible here. Why is this Gaussian and why is the previous one normal? I let you think about that. But what we can notice here with BMI is that most people have a BMI or a body mass index around 30, maybe in the range of 25 to 35. And then as you get further away, fewer people seem to have that BMI. So 25 to 35, roughly speaking might represent the average human in terms of the ratio of weight to height. And here's how you can interpret body mass index values. Or this is how that typically interpreted less than 18.5 is considered underweight. And then 18.5 to 24.9 is considered normal to overweight is well. 25 to 29.9 is considered overweight slightly more than 30, 30 to 39 is considered obese and greater than 40 is considered morbid obesity. Now, this is an important factor from the health perspective. Typically people are expected to be in the normal to overweight range. If they're underweight, then they are more susceptible to certain diseases. And if they are obese or even morbidly obese, then they are more susceptible to certain illnesses as well. And this is why this is one important piece of information that is requested in your insurance form. So as we perform this analysis, as you see with age, you can see with age that there may be a certain relationship between age and medical charges, and you can try and guess what that relationship might be. And then there may be a certain relationship between BMI and medical charges as well. With BMI, I'm guessing somewhere in between might be lower medical charges. If you're to either extreme, maybe you will have a higher medical charges. Now let's look at the most important column, the distribution of charges. And before I look at the distribution, do you want to take a guess on what the distribution of charges is going to look like? Just think about it. We've already seen the description. We've already seen the statistics for it. So here we are creating a histogram once again, BX dot histogram. And we have a medical data frame, the DA, the data frame on the X axis. We want to show charges and we also want to show a box plot above the histogram. And then we want to set the title to annual medical charges, but we have another. Important section here. This is called smoker. So what we're doing is we want to split the histogram of charges for smoker versus not non-smoker. So we have a column smoker and we are going to use that to split the data. And here's what that gives us green represents people who have responded. Yes to smoker. And no gray represents people who have not responded. Yes to smoker. So you can see here that people who are not smokers seem to have lower medical and lower annual medical charges. In fact, most people in general overall seem to fall in the maybe like 5,000. What is this? A 2000 to $15,000 expense category. So most people are spending between 2000 to $15,000, not more than that, which is not a very high expense. And then there are some people who need to spend a lot every year. And there is a stark difference between people who smoke and people who do not. It seems like 50% the median for non-smokers is 7,300. And the medium for smokers seems to be 34,000. Now it's not entirely clear if this is because they are smokers. If it is just generally true that you will have a higher medical if you smoke regularly, or is it that more people who smoke are taking life insurance? And maybe fewer people who smoke or do not smoke are taking life insurance. So we don't know what this means. All we know is that people have responded. Yes to smoker seem to have a higher median medical expense. But in general, you can see that this follows the exponential distribution or a more appropriately. This is called the power law. So in the power law, what happens is you have a lot of people along the smaller numbers. You have a lot of people who have a very low expense, and then you can see this exponential decrease. In the medical charges. And if you were to separate out smokers and non-smokers and plot them separately. So you would see that for non-smokers, you have this exponential decay. And maybe for smokers, you have that as well, but it seems like there are two sections. Maybe this is just general smokers, and this is maybe smokers with some ailments, right? So there is something to discover here, why we're seeing this kind of a pattern. Maybe do we need another category of like light smokers and heavy smokers? We might need to look into this, but at the very least, we're already starting to observe some trends that the charges have that for most customers, the annual medical charges are under $10,000, $14,000. And only a small fraction of customers have higher medical expenses, possibly due to accidents, major illnesses or genetic diseases. And this is what happens. Most of the population is healthy. Most of the population probably does not need to go to a hospital every year, maybe except for a few checkups, regular checkups, but some people unfortunately either have accidents or major illnesses or genetic diseases, right? So that's why they may incur a higher medical expense. And the second is that there is a significant difference in medical expenses between smokers and non-smokers. While the median for smoke non-smokers is 7,300. The median for smokers is close to 35,000. Okay. So next up, here's an exercise for you visualize the distribution of medical charges in connection with some other factors like sex and region. So maybe replace this by sex and replace this by region and see if you see a similar disparity across males and females or across different regions. Okay. So while we're talking about the smoker column, let's try and visualize the distribution there. So let's just see how many examples we have for each. So it seems like out of the 1300 or so people we have thousand or so have responded. No for smoker and around 274 have responded. Yes. So about 20% of the customers are smokers and we can visualize this using a histogram as well. So we say PX dot histogram. We pass in the data frame for X on the X column, we have smoker and let's break down the visualization by sex. This time. So here you can see that yes, represents people who've responded. Yes to smoker. And that number seems to be in total about two 68 or so, and no represents people have responded. No, but this is also broken down by sex. So you can see that about 115 females said yes. And 159 males on the other hand, 517 males said no. And 547 females. So it seems that some people have responded. 547 females, so it seems that smoking is more common among males. And this is something that you should try and verify. This is something that you should try and verify. And similarly, it also seems like 20% of the customers have reported that they smoke. So this is also again, something that you should try and verify, see if this matches with the national average. And you can assume that the data was collected in 2010. The data set has been around for about 10 years. Now, the reason you should verify these things as you're performing this analysis. And the reason you should perform this analysis in the first place is to understand if the distribution of the data that you have. Matches the distribution of the population itself, right? If your data contains 90% smokers, but in the general population, there are only about 10% smokers. Then any analysis that you do, or any model that you build will be wrong because it will be built under the assumption that 90% of the inputs that are coming to the model are going to have yes in smokers. So that's why wherever you're building a model, make sure that the distributions of each column match the distributions of the populations where the model will be applied. So with that, we've looked at the individual columns. You can also try and visualize the distribution of sex region and children and report your observations. Okay. So now having looked at some individual columns, we can now visualize the relationship between charges, which is the value that we wish to predict and the other columns. I would guess that age would be one of the most important factors in determining your medical expenses. So let's visualize the relationship between age and charges using a scatter plot. So we use PX dot scatter. We pass in the data frame medical DF. We pass in the data frame medical DF. And on the X axis, we want to show the age on the Y axis. We want to show the charges and we will show a single point on the chart for each customer. Now, whenever you have a Scott scatter plot, you should also use the opportunity to maybe add some color. So maybe let's color the points based on whether the person is smoker or not. And let us just set the opacity of points to 0.8 because we have over a thousand customers, so we may have some overlapping points. So just so that we are able to better see the data, let's just reduce the opacity a little bit of the points. And finally, we're just setting a title as well. Okay. So now we have this chart and this looks very interesting. I can see three clear clusters here. Here's what I can see. I can see three clear clusters of points and each of which seems to form a line with an increasing slope. You can see that this seems to roughly lie in a line. Of course it goes up and down, but roughly you can see a linear trend here. And then you have the second cluster, which also seems to have a somewhat linear trend as increasing with age. But if you look carefully, you have a mix of smokers and non-smokers in this cluster. It's possible that this is actually two different clusters, which just happened to fall in a similar place. For example, if I turn off smoker, this is what we get. So it seems like among non-smokers, most people incur a fairly low medical expense that goes from 0 to 5,000 at age 18 and goes up to 13 to 15,000 at age 64. But then there are some, as we said, there are maybe five, 10% of people who unfortunately due to accidents, genetic ailments or major illnesses have to spend a lot more. And then there's probably a similar trend here for smokers. You can see here, this is seems to be the baseline. So probably if you're just smoking, don't have any other ailments, this is where you are. And then here you have another group of people. So it's possible that these are people who are smokers and have ailments, or as there was one suggestion, maybe it is a combination of being a smoker and also being obese. So it's possible that is, that could be the reason for this. So it's worth investigating, but you can see here, there are two clear sections. Maybe you need a smoker and a heavy smoker section. Maybe that makes a big difference. Maybe a two cigarettes a day doesn't make a difference, but 20 does. So all of these interesting things here, and that's what we've summarized here. And this is a general thing that you should do in your exploratory analysis. Whenever you observe a trend, summarize it and try to offer some explanation. And you can mention clearly that there are certain hypotheses, like we're using the terms like it's possible, which means that this is something to be verified. And then presumably, which means that, okay, you have one cluster for smokers and one clusters for smokers with medical issues, or it could be smokers who are also obese. So we don't know, we'll have to study this. We'll have to dig deeper and try and figure it out. So I'll leave that as an exercise for you. And there are definitely more inferences that you can draw from the above chart. So I'll let you draw that as well. For example, what percentage of people lie outside the norm? So this, how many people lie in this region and how many people lie outside? And what are the number of outliers? Let's look at BMI. This seems again to be an important factor. So let's visualize the relationship between body mass index and charges using another scatter plot. Once again, we will use PX dot scatter on the X axis. This time we will use BMI on the Y axis, we will use charges and we will color the points once again, using the smoker column. And we will set the opacity to 0.8 and we will set on hover. We'll also show the sex so that when you hover over a point, you have some more information and we set a title for the chart. So here's what the chart looks like. And now you can see that the picture is starting to get a little clearer. So if you see here, if I turn off, yes, you can see that there seems to be no real relationship between BMI and medical charges. Maybe BMI is a flawed metric. Maybe it doesn't really make sense. You can see here that you have numbers all over the place. And if anything, you have some outliers here somewhere in between in the 20 to 40 range, not beyond 50 or not below 20. But that could also be because there are very few people below 20 or beyond 40. But once you add in the smokers, or if you just look at the smokers, you can start to see a trend here. So it seems like there are two major clusters below the BMI of 30. Your medical charges stay low in about the 30 K range below the 30 K range and above 30, then they seem to increase. So smoking combined with obesity, it may appear it might seem that this might be a bad condition to be in. So this requires more investigation, but we're already starting to uncover more information. At this point, if you have doubts, you should probably reach out to somebody in the company, somebody who actually has some expertise in the field and ask them if this is indeed the case. And all of these things are going to be important because the model that you build, you will have to explain why it gives the predictions that it does a lot of times because there are regulatory requirements. There are regulatory requirements in healthcare. There are regulatory requirements in finance and a bunch of other domains as well. So that's why understanding these relationships is important. What other insights can you gather? Just try and see if you can gather some other insights from this graph and try and create some more graphs to visualize how the charges column is related to other columns. For example, charges against children. Do you see that there is a general increase in charges when you have more children or charges with sex? Do you see that there is a difference between males and females maybe compared with region and maybe directly compared with smoker? Maybe just look at the average charges for smokers and the average charges for non-smokers. Here's a hint for you. You might want to use the violin plot for some of these. For example, if you do PX dot scatter medical DF and on the X axis, you put children and on the Y axis, you put charges. There's a problem here. You can't really tell much because the value of children is not continuous. It goes from zero to one to three to four to five. So we don't know if there are five points here or 50 points or 500. It's very hard to tell because they're all on top of each other. So if you just change scatter to violin, and if you've ever wondered why a violin chart is useful, now you will see just change it to violin. Now you can see that you also have this way for this chart. The width tells you how many values lie here relatively. So it seems like you have a bunch of values here, a bunch of values here, and then you have all these outliers. And now you can observe a general trend. You can see that the bulk of values shift up as we go with the number of children increasing such as the number of children increases. The potential medical costs seem to slightly increase, not by a lot because there are many people who have a very low medical cost despite having four children. There are many people who have a very high medical costs without despite having high children, despite having zero children. Okay. But there is some sort of a trend here, a very weak trend, but a trend nevertheless. So as you can tell from the analysis, the values in some columns are more closely related to the values and charges compared to other columns. For example, age and charges seem to grow together and BMI and charges don't. So the relationship between two columns on whether they grow together or not is often expressed numerically using a measure called a correlation coefficient. And there's a certain formula for it and there's a certain meaning it has. But first, let's just see how to compute the correlation coefficient. The correlation coefficient can be computed using the.corr or.corr method of a pandas series. So if we say medical DF dot charges dot core medical DF dot age. So we want to compute the correlation between charges and age, and it seems like the correlation is 0.29. Okay. And we'll see what that means in just a second. And then if you want to compute the correlation between the medical charges and the BMI, it seems like the correlation here is 0.19. So there's definitely a higher correlation between age and charges than there is between BMI and charges. And maybe let's also compare children. So with children, there seems to be an even lower correlation than the BMI. Okay. So that's how you compute correlation between columns. And you can also compute correlation for categorical columns. It won't work directly. At least I believe it won't. Medical DF dot smoker, it won't work directly because you need numeric data to compute the correlation. So what do you need to first do is convert your categorical data into numeric data. So here's what I'm going to do. I am going to create a new series called smoker numeric. And how are we going to do that? We're going to take medical DF dot smoker and we can call dot map on it. So dot map takes either a function or a dictionary and it applies it to every value. So here we've supplied a dictionary, no dictionary, which converts no into zero and yes into one. So when you call medical DF dot smoker dot map smoker values, then wherever you have a no, that's going to convert to zero. Wherever you have a yes, that's going to convert to one. So this is a very useful technique. Whenever you want to quickly convert some categories into numbers, let's, and let's compare that. Let's see medical DF not smoker. So you have, yes, no, no, no. And you have one zero zero zero. Now, once it is converted into a number, you can then compute the correlation. So you see medical DF dot charges, not correlation smoker numeric. And that has a coalition of 0.787. So now you can start to see the train 0.787 for smoker. Probably the highest correlation. Then you have with age 0.299. Then you have with charges 0.19. And then you have with children 0.06. And then there are some other categorical columns as well. So we're not just, we're not done yet. So here's how the correlation coefficient is interpreted. It indicates two things. It indicates strength and it indicates the direction. So the greater the absolute value of the correlation coefficient, the stronger is the relationship. Which means that if you have an extreme value like minus one or one and one is pretty much the highest value that you can get, you cannot get something higher than one or lower than minus one. That indicates a perfectly linear relationship where a change in one variable is accompanied by a perfect consistent change in the other variable. Which means essentially that these points lie on a line and there's a geometric interpretation here. So if you have X and Y, let's say age and charges, and you saw this kind of a graph, that would mean perfect correlation, correlation of one. This would be positive correlation. And if you had a chart like this, where it was a line, but one increase in one led to decrease in the other, that would be a perfect negative correlation close to minus one. But you generally have something in between. You have strong positive correlation where these points are distributed around the line or a weak positive correlation where there is a weak trend of some kind or no correlation at all where points are all over the place. Or you can have a weak negative or strong negative correlation, right? So that's the direction, the sign of the correlation coefficients represents the direction of the relationship. If it's positive, then you have an increasing relationship. If it's negative, then you have a decreasing relationship. If the correlation coefficient is very close to zero, then you have practically no correlation. Okay. And the correlation coefficient has a formula and you can check out this formula. This is simply the sum of XI minus the mean of X, YI minus the mean of Y added up divided by a square root of some other quantity. I'll let you figure out what this formula means because it's not very important. It's simply a question of carefully looking at each of the terms and listing out what it means, but this is the more important interpretation that you should have. When you see correlations, you see plus one. This is what you see. You see something, let's say greater than 0.5, greater than 0.7. This is what you should imagine less than 0.5. This is what you should imagine close to zero is this and less than zero. These are the pictures you should have in mind. Okay. And if you really want to learn the mathematical definition and the geometric interpretation, I have linked to a video here. But this picture pretty much summarizes everything we need to know. Okay. Now, the pandas data frames also provide a CORR method to compute the correlation coefficients between all pairs of numeric columns. So if I just call medical DF. CORR that's going to compute the correlation between age BMI children charges, which are the numeric columns with the other numeric columns. You can see age and age. So age has a perfect correlation with itself. If you plot age with age and you can do that, you can do something like this. PX dot scatter you have medical DF X equals age and Y equals age. You see that there's obviously a perfect correlation. It's a straight line. So age has a perfect correlation, but what we're interested in is the correlation of charges with the other columns. You can see that age has a correlation of 0.299 with charges. BMI has a weaker correlation of 0.198 and children has an even weaker correlation of 0.067. Okay. And finally charges with charges obviously has a correlation of one. Now, this is normally displayed using a heat map because it's difficult to look at all these numbers and figure things out. So you can just show a heat map like this. This is called a correlation matrix. I've simply passed it into SNS dot heat map from seaborn. And this also adds some color and helps you visualize that seems like agent charges have a high correlation. Okay. So that's the correlation matrix. And one last thing I want to mention is the correlation versus causation policy. Note that a high correlation cannot be used to interpret a cause effect relationship between features. Two features X and Y can be correlated if X causes Y or if Y causes X. You can see that in the formula. If you replace X with Y and Y with X, nothing changes. It's the same formula. If you flip the axis here, nothing changes the relationship remains the same. So it does not give you a direction to interpret causation in. Okay. That's very important. X could cause Y or Y could cause X or it's possible that both are caused independently by some other factor Z. And the correlation will no longer hold true. If one of those cause effect relationships is broken, like if Z causes X and Z causes Y. But if you fix or change whatever effect Z has on Y, then X and Y will no longer be correlated. So that's why I do not interpret cause effect relationships from correlations. It's easy and a lot of people do it, even though it seems obvious, a lot of people do it, but it's wrong. Just to give you an example, for example, if you let's say created a scatter plot between race and income, you would find that maybe a certain race has lower income and a certain race has higher income. And if you want to create a cause effect relationship there, you could say that income is a function of race, that if you belong to a certain race, you will learn less. And then you may read more information into that. And that is because of such and such reasons, et cetera. But it's possible that this is because of something else entirely. For example, racism, systemic racism that may have caused incomes to be lower for a particular race. And once you fix that, and maybe if you control for that, if you see people who have not been subjected for several generations to racism, you might find that you no longer have the same correlation. So try not to read too much into these graphs. Always be careful because it's a very natural human tendency to create these interpretations. And when it becomes even more magnified is when these into these correlations are captured in automated systems. Computers cannot differentiate between correlation and causation. And decisions based on an automated system can often have major consequences on society. So it's important to study why automated systems lead to a given result. And again, another example of this, you may have heard about this facial recognition technology developed by Amazon is being used in law enforcement. And it just so happens that because of the correct, because of the current distribution of people who are incarcerated, a certain race has a higher representation in the incarcerated population. And so the model is more likely to predict a person with a certain race is a criminal or detect them as a criminal. And so more people are likely to be incarcerated from the same race. So it can actually aggravate the problems, the problems we have in our human systems. And that is why it's very important to be mindful of why we're getting a certain result out of a system that we create and realize that determining cause effect relationships requires human insight. We shouldn't blindly trust computer systems. Okay. So that's just my small note about correlation and causation that you should keep in mind. Okay. So that's our exploratory data analysis. We've done a lot of analysis. We spent half the lecture just doing analysis, but it's important. Once you, this is a more important piece of machine learning, really, which is before you get to the actual machine learning. And then the machine learning part becomes fairly straightforward after that. Let's move on to linear regression with a single feature. And we'll define what linear regression is. We know that smoker and age, the columns have the strongest correlation with charges. We've seen that already. So let's try a way of finding, or let's try a way of estimating the value of charges using the value of the age column. We can't really use smoker because it's just a yes, no value. So how exactly are you going to use that to predict charges? But the age has a continuous value. So let's do one thing. Let's estimate the value of charges using the value of age just for non-smokers. And then we'll deal with smokers later. So first let's create a data frame containing just the data for non-smokers. Okay. And that's done. We simply pick medical DF dot smoker equals no, and pick those rows of those columns of data from medical DF. Sorry, those rows of data, which belong to non-smokers from medical DF. And let's plot it once again. I'm just using a seaborn scatterplot this time. Don't need any interactivity. And you can see the strength that there seems to be a generally linear trend that I could maybe fit a line here, although there are a bunch of outliers. So maybe I move my line up a little bit just to account for the average. So maybe a line like this would be ideal. Okay. So apart from a few exceptions, the points seem to form a line and we'll try and fit a line using these points. And then we will use that line to predict charges for a given age. So let's just fit a line somewhere like this. And once the line is fit, let's say somebody, a new customer comes in and they provide their information. Let's say they're 30 years old. Then we'll just look at the line and we'll see where the line intersects the Y axis or the Y value for the line at 30. And that would be maybe somewhere around here. And we would predict that their medical charges are maybe 5,200. Okay. That's how prediction works. Once you get the line, you use the line to make predictions. And a line on the X and Y coordinates has a formula like this. A line has a formula Y equals W X plus B. So as you make changes to the value of X, the value of Y changes and it is determined by two factors. So the number W is called the slope and the number B is called the intercept. Okay. So what you do is you pick a value for W. Let's say you pick that W has the value hundred and you pick a value for B and you say B has the value 50. And as you change the value of X, the value of Y changes. And that is what draws a line on the 2d plane, basic geometry, something you would have seen in high school and W is called a slope because it determines how much change you will see in Y when you make some change in X. So the rate of change of Y with respect to the change in X is W and B is called the intercept because B is the value of Y when you set X equals zero. So X equals zero means this. So that means B is the point at which the line intersects with the Y axis and that is called the intercept. So in the above case, in this graph, the X axis represents age and the Y axis represents charges. So what we're trying to say essentially is that we're assuming that charges and each have the following relationship. We are saying that the charges is some number W multiplied by H plus some number B and we will then try and figure out a W and B to get the line that best fits the data. Now, obviously this is not a perfect relationship. Medical charges depend on a lot more than age, as we've already seen, and maybe you can't even predict medical charges exactly because there's a lot of uncertainty, but there's a certain trend we are trying to capture because some information will be more useful than no information, at least knowing that a person at a higher age, we should probably be charged a higher premium is useful information that we'd like to draw. Okay. So here we have charges equals some number W plus B. Now this number W is also called a weight sometimes. And then this number B is called a bias. So they're called slope and intercept in geometry, but they're often called weight and bias in statistics and machine learning. In any case, this is a relationship that we are assuming. Okay. So this technique is called linear regression and we call this equation a linear regression model because it is modeling the relationship between age and charges you're setting forward hypothesis that the charges is some weight W multiplied with H plus some number B. So this is our model. This literally this equation is our model and then the numbers W and B are called the parameters or the weights of the model. So the model has a structure or an equation and the model has some parameters. As you change these parameters, you will still have a line. The structure will be maintained, but the line will move around. So the model will give different predictions when you input a certain age, it will give a different output for charges as you change the values of W and B. Finally, the values in the age column of the data set are called the inputs to the model and the values in the charges column are called the targets. So we assume that when we multiply the inputs with this number W and add B, we are going to get something close to charges and we want to get as close as possible. So that's our linear regression model. It's called linear regression because it's a line and we're trying to fit the line. So we're trying to make sure that the line fits as closely as possible. With the points now, this equation is well and good, but we would like to express it in a way that we can compute, which is using code. So we can turn this equation into a function. So let's define a helper function, estimate charges, which takes an age. It takes a value of W it takes a value of B and it multiplies the weight with age and gives us and adds the B bias and gives us a result, which is an estimate. For the charges. So this function estimate charges is our very first model, our first machine learning model, and we'll define machine learning in just a second. Now we don't know what values we should pick for W and B I'm just going to assume the weight of 50 and I'm just going to assume the bias of hundred. Let's see how it works out. I don't expect it to be perfect. We can call estimate charges on a single age. So let's say now we have estimate charges. We call it on the age 30 and we pass in W and B you would get that the estimated charges are 1600, is that correct? Let's see for 30, we are estimating that the charges are 1600 1600 would be somewhere here. So it seems too low. So I think my guess is bad. I can even maybe check 40. So then for 40, I'm saying my estimate is 2100. Okay. Still seems pretty bad because 2100 would be somewhere around here. So I guess my line is going something like this. It is going below these points. So maybe I should try and verify that in some way. Here's what we can do. Let me take all the values in the age column non-smoker, DF dot age. So this is a list of all the ages, right? Let's just get a list of all the ages. Okay. These are all the ages that we have in the dataset and let us get the estimated charges. From our model using these ages. Okay. So we put in into estimate charges, we put in ages, we put in W and B, so W has a value 50 B has a value 100 and we're going to do W multiplied by ages plus B now, because this is the numpy area, these same operators multiplication and addition work just fine for an entire area of data and not just a single value. So let's look at estimated charges. So here are the estimated charges. It seems like for the age 18, the estimated cost is thousand and then for age 28, it is 1500 and so on. And you can compare them with the actual charges, which is non-smoker, DF dot charges, and you can already see that these are pretty bad. You have a thousand here and you have a 1700 here. You have 1500 here and you have a 4,500 here. So we are way off let's plot what this looks like. So maybe first I'm going to do a scatter plot. Yeah. So if I do a scatter plot, this is what it looks like. The estimated charges have a linear relationship and that's what we assume we have assumed a linear relationship between them and this is what the graph looks like. So it might be even easier to just plot it as a line instead of a scatter plot. So I'm just going to put PLT dot plot, which creates a line and just make it a red line. So we don't need actual points to tell us that it forms a line. We can just draw a line. And now if you want to predict, let's say at 45, what is the charges? So the charges would be around 2,400 or so. So now we have a model, we give the model, some data, some age, and it gives us an output, but we want to know how good our model is. And for doing that, we can just plot this line on top of the scatter plot for the actual data. So here's what we'll do. We get the actual data, which is non-smoker DF dot charges. This is the actual data that we want to predict. We get our estimates. We first plot our estimates. We say PLT dot plot ages and estimated charges as a red line. And then we plot a scatter plot of the actual data ages and the actual charges that people have incurred in our dataset. Okay. So here's what it looks like. This is our estimate based on the current value of W and B, which we have chosen, which is 50 and a hundred. And these are the actual values of the points of the different customers that we have. So obviously this is pretty bad. This looks quite bad, but let's try and improve this. We now have a starting point and we maybe know that, Hey, maybe my weight needs to be higher because as I go from 30 to 40, the increase needs to be higher. So maybe I need a higher weight and maybe the bias is okay. I don't see any problem. If I had a line like this, it would probably intercept somewhere here. So maybe B should be around 2000. I don't know. So I've just defined this function, try parameters, which takes W and B and it's going to get the ages. It's going to get the targets. It's going to first get the estimated charges using the estimate charges function and it's going to plot the line. It's going to plot the scatter plot and we can call try parameters with different values of W and B to see what value best fits the data. So here we are calling try parameters with 60 and 200 slightly better than 50 and a hundred. No good still. Let's maybe go high. Let's go. Let's increase weight to 500, 400 and let's increase the bias P to 5,000. Okay. That seems to have moved up the line by quite a bit. Maybe if I just shift down the bias a little bit, if I maybe instead of doing 5,000 here, if I put in. 3000, let's say maybe that might be better. Okay. Maybe 2000. So as I'm changing the value of B, the line is moving down, maybe thousand. You can see that the line is moving down maybe 500. Maybe I need to go negative even. Okay. Maybe I need to go negative. So maybe go 600. Yep. Maybe go negative 2000. Yep. Yep. So there you go. You can now start to see what's happening as you move the line as you change the bias. And as you change the weight, the line shifts and moves up and down. So with, when you change B, that is going to move the line up and down. And when you change W that is going to increase. Or decrease the slope of the line. Let's say we set the weight to zero that makes a line flat. Let's say we set the weight to a thousand that makes the line really steep. Maybe you want to go somewhere between like around 500 and that's what we want to go with. So that's how you try and fit a line. So here's an exercise for you. Try various values of W and B to find a line that best fits the line. Just try it out, spend maybe five or 10 minutes trying out various values. And you will be able to get to a pretty good point by just guessing. And as you saw that what you do is you look at the line and then you compare it with the points. And you see, if you need to move the line down, if you need to move the line down, decrease the value of B, then if you need to increase the slope, increase W, if you need to decrease the slope, decrease W. So you already have a strategy in place. That you can apply each time. Now, as we change the values of W and B manually, trying to move the line visually closer to the points, we are learning the approximate relationship between age and charges. So wouldn't it be nicer if a computer could try several different values of W and B and learn the relationship between age and charges. Because ultimately now, once we've established that there's going to be a linear relationship, we just need to move the line around up and down, left and right, change its slope. And we should be able to get to the right point. Well, to do this, we need to solve a couple of problems. One, we have eyes and we can see that the line is far away from the points, but the computer does not have eyes, so to speak. So we need a way to measure numerically how well the line fits the points. That's the first thing. Okay. Second, we are able to guess which way we should move W and B by looking at how far the line is from the points. But again, we are applying a lot of insight here. So once we measure the fit of the line, we need a way to modify W and B to improve the line. Okay. So we need a way to measure how well the line fits the points. And we need a way to improve W and B once we have made a measure and we realize that this measure is bad. So if we can solve the above problems, this will be a breakthrough because it should then be possible for a computer to determine W and B for the best fit line, starting from a random guess. So let's solve the first problem. Now we want to compare our models predictions with the actual targets. And this is the method that we can follow. We calculate the difference between the targets, which is the actual charges from the dataset and the predictions from our model. And this difference is called the residual. So for each target, you have a bunch of targets here. The targets are simply the actual charges. So we have non-smoker DF not charges. These are the targets. And then you have the predictions, which is the estimated charges. This is what our model has predicted for the input data. So predictions is estimated charges. Okay. And this is what a particular value of our model is. So what we do is we subtract each prediction from the corresponding target. So this is going to tell us how far away our prediction is from a target. But the problem is some predictions may be lower than the target. Some predictions may be higher than the target. So to remove negative values, we square all the elements in the difference vector. Let me remove this matrix term here. Yeah. So we square all the elements to remove negative values. So you subtract 1000 minus one, seven, two, five, that's seven, 25. And then you square it. So 25 squared, then you do 1500 minus four, four, four, nine. That's again, some negative value, a square it at some points. You may have positive values. Maybe you have 2,600 here and 1,600 here. So you get a positive value. Square it. So we take the residuals. Each difference is called a residual and we square all the residuals. Then we calculate the average of all those elements in the result, right? So we have all of these residuals. We simply add up all the squares of residuals and calculate the average. So now we have this minus the actual target squared plus this minus the actual target squared plus this minus the actual target squared divided by the total target. And finally, we take the square root of the result. Okay. Seems like a complicated process, but it's actually fairly straightforward. What it's giving us is called the root mean squared error. So let's break that down. Error is simply the difference between the targets and the predictions. So that is the error. But of course, if you just add up errors, positives and negatives may cancel out. So that's why we square all the elements in the result. And then we take the square root of the result. If you just add up errors, positives and negatives may cancel out. So that's why we squared it. So now when we square it, all the errors become positive and then we take that average. So that gives us the average squared error or mean squared error. But then we want to take the square root again, because we want to interpret the number that we get as a dollar value, essentially. So now when we took squares, we converted the difference in dollars into dollars squared. So we just took, take a root again, and that just gives us a dollar value. So that's why this is called a root mean squared error. And it tells us on average, how far away we are, how far away each point is from the line. Now, this is not the only way to do it. There is also something called the root mean absolute error where you do not take squares. You actually take the absolute value, which means you simply ignore the sign. And the difference between root mean squared error and root mean absolute error is simply that outliers have a big factor or outliers contribute highly in a squared error because the, the greater the error, the greater its square is going to be. So it's going to be a bigger factor in the overall error. Okay. So that's just one thing to keep in mind, but root mean squared error. Make sure you understand this. This is the average on average, how far away we are from the line, right? And this is the formula, which is exactly the same thing. Prediction minus actual squared sum over all the predictions divided by N square root of that. And this is how you can visualize root mean squared error. So these are the rest. So suppose this is the line and then these are the actual target points for each point. You calculate how far away it is from the prediction and the prediction is simply the point on the line at the same X value. So this is a prediction. This is the actual point you calculate the residual for that point. And then you add up the squares of all the residuals, take their average and take the square root. So these are all different ways to look at the root mean squared error, but let's look at it in a way that we understand best, which is using cool. So I'm going to install the NumPy library here and import NumPy as NP. And then I'm going to define this function RMSE, which takes a set of targets. It takes a set of predictions. First, we get target minus predictions. That's the error we square it. We take the mean, and then we take the square root mean squared error. All right. So now we have a way to tell how bad our model is. So let's compute the root mean squared error for a model with a sample set of weights. Here is the weights 50 and a hundred. This is what we started out with. And here are the, here's what that line looks like the line made by these weights. Now let's get some predictions. So let's get, these are the targets. The targets are simply non-smoker DF dot charges and the predictions are simply by calling the model or putting the weights and the inputs into the model. So the age I'm taking the list of ages, putting that into the estimate charges function along with the weight and bias that will give us some predictions. So that will give us for each of these ages in the actual dataset, the point on the line where the point on the line is what we'll see in prediction. Specifically the Y portion of the point on the line or the Y coordinate of the point on the line for that age. Now we can compute the root mean squared error between the targets and the predictions. And it seems like the root mean squared error is 8, 4, 6, 1. So how do we interpret this number? Well, on average, each element in the prediction differs from the actual target by 8, 4, 6, 1. That's not the exact interpretation, but it's pretty close. And this is how we can look at it. So is that good or is that bad? Well, visually we can tell it's bad, but even on average, you can see that the values of charges go anywhere in the range of maybe 2000 to 15,000 here, right? Normally for somebody who does not have high, high medical expenses. So if you're trying to predict a value between 2000 to 15,000 and you're off by 8,000, I think that would be pretty bad. Even considering that you have some outliers because these outliers are smaller number. I think 8,000 would be pretty bad. Maybe I would be willing to live with an error of maybe four or 5,000 at most, but 8,000 is just all over the place. And you can tell the same thing by this line as well. This is too far away from the actual data. So that's how you interpret the root mean squared error. And this error is also called a loss because it indicates how bad the model is at predicting the target. Remember the difference between the two things, which is targets and predictions will be higher. And then the squares will be even higher and the average will be higher. And so the square root will be higher if the targets are very far away from the predictions. That means if your model is really bad, then the root mean squared error is going to be very high. That's why it's called a loss because it represents the information loss in the model. Remember, we are not capturing the actual relationship. We are simply modeling the relationship and all models are wrong. It's just that some are useful. How do we figure out which models are useful by looking at the loss, the lower the loss, the better the model, right? The higher the loss, the more noisy, the more useless the model becomes. So let's do one thing. Before we train a computer to figure out the lowest loss, let's try and do it manually. So I have this try parameters function here, which takes a set of weights and biases and then make some predictions using our estimate charges function and plots a graph. At the end, I have also computed the loss for the predictions made for these weights and biases and I've printed the loss. So here's what that looks like. This is the graph is the same line we have and the RMSE loss of 8, 4, 6, 1. Now let's take up the challenge to try and fix the RMSE loss. Let's try and make it as low as possible. So one thing that's clear is that I should be increasing the slope of the line. Let me make this 300. That did not do much. Okay. This is not the slope. Okay. This is the slope. Let me make this 350. All right. That push it up by quite a bit. Maybe let me push it down a bit. Let me push it down by 4,000. Yep. That's getting pretty close. And I've already arrived at a RMSE loss of 4, 9, 9, 1. So that's what you want to do. And you can experiment here. You can try and get it to as low a value as possible. And now you have a very clear target. You don't have to figure it out visually, but you can see as the RMSE loss gets slow and gets lower, you will see that the line gets closer and closer to the points. It will not lie exactly in between the points because we have all these outliers that we need to account for, but it should lie somewhere along this region, right? So 350 and minus 4,000 is my best case. Let's see if you can beat that and try and come up with a general strategy for bet for finding better values of W and B just by looking at the loss, not necessarily by looking at the graph, but just by looking at the loss. Here's one strategy that I will suggest. If you have a weight, if you have a certain W on a certain B, try slightly increasing W. If that reduces the loss, then you can increase W by a large amount. If that does not reduce the loss, try slightly decreasing W. If that reduces the loss, then maybe you can reduce W by a significant amount. If that does not help, try the same with B or maybe try the same with W and B in parallel. Okay. And this strategy of just trying a small change and seeing whether it leads to an increase or decrease, and then taking a larger step in that direction, which is sort of assume you're blind and you're in a dark room and you have to figure out which way to go down, which way goes down, right? If it's not an uneven surface that you're on. So you put one foot and you put one foot and you realize, okay, this is higher ground and you put one another foot and you realize, okay, that is lower ground. If you want to go downhill, then you step, take a couple of steps, maybe along the lower ground, and then you repeat the process. So this strategy is called the gradient descent strategy, and it can be implemented mathematically using derivatives, but that's what we want to come up with, right? So we need a, we need a strategy to modify the weights and the, and the biases, which is the, the parameters of our model to reduce the model, to reduce the loss and improve the fit of the line to the data. And there are a couple of strategies that are used to do this. One is called the ordinary least squares method. And the way it works is using some matrix operations and some matrix inversions. It directly arrives at the best value of W and B. So it uses a combination of calculus and linear algebra. To directly give you the values of W and B, it works well for small data sets, like data sets with a few thousand rows of data, but for larger data sets, it is very inefficient. And sometimes it takes up a lot of memory. So for that, you have this, another technique called the stochastic gradient descent technique, which is what I just talked about, which is imagine that you're on a hill and it's dark and you want to figure out how to go down. You just take steps in each direction, figure out which direction goes down and then take a few steps in that direction and then repeat, right? So that's called stochastic gradient descent. You can check out these videos for some explanation of how these work, some mathematical explanation. In general, we are not going to look at it in a lot of detail because you will never have to implement these. It helps to understand them, but in general, you won't. And you can't really implement these. You will most likely have to implement these. But in general, you won't, and you can't really implement these. You will mostly have to depend on the libraries to do this. And the more important skill for you is to know how to use these rather than to understand the mathematical representation. But yeah, ordinary three squares works well for small data sets, uses some direct matrix computations. Gradient descent works well for larger data sets and uses an iterative approach. That's the actualization of how gradient descent works. It starts out with a flat line and then slowly improves both the weight. So you can see now it is improving the bias so far, but once it gets close enough, it's going to start turning the line. So it's going to start changing the line. So it computes the loss. It figures out whether it should increase or decrease the weight. It figures out whether it should increase or decrease the bias or the intercept. And that's how it keeps changing the line step by step. And then it gets to the very close to the best fit line. And it works really well for large data sets where you have millions or tens of millions of data points. So you will be applying linear regression to some really large data sets. And that's where you will get to see gradient descent in action. Okay. And it is actually quite similar to our own strategy of just gradually moving the line closer to the points. So that's what we are going to do. Now, in practice, you will never have to implement either of these methods yourself. You can use a library like scikit-learn to do this for you. So I'm going to install this library scikit-learn. scikit-learn contains functions, modules for doing machine learning with Python. It's one of the most, most used libraries in the world for machine learning. And we are going to start out by using the simplest function that it offers, which is the linear regression class. So let's use the linear regression class from scikit-learn to find the best fit line for each versus charges using scikit-learn. So I'm going to install pip install scikit-learn. All right. And then I'm going to import from scikit-learn the linear regression model, the linear regression class. Haven't done anything so far. It's in the linear model section. Then we need to create a new model object. So we create a new model equals linear regression. Next, here's what happens. You call that every model in scikit-learn has a fit method and let's look at the help for the fit method. So fit method takes some parameters and take some targets and it fits a line in the case of linear regression. It fits a line between those parameters and targets. Okay. So that's what we're going to do next. We're going to fit the model to the data. Model.fit takes an X, which is the inputs and this should be of the shape number of samples, which means number of data points we have, comma number of features. And in this case, we just have one feature. But important thing is that this should be a two dimensional matrix or numpy array. Then you have this Y, which should be array like, so it should either be a one dimensional array, simply for each sample, what is the target value? Or if you can even predict multiple targets, let's say you want to predict medical charges, you want to predict travel charges, you want to predict some other charges. So you can have multiple targets as well. In our case, we have only one. Okay. So let's now create some inputs and let's create some targets and let's put them into model.fit. So here's what I'm going to do from non-smoker DF. I'm going to pick age, but I don't want to just pick age. I don't just want to do this because this is going to give me a series and this is, this just has one dimension. Remember model.fit requires two dimensions. So I'm just, I'm going to give it a list of columns that I want to pick. And in the list, I just want to put the single column age. Okay. So it's a subtle difference here that we want, we don't want a series. We want a data frame, and if you check inputs here, inputs is a data frame, not a series. You can check type of inputs type of inputs is data frame and inputs has the shape. Yeah. Inputs has a shape one zero six four comma one. Okay. So it needs to be a two dimensional tensor. Very important. And the targets can just be an area. So the targets are simply the charges. So now smoker DF dot charges, and then we check the shape of inputs and we check the shape of targets. You have the expected shapes. So now this is everything that we've been building towards. This is the one line that does everything that we need to do. When we said model equals linear regression, we already mentioned to scikit-learn that we want to assume a linear relationship between our inputs and our targets, which means we want to, we want our targets, which is charges to be some weight W, to be some weight W multiplied with the H, which is the input plus some number B. So now we're telling scikit-learn that, Hey, these are some inputs for you. And these are the targets for those inputs means these are some ages and these are the charges for these ages. Can you please calculate the loss? So maybe first get some random W and B and get the predictions, compare the predictions with the targets, see how bad your predictions are, and then use a gradient descent algorithm or use some other optimization strategy to improve your, the values of W and B and keep improving, keep improving, keep improving till you cannot improve anymore. And then give me the final result. So that's what happens when you call model.fit. So when you say model.fit, now inside model, a line has been fitted, which means a W and a B have been found, which make the loss, the mean squared or root mean squared loss, the lowest possible. Again, the losses, the loss is also coded inside the model. We don't need to worry about that either, but here's now what we can use it for. We can now make predictions using the model. So now we have a model. We have fitted our model. Let us try predicting the charges for the ages 23, 37 and 61. So we say model.predict and then we give it this two dimensional array. So in this two dimensional array, we have a bunch of rows and in each row we have just one column. So we have the value 23. This will be the value of age, another value, 37 again, another value of age, another value 61, another value of age. So we expect to get three values for the charges. So for this row or for this age, 23, the model predicts 4055. For this age, 37, the model predicts 7796. For this age, 61, the model predicts 14210. Okay. So it's starting to look good. Remember when for smaller ages, the cost is around $4,000. And then for the larger ages, like around 60 or 64, it gets to be, it gets closer to 15,000. So it seems like our model has done a pretty good job. Let us maybe compute the predictions for the entire set of inputs that we've given the model. So what is our inputs? Our inputs are simply the ages, right? So we, you can see here our inputs are simply the ages. So let's give a full list of all the ages. And let's see what the model predicts for all the ages. So this is what the predictions look like. And you can check what the inputs are and see if these seem reasonable. 18, 28, 33, 31. So for 18, we predict 2719 for 28, we predict 5391 and so on. Maybe let's compare them with the targets. The targets is the actual costs that these people incurred. So this person incurred a cost of 1725 and we predicted 2719. This person incurred a cost of 4449 and we predicted 5391. This predict, this person incurred 21,000 and we predicted 6727. This person incurred 2007 and we predicted 3520. So it's in the good range. It's not that bad. It's maybe off by a thousand or so, but when you're dealing with values in the range of about 15,000 or so, the error of thousand is probably acceptable, right? And it's mostly on the higher side for lower values, which is fine too, because we know that there are a lot of outliers which are quite high in value. Okay. We can also look at the laws. We have this RMSE function, which is the root means squared error. So that we will take the targets minus predictions, square all of the numbers, add up all the squares, take the average and take the square root of that. Let's look at the RMSE. So the RMSE for this model, the root means squared error for this model is 4662, which means on average, we are off by about $4,600 and that's the average. But actually, when you look at the values, you'll see that you're off by much less. It's just that some of these are such big outliers that we cannot capture that relationship with just a simple line. But you can see that this one, we should be predicting 21,000, but we get 6,720. And it's simply not possible to get that far out of an outlier using a line. So on average, we're actually doing much better than 4662. We are probably in the thousand, 2000 range, but yeah, this captures the fact that there are a bunch of outliers and in the model as well. So this is the whole process of machine learning that you assume a certain relationship between inputs and targets, which is the linear relationship. You then find a loss. You find a way to measure how bad your model is based on certain random parameters that you've picked initially. And then you use some kind of an optimization algorithm to improve the parameters of the model. In this case, WNB till you can minimize the loss. Okay. So we've gotten to a good point. Our model is off by 4,000 on average, which is not too bad considering that there are several outliers. Now you may be wondering, where's the WNB? Well, in scikit-learn, the W is stored in model.cof. It stands for coefficients, which are the coefficients or the weights applied to each individual feature. In this case, we're just working with age. So we have 267 is the coefficient W and then we have B here. Okay. So model dot intercept contains B and that is minus two zero nine one. So it seems like if you had picked a W of two 67 and a B of two zero nine one, we would have gotten a pretty good line and that would have been our best. That would have been the best possible model. So when you try this out later, see how close you can get without actually looking at the values of W and B. Let's maybe just plot them once again. So here we have all the points in blue and here we have our model, our line showing the predictions. So you see that our line is fitting quite nicely. I wouldn't say it's bad and it's slightly above this cluster because we also want to somehow capture the fact that there are certain outliers at the top. So if the line was somewhere here, it would technically not be giving the right information because it would not be suggesting the fact that there are outliers. So to account for all these outliers, it's slightly above the data, but that's pretty good. That's how you do linear regression using scikit-learn. You use the linear regression class, then you fit the model and then you generate predictions on new data or on existing data. Now, the linear regression class uses this optimization technique called ordinary least squares. There is another technique called gradient descent that I'd mentioned. For that, you need to use the SGD regressor. So I encourage you to try out SGD regressor. It works in exactly the same way. This is the one of the nice things about scikit-learn. If you know how to use one model, you know how to use all the models. It's just that internally what's happening is different. So try out the SGD regressor and train a model, make some predictions, compute the laws and see if you see any difference in the result. Okay. One other exercise for you is to repeat the steps in the section to train a linear regression model to estimate the medical charges for smokers. So till now we've been dealing with non-smokers. So now you work with smokers and then visualize the target and predictions and compute the loss. Once again, are you able to get a better model for non-smokers? What does that look like? Because for non-smokers, we seem to have these two clear clusters of data. So that would be worth figuring out. Okay. And that's it. Congratulations. You have just trained your first machine learning model and machine learning is simply the process of computing the best parameters to model the relationship between some feature and targets. So every machine learning problem that we'll solve in this entire course or in general that you'll solve anytime will generally have three components. The first thing you will have to decide is how do you want to model the relationship? Do you want to model the relationship as a linear equation? Maybe you want to model the relationship as a polynomial equation. Maybe you want to model the relationship as some kind of a decision tree. Maybe you want to model the relationship in some other way. And there are many, many, many ways to model relationships. The simplest one is to just say that the thing that we're trying to predict the target is some kind of a weighted sum of the thing that we're trying to predict the target is a weighted sum of the input features that we're interested in. For example, we could say that the charges is simply some weight w one multiplied by age, some weight w two multiplied by number of children, some weight w three multiplied by BMI plus some number B. Some intercept, right? So that's the model and every model has parameters. That's what the machine has to learn. So to understand how good the parameters are of the model, we have a cost function and there are several choices for cost function. But the idea for the cost function is that it should give you a cost, which it should give you how badly the model is performing and the lower the cost, the better the model. Okay. Once you have a cost function that can be computed given some targets and predictions, then you need to use some kind of an optimization technique to optimize the cost, which means to change the parameters of the model so that it better fits the data. And this is what this entire process looks like. You put some input data into the model and you get some outputs. You put, you take both the target and the output into the error or the loss function. You then apply an optimization method and that optimization method changes the weights and biases or changes the parameters of the model. So model gets slightly better. Then you put some more input data, you get some more output. You take that, put that into the loss function, perform optimization and improve the model. This, what we're seeing here is essentially this animation that we saw here, which is taking the line and slowly moving it so that it perfectly fits the data, right? So most of machine learning and even a lot of deep learning is basically glorified line fitting. That's all we are doing. We are fitting lines. And the fact is that the only difference is we are saying that the relationship instead of being a line is something else. And we're asking the computer to compute the parameters in that relationship or the parameters of that model. Okay. So that's, that's machine learning. And as we've seen, it just, it takes just a few lines of code to train a machine learning model using scikit-learn. You start out by creating some inputs and some targets you, for example, in the inputs in this case are non-smoker DF. And we are just picking a set of columns and currently we're just picking one column age. So the inputs need to be a two dimensional array. Then you have targets, which can be one dimensional or two dimensional. So you have a list of inputs and a list of targets and then you create a model. So here we are creating this model linear regression. And most of the time immediately, you just want to fit the model, not always, but most of the time. So if you just want to fit the model, you can create the object and immediately call dot fit on it. And that will also return the model. So you can just do this linear regression dot fit and into it, you provide the inputs and the targets and it's going to go and do its thing. It's going to do all of this for you. This class determines what kind of relationship you want to assume, like here, you want to assume the linear relationship, right? So you fit the data and that gives you a model. Then you can use that model to make predictions. So now you can just take this model and every time a new customer comes in, you ask them, Hey, what's your age? And they say 32 and you put in the number 32 and you get a certain estimate for the medical charges. That estimate can then be used to compute their insurance premium and off they go. So it automates that piece of analysis, which you had to do with maybe using several experts with years of expertise and there's a certain error. So whenever you use a model, you should be aware of what the error in your model is or what the loss, the information loss in your model is. So in this case, what we say is that, Hey, my predictions are off by maybe about three or 4,000. So you may just want to keep that in mind when you are comparing, when you are calculating the insurance premium for this person. All right. And that's your three or four step process, create inputs and targets, create a training model, create and train a model and generate some predictions with the model. And wherever possible, you also should have a measure of what is the information loss or what is the error in your model. So you create the input data, and then you put that into some kind of a model and the model gives you some predictions. Now you compare the predictions with the actual outputs and that gives you some kind of a measurement, how bad your model is your model is really bad initially. Then you run some kind of an optimization method and that improves your model. So as you give some data into the model, it learns from the data, it makes better predictions. Then you give it some more feedback using the loss function. It makes a better prediction and it gets better over time. If you can just grasp that basic idea and see how it is translated into code here, that is enough to get started. And it is a fundamentally different way of thinking about problems. If you are struggling with it a little bit, that's okay. Once you struggle through it, and once you get the idea, once it clicks, what's happening here, you will suddenly realize the potential of what it makes possible. Because now instead of let's say having several years of expertise in analyzing or evaluating insurance applications, you just have this one system. You just feed in some input and it's going to give you an output and you know how much it is going to be off by and you can use it. What about some of the other features? So far we've used just the age feature to estimate charges adding another feature like BMI is very straightforward. So we simply assume this relationship now instead of saying charges is some weight W plus B. Now we say that, Hey, for calculating the medical charges, we give some weight to the age and some weight to the BMI. And then we have this other intercept term, which is simply to just move the line up or down. Of course, now this is no longer a line. It's linear in the sense that there is still a linear relationship. We're not doing squares or cubes or anything, but this becomes now a plane in three dimensions. Essentially, if you see the, if you put age BMI and charges on each dimension, this becomes a plane in three dimensions. So now we are working with the plane, but still called a linear regression, simply a linear combination or a weighted average of the agent BMI, the charges. So how do we apply this in code? The exact same four steps, create inputs and targets. So the inputs are from non-smoker DF. We take age and we take the body mass index and then the targets are simply the charges. The only change we made here is this, the way to train the model and create the model is the same. So you create linear regression, you call dot fit, and then you put in the inputs and the targets, no change here. It's automatically going to detect that now you have two columns of data. Then you generate predictions, generate predictions by passing the inputs into model dot predict. And then finally you compute the loss. So this is not something that you have to do strictly speaking. You just, once the model is trained at fit, you can make predictions on any new input data, right? At this point, of course, we don't have any other input data. We just have the original input. So we just passed back the original inputs, but you call model dot predict inputs, and that's going to give you a set of predictions. And then you have RMSE, you compute the loss. So whenever you calculate some predictions, it's always a good idea to like, if you have the actual targets, which we do for the inputs, just compare the predictions of the model with the actual targets for those inputs. And get the loss. And you can see here, we now go from the loss of what's that 4, 6, 6, 2.5 0 5 to 4, 6, 6, 2.3 1 2. So clearly BMI did not create a major impact and you can see why because BMI has a very weak correlation, especially for non-smokers. And you can even see this in this 3d scatter plot, sorry, in this scatter plot between BMI and charges. So you can see here that you have BMI and you have charges and there's no real relationship that you can see here. Like there's no line that you can see. And that's where the correlation is low. And that is why even though we included BMI now, even though we included BMI now, you are unable to see any improvement in the loss. The loss still remains 4, 6, 6, 2. There's a very minor improvement of a less than a dollar on average. Let's also check the parameters of the model. You will find that the parameters have also not changed much. So this is the parameter for age. So for age, we're still using the weight of 2 66. This is the model. This is minus 2 2 9 3. So you can see that these are still quite close to the original parameters. The parameters that we had here were 2 67 and 2 0 9 1. So the parameters have still remained the same for age and for the intercept. Yeah. 2 66 for age. That is the weight applied to age. That is w 1 here. You can see w 1. This is 2 66 and this is the intercept minus 2 2 9 3. And you can see that the parameter or the weight applied to the BMI is very small, which means BMI has very low weightage and you can see why it has a tiny contribution to the charges for non-smokers. And even that is probably accident. And this is something that you may want to report to your company that, Hey, there seems to be no relation between BMI and medical charges. Should we even be looking at BMI? And another important thing to keep in mind here is that you cannot find a relationship that does not exist. If your boss came to you and said, I want to create, I want you to create a model that uses the BMI and minimize medical charges. You should just go back and tell them there's no relationship here. No matter what machine learning technique you use, no matter how much data you give me, there is simply no relationship here. All I'm seeing is this, you can't expect to fit a line to this. And you can verify that train a linear regression model to estimate charges using just the BMI, see if you expect it to be better or worse than the previously trained model, it should be worse, but it's something that you should verify maybe even visualize it by plotting the graph. Once again, when you have a single, when you have a single feature, you should be able to plot it easily. Once you go beyond one feature, beyond one input, there's no easy way to visualize and let's go one step further. Let's add a third numeric column. So here we have w one times age, w two times BMI, w three times charges plus B and you can see here the correlation between charges and children. So the correlation here is 0.138. It is lower than age, but it is higher than BMI. And we can see this using this. This is called a strip plot, similar to a violin plot, but we show actual points here so you can see that there are a lot of these points around at the lower values here, and then there are more points at the higher values here. So you can see a somewhat increasing trend. There's definitely a somewhat increasing trend here. And that's why you have a slightly positive correlation, although very small. And now to use children, once again, we create the inputs and targets, which is we use age, we use BMI, and we use children. Great. Then we train a model. So we say linear regression dot fit inputs and targets. Then we generate some predictions. So we say model dot predict inputs. So we pass in the inputs. This time we have three, three columns of data, and then we compute the loss for those predictions. So these are the targets and these are the predictions from our model for the input data and the losses 4,600. So there's a slight reduction in the loss. We went from. What's that? We went from 4 6 6 2 to 4 6 0 8. So we on average, like the loss reduced by $50 that's better than nothing. And this increases definite. This change is definitely greater than in the case of BMI. Okay. So now an exercise for you is to repeat the steps in this section to train a linear regression model to estimate medical charges for smokers. So use all the numeric columns and visualize the targets and predictions to compute the loss. And here's another thing we can do this time. Let's consider the entire dataset. Let's maybe ignore the distinction between smokers and non-smokers and see if we can still fit a line. So we now repeat the steps in this section to train a linear regression model on the entire dataset. So all I'm changing here is now, instead of using non-smoker DF, I'm going to use medical DF, which is the entire dataset consisting of smokers and non-smokers. I'm going to consider age, BMI, and children as the inputs. And I'm going to consider charges as the target. I'm going to pass the inputs, which is a now a 2d tensor, and it has, which has three columns of data and the targets, which is simply a series or an array of numbers going to make some predictions. And I'm going to compute the loss and the loss is much higher. It is 11,355 on average. And you can probably tell why we know already when we do PX dot scatter between medical DF and you set X as age, let's say, and Y as charges. And you set color as smoker, you know, that all this data was not there earlier when we were just dealing with non-smokers when we were dealing with non-smokers, we were just looking at this, which is obviously much easier to fit than this. Like here, where do you put the line? You put the line somewhere here. That's going to increase the error with these. That's going to also have a high error with these. And these are like really far away. These are quite far away from this line here. So obviously your loss is going to become quite high. Probably your model is completely useless at this point because most of the values you're trying to predict are in this range of five, 6,000. And if you're off by 11,000, then that doesn't really make sense at all. So the next thing that you might want to do is to start using the categorical features for machine learning as well. Now, so far, we've only been using numeric columns because we can only perform the computations with numbers. Remember the computation of the loss, the optimization that is again using calculus and linear algebra. So all of that can only be done with numbers. Now, if we could use categorical columns like smoker, then it would be possible to maybe train a single model for the entire dataset, which can also account for smokers in some fashion or to use the categorical columns. One trick is to simply convert them into numbers. And there are three common techniques for converting categorical columns into numbers. The first one is if a categorical column has just two categories, let's say smoker, non-smoker, yes, no, then it's called a binary category. It simply indicates the presence or absence of something. In such a case, we can replace the values with zero and one. We can replace no with zero and we can replace yes with one. So now a weight would apply when we have a smoker and when the, when the person is not a smoker, then that weight would simply not apply, right? So there's a certain term that gets added. It's like a penalty that we're adding for a person is a smoker or not. Then another way to go about it is if a categorical column has more than two categories, for example, the region, then we can perform a one hot encoding, which means that we perform a new, we create a new column for each category in that. So let's say for the Northeast column, Northeast region, we create a category which contains ones and zeros once at the places where a row belongs to the Northeast category, we can create a Northwest and Southeast and Southwest. So we create these four categories. Each of them contain numbers, and then we can ignore the original region column. So each of these columns, North Northwest, Northeast, Southwest, Southwest, Southeast will contain zeros and ones, and every row will contain exactly a single one. So it'll contain three zeros and a single one. And that's why it is called one hot encoding. Next, if the categories have a natural order, sometimes you may have a natural order like this cold, neutral, warm, hot, then they can be converted into numbers like one, two, three, four, while still preserving the order. So it's possible, not in this problem, but in some other problem that cold, neutral, warm, hot are indications of temperature and there is actually a linear relationship between what you're trying to predict and the temperature. So in such a case, it may make sense to convert them into numbers to preserve that order. And in these cases, they are called ordinance. So you will often have to decide whether you want to convert certain categories into numbers or you want to convert them into one hot. When should you convert them into numbers? When you see that there is a high correlation between these categories in their original order and their natural order and the target, then you should convert them into numbers. If you do not see a correlation, if you see things go up and down like region, for example, there's no clear region, there's no clear reason why region should have some kind of an order in determining what kind of charges you have. So you can just convert them into one hot encoded columns. Okay. So let's start with the smoker column. We are going to convert the smoker column into, we're going to add a new column called smoker code. So here, this is what the smoker column looks like smoker versus charges. So on average, the charges for smokers are close to 35,000 or 32,000. The medical charges that the mean, but that's because you have a lot of outliers. This does not really convey the full picture, but at least it tells us that the average is 32,000. And here for non-smokers, the average is more around six, 7,000. So now we are going to create two codes, no and yes. And we store these in, we store these in a map and we call medical DF dot smoker dot map. And then we pass in this dictionary here. What this is going to do is this going to replace all the no values with zero and yes, values with one. So now we have smoker codes. You can check medical DF and maybe the analysis checks medical DF. And you see smoker, you have smoker. Yes, smoker code one smoker, no smoker code zero. And we can check the correlation now between charges and smoker code. And seems like the correlation is quite high 0.787. And so now we can just use the smoker code column. This should say smoker code. So we can now just use the smoker code column for linear regression. We say charges equals the weight w one times age, w two times BMI, w three times charges, w four times smoker, w plus B. So age BMI children, smoker code, the number and the rest of the code is exactly the same. Great. So we end up with a loss of six zero five, six. That's almost a 50% reduction in loss. We've gone from what's that 11,000. The on average being off by 11,000 to on average is being off by 6,000. That's not as good as we had just for non-smokers, but still it's better. It's better by 50% from the previous value. So that's an important lesson. Never ignore categorical data. It may be a deciding factor in what makes your model good. Let's try adding the sex column in as well. So we're going to do the same thing. If you see here, it turns out that for females, the average is slightly lower than males. So I'm just going to put zero for female and one for male as a sex code. And we have medical DF here. You can see where we have female, we get zero where we have male, we get one. And let's check the correlation. So the correlation is very low, hardly anything. So we can run the process again, age, BMI, children, smoke, a court sex code. And you can see there's no real change. It goes to six zero five, six from what was the previous value? Six zero five, six point four, three to six zero point six point one zero zero. So hardly any change, even whatever changes there is less than a dollar on average. So it's not really important. We can maybe even ignore the sex column altogether. Next up, we have the region column, the region column contains four values. And that's why we'll need to use one hot encoding to create a new column for each region. Here's what we want to do. Instead of like Northeast, Northwest, Southeast in the same column, we want to create a column for Northeast and put ones and zeros there, create a column for Northwest, put ones and zeros, create a column for north for Southwest and put ones and zeros. There is a way to do this in scikit-learn and I typically just look it up. And this is, this process is called one hot encoding. So I typically just look up an example. So here again, if you see that there is a difference by region, so maybe region is a factor. Maybe different regions based on different average incomes have different medical costs at different hospitals. So maybe region is a factor. So we should include it. So the way you one-hot encode this data is you use this one hot encoder class from scikit-learn.preprocessing. So from sklearn, we import pre-processing and then we use pre-processing or one hot encoder. And on the encoder, we call, we give it the list of regions and it's simply going to identify the list of regions. It's called fit here, but it's really all it's going to do is it's going to find that there are four different values. Northeast, Northwest, Southeast, Southwest. Okay. So it is identified these values. Now we can give it some input and this could be our original inputs, or this could be something that we want to test out later. So we can give it some input. Let's say we want to, we have not, we have the Northeast region. So let's say we, we give it ENC dot transform and we need to give it the Northeast region. So let me just give in Northeast as the input. I need to call two area here to actually see the values. So it's going to convert Northeast into one zero zero zero. So for Northeast, which is the first category, it's going to create a one for the rest of the categories. It's going to have a zero. Let's maybe give it to call two rows of data. So this time we give it Northeast and Northwest. So one zero zero zero zero zero one zero. Now what we can do is we can take this, the original regions from our data frame, which are Southwest, Southeast, Southeast, et cetera. So this is the region per customer, and we can put that into encoder dot transform. And that's going to give us this kind of data. So now we have this column for Northeast, this column for Northwest, this column for Southeast and this column for Southwest, right? All this is doing is this process, taking a categorical column and giving us one hot vectors. And then we can insert it back into the data frame by saying that in medical DF, we want to create four new columns. So still remember it's not part of the data frame yet. So in medical DF, we want to create four more columns and we want to give it this data one heart. Okay. So here after doing all of that, and you don't need to actually remember any of that, you can always look that up. But after doing all of that, what we've been able to achieve is we take the region column Southwest, Southeast, Northwest, Northeast. And we have replaced them with these four columns, Northeast, Northwest, Southeast, Southwest. I've still kept these around. I've not removed these just so that we can see and make sure things are working properly. And now we can include region into our model as well. So we have w one times age plus w two times BMI plus w three times charges plus w four times smoker w five times six times region plus B. The idea here is once you understand linear regression with one variable and you really understand it well, if you internalize how it works, then all of this is really, it's a very simple extension. You don't even have to change any of the code. But all you want to understand is that when you're working with one variable, you're trying to fit a line and you start out with a random line and you move the line around. You use the loss function to find out how bad the line is. And you then use an optimizer to slowly improve the line and move the line around till it fits the data to the best extent possible, which is to minimize the loss, right? And then you simply have to extend it mathematically into a bunch of other dimensions. So now we're no longer dealing with one feature. We're dealing with six, but the code is exactly the same. So there you go. Now we have a lot of columns, age, BMI, children, smoker code, to be honest, this is not correct. Exactly. This is, there should be like a w six, w seven, w eight, w nine, right? So w six times Northeast w seven times Northwest w eight times Southeast w nine times Southwest. So there are four weights here. So I should actually correct this for every region. We're actually applying a different weight altogether. And there you go. Now the loss reduces slightly more to six zero four one. Where did we have it earlier? Let's see six zero five six six zero four one. So about a $10 decrease. And you can already see that a lot of these other factors are not really helping. It seems like age and children and maybe just age and a smoker. Those were the two biggest factors and the rest of them don't really matter. Maybe they give you a like a 1% difference, but not so much. Now here's an exercise for you to try and figure out if one linear regression model actually makes sense for this entire data, because as we already saw, just with the distribution of each, this data is not linear. There is a trend, but it's not linear in any sense. At the very least, if we want to have some sort of a linear relationship, we should take into consideration the smoker column and the way we have taken into consideration right now, the smoker column is by giving a weight to whether this person is smoker or not. So we're kind of adding a smoker penalty here that if smoker is one, then this weight gets added to the charges. If smoker is zero, this weight does not get added to the charges, but based on whether the person is a smoker or not, we are not really changing any of the other weights. So here's what, one thing you can try out are two separate linear regression models. One for smokers and one for non-smokers better than once in single linear regression model. And what does better mean? I let you think about that. So try it. Try creating two separate linear regression models and use all the other parameters with normal weights. Just include smokers or just exclude smokers and create two separate models, one for smokers, one for non-smokers and see if the losses for each of those models are lower on average. Let's say the losses for both those models are in the range of 3000 or 4000. So that means you are better off using two models rather than using one linear regression model, right? And that's a decision that you will have to make as a data scientist. Should I be using one linear regression model? Does that make sense? Or maybe based on whether these people are smokers or not, you need to have different linear regression models because the slopes of the lines are different. We know that for example, BMI makes a big difference for smokers, but does not make a big difference for non-smokers. So right now we have imposed a single weight on BMI, but ideally the weight applied to BMI should really depend on whether the person is smoker or not. So all these factors have to be taken into consideration. Sometimes two linear regression models based on our understanding of the data may be better than one. And this also kind of leads into something that we will cover in the coming weeks, which is about decision trees. Sometimes you don't want to make a calculation just using just weight. Sometimes you want to use decisions. You want to see whether a person is a smoker or not. And then maybe you want to check the region, which region they're in, and then maybe you want to check something else. And then maybe you want to use some kind of an averaging technique. So we'll see how to mix some of these models as well as we go along. So a couple of model improvements. And again, there is some code involved here. I will not go too deep into the code. Now we are getting to the point where you should be able to read code yourself and figure out what it is doing. So I'm going to tell you what the idea here is, what we are going to do here. And then you need to go and figure out where in the study hours or whenever you have the time, what this code means and it's simple code, not too difficult. So recall that due to some regulatory requirements in healthcare, we also need to explain the rationale behind the predictions on our model. So when we have a prediction, here's our model rate. Okay. Now somebody walks in and they have a certain profile. So these are the input columns for that. We look at in our model and let's try to create a sample input. So we take a person of age 28, a new, a new applicant that has come in. Maybe their BMI is 30. Maybe they have two children. Maybe they are a smoker. So the smoker code is one. Then they are female. So this is zero and they belong to the Northwest region. So this is zero one and zero and zero. Okay. Now I'm going to predict. So we give this data, this one row of data into our model and our model predicts that for this new applicant who has come in, the estimated medical charges are to 29,875. And then using this, we charge a certain premium to the person. We also look at other factors, but this is one of the major factors we look at. So now this person comes back to us and says, Hey, I would like to understand why you have quoted me this insurance premium. And then you tell them that, Hey, we predicted that your medical charges would be around $29,000. And then they would turn back and say, okay, on what basis are you saying that my medical charges are going to be $28,000? And then you would say, so my model, we created a model. We have a data scientist. They created this model and the model says 29,000. And then they say, yeah, but please tell me why it is 29,000. And that's a fair question because remember the issues with correlation causation, et cetera. We don't know what issues you had in your data. So you can't hide behind your model and say that the model says, so that's why we're charging you. That's why you should be able to explain why your model gives a certain output. Now, fortunately, in the case of linear regression, that is fairly straightforward. You can see here that we know that our model is applying some weight to the age. It is applying some weight to the BMI. It is applying some weight to the charges, some weight to the smoker, some way to the, some way to sex, some way to region, some way to, and then it has some intercept, right? So some baseline sort of compare the importance of each feature. If you have to tell them that, Hey, we're charging, we're quoting this value because you have a certain BMI. And then we need to tell that BMI has a certain weight in our model, right? So we go here and we can check the coefficients of our model. The coefficients of our model are the coefficients of our model are 256, 339, 475. So these are the weights applied to the model. Okay. Now, based on these weights, if we compare w one is for age, w two is for BMI, w three is for charges, w four is for smoker. So you might be tempted to say at this point that look, we apply a very high weight to smokers and $23,000. We've accounted for simply because you're a smoker, right? And that could be fine in certain cases in smoker because smoker is a simple plus or minus column, a simple zero or one column. It makes sense. But if you notice the weights carefully and to help you notice it carefully, I've actually put them into this data frame. If you notice the weight carefully, it seems like BMI has a higher weight than age. BMI has a weight 339 while age has a weight 256 and children have the weight 475. So this goes completely opposite to what we understand. We understand that age is the age is one of the most important factors. We saw that small for non-smokers, at least BMI has no bearing and children again, we saw had has no bearing. So what's going on here? Our model's results seem to be completely out of sync with the weights that it is assigning to the different features. We felt that age and smoker, the smoker definitely has a high weight, no doubt, but we felt that age has a far more important impact than BMI children, sex code, et cetera. But it seems like even Southwest Southeast region has minus four 48. So it has a negative correlation that is twice as high as age. So what's going on here? What's going on is that the range of values for BMI is limited. BMI is going to be limited from 15 to 40. BMI is never going to go beyond 40. And in fact, in most cases, it's going to be around 30. The Northeast column is only going to take the values zero or one. That means the maximum change that the Northeast column can compute, and then can create in the final output of charges is $500, not more than that. On the other hand, the age, even though 256 is the weight for it, age can go up to 60. So it can create 256 times 60. What's that 256 times 60. It can create a difference of $21,000, right? So the weights do not accurately represent which features are important because the ranges of the different features are different. That's one issue. And another issue that happens, not so much in linear regression, but in later data sets, we'll see that sometimes when one column has a very large range of inputs, like age has a very large range of 15 to 60. And one column has a very low range of inputs, like region can only go zero to one. So then when you are calculating the loss and when you're performing any optimizations, that large number disproportionately affects the optimization process because ultimately your number is your model is simply looking at a single number, the loss. And the loss is computed using all of these inputs and outputs and weights and everything. And if some column has a very large inputs or very large outputs, definitely very large inputs, then yeah, if some column has very large inputs, then it will disproportionately contribute to the loss. So most of our effort will be spent just optimizing for that one single column. So that's where we perform a certain thing called standardization. We take the values in every column from the value we subtract the mean of the column, and then we divide by the standard deviation of the column. Okay. What does that do? For example, let's say BMI, BMI goes in the range or BMI varies from the range of 15 to 40. After applying the standardization, we subtract the mean, which means we subtract 30 from the BMI. And then we divide by the standard deviation. The standard deviation is around six, let's say after applying this standardization, the values become centered around zero. So you have the value zero, and then you have BMI centered around zero. And they are also scaled down so that it becomes a normal distribution with standard deviation, one and mean of zero. So the idea is we scale down all the values to small numbers that are in the range of one to minus one or two to minus two. And then their mean is also zero. And we do that for every column. So what we do is we standardize every column into the zero to one range only for the numeric columns. The categoricals are already like a zero to one. So we standardized the, the numeric columns in the minus one to one range means zero standard deviation one. And then the weights will make a lot more sense. So here's how you do it. I'll just show you quickly. The way to do it is again, the way to really do it is to just look up online. How do I perform scaling of features in a scalar? So I'm just going to import standard scalar and we take the numeric columns, age BMI children. We get the numeric data from the scalar. We get the numeric data from the data frame, medical DF numeric columns, and we pass that into scalar. So we call scalar dot fit. All it does is it computes the mean and the variance for each of the numeric columns. So it turns out that the mean for age is 39 and the mean for BMI is 30 and the mean for children is one. It turns out that the very variance in age is one 97 variance is square of the standard deviation. The variance in BMI is far less and the variance in children is even lesser. So now we can scale the data. So what happens now, we call scalar dot transform and then we give it the data that we want to transform. So here's the data that we want to transform, which is medical DF numeric calls. This is the data that we want to transform each BMI and children. And once we transform it, this is what the values will become. So they will go between minus one and one. Here, you can see that this is transformed to something between minus between zero and one. This is transformed to something negative. So these values all change between minus one and one. And then we can take the categorical data and we can put it together. So we take these four rows of these three rows of categorical data, these three rows of numerical data, we take all the categorical data and we put them all together and it looks something like this. So these three rows, let me just take, show you a couple of one row of data. So these three come from here and then these three come from the categorical one hot encoding and the binary encoding, et cetera. All right. And then we train the model. So we get the same loss. It is in this case, it does not really affect the loss in linear regression. That's the scaling does not affect the loss, but the difference now is that when you look at the weight for every feature, you will see that the weights for the different features now make a lot more sense. So it seems like the weight for smoker, the weight for smoker is 2.3 into 10 to the power of four, which means 23,000. The weight for age is 3,600. And I guess this is probably not the best way to print it. So let me just, let's just put this model.cof and okay. I think we'll just have to interpret them here. So the weight for smoker is two, three, eight, three, 2.3 into 10 to the power of four, which is 23,000 the weight for ages 3.6 into 10 to the power of three, which is 3,600. The weight for BMI is 2.071 into 10 to the power of three, which is 2000. So now you can start to see that we have different levels of weight. And now you can tell the person or you can, you can disclose that in our models, smoking has a very high weight of 23,000. So it has contributed 23,000 to the calculation. The age has a factor of 3,600. So we add $3,600 for every year that you have been alive based on your BMI. For every BMI point, we add $2,000 to your medical expenses for every child you have. It seems like this is 500. So for every child we have, we had $572. Then we have minus 1.249. So it seems like for sex, it's the inverse correlation. So we're actually subtracting some money if you're male, then depending on whether you're from a certain region, we apply a certain correction factor for a region. And one of the biggest factors is the bias itself. So one simply represents the intercept. All right. Now, one thing that you will have to do is when somebody comes to you, let's say, let's go back to this person here. Yeah. And when a new person comes to you, you will then have to scale these values before you can put them into the model. So let's see, let's get new customers and we have just one customer here. So the first thing we want to do is you want to do scalar and directly put these values in 2832. So you need to scale the numeric values 2832. And then you can simply copy over the actual categorical values here. We have them and then you can just call a model dot predict. And now you can see here that we have the result 29,824, but now we have a very clear description of why we've applied 29,824, roughly speaking. So yeah, that's how you interpret a linear regression model. What I said earlier was about age adding $3,600 is not completely true because we have scaled the age, but you can give them a relative explanation that the most important factor is smoking. Then you have aged and we have BMI and these are the weightages that apply in your case. So in your case, the fact age factor is 0.1 or negative 0.1 and the BMI factor is 0.75 and so on. So this is the age, I believe in this is BMI. So that's how you scale the data. Okay. And I guess we're going to end here today. So just a quick summary of how you approach a machine learning problem. First step is to explore the data and find different correlations between inputs and targets. Understand what leads to a certain target. Maybe even go back and do some research, whether there is a cause effect relationship or is there just a correlation? And is it possible that the correlation may just go away or is it possible that the correlation is unfair in any way? For example, sure. Maybe you should not be using race as a factor in computing insurance, right? So that's something you should be conscious of as a data scientist. Then you pick the right model. In this case, we pick linear regression. You pick the light loss functions. In this case, we picked root mean squared error and you pick the right optimizer. In this case, we pick least squares and all of these three are captured in the linear regression class from scikit-learn. So for different problems, you have to pick different models. Sometimes you have to pick decision trees. Sometimes you have to pick neural networks or something else. Then you scale the numeric variables and one hot, you scale the numeric variables so that they all have a mean of zero standard deviation of one. You perform one hot encoding for categorical data. And this is not something we covered today, but this is something that we'll cover next time about setting aside a test set. Then you train a model. Then you make predictions and compute the loss. Now we didn't do this today, but one thing you should do is you should actually set aside 10% of your data, a small fraction of your data, just for testing and reporting the results of your model, because the model like the ones we have created today are designed to be used in the real world, not only on the data that we train them on. So what people typically do is when they get some data from let's say a product manager or somebody, whoever has given us the insurance data, we take 10% of the data and then we set it aside forever. Then we do all our modeling and everything. Let's say we build a linear regression model on 90% of the data. And then when we report the loss, when we actually report the metrics that, Hey, I built this model and this is the result it gives. And this is the kind of loss that I've been able to achieve. Or this is the kind of error that my model has. You should be then reporting that error on the test data, which means on the 10% of the data, which you set aside at the beginning and did not use for training. And you can do this in scikit-learn using this train test split. So what that does is it takes some inputs. It takes some targets. It takes a 0.1 like a fraction and it gives you a bunch of 90% of the data, which you can use for training, which is input strain and target strain, and then 10% of the data, which you can use for testing. Okay. And we'll talk about this in a lot more detail the next time. But what you will notice is that the score on your test set or the loss on your test set is going to be always higher than the loss on your training set because your model has not seen these examples. So model is bad is worse at predicting examples that it has not seen. And this is again, true. A lot of machine learning models, when you put them out into the real world, they perform worse, even though you may get a really, you get like a 98% accuracy when you're training the model. But when you put it out, it's less than 30%. So we will talk about a lot of techniques in this course on how to counter that. This is called overfitting and we'll talk in a lot of detail about that. So the right strategy is to set aside a test set before you train a model and then train the model and then make predictions and compute the loss and report the results on your test set. And we will apply this process over and over to several problems with different kinds of models over the next few weeks. The topic for today is logistic regression with scikit learn. And here is a quick overview of the topics that we're covering. This image will give you a hint of the problem statement that we're looking at today. So we'll start by downloading a real world dataset from Kaggle. We will explore the dataset. We'll perform some exploratory data analysis and visualization. Very little. We will then split the dataset node into a training validation and test set. So we'll talk about why these are important. We will talk about filling and imputing missing values in numeric columns. We will also look at how we can scale numeric features to a zero to one range. We will also look at encoding categorical columns as one hot vectors. Then we will train a logistic regression model using the scikit learn library. Then we will evaluate the model using the validation set in the test set that we had created. And then we'll save the model to disc and load it back. So you'll see quite a bit of overlap between what we're doing today and what we did last week. And that is intentional. The idea here is that the machine learning workflow is something that you will apply to pretty much every machine learning problem you solve. So by looking at it multiple times with multiple dataset in the context of different machine learning algorithms, you will start to get used to it. And by the end of this course, this should become second nature to you. And you can see instructions on how to run the code. We are currently running it on Google Colab using the run on Colab option, but you can also run it on binder. It will be a little slower, or you can run it on your computer locally. Just use the run locally option by clicking the run button. So here's the problem statement that we're looking at today. Since we want to take a practical and coding focused approach, we will learn how to apply logistic regression to a real world dataset from Kaggle. So here's the dataset. The rain in Australia dataset contains about 10 years of daily weather observations from numerous Australian weather stations. And this dataset is taken from this dataset is taken from Kaggle. If you open this link, you will see here the page where this dataset is taken from. So Kaggle is a platform where you can participate in data science competitions and Kaggle is also a community where people share datasets and data science projects. Now on this dataset page, you will be able to browse the dataset. You can see some information about the dataset as it says, this dataset contains 10 years of daily weather observations from many locations across Australia. And the objective of this dataset or why it was created was to predict next day rain. So we'll talk about what that is. You can also look at this dataset. So it seems like there are about 23 columns in the dataset. And you can see that these are the first few columns. The date of the observation starts around 2008 goes on or 2007. In fact, and goes on till 2017, it contains information like location. For example, Albury, Canberra, Sydney, it contains a minimum temperature for the day at that location. The maximum temperature, rain. Fall evaporation, sunshine, et cetera. So bunch of different measurements of weather at several locations day by day for almost 10 years. That's the dataset that we're working with. And here's a sample from the dataset. Just to give you a context of what it looks like. Now, as a data scientist at the bureau of meteorology, you are tasked with creating a fully automated system that can use today's weather data for a given location to predict whether it will lay rain at a location tomorrow. Okay. So if you look at this dataset, you have all these columns of data, location, main temp, max temp, et cetera, a total of 23 columns, not all of them are shown here. And then there are a couple of columns at the end. One says rain today, which indicates whether it rained on this particular day. And the next is rain tomorrow, which simply indicates on whether rainfall was recorded the next day. So for 10 years of data, we have all these weather measurements. And then for each day, we have also listed whether it has rained on that location on the next day or not. Now, this is very useful because now we can use this 10 years of historical data and create some kind of a system which can take the data for today and use that to predict whether it will rain tomorrow. So it's a weather prediction system that can potentially replace a meteorologist. So that's what we're trying to create. We want to create this forecast specifically. We want to create this rain forecast, whether it's going to rain or not in a particular location, given it's weather information for today. And here are some of the information you have. So before you proceed further, just take a moment to think about how you can approach this problem. Maybe try listing five or more ideas that come to your mind when you are executing this notebook. It's important to think about this because machine learning requires a change in mindset and you can see how your original mindset is as you list out some of these ideas. And once we work through the machine learning approach, you can try and see if you need to change your mindset a little bit so that the next time you come across a problem, you can think of it in terms of a machine learning algorithm or not. Okay. So in the previous tutorial, we attempted to predict a person's annual medical charges by looking at information like their age, their sex, their number of children, whether they're a smoker or not. And there we use linear regression. In this tutorial, we are going to use logistic regression, which is better suited for classification problems. Like predicting whether it will rain tomorrow and identifying whether a given problem is a classification problem or a regression problem is a very important first step in machine learning. So let's define these terms classification and regression problems, problems where each input must be assigned a discrete category. And sometimes these are called labels or classes are known as classification problems. And here's what I mean by that by with some concrete examples, rainfall prediction, predicting whether it will rain tomorrow using today's weather data. So what we want to do is take all the information for today, all the measurements for today, and then using that either say that it will rain or say that it will not rain. Okay. So there are two categories or two classes, which we want to classify these measurements into whether these suggest that it will rain tomorrow. Whether these suggest that it won't, here's another one breast cancer detection. This is actually something that is used in real world in the real world, predicting whether a tumor is benign, which is non-cancerous or malignant. So if a tumor is detected in your body, can we predict whether it's cancerous or non-cancerous using information like it's radius, texture, et cetera. So once again, we are classifying tumors into cancerous or non-cancerous. Here's one more. And there's a dataset that you can click through and find to do exactly this. Here's one more loan repayment prediction. Predicting whether an applicant will be able to repay a home loan based on factors like age, income, loan amount, number of children, et cetera. So here we are not predicting medical charges, which is like a continuous dollar number, but rather we are predicting whether they will be able to repay a loan or not. So yes or no. Again, a classification problem. And then there's one more handwritten digit recognition. So here you are given pictures of handwritten digits. Here, you're given pictures of handwritten digits and your job is to identify which digit a picture represents. So you can click through and you can discover this dataset as well. In this case, there are nine classes. You have to classify whether the digit is a zero or a one or a two, three, four, five, six, seven, eight, nine. So actually there are 10 classes. So classification problems can be binary, which is yes. No will rain, will not rain benign malignant will repay the loan will not repay the loan, or they can be multi-class where you have to assign each input into one of many classes like handwritten digit recognition. Try and think of some more examples of classification problems. And once you're done with this notebook, once you work through this notebook, try and replicate the steps we followed in this notebook with each of these datasets just to get a better sense of how classification problems work. So those are the kinds of problems that we're looking at today, but the kind of problems that we looked at the last time, which is problems where a new continuous numeric value must be predicted for each input are known as regression problems. And here are some examples of regression problems, medical charges prediction. So we are using in inputs like age, number of children, whether the person smokes, et cetera, to predict exact number, which is the amount of money they will spend on medical expenses. Similarly, house price prediction, given information like the location of a house, number of bedrooms, the square foot area, how many bathrooms it has, et cetera, et cetera, whether it has a lawn. Can you predict the price of the house? So again, it's an exact number. It's not a yes, no, or a cloud or a category. Similarly, there's another dataset here of ocean temperature prediction, given information about several measurements taken from the ocean floor. Can you predict the temperature of the ocean floor? And you can check out this dataset for that. And then in weather itself, if you think about temperature, if we want to predict the temperature for tomorrow, rather than whether it will rain or not, the temperature is again, a continuous number. So once again, this is called a regression problem, not a classification problem. We're, because we're trying to predict a continuous quantity, try and think of some more regression problems that you may come across or you may want to solve. And you can try and replicate the steps we followed in the previous tutorial with each of these datasets. Okay. So that's classification and regression classification is just classifying data into certain classes regression. Technically literally means fitting data. So trying to create a model which can fit exactly to the numeric value that you want to predict, right? So whenever you see regression, think continuous values. Whenever you think classification, think discrete classes. And we saw that linear regression is a commonly used technique for solving regression problems. Hence the name linear regression and in a linear regression model, the target is modeled as a linear combination or a weighted sum of input features. For example, we said that trying to predict the medical charges can be done by taking a linear combination of some input features. So let's say we're looking at the input features age and whether the person is a smoker. So zero or one. So the model that we created the last time assumed that age has a certain weight w one and smoker zero or one has a certain weight w two. And then there's certain number B. So you have w one, w two and B. If you can figure out good values for w one, w two and B, then you can take any value of age and any information about whether the person or smoker is a smoker or not. And get a good estimate of their medical charges. So that was the model that we had created the last time. And what we did was we took that model and using that model and we took random values of w one, w two and B to get started. We took that model and using that model, we generated predictions for all the inputs that we had. And we compared the predictions of our model, our randomly initialized model, assuming a linear relationship, we compared its predictions with the actual targets or with the actual medical charges of those people using a loss function. So called the root mean squared error, right? So we compared it using a loss function, and then we trained a machine learning model to reduce the loss function by adjusting the weights w one, w two and B. That is what a machine learning training process looks like. Okay. So here's what that looks like. We take some input data training data. So this would contain information like age, whether the person is smoker, number of children, et cetera. And it would also have some outputs. So we would put the input data into the model and we would get some predictions from the model. So we would take the predicted output from the model and we would also take the actual output or the real data from the training set, put them both into a loss function, which was a root mean squared error in case of linear regression, and then run an optimization method to reduce the loss by changing the weights within the model. So that will then improve the model over time. And as we do this over and over repeatedly for long enough, the weights get pretty good. And then we can use the model to make predictions on new data, right? That's the machine learning process that we followed. For linear regression. Okay. And if you want to see a more mathematical discussion of linear regression, you should check out this YouTube playlist that I've linked to here. And that will then go through in more detail how each component fits in and how the optimization works. So for classification problems, similarly, the technique that is often used is called logistic regression. Specifically for binary classification problems, which is a kind of classification problem. We're looking at today, just a yes, no classification. So in a logistic regression model, we start out the same way. We take a linear combination or a weighted sum of the input features. So let's say if we just look at today's temperature and today's rainfall, and we want to use those to predict whether it will rain tomorrow or not. So we start out by taking a linear combination. Let's say we pick a weight W1, a random weight W1. We pick a random weight W2 and random weight B and we fix them. So W1 W2 B we fix them. Then we calculate this quantity for a given input. Let's say we have some information about the city Melbourne. We take the temperature at Melbourne. We take the rainfall at Melbourne. We multiply them with W1 and W2 and we add B. So we get this quantity X1 W1 plus X2 W2 plus B. Then we have an additional step here. So in linear regression, we were done at this point, but in logistic regression, we have an additional step. We take this quantity and then we put it through this activation function called Sigma or called sigmoid. It's represented by this character Sigma, but it stands for sigmoid. What the sigmoid function does is it takes this output of this linear combination and it squishes it between the range zero to one. Okay. So the exact details are not very important about how this function does. So, but what it does is it takes this linear combination or this quantity that we get X1 W1 plus X2 W2 plus B and it squishes it into the range zero to one. And then that value, which we get out of the Sigma function is inferred as the probability of the class that we're trying to predict. So since we're trying to do a yes, no prediction, which is it will rain tomorrow or it will not rain tomorrow. So the output of Sigma Z output of Sigma will be a number between zero and one, and that number will indicate what our model thinks is the probability that it will rain tomorrow. Okay. And then we compare those probabilities with the actual values because we have historical data, we can see whether our model is predicting the right probabilities or not. If it is not predicting the right probabilities, then we penalize it by giving it a high loss. If it is predicting the right probabilities, then we penalize it, then we penalize it less by giving it a low loss. But overall, we come up with a loss for the model and for logistic regression, the loss is called the cross entropy loss. It has a formula like this. Again, the details are not that important. The important thing is that there is a small change in the structure of the model, which is this function Sigma. And then this loss is a value, which is high if the model predicts bad probabilities and the loss is low if the model predicts good probabilities. Okay. And then once again, we train the model in the exact same way. So we take the data that we have. So we take the weather data for today, put it through the model, and then we get a prediction from the model. We take the prediction of the model and we compare it with the actual data that we have because this is historical data of whether it will rain tomorrow. So we have the actual yes, no, yes, no answers of whether it will rain tomorrow. And we take the predictions, which are probabilities and we put them into this loss function called cross entropy. We run that through the optimization algorithm. Once again, the optimization algorithms like gradient descent least squares. There are many optimization algorithms, and that is able to then change the parameters or change the weights and biases of the model, the W1 W2, et cetera. So that the model now makes better predictions and gets a lower loss, right? So this is the same machine learning workflow that is applied, except that the details are changed. The model now, instead of being just a linear combination also applies a sigmoid function. So it outputs numbers between zero and one, and then the error, instead of being a root mean squared error is a cross entropy error, which is between these yes, no outputs and these probabilities. So on the one hand, the targets are like yes, no, or zero one, while the predictions are probabilities in the range of zero to one, something like 0.3, 0.7, 0.9. And the optimization method may also differ, but all of these parts remain the same. So that's logistic regression. And that's all we'll say about it right now, because we want to focus on how to build good logistic regression models. So if you do want to check out internally, how all of this works, you should once again, check out this YouTube playlist. It is about one hour long. It will go into the mathematics and also show you a more geometric interpretation and as to why it is called logistic regression, because it's solving a classification problem. So it's a bit confusing the name. The reason is that the output of this sigmoid function or the Sigma function that we have, that is a number between zero to one that is called a logistic. And what we're trying to do is we are trying to fit or regress that logistic to the data. So we're trying to get that value of between zero to one as close to one in the cases of inputs, which have the output. Yes. And as close to zero in the cases of inputs, which have the output zero. So in some sense, we have, even though we are working with discrete classes, yes and no, we have created this continuous quantity of the probability. And then we want to fit this probability as close to the desired inputs as possible. So that's why it's called logistic regression. But logistic regression is a classification algorithm. So just keep that in mind. Okay. So that's the machine learning workflow. And that's basically all the theory for today. Maybe if we have time at the end, we may walk through an actual logistic regression example with a few inputs. But roughly what you want to understand is that it is a different kind of algorithm which outputs probabilities and it uses a different error function. But the training process is the same and it is used for classification. Now classification and regression are both supervised machine learning problems because they use labeled data. Machine learning by label data, we mean we have some targets. So for classification, the target is the class that we want to assign to each input rain or no rain, which we already have in our dataset. And similarly for regression, it is the quantity that we want to predict. For example, the medical charges. Both of these are called supervised learning and you can see supervised learning have as classification and regression. But there are certain cases where we do not have any labels. And mostly what we want to do is we want to group data into clusters and that is called unsupervised learning. So we will also look at unsupervised learning towards the very end of this course. But our primary focus will be on supervised learning, which is classification and regression. Okay. So just to come back to the problem that we are looking at, we will train a logistic regression model using the rain in Australia dataset to predict whether or not it will rain at a location tomorrow using today's data. And this is a binary classification problem because it's a yes, no kind of classification. So let's start by downloading the data. The dataset is on Kaggle and there are several ways to download the dataset from Kaggle. For example, you can download the data here by clicking download and then upload it to Colab. Now Colab is running on Google servers. So if you check the files tab here, this is not files from your computer. So if you want to take files from your computer, put it on Colab, then you need to use the upload button to do it. So you could download the file here, the zip file, and then you could upload it to Google Colab and then you'll have to figure out a way to unzip it. It's a bit cumbersome. Another thing you could do is Kaggle has an API and you can check it out here. So you can use this Kaggle API. This is a command line tool for downloading datasets from Kaggle. So it looks something like this. You say Kaggle datasets download, and you have to run this on the terminal and you have to put a API key somewhere. So that's a bit cumbersome as well. The approach that we recommend is using the open datasets library. So this is an open source library that we have created to make it really easy to download Kaggle datasets anywhere. So you install the open datasets library using pip install open datasets. Then you import open datasets as OD, and then we can check the version of the library, which is 0.1.20. So always make sure you have the latest version, which you can ensure by running hyphen hyphen upgrade. So we've imported open datasets as OD, and then we take the dataset URL, which is this URL, the Kaggle page URL, and we run OD dot download the dataset URL. So let's run that. Now, when you run OD dot download dataset URL, you will be asked to provide your Kaggle username and a Kaggle API key. Now, where do you find that the way to do that is click on this, go back to Kaggle. So go to Kaggle.com and make sure that you sign in on Kaggle.com and then click on your username, then click on account. And here you should see an option create new API token. So just click create new API token, and that is going to download this file called Kaggle dot Jason to your computer. Now this file Kaggle dot Jason contains the information that you need. So if you open up Kaggle dot Jason, as you have opened it up here, as we opened it up here, you can take the username. So let me take the username I'm going to put it here. And then you will need to provide your Kaggle key. So I'm going to take my Kaggle key, put it in here as well and hit enter. And now I am authenticated and this dataset has been downloaded. You can verify that this dataset is downloaded by looking, opening up the files tab and checking here that you have the name of the dataset here. Another way you can do this instead of having to manually enter your Kaggle key is to click this upload button and just upload the Kaggle dot Jason file here. And once you do that, you will see that your Kaggle credentials are picked up automatically. Okay. And open datasets is smart. So it also checks whether you have downloaded the dataset already. So if it has found that you have downloaded the dataset already, it is going to skip the download. So you can specify force equals true. If you want to force it to download, or if you want to force it to redownload the dataset as it has done in this case. Okay. But otherwise it'll just detect and it will skip. All right. So now we have this folder, whether dataset rattle package downloaded from Kaggle using the open datasets library. All we need to do is import open datasets as OD and run OD dot download, provide our Kaggle credentials. So let's check this folder, whether dataset rattle package. This is what it's called. So we check it using OS dot list, DIR, and then it seems like there's a single file, whether a us dot CSV. So let's create the full path to this file. So now we have in train CSV, the variable, we have the full path to the file, whether dataset, right? Rattle package slash weather a us dot CSV, we can now load the data from weather a us dot CSV using pandas. Now pandas, as you already know, is a library for working with CSV files. So I'm just going to clear my notebook outputs at the moment, just so that we can get all the outputs from scratch. So I just did edit clear all outputs to remove the previous outputs from my previous execution anyway. So now we have PD pandas imported as PD, and we give it the path to the file that we have just downloaded. And that creates this data frame. So here is now our dataset available to us as a data frame. We have information like the date. We have location. We have minimum temperature, maximum temperature, rainfall, evaporation, sunshine, wind direction, et cetera, et cetera. So there are a total of 20, I believe there are a total of 23 columns starting from the date. So if you ignore the date for a moment, because we're not going to use date for prediction. And then the final column is the target. The final common column is what we want to predict, which is the rain tomorrow column. So we have 23 minus two, 21 pieces of information or 21 columns of data that we are going to use to make the prediction for whether it will rain tomorrow. So we're looking at 21 weather features for today, and hopefully that should give us enough information to predict whether it will rain tomorrow. Okay. And the dataset contains over 145,000 rules. So the dataset contains a date that contains some numeric columns. You can see here, these are numeric columns. It also seems to contain some categorical columns. You can see that the wind direction is definitely a categorical column. You can see these are strings, but you can see that there are a limited number of values that these take. So it's a good mix. And our objective is to create a model to predict the value in the column rain tomorrow using all the other columns, maybe except the date. So let's check the data types and see if there are any missing values in these columns. So in total, we have 145,460 rows of data. The date is not so there are no, not there are no null values for date. There are no null values for location. That's great because all of this information was conducted on a particular date at a particular location. But you will start to see that there seem to be quite a few null values in the other columns. And that's perfectly fine. Maybe that data was not available. Maybe that data was not measured. Maybe it was not entered. A lot of things could have happened, but one issue here is that the rain today and rain tomorrow columns also seem to have quite a few null values. In fact, you have one 45,000 total entries and here you have 142,000 values for rain today and rain tomorrow. So about 3000 rows don't seem to have information of whether it rain tomorrow. Now that's a problem because we want to train a model to predict whether it will rain tomorrow. So we can only use data which contains information on whether it train tomorrow or not. I mean, if your data itself contains null values in the target column, that's not very useful for the model to learn anything because you're not giving the model any information. So that's one issue. So we should probably remove the rows where the target column is empty. Now I want to go one step further and say that, and this is just a hypothesis now that whether it rains today. Should have a very strong bearing on whether it rains tomorrow, right? So if we have missing data on whether it rains today, we, I think that a lot of important information is missing here in that data, even though we have a bunch of other information. So what we're going to do is we are going to drop all the rows where either rain today or rain tomorrow. Okay. Well, either the rain today value or the rain tomorrow value are empty. And the second one, it makes sense because it's a target column and we should only have rows where we actually have a value to predict for the target column. But the first one rain today, I'm just taking a guess here saying that, Hey, if I don't know whether it rain today, it's going to be very hard to predict if it's rain tomorrow. Okay. So I'm going to just run raw DF dot drop any rain today and rain tomorrow in place. So from raw DF, we are going to drop this information. And now if you check raw DF dot info, you should see that the number of rows should have gone down. It has gone down to 140,000, but at the same time, we don't have any normal values anymore. Okay. The rest of them, we will deal with them over time, but these were just two important ones that I felt in this dataset. We have to take care. And this is something you will also have to figure out on a case by case basis, are there missing targets? Then maybe those rows are not useful. And are there any critical columns that you shouldn't, you should not have null. So maybe you should drop those rows as well. Okay. And you can think about how you will deal with missing values in the other columns, but we will come back to it as well. Okay. Next up, we have some exploratory data analysis and visualization. So before training a machine learning model, it's always a good idea to explore the distributions of the various columns and see how they are related to the target column. And they will often give you a lot of interesting insights about the data. They will give you ideas for whether you need to transform certain features, whether maybe there is some invalid data that you need to fix, whether you can combine multiple features to create new features. Whether you can maybe do a logarithm of a certain feature to get a better correlation, et cetera, et cetera. There are many, many ideas that you will get as you explore the data. But today, because our focus is on logistic regression, we'll do a limited amount of EDA. So let's just install these libraries, these plotting libraries, matplotlib, seaborn, and plotly. And first of all, I'm just going to look at the location data. I'm going to see what different locations we have and how many measurements we have across locations. So I believe we have around 29 locations. If I'm not wrong, let's check that. So raw DF dot N unique that will tell you how many locations we have. The audio dot location dot N unique will tell us that, okay, we have 49 locations. So these are the 49 locations. Maybe there are a few more, but for these locations, you can see for each location, we have about 3000 values, roughly speaking. And that makes sense. So 30 days of data, so per location, 300 or 360 days of data, maybe it's not 10, maybe it's nine and a half years. So 300 rows of 360 days of data per year multiplied by about nine years or nine and a half years should give us about 3000 values. But you will see that for certain locations, the number of measurements is smaller. Maybe they did not have weather stations earlier. They got where the stations recently, or maybe just data was not collected or was lost or for certain days. So that's what this tells us. We're getting a good sense that we have mostly uniform distribution of data across locations. I have also chosen color as rain today. So we have split these bars to show whether the value of rain today is nor yes. So if you roughly look at this, you can see that in Albury, there were 2300 rain today, 2300 measurements where there was no rain on those days. And there was 600 days where there was rain on those days. So 600 out of 30,000 out of 3000, that's about 20% of days. And that seems to be roughly the case that about 20% of days in pretty much every city, there was rain. And then for about 80 or 75% of days, there was no rain. Of course, that's not true everywhere. For example, here with Uluru, you can see that there are very few days where there was rain. And there are other like Hobart where in this case, you can see that there are quite a few days where there was rain or I guess this is Walpole. Yeah. On Walpole, there are quite a few days where there was rain, almost a thousand out of 2,500. Okay. So location definitely seems to be a factor that decides how much rain there is going to be. So there is some correlation here for sure. Let's look at a few more and we'll go through them very quickly. Let's look at the temperature at 3pm. So we're going to plot a histogram of what the temperatures at 3pm look like. And let's also break those histograms by whether it has rained on the next day or not. So my guess is that if the temperature at 3pm today was low, then it's likely that it might rain tomorrow. And this histogram seems to validate that. Like if we just look at the overall histogram, it seems to form a Gaussian curve or a normal distribution. But you can see that the red indicates the cases where it did rain tomorrow. So you can see that there are more cases of low temperatures and rain tomorrow than there are with high temperatures and rain tomorrow. So if the temperature is lower, it's more, it seems more likely that it would rain. Although there are still a decent number of examples where the temperature has been high and it has still rained, right? So this is how you can break down histograms specifically for classification problems. You can use this color argument in plot, and I'm using Plotly here, but you can, or Plotly express, but you can use Seaborn and there you have the hue argument. So you can use the color argument to distinguish between classes. And here I'm just getting a sense that when the temperature is lower, the chances of rainfall might be higher. Maybe let's also look at rain today. Let's also look at the distribution of rain tomorrow. And how many cases is rain tomorrow set to? Yes. In how many cases is it set to? No. So it seems like out of 140,000 data points for about 110,000 data points. It seems like it did not rain. And then for about the remaining 30,000 data points, it seems like it did rain on the next day, right? So this is called class imbalance. You have two classes, no, and yes. And you do not have an equal number of observations in each class. This is an important factor. We'll talk about later when we build our models. One other thing we have done here is we have chosen rain today as the color. So what does that tell us? This tells us that there were 92,000 instances where rain today was no and rain tomorrow was also no. And there were only 16,000 instances where rain today was yes and rain tomorrow was all was no. So that means if it did not rain today, then there's a pretty good chance that it will not rain tomorrow. On the other hand, here, you see a mostly even split, which means that if it did rain today, if rain today was, was, yes, it may or may not rain tomorrow. And if, if rain today was no, again, it may or may not rain tomorrow, right? So predicting rain tomorrow, yes, is difficult, but predicting rain tomorrow. No is easier when rain today is no. Okay. So there are many ways to think about and interpret this chart. So spend it, spend some time on each chart, trying to understand what it represents, but as such, we can already tell that if it did not rain today, we can be confident that it is not going to rain tomorrow. Of course, there are a lot of other factors involved as well. Then here's a scatter plot of maximum temperature and minimum temperature. So you can see on most days, this is what the distribution of maximum temperature, minimum temperature looks like. In fact, between them, there is a linear correlation. So you could probably even try a linear regression model to predict maximum temperature using minimum temperature. And you would have a line like this. But that's not what we're doing here. What we've done is we've colored in these days or these points using rain today. And you will start to see that the cases where the minimum temperature and the maximum temperatures are very close, right? Or where first of all, the minimum temperature is low. And second, the maximum temperature is close to the minimum temperature where there's not a lot of variation. In those days, rain today seems to be more common. So when it rains on a particular day, then the variation in temperature is small and maybe that is also relevant for whether it rains tomorrow or not, right? So these are all different facets that you can explore about weather, whether it's infinitely complex and whether it has a lot of different interesting things that you can learn from. And then here is a graph of temperature today versus humidity today and how that relates to whether it rains tomorrow. So I'm guessing that if the temperature today is low and the humidity is high, then there's a high chance that there may be rain tomorrow. And that seems to be the case. You can see here that if all the red points are in the low temperature, high humidity region, and the blue points are in the high temperature, low humidity region. Okay. So try and make your own interpretations from these charts and try and visualize all the other columns of the dataset and then study their relation with rain today and rain tomorrow as well. But we will move ahead. Okay. So at this point, let us just save our work. So from time to time, you should be saving your work and you can do this by running by importing the Jovian Python library and running Jovian dot commit. And the reason you should do this is because if you step away from the Google collab notebook for 10 or 15 seconds, then the notebook will shut down and you can lose your work. So if you run Jovian dot commit, you will be asked to provide an API key. So this is not the Kaggle API key. This is a Jovian API key because we want to take a snapshot of this notebook and put it on Jovian. So you just open up Jovian dot AI and in the get started tab of your profile, you will be able to find this API key. So just click on API key, come back here and paste it in. What you can do now is this will capture a snapshot of all the code execution that we have done so far and take that Jupyter notebook and put it on your Jovian profile. So as you can see here, this is my personal copy of the lesson notebook that I've been working on so far and any changes that you made will be preserved here. And you can always go back to Jovian.ai and you can find it in the notebooks tab of your profile. So as you can see here, Python sklearn logistic regression, I have a notebook here just for this. And if you want, you can go back and you can click run on Colab or run on binder to run this notebook. Okay. So to keep running Jovian dot commit from time to time. All right. Now there is one other optional step that I would like to suggest when you're working with massive datasets, where you have millions of rows. We don't have it right now. We have a hundred thousand rows that's fairly small, relatively speaking. But if you have millions of rows or tens of millions of rows, it's always a good idea to initially work with a sample. And by a sample, we mean a fraction, a randomly picked fraction. And this is so that you can quickly set up your model training notebooks so that each line of code that you execute runs quickly and you're able to just try out many different ideas very quickly. So if you would like to work with a sample in this notebook, just set you sample equals true. So let's say if you said you samples equal equals true, and then we pick a fraction of the data that we want to work with. Then all we are doing here is we are setting raw DF, the data frame that we have created to a sample of the data frame. So we are setting raw DF to raw DF dot sample, and we have the fraction sample fraction dot copy. So we are basically taking a 10% fraction. And instead of using 140,000 rows, we would just be using 14,000 rows and that will make our training or our analysis much faster. In this case, not by so much, but if we had like 10 million rows, then it would make a lot of sense to maybe just work with a hundred thousand initially set up the notebook and then come back and change you sample back to false. And if it's false, then we won't pick a sample and then rerun the notebook from top to bottom. Okay. This will save you a lot of time when you're working with massive data sets. So always set up this, these three lines of code, you sample sample fraction and this if else condition, which you can easily switch between having a sample and working with the entire data. So that's a quick trick there. All right. So moving right ahead, before we train the model, we have to learn a few best practices. So we are going to learn a few best practices that now you will start to see coming up over and over again with every model we build. Now, while building real world machine learning models, it is quite common to split a dataset into three parts. The first part is the training set and the training set is used to train the model, which means that we take the inputs from the training set, put them into the model, which is randomly initialized with some random weights. We then take the outputs or the predictions of that model, compare them with the targets, the actual targets and computer loss using a loss function. And then we put them through the optimizer to change the models weights using an optimization technique. So that's machine learning. That's a machine learning training workflow. The training set is used to train the model, but remember that your model is not going to be used on the training data. The training data in this case, for example, is historical data, but we want to use the model on new data. So it will not be accurate to report the model's performance on just the training set. And that is why we create two other sets. We create a validation set and a test set. Now a validation set is used to evaluate the model during training because training is often not a one step process. Often you train a little bit and then see if you've got a good enough result, maybe train some more. Maybe you change some parameters in the model, maybe instead of using a logistic regression model, you use a polynomial logistic regression model, or maybe you change the optimization algorithm. Maybe you change from least squares to gradient descent to bilinear solver, et cetera, et cetera. So while you're training the model, you will want to experiment with a lot of different tweaks to the model structure itself and to the optimization process itself, or maybe even to the loss function. You can even change the loss function. So as you want to tune all of these different ways of training the model, you should be using what is called a validation set, which is simply a fraction of the dataset. And you shouldn't use it to train the model, but you should use it to evaluate the model, which means that once you have trained the model using the training set, you then make some predictions on the validation set and simply check what the score on the validation set is. So instead of now using that score to once again, train the model or improve the weights, you don't do that. You use just the training set for training and you use the validation set simply for checking how well the model is doing while you are still experimenting with the model. And picking a good validation set is an essential, a very essential, probably the most, one of the most important parts of a machine learning project, because you want your model to generalize well. And you can learn more about it here. I've linked to a blog about picking good validation sets. And then finally, there is a test set. So then there's a third set and this test set is used to compare different models or compare different approaches by different people. For example, you have certain data sets, which are public data sets that are used by researchers around the world. Now with these data sets, if everybody reports their accuracies or their performance on a different validation set, maybe somebody picks a first 10% of data. Somebody picks a last 10% of data. Somebody picks a middle 10% or a random 10%. Then those results will not be comparable. So that's why typically it's a common practice to standardize a test set for a data set. And you will often see that you will get a training data and then you will get a test data. So when you have training data, you split it into training and validation, but the test data is what is standardized and you're not supposed to use it for training. You're not even supposed to use it for validation. You should only use it at the end to report the model's final accuracy. And for many data sets, test sets are provided separately. And then in a lot of cases, for example, in Kaggle data science competitions, you are not given the actual targets for the test set. So you are simply given the inputs for the test set, and you have to then make some predictions for the test set, create a CSV file with your predictions, submit it to Kaggle and then Kaggle will compare them with the actual data for the test set. And then give you the result or give you the score. So this is called a hidden test set. And a lot of companies tend to do this as well as a data scientist. So once you're given a project, you're the, you will not be given the entire data. So you'll be given maybe 80% of the data with labels, and then you will be given another 20% of the data with the labels hidden from you. So you will then, once you train all your models, you will take that 20% of data and you will make predictions, give it back to whoever has given you the project and they will tell you how well your model is doing in the real world. Okay. So this is a very important piece of machine learning and you will often see when you start out with machine learning that the models that you train perform really well in training. You get to like 98, 99, 100% accuracy, but when you put them out into the real world, they get like 30, 40% accuracy. And this phenomenon is called overfitting where your model is optimizing too much for the training set. So to give you an analogy to think about the training validation and test set, think about how school works or how high school works. So now in India, we have 12 grades and in grade 12, what happens is you have a textbook and you have classes. So you read the textbook and in the textbook, for example, for a math scores in the textbook, you have some questions. So now you solve the questions in a textbook and you use that to improve your understanding of mathematics. And one way to do it is to actually learn the concepts, but the other way to do is to simply memorize the answers, right? And your model, because you are simply trying to optimize for the lowest possible laws, your model over time will start memorizing training examples. Just like in a lot of school systems, people start memorizing answers to questions. So to avoid blind memorization, what people tend to do is schools conduct exams. So school can schools conduct a midterm school, schools conduct quizzes. So they conduct exams at various places during the school year. And you can think of that as the validation set. So that is where the model or you face questions that you have not already solved. And obviously your performance is a little bit worse, but more importantly, you need to learn the actual concepts or the underlying relationships rather than simply memorizing answers, right? So that's a validation set, but then you have these board exams or what you might call the SATs, which are used to compare the performances of students across the nation or across the world, even so you can think of that as the test set. So that is like at the end, at the very end of your school year, you have to give this one exam and everybody in the country will be ranked based on that exam. So that's the training set validation set and test set training set is what you use to learn. Validation set is what you use to improve yourself and change your approach while you're still building a model. And the test set is to report your final performance. Okay. And as a general rule of thumb, this is not, this is not set in stone, but as a general rule of thumb, you can use about 60% of the data for training. You can use about 20% data for validation and 20% for test. And this is typically good enough if you're working with the, let's say tens of thousands of rows or hundreds of thousands. If you have more data, millions of rows, then you can shrink the size of your validation and test sets and use more for training. If you have less data, then you may have to even keep a larger fraction, but it really depends on problem to problem. And if a separate test set is already provided to you, let's say you're given a training set and a test set separately, then you can do a training validation split, which could be like a 75 25 or a 70 30 split. So typically this pick or this division of the validation and test set is done randomly, especially when the dataset does not have any inherent order. So if you have just random datasets or random rows in the data, it is very common practice to just pick a random subset of rows for creating the test set and the validation set. And this can be done using the train test split utility from scikit-learn. So we import from scikit-learn.model selection, train test split, and you give train test split the data that you want to split. So in this case, we want to split the raw DF, the entire dataset that we have. And we want to extract out first 20% and set that aside as the test DF. So as the test data frame, so I'm just setting test size equals 20%. Now, which 20% do we want to extract? So we can specify a random number and we don't have to. If we don't, then it will pick something automatically, but we can specify a random number. So in this case, I'm just specifying the number 42. And what this will do is this will initialize a random number generator and then using that random number generator, certain rows will be picked. So a certain fraction of 20% fraction of the data will be put into test DF and the remaining will be put into what I'm calling train while DF. Okay. Now the benefit of random state is that it gives you the guarantee that every time you pass 42 as the random state, the same fraction, the same 20% fraction will get picked as a test set. So the benefit here is that each time you rerun the notebook, your test and validation sets will not change. So we want a random fraction, which is picked randomly once, but we also want it to be fixed across our runs because otherwise we will not be able to compare with each run of our notebook, whether we're doing better or not. Right. So you want to pick a random fraction, but we want to fix it. And that's what random status for. Okay. So now we've created a test data frame, which is 20% of data. And then we have a training validation data frame, which is the 80% of the data. Now out of the 80, I once again want to do a 20 60 split, which means 25% of the remaining data becomes validation data. And then 20% 80% 75% of the remaining data becomes the training data. So I'm using train test split once again. And this time I'm setting the test set 2.25 and I can use a different random state, but that I, if I want, but it's okay. I'm just using 42 again here. And that's going to give me a training data frame and a validation data frame. So first we extracted a test data frame, then we extracted a validation data frame. And finally we were left with the training data frame. So this is how we have created three data frames and you can check their sizes. Training data frame now contains 84,500 data points. Validation data frame contains 28,000 test contains 28,000. Okay. So this is what you will do for most datasets. You will pick a random sample as a validation set and a random sample as a test set. If you're given a test set already, this will go away. Maybe you'll have a separate CSV file for tests. So you'll simply do a 75 25 split between training and validation. Okay. Now, one other thing I want to add here is when you're working with dates, which we are in this case, fortunately or unfortunately. So when you're working with dates, there is an additional consideration to keep in mind, because we want to take this model, which has been trained on historical data. And then we want to use it in the future. Okay. So that means that your model ideally when it is making prediction on a certain date, it should not have seen any data that came after that date, right? Because you should not train your model to learn from future data to predict past data. So whenever you're working with, this is called time series data, or whenever you're working with dates, it is often a better idea to separate training, validation, and test sets with time. So what you want to do is maybe if we just check out the years in the data. So we have, this is what the distribution of the data looks like. We have these many, we have a bunch of values for 2007, a bunch of them for 2008, nine, 10, 11, 12, 13, 14. And we are presumably going to use the model to make predictions in 2018 or 2019. So here is a good way to divide the training and validation and test sets. How about we take the data till 2014 as the training set. So we train the model using data till 2014. Then we use the data from 2015 as the validation set. So now the model will be evaluated on data from the future data. It has not already seen in the past, right? So we use 2015 as the validation set, and then we use 2016 and 2017 as the test set. So our final accuracy of the model will be reported by its performance on 2016 and 2017. While the model has been trained and evaluated on data till 2014 and 2015. The benefit of this is that because we are going to use a model in 2018 and 2019. So the test sets score is going to very well reflect how the model will perform in real life. On the other hand, let's say if you pick 2011 as the test set, then it's possible that the model may use some information from 2010, 2012 to become a better predictable, better predictor for 2011 than it would be for 2018, right? Because it has access to data from the past as well as the future. So what I want to want you to take away from this is whenever you're working with dates or with time series data, it's often a good idea to take the last fraction as the test set and maybe a fraction just before that as the validation set. Okay. So that's what I'm going to do. I have created this year column. So I'm going to convert raw DF dot date to a date time series, and then I'm going to get the year out of it. And for the training data frame. So I'm going to ignore what I created earlier, all of these data frames using random splitting. But for the training data frame, I'm just going to use the data before 2015 for the validation data frame. I'm going to use data just from 2015 and from the test data frame, I'm going to use the data after 2015. So 2016 and 17 are going to be the test set validation is going to be 2015 and training is going to be before 2015. And this is what the split looks like. So it's not a perfect 60, 20, 20 split, but we have still ensured that the test validation, that the validation and test set. Both contains both contain data for at least 12 months or entire calendar year, which means they see all the seasonality and they occur in the future after the training data. So which will more accurately reflect our models performance in the real world. So the whole purpose of test training, validation, et cetera, is to create a model. Whose performance can be reliably measured on how it will perform in the real world, right? We don't just want to get a higher accuracy just for the sake of reporting it. We want to create models that work well in the real world. And that's why we do all these splits. So that's your training set. Now the first eight years of data, the first seven and seven years of data, then we validation set the next one year of data. And then the next year and a half of data is the test set. Okay. If you do not have dates, then pick a random 60, 20, 20 sample. All right. And I'm just going to save my notebook once again and record another snapshot. Now, some of what we have done here may seem a little complex may seem, it's not something that you may be able to grasp in the first glance, but do work through the notebook or do go through every line of code. All of it is explained and you can check out, you can separate out each line into a separate cell and try and work through it and see how it works out. So next best practice or next step in a machine learning problem is to identify the input and target columns. And often this is something that people tend to mess up as well. Just in the hurry to get to the modeling process. So not all columns in a dataset are useful for training a model. In fact, in the current dataset, if we look at raw DF, you can see here that we have a date column and the date column is different for each day. In fact, it only contains dates from 2008 to 2017. Now, presumably we're going to use the model in 2020 or 2021 and onwards. So the date will never be something like this, right? So the model will, so there's no point training the model on these dates or dates, which it has seen in the past. If it is going to be used in the future, right? So date is not a useful input to the model. On the other hand, minimum temperature definitely is because the model is going to see temperatures in the same ranges. It may see, it may not see 13.4, but it may see 13.5. It may see 13.1. So this is a useful thing to have maximum temperature is again, a useful thing to have. So all of these rows, apart from date, like location, main temp, max temp, et cetera, these are definitely useful for a prediction. So we definitely want to use these. Similarly, sometimes you will have ID columns. For example, in the last example of insurance or medical charges prediction, there was certain IDs. Sometimes you will have names of people like you, you may have names of a customer or maybe you may have emails of customers or you may have like a customer ID. Or some unique identifier, which is different for every row. So whenever you have these ID columns, you should completely ignore those because those are just unique. And that is not useful information to use. Then there is obviously also a target column, the column that we want to predict. We want to take all of this information and we want to use it to predict the rain tomorrow column. So we want to make sure that we don't use the rain tomorrow column as an input for sure. Right? You don't want to use a weather train tomorrow to predict whether it rain tomorrow. You don't the rain tomorrow is a target. And that is why, and it seems like an obvious thing, but in a lot of cases, you will see that you are training a model and then you train the model and you suddenly see a hundred percent accuracy, a hundred percent training accuracy, a hundred percent validation, a hundred percent test accuracy. And you debug and debug, or maybe you just think you've won, but ultimately you will figure out that you were actually using the target as an input. I do this all the time whenever I'm training models. So that's why one thing you should try and do is very clearly identify what are the inputs to your model and what are the outputs and what are the things you need to ignore. So I'm going to create two lists here. One is a list of input columns. So here I'm saying train DF dot columns. So from a training data frame or from the raw data frame, doesn't matter. They have the same set of columns. I'm going to pick from column number one, which means I'm going to ignore the date and I'm going to pick till the column number N. N minus two. So I'm going to skip the last column, which is the target. So that's going to give me all the columns starting from location to rain today. And then the target column that I want to predict is rain tomorrow. In this case, there is just one target column and generally in classification, there is just one target column. But sometimes there might be multiple target columns as well. So this could be target columns in certain cases. And another thing to keep in mind is certain machine learning algorithms expect you to always have a list of target columns. For example, linear regression does expect that I think, but certain machine learning algorithms are work fine with just a single value. So logistic regression, in fact, expects you to have a single target column. And that's why we're just picking a single target column here. Okay. So just be aware of that. Sometimes you, this may need to be a list. Sometimes this may need to be just a string. Sometimes it wouldn't matter, but, but this is the idea. We have input columns and target columns. So these are the input columns, just very explicitly printing it out. It's always a good idea to eyeball this and make sure, okay, I'm not using rain tomorrow as the target. And then we have the target column as well, which is rain tomorrow. Now, another special case of this is sometimes you may have columns which are derived from the target column. For example, you may have rain tomorrow and rain rainfall tomorrow in let's say in millimeters. So if you have a column rain tomorrow and you have a column rainfall tomorrow, then you probably don't want to use rainfall tomorrow as an input because you want to use a model with today's data. So you want to make sure that and rain tomorrow is rainfall tomorrow is going to be a very strong indicator of rain tomorrow, but at the same time, it is derived information from rain tomorrow. And it is not something that you will have in the real world. So be mindful of that. Sometimes some of these columns are derived or very closely related to the target and our targets themselves. So make sure to ignore those, right? If you want to use today's data, pick just today's data. So that's your input columns and target columns. Now, once we have that it's a good idea to just create separate data frames. So you have train inputs. Here I'm simply picking input columns from the training data frame. And I'm also just creating a copy because now we start modifying this data. We'll start doing scaling and encoding, et cetera. So I'm just going to create a copy so that the original data is not disturbed. And then we have trained targets. So here I'm just picking the target column. So here I'm picking columns. So train inputs is going to be a data frame. You can see here train inputs. It's a data frame containing just the, just the input columns and then train targets picks single, picks single column. So it's going to be a panda series. You can see that this is a series and this series has just a bunch of yeses and noes. Similarly, we are creating validation inputs and validation targets. We are creating test inputs and test our targets and we can see what they look like. So these will either mostly all the inputs are going to be data frames or sometimes numpy arrays. And then the targets could be series or they could be data frames. And if you're ever in doubt, you can always just check what the type is. So this is a series, but the general rule of thumb is when you're picking multiple columns or a list of columns. That's when you get a data frame. When you pick a single column, you get a series. Okay. One other thing that will be useful is to identify which columns are numerical and which ones are categorical. This will be useful later because we will need to convert categorical data into numerical data for training a logistic regression model or any machine learning model. So if we look here, these seem to be numeric and then this seems to be category. I don't know. Maybe this is also numeric, but then this is definitely categorical. This is definitely categorical. This is definitely numeric. So there's a combination. In fact, location here is a definitely categorical column and it has a grand total of 49 categories. And obviously location is a big factor because the weather or the climate of each location is definitely different. So as long as we are predicting just for these same locations, location is an important factor to look at. And by the way, you should also now understand that because we are using location as an input, your model can only make predictions for locations that it has already seen. Now, if you use this for, you cannot use this model for New York because your model does not know what to do with New York as an input. It has only seen a location as an input. So if you want to create something generic, something that is works irrespective of location, then you will have to remove the location column, right? So keep that in mind on, on a case by case basis, you'll have to figure out whether to use it or not. For now, we'll assume that we are going to use the model for predicting these same locations most of the time. So we will retain the location column. So here's a quick trick to come up with numerical and categorical columns. So if you do, if you take a data frame and then call select D types and simply give it which type of columns to include. So here we have said include NP dot number, and then you can get the list of columns. So basically this selects just those columns of data in the data frame where they are numeric and dot columns give you, gives you the name of the numeric columns. And then because, and then we're just converting them to a list. So I think we should remove this as well. And that should give us the numeric columns. So let's see numeric calls. So these are all the numeric calls, minimum temperature, maximum temperature, rainfall, evaporation, sunshine, et cetera. Okay. All I've done is I've selected the columns, which have a numeric type NP dot number, and I've taken the list of that columns and I've converted it into a list. And similarly for categorical, we have selected the list of columns with data type object and we've converted those to list. Now I was able to do this because there are no string columns in this dataset, but if there was string columns, then we would have to do something more interesting. Or maybe we would have, we would have to manually type out categorical columns. And you can, by the way, you can do that as well. That's completely fine. There are just 23 columns. So you can manually type out the names of numeric and categorical columns. Okay. So that's numeric and categorical columns. Here's numeric columns, main temp, max temp, rainfall, et cetera. And as soon as you have numeric columns, you can also very quickly check some statistics by calling dot describe. So here we are just selecting the numeric columns from the training inputs. And then we are calling dot describe. Just to explore. Okay. What does the ranges of these values look like? What's the minimum? What's the maximum? What's the standard deviation? What's the median? So it seems like minimum temperature can go up to minus eight and maximum temperature can go up to the minimum temperature can go from minus 8.5 to 33. And the maximum temperature goes from minus four to 48. And this is also a good opportunity for you to see if there are any values that don't make any sense at all. In which case we may have to maybe do some fix some incorrect values in some way, maybe draw some more histograms, et cetera. Okay. Yeah, but these values look fine to me. So I don't think there's a lot of data cleaning required here at this point. So that's the numeric data. And let's also check the categorical data for the categorical data. It's a good idea to check how many unique categories there are. And this will also let you verify if all these are actually categorical. For example, location, there are 49 values that may seem like a lot, but we are dealing with 140,000 rows of data here. So 50 is still, I think I would still call it categorical. I would not call it a string, but if I had maybe a 5,000 or 10,000 different values, then maybe that's better suited as a string column and not a categorical column. Then we have wind direction. We have 16 values for those. Well, that's just simple directions and wind direction at 9 AM, 3 PM. And finally, we have the rain today column, which is also a categorical column. Yes and no. And rain today, I'm sure is going to be very helpful in predicting whether it's going to rain the next day or not. Okay. So that's just identifying the input columns, target columns, and within the input columns, the numeric columns and categorical columns. And the reason we've done this is because we'll now start doing some processing with this data. Now, machine learning models cannot work with missing data. Whenever you have a nan and you try to train a machine learning model, it's probably going to crash. So you need to fill these missing values with some valid values. And the process of filling missing values is called imputation. Now there are several techniques for imputation, but we will use the most basic technique here. And I'll let you check out some of the other techniques and we will try out some other techniques over time. But the most basic technique is to replace missing values with the average value in the column. And you can do this using the simple imputed class from SK learn dot impute. So what we want to do is we want to replace this nan with the average value in this column. And I think the average in this column, if we have just three rows is 17 plus five, 22 by two 11. So we have 11 here and the average in this column is six. So we have six here. The average in this column is just seven. So we'll just put seven here. So this is just a guess. It's obviously may not be correct, but you have a choice here. You can either throw away those rows, but if you have a lot of nan values in a certain column and you throw away those rows, you may lose a lot of data. So you just take a guess there and you think that, okay, maybe I replace it with the average. Sometimes you may choose to replace it with not, not the average, but maybe a fixed value and it can do that as well. Simple imputer has several strategies. Sometimes you may replace it by the mean. If you are by the median, if you think that there are outliers that may affect the average. For example, if you have an income column, it's always a good idea to replace it with a, with the median rather than the mean, because there will be outliers that may affect the average. Okay. So many different strategies we are going to use a simple strategy of filling in nan values with the average value in that column. So how do we do it? We import simple imputer, and then we create a simple imputer object with some parameters in this case with just a strategy, but you can check what other parameters it supports simply by typing simple. Imputer and a question mark before that. So here you can see that this is the, all the health available for simple imputer. You will also see some examples missing value. You can also specify what the missing value looks like in this case. It's NP dot nan that's automatic. You can specify what strategy you want to use and so on. Okay. So we create the imputer fine. Now we've created an imputer object. So first, before we perform any imputation, let us just check how many missing values we have in each numeric column so that we can verify later that those were filled. So if we just do on the raw DF, which is the entire data check, get the numeric columns and check is any, which is replace each value in the data frame with the true or false, whether or not it is. Null and then take the sum. So all of this business is just to count the number of missing values in each column. You can see that a min min temp max temp have a low number of missing values. Evaporation sunshine have a fairly high number evaporation. It seems like almost half of the data is missing. On the other hand, we have for some of these other things, there is a reasonable fraction of missing values. Okay. And here, this is the entire dataset, but we could also check these individually for the training validation and test sets. For example, here is the check for just train inputs, checking missing values just for the training data. And you will see a similar pattern or a similar fraction in the validation and test set as well. Okay. So with that out of the way, the first step in imputation is to fit the imputed to the data, which means that we give it the data that we want to fill and it's going to look through the columns. It's going to look through all the columns. So it's going to look through the main min temp max temp rainfall column in the data frame, the data frame raw DF numeric calls, which is. Just the, just the numeric columns chosen from the raw data frame. So this, and it's going to go through each column and for each column, it is going to figure out the statistic that you've asked it to figure out the imputing statistic. In this case, we're using the mean or the average. So it's going to compute the mean for min temp, mean for maximum, mean for rainfall, mean for evaporation and so on. Okay. So let's call imputed fit. So it's not yet changed any of the data. We've not filled any, anything in, we have simply calculated the statistics. So if you now check imputed dot statistics underscore. So for the first row or the four, for the first column that we gave it. And remember these columns are simply the numeric columns that we have. Picked. So for the main temp column, it has figured out that the average is 12 for the max temp column. The average is 23 for the rainfall. The average is 2.5 evaporation. Average is 5.4, et cetera. We have asked you to compute averages that has computed averages. Now, once these averages have been computed, then they need to be fit, filled in into the test set into the validation set and into the training set. So here's how we do it. We say imputed dot transform. Let's take it step by step. So let's look at train inputs dot numeric calls. Okay. So this is just from my training data, which is the 60% data, just the numeric columns I've picked out in a data frame. And obviously there are a lot of NANs here. Then we call imputed. The imputed already knows the averages for all of these columns, men, temp, max, temp, rainfall, et cetera. And we call imputed dot transform, and it's going to fill all the NANs with averages. And now that becomes a number of areas. So you can't really see here, but you will, but if you look at this number, you will see that it now contain, it does not contain any NANs. So it has looked at all the NANs and it has replaced those NANs with those average values. Now it's not showing you a data frame because imputed just outputs and empire. But what we can do is we can take this output, this number and we can put it back into the data frame as the new values for the numeric columns, right? So we are overwriting the numeric columns now in the original train inputs data frame with the imputed data, which means with the NANs replaced with the averages. And we're doing that for validation data. Okay. We are transforming the valid inputs, numeric calls, and overing the overwriting that into valid inputs, numeric calls, and so on. So all of this business, imputed or transform, and then storing the data back. Okay. All of this is simply to get to the point where if I now check train inputs, numeric calls, you will see that now there are no NAN values. Now those NAN values have been replaced with the average value, which I guess in this case, you can probably tell 5.47 is probably the average value for evaporation. 7.6 is probably the average value for sunshine, but only the NANs have been replaced, not all the values. Okay. And now you can check for each of the training validation and test sets that there are no missing values at all. Okay. Now that's the average imputation technique, but you can try other imputation techniques like median, or sometimes you can also use other columns to impute data. So there are several techniques. Just check them out. There's an entire module in scikit-learn that does this. Okay. Now we've dealt with missing values. Another good practice is to scale features to a small range of values. Right now, if you check the numeric features or the numeric columns of data, you will see that they have a fairly wide range. For example, min temperature goes from minus eight to 33. On the other hand, wind speed goes from six to 135. I think this is min and this is max. Let's see your pressure at 3PM. Okay. This seems to go all the way up to 1039 and the minimum seems to be 977. Now what's what happens is when you put all these numbers into machine learning algorithm and you apply weights to them and then you compute a loss function. Ultimately, the loss is a single number, right? The cross entropy or root mean square loss. The loss is a single number and we say that higher the loss, worse the model. But what happens is if a certain feature has a very high range of values, maybe like rainfall takes values that are in the thousands. But minimum and maximum temperature rate take values that are in less than 10. And maybe there are certain. There's a certain column where the values are 0.00 point something, which are basically in the range of values. Are 0.00 point something, which are basically in the decimals. So what happens is that the values, which have a very high magnitude tend to dominate the loss because it's ultimately all, all these numbers getting multiplied and added together. So the values that have a high magnitude, the features that have a high range tend to dominate the loss, but features that have a lower absolute values and features that have lower range of values. Tend to not show up in the loss as much. So what happens is that if what dominates the loss also dominates the optimization process. So that means the values which have these high numbers are going to their rates are going to change a lot, but the values which have low numbers or smaller ranges, their value, their rates are not going to change that much during optimization. So just to give everybody a level playing field, just because we measure rainfall in a certain unit where it gets large values, but we measure a temperature in a certain unit where it gets lower values. So just to level the playing field for all the features, what we like to do is we should scale all the features into a same range of values. So we're going to scale the features of, we're going to scale all the numeric features into the range of value zero to one. And sometimes this is also minus one to one. Either one is fine. It depends on the circumstances, but zero to one is fine for classification problems. And you, and the way you do this is using a min max scaler. So from SK learn dot pre-processing, this is again another module. And we are, in fact, what we're doing right now is called pre-processing from a scaler under pre-processing import min max scaler. And then you create a min max scaler and then you can give it some options. Like you can give it, you can check here. It supports like a feature range. It supports whether it should create a copy of the data, et cetera, et cetera. But the objective of the min max scaler is to identify for each column. What is the minimum and what is the maximum and then scale the values in the column so that they all line the range zero to one. So it's a two step process, just like the imputer. First, we fit the scaler to the data. So we compute the range of values for each numeric column. So we say scaler dot fit raw DF numeric column. So what's happening here is it's going to look at each column from raw DF, the entire data frame, main temperature, max temperature, rainfall, evaporation, et cetera. It's going to look at each column and it's going to find the minimum and maximum of each column when we run scaler dot fit. So now we have, if we check the minimum, we can say that the data minimum in, you can see that for these columns, for the numeric columns, which are. Min temp, max temp, rainfall, et cetera. The minimum values are minus 8.5 minus 4.8, zero, zero, et cetera. And then for the maximum values, the maximum values are 33, 48, 375, 371, 145, et cetera. So now it knows the minimum and the maximum. And this is stored in data main and data max. Now, when we call scaler dot transform, what it's going to do is it's going to take this column. And instead of these being in the range minimum to maximum, it is going to scale them in the range zero to one. So the minimum value will become zero. The maximum value will become one and everything else will be scaled proportionately. So let's do that. It's a pretty simple transformation ultimately. So let's not let's not check train inputs. I think we can just check this here. We can just call describe and you can check for each of the columns. Now for each of the numeric columns, you will see that the minimum becomes a zero or very close to zero. You can see all of these are very close to zero and the maximum becomes one or very close to one. So in this case, you can see that the maximum is 0.56, 0.98. Yeah, it either becomes one or very close to one. In some cases, it may not actually be one because of how internally the imputer works, but you can also verify this by looking at the actual data instead of just the statistics that now all of these values are nice and small. All of these values are between zero to one. Okay. All the numerical columns for sure. Okay. So now we have all our numeric data between the range zero to one. So none of them will disproportionately affect the loss. And another thing is also that optimization algorithms often use calculus and calculus involves derivatives and powers of functions and polynomials. So all of these optimization algorithms also work really well when all the numbers are small in the zero to one range. And there are also issues with numerical methods where they cross floating point computation ranges. So if numbers get too large, then decimal information can be lost. And there's a lot of bunch of issues with large numbers, especially when you're working with these numerical optimization methods. So it's always a good idea to get your numbers into a small range. Okay. This one is about categorical data. Now we are numerical data is in good shape. Messing values have been filled and the data is in a good zero to one range because machine learning models can only be trained with numeric data. We also need to convert categorical data to numbers. So a common technique is to use the one hot encoding for categorical columns. So you take a categorical column, like a column, which has maybe a category, a category, B category, C. And what do you do is you get rid of this column and you instead create a column for each category. So category A becomes a column. Cat B becomes a column. Cat C becomes a column and whichever row had the value category A in this categorical column. We put a one there in the category column and we put zeros in all the other columns. So that's why this is called a one hot vector. So instead of this category string or this category class, we replace it with a vector, which contains a one for that particular columns for that particular categories column and zeros everywhere else. Similarly, we have category B. This is also one hot vector. Now this becomes zero one zero and category C becomes zero zero one. Now here, there are only three categories, but you can have any number of categories. For example, if we check the number of unique categories for location, we have 49 categories for wind gust direction. We have 16 wind direction, nine AM we have 16. And for rain today, we have two categories. So now if we want to perform one hot encoding, what we are actually trying to say is that if I just look at the location data. What we're actually trying to say is that we want to create one column in the data for Albury, Badgley Creek, Cobar, et cetera, and put ones and zeros in these columns. So we're going to introduce 49 columns for location. Now that may seem like a lot, but it's okay. It's not a problem because most of these are just going to be zero one. And then most of the time they're going to be zero one. And then most of the time they're going to be zero one. And then most of the time they're going to be zero one. And then most of the time they're going to be zero one. So it's not something to worry about too much. And our data frame can handle it. So how do we create these one hot encoded columns? The way to do that is using a one hot encoder again, from a skill or not pre-processing import one hot encoder. And I hope, I hope you're starting to get the drill. Now for all of these pre-processing steps, you create that object. So we are creating a one hot encoder. And we're giving it a bunch of options. We're telling it that we want to, don't want to create a sparse matrix. We want to actually create a proper NumPy array. So sparse is just an optimization technique. And we are saying that if you come across a category that is, that you haven't seen before, just ignore it. Then there are several other options that it supports. Okay. So we create the encoder object. Then the encoder needs to first understand the data. So we call an encoder. fit, and at this point it's not creating any one hot encodings. It's simply identifying the categories in the data. So we, okay. So I think this needs to be fixed somehow. I'm just going to check where to fix this. Okay. So we call encoder.fit on raw DF, and maybe there's a way to ignore NAN values. Let's see one hot encoder scikit-learn ignore NAN. All right. I'm going to just do a quick fix here. I think this is something to do with the, with the actual version of scikit-learn involved here, but I'm just going to do raw DF dot categorical and pick the categorical calls. And dot fill any, and I'm going to just fill it with the word unknown. Okay. Yeah. And I'm just going to call it raw DF two. Yeah. So these are some issues that you will run into from time to time. For example, this worked with the version of scikit-learn I had on my computer, but did not run on Colab. And it seems to be the issue seems to be with NAN values, NAN values. So I'm just going to put in raw DF two here, which just contains, which is replaced NAN values with the word unknown. And that will just be treated as another category. All right. So now we've called one hot encoder dot fit on the raw data frame. So it has looked at all the columns that we gave it. And for each of the columns, it has identified the categories that those columns contain. So if you check categorical calls, the columns that we gave, the columns of data that we gave to the encoder. So for location, it has identified these categories. And then similarly for wind gust direction, it has identified these categories. And then for rain today, it has identified these categories known. Yes. So for each of the columns, uh, for each of the columns in the data frame, so columns, location, wind, gust, wind direction, 3PM and rain today, it has identified what the list of categories are. Then there is one additional step involved here. We can generate column names for each individual category using this get feature names function. Again, not something you need to remember something that you need. You can look up. So what we do is now we create a list of category names. So now we've, instead of having this location, we have location, Adelaide, location, Albany, location, Albury, et cetera, et cetera. And similarly, going forward for each category, we have a category. We have a column name that we've generated like rain today, no, and rain today. Yes. And we just suffix it with the original columns name to avoid any confusion or avoid duplicates and column names. Okay. So like wind direction 3PM has WSW. And similarly wind direction 12 PM will also have WSW or nine name will also have WSW. Okay. So now we have a list of column names for, and we are just calling them encoded calls. And we have a list of column names for these new one hot columns that will be created. Finally, to actually create those one hot columns, we call encoder dot transform, and then we give it the categorical columns of data. So we are giving train inputs, categorical columns, and that is going to give us numpy arrays. And we want to put those numpy arrays into the pandas data frame, the train inputs data frame and with these column names. Okay. So I'll just show you the output here and Okay. I think I will need to fix this as well. So I'm just going to quickly fix train inputs, categorical columns, dot fill any I think we can just add this Yeah. So we can just add this here and this should fix it. Sorry about that. Yeah. All right. So what's going to happen here is it is going to transform all those categorical columns and it is going to convert them into these numpy arrays of one hot encoded vectors. And then it is going to add those as new columns into the data frame. So let's view the test inputs, for example, what that looks like. And you will now see that we have location, maintain, maximum, all of these. But if you start scrolling further, you'll start to see all these new categories, all these new columns, location, Adelaide, location, Albany, location, Albury. So like 49 location columns, and then you will see these. Okay. Wind gust. So there are eight wind gust columns and there are eight wind direction columns. And then there are two rain today columns. Okay. So we have added all these one hot columns and you can see that there are zeros in most of these and ones every now and then every here and there, you will see a one like you see, what's this location, Albury. So you should see a one at Albury. Yeah. You see location, Albury set to one. Okay. So now we have done some categorical data encoding as well. Again, it's a bunch of code, but ultimately it's something that you can look up quite easily. The more important thing is to just grasp the concept that we are doing this encoding. Now, once you've done all of this pre-processing and before your model is before you're ready to train the model, one thing that you should do is you should save these intermediate outputs to disc because we spent what a couple of hours just creating all of this data. And if you're working with really large data sets, it may take even longer because each step will take some time to run a few seconds to maybe a few minutes, maybe even up to an hour for some of these steps. So now we have these train inputs, train targets, while inputs, while targets, test inputs, test targets for the training and validation and test set. You can save these and you can save these to CSV, but there are more efficient formats, a very efficient format that pandas data frame support is called the parkit format. So you can save these to the, to the parkit format. And for that, you need the PI arrow library installed. So that's why we have this pip install here. So I'm just going to save train inputs to train inputs.parkit while inputs, I'm going to save to while inputs.parkit and test inputs. I'm going to save to test inputs.parkit. And similarly, I'm also going to save targets. Now there's just one additional step here. We need to convert them into data frames to actually save them properly. So I'm just going to save those targets as well. And now the benefit is the next time you restart this notebook, or maybe you start a new notebook. Maybe you want to do pre-processing in one notebook, and then you want to do model training in another notebook. You don't need to repeat all these pre-processing steps. You can simply read these files. So I'm just reading in train inputs, while inputs, test inputs, train targets, while targets, test targets, and that can give me the same information. So now what, what we've just done is we have taken this, these, the training validation test sets after all the pre-processing is save them to the disc so that if we shut down the Jupiter notebook, it will still be there. Of course, in Colab, the disc also goes away. So this is not completely true. So what you'll want to do is maybe download all these parkit files or maybe save them to their Google drive or something. But at the very least you want to back up these files if you're working on Colab, and then you can upload them into a new Colab notebook. When you're working on a new, when you're working on the next project or when you want to use this data, and then you can simply load it up on your computer. You don't need to do all that business. It is always saved on your file system. So you can write it and read these anytime. Okay. So that's just a quick tip that whenever you have done a lot of pre-processing, just save the outputs after the pre-processing. And finally, after all of that, we are finally ready to train the logistic regression model. Okay. This may seem like we've done a lot of work, but actually it's just been three or four steps, which is identifying the drop, identifying the inputs and the targets and identifying numeric and categorical columns, converting, filling missing values in numerical columns, filling imputing, yeah, that was through imputation, then scaling numeric columns to the zero to one range. And then we also replace the categorical columns with one hot vectors. So these are just four or five pre-processing steps that we've performed. And now we have our data in the form where most of it is numeric data. Like if you have, if I check train inputs, yeah. So we have, we still have these categorical columns, which we will probably get rid of while actually training the model, but we have numeric data for the numeric columns. And then the categorical columns have also be converted into numeric data. So now we're ready to train the model and logistic regression, as I've already mentioned is a common technique for solving binary classification problems. We take a linear combination or a weighted sum of the input features. That's why we need them all to be numeric. And that's why we also need them all to be in similar ranges so that they don't mess up one feature does not mess up the loss. Then we take that weighted combination and we apply the sigmoid function, which ultimately gives us a number between zero to one for each input. So for each set of inputs for each specific row of data, we get a number as zero between zero and one as the output of the model. The number represents a probability of the input being classified as yes. So that's how we interpret the number from the model. Then we use the cross entropy loss to evaluate how good the probability is with respect to the results. Now, obviously intuitively, what we want is if the target is yes or one, then we want the probability to be high, very close to one. If the target is no or zero, then we want the probability to be no. And that's what this formula captures. This is called cross entropy. We won't get into it, but the idea is it penalizes bad predictions. Okay. And then we train it using the normal machine learning training process. So here's how we train a logistic regression model. The actual model training part is so simple that you simply import logistic regression from a skill or not linear model. And then you train logistic regression. Now here, if I hover over it here, you should be able to see that there are a lot of different arguments that we can pass the kind of penalty. There's something called a dual, something called a C, something called a tolerance fit, intercept, et cetera, et cetera. You can even specify random state. You can provide weights for the different classes. You can specify the number of iterations. So these are all parameters that you can change. And remember I was talking about the validation set. So each time you try a different kind of model with a different set of parameters, you can use a validation set to judge whether those parameters set led to a good result or not. Okay. Now, in my case, I have picked this solver called lib linear because, and you can check what kind of solvers it supports. You can just see the, you can see the options here. Yeah. So there is Newton's CG. There is lib linear. There is a SAG SAGA. You can try out different ones. The reason I picked lib linear was because LB FGS did not work for me. It sometimes they're unable to solve the problem because these are numerical methods. Sometimes they just failed to solve the problem. So you try a different solver. So I just tried the lib linear solver and that took just like half a second. You can see it in total, it took just about half a second to perform this entire fit. Well, actually it is, we are not yet created the model. So I'll take that back. So now, now we've just created this model. We've just instantiated this model with the lib linear solver and with some other options. Now to train the model, we can train the model using model dot fit. Okay. So let's run model dot fit. And this is going to take a few seconds. Maybe. Okay. It took what 1.4 seconds on my computer. It took less than that. And what did it just do? Well, it uses this exact machine learning workflow. And I hope you starting to get bored with this at this point, because that's what we want. You want to get this machine learning workflow in your head internalized. So we initialize the model with some random parameters. So with some random weights and biases, we pick random values for these W's and B's for each of the features that we have. Then we pass the inputs, the training data into the model, and we get some predictions. We take the predictions and then we take the outputs from the model. And then we run them through a loss function. And then we use an optimization technique to slightly reduce the loss and improve the weights of the model. And in this case, we're using the lib linear optimization technique. Then we repeat the steps one to four till the predictions from the model are good enough. So if you see here, if I see logistic regression, you will see that there is this max. Itter. This is the maximum number of iterations that we want the model to take. So here, what we are telling the model is that we want to repeat this process of creating predictions, comparing them with the targets, running some optimization a hundred times at most, or we want to stop at a point where the tolerance or where the loss gets lower than a particular value of 0.001, right? So all of these things you can configure if you want the model to go on for longer, increase max iterations and reduce the value of tolerance, make the tolerance lower. And that way it will go on for longer. It will take more time, but maybe it will learn better. So all of those things are things you can customize in the model. Okay. But that's it. The model has now been fitted. So now for each of the individual, for each of the individual columns in the data, or for each of the numeric columns that we have given it, it is going to now have a particular weight. And I just want to tell you one other thing that I've done here. I have not given it the entire training inputs because train inputs contains some categorical data as well. Remember, we have not removed the location column, although we have added those one hot vectors. So if I just put in train inputs, you can see that although we've added one hot vectors at the end, we have not removed these categorical columns. So I've done this small trick here because I know the names of the numeric columns and I know the names of the encoded columns. I've just picked the encoded and numeric columns. And given that as the input to the model, and then the, in terms of the targets, we've simply given the training targets, which are simply numbers. Actually training targets are, yeah. Training targets are not numbers training targets are yeah. Training targets are these yes and no values. And that's fine. So scikit learn specifically logistic regression can work with numeric training targets. It cannot work with numeric categorical data as inputs, but it can work with training targets. What it'll do is the first class that it encounters, it is, it will automatically convert that to one or to zero and the second class of encounters, it will convert to one and so on. Okay. So targets can be categorical, but inputs cannot be categorical. And this differs from model to model library to library, et cetera. In any case, we have now fit the model using the training data, just using the training data. Yep. And now all these weights have been set. So if we check the numeric plus encoded columns, and we have all of these columns, we can see what weight has been assigned to each column in that formula by checking the CoF argument here or the CoF property. So minimum temperature has been assigned a weight of 0.89. So it seems like minimum temperature has a weight of 0.89 and maximum temperature has a weight of point minus 2.8. So it seems like maximum temperature has a negative impact on whether it will rain tomorrow. And that makes sense. If your maximum temperature gets too high, then it's unlikely to rain tomorrow. Then rainfall has a high positive coefficient. And that again makes sense that yes, you should give a high weightage on what the amount of rainfall today is to figure out if it will rain tomorrow evaporation again has a positive coefficient. Sunshine has a negative coefficient. So you can see that these models are not just black boxes. You can actually interpret them. What this model has learned is that rainfall is a good predictor and a strong indicator of whether it will rain tomorrow, the amount of rainfall today. On the other hand, the amount of sunshine today is a strong predictor, but in the opposite direction. If you have a high sunshine today, maybe it's not going to rain tomorrow again, wind speed. If your wind speed is very high, then that seems to be a very strong predictor. If you have high, if you're really fast wins today, then it's probably going to rain tomorrow. And then let's keep going. Let's check humidity. So humidity, it seems like humidity is also a predictor, specifically humidity at about 5 p.m. is a good predictor. And if you want to see these sort of side by side, you can create a data frame. You can do PD dot data frame and let's just create column or let's call this feature. And here, let's just give this numeric columns plus encoded columns. And here, let's put weight and let's just put the list of weights here. Okay. Maybe it is a nice table. Maybe let me just check this. Okay. Yeah, I think this, I need to just pick zero here and this should be fine. Yeah. All right. We finally have it. Just sometimes you can see that this coefficient list is actually a two dimensional array, not a one dimensional array that led to a problem in any case, you can now see the different features and the weights for each feature, rainfall, temperature, evaporation, sunshine, et cetera. I think we can also increase the number of columns that we want to show here. So if you want to show all the 118 columns, we can show that information here too, but roughly this is how you interpret the model, a higher the weight for a particular feature, the more important it is for the prediction, the lower, the weight, the less important it is. The sign indicates whether it is a positively correlated feature or negatively correlated feature. So your logistic regression model is capturing all of this information in the form of weights. And then of course there is that last intercept or bias term that is added at the end. And there is simply a correction for all values being zero. So that's sort of the zero correction. You may try more experiments and that is something that may take more time. You will try different imputation techniques. You may try different kinds of encodings. There are other encodings as well. Does max it or are the realizations of the simulations? Well, so max iter is, you can see here, what we do is we initialize a model, put some inputs, get predictions. We compare the predictions with the targets using the loss function. We optimize the model using an optimization technique. So one optimization does not like one pass of optimization does not fix the entire model all for forever. It simply makes a small changes to the weights that will slightly reduce the loss because all of these are approximation methods. So you need to repeat this optimization many, many times to get a fairly optimal model. So that is what max iteration indicates how many times you repeat this process. Do you visualize the feature rates with bar H? Yes, you could do that. That's a good idea. I haven't done that, but yeah, I could do that. So let's say we do, let's call this weight DF. And let's do SNS dot bar plot data equals weight DF. On the X axis, we can have the weight and on the Y axis, we can have the feature. Yeah. And hopefully that should give us a good, okay, I think we have a lot of features. So I should probably set, yeah, something like this should make it easy to visualize. Yeah, so you can see the different features here and their weights visualized here. So you can clearly see that pressure at 3PM is a very important factor. When gust speed is a very important factor, a low pressure is seems to be a higher indicator and humidity at 3PM is a very strong factor. You could probably even sort these by the importance. So yeah, that was a great suggestion. Actually, I should have added this before so you can probably sort these by importance and just pick the top 10 features. So let's do something like this sort values by the feature weight and ascending equals false and just pick the top 10 out of these, and then we won't need this yet. So wind gust speed, humidity at 3PM pressure at 9AM rainfall, temperature at 3PM cloud at 3PM minimum temperature. This is the order in which things are important. Seems like wind is the most important factor in predicting whether it will rain tomorrow. Okay. So moving ahead, next, we want to make some predictions and evaluate the model. Now we've trained a model, but we ultimately want to use the model. And before we use the model, we want to evaluate it on the training set, test set, and validation set. So let's evaluate the model because we want to use just these numeric and encoded columns. I've created these three variables, X train, X val and X test, which will contain just the numeric data that we can pass directly into the model. So now I'm going to call first make a prediction on the training set, get a bunch of predictions. And the way to do that is called model dot predict. So when you call model dot predict, you give it a set of inputs. You don't give it targets. You simply give it some inputs, and it is going to give you a list of predictions for those inputs. So this is what our model has predicted. No, no, no, no, yes, no, no, et cetera. And you can check that there are a few yeses here. If I like, if I do a list, you will see that there are actually some yeses. It's just that because only about 20% of the data has rainfall. So even your models predictions, only about 20% of them are going to be yes. Okay. That's a long list should probably have limited this in some way anyway. So this is the predictions that our model has created. This is the predictions that a model has created. No, no, yes, yes, et cetera. And these are the actual targets. So this is what we expect to see from our model. So we have some predictions. We have some targets for each day. The model is predicted. Yes or no. And we have the actual target. Yes or no. The simplest way to understand how well it has done is to simply look at the accuracy. And to look at the accuracy, accuracy simply means we compare, we compute the percentage of matching values in train preds and train targets. You see train preds here. So train preds has a bunch of nos and yeses train targets has a bunch of nos and yeses. If we compare, we count these. Okay. This is a match one match, two match, three match, four match, five match mismatch mismatch. So we count the number of matches. Divide that by the total number of days, which is about 144,552. So or less. So we can do that directly using this accuracy score function. So from sklearn.metrics, we import accuracy score, we call accuracy score. It's going to do a one by one match between train targets and train predictions. It's going to count the matches divided by the total number of values. And that gives us an accuracy of 85.1% on the training set. Okay. Now, apart from giving you predictions, the model can also give you probabilities. If instead of calling model dot predict, and this is only for logistic regression. If you call model dot predict proper, which is predict underscore PRO BA, and give it an input, it is then going to give you probabilities. You can see that now we get two values for each input. This is the probability of the class know, and you can check the classes using model dot classes. So there's a probability of the class know, and this is the probability of the class. Yes. So this is the no class. And this is the yes class. And these probabilities add up to one. You will see for each model. Right? So it's 94% confident about a no. In this case, it is 94% confident about a no. In this case, it is 0.87% confident about a no and so on. Okay. So you, your model is giving you a lot of information. Not only is it giving a prediction, it is also giving you a probability of how confident it is about the prediction. So maybe you can also say that there is a 94% chance that it will not rain tomorrow. And there is an 87% chance that it will rain tomorrow and so on. Okay. So that's predicting using the model that is predicting probabilities using the model. And finally getting accuracy score or getting a score accuracy score using the model. Now the accuracy score is a good measure to just get an overall idea of how well the model did, but what you can also do is you can also break down this into what's called a confusion matrix. So you can create this kind of a table where you have these actual values, right? So you have, there are possibilities where there was no rain and the value in the actual target was no, but your model predicted. Yes. So the actual value was no. And the model predicted. Yes. We call that a true. And then, and the model predicted, no, so if the actual value was no, and the model predicted, no, we call that a true negative, which means it was able to identify a negative correctly that it will not rain correctly. And similarly, if the actual value is yes, and your model predicted, yes, we call that a true positive, which means that it is going to rain tomorrow and your model did, it did predict that it is going to rain tomorrow. So we call that a true positive. So these are both cases in which your model has succeeded. These are both the success scenarios. Then there are two types of scenarios, error scenarios, and they are called the type one error and type two error, or false positive and false negative. Once we have false positive. So false positive is when it is not going to rain tomorrow or the actual value is no, but your model has predicted yes. So model has predicted that yes, it is going to rain tomorrow. And then similarly here, we have a false negative when the, where the actual value is, yes, it is going to rain tomorrow, but your model has predicted no. Now, depending on the business case or what you want to use the model for, you may want to very closely look at false positives or false negatives. For example, if you want to use this model to predict whether a baseball game should be hosted tomorrow in the open, then false negatives are a big problem. You may want to avoid false negatives where it is going to rain tomorrow, but your model predicts that it is not going to rain. So if you have a high number of false negatives, that means you are in for a surprise where you get 50,000 people and people are playing baseball and starts to rain. That's why maybe you want to optimize. You want to reduce the false negatives and you're not just interested in the accuracy. On the other hand, let's say it's a different kind of model, which is predicting whether you may have best breast cancer. Then at that point, you may want to take a close look at false positives before let's say recommending, I don't know, chemotherapy to somebody. You want to make sure that your model, you want to make sure that it is not a false positive. And for that while training the model, you want to make sure that the number of false positives is low. So in general, depending on the kind of problem you may want to optimize either for false positives or for false negatives. And how to do that is something that we will discuss in future lessons. But just keep this in mind that this is called the confusion matrix. And again, SK learn gives you a single line of code to create a confusion matrix. Like you just give it some targets and some predictions, and it is going to give you confusion matrix and it's also going to convert them into percentages. So here, what it tells you is among the actual yeses, which means the among the actual noes, which means all the days where rainfall was actually supposed to be no, it predicted no in 94% of the cases and it, it predicted yes in 5% of the cases. And then here, it tells you that among the cases where it was actually going to rain, it predicted, no, it predicted that it won't rain in 47% of cases. So that means your model is actually pretty bad. It has a lot of false negatives and you should probably not use it for deciding whether you should hold a baseball game in the open tomorrow. And it was 52% of the time when it was going to rain, it has predicted, right? So just by looking at this now 85% accuracy doesn't seem all that good because it seems like if it is going to rain tomorrow and really that's what we're all concerned about whether you should carry on your umbrella or not, your model's performance seems to be 50 50 more or less, right? It's a bit more confident than a random guess. It's about 0.5 to, but it's not that great, right? So if the model says that it's not going to rain tomorrow, I'm probably not that sure if I can. So if the model says that, yeah, the model said that it's not going to rain tomorrow. I don't know because if it is going to rain tomorrow, there is still a 47% chance that the model says that it's not going to rain tomorrow, right? So look at these numbers, spend some time looking at this matrix and you can, what you can do is you can convert this matrix into a nice heat map. This confusion matrix can be converted into a nice heat map and you can, then you can start to interpret this heat map. Dark values means, especially in these negative regions means that's good. And like bright values here indicate that this is probably a bad area for the model. Right? So yeah, that's a confusion matrix. And what we've done is we've simply created this predict and plot function, which can take some inputs and take some targets and it will predict, generate the predictions for those targets using the model. It will compute the accuracy. It will print the accuracy. It will create a confusion matrix and it will plot the confusion matrix. Okay. So this is how you analyze the results of a model. So in this case, it has about 85% accuracy on the training set. Let's validate this. Let's check it on the validation set. So now we are running it with, now we're calling this predict and plot function that we created with the validation data and the validation targets. And on the validation targets as well, we got, we get an accuracy of 85.4%. So that's good. Even on data that the model has not seen before, it is about 85% accurate. That's good. Let's check the test set. Now the test set, remember is data from the last two years or the last one and a half year. And on the test set, you can see that the date, the accuracy is slightly low. It's about 84.2%. All right, so now you can start to see that how the training and validation and test sets are useful. If the test set accuracy was 30%, that means the model is very badly overfitted to the training data. But since it is quite close, it's about 84% and training is 85%. You can accurately just go back and report to your boss or whoever gave you the project that, Hey, my model is about 84% accurate, although it does have a lot of false negatives. So please don't use it to decide if you should do something in the open tomorrow. Okay. Now one thing you might wonder is how good is 84% accuracy. And while this depends on the nature of the problem and on the business requirements, a good way to verify whether a model has actually learned something useful is to compare its result to a random or a dumb model, right? So for example, let's create two models. We'll create a model called random guests, which takes some inputs like the training inputs or validation inputs, and it completely ignores the inputs and it randomly picks yes or no. Okay. So that's my random guest model. And then I'm going to create another model called all no. And this is a model that just says that it's not going to rain. It always says that it's not going to rain. Okay. So now if I use random guests, for example, on the validation set, so if I call random guests on X well, which is the data from the validation set, you see that it, it, half of them are nos and half of them are yeses. And on the other hand, if I use all no on the X, you see that this always returns no. Okay. So these are also models. These may not be very intelligent models. These may not have looked into the data, but these are models. And one thing that you should always try and do is take your brilliant machine learning model that you've created after all this pre-processing, et cetera, et cetera, and compare it to a baseline model like this, something like a random guess or all no. So let's do that. Let's compute the accuracy for the test data or the test set using the random guest model. So we give it the X test that gives us the predictions for the random guests from the random guest model and compare that with targets. Oh, so a random guest gives us about a 50% accuracy. Does that make sense? Because you have a certain, you have two outcomes. And if you just guess randomly, 50%, you would, you're like any given input, you have a 50% chance of getting it right roughly, so to speak. But if you are more intelligent, if you were to always get, always say no, then what would you get? Then your model would have a 77% accuracy. And this is because in the dataset, remember there's an imbalance. So on 77% of days, there was no rain and only on 23% of days there was rain. So just by predicting that there was no rain, your model achieves an accuracy of 77%. Now thankfully the model that we have created has an accuracy of 85 or 84%, which is definitely better than 77%. But this is not always the case, and you will often find that once you train a model and do all of this analysis and create all these features and normalization, et cetera, et cetera. And after all that, you get a model that's worse than a dumb model like this, like it's that there are definitely ways in which we could have created a logistic regression model that was maybe a 65% accurate. So that means that your model is worse than simply predicting no, always, right? So always analyze your models against these dumb or very simplistic baselines and make sure that it makes sense. And now you may say, okay, if 77 is so easy to get, then I don't know if 84% is something that is worth putting all this effort for, right? Another thing to do is maybe do a human baseline. So what you could do is maybe you could pick a thousand days of data from the test set and ask a meteorologist to predict and check their accuracy and then check your models accuracy on those thousand data points. And this is called a human baseline. And if your model is able to do better than that particular meteorologist, then you have kind of beaten a human. And that is how we say that this AI is now better than a human. Of course, you don't want to just beat one person. Ideally, you want to beat a panel of experts, all of them who have independently made predictions and you are beating maybe the average of their predictions. And if you're comprehensively doing that, then at that point, we can probably just tell all the meteorologists to go home because they're not able to do as well as our model is able to do right. And that brings me to this last piece, which is to make predictions on actual single inputs. Now once the model has been trained to a satisfactory accuracy, once we have analyzed it, compared it with meteorologists and we've decided that, okay, we can fire them and just use this model. We now have to use this model to make predictions on new data. So consider this a dictionary containing the data from the Catherine weather department today. So 2021 619, this is the location. This is the minimum temperature, maximum temperature, rainfall, evaporation, et cetera, et cetera. And whether it rained today, we have all the information, but obviously since tomorrow hasn't happened, we don't know if it's going to rain tomorrow and we have to send out a weather bulletin in a couple of hours. So now it is up to our model to tell us if it is going to rain tomorrow. Okay. And as you run through this notebook, you can try changing some of these parameters. So I'm just going to, yeah, I'm going to create this dictionary and let's see how we can use our model to make predictions. Now, obviously we cannot directly feed this dictionary into our model. Our model expects numbers and encoded categorical data, et cetera, et cetera. So the best way when you want to make a prediction on a new input is to try and mimic the original input format as closely as possible. So the original input format was a pandas data frame. And this would ideally be a row within a data frame. So here's what we can do. We can take the new input, which is a dictionary, and we can put this into a list. So now we get a list of dictionaries and remember the pandas data frame constructor can take a list of dictionaries and it interprets each dictionary as a row of data. So we are just going to create a pandas data frame using this list of dictionaries, which the, which contains just one dictionary. So that means the pandas data frame will contain just one row of data. So we've taken this dictionary and converted this into this pandas data frame. Okay. The objective for you is to get the data into the same kind of a structure that was used to train the model. So this was a structure that was used to train the model, right? A pandas data frame. And here we have just one data point, so that's why we have just one row. But let's say if you had this information about all the locations, all the different locations you had collected 49 pieces of information, you could create a data frame with 49 rows. Okay. In any case, we now have this data frame. Can we pass this into the model? Not yet, because we need to apply the same pre-processing steps that we have applied while training the model. First is imputation of the missing values. And this is where the imputer that we had created earlier is going to be useful. So I'm just going to call imputer dot transform and remember, it's going to use the, the main values that it had calculated on the entire dataset. So we can't just create a new imputer. We need to use that original imputer. We need to keep that imputer around, and I'll tell you how to keep it around when we talk about saving models, but we say imputer dot transform, and then we give it new input DF and just the numeric columns, remember? So we also need to keep a list of the numeric columns, and then we need to grab the numeric columns out of the new input DF and transform them. And that will fill any missing data. So for example, there's one nan here for sunshine that will get filled in. Then we have, we have scaling. So numeric features also have to be scaled. Remember these values 23, 22, all of these have to be brought into the zero to one range. So we now see scalar dot transform, and that's why the scalar is useful. And remember the scalar has already been fitted on the entire dataset. So it knows the ranges, so we can't create a new scalar. So we use scalar dot transform, and then we pass in the numeric columns. And that's going to then give us the new input DF now with the numeric value scale. And finally, we also want to encode the categorical column. So we take the categorical columns and remember we have an encoder. It already knows what all the different categories are and what all the different columns need to be created are. And it's going to create these encoded columns. And let's check new input DF now yeah. So now you can see now it has been converted into the format that will be that is ready to be passed into the model. Okay. It's still one row of data. So now let's grab from new input DF. Remember we only want to pass in numbers into the model. So we take numeric columns, we take encoded columns, and we get those from the new input DF, and we check X new input. You can see that now it just contains the numbers that can be passed in directly. And all the columns are in exactly the same order. That's also very important, right? All the columns are in exactly the same order that they are in the input. That's why these pre-processing steps should be applied exactly to every new input. And we can now make a prediction. So we call model dot predict, like if we just call model dot predict, it's going to give us a list of predictions, one for each row of data. You can see that it's an area of yes, but obviously we just want the first prediction because we just have one row. So that's why we're just selecting zero here. So we say prediction and we check prediction and it turns out the prediction is yes. So based on this input, Catherine, main temperature, max temperature, et cetera, the model has predicted that it will rain tomorrow. And what is the probability we can check the probability as well. We can get the probability that the model thinks it will rain tomorrow. So the probability of for no is zero. The probability for yes is 51%. So it seems like the model is just on the border. It's not a hundred percent sure on whether it will rain tomorrow. It's only about 51% sure. So maybe I don't know if you should put this out in the weather bulletin that it will rain tomorrow. You may want to say slight chances of rain and leave that possibility open. And that's another interesting thing to look at, right? So now one thing you can play around with is try changing some of these values. Like now that you know that wind speed at 3pm is very important, maybe try changing the wind speed. Maybe let me change this wind speed to zero. Okay. And let me repeat these steps and let's see if the prediction changes. So the prediction still says yes. Let's see if the probabilities change. Okay. The model actually has become more confident. So I don't know. It's not the 3pm wind speed, I think, but maybe it is just the overall wind speed. I think this is the one actually. So let me just change this to 10 and let me keep the threes 3pm wind speed at 20. Let's see if that creates a change and this is how you want to. Okay. So now the prediction is no, right? So if the, you can see that the wind gust speed has gone down by a lot from 52 to 10. You can see that now it is 89% confident that it will not rain tomorrow. So this is how you should interpret your model. And this is how you should also make sense of your model. If these things are sensible, you should then sit with the meteorologist or the domain expert and show them, Hey, if I put in this number, this is the output that it gives. And this is the probability that it gives. Do you think it makes sense? And then they can tell you from whether research, if these things actually make sense or not. And if some of these things don't make sense, then maybe you need to go back and check these correlations. And sometimes you may have accidental correlation. Sometimes you may have what is called a data leak, where maybe some information from the target has been put into one of these input columns somehow, and all of these things happen. And that's why models don't work well in real life. So you need to look into the model and see why it is giving you a certain result. And again, as always, it helps to create a helper function to make these predictions. So now we have this helper function, predict input, and then we can just make some modifications into this new input. And we can try, and we can try things out, right? For example, here we, for this input, I've just changed the location to loss sentence. And it seems like now it is more confident. Maybe if I pick Albury, let me see if the prediction is going to change. Yeah. So now it's even more, it's like 79% confident. Let me pick, there was a certain place, Oloru, which had very little rain. So let me put Oloru here. Okay. It's still confident. So it seems like that this is, this indicates a very strong chance of rain for most places. Right? So that's how you analyze your model. You can try changing some of the other parameters. Let me try rainfall here. Let me try the rainfall for today. Let me make it two. Okay. Still says yes. So do play around with this. Do pick maybe pick some random row from the data and then try and change a few things, or maybe even look this up, look this information up today live and put in this information and see how useful this model is three years from being created. And honestly, climate does not really change that much in just three years. It's a multi-decade or multi-century phenomena. So if you've trained a model using data from 2007, 2000, or 2010 to 2017, it should hold up pretty well even today in 2021. So you could go on Yahoo weather or somewhere or AccuWeather and get all this information and then put this information into the model and check if it actually rains tomorrow by checking it. Right. And that is what machine learning is. It is about taking data and using data to make predictions, but how do you make predictions from data? You assume a certain relationship between the inputs, which is the weather information for today and the output, which is whether it will rain tomorrow. And that relationship is called the model. And that model has some inherent parameters called weights of the model, which are initially randomly initialized. And then you train the model, which means that you make predictions, compare them with the actual data. And that's where you need labeled historical data or labeled data for whether it rained the next day for the last 10 years. And using that comparison, computer loss, reduce the loss using an optimization technique, make the models parameters better. And as the model gets better, then you evaluate the model, you interpret the model, and then you use the model in the real world. This is used for weather prediction. This is used for predicting whether somebody will default on a loan. This can be used for predicting if a certain tumor will cause cancer. This can be used for predicting, let's say, well, actually a similar technique is used for predicting, for recognizing pictures of cats and dogs, logistic regression. If you check out our other course, which is the data deep learning with pytorch zero to GANS course, there we apply logistic regression on image data. So a very similar technique is applied to just identify separate pictures of cats and dogs, or even like identify handwritten digits. So it's a very versatile technique, logistic regression by itself, but machine learning in general, it can be applied to a whole bunch of different kinds of problems. One last thing I want to cover before we close for the day is saving and training, saving and loading trained models. Now, obviously we've spent a lot of time training these models. We've done a bunch of pre-processing just as we have saved pre-process data. We should also see the trained models and what is the model really? Well, the model is two things. The model is what kind of model it is, which is a logistic regression model and how it was initialized to all the parameters that we are given in the initialization, like the kind of optimizer, the number of iterations, et cetera, et cetera. And then it is the learned weights of the model, which is what has come out of model.fit. So we should save all of this so that we can reuse this model without having to train it without having to fit it on this pre-process data from scratch. In fact, once the model is trained and we know the coefficients, we don't need to look at the data anymore. We can just hand over the coefficients to a software developer and they can set up a simple webpage where somebody inputs a bunch of weather parameters. And then that input is sent to a web server where we simply multiply those parameters with the coefficients, apply the sigmoid function, and then return a result. And that is going to just work as a standalone web application without ever looking at all the data. So in some sense, the model, the model's parameters or the model's weights is the, is sort of a condensed or a dense representation of all the intelligence that is present within the data, right? So it is capturing all the patterns within that large data set of 130,000 rows of data into just these 250 numbers, 250 weights and biases. And that is really what makes machine learning so powerful. You are summarizing millions of data points into just these hundreds of numbers or sometimes maybe a thousand numbers, but still these very dense representations, which is the sort of the intelligence of the data and can be used to make predictions. So how do we save our model? Now we can save our model using this library called job lib, but we don't want to just save our model. We want to save our model. We want to save our imputers. We want to save our scalers. We want to save our encoders. We want to save even the column names that are important. Anything that will be required while generating predictions using the model should be saved. So anything that is used inside this function, for example, should be saved. So here's what you can do. Joblib is a very generic library, which can save any Python data. So I would typically create a dictionary like this. So I'm calling this Aussie rain, the dictionary, and I'm giving it the model, the imputer, the scaler, the encoder input columns, et cetera. So I have the key model and I have the value, which is the actual machine learning model that I've created. I have the key imputer and I have the value, which is the actual imputer that I've created. So here I have created this dictionary. And then I'm going to just save this dictionary using job blip dot dump. So we say job blip dot dump, give it the name of the Python object. You can just save your model if you want. You can save them as separate files, but I just like to keep them all together because they're going to be used together anyway. And you can save it. Now shut down this notebook and maybe create a new notebook and upload this file Aussie rain job. By the way, from Colab, you will want to definitely download this file if you want to keep a copy around and create a new notebook and upload this file Aussie rain or job blip, and then just do job lip dot load. And you can see that is now loaded into this variable Aussie rain too. And that will have now all of these things like now we can check Aussie rain to model. So this has been loaded from disc and yeah. So this has been loaded from the disc and you can see that we have got back the logistic regression model. In fact, you can even check its coefficients and make sure that these coefficients are the coefficients that it had learned. I think it's called CoF underscore. Yeah. So these seem to be the same coefficients that we looked at 8.9, 3.16, et cetera. Great. So now we've been able to retrieve our model. We don't need to retrain our model each time. We can probably hand over this job lip file to a software developer. And maybe if we have time, we will also look at how to create a simple web application using a machine learning model. But for now, let's just make some predictions using this model. So test spreads to is going to be the results of using the Aussie rain to model to make predictions on X test. And then we are computing the accuracy score from test targets. And that gives the 84.2% result, the exact same result as we had before. And it should, of course, because we've saved whatever was there in the model to disk and then loaded it back. So that's how you save and load models. Now, if you want to attach your model with your Jovian notebook, then while running Jovian dot commit, you can use this outputs argument and give it a list. And in this list, you have Aussie rain dot job lab, right? So you just give it the name of the output files you want to include. And when you run it and create a notebook on Jovian, that notebook will now contain this job lip file in the files tab, as you can see here, Aussie rain dot job. Okay. So you can always, whenever you want to download it, you can always download it very easily. Okay. And we've covered a lot of ground in this tutorial. We've covered a lot of different concepts and you are not expected to become an expert in all of these, or even any of these immediately, the things that we've learned right now, being able to apply them effectively, or even just get them, getting them to work is something that will take weeks, months, or even like a year or so for you to really grasp, right? So the machine learning course that we're doing right now, the practical machine learning course, the objective is to expose you to what machine learning is, what real world machine learning problems look like, and what the process of training a machine learning model and using a machine learning model looks like, right? And once you get this high level picture, then over time you can slowly chip away at small pieces. And this you will do via the assignments that we will have and why are the project that you will do. But even over time you can chip away all of at all of these small pieces and perfect each of these, right? This is not a skill that will come immediately, unlike let's say data analysis with pandas or just visualization with plotly. I would say even those take time, but at least there are a fixed number of things you can do there with machine learning, you have to apply a lot of thought, you have to apply a lot of creativity, you have to apply a lot of insight, there's a lot of experience that comes in. So don't be discouraged if you're not able to train good models initially. In fact, most machine learning models that you train initially will be pretty bad, but see that as an opportunity for improvement. Now one good way to improve yourself is by going to Kaggle datasets and then clicking on the code tab. So here, for example, this is the rain in Australia data, the dataset clicking on the code tab and then filtering by most ports. And then you can check out some of these notebooks. People are very nice, they have created these public notebooks for people for you to check out. You can see that there is an extensive analysis and there is an extensive feature engineering and splitting and scaling and training and predictions and accuracy and scores and confusion matrices and classification metrics and a bunch of different things. So this is a, honestly, it is a lifelong pursuit to machine learning, but I would also say that it is worth it because the kind of things that you can do with machine learning models is just amazing. It's out of this world, especially once you start getting into deep learning as well. So yeah, so don't feel overwhelmed, pick some datasets and try to create a model that works. And if you're unable to create a model that works, maybe read through what people have done with that dataset, pick up some ideas from here and there, maybe copy paste some code. If you need to maybe even just run through this person's notebook, look at each line of code, and that is how you learn the theory behind machine learning. You can probably watch a couple of videos and get a decent idea, but that is not going to help you beyond a point. What is going to help you is to try these things out of on your own, on real datasets on just like jumping into the pool without knowing how to swim essentially and just learning along the way. Right? So that's what you want to do. And there are a lot of great resources, but you have to dig them. So you have to find great datasets, and then you have to read through these notebooks. As I suggested, one quick hack is go to a dataset, click the code tab and select sort by hotness. You can even add some filters if you want, but like language, Python and things like that, but sorry, sort by most words and just browse through the top three or four notebooks. And just like this, you will learn a lot about the different machine learning techniques. Now one other thing I want to add here is that everything that we've done can be covered in just each step can be covered in no more than three or four lines of code. So it may seem intimidating. The concepts may seem intimidating at first, but go through this notebook again, review this video. And once you get the concept, realize that the work to be done there is actually quite, quite small. There may be some experimentation required, but the kind of the amount of coding is quite small. That's what it looks like the data pre-processing step downloading the dataset is three lines of code. You download using open datasets, you read it using raw DF, and then you drop it, you drop any's. So I've just dropped the rows, which have rain today or rain tomorrow set to false or set to NAN. Then creating training and validation datasets is two, three lines of code. Now we have used this because we have date a date column, we have used like yours less than 2015 equal to 2015 and greater than 2015, but we could also have used train test split. And that would also be just one or two lines. Then we created some inputs and targets by creating the input columns, the target columns, and then creating training inputs, training targets, validation, inputs, validation, targets, test inputs, test targets, et cetera. Then we have created some numeric and categorical columns just because we are going to need that information for the next steps. Then we imputed the numeric values by creating an imputer, fitting it to the data and then transforming the training, validation, and test inputs. Imputing is simply missing filling and missing values. Then we have scaled the numeric features. So creating a scalar is again, just one line of code. You create the scalar and then you fit it to the data. So it's going to figure out the ranges that are other scalars too. Something called a standard scalar. You can check out and then you apply them to the training, validation and test sets. Again, a psychic learn has a very nice API where everything that you do in pre-processing, you first instantiate or create an object with some options. Then you fit it to the data and then you transform the data using it. One hot encoding works the same way. You create a one hot encoder, then you transform the data. And of course, when you transform the data for each category, you're going to get a new column, so you also need to create a list of the new columns. And that's what this encoded columns is for. And then you apply that for the training validation and test set. Then you save the pre-process data to disc. This is an optional step, but this is useful when you're working with large data sets so that you don't have to keep rerunning the notebook over and over and over. Some people also like to have a pre-processing notebook, a model training notebook, and a prediction notebook. And they would often put all those three notebooks in the same folder. And they would share these files across the three notebooks. So each time you improve something in pre-processing, you save it to the pocket file. You read it in the prediction or in the training notebook, and then you save the model weights to a disc, and then you read the model weights in the prediction notebook. Okay. And this is how you load the data. Now, what you should try and do is just try and explain each line of code above. And really there's about 15 lines for you to explain because most of this is just repeated code. So if you can explain these 10 to 15 lines, right? So you're just 10 lines away from understanding how to build machine learning models, think of it that way, and try to just work through understanding these 10 or 15 lines, and you will understand the pre-processing, then you have model training and evaluation. So here we import logistic regression. Now, very interesting with the interesting thing with scikit-learn is that a lot of different models will have the exact same API. So instead of logistic regression, you could say a random forest classifier. You could say an SVM classifier. You could say a naive base classifier, and they would work exactly the same way. You would have to change almost nothing except maybe just this one line. So here we are creating the model. We are instantiating the model using logistic regression, the, the constructor. Then we fit the model always is the next step. Fitting is where we go through this process of generating predictions, computing loss, improving the weights and biases, then again, generating predictions and keep repeating that process. Once we have fed the model, it's always a good idea to just check the accuracy or the score or whatever loss or whatever we are interested in. However, you want to evaluate the model on the training data, on the validation data, on the test data, and whenever you're happy with the results, you should save the model and ideally when you're saving them, you may also want to maybe just add like a small note here on what date you're saving it on some kind of information in the file name so that you have multiple values. Now the difficulty is then you run into these issues of like final one, final V1, V2, et cetera, et cetera. So what we recommend is that when you run job in dot commit, if you just attach the saved model with your job, your notebook. So with every version, you can have a different version of your saved model that you can refer to. Okay. Again, try to explain each line of code in your own words. And if you can understand this, you understand how to train a model. You can even just try this out very quickly. Just change this to like X, a random forest classifier, and just replace this with random forest classifier and see if you get a better result. The logistic regression is the simplest possible model for classification. Essentially there are, we are going to talk about decision trees and random forest next week, which are again, a very interesting approach to the same kind of problem, and you will see how they're able to get a better result. Then finally, your model is going to be used for prediction. So you need to have a way to very easily test your model and always try and create this predict input function, which can take an input in the raw format or in a very human friendly format and return the prediction and the probabilities in case of classification or in case of regression, it's just a prediction, right? This allows you to experiment with the model, interact with the model one-on-one because if you're saying this is an AI, a piece of artificial intelligence, then you should be able to interact with it in the same way you interact with a human, which is by giving it some input and getting some output out of it and then interpreting it and comparing it with what a human would give you, right? And the closer it gets to the state of the art or the closer it gets to the truth, the better, the more intelligent this model has become. Okay. So playing with and experimenting with individual inputs will give you a lot of insight into why the model is working or not working. Now you can see here that fitting the model is a very straightforward step, right? So there's not much you can change here. Maybe there are some parameters you can change here and even calculating the predictions and accuracies is again, a fairly straightforward step. You just run a couple of functions, but what will really give you insight in how to improve your model is by looking at the misclassified examples, which means the examples, it was not able to predict correctly or looking at the examples, which it was unfair, unsure about, and then finally just playing with individual inputs. The kind of insight that this gives you is far more than any single accuracy score or any single, any single number can give you, right? So definitely play with your model. And it's also very interesting to see because once you've done all this work and trained this model to see how it behaves, to see how it performs, to see what it thinks essentially, right? The model's brain is those weights. That is a very important piece of understanding what the model is doing. I can go on for a long time about machine learning. This is my favorite topic and probably going to spend a long time just working on machine learning. That's why I feel very excited about it, but do try and I hope I was able to convey that same excitement to you or the same power of machine learning to you. If not the exact details, but be patient with it, don't expect all of it to come immediately, but when it does, you will be able to do some amazing things and people will not believe the kind of things that you will be able to do. So stick with it. That is what I'll say with machine learning. Okay. So here's a quick summary of what we've done. Downloading the dataset, exploring dataset, splitting it into training, validation, and test, imputing missing values, scaling numeric features, encoding categorical columns, training a logistic regression model, evaluating a model using a validation set in a test set and saving a model to disc and loading it back, there are even more steps that are involved in machine learning, which we will talk about in future lessons. And we will look at other types of machine learning models as well. And here are some more resources. If you want to look at the theoretical aspects of linear regression, logistic regression, you can watch these, you can watch these lessons. I think they're part of the machine learning course by interviewing. That's a great course, which gives you a good theoretical foundation for machine learning, probably a good compliment with this course. So it's on Coursera, machine learning, Stanford, and doing, you probably know the course already. Then here are some tutorials that I found useful while working on this notebook, especially this one, the extensive analysis, EDM modeling. I have borrowed a lot of things from this notebook. So do check it out. Every time I read a Kaggle notebook, I learned something even after doing machine learning for several years. And if you want to try your hand at training your own logistic regression model, just spin up a new notebook and repeat each of these steps with one of these datasets, breast cancer detection. This one should be relatively easy loan repayment prediction. This is a very hard and handwritten digit recognition. This is going to be intermediate, but you will have to figure out how to work with image data. Fortunately, all of these are hosted on Kaggle and you can go and check the code tab and on the code tab, you can always look at the most votes so you can try something and then you can, let's see, you can even search here. I'm just looking for logistic regression. Yeah. So you can see that there is this applied machine learning notebook and this contains a logistic regression example somewhere supervised machine learning. There's probably a logistic regression example somewhere there, logistic regression. There you go. Right? So this is the logistic regression on another dataset. The topic for today is decision trees and random forests. This is something different from what we've covered so far, which is the linear models like linear and logistic regression and random forest, especially are a very powerful and widely used machine learning technique. It's most likely that in your professional work, you will be building decision trees and random forests most of the time. And one of the primary reasons for that is interpretability of these models. So we will also talk about why these models learn the things that they do and why they give the results or predictions that they do. So here's what we're going to cover today. We will download a real world dataset, just as we've been doing for the previous lessons. We will prepare a dataset for training a machine learning model. Then we will train and interpret some decision trees, and then we will move on to training and interpreting random forests. We will talk about overfitting hyper parameter tuning and regularization. These are some of the central problems in machine learning and this is where you will spend a lot of your time when you are improving your models. And finally, we will talk about how to make predictions on single inputs as well. Now I'm running this notebook on Cola, but you can also run it locally, but things may be a little bit slower depending on what your configuration is. If you have a good CPU and a high RAM, you should be able to run this locally as well. Just as we've been doing in the previous lessons, we will talk about how to use decision trees and random forest to solve a real world problem from Kaggle. And we're going to use the same dataset that we used the last time. And this will also give you a chance to contrast decision trees versus linear regression, logistic regression models. So we will use the rain in Australia dataset, which contains about 10 years of daily weather observations from numerous sources of data that we can use. It contains about 10 years of daily weather observations from numerous Australian weather stations. And here is a small sample from the dataset. So on several dates, you have information captured from several locations. And this information includes minimum temperature, maximum temperature, rainfall, evaporation, et cetera. And the last two columns are the most interesting. One is whether it rained on that day. And the second is whether it rained on the next day. Now, of course we have rain tomorrow because we are looking at historical data. And as a data scientist at the bureau of meteorology, you are tasked with creating a fully automated system that can use today's weather data for a given location to predict whether it will rain at a location tomorrow. So you want to create an automated system, which can essentially predict the likelihood of rainfall all over Australia. So let's see how far we can get there. Before we begin, we'll just install and import some of the required libraries that we've been using throughout open data sets for downloading a data set, pandas for loading data sets and working with data frames, NumPy for some mathematical work, mathematical computing SK learn contains all the machine learning models that we train and Jovian for saving snapshots of your notebook. So let's import all the libraries as well. Open data sets as OD matplotlib.pyplot as PLT. Seaborn is SNS pandas is PD, NumPy is NP matplotlib Jovian. And we will also use the OS module a bit. So these are standard conventions that you should follow in all your Jupiter notebooks, and if you don't follow these, you will find that people get confused. Of course, you can call pandas, anything you want, but prefer calling it PD, because that's how you will see it all over the internet. Finally, we're also setting some options for display here so that the graphs are a little bigger and we can see more information within our pandas data frames. All right. So the first step is to download the dataset as we had done the last time. We will download this dataset using the open data sets library directly from Kaggle within Jupiter. So we just run OD dot download. And when we run OD or download, we will be prompted for an API key, a Kaggle username, and an API key here's what that looks like. Okay. Let me just run this. So we will be prompted for a Kaggle. We'll be prompted for a Kaggle username and an API key. Here's what that looks like. Now, this is one way to provide the information your Kaggle user name, and your Kaggle API key. But one other thing you can do is just click on this file explorer and find the upload button and upload your Kaggle dot Jason file. And if you place your Kaggle dot Jason file next to your notebook, then open data sets will automatically find the credentials and download the dataset. As you can see here, this was a three MB dataset that was downloaded. Automatically. And of course, if you don't have your Kaggle dot Jason file, go to Kaggle dot com, which is where we're downloading the dataset from click on account, scroll down to create new API token, and that will download the Kaggle dot Jason file for you. Okay. So you can either provide your Kaggle username and ID and key directly, or you can upload the Kaggle dot Jason file to Google Colab. Or just place it next to the notebook. If you're running it locally, so the dataset is now downloaded and extracted into this folder, whether dataset rattle package, and we can check that using OS dot list. Yeah. Now I'm just going to click edit, clear all outputs here so that we can see we can run all the code fresh. And we do not have any stale outputs in our notebook. All right. So the file weather, a us dot CSV contains the data you can see here, whether a us dot CSV. So let's load it into a pandas data frame. I'm going to run PD dot read CSV, and that loads the data frame up and here's the data frame. So here's the data. We looked at it the last time as well. We have date, location, minimum temperature, a bunch of weather parameters. And finally, we have rain today and rain tomorrow. Our objective is to take all of this information, maybe not the date, because everything is on a different date, but everything except the date and use that to predict whether it will rain on the next day. And hopefully we can then use it on some future data as well. So let's check the column types of the dataset. If we just do raw DF dot info, it tells us that there are a total of 145,000 entries. And you can see the types of each column. So you have object, which is mostly strings or categorical data. And then you have float 64 and then you have, okay, no, it's 64, but floats and ends are numeric data. The others are mostly categorical data and some, and sometimes these can be string data as well. And you will notice that some of these columns have null values, but they're not. And you will notice that some of these columns have null values too. So we need to deal with them as well. Now, one of the things I'm going to do is remove any rows where the value of the target column is null because we want to train a model that can predict whether or not it will rain tomorrow. So to give the model any data where we don't know information about whether it rains tomorrow will not be useful to train the model, right? So we will remove any rows where the target column is empty. So I'm just going to remove the subset rain tomorrow. And here's an exercise for you try and perform some exploratory data analysis on this dataset, if you haven't already and study the relationships of the other columns with the rain tomorrow column. See, if you can figure out before we build this model, which columns are the most important in determining whether it will rain tomorrow, and I'm just going to save my notebook as well. So I'm running Jovian dot commit here, and I am asked to enter my API key, which I can find from my Jovian profile by going to Jovian.ai just click on this. That will copy the API key and I can paste it here, and this will save a snapshot of the notebook to my profile so that I can come back and continue where I've left off in my next session. The Colab notebook, of course, will shut down after some time. All right. So now we've done most of this before, so let's go over this quickly. We will perform some steps to prepare the dataset for training. The first step is to create a training test and validation split. Remember it's common practice to set aside only about 60% of the data for training the model. And then we use 10, 20% of the data for validation, which is to evaluate different versions of the model as we try out different parameters to train the model. And finally, to report the final accuracy, we use the test set. Now it's common practice to do a random split, but in this case, because the data is ordered by date and because the model that we are going to create using this data is going to be used in the future, we can simulate that, which is using the model train on the past to the, to predict values in the future by picking the last couple of years for the test set. And we can maybe pick one year before that for the validation set, and then all of this data can be used for the training set. So this is the distribution of the number of rows per, uh, and we've done that using a simple count plot using seaborn. So here's what we'll do. We will create a train data frame, which is a fraction of this rows of the raw data frame, which we just loaded up where the year is less than 2015. And here is how we've converted the year. We have taken the date column, raw DF dot date, and we have passed it as a date time field, each value in the column, and from that date time field, we have extracted the year. So this is basic pandas date operations that you should check out if you're not familiar with this already. So before the year 2015, which is up to 2014, we use the data for training and then the validation data is updated. 2015, we use the data for training and then the validation data is the year 2015 and then the test data frame is the year 2016 and 2017 again, this, we are doing this only because we have this chronologically order data. And this is how your model will be used in the real world. If you do not have chronologically order data, then you use a random split and there is a method in psychic learn called train test split, which you can use to do that. So now we have about 98,000 measurements, 98,000 observations, or 98,000 samples for training. We have about 17,000 samples for validation. So as we try different kinds of models and we'll try quite a few today, we can use the validation data. Frame to evaluate how well those models are performing. And finally, we have a test data frame. This is where we will report the final accuracy of our model. Now here's an exercise for you. If you want to build on top of this, you can try and scrape the climate data for the recent years from this website. This is the official website of the bureau of meteorology in Australia. So you can try and scrape the data from 2017 to 2021 and try training a model with the enlarged data set. In fact, this is how this data set was created in the first place by scraping data. So web scraping is a great way to create new data sets for machine learning. We have created the training and validation data set split. And then the next step is to identify the input and target columns, because not all the columns will be useful for machine learning. And it's also very important to separate out the input and the target data. One common mistake people make initially is to accidentally use the target column to predict the target column. In which case your machine learning model, isn't really doing anything. It is taking the value of the target column and simply returning it. So always make sure to carefully check the columns of your data frame and separate out the input and output columns. So if I check the raw data frame, or maybe if I just check the train data frame, which is just a subset of the rules, you can see that we don't want to use the first row date and we don't want to use the last row tomorrow as an input. Why not date because we are going to use a model in the future. So a date will not be a useful thing and rain tomorrow is not useful because this is the value that we want to predict. So the input to the model should be the rest of the columns and the prediction of the model should be compared with the target column, which is rain tomorrow. So here's how we're setting that up. We are set up train DF dot columns. We've converted that into a list and we're excluding the first and the last value from that list. And now we can take just the input columns from the training data frame and create training inputs. So I'm just creating a copy here because we are going to make some modifications in the few, in the next few steps. And we can also separate out target columns or we can also separate out the target column. So now the target column is just a single column. So when we select train DF target call that is going to return a pandas series, not a data frame. So just keep that in mind. So here's what that looks like. Train inputs. So this contains location to rain today and train targets. This contains just the value of rain tomorrow. Okay. Always a good idea to just check out what information you have within your data frames before you move forward. Similarly here, we are creating the validation inputs, validation targets, and we're creating the test inputs and test targets. Next up, let's also identify the numeric and categorical columns within the data, because we will need to deal with them separately. So here's one trick. Well, one simple thing you could do is you could just do train DF, and then you could manually look through. And then you can manually look through the data frame, and then you can manually look through the data frame. So we are going to make a test. Like, well, one simple thing you could do is you could just do train DF, and then you could manually look through and make a list. Okay. Mint temp is numeric. Max temp is numeric. Rainfall is numeric, et cetera, et cetera. But what you should ideally be doing is detecting these automatically. So here's how you can do that. If you just do this, train inputs or any data frame dot select D types. So only select the columns which have these matching D types. And for the matching D types, if you just provide NP dot number, which encompasses float and. And all the numeric data types. So now you will get just a data frame containing all the numeric columns, and then you can simply access the columns here. And that gives you a list of all the numeric columns. And finally, we can convert that into a list using two lists. So here, now we get back a list of all the numeric columns. Now to get the list of categorical columns, all you need to do is change this to categorical. Sorry, not categorical to get the list of categorical columns. All you need to do is change this to object. So when you change this to object, you get back the list of categorical columns. Now, how did I find this out? Well, I simply looked it up online. How do you find new American categorical columns in a data frame? And once I found it, I just have it written in my notebook so that I can use it anytime. So these are the numeric and categorical columns. Now, one thing you might want to do at this point is decide if you really need all the columns, because every column introduces new data, which may potentially slow down your training. In this case, we have a small enough dataset, so we may not, we do not need to worry about it, but you can do some analysis and you can figure out how closely the columns are correlated with the target and maybe just select a subset of the columns instead of all the columns and observe how it affects the results. Does it lead to a large decrease in the output or is this a very insignificant decrease? And if it is, then it's probably okay to drop a few columns and just use the ones that are more, most important. Okay. So try it out, observe it and try to get a feel for when it makes sense to drop some columns. For now, we are going to move ahead with all the columns and the next important step is to impute missing numeric values, which means we want to take all the missing values because machine learning algorithms can't work with missing values. They will throw errors at you and we want to replace them with some other values. So how do you check the missing values? Well, you can go train inputs and then from training, put it inputs, pick the numeric columns. So just the data from the numeric columns, this is what that looks like. And here we can check is any, which is going to replace each value with a true or a false depending on whether it is NAN. So this will become a true and this will be a false. And then I'm going to do a dot some, so chaining pandas commands is a useful skill to learn. So always think about what you want to get to, and what is the incremental process that will take you there. And maybe I might even do this. I may also do a sort values here in the series and maybe set ascending equals false. All right. So it seems like sunshine has the highest number of missing values, followed by evaporation, followed by cloud 3 p.m. cloud nine, a impression, I name and so on. So all these numeric columns have some missing values and we are going to replace them using a simple strategy, which is the, which is basically replacing them with the average of that column. So for this, we can import simple imputer from scikit-learn and we create an imputer object and we specify the strategy that we want to use strategy is mean. And after creating the imputer object, we can also call dot fit and give it the data, the numeric column data, which is all the data from all the numeric columns in our data frame. And imputer is going to then figure out what is the average for each of those columns. Now, once you've fitted the imputer, which means once it has found the averages or the statistic that we want to use. For each to fill each column, we can actually fill the columns by calling imputer dot transform. So we call imputer dot transform on train inputs, numeric columns, that is going to fill information into all the empty data in the, in the numeric columns of the train inputs and return a new NumPy array. We can take that new result and put it back into the original data frame, train inputs and replace the original numeric columns. Okay. So the net effect of all of this is that you have no missing data in any of the numeric columns. We filled it with the mean value. Now the mean is not the only imputation strategy. There are several other imputation strategies that are available in scikit-learn. So an exercise for you would be to try a different imputation strategy and see how it affects the final result. And this is all what practical machine learning is. You try different things and maybe sometimes you try different strategies. We're different columns by doing some exploratory analysis and figure out the strategies that work best for the problem at hand. Okay. Next up, we are going to scale the numeric features scaling simply means we want to take the ranges of each of the numeric features, which is the min and the max. And we want to bring them down into a zero to one range. As you can see here in the validation training or test dataset, each numeric feature has a different range min temp is minus eight to plus 31 and wind speed is seven to 135. Whereas certain values like pressure can be like 988 to 1039. So because there are a lot of numerical computations that happen inside a machine learning algorithm and ultimately a single loss value is optimized. We don't want the data to have any, we don't want any specific feature to dominate the training process. We want to give every feature a level playing field to participate in the training of the model. And that is why we scale all of these feature values to the range zero to one. And we do that using min max scalar. So here we are creating a min max scalar, and then we call on min max scalar fit and we give it the numerical columns. The data from all the numerical columns. So it is going to figure out for each column, what is the minimum and the maximum value. And then we can call scalar dot transform, give it all the data from the numeric columns, and it is going to scale them into the zero to one range. And then we can take that output and put it back into our training validation and test data frames. So the net result of all of this is the minimum and the maximum value. So the net result of all of this is that the inputs are going to change from a variety of different ranges to the zero to one range. Now the zero to one range is not the only scaling strategy. There are several other scaling strategies as well. So you should try out a different scaling strategy, specifically standard scalar is something worth checking out and observe how that affects the results. Next, next, we're going to encode the categorical data. Machine learning algorithms can only work with numbers and in our data frames, we have some categorical data. If I just check train DF, you can see here you have location, which is categorical. Then you have wind gust direction, which is also categorical. And then you have a bunch of other categorical data as well. Things like rain tomorrow. In fact, that's what we've listed in categorical calls, location, wind, gust, direction, wind, direction, wind, 3 PM and rain today. So what we're going to do is perform one hot encoding for the categorical columns. Okay. And for the categorical columns, we do need to first fix the NAN. So I'm just going to fill NANs wherever we have NANs in the categorical columns. So I'm just going to do train DF categorical calls dot fill any, and I'm going to fill all NAs with the value unknown. I'm going to do that for the validation. I'm, I'm going to do that for the test data frame as well. So we did fill out missing values in the numerical columns, but we did not do that for the categorical columns. And you can see here, if I just pick the categorical columns, you can see that some of them have some NAN values, most of them actually. So we're just going to fill wherever we have NAN values. We are going to fill it with the string unknown. Just so that one-hot encoder doesn't complain and let's just do that in place. Let's try that again. All right. Let me just fix this. I believe this is an issue because of the version of scikit-learn. So this was something that worked on my computer, but did not work, is not working on Google collab. And whenever you face such issues where something works in a certain place, but does not work in another place, that is probably because of, that is probably because of scikit-learn, because of version differences. Okay. Let's do this one last time and this should fix it. So watch out for version differences between libraries. And if you ever want to check the version of a particular library, the way to do it is just run pip list and that will show you a list of all the libraries that are installed and you can check their version. So you can check the version on your computer. You can check the version on collab or wherever you're running and identify the discrepancy and the way to install a specific version is to go pip install scikit-learn, for example. And then specify the version you want to install, but after an equal to equal to like 0.1 0.3. Okay. So with that out of the way, we can now one heart encode our columns. We can now one heart encode our columns. So by one heart encoding, what we want to do is we want to take all the categorical columns and take the values in those categorical columns and create a. Create a separate column for each category. And those category columns will simply contain ones or zeros, depending on whether a particular row belongs to that category. Again, something that we've discussed in detail in the previous session. So I will just run this code here, which is to first create a one heart encoder and then fit it to the inputs that we have, then create a list of new feature names or new category names. And you can see what these category names are. Create the list of new category names. So for each categorical column and for each category combined, we have one new category name. And then we can transform the data from the categorical columns into one heart vectors and put them back into our data frame. So the net effect of this is that for every categorical column, for example, location here, we have a bunch of separate columns like location, rain today, a location, Adelaide location, Albany, location, Albury, et cetera, where we have zeros and ones. And we have one for the specific location that this represents, for example, one for Albury, because this location is Albury and zero elsewhere. Now, categorical one heart encoding is not the only encoding strategy. There are some other encoding strategies as well. So I encourage you to try them out and observe how they affect the results. And as a final step, let us drop the textual categorical columns from our inputs. So I'm just creating these new X train X, while and X test variables, which contains simply the numeric columns, which have been imputed and which have been set up. And we have been scaled to the zero to one range and the encoded categorical columns. So we are removing the actual string categorical columns and just keeping the encoded ones here. And these, this is the input that we will use to train on evaluate our model. Of course, we have the targets as well. We have trained targets, while targets and test targets as well. So here's what the input to our model looks like. Okay. So let's save our work before continuing. This is something that we did the last time as well. So it's, so this, all this is something that we've done the last time as well. So this should be fairly calm, fairly standard, should start to feel fairly standard and boring by now, because these are the steps that we did last time. Let's talk about training and visualizing decision trees. A decision tree in general parlance represents a hierarchical set of binary decisions. For example, here is a kind of decision tree that you may set up to decide whether or not you accept a job offer. Maybe if the salary is between 50,000 to $80,000, then you will be able to accept a job offer. If the salary is between 50,000 to $80,000, then you, then you consider the offer. If it is not between 50 to $80,000, maybe you decline the offer. If it is between 50 to $80,000, then maybe you check if the office is close to your home. If it is not, then you decline the offer. Otherwise you check if the office, if the company provides a cab facility, and if it, if it doesn't, then maybe you decline the offer. Otherwise you accept the offer. So this kind of a strategy is how we make a lot of decisions in the real world. In fact, this is how a lot of processes are set up. And if you think carefully, this is how programs are also set up or where we write a lot of if-else, if-else statements to come to a certain decision. Now a decision tree in machine learning works exactly the same way, except that we let the computer figure out the optimal structure and hierarchy of decisions instead of coming up with the criteria manually. So applying it to our problem about whether or not it will rain tomorrow. First, we'll set, we let the computer figure out what is the most important criteria to decide whether or not it will rain tomorrow. And maybe after checking the value of that criteria, let's say maybe it's whether it rained today or not, there is a different tree on where based on whether it rained today and a different tree or based on whether it did not rain today. So if it did, if it did rain today, then maybe we simply look at the pressure. And if it did not rain today, maybe we look at the wind speed at 3PM. And so on, right? So you can have multiple trees on either side, and we will see how these trees come up, but the important point is we are not creating those trees. We are letting the machine learning model figure out what the right criteria and the right decision points are going to be to best fit the model. Okay. And to train a decision tree, we simply import the decision tree classifier model from scikit-learn.tree to train a decision tree. Now why decision tree classifier? Because this is a classification problem. Remember there are two types of problems classification and regression in regression. You're trying to predict a continuous value, which is, for example, the medical charges for an insurance application. But in classification, you're trying to classify the input into one of two categories. For instance, here, we're trying to classify the measurements given today based on to whether or not it will rain tomorrow. So yes or no. So that's why we're using a decision tree classifier. If it was a regression problem, we can use a decision tree regressor. So from scikit-learn.tree, we import decision tree classifier, and then we create the decision tree model. So we create the decision tree model by simply creating an object of the class decision tree classifier. And there is some randomness involved in how decision trees work. So if you want to get the same output each time you run this code, just provide a value for random state. So random state 42. So this is initializing the random number generator inside the decision tree. So each time you run this code, you will get the same kind of randomization and health, hence the same kind of outputs. Now, if you do not want to have the same kind of output each time you run this code, then you can remove this random state, but it is generally recommended to have a random state. And you can pick this to be any number you want, but it is generally recommended to have a random state for your decision tree classifier so that you can replicate your results. Otherwise your results will not be replicable. All right. So now we've created the model and the next step is to fit the model. So we give the model the training data, which is all the numerical columns, which have been imputed and which have been scaled to the range zero to one, and we give it the targets, which is simply the yes, no value for whether it will rain tomorrow for each of the input columns. And we run that and it takes maybe a second, maybe a couple of seconds. So it took 2.8 seconds and our decision tree classifier has been trained. Okay. So what just happened? Let's try and use this classifier and let's see how it works. And then we'll try and visualize it as well. So the first thing after training, any model is to try and make some predictions using the model and evaluate how well it is doing. So here's how you can make predictions using the model. If we call model dot predict, and we give it a set of inputs to make predictions on, it will give us the predictions that we can look at. So I'm going to call model dot predict on X train. And this is what X train looks like. So we are giving the model, all this data, all of these are numbers and all of these have been missing values have been filled in categorical columns have been converted to one hot and the model gives me some predictions. What do those predictions look like? Well, the predictions are either no or yes. How does the model know that it needs to predict no or yes, because we called model dot fit with our targets and our targets also have these yes, no values. So when the model was training, when it was learning from the data, it identified that it needs to predict a yes or a no value. Now, internally, of course, the model represents this yes, no target value as a zero or a one. But to show us the output, it is showing us, it is going to show it is going to return strings. Yes or no. All right. So now we have some predictions from our model. We called model dot predict on our input data, the training data itself, and we got some predictions and these are the predictions and we seems like there are a lot of no's here, but just to make sure that we also have some yeses, I'm going to run PD dot value counts and PD dot value count simply takes a list and it's going to tell you the counts of the unique values. So it seems like there are about 76,000 no's and about 22,000 yeses in the predictions. So the model is based on some, whatever logic it has learned, the model is actually predicting different things. It's not just predicting no every time. So the model has seems to have learned something. Now, how well has it learned something? Well, that is something that we can evaluate by computing and accuracy score. So we have training predictions and we have training targets, which is the actual values. And the simple thing we can do is compare each value. So we compare the first value and they match. We compare the second value and the match, and we compare the third value and the match. And we count the percentage of values that match. So I'm just going to run accuracy score and accuracy score is imported from SK learn dot metrics. And that is simply going to count the number of matches. I'm going to run accuracy score on train preds and train targets. So let's see how well the model has done on the training set. Okay. So it seems like the accuracy of the model on the training set on which it has been trained is 99.99%. So practically a hundred percent, this could just be a floating point error. So the accuracy is now is close to a hundred percent. And the decision tree also returns probabilities for each production for each prediction. So we can also check the probabilities. So let's check the probabilities and to get probabilities, you can simply call model dot predict proper PRO BA and give it the same input and let's check the probabilities. And it looks like the model is very confident about its predictions as well. So we have an accuracy of 99% and we have a probability of one for all the, for most of the predictions. And you can verify if this is actually true throughout or not. So it seems like we've learned everything there is to learn from this data. Or is that so the training set accuracy is close to a hundred percent, but we can't rely solely on the training set accuracy because your model will not be used in the real world on the same training data. In the real world, your model will see data that it has not seen before. And so far it hasn't seen the validation set. So we must evaluate the model on the validation set. So now we need to make predictions on the validation set by calling model dot predict. And then we can compare the validation set predictions, which are obtained from the validation inputs with the validation targets using the score accuracy score function. But because this is such a common operation, scikit-learn models already have a dot score method. So in the case of decision trees, if you call model dot score, give it the input. So in this case, the validation inputs and give it the targets, then it will make a prediction on the, it'll make predictions on the validation inputs. And it will then compare those predictions with the targets and it will give you the accuracy here. And it turns out that the accuracy on the validation set is just 79.2%. So you can see that the accuracy on the training set was a hundred percent, as we saw here, 99.99% and the accuracy on the validation set is just about 79%. And in fact, 79% is only marginally, only marginally better than always predicting. No, for example, if you look at the validation data and we see the percentage of values that are no, which is by getting the value counts and then dividing that by the length of the validation data set. So it turns out that 78.8% of the data is, has the target. No. And 21% of the data has the target. Yes. Which means if we had a model that simply predicted no all the time, that would be 78.8% accurate. And our fancy decision tree that we've trained, which is a hundred percent accurate on the training set is only marginally better, just less than 1% better, just only half a person better than our dumb model, which always predicts. No. So what's going on here? What's going on? How is the model? A hundred percent accurate about the training data, but completely missing or learning anything, but completely failing to learn anything important, anything useful about the validation data. So here's what has happened. It appears that the model has learned the training examples perfectly, which means it has basically memorized all the training examples. It's like if you memorize the answers to all the questions in your textbook for an exam, and then you go to the exam and none of the questions come up with exactly the same values. So you are likely to score a very low mark in the exam. In the same way, the model has learned all the training examples, but it does not generalize very well to previously unseen examples. This phenomenon is called overfitting and reducing overfitting is one of the most important parts of any machine learning project, especially when you're dealing with tree based models like decision trees. So we'll see how to improve. We'll see how to reduce overfitting. And the first step in understanding what's going on is to visualize the decision tree that has been learned from the training data. Now I told you in the beginning that decision tree is a hierarchical tree of binary decisions, something like this, something like this. So our model actually builds a decision tree, which is pretty close to what we saw above. And we can visualize this tree using the plot tree function from sklearn.tree. So I'm just going to import plot tree from sklearn.tree and plot tree uses matplotlib under the hood. So I'm just setting, I'm just increasing the figure size here so that this is a big image that we can look at and we call plot tree with the model and plot tree can also takes the name of the features or the name of the columns. So we can provide to plot tree the names of these columns so that it can actually tell us which columns the model is looking at. And then we provide a maximum depth because this tree is a very deep tree. It's going to have a depth of about 40 or 50 of which cannot be printed very easily. So we're just going to look at two levels of the tree. And we just, this is just some, some information about color. So we're just filling up. We'll just fill some nodes of the tree with some backgrounds. So let's run this and let's see what this looks like. So here's what our models predictions look like. The model first checks the humidity at 3 p.m. and it checks if the humidity at 3 p.m. is less than 0.7 1 5. Then it goes into this direction and then it checks if the rainfall at after, after checking, well, if the humidity at 3 p.m. is less than 0.7 1 5, it checks whether the rainfall is less than 0.0 0 4. If that is so in, then checks if sunshine is less than 0.25 0.5 2 5, and then it has multiple checks and so on. So this is how the model proceeds each time it makes a decision based on this value based on checking the humidity that either goes left or right. Now, if it is gone, right, then here there is another check on humidity. And once again, based on that, it goes left or right. And then based on this condition, it checks left or right. And then it keeps going. Now we have only plotted till the depth of two, but you can plot till any depth here. So you can see here here, you can plot to any, any depth. Now, there seems to be a problem here. Typically this image, you will see that it is connected. So in this image, you will see that there are lines connecting these, but you can see the tree that's building up here. So this is the first decision. And based on this decision, this may be the second decision based on this, this may be the third decision and so on. And that keeps going till it finds a final leaf node where there are no more decisions to be made. And at that leaf node, it contains information about which class should be returned as the output. Okay. So I hope you can see how a model classifies a given input as a series of decisions. Now, of course, the tree is truncated here, but following any path from the root node to a leaf will result in a yes or a no. And I hope now you can also start to see how a decision tree differs from a logistic regression model. Now, one important difference that I can immediately tell you is that instead of having a standard weightage for every, instead of having a fixed weight for every column, as you go left and right, the, the kind of conditions and the kind of weights can change, for example, based on whether the humidity is less than 0.7 or more than 0.7, the conditions that are applied to wind gust speed may change. And that makes sense. Like if it has rained today, maybe the factors I should look at are different compared to whether it has not rained today. And that non-linear relationship can be captured better in a decision tree. And it's a bit harder to capture in a linear model. So whenever you have these non-linear relationships, then it's always better to try out a decision tree and see if that performs better in a decision tree. And see if that performs better than a logistic regression model. Okay. Now you may wonder how this decision tree is created. How exactly does the model figure out what is the first decision to be made and what's the second decision and so on. And this is where you should pay attention to this Gini value. So in each box, you will see this Gini score. Now, every machine learning model has something called a loss function or a cost function. And the objective of the model is to minimize the cost. So the way the model does this is the decision tree does this is by using a Gini score. The Gini score represents how good a certain split is. So a Gini score, a lower Gini score means a lower cost, which means a good split. So a perfect split, let's say by just looking at humidity 3PM, you could perfectly classify things as will not rain tomorrow versus rain tomorrow. In that case, the Gini score will be zero. So a perfect split has the Gini score zero. And as, as a split gets worse and worse. So if, if your split is completely useless, which means that even after splitting, there are 50% yeses and 50% nose on this side and 50% yeses and 50% nose on this side, then you will have a high Gini score, maybe close to 0.5 or I think somewhere around the range of 0.5. All right. So all right. So a low Gini score means a good split, a high Gini score means a bad split. So what does, what does our decision tree model do? Conceptually speaking, while training, the model evaluates all possible splits across all possible columns. So right now we are looking at this one split humidity 3PM, but conceptually speaking, what the model has done is it is looked at all the different columns and then for each column, it has looked at all the possible split points. So it has basically sorted all the values in those columns in increasing order. And then it has taken each value as a split point. And then for each split point, it has performed the split and based on the split, it has calculated the Gini score. Now, good splits will have a low Gini score and bad splits will have a high Gini score out of all the, all the columns, all the splits. It has selected the best column with the best split, which leads to a lowest possible Gini score. Now, of course, with just one split, you cannot really get to a Gini score of zero because you can't just look at one feature and one split and perfectly make predictions about whether or not it will rain tomorrow. But among all the possible splits, it turns out that humidity at 3PM, whether that's less than 0.7 or more than 0.7 is the most important factor. It is, it leads to the lowest Gini score. Okay. So that's how the decision tree figures out what should be the top level root level node. Now, once it is figured out what the root level node is, which is the best split among all the columns and all the possible splits, it performs a split using that data. So certain data points fall into this region. All the training data, which has humidity less than 0.7 falls into this region. All the data, which has humidity greater than 0.7 falls into this region. And this is where now the process is repeated. Now for this entire data, which has humidity less than 0.7, it tries all the columns, all the possible splits and figure out the figures out the best split. Now it turns out that if humidity is already less than 0.7, rainfall less than 0.04 is the best split. And if humidity is greater than 0.7 and humidity, whether that's humidity less than equal to 0.8 to five is the best split. Okay. So that's what is happening here. So the iterative approach of machine learning in the case of a decision tree involves growing a tree layer by layer. So what we do first is input all the training data. And we essentially look at all possible splits and we take all those possible splits. And then we compute the Gini score for all the possible splits across all the possible columns. Based on the Gini score, we pick the best possible split. Then we split the data into based on the split that was decided. And then we repeat the process recursively for each split for the left split and for the right split, right? So we are recursively growing the tree. We have the level one decision, and then we make level two decisions with the split data. Then we make level three decisions with the split data. Then we make level four decisions with the split data and so on. And for how long does this go on? Well, this goes on forever till the point that you end up with just a single value. Now, right now you can, and in fact, that is the number that you can see here at the very top, you have 98,988 rows of data and this split send 76,000 rows into the left and 22,000 rows into the right. Now, similarly, this particular split has 82,000 rows of data and this split sends 70,000 this way and 11,000 this way, right? So that's roughly how it works. It, it, it divides the data into multiple parts and it keeps dividing till it gets to single leaf nodes. So you, where you just have a single row of data. And then for that row of data, since you already have the target, so that the target for that row is used as the value of that leaf, right? So every leaf ultimately contains just one sample and that sample already has a target of yes or no. So essentially what we're saying is we want to follow this decision tree down to a specific example from our training set and just look at the label of that training example and return the same label. Okay. So I hope now you understand why the training accuracy is a hundred percent because the decision tree has literally learned or literally memorized the entire training set in the form of this tree based structure. Okay. And you can verify how deep this tree is by checking the maximum depth of the tree. So you just call model dot tree dot max depth. And it turns out that this tree is 48 layers deep. So it's possible that within 48 decisions, you will get to a leaf node and on that leaf node, you will have a label corresponding to that specific training example, which lies in that leaf node. Okay. So this is one way to visualize a decision tree, what the model learns. And as I said, you will see arrows here. Normally I don't, I'm not sure why they don't, they're not showing up here, but normally you do see arrows. One other way to display a decision tree, which can be easier to follow is as text. So you can call export text and you can pass in the model. You can again, specify a maximum depth up to which you want to show things because again, this can get pretty large as well. And you provide a list of feature names here too. So here's what the textual representation looks like. So here we're checking if the humidity is less than a 0.72 and if the humidity is less than 0.72, we go down this path. Otherwise we go down the other path, which is, well, I think we've not shown it here. We've just shown a few lines because the street itself, even with 10 layers of depth is very high. But yeah, if you first you check humidity, then you check rainfall, then you check sunshine, then you check pressure, then you check wind gust speed, then you check humidity again, then you check wind direction, then you check the location. Now, if you're looking at West Watsonia, then we check the cloud cover at 9 a.m. And then we check the wind speed at 3 p.m. And then we check the pressure. So if all of these checks succeed, then we return. Yes, that yes, it will rain tomorrow. If the pressure check fails, if the pressure is not less than equal to 0.47, then we return no. And if the wind speed check fails, if the wind speed is not less than equal to 0.07, if it is greater than 0.7, then we check the minimum temperature. And then there is another branch of decisions that is to be made. And similarly, you have this, if the cloud cover is greater than 0.83, then we check the cloud cover at 3 p.m. And then we return. Yes. Otherwise we return true. Otherwise we return otherwise. Well, we check the temperature and then we return. Yes. Otherwise there is another decision tree here. Okay. So the idea is this is the same as the decision tree that we saw above. And the idea is that we make these hierarchical decisions and the model has learned which decision to make first by analyzing all possible decisions. Now, one small note I want to give you here is this is how you should think about it conceptually. The model actually has not really analyzed all possible decisions because that is going to be very inefficient. So there are certain techniques or certain, what are called heuristics that are applied. So which is basically strategies to figure out good decisions, good enough decisions, if not the best decisions. And there is some randomization involved in there as well, right? So just as an optimization, there is some randomization and some strategies to pick if not the best, but at least a good enough decision. All right. So that's the internals of decision trees, which we don't really need to worry about. So based on this discussion, now, can you explain why the training accuracy is a hundred percent, whereas the validation accuracy is lower. And you can think about it. It's because the model has literally learned every training example. And when it sees an example, which does not fit exactly with training, it tries to categorize it into one of the existing training examples by following one path of the decision tree. And that may or may not end up really well because it's going to ultimately boil down to a specific training example. And this is what is called overfitting where your model has learned or memorized specific training examples, and does not generalize well to examples that it has not seen before. Okay. Okay. Let's keep going. Now, based on the genie index computation, a decision tree assigns an importance value to each feature. Again, there is a certain calculation involved here on figuring out how the importance is assigned, but these values can be used to interpret the results given by a decision tree. So if you just check inside any decision tree model dot feature underscore importances, that will give you a list of numbers. So here's what that list looks like. And this is the importance for every feature. Now, remember the input to our model X train, this had about 119 columns. So you will see 100 and 119 values here. In fact, if I just check the columns, and in fact, if I just check the columns, X train dot columns, you can see here that there are 119 columns. So this is the importance for minimum temperature. And this is the importance for maximum temperature. And this is the importance for rainfall. And this is the importance for evaporation and so on. So let's create a data frame out of it. So I'm just creating a pandas data frame where we have one column called feature or the name of the name of the column in the original data frame X train, and one column called importance, which is the importance of that feature. And then we are going to sort those values by importance in the descending order. And let's look at that. Let's look at the 10 most important values, the 10 most important columns. So we have humidity at 3pm, which seems to be the most important column here 0.26. Then we have pressure at 3pm, which seems to be the next most important. Then we have rainfall, which seems to be the next most important and so on. And you will find that these importances line up with the decision tree itself. You can see here you have humidity, and then you have rainfall, and then you have a wind gust speed. A pressure does show up sunshine doesn't show up, but pressure doesn't show up yet. But if you maybe went one level deeper, you would also see pressure. So these are the importances humidity, pressure, rainfall, wind gust speed, sunshine, et cetera. And we can also plot these as a bar plot. So I'm just using SNS dot bar plot to create a horizontal bar plot. And we are looking at the 10 most important features here. So it turns out that humidity at 3pm has a feature importance of higher than 0.26, higher than 0.25. Whereas the next most important feature seems to be pressure at 3pm followed by rainfall, wind gust speed, et cetera. And these values should be interpreted in relative order. So mostly you just want to use this to figure out which one is more important than other columns or than, than other features. So that's how you interpret a decision tree. You can see the actual decision-making process of a decision tree. Given an example, you can actually just draw the tree and work through it and see why a decision tree arrived at a certain answer. And you can also see the importance of the different factors. And this is where now you can check if humidity has a lot of missing values. And maybe we filled a lot of missing values into humidity. Maybe we are missing values maybe we are misleading the model by filling all those missing values. Maybe we should remove the humidity column, or maybe we should try and fill those missing values and so on. Right? So you need to go back and forth. You need to go back and check if your data makes sense, given that this is the feature importance that you're working with. So that's how you train a decision tree. You import the decision tree classifier from sklearn.tree and then you fit it to the input data and then you can analyze it. You can evaluate it using the validation dataset. And we saw that the decision tree classifier that we train memorized all the training examples leading to a hundred percent training accuracy while the validation accuracy was only marginally better than a dumb baseline model. So at this point, our decision trees basically useless because it has just memorized all the training examples. And this phenomenon is called overfitting. And in this section, we will look at some strategies for reducing overfitting. So you will hear a lot of these terms. Now there'll be four or five terms that you'll hear right now. And often in machine learning overfitting simply means that you're doing very well on the training data, but you're doing very poorly on the validation data. And we'll define it a bit more rigidly, a bit more concretely in, in a short while. And the process of reducing overfitting is known as regularization. So whenever you see regularization or regular regularization techniques, regularization, coefficient, regularization, component, et cetera, et cetera, all of that is concerned with reducing overfitting, which means trying to increase the validation accuracy or get it closer to the training accuracy. And sometimes we may be okay to give up some training accuracy to get a better validation accuracy, because the validation accuracy is what we ultimately care about. Now, how do we reduce overfitting in a decision tree classifier? Now the decision tree classifier, when we created it, we gave it a couple of arguments. We set some random state. We give it just one argument, which was the random state. And apart from that, it also accepts several other arguments, which can be used to reduce overfitting. So if you just check the help, which is by typing a question mark for decision tree classifier, you will see that you can specify a criterion, which can be Gini or entropy. And this is simply the loss function. So there are two loss functions. One is Gini and one is entropy. You can specify a splitter. So this is the strategy that is used to split at each node. And by default, it picks the best strategy, which just picks the best possible split, of course, with some randomization, or you can also specify a completely random split with, without actually looking at the, without actually evaluating different splits. But here's something interesting. So you have a max depth parameter and you can specify the maximum depth of the tree. So let's see that there is a max depth parameter. And typically these arguments in the context of machine learning that you set right when you are creating your machine learning model are called hyper parameters because the term parameter is generally reserved for the parameters or the numbers inside a machine learning model. For example, in logistic regression, the weights of the different features are known as parameters in a decision tree, which column is the root node and what point we are splitting at. And then what do the splits look like? Those are known as parameters. Anything that the model learns or figures out on its own is called a parameter. So just to separate things that model figures out versus things that we have to set upfront, we call some of these things hyper parameters. So we call max depth, which is something that we specify at the very beginning. When we are creating the classifier, we call it a hyper parameter because it's not something the model figures out. It's something that we are specifying. So what is maximum depth? Well, if you saw the tree, the tree went down to about 40, 42 levels deep, and we could check that using. So the previous model that we had, the model that we trained earlier, if we just check the tree inside it, so we can call model or tree underscore, and then we can check max depth. So the model that we had trained earlier, the decision tree went 48 levels deep. And that was one of the reasons for overfitting because it was learning every training example. So what if we did not go 48 levels deep? What if we only went three levels deep? Let's try and see what that will give us. So if you only go three levels deep, so now we have put in a restriction that we do not want our decision tree to go more than three levels deep. And then we call model dot fit with the same training input data and the same training targets. So now the model has been trained again, it just takes a second or two. And then we try to compute the accuracy of the model on the training and validation data sets. So we call model dot score on X train, and we call model dot score on X, while and while targets. And now it turns out that the model is only 82% or 83% accurate on the training data. And this makes sense because the model can no longer learn every training example. It can only go three layers deep. So it just has to make the best it can out of three layers. But this has the unintended consequence that the model is no longer overfitting. The model now performs better on the validations set than it did before. So this may seem counterintuitive that a three level deep tree performs better in the real world on real world data compared to a 48 level deep tree. And that's because the 48 level deep tree is learning specific training examples, whereas the three level deep tree is picking up general trends. And in machine learning, you want models to pick up general trends and not memorize training examples. Okay. So that's the model. The model's accuracy has gone up to 83% from 79%. That's a good improvement. And even though this has gone down, ultimately, what we care about is the validation accuracy and let's visualize the model now. So we can visualize the model using plot tree once again. So here's what our entire decision tree model looks like. First, we check humidity at 3 p.m. First, we check humidity at 3 p.m. If the humidity at 3 p.m. is less than 0.715, then we go left. Now here, we check the rainfall. If the rainfall is less than 0.004, then we go left. Then we check sunshine at the sunshine is less than 0.0525. We go left. And finally, if, if we reach this point, we return the class. No. So whenever you reach a leaf node, you return the class of that leaf node. Similarly, now you check humidity, rainfall, humidity, and you get to this point, you return. No. So it seems like a lot of these have no. And so in a lot of cases, as you go along this decision tree, you will end up at no, but there are certain cases where you will end up at yes. So you go to humidity or humidity 3 p.m. rainfall. So if the rainfall, if the humidity at 3 p.m. is less than 0.825, and then the wind speed, wind gust speed is less than 0.279. Then the Gini score is 0.471. All right. So it's your humidity is greater than point. The humidity is greater than 0.7, but less than 0.8. And the wind speed is less than 0.27. So here are the classes. Yes. So here is what here you return that there will be rain tomorrow. And if the humidity is greater than 0.7 and greater than 0.8, well, it turns out that in all these cases, you end up at yes. Right. And of course, these trees got truncated. These trees could not be built beyond three layers deep. So that's why you see a bunch of yeses here. It's possible that if you had allowed more layers, then maybe some of these yeses would then split once again into nodes. But because we are ending the tree at three layers, it's going to return no in all these conditions and it's going to return. Yes. In all these conditions. Okay. So this is what you want to study carefully, because at this point, we already know that we can predict with 83% accuracy simply by looking at humidity, rainfall, sunshine, wind gust speed. And that's it, right? So just four out of the 23 plus columns can be used to predict, to get a prediction of 83% accuracy. And once again, we can also look at it as a textual tree. So here you can see the same thing, humidity less than 0.72 and humidity greater than 0.72. And then you check the rainfall and based on the rainfall value, you either check sunshine or you check humidity once again, and it turns yes and no. Okay. So one thing you may want wonder is what is the right maximum depth to use? Should we use a maximum depth of zero? Obviously not, because if you use a maximum depth of zero, then your model would not learn anything. That means it would always just predict. No. And while that would be 79% accurate. And while that would be regularized, that would not be very useful because you've not given enough power to your model. But on the other hand, if you allow your model to go 40 layers deep or 50 layers deep, then your model can memorize every single training example. And since it is trying to optimize for the lowest Gini score, it is basically memorizing all the training data. And that's bad because then your model will not generalize. So the best value for the maximum depth of the tree is going to be somewhere between zero and 40. So let's try and explore that. Let's try and figure out what the best value for maximum depth is going to be. So here's what I'm defining. I'm defining a function called max depth error, which takes a sample max depth value, which we can give as an input. And then we create a decision tree classifier for that particular max depth with the random state 42. And we get this model. Then we fit this model to the training data for that given max depth. We create the model fitted to the training data. Then we calculate the accuracy on the training set. And we calculate the accuracy on the validation set, and we define, let's call this error. And we define the training error as one minus the training accuracy. And let's call, and we define the validation error as one minus the validation accuracy. So if accuracy is what percentage it got, right? Error is the percentage that it got wrong. And then we simply return this dictionary. And you'll see in a moment why I'm doing this. Now we take this max depth error function, which takes a max depth and then figures out for that max depth, what is the training error and the validation error. And we run that through this list comprehension. So we try all the max, we try all possible values of max depth from one to 21. And that's why it's taking a while because we are building a decision tree for every max depth value from one to 21. And we are computing the training error and the validation error for each of these max depth value models. And then we're putting all those results into a data frame. So let's yeah. So let's give that a minute. And then here you go. So this is what we get. When you have a max depth of one, the training error is 0.18, which means just by, which means just by selecting a max depth of one, just by making one decision, you get to a training accuracy of about 81% and a validation accuracy of about 82%. That's what it looks like. And, but as you, as you increase the max depth, the error goes down, which means the accuracy improves. You can see here, the training error keeps increasing 18, seven, the training error keeps decreasing. The accuracy keeps improving. So 18, 17, 16, 15, 14, 13, and it goes on all the way up to 0.0903. So at this point we are at 99 point. So at this point, we are at 97% training accuracy at a max depth of 20. And of course, if we increase the max depth further, the model will be, so to speak, learn or memorize more training examples, and it will get better at the training data. But notice what's happening with the validation error. So the validation error is at 0.17 and it goes down and it goes down and it goes down to 0.15. And then it starts to increase again. So you can see here from 0.15 starts to go to 0.16, 0.17, 0.18, 0.19. And if you plot this, so here's, here's what it will look like if I plot it. So here I've simply plotted the training versus validation error. The blue line is the training error, which is one minus accuracy. And the orange line is the validation error. So what's happening here? Here you see both the training and the validation loss decreasing. So what's happening here is you are making your model progressively more powerful, which means you're allowing it to make a one decision right now. And here you're allowing this model, the model at where max depth two, you allow it to make two layers of decisions. And this model with max step four, you allow it to make four layers of decisions and so on. So up to a certain point, it helps to add more complexity or help helps to add more power within your model, right? It helps to make your model bigger. But after a certain point, when, once your model's capacity gets large enough, it starts to just focus on memorizing the training data and it stops generalizing. So at this point, you see it gets better and better at the training data and it gets worse and worse at the validation data. And this is the scenario that is known as overfitting. Okay. So here is the graph that you will see over and over and over again in pretty much every problem. As you increase the complexity or the size of your model, as you increase the size of your model, or the power of your model or the capacity of your model. So many different ways of looking at it. Ultimately, it's a question of how many parameters are there inside the model. So as you increase the complexity or size or power or parameters of your model, you will notice that both the training error and the test or validation error will go down up to a certain point because the model is getting, the model has more capacity, so it can learn more and it can actually capture some information about the inputs and the targets and how they are related. But after a certain point, it will start memorizing training examples. And that is a point where your test or your validation error will start to increase. That is the point where your validation accuracy will start to drop. And this scenario is known as overfitting where your training error is going down, but your validation error is going up. If you train your model a little more, if you increase the complexity of your model, a little more, if you add one more layer to your decision tree, then the training error goes down, but the validation error actually gets worse. And this is where you should stop training your model. So you want to stop training your model at the point where, or this is where you want to pick the complexity of your model. So you want to pick the complexity of your model at the point where the validation loss is just about to increase. So here, by plotting this graph, we have been able to figure out that at, at a max depth of seven, we get as good as this decision tree can get on the validation error for the given dataset. So the max depth of seven is actually the best depth for a decision tree. So this is how you figure out, this is how you regularize a decision tree. So you regularize a decision tree, which means to reduce overfitting by tuning some hyper parameters. So this is called a hyper parameter max depth, and just changing its value is called tuning the hyper parameter. And by tuning this hyper parameter, we have to regularize the model a little bit. We have now reduced the amount of overfitting that it has. So you can now see that the validation score, it may be also, let me also print out the training score here, the training score and validation score or training accuracy, validation accuracy, both are about 84.5, 84.6%. So that seems like the best we can do by modifying the max depth of the decision tree. Okay. So we just looked at one hyper parameter, which is max tree depth. And we also looked at how that hyper parameter can be used to, can we use to regularize the model? Let's look at another hyper parameter. This one is called max leaf nodes. So this is another way to control the size of the complexity of a decision tree. So we have the complexity of a decision tree, another way to control the complexity of a decision tree, and which is to limit the number of leaf nodes. Now, whenever you have a decision tree, there are, as you can see here, there are a certain number of decision nodes, and then there are a certain number of leaf nodes. Now, the way we have limited the size of the decision tree or the, and the complexity or the parameters of the decision tree in this case is by specifying how deep it can get, but that may not be the best way. Maybe you want to allow it to go a few layers deep here. Maybe you want to allow it to go five layers deep here, and you want to allow it to just stay two layers deep here. So that's where you can actually specify the maximum number of leaf nodes that your decision tree can have. So here's how I'm going to do that. I'm going to specify that for my decision tree, that the maximum number of leaf nodes it can have is 128. And roughly speaking, if you have one, one node at the top that splits into two nodes below it, that splits into four nodes below it. What we might think is that the decision tree actually goes layer by layer, where it goes builds layer one and then builds layer two, and then it builds layer three. But actually what happens is it always tries to make the best possible split. So if it is created a layer one, maybe let's look at it here. So if a decision tree has created a layer one, it has created a split here. And then based on the split, you now have two splits left and right. Now it looks at both of these and it sees which is the better to split. If this is going to, if splitting this is going to result in a lower Gini coefficient, then it splits this into two parts by creating a split condition here. And now it's going to analyze among all of these leaf nodes, which is the best split to make. So if it determines that this is the best split to make, then it's going to make this split first. And now maybe at this point, it's going to look at all these leaf nodes and determine which is the best split to make. So maybe at this point, the next best split to make is, is this, and maybe after this, the next best split to make is this and so on. So your decision tree doesn't really go layer by layer where it first does this, and then it does this. Rather it goes, it looks at all the leaf nodes and it figures out which is the best leaf node to split at the moment. And it splits it. Now, how does that tie down, tie back to max leaf nodes? Now here, what we're saying is we want max leaf nodes to be 128 and 128, I believe is two to the power of seven. So if you had a decision tree, which was seven layers deep at its lowest point, it would have 128 nodes. Now let's try and give the max leaf nodes of 128 and let's see if the decision tree actually has a depth of seven. So here we create this decision tree. We call decision tree classifier, and then we call model dot fit. So now we're training and the only restriction we've specified is that the number of leaf nodes, the number of these nodes, which have not been split, should not go higher than 128. And that limits a certain number of splits. So now we fed the data on the training. We fed the model on the training data and it turns out that the training accuracy is only 84.8% and not a hundred percent because of the same reason it cannot go down and memorize every training example. There's only a limited number of nodes it can create. And let's check the models accuracy on the validation dataset. So on the validation dataset, this time the accuracy is 84.4%. And let's check the tree's depth as well. So the depth of the tree is 12. So let's compare this 84 with what we had the previous time. Yeah. With a model here, we had a model that could go to maximum depth of seven and that had 84.5% validation accuracy. In this case, we have 84.4%. Maybe if you change this a little bit, maybe if you change this to one 130 or 140, you may find that it may actually cross, but the important thing is that these two are different. And the reason these two accuracies are different is because the strategy by which we are limiting the size of the tree is different. In one case, we are saying that the max depth can be seven. In the other case, we're saying that the maximum number of leaf nodes can be 128. And the tree actually does go down to a depth of 12 in certain places. So that means certain parts go down to a depth of 12, but certain parts, maybe are shorter, certain parts, or maybe just three or four levels deep. And we can try and verify this. We can convert this model. We can get the textual representation of the model and maybe look at just the first few lines of that textual representation. The entire thing will get pretty long. So I've just printed the first 3000 characters or so. So here you can see that this is a fairly long path, but then this path definitely shorter. You can see that this path is shorter than this path. This is about one, two, three, four, five, six, seven, eight, nine, 10, 11, 12. So there are 12 checks here. On the other hand here, there are maybe less than 10 checks and here there's even fewer checks. And maybe once we go further, maybe if I print more of this, you will see that there are shorter and shorter paths. Yeah. So this is, this is definitely a shorter path. You can see that this is maybe one, two, three, four, five levels deep. On the other hand, this is six, seven, eight, nine, nine levels deep. So sometimes you have five levels deep. Sometimes you have nine levels deep. And that depends on the best split that the decision tree was able to find. So here's an exercise for you find the combination of max depth and max leaf nodes that results in the highest validation accuracy. Okay. Then another exercise for you is to experiment with the other arguments of decision tree scikit-learn has excellent documentation, very extensive and very helpful, very easy to read as well. So just check out the documentation of decision tree classifier on scikit-learn or treat our decision tree classifier. Look at all of these and go through all of these. Now, in a lot of these cases, it will tell you exactly what each parameter does. Maybe try a different criterion, maybe try the random split, see if that helps, maybe try the random try changing the max depth. We did do some experiments, but how does max depth matter if you're working with a random split, et cetera, is worth figuring out. Look at, there are some other parameters, hyper parameters here that you can look at. So try and experiment with all of these. And you can see that there are detailed explanations for each of these. And in fact, in certain cases, you will find that there are links to other resources. And as I said, a lot of these are implementations of some of the best papers in machine learning. So a lot of the best practices are given to us. A lot of the best techniques are given to us. We just have to try them out with scikit-learn. Another exercise for you is to try out a more advanced technique for reducing overfitting. This is called cost complexity pruning. So just as we have limited the number of loads by depth, and we have limited the number of nodes by the number of leaf nodes, there is a way to limit the number of nodes by the kind of split that a node performs. So we perform a split only if it satisfies certain criteria, and this is called cost complexity pruning. So you can learn more about it here. It's not a very commonly used technique because decision trees by themselves are almost never used in isolation. So I will not cover it here, but it's something that you can check out scikit-learn has good documentation on it. And in fact, it has an example implementation and also the link to the paper. So you can check this out and try and follow the code from this tutorial and try to implement cost complexity pruning and see if you can improve the validation accuracy further. Okay. Machine learning is all about trying different hyper parameters, trying different techniques and getting that additional boost in the model's performance. Okay. All right. So that was a quick introduction to decision trees. Now, of course we've skipped over some parts, especially the more mathematical parts about the Gini index and the importance calculation and things like that. But I hope you were able to get a basic intuition of how these splits are created. We look at all possible features, all possible splits, find the best split, and the best evaluation is done using the Gini index. Then we divide the data into these two parts or these two portions. And then we identify which is the next best split to be made based on the leaf nodes that we have. And then we make that split. And then we keep going till we either hit leaf nodes, in which case we essentially memorize the entire training dataset, or till we hit some limits that have been artificially imposed via some hyper parameters for the purpose of regularization. And in general, you don't want to have an unbounded decision tree. You want it to be somewhat generalized. So that's why you want to limit its depth and you want to maybe limit some of the other hyper parameters as well. Now, while tuning the hyper parameters of a single decision tree may lead to some improvements, a much more effective strategy is to combine the results of several decision trees that are trained with slightly different parameters. And we'll talk about these slightly different parameters slowly, but just assume that we are training maybe a bunch of different decision trees with a slightly different max depth, slightly different max nodes, slightly different, some other parameters. And then you combine the results of several decision trees that you've trained on the same dataset that is called a random forest. As the name suggests, a forest is simply a collection of trees and random forest, because there is some amount of randomization involved here on how this forest is created. Now, if you created a forest with the exact same dataset, then each decision tree would be identical. So we use randomization to make sure that all the, all the trees in the forest are slightly different to each other. And why are we doing this? Why are we combining the results of several trees? Well, the key idea here is that each decision tree in the forest, because of all this randomization will make different kinds of errors and upon averaging, many of those errors will cancel out. So it's a very interesting idea and sometimes seems counterintuitive, but we let's look at common world, a real life example of this. And the real life example is commonly known as the wisdom of the crowd. Now let's see if you, if you want to estimate the weight of this cow here, the way we would do it, one way we could do it is we could actually measure and that would give us the weight of the cow. One other way we could do is we could actually ask a lot of people. Now it turns out that the more people you ask, the better or the closer you get to the actual result. So if you just ask one person that one person could be way off because they may not be exposed to livestock at all. They may not know what a cow is, but if you ask enough people over time, you'll find some people who are underestimating some people who are overestimating some people who actually know what the information is. And the average of 800 guesses is a real experiment that was conducted turns out to be very close to the, and this is an org sorry, it turns out to be very close to the actual weight of the orgs. Right? So this is an idea that intuitively makes sense that if you ask enough people, you can get closer to the real answer. Of course, that's not a hundred percent true in all cases because humans have biases as well, but in general, this is the idea. And this is the kind of the strategy here that when we take decisions from several decision trees that have been trained with some amount of randomization, then the kind of errors that they make will cancel out upon averaging. And the result that we will get will be better than the result from any individual decision tree. Okay. So let's check it out. So here's how a random forest works. It simply takes the results of multiple decision trees, and then it averages or combines them. For example, here you have tree one, three, two, three, three, or tree and entries and trees have been trained. And then you take, and let's say these are classifiers. So then you take all these entries and you get the output out of each of these entries, and you simply do a basic voting. You figure out which class is the most commonly occurring class among these entries. And you return that as the output of the random forest. So the random forest is simply a collection of all these trees that have been trained independently and slightly differently. And the outcome of the random forest is simply by getting the outcomes of all the trees and then combining them. Now, because we are doing classification, the way you combine them is by voting, which means by picking the most common class, if it was regression, the way you would combine them is by averaging, which means if each decision tree was predicting a continuous number, you could add them all up and get the average and return that as the output. So to train a random forest classifier, as you might be able to guess by now is really easy. You just import the random forest classifier class from scikit learn dot ensemble. So this is called an ensembling technique, which is the more general term. Whenever you take multiple models and they don't have to be decision trees. They don't even have to be the same kind of model. Whenever you take multiple models and you take the results of all those models and combine them together to get the result of this combined model, this is called an ensemble. So let's import the random forest classifier and let's set up a random forest classifier. Now, there are a couple of things I'm passing through the random forest classifier. One is the N jobs and remember random forest classifier is going to have a bunch of trees inside it. And we would want to train the trees in parallel because each tree will be trained independently. So when you specify N jobs equals minus one, then the random forest classifier can pick up can use different workers or different CPU threads to train decision trees in parallel. And that just makes it faster. And you can try removing N jobs equals minus one and see the difference. And I'm also specifying a random space state of 42. This random state will be used inside the decision tree. So this will be passed to each decision tree to control the randomization there. Again, the reason for passing a random state is so that each time I run this line of code or each time I try to fit this model, I get the exact same result. So we've just created the random forest classifier object and let's fit it. So this is where N jobs becomes useful. If you had N jobs on specified, it would just use one thread, but right now it is using all possible threads or all possible workers. Okay. So I recommend you try out changing, removing N jobs and running model.fit once again. All right. So it took about 16 seconds to train the model. And what exactly happened here? Well, it just trained a whole bunch of decision trees. And now once we've trained the model, the next thing is to just score the model. So see how well it does on the training data by calling model.score X train. So once again, it is 99.99% accurate on the training data. And that makes sense because we have not specified any limitations on the decision trees. Each decision tree will train completely to perfectly learn the training dataset, but here's where it starts to get interesting. If you now check the validation accuracy, or so if you score the model on the validation set, you get an accuracy of 85.6%. Remember the first time of the first decision tree that we train had a training accuracy of 99.9% and a validation accuracy of 79, which was barely greater than a very dumb model, which always predicted no rain. But in this case, it was predicted no rain, but in this case, we've gotten to 85.5% accuracy. That's even higher than our fine tuned model. So here, the model that we had created with a certain maximum depth, et cetera, only got to what 84.5. So we've seen a 1% increase simply by taking a bunch of decision trees and averaging their results. How does that even work? How does that make sense? Because each decision tree is worse than their combination, right? And this is where you see the power of ensembling. So this is a general technique of combining the general technique of combining the results of many models. This is called ensembling. And the reason it works is because errors of individual models can cancel out when we average the results. So visually here's what it looks like. Let's say you want to create a separation between, let's say you have two features, feature one and feature two, and then you have a target, which is shown here as a shape. So we have three targets or three classes. One is a triangle, one is a square, and one is a circle. Now, one thing you could do is you could try and train a tree, a decision tree on the entire data. But here's this other approach that that has been taken. So we've taken like a 30% or 40% sample of the data trained one decision tree on it. Then we've taken another 30, 40% sample of the data trained another decision tree on it. And we trained another, we took another 30, 40% sample of the data and trained another decision tree on it. And each of these decision trees is imperfect. You can see that this line kind of cuts through it is, it is missing this triangle, for example, and it is missing this all these circles. And then this one, this one is also kind of imperfect because it is missing a few, I guess it is missing definitely a few squares and circles here. And then this one is also imperfect. So what happens is if you take, if you average out all these decision boundaries, and this is what happens, you combine all these decision boundaries, put together the ensemble decision boundary can be a lot more well-defined or can be a lot more accurate than any individual decision boundary. Okay. So that's the idea here. And this is something you have to think about for some time and see if that intuition makes sense to you. If not, let us know. And maybe we'll talk more about it over in the next lessons or so, but the idea is that the errors cancel out and the ensemble or the wisdom of the crowd is greater than the prediction of any particular tree. And that's why we get this validation accuracy of 85.5%. Now a random forest also gives you probabilities for the predictions, the classifier in the classification case. So if we can also call model dot predict proper with X train and check the training probabilities, and here's what they look like 0.9 0.1 0.97 0.03. So now they're no longer just ones and zeros. They are actual numbers between zero and one. And how does this probability get calculated? Well, the probability of a class, like the probability of the class. No, for the first example is simply the fraction of the trees which predicted the given class. No. So it turns out that among all the trees that were trained in the random forest, 90% of the trees predicted no for the first example, the first input, and only one or only 10% of the trees predicted. Yes. For the first example. And that is why the probability assigned to the first example for the class. No is zero is 0.9. And that's why the result of the random forest or the prediction of the random forest for the first example, the first input in X train is no. Okay. You can see here, this is, this is X train. So this is the first input that we're looking at the first row of data. When that is passed into the model, we get a probability of 0.9 for no. And you can verify this with the predictions. You can see that the prediction of the model is a no, and that is because the value is 0.9. Okay. So that's how probability is calculated in a decision tree in a random forest simply by taking the fraction that predict a certain class. All right. So we said that there are a bunch of decision trees being trained. Where are those trees? So those trees can be found within, within model dot estimators. So if you just check the length of model dot estimators, you will see that there are a hundred decision trees by default. And this is something that we can configure and we talk about in a second. And if you check the first estimator estimator underscore, so this, now you can see is a decision tree classifier. Great. And we can now actually plot a couple of these trees and let's just plot them to a depth of two. So I'm plotting decision tree number zero and I'm plotting decision tree number 20. And it turns out that maybe let's let me pick a different one so that I can illustrate the difference better. Okay. So in decision tree number zero, it turns out rain today is the most important factor. We're looking at rain today, less than equal to 0.5. And then we're looking at cloud at 9 a.m. and pressure at 3 p.m. But on the other hand, we have this decision tree, decision tree number 15. In this case, we're looking at sunshine and humidity, and we're also looking at rain today, but sunshine and humidity are definitely more important, which don't show up here at all. Maybe humidity does, but in a different format. So definitely there is some randomization going on here. Not all the decision trees are the same. And we will learn more about this as we explore the hyper parameters of a random forest. Now, one exercise for you is to verify that none of the individual decision trees have a better validation accuracy than the random forest. And this is really the thing that you want to get out of this, that the combination of all of these is better than any individual decision tree. And once you get that, you get the power of ensemble methods. Now, just like decision trees, random forest also assign an importance to each feature. And this is done by combining the importance values from individual trees. Again, there are certain mathematics involved in how that computation is done. But the important thing is that random forest, SQL and does that for you. And SQL and gives you that model dot feature importance. So you can just check model dot feature, importance is underscore. And let's just put that into a pandas data frame. And it looks like we have the same things showing up here more or less. So we have humidity, sunshine, pressure, et cetera, right? So you can see the same, the similar features show up here as they did in a decision tree, especially if you compare it with the unbounded decision tree or the decision tree with seven or eight maximum depth, you will see a similar set of things here. One thing that you will notice is that the skew is a lot less because now we are depending on a combination of all the different trees. So we're able to look at every feature and every feature gets some importance because it is important in some decision tree. So you'll see a less of a skew here. And that's how you train a random forest. And that's how you make, that's how you train and evaluate a random forest. Now you can make predictions using a random forest. You just give it a set of inputs and it is, it'll give you a set of outputs. You can get the probabilities and you can get the feature importances. So again, this is a model level feature importance, not to do with any specific input, but the model in general values, the humidity, a lot, and then the sunshine and pressure. So let's talk about how to improve random forest. And this is where random forests and decision trees are more powerful or more versatile, I would say, than some of the linear models, like logistic regression and linear regression, because in a random forest, you can actually increase the capacity. Even in a decision tree, you can increase the number of layers that you have in the tree, or you can increase the number of trees that you have within a random forest. And we'll see some other examples as well. So based on the data that you have, based on the kind of overfitting that you're running into the kind of problems you're running into, you can tune your random forest really well. So just like decision trees, random forests also have several hyper parameters. In fact, many of these hyper parameters are simply passed down to the underlying decision trees. And we look at some of them, not all of them, and you can learn more about them again, on the scikit learn documentation site, excellent documentation, really well explained, really approachable. So these are all the parameters or hyper parameters that you can set for your random forest model. And then there's a detailed description about each parameter. And you can also check it anytime simply by using the question mark here. So let's create a base model, which we can compare with the tuned hyper parameters. Before we tune any hyper parameters, let's just create a base random forest classifier. Let's give it the random state of 42 and end jobs minus one. So this is something that we will keep fixed throughout so that we know that the results we're getting, the changes we're getting are not because of randomness, but because of actual hyper parameters. So let's set random state of 42. And let's fit that to our training data here. And that gives us the base model. And based on that, let us calculate the base training, the training accuracy and the validation accuracy of the base model. So we've not done anything yet. This is just a baseline that we are setting up for ourselves for our comparison. So in the base model, as we already saw earlier, we have 99.9% training accuracy, and we have 85.59% validation accuracy. Now, the first thing we look at is an estimators. This is the first hyper parameter that we look at. And this argument controls the number of decision trees in the random forest. Now we know that if you have just one decision tree, that's just going to overfit to the data very badly. And this is one of those parameters where as you increase the number of estimators, you increase the randomness within the data. And just a general rule for you to keep in mind is randomness increases. Randomness helps reduce overfitting. The more randomness you, the more randomness you bring in into your model, the less overfitting you will have. So the default value is a hundred. And one thing that you can try is you can simply experiment with other values. For example, here is an estimate with, here is a model with 10 estimators or 10 decision trees. So 10 decision trees, obviously train a bit faster, but here with trend decision trees, it turns out that you get to a training accuracy of 98.7%, not really 99.99. And you get to a validation accuracy of 84.4%. So that's not too far from 99.99 and 85.59. Although I think this, this is a little bit low. So it seems like you do need to increase your number of estimators to a certain value. A small number of estimators will not, will probably not give you a very good result. So here we have 500 estimators. We've gone from a hundred, which is a default to 500. And that's going to take a minute or so to train because it's five times as many estimators or decision trees. Okay. Maybe I should have tried a smaller number, but the general rule about number of estimators is as, as a general rule, try to have as few estimators as needed. So if you can do the job with a hundred and if going from a hundred to 200 is not really helping, it's not really increasing your accuracy or whatever your loss is, whatever your metric is meaningfully, then stay with a hundred. If you think that going from a hundred to 200 is giving you a reasonable increase, then go to 200 and a good way to try estimators is by just doubling them. So the default is a hundred. One thing you can quickly try out is changing that to 200 and see if that helps and then try changing that to 400. See if that still helps try changing that to 800. See if that still helps at some point, the benefit that you get by doubling is going to be too small. And the cost of how long it takes to train the model is going to be too high, like with 500. So it's, it turns out that when I do 500, I get to 85.7%. So that's a 0.2% increase or 0.857. So there's a 0.2% increase in accuracy from 85.59. Well, that's less than 0.2. That's about 0.13. Now, do I care enough about 0.1% accuracy improvements? If I do, then, and in some cases you might, in some cases that may mean something. For example, if it is some kind of a financial model, then maybe that 0.1% can make you more money or it can improve your estimation of the risk. So in that case, you should maybe get it. And if you also have the hardware to spare, you don't mind training 500 estimators do it. But if you don't really care, if you're more concerned about 85 versus 90, then I wouldn't bother with it too much. If I'm just getting a 0.1, 3% increase by using five times more decision trees. So there's a sort of a diminishing returns here for the number of estimators. Okay. But one thing I'll tell you is when you have really large data sets, data sets with millions of rows of data, then you will have to go with hundreds, sometimes maybe a thousand estimators. That's not very uncommon. The default in scikit-learn used to be 10 estimators. And that was recently bumped a few years ago, 200 estimators, because the data sets have started to get much larger. So here's an exercise for you, vary the value of N estimators. And the way you can vary it is by simply doubling it each time and plot the graph between training error and validation error, and try to figure out what the optimal value of N estimators is. Now, N estimators is unique in the sense that it does not really add complexity to the model because it's bringing more randomness and ultimately your averaging results from each decision tree. So a tree by itself is not getting bigger. So complexity is not really increasing, but at the same time, it is bringing some randomness. So you can keep increasing N estimators to a fairly large value without worrying about overfitting on the data. Unlike some other hyper parameters like max depth and max leaf nodes. So you have max depth. A max depth is passed directly to the decision tree. Similarly, max leaf nodes is passed directly to the decision tree. So we've seen both of these in context of the decision trees already. Now, when you pass max depth or max leaf nodes to a random forest, that is going to get applied to every single decision tree. It is true. And by default, no maximum depth is specified, which is why every tree has a training accuracy of close to a hundred percent on the training data. And you can specify max depth to reduce overfitting. So if you feel that you're, or if you observe that your random forest model is overfitting, which means that the training error is going down, but the test error is going up, then specifying the max depth may be a way to limit the complexity of the model and get to a better position. So let's try that out. So what we want to do is we want to try out a random forest with a certain max depth, and maybe we want to try two or three values. Now, because this is getting a bit repetitive, or what I've done is I've written a simple function called test params, which takes a bunch of params and it simply passes those parameters to a random forest. And then it fits that random forest on the training data, and then it returns the score here. It returns the score of the model on the training data and the validation data. Okay. What's this star star? Well, when you call test params, you can call it like this. You can call max depth equal to five and max leaf nodes equals one, zero two, four. And maybe you can also specify and estimators equals a thousand. So when you call test params with multiple arguments, it's going to take all of these arguments. So star star params is going to capture all of these arguments. And then it's going to pass all of these arguments with those same names to the random forest classifier. So it's going to pass max depth equals five to random forest classifier. Max leaf equals a max leaf nodes equals one, zero two, four to the random forest classifier and estimators equals thousand to the random forest classifier. Okay. So this is some special Python syntax. If you want to look it up, this is called star star quarks. So you can just search Python function quarks to learn more about it. Okay. But all of this business is simply to just simplify testing. So if I call test params with max depth five, that's going to train a random forest, that's going to set up a random forest classifier with random state 42 and jobs minus one and max depth five and fit that to the training data and give us the training and validation laws. So let's test max depth five. And the default remember is no max depth. So unbounded max depth. So at a max step five, we have severely limited the capacity of the model. And that's why it falls down to about 82% accuracy. But on the other hand, when we set a max depth of 26, and let's just give that a second to run, you will see that our accuracy has actually increased to 85.7%. So if you see the base accuracy that we had here, the base accuracy was 85.59 and that has increased to 85.7. So that's that's a significant increase. It's a 0.1% increase. And all you've done is we've reduced the depth of the tree. So it actually takes less time to, for the tree to train. All right. And similarly, here are some experiments with max leaf nodes with the max leaf nodes here. I've set it to two to the power of five, which means 32. So if we have a maximum of 32 leaf nodes, we can only get to about 83.2% accuracy. But if we have a, we, if we allow two to the power of 20, which means that's, I believe that is 2048. Oh no, that two to the power of 20 is a lot of nodes. That's about a million nodes, I guess you have the leaf nodes, but in any case, the tree itself is not going to ever get that large. So we shouldn't worry about it because the tree only goes to a depth of about 40 or 45. So anyway, when you have unbounded max leaf nodes, you get to about 85.6%. I guess there is still a bound here probably. And here you have the base. So this is without any of the hyper parameters. You see, there are certain differences here. You have 85.59%. Here you have 85.65%. Here you have 85.70% and maybe there is a combination of max depth and max leaf nodes that can give you a good result. So it's worth figuring out, right? So it's worth trying out different values of max leaf nodes and max depth. And the thing to remember is that the optimal value is somewhere in between. If it's too low, then your model is not powerful enough. If it is too high, then the model is too powerful and it has, it started to memorize and overfit on the training data. So it's somewhere in between and your job as the machine learning practitioner is to figure out what that sweet spot is. And the way to do that is by trying multiple values. One thing you could do is you could try maybe increments of five increments of 10, or maybe increments of just doubling the max depth and max size. Okay. As we did above for a decision tree, you can do the exact same. You can, you can vary the value of max depth, plot the graph between training error and validation error, and figure out the optimal value of max depth that way, and then do the same for max leaf, and then try to combine the two and see if that helps you in any way. Next up, we have max features. Now, instead of picking all the features for every split. So instead of using, and we have a bunch of features, right? We have, let's call X train and let's check the columns. So we have 119 columns. Now, instead of using all 119 columns for every single split, we can specify that only a fraction of the features should be chosen randomly to figure out a split. So here's what I mean by that. You have your data set here with a bunch of features. So this is your data set. Let's let's say this is your data frame. And in your data frame, you have a bunch of columns. Now, typically when a decision tree is created, you look at all the columns. So you check this column, find the best split. You check this column, this column, this column, this column, you check every column. And for each column, you check all possible splits. You check this split, this split, this split, the split, the split, all of these, and you pick the best split. And that's how you decide. Now the trouble with random forest is if you are training two decision trees and you're using this exact same logic of checking all the columns and then checking all possible splits, then this split, let's say this is a greater than seven and this split are going to be the same. Now the two splits are the same. Then it's likely that these next splits are also going to be the same. And then it's likely that the next splits are also going to be the same. So you're not really training multiple decision trees. You're just training a hundred copies of the same decision tree. So you're not really gaining from that randomness and averaging out of errors. So what do you do instead is you specify that for my first decision or for the first split that is being done, I randomly pick a fraction of the columns. So I pick a fraction. Now this fraction is going to be, maybe I pick this column, I pick this column, I pick this column, and I pick this column. So I pick a random selection of columns. And then for those columns, I try all the different splits. And maybe after trying those splits, I figured out that maybe this column is the best. So if this column is the best, then here I'm checking for C and the next time I do a split, I randomly pick a certain fraction of columns. Maybe I pick this, pick this and pick this. And based on that performance split. And then similarly, there's another decision tree being trained in parallel in another thread. Now here also you're picking a random column. So you're picking random selection of column. Maybe you're picking this, this, and this. So what this does is each split, whether it is in the same tree or across different trees, each split is randomized because it uses only a certain fraction of columns. Now, this is again, a bit counterintuitive that using a fraction of the columns, doing randomized splits is better than doing single splits, better doing better than doing perfect splits, where you look at all possible columns and all possible splits, because your objective is not to fit and memorize the training set. Your objective is to generalize. So your objective is to learn general trends. And the way you learn general trends is by dropping random information here and there while training, right? You're skipping some of the columns and that helps you that forces you to learn some general trends rather than specific relationships or specific values within specific columns. So here are the values that you can specify for max features. You can specify auto SQRT log two, or you can specify an integer, or you can specify a float. Now the interesting thing is by default, the value is auto. And what does auto do? So when you specify auto, or you don't specify anything in a random forest by default, it picks the square root of the number of features. So how many features do we have here? We have 119 features. The square root of that is around 11. So every decision tree or every split only looks at 11 features. Every split does not look at 119 features by default. And that is how each decision tree in the forest is different. So I remember we mentioned there is some randomness. So this is that randomness each decision tree in the forest is different. And while it may seem counterintuitive, choosing all features for every split of every tree will just lead to identical trees. So this will overfit badly to the training set. The entire random forest will basically become just copies of the same tree and will not generalize well. So if you don't want to go so aggressive as square root, you can probably, you can specify an actual number. Let's say you want to pick exactly 10 features or you want to pick exactly 20 features each time, or you can specify a fraction. Let's say you want to pick maybe 30 percent of the features for each split, or you can also specify a logarithm, a log to the base two, which is also another way to just, these are all different ways of just specifying what fraction of features to pick. Now, remember the overfitting curve. This comes in here as well. If you pick too few features, then your model will not be powerful enough because it is really, let's say, if you just allow it to pick one feature randomly, then it's not really learning. It's just having to pick up, pick a feature and make a split on that feature. So you are going to end up in this space, but if you allow it to pick all the features, then you are going to end up in this space. So somewhere in between is a sweet spot. And through practice, again, this, this must be the result of some paper, which is looked at all possible splits and through in practice, they may have found that square root is a good percentage of features to pick, not to mean that this is the best for every dataset, but in general, it's a good percentage of features to pick as a default. But you should experiment with this because for your dataset, it may be different. So here I'm just going to call test parameters with max features log two. So this is going to take log to the base two of 112, 119, which is about seven, I believe. So it will take seven features. And then I'm also manually testing three features, six features, and maybe I can even try something like 0.3. So if I specify an integer, it's going to pick exactly that many features. I specify a fraction between zero and one. It's going to pick that percentage of features. And remember by features, we mean columns. So we are looking at six columns for every split randomly picked six columns. And here we're looking at three columns for every split. And here we have the base accuracy, which is, I believe this was 85.59. Okay. So turns out with the base accuracy, we're looking at about 11 features. So it turns out among these, the base accuracy is the best, but now what you can try is maybe try 20 here and see if 20 gives you a better accuracy than the base accuracy. So I hope 20 should probably give us something better, but I don't know. I don't know. And that's why there's no real way of telling other than by actually looking at the data, whether the default is good enough or whether you need to be left or right of the default. So once again, find the optimal values of max features on this dataset by varying it. And you can vary it. You can vary it in steps. Maybe you can double it two, four, eight, 16, 32. Yeah. So it does happen. So it does work when you increase max features to 20 from square root of one one nine, which is about 11. So if we say that we look at 20 features each time, we get to 85.75% accuracy rather than 85.59. I think 0.15 is a reasonable increase. Seems like I don't think we're going to get to 90 anytime soon. So seems like it's a reasonable increase to have. All right. Let's look at a few more and then we will wrap up here, but I hope you're starting to see the trend here. You want to understand what each hyper parameter does, and then you want to understand how it fits on the overfitting curve. And then you want to experiment and figure out the right piece where that hyper parameter needs to be used at. Okay. So I'm going to test a couple more. One is called the min sample split. And the other is called the min samples leaf. So by default, unless you have other parameters like max depth, et cetera, the decision tree classifier tries to split every node that has two or more. Rows of data inside it. So for example, here we have the entire dataset. And when we do a split, maybe 30% of rows go here and maybe 70% of rows go here. And then we do a split again. So some percentage of rows, some like end rows go here and end prime rows go here and so on. But as you keep going by default, the decision tree will keep splitting till we get to leaf nodes with just a single row of data inside it. And maybe this must be two. Okay. Or if it is three, then this will get to two, and then this will get to one and one. So you'll by default, it goes to that step. What you can do to regularize the tree to reduce the complexity of the tree. Again, this is another way of going about it is to say that a node should be split only if it has greater than seven rows. So if a node has less than seven rows, don't split it. That's one thing we can say. The other thing you can say is that when a node is split, then two leaves are created, right? Now, if these leaves are one and six, you can specify a minimum leaf size. You can say that the minimum leaf size is three. So a leaf will get created only, or this division will, this split will be performed only when the leaves have a size greater than three. Okay. So that's why these are two very connected things. This is one is the size to minimum size for splitting a node that even a split will be considered. And second, the minimum size for a leaf node that is created when you split a node, both of these, you can specify. So by default, you split a node, which has two or more rows of data. And by default, the default, the minimum leaf size is one. So that's just going out maximum overfitting as much as possible. But let's change that. Let's maybe change the minimum sample split to three and the minimal samples leave to two. So a node will be split only if there are at least three. Only if there are at least, well, that doesn't really make sense. Let me make that five. So the node will be split only if there are five rows of data inside it and a leaf, a node will be split only if the leaves created have at least two nodes, two rows of data each. So here it turns out that using a sample split of five, two gives you 85.5% accuracy, 85.56. So that's a reduction. And if you increase the size, if you increase a main sample size, that means you're saying that nodes must be larger and larger. So you are reducing the power of the tree because what you're essentially saying is that I'm not allowed to have a node of size two. I'm not allowed to have a node of size three, not allowed to have a node of, node of size five. I must have a node of size a hundred. I've said the main sample split 200. So nodes will not be split if they have less than a hundred rows. And here I've said the main sample leaves to 60 nodes will not be split if the leaves that they create are not of size 60. So that definitely reduces the power of the model. And because of that. We could end up in this situation here, right? So now what happens is a very high value for the size or the split size is going to put you in the low power space, the low complexity space, a very low value or a zero value or the default values put you in this space. So somewhere in between maybe around, let's say a split size of 10 and a node size of three or something like that is the best fit. So that is again, something that you have to explore. Okay. So that's about min sample split and min, min samples leave. Next up, just a couple more. And then we're done. One is the min impurity decrease. So here's yet another way of controlling the split. Now consider a point in the decision tree. So you have a point in the decision tree here, and this has a certain Gini value, right? Based on the most optimal split. Let's say this has a Gini value of 0.7. And then when you split it, you get two nodes and those nodes have Gini values of 0.68 and 0.69. So there's not really been a significant reduction in the Gini value, right? Because this was 0.7 and this is 0.68 and 0.69. If the most optimal split only leads to a negligible reduction in Gini value, then maybe the split is completely useless. It's just adding more complexity to the model. It's just causing overfitting. So what you can specify is that nodes should be split only when the decrease in the impurity is higher than a certain value. So I can specify that a node should be split only when the decrease in the Gini value, which means from 0.9 to 0.68 is higher than 0.1. Now 0.1 is a very high value. It should probably use something much lower, but just to give you an example. So this node will be split only when the Gini value goes from 0.7 to 0.6 or lower for each node, right? So that's the, that's the value. That's the purpose of min impurity decrease. And once again, by default, it is zero. So by default, whatever minimum or even zero reduction in the Gini value will allow creating multiple nodes will allow splitting, but if you want to control it, then we pass some, a value like this. Okay. Now, once again, where does this lie on the overfitting curve? Everywhere you see hyper parameters, you should start thinking about the overfitting curve, Gini value, a minimum threshold of zero, which means that we can split without having to worry about how much reduction in error in, in the Gini index there was that is maximum complexity. On the other hand, a very high value for the threshold that if you say that 0.1, a Gini value, a Gini reduction of 0.1 is required for a split that makes splits very hard. So that puts you in the low complexity zone. So somewhere in between you have the value of, which is the optimal for getting the maximum validation accuracy, right? Remember that's our goal, maximum validation accuracy. Now here, I recommend trying to try out powers of 10, 10 to the minus seven, 10 to the minus six, 10 to the minus five, and so on. And I found here, for example, we went, I think I found with 10 to the minus six or something, I found the best reduction. Yeah. But 10 to the minus six was where I saw the best value. So somewhere between zero and one or zero and 0.1, as you go with powers of 10, you will find the best min impurity decrease, where you have the maximum increase in the valid, where you have the maximum validation accuracy. Yeah. You can see here, when we set the minute impurity decrease to one E minus six. So that's 85.68% validation accuracy versus 85.59. All right. Last two, one is called, this one is called bootstrap and max samples. So one trick that I told you about random forest is by picking a random selection of features. That's good. But one, and that picks just a random selection of columns. Another trick that random forests apply is not to use the entire dataset for training each decision tree. Instead, it applies a technique called bootstrapping. And here's how bootstrapping works. Let's say you have a training dataset. So here we have a dataset with a whole bunch of rows, row one, row, two, row, three, row, four, row, five, row six. Typically to train a decision tree, and there are columns as well. So typically to train a decision tree, you would take this entire dataset and you would feed it into the decision tree algorithm. And that would figure out the optimal split, et cetera, et cetera. Bootstrapping works in a slightly different way. Now let's say you have, what's this? Let's say you have 10 rows. The way bootstrapping works is you randomly pick one of the 10 rows. Let's say you pick this and then you copy it here. Of course, it's only a conceptual copy. I'm not saying that an actual copy is created. Then you forget that you picked this row. And then once again, you randomly pick a row. And maybe this time you got this, and then you copy it here. Then you forget that you pick these two rows and then you randomly pick one another row and you get another row. Fine. Then you, again, you forget that you picked any rows and then you randomly pick and turns out maybe you pick this one again. And that's fine. So this row, this row will now come up here again. So now you have two copies of this row in, this is called a bootstrap. So this is called random picking with replacement. So you pick a row, but then you don't keep track of the fact that you've picked it already. And you allow yourself to pick it again. What this does is a couple of things. I know you can see by the end of this process, that by the time that I have 10 rows, maybe two of these rows got picked twice. And then some of these rows got picked once. And then some of these rows did not get picked. And then this is going to be used to train a decision tree. So one bootstrap is used to train one decision tree. And what this does is different bootstraps will drop different rows. So each decision tree will only be trained with on a particular fraction of the data rather than the entire data, because even though we are picking 10 rows, we are picking with replacement. So we are missing certain rows, potentially. And second, we also randomly assign like a double weight or a triple weight to rose when they get a big twice, right? Any, like initially, every row has equal weightage, but what we're saying is for certain decision trees, we are assigning a higher weight to certain rows because we are picking them twice or thrice in the dataset, right? And again, this is happening randomly. And this is happening for a hundred or hundreds of decision trees. And that's where all this randomness actually helps. Again, this is one of these techniques that has been tested empirically and has been shown to work. Well, papers have been written on it. And the authors of scikit-learn have very kindly made that the default within random forests. So by default bootstrap is set to true. And if you said bootstrap to false, what will happen is every decision tree will be trained using the entire dataset, which is one copy of each row. Now I let you decide which has better randomness. And I'll let you decide which will generalize better. Okay. So that's your bootstrap. And this entire system of picking a bootstrap, training a decision tree, picking another bootstrap, training, another decision tree, picking another bootstrap, training a decision tree. This is called a bag of bootstraps. So this is also sometimes called bagging. And one thing that you can also do is you can also control the size of each bootstrap. Now here, what I said was you have 10 rows of data and you keep picking randomly with replacement till you get 10 rows, even if they may have copies, but you can actually decide that maybe you just want to pick seven rows with random replacement. So that's called the bag of little bootstraps where you don't use the entire dataset, but you train a decision tree, or maybe you just want to pick a 20% fraction. So you train using bootstrapping and you train and you only pick samples of a certain size. So for example, and here, by the way, without bootstrapping, you can see that the accuracy is about the same. In this case, it did not make a big difference, but maybe in certain other datasets, you will see it's different. But what you can do is you can pick, you can specify when bootstrapping is set to true that you want to pick only a certain number of samples or only a certain fraction of samples. So instead of picking a bootstrap of size 98,000, which is the size of the training set. Maybe you want to pick a bootstrap of just 90% of the rows. Right? So you, you pick just, I don't know about 89, 80, 89 or 90,000 rows, not 98,000. And you pick them with replacement using the bootstrapping technique. And again, that will help you. That will, that will help you by reducing over-pitting. If each decision tree is being trained on a smaller fraction, then there is more randomization across trees. So then there is more, then there is more generalization. And you can see here, we got to 85.65 from 85.59. So that's a small increase. Once again, on the over-fitting curve, if max samples are set to one, which means you're picking as many as possible, then you are on the high complexity side. Then you are in the over-fitting side. If max samples is set to zero or 0.01 or something like that, where you're only putting, picking one person of the rows for each decision tree, then each decision tree is very weak, very less powerful. So that's going to be on the low power side. So somewhere in between is where you will have the optimal value. Okay. The last one, and this is slightly tangential, but the last one is class weight. Now, one of the things we have in our data set, especially is this particular data set, especially if you look at the targets and you check the value counts. Let's call that train targets and check the value counts. You have 76,000 knows and just 22,000 yeses. And then you divide this by the length. So Len off train targets. So 77%, 77.8% knows and only 22% yeses. So that there aren't many yeses here. And maybe you want to give higher weight, higher weightage to the rows, which have, yes, right? Simply because you have a very low number of rows with that particular weight. So you want to give that weight, you want to with that particular class. So you want to give rows with that class, a higher weightage. You want your decision tree to consider them twice as important or three times as important or four times as important in your calculation. So you can do that. And the way you do that is by specifying class weight, when you create the random forest. So when you create the random forest, you specify a class weight, and you can either give like a proper class weight to each one. So you can say for the no class, we set a weight of one. So each row of the no class will have a weight of one and each row of the yes class will have a weight of two. And that is going to be used somehow in the Gini calculation and somehow in the splitting. But the, the underlying concept is that more weightage will be given to the class. Yes. So you can specify that and you can compare that with the base, which is to give equal weightage to every class or equal weightage to every row, no matter what, or you can also specify a term like this balanced. So if you specify balanced and you can check out the documentation here, scikit learn will automatically figure out that. Okay. You have four times as many knows than you have yeses. So maybe I'm or three times as many. So maybe I'm going to apply a weight of three. So the inverse ratio is going to get applied as the weight of that particular class. So yes, it's going to get a weight of three or 3.5 and no, it's going to have a weight of one, right? That may not be ideal. You may not want to give so, so higher weight. So probably something in this range might work best for you where you manually give a slightly higher weight to the class, which is underrepresented. And you can see here, we go from 85.59 to 85.7 to 85.65 simply by applying a class weight to yes. Okay. And once again, with class weights, there is an optimal value somewhere. It's not one, one is not optimal and one four is not optimal. You can see that this actually led to a reduction. Maybe somewhere around one, two is optimal for this particular balance. And this is the central theme in machine learning. You will have hyper parameters. One end of the hyper parameter will lead to a very bad result for training and validation because it just is going to reduce the power of the model. And the other end of the hyper parameter is going to lead to a very low training loss or training error, but a very high validation error because it's going to overfit. It's going to be really powerful. Okay. And your job will always be to kind of find the central, find the middle path, find the optimal fit for the data so that the model learns enough, but does not overfit to the training data. And normally you will be modifying all of these things together. So you would set up something like this. You would set up a model here. You have a random forest classifier and you would set up maybe the number of estimators, the number of features, the depth, the class, the weight, you will try some experiments and you will figure out which direction you need to go in. If you change and there's no single way to just look at the number and tell whether you should increase or decrease this, what you should be looking at is what happens when you increase this and what happens when you decrease this. And that will help you decide whether you are in this region. So if we look at the overfitting chart, remember, this is how the training loss goes. And this is how the validation loss goes. So you want to figure out whether you are in this region or you're in this region. Now, if you're in this region, then you can increase the complexity of the model to get to the optimal fit because there's still a possible reduction in the validation loss. And if you're in this region, then you need to reduce the complexity of the model because your validation loss can actually be lower by with a less complex model because your model is overfitting. Okay. So let's try an example here, and then you can play around with this. So apply this, exactly this technique, have the overfitting curve in your head as you work through this. And let's see if after applying all of these random forest, all of these hyper parameters, do we actually get anything useful out of it? Okay. Now I have picked here 500 estimators and seven features, a maximum of seven features. Maybe I should change that to 20, a maximum depth of 30, and a maximum depth of 20 and class weights of one comma 1.5. So this is, this is bringing some regularization for sure. This is also, this is probably increasing the power because max features by default is log N. And this is again, probably bringing some regularization. Well, I guess this is increasing the power of the model as well. So you have to now think about which lies in which area, right? And one exercise for you is to try and modify all of these and see what is the best accuracy that you can get to. So play around with these hyper parameters. Let's give that a second. I think this gets to about an 85.7% accuracy. Finally. Yeah. It's just scoring the data now. Okay. 85.61. I guess in my previous run with certain changes in hyper parameters, I got to about 85.7% accuracy. Maybe I'm just going to run this again, but we increased the accuracy from 84.5% with a single decision tree fine-tuned to 85.7% with a well-tuned random forest. Now, depending on the kind of dataset and the kind of problem, you may or may not find this valuable and you may or may not find the significant or you may not, or you may or may not even see a significant improvement because with your hyper parameter tuning. So it's not that if you tune the right hyper parameters, you will get a hundred percent accuracy, a hundred percent validation accuracy. That's almost never going to happen. And this could be due to one of the following reasons. And this is the order in which you should try to explain what's happening when your model is not improving. The first thing is simply that you may not have found the right mix of hyper parameters. Maybe some of your hyper parameters are still on the low complexity side. And some of your hyper parameters are on the high complexity side. Some are overfitting, some are under fitting, which is the opposite of overfitting. So you need to bring all hyper parameters into the right range. And so you need to maybe keep trying to improve the model. That's one possible region. That's one possible reason why your validation loss is not increasing. The second possible reason is that you may have reached the limits of the modeling technique that you're currently using, which is random forest. It's just as logistic regression has a limitation that any, any nonlinear relationships it will not be able to capture. Similarly, random forest also have certain limitations primarily because we are using decision trees and we are using a averaging between decision trees. It's possible that certain types of relationships cannot get captured. For example, in a decision tree, you're always making a binary decision. So if let's say you're a very simple model where the target is simply a linear multiple of one of the features, your decision tree will have to create a huge, will have to create a huge tree because your decision tree cannot capture these proportional relationships. It's just target is 10 times of a certain feature that that kind of relation does not get captured. So modeling random forest may not work well in such a case. Okay. So when you have linearity linear models work well and entry based models may not work well. And when you do not have linearity, then tree based models are likely to work well. Here's, so we're kind of peeling back. So the most immediate problem may be hyper parameters, but when you've tried and tried and figured out the hyper parameters are not the problem, then maybe the next most immediate problem is a modeling technique. And maybe then you should try a different modeling technique. And then it's possible that maybe you've simply reached some limits of what you can predict using the given amount of data, and we may need more data to improve the model. So think about it. We have data for 10 years. If we had data for one years, would your model be better or worse? I think it would be worse because we have less data. And if we had data for 20 years or a hundred years, would your model be better or worse? I think it would be better. So whenever you get more data, it's possible to train better and better models. So at some point you're going to reach the limits of what you can predict using the given amount of data. That's one possibility. But beyond that, another possibility is let's say you've gathered as much data as possible, and you're still not seeing an improvement, then you may have reached the limit of how well we can predict whether it will rain tomorrow, given just the, using just the given weather measurements. Like we have only 20 columns there are, and these 20 columns are arbitrary measure weather measurements that we have invented. There is no physical law that states that the rain tomorrow depends on just these 20 columns and can be predicted accurately. So maybe we need more features. Maybe we need to take more measurements. Maybe we need to take more accurate measurements to further improve the model. Maybe we need to take the rain, not just in the particular region, but maybe we need to consider rain within a larger radius. Maybe we need to consider winds within a larger radius. Maybe we need to consider ocean temperature, et cetera, et cetera. So in many cases, adding new features or adding new columns to the model definitely helps with modeling. And in a lot of cases, we can also generate new features using existing features. This is called feature engineering. And we will look at feature engineering briefly the next time. So adding new features can help. And finally, it's possible that rain tomorrow may be an inherently random or chaotic phenomenon, which simply cannot be predicted beyond a certain accuracy. It's possible that maybe you cannot predict whether it will rain tomorrow, given any amount of data for any number of other measurements with any modeling technique beyond 90% accuracy. It's just not possible. And that's why we have our validation sets and our test sets to remind us of the reality that it's not all about just the training data. You have to make predictions in the real world and your models will fail after a certain point, right? So what do you want to take away from this is that ultimately all models are wrong, but some models are useful. If you can rely on a model that we created today to make a travel decision for tomorrow, or maybe if airline airlines can rely on this model to decide on whether or not they should fly tomorrow, how many flights they should schedule tomorrow. And if that saves them money, then the model is useful. Even if though, even if sometimes it may be wrong, even if one out of 10 times it is wrong, on the other hand, if you cannot rely on this model, then it's practically useless, right? So ultimately all models come down to their practical utility. So whenever you ask, what is a good accuracy or what is a good model or what is, is this good enough? The question is, you should be asking what is the purpose we are looking to use it for? And in its current state, is this model good enough to be used? Okay. So that's just something for you to think about just more generally as a bigger picture of machine learning. Okay. So that's all we have for today. And finally, one last thing you should do is you should also compute the accuracy of the model on the test set. And now you'll see what I'm talking about when I say train validation and test set split. Remember on the validation set, we got to 85.68%. Well, what's happened is that we have been manually testing so many different models and we may have overfitted our choice of hyper parameters to the validation set, right? So think of yourself as a machine learning algorithm that's picking the right hyper parameters and you are now overfitted to the validation set. So the test set is there to give you the human, the reality check that maybe you shouldn't be trying to optimize, over-optimize the validation set. Maybe you should just do a more principle thinking about what will generalize the model better and just choose that. Okay. So that's why you have a test set at the very end, which you should just use at the very end to report the final accuracy of your model. Okay. And then just like last time, you can make predictions on the model. So you can do that. Here's an input, and then you can, we've written a helper function here to generate a prediction on the, and we get like a 77% probability that it's going to rain tomorrow given these inputs. Okay. And you can save and load the model so you can take the model and the imputer and the scaler and the input columns, target columns, et cetera, save it into a job lip file, load it back, and it should work exactly the same way. So that's all we have for today. We looked at how to download a real world dataset. We looked at how to prepare a dataset for training. We looked at how to train and interpret decision trees. We looked at how to interpret random forests. We entrain them. We looked at overfitting hyper parameter tuning and regularization. One of the most central problems in machine learning that you will spend most of your time breaking your head over making predictions on single inputs is something that we looked at the last time and this time as well. It's pretty straightforward. You just set up a function to do that. And then we also introduced a bunch of different terms. So you should start becoming familiar with these terms because you will see them in tutorials that you see online. You will see, you will probably get asked these terms in interviews and try to get a good intuitive understanding and maybe come up with your own one line description here. And that's more important than any description than we can give you. So definitely try and write a one line description for these. And if you want to check out some of these resources, you can check these out decision trees and random forest can be used not just for classification, but also for regression problems. For example, predicting the price of a house. So here are some examples that you can check out the topic for today is how to approach machine learning projects. And we are going to look at a step-by-step process that you can apply for approaching ML problems all the way from conceptualizing and identifying a problem to performing data cleaning preparation, then building some baseline models, doing some model training and evaluation, then maybe regularizing your models and performing some ensembling and finally, interpreting and presenting your results. So this is roughly what the process looks like. And here are the steps in a little more detail. So the first step is to understand the business requirements because machine learning models are always used to serve some business requirement and understand the nature of available data, which tells you how good your model can possibly be, or whether it is possible to even build a model at all. The second is to then classify the problem as a supervised learning problem or an unsupervised learning problem. And so far, we've only been looking at supervised learning, but we will talk about a couple of examples of unsupervised learning as well. And if it is supervised, then whether it is a regression problem or a classification problem, next, you need to download clean, explore the data and possibly create new features that may improve models. So this process is called feature engineering, where the data may not be in a format that makes it very easy for the model to learn from it. So we will look at some ways of doing feature engineering. Next, you create a training test validation split and prepare the data for training machine learning models. So this covers things like imputing missing data, scaling the data and encoding categorical data. The next step is to create a quick and easy baseline model to evaluate and benchmark future models. This is a very important step in any large machine learning project. You want to know what the baseline is so that you don't waste your time on models that are not any better than a very simple approach that you can take somewhere. Next, you pick a certain modeling strategy, train a model, tune hyper parameters, and try to achieve the optimal fit for that modeling strategy. And then you repeat that with the different strategies. So this way you can try a bunch of different modeling strategies and figure out which one works best for the problem. Next, you experiment and combine results from multiple strategies. If possible to get a better result, you may also have to do some regularization, some hyper parameter tuning. And finally, you need to interpret the model, study individual predictions and present your findings. And that's the most important piece, because ultimately your model is going to be used for making individual predictions. And you will have to explain why your model gives the result that it gives. And you will have to present all of this to non-technical stakeholders. So we'll talk a little bit about that too. Okay. Now we're running this code. We're running this code. We're running this code on Google Colab, but you can also run it on your computer locally. Just follow these instructions. So first let's install the required libraries and import them as well. So we are using NumPy for numerical computing pandas for working with data frames, matplotlib, plotly, and seaborn for visualization. And we have the Jovian library for recording snapshots of this notebook. Then we have the open datasets library for downloading data and scikit learn contains all our machine learning algorithms. And let's import all of them too. And we are also setting certain styles for the graph so that our graphs look nicer. Now let's talk about step one. The first step is to understand business requirements and study the nature of the data. The first step is to understand how most machine learning models are trained to serve a real world use case. So it's very important that you understand why you're building a model, what the objectives are, and the nature of the data you have available before you start building a machine learning model. And the best way to do this is by asking questions. So you want to start out by understanding the big picture about why a company cares about a machine learning problem or why somebody's asking you to build a machine learning model and figure out if an, if an ML model is even required in the first place, or maybe you can do something simpler and more effective. Okay. So typically you're either given some data along with some documentation and maybe a problem statement, or you have to talk to somebody. So you have to discuss something with stakeholders and figure things out in either case, whether you have written descriptions and written requirements, or whether you have, whether you're talking to somebody, you should try and ask some of these questions. For example, what is the business problem that you're trying to solve using machine learning? And we'll look at a few examples here, but we'll look at a couple of datasets and we'll ask the questions in that context. But for example, it could be increased sales of a particular product that could be a business problem. It could be to reduce churn. So reduce the number of people leaving your service or your subscription service. It could be to increase conversion of some kind. So the number of people who come on a page, you want them to more of them to buy a certain thing. It could be forecasting of some kinds where you're trying to forecast a value in the future. For example, whether or not it will rain tomorrow or what the sales in a particular location of your business are going to be. So you need to ask what the business problem that you're trying to solve is. And if you, if you don't know what that problem is, then you may not be aligned and you may not really be able to create the best model for the problem. And the next question is to understand why we're interested in solving this problem and what impact it will have on the business. Now, typically there are two or three reasons why people look to machine learning. One is to automate some manual tasks. So there are certain manual tasks or where it either requires a lot of people or it requires some special expertise. And so it is expensive. And that is one reason why people want to automate manual tasks. Another case is if they want to predict the future, and there is no easy way to do that except by looking at all the data we have. So it's sort of an extension of a statistician, but you want to automate that system. So instead of a statistician, having to make forecasts, make a lot of different forecasts. You can just set up a model and the model can give you forecasts. And typically there is some kind of a revenue impact on the business. If not revenue, then maybe there is some impact related to customers, some impact related to some other stakeholders, like partners. So you need to understand what the reason for solving the problem is and what impact it will have on the business. And all, all of this will inform the kind of models you train. All of this will inform the level of interpretability that you will require and the level of complexity that you can use in your models. Next thing you want to know is how is this problem solved currently without any machine learning tools. And in a lot of cases, if you just study this, you will figure out that maybe there is a better way to just improve the process that is used to solve the problem. And maybe you don't really need a machine learning tool, right? Maybe you don't really need a machine learning model. You can also study the cost of building a machine learning model, which is in terms of time and in terms of the people involved and the potential impact the machine learning model will have on the business. Now, obviously the cost of building a machine learning model is higher than the potential savings or earnings that it can create. Then maybe it's not that useful to create a machine learning model, right? Maybe at a smaller stage, it's better to solve the problem using manual processes or some kind of an intelligent estimate. Next, you need to understand who will use the results of the model and how will it fit into other business processes? Your model will not be used in isolation. It will be part of some other system. So you need to understand how they're going to use it. And this is the kind of issue where sometimes the other team is going to use your model more as a yes, no decision point. And in that case, you should create a classification model, but if they're going to look at the exact number, then in that case, maybe you should create a regression model. So the kind of model that you train depends on how it is going to be used. So this is all about the problem. Like why are we doing this? What, what problem are we looking at? Next, we talk about how much historical data do we have and how was it collected? So it's important to understand the scale of the data and the nature of collection. Now, if it was entered manually, if there was some manual digitization involved, then maybe there might be errors. If it was collected through some sensors, then what are the calibration issues those sensors might have? If it was collected from the internet, do we have the rights to use this data, et cetera, et cetera. And these are also places where you might want to think about bias and think about whether it is ethical to use a certain kind of data. For example, is it ethical to use race to decide whether somebody should be given a loan or not. Right. Next, you talk about what features does the historical data contain? So you get a list of columns. You try to understand if those features will help you make the kind of model that you want to create. And very importantly, you need to understand if you have historical data for the values you're trying to predict. So far in all our models, we've seen that we need the target column in our training data set. So if you do not have the values that you're trying to predict, then the data is not useful for machine learning. So the first thing then you will need to do is you will need to add labels or add the targets into the model before you can create a model, then understand the known issues with the data. If there are any data entry errors, or if there's any missing data, sometimes what happens is missing data is just replaced with a zero or a minus one. If there, if the data was limited intentionally at between a certain range, et cetera, et cetera, there are different units. Maybe one person was reporting in kilometers. One person was reporting in miles. All of these are issues that you have to anticipate. Nobody will tell you in advance. I mean, some people will, but in general, you will have to ask people questions and figure them out and then look at some sample rows from the dataset. So if you're in a discussion with someone, you just asked them to show you some sample rows from the dataset. And then once you can actually see the data on your screen, you can ask a lot of questions about the data and you want to figure out how representative the sample is of the entire dataset, right? So you need to be very curious about data. And before you even get into modeling and before you get into ED, et cetera, you need to be curious about the business problem. And you need to ask as many questions as possible, as you can think of, right? And the last thing is to figure out where the data is stored and how will you get access to it? So data can be stored in multiple places. Sometimes you may have to combine data from multiple sources. Sometimes those data access to that data is split across multiple teams. So you, it is your job as a data scientist, as a data science professional to figure out, to make sure that you have access to the right data, right? So that's some questions that you should ask when working on a machine learning problem to get you started. But there are many more that you can ask, right? And if you have any other questions you think which should show up in this list, please feel free to add them in the chat. The main idea is to gather as much information about the problem as possible so that you have a clear understanding of the objective and the feasibility of the project, right? In a lot of cases, machine learning projects are simply infeasible because of a very ambitious problem statement. Like you're trying to do something that is really difficult to do, or because the data isn't good enough or the dataset isn't large enough, right? So before you even build a project, you need to understand the feasibility of the project. And that is something that you can do once you ask a lot of these questions. And of course, you're not working in a professional setting right now. So how do you apply the same strategy, apply the same process to your projects? So whenever possible, try to work with real world datasets. That means try to work with data obtained from real businesses. And Kaggle is a great source of real world datasets. If you just go to Kaggle.com slash datasets, and we've talked about how to get data, search data on Kaggle. All you do is you select, you go down here, you select some filters. Like if you're looking for tabular data, select CSV. If you're looking for a large dataset, put in a number like hundred MB and apply this, these filters, and then maybe search for let's say finance. If you are interested in some kind of financial data and then sort this by the most votes. Okay. So now we are searching for finance datasets greater than a hundred MB in CSV format. And now you have historical data for Bitcoin. Now that is real world data, or here is again, stock histories for Amex NYSE NASDAQ. That is real world data. You have parking tickets, data that's real world data. So Kaggle is a great place to find real world datasets. And another thing you should check out on Kaggle is Kaggle competitions. Now Kaggle competitions are also held on real world data. So for example, if you look at Optiver realized volatility, this competition that's currently ongoing. So the data for this competition is given by a company called Optiver and they have established a objective that you need to meet by building a machine learning model. And if you're able to do that, and if you're among the top three, then you can earn tens of thousands of dollars. Now on Kaggle, you're probably not going to, you're probably not going to take part in competitions for the sake of earning because it's really difficult and it's very unlikely as well. It's a bit unpredictable, but it's a great learning process. So what we will do is we will pick an older competition on Kaggle. So we will go through this Kaggle competition, the Rossman store sales prediction competition, and we'll try and answer all of these questions here. And then we'll try and build a machine learning model. So here's the Rossman store sales competition. And the objective here is stated very clearly forecast sales using store promotion and competitor data. So it's about sales forecasting. You have certain stores or Rossman has a bunch of stores across, I believe a bunch of European countries. Okay. Let's, let's just read this Rossman stores. Rossman operates over 3000 drug stores in seven European countries. And currently Rossman store managers are tasked with predicting their daily sales for up to six weeks in advance. So you can see here, there is a human element involved here. Somebody has to predict their sales and based on their predictions, maybe they can order inventory, et cetera. And not only is human element involved here, it's a manual process, but it's also an error prone process because what is the formula they're applying to make these predictions, right? So everybody has their own approach. So you want to the Rossman wants to standardize this and they know that sales store store sales are influenced by many factors. For example, whether there's a promotion running, whether there's some kind of a competition running between stores, whether it's a holiday or the seasonality and the locality. So with thousands of individual managers predicting sales, the accuracy of the results can be quite varied, right? So now we understand, now we can start to answer some of these questions. Now we can start to answer some of these questions. What's the business problem? We want to predict sales. Why are we interested? Because we have thousands of stores. Everybody tries to make predictions in a different way. If we could standardize it, then we could save a lot of money because then we would not have too little inventory or too much inventory. Our revenue predictions will be more accurate. And based on that, maybe our cost allocations can be more accurate. How is this problem solved currently without machine learning tools? Well by these managers individually. So now at this point you could ask, how can you improve this without using machine learning? So one simple thing that you can try and do is maybe create some kind of a formula. Maybe just do some statistical analysis and create a formula that all managers can put a bunch of variables into and get a result out of it, right? Now, how do you create that formula? One way to create that formula is to train a machine learning model, ultimately a machine learning model is a formula that's just a little more complicated. But the other way is to maybe just do some analysis over the historical data and come up with something manually. And that's perfectly fine. In a lot of cases, it works just fine. How much historical data do we have and how was it collected? Okay, let's see. So now let's go to the data tab. And in the data tab, it says that there are, there's a historical sales data for 1,150 Rossman stores. And then there are these files. So there is train.csv test.csv sample submissions store.csv. So apparently this is the historical data on which we need to train our model. And then we need to make some predictions on this test file. And the way it works on Kaggle is that the predictions or the target values for the test set are hidden from you. So you need to train your model on the training data, where you have both the inputs and the targets, and then you need to make predictions on the test data where you only have the inputs. Then you go here and you make a submission. When the competition is active, you can make a submission, but even if the competition has ended, you can just try it out. So you make a submission where you make a prediction for each row in the test set, and then Kaggle through an automated system is going to compare them with the real test targets, which are hidden from you and give you a score. And that's how they create a leaderboard. And that's how they create competition. And we'll see that process by the end of this tutorial. And this is what the submission file looks like. So there is a submission file format that you need to follow. And then finally, it seems like there is some supplemental information about the different stores that are present. And this, this is all the columns that you have. And if you go down here, maybe if you check train.csv, you can see a few rows of data. This is what it looks like. So it seems like we have nine columns day of week date sales. I think sales is what we want to predict. Yeah. Sales is what we want to predict. And there seems to be also a number of customers, not sure if we can use this or not, because we don't know how many customers are going to come to a store in the next month. So good way to check what the targets are is to compare the training data with the test data. If you check the test data here, you have store day of week date, whether the store is open, whether it was running a promotion on that day, whether that day was a state holiday and a school holiday. And it seems like we have neither customers nor sales. So we can't really use customers, the customers data as inputs to our model. We would have to use some of the other things and sales is what we're trying to predict. Okay. So now we are getting a clearer understanding. How much data do we have? Well, I think somewhere here in the column section, we should be able to find, let's see, it seems like we have 1.02 million. So we have over 1 million rows of data here. So across a thousand stores or a 15, a thousand Rossman stores, we have 1 million rows of data. That means we have 1000 days of data for every store, roughly speaking. That means we have about because each row represents the data for one day. So that means we have about three years of data, roughly speaking, right? And you want to get all of these things in your head before you start just working on the data, because now you know that, okay, now we have three years of data. And for, we have this data for thousand stores. I think that should be sufficient amount of information to build out a sales prediction model for the next two months. Okay. So yeah, that seems like a reasonable size of data and known issues, et cetera. Those are described here. So you can read the description and learn about known issues. And this is something that you should make your own notes about. If you think that there are certain issues, missing data, et cetera, we just saw some sample rows. And next, in this case, the data is stored on Kaggle, but in some other cases, maybe the data will be sent to you over a CSV file, or maybe it is in a SQL dataset. Maybe you have to get data from multiple sources, maybe trained or CSV, or the, this historical data is in some SQL database, but the store data is in a Jason format somewhere, and you have to read it and combine them. Okay. So those are some things that you will have to do. That is the preparation that goes in. And typically you can take a few days to maybe a week, couple of weeks, just to understand all of this, make sure that you're happy with the initial data that you're getting, make sure that it's clear to you what the objective is. And you feel at least somewhat confident about approaching the problem. Okay. All right. So that was step one. Okay. So step one was all of this preparation, understanding the problem, et cetera. The second step is now when things enter back into the data science domain. So now we are slowly translating the business requirements into a specification for a project for a data science and machine learning project. So we need to now start to classify the problem as an supervised learning problem or an unsupervised unsupervised learning problem. And if it is supervised as a regression problem or a classification problem, okay. And we'll talk about this. We've talked about these before, but here's roughly the landscape of machine learning and where machine learning lies in the entire computer science ecosystem. Now, of course, computer science is a vast field and one piece of computer science is artificial intelligence, which is systems that take seemingly intelligent decisions. Now, artificial intelligence is a very broad term. It includes machine learning, but it can also include hand coded systems. For example, if you look at, uh, I mean, if you look at Amazon Alexa, for example, Alexa answers a lot of questions. So in the background, actually a lot of questions have been hard coded. So what they look for is once they see certain keywords based on those keywords, they trigger specific activities. For example, if any sentence you say has the word alarm, then alarm, then the alarm is going to be set for you. Or if any sentence you say has the word play music, then it's going to play some music for you. So that's not too much machine learning. I mean, there is some machine learning in just converting your voice to text, but ultimately once your instruction is available as text, it's simply a rule based system where you check for a presence of a certain word and then you react in a certain way, right? So artificial intelligence systems can be rule-based where they are coded by somebody manually rule by rule, or they can be machine learning systems or machine learning models, which is what we are building here, where we simply give the data to a model and it learns the relationship between, uh, what we want to use and what we want to predict. Now in machine learning, you have unsupervised learning and supervised learning supervised learning is where you know, what you want to predict. For example, we want to predict the sales at Rossman, or we want to predict whether a tumor is cancerous or not. So that's supervised learning unsupervised learning is where we don't have any targets, where we are primarily just interested in learning the structure of the data. Let's say you have a lot of activity of customers on your website and you want to just figure out which customers are less frequent visitors and which customers are power users and which customers lie somewhere in between. So that becomes a clustering problem where you don't really know already beforehand for each customer, but you want to group customers into similar clusters and then study those clusters. So that's unsupervised learning that we will touch on briefly towards the end of the course and within supervised learning where we have categorized or numerical target data, we have two kinds of problems. We have classification and we have regression. Now classification is when we want to predict a category. For example, we want to take information about a breast tumor and predict whether it is cancerous or not. That is a classification problem. And then a regression problem is where we want to predict a number. So here for the Rossman store sales prediction, you want to predict a dollar value, a continuous value that is called a regression problem. And then within unsupervised learning, you have again, many different types of problems like clustering, where you just want to split up the data into a bunch of similar clusters and maybe then study those clusters. Then you have something called dimensionality reduction, where you want to take data that has a lot of columns, maybe hundreds of columns and reduce it down to five, 10, 15 columns without losing the essence of the data. And then you have another class of problems like association, which is more of like a similarity search or a recommendation. So for example, if you've watched, for example, recommending which movie you should watch next based on the movies you've watched and the movies washed by people who have also watched all of the movies that you have watched, right? So people like you or people who have a similar taste of taste as you, you can pick up a song or a movie that they have watched and you haven't watched and recommended to you, right? So that's association or recommendations or Yeah. Similarity search essentially. So those are the two areas. And today we are just looking at regression and classification. Now here's the question for you. What type of problem is store sales prediction, as we just discussed, because we're trying to predict a single number, it is a regression problem. And what type of problem is best breast cancer identification? Well, as we just discussed, that is a classification problem. Here's a third example. You have this data set for mall customer segmentation. So you have a bunch of information about customers who visited a certain shopping mall. So you have information like the customer ID, which I believe is unique. So not very useful for modeling, but then you have a gender, you have an age, you have an annual income, and you have some kind of a spending score. So there's nothing we want to predict here. All we want to do is we want to figure out what are the big clusters of customers we have. Maybe we have some low spenders. Maybe we have some high spenders, or maybe we have some millennials who spend a high amount. Maybe we have some older people who spend a lower amount. We don't know. And that's where we use a tool or a machine learning algorithm called clustering, which automatically splits our data into a bunch of different clusters. And we can decide how many clusters we want to create. And then we can study those clusters to figure out what makes them a cluster or what is the common property between them, right? All right. So once you've identified the type of problem you're solving, you next need to pick an appropriate evaluation metric. Also, depending on the kind of model that you will end up training, your model will also use a loss function or a cost function that it will optimize during the training process. For example, if you're trying to predict whether a tumor is cancerous or not, then the way you will evaluate the results of your model is probably by looking at the accuracy, which means if you give it a hundred inputs, you will want to maybe look at how many of them did it predict correctly based on the historical data, or maybe you might be interested in things like precision and recall. So you may, you may want that there shouldn't be too many false negatives where somebody who has cancer should not, we should not tell them that they do not have cancer, or maybe you want there not to be too many false positives where somebody who doesn't have cancer, you don't want to tell them that they have cancer. Right. So based on the exact scenario, you may have to use, you may have to figure out what evaluation metric you're going to use. So the evaluation metric, the important thing here is that they are used by humans to evaluate the model. And in fact, this is something that you will discuss with the other stakeholders, with the business team, and figure out and arrive on a valuation metric. You will promise them that I will, I will try to deliver a model that matches this evaluation metric. And they will agree that, okay, once you deliver a model that matches a certain evaluation metric, or meets a certain threshold for an evaluation metric that is pre-decided, then we will try out the model in our system. All right. So evaluation metrics are for humans, but evaluation metrics like accuracy, precision, recall are not understood by machine learning models. They typically need different kinds of evaluation metrics because they're requirements for machine learning models are around differentiability and continuous functions and things like that. So that's where loss functions come into picture. Machine learning models use loss functions, for example, for the breast cancer example, a machine learning model would use, would use a loss function called cross entropy. And that's a common loss function for classification problems. And it will try to optimize the cross entropy by changing its internal parameters. All right. So that's something to keep in mind. And here is a link that you can check out. This is a survey of the common loss functions and evaluation metrics. There are 11 evaluation metrics and then some loss functions as well. So this talks about the difference between evaluation metrics and loss functions. And typically for regression tasks, for example, predicting sales, a continuous value, the most common evaluation metrics are root means squared error or root means squared logarithmic error, or mean absolute error, which is a variation of root means squared error, where you don't take squares or square roots. You just look at absolute values and the R squared score. And then you have evaluation metrics for classification tasks where you have accuracy, error rate, precision, recall, F one score, you have something called the MCC balanced accuracy, log loss, and then you have something called AUC, ROC. So you can check out all of these, what these mean. And the good thing is for all of these, you have functions in built into scikit-learn. So whenever you have some predictions and you have some targets, you simply put them both together into the function. And that is going to give you the value of the evaluation metric. Okay. So do check out this. Okay. So check out this tutorial on evaluation metrics and loss functions. Now, what is the appropriate loss function and evaluation metric for store sales prediction? This is a regression problem. And with regression problems, typically the loss function in most machine learning models is either root means squared error, RMSE or R squared. So one of these two, but it really depends on the kind of model that you're training. So we'll probably just say that we'll, we'll use root mean squared error as our evaluation metric as well, because it is something simple enough to understand. It simply says that on average, your model is off by a certain amount. And if you have too many outliers and you don't want to give too much weightage to outliers, then you can also consider mean absolute error. Okay. So that's another thing worth considering. Again, you will have to discuss the evaluation metric with the business, with the other stakeholders and figure out, do you care about outliers? Are outliers really something that we should try and solve for? Then yes. Then you use root mean squared error. Otherwise, maybe you can just use mean absolute error. If you don't worry too much about outliers, and that would be a good evaluation metric. Okay. Then what are the appropriate loss function and evaluation metric for breast cancer identification? So for breast cancer identification, the loss function typically in most classification models is cross entropy, or in tree based models, it is typically the Gini, the Gini score at every stage. And the evaluation metric is generally either accuracy or precision or recall, or it is going to be the F1 score against something that you would have to discuss with the other stakeholders. Okay. Now, if we actually check the Rossman store sales here, you will see here that there is this evaluation mentioned here. And it turns out that the submissions that you make to this competition are evaluated using the root mean square percentage error. So that's very interesting. Now it's no longer root mean squared error. It's R M S P E. And this is how it's calculated. So in root mean square error, we take the prediction and we take the target. So we take the target and the prediction. We take the difference between them. We take the square, and then we take the average of that and take a root. But in this case, we are interested in the percentage error. And we are not interested in the actual error, right? So this is something that we will have to maybe try and figure out. Maybe while we're evaluating our models, we will also define a root mean square percentage error function. Okay. So just keep an eye out for this because it seems like they're interested not in the absolute absolute value that you're off by. They're more interested in what is the percentage that you're off by. And are you off by 5% or 10% or 20% and that's what they really care about. Okay. On average. So yeah, so that was step two, identifying what kind of problem you're solving supervised, unsupervised regression classification, and then identifying the evaluation metric that the business cares about and agreeing on that evaluation metric with the business. One thing that you should also then try and find out is what is a good value of the evaluation metric that the other stakeholders would be happy with. For example, for Rossman, if you are getting a root mean square percentage error of 10% or lower, is that good enough? Or do they want something that is 5% or lower, or do they want something that is 1% or lower? What is the exact amount that they need? And this is something that the business will then calculate. Maybe they'll go back historically and see how, how much the estimates were off by maybe the estimates by store managers were off by 30% on average. So if you can build a system that gets that is off by 20%, then it's better than all the store managers and it does not require them to spend that time. So maybe 20% is good enough, right? So it depends case to case. Now, obviously for breast cancer prediction, we wouldn't want maybe a 60 or 70% accuracy. We want, we maybe would want closer to 98, 99% accuracy to avoid misclassification because it's a question of life and death there. So that's the kind of decision. So that's the kind of trade offs that you have to discuss and encounter in real machine learning projects. Okay. So now we are getting into the territory that we're familiar with, which is about downloading, cleaning, exploring data, and creating new features. But you can see that there's a lot that goes in before you get to this step and let's move forward. Let's now get into more familiar territory here. So next we get to downloading, cleaning, and exploring the data and creating new features, right? And this is a very, very, very important thing. And this is something that we've done multiple times. So we will not go very deep here. We will just roughly see the process and you can explore the code whenever you're free, but we want to get through this process and get to the next steps, which is all about trying different kinds of models. So I'm just going edit, clear all output so that I can run all the code from scratch. Now, the first thing is to download the dataset and for Kaggle competitions, there is a special step that you have to take. You need to go to the Kaggle competition page, go to the rules page and click accept rules here. Unless you accept the rules of the competition and you can read the competition rules, unless you accept the rules of the competition, you will not be able to download the data. So always make sure that if you're on a competition page and you can see that there is a slash C in the URL, that's a competition page. So if you're on a competition page, go to the rules tab and then accept the rules and an easy way to download the dataset from Kaggle is using the open datasets library. So you just run OD dot download and that will then ask you for your Kaggle credentials, which by now I'm sure you should be familiar. You can go into your Kaggle account and your Kaggle account. And here on your account, you can click create new API token to get your API key. So I'm just going to grab my API key from my desktop here. And I'm going to put in the username and key here. So my username is this and then I'm going to put in my key here. There we go. And that's going to download the dataset for us. So it was about a seven MB dataset. Let us check out the folder that was created Rossman store sales. It contains test dot CSV train dot CSV sample, submission dot CSV and store dot CSV. Great. And you can check the same thing here on the side. You have Rossman store sales. Next up you check your load up the data using pandas. So we're just doing PD dot read CSV. And this is what the data frame looks like. So it seems like we have this information store day of week, date, sales, customers, whether the store was open or not, that's one or zero, whether the store was running a promotion or not, that's one or zero. The, whether it was a state holiday or not, that's a big factor. And whether it was a school holiday or not, that's another factor. But apart from the training data trained at CSV, there is also the store dot CSV file. So maybe let's check out this file store dot CSV. So store dot CSV contains store, the store ID, but it also contains a store type. And I'm guessing this might be important because if you have different types of stores, maybe one is just a kiosk in a bigger store. And one is an actual large store. Then perhaps that is going to be an important factor in determining sales. So maybe we should add store type here somewhere. Then let's look at the assortment. Maybe the, again, here it seems like there are a bunch of values. So this seems like a categorical column. Maybe we should add that here too. And then we have competition distance. Okay. And then we have competition since month and competition since year. So I'm not too sure what these represent. Maybe this would be a good time to go back and check the description or maybe ask a question with whoever gave you the data, what these represent. And for now, maybe I'm not going to use these very extensively. So I'm not going to do dig deeper into these, but this is an exercise for you, what this represents. But here we have promo too. Again, it tells you whether the source promotion, running a promotion or not. And then it has some information about what week, what year and what interval the promo was running for. So there seems to be some information about for how long stores have been running certain promotions. So all of this information could be useful and this is given store wise. So for every store, you have this fixed information. Now we want to get some of this information into the training set. What we want to do is here we have store one. So for store one, what we want to do is just put in the data for store one. And then there are a bunch of rows for store one. This is not the only row because here you see you have 100 or you have over 1 million rows of data. So store one shows up multiple times. In fact, about a thousand times over three years, but here you have just one row for store one. So we want to take all of these and then replicate it for each of those rows, each of the thousand rows. And to do this, we can use the merge. Method of the data frame. So we say Ross DF, which contains the training data. So which contains all this information about a single store sales on a single day. And we merge into Ross DF store DF and how do we do it? We want to do a left outer joint. So which means that we want to make sure that we retain all the data from Ross DF. And maybe some stores don't show up in the Rossman data set in the training data set. Then we ignore those stores. So that's why we have left and we want to merge on the store column. So once we run that, you can see that we now have store day of week date, sales, customers, open promos, state holiday, school holiday, et cetera. And then we have the store type, assortment, competition distance. So this data got replicated for each row where the store number was one. Okay. So that's pandas data frame merge. If you are not familiar with this or not comfortable with it, just look up merging data frames on pandas. It is also part of the zero to pandas course that we have, but now we have a much better data set, right? And this is something that you will have to do. You will have, you will get maybe a three or four rows of data. So if you have four rows of data, three or four columns of data, but then you look up a certain other database and then maybe you can expand a certain column and get more attributes about that column. And you generally will have to then try and combine data. There was a question about weather. So you can actually get weather data now. So based on the date, you can probably get some weather data. You can probably get some financial data as well. If you think that the overall, like the stock market index is going to have an effect on the kind of sales, then you can probably get that information as well and put that in here, like just the Rossman's or maybe the Rossman stock price or things like that. So get, get in as much data as you feel might be relevant into your training dataset. So now we have 1 million rows and 18 columns. And this dataset also contains a test set. So there is a test.csv file. Now let's just see what that test set looks like before we add store data into it as well. Now in the test set, you will see that you have an ID. So this is some kind of an ID for each row of testing. But then you have the store, you have the day of the week, and then you have the date and you have weather store was open, whether it was running a promotion, whether it was school, state, holiday, school, holiday. So no information about customers, no information about sales, and you have to generate predictions for the sales here. Okay. And I'm just also going to merge in into the test data frame, the things that we merged in into the training data frame. So here we have state holiday, store type, assortment, competition, distance, et cetera. Okay. So the only difference between the training and testing data at this point is that you don't have the columns, customer and sales and sales is what we want to predict. Okay. The next step is to do some data cleaning. So we check some data types, and then we check if there are any null values. In this case, it turns out, yes, there are some null values in some of the columns. So we may have to figure out how to deal with these null values. If they are, one thing we could do is we could simply not use those columns. Or if we think they're important, maybe we may want to miss fill those null values because machine learning models cannot be trained with null values, or we can maybe do some kind of one heart encoding for categorical features and the null values will just become all zeros. So that's the kind of thing that you have to then look at each column carefully. And typically just this step may require some back and forth communication where you talk to the business and figure out if it is how important they think it is to track the certain input and what they can give you. Maybe they can give you the data. Maybe they have that data sitting elsewhere, or if you have to then guess what should you be putting in there? Should you be putting in the average or should you be putting in the median? And you can guess that by studying the distribution as well. If it's a normal distribution, the average should be fine. If it is an exponential distribution, the median should be good, or should you be putting in some kind of an unknown value for a category? So on a case by case basis, you need to argue and figure out what you need to do here. This is wrong. There are no values think. Yeah. And then you should also look at the ranges of data, especially the minimum and maximum ranges just to get a sense of whether the data is valid or not. For example, in this case, it seems like the data, all the ranges seem pretty reasonable in the sense that nothing seems off. There are no negative values here. So no negative sales. If you did have negative sales, then you would have to go back and figure out why there are negative sales reported. Maybe on that day, the returns for certain products were higher than the actual sales. So maybe there was a huge recall of a particular kind of product that you were selling because there was some issue or some side effect. It was happening. So everybody came to the store on that day and just returned a certain product and no other sales could be made. So maybe then you might want to exclude that because that is not representative of what happens on a, on the general day to day. So again, you, you need to use your insight as a data science practitioner, your, your skills are as a data science practitioner, but have to combine that with the actual knowledge of the business to get the right insight about how to deal with these situations. Then here we have a maximum value. So it seems like the maximum sales were about $40,000. So that already gives you that, okay, you're trying to make predictions in the range of zero to 40,000. Now zero is an interesting value. Why were there zero sales on a certain days? It could be because the store was not open on those days. So that's something we will have to try and deal with when the store is not open. You're the, the sales are zero. Do we want to handle that in our machine learning model or do we want to handle that as a special case? Because it seems like a waste of all that modeling power and time. When we know really that if open is set to zero, then sales should be zero. So that's the kind of thing that you have to think about as you go through these ranges, look for invalid values, look for anything that stands out and then go back and have a discussion about it. And also check for duplicate rows. Again, depending on how the data was collected, there may be duplicate data. So you always need to make sure. And if you do find duplicate rows, you need to go back and discuss and figure out whether these duplicate rows are actually duplicates that should be removed, or they are valid data that we should keep around. And that really depends on the context. You can't reject or you can't reject or keep duplicate data as a rule. Okay. And then wherever you have dates, you want to pass dates out. So here we are just parsing dates into the date time format using PD dot two date time. And once you have these dates, you can also check the minimum and maximum. So it seems like in the training data, the training data combined with the store data, it starts out around 2013, January, and it ends at 2015, July 31st. And then you're trying to make predictions for what's this? This is a August of 2015. And then it goes up to you want to make predictions. So this is a test set. So you want to make predictions for August to mid September. So about a month and a half. Okay. Now this is the objective that has been set out for you. And this is the test data you have. And here's already something that you should think about now because your test data simply contains data from August to September. Now you train your model and then you generate some predictions on the test set. And then you get a certain score. Do you expect your model to have the same kind of performance? If you put it out into the world, and then it is used on data for October and November or November and December or December and January, I would say probably not. I would say that your model would fair a little worse because here you're only testing your model for August and September. Now in this competition, that's what they're looking for. So that's what we'll try and solve for. But at this point, you may want to go back to the business and go back to whoever you are working with on the business side and tell them that, Hey, you are only, or the test data that we've agreed on is only for a month and a half. August and September that may not be a good predictor of the model's performance in the real world. Maybe we should have one full year of test data. And then they may tell you, no, we don't have that. We only have three years of data. If you're using one full year for testing, then that reduces the amount of data we use for training, et cetera, et cetera. So then you have to negotiate and figure out what is going to be the right split because if the model has to be used year round, it's ideal to have test data year round, or maybe you may agree. You may come to a compromise and say, how about we just build it for the next month and a half right now? And maybe what we'll do is a month and a half later, we can then build it for the next month and a half. And like this for about a year, we just predict the next month of data. And then once we, once we may, once we've gotten through one year and we have, we're training models every two months, we can then use that last previous years of data to make predictions, right? So you have to figure out exactly how you will do this. A project isn't just done and finished at once. So you will keep improving your model over time. And if it's possible to train your model every two months, then maybe it's fine to make a user test set, which is just looking back by two months or something. Okay. All right. So now we have the dataset ready to go. And the next step is to do some exploratory data analysis and visualization. And by this time, you should know the objectives of the exploratory analysis. First, you want to study the distributions of the individual columns, uniform, whether they uniform, whether they're normal, whether they're exponential. Based on this, you may decide how to fill missing values. Based on this, you may decide if you want to apply some transformations. For example, if you see one column, which is exponential and it's correlation with the target column seems like an exponential curve. Maybe you can take the logarithm of that column and that will give you a better result. Then you want to detect some analysis anomalies or errors in the data. If you see any missing values, any incorrect values, any values that don't make sense by drawing the charts, you can fix those. Then you want to study the relationship of the target column with other columns. You want to see if there is a linear relationship, a nonlinear relationship, etc. And then you need to gather insights about the problem and about the dataset. And you need to then come up with ideas for pre-processing and feature engineering. So in a lot of cases, you will see that maybe if you combine two columns, you may get a good result or you may get a better result with the same kind of model. For example, if you have, if you're trying to predict the price of a house and you have the overall width and length of the house and let's say it's rectangular, the entire plot, then maybe using area as a column might be a better idea than using width, just width and height as separate columns, because area has a direct linear relationship with cost of the house, but weight and height may have a different kind of relationship. So these are the kinds of things that you have to think about. And this is called feature engineering, combining features or transforming features to create new features that make more sense. Right. Remember that your models are limited by how they are defined or how they, how the training happens. So if you're training a linear model, a linear model can never learn a quadratic relationship that a width multiplied by length has some linear relationship with cost. On the other hand, if you introduce a third column, which is area, then your linear model can suddenly start learning much richer relationships because now there's a linear relationship between area and the cost. Okay. So most of the time is actually split actually spent in feature engineering. And then once you have good set of features figured out, then model training becomes quite easy. So let's study the distribution on the sales column. Here's what the distribution of the sales column looks like. And this is the target. And already you can see that there is this huge chunk of zeros here, about 175,000 of these. So about, I would say one in seven, and that should give you a hint. It's probably that these are Sundays. Maybe the store is closed on Sundays. And that's why the stores sales on those days are zero. Now, this kind of a distribution is going to be very hard for a machine learning model to predict, especially linear models, maybe tree based models will figure it out because they'll be able to see that. Okay. When the store is not open, then you simply return a zero. Otherwise you do general tree based modeling, et cetera. But for linear models, it's going to be really hard. And this is, this seems like a very clear thing that we should just maybe just handle upfront. Right? Why even put all this complexity into our machine learning models? So let's just check this once let's check the number of well, let's check the value counts for zero. So it seems like there are 172,000 rows where the value of open is zero. And there are, what is this? This is about 172,000. So it seems roughly that all the days when the store was closed, the value was zero. So how about we remove all those rows of data where the store was closed? Because there were no sales on that data. So there were no sales on that day. So rather than trying to train a model, to predict, learn to predict zero, we know that we can just look at the value of open and return zero, right? So I'm just going to remove from the merge data frame or exclude from the merge data frame, all the dates where the store was closed. And then later on, when we're making predictions, we can handle it as a special case. So here we are just checking if merged open equals one. And we're just keeping that section of the data. So we are only going to use the data where the store was open. And this is something that you can tell them in advance. Hey, I'm creating a model for the days where the store was open to keep things simple. So don't use it to predict sales for days where the store is closed, where it's closed, you just predict zero. Okay. So now it looks much nicer. Now it looks much more manageable. I think we could then now. Yeah. So now it looks much more smooth and we've handled that special case, right? And that's why it's important to do this exploratory data analysis. Otherwise you might miss things like this and your models would perform really badly. Like just by removing the zeros, you will see that there is a huge difference in the models. And I encourage you to try it out with the zero as well and see if that makes a big difference. Let's, let's check out some other columns. Let's check out maybe stores sales versus customers. And this is where you will see a strong correlation where as you have more customers, you have more sales, of course, not necessarily true because sometimes some customers spend more. So this is a scatter plot. Each dot represents one row or one day of data for a particular store. And they're colored by date. So roughly you can also probably see that the spending probably increases over time with customers. So here you can see that you have a lot of 2013s here where the sales are less, but the customers are higher. And you have a lot of 2014s and 15s here, where the sales are less where the sales are high and the number of customers is low, right? So this gives you a sense of the train, but unfortunately we cannot use the customer's column. So too bad because we don't know how many customers are going to come to the store a few days later, right? In fact, that could be another thing that we can try and predict that could be a good exercise for you try and predict how many customers will visit the score or visit the store. Next up we have, okay. Next up, we have this figure between the store and the sales. So you have stores about a thousand or you have about 1200 stores. And for each store, we've simply just plotted what the spends or what the sales on particular days look like. And it's very noisy because there are thousands of stores, but each vertical line will represent a store like this represents a store. And you can see that this particular store had very low sales on a certain day, had a very high sale on certain day. While most stores have sales between five to $15,000, I would say it seems like, or five to $12,000. And there are some stores which kind of stand out here. So it might be worth looking into why are some stores sometimes getting a very high sale and why are some stores mostly getting sales in a standard range? Okay. Then here's a trend of the day of the week compared to the sales. Let's take a look at that. And we'll also see how promotions affect sales. Yeah. So this is how it looks like on the different days of the week. You have one, two, three, four, five, six, seven. All right. So I think the source probably opened it. The stores are probably open seven days a week and the holidays are on, well, maybe there's a fixed set of holidays every year for all the fixed set of days when the store is closed. But it looks like overall that the spending is highest on, I believe this is Monday. So the spending is highest on Monday and lowest or highest on Sunday and Monday and lowest on Saturdays. And then it's somewhere in between during the week. So that's something useful to look at the day of the week is an important thing. Here we are looking at the average sale based on promotion. So if you do not have a promotion, the stale store, the average sale is about $6,000. And if you have a promotion running at the store, the average sale is about $8,000. So promotion definitely seems to be an important factor. And that's the kind of analysis that you will do. And sometimes you will find some interesting things that you will need to deal with, like we dealt with the zero values for sales. And you can also look at correlations. You can look at how the target column correlate is correlated with the other columns. So it seems like it's highly correlated with the number of customers, but we can't use that as an input. Then promo seems to be an important thing. Promote since week, whether it's a store holiday, it seems to be an important thing. The store, well, this is not really a useful correlation because store is being treated here as a numerical value, but it's more of a categorical value, strictly speaking. So yeah, you don't see, you don't see much correlation with store here, but store should have a big impact on the sales, right? So just a correlation may not actually tell you the real picture because the store, depending on which store we are looking at, you should have like a good idea of what the sales for that store should be, but still a useful thing to look at all the correlations, especially with the numeric continuous data. And here's an exercise for you analyze and visualize all the other columns and figure out how they are related to the target column. You have to know the data inside out and exploratory data analysis is the way you do that. So just look at all the columns. There shouldn't be any columns that you ignore that you're putting into the dataset into the model. All right. So next we will talk about feature engineering, which is the process of creating new features by transforming, combining, transforming or combining existing features, or by incorporating data from external sources. For example, we have a date column. If we just check merge TF. So we have a date column here, but I'm guessing there is probably some sort of a monthly trend. So maybe we should extract out month from the date column. And I'm also guessing that maybe there is an early trends, or maybe you should also extract out here from the date column. Maybe you could also extract out information like whether it's a weekend or not. Maybe we could extract out information like whether it's the first day of the month or last day of the month. Maybe we could extract out information like, is it the first day of the quarter or the last day of the quarter? And maybe the day in the month. So all that sort of thing. And that's feature engineering for you essentially. So here we'll try and figure out the day, month, and year. Very simple. And we can do this by simply using the dot DT property of the date column. And there are many ways to do this in pandas. If you just try and figure out, given a date time column, how to get the day, month, or year, you will find some resources online. So what we'll do is we will just add these three columns day, month, and year into the training and the test set. And we can then plot and see if there's actually some sort of a correlation here. So we are plotting the year-wise sales and then we're plotting the month-wise sales, average sales. Yeah. So here are the year-wise sales and you can see that there is a slight increase. You can see that 2013 is a slightly lower than 2014, slightly lower than 2015. And we also have month-wise sales here. Month-wise sales seems to show a clear trend. The average daily sale month-wise seems to be quite high in December. And then there's another peak around May or June. So I think it was a good idea to extract out the month and the year columns. Maybe we should also look at the day column. And as he said, you can do things like we, whether the day is a weekend or not, whether it's a start of a month, end of a month, et cetera, that you can create now, you can also create new columns by getting information from external sources. So you could try and get the weather on each day. You can probably just check whether it rained or not on that day. You can check the temperature for that day. And if you would probably need the locations of the stores. So if with this, maybe you can go back to whoever is working with you on the business side and tell them that, can you also give me the locations of each store? And then you can use the date and the location to figure out the weather on those days. And if you can figure out the weather on those days, then you can use that information to make a prediction and to make predictions about the future, maybe you can use the predicted weather. So there is again, an additional uncertainty that gets added there, but it could potentially lead to better results. So that's something that you can try and add. You can try and add whether that date was a public holiday. So you can just look up information for Germany or for Europe and fill in the information on whether it's a public holiday. I think in this case, it's already included, but in, in case it isn't, that is something you may want to add it. We already have whether the store was running a promotion on that day, but suppose that was not available, then then you could just get a list of promotions that have been run across the Rossman stores and figure out if a particular store was running a promotion on that particular day or not. In this case, it's given to us, but that may not always be the case. Okay. So try coming up with new ideas and try adding more columns. One thing that you can also do to get a sense of what kind of ideas people apply to feature engineering is to simply check, is to simply check the notebooks that people have shared. So let me just find Rossman. Yeah. Rossman store sales. So if you go to the competition page and on the competition page, go to the code section and in the code section, just sort by votes and then search feature engineering. Yup. So you will see some examples of feature engineering here, like Rossman data cleaning and feature engineering, and then you can open up this notebook and on this notebook, you can then see what kind of features people have created. Let's see. So it seems like they're creating the grouping stores based on the sales level. I think this was something that we were talking about where we could probably look at the average sales and if the, and we can say that this is a low performing store or high performing store or low sales store or high sales store. We don't have things like the area of a store, but we can use the sales as an approximation to calculate the area, right, or to, to find a value that is proportional to the area. Then here are some features created using dates. Here are some features created by taking logarithms. So taking logarithms of the sales and taking standard deviations and means and so on. So probably one feature worth adding is take all the historical sales of a store and then add the average of that as a feature for the store. And then there is a bunch of feature engineering related to promotions. There's a bunch of feature engineering related to competition. So we've not really looked at the competition information that has been stored that has been shared with us. So that is something worth exploring. Yeah. So those are some examples of feature engineering and a lot of your time should be spent coming up with good features, not initially, but once you train a first model, and then you feel like you want to improve your model, then you spend some more time working on coming up with new interesting features. And here's some exercise for you where you can look at some of the existing features and then try to improve their representation. Okay. All right. So we're done with the data cleaning and visualization. Next part is data preparation. So the first step here is to perform a training test and validation. So the first thing we'll do here is create a training test and validation split for the data. Now we already have a test set given to us. So that contains a month and a half of data after the end of the training set. So we can create apply a similar strategy to create the validation set. Maybe we'll just use the last 25% of rows for the validation set after ordering by date. So here's what I'm going to do. I'm just checking the length of the data frame after removing the holidays, we are left with 850,000 rows approximately. So 75% of that is 633,000 rows and the rest will go into our training, our validation set. So these are just some basic calculations on what the size of the training set needs to be. Now here's what I'm going to do. I'm going to take the data frame sorted by date. So now we have all the observations by a single ordered by date. And then from the data frame, I'm going to just select the last few rows or the last 25% of rows for the validation data set. And I'm going to select the first 75% of rows for the training data set. So now we will have the training data frame, which contains 633,000 rows and a validation data frame, which contains 200,000 rows. Okay. Now I let you figure out how we achieve this in this line of code. You can probably create a new cell and try to sort out, try to work out each piece here. But what we're doing is we're sorting by date, selecting the first 75% of data for training and rest for validation. And here's what the data frame looks like now. So this is our training data frame. It contains store day of week, date, et cetera, et cetera, all of that. And validation data set will look exactly the same. What will be different is the date range that you have for the training and validation set. So the training set, it starts from 2013, January to 2014, December, it looks like, and the validation set starts from 2014, December, and goes on till 2015, July. And the test set starts right after that. Okay. Why are we doing this time-based split? Because we now understand that our model is going to be used in the future. So we want to show our model data that it hasn't seen in the past or data that, that comes into the future of the data that it has been trained on, right? So that's why you have the train. That's why you have the test set, which comes after the training period. Then you have the validation set, which also comes after the training period. So it is a better reflection of the test set. If you pick the validation set randomly, that might still work. That might still give you a good result, but the results or the metrics that you get on the validation set may not be a good reflection of the metrics that you get on the test set or the metrics that you get in the real world. So you want to make your validation set as close to the real world data that your model will face as possible. Okay. Now, of course, if it did not have dates, you could just pick a random segment of the data for validation. So these are the date ranges. You can see that the test set comes after the validation set. And remember, we already have a test set given to us separately. All right. So now we have our training test and validation data frames, and all of them have all these columns. Now we know the drill here. We need to get, we need to identify the input and target columns. So I'm just going to create a list of input columns, and I'm not going to use all the columns for your initial model. For your initial model, when you're just figuring out what kind of modeling approach you should use, or what would the rough accuracy or the rough baseline performance of your model look like, it's not worth spending too much time on just engineering and understanding every single column for your initial model. You can just pick a few columns and then just try things out. So I'm just going to pick about four plus five, nine columns here for my initial model. And I am going to put them into a list called input columns and the target column is sales. One thing I should be very careful of is to make sure that sales or the customer's column does not come up in input columns, because that would be an error. There's not something we have in the test set. Let's also separate out numeric and categorical columns. Now this is where it gets a little tricky here because stores ideally should be treated as categorical columns, one, one, one, five stores, all of them have a different property. So there's no natural order to stores. It's not that a store with a higher ID is going to have higher sales or a store with a lower ID is going to have higher sales. It's completely random, the numbering and the sales. So technically it is a categorical column, but the difficulty is creating one, one, five, treating it as a categorical column and then creating 1,115 columns of data is going to severely increase the size of the dataset. And also it's going to make the models a bit harder to train because of something called the curse of dimensionality, which means that if you have a lot of different features, which all are very closely correlated, then that makes it a lot easier for your models to overfit essentially, right? Because now, because now it has a lot more parameters to play with. So it can very easily overfit to the training data. And those models generally do not generalize very well. So that's not the exact definition of the curse of dimensionality, but that's the idea, the more features or the, especially when you get into thousands of columns or thousands of features, you are going to start facing issues. So we are going to treat store as a numerical column. And that is probably a problem. Probably there's an issue there, but we'll just see what happens, right? And that's why you can do some kind of feature engineering where maybe you can reorder the stores. Maybe you can compute the average sale per store, the average sale value. And based on that, give a new numbering to the stores and that numbering could be reflective of the average sales so that a higher number or a higher store number would mean a higher average sale, a lower store number would mean lower average sale, right? So that's something worth trying out, but we are just going to use it as a numerical column, linear models will struggle decision tree based models might be able to figure this out. Then we have day of the week. This is also, this is categorical. I think we can keep it categorical. We have day, month. So these are technically categorical too. So again, with month, again, with day, maybe you don't want to add 31 days. You don't want to add 31 new columns. So we might leave this numeric or maybe we might even leave month numeric and maybe decision tree should be able to figure out, figure it out. They can have decision boundaries. Like if the month is less than five, but greater than three, then do this. Otherwise do something else. So, so I'm hoping you get the idea here. Linear models will suffer when you have categorical data treated as numerical data, but decision trees generally should be able to sort it out. Okay. So I am just going to create a list of a numeric and categorical columns here. Numeric columns is store day, month, and year. So this is a store. I'm going to treat that as a number day. I'm going to treat that as a number as well. Month as a number and year as a number as well. Okay. And you can try making them categorical and try things out, but the rest of these things, I will keep them as categorical things. So store numeric columns are stored a month and year categorical columns are day of week promotion, state holiday store type and assortment. Anything that contains strings has to be categorical or especially if it has a limited number of values. If it is a unique string, like some notes or something, or like a description of the store that we may simply just have to drop or use that to create new categorical or numeric columns. Okay. So what one other thing that we're doing here is we're just creating this train inputs and train targets. So these new data frames, which only contain the input columns and the target column respectively. So this is something that you should successfully keep doing. You should keep eliminating the things that you no longer need so that your inputs and your targets just contain the information that you need. So now if we see train train inputs, you will see that it only contains these seven or eight columns that we're interested in and also does not contain the target. And then train target only contains a target value that we're trying to predict. Okay. Similarly, we have validation inputs and validation targets. And similarly, we have test inputs, but we have no test targets because our test data does not have targets. We have to make those predictions and submit to Kaggle. All right. So now we've identified numeric columns, categorical columns. There is some scope for debate here, but this is just something I went with, and maybe I can revise that later. Now we get to the imputation school, scaling and encoding. The first thing we'll do is we will impute missing values in the numerical columns. So I'm just going to use a simple imputer here. I'm going to use the strategy of mean. But as I said, what you should ideally do is look at the distributions of the different columns and figure out the columns that are that have a normal distribution, maybe use a mean there, the columns that have some kind of an exponential distribution, maybe a median will make more sense here. And there are also more complex imputation strategies that you can check out. So the imputer then takes all the numeric columns, computes the average for each column when we call fit. And then here, we are using it to transform the training data, validation data, and test inputs, all of these. So we just fill in all the missing values wherever there are missing values. Now in this particular data set, maybe that was not necessary because there were no null values in the numeric columns, but in general, there might be no value. So you will have to deal with them. And as I've already said, you can apply different imputation strategy to different columns depending on their distributions. So that was imputation, which is just filling missing values. Then we have scaling. So we just want to get all our numbers in the zero to one range. Once again, here with scaling, maybe you might try a standard scalar, which creates a minus one to one range, roughly, or assumes a normal distribution and creates a minus one to one, one range of mean and standard deviation means zero and standard deviation. One, that's one other way to scale or the min max killer creates a zero to one range. You can try different scaling approaches here. And to begin with, you can just pick one and just go with it. Okay. And then we have categorical columns encoding as one hot vectors. Again, we've seen this a few times. So this should be straightforward enough where we are taking all the categorical data and we are putting it into one hot encoded vectors. So here's what happens if I just check train inputs, you will see here that we have for day of the week, since it is treated as categorical, we have day of the week, one day of the week, two day of the week, three, four, five, six, seven. So the benefit of this is that if sales are very high on Sunday, for example, then a weight, a certain weight or a certain decision can be made on just the column day of the week, seven, and it will not have to, it will not apply to any of the other days. So I hope you see the benefit of having these one hot encoded columns. And of course, if you had strings here, then you would have to convert them into numbers using one hot encoding. Right now, at this point, because we have added new categorical columns for each of the new one hot columns for each of the categorical columns, so we can probably just extract out all the numeric data, all the numbers that can be fed into a model. So that includes the numeric columns, which we have imputed and scale, and that should also include the encoded columns. So for, we can drop the original categorical columns and just keep the columns that we have created for each category, the one hot columns. All right. And now we have this X train X val and X test. So X and Y generally refer to inputs and outputs in scikit-learn. So these are the actual inputs that will go into the model. And these are all numbers. And these are all properly scaled and imputed. All right. So that is the data pre-processing steps we need to take. And I'm just going to save my notebook here. So do save your notebook from time to time, and you can just grab your API key from Jovian to save your notebook. So you just go to Jovian.ai click copy API key, paste it here. And when you do that, a copy of this notebook will be saved to your Jovian profile. Okay. So here's my copy of how to approach ML. Okay. So now before you get to machine learning, one thing you should try and do is just create some quick and easy baseline models to benchmark your machine learning models. What do we mean by baseline models? Well, suppose you didn't know about anything about machine learning. Suppose you just had to somehow come up with some approximation on how to make these predictions. And they can be as you can come up with as simple or as basic and approximation as possible. What would that be? And then we can't, we think of such approximations or such strategies, and then we implement those strategies and see what the result from those strategies looks like. So here's an example of what I'm talking about. Let us define a model, our so-called model, because this doesn't really do anything. Let's define our model, which always returns the mean value of the same model. Which always returns the mean value of the sales as prediction. So we have the sales column. If you check train targets, okay, this is the train targets. And if I just take the mean, mean of the sales data, it's about 6, 8, 7, 3. And let's say my model simply always predicts that the sale for any store on any day is 6, 8, 7, $3. Okay. That's a very dumb model in some sense. It doesn't really do much, but let's just see what we get out of it. Okay. So now we have this model we're defining or this function we're defining simply called return mean, which given a given a bunch of inputs simply returns the mean over and over for every input. So if I pass it, pass into it X train, which is the training data, and I get back some predictions. And of course, all the predictions are simply predicting the mean, the average sale on a particular store on across the entire dataset in a store in a single day. And here is how this becomes useful. Now, what we can do is we can now check this against our evaluation metric. So remember we decided that the evaluation metric should be the root mean squared error, or actually in the competition, it is the root mean square percentage error. And I'll leave it as an exercise for you to figure out how to implement that RM PSE or RM SPE, but we are just going to use the RM SE score here because there's an inbuilt function for it, but yeah, but do replace this with RM SPE. So I'm importing the mean squared error function here. And then I call mean squared error on the predictions, which is all fixed predictions and the targets. So which are the actual targets that we want to predict train targets. Yeah. So this is what we needed to predict. And this is what our model actually predicted a fixed value. And let's just compare how bad this is. So if we compare the mean squared error of train predictions with train targets, it and we said squared equals false. So that will give us the root mean squared error. So it turns out that the mean squared error is 3082. Okay. Now why is this important? Well, first thing you should try and see is how does this compare to the actual values? So it seems like the actual values can be in the range 2000 to 15,000, roughly 15 or 17,000, 3000 to 17,000. And this means squared error of about 3082 is off by how much? So 2002 or 3000 to 15, 17,000, how much is that? 17 minus three is 14. So if you're off by 3000 out of a range of 14,000, then roughly, if you just predict the average all the time, you're only off by about 20% very roughly speaking. And you can do the route mean percentage square error and figure out some, if your sales predictions from all these stores, from all these sales managers is off by more than 20%, then that means the work that they're doing is actually counterproductive. The predictions that they're making are actually causing actively causing a loss. If you simply just use the average value as a prediction, you will probably have a better forecast, right? So that's one thing that you understand. The second thing you understand is that now any model we train, any fancy machine learning model we train should ideally have a lower loss than just predicting the average, right? Because here we're simply just predicting the average. There's no intelligence here. So this gives us, that's why you call this a baseline. This tells us that our model, any model that we train should be at least this good. It should be at least the RMSE should at least be less than 3000. If the RMSE is greater than 3000, then our model is completely useless. And maybe there is something we need to fix about it fundamentally. Okay. All right. So that out of the way, we can also check this for the validation set. So we generate predictions for the validation set where we just return the mean, and then we compare it with the validation targets and we get about 3168. So the model is off. The so-called model is off by about 3000 on average. Let's try another model. This time we are going to make a random guess between the lowest and the highest sale. So the lowest sale, I believe is about $3,000 since we have removed all the zeros. Okay. We still have a bunch of zeros here. So if it just guesses randomly between the lowest and the highest sale, and the highest is 41,000, that seems like a large number. So maybe you can change the range here. Maybe you can put a range of 3000 to 15,000, but yeah, here we are doing a random guess. So now if you pass the training inputs into this guess random model, you will get random training predictions. All of these are completely random. We've not looked at the data at all. And let's check the mean squared error. And this time we are off by about 18,000. You can see here, we are off in both cases by about 18,000. I am quite curious. What happens if I just put this here, if I just put here, maybe 3000 and maybe 18,000 instead of min and max. So then we would be off by just about 6,000, right? So a random guess is off by about 6,000. Maybe if we played around with the ranges a little bit, maybe we might get it down to four, four, 5,000 or 4,000. And then a fixed guess is off by about 3000. So our model should be off by less than both of these values. Here's one other thing you can try a quick hand coded strategy. Where, and this is where you may be, you might be able to satisfy the requirement without even building a machine learning model. So write a function that implements this strategy. If the store is closed, return zero. So when you get an input, it checks whether the, if the store is closed, it returns zero. And if the store is open, then simply return the average sales of the store for the current month in the previous year. Okay. So what we're saying is if you're checking for store number thousand or store number, one, two, three, four, and you want the sale, you want the prediction for the sale in on 15th of July, 2015, then just look at the average daily sales. In 2014 for the same store and return that. And this is something that you can pre compute and keep for every store. You can keep a store comma month and the average sale. And you can just look up that table and return the answer. And I believe that it should be better than this model, which is where we always return the mean. So maybe here we are looking at the average. We're looking at the same store and we're looking at the same month in the previous year. And maybe you can have a rule that maybe you increase it by 10% to account for inflation, let's say, or increase in prices or increase in traffic, whatever, and just use that those three rules and try and figure out and implement that strategy and find the validation score for that strategy. And you may be surprised that this hand coded strategy may actually perform a lot better than many of the machine learning models that we are about to train. Okay. So don't always think that a machine learning model will perform better. Sometimes human insight, especially when the signal in the data is weak. Sometimes human insight can give you a much better strategy. Okay. So do try out a sign code strategy. And if it satisfies the requirement that have come from the business, then that's it. End of project. You don't have to train a machine learning model. And that's a good thing because you understand a hand coded strategy. It is very clearly explainable because you have coded it. Okay. Then after trying all of these things, that is when you should maybe try a baseline machine learning model. So maybe just pick the simplest possible model for regression problems. Linear regression is the simplest model for classification problems. Logistic regression is the simplest model. And often these models themselves will be good enough. And you can just end the project there. You don't have to even do any hyper parameter tuning, et cetera, maybe just a little bit of feature engineering. So let's see here. We're importing the linear regression model. And we just create a model here, instantiating it. Then we are calling fit. So when we call fit, what happens is the linear regression model assumes a linear relationship between the inputs and the targets. So the target which is sale is assumed to be some weight multiplied by the store number plus some weight multiplied by day of week, plus some weight multiplied by there was a promotion or not plus some weight multiplied by whether it was a public holiday or not, et cetera, et cetera. So there are about eight features. It will apply a weight to each of those eight features. I think there are 18 after one hot encoding. So it'll apply a weight to each of the 18 features and it will start out with random weights, but it will slowly improve the weights by this kind of a workflow where it will put some inputs through the model, get some predictions, compare the predictions with the targets using a loss function. So linear regression in scikit-learn actually uses the R squared loss function, which is a variation of the RMSE loss function. And it then applies an optimization method. In this case, it's called ordinary least squares to improve the weights of the model. So the, so the random guesses are slowly improved step by step iteratively till the model gets good enough. Okay. So that's what happens when we run lin reg dot fit. Now, once the model has been fitted, which means once good weights have been identified for each feature in the input, we can then use the model to make predictions. Now, keep in mind that the parameters of the weights inside the model will not be updated during prediction. So if you see the model now, lin reg, this is a linear regression model inside it. You will find a bunch of coefficients that have been, that will be applied to each feature. Yeah. And these coefficients correspond directly to the columns in X train. So remember X train. So 1.04 will be applied to store number minus one will be applied to day 6.49 will be multiplied with month 1.0 4, 7 will be multiplied by year. And all of these multiple, all of these weights will be applied and then that will be added together. And that will give you the predicted sale. Okay. So it's a very simple linear assumption. We know that it is already very weak because store number has no direct correlation with sale, at least not a linear relationship. So here are some predictions. So now we put in into linear regression, we put in X train. So now we just predicting now we're no longer fitting the weight supply. So we're not fitting the weights of the model are fixed. So we are just taking the inputs, applying the weights and getting some predictions out. Okay. Now we're not training the model anymore. And now we get these predictions out for the training set. And let's compare the predictions with the train targets and let's get the root mean squared error. And it turns out that the root mean squared error is two seven four one. So our linear regression model gives a root mean squared of is off by 27 or almost $2,800. And our. Baseline average model, the one which always predicts the average is off by $3,000. So clearly linear regression is not doing too well here. I mean, we were off by 3000 earlier. Now we're off by 24 or 2700 is clearly a bad model, but now you know, now you know that it's actually a pretty bad model because you can compare it with the baseline. And now I hopefully you can see the value of a baseline. In fact, for the validation set, the error is even worse to eight one seven. That's almost the same as always guessing the mean. Okay. And we understand why because the relationships are definitely not linear sale is not a direct like linear combination of the store number and the day of the week and things like that. It's, it's non-linear, the relationship, the relationships here. Okay. But anyway, that was just a baseline model with machine learning. Now we get to the interesting part, which is where we now train interesting models and we tune hyper parameters. Okay. And this is where we have to then systematically explore a bunch of different modeling strategies. Now scikit-learn on its website offers this cheat sheet that you should check out to help you figure out which kind of models to use. And this doesn't include all the different models, but it roughly gives you an idea that scikit-learn is split into a bunch of classification models, a bunch of regression models, a bunch of clustering models, and a bunch of like dimensionality reduction models. So that's all scikit-learn contains. And then there are some other non-modeling related models as well. And here is where they say that you start and you check if you have more than 50 samples. And if you have less than 50 samples, then just get more data. No amount of modeling is going to help you. In this case, fortunately, we have over half a million. So next we ask the question, are we trying to predict a category or are we trying to predict the quantity? So if you are predicting a category, then we check if we have labeled data. So for the breast cancer example, this is the direction that we would head into. So we have, we do have labeled data. So that means this is a classification problem. We already know whether tumors are malignant or benign cancerous or non-cancerous. So we can now use a classification algorithm. And these are some of the classification algorithms. You have this SGD classifier. You have this K neighbors classifier. Then you have this ensemble. You obviously have random forest classifiers. You have a decision tree classifiers. Those are not listed here, but they're part of this. So these are some of them. These are all the models that are applicable to that particular problem, right? Almost all classification models will be applicable to that problem. Not all of them may perform well, but all of them will work. On the other hand, if you're predicting a quantity, like in this case, then we just check here. If we have less than a hundred thousand samples, then we should probably just use SGD regressor or one of those other regress regressions. Otherwise, if we know that some features should be important. So here they have this check. If you think that some features are going to be more important than other features, use lasso and elastic net. So the difference between SGD regressor, which is basically just linear with the gradient descent optimization technique. So it's just a different optimization technique. It works well for large data sets. So the difference between SGD regressor and elastic net and lasso and all these is somewhere within the loss function somewhere to prevent overfitting some things like that. And then, yeah, there are a couple of other options. So the, here, what they're trying to tell you is try the simplest model possible. And if the simplest model possible, if the simplest model does not work, then you can try maybe these ensemble regressors, which is random forest. And then they should have put in decision trees here as well somewhere. So if linear models work out fine, that's great. If you're not satisfied with linear models, then maybe you try some of these ensemble regressors, right? So that's the rough idea there. You start out with simpler models and go with, go to increasingly more complex models. So here's a general strategy. Find out which models are applicable to the problem you're solving. Now, obviously as a gay nearest neighbor's classifier is not going to be applicable to sales prediction because it's a classification model, but all of these would be applicable, train a basic version for each type of model that's applicable. Even if you know that the linear relationships are weak still make sense to just train a linear model just to see what result it gives you. You may be surprised. Don't assume anything. So you train a basic version for each type of model, then identify the modeling approaches that work well and tune their hyper parameters. So once you try a basic version of all of these models, then you will get a sense of which one is working well, and then you can start tuning some of the hyper parameters. And as you do this, use a spreadsheet to keep track of your experiment results. So here is one spreadsheet that we've put together for you. It's a simple format that you can follow. And this spreadsheet contains the spreadsheet contains two sheets. If you see at the bottom here, you have an idea sheet and you have an experiment sheet. So what you should do is you should list out all the ideas you want to try. Maybe I want to try a linear regression and a potential impact or a potential outcome, whatever ideas you have about the potential outcome. I think this is going to be pretty bad. Something that we've already seen. Then maybe you want to try out a ridge regression and you can read about ridge regression and then lasso regression. And maybe you can try out polynomial regression. So Ridge and lasso regression simply makes some changes to the loss function of linear regression. Polynomial regression means you're taking some, you're taking squares and cubes of certain features, which you think might be useful. Then maybe you want to try out a decision tree and maybe you want to try out random forests and maybe you also want to try out a gradient boosting. Maybe you want to try out something called support vector machines. Yeah. So just list out all the ideas you have. And it's a good idea to just list out what you think the outcome might be. So this you think should be better than linear regression because it contains regularization internally decision tree. Well, okay. Should offer good interpretation and things like that. So whatever outcome you're going, you're looking for from each model, it's a good idea to just project, try and predict which model might work well. And then once you actually train that model, you can list out your learnings here, and you can see if your prediction about whether a model will work well matches your actual learnings after training that model. And that is how you train your mind to think about picking the right machine learning model. There is no good answer. There is no series of rules that you can follow. It's just that you have to work through a bunch of different datasets and try different kinds of models. And for each kind of model, if you can just take a second to predict what outcome you want to see from that model and learn from the actual outcome that you see, you will start course correcting. So in some sense, you are doing, you're running a machine learning algorithm on your brain and you will get better at picking machine learning algorithms. All right. So this is one thing that you should have. This is the idea sheet. Then you have this experiment sheet. So in this experiment sheet, you should put all the experiments that you've put in all the experiments that you're conducting. I think you should have hyper parameters here as well. Yeah. So I'll tell you what I mean by this in just a second. Yep. Okay. So here's what we're going to do. We are going to define a function called try model, which takes a model. So you instantiate a linear regression model and give this model. And let's say here's one potential model that can go into this function. Okay. So you give it a model and it's going to first fit the model to the training data. So you have, you have X train and your train targets. So it's going to fit the model. Then once the model has been fitted, once the weights inside the model have been finalized, it is going to generate some predictions inside the model. So it's going to say model dot predict X train, get predictions for the training set, model dot predict, X val, get predictions for the validation set. Then because we already have the targets for the training and validation set, we are going to compute the mean squared error, training targets, training predictions, squared false validation targets, validation predictions, squared false. And it's going to return the train RMSE and val RMSE. And why am I defining this function here so that I can try different types of models out? Like I can call try model with a linear regression model. I can call try model with a random forest. I can call try model with something else. That's the benefit of scikit learn. It has a standardized API that we can use to write these nice functions. So first I'm going to try some linear models and you can read more about them here. And this is like linear regression, Ridge, Lasso, elastic net, and SGD. Now you can see I'm creating a linear regression model and I'm just directly passing it into try model. So this is the same as doing model equals linear regression and then try model model. Okay. Instead of that, I'm just writing it in a single sentence. And we've already trained a linear regression model. And we know that the linear regression model has a training RMSE of two, seven, four, one, and it has a validation RMSE of two, eight, one, seven. So now this is an experiment we've conducted. This is a certain model. We've model. We've trained with certain default hyper parameters. So I'm just going to put in this experiment here. I'm just going to say that on linear regression is one experiment I conducted. And I'm just going to insert the current time here by just typing now and then the hyper parameters. Well, I didn't tune any hyper parameters. I'm just going to put none here. Oops. This does not really work. Anyway, let me just put in today's date, July three, and then I'm going to record the training score. So the training score is two, three, four, five, six, seven, eight, and then the validation score is two, eight, one, seven, point seven. So I'm just going to put that in here as well. We don't have the test score yet, but we can put it in later. I am probably at this point, I would also maybe just commit the notebook and save the link to the notebook, but I'm just going to put this link here. Yeah. And then any other details we want to put in here. Yeah. And then any other details we want to incorporate about the model, we can put it in here. Okay. So that's a linear regression model. Then let's do ridge regression. So with the ridge regression, it seems like we are getting a very similar result, not too different to seven, four, 1.5 and two, eight, one, 7.7. Let's try lasso regression. This seems to be taking slightly longer. Let's try some other as well. Lasso elastic net and SGD lasso gives us a worse result in both places and elastic net gives us a worse result, even worse. So it seems like all linear models are pretty bad and two 972 is almost as bad as the mean prediction, right? So then you would just put in here. Okay. You tried elastic net and you tried ridge and you tried and you tried lasso and you tried SGD regressor and just put in all the information here. Right. And if you think that a experiment is completely useless, you can even skip it. But the idea is you want to build out this, this table of your experimentation so that when you need to look back after having conducted a hundred or 150 experiments, you can pick the right model out of this. Otherwise it gets really messy. Okay. And here's an exercise for you. Try changing some of the parameters in the above models. For example, elastic net has a couple of parameters. You can see it has this, let's see, it has an alpha and then it has l one ratio and a bunch, and an L L two ratio or something. So you can try passing an alpha, maybe try alpha equals 0.6 and read about it in the documentation. Okay. So changing the alpha changes the result. So maybe here you may want to note down that you put alpha equals 0.6 and that led to a certain score and maybe add another row and put in the alpha equals something else. And that led to a certain score. This is how you organize your work. And it's very important as a data science practitioner, as a data scientist to keep your work organized. Okay. Next up, let's look at some tree models. And from this point, I'll assume that we are putting in all the results into that spreadsheet, but here's a tree model. I'm just importing decision tree regressor and just trying a basic tree model. So I'm just calling decision tree regressor with the random state of 42 and calling try model. And it takes a while, but you can already see that the training loss is zero. And this is something that we already know about decision trees that they overfit to the training data badly. So we may just want to experiment with some hyper parameters, but even with overfitting, the validation loss is one five five nine that is far, far lower than two eight. What was this two eight two eight? Yes. That's far, far lower than two eight two eight. This was as good as just predicting the mean. This is half, right? So we've reduced, we've cut the error from the baseline to half. So now that I think is a pretty good model. And, but we would still have to go and figure out a talk to business and figure out if this is good enough, or if we need to do even better, and then you can also plot this decision tree. So at this point, now you can plot this tree and you can then send this information out saying that, Hey, it seems like promo is the most important thing. If we have promo, then we check which day of the week. If we do not have a promotion, we check the day of the week. If we do have a promotion, we check the store type. And this is how we arrive at a certain value. Okay. And then you can do the analysis of the decision tree, how deep it goes, how many, which are the most important features, et cetera, et cetera. But let's also quickly try a random forest. Let's see what happens when we go from a decision tree to a forest of a hundred trees. Again, the default configuration, not changing any hyper parameters here. And the random forest obviously takes a lot more time. And in fact, that is something that you can put in here as well. You can also maybe put in the amount of training time. So just feel free to add more columns here. So we've just done decision tree and that gave us something like one, about 1500. Was it the training score was zero. And the let's just call it loss. So that's clear. The training loss was zero and the validation loss was 1500, which is both much better than our linear models. And it's taking a while. So let's let that work it out. And then an exercise for you is going to be to just tune the hyper parameters of the decision trees and the random forest to get better results. And at this point, you may want to then consider how much better a random forest is compared to a single decision tree, because a single decision tree offers a hundred percent interpretability. You can see exactly how each decision is taken because you can plot this out, but a random forest performs the average of a hundred trees. So if you think that the business requires a very clear demonstration of how you arrived at a certain result, then maybe you're willing to sacrifice some of the accuracy for interpretability of the model. But if not, then feel free to maybe train a bigger random forest because random forest also offers some sort of a feature importance and stuff. Okay. I may have to just pause this at this point, but I believe from my previous run, I can recall that a random forest was around 1300. So that was again, a decent reduction from a decision tree. And then there are several other supervised learning models that you can try out. Now, if you just go to scikit learn.org and look at supervised models, you have a whole bunch of linear models. We've not covered all of them. And of course, not all of these are regression models. So some of these are for regression. Some of these are for classification, and you can just check the documentation. Then you have these Colonel Ridge regression seems interesting. You have something called support vector machines. These were pretty popular back in the early 2000s, but now they're not very widely used, but you can check them out nevertheless. Then you have stochastic gradient descent. This is just a variation of the linear models, which uses the gradient descent algorithm. You have something called the nearest neighbor's model. The idea here is given a data point, you find the closest four or five data points to it in the training set, and then just take the average of those. So you have nearest neighbors, regression, classification, et cetera. You have a knife base. You have decision trees that we're just looking at. You have ensemble methods, which includes random forest, gradient boosting, et cetera. Gradient boosting is something we will talk about the next time. And yeah, that, and then you also have neural networks, which is just another area, which is just goes into a topic called deep learning. But yeah, you have a lot of these that you can check out. Okay. This is still running. So this is probably some issue here. Let's just do an estimators equals 20. And let's see if that runs faster. Okay. Now an exercise for you is to just try out some of these other models and see if you can get a better result than a random forest. I believe a random forest gets you to about 1300. If I'm not mistaken, and then you can tune the hyper parameters as well. And then in certain cases, when it's not a supervised learning problem, you can then apply some unsupervised learning approaches that we will touch on briefly towards the very end. So here you have clustering, you have covariance estimation, density estimation, and dimensionality reduction, a bunch of these topics. Okay. So now the random forest has run with 20 estimators. And I wonder what the output is like. So we'll see the output the next time. But now we come to the next and probably one of the most important steps. Once you have a good baseline, once you are doing this whole once you're doing this whole process of trying out different models is to perform regularization. So you pick the model that is working well, and then you tune it's hyper parameters. And then you also try and combine results from different models, all of which seem promising. So in general, these are some strategies that you can use to improve the performance of your model. So now you have a model, it's beating the baseline. You want to make it better. The baseline is the best way to improve the performance of the model is to gather more data. If you can get your hands on more data, just get more data. And that is going to capture more relationships between the data, between the inputs and the targets. It's going to help you regularize and generalize the motor better as well. Try and include more features, the more relevant features for predicting the target you can get the better the model gets. So this could be by getting additional data. This could be by feature engineering, by carefully thinking about which form of a certain feature is very useful. For example, going from date to day, month, and year is a huge leap. That is not something that a model can figure out on its own. Our models are actually pretty limited by the, by the way in which they are constructed. So we need to give them the right features. Then tuning the hyper parameters of the model can help. So remember hyper parameter tuning. This is how it works. As you change the certain hyper parameter, the models complexity in a particular dimension goes from low to high. As you increase the complexity of the model, initially the model gets more powerful. So it starts to learn more about the data and still keeps generalizing well, but after a certain time, it starts to memorize specific training examples. And that's the thing that you want to avoid, right? So you need to find that best fit. So that's where you need to tune hyper parameters so that they're not underfitted, but at the same time, they're also not overfitted because that increases your test error. And for more information about this, just go back and review the decision trees and random forest lecture, where we saw an example of how wearing the max depth leads to this kind of a curve between the training loss and the validation loss. Okay. All right. So this model has run now. So with 20 decision trees, you can see that we got it down to about 1, 3, 9, 6. So definitely a random forest regressor seems to be working quite well for this particular problem. And this is a number that we can share with business and then try to figure out if this is a reasonable value for prediction or not. Okay. So tuning hyper parameters is another thing you can try. One other thing you should try is if you've tried everything else, you've been tuning hyper parameters for a long time and you have no, and you've tried whatever feature engineering you could come up with. The next thing is to just look at specific examples of where the model is making incorrect or bad predictions. So you can make all the predictions from the model, and then you can look at individual predictions side by side with the targets and maybe just compute what the difference is or whether the prediction matched or did not match and find the ones where the predictions were bad and use that to get some insight, look at those training data points. And maybe you'll figure out that the model is missing a certain thing, or maybe you should give a certain weight to a certain class or things like that. So looking at specific examples one by one is always is a exceptionally good strategy because it's okay to look at numbers and numbers can just, the metric just goes up and down. You can't really get a learn a lot from it, but when you look at specific examples, you can apply your insight into the domain, into the problem and figure out how to improve the model. Then you can also do things like grid search for hyper parameter optimization. If you're tired of tuning parameters on your own, you can set up a mechanism to, for this to happen automatically. And there's also another technique called key fold cross validation, which we will cover in more detail the next time. And finally, one other thing you can do is you can combine the results from different types of models or train another model using the results of a bunch of different models. So this technique is called ensembling and the specific version of it is called stacking. So these are all more advanced areas of machine learning. Now with hyper parameter optimization, when you want to do it in an automated fashion, typically what you do is you identify that these three or four hyper parameters are the ones that you want to tune and you identify the ranges for those hyper parameters that you want to test out. And you can then put this into a grid search algorithm, which is then going to create models for each given set of hyper parameters. And then it's going to return the result to you and you can pick the best model out of them. So I've pointed to a tutorial here that you can try and follow, and we will apply hyper parameter tuning to one of the models we train in the future. Next, we have this technique called key fold cross validation. And this is what K fold cross validation looks like visually. So we have some training data and we have some test data. Now, instead of just picking one portion of the training data as the validation data, what we can do is we can split the training data into five folds or five. I mean, you think of it like a piece of paper and you fold it five times. So five equal portions. And then you train five different models for the first model. You use fold one as a validation set and fold two, three, four, five, as the training set for the second model, you use for two as the validation set and fold one, three, four, five as the training set for the third model, you use four three as a validation set and you get the idea. So you're training five different models on 80% of the data, but each time it's a different 80%. And then what you can do is you can make predictions using all five of those models and simply average those predictions out. Now the benefit of this is instead of setting aside a single validation set, you have used all of the training data while still doing validation, right? So you still have a validation score that you can look at that you can use for tuning hyper parameters, but you are training on the entire data that you have available. Okay. Now this is going to be, this is going to perform a little poorly for time series data or data ordered by time, because ideally we want to keep the validation data at the very end of the time period, but you could still try it. It doesn't hurt. It's just, it's giving the model more data to train with. So it's not bad. All right. So try cross validation against something that you can do using scikit-learn, and I have pointed to a tutorial here as well, and something that we will apply the next time. Next, you have this technique called ensembling and stacking. And by the way, for both of these for grid search, it may seem like a difficult thing to do, but it's because of the utilities that you have in scikit-learn. It's actually quite simple. It's just like three or four lines of code, and you can always look it up to similarly with K fold cross validation. It may seem difficult. Okay. How am I going to split the dataset into five different pieces? And then will I need some for loops and we'll have to write some functions, et cetera, et cetera. Well, all of that is generally taken care for you. You have this K fold class that you can include from scikit-learn, and it is just going to be three, four lines of code to perform all of this. It's actually easier to implement than to explain. Next, you have ensembling and stacking. So on something simply refers to combining the results of multiple models. And here is what ensembling looks like. Visually speaking, you have the data, you train five different models, and these could be different kinds of models. One could be a linear regression model. One could be SVM. One could be a random forest and one could be a gradient boosting machine. And then you take predictions from all of these models. Now, depending on classification or regression, you could do two different things. So with classification, then you just pick the majority and with regression, you would just take the average. So in a way, ensembling, and this is something that we've already covered is an extension of random forests. In random forest, we look at a 10 hundred or 500 different decision trees and then average their results. But in ensembling, when you're doing a manual ensembling, we take five or six different models and then we average their results. And all of these models are diverse. So some are linear models. Some are, some are tree based models. Some can be like SVM kernel based models. And the idea is each model can pick up some sort of a diverse or some sort of a unique way of relating the inputs to the targets. So when we average out their predictions, the results that we get may be better than what we may get from any individual model. That's not always the case, but that is the objective. And if you do it well, if you do it, right, this works quite well. One other thing you can do is you can also give weightages to each model. Maybe you know that this model works is, works well most of the time. So give it like an 80% weightage and give these five, 10 and five, 10 and 10% or five, 10 and five percent. So then you would, the actual result for regression would be 0.8 times the result of this plus 0.1 times the result of this plus 0.05 times the result of this plus 0.05 times the result of this site. So you can give weights to the different predictions and to the different models. So as an exercise, just try and ensemble the result of a random forest with a rich regressor. There's nothing you have to do. You train a random forest, get the predictions from the random forest train a rich regressor, get the predictions from the rich regressor, then just average the two predictions, which means that add them together and divide by two. Or if you want to apply weights, then maybe apply a weight of 0.8 to the random forest and 0.2 to the regression, add the predictions together and see if you can get a better score than either of the two models individually. Okay. And if you can do that, then you have, then think about why that is, how is it possible that two models when combined give a better score than a single model and how the errors average out and things like that. Then there's something called stacking, which is a more advanced version of ensembling where we train another model using the results from multiple models. Now, if you remember, I mentioned here, we need, we can give weights to the different models here. Model one can get a weight of 0.8 0.1 0.05 0.05. Well, how do you decide the weights? You want to pick the weights in such a way that the validation loss is minimized or the validation accuracy is maximized. So how about we just train a linear regression model on top of these. Ensembled results and figure out the optimal weights. So we can train a linear regression model, which takes the outputs of each of these models and figures out the best weights to apply to them to minimize the validation loss. And this is called stacking. It's a natural extension of ensembling again, not very difficult to implement. It's just when you think about it and think about how you might implement it, it might seem like a lot of code, but there are utilities to help you do this. And we will see a stacking in one of the models we train to. Okay. And this is now getting into really advanced territory, but you will see this done a lot in Kaggle competitions because if you see Kaggle competitions, if you see the leaderboard here, this is the score that we different people have. You can see that the differences in score are very small. You can see 0.10999 and 0.1098. So people are fighting for that third, fourth, or fifth decimal place. And that's where all of these advanced techniques help in production. Typically you don't want to run all of these models at once because it can get pretty slow. It can also get pretty resources intensive. So typically only ensemble, maybe two or three methods or two or three models, not more than that. And stacking also can go to multiple layers. You can have layers upon layers of stacking, but typically just one layer of stacking is what you see in the industry. Okay. So yeah, this is more of a survey of some of these advanced techniques, but I encourage you to try it out. And once you've done all this hyper parameter optimization, ensembling, stacking, et cetera. So now you will sort of realize what is the best you can do with the data that you have and the kind of models that you have, like you will hit a wall at some point. You will not get to a hundred percent validation accuracy, unfortunately, because that signal may simply not be there in the input to give you the target, right? But that's where you stop. And you can almost clearly see that point. Like we saw here that we got to 1300 with random forest. Maybe we try gradient boosting. Maybe you get to 1250 or something, but I think it's going to be very difficult to go beyond 1200, let's say, or 1150. Maybe we do a lot of feature engineering and maybe we may push it down to a thousand, but somewhere there, we are going to hit the limit. Now at this point is where we start interpreting the model and we prepare to present our findings. So what are we going to do then? Well, we have to first explain why our model returns a certain value. So we have to explain why the model returns a particular result. And most scikit-learn models offer some kind of a feature importance score for linear models. It is a coefficient for decision trees. It is this feature importance. Score. So if you check the columns of the training data frame, you have these columns. And if you check inside a random data frame, you will see an importance of each property. Like you have store day, month, and you have the importance of each of those features here. And then you can put that into a data frame here. So now we're just putting that into a data frame and you can now see these are the different features and these, this is their importance. And then you can plot it. So this is the kind of thing that you want to ultimately show in your presentation. In the presentation, you want to say that, hey, the store is the biggest factor in determining what the sales are going to be like. And then it seems like the promotion is the next most important factor. And then it seems like the day, the month, and the day of the week are important factors too. And then the store type, et cetera, have some bearing on that too. Okay. So maybe you may want to then get more features that describe a store. Maybe the area of the store is an important thing. Maybe you want to get more features that the area of the store is an important thing. Maybe the location of the store is an important thing. Maybe the number of people working at the store is an important thing because all of these could be indicators of how much business the store does. And all of this information is available today. So if you can grab that information, add those new features, then you will start making a better prediction. And that is how machine learning projects work. You don't just train a model and report a number and a metric or a loss and be done with it. You have to interpret and figure out what more information we need. How can we make it better? And it's an iterative process that you have to keep improving over and over. Okay. And this is the kind of chart that you will present to non-technical stakeholders to explain how the model arrives at its result. And if you wanted even greater explainability, then you can just use a single decision tree. Maybe the decision trees prediction is a little worse. It's off. It's maybe instead of 1300, it's at 1400, but the interpretability that you get out of it, maybe that might be more useful to one, convince people that your model is working. And second, make a case that they should gather some other information. Like let's say if they want to get the information about how many people work at every store or what is the footfall at every store that's there. Well, but like the area of every store, et cetera, they would have to put some resources into it. And that's where you have to make a case as a data scientist because everything costs something, right? So everything has a cost. And now one other thing you should do is you should start showing some individual predictions. So you should do it yourself and you should also show it in your presentation, in your presentation, the number, the training curve is not going to be impactful. What is going to be impactful is some specific examples. So here I've defined this predict input function, and inside this predicting function, you can see that I've encoded some logic that if I get a single input and that input is going to be a dictionary. Now, if the store is not open, I simply return zero. If it is on, if it is open, then we create a data frame, just like the training data frame or the test data frame. We compute the day we add all the pre-processing steps that we've done for our training and test data, which is to convert date to a daytime column, extract the day, month, year. Remember your model needs all this information. So any new data that comes in has to go through the same pre-processing steps. Then we impute missing values using the imputer that we have already trained. So we shouldn't lose track of that imputer. We should keep that same imputer around. And similarly, we shouldn't lose track of the scaler because we need to scale to the same ratio that we have used for the training set and then the encoder as well. You need to keep the encoder around and the name of encoded columns around. And after this, all of this pre-processing, you get this X input, which is just a single data frame with all the imputed encoded scaled data as numbers that can be passed into the model for getting a prediction. And you pass this one row of this data frame containing just one row of data into the model and you get back a prediction from the model. Okay. And here's a sample input and you can try playing around with the sample input. You can try changing the date here, and then you can just make a prediction with the sample input and show what this looks like. So it seems like the prediction here is 4 1 0 7. Now this is the kind of thing that you want to showcase on maybe even build a simple tool for people to play with. Like if you have somebody on the business side who wants to just play with your model, you can maybe set up a simple webpage for them using tools like the flask web framework. And in that page, they could just make, they could just enter all of these things. So they could just fill all this information into a form and they would click submit and that that information would then be sent into a machine learning model, and then you will return the value 4 1 0 7 0.65 out. And somebody who has, and maybe this could be shared with store managers as well. And what store managers can now do is put in these values and check if the value that comes out makes sense to them, makes sense for their store. And if a lot of store managers come back and say that, no, this value seems wrong to me, then maybe there is some problem. Maybe you're getting a good loss. Maybe you're getting a, maybe you're getting a good value for the evaluation metric, because that's what you optimize for. But in the real world, the model isn't really performing well, right? So don't underestimate the importance of these qualitative checks. One other thing that you will gather out of this is if you notice that a certain parameter as you're moving it up and down is affecting the output more than it should, then maybe your model has picked up some unintended correlation, which it shouldn't have. So that's where you can then go back and decide, do I really want to have this column or not? Because this column seems to be causing some correlation with the output, but I know from basic common sense or human insight or business insight that they shouldn't have a correlation. And that's where you need to work with business and you need to give them these tools to play with your model and interpret this model in their own head. Okay. So look at various examples from the training, validation, and test sets to decide if you're happy with the result of your model. And then you need to present your results. You need to present what you've done, whether it is a weekly presentation that you give, or it is a monthly presentation, or maybe once at the end of the, of the entire project, but ideally you should be communicating as frequently as possible. So you will need to then create a presentation for non-technical stakeholders, and you need to understand your audience, figure out what they care about most, probably what they care about most is how much maybe the metric is something that they care about, but they probably also want to understand how it works. So don't ignore that. Maybe first time you just want to present a decision tree, get their confidence that the decision tree works well, and then tell them that maybe a hundred decision trees, all of which are slightly different combined is are going to give a better result. And that's why you should be using a random forest and then avoid showing any code or any technical jargon. You definitely don't want to show code. You definitely also don't want to show too much technical jargon. You can mention that this is a technique we're using just for mentioning it, but don't talk about it for too long. The more important thing is to show visualizations and deliver insights in simple words. You should be able to, if you understand your model well enough, you should be able to explain in simple words, why you've chosen the kind of model that you've chosen, why you've engineered the kind of features that you've engineered and what the results of your model are. I mean, if you're saying that my model's cross-entropy is 0.23, that's not conveying anything useful. And that probably indicates that you have not understood how to interpret that value. But if you're saying on the other hand, that most of the time the model is correct, like 98% of the time the model is correct, but it has a slightly high false positive ratio. And that is something that maybe we may want to discuss. And then you can, if they don't know what false positive is, you can go into that and have more detailed conversations around that and maybe show them a visualization, maybe show them a heat map. A heat map is much easier to understand compared to numbers. Like if you're presenting correlations or if you're presenting feature importances, the numbers themselves don't really make sense. But when you present it as a bar chart, then anybody can make out that, ah, this is more important than that. And focus on the metrics that are relevant for the business. Don't use loss functions or use the metrics that have been agreed upon the metrics that they care about. Talk about feature importance. Talk about how to interpret the results, explain the strengths of the model and the limitations of the model. This is where you can draw specific examples and say that, Hey, this is the kind of example that it goes wrong with. And somebody might then come to you and say, okay, this is, there's an easy fix for this. How about you simply, I can get you some more data for this particular case. Will you be able to fix it, et cetera, or you may then decide with some discussion that this is a special case and we don't want to deal with the special case right now, or maybe we want to train a different model. All of that, all of that will come out of discussion. And finally also want to explain how the model can be improved over time, whether you need more data, whether you should be retraining the model every two months. And this is one of the limitations of the model that we just trained, right? I can be confident that for the months of August and September, it's probably going to work pretty well, but for other months, especially the Christmas season, I don't think the model is going to work really well because you didn't give me that test data, or I couldn't use that data for validation. So those are the kinds of things that you have to convey so that you are not setting the wrong expectations. And it seems like obvious things, but most machine learning models, and you can look this up like 80 or 90% of machine learning models that get deployed fail and they fail miserably because of either not having a clear communication about the requirements or the data scientist not picking a good validation set, which is reflective of the kind of data that the model will see in the real world, or just optimizing the wrong metric, optimizing just the loss while they should have been optimizing a certain business metric, right? So those are the kinds of things that you have to keep in mind. And that is what makes a good data scientist. Okay. So one last thing I want to tell you is how to make a submission on Kaggle. Now quickly to recap from Kaggle, we went to the data tab and from here, we downloaded the data. You can download it directly here, or you can also download it using the open datasets tool. Then we loaded up train.csv and we loaded up store.csv. We added the data from store. We added the data from store to train. So we added a bunch of new columns. We also loaded up test.csv test did not have the sales or the customer information. And we've not yet looked at sample submission. So here's what we're going to do. We are going to first look at these, look at the test data. So that that is X test. We have been doing all the pre-processing on it, even though we've not used it yet. And let's pass the test data into dot predict. So random forest dot predict test predictions. And this gives us the predictions on the test set. Okay. Looking good. Then let us load up this submission DF, the submission dot CSV file. So Kaggle gives you always gives you the sample submission dot CSV file. Let's take a look at this file. So it seems like there is an ID column and then there is a sales column and we need to fill out the sales column. Okay. And how are we going to do that? Well, here's one thing I could do. We could just do submission DF sales equals test breads, right? Because if you see test, uh, if you see the test DF, it has these IDs one, two, three, four, five. And these IDs are in the same order and we have not changed the IDs. So from the test DF, we have performed some pre-processing and then we have picked out X test and X test does not contain the ID column, but it does not change the order of rows. So the outputs that we are getting from X test, which is the test breads, all of these outputs apply to all of these IDs in the exact same order. So I have loaded up the submission DF, which contains all the IDs and in the sales column, it currently says zero wherever it says zero. I am simply going to put in the value of test breads. Okay. That seems fine to me. And I can check submission DF now. So you now see that there is a prediction for each one of these, but one other thing that I should be doing is filling in the values of wherever, wherever the store is closed. Remember our model doesn't know how to handle that. So wherever the store is closed, we need to fill in a zero there, right? So here's what I'm going to do. We have test DF and in test DF we have open open. Okay. So open is either one or zero. You can, you can check this value counts. So open is either one or zero. So I'm just going to convert this to an integer as type int. Okay. There seem to be some NAN values. I'm just going to, wherever open is NAN. So I'm just going to do fill any one. So just assume it's open and as type in. So now this is a one or a zero column. Okay. And if I multiply this with test spreads, what will happen is wherever we have a prediction that will just get the prediction test bread and wherever we have a zero, sorry, wherever we have a one that will just take the value from test breads and wherever we have a zero that will just fill in a zero. So in this way, we are just filling in zeros into the places where the store is not open. So we're completely ignoring the model's prediction when the store is not open and just filling in a zero. Okay. So that's how we fill in this information. So now I can just put that directly into the submission data frame. So we take the test predictions and we multiply it with whether the store is open or not. If the store is open, it's a one. So we will get the actual value predicted from the model. If the store is closed, then test TF open will be zero and we will get the value zero, which is correct because when the store is closed, the sales are going to be zero. Okay. All right. So with that, what we've done is we've loaded up the submission file, make test predictions, added this, added these predictions back into the submission file. The last thing we need to do is now write it back to a CSV file. So I'm just going to do submission.df.to CSV, submission.csv. And also setting index equals none. The reason for this is if I don't set this, let me just open the file and show you first. So here's submission.csv. The file we just created this file contains an ID and this file contains the sales. And if you don't pass index equals none, if you don't pass index equals none, then a third column, this, this new column will get added too. So just to avoid that third column, I've put index equals none. And you can see what happens when you do it without, but yeah, here is the submission.csv file. It contains ID and sales. This is the prediction on the test set. And we can then download this file. So one way to download it is from here. We have the download option. Another way to download is to generate a file link. This is just a IPython useful IPython tool where you can generate a file link, and then you can just download that file. Okay. That doesn't seem to work in on Google Colab. In any case, I'm just going to download it from here. Yep. Now we have the file. I'm just going to save it on my desktop. So we made predictions for the test set. And now we have this file downloaded and I'm just going to take this file now, submission.csv. And you can take a look at it here. This is the submission.csv file, all the test predictions. I am going to go back to the Kaggle competition page, click on late submission because this competition is already closed. Otherwise we, this should just say submission, and then there's an upload button here. I'm going to click the upload button and I am going to click submission.csv. And where is this submission coming from? If you see this comes from RF. So this is my random forest. So I'm just going to add a note here, random forest with 20 estimators and the file is uploaded. I click make a submission. So now what Kaggle is doing is comparing the predictions from my file. Yeah, I think that's done. So Kaggle is comparing the predictions from my file to the actual data that they have for the test set, which was not shared from shared with us. So it was actually hidden from us. And now you can actually see where you are on the leaderboard. Now, because this competition has ended, you will not actually feature on the leaderboard, but if this competition was life, then you would show up on the leaderboard somewhere. So let's see this number here. 0.14431. Let's go on the leaderboard. And there are two leaderboards. Interestingly, there's a public leaderboard and a private leaderboard. So here's what happens just to make it even harder for you. When you make, when you upload your CSV file, you are shown your rank on the public leaderboard, which is only calculated using 33% of the test data. And then when the competition ends, they will then show you the private leaderboard, which is calculated on the remaining 67% of the test data. And you see how much the, the extents to which they are going to prevent overfitting because they do not want you to overfit to the public leaderboard. They want your model to generalize and overfitting is a central problem in machine learning. So you make predictions on the test set, and then you make a, you upload your predictions based on that you get your public leaderboard score. Every day you can make four or five predictions. So you try to optimize your public leaderboard score, but you shouldn't overfit to the public leaderboard. So you should make sure that your model generalizes well enough. And at the end of the competition, the models that were best generalized and also the most powerful or well-trained show up at the top. And you will often see a discrepancy. You see SDN T they had the highest score on the, they are the best score on the public leaderboard, but on the private leaderboard, where are they? They felt six positions. So they overfitted to the public leaderboard, essentially. And this person got moved up. And in fact, new Kami moved up 10 places. So it's a very interesting thing to just observe with every competition. In any case, our score was 0.1441. Let's see where that comes in. And there are about 3,500, okay. 3,298 participants in this challenge. Let's see 0.1443. Now, remember, we've not really optimized our model. So you can see that it's expected to be pretty bad, but here we are. 1, 4, yeah. 1, 4, 3, somewhere around this much, right? So we are at the position 3000, 2000 out of 3,200. That's not that good, but that's not too bad either. In general, with the Kaggle competition, if you can be in the top half, that means you've trained a useful model because the competition to the top is very intense and people are just fighting for the last decimal points. But if you're in the top half, then you've trained the useful model. And if you're in the top 10%, which means if you're in the top 300, then you've trained a really good model. So here's the challenge for you. Replicate this notebook and try and get within the top 10%. Which means try and get within 329, right? So even try top 20%, try and get within the top 500. If you can actually get within the top 500 with all of these techniques and all of these resources, then you are training pretty good machine learning models. Nobody can deny you that. And that's what we have for today. And you can make multiple submissions. So you can see a history of your submissions here. You can see these are some submissions that I've made. And it seemed like I was able to get to 1.1, 4, 3, with a hundred estimators. And you can make 100 estimators, and there's a lot more hyper parameter tuning that can be done. And then you can take some of this score and you can probably just then put it on this experiment and maybe just put the test score here. So this is how you work through a Kaggle competition. You have a list of ideas that you want to try out and then you predict what those ideas will do. You also note down the learnings once you actually try out those ideas. And a lot of times your learnings will be different and then you keep track of your hyper parameters systematically. One last thing that people tend to do is at the very end of a Kaggle competition, when the red line is about to close, they will pick their five best models and then they will ensemble, which means they will average out the predictions or maybe they will stack, which means they will train another model on top of those predictions. And that's how they will then get to the top. And the good thing about Kaggle competitions is you can see what the winners have coded. So if you just go to the discussion section or the code section, you can see what feature engineering they've done. You can see what, how they've trained models. You can see how they've applied cross validation. You can see how they've applied what imputation techniques they've used. And there is no better way to learn than to learn from winners of Kaggle competitions. So get into the habit of reading code. If you can start reading and understanding code, and a lot of these have pretty good documentation around it as well. But if you can get into a habit of reading code, then there is so much you can learn. And this information isn't there anywhere else in the world or in no textbook, no course is going to have this information. The, what do you want to do is go to Kaggle competitions, go through the courts first attempted yourself, go through the discussions, go through the court section, see what people have tried to read their code, copy paste their code, modify it, try a few things on top of it. And that is how you're going to become an effective machine learning practitioner, right? And whatever you need to learn along the way, you can learn the theory as well. So an exercise for you to is to repeat the steps from this notebook on the best cancer identification dataset. That is not a competition. That is just a dataset, but still worth repeating all these steps. And you can just pick out some other Kaggle competitions and try to work through them. Look for competitions where people have trained random forest or decision trees, and you should be fine. One last piece we've not covered here is that you need to then hand over the model to the software developer. Once you're happy with the model, once everybody's on board, who can then put the model into production as a part of an existing software system. And then you need to monitor the results and then make improvements from time to time. Like every two, three months, you can train this model. This is something that you, here's another tutorial that you can check out. This requires knowledge of the flask framework and the Heroku platform. So not something that data scientists typically have to do because this is handed over to software developers, but still a good thing to just learn and try out. Okay. So that's how you work on a machine learning project. You understand the business requirements and the nature of the available data. You classify the problem as a supervised or unsupervised learning problem. And with regression or classification, you perform some data cleaning, you clean, explore, visualize the data, and you create new features. You perform some data preparation, where you create training, test validation sets, you perform imputation, categorical encoding, et cetera. You come up with some quick baseline models just to evaluate your future models. Then you train a bunch of different models that are applicable to your problem and pursue the ideas that are working well. You apply regularization, hyper parameter tuning, ensembling, stacking, K fold cross validation. And finally, you then interpret and present these models to the other stakeholders and work with them iteratively. And this is not a linear process. You can, you may have to go back and forth. Sometimes you may have to refine the business requirements. Sometimes you may have to change the type of problem. You can say that I can't predict exactly, but I can give you a few categories. Is that good enough for you? Sometimes you may want to perform some additional cleaning. Sometimes you may want to get more features in, so you go back and forth, but this is how you should set up a notebook. And then you should have the spreadsheet on the side in the spreadsheet, list out all your ideas, put your ideas into action, into the notebook, and then keep recording the results of your experiments and you will be an effective data scientist. The topic for today is gradient boosting machines or GBMs with XG boost. And here's what we're going to cover today. We're going to download a real world dataset from a Kaggle competition. We are going to perform feature engineering and prepare the dataset for training. We're going to train and interpret a gradient boosting model using the XG boost library. We're going to train with K fold cross validation. So it's a different kind of validation technique, and we will also ensemble the results once we do the K fold cross validation. We'll also look at how to configure the gradient boosting model and how to tune its hyper parameters. Gradient boosting is a very powerful technique. It's probably one of the most powerful classical machine learning algorithms. So pretty much any problem concerning with tabular data, you may find it useful to try out gradient boosting and it's based on a very simple, but elegant idea, which we will discuss as we come to the model training. So let's begin by installing the required libraries. Now here, I'm installing numpy pandas, matplotlib, seaborn, jovian open datasets, XG boost. This is the library we're using today and graph with, this is required to visualize the trees that are created with an XG boost and light GBM as well. If you want to use light GBM and one issue you may face from time to time is when a certain notebook runs on your computer, but does not run on Colab or if it runs on Colab, but does not run on your computer. Whenever you face such issues where the same code does not run in two different environments or maybe leads to an error in two different environments, you want to make sure that the version of the libraries that you have installed in the two environments are the same. So what I mean by that is if you do PIP list and run it, you should be able to see the versions of all the libraries you have installed. And there is another tool called grip. So you can, after PIP list, you can just type this pipe character type grip and let's say type XG boost. And that will just show you the version of XG boost that you have installed, which in this case is 1.4.2. If you do not have this grip, it's just going to show you a whole bunch of libraries, all of the libraries that you have installed, which may become too much. So that's why it's useful, especially on Colab to just add a grip here and check the version. And there are other ways to check versions too, but this is one that is standard across all packages installed because it uses the PIP package manager. So we are using XG boost 1.4.2. Okay. So we are going to take a practical and coding focused approach here. So we will learn gradient boosting by applying it to a real world dataset. And we are going to pick the dataset from a Kaggle competition. This Kaggle competition is called the Rossman store sales competition. And here is some description about it. Rossman is a company that operates over 3000 drugstores in seven European countries. This is a real company and this is all real data. And currently Rossman store managers are tasked with predicting their daily sales for up to six weeks in advance. Now store sales are influenced by many factors. For example, if there's a promotion running at the store, or if the entire company is running a promotion, if there is a competing store nearby, the school and state holidays, because that affects people's behavior, seasonality, and locality. Now with thousands of individual managers predicting sales based on their unique circumstances, the accuracy of the results can be quite varied. So you are provided the historical data for 1,115 Rossman stores. And the task is to forecast the sales column for the test set. And your task is to forecast the sales column for the test set. So we will just look at the training set and the test set, but this is what we want to predict. We have data of 1,115 stores and we want to predict the sales on every single day. Okay. And we are going to download the data from this Kaggle competition. Now we just go to the data tab and this is where we can see some description about the data. So it turns out that there are four files, trained dot CSV, test dot CSV, sample submission, CSV, and store dot CSV. And we can look at the files here. So if you're unable to see this table here, just go to rules and click, I accept the rules. Otherwise you will not be able to see the data. And here in the data tab, you can see train dot CSV. So train dot CSV, the file contains nine columns. It's shown in a tabular format where you have some kind of a store ID store IDs go from one to 1,115. You have a date. This is the date. And they've conveniently also provided which day of the week it is one to seven Monday to Sunday, I would imagine. Then they have given you the sales and dollars on that particular date. They've also given you, given you the number of customers who came to the store on that particular date. You mentioned whether the store is open, whether the store is running a promotion, whether it was a state holiday and whether it was a school holiday. So all of this is informed important information that can determine the sales. And let's check the test dot CSV file. And let's see what's different here. So test dot CSV file also contains stores one to 1,115 apart from that, there seems also for each row, they seem to have added an ID, presumably this is for evaluation of your submissions because you have to make predictions on the test set and submit it to this Kaggle competition. Then you have the day of the week and date as before. One thing you will notice is that the dates in the test set occur after the training set. And that's just to reflect how the model will be used in real life. You train on data from the past and you make predictions on data from the future. And then it has open promo state holiday and school holiday. The things that it does not have is the number of customers and the sales. Now, this means that you cannot use customers as the input to predict sales, but you can use all the other things. You can use things like the store, the day of the week date, you will have to figure out what to do with it because all these dates occur in the future. So they don't exactly occur on the training set. Maybe we'll have to modify in some way, but you can use all of these things too. So this is a typical machine learning problem. You have a training file, you have a test file, but you have one more other file here, store dot CSV, and this contains metadata about the 1,115 stores. Now, if I go back to train, you can see that we have several rows of data for every store. It's not that if you check the total number of rows, you will see that the total number of rows is about 1.2 million. So for every store, we have data for the entire date range, which is from first January, 2013 to 31st, July, 2015. On the other hand, if you check stores, we simply have 1,115 rows and we have some metadata for each store. We have information like the store type. We have information like the, the assortment. So this conveys something about what the store offers. We have information about competition distance. So I think this is how far away a competitor is a nearby competitor is. Yeah. And you can check the description here. You have something called competition distance, which is distance in meters to the nearest competitor store. Then we have this competition open since open since month and open since year. So competition opens this month gives you the approximate year and month of the time. The nearest competitor was open. Yep. So for example, this competitor, this competitor is open since 2008 and September of 2008 was when this competitor opened. Then there is also the second promotion that is running here. So the second promotion promo indicates whether a store was, so promo indicates whether a store was running a promo in that day. This is part of the training set, but there is promo too. And promo too is a part of the store dot CSV file. So promo two is a company wide promotion. It is a continuing and consecutive promotion for some stores. So if stores participate in this company wide promotion, promo two has the value zero. Otherwise promo two has the value one. And this is not dependent on the date. This is just a store wise configuration. And then there are a couple of other columns called promo two since week and promo two since year. So this means stores start participating in this company wide promotion from a certain date and year. So you're given the, you're given the certain, you're given the year and you're giving the week number in the year when a particular store started participating in this company wide promotion. There also seems to be another thing called promo interval, which describes the conservative, consecutive intervals promoter started. So it seems like with promo to you start this promotion several times in a year. Let's say maybe this is some kind of a competition that you're running or some kind of a raffle or something. So it seems like there is a certain promotion that you're starting multiple times. So promotion has started a new, and here, when you have Feb, May, August, and November, this means that each round of this promotion to this company wide promotion is executed by that store in February, May, August, November. Okay. So there's a lot of information here. This, and this is something that you would have to read carefully and try to understand, but this is store wide information. And we will have to figure out how to use this information on a day to day in the day to day data, which is a trained or CSV data. Okay. So that's a quick overview of the dataset. And finally, what we need to do here is make this file some submission dot CSV file and the submission dot CSV file should have IDs and this ID should reflect the ID from the test set. And the sales here should be filled in with the predicted sales that your model comes up with after training on the training data for the test set. Okay. So the ID would be the ID from the test set like this and the predicted sale would come here. And then you create a submission or CSV file, and then you can make these submissions. So you given some training data, you have to train a model using the training data, and then you have to make some predictions on the test data. And then you submit these predictions. And once you submit these predictions on the leaderboard, you will be able to see where your predictions rank. Okay. Now this competition has already ended, but you can still make submissions and you can still track where you land on the leaderboard. Okay. So that's the data set. And we can download the data set from Kaggle directly within Jupiter using the open data sets library. Now, one option is to download this data from this tab here onto your computer and upload it to Colab, but the easier way is to just use the open data sets library and make sure to accept the competition rules before you try to download the data. So just go to the data tab. Sorry. Go to the rules tab and click X I accept the competition rules and review them as well. Now, once you have done that, then we are just going to import open data sets as OD. And then I'm going to run OD dot download and to OD dot download. We simply give the link of the competition, which is just this part Kaggle dot com slash C slash Rossman store sales. Now we give the link to the link of the competition here. And you are then asked for your Kaggle username. And then you will also have to supply your Kaggle API key, which you can get by going into your account on Kaggle. So just click your user, your avatar, click on account and go down and click create new API token. And that's going to save the file Kaggle dot Jason to your computer. And once I open it up, I will be able to see my Kaggle credentials here. So I'm just going to copy this. Now, this is a secret, so you should not be entering it. Don't save it in a variable within a Jupiter notebook. Okay. Let me try that again. All right. So let me open my Kaggle key. And here I have my username and I'm entering my username. And I'm entering my username. And I'm entering my username. So let me open my Kaggle key and here I have my username and I'm entering my Kaggle API key. And that downloads a dataset for me. Okay. And you can verify that the dataset is downloaded here in the file browser. All right. So let's check the files that have been downloaded. So as expected, it is the exact same file sample, submission, train, test, and store.csv. The first thing we'll do is load them into pandas data frames. So I'm going to load train.csv into this data frame called Ross DF or Rossman DF. I am going to load, store, test and submission data frames as well. And let's take a look at each of these. So this is Rossman DF. And this contains exactly the information that we saw earlier store, the date, the day of week for that date, the sales on that date, customers open promo state holiday and school holiday. And there are over 1 million rows of data. So that's a lot of data to train on. Then we have the test data frame. Here we have here, we have an ID store date, day of week, open promo state holiday and school holiday. As before, no customers and no sales. Then we have the submission data frame. This is the data frame we have to generate. These are IDs from the test.csv file. And this is where we need to put in our sales to generate our submission file. And finally, we have the store data frame where for all the 1,115 stores, we have a bunch of information. Now, the first thing I would do in such a situation is to simply merge the information from store DF into train DF and test DF, because it would be a lot more useful if this information was part of this train of Ross DF. So along with all of this, if we also had the information like the assortment at the store, the category of the store, et cetera, et cetera. So that's what I'm going to do. I am going to merge. I'm going to call Ross DF, which is the train dot CSV file, our dot merge store DF, which is the information about the store. And we are going to do a left outer joint, which means that we want to retain all the days of data from the training set. We don't want to miss any days. And for each day, we want to add the information for that appropriate store. So we're joining on the store column and a similar join we are doing on the test set as well. So this is what the merge data frame looks like. Now you have the store, the day of the week, the date, the sales, the customers, whether it's a state holiday, et cetera. And then you have store type assortment, competition, distance, et cetera, et cetera. Okay. And of course, these details will get repeated several times. Each time we have a row where the store number is one, all of these details will get repeated. Okay. Now this is the point where we would do some exploratory data analysis and something that I will leave as an exercise for you. So study the distribution of values in each column and study their relationship or the correlation with the target column sales. Okay. So with that, we have downloaded our dataset and I'm just going to save my work here. So on Google collab, when you run Jovian dot commit to save a snapshot of your notebook, you will be asked for an API key. So just go to your Jovian profile, click on copy API key. Come back and paste the API key here. So just open Jovian.ai. And in the get started tab, you should see this button. And once you paste it here, this notebook is now saved and a snapshot of this notebook has been captured and saved to your Jovian profile. So this is a snapshot of the notebook on your profile that you can run anytime. Okay. And you can always access this by going back to Jovian or AI and checking the notebooks tab. You can see here. All right. So let's see what you can do. All right. So let's talk about pre-processing and a new and very important topic, feature engineering. So let's take a look at the available columns and see if we can create any new columns or apply any useful transformations. Now, one thing I'm already seeing here is that we have a date column and just the date value by itself may not be very useful because all the information in the training set belongs to the past, but all the dates in the test set occur after the dates in the training set. But what may be more and more useful from the date is maybe extracting some parts out of the date, things like the year, the month, the day of the month, maybe because a lot of school stores are monthly cycles. So they may have higher sales or lower sales towards the end or beginning of the month. People get their salaries at the beginning of the month. So they may spend more during the first couple of weeks. So the day of the month might be useful. The week of the year might also be useful, whether it is the first week of the year, a lot of people are making new purchases or the last week of the year, when it's a holiday season versus maybe somewhere in between where it's not as busy. So date is something that we may want to first convert it into a date time object right now. It is just object. And second, we may want to parse out information from the date. So that's what we are going to do. We are going to extract different parts of the date. And because we have to do this both on the training set and on the test set, and later we will also have to do it when we try to make predictions on single inputs. So this is where we're defining a split date function. So the split date function takes a data frame and first it converts the date column into the date time column here using PD dot two date time. Then it extracts out the year, the month, the day and the week of the year. And now we can call split date on merge TF and merge test TF both of these. So merge TF is simply the train dot CSV combined with the store dot CSV file and merge test TF is simply the test dot CSV combined with the store dot CSV file. And now we have a merge data frame, as you can verify here. And in this much data frame, you will now see these things year, month, day, and week of year. Note that we already have day of week. So we don't need to extract that. So we already know whether it's a Monday, Tuesday, Wednesday, Thursday, Friday, or Saturday or Sunday. And this is essentially what feature engineering is. A model may not be able to make a lot of sense out of a string to 2015 07 31, or even if you convert it to a number of some kind, but once you put it in this format, you are now helping your model, a model will be able to figure out a lot more because now it can pick up monthly patterns. Now it can pick up patterns that happen through the month. So the daily patterns, now it can pick up the weekly patterns over the year, and it can also pick up the pattern during the week. Okay. So all of these patterns will can now be picked up from our model. So that was one useful thing to do. The next thing to notice is that there is this, there is this column called open. This tells you whether the store is open on a certain date. And if the store is not open on a certain date, it's quite likely that there will not be any sales. So let's maybe verify that. If we just check from merge DF, just extract those rows where merge DF dot open is zero, which means where the store is closed and check the sales on those dates and see what all different values sales takes on those dates. So it turns out that the only value in the cases where the store is closed is zero. So there are 172,000 rows of data out of the million where the store is closed. And on all those days, the sales are zero. And you can verify this. If you remove value counts, you'll see that all these, all these sales are zero. And that's what we've done here just by putting in dot value counts as well. Okay. So now instead of trying to model this relationship using a machine learning model, it would be a lot easier to just hard code this in our predictions. I mean, it's our model has a limited capacity or a limited power, whatever kind of model we're training. And the fact that it has to figure out that when open is zero, then the sales are zero. That is just some additional work, which we can save the model. We can do a very simple hard coding here. We can just hard code this relationship in our predictions so that whenever we are making any predictions on the test set, for example, we can simply set the sales to zero if the store is closed. So that's what we can do for the test set. What do we do in the training set? So in the training set, I would argue that the best thing we can do right now is simply remove all these rows and just train a model, which works on data for when the store is open. Okay. So we are only building this model for the days where the store is open and that will make it easier for the model to learn some of these other relationships, which are not as easy to guess. So here's what we're doing. We say merge DF dot open equals equals one. This is going to select all the days where the store is open and we select all those rows of data. And so now for all the stores, all the days when stores are open are selected here. And we're just creating a copy here so that we can make some modifications to this data frame without affecting the old data frame. Okay. So now we've removed from the training set all the days and all the days or all the entries where the particular store wall, which that row was referring to. We are not going to remove any rows from the test set, because remember we have to actually generate predictions for all the rows in the test set. So we can't ignore any rows or we just ignoring from training because we don't want to train on data when the store was closed. Okay. So let's get into a little more advanced feature engineering here. Now there is this competition open, or there is this competition open since month slash year column. If I just check merge DF, you can see here there is a competition distance. This tells you what is the distance in meters to a competing store from maybe a different company and competition opens in. So this tells you when this store was opened. So this was this store opened in September of 2008. And what day are we looking at here? We are looking at 2015 731, right? For this particular store. Maybe let me pick a sample so that we can see some variety here. Okay. So let's, let's see this one. For example, we are looking at the data for store number 9 25 on 2015, March seven. So March seven, 2015. And it turns out that there is a competing store at 470 meters away. And that store has been open since 2007 March. All right. Now we, we can pass this information into the model, which is how, when the store opened and which month the store opened in the competing store, but something that will be a lot more useful for the model is if we simply one, if we can simply indicate whether there is a competing store. And if there is a competing store for how long has it been open? Okay. And how long, how do we quantify that? We can simply maybe count the number of months that the store has been open for. So for example, if you're looking at 2015 here, so 2015 minus 2007, that is about eight years and March, 2015, March, 2007 to March, 2015. So that's about, yeah, that's about exactly eight years and zero months. So this particular store has been open for eight times, 12, 96 months. So I would argue that instead of telling the model that the competition has been open since, since March of 2007, it would be a lot for a lot more useful for us to tell the model that the competing store has been open for 96 months. Okay. On the particular day for which we are trying to model this relationship, right? So this is what we're going to do for every day or for every row. We are going to pick out the current day and using the current day and using the information in the competition open since month and competition open since year, we are going to calculate a new column called competition open and competition open is simply going to be the number of months that the competing store has been open. And the general idea here is the longer a competing store has been open, the more impact it will have on this store sales, because that's just a, that's just a hypothesis. And that's, that's something we're going to try out. Okay. And you may have different ways of looking at this, but this is just one approach that we're just going to try out and see if it helps the model. So we're defining this comp months function, which takes a data frame and first it checks how many years the competition has been open since compared to the current year. Remember we've already extracted out here, so we can do 2015 minus 2007. And then we also add to it how many additional months there are. So if the current date is July of 2015, the date in the particular row and the competition has been open since March of 2007. So we go 2015 minus 2007 times 12. So that's 12 weights 96, and then we go July, which is actually stored as a number seven minus March, which is three, so seven minus three is four. So we had four more months. So this will say that there, the competition has been open for a hundred months. And of course we are doing that row by row for every row. Okay. Now the only difficulty is sometimes it's possible that maybe a competing store opened in 2014 or in 2015. So we don't want to put in information about the future into any row. So we don't want to put in, for example, if we have information about store number one, we are looking at January 1st, 2013 at that point, there is no competing store. So at that point, we should probably just put in zero for competition open, or maybe put in some other number like any. So that's one thing to take care of that. We don't want to put information about the future into the past. So that's why there is this additional line. And I'll let you figure out how exactly this works, but this additional line simply replaces the negative values with zero. So wherever the competition open has a negative value, which means that there is no competing store yet, but it will open up sometime in the future. Then there we replace it with zero. Okay. So after doing all of this, the result is this. So if I just pick out the competition related columns, you can see here we have, and I'm just picking a random sample here. We have, this is, this is one row from the training set. So we have this date 2013, five 18, then we have the distance of the competing store, two 50 meters. Then we have the competent competitions open since year and competition open since month. And we have here calculated how many months the competition has been open for. Okay. And this is far more useful for our model than this or this. Okay. And this has been calculated using the date, using the competition opens this year and competition opens this month. All right. And in the cases where there is no competing store nearby, the value is zero in the cases where a competing store will open in the future. In that case, also the value will be zero. I can't find a good example here, but yeah, in the, so let's suppose the date here is 2013 and the competition opens since it's 2014, then this value would also be zero. Okay. So that's just some feature engineering that people try and do to make information, to present information in a form that can be most useful for the model. All right. Now there is another other column set of columns, which is to do with promo too. So if you recall promo two is basically representing a company wide promotion that stores participate in and it's a recurring promotion. So if a store participates in promo two, then you will see the number one here. Otherwise you will see the number zero. And then there are certain date at which every store has signed up for this company wide promotion at different dates. So you can see those dates here. Once again, we have here and we have week. So using this, maybe we can just identify for how long this store has been running a certain promotion. So we can just compute how many months the store has been running a certain promotion for. And then this promotion is something that is kicked off multiple times every year. So that that's where we have this promo interval. And here, this tells you that this promotion too was kicked off at that store every Jan, April, July, and October since some time in 2010. Okay. So one thing we can do is combine these two to just compute for how long the promo two has been running at a particular store. And the second thing we can do is we can identify if at the current date, at the current date, in the current month is a new promotion starting, right? So we can just check that as well. And that might be useful information because this by itself is not something that we can pass into the model, but if we can use this information, combine it with the current month, compare it with the current month and create a new column telling us whether a promotion is getting kicked off in the current month, that might be a more useful piece of information to the model. Okay. So that is what we're going to do here. And there are a couple of functions that perform this. So this, I will leave as an exercise for you. This is primarily just some pandas, data frame manipulations, and the end result of all of this once this is done. So the end result of all of this is that we are going to add a promo to open column. This tells you months since promo to was open at this car at the store. And we are going to add up is promo to month, which tells you if the promotion was started in the current month. And you can see that here. We have this promo to open. This tells you a number of months since the promotion started. And you can verify here, you have 2014 week 40, which means somewhere towards the end of 2014. And here we are looking at March of 2015. So for about five months, this promo has been running and this promo runs in January, April, July, and October. So that's why you see zero here. But if you had the matching month, then you would see a one here. For example, here, you can see that this date is February. So here we have, yeah. So here we have February listed in the promotion promo interval as well. So here we have a one. Okay. So now we don't need this column because we have this column is promo to month. And now we don't need this column or these two columns because we have this column promo to open and putting the specific computations aside. The idea here is that we did a joint, we took some store wide information, and then we use that store wide information in combination with the day-to-day or in combination with the present date for each column in the training for each row in the training set in combination with the present date to compute this new column, which gave us a richer information about the data. And this is probably 70 or 80% of machine learning, which is cleaning the data, getting the data into right into the right shape, combining different data sources, engineering new features based on insight, based on insights like this, that maybe we should keep track of how long a competitor has been open or maybe we should keep track of how long a promotion has been open, et cetera, et cetera. Okay. So this is not very easy to come up with. This is, I mean, I've just pulled it out of some of the best notebooks that I saw on the Kaggle competition, but, and these are not possibly not set certainly even the best features. So you may have to think about what other features you can create, but this is what is feature engineering, combining multiple columns, combining data frames and transforming things like breaking out date into things like your month, et cetera. Okay. So with that, we have now completed most of our feature engineering work. So if you now check merge TF, we should have a lot more columns. We have about 25 columns, some competition related columns, some promo related columns, and some date related columns as well. Okay. So let's move on. Now, once we've done this feature engineering, once we've done this exploration, and again, there is sort of an interplay here. It's not that you do feature engineering once and you're done, you do the feature engineering, then you do some more EDA, and then using that EDA, maybe you do some more feature engineering. Maybe you reject some of the features you created. Maybe you combine some of the features to get new features till you start seeing some kind of a pattern or some kind of a correlation. And it's not that it's once done and finished, you train a model. And if you see that the model is not good enough, maybe you need to go back and do more feature engineering. So although we are following the linear process, that is not how it will be typically. And this, this problem that we are solving over the course of three hours is actually something that people have actually spent two to three months solving for the duration of the Kaggle competition, right? So there's a lot of back and forth and nothing is ever just done or finished. But that said, moving on from feature engineering, let us now select the columns that we want to use for training. So here we are going to select, these are all the columns that we have now either used from either picked up from the dataset or created. And out of these, let's see which ones we are going to use as inputs. The store should be an input. We can keep that because the store numbers in the training set and the test set are the same. We are making predictions for the same stores. If we wanted to create a model, which could work on a new store, which is not seen in the training set, then we cannot use the store number. So keep that in mind. But in this case, it's the same 1,115 stores. So we can use the store number. Then we have day of week that is going to be useful for sure. We have date. We can drop date because we have extracted all important parts out of it. We have sales. This is something that should be the target column. So that this is what we've put in the target call. We have number of customers. This is not something that we can use because this information is not available in the test set. And of course, when you deploy the model, you want to make predictions for the next six weeks, you will not have the number of customers. Now, one thing you could do is you could train maybe a customer number of customer prediction model, a model that predicts number of customers, and then use the prediction of that model as an input to prediction of this model, et cetera, et cetera. But let's not go there yet. Right now, we are simply going to drop number of customers as an input column. Openness against something that we're not using. We've already gathered information from it, which is to drop all the stores, which were all the days, all the rows of data where the particular store was not open. And we are training a model only for the days where the store is open. So open is a useless column. It just contains the values. One now promo will be useful, whether the store is running a promotion or not is going to have an impact on the sales, the state holiday, school holiday, both of these are useful. So we're going to keep those within the input, input columns as well. The store type assortment, again, seems like a very useful thing. So we should keep them around. And by the way, these are categorical. And you can verify this by just looking at the unique values for these columns, the competition distance. Yes, this is useful. How far away a competing store is competition open. Now, remember these two, we have combined into competition open, which simply tells you how many months the competing store has been open for. So let's keep that. Then the date has been decomposed into day, month, and year. So we have those. Then we have a week of year. We have promo to both of these are useful, which week are we currently in and whether the store participates in promo to for how long the store has been participating in promo to. So this is the number of months and if promo to a new round of promo to kicks off in a current month. So all of this is useful information that we'll keep the things that we have now engineered away. We can remove like date, open since month, open since year, promo to seems week, et cetera. And the target is sales. So once again, I'm just going to get the input columns out here from the most data frame and create this data frame called inputs. And I'm just going to get the target column out. So now it's a single column and create a merge data, create a single series. So targets will be a series and inputs will be a data frame. And I'm also creating copies here because this is a very small data set. I don't have to worry too much about creating multiple copies, but if you're working with a really large dataset, maybe you shouldn't create a copy and you just have to modify the original data. So you can now check this is inputs. This contains just the input columns. And one thing that you should note here is there seems to be an E. So this seems to be a categorical column state holiday, where it has not just zero or one, but there are several types of holidays that it tracks, ABC and D and the store type assortment. These are also categorical columns with texts. So I'm sure we'll have to handle them separately and the rest of them look numeric. Although one could argue that maybe we should treat day of week as categorical because sales generally are not linearly correlated with the day of the week, like Monday to Sunday, there isn't a linear increase. So we should probably treat them as categorical one could argue the same for day, month, and year as well. But yeah, and this is where you will have to choose which of these do you want to treat numerically? And which of these do you want to treat categorically? Now, the good thing again is if you're using decision trees, decision trees can typically work with both types because decision trees can create these decisions. Like is store is day of the week, less than one. So to pick the day of the week, zero and the rest of the days of the week will go in another direction. So decision trees can actually pick out using binary decisions while the days like if Friday, Saturday, Monday are specific branches that decision trees can create, even if you don't create separate columns for each of these days. Okay. So yeah, but that's, that's, that's still a decision that you have to come up with. Should we treat some of these as categorical or some of these as numerical? So that's inputs. And let's take a look at targets as well. And that's targets here. So these are the things that we're trying to predict. This is the sales. And we're also going to generate test inputs simply from merge test TF. We're going to get the input columns and then create a copy. So now this will contain just all the test inputs. And at this point, hopefully the test inputs and train input should have the same number of columns. If they don't, then obviously you're not going to be able to train a model, which can make predictions on the test set. So when you get to this point where you've selected some input and a test column input columns, make sure that the test inputs and train inputs have the same number of columns, which in this case is 16. Okay. Then here is the, here is the somewhat subjective exercise of picking the numeric and categorical columns. Now there are some binary columns, which have zero one values. We don't really need to one heart encode them. They're already one heart encoded as zero one. So put them into numeric just to keep things simple to avoid this one heart encoding and introducing new columns. Now, with some of these things like month and day, these could be made categorical, but I chose to just keep them numeric and see what happens. You can try and convert them into categorical columns if you want and see if that has an effect. So the only things that I'm keeping categorical right now are the day of the week, because there are only seven of them. This, whether it's a state holiday. So there are these values, ABC, it's definitely not numeric store type and assortment. And both of these are also not numeric. Okay. So these, whatever are the textual columns, I am making them categorical and specifically one exception I've made here is day of week. And, but you can try and experiment with this in the future. Try and experiment with this and see if you get a different, see if you get a better result. Okay. So let's now impute the missing numerical data. And instead of doing an imputer, let's maybe just check where the numerical missing numerical data is. So it turns out that the only place where numeric data is missing is competition distance. And this is possibly because for competition stores, there is no nearby competing store. So in such cases, and the same is true for the test inputs as well. And there are no other columns where there are any missing data. So let's think about this for a moment. We need to put in a number here for competition distance. Otherwise, some of our models may not train and a missing value here for competition distance indicates that there is no competing store near this store. Now, if we put the value zero here, or if we put a very small value here, then that would be the opposite. That would mean the opposite. That means the competition is right next door. So we should probably not put the value zero here. What we might instead want to do is maybe just put a very large value here. Like if let's say, just instead of saying that there is no competition nearby, let's say that there's a competition is a hundred kilometers away or something like that. Just to sort of retain the meaning of the value that is inserted here. Zero would mean that the competition is next door. That's not what we want to suggest. We want to suggest that there is no competition nearby. The closest thing to that is to suggest that the competition is really far away. So here's the trick that I'm going to use. I am going to fill these empty values with the highest value for the competition distance column. So let's check max distance. So max distance is seven, five, eight, six, zero. That means about 75 kilometers away. So that seems reasonable. 75 kilometers. Maybe if, if I want, if I want, I can even double that number and make that one 50 kilometers away. So I'm just going to fill in that wherever competition distance has the value, none or NAN, I'm going to fill the value 75, 75, eight, six, zero times two. So I'm just going to say that the competition is 150 kilometers away, which basically has the same meaning as this store, not having any competition. Okay. So this is again, something that you will have to do very carefully when you are filling in missing values. It's not simply a matter of just using an imputer and putting in the averages or putting in the means or media. Sometimes you have to think about what it means to fill a certain value to fill the mean would, to fill the mean would change the meaning because then you would say that the competition is maybe like a few hundred meters away, but rather what we want to put in here is to say that competition is really far away or does not exist. Okay. So although we have seen these standard strategies for imputing values and scaling values, you will have to apply some thought to it when you apply it to real world problems. The standard strategies are fine. When you want to just quickly get to a machine learning model, but when you really want to improve the model, these are all things that you have to think carefully about. Okay. Let's then scale numeric values. I don't see any problem here. I think we can just scale them all to the zero to one range. That should be fine. So we are using a min max scaler to do that. And then let's encode some categorical columns as well. Now note that for categorical columns, we've mostly picked up columns, which have text data. So what will happen is we will take each of these unique categories and create a column for each of these categories and just put in the values one and zero, depending on whether the row contains the category for that column. Okay. So that's what we do using one hot encoder. Again, imputation, scaling and encoding one hot encoding is something that we've covered probably four times now and hopefully starting to sound boring and repetitive and routine, essentially. So just go through the previous tutorials. If you want to understand how these work. Now, one thing I want to point out is we have not created a validation set yet. And this is because we are going to use a different validation technique called key fold cross validation. We'll see that in just a moment, but that's all the data preparation that we need to do. And I encourage you to look through notebooks created by participants in the Kaggle competition. You can look through notebooks by going to this link. This is basically the date, the code tab of the competition. And on the code tab, just select most votes to sort by the most popular notebooks and just go through some of these notebooks and see what other kinds of feature engineering people have done. And even the visualization exploratory analysis, et cetera, will help you learn a lot because a lot of these people are professional data scientists. And you can see that there's a lot of data scientists and they're putting in the same amount of care and the same insights that they apply to their day-to-day work at their job into this competition. So there's a lot you can learn and this kind of information is not available anywhere in any book or any course. So in fact, this entire notebook has been created just by looking at a bunch of different notebooks from the code tab of this competition, right? So do check that out. Let's do a quick recap before we move forward, just to revise what we did for future engineering. So we downloaded the dataset. We downloaded, we downloaded, we downloaded the dataset and that gave us a training CSV file, a test CSV file, and a store that CSV file with some metadata about stores. All right. So we had, we have a training set, and then we have a test set and a test set contains about 40,000 rows of data. And then we have the store dataset, which we merged into the training and test sets. So that's one thing that you will have to do often merge data from multiple sources or perform some kind of joints to augment the training data. Then we looked at the dates and we extracted information like the year month day and week of year from the date. And that gave us a merged data frame with these additional columns. Then we looked at whether the store was open or closed and realized that when the store is closed, the sales are zero. And we decided that this is not something we want to model. This is something we will hard code into our predictions. So we removed all the rows where the say a store was closed and the sales was zero. Next, we added some columns for the competition. So instead of having the month and year since the competition competing store opened, we just have the number of months since the competing store opened a single value, which will be more useful to the model. Similarly, we added a couple of columns, one called promo to open and is promo to month promo to open indicates for how long the store has been participating in promo to a company-wide promotion and promo to month to keep track of whether the current row has a month when the promotion starts at the current store. Then we identified the input and target columns, and then we created inputs and targets to data frames. Then we identified the numeric and categorical columns as well. And then we imputed the numeric columns. And this time we had to take a specific decision to put in a very large value for competition distance. Then we scaled the numeric values to the zero to one range and encoded the categorical columns, which were strategically chosen so that some of the categorical columns, which have numbers, we've kept them as numeric columns. So we don't want to change them. And finally, we've put together these two variables X, which contains our training plus validation set essentially. So which contains just the numeric data from the numeric columns and the encoded categorical columns, right? So that is what X is. It contains all the data that can be directly fed into a model. It is all numeric. It has no empty values. And similarly, we have X test, which is just all the inputs that will be used to make predictions for the test set. So that's our X and X texts and X and Y are common used to refer, commonly used to refer to inputs and targets in machine learning. That's why we using uppercase X and uppercase X test. All right. So we're finally ready to start training our gradient boosting model. And here's how a GBM model works. A GBM is short for gradient boosting machine. So first we're trying to predict the sale at every store. The first thing that the grading boosting model does is to predict simply the average value of the target column, compute the average value of the target column and use that as an initial prediction for every input. So the model completely ignores all the training data that it is given. It simply looks at the predictions or the, the target values, and it simply takes an average of the target values and it uses that as a prediction. Now, obviously that's not a good prediction. So whatever error function you're using, let's say mean squared error or whatever, the prediction is going to be pretty bad, but here is where it gets interesting. Now, once you have predicted, you once you made one set of predictions, which is the averages, then we go back into each row of training data and we calculate the difference of the predictions with the actual targets. So just to convey that let's maybe work through an example here. Let's say we have three things. We have month. Let's say we have month and then we have a day, day of month, and we have a competition distance. Let's call that CD competition distance. And then we have store number here. And then we have the sale. Okay. So we have a bunch of stores. So there is some information here for the store. And then the sale here, let's say the sale was 700. Then there is some information for a second store at a second date. And let's say the sale on this for this training row, the sale was 800. And we have another one. Let's say the sale here was a nine 70, et cetera, et cetera. And like, let's say eight 50 here. So these are some rows from the training data. So the first prediction that our model is going to make is simply the average of these numbers, average of 700, 800, 900, and eight 50. I believe this would be, let's say this is around eight 20. So it just predicts eight 20 for all of them. Okay. So just predicts eight 20 for all of them. Okay. So just predicts eight hundred and 20 for all of them. Now that's a bad, bad prediction. So let's call this P zero or prediction zero. Now using these P zero, it is going to compute the residuals. Now what is the residual? It is simply the difference between the prediction and the actual sale. Sorry. The sale. Yeah. Difference between this actual sale and the prediction. So in this case, the residual should be minus 80 or sorry, minus not minus 80. The residual should be 700 minus eight 20. So minus one 20 in this case, the residual should be 800 minus 20, 800 minus 20. That's minus 20. In this case, the residual is nine 70 minus eight 20. That is one 50. And in this case, the residual is eight 50 minus eight 20 excuse the poor handwriting and that's 30. Okay. So now this is what we know that in our current set of predictions, we are off by this much for each row. So here is what happens next. Now the model takes this input. So it takes all these inputs and instead of trying to predict the sale price, it builds a decision tree to predict this value. Okay. And then it takes this input and using this input, it tries to predict this value and so on. Okay. So a decision tree is created. So now we create a decision tree. Let's use this to just represent a decision tree. And it's not an unbounded decision tree. Typically this is bounded in terms of depth. So we create a decision tree and using this decision tree, we now try to predict how much we were off by when we made the first prediction. So when we made the first prediction, we were off by one 20 and now suppose our decision tree predicts here minus 80. It predicts here minus 30. It predicts here that we were off by that we were off by 70 and it predicts here that we were off by 40. Okay. So this is what our first decision tree predicted. Now using the first decision tree. Yeah. So now we have, we have this decision tree. Let's just call this D. So this is the prediction of the decision tree, which was an attempt to predict the residual. But of course this is itself not that perfect. So what do we do next? We compute the next set of residuals. So the next set of residuals would be, let me just write that here. The next set of residuals would be wanted to predict minus one 20, but I predicted 80. So I, I still have, I'm still off by minus 40. I wanted to predict minus 20, but I predicted minus 30. So that means I'm off by plus 10. I wanted to predict one 50, but I predicted 70. So I'm off by 80 and I'm off by minus 10. Right. So this way we continue. We keep getting residuals. So this is what this picture is representing here. First we use the average and that's obviously very bad. So the error is very high. Then we train a decision tree to compute the residuals or the, the errors that we, that we incurred when we were predicting the average. So that is these errors. We trained a decision tree to do that. Once again, we don't do a perfect job, but it is better. Right. If you see, if, if I simply add up the prediction from the original model, 820 plus minus 80. So the overall prediction from the model becomes the first prediction plus the prediction of the first decision tree. So 820 minus 80 becomes 740. So 740 is now a lot closer to 700. Similarly, 820 minus 30 is 790. So 790 is a lot closer to 800 than 820. Similarly, 820 plus the prediction of the first decision tree, 70 is about 890. So 890 is a lot closer to 970 than 820 alone. So almost all the prediction now become better because we have trained a decision tree to compute the residuals. Okay. So as soon as we, as soon as we create one decision tree, all our predictions become better. And how do we make the predictions? We simply add the original prediction plus the prediction, the predicted residual by the decision tree. But of course we still not perfect. We still will not have a perfect model. So then we train another decision tree to compute the residuals that were obtained after taking the result of the original guess and the first decision tree. So that second decision tree is trying to correct the errors made by the second decision tree made by the first decision tree. And that reduces the error further. Then we take train another decision tree and that attempts to correct the errors made by the third decision tree. Okay. So that's how gradient boosting works. And let's just go over this step by step and try to understand this again. Okay. So let's go over this step. So the average value of the target column is used as the initial prediction for every input. That's the starting point. Then we compute the residuals of the predictions with the targets. So we, we now identify how much we are off by what the error in our prediction is. Then we create a limited depth decision tree of limited depth to just predict the residuals for each input. Now we're not trying to predict the entire sales. And in some sense, it's a smaller, just trying to correct the errors of our initial guess. Then here's a step that is an additional step, which is called, which is called the learning rate. We take the predictions from the decision tree, and then we scale them using a parameter called the learning rate. So we don't directly add the prediction of the decision tree to the original prediction. We just scale it a little bit and we'll talk about that in a second. Then those scaled predictions from the decision tree are added to the previous predictions to obtain the new and improved predictions for that particular row. And then steps two to five are repeated to create new decision trees, each of which is trained to predict just the residuals from the previous prediction. Okay. So that's our GBM works. And let me talk about scaling now. So coming back here, we had, we had this data. So we had this input columns and the target that we were trying to predict was, what was it? Target was 700. And we predicted, we predicted the average. The average was eight 20. So this was our initial prediction P zero. Then we trained the decision tree. Then we computed the residual. So let's call that R zero and the residual is minus 80. And then we do it for all of these columns. We compute the residual. Then we train a decision tree D and D is trying to predict these residuals. So D is trying to use the inputs to predict the residuals. And again, D is going to be off by a little bit. So what is the number we used here? Let's say that D predicted minus 40. Now what we do is we actually have a training, a learning rate parameter. And this is called alpha. And let's say alpha set to 0.5. So instead of predicting 820 plus point plus minus 40 as the prediction from our model after creating one tree, we actually predict 820 plus alpha times or 0.5 times minus 40. Right. Which is a prediction. So the actual prediction that we are now making, and this prediction is the prediction from the original average plus the alpha, the learning rate times, the prediction from the decision tree. So eight 20 plus alpha alpha is 0.5 multiplied by 40. So minus 20, eight 20 plus minus 20, that is 800. Okay. So this is the prediction. And this prediction is coming from P one comes from our average plus our first decision tree. Okay. That's how P one comes in. Then once we have P one, we compute the residuals are one, which is how far away the first prediction is from the actual value that we're trying to predict. It is still off by minus 20. And then we compute that for all of them. Then we train a decision tree to use the inputs to compute this residual. So now we have D two and let's say it's predicts minus 10 here. The residual it's still not perfect because it's not a full decision tree. Then once again, we multiply this by alpha. So we go 0.5. So we go the initial average a plus alpha times the decision from decision tree one plus alpha times the decision from decision tree two. So once again, we multiply this by 0.5. So that means now we get to eight zero five as our P two, and then we compare eight zero five and we're still far away from P zero. We still far away from the target. So then we compute the next residual R two. So R two becomes five and that's how it proceeds. Okay. So ultimately the prediction of the model, the prediction of the model P for a particular input is the average a plus alpha times the prediction of decision tree one. And what was decision tree one trained on? It was trained to predict P minus a, essentially the residual from the original average plus alpha times decision tree two. And what was decision tree to train to do the decision tree to was trained to predict the difference between this and the actual target that we have. Right. And then we have alpha three and alpha three will have decision three and then so on. Okay. So you can configure alpha. This is called the learning rate and you can configure the number of iterations or the number of decision trees that you create. And that is called the number of iterations or the number of estimators. Okay. And that's all it is. That's all a decision trees, successive. That's all a gradient boosting model is successively changing, successively improving our predictions by training these small decision trees to correct the errors of the model that we have so far. Now, why do we have this alpha in place? Why don't we simply add the predictions from the, from the previously trained decision tree. So it turns out that alpha prevents overfitting. If we apply that scaling factor to the decision tree, that was just trained to predict the residuals. So we are intentionally creating some error there, which can be then filled in by the next decision tree and so on. And that helps prevent overfitting to the training set. So that's one thing. And then second, why is this called gradient boosting? It is called boosting because. Because boosting is the name of the general technique where you train a new model to correct the errors of, or to improve an existing model. So whenever you have this kind of a system, as opposed to bagging, which is what random forest to remember with random forests, each decision tree tries to make the entire prediction that is called bagging, bootstrapping, bagging, et cetera. But boosting means you are correcting or you're improving an existing model. So the second decision tree depends on how the first decision tree was created. The third decision tree depends on how the second decision tree was created and its predictions and so on. So that's boosting. Now, why is it called gradient boosting? Well, in some sense, this is similar to the gradient descent algorithm. And the objective here is to minimize the loss. So we have a certain loss, like means squared error or some kind of a loss function. And the objective is to create new decision trees to minimize that loss, right? So in some sense, we are doing iterative gradient descent. So that's the mathematical term for it. And that's why it's called gradient. So gradient boosting machine is simply a boosting technique, which tries to minimize the error by creating new estimators or new decision trees each time. And that's what it is. Okay. So try and describe in your own words, how a gradient boosting model machine differs from a random forest and do check out if you want to check out the mathematical explanation for this, do check that out as well. But I will warn you that when you look at the mathematical explanation, it is pretty hairy. There's a lot of equations. So unless you're comfortable with the math, I would not recommend it and that there are some graphs as well, but it doesn't really help. But yeah, do check it out. If you're interested in checking it out, but really the most intuitive explanation that I found is just the fact that gradient boosting machines correct, add more decision trees to correct the errors of the decisions we've made so far. Okay. I also like this tutorial series on this YouTube channel called stat quest that you can check out. This is a five part series, I believe the gradient boosting. Yeah. There are four or five parts. This also gets into the mathematics, but this also has a bunch of visual explanations. So you can use that to get a good sense of it. But as long as you're getting the basic idea that we are training more trees to correct errors from our previous trees, you should be fine. We're always using the, all the input features. Now to train a GBM, we can use the XGB rig, the XGB regressor class from XGB boost. There is also a gradient boosting machine in scikit-learn, but XGB boost is far more efficient and far more versatile. So from XGB boost, which is going to import XGB regressor. And let's check out what that looks like here. You can see all the parameters that it accepts, and you can now see that there is this number of estimators, which tells you how many trees you are going to create. And you can now experiment with this. You can try setting number of estimators to one and see how that changes. See how many trees get created and try changing into two, three, four, and see how many trees are created. And then there is this max depth. So you can configure how deep each tree goes. Why do we need a max depth? Well, think about it coming back here. If we trained an unbounded decision tree, then that unbounded decision tree would essentially exactly fit the residual. So it would completely overfit to the training set to predict the residual. And once again, that would be a problem. So all of these are regularizations, because if it overfits through the training set, then it's going to perform poorly on the validation side. So typically these trees are not very deep. Then you have this learning rate and this learning rate is basically just the alpha or it's called eta. I believe in XGB boost, it is simply the factor that is applied, the scaling down the damping factor that is applied to the residual prediction from each decision tree. Once again, this is used to avoid overfitting. Otherwise you may overfit to the training data. And then you have a bunch of other parameters. So I'm simply going to create an XGB regressor here. And once again, setting some random state and end jobs, random state is simply to ensure that the randomization is using this number 42 as a seed. So that each time we run model dot fit, or each time we run this piece of code, we get the same results. That's what random status for. And if you change this number, you will get different results, but each time you will get the same result when you put the number seven, for example, then you have end jobs. This is simply to configure the number of threads that it should use in the background. When you put in minus one here, it will use all the threads that are available on the machine that we're connected to. Then here is the number of estimators or the number of decision trees that we will create. So we start with the average train one decision tree to correct the errors, train a second decision tree to correct the errors of the first two combined and so on. We'll create 20 decision trees and each decision tree will have a maximum depth of four. So let's call model dot fit now. And this takes not too long to fit because it's a, what 20 decision trees, each with a maximum depth of four shouldn't take more than a few seconds. So it took about eight seconds and you can see some of the other parameters here as well. And you can look up the documentation to learn more. And right now I'm just fitting it on the entire training set. We will talk about validation shortly, but right now I'm just, I've just fitted on the entire training data that we had using the inputs X and the targets. And now here's an exercise for you. Can you now describe in your own words, how a gradient boosting model fits this machine learning training workflow. We take some inputs, put them into the model. It makes a first guess, which is just the averages. We compute the loss, we compute the residuals, and then we train a decision tree to predict the residuals. Now we put in the inputs once again. Now we get, now we get the output as the average prediction, plus the prediction of the decision tree scaled down. Then we compute the residuals again, using the targets. And then we optimize by training another decision tree and so on. Okay. And we can now make predictions using the model. So right now, since we just have the training set, we've not created a validation set. So I'm just going to call model dot predict X. And these are the predictions of the model. Okay. And then we can evaluate this just in the same way we've been evaluating all our models. So I'm going to use the root mean squared error. So I've just defined this RMSE function. We give it the predictions and we give it the targets. And it turns out that the training error or the loss on the, the error or the loss on the training set is two, three, seven, seven. Okay. And I just want to point out here, you can create a validation set if you want. It's not that gradient boosting will not work with validation sets. It's just that we are using a separate, a different strategy for validation. That's why I've not created a validation set yet. Okay. But there's no fundamental limitation in creating validation sets for gradient boosting. You can, and we will. So that, so now we get a RMSE of two, three, seven, seven. And going back, if you just think about the average sales, let me just go merge TF dot sales dot men and max. I guess the min is zero in a lot of cases, but so the maximum sales is around $41,000. The men is around the men is zero. And you can also look at the histogram, but if you're off by 2,300, I would say that's not too bad. You know what? Maybe this, let's also look at a histogram. Let's do PLT dot hist merged TF dot sales. I'm just going to pick a sample of maybe 10,000 rows just to get a sense. Yeah. So this is what the distribution primarily looks like. It seems like most of the sales are around a thousand, around 5,000 to $10,000. And it seems like we are off by two, three, seven, seven, right? The RMSE is two, three, seven. So that means our predictions are off by $2,300. And that's not too bad. It's not like our predictions are in the range 5,000 to 10,000. And we are off by a full 10,000, which means our predictions are completely meaningless. We're off by 2000 or so, which is probably not that great, but it's not too bad as well. So already we can see that our decision, our gradient boosting model gives us a fairly good result. And what I would encourage you to do at this point is maybe compare it with a baseline model, maybe just build a very dumb model, which always maybe predicts the average or the average of the same store for the last month or something like that. And compare it with a linear regression model, compare it with a comparative with a decision tree in a random forest and see how, see if getting boosting is any better. And the reason it is better in a lot of cases is simply because of this residual trick, because each time for each tree, we are reducing what it needs to learn rather than random forest, where each tree has to learn the entire relationship. Now, in this case, each tree just tries to fix the errors of the previous tree. So that's how you evaluate, make predictions and evaluate just as you do with any of the scikit-learn model. You can also visualize individual trees using the plot tree function from XGBoost. This also requires the graph with library to be installed. So you will have to install that, but here's what it looks like. We say plot tree, and then we pass in the model and the model of course contains many trees. And then we pass in this numb trees argument. And then none trees is improperly named. It should be called the tree number or something. So here you are just passing the tree that you want to plot. You want to plot tree number zero, the first decision tree that was created or tree number one, the second decision tree that was created or so on. So let me just plot it and I can plot it left to right, which is nice, convenient to look at. So here's the first decision tree that was created promo less than 0.5. That's what we're checking. So it seems like whether there's a promotion running or not seems to be the biggest factor, then if there was no promotion running, then it depends on the day of the week. And again, with the day of the week, we're checking day of the week, less than 0.5, which also a day of the week one, which means whether it was a Monday. So if it is not a Monday, then we check the month. If it is, if it is a Monday, then we check something else. So this is what the first decision tree looks like. And I will draw your attention to these values. You see here, two, three, zero, nine, one, eight, two, three, and maybe you'll also see. Yeah. So all of these are not predictions for the actual sales. This is simply the deviation from the average that we're predicting here. And if I show you the next tree, num trees equals one, this is going to try and correct the issues that have been created by the first tree or the residuals that that are, that remain after adding the first tree. So here are the leaf notes, and you can see that these numbers are getting smaller. These are no longer that large. These are definitely smaller. And if I go to tree number 19, which is the final tree that we've trained, which is trying to correct the errors of the sum of all the first 18 trees and the average, you can see that here you have numbers like minus 19, minus two, 34, three. So it's making very small adjustments. Right? So even though these trees are all very small, each tree can correct the previous tree in a meaningful way and contribute to an improvement in the result. So now notice how the trees only compute residuals and not the actual target values. And we can also visualize, we can also visualize these trees in a textual format. So if we just run this, we can just run model dot get booster. This is the underlying structure that extra boost creates and called dot get dump that gives us a list of trees, tree objects and text. And there are 20 trees, not surprisingly. And here is the first tree in textual format. So you can see what, how it makes its decisions. And just like decision trees in random forest, extra boost also provides a way, provides a way to understand how important each feature is. So the feature important score, and you can look into how this is calculated, but there are a couple of ways. One is called the gain, which has to do with how much each feature has contributed to reduction and loss over all the trees. And second is called the weight, which counts how many times a particular feature was used to create a split. Okay. So one is the number of splits, which is called weight, how many times a particular feature was used to create a split across all the trees. And second is the gain, which is how much it has contributed to reduction in loss. So you can see reduction in loss is called gain, right? So it, it, that's how the word is derived. So here is the importance data frame. So we just pick model dot feature importances. And we put that into a data frame where we're also putting the columns from the X or the data we passed used to train the model. So here we have features and here we have importance and we can now plot that and see that as well. And this is what it looks like as expected. Promo has a big effect and then store type seems to have a big effect to day of the week, whether it is one, I'm guessing then maybe, I don't know if day of the week, one represents Sunday or Monday, but that's something that I would check, but it seems like that has a big effect. Promo two seems to have an important effect as well. So it's good that we added a bunch of columns related to promo two competition distance, as one might expect assortment, then month store in day. Interesting that adding month and adding the day may have significantly improved our model. So what I would encourage you to do at this point is maybe turn off some of the features and try to retrain the model and see if you still get a good enough result and it's possible that you may, it's possible that you may not. So this is how you figure out if the features that you've created are working or not. You look at their importance. You look at the before and after, and you have to keep all of this information organized in some fashion, in some sort of a spreadsheet where you note down each experiment you've tried and the result that you got for that experiment and potentially also keep track of that specific version of the notebook so that you can replicate the experiment if required. Okay. Now notice that we didn't create a validation set before training our XGBoost model. This is because we want to use a different validation strategy this time. This is called K fold cross validation. So in K fold cross validation, we take our training data, which is separate from the test data. The test data is supposed to be used only at the very end, but we take a training data and we split that into many parts. So when you say K fold K can be configured, let's say case five. So we split it into five folds or five parts. It's like taking a piece of paper and folding it five times. And then each time we train five different models. And for each model, we use a different fold as the validation set. So we take split one and we train a model on split one, and here we use the first 20% as a validation set for split two. We use the second 20% as the validation set for split three. We use the third 20% as validation set and so on. And we use that for hyper parameter tuning. We use that for just experimenting between different modeling strategies. And the benefit of this is that we are not sacrificing on any of the data we are for training. We are using all the data for training, but at the same time, we still have a way to analyze how these models are doing by looking at the validation score for each of the splits. Okay. So that's what we'll implement. And K fold is quite useful for small data sets. It's not very useful for the really large data sets like this one, where we have millions of pros, a million rows, actually. And the other thing to keep in mind also is that K fold is that K fold may not be the optimal strategy here because our validation set should ideally reflect what kind of data the model will see in real life should ideally be closer to the test set and the test set occurs in the future after the training set. So ideally the validation set should be in the future of the training set. So we are breaking that. So we are breaking that requirement here. Also, so potentially the validation score that we get will not be as useful as we might get if we simply use the last one, right? So those are some things to consider. And we are doing it here primarily to demonstrate how to use K fold cross validation, but that definitely keep that in mind when you have time order data, maybe you just want to do just like split five, which is just take the last 20% by date and use that as a validation set Regardless, here we go with K fold cross validation. We say SK learn dot model selection import K fold. So we have this K fold utility that we're going to use. This is going to give us all these splits to try out and how do we use it? Well, we use it like this. We create the K fold object. We pass in the number of splits and then we can use, we can say K fold split and give it the data that we want to split in this case, we want to split the training input and remember the training input is a data frame. So what this will give you is indexes of rows for the training set and indexes of rows for the validation set for each split. So you, as you call it in a for loop, it will give you five splits. So each time you will get a list of indexes of training rows, a list of indexes of validation rows, so we can then take X and from it extract out the training indexes. So let's say it gives us these indexes, zero, one, two, three, four, et cetera, et cetera. 20% of them will fall into val IDXs and 80% of them will fall into train IDXs. So we grabbed those rows from the training from the input X and that gives us the X train. And then we grabbed those rows and we're using I lock here to grab those specific rows, given the indexes, and that gives us the training targets from, so we grabbed them from targets and that gives us training targets. Similarly, the value validation IDXs, this contains all the rows, the row indexes for the validation set. We grabbed those rows from the data frame to give us X, while, and we grabbed those rows from the targets to get well targets. Okay. So now we have X train train targets, X, well, well targets. Now we need to actually train and evaluate the model. So I've written a simple function here, train and evaluate, which takes the X train, takes a train targets, X, well, well targets. It takes a bunch of parameters. Like we're passing max step four, norm estimators, four 20, et cetera. And this is defined here. So with those parameters, which you can see are captured, like a variable number of parameters are typical parameters. That typically captured using star star params. And then those params are passed directly into a XG boost regressor. So this is just some Python syntax you can look up. This is called quarks or keyword arguments. But yeah, we take all those parameters that were passed in. We create an extra boost regressor. So for each split, we are creating a new model. And then we fit that model to that split data, to whatever is there in the current fold. So then the first iteration, we are going to use this as training. This is validation and create a first model. Then we compute the, we make predictions on the train set. So we make predictions on this data, the training set. We calculate the loss. We prediction. We make predictions on this set, this validation set. We compute the loss. So this was used for training and this was used just for evaluation. So we compute the training loss and the validation loss, RMSE, the root mean squared error, and we return all three. We return the model. We return the training loss and the validation loss. And we capture that model training and validation loss. And we are just printing out the training and validation loss for each split. And we are going to then append the model. We are going to just append the model to a list of models. So because we are training five models, we are going to need all of those five models to make predictions later. But that's the whole idea here. You split the data five ways, and then you use a different 20% portion each time for validation. Then you train the model for each of those, you train a separate model for each of those splits. And you can see here for the first split, the train RMSE was two, three, five, two, and the valid RMSE was two, four, two, four. The second split, the train RMSE was two, four, zero, six while RMSE was two, four, five, one, and so on. Okay. So this way, when you have a really small data set, especially like maybe a few hundred rows, you may want to just use cross validation and then average the results from all the models. So here's a function I'm defining called predict average. It takes the list of models. It takes a bunch of inputs. It is going to call model dot predict inputs for each of the models. And then it is going to take the average. Okay. That's exactly what's happening here. We say for model and models. So we get each model one by one and we call model dot predict inputs. So we give the inputs to that specific, each model, get the predictions. And then we take the averages of all the predictions along access zero. So here's what that looks like. This is the average prediction obtained on the training data by taking the averages of the predictions of all the models. Okay. And you can do that. You, if you want, you can now break that down. You can say model zero dot predict X and you can say model plus model one dot predict X plus model two dot predict X and so on. And when you want to average them, you can simply just divide by the number of models you have. And that's basically what we're doing with. Yeah. So that's basically what we're doing with our models. So that's basically what we're doing with predict average. Okay. So that's K fold cross validation. You use the K fold class to create some splits. And you can also do randomization. Let's say you want to first shuffle the rows before creating a split. Then you can also specify shuffle equals true might be worth a try. You can just do shuffle equals true here. And then you get, you put it in a for loop. You get the train indexes, valid indexes for each split. You come, you get the actual data for each split, the training and validation. You train a model, you evaluate the model, you can view the actual results here, and then you can also keep track of all the models and then use all those models together to make predictions. Definitely a more advanced technique, not something that you will have to use for data sets that are large enough, but still a good thing to know. Okay. All right. So that's how we use, we perform K fold cross validation. Next, let's talk about hyper parameter tuning and regularization. Now, just like any other machine learning model, there are several hyper parameters that we can adjust to change the capacity of the model and to reduce overfitting. And this is the typical model complexity versus error curve. When your model is very weak, let's say you're just training one decision tree, then obviously you're going to be, you're going to have a very high error. But as you start increasing the capacity of your model, your training error starts to decrease and it continues to decrease because your model looks at the training data. It has more parameters to learn information about the training data. So that continues to decrease over time, but at some point, your test error will start to increase. And this is where you have to manage the different hyper parameters and adjust them in such a way that you get to the best, best fit. You get to the minimum possible test error. And any further you go, the test error should start to increase. So that should be your goal with all the different hyper parameters that you train. Okay. And XGBoost has quite a few hyper parameters. You can check out the documentation here, and there's a detailed explanation about each of the hyper parameters. So once you have that intuitive picture in place, average followed by a decision tree to correct the residuals followed by another decision tree to correct the residuals. And how you, how we make predictions is simply adding up all the predictions, but with a learning rate applied, then you can check out all of these different parameters here. This is the learning rate. This is called the men's split loss, max depth, men, child, et cetera, et cetera. Okay. Everything to do with trees. So you could experiment with the hyper parameters with K fold cross validation, which means that every different hyper parameter you try, you could perform K fold cross validation for it and look at the result. And here is a helper function to help you do that. But because we don't have that kind of time, we want to do quick experiments. So we will just fix a validation set for our hyper parameter tuning. And this is something, this is a choice that you will have to make from time to time. So I'm just going to fix a random 10% split of the data as the validation set. So here I'm importing train test split, and then I'm calling train test split with the inputs, with the targets and setting the test size to 0.1. So that should give me a training set with 90% of the data and a validation set with 10% of the data. And I guess for a million rows, a hundred thousand rows is enough to accurately verify the model. And ideally, again, what I should be doing is maybe picking something from the very end so that the validation set is closer to the tree, a test set. But that's okay. I'm just going to go with a random selection and see what happens. Maybe I'll come back and replace this with actually selecting the last few months of data. But now we have a fixed training set. We have a fixed validation set, and then we have this test params function. So you just give it some parameters and it is going to create a regressor and then it is going to fit it through the training data. It is going to calculate the training loss, validation loss, and it's going to print it out. That's all. So now let's test different number of estimators and estimators is the first hyper parameter, probably the most important one. How many decision trees are you going to train the more you train, the lower the loss will get. So here right now we are looking at 10 estimators and with 10 estimators, we get to just with 10 decision trees, we get to 2, 3, 3, 4 validation loss. And you can see that the model isn't quite overfitted yet because you can see that the train RMSE is 2, 3, 4, 0, whereas the validation RMSE is 2, 3, 3, 4. So it's actually lower the validation loss. Here again, we go to 30 estimators and things have started to get better. We are now only off by a thousand or $1,800 again, not bad at all, considering the values that we're trying to predict are in the range of 5,000 to $10,000. So this is not bad at all. I believe this is going to take some time. So maybe we'll let this one run and we will just cancel this execution, but you will see this kind of a trend here. At some point you will see the best fit and then training error will continue to decrease, but the test error will start to increase potentially. So as an exercise, do experiment with different values of an estimators and then plot a graph, try and plot this kind of a graph of the training loss and the validation loss for the different strategies that you've tried. Yeah. For the different strategies that you've tried and see where you get the best fit. And it's not just about the best fit. Something, sometimes it's also about what is the additional benefit that you're getting. Let's say by moving to a hundred to 240 estimators, the validation RMSE reduces from 12 0 9 to 11 90. That's not a big increase. And the amount of time it takes to take a train 240 estimators is more than double the amount of time it takes to train a hundred estimators. So that's where you will have to decide. Is it worth the time to add the additional estimators? Now, if it's not, then you will have to decide what it is now. If it's not, then maybe you stop at a hundred 150, wherever you see that the change in validation RMSE started to flatten out. So you can stop somewhere here as well. That's not a problem in a lot of cases. It's not like it starts to increase immediately, but it starts to continue to decrease very slightly for a very long time. And at this point, it just doesn't make sense because you're creating this huge model, which is only slightly better than this much smaller model. Right. So start out small, increase the capacity of your model, and whenever it stops making sense to just increase the capacity further, either in terms of the time or because of overfitting, you can stop. Now, similarly here, we have max depth. So as you increase the maximum depth of each tree, the capacity of the tree increases, and it can capture more information. Let me maybe stop this right now. So that let me do this with a smaller number of estimators so that we can see the effect. Let's say estimators 10. Yeah. So max depth is simply the depth of each tree, the trees that we create to correct the residuals, what is the maximum depth you want to allow them to have. And you can see as we increase the max depth initially, the loss definitely goes down. Once again, experiment with different values of max depth. Now we can see that these are not overfitting just yet. We can see that these, these models are not overfitting right now. So you can probably still push the model. You can probably still keep increasing the max step till the point that you start to see some overfitting happening, which means the training error continues to decrease, but the test error actually starts to increase. So the kind of trend that you'll notice is let's say around max depth of 15, the test error flattens out, and then maybe around max depth of 18, the training error is still going down, but the test error has started to increase and at max depth of 20, the text, the test error actually starts to, or the validation error actually starts to go up by a significant amount. Right? So when you, whenever you notice this trend, maybe you come back and find the lowest point. And also just to clarify, overfitting is not a situation where the training error has to be zero. That's not the case. Overfitting specifically is described by this scenario where the training error is increasing is decreasing as you increase the model complexity, but at the same time, the test error is increasing. So the model is getting worse, even though it is getting better at the training data. Okay. Next up, we have the learning rate. The learning rate is the alpha that is applied to the predictions. So this is the alpha. Remember we take the initial average, then we apply an alpha to this factor to scale the prediction of the decision tree. And then we apply that to all further decision trees. So we set the learning rate to 0.01. That means it's going to give very less weightage to this prediction. It's going to hardly collect the residual maybe by 1%. So that makes it very hard for the model to even get good at the training set. So you will need a large number of estimators. So a very low learning rates, something like 0.1 will lead to underfitting and a very high learning rate like 0.99 may lead to overfitting. It's not that it will, but it may lead to overfitting because you have other hyper parameters as well. Right. So one hyper parameter alone cannot cause overfitting, but because some other, there may be other regularizing parameters, but this is the direction roughly that it looks like low learning rate, low power, because you're really not giving your decision trees that much weight in their predictions. You're giving a much higher weight to the initial prediction, high learning rate. You're giving the decision trees a lot of power. So it's likely that they may start overfitting through the training set while the test set starts to get worse or the test error or validation error starts to get worse. So here's, you can see here when the learning rate is 0.01, the validation RMSE is $5,000 that's way off. That's like we're trying to predict values in the range 5,000 to 10,000. And if you're off by 5,000, that's pretty bad. And hopefully if we get somewhere here and start to get better, I don't think it's going to overfit just yet because we are just at 50 estimators, but let's say we had 500 estimators or a thousand estimators, and we had given them each a max depth of maybe 10 or something, then you would definitely see that maybe beyond 0.7 or so. The models would start to overfit, but right now it's just because the number of estimators is so low that there's not enough power. Even if you set the learning rate to one, it may not overfit. There is a question can hyper parameter optimization be done while combining multiple hyper parameters? Yes, absolutely. It's like if I've put in an estimators 50 here, I could put in an estimators 20 and I could put in max depth of 10 and I could put in whatever I want here. So ultimately what test params is doing is creating a random forest and that random forest is then going to be trained on the data. And then we're going to evaluate it using the, by making predictions on the training and validation sets, computing the RMSE. So you can modify multiple hyper parameters at a time, but it's generally a good strategy to start with the most effective or the hyper parameters that is going to have the biggest effect. Let's say the number of estimators I would think that initially would have the highest effect and find a good value for the number of estimators. Like I would probably try and try multiples of two. I would go 50, 100, 200, 400. And at some point I would start to realize that maybe the benefit is not there of doubling again, or maybe it has started to overfit. So maybe I'll just again between, maybe find something between 200 and 400 that works best. Maybe try and plot a graph and figure out. So pick the biggest, most impactful hyper parameter first, try and optimize that. Then pick the next most impactful one. And how do you figure out which is the most impactful one? Well, typically if you just try two values of each hyper parameter, like very quickly, you can see how much effect each hyper parameter has initially. So when, if you, if the number of estimators is very low, like 10, then changing the learning rate by a lot is not going to cause overfitting a very quickly. So yeah, it is subjective, depends on the dataset, but it's a question of just experimenting a little bit, figuring out which hyper parameters you should optimize first for gradient boosting. That is typically number of estimators, max depth, learning rate, and by the time we get to learning rate 0.99, RMSE is looking pretty good. But as I said, you can try putting in 500 here and it'll take 10 times longer, but that will tell you if learning rate of 0.99 is always a good idea, especially at a high number of, at a, at a very high number of models, and I'm just going to run all of these cells so that we don't have to wait a long time. Okay. Next up, we have this argument called booster. Now I said that gradient boosting machines use decision trees, but instead of using decision trees, XGBoost can also train linear models for each iteration. So the idea is the same. The idea is pretty much the same that you initially guess the average as your prediction P zero, and then you compute the residual. Then instead of training a decision tree to compute the residual, maybe you train a linear regression model. So maybe you, you train a linear regression model that tries to fit the data, and then you compute the residuals from the linear regression model, right? And then you train another linear regression model to do it, and then you train another linear linear regression model to correct the residuals and so on. So everything remains the same. We just switch out the decision tree classifier, the decision tree regressor internally for the linear regression model. And even all of these things like the alpha and everything remains the same. So all the hyper parameters can remain the same. And if you think that there are more linear relationships within the data, then you can just test out linear relationships as well. And seems pretty fine to me. Not, not too bad. 2727 is the output here, the validation RMSE. So that's not bad at all. So maybe you can try experimenting with some other hyper parameters, experiment with the linear and the tree based model and figure out which one works best. Now, what will happen over time is as you train dozens of models or maybe hundreds of models, you will start to get a feel for how all of these hyper parameters interact with each other and which kind of hyper parameters are better suited, which for which kind of data, a lot of this is fairly subtle in the sense that it is not something that is documented explicitly. It is not something that a lot of people are even able to express themselves express clearly in words, but it is something that they know, something that they understand. It's kind of like a muscle. So the best way to become good at machine learning is to just train a lot of models like every day, just be training some model on some dataset. And a great way to orient yourself and training models is by participating in a Kaggle competition, where you can just train a bunch of models and make a submission and see if your model work better or not, and trying to stay organized using spreadsheets or whatever tools you need to use to stay organized, where you note down all your ideas, all your experiments and their results. And you just keep doing that over and over. And whenever you stuck, you look for help and the places you look for help are typically Kaggle forums, where you look at the discussions, or maybe you look at some tutorials online. Maybe you look at some videos online and apply the techniques that you read there. And sometimes like when you get to a point where you applied a lot of these techniques and none of them work, you may have to start reading papers. So sometimes you may end up doing that. And that's how you get good at machine learning, right? So there's definitely more theory that you should learn about machine learning eventually over time that will be useful, but understanding the theory will not help you beyond a point. Machine learning is an applied, an applied science and applied field. So the more models you train, the more experiments you do, the better you will get at it. Okay. And a great way to learn is also by look, looking at what others are doing. All right. So here's an exercise for you. Try and experiment with all the different hyper parameters. And there are several other hyper parameters. As I've said earlier, it's, these are not the only ones. Gamma is another one. I think this is gamma indicates what is the minimum reduction in loss that is required to create a split. Then you have main child weight, which indicates Yeah. I wish I think indicates how many, what percentage of rows each child should contain or something like that. Then you have max Delta step. What should be the, yeah, you'll have to check this out. I'm not sure about max Delta step. Then you have subsample for each tree. Instead of using all the rows, you can also just use a fraction of the rows. So that's up sample. You have call sample by tree, which means for each tree that you create, instead of using all the columns, you can simply use a fraction of the columns instead of using all 120 columns that we have right now. Maybe you just use a hundred columns each time, randomly chosen. And that again, create some kind of randomization, right? Yeah. So those are some of the things that you can try out and let's train a final model on the entire training set. And this is also common practice that you will have a validation set, or maybe you will do cross validation and you will try many different kinds of experiments. And then you, you will arrive at certain hyper parameters. Now, once you've arrived at a certain set of fiber parameters, which you know, or you believe are the best at the very end, one last time, you can train your model on the entire training set, which means including the validation set, because now we are done evaluating. Now we're done evaluating different types of models and different hyper parameters. And we've decided that these are the hyper parameters that work well for this dataset. So at this point, it just makes sense to maybe just train with the entire dataset, get a little final boost because more data is always helpful for the model and then make predictions on the test set and then submit that to Kaggle. Okay. So always do that. Have a validation set, use it till the very end. But when you're making predictions, just train the model with the entire set. Okay. So this is, this may take some time. I think I've said the number of estimators to a thousand. I believe it took about eight seconds when we had 50 estimators. So it would take five minutes or so to run this. Okay. So now we're training one final model on the entire training set with custom hyper parameters. These are the hyper parameters we've chosen a thousand estimators, a learning rate of point to a max depth of five. Okay. So now we're training one of 10, a subsample of 0.9, which means we each decision tree only use uses 90% of the rows randomly chosen and a column sample by tree of 0.7, which means that each decision tree only uses 70% of the columns. And then we fit the model and it took about five minutes on my computer. And this is still running on collapse. So maybe it'll take twice as long. And this is a model that got done. Then we can evaluate the model as well. So we can create training predictions model dot predict X, and then we can do a RMSE on the train preds and train targets. Yeah. Okay. There seems to be some issue here, but, but yeah, ultimately you can make, you can evaluate the model using the training set. And of course that will not be the true reflection of the performance of the model. It's just that we don't have a validation set right now because this is the final training we're doing, but we train the model. And once the model is trained, we can make predictions on the test set. So remember we have the test set as well. If I just check X test. So this is X test. It contains all the information in exactly the same format as the training inputs X train. So we can pass X test into model dot predict, and that will give us a bunch of predictions for the test set. Let's let's check test preds. So here is all the predictions that we've generated for the test test set. And of course test preds applies to these IDs. So we know that for ID one, for the first drop, the test set, the prediction is four to zero seven for the second of the test set. The prediction is seven five eight seven for the third of the test set. The prediction is eight eight four nine and so on. And we have to create the submission file. So we have already have the submission DF here. We have IDs and then these random sales. Actually, let me just load that up again. So PD dot read CSV Rossman store sales slash submission dot CSV. So we are given this sample submission file. Yeah. So we're given the sample submission file submission dot CSV. And the sample file contains IDs one, two, four, one zero eight eight. And it contains the value zero for sales. Now we have to fill in instead of the value zero, we have to fill in our predictions. So these are our predictions, test preds, and we can do that here. So we just say submission DF sales equals test preds. And once we do that, we can now see we have submission DF. So here we have the IDs from the sample submission file that was given to us. And instead of having zero sales for everything, which was the sample submission, we have replaced those zeros with the predictions of our model four two zero seven point zero seven, and so on. And we know that these predictions line up with the IDs because these predictions were generated using the test set X test and X test came from test DF and in test DF, these IDs were in the numerical order. Okay. So especially with Kaggle, you need to be careful, make sure that you're not shuffling your test set so that the predictions that you make on the test set line up with the original IDs or whatever identifier is there in the submission file. Okay. So now we've created the submission file, the submission data frame. Now this is one thing we have to handle here. Recall that if the store is not open, then the sales must be zero. And this is not something that we are accounting for in our model. Our model simply assumes that the store is open. So we need to hard code this rule that whenever the value of open in the test set is zero, then we can set the sales to zero. Okay. So in some cases here, maybe we have to check if that particular row had store open set to zero. And if it did, we simply set that to zero. So here's one way you can do that. If you just check test DF in test, you have ID and you have some information and you have store and store is a store open is either one or store open is zero. If store open is one, then we can simply keep our prediction, whatever prediction we have made. If store dot open, if the value of the open column is zero, then we should simply set that value to zero. So here's a trick that I can apply. If I simply take the sales column of the submission and multiply it with the open column of the test data frame, which is also lined up by IDs. So wherever the, wherever the store is open, the value will remain. And wherever the store is closed, the value will turn into zero. So that's what I'm doing here. I'm just going to do submission, DF sales multiplied by test DF open. Okay. So the ones will retain the values, but the zeros will convert the values to zero. And that is what we want. When the store is closed, the sale should be zero. Now there is one issue, one minor issue here. Some, in some cases, the value of open in the test data frame is NAN. So this is where, again, we'll have to figure out why is this data missing on those dates where these stores open or closed. But for now, I'm just going to assume that wherever the data is NAN, I'm just going to assume that the store was open, not maybe not the best assumption, but let me just assume that that wherever the data for open is NAN store is open. So I'm just going to fill any with one. And now we have the submission data frame, and it is going to have a bunch of zeros as well. And you can see it. If I just check the sample, let me see, sample 20, you will see that you have zeros in certain places and these zeros have come by multiplying the value of open. Okay. I think we have been able to train the model here as well. So let me just, let me just return to Colab here. So we make predictions on the test set, and then we put these predictions into the submission data frame. And then we have the submission data frame that we've created. Okay. All right. So we just did all of that stuff locally, but the same thing we just repeated on Colab as well. So now we have submissions from our model model, ready to be added, ready to be submitted to Kaggle. So now we can just save this file, submission DF dot two CSV as submission dot CSV. And we set index to none. Here you have submission dot CSV, and I'm just going to download this file. And that's downloaded to my desktop submission dot CSV. And then I'm going to go on the competition page here, and on this competition page, I'm just going to click late submission because this competition has ended, but I can still submit and see a score and upload the submission dot CSV file. So it's going to take a minute or so to upload. And you can also describe your submission because you can see your past submissions and it'll be easy to just check things out. So let's see, maybe this is what I'll do. I will simply copy over. I will simply copy over this. I think this is a great way to describe my submission because it has all the hyper parameters. Yep. I think you can even write it as code. So maybe let me just do this. Yeah. So this is the information is everything I need to know about my submission. And I click make submission. And that gives me a score 0.13152. Let's see where that lies on the leaderboard 0.13152 0.13152. It's already pretty bad, but actually it may be good enough because there are 3,200 participants and 0.13. Okay. Looks like it's a long way down. Yeah, somewhere here. So we are in the top 50% and that's not bad. If you're in the top 50% of a Kaggle competition, then you have a decent model and you will see that the gains are actually pretty small beyond that point. Like from 0.13 to 0.11, the overall gain is off about 10%. Right. So you're about, you have achieved about a 90% possible reduction in the error. And mostly here in a lot of cases, the top submissions often have a lot of models, maybe the ensemble, like 10 models, and we've not trained a very big model, even though it took about 15 minutes. We've only put in what a thousand estimators and a max depth of 10 and a learning rate of 0.2. People do a lot more. You can imagine setting a model to train overnight with maybe 10,000 estimators and a different learning rate and a different max step. Of course, you have to first do some experiments to figure out that you're not overfitting, but often, especially with gradient boosting, these models can run into hours, the training times, and you can use GPUs. So one thing that I would encourage you to check out is how to train an XGBoost model on a GPU. And you will see a significant speed up if you use a GPU, sometimes even 10 to a hundred times faster. And GPUs are GPUs or graphic processing units are very efficient at performing matrix computations, which is the kind of computations that XGBoost requires. Okay. And now here, we have not done cross validation. So one thing we could do is we could also do K fold cross validation that would give us five models, and then we could average out the results of those five models. Or another thing that you could do is you could train five models with five different hyper, five different sets of fiber parameters, and then average the results of those models. Yet another thing you could do is you could train the best possible random forest model, the best possible gradient boosting model, the best possible deep learning model, the best possible SVM support vector machine kind of model, and then average all of their results. And one other thing you can do there is instead of just blindly averaging all the results, maybe apply weights to the results of each of the models, give each model a certain weight and optimize the weights. Now, one way you can optimize the weights is by hand tuning each weight. But the other way you can do that is you can do something called stacking, which is train a linear regression model on top of those weights to figure out the best weights to apply to each model, right? How much should, how much weight should be given to the XGB boost regressor? How much we should be given to the random forest, et cetera, et cetera. And those are all the techniques that people apply to get to the top, like the first hundred or 200 submissions would all have several models, as you can see here, that you can even form teams on Kaggle. So on Kaggle, you'll find that these five people would each have trained maybe five models and they would each have worked independently trained five models. And then the team would essentially mean ensembling the results of each of those five models by each of those five people. So you give your results and this person gives their results and this person gives their results and so on. And then they average the results and they simply submit that. And that itself gives them a huge boost because the more different kinds of models, the more different kinds of ideas that you combine, the more the effects of ensembling work or the wisdom of crowd kicks in or the errors cancel out. Okay. So that's making a Kaggle submission. I encourage you to try out different hyper parameters and beat the score should not be too difficult. Take inspiration from the top notebooks in the code tab of the competition. And here is another exercise for you. Just save the model and all the required objects using job lib. We did this for decision trees. The process is exactly the same where you create a dictionary with all the important, useful things that you need to do to take any input and make predictions on the input and then save that dictionary using job lib. Yet another exercise we use to write a function, predict input. So predict input should take a dictionary and in that dictionary, we can put in all the columns of data that are there in the training dataset. For example, a date. Let's see what that looked like. So let me just see Ross TF. Yeah. So here is a sample input. Let's say I give you a sample input in the sample input. Store is two and then day of week is a four. And then date is something else. Like let's say one of these, and then you have, of course you won't have sales or customers. You will only have open. So let's say open is one, and then you have promo and state holiday and school holiday, et cetera. Okay. So given an input like this, can you make a, can you write a function to make a prediction from the model? What are all the things that you will have to do? Well, first thing you will have to do is put this input into a data frame. So let me just do PD dot data frame. And here I'm just going to give it a list of dictionaries and just one dictionary here. So sample input and let's call this input DF. Yep. There we go. Input DF. And then you would then have to perform all the pre-processing that we had performed earlier. Of course, I should also put in this information. Let me just add that quickly. State holidays, zero and school holiday. Zero. Yeah. So that's your input data frame. The next thing you would have to do is you would have to merge with stores DF. stores DF. So that would look something like this, this input DF dot merge store DF on store. And let's just put that in input merged DF. So that's what that will look like. Next, you would have to do all the feature engineering so that you have done, which is the dates, the competition related stuff, and then the promo to related columns. You have to add all those columns here. Next, you will have to do some pre-processing. So for the pre-processing, you would have to fill missing values or imputation. And then you would have to do scaling using the same imputer that we had created in the same scaler, and then you'd have to do the encoding for the categorical data. Then you can pass it into the model. Of course, only after selecting the right set of rows, selecting the right set of columns, which is just the numerical plus encoded columns. Okay. It's always good to just write this out and then it becomes a lot easier to actually do it. So now for feature engineering, let's do the dates. So what was that function called? I don't recall, but let's say was it date something. Let's look it up here. Yeah. Split date. So let's call split date now on the data frame. So split date on input DF. Then let's add the competition related data. So here we have comp months. Comp months on input DF. Then let's add the promo check. So promo calls is what it's called. Promo calls input DF. And let's look at input DF after this, or it is called input merge DF, actually input merge DF here, input, merge DF input, merge DF here. Okay. This should be, I think this shouldn't be 31. This should be 30. This is not a valid date. Let's just get that fixed. Yep. So now we have input merge DF. Now we have input merged here. Now we have all these rows. Now we need to perform the imputation. Well, there's no imputation to be done because there is a competition distance already. So this is not required. There is a scaling required. So I'm just going to do input merged DF DF numeric calls equals scalar dot transform. And scalar already has the scaling parameters. Input merge DF numeric calls. And then there is also an encoding required. So input merged DF encoded calls. So remember the encoded columns, we have already saved the names and we say encoder dot transform input, merged DF categorical calls. Okay. So that is the imputation and the scaling. Okay. I think I may have to just do it a little more carefully, but once you do the imputation and scaling, then you can maybe just select the numerical and encoded columns. And once you select the numerical and encoded columns, then you should be able to make predictions on the dataset. Okay. I believe I forgot to add promo. Let me just add that here. Yeah. Let me just add the promo here. Yeah. I just missed a one factor in the input. Okay. And there seem to be some other issues there as well. In any case, we will fix all of these. And once you have that, you will have, you will get an X input and the X input will simply be input, merged DF numeric calls plus cat plus encoded calls. These are the one-hot encoded columns, and then you can pass it into the model. So you can say model dot predict, and you can give it X input, and that will give you a list of predictions. Of course, there's only one element in the list. So you can just select out the zero element, and that should give you the prediction. Let's maybe take a, just a small second to fix this is NAN is not supported for input types. The inputs could not be safely coerced. Okay. Maybe I've still missed some column somewhere. So I am. Yeah. I'm just not too sure. Maybe I just have to do this a little more carefully and this should be fine. Test DF. So we have the store. We have the day of the week. We have date, open promo state holiday. And yeah, I think we have all the information. So I'm not sure what the issue is. I see maybe this should be a, and this should be zero. Yup. All right. Okay. Now we fix the input and this is the prediction for the input. So the prediction that our model makes is 6026.005. So the model predicts that given this input, sample input, the output or the sale on this particular date is going to be 6026 based on everything that it has learned from our data. Okay. And that's how you use the model. So you have to perform all these pre-processing steps onto this in simple input before you can make a prediction. So that's all for today. We learned about gradient boosting and gradient boosting is a very simple idea where we first predict the average as a prediction. And then we find the errors that our prediction has made and we train a decision tree to correct the errors. And then we make new predictions using the initial prediction and the predictions of the decision tree scaled down by a learning rate. And that gives us new errors. And then we try to correct those errors with the second decision tree and so on. So we looked at how to download a real world dataset from a Kaggle competition. And we looked at how to perform features with the first decision tree. So we looked at how to perform feature engineering and prepare a dataset for training. This is a very important part. And this is something that you will have to learn with practice by looking at what other people are doing. We looked at training and interpreting a gradient boosting model using XG boost. There's also another library called light GBM, which works exactly the same way, but it has certain optimizations that might be useful for larger datasets. And some of the ways in which it operates, some of the internal implementations are different. Ultimately, both XGB boost and light GBM are implementations of several papers in the space of a tree based learning algorithms and gradient boosting. So check out what the differences are. Then you also have we saw, we also saw how to perform K fold cross validation and how to combine the results of the K folds. We also learned how to configure a gradient boosting model and how to tune hyper parameters of the model, which is essentially by going up and down and figuring out where we are getting the best fit and going with the most impactful hyper parameter first, and then slowly going through all the other parameters. Okay. And that's what machine learning engineers or data scientists and ML practitioners do all day, create new features, tune hyper parameters, and think of ways to just improve the data for the machine learning model. And maybe get the machine learning model to train a bit faster as well. Here are some resources you should check out. There is definitely a lot more about gradient boosting that we've not covered here, but this should serve as a good introduction. The topic for today is unsupervised learning using scikit learn. Using scikit learn and specifically we're going to talk about a couple of unsupervised learning techniques clustering, which is taking some data points and identifying clusters from those data points. This is a visual representation here and dimension dimensionality reduction, which is taking again, a bunch of data points that exists in two, three, four, five, or any number of dimensions and reducing them to fewer dimensions. For example, here, we're taking all these points and then we are taking a line and projecting all these points on the line and simply looking at their positions on the line instead of looking at these two coordinates. So that's what we're going to talk about today. And we'll start with just an overview of unsupervised learning algorithms in scikit learn, then talk about clustering, and then talk about dimensionality reduction. Now, this is going to be a high level overview. We're not going to look at a very detailed implementation, neither the detailed internal workings. We will try and grasp the intuitive idea of how these different algorithms work and what they're used for and how they're different from one another. Okay. And I encourage you to explore more, as I've been saying for the last few weeks, we are at that point where you can now learn things from online resources, from books, from tutorials, from courses on a need to know basis. So whenever you come across a certain term that you need to know about, you look it up and you find a good resource. And then you spend some time on it, maybe spend a day, a couple of days working through some examples and become familiar with it. Okay. So from this point, you have to start searching. You have to start doing some research on your own. And a great way to then consolidate your learning is to put together a short tutorial of your own, like the tutorials that you have been seeing in this bootcamp. Try creating your own tutorial on any topic of your choice, for example, on principal component analysis and publish that and share it with the community. And we would love to include it back within the bootcamp as well. So yeah, I encourage you to try out, check out any topics of your choice and craft tutorials. So let's install the required libraries. I'm just installing numpy pandas, matplotlib, and seaborn. These are the standard data analysis libraries. And I'm also installing Jovian and scikit-learn because these are the libraries we'll need to use today. Now unsupervised machine learning refers to the category of machine learning techniques where models are trained on a dataset without any labels, unlike supervised learning. And you might wonder what exactly are we training for? If there are no labels, if we just have a bunch of data. So unsupervised learning is generally used to discover patterns in data and to reduce high dimensional data to fewer dimensions. And here is how it fits in into the overall machine learning landscape. Of course, you have computer sciences or where artificial intelligence, machine learning and everything that we're doing comes under, but within computer science, you have artificial intelligence, where you have, sometimes you have rule-based systems. Sometimes you have machine learning models, where models are learning things, learning patterns and learning relationships. Now again, machine learning is comprised of unsupervised learning and supervised learning. Supervised learning is where you have labels for your data and unsupervised learning is where you do not have labels. Now, there are also a couple of other categories or overlaps called semi supervised learning and self supervised learning. So we'll not get into that right now, but I encourage you to check this out. And then there is one branch of machine learning deep learning, which we have not talked about in the last few years. And then there is one branch of machine learning in this course, but we have another course on it called deep learning with PyTorch zero two GANs. So I encourage you to check that out sometime later, which kind of cuts across all of these categories. And it's just a new paradigm or a new way of doing machine learning. So encourage checking it out as well. It kind of, it's a sort of a wide reaching approach that applies to various different kinds of problems. And here's what we are studying in this course. We are looking at classical machine learning algorithms as opposed to deep learning. And the reason we're doing that is because a lot of the data that we work with today is tabular data, like 95% of the data that companies deal with this tabular data, Excel sheets, database tables, CSV files, et cetera. And the best known algorithms for tabular data at this point, especially algorithms that can be interpreted and controlled well are classical machine learning algorithms. And we've looked at supervised learning algorithms already things like classification, where we've looked at again, linear regression. We've looked at logistic regression. We've looked at a decision tree classification, gradient boosting based classification and regression, where we try to predict a number. So classification is where we divide observations into different classes and, and predict those classes for new observations. Regression is where we try to predict the continuous value. And today we're looking at unsupervised learning where there is no label for any data and you either try and cluster the data, which is create similar stacks of data, or you try and reduce the dimensionality of, reduce the dimensions of the data. Or sometimes you try and find associations between different data points and use that for doing things like recommendations. And scikit-learn offers this cheat sheet for you to decide which model to pick for a given problem. Now, most of the time you will, if you've done a little bit of machine learning, you will automatically know what models to pick. So this is a very simple or very obvious kind of a tree, but it's still good to see these are the four categories of algorithms available in scikit-learn. So if you are predicting a category and you have some label data, that is when you use classification. But on the other hand, if you're trying to predict a category and you do not have label data, that is when you use clustering, right? So that's one difference between classification and clustering. That's a common confusion in clustering. There are no labels for us to look at. We just want to group whatever data points we have into different clusters. And of course, if you're predicting a quantity, you should be using regression. And if you're just looking at data, if you just want to visualize data or reduce its size, that's when you look at principal component analysis and embedding and things like that. Okay. So let's talk about clustering. Now, as I've said a couple of times already, clustering is the process of grouping objects from a dataset such that the objects in the same group are more similar in some sense to each other than to those in other groups. So that's the definition, the Wikipedia definition for you and scikit-learn offers several clustering algorithms. And in fact, it has an entire section on clustering algorithms that you should check out. So it talks a little bit about the different clustering methods. You can see there are about 10 or more than 10 clustering methods. And it tells you about the use cases for different clustering methods. It tells you about the scalability, how well these methods scale to different kinds of different sizes of data sets and different number of clusters and the parameters that they take. And all of these are fairly different in how they're implemented, but the goal is ultimately the same. The goal is to take some data. For example, here, let's say we just had a bunch of points here where here, this example plots incomes and debt. So this is a scatter plot between incomes of various people and the amount of debt that they have at the moment. So each point, each point represents one person. So these people are people who have high income, but low debt. And these people are people who have low income and low debt. And these people are people who have low income and very high debt, right? So clearly there are three clusters of people here. And typically these clusters don't come predefined. So you would simply have income and debt information. And what you might want to do is identify which cluster a particular person belongs to, or even figure out what the clusters should look like. So that's what clustering is. And that's what we'll try and do today. Given all these points, we will try and figure out what, how many clusters there are in the data and which clusters do each of the points belong to. And potentially if there is a new data point that comes in, which cluster will that data point belong to? Okay. So that's what we'll talk about. Now, why are we interested in clustering in the first place? Here are some real world applications of clustering. One is of course, customer segmentation. Now, suppose you are an insurance company and you're looking at, or you're a bank and you're looking at applications for loans. Then this kind of a cluster analysis is something that you may want to do this plot incomes and debt and see where the person lies, which cluster the line. And you may have a different set of operating rules for high income, low debt people, low income, low debt, and high low income, high debt people. And then you may just want to just to simplify decision-making instead of each time having to look at both variables, you can simply feed the variables into a computer algorithm, get back a category for them and use that category to make decisions for them. Right. So often the classes in several classification problems are obtained in the first place through clustering processes, especially when you're looking at things like low risk, medium risk, high risk, et cetera. Then product recommendation is another important application of clustering where if you can identify clusters of people who like a particular product, then you can recommend that same product to people who have similar behaviors. Feature engineering is yet another application. What you could do is you could perform some clustering on your data, and then you could actually take the cluster number and add it to your data, your training data as a categorical column, and it's possible that adding that categorical column may improve the training of a decision tree or a gradient boosting tree. Then you have anomaly or fraud detection. Now, of course, if you have, if you plot again, let's say you have a bunch of credit card transactions and then you cluster credit card transactions, you will notice that fraudulent transactions stand out. Maybe there are certain credit cards that make a lot of transactions. So they don't fall within the regular cluster. They fall within the anomalous cluster. And that can be used to then detect that for this kind of a fraudulent behavior, what is the activity? How do you deal with the, how do you deal with the problem? Right. Another use of clustering called hierarchical clustering specifically is what taxonomy creation. Now you know that there are several hierarchical divisions in biology between first you have, of course, the animals and plants and then between plants, you have a bunch of different families and then in each family, a bunch of different species and so on. Right. So that kind of a hierarchy is created using clustering where you take a bunch of different attributes, like what kind of reproduce reproduction of particular animal has, and whether they, what kind of teeth they have, what kind of weight they have, where do they live and things like that. And use that to create clusters and create families of related animals. And then related families get grouped into related kingdoms and so on. So those are some applications of clustering and we will use the Iris flower dataset to study some of the clustering algorithms available in scikit-learn. So here is the iris flower dataset. I'm just going to load it from the seaborn library. It comes included with seaborn. So here is the iris dataset in this dataset. You have four, you have observations about 150 flowers. You can see 150 rows of data. Each row represents some observations, floor of flower, and the observations are the length of the sepal, the width of the sepal and sepal and petal are two parts of the flower and the length of the petal and the width of the petal. Okay. So these are four measurements we have. And then we also know which species these flowers belong to. Now, for the purpose of today's tutorial, we are going to assume that we don't know what species these flowers belong to. Okay. Let's just pretend that we don't know what species the flowers belong to. We just have these measurements. Now, what we'll try and do is we will try and apply clustering techniques to group all of these flowers into different clusters and see if those clusters, which our clustering algorithms have picked out just by looking at these four observations match these species or not. Okay. So we're not going to use species at all. We are going to treat this data as unlabeled. So here's what the data looks like. You can see that you have, if I just plot sepal length versus petal length, this is what the points look like. And if I did not have this information about species, then this is what the points would look like. Right. So if you look at the points here, you might say that maybe there are two clusters. There's one cluster here, and then there's one cluster here. Maybe if you look even closely, you might say that, okay, maybe there's like three clusters here. Maybe this is a cluster and maybe this seems like a cluster. And of course, we're just looking at two dimensions here. We're looking at sepal length and petal length. We're not really looking at all four dimensions because we can't even visualize four dimensions, but we could look at sepal length. We could look at sepal weight and petal weight, and there seem to be a couple of clusters, maybe three clusters here. And again, we could look at different combinations, or if we could take three of them and visualize them in 3d and try and identify some clusters or well, we have no way of actually visualizing things in 4d. So that's where we will have to take the help of a computer, right? But just even within representation here of sepal length and petal length, you can see that clusters do start to form. And it's an interesting question to ask, how do you train a computer to figure out these clusters, given just these four measurements, sepal length, sepal width, petal length and petal width. Okay. So as I've said, we are going to attempt to cluster observations using numeric columns in the data. So I'm going to pull out a list of numeric columns and just take the data from the numeric columns into this variable X. And X is typically used to represent the input into a machine learning algorithm. Okay. So there's just this X and there's no Y here. So the first clustering algorithm we'll talk about is key means clustering. And the key means clustering algorithm attempts to classify objects into a predetermined number of clusters. So you have to decide the K in K means is the number of clusters that you wanted the data to have. So in this case, let's say we have some intuition that maybe by looking at some scatter plots, we feel like maybe there are two, maybe there are three clusters. So we can give an input number of clusters to the K means algorithm, and then it will do something and it'll take this data. Let's say we give the input of K S three. It's going to then figure out it's going to then figure out three central points for each cluster. So here is the central point for cluster number one, here is the central point for cluster number two, and here is the center point for cluster number three. And then each of these central points are also called centroids. And then each object or each observation in the data set is going to be classified as belonging to the cluster represented by the closest center point. Okay. So that's why you can see that all these belong to this cluster. All these are set to belong to this cluster and all these are set to belong to this cluster. And of course, now, if you go out and make a new observation, maybe you get another flower, measure it, staple petal and all, and put in this, put in the observations here. And if the flower lies somewhere here, then it will belong to the cluster of the center center that it is closest to. Okay. So that's the K means algorithm. Now the question you may have is how exactly are the centers determined? So let's talk about that. And maybe let's take an example. Let's take maybe a one dimensional example initially, and then we will expand that to two dimensions and go from there. So let me draw a line here. And let's say this line represents the petal length. Okay. And I'm just going to take petal lengths from, let's say zero to five. All the values seem to have petal lengths between zero to five. And let me take about 12 flowers. Let me, let's not consider all 150. And let's say you have these four flowers. They have a fairly low petal length. Okay. Very close to one, maybe. And then you have another four flowers. They have a medium petal length, and then you have yet another four flowers. They have a high petal length. Okay. Now, visually, just by looking at, but by looking at just the petal length, you can kind of cluster. You can kind of visually say that, okay, this is one cluster and this is one cluster. And this is one cluster. Right now, the challenge is how do you train a computer to figure out these clusters? So here's what we are going to do first. We determine in K means algorithm. What is the value of K? So let's say I said the value of K to three. Okay. For some reason, I have some intuition that maybe the value of K is three. And we'll talk about how to pick the right value of K later. Then what we do is we pick three random points. So let's pick three random points. Maybe I pick this one. That's a random point. Sure. I pick this one. Okay. That's a random point. And then I pick this. All right. That's a random point. So now we are going to treat these three random points as the centers of our clusters. So now this becomes the center of cluster one. This becomes the center of cluster two, and this becomes a center of cluster three. Okay. Great. So here's what we've done so far. Pick K random objects as the initial cluster centers. And then the next step is to classify each object into the cluster whose center is closest to that, to that object. So now let's start. Let's try and classify this point or this flower. Now, which, which cluster is this flower closest to, well, you can clearly see that it is closest to the center of its closest to center one. So this becomes one and then check the next one. And this is also closest to one. So this also gets assigned the cluster one, and this also gets assigned the cluster one. This also gets assigned the cluster one. I would say that even this is closer to one than two. So this also gets assigned the question cluster one. And of course, this point is already in cluster one. Then you have this one cluster two, and this also, this gets assigned cluster two. This one, I would say is closer to three. So this gets assigned to cluster three, and this is cluster three, and this is cluster three as well. Okay. Okay. So now we have a clustering, but definitely this clustering is not that great. You can clearly see that these two should rather belong to the class two. So here is where we do the next interesting thing. Once we have determined the cluster. So like this is cluster one. And then this is cluster two, and then this is cluster three. We then for each cluster of classified objects, compute the centroid or simply the mean. The centroid is nothing but the mean right now. We're looking at one dimension so that the centroid is simply the mean. And we'll talk about two dimensions, but here's what we do then. We basically take all these values. So this is like, this is about a 0.7. This is about one. This is about 1.2, et cetera, et cetera. And then we take that average. So if you take the average of all these values and let me draw another line here. So if we take the average of all these values, the average of all these values would be somewhere around here, right. And then we take the average of all these values in the cluster two, and that would be somewhere around here. And then we take the average of all these values and that would be somewhere around here. Now, here's the interesting thing. Once we've taken the averages, we make these the new centers of the cluster. So this becomes the center for cluster one, this becomes a center for cluster two. This becomes a center for cluster three. Okay. And now you can see things are already starting to look better. Let's just put back these points here. Okay. So now what, what we've done is we have taken each cluster that we created using randomly big points, and we took the averages of those clusters and said, these are the new cluster centers. Now using these new cluster centers, we reclassify the points. Okay. So that's what we're doing. We are now going to reclassify each object using the centroids as the cluster centers. So now this point is given the class one. This point goes to class one, and this one goes to class one. This one goes to class one too. Now this one goes to class two, class two, class two, and class two, and you have class three, class three, class three, and class three. Okay. And that's it. So just like that. Now we have ended up with cluster one, cluster two, cluster three. Okay. So that is exactly how K means works. But one last issue here, suppose the points that we had picked out in our random selection initially were very bad points. Let's say they were somewhere here. Let's say we had picked this, this, and this as the points, what would happen then? Well, what would happen is all of these would belong to the first cluster, and this would be the second cluster. And this would be the third cluster and the average would lie somewhere here. So your first cluster would be here and the second and third would be here. So you would still end up, even after performing the entire average and reclassifying the points, you would still end up with this huge cluster. Maybe this, maybe these might get excluded. Maybe these might go into two, but you would still get both of these sections into one big cluster, right? So K means does not always work perfectly, right? It depends on that random selection. So that's where what we do is we repeat these steps, steps one to six, a few more times to pick the cluster centers with the lowest total variance. And we'll talk about that last piece again. So here's what we do just to recap, pick K random objects as the initial cluster centers, like this one, this one, and this one classify each object into the cluster whose center is the closest to the point. So now we've classified each object based on the randomly picked clusters. Then compute the centroid or the mean for each set of clusters. So here are the central layer here and here, the central layer here, the central layer. Then use the centroids as the new cluster centers and re-classify all the points. Okay. Then, so you do that and you keep track of those centroids and then you do the whole process again, K random objects, classify, compute the centroid reclassify, keep that option around and keep doing that over and over and over multiple times. And let's say you do it about 10 times or 20 times and I'll do that again. 10 times or 20 times, and out of the 20, you simply pick the one where you got the best lowest total variance. So what do we mean by the total variance? Well, here's how you can compare two sets of clusters, right? So, and I'm not going to draw the points anymore, but let's say you have one set of clusters that looks like this and you have one set of clusters again through K means that looks like this. Okay. So what you do is you compute the variance. Now you have a bunch of points in this cluster, right? So these points are nothing but values on the line. Remember this is better length. So whenever you have points, you can compute the variance. A variance is simply the measure of spread, how much spread is there in this data? And then you compute the variance of these points, the points that you have here, and you compute the variance of these points and add up these values. Okay. So here, this is like a relatively low variance, right? This is, let's say this variance is 0.7. This variance is 0.2 and this variance is 0.5. This is like a relatively low variance. On the other hand, consider this, the variance here is pretty high. So the variance here is like 2.1 and the variance here is about 0.4. The variance here is about 0.1, let's say, right? So the total variance here in this case is about 2.6 and the total variance here 0.7 plus 0.2 plus 0.5 is going to be about 1.4. Okay. And the variance is simply the square of the standard deviation. If you remember from basic statistics, right? So what do we mean when we have a low total variance? When we have a low total variance, what we are essentially expressing is that all the points in every cluster are very close together. And when we have a high total variance, what we are expressing there is that there are points in a certain cluster or in certain clusters that are very far away from each other. So as we try all of these random experiments, we simply pick the cluster centers, which minimize the total variance and with enough random experiments, you can almost be sure that you will get to a point which is very close to the optimal solution. Okay. You may not get the best possible division. And sometimes even when you run K means multiple times, you may get different cluster divisions, depending on the kind of data, but once you run it enough time, you will get to a pretty good place. Uh, and that's basically the, that's basically how K means algorithm works. Now, of course, how does this apply to two dimensional data exactly the same way? Now let's say you have petal length and you have petal weight. So you're looking at both of these and you have a couple of flowers here, and then you have some flowers here, and then you have some flowers here. Okay. So what do you do? You pick three random points. Let's say you, you have set K equals three. So you pick three random points. Uh, maybe let me pick that using those three random points. You set them as the cluster centers. And once you set them as the cluster centers, you get, you label all the other points one, one, one, and then all these, I think are also close to. One and then yeah. And then these are close to two and this one is close to three or something like that. Then what do you do? You take all these points and then you take the centroid. So when you take the centroid, the centroid ends up here. And then when you take the center of these two, the centroid ends up here and you take the center of these two, the centroid ends up here. Now, once you get the centroid, you once again, then do the classification. So now when you do computations against the centroid, some of the plus, some of the clusters may change, I mean, this is probably not a great example, but maybe let's say you got a centroid here and you got a centroid here. So what will then happen is all of these will now fall into a different cluster and all of these will fall into a different cluster and all of these will fall into a different cluster. Okay. So the exact same process K random objects classify compute the centroid. And what is the centroid? Well, you take all the X values of the, of the, of the points in this cluster and take their mean, you take all the Y values of the points in the cluster and take their means. So centroid is simply a dimension wise mean that you take. Okay. And you, then you do this multiple times using random initial cluster centers and you pick simply the centers with the lowest total variance to get the measure of and use that as a measure of goodness. Okay. So that's how K means works. And here is how you actually perform K means clustering. You don't have to write code for any of that. All you do is you import K means from SK learn dot cluster. And then here you simply initialize K means you give it and clusters, the number of clusters you want to create, and then you give it a random state. You don't have to give this, this is just to ensure that each time the randomization is initialized in the exact same way and you call model dot fit, you call model dot fit and you just give it the X and remember the X simply contains the numeric data. There is no target given here. And once the model is fitted, you can actually check the cluster centers. So now remember, this is clustering, not on one dimension, not on two dimensions, but on four different dimensions. And the algorithm is exactly the same. It's just that the number of dimensions has increased. So what the model has found is that the cluster centers are as follows for cluster. One, the center is at a sample length of 5.9 and the sample width of 2.7 and a petal length of 4.3 and petal width of four point of 1.4, and this is the second cluster, similarly, and this is the third cluster. Now we're, when we want to classify points using the model, this is what we do. We check the distance 5.1 to 5.9. Okay. That's about 0.8. We check the distance of 3.5 to 2.7. That's about 0.7. We check the difference of 1.4 to 4.3. Okay. That's a lot. That's a, that's about, um, three point, uh, that's about 2.9 and we check the distance 0.2 to 1.4. Okay. So we see how far away each of these values are from the original values. So we are, we are basically going to subtract the cluster center from the actual values. And then we add up the squares of those differences. Okay. And basically what that means is if you have a cluster center here, let's say this is a cluster center and this is a point. Okay. Now the cluster center is going to look something like this. X, Y in two dimensions, and the point is going to look like this X, Y in two dimensions. So let me call that X one, Y one and X two, Y two. What we do is we compute X one minus X two squared plus Y one minus Y two squared. And we take the square root of that. Okay. So what is that exactly? Well, that is nothing but the actual distance between the two points in the two dimensions. Because if you just see this triangle over here, this right angle triangle, this part of the triangle, this edge of the triangle is simply X two minus X one, this length, and then this length over here is simply Y two minus Y one. So this distance over here is the, by the Pythagoras theorem, it is root over the sum of squares of these two sides. So this is actually the distance D between the two points, right? So we check for each point, its distance from the cluster center from each cluster center. So we compute for this point, the distance from this cluster center, the distance from this cluster center, the distance from this cluster center and the distance from this cluster center. And we find that this cluster center is the closest. So this is the cluster that it gets assigned to. Okay. So don't worry about the math. The basic idea is each point is going to get assigned to the cluster that it is the closest to. And that closeness is determined using this. Sometimes it's called the L2 norm. Sometimes it's called the Euclidean distance. In these four dimensions. Okay. So the way we can now classify each point or each flower into these clusters is by calling model dot predict. So we call model dot predict on X and the model now figures out that compare calculating the distance of this point or this flower to cluster center one and cluster center two and cluster center three. It turns out that it is closest to cluster center number one, sorry, zero, one, two. So it's closest to this cluster center. Okay. And you can verify that if you want to actually use, there's a distance formula and say, get learned too. So all these points belong to the cluster number one, and then a bunch of these points belong to cluster number zero. And some of these in between, you can see belong to cluster number two, and then a lot of these points belong to cluster number two, and some of these in between belong to cluster number zero. Okay. So each flower has now been classified into a cluster. And if I want to plot that here is what that will look like. So I've plotted the cluster centers here. This is cluster center number one, this is cluster, sorry, this is cluster center number zero. This is cluster center for the cluster. Number one, this is cluster center number two. And you can, I'll let you verify here that it is classifying based on closeness of the points and not just here, we are just looking at sepal length and petal length, but you, if you actually measure the closeness, it's taking into consideration all four dimensions. So that's why you may not get the perfect picture here, but if you measure the closeness across all four dimensions, all of these flowers have been detected as belonging to this cluster. And all of these flowers belonging to this cluster and all of these belong here. Okay. And that looks pretty similar to the chart that we had earlier, the scatter plot with species. It's, it seemed like there was this one species of flower, then there was a second species and a third species. It's possible that there is some misclassification, but already you can see the power of clustering, right? And imagine now these were not flowers, but these were, these were attributes about custom. Coming to a website, and we took these four attributes about customers. Maybe we took attributes like how long they stayed on the site, how many things that they, how many things they clicked on, what extent of the page did this scroll to and where did they come from, right? Or something like that. So we take four such attributes and then we cluster our customers based on those four attributes and we get these three clusters. Now we can then look into each of these clusters and figure out maybe these customers are spending a very little time on the site. Maybe these customers are spending a decent amount of time on the site and are also scrolling to a long, to a large degree, and maybe these customers are actually making a purchase. So then maybe we can go and interview some of these customers and then figure out how maybe we should focus of more of our marketing efforts on these customers. Maybe understand their demographics, maybe understand the things, the products they look at, the kind of celebrities they follow, maybe get one of those celebrities to endorse our products. And then you can get a lot more customers who are in this cluster, right? So in general, you want to grow the cluster of your paying users and you probably want to ignore the people who are not really interested in your product. Okay. So that's the thing. Ignore the people who are not really interested in your product. Okay. So that's sort of how this extends into real world analysis. So you can think of clustering more as a data analysis tool. Like sometimes you can just take the data, cluster it, and then present your observations and sort of use that as a kickoff point for further analysis. But technically speaking, because it is figuring out these patterns out of data, so it is a machine learning algorithm. Okay. And I mentioned to you about the goodness of the fit. So the total variance of these three clusters, what do we do is we take the variance of all these points and we take the variance of all these points and variance of all these points across all four dimensions. And then we average out the variance. And then we add up, we average out the variance across all dimensions and then we add up the variances for the individual clusters. So the total variances of all the individual clusters is called the inertia. So remember variance tells you the amount of spread of the data. So the less the spread within the clusters, the better is the goodness of the fit. Right. So here it turns out that we have an overall variance or an overall inertia of 78. Now let's try creating six clusters instead of three clusters. So here we now have game means, and we have put in number of clusters as six, and we're trying to predict here, and you can see that it has made a bunch of predictions and these are what the clusters look like. So we have a couple of clusters here. So this got broken down and then we have a couple of clusters here, and then we have a couple of clusters here. So you can see that even with six clusters, it basically took those three clusters and broke them into three clusters. Three clusters and broke them into two more clusters. And maybe now, if you go back and look at the actual species and maybe like actual look at the actual flower, you may realize that, okay, maybe these are fully grown set of flowers. And maybe these are young set of flowers. Maybe these are fully grown virginica flowers. Maybe these are young virginica flowers or things like that. Right. So when you do clustering, you may uncover more interesting things. Even if you already have some kind of classes or labels for your data, right? So here's, here's what we get with six clusters and you can check the entropy here. Hopefully the entropy here should be lower. Sorry. The inertia here should be lower. So if we just check model dot inertia here, you can see that the model dot inertia is 39 instead of 78. So there's definitely, you can see that these clusters, definitely a better classification because there is total overall variance across all clusters is actually pretty low. Now in most real world scenarios, there is no predetermined number of clusters. And in such a case, what you can do is maybe just take a sample of data. Like if you have millions of data points, maybe just take a hundred or maybe a thousand data points and then try different numbers of clusters. And for each, each value of K, which is the number of clusters, train the model and compute the inertia of the model, the total overall variance, and then plot the inertia. So what we're going to do here is take cluster sizes of two to 10 and then try them all out. And then we're going to plot the number of clusters on the X axis and the inertia on the Y axis. And it's going to create a graph like this, and this is called an elbow curve. It may not always be this nice reducing this nice exponential kind of curve. In a lot of cases, it will actually flatten out like this beyond a certain point, creating new cluster clusters won't really help. So what you can then decide is, okay, the point at which things start to flatten out is maybe the right number of clusters. So in this case, I would say that definitely around, like there's a huge decrease when we go from two clusters to three clusters and three to four, and maybe even four to five, but definitely around six things start to flatten out. So maybe for this data, I should just go with six clusters. Okay. So this is the kind of analysis that you can do, and you don't have to use the entire data to do this analysis and do the analysis, pick the point where you get this elbow kind of shape. So in a lot of cases, what you will end up with is a graph that looks like this here, you have the value of K, the number of clusters here, you have the inertia, and this is what the graph will typically look like. So you want to pick either this point or this point as the number of clusters. Okay. So I can't really say that you need three clusters of five clusters, but you draw this graph, and then based on this, you figure out where the elbow point is. So that's a K means algorithm. And if you remember the algorithm, one thing we said was that we want to randomly test a lot of these things each time. So remember we said that you pick K points and then you use them as cluster centers, classify all the points, compute the centroids, then reclassify all the points, and then repeat that process randomly many times that may not be a good idea because if you are working with a really large data set, because it can get really, really slow. So that's where you have a variation of K means called the mini batch K means algorithm. Now, in the mini batch K means algorithm, instead of taking all the points, instead of classifying all the points, you pick just a fixed number of points and the fixed number is called the mini batch size. So you just pick a fixed number of points, compute their centroids and compute the cluster centers for those, let's say those hundred points. And then you pick a next hundred points and for those next hundred points, you start by using the previous centroids, rather than using the, rather than using a random K points, you start by using the previous K centroids, okay? So there's a small change that you apply here, that each time you pick about a hundred or 200 or 300 points, whatever is the batch size, and use that to update the centroids or upgrade the cluster centers from the previous batch. And again, this is the point where you can now go through and read about mini batch K means clustering and figure out how that is different from the traditional K means clustering that we've just done. Okay. So here's a dataset. This is, this is called a mall's customer dataset, where you have information about a bunch of customers who visited a mall and you can try performing K means clustering by downloading this dataset and then using the K means class from scikit-learn. And then maybe you can also study the segments. Once you have these cluster segments, use them as a new column in the data and then do some more analysis. Maybe look for each cluster, study whether, what the spends look like, study what the age group looks like and things like that. Okay. So do check this out and you can also try and compare K means clustering with mini batch K means clustering using this dataset. One other thing you may try out is you can also configure how many random picks you want. The K means algorithm to take. So by default, we say up to 300 iterations, like maximum iterations is simply the number of times this whole experiment should be repeated of trying to find a good cluster centers and you can set maximum iterations to any value you want. So see what kind of impact that has on the quality of clusters. Okay. All right. So that's the key means clustering algorithm. So let's talk about a couple more clustering algorithms, very quickly, very briefly, the next one, again, a very common one is called DB scan, which is short for density-based spatial clustering of applications with noise. So that's a mouthful, but it's actually not that complicated a technique. It's again, fairly straightforward. If once you understand the basic steps involved, and it uses the density of points in a region to form clusters. So it has two main parameters. It has a parameter called epsilon, which we'll see in a moment, what that means. And another parameter called min samples and using these parameters, epsilon and min samples, it classifies each point as a core point, a reachable point, or a noise point or outlier. Okay. So let's try and understand how DB scan works by looking at this example for a moment. Now, forget all this, all these circles and all these arrows and everything. And even all the colors, just imagine you have these points, you have this point, maybe let's try and replicate this. So you have a point here, here, here. Okay. Not really. Okay. Okay. So these are the points we have. And here is how came me. Here's how DB scan works. First, you set an epsilon. Let's say we send we, and of course, all of this is on a coordinate plane. Let's say we're still talking about petal length. And petal width in two dimensions. Let's say we said epsilon to 0.5. And let's say we said min samples to four. Okay. Now, look at this point. And this point is, let's just consider this point and you can start at any point, take this point and around this point, draw a circle with the radius epsilon. So let me just, I think this would be about 0.5. So let me just draw the circle here with the radius epsilon. Okay. So we draw a circle with the radius epsilon around the point. Then we check if in that circle, you have at least four points. So if you have at least four points in the circle, which we do one, two, three, and four, including the point itself, as you can see here, one, two, three, and four around the point. So if you have at least four points, then we say that this is a core point. So this is a core point. So let me just put it in dark. All the core points. I'm going to make them dark. So now we've classified a core point and then everything else that is connected to the core point is now part of the same cluster. Okay. So these three points are part of the same cluster, then let's go to this point and across this point, let's also draw the circle of size epsilon. And this is what it looks like. And now once again, you have one, two, three, and four. So then this point is also a core point. Okay. So this point is a core point. And then this point is within 0.5 of the core point. So this is now, this point now belongs to the same cluster. Okay. Then let's do this one again. If you draw a circle around this one, you will notice that four points lie here. So this is also a core point. So these are all part of the same cluster. And then this one also turns out to be a core point. You can verify it will have four points around it. And this will be connected to this. And this will also be a core point. It will have all these inside. This will also be core point. So this like this, you continue creating connecting core points and there will be some points now, which will not be core like this point right here. This is not a core point. This is a point which is connected to a core point. It is, it lies within the circle of a core point, but it by itself, like if I just draw the circle around this point, it does not contain the mean sample values. So this is called an edge point. Yeah. So this is called, sorry, this is called a reachable point. So this is kind of the edge of the cluster, right? This is not a core point. This is not connected to three other points in that circle, but it is still part of an existing cluster. And similarly, this one is well, like if you draw the circle around this, the circle would look something like this. So this is not a core point, but this is still a connected edge point. So this way, we have now identified this one cluster of points. Okay. And we're done with all these. And let's say there is another cluster of points here where you have like these four core points. This is all connected to each other. And then these are connected to this. So you have two edge points here in this cluster as well. And then you have four core points in this cluster. So this becomes another cluster right here. This point, however, it is neither core, nor is it an edge point. It does not have four things around it or three things around it close by, and it is not connected to a core point. So this is called an outlier. Okay. And it's, or sometimes it's also called the noise point. So that's what these triangles and these colors and these lines represent. You have core points, which are all, which have within their epsilon radius, min sample numbers of observations. You have these noise point noise points, which are not connected with any points. And then you have these edge points or reachable points, which are connected to core points, but themselves are not core. So that's what these DB scan does. And again, the way to implement or use DB scan is simply to import from a scale under cluster DB scan. And you can see here in the signature, you can configure epsilon, you can configure min samples and you can configure how the distance is calculated. And by default, it is Euclidean. I was showing you in two dimensions, but because we have four dimensions, so in four dimensions, it's going to take the square root of four dimensions. The square root of first dimension difference squared plus second dimension difference squared plus third dimension difference squared, plus so on the extension of Pythagoras theorem. And then you can also specify how to go about, and you can also specify some other things. So I'll let you look up the remaining arguments here, but that's a basic idea. And then you do a DB scan for the model. So then you, you instantiate the DB scan model with the epsilon and with the min samples, and then you fit it to the data. Now, here's one thing in DB scan, there is no prediction step because remember the definition of a core point depends on certain points being there in the dataset. So in K means you try and figure out the center of a cluster, but in DB scan, there is no center of a cluster. The cluster is defined by the connections between points. So you cannot use DB scan to classify new inputs, to classify new observations. DB scans simply assigns labels to all the existing observations. So you can just check model dot labels, and it will automatically just tell you that for the inputs that were given, when we perform the DB scan algorithm on all of them, because all of them need to be considered to perform all of them. To perform the DB scan algorithm, the DB scan clustering, these are the labels that got assigned. So this, all of these got assigned zero and then all of these got assigned one. And then you can also check which ones are core points and which ones are, which ones are epsilon. So in our epsilon, which ones are reachable points and which ones are noise points. I think you can do something like this. Yeah, we can probably just, so a good way to check all the attributes on all the properties that you have on a particular model, all the methods you have is by using DIR. So yeah, I think you have these core sample indices. So this will tell you the core sample indices. And you can see here that these are all the core points. So not every point yet here, it seems like most points are core points. And maybe we can try changing the epsilon value. Maybe we can reduce the epsilon or increase the epsilon. And that will maybe tell us that not all points are core points. So let me do a scatter plot here. And now I'm using the sample length, the petal length, and instead of the, as a hue, I'm using the model dot labels. So you can see that only two classes were detected here, zero and one. And maybe if we change the value of epsilon, maybe if we change the value of min samples, that number might change. So that's an exercise for you try changing the value of epsilon, which is the circle, the size of the circle that is drawn around each point and try changing the value of min samples, which decides when something is treated as a core point and try wild ranges, maybe try close to zero, maybe try close to like a hundred. Epsilon does not have to be between zero or one. It can be very large. And similarly for min samples, maybe try value one, maybe try value hundred experiment and try and figure out how each of these hyper parameters affect the clustering. Okay. And see if you can get to the desired clustering, which is ideally cluster these points according to the species that the flowers belong to. Okay. So that's the DB scan algorithm. And the natural question you may have is when should you use DB scan and when should you use key means? So here's the main difference between DB scan and key means key means uses the concept of a cluster center, and it uses distance from a cluster center to define the cluster DB scan. On the other hand, uses a nearness between the points themselves to create a cluster. So here is an example of key means where if you have data like this, again, let's say this is X axis and Y axis. Now, visually, you can clearly tell that this is the right clustering, you know, the outer points, these are all connected. It seems to be one cluster and the inner points on collected on all connected seem to be another cluster, but DB scan can do this because it is concerned with the nearness between points. But key means actually cannot identify these two clusters, because if you said this as a cluster center, then any cluster that includes this point would also need to include all of these points because these points are closer to the center than this point, right? So there's no way you can create a cluster using key means of these, of these outer rings. On the other hand, this is what K means clustering would look like. Maybe you'll end up with one centroid here and one centroid here. So half the points will go here and half the points will go here. Here is another example. You have these two horseshoe shapes and this again, DB scan is able to detect them, but K means is not on the other. And then there are a few more examples for you to check out. Now, one other thing about DB scan and K means is that in K means you can specify how many clusters you want, but in DB scan, it will figure out on its own and you can only change the values of epsilon and min samples to maybe try and indirectly affect the number of clusters that get created. So that's the DB scan algorithm. So just keep in mind whenever you want to detect these patterns, which are more concerned with the nearness of the points themselves, DB scan may make more sense, but if you want a distance based clustering technique, then you use K means. One more thing is you can classify new points into clusters using K means, but you cannot use DB scan to classify new points. You would have to run the entire scan again, because it's possible that by the introduction of a new point, two clusters may join together or change in some fashion. Okay. So, one last clustering technique I want to talk about is hierarchical clustering. And as the name suggests, it creates a hierarchy or a tree of clusters and not just a bunch of different clusters. So what does that mean? That means that as you can see in this, as you can see here in this animation, we have a bunch of points. And first we take the two closest points and we combine those two closest points into a cluster. Then we say we combine the next two closest points, which in this case turns out to be cluster, the cluster point, and then another point and combine this, combine them. And then in this way, we combine the two closest points. And then in this way, we create this tree of clusters. So at the bottom of the tree are these individual points. And above you have these cluster of two points. And sometimes above these, you have clusters of three or four points. And above these, you have clusters of cluster of cluster of points and things like that. And this is the kind of thing that can typically be used to generate a taxonomy, for example, like if you have observations about many different animals, many different types of animals, you have a bunch of observations and you start performing clustering. You may realize that there are very few, very close relationships between humans and chimpanzees. And then between humans, chimpanzees, and bonobos, there is another relationship. So that is what is captured by this point. And then of course, between these, between this family and between, let's say mammals, there is a relationship that is captured by this. And then here on the other hand, you have a relationship between plants. And finally, at the top, you have like a single cluster. Okay. And this is how hierarchical clustering works. You first mark each point in the dataset as a cluster by itself. Like all of these points, P zero to P five are clusters in the dataset. Then you pick the two closest cluster centers, like here, you can see you pick the two closest ones and treat them as a single cluster. You pick the two closest cluster centers without a parent and combine them into a new cluster. Now the new cluster is the parent cluster of the two clusters and its center is the mean of all the points in the cluster. Then you repeat the step two, which is you pick the two closest cluster centers in the dataset without a parent. So this time you could be combining a cluster center from a cluster and a leaf. And that could then become their parent cluster. And then you pick the two closest and you keep picking the closest cluster centers each time that do not already have a parent. And that's how you get to the top level. Okay. And this structure that you end up with is often called a dendogram. So yeah, that's what this will look like. And these are all the cluster points, the cluster centers here. Now for each type of clustering, I've also included a video that you can watch for a detailed visual explanation. If you're willing to just get deeper and maybe follow along on your own and scikit-learn allows you to implement hierarchical clustering. So you can try and implement it for the iris dataset. So let you figure it out. Okay. So we've looked at three types of clustering. Now we have looked at K means clustering. We have looked at DB scan, and then we've looked at hierarchical clustering. There are several other clustering algorithms in scikit-learn. So you can check them out too. Here you have about 10 or so clustering algorithms, so you can check out all of them. Okay. So with that, we will close our discussion on clustering by no means exhaustive, but I hope you've gotten a sense of what clustering is, how couple of common clustering algorithms work and how to use them. Let's talk about dimensionality reduction in machine learning problems. We often encounter datasets with a very large number of dimensions. And by dimensions, we mean the number of features, a number of columns. Sometimes they may go into dozens. Sometimes they may go into hundreds, and especially when you're dealing with sensor data, for example, like let's say you're trying to train a machine learning algorithm, which based on the, based on the data of an actual flight, like a flight that started from a certain point and ended at a certain point. Now flights have hundreds of sensors or sometimes thousands of sensors, like same kinds of sensors at every step or different places on the flight. And if you look, if you just collect the information from all of these sensors, you will end up with thousands of columns. And that may be a very inefficient thing to analyze and a very inefficient thing to even train machine learning models on because more columns means more data, which means more processing, which requires more resources and more time. And it may just significantly slow down what you're trying to do. So one thing we typically have learned in the past is to just pick a few useful columns and throw out the rest of the columns so that we can quickly train a model. But what if you didn't have to throw away most of the information? What if you could reduce the number of dimensions from let's say a hundred to five without losing a lot of information? What if you could do that? And that is what dimensionality reduction and manifold learning are all about. So what are the applications of dimensionality reduction reducing the size of data without the loss of information? So let's say you have a certain dataset here, which has, let's say 200 columns. Okay. What if you could reduce those 200 columns of data to just five columns without losing information as much information, right? What if it, this could still retain 95% of the information. And we'll talk about what we are saying, what we mean by information retention, but what if you could do that then, then you can clearly see already that this, anything that you do on this model or on this dataset is going to be 40 times faster. So are you willing to trade maybe 5% of the information for 40 times the speed? Probably. Yes. Right. So that's one reason to do dimensionality reduction. And that will then allow you to train machine learning models efficiently. But yet another very important application of dimensionality reduction is visualizing high dimensional data in two or three dimensions. Now, of course, as humans, we can only visualize in three dimensions, but even three dimensions can get a little bit tricky. Like 3d scatter plots are very hard to read. So we are really the most comfortable, at least right now, while looking at screens, we are most comfortable looking at data in two dimensions. So even right now with the iris dataset, we've seen the problem that we have four dimensions, but we can only really visualize two dimensions at a time. All right. So visualizing high dimensional data in two to three dimensions is also an application of dimensionality reduction. So let's talk about, let's talk about, let's talk about a couple of techniques for dimensionality reduction. The first one is principle component analysis. Principle component analysis is a dimensionality reduction technique that uses linear projections. And we talk about what linear projections mean of data. To reduce dimensions, while still attempting to maximize the variance of the data in the projection. So that's what this is about. It's about projecting data. And this is what PCL looks like while still maintaining as much of the variance as possible. So let's say you have this data. Let's say X1 represents petal length and X2 represents petal width. And then you have all these points ignore the line for a moment. Now, instead of having X1 and X2, if we could simply draw this line, and if we could just draw this line, and then maybe on this line, we could set this as a zero point. And then we could simply draw these project these points onto the line. So just see how far away they are from the zero point. Let's say we said this as a zero point. And then we simply give this point, the value minus one, this point would be minus 1.8. This point wouldn't be minus 2.2. Minus 1.8. This point wouldn't be minus two. This point would be minus 2.7. And this point would be maybe 0.1. This would be 0.3. This would be one. This would be two. This would be 2.3, 2.4. Okay. So now we have gone from taking this data in two dimensions to this data in one dimension. Okay. So how do we do that? Let's take a look. So let's say we have some data. Let me look at petal length and petal width, just two dimensions right now. So we have 0.1, 0.2, 0.3, 0.4, and so on. And for each one, we have some petal length, petal width, petal length, petal width, petal length, petal width, and so on. Now we take that and we plot it here. So we plot these points here. Maybe it looks a bit like this. Okay. There are more points here. So this is what it looks like. Now what we would rather want to see is maybe we just want to see one value, right? And we, we don't know what this is called. Let's just call this PC one. Okay. We just want to see one value. So for one, two, three, four, five, we just want to see one value here instead of these two columns of data. And if we can do go from two columns to one column, then we should be able to go from three columns to one column or three columns, two columns or 200 columns to five columns. So the idea remains the same. So how do we do that? Well, the first thing we do is we shift our axes a little bit to center these points. So what do I mean by shifting the axes? Well, we take this x-axis and we, the y-axis, and then we move it here somewhere here. What exactly is meant by moving the axis? Well, in the new axis, you will notice that the x values of these points have changed. So we are calculating the X, the mean X coordinate of all these points and simply subtracting the mean from all these points. Similarly, we take the X-axis and move it around and moving the X-axis means subtracting the mean of the Y coordinate. Okay. So what we do for each point is we take the point P and we subtract from it the average. So let's say P has an X and a Y and the mean also has an X and a Y. So the X subtracts from the average X, the Y subtracts from the average Y. So that is going to center the points. Okay. Some of the points will now have negative values. Some of the points will now have positive values. Then we try out a candidate line. So we try a candidate line. Okay. This looks like a candidate line on which to project. And then we project all of these points on the candidate line. So we project project, project, project, project, project, project. Okay. So now we have projected all of these on the candidate line. Now, once we've projected things on the candidate line, let me for a moment, just get rid of the X and Y coordinates. Now that we know that they're centered, let me just move them away. Yeah. So now that we have all these projections and we know that this is a zero point, this is a zero point, we can see that. Okay. This is the, this point can now be represented by this point that's perpendicular. And this point can now be represented by this point. And this point can now be represented by this one. So for each one, if this is zero and this is one, and this is two, and this is three on this new line, we can now represent each point using a single number, which is it's the distance of its projection from the zero point on this projected line. Okay. So we can now start filling in these values. So the distance of 0.1 distance of 0.2 distance of 0.3 and distance of 0.4. So we start filling in these values. Okay. All right. So now we already have this, we have now already reduced from two coordinates to one coordinate, but we want it to retain as much of the information or as much of the variety from these two coordinates as possible. So this is where what we try and do is we try and maximize D one square plus D two square plus D three square plus so on. Okay. So D one square plus D two square, but D three square is basically the, now, if you like, keep in mind that we have subtracted the mean, and then now we are squaring the distances and things like that. So it ultimately turns out to be the variance sigma. So what we want to do is we want to try different options for this line. So let's say we want to take the same line. Okay. It's going to be hard to select this line, but we want to take the same line and we want to rotate this line. So you want to rotate this line and each time as we rotate, all the D's will change because all the perpendiculars will change. And we pick the line, the rotated line for which the sum of squares of the D's, the sum of squares of all these values is the highest. All right. So just to maybe clean this up a little bit, what we're trying to say is once we have the centered points, you can see here, if I pick this line, all of the projections are very close to zero. So if all of the projections are very close to zero, then that means most of the information is lost because all the values, all the D1, D2, D3, D4, all of them are very close to zero. On the other hand, on the other hand, if we pick a line that goes like this, you can see that the projections are quite far away. So we are capturing the spread of the data very nicely for a nice fitting line. And we are not capturing the spread of the data very nicely for an ill-fitting line. So that's what PCA tries and figures out. Okay. So it figures out one, and this is from going from two dimensions to one dimension. Here is an example of going from three dimensions to two dimensions. So now here we have feature one, feature two, feature three, ignore the word gene here, and then you have points. So what we try and do is we first find PC one, we find a line along which we can project all the points, the best possible line, which maintains the highest variance. Then PC two is a line which is perpendicular to the first line. Now, remember we're in three dimensions here. So there are, there are an infinite number of lines that are perpendicular to the first line PC one. So again, we pick the line which maximizes the variance of the points when the points are projected on PC two. So for three dimensions, you can have PC one, PC two. And then when you go to four dimensions, you can have PC one, PC two, PC three. When you go to five dimensions, you can have PC one, PC two, PC three, PC four. And then what you can do is like, if you have 200 dimensions, you can just choose the five most relevant axes of variance, right? And remember each of these are highest variants, preserving possible lines. So let's say you have 200 dimensions. You can do a principle component analysis and you can reduce them down to the five dimensions along which the variance of the data is preserved as much as possible. So that's principle component analysis for you. And once again, how do you do principle component analysis in scikit learn? Let's say we take this dataset again, and we are just going to pick the numerical data. No species for now. So we just look at these numerical columns. We simply import from SK learned or decomposition. We import the PCA model, and then we create the PCA model. We provided the number of target dimensions we want and the number of target dimensions in this case is two. And then we call fit. So we say PC dot fit, and we give it the data. And so we are going here from four dimensions, sepal length, sepal width, petal length, pet petal width to two dimensions. And what are those two dimensions? Well, at this point, we can't really interpret them as physical dimensions. I mean, it's not like we've picked sepal length and sepal width. No, it's, it's like we've picked two possible linear combinations. And remember, because lines and projections are involved, so everything is still linear. So we've picked two possible linear combinations of these four features, which are independent, which means that the lines along which we have projected are perpendicular. So we have picked two poss a two, two linear combinations of these four variables to independent linear combinations. And it is a projections on those lines that we are left with. Okay. So what just happened when you called fit? So those two things got calculated. So what got calculated is the lines. Now PCA internally knows what are the lines, the line number one, line number two, and you can look at the internal, like you can look at what those line number one, line number two are. Like if you do DIR PCA, you can look at some information about the lines. I think it would be a part of comp. Yeah, I think it would be a part of get params probably. I'm not sure. Okay. No. Yeah. Sorry. These are the components, right? So these are the components. So this tells you the, this basically, these four numbers convey the direction of the first line. This is the direction of the four, like these four numbers together conveyed and these four numbers convey the direction of the second line in four dimensional space. And you can verify that these two lines are perpendicular. Like if you did a dot product, then you will find that the dot product is zero. But now that we have the components, we can project these points on these lines by doing PCA dot transform. So now if we give PCA dot transform this data and this data, which has four dimensions, Iris DF numeric calls, it is going to give us transform data in two dimensions. You see here, now you have this transform data in two dimensions. This is the data projected on line number one, which has this direction, and this is the data, which is projected onto line number two, which has this direction. So if you know a little bit of linear algebra, these are both unit vectors in the direction of the lines that have been picked. In any case, now we've gone from this to this. And let's check it out. Let's maybe now plot. So now when we do a scatter plot, you can see now now when we do a scatter plot and I'm a plotting this data and I'm plotting this data, we can now finally visualize in two dimensions. Of course, not perfectly. There's still some, something is lost for sure, but we can still visualize data from all four dimensions, or we can visualize information from all four dimensions. So if we really want to study the clusters, right now, I've plotted the species, but we could just as well have plotted the clusters that were detected. So let's maybe look at the DB scan clusters. I think it was a model dot labels. Yeah. So these are the DB scan clusters. So now we can better visualize the clusters that we generate from clustering. So that's one other thing. That's one other, that's one good thing with dimensionality reduction. It can let you visualize the results of clustering and maybe evaluate the results of clustering better. You take a high dimensional data and then you perform clustering on it, and then you take maybe a principal component of it. And then you use that to visualize the clusters. Now, of course, principal component is, has some limitations. The first limitation is that it uses these linear projections, which may not always achieve a very good separation of, of the data. Like obviously one thing that we noted here was if you have this kind of a line, then information gets lost because most of the projections fall in the same place. Here's one problem that I can tell you with principal component analysis. Like if you have a bunch of points here, and you have a bunch of points here, all of them will project exactly the same value, right? So there is some information that does get lost. And then suppose these points belong to a different class. And then these points belong to a different class. So as soon as you do PCA, and then you are trying to maybe train a classification machine learning algorithm, you are going to lose some information, right? So there are some limitations with PCA, but for the most part, it works out pretty well. So I would use PCA whenever you have like hundreds of features on maybe thousands of features, and you need to reduce them to a few features. And as an exercise, I encourage you to apply principal component analysis to a large high dimensional dataset, maybe take the house prices, Kaggle competition dataset, and try and reduce the numerical columns from like 50 or whatever to less than five, and then train a machine learning model using the low dimensional results. And then what do you want to observe is the changes in the loss and training time for different number of target dimensions. Okay. And that's where we come back to this. If you could trade 200 columns for 50 columns for maybe a 5% loss in variance. Now we know what information means by information. We're saying variance. If you could trade 200 columns for five columns for a 5% loss in variance, that would give you a 40 X speedup. And maybe that speedup could be a make or break for your analysis, because now you can actually analyze 40 times more data in the same time. Right? So when you have really large datasets, PCA is definitely a very useful thing to do. Now, there's this Jupiter notebook you can check out. This goes into a lot more depth about principal component analysis. The way this works is that you can see that the way this is done is using a technique called singular value decomposition or SVD. Yeah. So there's a bunch of linear algebra involved there, but the intuition or the process that is followed is exactly the same. You find a line, you rotate the line till the point that you get the minimum, sorry, the maximum variance. And then you find the next line, which is perpendicular to this line and still guarantees the maximum possible variance. And you keep going from there. So let's talk about another dimensionality reduction technique. And this is called the T distributed stochastic neighbor embedding technique, or T SNE or T snee for short. And this belongs to a class of algorithms called manifold learning. And manifold learning is an approach to perform nonlinear dimensionality reduction of PCA is linear. And then there are more like PCA, there's something called ICA, LD, et cetera. They are all linear in the sense that they all come through some sort of linear algebra or matrix multiplications, but there are some limitations sometimes with linear methods. So you can use some of these nonlinear methods for dimensionality reduction. And they're typically based on this idea that the dimension, the dimensionality of many datasets is only artificially high, that there are, that most of these datasets, which have a hundred, 200, or 500 columns can really be captured quite easily with four or five columns. And we just have to try and figure out in some way what those, how to come up with those four and five, four or five columns of data, whether through like some formula being applied to all the columns or whether through like it's kind of feature engineering, except that the computer is trying to figure out these features for you based on certain rules that have been trained into these different kinds of models. So you have a bunch of different manifold learning techniques in scikit learn. And you can see here, this is the original dataset. So it is plotted in 3d. So this is the original datasets and they've, they've just colored the points maybe to give you a sense of which point goes where. And this 3d dataset, when you apply these different feature engineering or these different manifold learning techniques collapses to this kind of a graph in 2d. So you can see here that this basically is able to separate out red from red, from yellow, from green, from blue, which would be very difficult for PCA to do. Like if you try to draw two lines and drop projections, you would, it will be very difficult for you to get a separation like this, but ISO map is able to do that. Here is T-SNE. You can see with T-SNE as well. It is able to separate out to read it with from the yellow, from the green, from the blue. And these are all different kinds of separations that you get. Now T-SNE specifically is, T-SNE specifically is used to visualize high, very high dimensional data in one, two or three dimensions. So it is for the purpose of visualization. And very roughly speaking, this is how it works. We're not going to get into the detailed working of T-SNE because it's a little more involved. It'll take a little more time to talk about it in a lot of detail, but here's how it works. Now, suppose you have again, these clusters of date of data of points. Let's say this is petal length and this is petal width. What would happen normally if you were directly going to, if you just directly projected them onto this line, is that all of these blues would then overlap with a bunch of oranges and those would overlap with a bunch of reds. What T-SNE tries to do is first project them on a line and then move the points around and it moves the point points around using a kind of a near miss rule. So every point that is there that is projected down on the line is moved closer to the points that are closer to it in the real dataset, in the, in the original dataset. And it is, it moves away from the points that it is far away from in the original dataset. Okay. Now I know that's not very, it's not a very concrete way of putting it, but here's what it means as you project this line, this point down, it will end up here, the blue point. As you project this point down, it will end up here, the orange point. Now what T-SNE will do is T-SNE will realize that the blue point here should be closer to other blue points and should be farther away from orange points because that's how it is in the real dataset. So it's going to move the point. It's going to move the point closer to the blue points and it's going to move the orange points closer to orange points. Okay. So the closeness in the actual data is reflected as closeness in the reduced dimension in the dimensionality, reduced dimension, reduced data. Okay. So just keep that in mind. And if you have that intuition, you will be able to figure out when you need T-SNE, when you need to maintain the closeness, no matter how much you reduce the number of dimensions by T-SNE is useful. And here is a visual representation of T-SNE applied to the MNIST dataset. And the MNIST dataset contains 60,000 images of hundred and digits. So it contains 20 28 pixel by 28 pixel images of the hundred and digits zero to nine and 28 by 28 pixel. Each pixel remember is simply represents a color intensity like red, green, blue, or in this case, they are gray. So just represents like how gray that particular pixel is. Each pixel is a number. So that means each image is represented by 784 numbers, 28 times 28. So we can take those 784 dimensions and we can perform. We can use T-SNE to reduce that data to two dimensions and then plot it. And when we plotted, this is what we get. T-SNE is able to very neatly separate out all the images of the number zero from all the images of the number one from all the images of the number two, three, four, five, and so on. Okay. So there is, I encourage you to check out this video. And then there's also a tutorial that I've linked below on. Yeah, sorry. There's a tutorial here. You can check this out on how to actually create this graph. It will take you maybe an hour or so to create this graph to download the dataset and maybe look at some samples and create this graph. But what we're trying to get at here is that T-SNE is very powerful for visualizing data. So whenever you have high dimensional data, use T-SNE to visualize data. Now, T-SNE does not work very well. If you have a lot of dimensions, like maybe even 784, it's not ideal to directly reduce it to two dimensions. So typically what ends up happening is you first have maybe let's say 784 dimensions. You would take those 784 dimensions and you would perform PCA principal component analysis and reduce it to about 50 dimensions. And then you would take those 50 dimensions and reduce it to two dimensions using T-SNE for visualization. Now these two dimensions, the one, the data that you get by T-SNE is not that useful for doing machine learning or even for doing data analysis. It is useful for visualization because you can see which points are closer together in the original data. That's what it is trying to tell you. Okay. So that's why you'll see T-SNE used as a visualization technique in a lot of different machine learning algorithms. And how do you perform T-SNE exactly the same way as the other as PCA? So you just import the T-SNE class, and then you set the number of components or the number of dimensions that you want. And we want to take this four dimensional data and we want to transform it to two dimensions. Now again, with T-SNE it, there's no predict or there's no transform. There's no fit and transform step. Both of them are combined into a fit transform because closeness of the points is very important. So in some sense, you, you don't use T-SNE on new data. You only use T-SNE on the data that you already have. So that's why we are doing a fit transform here. And that gives us the transform data. This is the transform data here. So we've gone from four dimensions to two dimensions. And then when we plot it, you can now see that T-SNE has really separated out the points here. So you, you have one class here, and then you have these two classes here. So these points, it may hold true that these points were actually quite near to each other in the original dataset. And that is why they are near to each other here. And these points are far away from each other in the original dataset. So that's why they're far away from each other here. Okay. Yep. So the takeaway is PCA is good when you're doing machine learning and T-SNE is good when you want to visualize the results. So try and use T-SNE to visualize the M and IST handwritten digits dataset. I have linked to the dataset here. Okay. So with that, we complete our discussion of unsupervised learning, at least two aspects of it clustering and dimensionality reduction. Again, there are many more ways to perform clustering in scikit-learn. There are many more ways to perform dimensionality reduction, and all of them have different use cases, but in general, you mostly would just start out by using K means clustering and by using PCA for doing dimensionality reduction. And then in a lot of cases, maybe a better clustering algorithm or a better dimensionality algorithm will give you a slight boost, but you should be fine with just using the most basic to begin with in a lot of cases and do check out some of these resources to learn more. Hello and welcome to this workshop on how to build a machine learning project from scratch. Today, we are going to walk through the process of building a machine learning project, and we are going to write some code life. We're going to start by downloading a dataset, then processing it, training a machine learning model. In fact, a bunch of different machine learning models and evaluating those models to find the best model. We will also tune some hyper parameters and do some feature engineer. Now, before we start, if you're looking to start a new machine learning project, a good place to find datasets is gaggle. So I just want to show you this before we get into the code for today. So there are a couple of places on Kaggle, which is an online data science community and a competition platform. So there are a couple of places where you can find good datasets for machine learning projects. The first is competitions on Kaggle. Now Kaggle is a competition platform that has been around for close to 10 years, I believe at this point, and they have hosted hundreds of competitions. So you can go on Kaggle.com slash competitions and then go back all the way. You can go back to competitions from 2010 or 11, I believe, and use those datasets to work on your machine learning projects. For example, one of the, the dataset that we'll be looking at today is called the New York city taxi fare prediction challenge. This competition was conducted three years ago by Google cloud. And the objective here was to predict a rider's taxi fare, given information like the pickup location, drop location, date, and the number of passengers. And you can learn a little bit about the competition. You can look at the data for the competition before downloading it. You can also look at a lot of public notebooks that have been shared by other participants in the competition. Reading others notebooks is a great way to learn. And in fact, a lot of the techniques that we are covering today will be from public notebooks. And you can also look at the discussions if you have any questions about how to go about doing a certain thing. Now, one of the best parts of Kaggle is that you can actually make submissions to the leaderboard. So you can do this. You can go to my submissions and click late submission. And although you will not rank on the leaderboard, your submission will still be scored and you can see where you land among the entire set of participants in this competition. So in this competition, for example, there were over 1400 teams that landed on the leaderboard and getting to anywhere in the top 30 to 40% on a Kaggle competition. Even one that has ended already is a sign that you're probably building really good machine learning models. So that's one place on Kaggle where you can find data data sets for machine learning. And you have at least a hundred options here to choose from. So if you're building your first or second project, I would just go here. But apart from this on Kaggle.com slash datasets, you can find hundreds of other data sets. Now, one of the things that I like to do when I'm searching for datasets on Kaggle, especially for machine learning or classical machine learning, as we call it to differentiate from deep learning is go to filters here and select the file size, file type CSV, and set a minimum limit on the file size. So minimum limit of 50 MB generally gives you a large enough dataset to work with apply those filters. Then I sort by the most votes. And finally, I, so that leaves us with about 10,000 datasets to choose from. And finally, I put in a query or a keyword to filter datasets by a specific domain. So here, for example, are all the datasets that are related to travel. Now, since these are sorted by the most words, you, somebody has already done a lot of exploring for you. You can just look through the first five or 10 datasets. Not all of them may be suitable for machine learning, but many are, and you can in fact open up datasets and read their descriptions. And in many of these descriptions, several tasks will be mentioned that can tell you how you can do machine learning. Another thing you can do here is you can go into the code tab and on the code tab, you can search for machine learning terms like random forest. And you can see here that people have used this dataset to build machine learning models. So that's another good place to find datasets. So you have hundreds of real world datasets to choose from because most of these datasets and most of the datasets in Kaggle competitions have come from real companies that are looking to build machine learning models to solve real business problems. So with that context, let's get started today. We are going to work on this project called New York city taxi fare prediction. And this was a Kaggle competition, as I mentioned a few years ago, and you can learn all about this on the competition page. What you're looking at right now is the notebook hosted on the Jovian platform. This is a notebook, a Jupiter notebook hosted on my profile. And on in this notebook, there are some explanations and there is some space to write code and we are going to start writing the code here. Now, of course, this is a read only view of the notebook. So to run the notebook, you click run and select run on Cola. We are going to use Google Cola to run this notebook because this is a fairly large dataset and we may require some of the additional resources that Google Cola provides. Now, when you go to this link and I'm going to post this link in the chat right now, when you go to this link on this link, you will be able to click run run on Cola and you may be asked to connect your Google drive so that we can put this notebook into your Google drive and you can open it on Cola. Okay. But once you're done, once you're able to run the notebook, you should see this view. This is the Colab platform, colab.research.google.com. It is a cloud based Jupiter notebook where you can write code and any code that you execute will be executed on Google servers in the cloud on some fairly powerful machines. In fact, you can go to runtime, change runtime type. And from here you can select, you can even enable a GPU and high Ram machines, which I encourage doing if you're using either of these. All right. So whenever you run a notebook hosted from on Jovian on Cola, you would see this additional cell of code at the top. This is just some code that you should always run at the beginning. So whenever you go to a Jovian notebook, run it on Cola, definitely make sure to run this first line of code because it is going to connect this Cola notebook to your Jovian notebook. And anytime you want to save this version of your Colab notebook to your Jovian profile, you will be able to do that, but you need to run this single line of code. Okay. All right. With that out of the way, let's get started. So we'll train a machine learning model to predict the fare for a taxi ride in New York city, given information like pickup date in time, pickup location, drop location, and number of passengers. And this dataset is taken from Kaggle and we'll see that it contains a large amount of data. Now, because this is a short workshop, we are, and we're doing all this life. We'll attempt to achieve a respectable score in the competition using just a small fraction of the data. Along the way, we will also look at some practical tips for machine learning, things that you can apply to your projects to get better results faster. And I should mention that most of the ideas and techniques covered in this notebook are derived from other public notebooks and blog posts. So this is not all entirely original work. Nothing ever is now to run this notebook. As I said, just pick run and run on Cola and connect to your Google drive. And you can also find a completed version of this notebook at this link. So I'm going to drop this link in the chat. If you need to refer to the code later. Okay. So here's the first step I have for you before we even start writing any code, create an outline for your notebook. Whenever you create a new Jupyter notebook, especially for machine learning and fill out a bunch of sections, and then try to create an outline for each section before you even start coding. And the benefit of this is that this let you, this lets you structure the project. This lets you organize your thought process into specific sections, and this lets you focus on individual sections at a time without having to worry about the big picture. Okay. So you can see here, if you click on table of contents, I have already created an outline in the interest of time where there are sections and subsections. So here's the section and here are some subsections and so on. And then inside the subsections, there's also some explanation that mentions what that subsection covers. So here's what the outline of this project looks like. First, we're going to download the dataset. Then we're going to explore and analyze the dataset. Then we are going to prepare the dataset for training machine learning models. Then we are going to first train some hard coded and baseline models before we get to the fancy tree based or gradient boosting kind of models. And then we'll make predictions and submit predictions from our baseline models to Kaggle. And we'll talk about why that's important. Then we will perform some feature engineering, then we will train and evaluate many different kinds of machine learning models. And then we will tune hyper parameters for the best models. And finally, we will briefly touch on how you can train on a GPU with the entire dataset. So we will not be using the entire dataset in this tutorial, but you can repeat the tutorial with the entire dataset as well. And finally, we are going to talk a little bit about how to document and publish the project online. So let's dig into it. And if you have questions at any point, please post them in the Q and it, and we will stop periodically if possible to take questions. All right. So as I said, for each section, it's always a good idea to write down the steps before we actually try to write the code. And you can, Jupiter is great for this because it has marked on cells where you can write things and modify them as required. So here are the steps. First, we will install the required libraries. Then we will download some data from Kaggle. Then we look at the dataset files. We will then load the training that set with pandas and then load the test set with pandas. So let's get started. I'm going to install the Jovian library. Well, that's already installed, but I'm going to put it in here anyway. We're going to use a library called open datasets for downloading the dataset. We're going to use pandas. We're going to use numpy. We're going to use scikit learn. We are going to use XG boost, and I believe that should be it. So I'm just going to install all of these libraries and I've added hyphen hyphen quiet to avoid any outputs from this installation. Now, whenever you're working on a notebook, it's important to save your work from time to time. And the way to save your work is to import the Jovian library by running import Jovian and then running Jovian dot commit. Now, when you run Jovian dot commit, you will be asked to provide an API key. And I'm going to go here on my Jovian profile and copy the API key and come back and paste it here. Now, what this does is this takes a snapshot of your notebook at this current moment, and then it publishes it to your Jovian profile. As you can see here, NYC taxi fare prediction blank. This notebook was created just now. You can see this is version one. Now, every time you run Jovian dot commit in your notebook, a new version of this notebook will get recorded. And the benefit of having a notebook on Jovian is that you can share it with anybody. Like I can take this link and I can post this link in the chat. Of course, you can also make your notebooks private or secret if you would like to, and you can add topics to your notebook so that other people can find them easily. Okay. So coming back now to download the dataset, we are going to use the open datasets library, which can connect using your Kaggle credentials to Kaggle and download the dataset from this link for you. So here's how it works. I first import open datasets as OD, and now I can run OD dot download and I need to give it a URL. So let me first put a URL here. Here is a URL for the competition. And I just provide dataset underscore URL. Now, when I run this open datasets is going to try and connect to Kaggle using my Kaggle credentials, but to do that, it needs my Kaggle credentials. And the way to provide your Kaggle credentials to open datasets is to go to Kaggle.com and then click on your avatar, go to your account and scroll down to API and click create new API token. And when you click this create new API token, it is going to download a file called Kaggle dot Jason to your computer. Now you need to take this file Kaggle dot Jason and come back to Colab and on Colab, go to the file step and upload this Kaggle dot Jason file. Now, unfortunately, you'll have to do this every time you run the notebook. So I suggest downloading the Kaggle or Jason file once and keeping it handy. Like I have here on my desktop so you can upload it whenever you need it. Now this Kaggle dot Jason file downloaded from my Kaggle account has my username and a secret key. So you should never put the secret key into a Jupyter notebook. Otherwise somebody else will be able to use your Kaggle account. But within your own Jupyter notebook, when you run OD dot download, it is going to read the credentials from the Kaggle dot Jason file and download the dataset for you. You can see here, this dataset is pretty large. This is about 1.56 gigabytes. And of course it's a zip file. So after expanding, it's going to become even larger and it's, it's going to download this dataset to a folder called New York city taxi fare prediction. So I'm just going to put that into a variable, New York city taxi fare prediction so that we have it handy when we want to look at the files. All right. So on the files tab here, you can see we have New York city taxi fare prediction. Here's a folder and inside the folder there are some files specifically there are one, two, three, four, five files. Okay. Now there was a question. Should you try and follow along right now? Well, I would say right now you should probably watch and you will have a recording of the session and you should try and follow along with a different dataset later, but it's totally up to you. All right. So now the data has been downloaded and now we have this data directory, which points us to the directory where the data lives. And now let's look at the size, number of lines in the first few lines of each file. So first I'm going to use the L S minus L H command. So this is a shell command. This is not Python. So every time you have an explanation mark at the beginning, this is going to be passed directly to the system. This is going to be passed directly to the system terminal. So I'm just going to run LS minus L H and I need to access this folder, which, and this string is part of this variable. So you can pass the value of a Python variable using these curly brackets. So when we put something inside these braces or curly brackets, Jupiter is going to replace this entire expression with the value of this variable, which is New York city, taxi fare prediction. All right. So LS minus L H data DIR shows us that this is a total of 5.4 gigabytes of data out of this almost entirely. It is the training set with 5.4 gigabytes. That's a pretty large training set. And then the asset is just 960 kilobytes. And then finally there is a sub sample submission file. As I mentioned, you can submit some predictions on the test set to Kaggle and there are some instructions which we can ignore. Okay. So that's the sizes of the files. Let's look at the number of lines in some of the important files. So the way to get the number of lines is using the WC minus L shell command. And once again, I'm going to get the data directory and under the data directory, I want to look at train.csv. So there you go. New York city, taxi fare prediction slash train. That's yes. We contains what's this 55,423,856 rows. That's a lot of rows. And then let's look at the test set. The test set contains 9,914 rows, which is a lot, lot smaller than the training set. Let's also look at the submission.csv file. submission.csv file also contains 9915 rows. So just one additional row compared to the asset. This could just be, this could just be an empty line. So it wouldn't worry too much about it. And that's it. So let's, let's also look at the first few lines of each file. I am going to use the head shell command this time. So here we have the first 10 lines of train.csv. Remember it has 55 million lines. And these are just the first 10 and it seems like you have information like a key. So this is the first key. This is a CSV file. So you will find the first row contains names of columns and then the future rows contain the data. So this is the key. So it seems like every row has a very, has a unique identifier called the key. And then there is the fair amount for that, right? So for example, here, the fair amount is 4.5. Then you have the pickup date time. So this is the date and time of the pickup. Then you have the pickup longitude or, and the pickup latitude. These are the two places. This is the geo coordinates of the pickup. Then we have the drop-off longitude and drop-off latitude. So here are the coordinates of the drop-off minus 73 and 40. And finally you have the passenger count, which is the number of people who took the ride. Okay. So that's the training data looks simple enough. Let's look at the test data. All right. So test data looks similar to, we have the key. Every row has a unique key. And then we have a pickup date time. There we go. We have a pickup long longitude, pickup latitude, drop-off longitude, drop-off latitude and passenger count. Great. Now, one thing that's missing from the test data is the fair amount. And this is what is typically called the target column in the, in this machine learning problem, because remember the project is called taxi fare prediction. So we need to build a model using this training data and then use that model to make predictions of the fair amount for the test data. And that's why the test data does not have predictions. Now, once you make predictions for the test data, you need to then put those predictions into a submission file. And here is what a sample submission file looks like. Here's a sample submission file. There is a key. So you will notice that these keys correspond exactly row by row to the test dataset. And here is supposed to be the prediction of your model. Now this sample submission file just contains 11.35, the same answer for every test row, but this is supposed to be the prediction for your model prediction generated by your model for the test set. And you need to create such a file and then you need to download this file. So I'm going to download it right here onto my desktop. And then you can come to this competition page and click on late submission. And you can upload this file here containing the sample submission, which is the key for each row in the test set and your prediction. And then you can make a submission once this is uploaded, of course. And once you make a submission, your submission is going to be scored. So the score for the submission is 9.4. Now, what does the score mean? You can check the overview tab and go into the evaluation section to understand what the score means. So this score is the root mean squared error. So the root means squared error is simply a way of measuring how far away your predictions are from the actual values. So you can't, you are not given the actual fair amounts for the test data, but Kaggle has them. And when you submit your submission, when you submit the submission or CSV file, the predictions are compared to the actual values, which are hidden from you. And the difference is calculated. Those differences are squared up, add it together. And then you take an average of the square differences, and then you take a square root of the average difference of the average square difference. And that's called a root mean squared error on average. It tells you how far away your predictions are from the actual value. So for example, in our case, our recent submission had a root mean squared error of 9.4, which means our predictions are on average off by $9.4. Okay. And we'll see whether that's good or bad in some time, but we definitely want to do more than that. And now one thing you can do is check your submission against the leaderboard to see where you land. Seems like people have gotten to a pretty good point where they, they're able to predict the taxi fare within $2.8. And we're at $9.74, which is pretty high if you ask me, because most taxi rides cost $10 or $10 to $15 maybe. So if you're off by nine, your prediction is practically useless. Okay. But that makes sense because right now we've just put a fixed prediction. We've basically submitted the sample file. So that's how Kaggle works. And one tip I have for you here is that you should write down these observations. So anytime you have an insight about the dataset, you should write it down, document it so that it's there for you later, if you need to come back to it. And again, Jupiter is a great way to do that. Okay. So that's how we downloaded the, so that's what we did. We downloaded the data from Kaggle using open datasets. We looked at some dataset files and we noted down the observations. The training data is 5.5 GB test data is it has 5.5 million rows. The test set is much smaller, less than 10,000 rows. And the training set has eight columns, he fair amount pickup, daytime latitude, longitude drop off latitude, longitude, and passenger count. Now the test set has all columns except the target column, the fair amount, and the submission file should contain the key and the fair amount for each test sample. Okay. Now I'm going to just save my notebook at this point. I'm going to save regularly so that I don't lose any work. You should do that too. Next up, we are going to load the training set, right? So here you can check the table of contents. If you're ever lost, we are going to now load the training set and then load the test set. So here's one tip when you're working with large datasets, always start with a small sample to experiment and iterate faster. Now loading the entire dataset into pandas is going to be really slow and not just that any operation that you do afterwards is going to be slow. So just to set up my notebook properly, I'm going to first work with the sample and then maybe come back and work with the entire dataset. So we're going to work with a 1% sample. That's we are going to ignore 99% of the testing set of the training data, but that still gives us 500,000 rows, 1% of 55 million is 500,000 rows. And I think 500,000 rows should still allow us to create a pretty good model to make predictions for 10,000 rows of data in the test set. So we're going to use a 1% sample and we're also going to ignore the key column because we don't really need to use the unique identifier that is present in the training set and just loading that into memory. You can slow things down. So we're going to ignore that. We are going to pass pickup and date time while loading the data so that, you know, pickup date time, pickup date time is a daytime column and pandas has a special way of dealing with it. We can just inform pandas and that makes things faster. And we're going to specify data types for the particular columns so that pandas doesn't have to try and figure out by after looking at all the rules. And that is going to again, speed up things significantly. So with that, let's set up, let's set up this data loading. So first let's import pandas as PD. Okay. And we are going to use the PD dot read CSV function and to the read CSV function, we need to provide the file name. So here I'm going to provide data, DIR slash train dot CSV. Then here are some of the other parameters that we can provide. Now we want to pick a certain set of columns. So I'm going to use, I'm going to provide a value use cause that's one. Then we are also going to provide data types. So I'm going to use the, I'm going to use D type. And finally, we are also going to want to pick a sample. So there are two ways to pick a sample. We can either just pick the first 1% or the first 500,000 rows for which we can use in rows. If you just provide the value of in rows and provide 500,000, that's going to pick up a 500,000 sample for you, or there's another way to do it, which is using something called skip rows. So here we can provide a function which can go through each row and based on the row index, it can tell you whether or not to keep the row. So I'll show you both. Let's start by putting in the use calls. So let me create a variable called calls or selected calls. And I'm going to put in all the call columns here, except the key. So I'm just going to take this, put that into a string and I'm just going to split that at the comma. So this has a nice effect. Of giving us this list of columns. There you go. So we are going to use selected calls. Then I am going to like D type. So let's set up D types or data types. I'm going to grab all of these and use float 32 for the data type. Of course, not all of these are float 32. We wanted to use you in eight for passenger comp. And let's just indent that. So those are the D types. So that's the value of the D type. And now we could provide this and rows equals 500,000. And you can also write numbers like this to make them easier to read. So we could do this and we would get the first 500,000 rows, but I want a random 1% sample. So for that, I am going to use the skip rows function and I'm going to call it, I'm going to pass in a function called skip row. This gets the row index or the row number. And here's what we're going to do now. Of course, we want to keep the first row. So if row number or index is zero, then we do not want to skip the row. So we return false. Otherwise here's a quick trick we can, we can apply here. Let's say I want my sample fraction to be 1%. So 1% is just 0.01. Here's what I'm going to do. I'm going to first import the random module. Let me do that here. And the random module can be used to generate random numbers from zero to one. Okay. So that's a random number. And that's one of these numbers are between zero to one. Now, if I write this random dot random less than sample for action, and let's run that less than sample fraction, then because random numbers are picked uniformly, there is a exactly a 1% chance that random dot random is going to be less than sample fraction. So we should keep the row only if this expression returns false. Oh, sorry. Only if this expression returns false. So we should skip, we should, sorry, we should skip the row. If this expression returns true. So if random not random is greater than 0.01, which is a 99% probability, we should skip the row. Otherwise we should keep the row. So that's what our skip row function does. Okay. All it does is for 1% of the rows, it is going to randomly return true or return false, which means keep the row. And for 99% of the rows, it is going to return true, which is to skip the row. Okay. Now, one last thing I'm going to do here is I'm going to set random dot seed and set it to a particular value. So I'm going to initialize the random number generator here in Python with the value of 42 so that I get the same set of rows every time I run this notebook. Okay. So I encourage you to learn more about seeds. And this is going to take a, okay. Sorry. This is not a date time. Of course, we also need to provide a list of date time separately. Yeah. So let me run that. And then we'll talk about this. Yep. So I encourage you to learn more about random number seeds. Let's see. Where is ours underscore dates. There we go. Yep. And always fix the seeds for your random number generator. So that's the third tip so that you get the same results every time you run the notebook. Otherwise you're going to pick up a different 1% each time you run the notebook and you're not going to be able to iterate that much. So there was a question. Can you explain the significance of using shell commands instead of checking the dataset for checking the dataset instead of Python? Yeah. So simple reason here is because these files are so large loading them into Python can itself slow down the process a lot. So normally I would recommend using the OS module from Python, but in this case I have recommended shell commands because these notebooks are so large and shell commands are really good at working with large files. Okay. There's another question. How do we know it's a regression problem? So here we're trying to predict for fair amount and the fair amount is a continuous number, right? The fair amount can be $2.5, $3.2, $5.7. So that is what is called a regression problem. A classification problem is one where you're trying to classify every row into a particular category. For example, trying to classify, let's say an insurance application as low risk, medium risk or high risk. That's called a classification problem. Okay. This is taking a while. In fact, it has been running for a minute and a half. And this is happening while we're working with just one person of the data. So you can imagine when you're working with a hundred percent of the data, it's going to take a lot longer. So, and not just this, but every single step of this, it took about one minute, 36 seconds to complete. Okay. Now that's the exercise for you. Try loading 3%, 10%, 30%, 100% of the data and see how that goes. All right. So let's load up the test set as well. So I'm just going to load up your test set as PD dot read CSV data, DIR plus slash test dot CSV. And I'm just going to provide the D type here. And that's it. I think don't really need to provide anything else because test set is pretty small. Let's look at the training dataset as well. Maybe let's print it out here. Okay. It's just called DF at the moment. Yep. So there you go. We have fair amount, pickup date, time, latitude, longitude, and all the expected values. Now the test DF has key. So we're going to keep the key for the test data frame because we're going to use this for making submissions, but we have the pickup date, time, longitude, latitude drop off and passenger count looks great. And I can just commit again. All right. Okay. So we're done with the first step downloading the dataset that took a while, but we are now well set and let's explore the dataset a little bit. Okay. We're just going to do some quick and dirty exploration. We're not really going to look at a lot of graphs and I'll talk about why, but the quickest way to get some information about a data frame is to go DF dot info. And this tells us that these are the seven rows. And then these are the number of entries here. This is the total space it takes on memory. This is an important thing to watch as you go with a hundred percent of the dataset. So you can imagine that it's going to take a hundred times more or 1.5 GB of memory or Ram. That's why we are using Colab and what else? Yeah. These are the data types. Seems like there are no, no, there are no null values or missing values. So that's great. Now, another thing you can do is DF dot describe, and that's going to give you some statistics for each column, for each numerical column. So it turns out fair amount. The minimum value is minus $52 and the maximum value is $499. All right. Then the mean or the average value is $11 and the 50% dollar value is $8.5. So we already know that 50% of rights cost less than less than $8. And in fact, 75% of rights cost less than $12.5. Okay. Now that gives us a sense of how good our model needs to be. If you're trying to predict, right. The right, if you're trying to predict the taxi fare and 75% of taxi fares are under $12. So I would want my prediction to be in the plus or minus $3 range. Otherwise I'm off by a lot. And that's what we'll try and aim for. Okay. You can also look at pickup latitude, longitude drop off, and then passenger counts. Now there seem to be some issues in this data set, as is the case with all real world data sets. Seems like the minimum pickup longitude is minus one one eight three, which is just not valid at all. It doesn't make sense. There are no such longitudes. Neither are there such latitudes. So we may have to do some cleaning. This could just be wrong data. And there also seems to be a max passenger count of two zero eight, which again seems quite unlikely to me. You can see 75% of the values are under two. So again, this is something that we may have to fix later. We'll take a look at that. Now, one thing that is missing here is the date time. So let me just grab the pickup date time and just look at the minimum and maximum values here. So you can see here that our date start from the first of January, 2009 and end on the 30th of June, 2015. So it's about six years worth of data. And once again, all these observations are noted here, five 50 K rows as expected, no missing data, fair amount ranges, passenger count ranges. There seem to be some errors and we may need to deal with outliers and data entry errors. Let's look at the test data here. So nothing surprising here. Nine, nine, one, four columns, rows of data across these seven columns, no fair amount. And here are the ranges of values. And these seem a lot more reasonable. The pickup latitude and longitude seem to be between minus 75 and 72. So minus 74.2 is the, is the lowest and minus 72.98 is the highest. So that's good. Then passenger count also seems to be between one and six. Now here's one thing we can do. If our model is going to be evaluated on the test set and which is supposed to represent real world data, then we can limit the inputs in our training set to these ranges, right? Anything that is outside the range of the test set can be removed from the training set. And because we have so much data, 55 million rows or one person of that, which is still a large amount of data, we can later just drop the rows which fall outside the test range. Okay. So keep that in mind. And finally, let's check this too. Pick up date time and maximum and minimum. And you see here that these test dataset values also range from the first of January, 2009 to the 30th of June, 2015, which is interesting because this is the same range as the training set. Now that's an important point here, which we'll use while creating the validation set. All right. So let's commit this. That was quick enough. And that we already have a lot of insight, but now what you should do at this point, or maybe later when you've trained a few models is to create some graphs like histograms, line charts, bar charts, scatter plots, box plots, geo maps. You have location data here or other kinds of maps to study the distribution of values in each column and study the relationship of each input column to the target. This could be useful thing to do not just right now, but also once you've created new features, when we do some feature engineering. And another thing that you should try and do is something like this. You should try and ask and answer some questions about the dataset. What was the busiest day of the week? What is the busiest time of the day? In which month are the fairs highest in which pickup locations have the highest fairs, which drop locations have the highest fairs. What is the average, right distance and keep going the more questions you can ask about your dataset, the deeper understanding you will develop off the data. And that will give you ideas for feature engineering, and that will make your machine learning models a lot better. So having an understanding of the data is very important to build good machine learning models. And if you're looking to learn exploratory data analysis and visualization, you can check out a couple of tutorials or a couple of resources. We have a video on how to build an exploratory data analysis project from scratch. And we also have a full six week course on data analysis with Python zero to pandas.com that you can check out. Now, one tip I would like to share here is that you should take an iterative approach to building machine learning models, which is first do some exploratory data analysis, a little bit like we've done without even plotting any charts, then do some feature engineering, try and create some interesting features. Then train a model and then repeat to improve your model. Instead of trying to do all your EDA for maybe a week and then doing a lot of feature engineering for a month and then trying to train your model and discovering that most of what you did was useless. Just use an iterative approach, try and train a model every day or every other day. Okay. So I'm going to skip ahead right now, and maybe I'll do some EDA after we're done with this tutorial. All right. So that was step two. We've made good progress. Now we've downloaded the data. We've looked at the data. Let's prepare the dataset for training. So the first thing we'll do is split training and validation sets. Then we will deal with the missing values. There are no missing values here, but in case there were, we, this is how we deal with them. And then we will extract out some inputs and outputs for training as well. Okay. So we will set aside 20% of the training data as the validation set. So we have 550,000 rows out of those 20% will be set aside as a validation set. And this validation set will be used to evaluate the models we train on the training data. So the models are trained on the training data, which is the 80%. And then the evaluation is done. So we calculate the root mean squared error on the validation set, which is a 20% for which we know the targets, unlike the test set for which we don't know the targets. And what that will do is the validation set will allow us to estimate how the model is going to perform on the test set and consequently in the real world. Okay. So here's the next step. Your validation set should be as similar to the test set or real world data as possible. And the way you know, that is when you find the root mean squared error on the validation set. And you can do that because you can get predictions from your model and you have the actual targets for the validation set and you can compare those and calculate the root mean squared error. So the way you know, that the validation set is close enough to the test set is when the evaluation metric of the model on the validation and test set is very close. Okay. And if the root mean squared error on the validation set is like $2, but when you submit it to Kaggle, the root mean squared error is $9, then your validation set is completely useless. And you're basically shooting in the dark because you're trying to train different models to do better on the validation set, but the validation set has no relationship to the test set score. So make sure that your validation set and test sets have similar or very close scores and an increase in the score on the validation set reflects as an increase on the test set. Otherwise you may need to reconsider how your validation set is created. Now, one thing here is that we can, because the test set and training set have the same date ranges, right? The test set lies between a Jan 2009 to June, 2015. And the training set also comes from Jan 2009 to June, 2015. We can pick a random 20% fraction of the training set as the validation set. Suppose the test set was in the future. Suppose the training set was data from 2009 to 2014 and test set was data for 2015. Then to make the validation set similar to the test set, we should have picked maybe the data for 2014 as the validation set, and then the data data for 2013 and before, as the training set, right? So keep those things in mind. It's very important to create validation sets carefully. So for creating the validation set, I'm going to import from a skill on dot model selection, import train test split. And this is something that you can look up. You don't have to remember this and I'm just going to do train DF while DF equals train test split. And I'm going to split the original data frame and I'm going to set the test. Let's see here, the test size or in this case, which is going to be the validation size to 0.2. Okay. And now I can check the length of the train DF and the length of the validation DF to make sure that we have the right sizes. So we have 4441,000 rows in the training set and a randomly chosen 11,000 rows in the validation set. I'm also going to set random state equals 42 just so that I get the same validation set every time I run the notebook. This is important because your scores may change slightly if each time you're creating a different validation set. And it's also, if you're combining models across validation sets, that leads to data leakage, et cetera. So to fix the validation set, the random set that is picked, I'm going to start set the random state to 42. Okay. Now that's one piece. Now the other thing that we need to do is to fill or remove missing values. Now we've seen that there are no missing values in the training data or the test data, but it's possible because we've only looked at one person on the data. It's possible that there may be missing values elsewhere. So here's one simple thing you can do train DF equals train DF dot drop any and while DF equals while DF dot drop any. Okay. Now why, what does this do? This is going to drop all the empty rows from the training or all the rules where any of the columns has an empty value or missing value from the training and validation sets. You shouldn't always do this, but if, because we have so much data and at least so far, I've not really seen a large number of missing values. I'm estimating that the number of missing values is going to be less than one or 2%. So it should be okay to drop them like number of missing values in the entire dataset. So it should be okay to drop them. Okay. So I'm not going to run this right now, but you know what this does next. We, before we train our model, we need to separate out the inputs and the outputs because the inputs and the outputs have to be passed separately into machine learning models. So I'm going to create some something called input calls here, and maybe let's just first look at train DF dot columns so that we can copy paste a bit. Now the input columns are these, but actually we can't really pass a date time column by itself into a machine learning model because it's a, it's a timestamp. It's not a number. So we'll have to convert the date time, date time column into some or split the daytime column into multiple columns. So I'm just going to use these for now the latitudes and longitudes and the passenger count. And for the target column, I am just going to use a fair amount. And there you go. So now we have the input and target columns. Now we can create train inputs. So from the training data frame, we just pick the input columns. So this is how you just pull out just a certain set of columns from the training set. And then we have the train. Well, let's call that train inputs and we have the train targets, which is train DF target call. We can view the train inputs here and we can view the train targets here. Okay. So you can see now we no longer have the column fair amount here, but we still have all the rules. And here we no longer, we just have the single fair amount column in front of us. Okay. Let's do the same for the validation set. So while inputs is where DF input calls, where targets is where DF target call. And then let's look at well inputs and well targets. Okay. So 110,000 rows, that's already a lot larger than the test set, by the way. So should be good. Yep. And here are the validation targets. Really test DF. Now the test data frame, remember it doesn't really have any target columns, but we still want to pull out just the input columns that we can use for training. So let's just do test DF input calls test inputs. Okay. And there are no targets. There's no fair amount in the test data frame. That is something that we have to predict. So there it is. Okay. Not bad. We are making good progress in under an hour. We have downloaded the dataset, explored it a little bit, at least repair the dataset for training. And now we are going to first train some hard-coded and baseline models. So here's the next step. Always, always create a simple hard-coded model, which is basically like a single value or something or a very simple rule or some sort of a baseline model, something that you can train very quickly to establish the minimum score that any proper machine learning model should beat. I can't tell you how many times they've seen people have trained models for hours or days. And then the model ends up producing results that are worse than what you could have done with a simple average. And that could be for a couple of reasons. One, you've actually not trained the model properly. Like you, you've trained a really, you created a really bad model or second, you made a mistake somewhere in the feature engineering or somewhere in preparing the data or somewhere in making predictions, right? So it serves as a good, a good way to test whether what you're doing is correct and whether you're, and it gives you a baseline to beat. So let's create a simple model. I'm going to create a more class and mean regressor will have two functions fit. I'm going to make it very similar to a psychic learn model. So it's going to take some inputs. It's going to take some targets. And so fit is used to train our simple model. And then we're going to define a function called redic. It takes a bunch of inputs and then create some targets. So here's what I'm going to do here. I'm going to completely ignore the inputs and I'm simply going to set self dot mean. I'm going to store a value self dot me where I'm just going to do targets dot mean here. Okay. And that's just going to calculate the average value of the targets. And here I'm just going to return. So let's say we have, let's take the length of inputs or another way to do this is input start shape zero. And I'm just going to do something like this and P dot full. Okay. Let me import number five. Yep. And I'm going to do this NP dot full, where let's see, you can give it a shape and then give it a value. So let's say I have 10 inputs and I always want to return the value three. So I do NP dot full 10, three, and that's going to list return 10 threes. So I'm always going to return. I'm always going to return input start shape zero. So again, if you have a numpy array, let's say train inputs, right? This is strain inputs. This is a numpy area pandas, a data frame. If I do train inputs dot shape, that tells me the number of rows and columns dot shape zero is going to tell me the number of rows. So I get the number of rows. So I'm just going to get the number of rows here from the inputs that were passed here. And I am going to return self dot me. Okay. So yeah, at some object oriented programming, some fancy numpy stuff, but ultimately what it's doing is this. Let's first create a mean regressor model. Let's call it mean model is mean regressor. So now we've created this mean model and let's call mean model dot fit. So we're now going to train, train this model, this so-called model that always predicts the average and let's give it the train inputs and the train targets. Okay. Now, once we pass, sorry, let's call it fit. Okay. So now once we call mean regressor dot fit, it's going to completely input the, ignore the inputs and it's going to take the targets and simply calculate the average of the targets, a single value. And it's going to store that in the dot mean attribute. So the average is 11.35. Okay. That's the average fair for the taxis. And then when we get, want to get some predictions. So let's say we want to get some predictions for the train training set. We can say mean model dot predict train inputs, and that gives us a prediction. So it's simply predicted the value 11.35 for every row in the training set. Okay. Similarly, we can get some predictions for the validation set. So let's say mean model dot predict while inputs. And once again, it's going to simply predict the value 11.35 for every row in the validation set. Now we may want to compare these predictions with the targets. How off is this model by, of course it's going to be way off because we are just predicting the average. So here are the train predictions and here are the training targets six 3.7. You can ignore this. This is simply the row numbers from the data frame, but six 3.7 and we're always predicting 11.35. Now to tell how badly we are doing, we are going to need to compare these two and come up with some sort of an evaluation metric. So that's where we are going to use a root mean squared error evaluation metric, because that is a metric that is used on the leaderboard. So I'm going to import from SK learn dot metrics. Okay. And I'm going to define a function called RMSE just to make my life easier. It takes some inputs, some targets, and it returns mean squared error with between the not input, sorry. It takes some targets and it takes some predictions and it returns a mean squared error between the targets and the predictions. And to return the mean squared, sorry, to get the root mean squared error, we need to set squared to false in mean squared error. Okay. All that said and done, I'm now going to be able to get the root mean squared error. So we have some training targets. These are the fair amounts for the training set rows. We have some train threads, some predictions. Let's call RMSE on this and let's call it train RMSE and let's print that out. So this is the root mean squared error for the training set, which means that on average, the predictions of our model, which is always 11.35 are off or are different from the target, the actual value that the model should be predicting by nine, which is pretty bad because the values we're trying to predict the average is about 11, the 75% mark is about seven, sorry, is about 12. So if you're, if they're trying to predict values in the range of, let's say 10 to 20 and you're off by nine. So that's a pretty bad model, right? And that's expected because it's just a dumb model, but here's the thing. Any model that we train should hopefully be better than this kind of a model, right? Should we hopefully be, have a lower training, a lower RMSE. Let's get the validation RMSE as well. Well, targets, well, breads yeah. So the model is our, our hard-coded model is off by 9.899 on average, which is pretty bad considering that the average fare is 11.35. Okay, great. So that was our hard-coded dump model. All right. Next, let's train a very quick linear regression model to see whether machine learning is even useful at this point. So I'm going to, from SK learn dot linear model, I'm going to import linear regression and I'm going to create a linear model here. That's it. I think that's pretty much it. You can set a random state, I believe to avoid some randomization or no, there is no random state. So this is it, right? A linear regression model is just this in scikit-learn and here is how you fit a model. By the way, we are expecting here that you're already familiar with machine learning. And if you're not, then I highly recommend checking out zero to GBMS.com. This is a practical and coding focused introduction to practical to machine learning with Python, where we cover all of these topics, all of them models that we are looking at today. Okay. So we do linear model dot fit, rain inputs and rain targets, right? And then once it is fit, we can now make predictions so we can get train preds equals linear model dot predict. And that's going to take the training inputs and it's going to come up with some predictions for us. And you can look at the predictions here and compare them with the targets here. You can see that the predictions are all still close to 11, but they're different. At least they're not the same prediction each time, but it's still way off, right? There's still a way off. Let's maybe also get the, let's look at the RMSE on the train breads and rain breads. So the RMSE is 9.788. So that's not much better. 9.789 was our average model and our linear regression just is hardly better and still completely useless, right? Let's get well predictions. RMSE on the well targets and well breads. So the root mean squared error here is 9.89 or 9.898, which is just 0.001 less than a cent, less than 0.1 cent better than our average model. Okay. And now at this point, you might want to think about why that is the case. And in this case, I would say that this is mainly because the training data, which is just geo coordinates at this point, which is a latitudes, longitudes, longitude is not in a format that's very useful for the model, right? How is a model going to figure out that latitude and longitude are connected and there's a pickup latitude and pickup longitude. And then there is a sort of a, some distance between them or all those relationships are very hard for models to learn by themselves. And that is where feature engineering is going to come into picture. And we are also not using one of the most important columns, which is the pickup date and time because fairs are very seasonal in terms of months, in terms of days, in terms of hour of day, in terms of day of the week, et cetera. So that's why our, our data in the current format is not very useful form of machine learning perspective. And we are able to establish that using the hardcore and baseline model. However, we now have a baseline that all our other models should ideally beat. Now, before we train any further models, we're going to make some predictions and submit those predictions to Kaggle. Now here's the next step. Whenever you're working on Kaggle competitions, submit early and submit often. Ideally you want to submit, make your first submission on day one and make a new submission every day, because the best way to improve your models is to try and breed your previous score. If you're not making a submission, then you're not going to figure out if you're heading in the right direction, or if you have a good validation set, or if there's anything else you should be doing. But on the other hand, if you're making submissions every day, you will have to try and beat your previous submission, right? And that will force you to move in the right direction. So how do you make predictions and submit to Kaggle? First, you have to make some predictions for the test set. So we have the test inputs here right in front of us. So all we need to do is we need to pass the test inputs into the, let's say, let's take a linear model. So into the linear model, we say dot predict it's trained already using the training set and we get some predictions from this. And of course we don't have any targets. So the way we have to evaluate these predictions is by creating a submission file. The way we can create a submission file is this way. We first read in the submission data frame, which is the sample submission file. Right. And now, so this is a sample submission file. Now all we can do is we can simply take the test predictions. Here are your test predictions and we can simply replace this column of data with this because remember the rows in the submission file correspond one to one or one, one on one to the rows in the test file, right? So the first row of the submission file points to the first of the test file and so on. So I'm just going to do something like this sub DF dot or sorry, sub DF fair amount equals test rates. And now you should see sub DF will now have the data from the test set. You can see that there are all, these are all different values. Okay. And now you can save that to CSV. So you can say sub DF dot to CSV. You can give them a file name, like linear model, submission dot CSV. And one thing you need to do when saving submission files, especially for Kaggle is to specify index equals none. Otherwise your otherwise pandas is also going to add this zero, one, two, three, this column as an additional column in your file, which you don't want. Okay. And now you will have this file here sample or linear model submission dot CSV, and now you can download this file. So let's save that and you can submit this file. So our previous submission was giving us 9.409. That was the 9.409 that was the RMSE and this one, let's see what this one gives us. Let's click on late submission. Let's go here and let's just call this linear model. It's called a simple linear model. Let's make oops. Let's give that a second. Yeah. It's uploaded and the submission is code 9.407. So not, not very different. And one thing that we can now verify is that our test set metric is close to the validation set metric, right? So the, remember the validation set RMSE was 9.89 and the test set metric is 9.4. Now that's not too different. I mean, they're in the same range. It's not that validation said RMSE was two and the test said RMSE was 10. So they're close enough. Of course, the validation set is a lot larger. It's about 110,000 rows. Whereas the test set is just 10,000 rows. So it's always harder to make predictions on larger, uh, unseen data than smaller than seen data. So that could have an effect, but at least they have their close enough for us to work with. Okay. So that's how you make predictions for the test set. Now the next step here is to create reusable functions for common tasks. Remember we said that you should be making submissions every day. And if you need to make submissions every day, you should not be copy pasting all this code around each time, because that just takes up a lot of mental energy and you make mistakes and you have to figure change values and all that. So it's always good to create functions like this. So we create a function, predict and submit, which takes a model, X a file name calls model dot predict on test inputs. And then that gives you test predictions. And then you can call read CSV, or you could even provide test inputs here as an, uh, yeah, as, as an argument. And then you call PD dot read CSV, and that's going to read the sample submission into the submission file. You are going to put the fair amount and then you're going to save it to the given file name. And yeah, let's put that here. You're going to save it to the given file name and that, and then you're going to return the data frame. So we could do the same thing, predict and submit. Let's, let's just call this, let's give it the linear model and let's give it test inputs and let's give it the file name, linear sub two dot CSV. And you can see that it does exactly the same thing, but now it's just one line. Anytime you want to generate predictions and create a submission file. So here's linear sub two dot CSV. You can see what this file contains. So yeah, it, it shows you a nice pretty view here. Yeah. So it shows you a nice pretty view here, but it's ultimately actually just a CSV file which contains the key and fair amount. Okay. Great. So that brings us to make predictions and submit to Kaggle. We're done with that. That was simple enough. And we now have this function that we can use anytime to make predictions on the test set. And next, one thing you will want to do here is just track your ideas and experiments systematically to avoid becoming overwhelmingly doubles and dozens of models, because you're going to be working on a machine learning project for at least a couple of weeks, but probably a couple of months or longer. So you need to keep track of all the ideas you're trying. So here's a tracking sheet that we've set up for you where you can, and you can just create a copy of the sheet. You can go to file, make a copy and create a copy, but here you can put in ideas like the, the kind of models you want to try the sample sizes. So maybe like try sample size, 10%. So just keep a, keep a list of all the ideas you have whenever you have them don't have to try them right away. You probably can't and keep a list of what you expect the outcome to be whenever you have an idea. And also then once you try out an idea or mention what learning you have from the idea. So this is called just idea tracking, where you like list out all your ideas and the potential outcome you expect and what you learn from it. And then here you have experiments. So each time you train a model, you just want to put in like give a title to the model, give a date, give some note down, whatever hyper parameters you want to note or the type of model you want to note, note on the training loss, validation loss, and the test score. This could be from the Kaggle leaderboard and a link to the notebook. So every time you save a notebook, every time you save using Jovian dot commit, let's say I go here and I say Jovian dot commit. This is going to give me a link. Yep. Yeah. So you can note down here that there's version 10 here. So you can just note down something like this. You can just note down, let's say version 10, right? Version 10 of the notebook, and you can refer back to it. So over time you have all these versions, dozens of versions, and for each version, you know exactly what the parameters for that version are. And once you have maybe 30, 40 models, you can look at the sheet and you can get a very clear idea of which models are working and which models are not. Okay. Let's see if we have any questions at this point before we move ahead. Can you please explain? Okay. I think we've answered why we are using shell commands. Can we directly fit the model to 99% of the remaining data with 1% sample data of training? Yeah. So I don't, I'm not sure if I understand the question, but what we have done right now is there are 55 million rows available for training. I have simply taken 1% of that or 500,000 rows for training the model so that we can train things fast. What you would want to do later is instead of using just one person of the data, use maybe 10% of the data or 20% of the data or all a hundred percent of the data at the very end and train a model on the entire a hundred percent of the data, and then make predictions with that train model on the test set. And that should definitely be better than training the model on just one person of the data. So I hope that answers that. Next for regression problems, we created a model which gives you, gives the mean as an output. And then we tried linear regression. What should be our approach for classification problems? Good question. So for classification problems, you can maybe just predict the most common class, or you can predict a random class and just go with that. Yeah. Most common class or random class is what I would suggest for classification problems. Okay. Let's see if there are any other interesting questions. Any specific reason we are using float 32 and you ain't eight. Yep. So I actually looked up how many decimal places the float 32 supports, and it supports about eight decimal places, like eight digits of precision, roughly. And that is good enough to support longitudes and latitudes. Now, if you just specify float, it might pick float 64, which will take twice the amount of memory, which can be a problem for large data sets. Similarly with you ain't eight. So I looked at the fair amount values and seems like it's in the, not fair amount. The number of passengers and seems like that's in the range of maybe one to 200. So if you just specify in, that's going to use in 64, which is going to use 64 bits, but we can actually get away with one eighth of that just eight bits because the numbers are, we are dealing with are fairly small. So that's why you ain't eight. So these are just techniques to reduce the memory footprint of the data set. Okay. Perfect. And there was a question about prerequisites for this workshop. So zero to pandas.com is the prerequisite. And, but you can also watch this right now and start working on a machine learning project and then learn machine learning along the way. Okay. Let's move on to feature engineering now. We're halfway through. So hopefully we'll be able to train a few models, feature engineering means taking the columns of data that you have and performing operations on them to create new columns, which might help train better models because machine learning models are fairly dumb. They have, there is a certain structure they have. They assume a certain relationship between inputs and outputs. Like linear regression assumes that the output is a linear is a weighted sum of the inputs. And that may not hold true in the current form, which is latitudes and longitudes, but suppose we were able to somehow calculate the distance between the pickup and drop off point, then there will definitely be some sort of a linear relationship between the distance to be covered and the fair, right? So by creating good features, you are going to make, you're going to train much better models because you are now applied human insight to provide features that are conducive to solving the problem in the structure that the model assumes. Now the tip here is to take an iterative approach to feature engineering. Don't go overboard. Don't spend weeks creating features. Just add some features, one or two, train a new model, evaluate it, keep the features if they help, otherwise drop them, then repeat with new features. So here are some features that we are going to create, and I am not taking an iterative approach here in the interest of time. I'm just going to create a bunch of features right away. First is to extract parts of the date. We have totally ignored the date so far because we didn't know how to put it into a linear regression model, but we can extract things like year, month, day, weekday, hour. I think these are all useful things because over years I would assume that the fair increases across months. There must be some sort of a seasonal trend across days of the month as well, because maybe there are a bunch of deliveries or there are a bunch of things that people have to do during the start of the month or the end of the month, like going to the bank across weekdays. Of course, there should be a difference between weekdays and weekends. There should be a difference during the week and across hour of the day. So we are going to extract out all these parts out of the date. So that's one thing we'll do. We will also deal with outliers and invalid data. It's a form of feature engineering. We are kind of cleaning up the data a little bit. We will add distance between pickup and drop location. So we'll see how to compute distance using latitudes and longitudes. And we will also add distance from some popular landmarks because a lot of people take taxis to get to places where they can't normally drive. And these could be crowded places or these are where they can't park, or these could be things like airports. And there are also tools involved here and tools are included in the, in the fair. So it might be useful to track that as well. Okay. So we are going to apply all of these together, but you should observe the effect of adding each feature individually. Okay. So let's extract some parts of the date and this is really easy. And once again, I'm just going to follow my previous advice and create a function that allows me to do that. So here we have a function add date parts that takes a date, a data frame and a column name, and then it takes the column and it creates a new column name called column name underscore year, where it extracts from that date, from that date, time column here. So just to give you a sense of how this works, we have trained EF. This is a data frame. And suppose I said column to pick up date time, then train DF of call is just all the pickup date times. And if I called dot DT dot here, then it's going to give me just the year for every row of data. You can see 2011, 12, 15, and so on. And I can actually now save that to train DF. Let's say I want to save it to pick up date, time underscore year equals this, right? And another way to do this is to just say call plus, right? So that's what we've done, not just for year, but month, day, week, day, and hour. And we've put that into a function. So I'm just going to call add date parts to train DF. Let's run that. And of course we need to give it the column name. Yup. Let's ignore the volume for now. Uh, the warning for now. And let's do that for the validation set as well. Add date parts to the, let's do that for the test set as well. Uh, let's. Yeah, well, let's see. Hmm. Yeah. It looks like there might be some issue in the test data frame. So I'm just going to load it again. Uh, I see, I see what happened. So yeah. Uh, this is one of the things with doing things live. There are always some issues. Let's go back to creating the test set. Let's see. We have a useful table of contents here again. This is why it's useful to have a table of contents. You want to also specify here past dates equals pick up date time so that this can be passed as a date column. And let's come back here to extract parts of date and let's add the training data frame. Uh, let's add it in the validation data frame and let's add it in the test data frame. Okay. I think we have it everywhere. So let's check train DF. And you should now see that train DF has not just a fair amount in pickup date, time, et cetera, but it also has pickup date, time, year, month, day, time of day, weekday, and hour. And you can verify the same for while DF and test DF as well. There you go. Right. And yeah. So with that, we have added different parts of the date. So we've already done some basic feature engineering. So dates are always a low hanging fruit for doing feature engineering. You can add more. You can add things like start of quarter, end of quarter, start of year, end of year, weekend, weekday, et cetera. The next thing that we're going to add is the distance between the pickup and the drop location. And to do that, we're going to use something called the have a signed distance. There are many formulas to do this. Like again, the way I found this was just looked up online distance between a distance between two geographical or map coordinates, I don't like it too long at your distance, something like that. So there is this formula and it looks something like this, essentially. This is what the formula looks like. It has an arc sign and a sign square and a cost, et cetera. And then I looked up a way to calculate have a sign for pandas. So I searched online, fast, have a sign approximation. How do you calculate it? And somebody was very helpful. They created an entire function, which I directly borrowed over here as the have a sign distance. Okay. So it takes a longitude latitude. So this is going to be the pickup location. It takes the longitude latitude of the drop location, and it is going to calculate the approximate distance in kilometers between the two points. This uses great circle geometry, which uses the spherical nature of the earth and how latitudes, longitudes are defined, et cetera. We don't have to get into it, but there are these resources if you want to. And the interesting thing here is that this works, not just with one latitude and longitude, but also a entire series or an entire list of latitudes and longitudes. So if you provide one list containing a list of longitudes and a list of latitudes and a list of longitudes and a list of latitudes. So basically a bunch of rows, it's going to perform that for each row. And it's going to perform that in a vectorized fashion because it uses numpy. So it's going to be very efficient. So we can directly use this to add the trip distance into our data frame. So from our data frame, we pick up the pickup longitude, pickup latitude, or then we give the drop of longitude and drop of latitude and pass it into the have assigned function that is going to do this for each individual row. It's going to compute the distance using the have assigned formula, and it's going to now add a trip distance. So let's add trip distance to train DF and let's add trip distance to while DF and let's add trip distance to test DF. And now we can see reindeer. Yep. So here now you can see there is a trip distance and this is a distance in kilometers. So this seems like a fairly long trip. This seems like a shorter trip and so on. Well, 1.3 kilometers. And then there are some trips like 7.1 kilometers and you can already probably tell that this, the fare for this trip is probably going to be maybe four or five times the fare for this trip. I would imagine let's see. Yeah. So the fair here is 18, the fair here is 3.7. So it's about five times, right? So there's already a very useful feature for us to use now next. And this is something I learned by looking at some discussions and some notebooks on the competition page. We're going to do a little more, a slightly more creative set of feature engineering. We are going to add the distance from popular landmarks specifically we're going to add check if we, if the trip is going to end near one of those landmarks, if people are going to one of these landmarks and specifically airports, because airports have tools. So we're going to add JFK airport, the LaGuardia airport, the Newark airport, we're going to add the location of times square met museum and world trade center. Okay. There are many more. You could have the statue of Liberty. You could have central park. You could have a bunch of other locations in New York. So this is something that you have to look up and we will add the distance from the drop location, but feel free to also add the distance from the pickup location that is left as an exercise for you. And let's see if this hurts. And here's the next step that creative feature engineering, which generally involves some human insight, because no machine learning model will be able to figure out at least not the simple models that we are training. We'll be able to figure out that a certain location is very important by itself, but we can able, we can figure that out quite easily given the context we have. So involving human insight or involving external data. So here we have picked up the latitude and longitude values for JFK, LG, et cetera. And this is essentially, this is what is called external data. Again, this, this data will never automatically become available to the machine learning model. So creative feature engineering is often a lot more effective in training good models than excessive hyperparameter tuning. Like you're doing a lot of grid searches and like training a model for a long time for multiple hours or overnight or over multiple days with tens of gigabytes and like hundreds of columns. So keep in mind that just one or two good features improve the model's performance drastically. So focus on finding what those one or two good features are, and they can one, they can improve the model's performance far more than any amount of hyperparameter tuning. And the way you get these features is by doing your exploratory data analysis by understanding the problem statement better by reading out reading up discussions by discussing it with people. And also keep in mind that adding too many features is just going to slow down your training and you're not going to be able to figure out what is the useful thing to use. That's why iterating is very important. So here are the latitudes and longitudes or longitudes and latitudes for JFK, LCA, a Newark, the Mac museum and the world trade center. And once again, we can use the have assigned distance function. We are going to give it the data we are going to give it a data frame and we're going to give it a landmark name and landmark lawn light. So from the lawn light, we're going to get the longitude and the latitude of the landmark. And we're going to create a landmark name underscore drop distance column, where we are passing into the have assigned function, the longitude and the latitude of the landmark as the pickup location. And this is the interesting thing about this function that it doesn't have to be all numbers or all series, like a couple of these can be numbers and a couple of these can be series and it will still work fine. So here is the landmarks location. And then here is the longitude and latitude of the drop location, right? So we're going to add the drop distance from the landmark in this fashion. And of course we need to do it for a bunch of landmarks. So here we have created this ad landmarks function. That's going to do it for each of the landmarks, right? So this is why creating functions is really useful because now we can just do ad landmarks, train DF. I guess this should just be called ADF. Yeah. I had landmarks train DF and ad landmarks. Okay. And now if we check the training data frame, it's looking nice. Now we have this pickup, latitude, long, longitude, et cetera, but here is where it starts to get interesting. We have the trip distance, JFK drop distance, and this is where we're going to add the data frame. Okay. And now if we check the training data frame, it's looking nice. Now we have a lot more interesting. We have the trip distance, JFK drop distance. We have the LGA drop distance. We have a EWR drop distance. This one seems to be near the met. And then we have the WTC drop distance. This one seems to be at the WTC. Yep. So now we have a lot more interesting things here. And I think that's enough feature engineering for now, enough new features, but let's also remove some outliers and invalid data. Remember, if you look at like just the data frame, DF dot in DF dot describe, and we look at test DF dot describe the test set seems fine. That's it has fairly reasonable values. You can see here, it has pickup latitude longitudes are in the minus 72 to 75 range. Similar people pick up longitudes and drop off long longitude pickup latitudes is probably between the minimum is 40 and the maximum is 41. So between 40 and 42, um, passenger counts are between one and six. Since we are only going to make predictions on these ranges of data, it makes sense to eliminate any training data, which falls outside these ranges. And of course the training data also seems to have some incorrect values like minus one, one, eight, three is not a longitude, uh, two zero eight people cannot fit in an Uber. And even this fair seems high, but I wouldn't bet so much on it with search. It could possibly go there, uh, but there's definitely some issue here. So here's what we're going to do. We are going to limit the fair amount between one and $500, which is already the case, except that there are some negative fares and we don't want our model dealing with negative fares. It just makes things harder. We may sacrifice maybe predictions for one or two rows in a test set, but the overall gain in accuracy will be better. Then we're going to limit the longitudes to minus 75 to 72. We're going to limit the latitudes to 40 to 42, and we're going to limit the passenger count to one to six. So now we have this remove outliers function, which takes a data frame. And then from the data frame, it picks out those rules, which match all these conditions where the fair amount is greater than one. And this is how you combine conditions in pandas while querying or filtering rows. And the fair amount is less than or equal to 500. And the pickup longitude is greater than minus 75. And the pickup longitude is less than minus 72. Same for drop-off and the pickup latitude is greater than 40. And the pickup latitude is less than 42 and same for drop-off. And the passenger count is between one and six. Okay. So this is how we remove outliers. You don't always have to remove outliers. If your model has to deal with outliers in the real world, then you should keep the outliers. But if your model doesn't have to deal with outliers, or you're going to train a different model for outliers, then it makes sense to remove the outliers, the ranges of values that don't appear in the test data. So let's remove outliers from the training dataset. Let's remove outliers from the validation data data frame. And finally, let's remove outliers from, okay. There are no outliers in the test data frame. So I wouldn't worry about that, but I would want to save my notebook again. There we go. Okay. All right. So now we have done a whole bunch of feature engineering. We've extracted parts of the date. We've added distance between pickup and drop locations. We've added distance from popular landmarks. We have removed outliers and invalid data. We have done some, okay. Next up, here are a couple of exercises for you. You can try scaling numeric columns into the zero to one range. Right now, all these numeric columns have different ranges, the month, date, etc. This generally helps with linear models or any models where the loss is computed using the actual values of the data. And you can also try encoding categorical columns. So things like months can possibly be treated as categorical columns, month, year, even day of week, et cetera. And you can probably use a one-hot encoder there. And that makes it a lot easier for decision trees to work with categorical columns. Now we won't do this because for a couple of reasons, one, we just want to keep it simple right now and get to a good model for us. And then we can come back and try this later if we have time. The second tree based models are generally able to do a decent job, even if you do not scale numeric columns or encode categorical columns, assuming that you have a tree that can go deep enough, or you're training enough trees, which we will try and do. Okay. But here, but that's an exercise for you. Try scaling numeric columns and encoding categorical columns. And if you don't know what these terms mean, then check out the zero to GBM models. Now, another tip here that I have for you is we've spent what one, a closer to two hours now preparing this data for training. We still haven't trained our first machine learning model, but what you can do now is you can actually save these intermediate outputs and download those files, or you can even put those into Google drive and then load them back later when you're, when you're using a next notebook, right? So this way you save all that time of executing or downloading the data and preparing the dataset and running through all this code. You can start from this point where you have this pre-prepared data and you may want to definitely do that, do this for the entire dataset once and get those processed files and then save those processed files for the entire dataset so that you don't have to download and do this processing on the entire dataset of 55 million rows. Okay. And a good format that you can use is the Apache packet format. So you could just do train DF dot two pocket, train dot. So the pocket format is really fast to load and to write, and it is also very small and it's, it's a footprint in turn on the storage. So it's a good intermediate format to use when you know that you're going to load it back using pandas. CSV, unfortunately is a very heavy format because everything has to be converted to a string. And similarly, we can do a well, DF dot two. Okay. Well, not okay. And you can see that these files are here and you can download these, or you can even just push them to your Google drive. All of those things are possible. Now, another tip here, sort of corollary here is that you may also want to create different notebooks for EDA feature engineering and model training so that your EDA, your feature engineering, your initial EDA is just something where you just experiment with different graphs, et cetera. Your feature engineering notebook is where you have to create new features and then output these pocket files. And then your model training notebooks can simply use the outputs of your feature engineering notebooks. And there is a way to connect Google Colab to Google drive so that you can organize all your work well. Okay. So let's now train and evaluate some models to be trained three kinds of models, although there are many more you can train, but because we have limited time, we are going to train just three kinds of models, linear regression or a form of linear regression called Ridge and random forests and grading boosting models. Maybe I should just change this here. I'm going to change this to Ridge and you can try things like Lasso elastic net, et cetera. Okay. But before we train the model, once again, we have to create inputs and targets. Now we've added a bunch of new columns. So let's just go train DF dot columns. And now, well, the input columns, we are going to skip the fair amount. We are going to skip the pickup date time. I'm still going to keep pickup latitudes and longitudes because decision trees might still be able to use these. So those are all our inputs. Looks good. And then we have our target column and here is the fair amount. That is a target column. So now we can create train inputs. So train inputs is train DF, and let's just put in the input calls. And then let's create the train targets. That's trained DF. And let's just put in the target call that read while inputs that is while DF input calls and while inputs while DF target calls. And finally we have the test inputs, which is just SDF input calls. Okay. Perfect. Then before we train models, I am just going to create a helper function to evaluate models, which takes the model train inputs and while inputs. So this is what it does. It takes a model. It takes training inputs and validation inputs, assuming this is a trained model. And then first it makes predictions using the trained model on the train inputs that gives us train predictions. It then computes the mean squared error between the training targets and the training inputs. Well, maybe we can just drop this. We can just use the globals because the model is what is changing most of the time. So we have a function evaluate, which takes a model. It gets predictions on the training set, and then it computes a mean squared error using the training targets and the training predictions. It gets predictions on the validation set that it computes a mean squared error using the validation targets and validation predictions. And then it returns the root mean squared error for training validation sets and the predictions for the training and validation sets. So now evaluating models is just going to be a single line of code. So let's start with the Ridge regression from SK learn dot linear model import Ridge. Once again, if you want to learn about Ridge regression, you can check out the documentation here, or you can do the zero to GBM scores and let's create a model. So let's call this model one Ridge. I think let's see, there are a few coordinates here. So I think we can specify a random state here. So I'm just going to do random state equals 42 so that I get the same result each time. Now, rage chooses something called L2 regularization in combination with linear regression in the, in the objective function. So you can specify the value of this alpha here. So I would encourage you to play around with it. Let's do alpha equals 0.9 maybe. And then Ridge also has a bunch of other things that you can set. You can set a solver and you can set a bunch of other things. So I'll encourage you to try it out, but that's a model. And let's train our model by calling model dot fit, train inputs. Of course we need to provide the targets as well. So the model is now going to train and it is going to try and figure out a set of weights that can be applied to the input columns to create a, to create a weighted combination of input columns that predicts the target value, which is the fair amount. So the fair amount is being expressed as a some way w one multiplied by, let's say the distance plus some way w two multiplied by the pickup latitude and plus some way w two multiplied by number of passengers, et cetera, et cetera. Right. So it's a weighted average that's going to try and predict the fair and the Ridge, when we call fit, it figures out a good set of weights. Now we can evaluate our rich model by just calling evaluate model one. Okay. Yeah. I'm not sure what the issue here is. Let's see here. Yeah. Let's go step-by-step and we should be able to figure out this issue. It doesn't seem like a big issue to me, but here we have trained predictions is model one dot predict train inputs. Yup. And so similarly you have validation predictions. I'm just going to change this to RMSE. Let's try and get the RMSE here. Yup. This works too. Let's try and get these going. Okay. Something wrong with validation predictions. What did we break here? Oh, I see. Yeah. Yeah. So that's just like a quick note on debugging. Whenever you have functions that you need to debug a good way to debug them is to break them down line by line and then identify which line is causing an issue and then go back to where you created this variable or whatever was the previous instance that leads to right. Like debugging is always one. Okay. So let's evaluate the model and upon evaluation, this model gives us 8.0 as the training set RMSE and 8.2 as the validation set RMSE. Well, that's a lot. That's somewhat better than our previous model. Okay. In this case, it is, I was probably getting 5.2 earlier, maybe without this alpha. Let me try that again. Yeah. So it's 8.8.2. It's somewhat better than our baseline model. Not great, but still better. Let's check if our train input still has the same shape. Yeah, it does. Okay. Now remember our submit and predict function. Oh, sorry. Predict and submit. So now we can use this, reuse this function and give it our model one and then give it our test inputs and give it a file name. Yup. So those are our predictions for the Ridge model. Let's take this, these set of predictions and let's upload them. Let's see what we get. So download the red submission and go back here. Let's upload it here. Let's see what that does. Okay. It's almost up. Yeah. Let me just call that rich. So 7.7 to not bad, not bad at all. That's better than 9.7. So we're getting there. We're getting better. Let's try a random forest. Let's save this. Let's try random. Of course, at this point I would also go and add it in my experiment sheet. I would note down that Ridge gave me 7.72 and it was like 8.4 or something and so on. Okay. Let's try a random forest now. So I'm going to import from SK learn dot ensemble. I believe import random forest regressor. And then model two is random forest regressor. And I'm going to set the random state to 42. And I'm going to set something to minus one. Yeah. Well and jobs and jobs is just a number of workers. That reminds me though. Maybe I made a mistake while removing outliers. Yeah. I think I made a mistake while removing outliers. So what we're doing here is when we remove outliers, we are returning a new data frame. So what we really need to do here is we need to do train DF equals train DF remove outliers and we need to do value DF equals remove outliers from value DF. And finally we need to also yeah. So that, that is now actually properly removed outliers earlier. We did not actually remove the outliers. We simply created new data frames, but we did not set them back. But that is, that gives us an opportunity now to see how actually removing the outliers has an impact on the model. So I'm training my Ridge regression model once again and lo and behold, our models accuracy went down or models error went down from 7.2 to 5.2. So just by limiting the training data, the columns of the training data to the range of values within the test data, we are able to train a much better model. Okay. And once again, we can test this out. I'm going to download this rich submission.csv file here. Once again, let's download that. And I think this is a great, this is a great example of how much feature engineering can change things, right? We have gone from 7.2 to 5.2. That's a 30% reduction in the error. So let's upload this. Let's click late submission again. And let's upload the new rich file here. And it's done. Yep. And that puts it at 5.15. And let's look at the leaderboard before we go into a random forest. So 5.15, where does that put us out of 1433 submissions or 1478 submissions. Let's load those submissions quickly. And let's search for 5.15. Okay. So that puts us at 1, 1, 6, 7. So we've already moved up almost 300 places from our original submission, which is very low, which is like 9.8, which is way down here. 9.8, 9.4 was somewhere around here, 1400 X to 5.15. Yep. Somewhere here, 5.15. So we are at position one, one, six, seven. So we jump, we jumped 300 places just by doing some good feature engineering. Of course I say good and just, but this has probably taken people several weeks or probably a month or more to figure out that these features would be useful. So if you can think of more creative features, even with a very simple ridge regression or even a linear regression, you'll be able to move up even higher, maybe into the top thousand. And of course, let's not forget that we're still only working with 1% of the data, right? And this just keeps getting, keeps getting better. So let's go and try a random forest. So I'm going to try a random forest regressor here. And here in a random forest, we are going to train a bunch of decision trees. And then each of those decision trees is going to make a prediction. And then we are going to average the prediction from all the decision trees. I'm setting a random state fortitude so that we always get the same set of predictions and enjoy the minus one, make sure that trees can be trained in parallel. Then we have max depth. So by default, the decision trees in the random forest are unbounded, but for a very large dataset, you may not want to create unbounded trees because they may take a very long time and they may also very badly overfit the training data. So I'm just going to specify a max depth of, I don't know, maybe 10, let's say. And by default, it is going to train. Let's see how many is it going to train and estimators. So the number of trees it's going to train is let's see an estimators. The default is 10. Let me train a hundred trees. Okay. And let's just time the training here. So I'm just going to put time here. It's going to time the training and model to dot train or dot fit train inputs and train targets. Okay. And while it trains, let us see if we have any questions. So first question, would this support parquet, which is a pre-compressed file? Yeah. So I'm not sure what the question here is, but yes, a partial parquet is a yes. I think it would support a pre-compressed parquet file. If you have a pre-compressed packet file, you can load it back using pandas. You may have to specify the type of compression possibly. So yeah. Second question. Shall we use PI carrot? Well, I don't think, I don't know if you can use PI carrot, but pocket works just fine. If we limit the range of the training data to match that of the test data, aren't we reducing the generalization of the model? For instance, we won't be able to use the same model for a future dataset that might not have the same range as a past test set. Absolutely. And that is why having a good test set matters, right? So your test set should be as close to what your model is going to do in the real world as possible, because otherwise the test set is useless. Now, if we are going to use the model in the real world for something other than what is already present in the test set, or the kind of data, the kind of ranges that are present in the test set, then our predictions on the test set are not very indicative, right? So our accuracy of the model is not very indicative. So even if you're getting 90% accuracy on the test set, then in the real world, the model can just be 30% accurate. And this happens all the time. Probably 80% of machine learning models face this issue. So what I would suggest instead is to create the test set in a way that can capture the entire range of values that the model can encounter in the real world, even if that means coming up with some estimates. And I know this may not always be possible, but that's a great question. So thanks. How do we know the number of landmarks to create? You don't, you have to try a few and then we'll train some models, try a few more, do some exploratory analysis, see where, draw maybe a geographical heat map, see where there's a lot of traffic, et cetera. Okay. So our random forest model has trained. It took nine minutes, 57 seconds. And let's see what this model does. So first let's evaluate the model. So let's call evaluate on model two. Okay. And the model is able to get to a pretty good place. Seems like we are down to 4.16. That's not bad at all. Let's make a submission. So 4.16 is the validation RMSE here. Training RMSE is 3.59 and let's make a submission here, predict and submit model two. And let's make let's call it RF submission.csv. It also requires the test inputs. Yep. Okay. So now we have generated a random forest submission. So let's take that. Let's download that. And we are just two hours in at this point, but making good progress, of course, to put this together, to put this outline together, definitely took more time, but even for a couple of days of work, this is not a bad result at all. And remember, we are only using 1% of the data. I might, at this point, I might even just want to take this random forest and put this into the comments so that when I'm looking back at my submissions, I can see exactly what this model contains. Interesting thought. Okay. Seems like this might take a minute or two to, yeah, there's probably some issue here, but why don't we train a XG boost model in the meantime? The next model we're going to train is a gradient boosting model. Now a gradient boosting model is similar to random forest, except that each new tree that it trains tries to correct the errors of the previous tree. And that's what makes it is a technique called boosting. And that's what makes it sometimes a lot more powerful than random forest. So we're going to use the XGBoost library. So from XGBoost import XGB regressor. And then let's create a model three equals XGB regressor. And let's give it a max depth of three. Let's make it five. Maybe a learning rate looks fine. An estimator is a hundred looks fine. Let us objective. So we may want to change the objective here to like squared error because we are dealing with the root means squared error. And again, you can look up the documentation to understand what each of these mean. So in this case, the objective if you see here, let's see. Let's see XGBoost RMSE objective. Yeah, you can always look up. Yeah, I think this is the one that I'm going to use a reg squared error. So yeah, this is the objective or the loss function. Let's an estimators. Let's maybe change that to 200, maybe give it a little longer to train. Let us yeah. Random state. And let us set n jobs for some parallelism. Okay. Let's train the model. And then we will evaluate the model. And then of course we will also predict and submit model three test inputs. And let's call that XGB submissions, submission dot CSV. Okay. So let's give that a minute to train. In the meantime, let's check this out. Okay. So it looks like we got to 3.35. And where does that put us on the leaderboard? 3.35. Let's go down. It's getting pretty close, getting pretty 2.8. And we are down to 3.35. Still only 1% of the data. Still model just took a few minutes to train 3.35. So we're up in the in 560. The 560 out of 1478 that's in the top 40%. So we're in the top 40% already. That's not bad at all. Top 40% model is actually a very good model because most of the top submissions one train for a very long time and also use a lot of ensembling techniques, et cetera. Here's our extra boost model. It took about 34 seconds, very short model. I'm sure we can probably bump up the number of estimators to buy a lot more. And it is able to get to 3.98. So is that better than the random forest? I don't know. Yeah, it seems like it's better than the random forest. So in just 46 or maybe a few minutes, I don't know, 46, 35 seconds probably in, let's see. Yeah. In just 35 seconds, we were able to train the best model so far probably. So let's go down here and let's go to XGB submissions and let's download it. And let's save it here. Come back late submission. Put that in here. And I'm just going to drop the description as well. So I am going to drop the XGB regressor description right into this and let's submit that and see what happens. Perfect. Seems like we have made a submission and that brings us to 3.20. Quick look at the leaderboard once again, 3.20. And before, while that loads, let me also just tell you what is coming up next. So, so far we have just evaluated a bunch of different models and I would encourage you to try out a few more. I'm going to commit the notebook here as well. But the next thing that we should be doing is tuning some hyper parameters. So let's see 3.20. I believe. Okay. We are up to 440 pretty good, pretty good. 440 out of one, four, seven, eight is okay. We've hit the 30% mark already. That's not bad. So the next thing we're talking about is tuning hyper parameters. Now here is this now tuning hyper, hyper parameters, unfortunately is more of an is unfortunately more of an art than a science. There are some automated tuners available, but they will typically try a whole bunch of things like grid search and take a long time. And you ultimately have to train a lot of models and build some intuition about what works and what doesn't, but I'll try to give you a couple of tools here that you can use. Here's a strategy, which is about picking the order in which you tune hyper parameters. So what you should do is first tune the most important and impactful hyper parameter first, for example, for the XG boost model, number of estimators or number of trees that you want to train is the most important hyper parameter. And for this, you need to also understand how the models work really well, at least in an intuitive fashion. If not the entire mathematics behind it, you should have a good intuitive understanding and all of these models can be, once you understand them intuitively can be described in a single paragraph or maybe two. So for XGB is one of the most important parameters is an estimators and you would tune that first. And we'll talk about how to tune that. Then with the best value of the first hyper parameter, we tune the next most impactful hyper parameter, which in this case, I believe would be max depth and so on and so on. So tune the most impactful hyper parameter, use its best value. And what do you mean by best? Well, use the value that gives you the lowest loss for the validation set. Okay. While still training in reasonable amount of time. So it's time versus accuracy. So wherever you feel that this is the best, this is giving me the best result on the validation set. Use that hyper parameter. And then with the best value of the first type of parameters. So all future models that you try to tune, you should have the best value of the first type of parameter and then tune the next most impactful hyper parameter. So let's say the number of estimators, the best value is 500, keeping that 500 fixed, you in the next most impactful hyper parameter like max depth, and then keeping the max depth fixed, doing the next most impactful hyper parameter and so on and go down for two, five, six, seven hyper parameters and keep going, then go back to the top and then further tune each parameter once again for further module gains, right? So that's the order you sort of go through the parameters, get the best value and go forward. And as I said, it's more an art than a science, unfortunately. So try to get a feel for how parameters interact with each other based on your understanding of the parameter and based on the experiments that you do. Now, in terms of how to tune hyper parameters, there's a, there's a image that captures this really well, which is called the over fitting curve. Yeah. So this is a image that captures the idea really well. Let's yeah, this is the one I'm looking for. So the idea here is hyper parameters. Let you control the complexity of the model. So certain hyper parameters, when you increase the hyper parameter, it increases the complexity of the model or increases the capacity of the model. In some sense, for example, if you increase the num, if you increase the max depth of the tree, or you increase the number of estimators, then you are increasing the capacity of the model. You're increasing how much it can learn. And the model starts out, let's say you try number of estimators, 0, 5, 10, 20, 100, 500, 2,000, 5,000, 10,000. So when you have very few estimators or very few, or a very small model or a very limited model, both training error and test error or validation error are pretty high because your model has very low capacity and it has to deal with a lot of data. It simply doesn't have all those parameters to learn enough about the data. As you increase the model's capacity, which is increase the number of estimators or increase, let's say the max depth, both the model can start to learn more. So it starts to the training error starts to decrease and the validation error starts to decrease up to a point. And then what happens is at certain point, a validation error starts to increase. So this is what's called overfitting. This is where the model is getting to a point where it is now, this is where the model is getting to a point where it is now, instead of trying to learn the general relationship between the inputs and outputs, it is now starting to memorize specific values or specific patterns between the training data, mostly specific examples in the training data or specific sets of examples to further reduce the loss. Right. And as you make the model more and more and more complex by increasing the number of parameters it has by increasing, let's say the max depth, it can memorize every single training input. And that's what decision trees do if you don't bound their depth. So when you get, when your model gets to that point, then it's a very bad model because all it's good for is it's, it's kind of a model that is simply memorized all the answers. So anytime you give it a new question, it completely fails, right? It's like memorizing answers for an exam versus understanding the concept. As you go through the material, as you, as you study, as you do some practice questions, I spend more time, your understanding of the material gets better. But if you get to a point where you're just blindly memorizing all the answers, then your understanding of the material may actually get worse because you won't know how to solve general problems. So it's not generalizing well enough. So that's what we want to find. And what we've done here for you is created a couple of functions, one called test params, which give, which takes a model class and a set of parameters and then trains the model with the given parameters and returns the training and validation RMSE. And then another called test param and plot, where you can provide a model class, you can provide a parameter name, you can provide a set of values for that parameter that you want to test and then a list of the other parameters that you want to set constant while wearing this parameter. Okay. And then it's going to train a model for each of those values and it's going to plot the figure for you. Now I'm going to show you in just a second, what it does. So don't worry too much about the function code right now, but here's what we're going to do. We're going to try and tune the hyper parameter number of trees. Now what's the number of trees we have here? 200. So I'm going to try and figure out should we be increasing the number of trees or should we be decreasing the number of trees? Okay. And the way I'm going to do about this, go about doing this is calling test params and plot. And let's time that as well. And in test params and plot, we have first the type of model that we want to train, which is XGB regressor. We have the parameter name that we want to vary. So we want to vary the numb estimators parameter. Then we want to try the values. Let's try the value hundred. Let's try the value 200, which we've already tried. Let's try the value 400. Let's see. Let's say we're just doubling the number and seeing if that heads and let us set the other parameters. So let's set the random state to 42 number of jobs to minus one and objective to red squared error. So I'm just going to pass other parents. This is called a, these are called quarks or keyword arguments. So each key inside the sorry, not other parents, each key inside best parents is going to be passed as a argument to test params and plot. And ultimately it's going to get passed on to the ultimately it's going to get passed down to the best, to the XGB regressor model. Okay. This is going to take a while. So how about we just start filling out some code in the meantime? So what we are going to do after this is into best params I'm going to then add. So the best params I'm going to add what I think would be the best value of norm estimators that we should use. So we're going to add that then we're going to try out some experiments with max step. So we have set the max step above to, what did we start with? We started with the max depth of five. So maybe let's try three and seven or three and six for the max depth. So test param and plot XGB regressor. And we want to test max depth and we want to test the values three, five, and maybe six, let's say seven may take a long time. Oh, there it is. Okay. Okay. So it seems like number of estimators isn't really making a big difference at the moment. Seems like maybe we should, maybe we should like reduce the learning rate or something and then try changing the number of estimators. Let me change the learning rate here or the initial learning rate to 0.05 instead of 0.1. Let's see if that gives us any benefit, but yeah, we can try a max step of three, five, maybe seven, let's say. And let's give it the best parameters. And then we are going to add the best param or max depth. And we can try the same thing with learning rate as well. So like we can try learning rates of 0.05, 0.1 and 0.2 and so on. Okay. Yeah. So this isn't really doing much. So maybe the number of maximum maximum estimators isn't really helping. So no need to worry about it, but let's just go with a hundred for now, then let's try max depth of three, five, and seven. Let's see what that does. And then we are going to try different learning rates. So this is what we want to do, right? I hope you're getting the idea here. What you want to do is first train a train, a basic model. So have a set of initial params that you want to start with. Then for each hyper parameter, try out the a bunch of different values, try decreasing it, try increasing it. Similarly, try decreasing it, maybe try five values, maybe try seven values. Look at the curve and look at the curve that gets created and try to figure out where the best fit is. And the best fit is the point where the validation error is the lowest, right? Once you put in enough values, you will see a curve like this and you want to pick not the point where like these two are the closest, not the point where this is the lowest, not the point where this is the highest. You want to pick the point where the validation error is the lowest. Okay. Now one caveat here is that sometimes a curve may not be very nice like this. Sometimes it may have, it may sort of flatten out here. And if it's flattening out, that means it's still continuing to get better. But if it, if going from this point to this point is going to take three or four times the amount of time to train the model, then you're probably better off just picking this value instead, where it's kind of starting to flatten out so that you can try more experiments faster. Okay. So that's something worth thinking about. Yeah. So here it is. You can see with max depth. It seems like if we, if you were to pick a max depth of seven, that would actually train much faster. And I'm sorry, that would actually give us a much better model. And so I'm just going to pick maybe let's say a max depth of seven here. Let's say then we can try out some learning rates. I'm going to try out a bunch of learning rates here. And then I'm based on that. I'm going to test a learning rate. And then of course, we can continue trying to tune the model. Okay. So you want to try this with all the different parameters, not just these parameters. And here is a set of models that works well. So here's one where we have 500 estimators, and then we are trying a max depth of let's go maybe a little bigger. Let's try a max depth of eight. Let's try a learning rate slightly lower, because as you increase the number of estimators, you want to decrease the learning rate. And then here, there's a sub sample. So for each split of each tree, we only want to use 80% of the rows. And then there's something called a call sample by tree for each tree that we use. We only want to use 20% of the camp of the columns or 80% of the columns. So these are just a couple of things you can try 0.8.7 the same way that we've tried test params and plot and see where that takes us. So I'm going to run this model right now and see what yeah. So I'm going to train this model here, XGB model final, and I'm going to fit it to the train inputs and train targets. I am then going to make some, I'm going to evaluate the model. I am going to then predict and submit XGB model final XGB, XGB, GB, tuned dot, uh, submissions dot CSV. Okay. So I'm just going to let this run and see where that gets us. Now, in, in my case, I think the last time I trained this, it was able to get us to about the four 60th position. I'm hoping we can beat that. I'm hoping we can get maybe into the top 25, 26%. Let's see. Um, but it should be somewhere around that, that point. And again, what is pretty amazing here is that we are still just using 1% of the data. We are throwing away 99% of the training data, never looking at it. And the reason we are able to do that is because the test set is really small. The test set is just 10,000 rows. And to make predictions on a test set of 10,000 rows, you don't really need 55 million rows. Yes, it will help to add more data or the entire data will definitely make it better. But if you're always working with 55 million rows, then to do what we just did in less than two and a half hours, it would take you probably a couple of weeks, maybe longer because one of the things we were able to do here right now while working with a sample is we were able to fix errors very quickly. We were able to try out new ideas very quickly. We were, we were able to brainstorm and like go from thought to action very quickly. But if you're working with 55 million rows, every action that you take, every cell that you run is going to run for a couple of hours. And by the time you come back, you're going to be tired. You're going to forget what you had in mind. So speed of iteration is very important and creative feature engineering is very important. And then hyper parameter tuning is really often generally just a very small step, which is a fairly small step, which gives you that last bit of boost, but is generally not the biggest factor. So let's just fix that, predict and submit. This is looking pretty promising. It has gotten to 3.8. Let's see if that is any better than the best model that we had XGB tuned submission. Let's submit that. So that's why it's very important to plan your machine learning project. Well, it's very important to iterate. It's very important to try as many experiments as quickly as you can and drag them systematically. It can make the difference of machine learning model taking months and still not getting to a good result versus getting to a really good model, something that can be used in the real world in a matter of hours. Okay. Let's see where this gets us. So we just submitted and this got us to about 3.20. Let's check where that puts us on the leaderboard. 3.20. Yeah, I bet it would still be under the 30% mark. So which is pretty good. Considering this is a single model, most models on Kaggle use ensembles and considering our model has taken just, what is this one minute to train, not even 10 minutes. Our model was trained for just one minute and we haven't even fully optimized the hyper parameters yet. Right. So there's a lot more we can do in terms of hyper parameters as well. So let's see 3.20. Okay. That puts us at position four 40. Yeah. So that is within the top 30%. Right. And I encourage you to like simply maybe just go 10 X here instead of 500 estimators, maybe go for 2,000 estimators and see what that does. So here are some exercises for you. Tune hyper parameters for the ridge regression and for random forest, see what's the best model you can get. Repeat with 3% of the data, 10%, 30% and 100%. So basically three Xing each time from one to three, three to 10 and so on. So see how much reduction in error does three X data produced this 10 X data produced this a hundred X data produced. And you will see that the reduction is not a hundred X, but the time taken definitely becomes a lot more. Right. And finally, a last couple of things, you can save the model weights to Google drive. So there are a couple of ways you can, I'm not going to do this right now, but I'm just going to guide you to the right place here. So the way to save model weights is to use something like this, use this library called job lip. So you can do simply from job lip import dump, and then you can take any Python object and dump that into the job lip file. Right. So you can import, you can, you can maybe just put the dictionary or sorry, just put the model itself, the XGBoost model and dump that into a file and then load it back and then use it just like the XGBoost object it is, or you could create a dictionary, put into it, the XGBoost model, any other, like if you were using a scaler, if you're using an imputer of some kind, anything else that you need to make predictions, put all of those into a job, the file, and then dump it. So that's how you save models. And then here we have IO inputs and outputs for Google drive, for Google Colab. So you can mount your Google drive this way. You can say from Google dot Colab import drive, and then drive dot mount slash content slash drive. When you do that, then on slash content slash drive, your Google drive is going to show up here on the left. Let me, let me try that. Yeah. So your Google drive is going to show up here on the left. I'm not going to run the whole thing right now, but your Google drive is going to show up here on the left. I believe you need to. Yeah. I believe this one is going to ask you to do some, take some additional steps like this. You will have to open a link, enter an authorization code, similar to adding your job in API key that attaches your Google drive. And once your Google drive is attached, you can take the job lip file that you created here and put that into your Google drive. So now you can have a ID and notebook, and then you can have a feature engineering notebook, which takes the data, adds a bunch of features, saves those files in pocket format to Google drive. Then you can have your machine learning notebook, which can pick up those files and then train a bunch of models. And whatever are the best models, you can write those models back to Google drive. And then you can have an inference notebook, which can load those models from Google drive and make predictions on new data, right? Or make predictions on individual inputs. That is again, something that I would suggest. If you're, if you hit a wall at some point, I looking at some individual samples from the test set, would that put those into the model, see the prediction on that individual input and see that prediction makes sense to you. Just eyeball the predictions and then you will get some more ideas. Then you can do some more feature engineering and that's the iterative process that you want to follow. You want to make submissions every day, day after day, right? Now, one other thing that we've not covered here is how to train on a GPU. You can train on a GPU with the entire dataset to make things faster. So there's a library called Dask that you can use another library called CUDF or CUDA data frame, which can take the data from the CSV file and put it directly into the GPU. Remember on Colab, you also get access to a GPU. So you can take the data, put it directly onto the GPU. Next, you can create training and validation sets for form feature engineering directly on the GPU. It's going to be a lot faster. And most importantly, the training that you do, the training can be done using XGBoost on the GPU itself. And that's again, going to be a lot, lot faster, probably orders of magnitude faster. So the entire process of working with the entire dataset itself can be reduced to maybe 10 or 20 minutes of work, right? Now, Dask, CUDF and COML are all have very similar APIs to, or very similar functions, et cetera, arguments, et cetera, as pandas and XGBoost, but some things are different and some things have to be done differently. Unfortunately, it's not a hundred percent compatible API. So I've left you a few resources that you can check out specifically do check out this project by Alan Kong. He was one of the members of the Jovian community who has created a model using Dask, using the Dask library. And he has used a hundred percent of the data and his model trains in under 15 minutes, I believe. And in under 15 minutes, he is able to get to a point where I think he was able to get to 2.89, which was in the top 6%. Okay. Of course it took several days to write out the code and try out to learn the different things required to do this, but a model, a single model trained in under 15 minutes was able to, on the entire dataset, a hundred percent of the data under 17 minutes, placed him in the 94th percentile or the top 6% of the data set. So you can check out his notebook here as well. His notebook is listed here, so you can check out his notebook. It's a good tutorial on how to use Dask. So that's an exercise for you. And finally, here's the last thing I want you to take away from this workshop. Always document and publish your projects online because they help improve your understanding. When you have to explain what you've done, there are a lot of things that you've probably just copy pasted or taken for granted or not really thought about that you have to now put into words and that forces you to think and understand and fill the gaps in your understanding. So that's very useful to improve your understanding. It's a great way to showcase your skills. If you're going to write on your resume, that you know, machine learning under a skill section without offering any evidence for it, there is no way somebody is going to believe that you know, machine learning, and they don't have the time to actually interview hundreds of people and figure out if they know. So the best way to offer evidence is to do a blog post, write a blog post, well, explain, and list a link it from your resume. And the last thing is that as people read your blogs or you share them on LinkedIn or Twitter or wherever that will lead to inbound job opportunities for you. People will reach out to you. Recruiters will reach out to you. Employers will reach out to you. I saw the project that you did. It looks pretty interesting. We have a similar problem here at our company. Would you be interested in talking and you won't believe how many, how much easier it is going to become for you to find opportunities. If you're consistently write blog posts and publish your projects online, any project that you're doing, please put it up online, please add some explanations using markdowns, spend another hour or two, clean up the code and create functions show that you are a good programmer and publish the Jupyter notebook to Jovin. It's just one, we've made it so simple for you because we want you to publish these articles with us so you can run Jovin dot commit, or you can download the notebook and then you can like, you can go file or download notebook as IPINB, and then you can upload that notebook on Jovin. You can go here on new and you can upload a notebook. It's really easy. But yeah, when you do that, you, yeah, you, you can now share this notebook with anyone, right? And you can also write blog posts like this one. And the benefit of blog posts is that you don't have to show the entire code. You can make it much shorter and you can focus on the bigger narrative or the bigger idea here. I think this is a great blog post about the different steps involved here and the things that Alan tried without showing a bunch of maybe like hundreds of lines of code, right? So it's a good summary blog post of the code. And it's a, it's a great, it's a great way to share what you've done with somebody and summarize it. Now, one thing you can do is on your blog post, you can actually embed code cells from Jovian and outputs and graphs and anything from a Jovian notebook. And you should check out this tutorial on how to write a data science blog post. We have a tutorial here. We write something from scratch. We did a few months ago that will guide you in that process. So that was the machine learning project from scratch, not really from scratch because we had written out an outline, but let's review the outline. Once again, we started the, we're trying to predict taxi fares for New York city by looking at information like pickup location, drop location, latitude, longitude fare, the number of passengers and the time of pickup. So we downloaded the dataset by first installing the required libraries, downloading the data from Kaggle using open datasets, looking at the dataset files, seeing that we had 55 million rows in the training set, but just 10,000 rows in the test set. We had eight columns. We loaded the training dataset and the test set. We then explored the training set. And we saw that there were some invalid values, but there were no missing values. The test set, although had fairly reasonable ranges, then something we could have done is exploratory data analysis and visualization, which is a good thing to go and do right now to get ideas for feature engineering. And a great way to build insight about the dataset is to ask and answer questions, because that will give you ideas for feature engineering. Then we prepared the dataset for training by splitting the data into training and validation sets. Then we filled and, or remove the missing values. In this case, we removed, remove them. And there were no missing values in our sample. Of course, one of the things that we did while loading the training set was we worked with a 1% sample so that we won't have to, so that we could get through this entire tutorial in three hours, but that also had the unexpected benefit that we could experiment a lot more very quickly instead of having to wait for tens of minutes for each cell to run. Then we expected the inputs and outputs out. We separated out the input columns and the output columns, because that's how machine learning models have to be trained for the training validation and test sets. We then trained some hard coded models, a model that always predicts the average, and we evaluated it. We made submissions from that model. Sorry. We evaluated it against a validation set and saw that it gives us a average RMSC of about 11. We trained and evaluated a baseline model, which gave us an average RMSC of about 11 as well. This was a linear regression model. And the learning here was that our features are probably not good enough. We probably need to create new features because our linear regression model isn't really able to learn much beyond what we can predict by just returning the average. So before you go out and do a lot of hyper parameter tuning, make sure that your model is actually learning something better than the brute force or the simple solution. Then we made some predictions and submitted those predictions to Kaggle and that established a baseline, which we would then try and beat with every new model that we create. Then when it came to feature engineering, the low hanging fruit was extracting parts out of date, the year, the month, the day, the day of the week, and also the hour of the day. We then added the distance between the pickup and drop using the Havers sign distance. We found a function online that we just borrowed. We've also added distances of the drop location from popular landmarks and like the JFK airport, the Newark airport, the LaGuardia airport, and a bunch of other places. You can possibly also add distance from the pickup location. We removed outliers and invalid data. We noticed that there was a bunch of invalid data in the training set. And we noticed that the test set had a certain range of values for latitudes, longitudes of fairs and all that. So we put those in so that our model is focused on making good predictions on the test set, which should be reflective of how the model is going to be used in the real world. We can, we could have done scaling and one-hot encoding, and that would have helped train the models a little better. I'm sure. And then we saw how to save the intermediate data frames and also discuss that we can put them onto Google drive so that we can separate out our notebooks for exploratory analysis, feature engineering, and training. We then trained and evaluated a bunch of different models. First, we once again split the inputs and targets, and then we trained a ridge regression model. We then trained a random forest model and we trained a gradient boosting model. Each of these, we did some very quick and dirty hyper parameter selection. But even with that, we were able to get to a really good place. You are able to get to the top 40% or so without much tuning. Then we looked at hyper parameter tuning, where we decided that we would do the most impactful parameter first, and then keeping its value fixed, doing the next most impactful. And by tuning, we mean picking the value of the hyper parameter where the validation loss is the lowest, where it is not, it has not started to overfit, but it is still, it has learned a little bit about the data in more general terms. So we tuned number of trees, max depth, learning rate, and we ran some experiments here. We saw that all of these parameters could be further increased. Of course, we are short on time. So we can't really look at like going to very deep trees that would take a couple of hours or so to train, but I encourage you to try those out till the point that you start seeing the increase in the validation error. And finally, we picked a bunch of good parameters and we trained a model and that model was able to put us in the top 30%, which was pretty amazing, considering we're still just using one person of the data. And we looked at how we can save those model rates to Google drive. And we also discussed that the model can be trained on GPU, which would be a lot better when you're working with the entire dataset so that you don't have to wait for hours to train your model. Of course, it requires some additional work because you need to install a bunch of libraries and make to make things work, but there's definitely a few resources that you can check out here. Maybe it could be a topic for another workshop where we could talk about using training classical machine learning models on a GPU. Finally, we talked about the importance of documenting and publishing your work. I cannot overstate this any work that you do, please document it, please publish it, publish it to Jovian. If you were writing a blog post, go to blog.jovian.ai and check out the contribute tab here and you can feature blog posts here. We share it not just with the subscribers of the blog, but it also goes out in our newsletter, which goes out to over a hundred thousand members of the Jovian community. So it's a great way to get some visibility for your work and become found. So finally, I just want to share some references and then we'll take a few questions. If you have questions, please stick around. The first one is about a dataset. This is a New York city taxi fare prediction dataset. Definitely one of the more challenging datasets that you will find on Kaggle, but with the right approach, you can see that it's all about strategy and approach and fast iteration. You can do a lot with just a little bit of data. If you want to learn shell scripting a little bit, I definitely check, recommend checking out missing semester from MIT to learn bash and how to deal with the terminal. Then the open datasets library. This has been developed by Jovian to make it easy to download data from. You can use it in all your projects. All you need to do is specify your Kaggle credentials. Then for exploratory data analysis, do check out this tutorial on building an EDA project from scratch. Again, it's a follow along kind of tutorial that you can apply to any dataset, just as this entire strategy, you can pretty much apply to any dataset from Kaggle. Maybe only the specific parts like feature engineering are going to change. Then do check out the course machine learning by with Python zero to GBMs. That's a useful course. If you want to learn machine learning from scratch and do check out the blog post by Alan Kong on this particular dataset, it's really useful. There is this experiment tracking sheet that we talked about very important to stay organized as you try hundreds or dozens, at least of experiments do so that you don't lose track of what are the best hyper parameters or the best models or the best features. Even then if you want to learn more about daytime components and pandas, you can check this out. There's some more resources about have assigned distance. And here is the rapids project, which builds all these alternative libraries that work directly with GPUs, which we have on Google Colab, fortunately. And if you're looking to write a blog post, there's again, a follow along tutorial we have on how to write a data science blog post from scratch that you can follow a few examples of good machine learning projects. These are all projects created by graduates of the zero to data science bootcamp that we run. It's a six month program where you learn data analysis, machine learning, Python programming, a bunch of analytics tools and build some real world projects and then go out and also learn how to prepare for interviews and apply for jobs. So here's one you should check out Walmart store sales is a great project on forecasting Walmart's weekly sales using machine learning. It's very detailed and covers all the specific aspects that we have talked about in this table of contents. Here's another predicting used car prices. One thing that you get to see with machine learning is how generally applicable it is to so many different kinds of problems. So again, a very interesting model to check out also very well documented. So great project to check out. Here's one about applying machine learning to geology about predicting lithologies using wireline logs. I can't say that I understand the entire project, but I can definitely see the pieces, the pieces that you can pick up is defining the machine learning problem, understanding what the inputs are, what the outputs are, what kind of problem it is, what kind of models you need to use, and then going through the process of training good models and experimenting and staying organized. Here's one about ad demand prediction, predicting whether a certain machine learning ad is going to be clicked on. Here's one on financial distress prediction, predicting whether somebody will face financial distress within the next year or two and another machine learning project on credit scoring. And I hope you'll notice a similar trend that is there across all of these projects, which is how to apply machine learning to the real world. All of these are on real world data sets from Kagan, right? So with that, I want to thank you for attending this workshop. We are almost running up on three hours. We will take questions, but for those who want to go, thanks a lot for attending. We are planning to do more workshops every week or every other week. So do subscribe to our YouTube channel. And of course, if you are learning machine learning or data science, do go on job in.ai, do sign up, do take some of our courses, build some interesting projects, and also share these courses with other folks who might find it useful. We also have a community discord where you can come chat with us and we have a community forum as well. And if you are pursuing a career in data science, definitely talk to us about joining the zero to data science bootcamp. We think it could be a great fit if you're looking to make that career transition. So that's all I have for you today. Thank you for joining. I will see you next time. Have a good day or good night. Okay. And let's take the questions now. There is a comment come question from part tick. We love the session, understood everything right from creating a project pipeline, feature engineering, saving files, and pocket format, uploading our submission files with descriptions as model parameters, hyper parameter tuning, et cetera. I just had one question not regarding the session that one, not regarding session, but how can you find a problem that will solve some unique? Okay. How can you find a unique problem statement, right? Yeah. So I don't think that there are many unique problem statements in the world right now. Like even the data sets that you find online, you will find that many people have created many machine learning projects from those, but that should not stop you on working with the dataset because everyone brings their own perspective. Everyone's going to do their own analysis. Everyone's going to train their own kind of models, try their own ideas. So you will almost certainly one learn a lot from the process, even if there are a hundred other projects on a dataset like New York taxi fare, where 1500 people have made submissions, right out of the 1500, probably many people may have trained models for several days or probably like months and still may not have been able to make the top 500, but with some smart feature engineering, you might be able to get to the top 400 in just a couple of days. So it shouldn't stop you from trying that dataset. And the second thing is it's ultimately about, yeah. So the second thing is about finding good problems. I would say that you should try and find problems where a lot of people are already working on that problem because that is an indication that it's a good problem to solve. So when I came across a New York taxi fare dataset, I saw that it's a large dataset. I saw that like over a thousand people had participated in the competition. So that probably means that it's a very, very interesting problem to solve. So in, in a somewhat counterintuitive sense, the more people have tried a particular problem, the more interesting it is, unless it gets to a point where it becomes like a instructive problem, which is taught in courses, for example, the Mness dataset or the CIFAR 10 or CIFAR hundred datasets, they are generally used for teaching. And because they're used for teaching, pretty much everybody goes through creating models for those problems. So you want to pick something that is not used in some course or some tutorial, which is very popular, but at the same time is not very like obscure, where you don't understand what the problem statement is, or even if whether it is a machine learning problem somewhere is that sweet spot somewhere in between just like model training, I guess there's that sweet spot somewhere in between where you find some really good problems, but most important thing you should look at is independent of whether it's unique or not, how much you're going to learn from it. What would be a good reason to test a splitter test set to be so small from the training set? Well, I believe it could just be that this was Google cloud that had run the competition. So maybe they just wanted to see how much additional benefit can you get by 10 times or a hundred times more data, right? How much additional juice can you extract out of it? That's one piece. The second, I guess could just be that because Google just has so much data, but I don't know why the test set is so small. I don't know. These are all guesses. Can you teach me how to make a customized dataset? Well, I would, I think that would be a topic for another day because there is a lot that will be involved every time Kaggle works with the company. I know that they spend a lot of time creating the dataset because the, on the one hand, it should be possible to make predictions using the inputs and targets. There should be enough signal in the data, but on the other hand, sometimes you introduce something called leakage where one or two features might completely end up predicting the outcome. So for classical machine learning, I would say it's not very easy to come up with your own custom datasets. And of course there's the whole issue of labeling is itself, right? So if you have to sit and label all the data, data yourself, that's going to make things a much harder for you. But for deep learning, when you're working with image recognition problems or with natural language problems there, it's a lot easier to create custom datasets. And again, that's a topic for another day, but we do have a tutorial on building a deep learning project from scratch on our YouTube channel that you can check out. Thank you again for joining. You can find us on www.showbin.ai and I'll see you next time. Thanks and goodbye. Hey guys, welcome to the tutorial of deploying a machine learning model. In this tutorial, we are going to employ a already trained model into live production. I will try to keep things as simple as possible and I hope you will be able to follow along when I'm going through the tutorial. If not, please post your question on the chat and I'll take the questions at the very end of the tutorial. So with that, let's start. So for this tutorial, I'll be using this notebook. So our topic for today is deploying a machine learning model and we are going to create a web application using flask, a Python web framework and deploy it in the cloud using render. You will also learn how to add HTML templates and see how to style it using CSS very basic. By the end of this tutorial, you will have a basic understanding of how to create a web application using flask, integrate machine learning models into the web application and know how to deploy a web application to the cloud. The following topics are covered in this tutorial. First, we'll set up the project on GitHub and Conda environment and then we'll create a simple web application using flask, just a simple like print something on the web page, then add HTML templates to the web app. Then we will use, we will see what ML model we are using. We will just learn a bit about it. We'll not go deep into the ML model and then run the model locally on a web browser, publish to GitHub and deploy to render. And finally, we'll create an API route for the model so that these are the topics that we are going to cover. There are certain prerequisites for this, for this tutorial, I would say very basic, you have to, you should, I think you should have very basic knowledge in Python. By basic, I just mean defining a function, I guess, very basic knowledge of HTML tags like body, div, headings, forms, etc. And CSS styles like background, color, text align, etc. And some level of understanding of machine learning concepts. Just like if you have, if you are given with the model, you should know how to predict with the value. Okay. I think you'll be able to understand even though you don't have, don't know some of this. So try to follow along. Okay. So definitely we are going to solve a problem in this tutorial. The problem states that you are given a pre-trained email spam classifier model and its requirements. Basically, what are the things there in the model? Our task would be to deploy the model and prepare it for end users. So anyone can go to the site, give an email or body, email body, and we'll be able to understand whether it's a spam email or not, whether it's not a spam email. To achieve this, we need to design a webpage definitely using HTML and predict using the model, whether it's a spam or not. The webpage interface should be as friendly and simple, user friendly and simple as possible. And in short, we have to design a form that takes only one input, that is the email body, the email text and returns whether the email is spam or not. Okay. Cool. Enough of all this theoretical part. Let's start with creating the simple flask app. For that, we'll start with creating a new repository on GitHub. I'll go to github.com and I've already signed in. You can sign in or sign up to github.com. I'll get a new repository here and give a repository name. So let's say model deployment. It can be anything, any unique name. I will skip the description, keep it public, add a readme file. You can add the readme file later, but I'm adding it now. A gitignore template. This is basically a common template used for different programming languages. I'll use the Python one and a license you can skip, but I'll use the MIT license just because like it's the best one for open source projects. Okay. So we have created a GitHub repository, right? So now let's start coding. We will be using code spaces here, but there are different options for code, like for coding. Okay. So you can use replet or VS code locally for the coding part, but I'll be using it up courses as it, as it is easy to set up. I'll just click on code here, go to code space and say, create code space on main. Well, let that open, let that set up. I'll also have to create a Conda environment. Now what is Conda? Conda is a package manager, just like people. And it also allows us to create environments. Now an environment is useful to manage a certain button, like a certain project. So if you have a project, if you can build different environments for different projects and have the packages required for particular projects in that particular environment. Okay. So this is our repository. So we will see if Conda is available for first, we can just say Conda version. And yes, Conda is already available. Let's start by creating a Conda environment. So here is the command that you can just start, create the environment. So I'll allow and Conda create minus N this environment name can be something like, let's say, model deployment. And I have assigned Python 3.7.10, but you can use the latest version. Also, I'll tell in the towards the end, I'll tell why I've used this particular version. So while it's installing, let us also install the Python extension code space. Yep, that is that install this. And yeah, it's asking permission to install the different libraries. I'll just click yes. Okay, just a minute. I think there is an issue. You cannot, you're not able to see the GitHub screen. Just give me one moment. Yep, I think it should be visible now. So I will just repeat it. Repeat the steps very quickly. So first, okay, I'll just start from creating a new GitHub workspace, right? So let's go to that. Yeah, sorry for all the issues. It was a technical glitch, I guess. So we will just log into GitHub first. Okay. GitHub.com will start here, click new repository. And let's say model deployment, as we already have model deployment, I will just say v2. I click on add a read me, I click on Python, I'll select the Python GitHub template, I'll select a license MIT. Okay, and I'll click on create repository. We are starting again from the GitHub, creating a new repository, right? So now next we will use a code editor, we can use code spaces, we can use a template, we can use VS code. So here I'm going to use code spaces. And this tutorial, I mentioned the different steps to install like to use a template to use VS code. Okay, so while it opens, I will just install the Python extension here. Let it open, I guess. I will stop the other code space because it might take some usage. So I'll just go to code spaces. Yep, give me one moment. Yeah, I'll just delete this code space. The previous one. And here Yeah, the code space is now open, we'll install the Python extension. And I'll also check if Conda is available here. So Conda minus das das version. And yes. Yes, Conda is available. So I will create a new Conda environment. So yeah, here is the code conduct create minus an environment name, Python 3.7.10. So here I'll just give an environment name, let's say deployment. It can be anything you can give any name and I've assigned Python 3.7.10. I let it install. And here I will. Yep, so I let it install first. And here we'll say yes. I let it install all the packages. Yep, Conda is installed. I will try to activate this, but I think it should return some errors. Let's see. Yep, it says Conda is not initialized. So I'll just say Conda in it bash. I'm using bash. Cool. I think it's initialized. I'll have to just reopen a bash shell. I'll delete that previous one. And now you can see the Conda base environment here, right? We created another environment called deployment. So Conda activate deployment. This will activate the deployment environment that we have created. Right. Okay. So now we only have a few packages installed here. So you can see this 35. Okay. And so I would like to install different packages in this environment. I'll just select this packages installation, type install and enter. This will install a bunch of packages that are required for this tutorial. Okay. While it installed, let's go to the next steps that is create and run a flask web server. Right. So first we'll need to create a app.py file. So here I'll create a app.py file. The app.py file will write all the flask code in this file. Basically everything, everything Python related stuff will be in this file. So it's just right. And now the libraries are installed. So what we will do first is import flask. So from flask import flask. Cool. The next step is creating an instance of flask. So in app flask underscore underscore name. This will create an instance of the flask class. Okay. And name is basically referring to this current file. Okay. A current module. Cool. Let's go. Let's now create a route. So what is a route basically? So whatever you see on this website, this is a route. So jovian.com is a route. Now, if I go to jovian.com slash notebooks, it's another route. Okay. It is basically going to the notebooks route. So I will create a route like this slash. This basically means the home route that is jovian.com. Okay. Or for here it will be different. So app dot route. So the route is kind of like so the route is defined as a decorator. So I'll just say that and I'll create a function called home and I'll return hello world. Okay. So we're trying to create a flask application where there will be a text called hello world. Right. So now what we have to do is we have to run this. So we have to run the flask application. I can just copy this, but I'll just go step by step. So if name equals equals min and here I can just say app dot run with this. With this, I'll just zoom in the screen. Yep. I think that's visible. That's better. Okay. So with this, we can just run the Python file. Let's say python app dot pi. And you can see that it's running. It's showing up on the right side that open in browser. I can click that and you'll be able to see a hello world here. I'll zoom in here. Yep. That's their hello world. Cool. Now the next step is if I want to change, I want to change this hello world to net something else. I'll call this updated, right? I'll save it and let's say fresh, but it's not showing up. That is not showing up here. So what we have to do is whenever we save, change something and save it, it should show up before that we can activate the debug mode. The debug equals true. Okay. And I'll just close this and run the server again to close. I'll just say control S a control C and run it again. Python app dot pi. Now again, let's see. So you can see this hello world updated. And now if I change something here, updated version two, it should show up here also. Yep. It's showing up. So that is a simple basic flask app. We have created a very basic flask app and I think it was very straight forward up till now. Um, yeah, the notebook is shared on the chat window. So you can just access the notebook from there. Okay. That's creating a simple flask application. But now we use try to render an HTML template. Okay. For that we will create a new file here new folder called templates cause flask uses this folder, this template's name for a better for, for their rendering templates, right? Okay. So we have created a folder called templates here and then we will create a new file. It can be anything. Let's call it index dot HTML. And here we can write our HTML code. So I'll just copy this code here. Basically what we are doing is we are giving a head, a title of my first webpage and it says hello world from template. If I save this and if I load it will not update because we have not updated in app dot PI. Okay. So here what I will do is I, I'll have to render the template, right? So I'll render template and just give the path of the file. Now, as I said, render template looks in the templates folder. Okay. So we don't have to give the path exact path like templates slash index dot HTML. We can just say in this dot HTML and then a template is not imported. I mean, yeah, not imported. So we can just import it and a template. I'm seeing some errors here, right? So I want to remove those errors that is causing by ESLint. So for now I will just activate ESLint. Yep. And the error should go. Yeah. Cool. So now if I update this, yeah, the, this has, uh, what happens in code space is you are, we are using a pre version of code space. It reloads or stops some time. Okay. Or you can, we see we've got an error here. So it has stopped. We can just say Python app dot py again, and it's starting. Yep. Hello world from template. So this came from template, right? So you can see this template had a title, my first webpage. So this title is showing up here on top, my first webpage. And here the H1 is hello world from template. So that is what's coming here. So yes, we have created a web app and rendered a template, a normal template here. Cool. Um, the next thing we are going to do is creating a simple form. So our end goal is creating an email spam classifier. It should have an input text area. It should have a heading like this. It should have a button. So this is the next thing we're going to do. So we are going to update this index.html file here and we will give a H1. Let's say the heading will be email spam classifier. So let's see what we have got here. Yep. This is again, reload it. I think it's again, I have to reload this just I'll stop this. And again, Python app dot py. Okay. So I'm getting an error. So let's see what the error is. But first I'll let update this Conda environment here. So the Conda environment is this one we are using and yep. Let's see. We will try to fix this error. Let's see. We will run this app again and yep. It's now loading again. Okay. Cool. So I think I was not, I think the Conda environment was the issue. So we have got the email spam classifier heading. Now we will create a form. In this we will add a text area. Okay. And we will also add a button here. That's what is there in this thing. So a text area and a button for this text area. Let's now look at the webpage ones. Yep. We have a text area and a button. So let's modify it a bit. We will give a button. We'll call it checks spam. And this text area can have a placeholder like this, enter your email. So let's do that. Enter your placeholder equals enter your email body. Okay. So once we load, save that and refresh this. We got this. This is good enough. I guess we are getting there. So we can say rows equals to, uh, we, we can increase the size of the text box. So I'll just modify this rows to let's say 15 and calls number of calls. Basically rows is the height and calls is the width. So I'll say width as let's say 25. Let's see, uh, how it looks now. Yep. It is looking a bit better. I can increase the width to 35 and yeah, I think this is better. So one more thing, I want this checks spam below this so I can create a different block for this. What I will do is I'll put this in a div. Yep. That's there. And this is looking, let's refresh this. This is looking fine in this. This is centered. So we can also add a style tag here to center everything. This is a naive way of adding CSS, but I'll be using this one now just in the interest of time. Uh, so I'll be adding, let's say body and align the text to center. Okay. Now see, let's see how it looks. Yep. Everything is aligned to the center. We can also improve the button. So button, we can give a background color of blue and a text color or color of white. Yep. Let's see how it looks. Yeah. This is looking better. So now we have, uh, our simple form. Okay. But what does this form do? This does nothing as of now, right? If I just click here and write something, it will do nothing. What we want is we want something, this whatever we have written here, we want to display it below. Right. So let's do that. I'll just zoom in a bit more. Yeah, this is better. Okay. So let's get, let's write something here, right below. So we can just give a paragraph tag or a heading H2 tag. I'm giving a head of H2 tag here. H2 is basically second header and I'll say hi. So let's refresh the page again and you can see hi here. So instead of hi, I want to display whatever is showing up here. Right. So what we can do is we have to update, uh, we can use Jinja template. Okay. So we have to update to few things in this. First we'll go to the app.py file. Okay. And here we will give the method. So we are basically, we are sending a post request by clicking on this button. Okay. So we can update that. We can go to the index.html and say this button, this form will send a post request. The method is post. All right. Cool. Now the post request is sent. Okay. What we want is we want to get the text from the post, right? We can say the method. So this particular route, okay. Home route accepts two methods. It accepts a get and a post request. So why get get is basically when you open this webpage or when this URL, okay, it will fetch the webpage from the server and get it. Okay. So that is get and post is on clicking this button. It will show something here. So that is the post request, right? So both get and post, and I'll just say if request.method equals equals post, there will be a text which is request. Let's see. It is request.form.get, right? But the code is here. We can just say text equals to request.form.get. Okay. Cool. I will return the text here, but one thing you can see that there is something called email content. I don't know what that is. Okay. So basically email content should be the name of the text area. We'll add a name here. So we are fetching this email content. Okay. Email content. We will be getting text from this email content, text area, right? And request is not present here. That's why it's giving the edits. We can just say request. We can import the request from flask. Cool. So that's great. Okay. But if we still refresh this, okay, this has, I'll have to run the server again. So let's see, let's run the server again. Getting some error. Okay. Let's see. We are getting some error. Let's fix that. So what we have done is we have added a method post. Okay. And it shouldn't give a error. Let me just run that server again. Just this happens sometime with the code spaces. Let's just run the server again. So this year it's running now. Okay. So this problem doesn't happen with your local development vs code. Okay. It happens sometime with code spaces. So now if I write something here, okay, let's say hello world, I click check spam and it should show hello world here, but it's not showing up because we have, we are fetching the file till now. So we will use this ginger template. Okay. And ginger template engine to get the text from here. Okay. So this, we have returned the text in app.py. This is the text we have returned. I want to fetch that text, right? Okay. So I'll fetch that text here and that should show up. Now let's type hello world. Yep. It's showing up here. One more thing we want is we want to show the hello world here. Okay. So what we can do, we can just give the hello world here, the text here also in the text area. Okay. So now if I just say hello world, it's showing up here also. And it's still remaining here also. It's not showing entire email content. Okay. That's great. We have, we are midway through now. What, what else to do? We can, I want to reset this. Now if I refresh this page, okay, this hello world will still remain here. So I want to reset this, but very nice way of doing this is I'll just add a H ref here. Okay. And I'll give a path. I'll give the home route. Okay. And I'll give a reset text. What this will do, this will add a hyperlink or this will add a link. Okay. A preset. Okay. And we'll just fetch the home route. Okay. I'll get the get request. So let's refresh and we have got a reset button. And let's see if I just click on this reset, everything is reset. Okay. Okay. This is looking a lot better now. Hello world check spam. It shows hello world and reset. Okay, cool. We have now fetched, like we have now shown whatever is input we are giving in the text area. We are now showing it on the webpage. Okay. So what's the next step? Now let's learn a bit about the model itself. Okay. So an email spare classifier model spam classifier model is a classification model. And what it does is it categorizes either a email is spam or not spam. Okay. That's the basic thing. Now the dataset has a label emails. Okay. And, and it also has some email body. So basically the dataset has an email body, the label that it is a spam or not a spam. And it's a classification supervised model. Okay. So I'm not going deep into the models of what models are trained. I'll link a notebook here. We can check that how the model was trained. But what are the things to note? The things to note is the model takes a plain text. We'll send the plain text using this text area. Okay. It will tokenize. Okay. So we'll have to tokenize the text. Okay. That step we need to do. Then it will run the model dot predict. Okay. The tokenized text and it will return either one, which is spam or not one, which is not spam. That's the basic thing you need to know about this model. Okay. Cool. Now let's head on to running the model locally. But before doing that, we need a few things. We need the model and the tokenizer. Okay. Or we are using count vectorizer here. We need these two things. So I have linked those two things here, count vectorizer and model. What we will do is we, I already have it downloaded. Okay. So I will use that, but you can download it from this links. Now we will create a new folder named model. Okay. Inside our repository. So let's create a new folder here. So the folder is models. And inside this models, I already have a couple of files, the models. I'll just paste it here. The models are there. The models are loaded. Now what I will do is I will, yeah. So now I will have to load that model, right? Now I have to load that model, integrate that model to our webpage, right? So we will update the app.py file. We'll start with the app.py file. And here we have defined this thing in the home route. Okay. We have defined this. Hello world. This post response in the in this particular home or default route, but I'm going to create a new route that will take a post request. So let's say app.route and we will give a new route called predict and this will take methods as post only the post method, right? So basically this method, this route will only be fetched only when we click on the check spam button. Cool. So I will post here and we'll create a new function called predict. Cool. And we will again return the same template lender template. That is index dot HTML template. We can use the same template for multiple functions. We are using the same template. Okay. So now we don't need this things. This text request dot post. Okay. So basically home, the home route will now only get this page. If I am in the home route, I'll only see this page, nothing else, right? I will remove this methods also. We don't need a post method for the home route and that's it. So this is the home route and we are getting the predict route. Now in this what we have to do is first we have to get the email text. So this text, whatever we're giving here, you have to get that for that we can say email text equals to request dot get dot request dot form dot get request dot form dot get. And we need to specify the text area name here. So email content, email. Yep. This is the one email content, email content. Okay. So we got the email text now. So whatever we are writing here, this will be coming here. Okay. Now what we have to do next, we have to tokenize this text, right? To tokenize this text, we have to basically let's define a new variable tokenized email. We have to tokenize. Now we have to use this tokenizer, this count vectorizer CV. Okay. So this is the model's count vectorizer model, which we'll tokenize. We can just say CV dot transform. Okay. So this is coming from SQL and we use the transform, but we don't have CV yet. Right. So first we'll have to import the C pickle library. So it's a pickle file. We have to load them CV and model. Okay. So let's import pickle and we have to basically load this pickles. So let's do that. Import pickle, CV equals pickle dot load, open CV dot pickle. I can call it, let's say tokenizer and read binary because a pickle file is a binary file. So we'll say read binary. Similarly, the model can be pickle dot load, open CLF dot pickle and again read binary. So CLF is the classifier. That is the model. And this specific models, you can, you might not recognize these names because I updated these models and you don't might have, you might not have idea, but when you are having your own model, when you want to deploy that, okay, you will have the particular pickle file name and everything you will have with, with yourself. Right. So this is tokenizer. This is the model I've got this. This request spelling is wrong. Now this tokenizer dot transform and I can say email text. And again, I'll have to predict the model also. So predict predictions, let's say predictions equals to model dot predict. This is what we do in SK learn to predict a model. Tokenized email. Okay. So this is the pre-process text. Okay. And we are passing that. So you can also call it the X value. This is the X. Okay. We're passing it to model dot predict. Fine. Cool. Now one last thing here is if the prediction is one, it is spam and if the prediction is not one, then it is not spam. So let's do predictions equals one. If predictions equals one else minus one. So if it is a spam will, it'll be one fine. If it is already one, it will be one. If it is not one, it will be minus one. Okay. So these things, these all these things I have got from the model, okay, which I have defined. So these things might be particular. This will change from person to person, whatever models they are creating, whatever way they are training. Okay. So these things will have to get directly the idea from the model itself. Correct. Cool. So now we have this predictions. What we have to do is we have to pass these predictions. So predictions equals predictions and text. Okay. So we will also pass the email text. Remember I want to show the email text here so that it doesn't disappear. Okay. So I'll also return the email text. Okay. So text equals email text. Okay. Let's call this email text also. Cool. So I think the spelling of predictions is wrong. Yep, that's correct now. So I think this is good. I don't need to change anything on the app.py file. I, what I have done is I have kept the default route as just fetching the index.html page. Okay. And I have a new route called predict. Okay. Which will take in like, which will take, get the text from the text area. We'll do the tokenizing, like we'll do the tokenizing part. Okay. And also do the prediction part and it will return the prediction to the basically return the prediction to the HTML. Okay. So now here what we have to do is we have to change this text to prediction. I hope it's called prediction or predictions. Okay. So predictions and again, to save the email text, we have changed it to email text. So I'll call it email text here in the text area. Cool. Let me just go to the new line here. Yep. So this is called text, email text and yep. So we are, I think this is good. Now one thing here, so we have to give, so initially this form was posting to the default route. That's why we have not given anything else here, but now we are making a post request to the predict route. Okay. We have to give a actions here. So let's call it actions and I will say predict. I'll give the route here. Okay. We are basically making a post request to the predict route. Okay. So let's see if this works or not. I'll have to run this again. This has reloaded. I'll call python app.py. Okay. So it says, Oh, the path. Okay. So there is an error. Okay. It says the file not found error. So I've, the path is wrong here. So I'll have to say models dot CV dot, dot pickle, one of the classic dot pickle and models last sale of dot pickle. Why? Because we have saved the models in the models directory. Okay. So that's there. Now let's try to run this again and see if it's working. Yep. It's running. Let us open it in browser. Let's give a simple text. Hello world. This is about email body. Okay. And let's see what we get. Oh, it says method not allowed. Okay. Let's see what error we have got. It says four zero five. That means we are not able to get, uh, we're not able to make a post request. Let's see what's the error. So here we have given predict as the route. The method is post. Okay. One thing I'll do here is if request dot method equals equals post, I'll get this email text. Although I don't need this, but yeah, I'll just give this. Okay. So that means when there is a post request, I'll get the email text. Okay. That is fine. But actions, method, post, actions, predict. Okay. And that post actions predict. I think this should work. Let me run this again. Let me see if this works. This is a email. Yep. Okay. Finally. Cool. So I think this action was wrong in the previous one. So it should be, uh, action, not actions. Okay. Uh, but I'm still getting an error here. It says that, um, tokenized email CV transform here. It is wrong. Let's see. We will fix that error, but finally we were able to get this CV dot transform email. Okay. Cool. And we are getting the error as none type object. Okay. Cool. So none type object has no attribute lower. Okay. So let's see. I have not given any text, is it? This is an email button. Okay. I'll just run this again. Python app.py. Okay. I think I'll have to open it again. Python app.py. Yeah. This is an issue with code space. I'll have to open in browser. Okay. None type attribute has no object lower. Cool. Let's look at the code again. So email, we are getting the email here and, uh, this, we are getting the name is content now. So content, this should be content. Okay. Cool. I think this should work. Yep. Finally, we are, we have got the results minus one. So this is not a spam. Okay. I think this content was wrong. Okay. Uh, this is now content. Okay. Because I've copied the code from here. So this was having content and I had to set content here. Okay. So everything else is still the same. We are getting the email, uh, content from this. Okay. And we are creating the tokenization, we're doing the organization, we're doing the prediction and we're returning the prediction. Okay. And the email text. So here we are just getting the email text in the text area and bring the predictions in the predictions in the H for H2. Okay. So this is the result. Now we'll click on the reset to, we'll check this reset is working or not. So if we click reset, everything is reset. That's working fine. Okay. Great. So now what we want is now we want to show this as not minus one and one. So we are getting minus one here. If we type, this is an email. I want to show spam here instead of minus one. For that, what we can do is we can change this. We can add Python code here using Jinja template. So I'll just type here percentage percentage and here I'll say if prediction equals, this is just simple Python. If predictions equals equals one. Okay. Then I will say, then this will be like spam. Okay. I'll write spam here and else if life that is life prediction equals equals minus one will be, this will be not spam and copy this. Yeah. So this will be not spam. Correct. Now I'll end that. So this is a bit different than Python, but almost similar. So I'll just say, this is and a live body and if correct. Okay. So this is the format. So for writing a, if here, if predictions goes as one within this curly bracket and percentage, and this will be spam. If it is prediction is one, if it is minus one, then it is not spam. Correct. Let's now try to run this once is an email. If I, it should be and if, okay, I'll just defress this once is an email and just click on this button. It says not spam. Okay. So we have got it. It should be end without space. Okay. Cool. So now we can modify this a bit. We can add a small style here. Okay. So style, let's say color is for spam. Let's say the color is red and for not spam, the color can be green. Okay. Let's check what it does. Yep. Not spam. This, this is working. Okay. So I have kept a spam email and a non spam email example here. I just copy and see if it's working. Everything is working fine. This is a spam email I got in my mail folder. Okay. And I'm not saying the model is a hundred percent correct. Okay. It can also return wrong. So this is again, a part of machine learning. Okay. So I am not going into the machine learning part. This can be not a spam email, but it's showing it must show scam here. Okay. According to the model I have created, again, I'll check this and this shouldn't show spam. Let's check this once. Yeah, it's not spam. Okay. So great. I think we have finally created the model under around the model locally. So we have done most of the part. Now the next thing that is remaining is deploying this web application to cloud. Okay. So yeah. So the next part is deploying the web application to cloud. So what we will do is we will prepare for publishing first. Okay. So we have to have a requirement or TXT file in this requirements or TXT file. We have to save all the libraries that are required for this particular project. Okay. So we already know that the free shows. I'll just let me open, let me stop this server once. Okay. I'll show you what the free shows. So if I do pay freeze here, it will show me all the libraries that are required for this particular project. Okay. So what I can do is I'll save all these things in requirements. requirements.txt file requirements. Yeah. The spelling is correct. And here you see that all of these are saved. Okay. So that is the library called this. I think this is wrong. So in this case, I'll just uninstall it and install it again. If uninstall 35, just uninstall this and pick install 35. Okay. Let's see if it's now it's correct. So if freeze greater than symbol, it comes to TXT. And yeah, now it's showing correct. Okay. So now we have all the libraries that are required for this particular project. Okay. So we have created a project. We are ready to deploy this. Okay. We'll take it to live. So next is posing this changes to GitHub. But to do that, we can check the changes that are made here. Okay. So we have to make changes in app.py. Fine. The gamut.txt, uh, settings.json. Yeah, this is looking fine. We can just commit it from here. We can just give a commit message and commit from here, but I'll just do it from the terminal for this time. Get add.get commit minus M let's say first commit. Yep. It push origin main. Yep. Uh, so you might have to set a get config, like who is who is pushing? We start to commit. Okay. So if you are doing it for the first time, I have made a, I've set up a link here. Okay. If it is, if you're getting an error while trying to commit, you can check this guide. Cool. Okay. Now the next thing is publishing with render. So we will directly head on to let's first check the GitHub repository now. Okay. So this was our repository in the beginning. You can make changes in the readme.md file, but let's check this first. If it is working, yep. All the changes are here. You can see the app.py file changes here. Okay. Cool. Now, uh, let's open render for deployment. So render.com surrender, uh, lets you deploy some websites for free. Okay. Uh, and, uh, yep, it is very easy to use. So that's there also. So I'll just open render. I'll log in with GitHub. That's what I've done. Uh, that's why I have few previous, few previous deployments here. What I will do here is I will create a new web service and I will configure GitHub here. I'll connect with my GitHub account. It's not connected. You can see, so I'll connect with GitHub by pressing configure GitHub. And all of these steps are also mentioned in this notebook. So if you, uh, you can follow this notebook also, but then I'll go to my account. Okay. And here I will say, I'll give access to all repositories for now. Okay. I'll say install. Yep. Now, uh, GitHub is connected with replet. Let me just refresh this once and let's press new web service. Yep. Now it is now GitHub is connected with replet. Right. Okay. Uh, so if it was not working, you should once refresh the page because it's already connected. So this was the one that I was working on model deployment V2. Okay. I'll click on connect. Okay. Now here, I'll give a service name. I'll say model deployment. Okay. I can select a region. So I'll just select, I'll let it be the default region, Singapore, South Asia, try to select a region that's near to you or wherever the most traffic is the main branch. I want to connect with the main branch. Okay. If I don't have any other branches of now, uh, then I don't have any root directory. So I'll not set anything here. The, my new directory is already the one directory that is there. That is the, um, this directory basically. Okay. If you, uh, have your variety with SRC, you would have given that SRC or anything else. Now I am using Python three as the runtime. We have created a requirements.txt file. So this is the build command here. This is fine. And also you can see that something here is G unicorn app colon app. We have already installed G unicorn in our, uh, while, while creating the environment. Okay. So if you check the requirements or text if I, you can see G unicorn, the G unicorn, uh, helps you is a, you can say it's a library that helps you, um, deploy your website. Okay. So here there are two things app colon app. So this app is the file name you are trying to run. Okay. So our file name is app. So this will be app. Okay. And this app is the flask instance. Okay. So we have this in our app. We have something called app here, this app. Okay. So this is the app that is a flask instance. Okay. I will select the free instance type. That is, uh, that only has 512 MB Ram and 1.1 CPU. And I'll just click create web service. Okay. So now we have created the web service and let's go to the next step. Okay. So all these steps are mentioned here, but I've done the next step is awaiting deployment. So now it will, you can see that deployment will start. Okay. And we'll have to wait. Okay. This, this at this point, we'll have to wait for the deployment to run it because we're using the free instance. It might take some time. Okay. And we might also get some errors. So that will have to fix now why we have used Python 3.7.10. I mentioned at the very beginning, uh, that Conda created by the 3.7.10 because by default, uh, render is using Python 3.7.10. Okay. Uh, so that's why we have used Python 3.7.10, uh, as our, um, Conda and like for, for this in particular, uh, project. Okay. But if you want to use the latest Python version, you can just, you have to, you have to mention this on the Python environment file. Okay. There is a doc here on how to, uh, use the latest Python version. Cool. So you can see it says build uploaded. So build successful. Okay. So build, like all the packages were installed. Okay. All the required packages. Now it's starting to deploy. Uh, it's trying to start. Let's see. Yep. The service is like, okay, cool. That was fast. Okay. So the service is now live. Let's check this. Okay. This is working fine. Let's zoom in. Yep. So now this link, this is my email body. Okay. Let's check once. Yup. This is a spam. I don't know why, but this is spam. But this link is live. This is published live and everyone can access it. So I'll just post it in the comments and everyone can access this, uh, but this particular link to check whether the email that I have got is a spam or not. So I have successfully deployed my email spam classified model. Okay. In the same way you can deploy all other models. Okay. So we are done with the deployment. Okay. Now one last thing we are going to cover is API, how to, uh, create an API route. Okay. So I'll just go back to my app.py file again. And here basically the very basic way of, uh, like what this may, what is the creating an API route mean is we are not going to use the web interface, this thing. Okay. We are going to send the request as a Jason. Okay. And we are also going to get the output as a Jason. Okay. So that is what by creating an API route means. Okay. We are out. We will do that. Let's again, create a new route app.route. That's API. So now we'll give a API. So this is a general convention. We can say, we can say API slash predict. This is a new route. This is not the same as this predict. Okay. And we can call methods. Uh, again, this will be post. Okay. Uh, now again, this is API. We can give a function name, API predict function name can be anything. I'm calling it if I predict the first thing is getting the email. So email will be request or get a song. Okay. I'll force Jason. Okay. I'll force that it should get Jason. Okay. So this is called force equals to true. Uh, then I got the email body. So next part is same. Okay. So I'll just copy this, the same thing. Okay. And return again, I'll just return a Jason, a file that I'll not return this render template. I'll not return render template. I'll just return the Jason. I'll say Jason if I let's just return the prediction, prediction equals prediction. Okay. So it's returning the Jason. Okay. But, uh, this Jason if I is not, we have not imported this only five, so we can import it from last Jason if I it's present in the flask module. Yep. So now we have successfully created an API route. Okay. Now let's try to see if this API route is accessible via webpage. Okay. So I'll just say Python app.py. And we have got our new, this thing, this is an email. Okay. And if we click on button, you can see it's going to the predict route and it's getting not spam. Okay. But if I just call like this API slash predict, we are not getting anything. Okay. So this is, this is now not how we call an API. Correct. So what we can do is we can use, um, uh, different API tools. Okay. We can use maybe postman insomnia or other tools to call an API. We can also call it, uh, by request or by using terminal, but I will use a tool called thunder client and install the extension thunder client. And while it's installing, I'll go back to my notebook. Okay. So here is how you can fetch the API. Okay. You can use command line. You can use Python. You can use any API testing tools. Okay. For example, thunder client insomnia postman. I'll be using thunder client here. So let's go back. I think thunder client is already installed. Yeah. Let me just refresh this code space once. Yeah. And thunder client should show up here. Yeah. Thunder client is here. Okay. So I will do a new request and here I have to give my URL. Okay. So this URL for that, first we'll have to run this. First we'll have to run our app by then app dot PI. Okay. Now it's running in this particular link. Okay. So the URL will be this link slash API slash predict. Okay. So this is the URL. Yeah. I'll just copy this and give the URL and we are sending a post request, not a get request. Okay. So I'll post and the body. So what will the body have? Right. So let's go back to the code once and here you can see we are. Yeah. So we have to give a content. So that's just go back here. Yeah. Request dot get yes on. So we, okay. So we're getting the data in this form. Okay. We have to get the content. So this is the data and out of that we are fetching the content. Okay. So this will be the data. We are getting the data and the email is basically the content from the data. Okay. So this content should be a key in the JSON. Okay. So now let's see we will be sending a JSON with a key called content. Okay. And this is the email body. Okay. Now let's try to send this and we are not getting a response. Let me just run this once again. I'll close everything and let's run this once again. Python app.py. Let's see open in browser. Now this slash API slash predict. I'll copy this copy and paste this new URL here. We have this new URL and I am sending this. So it's showing method not allowed. So have I done. Okay. Slash API slash predict methods post. Let me check slash API slash predict method post and post true. Okay. I will just copy this once. Yep. So this should return. Let me just run this once again. Yep. This is running slash API slash predict. I think I will. Okay. Let me check new request here is the link and let's send. Okay. I'll have to deploy this because I think this is not taking this local host. Okay. So let me just once again make a git push. Okay. I'll just make a git push once again with the new code. So added API route. Commit. Yep. I'll stage everything and make the changes. So what it will do is it will automatically deploy the new code to render. Let me just go to render. Okay. And here let's just sign in. Yeah. Deployment is progress. So the new version is deploying now. Okay. So let me let's let's wait for deployment. It's still deploying. So you will see when it's deployed. Okay, let's wait for this while. So one more thing I want to talk to you about is code refactoring. So what we can do is so you can see in our code that this part is repeating twice. Okay. So we can refactor this. Okay. And create a new utils file where we will predict our model. Okay. So let's do that. So what we'll do is we'll create a new file called utils.py. Okay. And here we will copy a few things. First is the pickle part import pickle and the CV will copy the model models. Okay. We'll copy this and paste it here. The model loadings and also import pickle. We are trying to reduce all of this from this particular file and read another file for the model prediction. Okay. Now I'll copy this code tokenized email equals to this and paste it in the utils.py file. Okay. I'll post, put it inside a function called def make prediction with the email body. Okay. And it'll return a prediction. Okay. So once that's done, what we can do is we can just call this function make prediction here. So let's say prediction equals make prediction and we will give email here. So the, it accepts our email. We'll give email here. So you can see make prediction is not available here. So we'll have to import it. We'll import it from utils. We'll import make prediction. Okay. So now that is done. So now we can use this make prediction function to make prediction. We don't need this anymore. Again, we will just say prediction equals make prediction email. Okay. So now you can see the code has is like reduced a lot. So we just have like three lines of code in each functions into four lines. That's there. Now let's check the deployment once. Yep. The deployment is done. I'll go to the page here. So we have this API, right? So that was API slash predict. Now what we can do is we can go to thunder client. We can try to make a new request with this API slash predict URL, make a post request. And in the body we will say content. This is a email body. Okay. And yeah, I think that's it. It takes the content from the data. I'll send this request. Yep. Finally, we got the API call. Okay. So fine. So we got a prediction of one. That means it's a spam. Okay. So yep. So that's how you can create an API also and fetch an API. I think it was not working because we are using code space and the code space, the local host URL was, I think it's a bit different, but once now everyone can access this API and give a content. Okay. And email body here and you will get a spam or not spam like prediction will be either one or not minus one. Okay. So that's it. Okay. So that's what is there in this notebook. Also you can follow this notebook and see how to create an API, a code refactor and fetch the API. So one exercise for everyone is I am not too good in creating a good UI. You can see that this is a very bad UI. And if I even zoom out, it'll look even bad. Okay. So what you can do is you can try to improve the UI. Okay. And what you can add is you can add a heading here. You can add a introduction here. You can make the box like this. You can make the button like this. You can add the example spam email and example, not spam email on the left side. And also what you can do is you can add another model. If you have like multiple machine learning models, you can add different models on the top. Okay. Let's just spam classifier, then image classifier, whatever it did. Okay. All of this on the top. So you can change the UI. You can modify the UI yourself, but try to go through the entire notebook and create this email spam classifier model deployment also. Okay. We already have the model, try to create the deployment. Okay. So what we have covered in this tutorial, I think we are towards the end of this tutorial, we have covered how to use flask. Okay. How to create a very basic website using flask. Then we have also seen how to set up a GitHub and Conda environment. Okay. I've seen how to use HTML and CSS and Python. So then we have seen how to run a model locally, how to integrate a model into our webpage and run it locally. Then we have seen how to deploy the web application to cloud use, uh, like push changes to GitHub and then prepare it for, uh, like then, uh, publish to render, then fix the errors in the render. So we have not got any error, uh, while publishing, but if you face any error, you have to make the changes again and then again, push it. Okay. So those, and then we have seen how to create an API and also, uh, saying how to fetch an API and do code refactoring. So that's all the things that is covered in this notebook. Okay. Uh, here are a few references I have mentioned here. And also I have mentioned how to do a VS code set up, or if you want to run this code in VS code, and also if you want to run this code in replet, I mentioned that I hope this was very, um, like very simple beginner friendly tutorial. Although we faced a few technique like technical difficulties in the middle, but I hope this was a very, um, simple deployment process. Now I'll take a few questions. I think most of the questions are already replied, but let's see. Okay. Uh, thank you, uh, for the good comments. Um, yep. Uh, hope this notebook is useful to you. If you can refer it to this notebook anytime we'll make, uh, we'll add the link in the description. Okay. Web service is not working. I am not sure why it is not working. We can discuss this later. You can just, um, post on probably you can post on the YouTube comment and we can see, we can talk further. Okay. Okay. There is a question. Okay. Well, we can find the pickle models to test the service and try it out yourself. Okay. So the pickle models are there in this Jupiter notebook. Uh, you'll find it here. Okay. The account vector has a model in the classified model. You can try it out. And also if you have any different model, if you are, if you have worked in a different model, you can use that also. It doesn't have to be the same image pattern classifier model. And, um, okay. So there are definitely, this is not the, I would not say this is the ultimate way of creating a like deployment. There are a few steps you might have to add authorizations. Also you might have to add, uh, like the models might be very big. Okay. So you might have to add them in a database or create a Docker. Okay. And, um, like, uh, yeah, you can do a lot of things. So there is a scope of future work. You can see, you can implement authentication and user management. Okay. If you want to restate some feature, you can, uh, scale the application for high traffic. Okay. Then you can, um, you can also, so obviously improve the user interface. We have not done a lot of in user interface site. Then, um, we, you can also see like implement real time updates. For example, like if someone is typing the email, you can just say spam or not spam. Okay. And a few more things. Okay. There's other things that you can do for future. This was a very beginner friendly. I tried to keep it very simple, uh, just for anyone who has created a model. Okay. And, uh, just they can use the steps to deploy their model and showcase it in their probably LinkedIn or resume or anywhere. Okay. This, I tried to keep it very beginner friendly, but there are a lot of things that can be done to improve this, uh, from this. Okay. Well, I think that's everything. Um, thanks for attending the session and I'll end it here. I'll end the session here. Uh, okay. Thank you for attending and see you on the next video.