Lecture Notes: Building a Machine Learning System to Predict Medical Insurance Costs

hello everyone this is siddharthan this is the 11th project video in our machine learning project series and in this video we are going to see how we can build a machine learning system that can predict what is the medical insurance cost of a person okay in case you are watching my videos for the first time hi in this channel i'm making a project based machine learning course with python you can check out the playlist section in my channel to more about this machine learning projects and my machine learning course okay so with that being said let's get started with today's video and in this video i would like to explain you more about this problem statement first and then as we always do i would like to explain you about the workflow which we are going to follow and then let's get uh you know into the coding part using python okay so this is the problem statement of this video so let's say that you are a machine learning expert or you are a data scientist and that is a medical insurance company and this company wants to create automatic system that can predict what is the medical insurance cost of a person will be using some methods okay and they are approaching you as you are a machine learning expert okay and now we need to build a machine learning system that can you know learn from the data and it can predict what this price can be what is this insurance cost can be okay so this is what we are going to do in this video so now let's understand what is the workflow that we have so the first step in machine learning is to collect the data here we need insurance cost data so as we as we i have told you earlier we are going to predict the insurance cost and for this we need some data so what happens is our machine learning algorithm will learn some parameters so these parameters can be what are some health issues that person had and what is the gender of the person and other such kind of things and based on this the insurance cost varies and we need this data to feed our machine learning model okay so the first step is to collect the data for this project okay so as once we have this insurance cost data we need to do some data analysis so we need to analyze this data to understand whether it can give us some meaning so what is this data all about and such kind of things so data analysis helps us to gain the meaning the data has to tell okay so in this we just make some plots and try to plot the data in some graphs and see what is the relationship it has okay so that that kind of thing is done in data analysis so the next part is data preprocessing so once we have the data we cannot feed it directly to our machine learning algorithm so we need to do some processing on it this is where data preprocessing comes into play so once we process the data it will be compatible to go into the machine learning model then the next step will be to split our data into training data and testing data okay so in machine learning we train our machine learning model with this training data and we evaluate our model with this test data okay so this evaluation is just to uh see how well our model is performing okay so it is just to check the performance of our model so once we split the data into training data and testing data we feed this training data to our machine learning model so in this case we are going to use a very simple model called as linear regression model okay so once we train this linear regression model we will evaluate this model to check how well it is performing so after that we will have a trained linear regression model okay so once we have that we can feed new data and once we feed this new data this model can predict what is the insurance cost will be okay so this is the workflow which we are going to follow okay so linear regression model is most of a statistical model rather than a machine learning model but you know it is the base for other models so people think uh you know machine learning and statistics are two different things but it is not the case machine learning is built on top of statistics okay so these kind of basic statistical model is also very important for machine learning okay so i'm just doing this linear regression model because we haven't used this model in our previous project videos so that's why so you can also use other regression models as well like xg boost regressor and the other regression models and also while we are implementing this linear regression model i will be explaining you how this model exactly works okay so this is what we are going to see in this video so now let's get into the coding part okay so before starting the video i would like to give you a quick intro to my channel so as you can see here once you go to my channel you can save this learning course curriculum video so in this i have explained you what are the different videos and modules i will be covering in this channel and you can go to this playlist section so that you can find these modules so this is the hands-on machine learning course that i have posted so so far we have completed about four modules and we also have this machine learning project video so totally we have about 11 project videos and i will be uploading three videos per week every friday i will be uploading one machine learning project video and monday and wednesday i will be uploading videos on this course order okay so totally i will be uploading three videos in a week on monday evening wednesday evening and friday evening okay so the programming the python programming which we will doing is in google collaboratory in case you are not aware of this google collaboratory you can go to this second module python basics for machine learning if you go there you can see this google collaboratory for python in this video i have explained how you can access google collaboratory and what are the different features in it okay so now let's get started with today's video just a second okay so there will be this connect option here so you need to connect your system and i have uploaded this data set here so this is the insurance data set so you can search for this data set in kaggle and also other sites you will get it and also give the link for this data set file in the description of this video okay so you just need to go to this files option and then give this upload to session storage or just click right click here okay there will be this upload option so from there you can upload your data set here so this is the data set which we are going to use it is insurance dot csv okay so the first step in in a project or in any python program is to import the dependencies so i'll make a text here as importing the dependencies so dependencies are nothing but the libraries and functions which we need for our program so first of all i will import some basic libraries so the first library is numpy so import numpy scnp so numpy library is used to make numpy arrays okay so we need some arrays for our processing in various stages of our project so i'll import numpy and next is pandas okay so i'll input pandas as pd so pd you know pandas is useful for making data frames okay so as you can see here the data set file which we have is a csv file so csv means comma separated value and it is not easy to analyze the data from uh you know csv file so we use this pandas data frame so data frames are more structured table so once we feed this or once we load the dataset to a pandas dataframe it is easier to do some processing and analysis on the data so next we are going to import two other libraries they are math plotlib math.lib.pi plot as plt so this library is used to make some plots and graphs and i'll import another library c bond so c bond is also a data visualization library it is used to make some plots so import c bonus sms okay so these are the short forms so i don't want to use numpy in all the codes so i have imported it in a short form as cnp so that is what this as np and spd does so it is just a short form for this library okay so import cbon as sls then you need to import another thing from sklearn dot model selection we need to import drain test spread so as i have told you we need to split our data into training data and testing data and this train test split function will help us do it and from sklearn dot linear model we need to input our linear regression model so linear regression so pay attention to which letters are in caps and which letters are in you know in lower case okay so linear regression and finally from sklearn input matrix so this matrix is used to perform some evaluation on our model okay so we have imported numpy for arrays pandas for pandas data frame matplotlib and c bond for plots and we have imported the strain test split function from scale under model selection and linear regression from linear model module okay so these are the dependencies which we need so the next step is data collection and processing data collection and okay so let's put data collection and analysis so we are going to analyze this data okay so you can find this insurance data in kaggle just go to google and search as insurance cost data medical insurance cost data okay so now we need to load our data in the csv file to a pandas data frame so i'll just make a comment here as loading the data from csv file to a pandas data frame okay so i'll create a variable so let's name our data frame as insurance data set okay so insurance dataset is equal to so we are going to use pandas so pd right so a short form for pandas so pd dot read csv this read csv function will read the csv file and load the content to a pandas data frame so go here you can find this options menu here so go to that options and you can see this copy path so copy the path of this file and you can paste it here okay so now you can run this cell by pressing shift plus enter so this will run this cell and go to the next one so this will load uh the data from the csv file to this insurance data set data frame okay so now we can understand about this data so you can print the first five rows of the data frame so first five rows of the data frame so mention the data frame name which is insurance data set dot yet so this yet function will print the first five rows of the data frame okay so as you can see here totally we have about seven columns here so it's excluding the serial number column so we have this age of the person what is the sex or gender of the person what is the bmi body mass index of that person whether that person has any children or not and whether that person is a smoker or a non-smoker and what is their region so what is the place they are from and what is their insurance cost in charges okay so this value is in us dollars okay so this is a us data set and the region represents uh that region in united states okay so south west southeast north west and northeast so these are the four regions we have and charges are in us dollars so the first is nothing but sixteen thousand eight eighty four us dollars and second is thousand seven twenty five us dollars and such kind of things okay so totally we have seven columns so what we are going to do is we are going to train our machine learning model with this data so once it learns from this data when we give a new data so we will so what we will be doing is once it has trained we will give the data from age to region and we won't give this charges okay so once we give this data our machine learning model can predict what this cost can be okay so this is what we are going to accomplish in this particular project okay so as you can see here what are the different columns and now let's understand more about this so i would like to see how many data points this dataset has okay so now we are going to find the number of rows and columns in this data set okay so again mention the data frame name insurance data set dot shape so this shape function will give us the number of rows and columns so as you can see here totally we have thirty eight rows so rows means different data of different persons okay so totally we have data from thousand three thirty eight percents and seven represents in this column so each person will be having this seven values right so this is our data frame so it is it is not a very small data set and it is not a very huge data set it is in kind of a mid-range okay so in machine learning we also deal with data set which is about uh you know in thousands of rows and even lacks of rows there are also data set of such magnitude as well so it's important to know what is the size of the data set we have the next step is to get some information about the data set getting some informations about the data set it is always a good practice to mention what you are doing in a particular line of code and now let's do this insurance data set dot info so run this so you need to mention paranthesis okay so as you can see here it tells what is the data type of each column and how many values we have so totally we have thousand three thirty eight entries and uh seven columns and these are you know non-null values so null values means missing values non non values means those values that are present it means that no values are missing and we have this integer object float etcetera so what this means is so you can see here for this sixth column we have the data type as object and for smoker and region we have this object so you can see here these are nothing but classes okay so these are not numerical values whereas eh bmi and this children and charges so these four columns are numerical values but these are classes so this sixth column contains values either male or female right and smoker contains either yes or no and region also contains uh you know only four valleys southeast or southeast northeast or north west right so we call this features as categorical features as it contains only categories and not values a particular integer values right so these are the categorical features and these values are very important we also need to do some processing on it so i'll mention it here i'll just put a text here as so totally we have three categorical uh columns in this data set categorical features okay so categorical features are first one is six the gender of the person second one is smoker column whether that person is a smoker or a non-smoker and the third categorical column is region okay so you may be wondering what is meant by these features so as i have already told you so once we give all these values our model can predict this particular column charges right so the value which is which is of interest in this case which we are going to predict is called as a target variable so in this case charges is the target variable and the remaining columns are called as features so these six columns are features and this charges column is called as target and in this features we have three categorical features which are just categories whether it's yes or no or uh it's male or female such kind of categories okay so the next step is to understand whether it has any missing values as i have told you already so this also tells us but there is a standalone way to check whether this data set has any missing values looking for missing values insurance data set so it is our data frame dot s null dot sum so this particular line of code will give us the number of missing value in each column okay so if it is null find the sum in each column so that is what we are finding here so as you can see here we don't have any missing values so it is a very good data set and in case we have any missing values we do some processing in it so if you want to know how how we can handle the missing values in the data set you can go to the fourth module in my machine learning course it's on data collection and tree processing in that i have explained you how you can handle the missing values in a data set okay so in this case we don't have any missing values so we can proceed with the further steps so the next part is data analysis so we are going to analyze this data using some plots data analysis so first let's get some statistical message about the data set statistical measures of the dataset insurance dataset dot describe so this describe function will give us the statistical measures like mean standard deviation percent tails etc as you can see it gives only for the numerical values and not for categorical features we cannot the mean value for we cannot find the mean value and standard deviation for categorical features so that's the reason so first it tells the count the number of uh values we have for in each column so we have totally thousand three thirty eight values in each column and we have the mean so the mean of this h column is 39 okay so the mean age and the mean bmi is 30.6 and the number of mean number of children is 1.09 and charge so the mean value of this transition we have the standard deviation what is the minimum value in each column maximum value in each column and then we have this three percentiles so these are percentiles or not percentage okay so percentage means so they are different from percentage so what it means is 25 percentage means 25 percentage of values are less than 27 okay so 50 percentage so less than 50 percentage of the values are less than 39 so that's what percentile means okay so we get some statistical measures of the data it it tells us how the data is distributed so what is the central tendency like mean s so mean is one of the central tendency so what is the mean standard deviation and such kind of things okay so now let's understand more about this so what we are going to do is we are going to find the distribution of the data set so first column is ch right so we will analyze this column by column so first we are going to find the distribution of h value okay so i'll use this sns dot set so this sns dot set will give us some better themed plots okay so we want you know uh grids in our plots so such kind of things is given by this sms dot set function so i'll explain you about this later so plt dot figure so if you remember i have imported uh c bonus sns so that's what i'm using here and i have imported matplotlib.pytlot as plt as short form so that's what i'm using in this third line okay sms dot it and plt dot figure so in this we need to mention the figure size we want so figure size is equal to 6 comma 6 you can give other values as well so it is just the size of our plot and sls.dist plot list plot represents distribution plot it tells us how the values are distributed throughout the data set and inside the parenthesis mention the data frame name insurance data set and as we are going to find it for each you can see the column name so if it is in small letters if it is in lower case it should be in lowercase as well so we have all the values are all the column names in lowercase so first we are taking this age column so mention it here in this square bracket so i'm mentioning this data frame and inside this data frame i want to take the age column so mention it here and it should be in parenthesis okay so it the next step is you can give a title for your plot so plt dot title and uh let's name this plot as edge distribution okay so now we need to use this plt dot show so let's run this okay so this is our distribution so this sns it is used to make this grids and other themes so you can try without this sms dot sit once and you can try again with this sns dot set function okay so it tells you what is the difference this sns.function is creating okay so now let's understand the distribution so you can see the age value from 10 to 70 and we have maximum density most number of values in this 20 21 22 range right so there are a lot of values whereas from this 23 maybe 23 or 24 to about 70 the uh distribution is almost normal so the distribution is almost equal in this age range right so only this uh gauge in 20 is kind of more so it means more number of people in our dataset or of this particular age okay so 20 21 and 22. so this is the age distribution and now the second column we have is 6 right so the gender column so now let's try to plot this so we cannot plot a categorical feature in this kind of distribution so we need to use account plot for categorical features so i'll explain you what is mean by this count plot so we have printed the distribution of each column now we are going to check it for gender column okay now plt dot figure so you don't need to mention this sms dot set function every time so once you you know run this once it is enough so it will be incorporated in all your plots so plt dot figure and let's mention the figure size 6 comma 6 okay so now we are going to use count plot sms dot count plot x is equal to so we are taking the six column right so you can refer the column name so put that column name in this ex and mention the data which you are taking so in this case the data is the data frame which we have is insurance data set right okay so this will give us a count plot so count plot means it give us so in this sex column we have two values two types of values male and female so this found plot gives the number of uh values for this main male value and number of uh you know data points for this female value so that's what account plot is and let's name this plot as plt.title so let's give it a name let's put it as sex distribution okay and plt dot show let's run this okay so made some so i just made a mistake here so it should be fixed size okay so you can see here now we have the count for these two kinds of values so it is almost equal right so the distribution is almost equal for both the gender okay so the value is not very i in the case of only one kind of value okay so there is another way to understand this so what you can do is use this insurance data set mention the column name so here we are analyzing the sixth column right so six dot value comes so this value counts function will give us the number of values we have for male and number of values we have for female okay so i'll run this so totally we have 676 value for this male column 162 values for female column so this is how you can use this value function to check how many values are there for the two categories okay so the next column here we have is bmi okay so or else what i'll do is i'll just okay so let's go in this order so the next one is bma so the bm is not a categorical column so we need to use a distribution plot right so i just copy this particular line of code we just need to change a few things so we are going to check the vma distribution in our data set bmi distribution okay i will paste this and in this you need to change this edge to bma because we are checking for bmi and let's rename this to bmi distribution so bmi represents body mass index it tells whether a person is overweight or underweight or in a normal way so we call this kind of distribution of data as a normal distribution so the more number of values are in this mid range which is 30 in this case okay and we have a similar kind of distribution in either side of this peak so from this you know 15 to 20 or 25 we have this kind of a growth and there are more number of values around this 30 value 30 bmi value and it kind of reducing after that bmi range okay so here there is one important thing that i have to tell you so the normal bmi range so normal bmi range so for a person the normal bmi range is 18.5 to 24.9 okay so this is the normal bmi range as we know bmi represents body mass index so it is calculated using the height and weight of a person if the bmi of a person is less than 18.5 then that person is underweight and if the value is in this particular range 18.5 to 24.9 then the bma is almost in a normal range and if the bmi is more than 24.9 then that person is overweight okay so here now you can see this data we have more number of values from this uh 25 to around 35 or 40 so a lot of people in this particular data set are overweight okay so this is one important thing to note here and this can affect the insurance cost a person get okay so it might increase uh the insurance first you know this is the case okay so this helps us to understand what is the distribution and we get a normal distribution okay so the next column we have is bmi and children number of children so it is a numerical value but in this case instead of using a distribution you can just use a count plot as well because so this children column contains only five values zero sorry five or six values zero one two three four and five okay so here you can just use a count plot so let's do this for the children column plt dot figure and let's mention the figure size six comma six and then sns.con plot and in x we need to give the column name here the column name is children and the data we are taking is insurance dataset okay so let's name this as plt.title it's the same procedure which we have done for the sixth column okay so then children okay now finally we can use this plt dot show so now you can see here we have more number of people who doesn't have any children and the count is around uh 550 or 570 and a number of people with one children is again more you know comparing to other values here so we have about three values more than 300 and this is the distribution we have for this count value okay so sorry this children column and let's also use this value counts function to see exactly how many uh data points we have for each uh different kinds of values insurance data children so we need to mention the column name children dot value counts now you can see here so there are totally about 574 data points now for 574 number of people who doesn't have any children 324 people with one one child and such kind of distribution okay so the next column which we need to understand is this smoker column right so for smoker also you can use this uh count plot as well so count as we have used for the six column and children column okay so i'll just copy this line of code and this is for smoker column smoker let's rename this to smoker okay now we can see here no so the number of non-smokers is kind of more in this data set about more than thousand and we have a smokers of about 200 or 250 value such kind of range okay so we can also use this value counts to see the exact numbers so it is you know a kind of important in when it comes to data sense to understand how this data is distributed in a data set so it gives you a clear picture of what is the different variations in the data so here we need to change this to smoker okay so we totally have two values so 1064 values for non-smokers and 274 values for smoker column and the next column we need to understand is region so south east south east or north east and northwest so i'll paste this this is for region column x is equal to region region okay now as you can see here we have totally four regions south east south west south east north west and north east right and the data is almost similar for all the uh you know different types of values it is kind of little kind of more for the southeast okay so now let's use this valleycons function to see the exact number insurance data set region dot valley council okay so this is the distribution of this data set so finally there is only one uh column left which is charges so this is a numerical data so we cannot uh put it in count plot because we have we can have many number of values right but in the case of the categories we have only two or three values say for in this six column we have only male and female for smoker we have only uh yes or no but when it comes to this age bmi and charges the range of values can be very large and we cannot use count plots there so we need to use distribution plot to see how this range is distributed instead of individual values so i'll just copy this distribution plot and change the column name to charges okay so distribution of charges so charges represents the insurance cost so you can remove this sms sms.jet or just charge distribution okay okay so you can see here and we have a lot of data distributed in this ten thousand values so ten thousand dollars okay and we have uh very little values which are going around thirty thousand and forty thousand dollars okay so this is how our charges is distributed so this is the data analysis part which we can use to understand about our data is distributed and other kind of things okay so the next step we are is in data processing so we make always call it as data pre-processing so there is one important thing which we need to do here so as you can see we have three categorical columns right so six small current region so the sixth column is uh female and male as the value smoker assessor not region has totally four values so we cannot feed this text data to our machine learning model so companies computers don't understand this text well so you can understand the numerical values well so what we are going to do is we are going to create uh classes or labels numerical labels for this say what we will do is we will convert this female to 1 and male 2 0 so and smoker 2 maybe 0 node 2 1 and south east southeast northeast norway so we give one numerical value to each of these you know classes so this process is called as encoding so we are going to encode the categorical features so this process is about encoding the categorical features okay i'll explain you how we can do this so first the categorical column we have is six column right so encoding six column okay so mention the data frame insurance data set dot replace so this replace function will replace all the text data with appropriate values which we want so insurance data set dot replace open a parenthesis here and then we need to put another early package and mention the data frame the column of the data frame which we need to change here we need to change the six column right so mention six and in that if the value is for male so this male value we want to we wanted to encode it to zero and for female we will be transforming it to one okay in place is equal to true so i want to change it in the place of those values so i mentioned it here so these are the parameters which we need to mention so in place is equal to true so what we are basically doing is we are going to replace some values so what values we are going to replace so i am taking this sixth column and in that sixth column i want to replace all the mail text with zero and female text with one okay so this is what i'm doing and the next categorical features we have is smoker right so encoding so make a comment here as encoding smoker column okay again mention the data frame insurance data set dot replace and now the column name is smoker you can refer the column names also if you are not sure and if the value is yes if that person is a smoker let's put 0 and if that person is a non-smoker then we can encode this value to 1 okay and in place is equal to true as well this is equal to true okay so we are replacing the values of this smoker column if yes we are transforming it to zero and no we are transforming it to one and the last final categorical column we have is region column okay encoding region column so here we totally have four different kinds of areas as you can see here we have found it through this valley cones south east south east sorry southeast north west south west and northeast so i'll mention the data frame here insurance data set dot replace and how we want to replace this as i mentioned the column region and if this region so if the value is south east okay so southeast i want to transform it to 0 and if the value is southwest let's import it to 1 and if the value is northeast let's transform it to 2 and finally we have northwest so all the northeast values should be transformed to three okay so we have i encoded the four values so we are replacing the region column and if it is southeast we will be transforming it to uh zero north uh sorry south east south west north east and northwest so the labels are zero one two one three okay so i'll just check it once for insurance data replace mail 0 female 1 okay the smoker is 0 and uh southeast southwest northeast and northeast okay so let's run this there is some mistake here okay so it shouldn't be this okay so this has successfully transformed our column so you can see the value here so previously we add about uh yeah so we have this text values female male a smoker is or no and such kind of things so all those things will be transformed to their corresponding numerical labels so as you can see here if the uh you know the value is female it is transformed to one for male it is zero and for smoker non-smoker it is uh one and for smoker it is zero okay and then also for region okay so this is how you can encode the categorical features okay so the next step is to split that features and target so as i have already told you these six columns will act as the features and this target variable will as act as the targets these uh this charges column right because this is what we are going to predict so the next step is splitting the features and target so i'm going to create two variables x and y and let's put all the features in x and all that you know target in y so mention the data frame insurance data set dot drop so this drop function will can you not drop a row or column so i want to drop a particular column so the column i want to drop here is charges okay so charges and axis is equal to 1 so what i'm doing is so as you i have told you these six columns are the features so i want to remove this charges column and store all these values to x and they want to store this charges column to y so i'm using this drop function you can see this axis parameter so if we are removing a column we need to mention the axis as one and in case we are removing a particular row we need to mention the axis is equal to zero so one represents column and zero represents a row okay so now let's put the target variable in y insurance data set so inside the square brackets mention charges okay so let's run this so this will successfully split the data into features and target so you can go ahead and print x and y separately so capital x okay so you can see here we don't i have this okay so we have some mistake here so it is not encoded well this region column okay so something has happened so i have to change something here okay so the problem here is so i haven't included this in place function so it has just replaced for this one instance so that's the reason so we need to include this in place function here so in place is equal to true so this will rectify our problem so i just want to restart this and run this so you can go to this runtime and restart and run all because this encoder function so this encoded data will again try to encode so it will create a problem so i'm just restarting this entire runtime here and running all the cells here okay so yeah it has run and now you can see here it doesn't print this here so now this region uh column is encoded correctly here as well so the problem is because i haven't included this in place here okay so now we can print y print y so y contains the target column which is the charges right okay so this x contain all the columns except for charges and this y contain the charges column so this is how you can split the features and target present in a data set so the next step is as i have told you in the workflow is to split our data set into training data and testing data splitting the data into training data and testing data so we need to create four arrays here the first one is xtrain and second one is x test and y train and y test okay so i'm going to use this drain test split function so i'll explain you what is meant by these four functions sorry this four arrays and in the parameters we need to mention x comma y because we are going to split the x and y as training and testing data and mention the test size you want test sizes how much data you want in testing data so generally we take 10 percentage or 20 percentage of data as testing data so point to represents 20 percentage of data okay so we are taking 20 percentage of the entire data's testing data and 80 percentage go to the training data and the next one is random state so i'll explain you in a minute what is meant by this random state okay so you can see here we have created four arrays x train x test y drain and white test what happens is this uh x train this x will be splitted into two arrays one is extreme and the second one is x test okay so the corresponding charge is the corresponding target value for this x strain this y goes to y train and the corresponding charges value for the sex test go to y test okay i hope you are getting what we are doing here so this x is splitting into training and testing data and the corresponding labels for that training data go to the y drain and for x test it it goes into y test okay so this extended express is the split of this x and y train and y test is the corresponding speed of y okay so that is why we are mentioning this x and y in this train test split function and here we have mentioned the size we want so 20 percentage of test data and finally we have this random state parameters so what is this us so let's say that you are trying to run this command and in case if you mention some other values like three or four so your data will be splitted in a different manner okay but if you mention the same value that i am mentioning here so if you put the random state is equal to two then both of our data will be splitted in the same manner so this is just to split uh the data in the same manner in two different instances okay so it is just for that so in what kind of manner you want to split so this is basically to reproduce the code okay so this is the splitting of data into training data and testing data so let's run this and now we can check the shape of this data points print x dot shape so x is the original data data set and the x train so explained represents training data and x test represents the testing data source to stop shape extreme dot shape so this will tell us how many values we have so totally we add about 1338 data points of that the 80 percentage of data which is 1070 goes to extreme and 20 percentage of your entire data which is 268 go to this 6 test so this is how you can split your data into training data and test data okay so the next step is training our model so model training so in this case we are going to use as you know i have mentioned in the starting of this video that we are going to use a linear regression model okay so before you know implementing this i'll just explain you how this linear regression model works okay so let me go to my slide okay so let's say that we have two axis x axis and y axis and we are taking the features in the x-axis and target in the y-axis so the features are nothing but in this case it's age bmi uh number of children smoker and region so these six columns are features and the target column is nothing but the charges right so what happens is so in this plot the data point should be you know plotted so these are the data points so let's say that we have some data points and this is uh the plot we get so when we use a linear regression model it tries to fit in this data okay so linear regression is a line model right so it will try to fit in this data and we will get some best fit for this data points okay and the equation for this line is as you know the line of equation of line is y is equal to m x plus c and this is the what is what are the variables in the equation so x represent the input features y represents the probability of our prediction m represents the slope and c represents the intercept as you know in this case x the input features are those six columns from age to region and y prediction probabilities like uh the prediction which we are making is for the charges column and slope and intercept we have so if you change the slope if we have a different data points we will get a different slope and this line will be tilted in a different angle okay so that line will have a different slope so you know changing the data points will change the slope and that will change the orientation of this line and c represents intercept so intercept is this particular distance we have from the origin so this is the intercept so this can change based on the data we have okay so this is what happens so this model will try to fit this data in a line and based on that fit it tries to predict new values so if we have a new values say let's say that we have only the x values so what it will do is it will try to plot this x value and it will check the corresponding y value and this y value becomes up chart the r cost okay so this is how it can predict so this is how a simple linear regression model works so now we can implement this in python so i'll go to my google collab so if you remember we have imported the linear regression model from sklearn dot linear model right so we need to use this linear regression function now so loading the linear regression model okay so let's create the name of the regressor as regressor which is equal to linear regression okay so regressor is equal to linear regression this will load one instance of this linear regression model to this particular variable now we need to fit this uh model to our training data here the training data is extreme and the corresponding y value is this white drain okay so we need to fit white x strain and y train so mention the model so here the model becomes regressor so regressor dot fit so as i've explained you what happens in a fit right so it's regressor dot fit x train and y train so we are plotting for training data for the features of the training return label of the training data right let's run this okay so this will uh plot the line you know in the in this data points as i have explained you in that graph so now we need to evaluate our model so the training of our model is done now we need to evaluate it model evaluation and for this we are going to use our sport okay so before that we need to make some predictions so prediction on training data so first let's try to predict the charges the cost of all the training data and then let's predict it for test data prediction on training data let's create a variable as training data prediction i'll explain you what is meant by this regression the model which is regressor regressor dot predict and in this predict let's mention next string okay let's run this so what happens is i'm just giving the x ring you can note here so i'm not mentioning the white end so what happens is i am just giving the x strain and this model will predict what is the y value r so those predicted values will be stored in this training data prediction okay so now we need to find the r square value so this is one performance metric r square so r squared value so this r squared value lies in the range of 0 to 1 and as it is you know in a general convention we say that if the value is close to one then we can say that our model is performing well but it is not the same in all the cases in some cases if we have a less r square value then it is good so it it differs in different cases so i will explain you in detail about how this evaluation metrics work in a separate video which we will be doing on model training model okay so in this let's see how we can do this particular r squared uh find this r squared value so let's put this as r2 okay so r2 is equal to so r2 the r squared value for training data so let's put this as r2 train which is equal to so we have imported metrics uh you know module from this sql on library so we are going to use it so you can refer the importing dependencies cell which we have done so matrix dot r2 is 4 so this will give us the r2 score and in that mention y train and training data prediction so what we are doing is we are comparing the original uh labels which is white rain and the values predict sorry not labels the original values of this y drain and the values predicted by our model okay so when you run this we will get a score so i'll just print that's four so print as r squared value so r2 trim this will give us the r squared value and here the value is around 7.751 so as i told you it is as it is close to 1 the value is good in this particular case so we almost get a good uh you know r2 score so this is a simple linear regression model if you use some other advanced regression models like xg boost regression model you can get uh you know better values as well so we have already used x3 boost regressor in one of our regression projects okay so this is our r squared value for training data and this this alone is not you know necessary so we need another one important uh metric so the r2 square for testing data so as i have told you the evaluation on testing data is important because our model didn't see this test data okay so test data prediction requisite dot predict now we need to change this to x test okay let's run this and now we need to find the r2 value let's change few things here so our square value let's change this to r2 test and here y train should be changed to y test so we are just doing the same thing but we are just predicting for test data okay so let's run this so here we get a value of 0.744 okay so it is almost close to each other okay so you may ask this question that why we are predicting for training data and testing data i have already told you that uh evaluation on testing data is important but why we are finding for training data as well okay so there is one important thing to note here there is a concept called this over training in machine learning so what happens in overtraining is we also call this as overfitting so in that case the model will overland on the training data which is not necessary so in that cases we will get a huge value for this training data and a very less for this testing data so for a good fit for a good performance of the model these two values for training data and testing data should be almost equal to each other so in this case it is almost equal to each other so we don't have this overfitting issue in this case okay so this is how you need to evaluate our model so finally we have completed this and there is one important step now so this is the final step of our project which is to build a predictive system building a predictive system that can predict the cost given all these parameters like ks bmi etc okay so i create a value as input data okay so what let's do is i just open this data frame sorry the data set so it is this is the data set which we used so i'll just open this in a notepad file so what i'm going to do is i'm just going to select a random value here okay so okay so let's take this value so you can take any values as well so what i'm doing is i'm just not including this charges value if you can see here so i'm just taking the values from each to region so you can see the column names here so h2 region and uh given these values the model should predict what is the insurance cost right so i'll copy this value and put it here put this in input data okay so this value so once we give this these values it should predict our medical insurance cost but there is another main thing to note here that we have imported our categorical values right so we have in you know imported this female to one smoker is to zero node one and this kind of things right now we need to do the same thing for uh this data so for female we have included the value to one okay so now we need to change this female to one and then we have no so the no should be transformed to one okay so we have no year and let's transform it to one and finally we have southeast so southeast should be changed to zero right so let's transform it to zero okay so given these values it should predict the insurance cost so there are few things which we need to do here so this is a tuple so we have uh stored it in a in parenthesis right so it is a tuple data you know data type and i want to change this to and in into a numpy array so changing input data which is a tuple to a numpy array because it is easy to do some processing on numpy arrays rather than tuple so let me name this as input data as numpy so we have imported numpy library as np so use this numpy dot ask array function okay so in this parenthesis mention this input data so what happens is this will transform or this will convert this tuple to a numpy array and we are storing it in a variable called as input data as numpy array so mp.acid is the function so let's run this okay so there is another thing now we need to reshape the this array so reshape the array okay so i'll name this as input data reshaped i'll explain you in a minute what we are doing in this reshaping so input data as numpy array dot reshape one comma minus one okay so what we are doing is so if we if we don't uh you know mention this reshape the model will you know it doesn't know that we are predicting for uh one data point so what we are doing is we are just giving one uh row right so we are just giving one data point and we want to predict it if we don't reach a bit the model doesn't know because uh in training we have used about uh one second so while training we used about 1070 data points in extreme right so the model will be expecting same number of values so we want that so we don't want that we want our model to understand that i just want the value for one particular data point and that is why we are reshaping to one comma minus one this is what this particular reshape does okay so now we can find the prediction so let's the model which we use is in regressor dot predict so predict function will give us this charge value okay so predict input data reshaped okay so input data reshape so we need to feed this reshaped input array to this linear regression model and let's print this prediction so this will print the medical insurance cost so let's run this so the value which we are getting is three seven six zero so you can go to this data set which we have so we have took the value for this and for this particular data point the medical insurance cost is three seven five six it's three thousand seven fifty six dollars and the value predicted by our model is 3760 which is very very close so it tells us our model is performing kind of well right so this is how you can build a predictive system and i'll just print another thing so as you can see here it is printed in the form of a list so it is inspired bracket that it means it is in the data type list so we need to print it separately so let's say that the insurance cost as ust prediction zero okay so why i am mentioning this prediction zero is so this prediction is in the form of list right so a list which contains only one value so if we want to take this one value alone we need to mention prediction is equal to prediction square bracket zero so this is the index of this list so the index of this particular value is zero so that's why i'm mentioning prediction square brackets zero okay so if you are not aware of this you know list what is mean by this list data type double data type you can uh go to the second module which is python basics for machine learning that i explained about all the different data types in python okay so let's run this as you can see here it gives us the insurance cost is 3760 which is very close to the real value so this is how you can build a predictive system using machine learning so i hope you have understood all the things covered in this video i'll just give you a quick recap of what we are doing in this video so the first step is to import all the dependencies i have already told you how you can upload the data set so i'll give the link for this data set and the link for this collab file in the description of this video so please practice this force don't just download this caller file and run it once uh you know practice this quotes by yourself it will help you to understand things better okay so the first step is to import all the dependencies necessary libraries and functions and we have seen how to import the data to a pandas data frame then we have printed the first five rows of the data frame that should find how many rows and columns we have and we have seen which are the categorical features in the data frame and we have checked whether it has any missing values and we did some data analysis by getting some statistical measures and finding the distribution of values and finding some count plot of some columns so once we have did this data analysis the next important thing is to encode the categorical features so we have encode the corresponding the categorical features into some numerical values then we have splitted the data into features and targets after that we have splitted the data into training data and testing data and then we have seen how we can train a linear regression model so i have also explained you our linear regression model works and then we have evaluated our model and finally we built a predictive system that given all these parameters it can find the medical insurance cost so as we have achieved what we wanted to do in this particular project so i hope you have understood all this so uh if you have any doubt you can just reach me out in comments or other social medias i have just given the link for my telegram channel linkedin profile and facebook group in the description of all my videos okay so please share this with your friends who you think might be interested in machine learning and artificial intelligence okay so i hope you have understood everything so thanks for watching

hello everyone this is siddharthan this is the 11th project video in our machine learning project series and in this video we are going to see how we can build a machine learning system that can predict what is the medical insurance cost of a person okay in case you are watching my videos for the first time hi in this channel i&#39;m making a project based machine learning course with python you can check out the playlist section in my channel to more about this machine learning projects and my machine learning course okay so with that being said let&#39;s get started with today&#39;s video and in this video i would like to explain you more about this problem statement first and then as we always do i would like to explain you about the workflow which we are going to follow and then let&#39;s get uh you know into the coding part using python okay so this is the problem statement of this video so let&#39;s say that you are a machine learning expert or you are a data scientist and that is a medical insurance company and this company wants to create automatic system that can predict what is the medical insurance cost of a person will be using some methods okay and they are approaching you as you are a machine learning expert okay and now we need to build a machine learning system that can you know learn from the data and it can predict what this price can be what is this insurance cost can be okay so this is what we are going to do in this video so now let&#39;s understand what is the workflow that we have so the first step in machine learning is to collect the data here we need insurance cost data so as we as we i have told you earlier we are going to predict the insurance cost and for this we need some data so what happens is our machine learning algorithm will learn some parameters so these parameters can be what are some health issues that person had and what is the gender of the person and other such kind of things and based on this the insurance cost varies and we need this data to feed our machine learning model okay so the first step is to collect the data for this project okay so as once we have this insurance cost data we need to do some data analysis so we need to analyze this data to understand whether it can give us some meaning so what is this data all about and such kind of things so data analysis helps us to gain the meaning the data has to tell okay so in this we just make some plots and try to plot the data in some graphs and see what is the relationship it has okay so that that kind of thing is done in data analysis so the next part is data preprocessing so once we have the data we cannot feed it directly to our machine learning algorithm so we need to do some processing on it this is where data preprocessing comes into play so once we process the data it will be compatible to go into the machine learning model then the next step will be to split our data into training data and testing data okay so in machine learning we train our machine learning model with this training data and we evaluate our model with this test data okay so this evaluation is just to uh see how well our model is performing okay so it is just to check the performance of our model so once we split the data into training data and testing data we feed this training data to our machine learning model so in this case we are going to use a very simple model called as linear regression model okay so once we train this linear regression model we will evaluate this model to check how well it is performing so after that we will have a trained linear regression model okay so once we have that we can feed new data and once we feed this new data this model can predict what is the insurance cost will be okay so this is the workflow which we are going to follow okay so linear regression model is most of a statistical model rather than a machine learning model but you know it is the base for other models so people think uh you know machine learning and statistics are two different things but it is not the case machine learning is built on top of statistics okay so these kind of basic statistical model is also very important for machine learning okay so i&#39;m just doing this linear regression model because we haven&#39;t used this model in our previous project videos so that&#39;s why so you can also use other regression models as well like xg boost regressor and the other regression models and also while we are implementing this linear regression model i will be explaining you how this model exactly works okay so this is what we are going to see in this video so now let&#39;s get into the coding part okay so before starting the video i would like to give you a quick intro to my channel so as you can see here once you go to my channel you can save this learning course curriculum video so in this i have explained you what are the different videos and modules i will be covering in this channel and you can go to this playlist section so that you can find these modules so this is the hands-on machine learning course that i have posted so so far we have completed about four modules and we also have this machine learning project video so totally we have about 11 project videos and i will be uploading three videos per week every friday i will be uploading one machine learning project video and monday and wednesday i will be uploading videos on this course order okay so totally i will be uploading three videos in a week on monday evening wednesday evening and friday evening okay so the programming the python programming which we will doing is in google collaboratory in case you are not aware of this google collaboratory you can go to this second module python basics for machine learning if you go there you can see this google collaboratory for python in this video i have explained how you can access google collaboratory and what are the different features in it okay so now let&#39;s get started with today&#39;s video just a second okay so there will be this connect option here so you need to connect your system and i have uploaded this data set here so this is the insurance data set so you can search for this data set in kaggle and also other sites you will get it and also give the link for this data set file in the description of this video okay so you just need to go to this files option and then give this upload to session storage or just click right click here okay there will be this upload option so from there you can upload your data set here so this is the data set which we are going to use it is insurance dot csv okay so the first step in in a project or in any python program is to import the dependencies so i&#39;ll make a text here as importing the dependencies so dependencies are nothing but the libraries and functions which we need for our program so first of all i will import some basic libraries so the first library is numpy so import numpy scnp so numpy library is used to make numpy arrays okay so we need some arrays for our processing in various stages of our project so i&#39;ll import numpy and next is pandas okay so i&#39;ll input pandas as pd so pd you know pandas is useful for making data frames okay so as you can see here the data set file which we have is a csv file so csv means comma separated value and it is not easy to analyze the data from uh you know csv file so we use this pandas data frame so data frames are more structured table so once we feed this or once we load the dataset to a pandas dataframe it is easier to do some processing and analysis on the data so next we are going to import two other libraries they are math plotlib math.lib.pi plot as plt so this library is used to make some plots and graphs and i&#39;ll import another library c bond so c bond is also a data visualization library it is used to make some plots so import c bonus sms okay so these are the short forms so i don&#39;t want to use numpy in all the codes so i have imported it in a short form as cnp so that is what this as np and spd does so it is just a short form for this library okay so import cbon as sls then you need to import another thing from sklearn dot model selection we need to import drain test spread so as i have told you we need to split our data into training data and testing data and this train test split function will help us do it and from sklearn dot linear model we need to input our linear regression model so linear regression so pay attention to which letters are in caps and which letters are in you know in lower case okay so linear regression and finally from sklearn input matrix so this matrix is used to perform some evaluation on our model okay so we have imported numpy for arrays pandas for pandas data frame matplotlib and c bond for plots and we have imported the strain test split function from scale under model selection and linear regression from linear model module okay so these are the dependencies which we need so the next step is data collection and processing data collection and okay so let&#39;s put data collection and analysis so we are going to analyze this data okay so you can find this insurance data in kaggle just go to google and search as insurance cost data medical insurance cost data okay so now we need to load our data in the csv file to a pandas data frame so i&#39;ll just make a comment here as loading the data from csv file to a pandas data frame okay so i&#39;ll create a variable so let&#39;s name our data frame as insurance data set okay so insurance dataset is equal to so we are going to use pandas so pd right so a short form for pandas so pd dot read csv this read csv function will read the csv file and load the content to a pandas data frame so go here you can find this options menu here so go to that options and you can see this copy path so copy the path of this file and you can paste it here okay so now you can run this cell by pressing shift plus enter so this will run this cell and go to the next one so this will load uh the data from the csv file to this insurance data set data frame okay so now we can understand about this data so you can print the first five rows of the data frame so first five rows of the data frame so mention the data frame name which is insurance data set dot yet so this yet function will print the first five rows of the data frame okay so as you can see here totally we have about seven columns here so it&#39;s excluding the serial number column so we have this age of the person what is the sex or gender of the person what is the bmi body mass index of that person whether that person has any children or not and whether that person is a smoker or a non-smoker and what is their region so what is the place they are from and what is their insurance cost in charges okay so this value is in us dollars okay so this is a us data set and the region represents uh that region in united states okay so south west southeast north west and northeast so these are the four regions we have and charges are in us dollars so the first is nothing but sixteen thousand eight eighty four us dollars and second is thousand seven twenty five us dollars and such kind of things okay so totally we have seven columns so what we are going to do is we are going to train our machine learning model with this data so once it learns from this data when we give a new data so we will so what we will be doing is once it has trained we will give the data from age to region and we won&#39;t give this charges okay so once we give this data our machine learning model can predict what this cost can be okay so this is what we are going to accomplish in this particular project okay so as you can see here what are the different columns and now let&#39;s understand more about this so i would like to see how many data points this dataset has okay so now we are going to find the number of rows and columns in this data set okay so again mention the data frame name insurance data set dot shape so this shape function will give us the number of rows and columns so as you can see here totally we have thirty eight rows so rows means different data of different persons okay so totally we have data from thousand three thirty eight percents and seven represents in this column so each person will be having this seven values right so this is our data frame so it is it is not a very small data set and it is not a very huge data set it is in kind of a mid-range okay so in machine learning we also deal with data set which is about uh you know in thousands of rows and even lacks of rows there are also data set of such magnitude as well so it&#39;s important to know what is the size of the data set we have the next step is to get some information about the data set getting some informations about the data set it is always a good practice to mention what you are doing in a particular line of code and now let&#39;s do this insurance data set dot info so run this so you need to mention paranthesis okay so as you can see here it tells what is the data type of each column and how many values we have so totally we have thousand three thirty eight entries and uh seven columns and these are you know non-null values so null values means missing values non non values means those values that are present it means that no values are missing and we have this integer object float etcetera so what this means is so you can see here for this sixth column we have the data type as object and for smoker and region we have this object so you can see here these are nothing but classes okay so these are not numerical values whereas eh bmi and this children and charges so these four columns are numerical values but these are classes so this sixth column contains values either male or female right and smoker contains either yes or no and region also contains uh you know only four valleys southeast or southeast northeast or north west right so we call this features as categorical features as it contains only categories and not values a particular integer values right so these are the categorical features and these values are very important we also need to do some processing on it so i&#39;ll mention it here i&#39;ll just put a text here as so totally we have three categorical uh columns in this data set categorical features okay so categorical features are first one is six the gender of the person second one is smoker column whether that person is a smoker or a non-smoker and the third categorical column is region okay so you may be wondering what is meant by these features so as i have already told you so once we give all these values our model can predict this particular column charges right so the value which is which is of interest in this case which we are going to predict is called as a target variable so in this case charges is the target variable and the remaining columns are called as features so these six columns are features and this charges column is called as target and in this features we have three categorical features which are just categories whether it&#39;s yes or no or uh it&#39;s male or female such kind of categories okay so the next step is to understand whether it has any missing values as i have told you already so this also tells us but there is a standalone way to check whether this data set has any missing values looking for missing values insurance data set so it is our data frame dot s null dot sum so this particular line of code will give us the number of missing value in each column okay so if it is null find the sum in each column so that is what we are finding here so as you can see here we don&#39;t have any missing values so it is a very good data set and in case we have any missing values we do some processing in it so if you want to know how how we can handle the missing values in the data set you can go to the fourth module in my machine learning course it&#39;s on data collection and tree processing in that i have explained you how you can handle the missing values in a data set okay so in this case we don&#39;t have any missing values so we can proceed with the further steps so the next part is data analysis so we are going to analyze this data using some plots data analysis so first let&#39;s get some statistical message about the data set statistical measures of the dataset insurance dataset dot describe so this describe function will give us the statistical measures like mean standard deviation percent tails etc as you can see it gives only for the numerical values and not for categorical features we cannot the mean value for we cannot find the mean value and standard deviation for categorical features so that&#39;s the reason so first it tells the count the number of uh values we have for in each column so we have totally thousand three thirty eight values in each column and we have the mean so the mean of this h column is 39 okay so the mean age and the mean bmi is 30.6 and the number of mean number of children is 1.09 and charge so the mean value of this transition we have the standard deviation what is the minimum value in each column maximum value in each column and then we have this three percentiles so these are percentiles or not percentage okay so percentage means so they are different from percentage so what it means is 25 percentage means 25 percentage of values are less than 27 okay so 50 percentage so less than 50 percentage of the values are less than 39 so that&#39;s what percentile means okay so we get some statistical measures of the data it it tells us how the data is distributed so what is the central tendency like mean s so mean is one of the central tendency so what is the mean standard deviation and such kind of things okay so now let&#39;s understand more about this so what we are going to do is we are going to find the distribution of the data set so first column is ch right so we will analyze this column by column so first we are going to find the distribution of h value okay so i&#39;ll use this sns dot set so this sns dot set will give us some better themed plots okay so we want you know uh grids in our plots so such kind of things is given by this sms dot set function so i&#39;ll explain you about this later so plt dot figure so if you remember i have imported uh c bonus sns so that&#39;s what i&#39;m using here and i have imported matplotlib.pytlot as plt as short form so that&#39;s what i&#39;m using in this third line okay sms dot it and plt dot figure so in this we need to mention the figure size we want so figure size is equal to 6 comma 6 you can give other values as well so it is just the size of our plot and sls.dist plot list plot represents distribution plot it tells us how the values are distributed throughout the data set and inside the parenthesis mention the data frame name insurance data set and as we are going to find it for each you can see the column name so if it is in small letters if it is in lower case it should be in lowercase as well so we have all the values are all the column names in lowercase so first we are taking this age column so mention it here in this square bracket so i&#39;m mentioning this data frame and inside this data frame i want to take the age column so mention it here and it should be in parenthesis okay so it the next step is you can give a title for your plot so plt dot title and uh let&#39;s name this plot as edge distribution okay so now we need to use this plt dot show so let&#39;s run this okay so this is our distribution so this sns it is used to make this grids and other themes so you can try without this sms dot sit once and you can try again with this sns dot set function okay so it tells you what is the difference this sns.function is creating okay so now let&#39;s understand the distribution so you can see the age value from 10 to 70 and we have maximum density most number of values in this 20 21 22 range right so there are a lot of values whereas from this 23 maybe 23 or 24 to about 70 the uh distribution is almost normal so the distribution is almost equal in this age range right so only this uh gauge in 20 is kind of more so it means more number of people in our dataset or of this particular age okay so 20 21 and 22. so this is the age distribution and now the second column we have is 6 right so the gender column so now let&#39;s try to plot this so we cannot plot a categorical feature in this kind of distribution so we need to use account plot for categorical features so i&#39;ll explain you what is mean by this count plot so we have printed the distribution of each column now we are going to check it for gender column okay now plt dot figure so you don&#39;t need to mention this sms dot set function every time so once you you know run this once it is enough so it will be incorporated in all your plots so plt dot figure and let&#39;s mention the figure size 6 comma 6 okay so now we are going to use count plot sms dot count plot x is equal to so we are taking the six column right so you can refer the column name so put that column name in this ex and mention the data which you are taking so in this case the data is the data frame which we have is insurance data set right okay so this will give us a count plot so count plot means it give us so in this sex column we have two values two types of values male and female so this found plot gives the number of uh values for this main male value and number of uh you know data points for this female value so that&#39;s what account plot is and let&#39;s name this plot as plt.title so let&#39;s give it a name let&#39;s put it as sex distribution okay and plt dot show let&#39;s run this okay so made some so i just made a mistake here so it should be fixed size okay so you can see here now we have the count for these two kinds of values so it is almost equal right so the distribution is almost equal for both the gender okay so the value is not very i in the case of only one kind of value okay so there is another way to understand this so what you can do is use this insurance data set mention the column name so here we are analyzing the sixth column right so six dot value comes so this value counts function will give us the number of values we have for male and number of values we have for female okay so i&#39;ll run this so totally we have 676 value for this male column 162 values for female column so this is how you can use this value function to check how many values are there for the two categories okay so the next column here we have is bmi okay so or else what i&#39;ll do is i&#39;ll just okay so let&#39;s go in this order so the next one is bma so the bm is not a categorical column so we need to use a distribution plot right so i just copy this particular line of code we just need to change a few things so we are going to check the vma distribution in our data set bmi distribution okay i will paste this and in this you need to change this edge to bma because we are checking for bmi and let&#39;s rename this to bmi distribution so bmi represents body mass index it tells whether a person is overweight or underweight or in a normal way so we call this kind of distribution of data as a normal distribution so the more number of values are in this mid range which is 30 in this case okay and we have a similar kind of distribution in either side of this peak so from this you know 15 to 20 or 25 we have this kind of a growth and there are more number of values around this 30 value 30 bmi value and it kind of reducing after that bmi range okay so here there is one important thing that i have to tell you so the normal bmi range so normal bmi range so for a person the normal bmi range is 18.5 to 24.9 okay so this is the normal bmi range as we know bmi represents body mass index so it is calculated using the height and weight of a person if the bmi of a person is less than 18.5 then that person is underweight and if the value is in this particular range 18.5 to 24.9 then the bma is almost in a normal range and if the bmi is more than 24.9 then that person is overweight okay so here now you can see this data we have more number of values from this uh 25 to around 35 or 40 so a lot of people in this particular data set are overweight okay so this is one important thing to note here and this can affect the insurance cost a person get okay so it might increase uh the insurance first you know this is the case okay so this helps us to understand what is the distribution and we get a normal distribution okay so the next column we have is bmi and children number of children so it is a numerical value but in this case instead of using a distribution you can just use a count plot as well because so this children column contains only five values zero sorry five or six values zero one two three four and five okay so here you can just use a count plot so let&#39;s do this for the children column plt dot figure and let&#39;s mention the figure size six comma six and then sns.con plot and in x we need to give the column name here the column name is children and the data we are taking is insurance dataset okay so let&#39;s name this as plt.title it&#39;s the same procedure which we have done for the sixth column okay so then children okay now finally we can use this plt dot show so now you can see here we have more number of people who doesn&#39;t have any children and the count is around uh 550 or 570 and a number of people with one children is again more you know comparing to other values here so we have about three values more than 300 and this is the distribution we have for this count value okay so sorry this children column and let&#39;s also use this value counts function to see exactly how many uh data points we have for each uh different kinds of values insurance data children so we need to mention the column name children dot value counts now you can see here so there are totally about 574 data points now for 574 number of people who doesn&#39;t have any children 324 people with one one child and such kind of distribution okay so the next column which we need to understand is this smoker column right so for smoker also you can use this uh count plot as well so count as we have used for the six column and children column okay so i&#39;ll just copy this line of code and this is for smoker column smoker let&#39;s rename this to smoker okay now we can see here no so the number of non-smokers is kind of more in this data set about more than thousand and we have a smokers of about 200 or 250 value such kind of range okay so we can also use this value counts to see the exact numbers so it is you know a kind of important in when it comes to data sense to understand how this data is distributed in a data set so it gives you a clear picture of what is the different variations in the data so here we need to change this to smoker okay so we totally have two values so 1064 values for non-smokers and 274 values for smoker column and the next column we need to understand is region so south east south east or north east and northwest so i&#39;ll paste this this is for region column x is equal to region region okay now as you can see here we have totally four regions south east south west south east north west and north east right and the data is almost similar for all the uh you know different types of values it is kind of little kind of more for the southeast okay so now let&#39;s use this valleycons function to see the exact number insurance data set region dot valley council okay so this is the distribution of this data set so finally there is only one uh column left which is charges so this is a numerical data so we cannot uh put it in count plot because we have we can have many number of values right but in the case of the categories we have only two or three values say for in this six column we have only male and female for smoker we have only uh yes or no but when it comes to this age bmi and charges the range of values can be very large and we cannot use count plots there so we need to use distribution plot to see how this range is distributed instead of individual values so i&#39;ll just copy this distribution plot and change the column name to charges okay so distribution of charges so charges represents the insurance cost so you can remove this sms sms.jet or just charge distribution okay okay so you can see here and we have a lot of data distributed in this ten thousand values so ten thousand dollars okay and we have uh very little values which are going around thirty thousand and forty thousand dollars okay so this is how our charges is distributed so this is the data analysis part which we can use to understand about our data is distributed and other kind of things okay so the next step we are is in data processing so we make always call it as data pre-processing so there is one important thing which we need to do here so as you can see we have three categorical columns right so six small current region so the sixth column is uh female and male as the value smoker assessor not region has totally four values so we cannot feed this text data to our machine learning model so companies computers don&#39;t understand this text well so you can understand the numerical values well so what we are going to do is we are going to create uh classes or labels numerical labels for this say what we will do is we will convert this female to 1 and male 2 0 so and smoker 2 maybe 0 node 2 1 and south east southeast northeast norway so we give one numerical value to each of these you know classes so this process is called as encoding so we are going to encode the categorical features so this process is about encoding the categorical features okay i&#39;ll explain you how we can do this so first the categorical column we have is six column right so encoding six column okay so mention the data frame insurance data set dot replace so this replace function will replace all the text data with appropriate values which we want so insurance data set dot replace open a parenthesis here and then we need to put another early package and mention the data frame the column of the data frame which we need to change here we need to change the six column right so mention six and in that if the value is for male so this male value we want to we wanted to encode it to zero and for female we will be transforming it to one okay in place is equal to true so i want to change it in the place of those values so i mentioned it here so these are the parameters which we need to mention so in place is equal to true so what we are basically doing is we are going to replace some values so what values we are going to replace so i am taking this sixth column and in that sixth column i want to replace all the mail text with zero and female text with one okay so this is what i&#39;m doing and the next categorical features we have is smoker right so encoding so make a comment here as encoding smoker column okay again mention the data frame insurance data set dot replace and now the column name is smoker you can refer the column names also if you are not sure and if the value is yes if that person is a smoker let&#39;s put 0 and if that person is a non-smoker then we can encode this value to 1 okay and in place is equal to true as well this is equal to true okay so we are replacing the values of this smoker column if yes we are transforming it to zero and no we are transforming it to one and the last final categorical column we have is region column okay encoding region column so here we totally have four different kinds of areas as you can see here we have found it through this valley cones south east south east sorry southeast north west south west and northeast so i&#39;ll mention the data frame here insurance data set dot replace and how we want to replace this as i mentioned the column region and if this region so if the value is south east okay so southeast i want to transform it to 0 and if the value is southwest let&#39;s import it to 1 and if the value is northeast let&#39;s transform it to 2 and finally we have northwest so all the northeast values should be transformed to three okay so we have i encoded the four values so we are replacing the region column and if it is southeast we will be transforming it to uh zero north uh sorry south east south west north east and northwest so the labels are zero one two one three okay so i&#39;ll just check it once for insurance data replace mail 0 female 1 okay the smoker is 0 and uh southeast southwest northeast and northeast okay so let&#39;s run this there is some mistake here okay so it shouldn&#39;t be this okay so this has successfully transformed our column so you can see the value here so previously we add about uh yeah so we have this text values female male a smoker is or no and such kind of things so all those things will be transformed to their corresponding numerical labels so as you can see here if the uh you know the value is female it is transformed to one for male it is zero and for smoker non-smoker it is uh one and for smoker it is zero okay and then also for region okay so this is how you can encode the categorical features okay so the next step is to split that features and target so as i have already told you these six columns will act as the features and this target variable will as act as the targets these uh this charges column right because this is what we are going to predict so the next step is splitting the features and target so i&#39;m going to create two variables x and y and let&#39;s put all the features in x and all that you know target in y so mention the data frame insurance data set dot drop so this drop function will can you not drop a row or column so i want to drop a particular column so the column i want to drop here is charges okay so charges and axis is equal to 1 so what i&#39;m doing is so as you i have told you these six columns are the features so i want to remove this charges column and store all these values to x and they want to store this charges column to y so i&#39;m using this drop function you can see this axis parameter so if we are removing a column we need to mention the axis as one and in case we are removing a particular row we need to mention the axis is equal to zero so one represents column and zero represents a row okay so now let&#39;s put the target variable in y insurance data set so inside the square brackets mention charges okay so let&#39;s run this so this will successfully split the data into features and target so you can go ahead and print x and y separately so capital x okay so you can see here we don&#39;t i have this okay so we have some mistake here so it is not encoded well this region column okay so something has happened so i have to change something here okay so the problem here is so i haven&#39;t included this in place function so it has just replaced for this one instance so that&#39;s the reason so we need to include this in place function here so in place is equal to true so this will rectify our problem so i just want to restart this and run this so you can go to this runtime and restart and run all because this encoder function so this encoded data will again try to encode so it will create a problem so i&#39;m just restarting this entire runtime here and running all the cells here okay so yeah it has run and now you can see here it doesn&#39;t print this here so now this region uh column is encoded correctly here as well so the problem is because i haven&#39;t included this in place here okay so now we can print y print y so y contains the target column which is the charges right okay so this x contain all the columns except for charges and this y contain the charges column so this is how you can split the features and target present in a data set so the next step is as i have told you in the workflow is to split our data set into training data and testing data splitting the data into training data and testing data so we need to create four arrays here the first one is xtrain and second one is x test and y train and y test okay so i&#39;m going to use this drain test split function so i&#39;ll explain you what is meant by these four functions sorry this four arrays and in the parameters we need to mention x comma y because we are going to split the x and y as training and testing data and mention the test size you want test sizes how much data you want in testing data so generally we take 10 percentage or 20 percentage of data as testing data so point to represents 20 percentage of data okay so we are taking 20 percentage of the entire data&#39;s testing data and 80 percentage go to the training data and the next one is random state so i&#39;ll explain you in a minute what is meant by this random state okay so you can see here we have created four arrays x train x test y drain and white test what happens is this uh x train this x will be splitted into two arrays one is extreme and the second one is x test okay so the corresponding charge is the corresponding target value for this x strain this y goes to y train and the corresponding charges value for the sex test go to y test okay i hope you are getting what we are doing here so this x is splitting into training and testing data and the corresponding labels for that training data go to the y drain and for x test it it goes into y test okay so this extended express is the split of this x and y train and y test is the corresponding speed of y okay so that is why we are mentioning this x and y in this train test split function and here we have mentioned the size we want so 20 percentage of test data and finally we have this random state parameters so what is this us so let&#39;s say that you are trying to run this command and in case if you mention some other values like three or four so your data will be splitted in a different manner okay but if you mention the same value that i am mentioning here so if you put the random state is equal to two then both of our data will be splitted in the same manner so this is just to split uh the data in the same manner in two different instances okay so it is just for that so in what kind of manner you want to split so this is basically to reproduce the code okay so this is the splitting of data into training data and testing data so let&#39;s run this and now we can check the shape of this data points print x dot shape so x is the original data data set and the x train so explained represents training data and x test represents the testing data source to stop shape extreme dot shape so this will tell us how many values we have so totally we add about 1338 data points of that the 80 percentage of data which is 1070 goes to extreme and 20 percentage of your entire data which is 268 go to this 6 test so this is how you can split your data into training data and test data okay so the next step is training our model so model training so in this case we are going to use as you know i have mentioned in the starting of this video that we are going to use a linear regression model okay so before you know implementing this i&#39;ll just explain you how this linear regression model works okay so let me go to my slide okay so let&#39;s say that we have two axis x axis and y axis and we are taking the features in the x-axis and target in the y-axis so the features are nothing but in this case it&#39;s age bmi uh number of children smoker and region so these six columns are features and the target column is nothing but the charges right so what happens is so in this plot the data point should be you know plotted so these are the data points so let&#39;s say that we have some data points and this is uh the plot we get so when we use a linear regression model it tries to fit in this data okay so linear regression is a line model right so it will try to fit in this data and we will get some best fit for this data points okay and the equation for this line is as you know the line of equation of line is y is equal to m x plus c and this is the what is what are the variables in the equation so x represent the input features y represents the probability of our prediction m represents the slope and c represents the intercept as you know in this case x the input features are those six columns from age to region and y prediction probabilities like uh the prediction which we are making is for the charges column and slope and intercept we have so if you change the slope if we have a different data points we will get a different slope and this line will be tilted in a different angle okay so that line will have a different slope so you know changing the data points will change the slope and that will change the orientation of this line and c represents intercept so intercept is this particular distance we have from the origin so this is the intercept so this can change based on the data we have okay so this is what happens so this model will try to fit this data in a line and based on that fit it tries to predict new values so if we have a new values say let&#39;s say that we have only the x values so what it will do is it will try to plot this x value and it will check the corresponding y value and this y value becomes up chart the r cost okay so this is how it can predict so this is how a simple linear regression model works so now we can implement this in python so i&#39;ll go to my google collab so if you remember we have imported the linear regression model from sklearn dot linear model right so we need to use this linear regression function now so loading the linear regression model okay so let&#39;s create the name of the regressor as regressor which is equal to linear regression okay so regressor is equal to linear regression this will load one instance of this linear regression model to this particular variable now we need to fit this uh model to our training data here the training data is extreme and the corresponding y value is this white drain okay so we need to fit white x strain and y train so mention the model so here the model becomes regressor so regressor dot fit so as i&#39;ve explained you what happens in a fit right so it&#39;s regressor dot fit x train and y train so we are plotting for training data for the features of the training return label of the training data right let&#39;s run this okay so this will uh plot the line you know in the in this data points as i have explained you in that graph so now we need to evaluate our model so the training of our model is done now we need to evaluate it model evaluation and for this we are going to use our sport okay so before that we need to make some predictions so prediction on training data so first let&#39;s try to predict the charges the cost of all the training data and then let&#39;s predict it for test data prediction on training data let&#39;s create a variable as training data prediction i&#39;ll explain you what is meant by this regression the model which is regressor regressor dot predict and in this predict let&#39;s mention next string okay let&#39;s run this so what happens is i&#39;m just giving the x ring you can note here so i&#39;m not mentioning the white end so what happens is i am just giving the x strain and this model will predict what is the y value r so those predicted values will be stored in this training data prediction okay so now we need to find the r square value so this is one performance metric r square so r squared value so this r squared value lies in the range of 0 to 1 and as it is you know in a general convention we say that if the value is close to one then we can say that our model is performing well but it is not the same in all the cases in some cases if we have a less r square value then it is good so it it differs in different cases so i will explain you in detail about how this evaluation metrics work in a separate video which we will be doing on model training model okay so in this let&#39;s see how we can do this particular r squared uh find this r squared value so let&#39;s put this as r2 okay so r2 is equal to so r2 the r squared value for training data so let&#39;s put this as r2 train which is equal to so we have imported metrics uh you know module from this sql on library so we are going to use it so you can refer the importing dependencies cell which we have done so matrix dot r2 is 4 so this will give us the r2 score and in that mention y train and training data prediction so what we are doing is we are comparing the original uh labels which is white rain and the values predict sorry not labels the original values of this y drain and the values predicted by our model okay so when you run this we will get a score so i&#39;ll just print that&#39;s four so print as r squared value so r2 trim this will give us the r squared value and here the value is around 7.751 so as i told you it is as it is close to 1 the value is good in this particular case so we almost get a good uh you know r2 score so this is a simple linear regression model if you use some other advanced regression models like xg boost regression model you can get uh you know better values as well so we have already used x3 boost regressor in one of our regression projects okay so this is our r squared value for training data and this this alone is not you know necessary so we need another one important uh metric so the r2 square for testing data so as i have told you the evaluation on testing data is important because our model didn&#39;t see this test data okay so test data prediction requisite dot predict now we need to change this to x test okay let&#39;s run this and now we need to find the r2 value let&#39;s change few things here so our square value let&#39;s change this to r2 test and here y train should be changed to y test so we are just doing the same thing but we are just predicting for test data okay so let&#39;s run this so here we get a value of 0.744 okay so it is almost close to each other okay so you may ask this question that why we are predicting for training data and testing data i have already told you that uh evaluation on testing data is important but why we are finding for training data as well okay so there is one important thing to note here there is a concept called this over training in machine learning so what happens in overtraining is we also call this as overfitting so in that case the model will overland on the training data which is not necessary so in that cases we will get a huge value for this training data and a very less for this testing data so for a good fit for a good performance of the model these two values for training data and testing data should be almost equal to each other so in this case it is almost equal to each other so we don&#39;t have this overfitting issue in this case okay so this is how you need to evaluate our model so finally we have completed this and there is one important step now so this is the final step of our project which is to build a predictive system building a predictive system that can predict the cost given all these parameters like ks bmi etc okay so i create a value as input data okay so what let&#39;s do is i just open this data frame sorry the data set so it is this is the data set which we used so i&#39;ll just open this in a notepad file so what i&#39;m going to do is i&#39;m just going to select a random value here okay so okay so let&#39;s take this value so you can take any values as well so what i&#39;m doing is i&#39;m just not including this charges value if you can see here so i&#39;m just taking the values from each to region so you can see the column names here so h2 region and uh given these values the model should predict what is the insurance cost right so i&#39;ll copy this value and put it here put this in input data okay so this value so once we give this these values it should predict our medical insurance cost but there is another main thing to note here that we have imported our categorical values right so we have in you know imported this female to one smoker is to zero node one and this kind of things right now we need to do the same thing for uh this data so for female we have included the value to one okay so now we need to change this female to one and then we have no so the no should be transformed to one okay so we have no year and let&#39;s transform it to one and finally we have southeast so southeast should be changed to zero right so let&#39;s transform it to zero okay so given these values it should predict the insurance cost so there are few things which we need to do here so this is a tuple so we have uh stored it in a in parenthesis right so it is a tuple data you know data type and i want to change this to and in into a numpy array so changing input data which is a tuple to a numpy array because it is easy to do some processing on numpy arrays rather than tuple so let me name this as input data as numpy so we have imported numpy library as np so use this numpy dot ask array function okay so in this parenthesis mention this input data so what happens is this will transform or this will convert this tuple to a numpy array and we are storing it in a variable called as input data as numpy array so mp.acid is the function so let&#39;s run this okay so there is another thing now we need to reshape the this array so reshape the array okay so i&#39;ll name this as input data reshaped i&#39;ll explain you in a minute what we are doing in this reshaping so input data as numpy array dot reshape one comma minus one okay so what we are doing is so if we if we don&#39;t uh you know mention this reshape the model will you know it doesn&#39;t know that we are predicting for uh one data point so what we are doing is we are just giving one uh row right so we are just giving one data point and we want to predict it if we don&#39;t reach a bit the model doesn&#39;t know because uh in training we have used about uh one second so while training we used about 1070 data points in extreme right so the model will be expecting same number of values so we want that so we don&#39;t want that we want our model to understand that i just want the value for one particular data point and that is why we are reshaping to one comma minus one this is what this particular reshape does okay so now we can find the prediction so let&#39;s the model which we use is in regressor dot predict so predict function will give us this charge value okay so predict input data reshaped okay so input data reshape so we need to feed this reshaped input array to this linear regression model and let&#39;s print this prediction so this will print the medical insurance cost so let&#39;s run this so the value which we are getting is three seven six zero so you can go to this data set which we have so we have took the value for this and for this particular data point the medical insurance cost is three seven five six it&#39;s three thousand seven fifty six dollars and the value predicted by our model is 3760 which is very very close so it tells us our model is performing kind of well right so this is how you can build a predictive system and i&#39;ll just print another thing so as you can see here it is printed in the form of a list so it is inspired bracket that it means it is in the data type list so we need to print it separately so let&#39;s say that the insurance cost as ust prediction zero okay so why i am mentioning this prediction zero is so this prediction is in the form of list right so a list which contains only one value so if we want to take this one value alone we need to mention prediction is equal to prediction square bracket zero so this is the index of this list so the index of this particular value is zero so that&#39;s why i&#39;m mentioning prediction square brackets zero okay so if you are not aware of this you know list what is mean by this list data type double data type you can uh go to the second module which is python basics for machine learning that i explained about all the different data types in python okay so let&#39;s run this as you can see here it gives us the insurance cost is 3760 which is very close to the real value so this is how you can build a predictive system using machine learning so i hope you have understood all the things covered in this video i&#39;ll just give you a quick recap of what we are doing in this video so the first step is to import all the dependencies i have already told you how you can upload the data set so i&#39;ll give the link for this data set and the link for this collab file in the description of this video so please practice this force don&#39;t just download this caller file and run it once uh you know practice this quotes by yourself it will help you to understand things better okay so the first step is to import all the dependencies necessary libraries and functions and we have seen how to import the data to a pandas data frame then we have printed the first five rows of the data frame that should find how many rows and columns we have and we have seen which are the categorical features in the data frame and we have checked whether it has any missing values and we did some data analysis by getting some statistical measures and finding the distribution of values and finding some count plot of some columns so once we have did this data analysis the next important thing is to encode the categorical features so we have encode the corresponding the categorical features into some numerical values then we have splitted the data into features and targets after that we have splitted the data into training data and testing data and then we have seen how we can train a linear regression model so i have also explained you our linear regression model works and then we have evaluated our model and finally we built a predictive system that given all these parameters it can find the medical insurance cost so as we have achieved what we wanted to do in this particular project so i hope you have understood all this so uh if you have any doubt you can just reach me out in comments or other social medias i have just given the link for my telegram channel linkedin profile and facebook group in the description of all my videos okay so please share this with your friends who you think might be interested in machine learning and artificial intelligence okay so i hope you have understood everything so thanks for watching

Transcript for:Lecture Notes: Building a Machine Learning System to Predict Medical Insurance Costs

Transcript for:
Lecture Notes: Building a Machine Learning System to Predict Medical Insurance Costs