Overview of Machine Learning Algorithms

so today's session what all things we are basically going to discuss so first of all we are going to discuss about different types of machine learning algorithm like how many different types of machine learning algorithm are there understand the purpose of taking this session is to clear the interviews okay clear the interviews once you go for a data science interviews and all the main purpose is to clear the interviews i've seen people who knew machine learning algorithms in a proper way okay they were definitely able to clear it because they just explained the algorithms in a better way to the recruiter so that they got hired first of all is the introduction to machine learning here i'm just specifically going to talk about ai versus ml versus dl versus data side then the second thing that we are going to talk about over here is the difference between supervised ml and unsupervised ml the third thing that we are probably going to discuss about is something called as linear regression so we are going to clearly understand the maths and geometric intuition the next thing that we are probably going to discuss about is r square and adjusted r square the fifth topic that we are going to discuss about is ridge and lasso regression the first topic that we are going to discuss about is ai versus ml versus dl versus data science so this is the first topic that we are probably going to discuss if you really want to understand the difference between ai versus ml versus dl versus data science we will go in the specific format so just imagine the entire universe so this entire universe i will probably call it as an ai now specifically when i say ai this basically means ai artificial intelligence whatever role you are in you are as a machine learning developer you are working as a deep learning developer vision developer or a data scientist or an ai engineer at the end of the day you are actually creating a i application so if i really want to define what is this artificial intelligence you can just say that it is a process wherein we create some kind of applications in which it will be able to do its task without any human intervention so that basically means a person need not monitor this ai application automatically it will be able to take decisions it will be able to perform its task and it will be able to do many things so this is what an ai application is some of the examples that i would definitely like to consider so the first example that i would like to consider ai application ai model netflix has an ai module suppose if you see a kind of action movie for some time then the kind of ai work or ai work that is basically implemented over here is something called as recommendation so here through this application what happens is that when you're continuously seeing the action movies then automatically the ai module that is present inside netflix will make sure that it gives us recommendation on action movies second if i take an example of comedy movie if i continuously see comedy movie then also it will give us the recommendation of the comedy movie so this through this what happens is that it understands your behavior and it is being able to do its task without asking you anything the second example that i would like to take up in is amazon.com now amazon.n again if you buy an iphone then it may recommend you a headphones so this kind of recommendation is also a part of ai model that is integrated with the amazon dot in website the ads that you see probably when you're opening my channel through which i get paid a little bit from my from uh from the hard work that i do in youtube right so through that ads how that is recommended to you uh that is also an ai engine that is included in the youtube channel itself which really plays it is a business driven goal understand it is a business driven things that we basically do with the help of ai one more example that i would like to give you is if i consider it self-driving cars so here you'll be able to see self-driving cars if you take an example of tesla so self-driving cars what happens based on the road it is able to drive it automatically who's doing that there is an ai application integrated with the car itself right so if i consider all these things this is all our ai application at the end of the day whatever role you do you are going to create an ai application this is the common mistake what people do you know like uh our ceo sudan sukumar he is written in his profile that he is an ai engineer that basically means his goal is to create an ai application so probably in a product based companies will be seeing this kind of roles called as ai engineer now let's go to the next role which is called as machine learning so where does machine learning comes into existence so if i try to create this machine learning is a subset of ai and what is the role of machine learning it provides stats tools to analyze the data visualize the data and apart from that to do predictions i'm forecasting so you will be seeing a lot of machine learning algorithms so internally those machine learning algorithm the equation that we are basically using it is basically using it is having a kind of stats to stat techniques because whenever we work with data statistics is definitely very much important so this exactly is called as machine learning so it is a subset of ai this is very much important to understand ml is a subset of ai so here you can see that it is a part of this now let's go to the next one which is called as deep learning deep learning is again a subset of ml now let's consider why deep learning came into existence because in 1950s 60s scientists thought that can we make machine learn like how we human being learn so for that particular purpose deep learning came into existence here the plan is to basically mimic human brain so when i say mimicking human brain that basically means we are trying to mimic the human brain to implement something to learn something so for this you use something called as multi-layered neural networks so this is what deep learning is it is a subset of machine learning its main aim is to mimic human brain so they actually create multi-layered neural network and this multi-layered neural network will basically help you to train the machines or applications whatever we are trying to create and deep learning has really really done an amazing work with the help of deep learning we are able to solve such a complex complex use cases that we will be probably discussing as we go ahead now if i come to data science see this is the thing guys if you want to say yourself as a data scientist tomorrow you're given a business use case and situation comes that you probably have to solve that use case with the help of machine learning algorithms or deep learning algorithms again the final goal is to create an ai application right you cannot say that i am a data scientist and i'll just work in machine learning or i'll work in deep learning or i may i don't know how to analyze the data no you cannot do that when i was working in panasonic i got various different kind of tasks sometime i was told to use tableau power we had to visualize analyze the data sometime i was given a machine learning project sometime i was given a deep learning project so as a data scientist if i consider where does data scientists fall into this it will be a part of everything so if i talk about machine learning and deep learning with respect to any kind of problem statement that we solve the majority of the business use cases will be falling in two sections one is supervised machine learning one is unsupervised machine learning so most of the problems that you are basically solving this is with respect to this two problem statement two different types of machine learning algorithms that is supervised machine learning and deep learning if i talk about supervised machine learning two major problem statements that you are basically solving here also one is regression problem and the other one is something called as classification problem and in the case of unsupervised machine learning problem statement you are basically solving two different types of problem one is clustering and one is dimensionality reduction and there is also one more type which is called as reinforcement learning reinforcement learning i can i i will definitely talk about this not right now right now we are just focusing on all these things now understand what happens in supervised machine learning let's consider a data set so here i have a data set which says this is my age and this is my weight suppose i have these two specific features let's say that i have values like 24 62 25 63 21 72 25 7 of 62 and many more data over here let's say that my task is to basically take this particular data and create a model wherein so suppose my task is that i need to create a model whenever it takes the new age first of all we train this model with this data and whenever we take age a new age it should be able to give us the output of weight this particular model is also called as hypothesis okay i'll discuss about this today whenever discussing about linear regression now what are the important components whenever we have this kind of problem statement first of all you need to understand there are two important things one is independent features and the other one is something called as dependent features now let's go ahead and discuss what is independent feature independent feature basically means in this particular case since the input that i am basically training in all those features becomes an independent feature now in this particular case my age is independent feature and whatever i'm actually predicting so when i say predicting i know this is my output okay this is the what i have to basically make my model uh give this as an output so in this particular case my dependent feature becomes weight why we specifically say dependent feature because this is completely dependent on this value whenever this is increasing or decreasing this value is basically getting changed so that is the reason why we basically say this is independent and dependent feature whenever we are solving a problem right in the case of supervised machine learning remember there will be one dependent feature and there can be any number of independent features now let's go ahead and let's discuss about regression and classification what is the difference between them now let's go ahead and let's discuss about two things one is let's say i want a regression problem statement suppose i take the same example as age and weight so i have values like as discussed 24 72 23 71 24 or 25 71.5 okay so this kind of data i have see this is my output variable which is my dependent feature now in this particular dependent feature now whenever i'm trying to find out the output and in this particular output you have a continuous variable when you have a continuous variable then this becomes a regression problem statement now one example i would like to give suppose this is my data set right this is my age this is my weight suppose i am populating this particular data set with the help of scatter plot then in order to basically solve this problem what we'll do suppose if i take an example of linear regression i will try to draw a straight line and this particular line is my equation which is called as y is equal to mx plus c and with the help of this particular equation i will try to find out the predicted points so this will be my predicted point this will be my predicted point this this any new points that i see over here will basically be my predicted point with respect to y so in this way we basically solve a regression problem statement so this is very much important to understand let's go to the always understand in a regression problem statement your output will be a continuous variable the second one is basically a classification problem now in classification problem suppose i have a data set let's say that number of hours study number of study hours number of play hours so this is my independent feature let's say a number of sleeping hours and finally i have my output which will be pass or fail so in this i have all this as my independent features and this is my dependent feature so i will be having some values like this and here either you will be pass or fail or pass or fail now whenever you have in your output fixed number of categories then that becomes a classification problem suppose it just has two outputs then it becomes a binary classification if you have more than two different categories at that time it becomes a multi-class classification so this is the difference between regression problem statement and the classification problem statement now let's go ahead and let's discuss about something called as unsupervised machine learning now in unsupervised machine learning which is my second main topic over here i'm just going to write unsupervised machine learning now what exactly is unsupervised machine learning here whenever i talk about there are two main problem statement that we solve one is clustering one is dimensionality reduction let's take one example of a specific data set over here let's say that my data set is something called as salary and age now in this scenario we don't have any output variable no output variable no dependent variable then what kind of assumptions were that we can take out from this particular data set suppose i have salary and age as my values so in this particular case i would like to do something called as clustering now why clustering is used just understand let's say i am going to do something called as customer segmentation now what does this customer segmentation do clustering basically means that based on this data i will try to find out similar groups groups of people suppose this is my one group this is my another group this is my third group let's say that i was able to create this many groups this many groups of clusters i'll say cluster one two three each and every cluster will be specifying some information this cluster may specify that this person he was very young but he was able to get some amazing salary this person it may some specified that these people are basically having more age and they are getting good salary these people are like middle class background where with respect to the age the salary is not that much increasing so here what we are doing we are doing clustering we are grouping them together main thing is grouping this word is very much important now why do we use this suppose my company launches a product and i want to just target this particular product to rich people let's say product one is for rich people product two is for middle class people so if i make this kind of clusters i will be able to target my ads only to this kind of people let's say that this is the rich people this is the middle class people i will be able to target this particular ads or this particular product or send this particular things to those specific group of people by that that is basically called ad marketing and this uses something called as customer segmentation a very important example and based on this customer segmentation we can later apply any regression classification kind of problem statement now coming to the second one after clustering which is called a dimensionality reduction now in dimensionality reduction what we are focusing on suppose if we have 1000 features can we reduce this features to lower dimensions let's say that i want to convert this 1000 feature to 100 features lower dimension so can we do that yes it is possible with the help of dimensionality reduction algorithm there are some algorithms like pca so i'll also try to cover this as we go ahead understand clustering is not a classification problem clustering is a grouping algorithm there is no output feature no dependent variable in clustering sorry in unsupervised ml so yes i will also try to cover up lda will cover up pca and all as we go ahead so with respect to supervised and unsupervised so first thing that we are going to cover is something called as linear regression the second algorithm that we will try to cover after linear regression is something called as ridge and lasso third that we are going to cover is something called as logistic regression the fourth that we are basically going to cover is something called as decision tree decision tree includes both classification and regression for fifth that we are going to cover is something called adaboost sixth that we are going to cover is something called as random forest seventh that we are going to cover is something called as gradient boosting eighth that we are going to cover is something called as xgboost ninth that we are going to cover is something called an a bias then when we go to the unsupervised machine learning algorithm the first algorithm that we are going to do is something called as k means k means algorithm then we also have db scan then we are also going to do higher cult clustering there is also something called as k nearest neighbor clustering fifth we'll try to see about pca then lda so different different things we will try to cover up yes svm i have missed here i'm going to include svm knn will also get covered so i have that in my list probably i may miss one or two but we are going to cover everything so let's start our first algorithm linear regression so let's go ahead and discuss about linear regression linear regression problem statement is very simple guys so suppose i have let's say i have two features one is my x feature and one is my y feature let's say that x is nothing but age and y is nothing but weight so based on these two features i have some data points that has been present over here so in linear regression what we try to do is that we try to create a model with the help of this training data set so this will be my training data set what i am actually going to do is that i am going to basically train a model and this model is nothing but a kind of hypothesis testing or it is just kind of hypothesis which takes the new age and gives the output of the weights and then with the help of performance metrics we try to verify whether this model is performing well or not now in short what we are going to do in linear regression is that we will try to find out a best fit line which will actually help us to do the prediction that basically means if i get my new age over here then what should be my output with respect to y okay so with respect to this what should be my output over here in this particular case whenever we are drawing a diagram like this i can basically say that y is a linear function of x so this is what we are going to do now understand how we are going to create this best fit line this is very much important whenever we say linear regression it basically means that we are going to create a linear line over there you may be thinking sir why to create a linear line why not non-linear line that i'll discuss about it as we go and see other other algorithms so to begin with let's consider this line that you see over here right this line equation can be given by multiple equations someone some people write y is equal to mx plus c some people write h some people write y is equal to beta 0 plus beta 1 into x some people write h theta of x is equal to theta 0 plus theta 1 into x many many equations are there for this this straight line this straight line many many equations are there with respect to many many different kind of notations but the first algorithm that i have probably learned of linear regression is from andrew ng definitely i would like to give him the entire credits and based on his notation whatever he has explained i'll try to explain you over here so the credits for this algorithm specifically goes to andrew and g so let's consider this one over here in order to create this straight line i will basically use the equation which is called as h theta so this is the equation of a straight line if i know the equation of the straight line whatever i can write i can write many things y is equal to mx plus c y is equal to beta0 plus beta1 multiplied by x and then i can also write one more that is h theta of x is equal to theta 0 plus theta 1 into x of i here also you can basically say x of i here also you can say x of i now let's go ahead and let's take this equation for now let's take this equation of now so i'm going to take out this equation and just write one equation through which i have also studied but i will definitely be adding some points which probably andrew ng could not mention in his video but i'll try my level best obviously he is the best i cannot even compare myself to him so theta 0 plus theta 1 into x now let's understand what is theta 0 theta 1 as i said that let's say i have a problem statement over here let's say i this is my x and this is my y this is my data points now what i'm doing i'm trying to create a best fit line like this now what is this best fit line what is well when i say this best straight line is basically given by this equation what does theta 0 basically indicate theta 0 over here is something called as intercept now what exactly is intercept intercept basically means that when your x is 0 then h theta of x is equal to theta 0. so in this particular case intercept basically indicates that at what point you are meeting the y axis so this particular point is basically your intercept when your x is equal to 0 at that point of time you will be seeing that this line is intersecting the y axis whatever value this will be that is your intercept now the second thing is about your theta 1 what is theta 1 this is nothing but slope or coefficient now what does this basically indicate this indicates let's say that this is the unit 1 unit the x axis and probably with respect to this i can find one point over here one point over here and if i try to draw this over here to here this is the unit movement in y so what does it basically say slope with the unit movement in what one unit movement towards the x axis what is the unit movement in y axis that is basically slope or coefficient theta 0 and theta 1 two things and x of i is definitely your data points now our main aim is to create a best fit line in such a way that i'll just try to show it to you what is our main aim let's let's understand what is the aim of a linear regression so if i take an example of linear regression i need to find out the best fit line in such a way that the distance between this data points that i have and the predicted points should be very very less suppose i am creating a best fit line okay i am creating a best fit line so with respect to this data points initially was this right but my predicted point is this point in this particular case my predicted point is this point so and if i do the summation of all these points those distance should be minimal then only i'll be able to say that this is the best fit line so i i cannot definitely say that this is exactly the best fit line or not how will i say when i try to calculate the difference between this point and the predicted point these are my predicted point right if i try to calculate the distance between them then i will basically have a aim to m it should be minimal if i do the summation of all the distance it should be minimal so for that what i can do is that c you may be also thinking krish why not just do one thing okay suppose if these are my data points why not just play and create multiple lines and try to compare what we can do is that we can compare multiple we can create multiple lines right like this and then whoever is giving the best minimal point i will go and select that but how many iteration you will do how you will come to know that okay this line is the best line so for that specific purpose we should start at one point and we should lead towards finding the best fit line start at one point and then we should go towards finding the best fit line so for this particular purpose what we do is that we create a something called as a cost function i have already shown you what is my hypothesis function my best fit line equation is basically given as h theta of x is equal to theta 0 plus theta 1 multiplied by x this is my hypothesis right now coming to the cost function which is super super important why this it is super important because cost function basically what what is cost function over here i told right right this distance when i do the summation this distance that i when i'm doing the summation it should be minimal so if i really want to find out this particular distance i will be using one more equation how can i use a distance formula between the predicted and the real point i will just say that h theta of x minus y so when i say s theta of x minus y what does this basically mean this is my real point and this is my predicted point predicted point is basically given by h theta of x and what i'm going to do i'm going to basically do the squaring because i may get a negative value so because of that i really want to do the squaring part now understand one thing i need to also do the summation i is equal to 1 to complete m let's say that i am taking the number of data points over here as m because i need to calculate the distance between all the points right with respect to the predicted and the predictor with with respect to the real points so after this i also need to divide by 1 by 2 m the reason why i'm dividing by first of all let me show you why we are dividing by 1 by m 1 by m will give us the average of all the values that we have the specific reason why we are dividing by 1 by 2 is for the derivation purpose it helps us to make our equation very much simpler so that later on when i am updating the weights when i say weights i am basically updating theta 0 and theta 1 theta 0 and theta 1 at that point of time you will be able to see that this particular value when we probably do the derivative it will help us to do it again i'm going to repeat it i'm going to write it down for you first of all now in order to find out the best fit line i need to keep on changing theta 0 and theta 1 unless and until i get the best fit line unless and until i don't get the best fit line i need to keep on updating theta 0 and theta 1. now if i need to keep on updating theta 0 and theta 1 i probably require a cost function okay what this cost function will do i'll just tell you so cost function over here i will specify as j of theta 0 comma theta 1 is equal to now what is cost function over here what this distance i told right this distance between the h theta of x and y if i do the summation of all these things it needs to be minimal it needs to be less because with respect to our x point this is my y point right similarly with respect to this exponent this is my y point so what i'm actually going to do i'm going to use a cost function now in this cost function my main aim is to basically write h theta of x minus y whole square this will be with respect to i i i why i am saying i because this will be moving from i is equal to 1 to all the points that is m m is basically all the points over here now apart from this what i'm actually going to do i'm going to divide by 1 by 2 m i'll tell you why i'm specifically dividing by 1 by 2 m first of all by dividing by m i will be getting an average output average cost function because here i'm iterating through m the reason why i'm dividing by 2 because it will help us in derivation why let's say that i have x square if i try to find out derivative of x square with respect to x then what will i get i will basically get 2x right that is what is the formula what is the derivation of x of n it is nothing but n x of n minus 1 so that is the reason why i am actually making it 1 by 2 so that when 2 comes over here this 2 and 2 will get cancelled so i hope everybody is able to understand so this is my cost function now understand what is this called as this entire equation is basically called as squared error function yes mathematical simplicity basically means because when we are updating theta 0 and theta 1 we basically find out derivation in the cost function so that is the reason why we are specifically doing it squaring off is basically done because so that we don't get any negative values here squared error function now let's go towards the what we need to solve this is my cost function okay so i need to minimize minimize this particular value that is 1 by 2 m summation of i is equal to 1 to m and then this will basically be h theta of x of i minus y of i whole square we need to minimize this by adjusting parameter theta 0 and theta 1. this entirely is what this is nothing but j of theta 0 comma theta 1 and we really need to minimize this so this is our task okay this is our task now let's go ahead and let's try to compare with two different things one is the hypothesis testing and one is with respect to the cost function okay let's take an example so right now my equation of the hypothesis is nothing but h theta of x is equal to theta 0 plus theta 1 multiplied by x if theta 0 is 0 then what does this basically indicate can i say that it basically the line the line the best fit line passes through the origin and this is nothing but h theta of x is equal to theta 1 multiplied by x can i say like this obviously i can definitely say like this right so my equation will be like this so for right now let's consider that your theta 0 is equal to 0 so this is what it is we have done till here we have minimized we have written the equation everything yes so it is passing through the origin and this is what is the equation i am actually getting now let's take one example and let's try to solve this if i if i have h theta of x so this is my new hypothesis considering that my intercept is passing through the origin so with respect to this let's say that i will create one line over here let's say this is my this is my data points like x1 y1 i have 1 three i have one two three now let's consider that if i have heat i have data points like what i have data points like let's say i have three data points one comma one 2 comma 2 3 comma 3 so 1 comma 1 is nothing but this is my data point 2 comma 2 is nothing but this is my data point and 3 comma 3 is this is my data point so these are my data points from the data set that i have so 2 comma 2 is this point and 3 comma 3 is basically this point let's consider that these are my points that i have these are my data points now if i consider theta 1 as 1 where do you think the straight line will pass through where do you think the straight line will pass the straight line will definitely pass like this right my straight line will definitely pass through all the points this same point becomes a prediction point also right same point let's consider that this is also getting passed through this it passes through all the points when theta 1 is equal to 1 theta 1 is nothing but slope and slope is equal to 1 in this scenario it passes through all the points now go ahead and calculate your j of theta so what will the formula of j of theta 1 become because theta 0 is 0 okay we can basically write 1 by 2 n summation of i is equal to 1 to 3 how many points are there 3 right and here i have j of h of theta of x 1 sorry x of theta of x i minus y i whole square right now let's go ahead and compute now in this particular scenario what will happen 1 by 2 m then what is what is this point minus y of i see state of x is also 1 y of i is also 1 both the points are 1 so this will become 1 minus 1 whole square plus because we are doing summation the next point is also falling in 2 comma 2 so this will become 2 minus 2 whole square plus 3 minus 3 whole square so in total this will become 0. 0 so when your j of theta when theta 1 is 1 theta 1 is 1 so j of theta 1 is how much it is 0 right so what is this j of theta 1 it is the cost function so let me draw the cost function graph over here let's say that this is my theta and this is my so here i have 0.5 here i have 1 here i have 1.5 so this is my theta here i have 2 then i have 2.5 okay then similarly i have 0.5 then i have 1 1.5 to 2.5 this is my j of theta 1 so right now what is my theta 1 my theta 1 is 1 at this particular point what did i get j of theta 1 is nothing but 0 so this will be my first point this will be my first point guys i have discussed why why the value will be 1 by 2 m basically to make the calculation simpler we are dividing by 1 by 2 m is basically used to average average the summation that we are actually doing over here now let's go ahead and let's take the second scenario in the second scenario let's consider my theta 1 let's say that my theta 1 over here is now 0.5 if my theta 1 is 0.5 then tell me what are the points that i will get for x is equal to 1 0.5 multiplied by 1 so it will come as 0.5 over here right then similarly when x is equal to 2 0.5 multiplied by 2 is nothing but 1 over here and then similarly when uh for x is equal to 3 0.5 multiplied by 3 see we are multiplying here right 0.5 multiplied by 3 is 1.5 so the next point will come over here now when i create my best fit line what will happen so here is my next best fit line which i will probably create by green color okay so this is my second one which is green color here definitely slope is decreasing so if i go ahead and calculate my j of theta let's see what i'll get so j of theta 1 is nothing but 1 by 2 m again same equation summation of i is equal to 1 to 3 h theta of x of i minus y of i whole square so what we have for over here we have nothing but 1 by 2 m now let's do the summation what is this point this point is nothing but the predicted point and this point is the real point right so in this particular scenario the first point that i will get is nothing but 0.5 minus 1 whole square i am getting 0.5 minus 1 whole square this is 1 this is the real point 1 this is the predicted point 0.5 so here i am getting 0.5 minus 1 whole square the second point will be 1 minus 2 whole square right 2 so 1 minus 2 whole square and then i will finally get 1.5 minus 3 whole square so finally if i do this calculation how much i'm actually getting 1 by 2 multiplied by 3 which is 6 here i'm getting 0.25 0.5 square here i'm getting 1 here i'm getting 1.5 whole square so my final output will be which i have already calculated it is nothing but point it will be approximately equal to 0.58 so 0.58 now with theta as this is nothing but theta theta 1 as 0.5 right that is what theta 1 is 0.5 we are able to get 0.58 so theta 1 is 0.5 over here and 0.58 will be coming somewhere here right so this is my next point which will be again in green color now let's go ahead and calculate the third condition now in third condition what i'm actually going to write i'm going to basically say theta 1 as 0 at that point of time just go and assume what is 0 multiplied by x it will obviously be 0 so i will be getting 3 points and my next line will be in this line that is the x axis and this is basically all my points now if i go ahead and calculate this what is j of theta 1 now what is j of theta 1 now in this particular case when my theta 1 is equal to 0 1 by 2 m now this part you will be able to see this is 0 minus 1 0 minus 2 0 minus 3 okay so it will become 0 minus 1 whole square 0 minus 2 whole square and 0 minus 3 whole square okay so this will become 1 by 6 multiplied by 1 plus 4 plus 9 which will not be it'll be nothing but 2.3 which is approximately equal to 2.3 then what will happen with respect to theta 1 is 0 we are getting 2.3 so if i draw this it is nothing but with respect to 0 i am getting 2 point 2 point two point three this is my point so similarly when you start constructing with theta one is equal to two i may get some point over here so here when i join this points together you will be seeing that i will be getting this kind of curve okay and this curve is something called as gradient descent and this gradient descent will play a very very important role in making sure that in making sure that you get the right theta 1 value or light slope value now which is the most suitable point the most suitable point is to come over here because this is this this point is basically called as global minima because see out of all these three lines which is the best fit line this is the best fit line right this is the best fit line when i had this best fit line my point that came over here was here itself this was my point that came over here right and i want to basically come to this region because this is my global minima when i basically am over here the distance between the predicted and the real point is very very less right so this specific point is basically called as global minima but still i did not discuss chris you have assumed theta 1 is 1 theta 1 is 0.5 theta 1 is 0 here also you are assuming many things right and then you are probably calculating and you are creating this gradient descent but the thing should be that probably you come to one point over here and then you reach towards this so for that specific reason how do you do that how do i first of all come to a point and then move towards this global minima so for that specific case we will be using one convergence algorithm because if i come to one specific point after that i just need to keep on updating theta 1 instead of using different different theta 1 value so for this we use something called as convergence algorithm so here the convergence algorithm basically says repeat until convergence that basically means i'm in a while loop let's say and here i'm basically going to update my theta value which will be given by this notation which is continuous updation where i'll say theta j minus i'll talk about this alpha don't worry and then it will be derivative of theta j with respect to this j of theta 0 and theta 1 so this should happen that basically means after we reach to a specific point of theta after performing this particular operation we should be able to come to the global minima and this this specific thing that you are able to see is called as derivative this is called as derivative derivative basically means i am trying to find out the slope derivative which i can also say it as slope this equation will definitely work guys trust me this will definitely work why it will work i'll just draw it show it to you let's say that this is my cost function let's say that i have got this gradient descent and let's say that my first point is somewhere here but i have to reach somewhere here right now when i reach this this is my theta one and this is my j of theta one suppose i reach at this specific point and i will also have another gradient descent which looks like this let's say that in the initial time i reached the point over here how we will be coming to this minimal global minima by using this equation i'll talk about alpha also don't worry now this is also my theta 1 this is also my j of theta 1. now let's say suppose i came to this particular point right after coming to this particular point i will basically apply this derivative on this j of theta 1 okay now when i find out the derivative that basically means we are trying to find out the slope and in order to find the slope we just create a straight line like this which will look like this i'll just try to create so i'll try to create a slope like this this slope so if you try to find out with respect to this this is a positive slope how do we indicate it because understand the right hand side of the line of this is pointing on the top what's direction this is the best easy way to find out whether it is a positive slow or negative slope now in this particular case this is a positive slope now when i get a positive slope that basically means i will update my weights or theta 1 as theta 1 let's say i'm writing it over here so i will just apply this convergence algorithm see theta 1 colon theta 1 minus this learning rate which is called as alpha this is my learning rate i'll talk about learning rate don't worry then this derivative value in this particular case since i am having a positive slope i will be getting a positive value let's say that for this theta value i got this slope initially now i need to come to this location so for that i have to reduce theta 1 so that i come to this main point now here you can see that i am i am subtracting theta 1 with something which is a positive number right this is a positive number so definitely i know that after some n number of iteration i will be able to come to the global minima similarly if i take the right hand side and if i try to draw the slope in this particular case my slope will be negative so similarly i can write the equation as theta 1 equal to theta 1 minus learning rate multiplied by a negative number so minus into minus will be positive right suppose initially my theta 1 was here my theta 1 was here now i'll keep on updating the weight to come to this global minima so minus into minus is positive so i will basically get theta 1 plus alpha by a positive number because minus into minus is plus so this will definitely work so that we will be able to come over here to the global minima whether it is a positive slope or a negative slope now what is this learning rate now learning rate based on this learning rate suppose i want to come from this point to the global minima by what speed i should be coming what speed if my learning rate value is bigger what speed i may be coming suppose if i say usually we select learning rate as 0.01 if i select a small number then it will start taking small small steps to move towards the optimal minima but if i take a alpha value a huge value if it is a huge value then what will happen this uh this updation of the theta one will keep on jumping here and there and the situation will be that it will never meet it will never reach the global minima so it is a very very good decision to take a alpha a small value it should also not be a very very small value if it becomes a very very small value then what will happen very tiny steps it will take forever to reach the global minima that basically means my model will keep on training itself so definitely this algorithm is going to work now let me talk about one scenario one scenario will be that what if my my cost function has a local minima what if i have a local minima because here if i come here if i come this is a local minima suppose one of my points come over here and finally i am reaching over here what will happen in this particular case because in this case you'll be seeing that what will be my equation my equation will be simply theta 1 theta 1 minus alpha in this point in this local minima slope will be 0 so in this particular case my theta 1 will be equal to theta 1 now you may be thinking what is if this is the scenario then we will be stuck in local minima this is called as local minima but usually with respect to the gradient descent and the equation that we are using here we do not get stuck in local minima because our gradient descent in this particular scenario will always look like this but yes in deep learning when we are learning about gradient descent and a n at that point of time we have lot of local minima and because of that we have different different gradient decent algorithm like rms prop we have adam optimizes which will solve that specific problem so this one point also i wanted to mention because tomorrow if someone asks you as an interview question that what if in your do you see any local minima in linear regression you could just say that the cost function that we use will definitely give not give us local minima but if in deep learning techniques with that we are trying to use like a n we have different different kind of optimizers which will solve that particular problem so that is the answer you basically have to give now let me go ahead and write with respect to the gradient decent algorithm so here again i'm going to write the gradient decent algorithm so this will be my gradient descent algorithm and remember guys gradient descent is an amazing algorithm and you will definitely be using it so please make sure that you know this perfectly now some questions are there when will convergence stop convergence will stop when we come to near this area where my uh j of theta will be very very less now in gradient decent algorithm i will again repeat it so what did i say i said repeat until convergence i told you right here we have written this algorithm and now let's take it for theta 0 and theta 1 so here i will write theta 0 j equal to theta j minus learning rate of derivative of theta j j of theta 0 and theta 1 so this is my repeat until convergence now we really need to find out what we'll try to equate we'll try to first of all find out what is this now if i really want to find out derivative of derivative of derivative of theta j with respect to j of theta 0 and theta 1. so how do i write this i can definitely write this in an easy way okay so this will be derivative of theta j and remember j will be 0 and 1 right because we need to find out for 0 theta 0 and theta 1 so this will be 1 by 2 m what is what is j of theta 0 comma theta 1 obviously my cost function so i will write summation of i is equal to 1 to m and here i will basically write j of theta of x of i minus y of i whole square so if my j is equal to zero so what will happen for this so here i can specifically say that derivative of derivative of theta 0 j of theta 0 comma 1 now simple here what i will be doing is that i will be simply applying derivative function see guys what is this derivative let's consider this is something like this 1 by 2 m x square so if i try to find out the derivative this will be 2 by 2 m x so 2 and 2 will get cancelled so similarly i will have 1 by m and here i will specifically be writing summation of i is equal to 1 to m h theta of x of i which will be my x minus y of i whole square so this will be my derivative with respect to theta 0 this is what i got now the second thing will be that when j is equal to 1 derivative of derivative of theta 1 j of theta 0 comma theta 1 in this particular case i will be having 1 by m summation of i is equal to 1 to m then again see in this particular case theta of 1 is there right theta of 1 basically means what if i try to replace this let's say that i am trying to replace this h theta of x with something else what is the state of x i know that right it is theta 0 plus theta 1 multiplied by x so theta 0 plus theta 1 multiplied by x so after this if i'm trying to find out the derivative with respect to theta 0 this will obviously become i will be able to get this much right now with respect to the second derivative what i will be writing i will again be writing h theta of x of i minus y of i whole square multiplied by x of i so this square also went off understand this h theta of x is what see therefore h of x is nothing but theta 0 plus theta 1 multiplied by x so if i'm trying to find out derivative with respect to theta 0 nothing will be going to come okay theta 1 of x will become a constant in this particular case in this case because theta 1 of x is there so if i try to find out derivative of theta 1 into x only i'll be getting x y square will not be there it's easy right x square means 2 x this is the derivative of x square right so that square went and 1 by 2 1 by 2 by 2 got cancelled so this will be now my convergence algorithm so here we have discussed about linear regression oh sorry i have to remove square here also so let me write it again okay repeat until convergence let me write it down again repeat until convergence finally your two updates will be happening one is theta 0 so here it will be theta 0 minus alpha that is my learning rate 1 by m summation of i is equal to 1 to m and this will basically be s theta of x of i minus y of i okay and similarly if i want to update theta 1 it will be minus alpha 1 by m submission of i is equal to 1 to m h theta of x of i oh my god y of i uh multiplied by x of i alpha is a learning rate guys alpha is nothing but it is learning rate here we have to initialize some value like 0.001 see what is h theta of x theta 0 plus theta 1 into x right if i do derivative of theta 1 into x what is derivative of theta 1 with theta 1 x it is nothing but x so this x will come over here now let's discuss about two important thing one is r square and adjusted r square now similarly what will happen you will have lot of convex functions now see if i talk about uh like if you have multiple features like x1 x2 x3 x4 at that point of time you will be having a 3d curve curve which looks like this gradient descent up which will be something like this gradient it's just like coming down a mountain now let's discuss about two performance metrics which is important in this particular case one is r square and adjusted r square we usually use this performance metrics to verify how a model is how good our model is with respect to linear regression so r square is basically given r square is a performance matrix to check how good this specific model is so here we basically have a formula which is like 1 minus sum of residual divided by sum of total now this is the formula of r square now what is this sum of residual i can basically write like this summation of y i minus y i hat whole square this y i hat is nothing but h theta of x just consider in this way divided by summation of y of i minus y mean y mean y whole square so a formula this is the formula i'll try to explain you what this formula definitely says okay so first thing first let's consider that this is my this is my problem statement that i'm trying to solve suppose these are my data points and if i try to create the best fit line this y i hat why i had basically means this specific point we are trying to find out the difference between these things difference between these things let's say that these are my points i'm trying to find out the difference between this predicted this is my predicted the point in green color are my predicted points which i have denoted as y i hat and always understand this is what some sum of residual is sum of residual is nothing but difference between this point to this point this point to this point this point to respond this point to this point and i am doing the all the submission of those now the next point which is very much important here is my x and y what is this y i minus y y bar y bar is nothing but mean mean of y if i calculate the mean of y then i will probably get a line which looks like this i'll get a line something like this and then i will probably try to calculate the distance between each and every point and this specific point with respect to the distance between this point and this point the denominator will definitely be high right this value obviously this value will be higher than this value right the reason why it will be higher because the mean of this particular value distance will obviously be higher so this 1 minus high this will be a low value and this will be a high value when i try to divide low by high low by high then obviously this entire number will become a small number when this is a small number 1 minus small number will be a big number so this basically shows that our r square has fitted properly right it has basically got a very good r square now tell me can i get this entire r square a negative number let's say that in this particular case i got 90 percent can i get this r square as negative number there will be situation guys what if i create a best fit line which looks like this if i create this best fit line which looks like this then this value will be quite high it is only possible when this value will be higher than higher than this value okay but in a usual scenario it will not happen because obviously we'll try to fit a line which will be at least good it's not just like pulling one line somewhere we don't want to create a best fit line which is worse than this right worse than this so in this particular scenario you will be saying that in r square now here you will be able to see one one amazing feature about r square is that let's say let's say one scenario suppose i have features like let's say that my feature is something like uh let's say i have a price of a house okay so suppose this is my bedrooms how many bedrooms i have and this is basically the price of the house now if i if i probably solve this problem i will definitely get an r square value let's say the r square value is 85 percent let's say that my r square is 85 percent now what if if i add one more feature the one more feature basically says that okay if i add location location of the house will be definitely correlated with price so there is a definite chance that the r square value will increase let's say that r square will become 90 if i probably have this two specific feature and obviously it is basically increasing the r square because this is also correlated to price and let me change the example see first case i got by r square as 85 percent let's say now as soon as i added location i got 90 percent now let's say that i added one more feature with gender is going to stay gender like male or female is going to stay you know that gender is nowhere correlated to price but even though i had one feature there is a scenario that my r square will still increase and it may become 91 percent even though my feature is not that important even gender is not that important the r square formula works in such a way that if i keep on adding features and that are not to nowhere correlated this is obviously nowhere correlated this is not correlated with price then also what it does is that it is basically increasing my r square so this specific thing should not happen whether a male will stay or female will stay that does not matter at all still when you do the calculation the r square will still increase so in order to not impact the model because see now right now with this particular model where i have got 90 percent now as soon as i see r square as 91 percent because it is considering this particular gender so this model will be picked right because it is performing well and is giving you a better r square value but this should not happen because that is not at all correlated this model should have been picked so in order to prevent the situation what we do we basically use something called as adjusted r square now what is this adjusted r square and how it will work i'll also show it to you very very nice concept of adjusted r square so adjusted r square r square adjusted is given by the formula is given by the formula 1 minus 1 minus r square multiplied by n minus 1 where n is the total number of samples n minus p minus 1 this p p is nothing but number of features or predictors we'll also say or predictors suppose initially my number of predictors were in this particular scenario in this scenario where i saw this my number of predictors was 2 and in this particular case my number of predictor was 3. now if my predictor is 2 i got the r square as 90 so in this particular scenario what all the calculation will happen okay all the calculation will happen and let's say that my r square adjusted it will be little bit less it will be little bit less let's say 86 percent let's say that my r squared adjusted is 86 percent based on this predictor 2. now when i use my predictor 3 predictor basically means number of features that i'm going to use and now in this one one feature is nowhere related like gender but what we are getting we are basically getting r square increase to 91 percent now for the r square adjusted this will not increase this will in turn decrease right now it will become 82 percent how it will become i'll show you i've just considered some value 86 82 here you can see that there is an increase here an increase is there here decrease is there now how this is basically happening see this p value that i will be putting okay if i put a p is equal to 3 obviously with n minus p minus 1 this will become a little bit smaller number or sorry little bit smaller number right so now in this particular case if it is not correlated obviously this will be high when i'm increasing this so this will also be high let me write the equation something like this just a second so this will basically be okay now why probably this value may have decreased let me talk about this one what is r square i hope everybody understood n is the number of data points p is the number of predictors if p is increasing then what will happen as p keeps on increasing this value will keep on decreasing this value will keep on decreasing if this values keep on decreasing this will be a bigger number this will obviously be a big number a big number divided by a small number what it will be obviously this will be a little bit bigger number 1 minus bigger number we will basically get some values which will be decreasing if my p value is 2 in this particular case it will be less smaller than this right at least it will be greater than this particular value right when p is equal to 3 so with the help of p obviously r square is there to support you okay whether it is correlated or not always remember when the features are highly correlated your r square value will increase tremendously if it is less correlated then it will be there will be a small increase but there will not be a very huge increase now if i consider p is equal to 2 obviously when i am trying to find out this calculation n minus p minus 1 it will obviously be greater than p is equal to 3 when p is equal to 3 then this value will be still more smaller and when we are dividing a bigger number by a smaller number obviously we are subtracting with 1 so that basically means even though my r square is 86 over here there may be a scenario since this is nowhere correlated i am basically getting an 82 percent because of this entire equation so i hope you are understanding this this is very much important to understand a very very important property simple way to define is that as my p value keeps on increasing the number of predictors keeps on increasing my r square gets adjusted whatever r square i am getting with respect to this it will always be less than this particular r square there was one interview question that was asked to one of my student between r square and adjusted r square which will always be bigger definitely the student said r square then he told him to explain about existed r square why does that specific happen agenda one is about rich lasso regression second is assumptions of linear regression the third point that we are probably going to discuss about is logistic regression then the fourth thing that we are going to discuss about is something called as confusion matrix the fifth thing that we are going to consider about is practicals for linear rich lasso and logistic so first topic uh that we are probably going to discuss is something called as ridge and lasso regression so let's understand about ridge and lasso regression if you remember in our previous session what all things we discussed linear regression and then we had discussed about the cost function we have discussed about r square existed adjusted r square sorry r square and adjusted r square we have discussed about it gradient descent we had discussed about it it was nothing but 1 by 2 m summation of i is equal to 1 to m h theta of x i minus y minus y i whole square so this is the cost function that we had discussed right yesterday and this cost function was able to give us a gradient descent with respect to the j of theta uh j of theta 0 or theta naught so i can also write this as j of theta comma theta 0 comma theta 1 now let me give you a scenario let's say that i have a scenario over here and i have this specific scenario let's say that i just have two points which looks like this okay now if i have these two specific points what will happen i will probably try to create a best fit line the best fit line will definitely pass through all the points like this if i try to calculate the cost function what will be the value of j of theta 0 comma theta 1 let's say that in this particular case since it is passing through the origin my theta 0 will be 0 okay so what will be the value of g theta 0 comma theta 1 so here obviously you can see that there is no difference so it will obviously become zero now understand this data that you see right this data is basically called as training data so this data that i have actually plotted with two points these are specifically called as training data now what is the problem in this data right now see right now exactly whatever line is basically getting created over here which is through the hypothesis over here you can see that it is passing through every point so that is the reason your cost is zero and our main aim is to basically minimize the cost function that is absolutely fine now in this particular case in which my model this if this model is getting trained initially this data is basically called as training data now just imagine that tomorrow new data points comes so if my new data points are here let's consider that i i want to basically come up with this new data point now in this particular scenario if i want to predict with respect to this particular point let's say my predicted point is here is this the difference between the predicted and the real point quite huge yes or no so this is basically creating a condition which is called as over fitting that basically means even though my model has given or trained well with the training data or let me write it down properly over here so this condition since since you can see that over here my each and every point is basically passing through the best fit line so because of that what happens it causes something called as over fitting so you really need to understand what is overfitting now what does overfitting mean overfitting basically means my model performs well with training data but it fails to perform well with test data now what is the test data over here the test data is basically this points the real test data answer was this points but because my line is like this i'm actually getting the predicted point over here so this distance if i try to calculate it is quite huge so in this scenario whenever i say my model performs well with training data and it fails to perform well with test data then this scenario we say it as overfitting so this scenario when the model performs well with training data i have a condition which is called as low bias and when it fails to perform with the test data then it is basically called as high variance very important okay i will make each and everyone understand one by one if it is performing well with the training data that is basically low bias and whenever it performs well with the test sorry fails to perform well with uh fails to perform well with the test data then it is basically high variance now similarly i may have another scenario which is called as under fitting so let's say that i have something called as under fitting now in this under fitting what is the scenario the model fails to perform it gives bad accuracy i say that model always remember whenever i talk about bias then you can understand that it is something related to the training data whenever i talk about test data at that point of time you talk about variance and that specifically whenever you talk about variance that basically means we are talking about the test data so for an overfitting you will basically have low bias and high variance low bias with respect to the training data and high variance with respect to the test data now if the model accuracy is bad with training data and the model accuracy is also bad with test data in this scenario we basically say it as under fitting so these are the two conditions that are with respect to under fitting that basically means that both for the training data also the model is giving bad accuracy and again for the test data also it is basically having a bad accuracy so in this particular scenario we can definitely say two things out of under fitting one is high bias and high variance so this is the condition with respect to under fitting very super important let me just explain you once again suppose let's consider i have one model i have model two this is model one this is model one this is model two and this is model three okay guys so suppose let's say that i have my model my training accuracy is let's say 1090 and my let's say that my test accuracy is 80 now in this particular case let's say that my training accuracy is 92 percent and my test accuracy is 91 and let's say my model 3 is basically having training accuracy as 70 and my test accuracy is 65 percent so if i take this particular case it is basically overfitting if i take this particular thing this basically becomes my generalized model and when i talk about this this is my i'll just say that okay i'll also put nice color so that you will be able to understand this this becomes our generalized model and this finally becomes our under fitting right under fitting so here is my red color i will just say it as under fitting what are the main properties of this overfitting as i said in this scenario since it is performing well with the training data so it will be low bias high variance in this particular case it will be low bias low variance and this particular case it will be high bias and high variance understanding this terminology in this particular way you will be able to understand so why do we require always a generalized model because whenever our new data will definitely come generalized model will be able to give us very good output let's go back to this particular example here you will be able to see this straight line the red line that i have actually created is basically overfitting so that whenever i probably get the new points which is having this real value and the predicted points here you'll be able to see the difference is quite huge so because of this it will definitely be a scenario of overfitting where it has low bias and high variance so again let me go ahead and take this example so this was my line which i have actually drawn i had two points and when i draw this line which was the best fit line to which is passing through both the points this scenario is basically causing a overfitting problem and i've also shown you my j of theta 1 will be 0 in this scenario since it is passing exactly and the predicted point is also over there now understand one thing is that what can we take out from this what assumptions we can take out from this definitely if i talk about our cost function our cost function here is nothing but 1 by 2 m summation of i is equal to 1 to m h theta of x of i minus y of i whole square now let's consider that i am going to use this h theta x and i'm going to basically write it as y hat okay let's focus on this specific point so when i take this i'm just going to focus on this particular point so here i will definitely write it as y hat minus y of i whole square so this is my y y hat of i minus y hat y i whole square so this is nothing but the difference between the predicted value and the real value okay this is what i'm actually trying to get now in this scenario if i am adding these values obviously i am going to get the value as 0. now i have to make sure that this value does not come to 0 because this is still over fitting so that is where your ridge regression will come into picture ridge and lasso will come into picture now when i use ridge and lasso suppose if i use ridge nine ridge what we'd say this this is also called as l2 regularization now l2 regularization what it does is that it basically adds a unique parameter or add a one more sample value which is like lambda multiplied by slope square now what is the slope whatever slope of this particular line it is we are just going to square it off now suppose if i take my equation which looks like this h theta of x is equal to eta 0 plus theta 1 x now in this particular case my theta 0 was 0 so my h theta of x is nothing but theta 1 what is theta 1 this is specifically called as slope and i am basically taking this theta 1 i'm actually making it as a square so always understand i don't want to make this as 0 because if it becomes 0 it may lead to overfitting condition now what will happen if i add this particular equation if i add this particular equation this will obviously come as 0 let's consider my lambda value over here my lambda value is 1 i'll talk about how do you set up lambda value okay let's consider that i'm initializing it to 1. let's say my lambda value is 1. now what i will do is that this lambda value is 1 let's consider our slope value initially is 2 and because of this 2 i got this best fit line i'm just going to consider it so if i do the total sum over here if i'm just considering this this value is 3 now the cost function will not stop over here because still it has to minimize it has to reduce this 3 value so what it will do it will again change the theta 1 value and let's say that my theta band value has changed now it got another best fit line which looks something like this this is my next best fit line i'll talk about lambda lambda is a hyper parameter guys what exactly is lambda i'll just talk about it now when i basically change this line now see why i'm getting this line let's consider i have changed my theta 1 value since we need to minimize now when we need to minimize what it will do we'll again calculate the slope of this particular line and then we will try to create a new line when we sorry it is 2 2 not 3 just a second guys 0 plus 1 multiplied by 2 square which is nothing but 4 so now i my cost function will not stop over here so we are going to still reduce this now in order to reduce this again theta 1 value will get changed and then we will get a next best fit line for this point now what will happen in this scenario once we have this best fit line we will definitely get a kind of small difference so now if i go ahead and consider the new equation my y hat i minus y i whole square plus lambda of slope square this value will be a small value now because i have some difference and then plus again 1 multiplied by now understand whether the slope will increase in this particular case or whether it will decrease in this particular case there will be some slope value let's say that i have got some slope of this particular line in this particular scenario again your slope will definitely decrease so let's say in the case of 2 initially it was now it is basically 1.36 whole square now this small value plus 1 plus 1.3 square or let me consider that my slope is now so one simple value that is 5 so if i get this it is 2.25 2.25 plus small value it will be less than 3 only right it will obviously be less than 3 or equal to 3 but understand what is happening the value is getting reduced from 4 to 3 so this is the importance of rich now what will happen is that you will try to get a generalized model which has low bias and low variance instead of this overfitting condition you know why specifically we are adding ridge l2 regularization it is basically to prevent overfitting because here you are not stopping here you are trying to reduce it unless and until you get a line you get a line which will be able to handle the which will be able to handle as a generalized model now here you can see now if i have my new points like how i drew over here now the distance will be less so now you will be able to see that it will be able to create a generalized model guys this will be a small value only see initially when we have this line obviously we have zero if we try to slightly move here and there so here you will be able to see that it will just a slight movement but what this movement is basically specifying it is specifying that the slope should not be steep if we probably have a steep slope it obviously leads to most of the time overfitting condition it should not be steep it should be very very it should be less steeper but it should actually help you to create a generalized model so you will be seeing that after playing for some amount of time this value will not reduce after some point of time it will get almost it will be a minimal value it will be a smaller value and for this also you have to specify iterations how many times you probably have to train them now this iterations is also a hyper parameter based on number of iterations you will probably see your r square or adjusted r square over here so this iterations based on the number of iterations it will never become 0 guys understand because 0 it is not possible if it becomes 0 trust me it is an overfitting model you cannot get that is something zero over here now what is lambda coming to this lambda this lambda is a hyper parameter this is basically to check how fast you want to lessen the steepness or how fast you want to make a steepness grow higher right and this lambda will also be selected by using hyper parameter and this also i'll show you today in practical what do you mean by iterations iteration basically how many time i want to change the theta 1 value how many times you want to change the theta value that is the convergence algorithm right convergence algorithm over here l2 regularization or rich is basically used in such a way that you should never over fit why we assume theta 0 is equal to 0 because i am considering that it passes through a origin right origin over here lambda is a hyper parameter steep basically means how steep the line is if i have this line this line is quite steep if i have this line this is less steep now if i go to the next regularization which is called as lasso raso ridge lasso regression this is also called as l1 regularization now here the formula will be changing little bit here you will be having y hat of minus of y whole square here you will be adding a parameter lambda but understand here you will not be adding slope square no here you will be adding mode of slope here you will be adding mode of slope and this mode of slope will work is that it will actually help you to do feature selection now you may be thinking how feature selection krish let's consider the equation over here let's say that i have many many features i have many many many features okay so my h theta of x which i'm indicating here as y hat let's say that i'm writing this equation apart from preventing for over fitting it will also help you to do feature selection here let me just show you over here with an example this h theta of x which i'm probably writing as y hat will basically be indicated by something over here you'll be able to see that it is nothing but let's say that i have multiple features like this now in this particular features obviously they are so many coefficients over here so many slopes over here now mod of slope will be what it will be nothing but mod of x1 plus x2 plus x3 plus x4 plus x5 like this up to xn now in this particular case how it is basically helping you to sorry not x1 sorry just a second this mod of i have taken the data point this is not data points this should be your mod of theta 0 plus theta 1 plus theta 2 plus theta 3 plus theta 4 plus theta 5 like this up to theta n so here you'll be able to see that this is how i will basically i'll basically be calculating the slope now as we go ahead guys whichever features are probably not playing an amazing role the theta value the coefficient value the slope value will be very very small it is just like that entire feature is neglected then the entire feature is neglected now in this particular case we were doing squaring because of this squaring that value was also increasing but here because of the mode that value will not increase instead it will be a condition wherein we are basically neglecting those features that are not at all important in this specific problem statement so with the help of l1 regularization that is lasso you are able to do two important things one is preventing overfitting and the second case is that if you have many features and many of the features are not that important okay in basically finding out your slope or your line or the best fit line in that particular case it will also help you to perform feature selection so this is the importance of the entire what is the importance of this this is the importance of the uh ridge and the lasso regression that we are doing here i'm just going to write l1 regularization and obviously we have discussed about l2 regularization also now you have probably understood lambda is one hyper parameter okay which we will specifically using okay and based on this lambda this will be found out through cross validation cross validation is a technique wherein we will try to probably train our model and try to find out the specific things okay what should be the exact value and there also we play with multiple values in short what we are doing we are just trying to reduce the cost function in such a way that it will definitely never become zero but it will basically reduce based on the lambda and the slope value in most of the scenario if you ask me we should definitely try both the regularization and see that wherever the performance matrix is good we should use that what is cross validation basically means i will try to use different different lambda value and basically use it so in a short let me write it down again for ridge regression which is an l2 norm here i'm simply writing my cost function in this particular case will be little bit different here i can definitely write my cost function as h theta x of i minus y of i whole square plus lambda multiplied by slope square what is the purpose of this the purpose is very simple here we are preventing overfitting this was with respect to the ridge regression that is l2 knot now if i go ahead and discuss about the next one which is called as lasso regression which is also called as l1 regularization in the case of lasso regression your cost function will be h theta of x of i minus y of i whole square plus lambda multiplied by mode of slope so here you have this specific thing and what is the purpose the purpose are two one is prevent overfitting and the second one is something called as feature selection so these two are the outcomes of the entire thing see with respect to this lasso right you have slopes slopes here you'll be having theta 0 plus theta 1 plus theta 2 plus theta 3 like this up to theta n now when you have this many number of thetas when you have many number of features and when you have many number of features that basically means you'll have multiple slopes right those features that are not performing well or that has no contribution in finding out your output that coefficient value will be almost nil right it will be very much near to zero in short you are neglecting that value by using modulus you're not squaring them up you're not increasing those values now uh i will continue and uh probably i will also discuss about the assumptions of linear regressions so what are the assumptions of linear regression in this particular scenario so assumption is that number one point linear regression if our features are in normal or gaussian distribution if our features follows this particular distribution it is obviously good our model will get trained well so there is one concept which is called as feature transformation now in future transformation always understand what will happen if a model does not follow a gaussian distribution then we apply some kind of mathematical equation onto the data and try to convert them into normal or gaussian distribution the second assumption that i would definitely like to make is that standard scalar or standardization standardization is nothing but it is a kind of scaling your data by using z score i hope everybody remembers z score this is what we basically apply there your mean is equal to 0 and standard deviation equal to 1 see guys wherever you have gradient descent in void it is good to basically do standardization because if our initial point is a small point somewhere here then to reach the global minima or training will happen quickly otherwise what will happen if your values are quite huge then your graph may be very big and the point can come over any over there and the third point is that this linear regression works with respect to linearity it works if your data is linearly separable and not say linearly separable but this linearity will come into picture if your data is too much linear it will obviously be able to give a very good answer like logistic regression also which we are going to discuss today this also has the same property now you may be asking is it compulsory to do standardization guys if you want to increase the training time of your model or if you want to optimize your model i would suggest go ahead and do standardization now coming to the fourth point here you really need to check about multi-collinearity this is also one kind of check we basically do what is multicollinearities let's say i have x1 i have x2 and this is my output feature i have let's say x3 also now let's say that if i try to see the collinearity of this two feature how how correlated these two features are let's say that these two features are 95 percent correlated is it is it a wise decision to use both the features and let's say that let's let's say that these two features are 95 percent correlated but it is highly correlated with why is it necessary that we should use both the feature in this particular scenario the answer should be no we can't drop this particular feature okay we can drop this particular feature any one of the feature we can definitely drop it and based on that i can just use one single feature and basically we do the prediction there is also a concept which is called as variation inflation factor i will try to make a dedicated video about this multi carnality is also solved with the help of variation inflation factor one more term is there homozygotic so that kind of terminology is also we use one more condition in this but if you are almost satisfied with this assumptions you will definitely be able to outperform in linear regression so you have got an idea of the assumptions you have also got an idea of multiple things okay now let's go towards something called as logistic regression now logistic regression what logistic regression is the first type of algorithm that we are going to learn in classification let's say that in classification i have one example you know suppose i have say number of hours study hours and number of play hours based on this i want to predict whether a child is passing or failing suppose these two are my features i want to predict whether it is pass or fail so here you will be able to see that i have some fixed number of categories specifically in this particular scenario i have two categories binary logistic regression works very well with binary classification now the our question comes that can we solve multi-class classification using logistic the answer is simply yes you can definitely do it so let's go ahead and let's try to discuss about logistic regression now what is the main purpose of the logistic regression first of all let's let's understand one scenario okay suppose i have a feature which basically says number of study hours and this is like 1 two three four five six seven and let's say that i have pass this point is basically pass and this point is basically fail so i have these two conditions these are my outcomes now what i'll do i will just try to make some data points let's say that if i study less than three hours i will probably be fail if i study more than three hours then probably i will pass this i'll make it as fail and this i will make it as pass so i will be having points over here after this one two three let's say that this is my training data set now the first question says that okay chris fine you have some data over here whenever it is less than three you are basically the person is failing if it is greater than pi uh greater than three it is basically showing data points with respect to pass now can't we solve this problem first with linear regression now with the help of linear regression here the first point will be that yes i can definitely draw a best fit line my best fit line in this particular scenario may be something like this it may it may look something like this so here fail is nothing but 0 pass is 1 the middle point is basically 0.5 so obviously with the help of linear regression i am able to create this best fit line and i will put a scenario that whenever the value is less than 0.5 whenever the value is less than 0.5 whenever the output is less than file let's say that new data point is this and based on this i'll try to do the prediction i'm actually able to get the output over here now when i'm getting the output over here this basically is 0.25 now in this particular scenario obviously i'm able to say that yes the person i'll write a condition over here saying that if my h theta of x value is less than 0.5 then my output should be 0 let's say less than 0.5 i'll say not less than or equal to less than 0.5 then my output will be 0 right so in this particular case 0 basically means power fail similarly i'll have a scenario when i'll say that when if my h2 of theta of x is greater than or equal to 5 then this will basically be 1 which is nothing but pass so this two condition i can definitely write over here this is my center point so that any point that will probably come over here let's say that this point is coming over here right let's say new data point is somewhere coming over here with this red point now what i'll do i will basically draw a straight line it will come over here i will just extend this line long i will extend this line all over here and i will extend this line over here and here you can see that based on this i'm actually getting this particular prediction which is greater than 0.5 so i will say that okay the person has passed obviously this is fine this is obviously working better this is obviously working better so what what is the problem why we are not using linear regression okay in order to solve this particular problem why you are specifically having logistic regression the answer is very much simple guys the answer is that whenever let's say that if i have an outlier which looks something like this suppose i have an outlier which comes like this over here what is this value let's say that this value is nothing but 7 8 9 10. let's say that the number of study hours when i'm studying for nine it is obviously path now in this particular scenario when i have an outlier this entire line will change now i will probably get my line which looks something like this okay my line will basically move something like this it will now get moved something like this now when it gets moves completely like this now for even five or even at any point that i am actually predicting let's say that at this particular point if i try to find out you it will be showing less than 0.5 so here this particular value or answer will be wrong right because if we are studying more than five hours obviously based on the previous line the person had to pass but in this scenario it is failing it is coming less than 0.5 but the real value for this is basically past so i hope you are understanding because of the outlier the entire line is getting changed so how do we fix this particular problem now in this two scenarios are there first of all obviously because of just an outlier your entire line is getting shifted here and there the second point is that over here sometimes you're also getting greater than one you're also getting less than one suppose if i try to calculate for this particular point if i project it in behind i will be getting some negative value so we have to squash this function if i squash this function then it will become a plane line right how do we squash it and for this we use something called a sigmoid activation function or sigmoid function if somebody asks you why don't you use linear regression in order to solve this classification problem then your answer should be very much simple you should say this to specific points so we will try to go ahead and solve some linear regression now with the help of cost function everything as such and we'll try to understand how the cost function will look for logistic regression second reason i told you right it is greater than zero over here the line is going greater than zero right greater than zero i have only zero and one and it is becoming greater than zero but i have already told that our maximum and minimum value of one and zero so i hope you have understood why linear regression cannot be used okay i showed you all the scenarios why linear regression should not be used now we'll continue and probably discuss about the other things over here and we will now try to understand fine what exactly logistic regression is all about and how the decision boundary is basically created now we will go ahead and discuss about that specific thing so let's go ahead our values should be always between 0 to 1 over here in this particular case because it is a binary classification problem only this should be the answer so let's go ahead and let's define our decision boundary so my decision boundary decision boundary in the case of logistic regulation first of all as usual in logistic regression we defined our hypothesis okay guys first of all let's see if i'm writing my my h of theta my h theta of x as theta 0 plus theta 1 into x plus theta 2 into x like this x 1 x 2 plus theta n into x n now in this scenario can i write this entire equation as theta transpose x obviously i can definitely write this way right and this is what is the notation that you will probably seeing in many places so with respect to the decision boundary of logistic regression our theta like this we can write i'm saying okay but since we have to consider two things one is squashing the line okay how that squashing will basically happen see if i have this if i have this line we saw in the above right if i have this line suppose i have some data points over here and i have some data points over here if i want to create the best fit line how will i create i will basically create like this but i have to also do two things one is squash over here and squash over here right squash over here and squash over here now in order to squash i'm saying squash squash now in order to do this i use a function which is called a sigmoid activation function that basically means what happens obviously you know this line is basically denoted by h theta of x is equal to how do you denote this straight line let me write it down nicely for you how do you denote this straight line this straight line is obviously denoted by theta 0 plus theta 1 multiplied by x 1 let's say now on top of this on top of this i have to apply something on top of this value i have to apply something so that i can make this line straight instead of just expanding in this way so my hypothesis will basically be now g of g is basically a function on theta 0 and theta 1 multiplied by x 1. so here i'm trying to basically what i'm trying to do i will apply a mathematical formula on top of this linear regression to squash this line now let's go ahead and let's try to find out what is this g okay what is this g i will say let z is equal to theta 0 plus theta 1 multiplied by x i'm just initializing this now my h theta of x is nothing but g of z now we need to understand what is this z g of z and how do we basically specify what is the g function so my g function is nothing but h theta of x is equal to 1 by 1 plus e to the power of minus z which in short if i try to initialize z now it is 1 to the power of e to the power of minus theta 0 plus theta 1 multiplied by x so this is what is my h theta of x which is my hypothesis and this obviously works well because it is being able to squash the function so this is basically my hypothesis which i am definitely trying to use it and this function that you are actually able to see is called as sigmoid or logistic function now you need to understand what does this sigmoid function look like in graph in graph it looks something like this so this is my z value and this is my g of z this is my 0.5 your sigmoid function will have this curve so this is your 1 this is 0 your value when now from this we can make a lot of assumptions what are the assumptions that we can basically make your g of z your g of z is greater than or equal to 5.5 is obviously greater than or equal to 0.5 when your z value is greater than or equal to 0 this is the major assumptions that we can basically make that is whenever your g of z is greater than your g of z is greater than or equal to 0.5 whenever your z is greater than or equal to 0 so obviously whenever your z value is greater than 0 it is greater than 0.5 if your z value is less than 0 what it will become it will basically be less than 0.5 so you can write that specific condition also you want so this is the most important condition over here why it is called as logistic regression see guys with the help of regression you are creating this straight line and with the help of the concept of sigma you are able to squash it so they have probably combined that name and basically have written in this way will squashing of the best fit line help to overcome the outlier issues yes obviously it will be able to help you so let's go ahead and let's try to solve the problem statement now usually let's consider my training set let's consider my training set suppose i have some training points like this x of 1 comma y of 1 let's say x of 2 comma y of 2 okay x of 3 comma y of 3 like this i have lot of training points and finally x of n comma y of n let's say that this is my training data so here uh my y y will belong to what 0 or 1 because i will only have 2 outputs since we are solving a binary classification problem here is my training set with two outputs and i hope everybody knows about j theta of z that is nothing but 1 plus e to the power of minus z here your z is nothing but theta 0 plus theta 1 multiplied by x 1 so this is your theta 0. now what we have to do we have to select this theta now in this particular case let's consider that my theta 0 is 0 because it is passing through the origin just for time past 6 suppose my z is theta 1 into x so now i need to change what is my parameter my parameter is theta 1 i have to change parameter theta 1 in such a way that i get the best fit line and long that i apply the sigmoid activation function now let's go ahead and let's first of all define our cost function because for this we definitely require a cost function now everything will be same obviously you know the cost function of linear regression because the first best fit line that you are probably creating is with the help of linear regression now in this particular case in the case of linear regression so here you can basically write j j of theta 1 is nothing but 1 by m summation of i is equal to 1 to m 1 by 2 and here you have h theta of x minus y of i i whole square so this is your entire thing of if you remember linear regression whatever things we have discussed yesterday okay so this is the cost function let's consider that for linear regression for this is for the linear regression now for the logistic regression what will happen for your logistic regression i will take the same cost function h theta of x now you know what is h theta of x it is nothing but 1 plus 1 plus e to the power of minus theta 0 plus theta sorry theta 1 multiplied by x right this is my with respect to logistic regression this is my entire equation now similarly i will try to only put this h theta of x let's consider that this is my cost function only my h theta of x is changing in this particular case so if i go ahead and write my cost function i can basically say 1 by 2 h theta of x of i minus y of i whole square and in this particular scenario what is h theta of x it is nothing but 1 plus 1 plus e to the power of minus theta 1 x so this is what this is getting replaced and this is my logistic regression cost function i'm just considering this cost function part this part later on if you replace this to this see if i replace this to this and if i replace this to this it becomes a logistic regression cost function intercept i am considering it as 0 guys now when i am replacing this to this this to this then it becomes a logistic regression cost function but there is one problem we cannot we cannot use we cannot use this cost function there is a reason for this because this equation that you are seeing 1 divided by 1 plus e to the power of minus theta 1 multiplied by x this is a non-convex function now you may be considering what is a non-convex function so let me write it down so here this this term this terminology right it is a non-convex function now what is this non-convex function let me show you and let me differentiate it with convex function okay we'll try to understand what is the difference between non-convex function and convex function this is related to gradient descent very important this is related to gradient descent if you remember with the help of linear regression whatever gradient descent we are actually getting it is a convex function like this this is the convex function which looks like a parabola curve parabola curve because of this parabola curve whenever we use this linear regression cost function specifically because here my h theta of x is what it is nothing but theta 0 plus theta 1 into x because of this this equation will always give you a parabola curve this kind of cost function or convex function you can say but here your h theta of x is changing so in the case of if i use that cost function you will be getting some curves which looks like this now what is the problem with this curve here you have lot of local minima if local minima is there you will never reach this global minima so that is the reason we cannot use that cost function now mathematically you can also go and probably search in the google what is the what is the graph or what is the convex or non-convex function but always remember whenever we updates theta 1 with this within this particular equation by finding the slope then this way it will not be differentiable and here you have lot of local minima and because of this local minima you will never be able to reach the global minima this is your global minima right in case of in case of linear regression you will reach this global minimum but in this case you will never leach never never you'll be stuck over here or you may get stuck over here you may get stuck over here okay so this has a local minima problem so how do we solve this understand in local minima these are my points right i have to come over here this is my deepest point in this particular case i don't have any local minima now in local medium also you'll get slope is equal to zero so that is the reason your theta one will never get updated so in order to solve this problem you can see this diagram we have something called as logistic regression cost function so i can now write my logistic regression cost function in a different way so this researcher researcher thought of it and basically came up with this proposal that the logistic cost function should look something like this so the entire cost function of logistic regression that is specifically h theta of x of i comma y this should be written something like this and it should be written like this see here i'm just going to write cost function of j of theta 1 let's say that i'm writing j of theta 1 okay so j of theta 1 what are the different different output that i will be getting and we get i'll be getting y is equal to 1 or y is equal to zero so based on this two scenarios our cost function will look something like this minus log of h of theta of x and i know i hope you all know what is h t of x h theta of x is nothing but 1 plus 1 to the power of minus theta 1 x so this is what is my h theta of x and whenever y is 0 then you basically have minus log multiplied by 1 minus h theta of x of i of i okay so this is how you basically write your cost function in this particular scenario now with the help of this cost function it is always possible since it is getting log log is basically getting used in this scenario you will always get a global minima that is the reason why they have completely neglected this cost function and utilize this cost function now what does this cost function basically mean two scenarios if y is equal to 1 let's consider this is my cost function graph i have h theta of x and you know that h theta of x value will be ranging between 0 to 1 since it is a classification problem so it will be ranging between 0 to 1 and this is basically of j of theta 1 which is my cost function so if y is equal to 1 this specific equation will be used and whenever this equation is basically used you get a call you get a curve see minus log hd of x of i you get a curve which looks something like this okay which you'll get a curve which looks like this now what does this curve basically specify the curve come up with two assumptions the cost will be zero if y is equal to one and h theta of x is equal to one that basically means when your state of x is one and the y is output is one that basically means you are going to assign over here one right so in this particular case you will be seeing that your cost function will be zero cost is zero so here is my zero it is meeting over here if your state of x is equal to one and y is equal to one so this is this is again a convex function then the next point that you can probably discuss over here is with respect to y is equal to 0 if your y is 0 then what kind of curve you will be getting you will get a different kind of curve which will look like this h theta of x here your value will be zero to one and here you will be having a curve which looks like this so when you combine these two you'll be able to see that you are able to get a kind of gradient descent so this will definitely help us to create a cost function so i hope everybody is able to understand till here with respect to this and this will definitely work so finally i can also write my cost function in a different way the cost function that i will probably write over here so this will be my j of theta 1 so i can come up with a cost function which looks like this cost of h of theta of x of i comma y minus log of h theta of x if y is equal to 1 and then minus log 1 minus h theta of x if y is equal to 0 now i can combine this both and probably write something like this i can combine this both and i can basically write cost of s theta of x of i comma y is equal to minus y log h theta of x of i minus log 1 minus y okay 1 minus y log of 1 minus h theta of x so this will be my final cost function and here also you can see that if i replace if i replace y with 1 then what will remain only this particular value will remain right this value when y is equal to 1 this thing only will come you see over here replace y with 1 probably replace y with 1 and then you will be able to see so here i can now write if y is equal to 1 my cost function will look something like this which is nothing but see y is 1 then what will happen my log of h theta of x of i will come and this 1 minus 1 is 0 so 0 multiplied by anything will be 0 if y is equal to 0 then what will happen my cost function will be so when it is 0 this will minus y will become 0 0 multiplied by 0 anything is 0 so here you'll be able to see that i am i'll be having minus log 1 minus h theta of x sub i so this both the condition has been proved by this cost function so this is my cost function yes cos function and loss function with respect to the number of parameters will be almost same so finally if i try to write j of theta because i have 1 by 2 m also right so 1 by 2 m also i have so what i'm actually going to do here you will be able to see that i can write j of theta 1 is equal to 1 by 2 m summation of i is equal to 1 to m and then write down the entire equation that you have probably written over here so here you have minus y or i'll just remove this minus and put it over here and this will become plus sorry y of i multiplied by log h theta of x of i 1 minus y of i y log 1 minus h theta of x of i so this becomes my entire first function and obviously you know what is h theta of x h theta of x of i is nothing but 1 plus 1 e to the power of minus theta 1 multiplied by x and finally my convergence algorithm i have to repeat this to update theta 1 repeat until this updation that is theta theta j is equal to theta j minus learning rate derivative with respect to theta j and this will be my j of theta 1 this is my repeat until conversion so this is my cost function this is my repetence algorithm and here i will be updating my entire theta 1 and this solves your problem with respect to logistic regression simple simple questions may come like how it is different from linear regression how it is not different from linear regression can we say log likelihood or topic from probabilistics yes this is uh this is a log likelihood now i will discuss about performance metrics and this is specific to classification problem and binary classification i'm talking let's consider let's consider i have a data set which has x1 x2 and this is y and obviously in a logistic classification you have outputs like 0 1 0 1 1 0 1 and your y hat y hat is basically the output of the predicted model now in this particular scenario my y hat will probably be 1 1 0 uh 1 1 one zero so in this particular scenario this is my predicted output and this is my actual output so can we come to some kind of conclusions wherein probably we will be able to identify what may be the accuracy of this specific model with respect to this many data points because confusion matrix is all dealt with this is called as we will first of all have to create a confusion matrix now for a binary classification problem the confusion matrix will look like this so here you have 1 0 1 0 let's say that this is prediction let's say that these are my actual values and these are my prediction value okay this both are prediction value these are my output value when my actual value is 0 my predicted value is a 1 does this what does this mean wrong prediction right so when my actual value is 0 my predicted value is one so here my count will increase to one let's go to the second scenario when the actual value is one and my predicted value is one that basically means one and one so here i'm going to increase my count similarly when my actual value is zero my predicted value is zero so that basically means when my actual value is zero my predicted value is zero i'm going to increase the count by one if i go over here one one again it is so instead of writing one now this will become two i'm going to increase the count similarly i'll go over here one more one is there so i'm going to increase the count three then i have 0 1 0 1 basically means when my actual value is 0 i'm actually getting it as 1 so i'm also going to increase this particular value as 2 and then finally i have 1 and 0 where i'm going to increase like this now what does this basically mean now what does this basically mean see with respect to this kind of predictions whenever we are discussing this basically basically says so this is my actual values and i have z 1 and 0 and this is my predicted values i also have 1 and 0. this value when 1 and 1 are there this is called as true positive this value when 0 and 0 are there this is called as false negative whenever your actual value is 0 and you have predicted 1 this becomes false positive and whenever your actual value is 1 you have predicted 0 this becomes false negative now coming to this i really need to find out the accuracy of this model now if i really want to find out and this is what is called as confusion matrix now in this confusion matrix if i really want to find out the accuracy the accuracy of this model it is very much simple this middle element that you are able to see will basically give us the right output so this and this if i add it up it will give us the right output so here i'm going to get tp plus tn divided by tp plus fp plus fn plus tn so once i calculate this so i have 3 plus 1 divided by 3 plus 2 plus 1 plus 1 so this is nothing but 4 by 7 what is four by seven point five seven so am i getting fifty percent percentage accuracy so i'm actually getting 57 percent accuracy over here with respect to the accuracy so this is how we basically calculate with respect to basic accuracy with the help of the confusion matrix okay so this is specifically called as confusion matrix now there are some more things that you really need to specify always remember our model aim should be that we should try to reduce false positive and pulse negative now let's say that i want to discuss about two topics what one is suppose in our data set i have zeros and one category let's say in my output if i say zeros are 900 and ones are 100 this becomes an imbalance data very clear right so this become an imbalanced data set it is a biased data suppose if i say zeros are probably 600 and ones are probably 400 in this particular scenario i will say that this is the balanced data because yes you have 100 less but it's okay the it may not impact many of the algorithm now see guys most of the algorithm that we will be probably discussing imbalance if we have an imbalanced data set it will obviously affect the algorithms let me talk about this let's say that i have number of zeros that are 900 and number of ones as hundred now let's say that my model i have created which will directly predict zero it'll i'll just say that all my inputs that it is probably getting with respect to this training data it will just output zero now in this particular scenario what will be my accuracy my accuracy will be 900 divided by 1000 right so this is nothing but 90 so is this a good accuracy obviously it is a good accuracy but this is a biased data if my model is basically just outputting 0 0 0 0 0 if it is outputting 0 0 0 obviously most of the answer will be 0's but this will be a scenario like you know where it is just outputting one thing then also it is able to get 90 accuracy so you should only not be dependent on accuracy so there are a lot of terminologies that we will basically use one terminology that we specifically use is something called as precision then we'll also use recall what is precision what is recall i'll write the formula over here in precision what do we need to focus and then finally we will discuss about f score so we have to use different kind of parametric so for different kind of formulas whenever you have an imbalanced data set you can also do over sampling but again understand in most of the scenarios in some of the scenarios over sampling may work but we have to focus on the type of performance metrics that we are focusing on right now i'll not say f1 square i'll say f score the reason why i'm saying i'll just let you know so let's talk about recall recall formula is basically given by true positive divided by true positive plus false negative precision is given by true positive divided by true positive plus false positive and then i will probably discuss about f score also or we basically say f beta also now i'll just draw this confusion matrix again okay which is having true positive true negative so let me draw it over here so this is my ones and zeros these are my actual values and these are my predicted values i have true positive i have true negative false positive and false negative now in this particular scenario when i'm actually discussing understand what is recall and what focus it is basically given on so here whenever i talk about recall recall basically says that tb tp divided by tp plus fn so i am actually focusing on this so what does this basically say through uh recall out of all the actual true positives how many have been predicted correctly that is basically mentioned by tp out of all the positive values how many of them have predicted as positive so this is what it is basically saying and this scenario is called as recall in this the false negative is basically given more priority and our focus should be that we should try to reduce false positive false negative sorry we should try to reduce this now let's go ahead and let's discuss about precision in precision what we are doing we are basically taking out of all the predicted values out of all the predicted positive values how many of them are actual true or positive okay this is what precision basically means now suppose if i consider spam classification suppose this is my task tell me in this particular case should we use precision or recall and one more use case i am saying that whether the person has cancer or not in which case we have to support recall and in which case we have to go ahead with precision as cancer or not in spam what is important you guys are true the recall is also called as true positive rate i can also say recall as sensitivity so if i go with spam classification it should definitely go with precision why it should go with precision if i probably get a spamming the main aim should be that whenever i get a spammy it should be identified as spam okay in that specific scenario my positive name false positive we should try to reduce and in this scenario my false pository talks about the spam classification a lot in a better way in the case of cancer i should definitely use recall let's let's focus on the recall formula tp divided by tp plus fn if a person has a cancer see one actually he has a cancer it should be predicted as one otherwise if we have fn it is basically predicting it does not have a cancer that is really a big situation in this case if a person does not have a cancer and if he is predicted if the model predicts okay fine he has a cancer he may go and further do the test and then he'll come to know whether he has a cancer or not but this scenario is very dangerous if a person has a cancer but he is being indicated that he does not have that cancer so here false negative is given more priority over here in the case of spam classification false positive is given more clarity so this is something important over here and you really need to understand with respect to different different problem statement let me give you one more example tomorrow the stock market is going to crash in this what we need to focus on should we focus on precision or should we focus on recall now here two things are there who is solving what kind of problem see many people will say recall or precision but here two things are there on whose point of view you are creating this model are you creating this model for the industry or are you creating this model for the people for the people he should definitely get identified that okay in this particular scenario you need to sell your stock because tomorrow stock market is going to crash but for companies this is very bad okay i hope everybody is able to understand for companies it is very very bad so in this particular case sometime we need to focus both on false positive and false negative and again i'm telling you for which problem statement you are solving that indicates if you are solving for people then they should be able to get the notification saying that it is going to crash if you are probably doing it for companies at that time your precision recall may change but if i consider for both the scenarios at that point of time i will definitely use something called as f score f score or i'll also say it as f beta now how is f beta formula given as i will talk about it and here in the f score you have three different formulas the first formula i will say basically as when your beta value is one okay first of all i'll just give a generic definition of f score or f beta here you are basically going to consider 1 plus beta square precision multiplied by recall divided by beta square multiplied by precision plus recall whenever your both false positive and false negative are important we select beta as one so if i select beta as one it becomes one plus four precision multiplied by recall then you have precision plus recall so here sorry 1 plus 1 so this becomes 2 multiplied by precision into recall divided by precision plus recall so here you have this is basically called as harmonic mean harmonic mean probably you have seen this kind of equation where you have written 2 x y divided by x plus y same type you are able to see this this is called as harmonic mean here the focus is on both false positive and false negative let's say that your false positive is more important than false negative at that point of time you will try to decrease or you will try to decrease your beta value let's say that i am decreasing my beta value to 0.5 then what will happen 1 plus 0.5 whole square and then you have p multiplied by r precision recall and here also you have 0.25 p plus r now in this particular scenario i am decreasing my beta decreasing the beta basically means that you are providing more importance to false positive than pulse negative and finally you will be able to see that if i consider beta value as let me just say my notes if i consider beta value as 2 that basically means you are giving more importance to false negative than false positive so with this specific case you can come up to a conclusion what value you basically want to use now whenever i use beta is equal to 1 it becomes f 1 score if i use beta as 0.5 then this basically becomes f 0.5 score and this becomes your f 2 score so based on which is important okay which is important whether your precision or false positive or false negative is important you can consider those things f score will have different values if you are using beta is equal to 1 that basically means you are giving importance to both precision and recall if your false positive is more important than that at that point of time you reduce your beta value if false negative is greater than false bit uh false positive then your beta value is increasing beta is a deciding parameter to decide your f1 score or f2 score or f point score now first thing first what is the agenda of today's session first of all we will complete practicals for all the algorithms that we have discussed these all algorithms that we have discussed we will cover the practicals probably we will be doing hyper parameter tuning everything the second thing and again here we are going to take just simple examples so yes uh so today's session i said practicals with simple examples where i'll probably discuss about all the hyper parameter tuning then the second one the second algorithm that i am going to discuss about is something called as name bias this is a classification algorithm so we are going to understand the intuition and the third one that we are going to probably discuss is k n algorithm so k n algorithms is definitely there so this r today's plan i know i've written very less but this much maths in involved in name bias right we'll understand the probability theorem again over there there's something called as biased theorem we'll try to understand and then we'll try to solve a problem on that so let's proceed and let's enjoy today's session how do we enjoy first of all we enjoy by creating a practical problem so i am actually opening a notebook file in front of you so here we will try to solve it with the help of linear regression rich lasso and try to solve some problems let's see how much we will be able to solve it but again the aim is that we learn in a better way okay uh so that everybody understands some basic basic things okay so first of all as usual uh everybody open your jupyter notebook file the first algorithm that i am going to discuss about is something called as sk learn linear regression so everybody i hope everybody knows about this sk learn let's see what all things are basically there in this we will be using fit intercept everything as such but here the main aim is to find out the coefficients which is basically indicated by theta 0 theta 1 and all the first thing we will start with linear regression and then we will go ahead and discuss with regen lasso i'm just going to make this as markdown how many different libraries of for linear regression you can do with stats you can do with skype you can do with many things okay so first thing first let's first of all we require a data set so for the data set what we are going to do is that we are going to basically take up some smaller smaller data just let me do this so for this we are going to take the house pricing data set so we are going to solve house pricing data set problem a simple data set which is already present in sk learn only now in order to import the data set i will write a line of code which is like from sklearn dot data sets data sets import load underscore boston so we have some boston house pricing data set so i'm just going to execute this i'm also going to make a lot of cells so that i don't have to again go ahead and create all the cells again some basic libraries that i probably want is raw import numpy as np import panda spd okay import c bond as sns and then i will also import mat mat plot lib dot pi plot as plt and then percentile matplotlib matlob lib dot inline and i will try to execute this see this my typing speed has become a little bit faster by writing by executing this queries again and again and uh let's go ahead uh so i have imported all the necessary libraries that is required which will be more than sufficient for you all to start with now in order to load this particular data set i will just use this library called as load underscore boston and i'm going to just initialize this so if you press shift tab you will be able to see that return load and return the boston house prices data set it is a regression problem it is same and then probably i'm just going to execute it now once i execute it i will go and probably see the type of df so it is basically saying sqlearn.utils.bunch now if i go and probably execute df you will be able to see that this will be in the form of key value pairs okay like target is here data is here okay so data is here target is here and probably will be able to find out feature names is here so we definitely require feature names we require our target value and our data value so we really need to combine this specific thing in a proper way in the form of a data frame so that you will be able to see so what i'm actually going to do over here i'm just going to say pd dot data frame i'll convert this entirely into a data frame and i will say df dot data see this is a key value pair right so df.data is basically giving me all the features value so if i write df dot data and just execute it you will be able to see that i will be able to get my entire data set in this way my entire data set in this way this is my feature one feature two feature three feature for this feature 2l i have 12 features over here and based on that i have that specific value now the next thing that i am going to do probably i should also be able to add the target feature name over here so what i will do i will just convert this into df and then i will also say df dot columns and i'll set it to df.target and let me change this to data set so i'm going to change this to data set and i'm going to say data set dot columns is equal to df.target so if i execute this and now if i probably print my data set dot head you will be able to see this specific thing okay it is an error let's see expected access has 13 element new values has 506 so target okay i should not use target over here instead i had a column which was called as features feature names like if i go and probably see df df over here you'll be able to see there is one thing which is called as feature names so i'm going to use df.feature names over here so here it is df.feature names i'm just going to paste it over here and now if i go and write you here you can see print df dataset.head if i go and execute without print you will be able to see my entire dataset so these are my features with respect to different different things and this is basically a house pricing data set so initially i have these features crim zn indus chass nox rmh distance radius tax pt ratio bl stack so i have my entire data set over here the same data set i have basically put it over here now here also you will be able to see what all this feature basically means this is showing wasted weighted distance to 5 5 boston employment center rad basically means index of accessibility to radial highway tax basically means full value property tax rate this much pt rate basically means pupil teacher ratio i don't know what the hell it means but it's fine we have some kind of data over here properly in front of you so these are my independent features what are these these all are my independent features if you want the features detail here you can see it right everything what is crim this basically means per capita crime rate by town which is important zn it is proportional of a residential land zone for lots over 25 000 square feet so this is my df i did not do much i'm just using data frame df.data column features name i'm getting this value very much simple now let's go little bit slowly so that many people will be able to also understand now this is my dataset.head now the thing is that i obviously have taken all these particular values but this is my independent feature i still have my dependent feature so what i'm actually going to do i will create a new feature which is like data set of price i'll create my feature name price price of the house and what i will assign this particular value this value will be assigned with this target this target value this target value is basically the sale the price of the houses right it is again in the form of array so i'm going to take this and put it as a dependent feature so here you will be able to see that my price will be my dependent feature so here i'll basically write df.target so once i execute it and now if i probably go and see my dot head you will be able to see features over here and one more feature is getting added that is price now this price maybe the units may be in millions um somewhere target should be here or there it should be probably in millions or i cannot see it but it should be somewhere here it should have definitely said that it is probably in millions or okay but that is not a problem i think but mostly it will be in millions somewhere i think it should be here okay i cannot see it but probably if i put more time i'll be able to understand it okay so over here what is the thing main thing these all are my independent features and this is my dependent feature right so if i'm trying to solve linear regression i have to divide my independent and dependent features properly now let's go to the next step that is dividing the data set dividing the oh my god dividing the data set into train into first of all i'll try to divide into independent and dependent features so i want my entire features data set divided into independent and dependent features x i will be using as my independent feature so i will write data set dot i will use an i lock which is present in data frames and understand from which feature to which feature i will be taking as my independent feature to this feature till l start so up the best way that basically means that i just need to skip the last feature in order to skip the last feature what i'm actually going to do from all the columns i will just skip the last column so this is how you basically do an indexing with respect to just skipping the last feature and this will basically be my independent features and here i will basically say y is equal to data set dot i lock in here i just want the last feature so i will write colon all the records i want and see the first term that we are probably writing over here this basically specifies with respect to records here this specifies with respect to columns from all the columns i'm taking the last column here i will just take the last column and this will basically be my dependent features independent features so here i have basically executed now if you can go and probably see x dot head here you will be able to find all my independent features in y dot head you will be able to find a dependent feature now let's go to the first algorithm that is called as linear regression always remember whenever i definitely start with linear regression i'll definitely not go directly with linear regression instead what i will do is that i'll try to go with ridge regression and uh lasso regression because there are a lot of options with respect to hyper parameter tuning but i will just show you how linear regression is done so basically you really really need to use a lot of libraries okay over here and based on these libraries this libraries will try to install okay and what are these libraries these are basically the linear regression library so here i'm basically going to use two specific thing one is linear regression library so i will just use from sklearn dot linear underscore model import linear regression do you need to remember this the answer is no because i also do the google and i try to find out where in a scale and it is present okay so here is my linear regression so i will try to initialize linear reg is equal to initialize with linear regression and then here what i'm actually going to do i'm going to basically apply something called as cross validation the cross validation is very much important because in cross validation we divide our train and test data in such a way that every combination of the train and test data is basically taken by care is taken by the model and whosoever accuracy is better that all entire thing is basically combined so here what i'm going to do i'm going to say mean square error is going to here i will import one more library let's say from sklearn dot model selection am going to import cross val score so cause val scored cross validation score basically means it is going to do a lot of train and test split it's something like this one example i will show it to you here only so what does cross validation basically do okay so in cross validation what happens what you do suppose this is your entire data set suppose this is 100 records if you do five cross validation then in the first this will be your test data and remaining or will be your training data if in the second cross validation this will be your test data and remaining all will be attached uh training data like this five times you'll be doing cross validation by taking the different combination of train and test but i'm not going to discuss much about it in the future if you want a separate session i will include that in one of the session itself so this was uh basically the plan with respect to cross validation across val score so here i'm going to basically take cross val score and here the first parameter that i give is my model so linear regression in my model and here i will take x and y i'm not doing a train test splits specifically over here i'm giving the entire x and y and probably based on that i'm going to do a cross validation over here you can also do train test split initially and then just give the x train and white rain over here to do the cross validation it is up to you but the best practices will be that first you do the train test split and then only give the train data over here to do the cross validation i'm just going to use scoring is equal to you can use mean squared error negative mean squared error let's say that i'm going to use negative mean squared error again where do you find all these things you will be able to see in the sk learn page of linear cross val score and then finally in the cross valve score you give cross validation value as 5 10 whatever you want so after this what i'm actually going to do i'm just going to basically from this how many scores i will get the mean squared error will be five since i'm doing five cross validation if you don't believe me just see over here print msc so here you'll be able to see five different values one two three four five right five different mean values because we are doing cross five five cross validation so here what i'm going to write i'm just going to say np dot mean i want to take the average of all the five so here will basically be my mean underscore mse okay and then probably i'll print i will print my ms mean underscore msc so this will be my average score with respect to this the negative value is there because we have used negative mean squared error but if you just consider mean squared error then it is only 37.13 okay so this i have actually shown you how to do cross validation see with respect to linear regression you can't modify much with the parameter so that is the reason why specifically in order to overcome overfitting and do the feature selection we use origin lasso regression so here i will show you how to do ridge ridge regression now now in order to do the prediction all you have to do is that just go over here take the model okay what is the model linear underscore reg and just say dot predict so here you can see uh you will be getting a function called as dot predict and give the test value whatever you want to predict automatically the prediction will be done so i'm just going to remove this and focus on ridge regression right now because i want to show how hyper parameter tuning is done in ridge regression so for ridge regression the simple thing is that i'll be using two different libraries from sklearn dot linear linear underscore model i'm going to import ridge so for the ridge it is also present in linear underscore model for doing the hyper parameter tuning i will be using from sklearn dot model underscore selection and then i'm going to import grid sir cv so these are the two libraries that i'm actually going to reuse grid search cv will be able to help you out with the um okay i will be able to help you out with hyper parameter tuning and then probably you'll be able to do that uh difference between msc and negative mse not big thing guys if you use msc here mean squared error you'll be getting 37. i've just used negation of msc it's okay anything is fine you can go with msc also mean squad error there is also another uh another scoring area which is like which focuses on square root square mean square uh sorry root mean square error okay so there are different different things which you can basically focus on okay now in order to give you this specific good value i'm actually going to do hyper parameter tuning now let's go ahead with uh grid search cv so here what i'm going to do again i'm going to basically define my model which will be rich okay so this is what i have actually imported now uh let me open the ridge sql so sklearn ridge we nearly need to understand what all parameters are basically used do you remember this alpha value guys do you remember this alpha value why do we use alpha i i told you know alpha multiplied by slope square if you remember in ridge we specifically use this right ridge and lasso regression alpha so this is the alpha the this is probably the best parameter we can just perform hyperparameter tuning the next parameter that we can probably perform is basically um this max iteration okay max iteration basically means how many number of our iteration how many number of times we may probably change the theta 1 value to get the right value so we can do this so what i'm actually going to do i'm going to select some alpha values i'm going to play with this apart from that if i want i can also play with the other parameters which are uh like kind of uh you know probably we can you can also play with the iteration parameter it is up to you try whichever parameter you want to change you can go ahead and change it now let me show you how do we write this and how do we make sure that the specific thing is done now uh before doing grid search cv uh let me do one thing i will define my parameters okay so here is my ridge now what i'm going to do i'm going to say parameters and in this parameter two important value that i'm probably going to take is this one that is my c value and i will try to define this in the form of dictionaries so here the c value that i sorry not c just a second guys my mistake it is not c it is alpha let's see so how do i define my alpha value we'll try to see so here the parameters will be alpha c is basically for uh logistic regression i'll show you so the alpha value i will just mention some values like 1 e to the power of minus 15. that basically means point zero zero zero zero zero zero zero zero zero zero zero zero zero zero one similarly i i can write one e to the power of minus ten that again means zero zero zero zero zero zero zero zero ten times 0 1 i'm just making fun okay so that you will also get entertained 1 e to the power of minus 8 okay similarly i can write 1 e to the power of minus 3 from this particular value now i'm increasing this value see 1 e to the power of minus 2 and then probably i can have 1 5 10 um 20 something like this so i'm going to play with all these particular parameters for right now because in grids or cv what they do is that they take all the combination of this alpha value and wherever your uh your your model performs well it is going to take that specific parameter and it is going to give you that okay this is the best fit parameter that is got selected so here i have got all these things now what i am going to do i am going to basically apply the grits or cv so here i have uh grid sorry ridge grid jam thing there is underscore aggressor so i'm going to use grid search cv greaser cv and here i'm basically going to take the parameters reg okay which is my first model and then i will take up all this params that i have actually defined c in grid search cv if i press shift tab i have to first of all execute this then only it will be able to press shift tab so here if i press shift tab here you will be able to see estimator and parameter grid is my second parameter then scoring and then all the other parameters so here the first thing that goes is your model then your parameters which you what you are actually playing then the third parameter is basically your scoring scoring and again here i'm going to use negative mean squared error some people are saying that mean square error is not present so that is the reason why negative mean squared error is done why it may not be present because they try to always create a generic library probably this kind of scoring parameter may also get used in other algorithms so that is the reason they may not have created but if you want to deep dive into it google google karo patala then what is ridge regressor dot fit on x comma y again i'm telling you you can first of all do train test split on x and y and then probably only do this on x train and y tray parameter is not oh sorry okay i get this okay parameter is not and why it is not n oh yeah it has become a list i'm going to make this as dictionary right now i'm fully focused on implementing things if i get an error i'll definitely make sure that it'll get fixed anyhow if i get that error i will not say oh chris why buy this error came you know why this error came i'll not get worried i'll get the error down only you cannot give this as this one okay so try to understand okay so this is your grid search cv i've also done the fit and let's go and select the best parameter so what i can do i will write print reg underscore regressor dot parents sorry there will be a parameter called as best params i'm going to print this and i'm going to print rich underscore regressor dot best score so these are all the values that are got selected one is alpha is equal to 20 and the best score is minus 32 so initially i got minus 37 but because of ridge regression you can see that our negative mean squared error has definitely become better there is a minus sign don't worry but from 37 it has come to 32 cross validation guys over here inside grid search cv also when it is probably taking the entire combination over there the cv value cross validation also we can use so probably if i am probably considering all these things many people has a question chris is this minus value increased that basically means you cannot use ridge regression you are right in this particular case ridge regression is not helping you out so guys let me again write it down everybody don't worry previous i got minus 32 right now i'm getting minus 37 right sorry previously i got what minus 37 minus 37 now i got -32 so here you can see this i got it from linear regression this i got it from what ridge which one should i select i should select this model only because it is performing well then this but again understand ridge also tries to reduce the overfitting so probably in this particular scenario we cannot use rich because the performance is becoming more bad so what i will do i will go and try with lasso regression now i'll copy and paste the same thing so linear model import lasso and then this will basically be my lasso let's see with lasso whether it'll increase or not let's see this is my parameter that got selected now let me write last so regressor dot best params so this is alpha is equal to 1 is got selected over here i'm just going to print it okay and then i'm going to print with lasso regression dot core will be the best so here i'm actually getting minus 35 minus 35 here i'm actually getting minus 32 so minus 35 still i will focus on linear regression now see what will happen if i add more parameters if i add more parameters see what will happen so and now i'm going to take alpha different different values see this i'm just going to remove this and probably add alpha value in this way see here i have added more values 1 5 10 20 30 35 40 45 100 okay let's see whether we our performance will increase or not so here uh first of all let me remove from here enrich just take it down guys i'm i'm adding more parameters like this just take it down yeah cv is equal to five nobody okay you're not able to see it um cv is equal to 5 now here it is what you can basically focus on so here you can see i have added some values like this you can also add and just try to execute and now if i go and probably see this is my first i have tried for rich i'm getting -29 do you see after adding more parameters what happened in ridge after adding more parameters what happened in ridge you can see or minus 29 and the alpha value that is got selected is 100 if you want try with cross validation 10 and just try to execute now so these are some hyper parameters that we will definitely play with here you can see minus 29 so here you can see minus 29 you can also increase the cross validation value over here also and probably execute it but with lasso i don't know whether it is improving or not it is coming to minus 34. you just have to play with this parameters as now for a bigger problem statement the thing is not limited to here right we try to take multiples and many parameters multiples and many parameters and try to do these things it is up to you we play with multiple parameters whichever gives us the best result we are basically taking it it's okay error is increase i know that no error is increasing definitely error is increasing even though by trying with different different parameters what about most of the scenario see here i got minus 37 probably what i can actually do is that uh try to get better one with respect to this now the best way what i can also do is that i can basically take up train and test split also and probably do these things let's see let's see one example so how do we do train and test from scale on dot i think model selection import printer split okay it's okay guys you may get a different value okay let's do one thing okay let's make your problem statement little bit simpler now what i'm going to do just tell me in train test plate what we need to do so i'm going to take the same code i'm going to paste it over here or let me do one thing let me insert a cell below and let me do it for train test split so in train test plate what we can do so i'm just going to take the syntax paste it over here let's say that i'm taking extra in y train and then i'm using train test split with 33 percent now if i execute with respect to x train and white rail so here is my you can see this i have written this code from scalar dot model selection train test split random state can be anything whatever you write it is fine then you basically give x and y with test sizes 0.33 this is basically saying that the test will have 32 and the trained data will be 77 percent so this is what i'm actually getting with respect to x train and y train here what i'm going to do i'm going to basically take x strain from a white ray and now if i go and probably see this here you can see minus 25 understand this value should go towards zero if it is going towards zero that basically means the performance is better now similarly i do it for reg in ridge what i'm actually going to do here i'm going to write x strain and y train and if i go and probably select the best score than this here you will be able to see i'm getting how much i'm getting minus 25.47 okay here i'm getting 25.18 here 25 point most four seven that basically means now still the improvement is little bit bad because here we are not going towards zero so the next part again here also you can basically do it for x train and y train x train and y train so here you have this one and let's go and execute this so here you can see minus 25.47 now what you can also do is that you can use this lasso regressor dot predict and you can basically predict with respect to x test so this is your white test value suppose let's say that this is my white red y underscope red then what i can do from s k learn i will be using r square and adjusted r square if you remember s k learn r square r square so this is my r2 score so where it is present in scale dot matrix so i'm going to write from sk learn import let's say i am saying from scalar dot matrix import r square r2 score now what i am going to do over here i am basically going to say my r2 score which is my variable i'll say this is nothing but r2 score here i'm just going to give my y thread comma y underscore test so if i go and probably see the output here i will be able to see print r2 score this is all i have discussed guys see there is also adjusted rand score is there where is r2 r2 score 1 adjusted r square okay r2 score is there but adjusted r square should be here somewhere in some manner so this is how your output looks like with respect to by using this lasso regressor okay which is very good okay it should be i told it should be near 100 right now i'm getting 67 percent if i want to try with the ridge you can also try that so you can say ridge regressor dot predict and here you can see 768 percent then you can also try linear regressor and predict what is the error saying the linear regression is not fitted yet why why it is not fitted why it is not fitted let's say that i have fitted here linear regression dot fit on x strain and white red x strain and comma white ray i'm just going to fit it now if i go and probably try to do the calculation so if i go and see my r2 score it is also coming somewhere around 68 percent 67 percent now since this is just a linear regression you won't be able to get hundred percent because you are drawing a straight line right so for that you basically have to other use other algorithms like xgboost and all name bias so many algorithms are there it's okay see you give y test over here why bread over here both are same right they're comparing bye see at one limit you can you can increase the performance after that you cannot see again i'm telling you in linear regression what we do these are my points right i will be only able to create one best fit line i cannot create a curve line right over here so obviously my accuracy will be only limited let's go and do it logistic practical quickly and here uh in logistic also we can do grid search cv now what i'm actually going to do first of all let's go ahead with the data set so i will quickly implement logistic so from less scale dot linear underscore model i'm going to import logistic regression so i'm going to use logistic regression and apart from that we know that let's take a new data set because for logistic we need to solve using classification problem so this is basically my logistic regression i'll take one data set so from sklearn dot data sets import we'll take a data set which is like a breast cancer data set so that is also present in a scalar with respect to the breast cancer data set i'm just going to use this load best answer data set i'm loading it and all the independent features are in data and my columns are feature names the same thing like how we did previously okay so this will basically be my complete uh complete independent features so if i go and probably see this x dot head here you will be able to see that based on this input feature the independent feature we need to determine whether the person is having cancer or not these are some of the features over here and this is like many many features are actually present so next thing i this that was my independent feature now i'll take my dependent feature dependent feature will already present in df.target okay this particular data set that we have taken in df in df.target we will basically have all our dependent feature these are my independent features so what i'm actually going to do i'm going to create y and i'm going to say pd dot data frame and here i'm going to say df.target target and this column name should be target right so this will be my column name and now if i go and see my y y is basically having zeros and one in the target feature now the next thing that we are going to do is that uh apply basically apply the first of all we need to check whether this data set is uh this particular y column is balanced or imbalanced okay in order to do that i will just write f target if the data set is imbalanced definitely we need to work on that and try to perform up sampling so if i write y target dot value underscore counts if i execute this so here you will be able to see that value underscore counts will basically give that how many number of ones are and how many number of zeroes are so now total number of ones are 357 and total number of zeros are 212. so is this a imbalanced data set probably this is a balanced data set so here i'm actually going to now do train test split train test split i will try to do again train test plate how do we do we can quickly do copy the same thing entirely i'll copy this entirely over here and then i will get my x and y so here is my x 2 and x test white rain my test so train test split obviously i'll be doing it now in logistic regression if i go and search for logistic regression scale on i will be able to see this what all parameters are there this is basically the l1 norm or l2 normal l1 regulation or l2 regularization with respect to whatever things we have discussed in logistics and then the c value this two parameter values are very much important if i probably show you over here the penalty what kind of penalty whether you want to add an l2 penalty l1 penalty you can use l2 or l1 the next thing is c this is nothing but inverse of regulation strength this basically says 1 by lambda something like that this parameter is also very much important guys class weight suppose if your data set is not balanced at that point of time you can apply weights to your classes if probably your data set is balanced you can directly use class weight is equal to balanced other than that you can use other other weights which you basically want so this is specifically some of this right no this is not ridge or lasso okay this is logistic in logistic also you have l1 norm and l2 norms understand probably i missed that particular part in the theory but here also you have an l2 penalty norm an l1 penalty now i probably did not teach you in theory because if you look see logistic regression can be learned by two different ways one is through probabilistic method and one is through geometric method if you go and probably see my video that is present with respect to logistic regression right now in my youtube channel there i have explained you about this l1 and l2 norms also over there so in this also it is basically present it is a kind of penalty again just for uh using for this kind of classification problem so what i'm actually going to do let's go and play with the parameters that i am looking at so i will play with two parameters one is params c value here i am defining 1 5 10 20 anything that you can define one set of values you can define and there was one more parameter which is called as max iteration this is specifically for grid source cv okay that i'm specifically going to apply so i will just try to execute this this will be my params now i'm going to quickly define my model one which will be my logistic regression model so my logistic regression here by default one value i'll give for c and max either let's say i'm giving this value later on what i will do for this model i'll apply it to grid search cv so i'm just going to say grid search cb and i'm going to apply it for model 1 param grid is equal to params this parameter that i'm specifically trying to apply since this is a classification problem and i am not pretty sure that whether true positive is important or true negative is important i'm going to use f1 scoring okay f1 scoring is basically again the parametric term which we discussed yesterday which is nothing but performance matrix and then i'm going to use cv is equal to phi so this will be entirely my model with respect to grid search cv and i'll be executing this then i will do model.fit on my xtrain and y train data so once i execute it here you can see all the output along with warnings a lot of warnings will be coming i don't know because this many parameters are there and finally you can see that this has got selected now if you really want to find out what is your best param score model dot best params so here you can see max iteration as 150 and what you can actually do with respect to your best score model dot best score is 95 percentage but still we want to test it with test data so can we do it yes we can definitely do it i will say model dot core or i'll say model dot predict on my x test data and this will basically be my y print so this will be my y print all the y prediction that i'm actually getting so if you go and see why thread so these are my ones and zeros with respect to the y prediction and finally after getting the prediction values i can apply confusion matrix i hope i have taught you about confusion matrix so from sklearn dot confusion matrix sorry say learn dot metrics i'm going to import confusion metrics classification report and the next thing that i would like to do is this two i will try to import confusion metrics and classification report now if you want to see the confusion matrix with respect to your i can just write y underscore thread or y underscore test whatever you want to go ahead with it and this is basically my confusion matrix if i put this forward no difference will be there only this thing will be moving that also i showed you 63 118 3 and 4. now finally if i want to accuracy score i can also import accuracy score over here so here you can see accuracy score is imported i can also find out my accuracy score which is my the total accuracy with respect to this i we can give y test and y underscore trade which we have discussed yesterday this is giving 96 percent if you want detailed precision recall all the score then at that point of time i can use this classification report and here i can give y test in my thread here is what i'm actually getting so here you can see with respect to f1 f1 score decision recall since this is a balanced data set obviously the performance will be best yes you can also use roc see i'll also show you how to use roc and probably you'll be able to see this you have to probably calculate false positive rate to positive rate but don't worry about roc i will first of all explain you the theoretical part now let's go ahead and discuss about name bias name bias is an important algorithm so here i'm just going to go ahead so now let's go ahead and discuss about name bias and here we are going to discuss about the intuition so nave bias is an another amazing algorithm which is specifically used for classification and this specifically works on something called as bayes theorem now what exactly is bayes theorem first of all we need to understand about bayes theorem let's say that guys i have base theorem let's say that i have an experiment which is called as rolling a dice now in rolling a dice how many number of elements i have so if i say what is the probability of 1 then obviously you will be saying 1 by 6. if i say probability of 2 then also here you will say 1 by 6. if i say probability of 3 then i will definitely say it is 1 by 6. so here you know that this kind of events are basically called as independent events now rolling a dice why it is called as an independent event because getting one or two in every experiment one is not dependent on two two is not dependent on three so they are all independent that is the reason why we specifically say is an independent event but if i take an example of dependent events let's consider that i have a bag of marbles okay and this marble i basically have three red marbles and i have two green marbles now tell me what is the probability of suppose i have an event in the first event i take out a red marble so what is the probability of taking out a red orbal so here you can definitely say that it is three by five okay so this is my first event now in the second event let's say that in this you have taken out the red marble now what is that second second time again you are taking out the second red marble or forget about secondhand marble now you want to take out the green marble now what is the probability with respect to taking out a green marble so here you will be definitely saying that okay one red marble has been removed then the total number of marbles that are left are four so here you can definitely write that probability of getting a green marble is nothing but two by four which is nothing but one by two so here what is happening first first element you took out first marble that you took out first event from from the first event you took a red marble from the second event you took out green marble these two are in this two are dependent events because the number of marbles are getting reduced as you take out from them so if i tell you what is the probability of taking out a red marble and then a green marble so it's the simple the formula will be very much simple right which we have already discussed in stats it is nothing but probability of probability of red multiplied by probability of green given red so this specific thing is called as conditional probability here understand what is happening probability of green marble given the red marble event has occurred here both the events are independent now let me write it down very nicely so i can write probability of a and b is equal to probability of a multiplied by probability of b divided by probability of a let's go and derive something can i write probability of a and b is equal to probability of b and a so our answer is yes we can definitely say we can definitely say if you go and do the calculation you will be able to get the answer you should not say no now what is the formula for probability of a and b so here you can basically write probability of a multiplied by probability of b given a if i take out probability of green what is probability of green in this particular case 2 by 5 what is probability of red 3 by 4 for right now let's consider this now this part i can definitely write as this part i can definitely write a probability of b multiplied by probability of b probability of b this one probability of b and this will be probability of a given b so i can definitely write this much with respect to all this information now can i derive probability of a is equal to probability of b multiplied by probability of a divided by b means probability of a given b divided by probability of sorry i'll write this as probability of b given a divided by probability of a and this is specifically called as base theorem and this is the crux behind nail bias understand this is the crux behind the base theorem now let's go ahead and let's discuss about how we are using this to solve let's take some examples and probably make you understand let's say that i have some features like x1 x2 x3 x4 x5 like this still xn and i have my output y so these are my independent features these all are my independent features these all are my independent features so here i am going to write independent features and this is my output feature which is also my dependent feature now what is happening if i say probability of b or a what does this basically mean i need to really find what is the probability of y and you know that guys i will have some values over here and basically i will have some output value over here so based on this input values i need to predict what is the output initially on a training data set i will have your input and then your output initially my model will get trained on this now let's consider what this entire terminology is i will try to write in terms of this equation so i will say probability of y given x1 comma x2 comma x 3 up till x n then this equation will become probability of y see probability of y given x 1 x 2 x 3 x n this a is nothing but x 1 x 2 x 3 x n and i'm trying to find out what is the probability of y and then i will write probability of b b is nothing but y but before that what i will write probability of a divided by b right a given b a probability of b probability of b is nothing but y multiplied by probability of a given b probability of a given b basically means probability of x 1 comma x 2 comma x n given b b is given right so i'm able to find this entire value now just a second i made some mistakes i guess now it is correct sorry i i just missed one term that is this given y this is how it will become and this will be equal to probability of a that is x 1 comma x2 like this up to xf so probability of y multiplied by probability of a given y now if i try to expand this then this will basically become something like this c probability of y multiplied by probability of x 1 given yes a given y sorry given y multiplied by probability of x 2 given y probability of x 3 given y and like this it will be probability of x n given y so this will also be y 1 y 2 y 3 y n this i can expand it like this and then this will basically become probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 like this up to probability of x n so this is with respect to all the probability y will be different see here for this particular record y will be different for this y will be different for this y will be different but why output it may be yes or no right it may be yes or no okay i'll solve a problem it will make everything understand and this will probably be probability of y it can be in binary multi-class whatever things you want i'll solve a problem in front of you now let's say that i have my y as let's say that i have a lot of features x 1 x 2 x 3 x x 4 with respect to this let's say in my one of my data set i have this many x ones this many features and this is my y so these are my feature number and this is my y let's say that in y i have yes or no so how i will probably write we really need to understand this okay i will basically say what is the probability of y is equal to yes given this x of i's this is my first record first record of x of ice this is my second record of x of i so i may write like this what is the probability of y being yes if x of i is given to you x of 5 basically means x 1 x 2 x 3 x 4 so here you'll obviously write what kind of equation you'll basically say probability of yes multiplied by probability of yes multiplied by probability of x of 1 given yes multiplied by probability of x 2 given yes probability of x 3 given yes and probability of x 4 given s divided by probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 multiplied by probability of x for y is fixed it may be yes or it may be no but with respect to different different records this value may change similarly if i write probability of y is equal to no given x of i what it will be then it will be probability of no multiplied by probability of x 1 given no then probability of x 2 given no probability of x 3 given no and probability of x 4 given no so here because every input that i give any input x of i that i give i may either get yes or no so i need to find both the probability so probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 multiplied by probability of x 4. see with respect to any x of i the output can be yes or no and i really need to find out the probability so both the formula is written over here what is the probability of with respect to s and what is the probability with respect to no now in this case one common thing you see that this denominator is fixed this is definitely fixed it is fixed it is it is not going to change for both of them and i can consider that this is a constant so what i can do i can definitely ignore so here i can definitely ignore these things ignore this also ignore this all because see this is constant so i don't want to consider this in the next time i'll just use this specific formula to calculate the probability now let's say that if my first probability for a specific data set yes of x of i is let's say that i'm getting as 0.13 and similarly probability of no with respect to x of i if i get 0.05 you know that in a binary classification any values if we grade greater than or equal to 5 we are going to consider it as 1 and if it is less than 0.5 i'm going to consider it as 0 now i'm getting values like this 0.13 and 0.1 0.05 obviously i'm getting 0.13.0 so we do something called as normalization it says that if i really want to find out the probability of x with x of i if i do normalization it is nothing but 0.13 divided by 0.13 plus 0.05 0.72 this is nothing but 72 percentage and similarly if i do it for probability of no given x of i here obviously it will say 1 minus 0.72 which will be your remaining answer that is 0.28 which is nothing but 28 so your final answer will be this one this formulas you have to remember now we'll solve a problem let's solve a problem this will be a very very interesting problem let's say i have a data set which has like this feature day let me just copy this data set okay for you all now in this data set i want to take out some information let's take out outlook table now based on this output outlook feature see over here outlook my day outlook temperature humidity wind are the input features independent feature this is my output feature this one that you are probably seeing play tennis is my output feature which is specifically a binary classification so what i'm actually going to do i'm basically going to take my outlook feature and based on this outlook feature i will just try to create a smaller table which will give some information now based on outlook first of all try to find out how many categories are there in outlook one is sunny one is overcast and one is rain right three categories are there so i'm going to write it down over here sunny overcast and rain so these three are my features with respect to sunny with outlook i have three categories one is sunny one is overcast and one is raised here i am going to basically say with respect to sunny how many yes are there and how many no are there and what is the probability of yes and probability of no i'm going to again write it over here so this is my outlook feature and then i have categories first yes no sunny overcast rain yes no then probability of yes and probability of no now the next thing that we need to find out is that with respect to sunni how many of them are yes see yes we have so when we have sunny over here the answer is no so i will increase the count over here 1 then again i have sunny again the answer is no so i'm going to increase the count to 2. with this sunny this is basically no okay so again i'm going to increase the count to three now with sunny how many of them are yes one and two so i have this one and this one so i have two so i'm going to say with respect to sunny i have two yes understand outlook is my x1 or x1 feature let's consider now the next thing is that let's say with respect to overcast with overcast how many of them are yes so this overcast is there yes one two three and four so total four yes are there with respect to overcast then with respect to overcast how many are on no you can go into and find out it is basically zero nose then with respect to rain how many of them are yes so here you can see with respect to one rain yes yes no no so this is nothing but three two just try to find out there are three is to or not one here also one yes is that right so three yes two nose so the total number of yes and nose if you count it there are nine yes and five no's this is my total count so if you totally count this 9 plus 5 is 14 you will be able to compare that there will be 9 yes and 5 knows what is the probability of yes when sunny is given so here you have 2 by 9. here you have 4 by 9 here you have three by nine now if i say what is the probability of no given sunny now see probability of yes given sunny probability of yes given forecast probability of yes given rain so it is basically that i will just try to write it in a simpler manner so that you will not get confused okay so this is my probability of yes and this is my probability of no but understand what does this basically mean this terminology basically means probability of yes given sunny probability of yes given overcast probability of yes given rain similarly what is probability of no probability of no obviously you know that 3 by 5 is my first probability then you have 0 by 5 and then you have 2 by 5. now with respect to the next feature let's consider that i'm going to consider one more feature and in this feature i will say let's consider temperature okay let's consider temperature now in temperature how many features i have or how many categories i have i have hot you can see hot mild and cold now with respect to hot mild cold here also i will be having yes no probability of yes and probability of no now try to find out with respect to hot how many are yes so here no is there here also no is there two no's uh one yes uh two yes so two yes and two nose probably then similarly with respect to mild mild how many are there one yes one no two yes three yes four yes four yes and two nose okay so here you basically go and calculate four yes and two knows with respect to cold how many are there cool cool or cold one yes one no two yes three yes three yes and one no so here i have specifically have three s and one no again the total number is nine and five which will be equal to the same thing that what we have got now really go ahead with finding probability of yes given hot so it will be two by nine over here then here it will be how much 4 by 9 here it will be 3 by 9 again here what will be the probability of no given given hot so it'll be 2 by 5 2 by 5 1 by five so this two tables has already been created and finally with respect to play the total number of plays are yes is nine no is five and the answer is total fourteen if i say what is the probability of yes only yes then it is nothing but 9 by 14 what is the probability of no it is nothing but 5 by 14 okay so this two values also you require now let's say that you get a new data set you need to get a new data set let's say you get a new test data where it says that suppose if you are having sunny and hot tell me what is the output so this is my problem state so let me write it down so here i will write probability of yes given sunny comma hot then here i will write probability of yes multiplied by probability of so here i will write probability of sunny given yes multiplied by probability of hot given yes divided by what is it probability of sunny multiplied because probability of no also i will be getting the same value 9 by 14. so probability of yes i'm going to replace it with 9 by 14 multiplied by 2 by 9 then probability of hot given yes so i am going to get 2 by 9 so yeah hamara ho jayega then this is nothing but 2 by 63 0.031 i read this statement little bit wrong it should be probability of sunny given yes now go ahead and calculate go ahead and calculate what is probability of no given sunny and hot so here you have probability of no multiplied by probability of sunny given no multiplied by probability of hot given no divided by probability of sunny multiplied by probability of hot this will get cancelled denominator is a constant guys this is a constant so what is probability of no so probability of no is nothing but 5 by 14 so i will write over here 5 by 14 multiplied by probability of sunny given no what is probability of sunny given no what is probability of sunny given no is nothing but probability of sunny given no is nothing but three by five so here i'm going to get three by pi multiplied by probability of heart given no that is nothing but 2 by 5 so 2 by 5 is here 3 by 5 is there 5 and 5 will get cancelled 2 and the 2 7 0 and then i'm getting 3 by 35 which is nothing but calculator uh if i'm actually getting 3 divided by 35 that's nothing but 0.0857 i will write it down again probability of yes given sunny comma hot which is my independent feature is nothing but point zero three one point zero three one and this is probability of no given sunny comma hot point zero eight five now we'll try to normalize this point zero eight five plus point divided by point zero three one plus point zero eight five point seven three this is nothing but seventy three percent and here i can basically say one minus point seven three which is my point two seven which is nothing but twenty seven percent if the input comes as sunny and hot if the weather is sunny and hot what will the person do whether he will play or not the answer is no okay now my next question will be that if your new data is overcast and mild now tell me what will be the probability using name bias now you can add any number of features let's say that i will say that okay let's let's say that i will i will probably say we can consider humidity and mind wind also you basically create this kind of table to find it out but this will be an assignment just do it overcast and mild if it is with respect to neighborhoods try to solve it so the second algorithm that we are going to discuss about is something called as knn algorithm k n algorithm is a very simple problem statement okay which can be used to solve both classification and regression so k n basically means k nearest neighbor let's first of all discuss about classification problem number one classification problem let's say that i have a binary classification problem which looks like this i have two data points like this one and this is another one suppose a new data point suppose a new data point which comes over here then how do i say that whether this belongs to this category or whether it belongs to this category if i probably create a logistic regression i may divide a line but in this particular scenario how do we define or how do we come to a conclusion that whether this will belong to this category or this category so for here we basically use something called as k nearest neighbor let's say that i say that my k value is 5 so what it is going to do it is going to basically take the five nearest closest point let's say from this you have two nearest closest point and from here you have three nearest closest point so here we basically see from the distance the distance that which is my nearest point now in this particular case you see that maximum number of points are from red categories from red from red categories i'm getting three points and from white categories i'm getting two points now in this particular scenario maximum number of categories from where it is coming we basically categorize that into that particular class just with the help of distance which all distance we specifically use we use two distance one is euclidean distance and the other one is something called as manhattan distance so euclidean and and distance now what does euclidean distance basically say suppose if this is your two points which is denoted by x1 y1 x2 y2 euclidean distance in order to calculate we apply a formula which looks like this x2 minus x1 whole square plus y2 minus y1 whole square whereas in the case of manhattan distance suppose this are my two points then we calculate a distance in this way we calculate the distance from here then here right this is the distance we calculate we don't calculate the hypotenuse distance so this is the basic difference between euclidean and manner and distance now you may be thinking krish then fine that is for classification problem for regression what do we do for regression also it is very much simple suppose i have all the data points which looks like this now for a new data point like this if i want to calculate then we basically take up the nearest five points let's say my k is five k is a hyper parameter which we play now suppose let's say that k it finds the nearest point over here here here here and here so if we need to find out the point for this particular output with respect to the k is equal to 5 it will try to calculate the average of all the points once it calculates the average of all the points that becomes your output so regression and classification that is the only difference because this k is actually in hyper parameter we try with k is equal to 1 to 50 and then we probably try to check the error rate and if the error rate is less then only we select the model now two more things with respect to k nearest neighbor k nearest neighbor works very bad with respect to two things one is outliers and one is imbalanced data set now if i have an outlier let's say i have an outlier over here this is one of my categories like this and this is my another category let's consider that i have some outliers which looks like this now if i'm trying to find out the point for this you can see that the nearest point is basically blue only and it belongs to the blue category but because this is outlier you know it will consider that the nearest neighbor is this so then this will be basically treated in this group only formula for manner and distance it uses modulus x2 minus x1 plus y2 minus y1 mode x2 minus x1 y2 minus y1 uh this was it from my site guys and yes i've also made detailed videos about whatever topics we have discussed today you can directly go and search for that particular topic so this is the agenda of this session we will try to complete this all things again here we are going to understand the mathematical equations and so today's session we are basically going to discuss about uh decision tree okay and uh in this session we are going to basically understand what is the exact purpose of decision tree with the help of decision tree you are actually solving two different problems one is regression and the other one is classification so we'll try to understand both this particular part where we will take a specific data set and try to solve those problems now coming to the decision tree one thing you need to understand i'll say that if age is less than 8 let's say i am writing this condition if age is less than or equal to 18 i'm going to say print go to college here i'm printing print college and then i'll write else if age is greater than 18 and page is less than or equal to 35 and say print work then again i'll write else if age is let me let me put this condition little bit better then i'll write here l if age is greater than 18 and age is less than or equal to 35 i'm going to say print work basically people needs to work in this age else i'm just going to consider print retire so here is my if else condition over here now whenever we have this kind of nested effects condition what we can do is that we can also represent this in the form of decision trees we will also we can actually form this in the form of decision tree and the decision tree here first of all we will have a specific root node let's say this is my root node now in this root node the first condition is less than or equal to 18. so here obviously i will be having two conditions saying that if it is less than or equal to 18 and one condition will be yes one condition will be no so if this is yes and if this is no right if this condition is true that basically means we'll go in this side if it is true then here we will basically have something like college so this is your leaf node similarly when i have no okay no no in this particular case we will go to the next condition in this next condition i will again create a node and i'll say that okay this is less than 18 and greater than sorry less than or equal to 35 so if this is also there then again i'll have two conditions which is basically yes or no now when i create this yes or no over here you will be able to see that basically means here again two condition will be there if it is yes i will say print work so this will again be my leaf node and again for no again i will do the further splitting which is retire so here you can see that this entire algorithm this entire code that i have actually written you can see that it has got converted to this kind of trees where you are specifically able to take decisions yes or no so can we solve a classification problem sorry this is greater than 80 again if it is greater than 18 and less than or 35 so can we solve a regression and a classification problem regression and classification problem using this decision trees by creating this kind of nodes so in short whenever we talk about decision trees whenever we talk about decision trees you will be seeing that decision trees are nothing but decision trees are nothing but by using this nested if else condition we can definitely solve some specific problem statement but here in the visualized way we will specifically create this decision tree in the form of nodes now you need to understand that what type of maps we will probably use okay so let's do one thing let's take a specific data set which i will definitely do it over here in front of you okay and we will try to solve this particular data set and this will basically give you an idea like how we can probably solve these problems so let me just open my snippet tool so this is my data set that i have let's consider that i have this specific data set now this data set are pretty much important because this probably in research papers also probably people who have come up with this algorithm they usually take this they take this thing but right now this particular problem statement if i talk about this is a classification problem statement okay but don't worry i will also help you to explain i'll also explain you about regression also how decision tree regression will definitely work so let's go ahead and let's try to understand suppose if i have this specific problem statement how do we solve this this is my output feature play tennis yes or no okay whether the person is going to pay tennis or not yesterday or day after yesterday or whenever you want so if i have this input features like outlook temperature humidity and wind is the person going to play tennis or not this is what my model should predict with the help of decision tree so how decision tree will work in this particular case first of all let's consider any any any specific feature let's say that outlook is my feature so this will be my first feature which is specifically outlook now just tell me how many are basically having nose and how many are basically having yes in the case of outlook there you will be able to find out there are nine years see one two three four five six seven eight nine and how many knows are there one two three four five i think one two three four five so nine yes and five knows what we are going to do in this specific thing now we have nine yes and five nose and the first node that i have actually taken is basically outlook so outlook feature now just try to find out we are focusing on this specific feature now in this feature how many categories i have i have one sunny category you can see over here i have sunny one category then i have another category called as overcast then i have another category as rain so i have three unique categories so based on these three categories i will try to create three nodes so here is my one node here is my second node here is my third node so these are my three categories so this category is basically called as sunny this category is basically called as overcast and this category is basically called as rain based on these three categories so i am splitting it now just go ahead and see in sunny how many yes and how many know are there how many yes with respect to sunny are they see in sunny i have two nos see one and two no ah one more noise there three nose so here you can see this is my one no then this is my two no this is my three no and yes r2 so this one and this one so how many total number of yes so here you can see that there are one two two yes and three no let's say that i have randomly selected one feature which is outlook why can't i when like see it is up to it it is up to the decision tree to select any of the feature here i have specifically taken outlook later on i'll explain why it's it can basically select how it selects the feature okay i'll talk about it don't worry so in the outlook we have two yes sorry in the case of sunny we have two yes and three nos now the next thing is that let's go and see for overcast in overcast i have one yes or two yes um three years and four years i don't have any no in overcast so over here my thing will be that four yes and zero knows and then finally when we go to the rain part see in rain how many features are there in rain if you go and probably see it how many number of yes and nose are there go and see in one one years in row rain two yes then one no then again you have one yes and one no right so here you can basically say that in rain in the case of rain if i take an as an example how many number of yes and nodes are there it will be three yes and two knows understand understanding algorithm then everything will you'll be able to understand now let's go ahead and try to cease for sunny sunny definitely has two years and three nose this has four yes and zero nose here you have three y's and two nose now if i probably take overcast here you need to understand about two things one is pure split and one is impure split now what does pure split basically mean pure speed basically means that now see in this particular scenario in overcast in overcast i have either yes or no so here you can see that i have four years and zeros knows so that basically means this is a pure split anybody tomorrow in my data set if i just take this outlook feature suppose in one day in date 15 the outlook is outlook is basically overcast then i know directly it is the person is going to play so this part is already created and this node is called as pure node understand this why it is called as pr node because either you have all yes or zeros nose or zero yes or all knows like that in this particular case i have all yes so if i take this specific path i know that with respect to overcast my final decision which is yes it is always going to become yes so this is what it basically says so i don't have to split further so from here i will probably not split i will definitely not split more because i don't require it because i have it is a pure leaf node okay you can also say that this is a pure leaf node so i'm just going to mention it again this one i'm specifically talking about now let's talk about sunny in the case of sunny you have two yes and three nodes so this is obviously impure so what we do we take next feature and again how do we calculate that which feature we should take next i'll discuss about it let's say that after this i take up temperature i take up temperature and i start splitting again since this is impure okay and this split will happen again until we get finally a pure split similarly with respect to rain we will go ahead and take another feature and we'll keep on splitting unless and until we get a leaf node which is completely pure i hope you understood how this exactly work now two questions two questions is that krish the first thing is that how do we calculate this purity and how do we come to know that this is a pure split just by seeing definitely i can say i can definitely say by just seeing that how many number of yes or no's are there based on that i can definitely say it is a pure split or not so for this we use two different things one is entropy and the other one is something called as guinea coefficient so we will try to understand how does entropy work and how does gini coefficient work in decision tree which will help us to determine whether the split is pure split or not or whether this node is leaf node or not then coming to the second thing okay coming to the second thing one is with respect to purity second thing your first most important question which you had asked why did i probably select outlook how the features are selected and here you have a topic which is called as information gain and if you know this both your problem is solved so now let's go ahead and let's understand about entropy or guinea coefficient or information gain entropy or guinea coefficient or sorry guinea coefficient i am saying guinea impurity also you can say over here i'll write it as guinea impurity not coefficient also i'll just say it as guinea impurity but i hope everybody is understood till here let's go ahead and let's discuss about the first thing that is entropy how does entropy work and how we are going to use the formula so entropy here i will just write guinea so we are going to discuss about this both the things let's say that the entropy formula which is given by i will write h of s is equal to so h of s is equal to minus pl plus i'll talk about what is minus what is p plus log base 2 p plus minus p minus log base 2 p minus so this is the formula and in guinea impurity the formula is 1 minus summation of i is equal to 1 to n p square i'll even talk about when you should use guinea impurity when you should not use guinea impurity when you should use entropy you know by default the decision tree regressor or classification sorry decision tree classification uses guinea impurity now let's take one specific example so my example is that i have a feature one my root node i have a feature one which is my root node and let's say that in this root node i have six years and three nodes very simple let's say that this has two categories and based on these two categories of split has happened this is c1 let's say in this i have three yes three nodes and here i have three yes zero knows and this is my second category always understand if i do the summation 3s and 3s is 6 years see this this summation if i do 3 plus 3 is obviously 6 3 plus 0 is obviously 3 so this you need to understand based on the number of root nodes only almost it will be same now let's go ahead and let's understand how do we calculate let's take this example how do we calculate the entropy of this so i have already shown you the entropy formula over here now let's understand the components i will write h of s is equal to minus sign is there what is p plus p plus basically means that what is the probability of yes what is the probability of yes this is a simple thing for you all out of this what is the probability of yes out of this so obviously how you'll write if you want to find out the probability of yes out of this see when i say plus that basically means yes when i say minus that basically means no so what is the probability of yes so it is nothing but yes plus and minus are specifically for binary class this can be positive negative so the probability with respect to yes can i write 3 by 3 only for this what is the probability out of this total number of this is the 3 by 3 similarly if i go and see the next term log to the base 2 p plus so again if i go ahead and write over here log to the base 2 p plus p plus is again 3 by 3 so then again we have minus and this is now p minus what is p minus 0 by 3 log base 2 0 by 3 this obviously will become 0 this will obviously become 0 because 0 divided by anything is 0 what will this be 1 log to the base 1 what is this this is nothing but 0 log to the base 1 is nothing but 0 tell me whether this is a pure split or impure split so this is a pure split whenever we have a pure split the answer of the entropy is going to come to zero so here i am going to define one graph this is h of s and let's say this is p plus or p minus if my probability of plus c when i say probability of plus is 0.5 what will be probability of minus it will also be 0.5 right because it's just like p is equal to 1 minus q right if p is 0.5 then q will be 1 minus p same thing right so when it is 0.5 obviously my h of s will be 1 let's say so this is the graph that will basically get formed let's go ahead and try to calculate the entropy of this guys what will be the entropy of this node so here i'm going to just make a graph h of s minus what is p plus p plus is nothing but 3 by 6 log base 2 3 by 6 minus 3 knows are there 3 by 6 log base 2 3 by 6. so if you compute this log base 2 to the power of 1 if you do the calculation here i am actually going to get 1. so when i'm getting 1 when i'm actually getting 1 when you have 3 s and 3 knows what is the probability it is 50 50 right so when your p plus is 0.5 that basically means your h of s is coming as 1 so from this graph you can see that i am getting 1 if this is 0 this is 1 this is 0 and this is 1 i hope everybody is able to understand guys 0 and 1 if your p plus is 0 or if your p plus is 1 that basically means it becomes a pure split so in h of s you are going to get zero so always understand your entropy will be between zero to one if i have an impure this is a completely impure split because here you have 50 probability of getting yes 50 probability of getting no h of s is entropy this is entropy for the sample h of s notation that i am using is h of s so if whenever the split is happening the first thing is done the purity test the priorities test is done with the help of entropy right now i'll also show guinea guinea impurity don't worry so with the entropy you will be able to find if i am getting 1 that basically means it is a impure split and if i am getting 0 it is pure split so this is the graph okay this is the graph and this graph is basically the entropy graph again understand if your probability of getting yes or no is 0.5 that basically means 50 50 is there 3 s and 3 knows then your entropy is going to be one h of s if your probability is completely one that basically means either you are getting completely yes or completely no so your your entropy will be zero that basically means it is pure split so in the case of probability 0.5 you're getting plus one then it will keep on reducing now let's go ahead and let's try to understand so here you have understood about priority test definitely you'll use entropy try to find out whether it is pure or mpr if it is impure you go ahead with the further shift further division of the categories again you take another feature divide it because here from this to which split you will do further you will do this split as further if you are getting 0.6 0.6 is this specific value then you probably go and draw over here this is your entropy if your probability is here which is 0.3 then you will go here and create this this may be 0.4 or 0.3 something like this it will be between zero to one let's go ahead and discuss about the second issue i hope everybody is discussed about we have discussed about checking the pure split or not and we have understood this much but the next thing is that okay fine krish this is very good you have explained well i know many people will say that but there are some people i can't help let's say that i have some features okay now coming to the second problem how do we consider which node to cap which which feature to take and split because here i may have one play one split so again let's see that what is the second problem which feature to take to split right this is the second problem that we are trying to solve let's say that i have one feature one over here and i have two categories let's say this is there c one and c two here let's say that i have nine yes five nose and then i have six yes two nose here i have basically three yes and three notes let's see and in my data set i have features like f1 f2 f3 now let's say that another split i can actually start with feature two also and in feature two i may have probably three categories like c1 c2 c3 so with respect to the root node and all the other features because after this also i may have to split right i may have to take another feature and keep on splitting right based on the pure or impure split how do i decide should i take f1 first or f2 first or f3 first or any other feature first how should i decide that which feature should i take and probably do the split that is the major question so for this we specifically use something called as information gain so here i am just going to say here we basically use information gain now what is this information gain i'll talk about it so information gain first of all i will write the formula we basically write gain with sample first with feature one i will compute so first with feature one i will compute suppose this is my first split of my data and probably i'm computing over here this can be written as h of s i'll discuss about each and every parameter don't worry summation of v belong to values s of v don't worry guys if you have not understood the formula i will explain it then the sample size h of sv i'll discuss about each and every parameter let's say that i'm taking this feature one split i have you have already seen what is feature one so this is my feature one i have two categories c1 c2 this has nine yes five nose this has six yes and two nose and this has three yes and three notes now i will try to calculate the information gain of this specific split now i will go ahead and probably take this up now see over here we'll try to understand what is this now if i want to compute the gain of s of f one fuch is first first thing that i need to find out is h of s now this h of s is specifically of the root node so i need to first of all calculate what is h of s h of s is nothing but entropy entropy of the root node so if i want to compute the entropy of the node node tell me how should i compute h of s is equal to minus p plus log base 2 p plus calculate guys along with me minus p minus log base 2 p minus so i hope everybody knows this so here i'm going to compute by what is probability of plus over here in this specific root node it is nothing but 9 by 14 then i have log base 2 again 9 by 14. then i have p minus what is p minus 5 by 14 log base 2 5 by 14. so this calculation i will probably get it as 0.94 approximately equal to 0.94 just check it whether you are getting this or not again you can use calculator if you want now i have definitely found out this this is specifically for the root node now let's see the next thing the next important thing which is this part what is s of v and what is s and what is h of sb now very important just have a look everybody see this graph okay see this graph i will talk about h of sv first of all i'll talk about h of sv okay this one this is the entropy of category one you need to find an entropy of category two you need to find so if i write h of sv of category one so what is category one for this i'll write sc1 let's say i'm going to write like this quickly calculate the h of sv of this and this separately you need to calculate so h of sv of c1 okay so here again you will write minus 6 by 8 log base 2 6 by 8 minus 2 by 8 log base 2 2 by 8 i hope everybody knows this how we got it so h of s v basically means i am going to compute the entropy of this category and this category so for that i will basically write h off so here i will write minus 6 by 8 log base 2 6 by 8 minus 2 by 8 log base 2 2 by 8 so if i get it i'm actually going to get 0.81 and similarly if i calculate h of c 2 quickly calculate how much you are going to get guys 6 by 8 6 by eight with respect to this we need to find out so now we have all these values we'll start equating them to this equation so here we have finally gain of s comma f one so let's say that here i'm going to basically add 0.94 minus see minus summation of okay summation of what is s of v understand s of v basically means that how many samples i have over here let's say for category one how many samples i have for category one over here simple if you really want to just calculate it is nothing but eight and total number of sample is how much if i go and see over here there are nine years five nos okay nine years and five nos that basically means 14 total sample here you have eight sample okay so this will become eight by fourteen then you multiply by what see 3 from this equation you multiply by h of sv so h of sv is nothing but the entropy of category 1 so entropy of category 1 is nothing but 0.81 plus then you go again back to the graph and try to see that for c2 how much how many total number of samples are there 3 plus 3 is 6. so 6 by 14 it will become multiplied by 1 right so this is your entire thing so here after all the calculation you are going to get 0.041 so this is my gain with s comma f1 so here i have got this value amazing i did this with feature 1 only what about feature 2 let's say that this was my split for feature 2 and suppose i get the gain for s comma feature 2 as 0.00051 if i get this now tell me in using which feature should i start splitting first whether it should be f1 or whether it should be f2 based on this value you know that over here the gain the information gain of s comma f2 is greater than gain of s comma f1 so your answer is very much simple we will definitely use feature 2 to start the split the thing over here you are trying to understand that if i really want to select which feature to select to start my splitting then i have to basically calculate the information gain and go throughout the all the paths and whichever path has the highest information gain then we will select that specific thing now the question rises krish obviously this is good but you had written about guinea impurity what is the purpose of that please explain us and why guinea impurity is basically used so let me go ahead with guinea impurity i told that yes you can obviously use you can obviously use entropy but why guinea impurity so guinea impurity formula which i have specifically written as 1 minus summation of i is equal to 1 to n p square now what is this p square suppose let's say that in my n n is the number of outputs right now how many outputs i have i have two outputs yes or no so i will expand this 1 minus since this is summation i is equal to 1 to n i am basically going to basically say that okay fine i will write probability of plus whole square plus probability of minus whole square so this is the formula for guinea impurity now you may be thinking okay fine the calculation will be obviously very much equal easy right suppose if i have a node sorry if i have a node which which has two yes two nodes now in this particular case how do i calculate my this probability if i have two yes or two nodes suppose let's say that i have a node over here which is my split and this is having 2 yes and 2 no so how do i calculate i will write 1 minus what is probability of square 1 by 2 square sorry not 1 by 2 yeah 1 by 2 square plus 1 by 2 square right then i will say 1 by 1 by 4 plus 1 by 4 is nothing but 2 by 4 which is nothing but 1 by 2 so i will be getting 0.5 now here you understand this is a complete impure split right if you have an impure split in entropy the output you are getting it as one whereas in the case of guinea impurity as 0 sorry 0.5 so if i go ahead with the graph that i probably had created here so my guinea impurity line will look something like this so it will be looking something like this for 0 obviously i'll be getting 0 but whenever my probability of plus is 0.5 i'm going to get 0.5 over here and that is the difference between guinea impurity and entropy but the re but you may be seeing krish when to use what now let's understand that when to use guinea and when to use entropy tell me guys if i consider this formula of guinea purity and if i probably consider if i consider entropy this formula where do you think more time will take for execution for this particular formula whether for entropy it will take or for guinea impurity it will take more time where it will probably take for the execution purpose see understand decision tree is having a worst time complexity because if you have hundred features probably you will keep on comparing by dividing many many feature then probably compute information gain like this if you have just 100 features so which is faster entropy or guinea impurity understand in entropy you have log function here you have log function here you have simple maths the more amount of time out of entropy and guinea impurity the more amount of time basically is taken by entropy so if you have huge number of features like 100 200 features and you are planning to apply decision tree i would suggest try to use guinea impurity then entropy if you have small set of features then you can go ahead with entropy so over here definitely with respect to fast guinea is greater than entropy now let's go ahead and understand with respect to you may be thinking krish okay fine you have basically explained us about categorical variables over here see over here you have explained about categorical variables what if i have numerical feature let's say i have f1 over here which is a numerical feature i have an f1 feature which is numerical feature and i may have values let's say that i have sorted all the values over here okay let's say that i have f1 and output okay so this f1 let's say that i have values like assorted order values i'm sorting these features i'm basically doing this let's say that initially i have this features like this and let's say i have values like 2.3 1.3 4 5 7 3 let's say i have these features now this is a continuous feature this is a continuous feature so for a continuous feature how probably the decision tree entropy will be calculated and the information gain will get calculated so here you will be able to see that i will first of all sort these values so in f1 the decision tree will basically first of all sort this value so i have 1.3 then you have 2.3 then you have 4 sorry then you have 3 3 then you have 4 then you have 5 and then you have 6. now whenever you have a continuous feature so how the continuous feature will basically work in this case first of all your decision tree node will say that we'll take this one only one first record and say that if it is less than or equal to 1.3 okay if it is less than or equal to 1.3 so you here you will be getting two branches yes or no so yes and no definitely your output over here will be put over here right and then for the no here you'll be having another node over here how many number of records you'll be having in this particular case you'll be having one record in this particular case you will be having around five to six records and here also you'll be able to see right how many yes and no's are there definitely this will be a leaf node so in the first instance they will go ahead and calculate the information gain of this then probably once the information gain is got then what they'll do they will take the first two records and again create a new decision tree let's say that this is be my suggestion where they'll say it is less than or equal to 2.3 so i will get one and one over here so in this now you will be having two records which will basically say how many yes and no are there and remaining all records will come over here then again information gain will be computed here then again what will happen they will go to the next record then again they'll create another feature where they'll say less than or equal to 3 and they will create this many nodes again they'll try to understand how many yes or no are there and then they'll again compute the information again like this they'll do it for each and every record and finally whichever information gain is higher they will select that specific value in that feature and they'll split that node so in a continuous feature whenever you have a continuous feature this is how it will basically have and then it will try to compute who is having the highest information gain the best information gain will get selected and from there the splitting will happen now let's go ahead and understand about the next topic is that how this entirely things work in decision tree regressor because in decision tree regressor my output is and continuous variable so suppose if i have one feature one feature two and this output is a continuous feature it will be continuous any value can be there so in this particular case how do i split it so let's say that f1c feature is getting selected now in this f1c feature what value will come when it is getting selected first of all the entire mean will get calculated of the output mean will get calculated so here i will have the mean and here the cost function that is used is not guinea coefficient or guinea impurity or entropy here you use mean squared error or you can also use mean absolute error now what is mean squared error if you remember from our logistic linear regression how do we calculate 1 by 2 m summation of i is equal to 1 to n y hat minus y whole square y hat of i y minus y whole square this is what is mean squared error so what it will do first based on f1 feature it will try to assign a mean value and then it will compute the msc value and then it will go ahead and do the splitting now when it is doing splitting based on categories of continuous variable i will be having different different categories now in this categories what will happen after split some records will go over here then i will be having a mean value of this over here that will be my output and then again the msc will get calculated over here as the msc gets reduced that basically means we are reaching near the leaf node and the same thing will happen over here so finally when you follow this path whatever mean value is present over here that will be your output this is the difference between the decision tree regressor and the classifier here instead of using entropy and all you use mean squared error or mean absolute error and this is the formula of mean squared error now let's go to the one more topic which is called as the hyper parameters tell me decision tree if i keep on growing this to any depth what kind of problem it will face regressor part you want to me to explain okay let's see okay let's let's do the regression decision tree regressor let's say i have feature f1 and this is my output let's say i have values like 20 24 26 28 30 and this is my feature one with category one category one let's see some categories are there let's say i have done the division by f1 that is this feature initially tell me what is the mean of this that mean value will get assigned over here then using msc that is mean squared error here you will try to calculate suppose i get an msc of some 37 47 something like this and then i will try to split this then i will be getting two more nodes or three more nodes it depends then that specific nodes will be the part of this again the mean will change again the mean will change over here suppose this 2 is there this two records goes here right then again msc will get calculated i'm just taking as an example over here just try to assume this thing now if i talk about hyper parameter see this is what is the formula that gets supplied over msc now let's see in this hyper parameter always understand decision tree leads to overfitting because we are just going to divide the nodes to whatever level we want so this obviously will lead to overfitting now in order to prevent overfitting we perform two important steps one is post pruning and one is pre-pruning so this two post pruning and pre-pruning is a condition let's say that i have done some splits i have done some splits let's say over here i have seven yes and two no and again probably i do the further split like this now in this particular scenario you know that if seven yes and two nose are there there is a maximum there is more than eighty percent chances that this node is saying that the output is yes so should we further do more pruning the answer is no we can close it and we can cut the branch from here this technique is basically called as post pruning that basically means first of all you create your decision tree then probably see the decision tree and see that whether there is an extra branch or not and just try to cut it there is one more thing which is called as pre-pruning now pre-pruning is decided by hyper parameters what kind of hyper parameters you can basically say that how many number of decision tree needs to be used not number of decision trees sorry over here you may say that what is the max depth what is the max depth how many max leaf you can have so this all parameters you can set it with grid search cv and you can try it and you can basically come up with a pre-pruning technique so this is the idea about decision tree regressor yes it is possible your guinea value will be one no this graph is there no guinea value are you talking about this guinea entropy it will not be one it will always be between 0 to 0.5 so the first thing first as usual what we should do we should import the libraries so here i will go ahead and import the library so i'll say import pandas as np pd import matplotlib dot pi plot as plt uh import so this basic things i have with me so i will go and take any data set that i want from sklearn dot data sets import let's say that i'm going to take load iris data set and then i'm going to upload the iris data set so i'm going to write load iris in my iris data set then the next step uh once you get your iris data set so this is my irish.data okay these are all my features the four features will be there these four features are petal lens petal width sepal length and sepal width this is my independent features then if i really want to apply for classifier or decision tree classifier so i can first of all import from sklearn dot tree import decision let's see where is decision tree present in sql decision tree classifier the name is absolutely fine but i was not getting it over here so so this is got no module scala now okay sk [Music] so here you have classifier right now i'm just going to overfit the data then i'll probably show you how you can go ahead with uh pruning so by default what are the parameters over here if you probably go and see in in the classifier over here you have criterion see this the first priority parameter is criterion by default it is guinea then you have splitter splitter basically means how you are going to split and there also you have two types best and random you can randomly select the features and do it okay you should always go with best max depth is a hyper parameter minimum sample lift is a hyper parameter maximum features how many number of features we are going to take in order to fix that that is also an hyper parameter so all these things are hyper parameter okay so i will just by default executed whatever is giving me in decision tree and the next thing that i am actually going to do is create a decision tree so for this i will be using plot dot fig size plot dot figure inside figure i have this fixed size okay and i will probably show in some better figure size so that everybody will be able to see it so here let me say that i'm going to take an area of 15 10 and then probably i'm going to say tree dot plot and here i'm going to say classifier and it should be filled the coloring should be filled with this so tree sorry three three three three two two two two two it should be classified tree dot plot okay i have to also import uh three so i have to basically import three so from sklearn import three again i'm getting error has no attribute plot why let me just see the documentation so this plot function is like plot underscore tree dot tab plot underscore tree now what is the error we are getting okay not fitted yet sorry so i'm going to say classifier dot fit on data what data iris dot data and then i'm going to fit with iris dot target so once this is done i think now it will get executed so this is how your graph will look like guys so here you can see this is how your graph looks like now if i show you the graph osc you can see some amazing things over here three outputs are actually there in this when you see in this left hand side this become a leaf node so this first one is probably versicolor uh versicolor flower okay if you go on the right hand side here you can see 50 50 is there so based on one feature based on one feature here you will be able to see that you are getting a leaf node based on another branch here you are getting 0 50 50. so again you have two more features getting splitted over here so here you have 49 5 here you have 47 one do we require this split anybody tell me from here do you require any any more split just try to think this is after post pruning i want to find out whether more splits are required or not now in this particular case you see this after this do you require any split you did not require right here you are basically getting 47 in one i guess after this also you require no split so understand this so this is basically post pruning so you can then decide your level and probably do it any value is more than 0.5 okay this side up this is coming as 0.5 greater than 0.5 it should not had here it is 0.5 now maximum 0.5 can come 0 to 0.51 you should come i don't know why this is coming at 0.667 i'll have a look on to this guys but anywhere you see other than that you're everywhere you're getting less than 0.5 the plotting graph is very much easy you use skill on import tree then you basically do this get classified and filled is equal to true and you can just do this so the agenda let me define the agenda what all things are there first we'll understand about ensembl techniques in this assemble techniques we are basically going to discuss about what is the difference between bagging and boosting second what we are basically going to discuss about is so uh the agenda of this session is ensemble techniques packing and boosting then we are probably going to cover random forest and then probably we will try to cover ada boost and if i have more energy i will also try to cover xgboost so all these algorithms will discuss about it so let's go ahead and let's start the topics we start the first topic that we are going to discuss is about ensembl techniques now what exactly is assembled techniques and we are going to discuss about it okay so ensembl techniques what exactly is ensembl technique till now we have solved two different kind of problem statement one is classification and regression and you have learnt about different different algorithms like uh linear regression logistic regression we have discussed about k n n we have discussed about uh yesterday what disc what did we discuss about a bias different different algorithms we have already finished now with respect to classification regression problem whatever algorithm we are discussing there was only one algorithm at a time we were discussing one algorithm at the time we are discussing and we are trying to either solve a classification or a regression problem now the next thing is over here is that can we use multiple algorithms multiple algorithms to solve a problem multiple algorithms basically means can we i'll just talk about it okay now in the if i ask this specific question can we use multiple algorithms to solve a problem at that point of time i will definitely say yes we can because we are going to use something called ensembl techniques there now what this ensemble techniques is okay so in symbol techniques in ensembl techniques we specifically use two different ways one is one way is that we specifically use another one i'll just go to write it over here so one that we basically use is something called as bagging technique and the other one we specifically use is something called as boosting technique so in bagging technique we what exactly we can do and in boosting technique what we can actually do and how we are combining multiple models to solve a problem so let's first of all discuss about bagging now how does bagging work let's say that i have a specific data set so this is my data set with uh with features rows columns everything like this i have this specific data set just imagine i have many many features over here like this f1 f2 f3 and probably i have my output so this is my data set d let's consider it now what we do in bagging is that we create models and this model can be anything it can be logistic it may be linear for a classification problem let's say that this is logistic model so this is my model m1 let's say i have another model m2 then i may have another model m3 let's say that this is logistic and this is probably the other model which is like decision tree and then probably we use this model as a k n classification and this model can again be decision tree it's fine let's use another decision tree so now here you can see that we have used so many models okay so many models are there now with respect to this particular model what i will do is that the first step that i will do from this particular data set i will just take up some rows so i'll basically do row sampling and i'll take a row sampling of d dash d basically means this d dash is always less than d some of the rows i'll push it to m1 okay i can also use neighbor as f5 so what i will do is that some of the rows i'll push it to model one this model one will be training let's say that for out of this ten thousand record thousand rows i'm actually doing a row sampling of thousand rows and giving it to m1 to train it then what i'm actually going to do over here i'm basically going to give this specific model m2 and again i'm going to do row sampling and i'm again going to sample some of the rows and give it to model 2 and again remember some of the rows may get repeated from this d dash to next d double dash similarly i will do row sampling and give it to this and again i may have d triple dash and d4 dash so different different different different rows data points when i say row sampling basically i'm talking about data points different different data points i will give it to separate separate model and this model will specifically train when i say d dash that basically means suppose i say 10 000 are my total number of data points when i say d dash this d dash may be thousand points then the double dash may be another thousand points and some of the rows may get repeated over here d triple dash here also i can basically use so here specifically row sampling will be used now when i have this many specific stuffs each and every model will be trained with different kind of data now how the inferencing will happen for the test data so first thing first let's say that i am going to get a new touch data over here now new test data will be passed to m1 and this m1 suppose it gives 0 as my output suppose let's say that i am doing a binary classification it gives a 0 as an output so this is my output of 0 next m2 for the new test data gives 1 m3 gives 1 and m4 also gives 1 as the output now in this particular case in this particular case what will happen now you can see over here it's simple what what do you think the output may be in this particular case now m1 has predicted for this particular test data as 0 the model m2 has predicted 1 m3 has predicted 1 and m4 has predicted 1. so finally all these outputs are going to get aggregated are going to get aggregated and a simple thing that gets applied is majority voting majority voting so tell me what will be the output with respect to this the output will obviously be one because the majority voting that you can see three people are basically saying it as one so my output over here will be one okay this is the concept of bagging wherein you are providing different different rows with probably all the features in this case and giving it to different different model again which is a classification model and then finally you are combining them based on majority voting and you are getting the answer as one so this step is called as bootstrap aggregator that basically means you're aggregating all the output that is basically coming from all the specific models all these specific models now many people will say krish what about thai guys like this kind of situation you know we will be having more than 100 to 200 models so it is very very difficult that it will be a tie who are repeating questions they will be put up in timeout so what if you are saying that if the 50 percentage of model says yes 50 percent of our model says no always understand guys we will be having more than hundred to two hundred plus models so in this particular case there will be high probability that always there will be a majority voting available it will always not be in that specific scenario so this was the concept about bagging now some people will be saying that krish why are you using different different models guys i'm not discussing about random forest over here random forest uses only one type of model that is decision tree but if we think as an concept of bagging you can have different different models over here and you can basically combine them so this is a technique of ensemble techniques and this is basically called as bagging okay now tell me one point i missed out fine this is with respect to the classification problem with respect to the regression problem what will happen in case of a regression problem let's say that i got here 120 here 140 here 122 here 148 as my output so in regression what will happen is that the entire mean will be taken mean will be taken the output mean will be basically taken and that will be your output of the model average or mean very simple right so average or mean will be basically taken up and here based on the average you will be able to solve the regression problem great now let's go ahead and try to understand with respect to bagging and boosting how many different types of algorithm other but before that i need to make you understand what exactly is boosting now here in bagging you have seen that you have parallel models right one one one independent you have parallel models you are giving some row samples in different different models and basically are able to find out the output now in case of boosting boosting is a sequential combination of models like this you have a lot of sequential models like this and one after the model like first i'll give my training data to this particular model then it will go to this data then this model then this model so this will be my m1 m2 m3 m4 and finally i will be getting my output so here you can basically say that boosting is all about and this m1 m2 mt we basically mention it as weak learners so this will be weak learner weak learner weak learner weak learner and finally when we go till here it will if i combine all these weak knowledge weak learner weak learner okay once i combine all this weak learner it becomes a it becomes a strong learner finally if i try to combine this this will basically become a strong learner so here you have all the models sequentially one after the other and then you will probably try to provide your input from one model to the next model to the next model and these all models will be a very simple weak learner model which will not be able to predict properly but when you combine all these particular models together sequentially it becomes a strong learner how this specifically works i'll take an example of adaboost xd boost i will show you that okay weak learner basically means the prediction is very bad but as you go sequentially you combine them they become a strong learner okay one example i want to give you let's say that you are a data scientist right let's say that this model one may be a teacher with respect to physics then this model too may be a teacher with respect to chemistry let's say model 3 is basically a teacher of maths and model 4 is a teacher of geography now suppose if you are trying to solve one problem obviously if the physics teacher is not able to solve that particular problem then probably chemistry can help or maths can help or geography can help or someone can help so when we combine this many expertise together they will be able to give you the output in an efficient way sumit i'll talk about it where whether all the features are basically passed to all the models or not i'll just talk about it just give me some time okay but i just want to give you an idea about in short if someone asks you in an interview what exactly is boosting okay boosting is you can just say that it is a sequential set of all the models combined together and these all models that i initialize are usually weak learners and when they are combined together they become a strong learner and based on the strong learner they gives an amazing output and right now if i say in most of the kaggle competition they use different types of boosting or bagging technique so we have basically as i said bagging and boosting in bagging what kind of algorithm we specifically use we use something called as random forest classifier and the second model that we specifically use is something called as random forest regression so we specifically use these two kind of models which i'm actually going to discuss right now after this and then in boosting we basically use techniques like adaboost gradient boost number three is extreme gradient boost which we also say it as a g boost extreme gradient boost so let's go ahead and let's discuss about the first algorithm which is called as a random forest classifier and regressor now first thing first let's understand some things from the yesterday's class i hope what is the main problem with respect to decision tree whenever we create a decision tree without any hyper parameter it does it not lead to overfitting does it not lead to overfitting whenever you probably have a decision tree right it leads to something like overfitting why overfitting because it completely splits all the feature till it's complete depth overfitting basically means for training data the accuracy is high for test data the accuracy is low so training data when the accuracy is high i may basically say it as high bias and then i may basically say it as sorry not high bias low bias and high variance so low bias and high variance yes obviously we can do pruning and all guys but again understand pruning is an extensive task probably if your if you have 100 features if you have data points which is like 1 million to do pruning also it is very much difficult yes pre-pruning can be done but again we cannot confirm that it may work well or not so right now with respect to decision tree you have this specific problem that is low bias and high variance now in low balance and high variance you know that my model is basically the generalized model that i should get it should have low bias and low variance so if somebody asks you why do you use random forest you can basically explain about decision trees like this now my main aim is to convert this high variance to low variance now i will be able to convert this high variance to low variance using random forest classifier or random forest regressor now what does random forest do random forest is a bagging technique similarly i have a data set over here let's say that i have this data set and then here i will be having multiple models like m1 m2 m3 m4 let's say i have these four models like this will have many many models now with respect to this models this models all the models are actually decision tree in random forest all are decision trees you don't have a different model over there so over here you can see that all the models are decision trees that is going to get used in random forest so decision trees always gets used in random forests the first thing that you should know now whenever we are using decision entries you know that decision tree if i by default if we try to create it it may lead to overfitting because of that every decision tree will basically create low virus low bias and high variance but if we combine in the form of bootstrap aggregator this high variance will be getting converted to low variance because why because majority of voting we will be taking from this particular decision trees there will be many many decision tree so the lot of outputs will be coming and with the help of majority voting classifier this high variance will get converted to low variance now in random forage how it works in the first case if i talk about random forest over here two things basically happen with respect to the d data set let's say in first model we do some kind of row sampling plus feature feature sampling that basically means we have to select some set of rows and some set of features and give it to m1 similarly you do row sampling and feature sampling and give it to m2 then you do row sampling and feature sampling you give it to m3 and then you do row sampling and feature sampling you give it to m4 now when you do this so what will happen independently you're giving some features along with some rows now there may be a situation that your features may also get repeated it may also get repeated your records or data points may also get repeated so when you are probably training your model with this specific data sets and specific features this model become expert in predicting something right as i said one example over here i'm giving a physics model some data i'm giving chemistry data chemistry model with some data similarly here i'm giving some information to some model so the model will be an expert with respect to that specific data so based on all this particular data whenever i get a new test data so what will happen suppose let's say that this is a classification problem the m1 model will be predicting 0 this will be predicting 1 this will be predicting 0 and this will be predicting 0. now in this particular case again the majority voting classifier or majority voting will happen in the case of classification problem and then here you will be specifically able to get the output as zero so i hope everybody is able to understand all the models over here are decision trees and based on that you will be doing see when i'm in an interview should be very very things the things that i'm telling you over here is all all the points are very much important and similarly if you tell the interviewer definitely your interview is cracked in this kind of algorithm i've seen some of my students saying that okay chris when the interviewer asked me that which is my favorite algorithm i said random forest i told why did you say like that because he said that because that person let me let him ask any questions in random forest i'm very much confident about it and i'm also going to prove him you know why they are very very good so with this specific case here you can basically see that because of the overfitting condition of the decision tree you are combining multiple decision tree so that you get a generalized model which has low bias and low variance so i hope everybody is able to understand boost feature sampling basically means suppose if i have one two three four feature for the first model i may give two features for the second model i may get three features for the fourth model i may give four features or any one feature also i can give to a specific model so internally that random forest should take cares of over here these things are there and this is how random forest works only the difference between random forest classify and regression is that in regression again whatever output you are basically getting you basically do the mean that's it average you just do the average you'll be able to get the output based on all the models output that you are actually getting now let's talk about some of the important points in random forest the first thing first question is that is normalization required in random forest then the next question is that in k n is normalization when i say normalization or standardization i'll just talk about standardization is standardization is required so this will be my another question so is normalization required in random forest or decision tree you here you can also say it as distributory is it required so for this the answer will be no because understand decision tree will basically do the splits if you minimize the data also that split won't be that much important but if i talk about knn whether standardization normalization required over here the answer is yes because here we use two things one is euclidean distance and manhattan distance because of this you definitely have to you apply standardization so that the computation or distance becomes easy so this is one of the most common interview questions that is basically asked in random forest coming to the third question is random forest impacted by outliers over here the answer will be no just check it out outside basically means google and check it out check it out in google okay perfect so i hope i have covered most of the things in random forest is random forest impacted by outliers this is the third question is k n impacted by outliers is this k n algorithm impacted by outliers is k n impacted by outliers the answer is yes big yes perfect so these all are the interview questions that needs to be covered now let's go ahead and discuss about adaboost now in bagging most of the time we specifically use random forest or you can also create custom bagging techniques custom bagging techniques means whatever algorithm you want use the combination of them and try to give the output this also you can do it manually with the help of hands okay guys so second thing we are going to discuss about is boosting technique and this the first thing that first algorithm that we are going to discuss about is at a boost so ada boost uh we are going to discuss about how does our ada boost uh work now let's solve uh the first boosting technique which is called as adaboost okay and this is a boosting technique in the boosting technique you have heard that we have to basically solve in a sequential way this at least you know i know there is a lot of confusion within you all but we'll try to solve a problem let's say so suppose i have a data set which looks like this f1 f2 f3 f4 so these are my features and probably these are my output okay so let's say that i am having this features like this and this is my output like yes or no like this so let's say that how many records i have over here three four five six and one more is there seven so this seven records are there now in ada boost the first thing is that specifically with ada boost you really need to understand that what all things we can basically do how do we solve this classification problem that we are going to understand the first thing first is that we define a weight and the weight is very much simple initially to all the records to all these input records we provide an equal weight now how do we provide an equal weight we just go and count how many number of records are there now in this particular case the total number of records are 1 2 3 4 5 six seven now every record i have to provide an equal weight that is between zero to one so that overall sum should be one so in this particular case what i can do if i make one by seven one by seven one by seven to every one this will definitely become a equal weights to all right and if i do the total sum it will obviously be one let's go to the next one now after this what do we do okay after this in adaboos the first thing that we do is that we take any of this feature how do we decide which feature to take whether we should go with f1 or whether we should go with f2 or whether we should go with f3 this we can do it with the help of information gain and information gain and entropy or guinea right based on this we can definitely understand whether we should start making decision here also you specifically make decision trees so here what you do is that you probably have to determine by using which feature i have to start my decision tree so suppose out of all this feature one feature to feature three you have selected that okay the information gain and entropy of feature one is higher so i'm going to use feature one and probably divide this into decision trees now when i divide this into decision tree let's say that i'm dividing like this into decision tree this decision tree depth will be only one one depth and this depth since it has only one depth we basically call it as stumps so what we do over here specifically we will create a decision tree by taking only one feature and we will only divide it to one level okay one level or one depth that's it and this is specifically called as stump what we are going to do next is that from this particular stump okay the stump is basically getting created only one so that is ada boost right we say it as weak learners because this is weak learner weak learner why there is a reason we say this as weak learner so only weak learner so that is the first thing with respect to uh this particular ada boost so the first step is that this is us weak learner so for the week learner we basically create a stump stump basically means one level decision tree that's it based on the information gain and entropy i have selected the feature and then i just made a decision tree with only one level why it is called as it is called as weak learner okay so that is the reason we use only stub that is just a one level decision tree now the next step happens is that we provide all these specific records to this f1 and we train this specific model only with one level decision tree we train them now after we train them let's say that we are going to pass all these particular records to find out how many are correct and how many are wrong this decision this decision tree is basically giving so let's say that out of this entire records one record one record was just given as wrong let's say that this is the this is the record which was given as wrong okay so let's say that this record output was predicted wrong from this particular model only one wrong was there after training the model now what we need to do in this specific case understand a very important thing so let's say that we have done this and probably after this what we are actually going to do we are going to calculate the total error so how many error this particular model made let's say that in this particular case only one was wrong so this was only wrong right one was wrong so if i want to calculate the total error how will i calculate how many how many of them are raw how many of them are wrong only one is wrong what is the weight of this so i will go and write one by seven so this is specifically my total error out of this specific model which is my stub over here okay which is my f1 stub now this is my first step the second step is that i need to see the performance of stump which stopped this specific stump and the performance is basically checked by a formula which is 1 by log e 1 minus total error divided by total error why we are doing this everything will make sense okay in just time every every in just a small time everything will make sense the first step that we do in ad boost is that we try to find out the total error the second step we try to find out the performance of stump now in this particular case it will be 1 by log e 1 minus 1 by 7 divided by 1 by 7. so once i calculate it it will be coming as 0.895 f2 and f3 see again understand out of all these features i found out from information gain and entropy that this is the best feature let's say that i have calculated this as 0.895 so this is my second step the first step is find out the total error the second step is performance of stump what is t t basically means total error t basically means total error now see see the steps okay see the steps whenever i am discussing about boosting i am going to combine weak learners together to get a strong learner now what is the next step out of this now what what will be my third step understand over here my third step will be to update all these weights and that is the reason why i'm calculating this total error and performance of step so my third step will basically be new sample weight from the decision tree one which is my stump so i'll say new sample weight is equal to i need to update all these weights why me i need to update all these weights again understand i'll talk about it just a second so if i want to update the sample weights first update i will do it for correct records see for correct records whichever are correct like this all records are correct these all records are correct now when i update the weights of this way update the weights of this particular record it should reduce and when the the wrong records that i have this update should increase why because because if i increase this weights then the wrong records that are there that record should go to the next week learner that is the reason why i am doing it now how to update this particular weights for correct records for correct records the formula looks something like this weight multiplied by weight multiplied by e to the power of minus this specific performance okay this specific performance so e to the power of ps i'll write performance of stump and then i will basically be able to write 1 by 7 multiplied by e to the power of minus 0.895 if i do the calculation everybody try to do it the answer will be 0.05 now this is for correct records what about incorrect records for the incorrect records the the weights that is going to the formula that we are going to apply is multiplied by e to the power of plus ps not minus ps plus ps so here i'll write 1 by 7 multiplied by e to the power of 0.895 so if i go and probably calculate this i'm going to get it as 0.349 so these two are the weights that i have got that basically means all these records now which are correct one by seven the new updated weights will be point zero five point zero five point zero five point zero five sorry not for the wrong records then this will be point zero five then point zero five and point zero five so let me just see what is one by seven so here you can see initially it was point one four two now it has got reduced to point zero five because all these records are correct but the wrong record value is 0.349 so my weights will now become over here as 0.349 now i will just go and go ahead and write over here my new weight my new weight is nothing but 0.05 0.05 zero five point zero five point zero five point zero five one two how many one two three okay fourth record is here fourth record is there one two three four point zero five point zero five okay how many records are there one two three four five six seven so my fourth record will basically become the new value that i'm having is something called as 0.349 now tell me guys if i do the submission of all these weights is this is it 1 so probably no i don't think so it is 1 because if i try to add it up it is not 1 but if i go and see over here these all are one if i combine all these things one two three four five six seven these all are one so here i am need to find out my normalized weight now in order to find out the normalized weight all i have to do is that what i have to do because the entire summation should be 1 so we have to normalize now in order to normalize all you have to do is that go and find out what is the sum of all these things the summation of all these things will be point zero 0.649 all you have to do is that divide all the numbers by 0.649 divided by 0.649 0.649 like this divide all the numbers by 0.649 and tell me what will be the answer that you will be getting so here your normalized weight will now look like 0.07 0.07 0.07 and this value will be somewhere around 0.537 i guess in this case then this will be point zero seven point zero seven point zero seven here we are going to divide or by all this six four point six four nine now this is my normalized weight now after you get a normalized weight we will try to create something called as buckets because see one decision tree we have already created which is a stump and you know from this particular stump what you are going to get okay as an output then in the sequential model we will go and combine another model over here now it's the time that i have to create this specific model now in order to create this specific model i need to provide some specific rows only to this model to train because this model is giving one wrong now what i have to do is that whatever is wrong along with other data points i need to provide this specific model with those records so that this model will be able to train on this and probably be able to get the output now let's create buckets now based on buckets how the buckets will be created over here i will take 0.07 until sorry whatever is the value over here normal weight value okay so i will start creating my buckets buckets basically from 0 to 0.07 what did i say now for this decision tree or stump i need to provide some records so the maximum number of record that should be going should be the wrong records that should go over here now how do we decide that okay there should be a way that we should be able to say that that specific wrong number of records should go to that decision tree so for that purpose what we do is that this decision tree will randomly create some numbers between 0 to 1 randomly create those numbers between 0 to 1 and whichever bucket it will come in like 0 7 2 0 1 4 0 1 4 2 0 7 basically means 0 2 1 then 0 2 1 2 see how the bucket is getting created this value is getting added to this so that becomes this bucket 0 to 1 plus 0.3 537 how much it is it is nothing but four seven zero point seven four seven then point seven four seven to point seven five one like this you create all the buckets okay you can create all the buckets now tell me which record is basically having the biggest bucket size obviously this record so if i randomly create a number between 0 to 1 what is the highest probability that the values will be going in so in this particular case most of the wrong records will be passed along with the other records obviously other records there are chances that other records will go to the next decision tree but understand maximum number will go with the wrong records because the bucket is high over here so the bucket is high over here so most of the time this specific record will get selected and then it will be gone to the second tree now suppose i have this all records so this is my first stump this is my second stump this is my third stump similarly the third stump from the second stump whichever wrong records will be going maximum number of records will go over here then again it will be trained like this will be having lot of stumps minimum 100 decision trees can be added you know that every decision tree will give one output for a new test data new test data this week learner will give one output this week learner will give one output this week learner and this will be cleaner will be giving one output obviously the time complexity will be more now from this particular output suppose it is a binary classification i will be getting 0 1 1 1 so again over here majority voting will happen and the output will be 1. in case of regression problem i will be having a continuous value over here and for this the average average will be computed and that will give me an output over here so for regression the average will be done for classification what will happen majority will be happening so everywhere that same part will be going on buckets is very much simple guys buckets basically means based on this weights normalized weight we are going to create bucket so that whichever records has the highest bucket based on this randomly creating code you know it will select those specific buckets and put it into random forest we understand why this bucket size is big the other wrong records which are present right suppose they are more than four to five wrong records their bucket size will also be bigger and because based on this randomly creating number between zero to one most of the wrong records will be selected and given to the second stem similarly this particular decision tree will be doing some mistakes then that wrong records will get updated all the weights will get updated and it will be passed to the next decision tree guys when i say wrong record the output will be same only no zero and one so interesting everyone i hope you understood so much of maths in ada boost and how ada boost actually work three main things one is total error one is performance of stump and one is the new sample weight these things are getting calculated extensively max normalized weight was basically used because the sum of all these weights are approximately equal to one when boosting why not take the last output no no no we have to give the importance of every decision tree output every decision entry output are important okay let me talk about one model which is called as black box model versus white box what is the difference between black box model and white box if i take an example of linear regression tell me what kind of model it is is it a white box model or black box if i take an example of random forest is this a white box or black box if i take an example of decision tree it is a white box or black box model if i take an example of a n is it a white box or black box model linear regression is basically called as a white box model because here you can basically visualize how the theta value is basically changing and how it is coming to a global minima and all those things in random forest i will say this as black box model because it is impossible to see all the decision tree how it is working so that is the reason the math is so complex inside is if i talk about decision tree this is basically a white box model because in decision entry we know how the split are basically happening with the help of paper and pen you will be able to do it in the case of a n this is a black box model because here you don't know like how many neurons are there how they are performing and how the weights are getting updated so this is the basic difference between the black box and white box model this entire thing is the agenda of today's session so let's start uh the first algorithm that we are probably going to discuss today is something called as k means clustering k means clustering and this is a kind of unsupervised machine learning now always remember unsupervised machine learning basically means that uh the one and the most important thing is that in unsupervised machine learning in unsupervised ml you don't have any specific output so you don't have any specific output so suppose you have feature one and feature two and suppose you have data's different different data you know and based on this data what we do we basically try to create clusters this cluster basically says what are the similar kind of data so this is what we basically do from uh clustering and there are various techniques like k-means uh it is hierarchical clustering and all so first of all we'll try to understand about k-means and how does it specifically work it's simple uh suppose you have a data points like this okay let's say that this is your f1 feature f2 feature and based on this in two-dimensional probably i will be plotting this points and suppose this is my another points so our main purpose is basically to cluster together in different different groups okay so this will be my one group and probably the other group will be this group right so two groups because obviously you can see from this clusters here you have two similar kind of data which is basically grouped together right this is my cluster one and this is my cluster two let me talk about this and why specifically it will be very much useful and then we'll try to understand about math intuition also now always understand guys uh where does clustering gets used okay in most of the ensemble techniques i told you about custom and symbol technique right so custom and symbol techniques in custom and symbol techniques you know whenever we are probably creating a model first of all on our data set what we do is that we create clusters so suppose this is my data set during my model creation the first algorithm we will probably apply will be clustering algorithm and after that it is obviously good that we can apply regression or classification problem suppose in this clustering i have two or three groups let's say that i have two or three groups over here for each group we can apply a separate supervised machine learning algorithm if we knows the specific output that we really want to take ahead i'll talk about this and give you some of the examples as i go ahead now let's go on go ahead and focus more on understanding how does k-means clustering algorithm work so let's go over here the word k means has this k value this k are nothing but this k basically means centroids k basically means centroids so suppose if i have a data set which looks like this let's say that this is my data set now over here just by seeing the data set what are the possible groups you think definitely you will be saying k is equal to 2 so when you say k is equal to 2 that basically means you will be able to get two groups like this and each and every group will be having a centroid a centroid point here also there will be a centroid point so this centroid will determine basically this is a separate group over here this is a separate group over here so over here here you can definitely say that fine this is two groups but how do we come to a conclusion that there is only two groups okay we cannot just directly say that okay we will try to just by seeing the data because your data will be having a high dimension data right right now i'm just showing you two dimension data but for a high dimension data definitely you will not be able to see the data points how it is plotted so how do you come to a conclusion that only two groups are there so for this there is some steps that we basically perform in k-means the first step is that we try with different k values we try with different k values and which is the suitable k value k is nothing but centroids okay it is nothing but centroids we try with different different centroids in this particular case let's say that i have this particular data point and i actually start with k is equal to 1 or 2 or 3 any one you want let's say that i'm going to start with k is equal to 2 how to come up with this k is equal to 2 as a perfect value that i'll talk about it we need to know there is a concept which is called as within cluster sum of square so when we try different k values let's say that for k is equal to 2 what will happen the first step we select we try k values so let's say that we are considering k is equal to 2 the second step is that we initialize k number of centroids now in this particular case i know my k value is 2 so we will be initializing randomly let's say that k is equal to 2 so what we can actually do let's say that this is this is my one centroid i will i'll put it in another color so this will be my one centroid and let's say that this is my another centroid so i've initialized two centroids randomly in this space now after this particular centroid what we have to do is that after initializing this centroid what we have to do is that we have to basically find out which points are near to the centroid and which points are near to the centroid now in order to find out it is a very easy step we can basically use euclidean distance to find out the distance between the points in an easy way if i really want to show you that you know like how many points i want to in an easy way what i can do i can basically draw a straight line over here let's say that i'm drawing a straight line over here in another color i can draw a straight line and i can also draw one parallel line like this so this basically indicates that whichever points you see over here suppose if i draw a straight line in between all these points you will be able to see that let's say that i'm drawing one more parallel line which is intersecting together so from this you can definitely find out let's say that these are all my points that are nearer to this green line green point so what i am actually going to do in this particular case all these points that you are seeing near the green it will become green color so that basically means this is basically nearer to this centroid and whichever points are nearer to this particular point that will become red point so that basically means this belongs to this group okay this belongs to this group so i hope everybody's clear till here then what will happen is that this summation of all the values then we initialize the k number of inter centroids that is done then we try to calculate the distance we try to find out which all points is nearer to the centroid let's say that this is my one centroid this is my another centroid and we have seen that okay these all points belong to this centroid it is near to this particular centroid so this is becoming red so it is based on the shortest distance and here it is becoming green now the next step let's see what is the next step after this so i am going to remove this thing now the next step will be that the entire points that is in red color all the average will be taken so here again the average will be taken now third step here i am going to write here we are going to compute the average the reason we compute the average is that because we need to update the centroid so compute the average to update centroid to update centroids so here you will be able to see that what i am actually doing as soon as we compute the average this centroid is going to move to some other location so what location it will move it will obviously become somewhere in center so here now i am going to rub this and now my new centroid will be this point where i am actually going to draw like this let's say this is my new center right now similarly this thing will happen with respect to the green color so with respect to the green color also it will happen and this green will also get updated so i'm going to rub this and this will be my new green point which will get updated over here then again what will happen again the distance will be calculated and again a perpendicular line will be calculated here you can see that now all the points are towards there okay again the centroid based on this particular distance again it will be calculated and here you can see that all the points are in its own location so here now no update will actually happen let's say that there was one point which was red color over here then this would have become green color but since the updation has happened perfectly we are not going to update it and we are not going to update the centroid right so now you can understand that yes now we have actually got the perfect centroid and now this will be considered as one group and this will be basically considered as the another group it will not intersect but by default here intersection is happening so i hope everybody's understood the steps that you have actually followed in initializing the centroids in updating the centroids and in updating the points is it clear everybody with respect to k means now let's discuss about one point how do we decide this k value okay how do we decide this k value so for deciding the k value there is a concept which is called as elbow method so here i'm going to basically define my elbow method now elbow method say something very much important because this will actually help us to find out what is the optimized k value whether the k value should be 2 whether the k value is going to be 3 whether the k value is going to become 4 and always understand suppose this is my data set suppose this is my data set initially let's say that i have my data points like this we cannot go ahead and directly say say that okay k is equal to 2 is going to work so obviously we are going to go with iteration for i is equal to probably 1 to 10 i'm going to move towards iteration from 1 to 10 let's say so for every iteration we will construct a graph with respect to k value and with respect to something called as wcss now what is this wcss wcss basically means within cluster sum of square okay this is the meaning of wcss within cluster sum of square now let's say that initially we start with one centroid so one centroid let's say it is initialized here one centroid is basically initialized here if we go and compute the distance between each and every points to the centroid and if we try to find out the distance will the distance value be greater or it will be smaller will it be smaller or greater tell me if we try to calculate this distance from this centroid to every point this is what is within cluster sum of square it will always be very very much greater so let's say that my first point has come somewhere here it is going to be obviously greater let's say that my first point is coming over here fine so within k is equal to 1 initially we took and we found out the distance of wcss and it is a very huge value okay because we are going to compute the distance between each and every point to the centroid now the next thing that i'm actually going to do is that now we'll go with next value that is k is equal to 2 now in k is equal to 2 i will initialize two points okay i will initialize two points and then probably i will do the entire process which i have written on the top now tell me whichever points is nearer to this green point if we compute the distance and whichever points is nearer to the red point if we compute the distance like this now this summation of the distance will be lesser than the previous wcss or not obviously it is going to be lesser than the previous wcss so what i am actually going to do probably with k is equal to 2 your value may come somewhere here then with k is equal to 3 your value may come somewhere here then k is equal to 4 will come here to 5 6 like this it will go so here if i probably join this line you'll be able to see that there will be an abrupt changes in the wcss value in the wcss value there'll be an abrupt changes and this this is basically called as elbow curve now why we say it as elbow curve because it is in the shape of elbow and here at one specific point there will be an abrupt change and then it will be straight so that is the reason why we basically say this as elbow okay so this is a very important thing see in finding the k value we use elbow method but for validating purpose how do we validate that this model is performing well we use silhouette score that i'll show you just in some time but understand that in k means clustering we need to update the centroids and based on that we calculate the distance and as the k value keep on increasing you will be able to see that the distance will become normal or the wcs's value will become normal and then we really need to find out which is the feasible k value where the abrupt change see over here suppose abrupt change is there and then it is normal then i will probably take this as my k value so obviously the model complexity will be high because we are going to check with respect to different different k values and wcss values and this basically means that the value that we'll probably get first of all we need to construct this elbow curve then see the changes where it is basically happening we'll need to find out the abrupt change and once we get the abrupt change we basically say that this may be the k value so k is equal to 4 as an example i'm telling you so unless and until if you really want to find the cluster it is very much simple we take a k value we initialize k number of centroids we compute the average to update the centroids then again we try to find out the distance try to see that whether any points has changed and continue ahead process unless and until we get separate groups okay so this is the entire funda of clayman's clustering so finally you'll be able to see that with respect to the k value will be able to get that many number of groups if my k value is 4 that basically means i will be probably getting 4 different groups like this 1 2 right three like this and four i will be getting four groups like this with k is equal to four that basically means k is equal to four clusters and every group will be having its own centroids okay every group will be having okay centroids are very much important yes i'll try to show you in the coding also guys let's go towards the second algorithm the second algorithm that we will be probably discussing is called as hierarchical clustering now hierarchical clustering is very much simple guys all you have to do is that let's say this is your data points this is your data points and this is my p1 let's say p2 now hieracle clustering says that we will go step by step the first thing is that we will try to find out the most nearest value let's say this is my x and y let's say these are my points like this is my p1 point this is my p2 point this is my p3 point this is my p4 point p5 point six point p seven point okay so these are my points that i have actually named over here let's say that this is maybe the nearest point to each other so what it will do it will combine this together into one cluster this we have computed the distance so it will clear create one cluster now what will happen on the right hand side there will be another notation which you may be using in connecting all the points one so suppose this is my p1 this is my p2 this is my p3 p4 let's say that i have this many points and probably i will also try to make p7 so these are my points p7 now you know that the nearest point that we are having okay this will probably be distance one two three this is distance okay 4 5 6 like this we have a lot of distance so hierarchical clustering will first of all find out the nearest point and try to compute the distance between them and just try to combine them together into one what do we do we basically combine them into one group okay so p1 and p2 has been combined let's say then it will go and find out the other nearest point so let's say p6 and p7 are near so they are also going to combine into one group so once they combine into one group then we have p6 and p7 which will be obviously greater than the previous distance and we may get this kind of computation and another combination of cluster will form get formed over here then you have seen that okay p3 and p5 are nearer to each other so we are going to combine this so i am going to basically combine p3 and p5 okay and let's say that this distance is greater than the previous one because we are basically going to start with the shortest distance and then we are going to capture the longest distance now this is done now you can see that the next point that is near right to this particular group is p4 so we are going to combine this together into one group so once we combine this into one group this p4 will get connected like this let's say it is getting connected like this p4 has got connected then what is the nearest point whether it is p6 p7 group of p1 p2 obviously here you can see that p1 p2 is there so i am probably going to combine this group together that basically means p1 p2 let's say i'm just going to combine this group together again circle is coming so i will make a dot let's say i'm going to combine this group together because these are my nearest groups so what will happen p1 and p2 will get combined to p5 sorry p4 p5 this one so i will be getting another line like this and then finally you will be seeing that p6 p7 is the nearest group to this so this will totally get combined and it may look something like this so this will become a total group like this so all the groups are combined so finally you'll be able to see that there will be one more line which will get combined like this this is basically called as dendogram dendogram okay which is like bottom root to top now the question arises is that how do you find that how many groups should be here how do you find out that how many groups should be here the funda is very much clear guys in this is that you need to find the longest vertical line you need to find out the longest vertical line that has no horizontal line passed through it no horizontal line passed through it this is very much important that has no horizontal line pass through it now what this is basically meaning is that i will try to find out the longest line longest vertical line in such a way that none of the horizontal line passes through it what is horizontal line suppose if i consider this vertical line this vertical line over here if you see that if i extend this green line it is passing through this if i extend this line it is passing through this right if i am extending this line it is passing through this right so out of this the longest line that may be passing in such a way that no horizontal line probably is this line that i can actually see so what you do over here is that you basically just create a straight line over this and then you try to find out that how many clusters it will be there by understanding that how many lines it is passing through if it is passing through this one line two line three line four line that basically means your clusters will be four clusters this is how we basically do the calculation in hierarchical clustering again here it may not be the perfect line i've just drawn with some assumptions but if you are trying to do this probably you have to do in this specific way okay i've already uploaded a lot of practical videos with respect to hierarchical clustering and all now tell me maximum effort or maximum time is taken by is taken by k means or oracle clustering this is a question for you yes guys number of clusters may be three but here i'm just showing you that how many lines it may be passed by how do you basically determine whether maximum time will be taken by k-means or higher clustering this is the interview question the maximum time that will be taken is by high radical clustering why because let's say that i have many many many data points at that point of time higher clustering will keep on constructing this kind of dendrograms and it will be taking many many many time lot time right so higher clustering will take more time maximum time that it is going to basically take so it is very much important that you understand which is making it basically taking more time so if your data set is small you may go ahead with hierarchical clustering if your data set is large go with k means clustering go with k means requesting in short both will take more time but k main will perform better than oracle clustering see guys you will be forming this kind of dendograms right and just imagine if you have 10 features and many data points how you are going to do it it will be a cuber some process you will not be even able to see this dendrogram properly and manually obviously you cannot do it so this was with respect to k-means clustering and hierarchical mean clustering i hope everybody's understood now the next topic that we'll focus on in is that how do we validate see how do we validate a classification problem we use performance metrics like confusion matrix accuracy um different different true positive rate precision recall but how do we validate clustering models we are going to use something called as so we are going to basically use something called a silhouette score i'll show you what silhouette score is i'm going to just open the wikipedia so this is how a sealyard score looks like a very very amazing topic okay how do we validate whether my model basically has perfect three or four model perfect three suppose if i find out my k value is three how do we find out now see one more one more issue with k means one issue with k means which i forgot to tell you let's say that i have a data point which looks like this and suppose i have some data points like this i have some data points which looks like this let's say i have like this now in this one issue will be that suppose i try to make a cluster over here obviously you'll be saying my k value will be 2 okay in this particular case suppose this is one cluster this is my another cluster right because of my wrong initialization of the points okay understand because suppose if i initialize just randomly some centroids like this then what may happen is that there is a possibility that we may also have three clusters like like like this kind of clusters one cluster will be here one cluster will be here one cluster will be here so this initialization of the centroids one condition is that it should be very very far if we initialize our centroids very very far at that point of time we will be able to find the centroid exactly in the center because it will keep on updating it will keep on going ahead right but if we don't initialize that very far then there will be a situation that probably if i wanted to get only the real thing was to get only two centroids i was probably getting three centroids right so this is a problem so for this there is an algorithm which is called as k means plus plus and what this k means plus plus will do which i will probably show in practical this will make sure that all the centroids that are initialized it is very very far okay all the encendroids that is basically there it is initialized very very far we'll see in that in practical application where specifically those centroids are basically used now let me go ahead and let me show you with respect to silhouette clustering now what is the silhouette color string i'm going to explain you in an amazing way this is important if someone says you how do we validate how do we validate cluster model then at that point of time we basically use this silhouette it will be used in it will be used with respect to it will be used with respect to k means it can be used in hierarchical mean right if you want to validate how do we validate okay that is what we are basically going to see over here now in k-means clustering what are the most important things the first and the most important thing is that we will try to find out we will try to find out a of i we will try to find out a of i1 now what is this a of i see this a of i that you basically see a of i is nothing but c three major steps happens in order to validate cluster model with the help of silhouette first thing is that i will probably take one cluster okay there will be one point which will be my centroid let's say and then what i'm going to do i'm just going to whatever points are there inside this cluster i'm going to compute the distance between them so i'm going to do the submission and i'm also going to do the average of all this distance so here you can see that when i say distance of i comma j i basically means this point j basically means all these points i is nothing but it is the centroid so here is nothing but this is the centroid let's say that i am having the centroid so i am going to compute all the distance over here which is mentioned by this and this value that you see that i am actually dividing by c of i minus 1 in short i am actually trying to calculate the average distance so this is the first point where i am actually computing the a of i now similarly what i will do is that what i will do is that the next point will be that suppose i have computed a of i the next the next thing that we need to compute is b of i now what is b of i b of i is nothing but there will be multiple clusters in a k means problem statement we will try to find out the nearest cluster okay suppose let's say that this is the nearest cluster and in this i have all the variety of points then b of i basically says that i will try to compute the distance between each point and the other point in this centroid sorry in this cluster so this is my cluster one this is my cluster two so what i'm actually going to do is that here i'm going to compute the distance between this point to this point then this point to this point then this point to this point this point to this point this point to this point this point to this point every point i'm actually going to compute the distance once this point is done we will go ahead with the next point and we'll try to compute the distance and once we get all this particular distance what we are going to do we are going to do the average of them average now tell me if i try to find out the relationship between a of i and b of i if my cluster model is good will a of i will be greater than b of i or will b of i will be greater than a of i if i have a good clustering model if i have a code clustering model will a of i is greater than b of i will be greater than b of i or whether b of i will be greater than a of i out of this if we have a really good model obviously the distance between b of i will be greater than a of i in a good model that basically means if i talk about silhouette clustering the values will be between minus 1 to plus 1. the more the value is towards plus one that basically means the good the model is the good the clustering model is the more the values towards negative one that basically means this condition is getting applied now what does this condition basically say that basically means that this distance is far than the cluster distance this is what this information is getting portrayed and this is the importance of silhouette clustering finally when we apply the formula of celluloid clustering you will be able to see that silhouette clustering is nothing but let me rub this everything guys for you let me just show you what is silhouette clustering silhouette clustering formula will be something like this b of i so here you have silhouette clustering this is the formula b of i minus a of i max of a of i comma b of i if c of i is greater than 1 right so by this you will be getting the value between minus one to plus one and more the value is towards plus one the more good your model is more the values towards minus one more bad your model is because if it is towards minus one that basically means your a of i is obviously greater than b of i so this is the outcome with respect to silhouette clustering if s is equal to zero that basically means still your model needs to be uh basically the clustering needs to be improved what is i over here i is nothing but one data point you can just read this guys data point in i in the cluster c of i so i hope everybody's understood this now let's go ahead and let's discuss about the next topic we have obviously finished up solhet clustering over here let's discuss about something called as db scan so for db scan clustering this is an amazing clustering algorithm we'll try to understand how to actually do db clustering and probably you'll be able to understand a lot of things from this now in db scan clustering what are the important things so let's start with respect to db and clustering and let's understand some of the important points over here the first point that you really need to remember is something called as score points i'll also talk about when do you say core points or when do you say other points as such so the first point that i will probably discuss about is something called as min points the second point that i will probably just discuss about is something called as score points the third thing that i will probably discuss about is something called as border points and the fourth point that i will definitely talk about is something called as noise point okay guys now tell me in k-means clustering if i have this kind of groups don't you think with the help of two different clusters i may combine this two like this with the help of two different clusters i may combine something like this right but understand over here what what problem is basically happening with the second clustering this is actually an outliers let's say that let's say one thing very nicely i will put okay let's say i have one point over here i have one point over here so if i do clustering probably i will get one cluster here and i may get another cluster which is somewhere here now understand one thing this point is definitely an outlier even though this is an outlier with the help of k means what i'm actually doing i'm actually grouping this into another group so can we have a scenario wherein a kind of clustering algorithm is there where we can leave the outlier separately and this outlier in this particular algorithm and this is basically uh we will be using dbscan to relieve the outlier and this point will be called as a noisy point noisy point or i can also say it as an outlier so this will be a noise point for this kind of algorithm where you want to skip the outliers we can definitely use db scan that is density based spatial constraint of application with noise a very amazing algorithm and definitely i have tried using this a lot nowadays i don't use k-means or hierarchical means instead i use this kind of algorithm now see this what are the important things over here first of all you need to go ahead with min points min point so first thing is that you need to have main points this min points is a kind of hyper parameter this basically says what does hyper parameter says and there is also a value which is called as epsilon which i forgot i will write it down over here this is called as epsilon now what does epsilon mean epsilon basically means if i have a point like this and if i take epsilon this is nothing but the radius of that specific circle radius of that specific circle okay so epsilon is nothing but radius over here in this specific thing what does minimum points is equal to 4 mean let's say that i have i have taken a point over here let's say that this is my point and i have drawn a circle which looks like this and let's say that this is my epsilon value okay this is my epsilon value if i say my min point is equal to 4 which is again a hyper parameter that basically means i can if i have four at least four points over here near to this particular circle based on this epsilon value then what will happen is that this point this red point will actually become a core point a core point which is basically given over here if it has at least that many number of min points inside or near to this particular within this epsilon okay within this particular cluster suppose this is my cluster with the help of epsilon i have actually created it is there a particular unit of epsilon or we simply take the unit of distance no epsilon value will also get selected through some way i'll show you i'll show you in the practical application don't worry now the next thing is that let's say let's say i have another another point over here let's say that i have another point over here and this is my circle with respect to epsilon i have created it let's say that here i have only one point i have only one point inside this particular at that point this point becomes something called as border point border point border point also we have discussed over here right so border point is also there so here i am saying that at least one at least one if it is only one it is present then it will become a border point if it has force definitely this will become a core point core point like how we have this red color so and there will be one more scenario suppose i have this one cluster let's say this is my epsilon and suppose if i don't have any points near this then this will definitely become my noise point and this noise point will nothing be but this will be a cluster so here i have actually discussed about the noise point also so i hope everybody is able to understand the key terms now what is basically happening is that whenever we have a noise point like in this particular scenario we have a noise point and we don't find any points inside this any core point or border point if we don't find inside this then it is going to just get neglected that basically means this is basically treated as an outlier i hope everybody is able to understand here this point will be treated as an outlier or it can also be treated as a noise point and this will never be taken inside a group okay it will never never be taken inside a group suppose i have this set of points which you see basically over here red core and all and there is also a border point by making multiple circles over here here you can definitely say that how we are defining core points and the border points and this can be combined into a single group okay this can be combined into a single group because how the connection is now see this this yellow line is basically created by one sorry this yellow point is basically created by one epsilon and we have one core point over here remember over here it should be at least one core point okay not one point but one core point at least if it is having one core point then it will become a border point this will become a border point that basically means yes this can be the part of this specific group so what we are doing whenever there is a noise we are going to neglect it wherever there is a broader and core points we are going to combine it so i will show you one more diagram which is an amazing diagram which will help you understand more in this a k means clustering and hierarchical mean clustering now see this everybody now the right hand side of diagram that you see is based on db scan clustering and the left hand side is basically your traditional clustering method let's say that this is k means which one do you think is better over here you see this this all outliers are not combined inside a group but whichever are nearer as a core point and the broader point separate separate groups are actually created right so this is how amazing a db scan clustering is a db scan clustering is pretty much amazing that is basically the outcome of this here in k-means clustering you can see this all these points has also been taken as blue color as one group because i will be considering this as one group but here we are able to determine this in amazing groups so any i'm saying you guys directly use db scan without worrying about anything so now let's focus on the practical part uh i'm just going to give you a github link everybody download the code guys i've given you the github link quickly download and keep your file ready i'm going to open my anaconda prompt probably open my jupyter notebook we'll do one practical problem i've given you the link guys please open it so this is what we are going to do today this will be amazing here you will be able to see amazing things how do you un come to know that overfitting or under fitting is happening you don't know the real value right so in in clustering there will not be any under fitting or overfitting so uh what all things we will be importing first is that we'll try k means clustering we'll do slots scoring and then probably will see the output and um and we'll do db scan also let's say db scan is also there so uh what are the things we have basically imported one is the k-means clustering one is the sell-out samples and cellulite scores these all are present in the sk learn and it is present in the metrics that basically means we use this specific parameter to validate clustering models okay now we will try to execute this and apart from that matplotlib we are just trying to import numpy we are trying to import and all here we are executing it perfectly the next thing is that here the next step is that generating the sample data from make underscore blobs first of all we are just trying to generate some samples with some two features and we are saying that okay it should have four centroids or c centroids itself with some features i'm trying to generate some x and y data randomly and this particular data set will basically be used in performing clustering algorithms okay forget about range underscore and underscore clusters because we need to try with different different clusters and try to find out the silhouette score so right now i've just initialized with two three four five six values it is very simple so if i go and probably see my x data so my x data will look something like this so this is my x data with two features and this is my y data with one feature which is my output which belongs to a specific class okay so that you can actually do with the help of make underscore blobs let's say how to apply k means clustering algorithm so as i said that i will be using wcss wcss basically means within cluster sum of square so i'm going to import k-means over here for i in range 1 comma 11 that basically means i'm going to use different different k values of centroid values and try to see which is having the minimal wcss value and i'll try to draw that graph which i had actually shown you with respect to elbow method so here i will basically be also using k means number of clusters will be i and initialization technique i will be using k means plus plus so that the points the centroids that are initialized those those points are very very far and then you have random state is equal to 0 then we do fit and finally we do wcss dot append k means dot inertia okay this dot inertia will give you the distance between the centroids and all the other points and this is what i'm going to append in this wcss value and finally i'll just plot it now here you can see that i'm just plotting it obviously by seeing this graph this graph looks like an elbow okay this graph looks like an elbow so the point that i'm actually going to consider over here c which is the last abrupt change so if i talk about the last abrupt change here i have the specific value with respect to this okay i have one specific value with respect to this this is my abrupt change from here the changes are normal so i'm going to basically select k is equal to 4 now what i'm actually going to do with the help of selhot with the help of silhouette plus score we are going to compare whether k is equal to 4 is valid or not so that is what we are going to do valid or not so here we are going to do this now let's go ahead and let's try to see it how we are going to do it so here you can see in cluster is equal to 4 then i am actually able to find out the prediction and this is specifically my output okay this is done now see this code okay this code is a huge code i have actually taken this code directly from the sklearn page of silhouette if you go and see this this code is directly given over there but i'm just going to talk about like what are the important things we need to see over here with respect to different different clusters see see this clusters two three four five six i'm going to basically compare whether the k value should be four or not with the help of silhouette scoring so let's go over here and here you can see that i'm applying this one first i will go with respect to for loop for n underscore clusters in range underscore clusters different different cluster values are there first we'll start with two so here you can see initialize the cluster with n cluster value and a random generator seed of 10 for reproducibility so n underscore clusters first i took it as 2 and then i did fit predict on x after i did fit predictor on x i am using the score on x comma cluster label now what this is going to do understand in silhouette what did we discuss it will throw it will try to find out all the clusters the clusters over here like this and it will try to calculate the distance between them which is the a of i then it will try to compute the b of i then finally it will try to compute the score and if the value is between minus one to plus one the more the values towards plus one the more better it is right so this all things we have already discussed and that is what this specific function will do and this will give my silhouette average value over here silhouette value will be over here okay this we have done and then we can continuously do it for another another things you can actually find it over here and this value that you see this code that you see is nothing so complex okay this is just to display the data properly in the form of graphs okay in the form of graphs so again i'm telling you i did not write this code i've directly taken it from the sk learn page of siloid okay so just try to see this particular uh plotting diagrams and all that you can definitely figure out but let's see i will try to execute it and try to find out the output now see for n underscore cluster is equal to 2 the average silhouette score is 0.70 i told you the value will be between minus 1 to plus 1 and i'm actually getting 0.704 which is very very good and then for n underscore cluster is equal to point five eight eight then end underscore cluster is equal to four i'm getting point six five which is pretty much amazing and then for an underscore cluster is equal to five the average score is point five six three and an underscore cluster is equal to six you are saying point four five here directly you can actually say that fine for underscore cluster is equal to two i'm getting an amazing score of point seven zero four obviously you're you're getting the highest value over this so should we select n underscore cluster is equal to two okay we should not directly conclude from it because here we need to also see that any feature value or any cluster value is also coming as negative value that also we need to check so here we will go down over here you will see the first one over here with respect to the first one you see that i'm getting getting the value from 0 to 1 it is not going to minus 0.1 so definitely two clusters was able to solve the problem so i'll keep it like this with me i definitely have a chance that this may this may perform well i may have a chance that this k k is equal to 2 may perform well okay so i may have a chance let's see through the next one to the next one over here you can see that for one of the cluster the value is negative if the value is negative that basically means the a i is obviously greater than b of i so i'm not going to refer this because it is having some negative values even though my cluster looks better but again understand what is the problem with respect to this cluster is that if i take this cluster and probably compute the distance between this point to this point and if i probably compute from this point to this point or this point to this point this point is obviously nearer to this right it is obviously nearer to this so that is the reason why i'm getting a negative value over here okay negative value over here this is my uh output my score this point that you see dotted points this is my score 0.58 what whatever it is this is basically my score so obviously this basically indicates that this point is near the other cluster point is nearer to this so i'm actually getting a negative value right so this you really need to understand okay now similarly if i go with respect to n underscore cluster is equal to 4 this looks good because here i don't have any negative value and here you can see how cool it has basically divided the points amazingly with the help of k is equal to 4 right and similarly if i go with 5 obviously you can see some negative values are here some dotted line negative value are there with respect to 6 you also have some negative values so definitely i'll not go with 6 i may either go with 4 or i may either go with 2 now whenever you have this options always take a bigger number instead of 2 take 4 because 4 is greater than 2 because it will be able to create a generalized model so from this i am actually going to take n is equal to 4 k is equal to 4. now should we compare with this with the elbow method here also i got 4 right so both are actually matching so this indicates that with the help of this clustering this silhouette score we can definitely come to a conclusion and validate our clustering model in an amazing way so i hope everybody is able to understand and this way you basically validate a model and definitely you can try it out you can understand this code definitely i but till here you have understood that here i am going to get the average value then for i in underscore clusters whatever cluster this is matching it is just mapping over there and it is basically giving so this was the session and yes in today's session we efficiently covered many topics we covered k-means hierarchical clustering science silhouette score db canon clustering in tomorrow's session the topics that are probably pending is first i'll start with svm and svr second i will go ahead with xgboost and third i will cover up pca let's see whether i'll be able to complete this session uh one one amazing thing that i want to teach you guys because many people ask me the definition of bias and variance so guys many people get confused when we talk about bias and variance you know because let's say that i have a model for the training data set it gives us somewhere around 90 percent accuracy let's say i'm getting a 90 accuracy for the test data i may probably getting somewhere around 70 accuracy now tell me which scenario is basically this most of the people will be saying that okay fine it is over fitting now when i say overfitting i basically mention overfitting by low bias and high variance right so many people get confused chris tell me just the exact definition of bias and variance low bias obviously you are saying that because the training is performed like the model is performing well with the help of training data set but with respect to the test data set the model is not performing well with respect to training data set why do we always say bias and with respect to test data set why do we always say variance so for this you need to understand the definition of bias so let me write down the definition of bias over here so here i we can definitely write that bias it is a phenomena that skews the result of an algorithm in favor in favor or against an idea against an idea i'll make you understand the definition uh but understand the understand understand what i have actually written over here it is a phenomena that skews the result of an algorithm in favor or against an idea whenever i say this specific idea this idea i will just talk about the training data set initially now when we train a specific model suppose if i have this specific model over here and i'm training with this specific training data set so this is my training data set now based on the definition what does it basically say it is a phenomenon that skews the result of an algorithm in favor or against an idea or a this specific training data set so even though i'm training this particular model with this training data set with this data set it may it may be in favor of that or it may be against or that that basically means it may perform well it may not perform well if it is not performing well that basically means the accuracy is down if the accuracy is better at that point of time what we'll say see if the accuracy is better that time what we'll say will come up with two terms from here obviously you understand okay there are two scenarios of bias now yeah if it is in favor that basically means it is performing well with respect to the training data set i will basically say that it has high bias if it is not able to perform well with the training data set then here i will say it as low bias i hope everybody is able to understand in this specific thing because many many many people has this kind of confusion now similarly if i talk about variance let's say about variance because you need to understand the definition a definition is very much important okay if i if i just talk about the definition of variance i'm just going to refer like this the variance refers to the changes in the model when using when using different portion of the training or test data now let's understand this particular definition variance refers to the changes in the model when using different proportion of the test training data or test data we obviously know that whenever initially if i have a model understand from the definition everything will make sense i am basically training initially with the training data okay because we divide our data set see our data set whenever we are working with we divide this into two parts one is our train data and test data okay because this is a trade test data is a part of that particular data set right and suppose in this particular training data it gets trained and performs well here i'm actually talking about bias but when we come with respect to the prediction of this specific model at that point of time i can use other training data that basically means that training data may not be similar or i can also use test data now in this test data what we do we do some kind of predictions these are my predictions and in this prediction again i may get two scenario i may get two scenario which is basically mentioned by variance it refers to the changes in the model when using when using different portion of the training or touch data refers to the changes basically means whether it is able to give a good prediction or wrong predictions that's it so in this particular scenario if it gives a good prediction i may definitely say it as low variance that basically means the accuracy with the accuracy with respect to the test data is also very good if i probably get a bad if i probably get a bad accuracy at that time i basically say it as high variance so if i talk about three scenarios over here let's say this is my model 1 and this is my model 2 and this is my model 3. now in this scenario let's consider that my model 1 has the training accuracy of 90 and test accuracy of 75 similarly i have here as my train accuracy of 60 and my test accuracy of 55 now similarly if i have my train accuracy of 90 percent and my test accuracy of 92 percent now tell me what what things you will be getting here obviously you can directly say that find your training accuracy is better now you're talking about bias so this basically indicates that this has low bias and since your test accuracy is bad because it is when compared to the train accuracy it is less so here you are basically going to say high variance understand with respect to the definition similarly over here what you will say high bias high variance because obviously it is not performing well this is another scenario last the last scenario is that this is the scenario that we want because it is low bias and low variance okay many many people have basically asked me the definition with respect to bias and variance and here i've actually discussed and this indicates this gives me a generalized model and this is what is our aim when we are working as a data scientist so i hope you have understood the basic difference between vary bias and variance and i was able to give you lot of examples lot of understanding with respect to this so i hope you have actually got this particular uh understanding of this two terms which we specifically talk about high bias low bias high variance low variance right so this was it from my side guys and i hope you have understood this okay so let's take let's consider a data set credit and let's say this is a approval so we are going to take this sample data set and understand how does xjboost work suppose salary is less than or equal to 50 and the credit is bad so approval the loan approval will be zero that basically means he'll he or she will not get if it is less than or equal to 50 if the credit score is good then probably approval will be one if it is less than or equal to 50 if it is good again then it is going to get one if it is greater than 50 and if it is bad then obviously approval will be zero if it is greater than 50 if it is good we are going to get it as one if it is greater than 50 k and probably if it is normal then also we are going to get it so this is my data set so how does xgboost classify work understand the full form of exe boost is extreme gradient boosting extreme gradient boosting so we will basically understand about extreme gradient boosting now extreme gradient boosting uh will be actually used to solve both classification and the regression problem statement so first of all let's understand how it is basically executes basically how it actually if you if you just talk about exe boost you understand that it is a boosting technique and internally it tries to use decision tree so how does this decision tree is basically getting constructed in the case of xeboost and how it is basically solved we are going to discuss about it so whenever we start exubus classifier understand that first of all we create a specific base model suppose if i say this is my base model and this base model will be a weak learner okay and this base model will always give an output of probability of 0.5 in the case of classification problem so suppose if i say this is probability is 0.5 then i will try to create a field over here this field is called as residual field so first base model what i am going to do any data set that you give from here to train it will always give you the output as 0.5 so this is just a dummy base model now tell me if my probability output is 0.5 if i want to calculate the residual that basically means i need to subtract approval minus this particular value so what will be the value over here zero minus point five will be minus point five one minus point five will be point five one minus point five will be point five and zero minus point five will be minus point five and this one minus 0.5 will be 0.5 and this will also be 0.5 let's consider that i have one more record and this specific record can be anything because i want to keep some more records over here so let's consider that i have one more record which is less than or equal to 50k and if the credit score is normal you are going to get zero so here also if i try to find out the residual it will be minus 0.5 now the first step i hope everybody's understood we have to create a base model okay this base model is very much important because we have to create all the decision tree in a sequential manner so the first sequential base tree which is again this is also a decision tree kind of thing you can consider but this is a base model which takes any inputs and gives by default the probability is 0.5 now let's go ahead and understand what are the steps in constructing decision tree after creating the base model the first step is that create a binary decision tree so i'm going to write it down all the steps please make sure that you note it down so create a binary tree binary decision tree using the features second step we basically define we we say it as okay second step what we do we actually calculate the similarity weight we calculate the similarity weight i'll talk about this similarity weight what exactly it is if i want to use this a formula it is summation of residual square divided by summation of probability 1 minus probability plus lambda i'll talk about this what is exactly lambda it is a kind of hyperparameter again so that it does not over fit the third thing is that we calculate the information gain okay information gain so these are the steps we basically use in constructing or in solving uh in creating an xd boost classifier the first step is that we create a binary decision tree using the feature then we go ahead with calculating the similarity weight and finally we go ahead and calculate the information gain so how does it go ahead let's understand over here and let's try to find out okay now let's go ahead and let's try to construct the decision tree as i said that let's consider that i am considering salary feature so based on using salary feature what i'm actually going to do i am going to take this as my node and i am going to split this up and remember whenever we are creating decision tree in this particular case it will be a binary decision tree let's say that in salary 1 is less than or equal to 1 is greater than 50. so this 2 you obviously have in the case of binary in case of credit where there are three categories i'll also show you how that further split will happen and how that will get converted into a binary t so here you have less than or equal to 50 k and greater than 50k now let's go ahead and understand how many values are there in this salary so if i see before the split you can definitely see that i'm going to use this residual and probably train this entire model now if i really wanted to find out the residual initially these are my residuals over here so one residual is minus 0.5 then i have 0.5 over here then i have 0.5 then again i have minus 0.5 then again i have 0.5 and then again i have 0.5 and finally i have minus 0.5 so these are my total residuals that are there suppose if i make this split less than or equal to 50 first less than or equal to 50 the residuals what are things are there so here i'm going to have minus 0.5 then less than or equal to 50 again i'm going to have 0.5 then again less than or equal to 50 i'm going to have 0.5 and less than or equal to again one more point five is there i'm just going to remove this the last 0.5 which is nothing but minus 0.5 so i hope you understood this split so half of the things came over here the remaining half will be greater than or equal to greater than 50. so you have one value here one value here one value here so it will be minus 0.5 then you have 0.5 and then finally you have 0.5 residuals how do we get it guys see from the base model which is by default giving 0.5 first my data goes over here by default probability i'm going to get 0.5 so residual is basically calculated from this probability and approval so this probability minus approval so if you subtract 0 minus 0.5 sorry i'm just going to rub this so it if you subtract 0 minus 0.5 you're going to get minus 0.5 1 minus 0.5 you're going to get 0.5 1 minus 0.5 you're going to get 0.5 so everybody i hope is very much clear with respect to this so this is the first step we constructed a binary tree now in the second step it says calculate the similarity weight now how to calculate the similarity weight similarity weight formula is sum of residual square now what is residual square let's say that i'm going to calculate the the the uh i'm going to calculate for this okay similarity weight now in this particular case if i go and calculate my similarity weight it will be summation of residual square this is my residual values this is my residual value so i'm going to do the summation of this square okay this value square you can see over here sum of residual square everybody you can see sum of residual squares so what do you think sum of residual squares will be in this particular case how i have to do it i will just take up this all values like minus 0.5 plus 0.5 plus 0.5 and minus 0.5 whole square right i'm just going to do the squaring of this divided by understand what it is divided by it is divided by probability of 1 minus probability now where do we get this probability value where do we get this probability value we get this probability value from our base model right so here i'm basically going to say that we are going to do the summation of probability of 1 minus probability 1 minus probability that basically means for each and every point for each and every point what is the probability see probability is basically coming from the base model so for each prob each point i am going to come compute two things one is the probability and then 1 minus probability and this i am going to do the submission like this i will do it 4 times 1 minus 0.5 then 0.5 multiplied by 1 minus 0.5 and finally you will be able to see one more will be there which is plus 0.5 1 minus 0.5 so this will be your total things with respect to this so i hope you have understood till here where you are able to understand that what we have done this is submission of residual square and this is the remaining probability multiplied by 1 minus probability now tell me what are you able to find out from this if you cancel this and this this and this this value is going to become 0 so this entire value is going to become 0 because 0 divided by anything is 0. so here i hope everybody has understood what is the similarity weight of this specific node if i want to write it is nothing but 0. now you may be considering where is lambda value okay we will initially initialize lambda by 1 i'll talk about this hyper parameter let's consider it as 1 so here plus 1 or plus 0 let's let's consider lambda value 0 let's say for right now okay i'm just going to make it lambda is equal to zero i'm just going to talk about it because it is a kind of hyper parameter by zero yeah minus point five plus minus point five plus point five plus one y if i do the submission if i do the summation here you will be able to see that i'm going to get zero so this calculation we have done and we have got uh the summation of weight is equal to zero and let's go ahead and calculate the summation of the weight of the next node no no no it's not first square it is whole square so here also if i do so it is 0.5 plus 0.5 now let's do it for this if i want to find out the similarity weight again see i'm going to repeat it 0.5 plus 0.5 whole square and since there are three points so i'm going to basically use probability one minus probability for one point then plus probability one minus probability second point and then probability and one minus probability for the third point and lambda is zero so i'm not going to write anything now go let's go and do the calculation for this node so minus 5 minus 5 0 it becomes 0 then 0.5 whole square right so here i'm going to get 0.25 here if you do the calculation here you are going to get 0.75 so this value is going to be 1 by 3 and which is nothing but 0.33 so the similarity weight for this node for this node is 0.33 so here you can see probability of multiplied by 1 minus probability okay now the next step that we do is that calculate the information gain now you know how to calculate the information gain but before that let's do the computation for this also for this root node also go ahead and calculate the similarity weight of this okay why the base model probability is 0.5 because it is just understand that it is a dummy dummy model i have just put a if condition there saying that it is going to give 0.5 now do it for this one guys root node what it will be see i can calculate from here only minus 1 gone this is also gone this is also gone this will be 0.25 divided by something now tell me guys what should be for the root node what is the similarity weight what is the similarity weight for this do this calculation everyone up 1 i know it will be 0.25 divided by this will be 1.75 are you getting this similarity weight which will be nothing but 1 by 7 and if i divide 1 by 7 if i say what is 1 by 7 it is 0.142 so it is nothing but 0.14 if i want to calculate the root node similarity weight over here is 0.14 so i know 0.14 here 0 here 0.33 now see over here we calculate the information gain next step the third step what we do is that we calculate the information gain now information gain is nothing but in this particular case the root node similarity weight will try to add up so i will be getting 0.33 minus this particular top root node whatever split has happened that similarity weight i'll take 0 plus 0.33 minus 0.14 so point minus 0.14 and if i do it it is nothing but just open your calculator again and 0.33 minus 0.14 so it is nothing but 0.19 i am getting 0.19 as my information gain the information gain of this specific tree i got it as 0.19 obviously you know how the features will get selected based on the information gain but let's say that the highest information gain that is given by salary okay now we will go ahead and do the further split let's go ahead and do the further split so i i know my information gain now it is 0.19 and information gain is basically used to select that specific node through which the split will happen now i'll further go and do the split let's say that i'm going to do the further split with the next feature that is which one credit so i'm going to take credit over here i'm going to take credit over here and again i have to do a binary split again but you may be considering krish here are only three categories how we are going to basically do this particular split right because we don't know how to do split because we have three categories over here so in this case what i will do is that we what we can definitely do is that in this particular case the split that we are probably going to do is that let's consider two categories like good and normal at one side bad at one side so here it becomes a binary split again now let's go ahead and let's try to see that how many data points will fall here and how many data points will fall here so for writing down the data points let's say if it is less than or see go to the path if it is less than or equal to 50 it will go this part and if it is b then we are probably going to get how much is the residual we are going to get one residual over here first of all so this is my one residual that is minus 0.5 then similarly if i see less than or equal to 50 good is there right good or normal is there so here again 0.5 will come i hope everybody is able to understand see the second record less than or equal to 50 we go in this path but it is good we come over here again less than or equal to 50 good again we are going to get one more point five then go with respect to greater than or equal to 50 which is coming over here will not worry about it right now again less than or equal to 50 normal again it is minus 0.5 right so this many records are definitely coming over here only one record is basically coming over here then again we will start the same process again we will start the same process now for the same process what we are going to do again try to calculate the similarity weight now in order to calculate the similarity weight what i will do i will basically say this is my similarity weight this will become 0.25 divided by 0.25 y because this whole square right this whole square residual square right summation of residual square but here i have only one residual so this square it will become and then what i'm actually going to do i'm going to basically write 0.5 minus 1 minus 0.5 this is nothing but only for one data point so this is nothing but 0.5 multiplied by 0.5 which is nothing but 0.25 right now in this particular case i will get similarity weight as i hope everybody i'm getting it as 1. now what about this similarity weight if you want to compute it is again very very simple this and this will get cancelled then again it will be 0.25 divided by uh if i say 1 like this 0.25 then again it will be 0.75 then this will also be 1 by 3 that is nothing but 0.33 so similarity weight will be 0.33 then again i have to calculate the information gain of this node what i will do i will add this up see 1 plus 0.33 i'll add like 1 plus point three three minus zero y zero because the information gain the similarity weight of this uh the up one is basically zero right for this particular credit node similarity weight is zero so one plus 0.33 minus 0 this will be 1.33 so like this further split will again happen over here with different different node and we will only be getting a binary split but we will be comparing based on information gain which one is coming good now let's say that i have created this path i have i have designed i have developed my entire binary decision tree which is a speciality in xgboost now what i'm going to do over here is that see everybody what i'm going to do let's consider the inferencing part let's say this record is going to go how we are going to calculate the output so this first of all went to this base model now let's go ahead and see how the inferencing will happen suppose this record is going right so first of all this record will go to this base model the base model is giving the probability as 0.5 so the first base model is basically giving 0.5 now based on this 0.5 how do we calculate the real probability how do we calculate the real probability in this okay so we apply something called as log so we basically say log of p divided by one minus p so this is the formula we basically apply in only the case of base model so if we try to see this it is nothing but log of 0.5 divided by 0.5 which is nothing but 0 log of 1 is nothing but 0 so in the first case whenever any record goes i will be getting the zero value over here okay zero value over here then plus by plus i am doing because it will now go to the binary decision tree now this record will go to my binary decision tree whatever value i am getting from this i'm actually adding that up and now it will go over here now when it goes over here first of all let's see which branch it is following it is falling less than or equal to 50 branch first branch over here then this is bad it will go and follow here so here i can see that the similarity weight is one now the similarity weight is basically one in this case so what we do in the case of this we pass it to a learning rate parameter so this specifically is my learning rate multiplied by 1 1 because y similarity weight is 1 over here so this will basically be my first references and alpha over here is my learning rate it can be a very small value based on the learning parameter that we use like how we have defined learning parameters elsewhere on top of this we apply an activation function which is called a sigmoid since this is a classification problem we apply an activation function which is called a sigmoid and i hope you know what is the use of sigmoid based on this based on the alpha value based on this the output will be between 0 to 1. now i hope you are getting it guys this is how the entire inferencing will probably happen now similarly what i will do i will try to construct this kind of decision tree parallely so we can also write our entire function will look something like this alpha 0 plus alpha 1 and this will be your decision tree 1 output then alpha 2 your decision tree output alpha 3 your decision 3 output like this alpha 4 your decision 3 output 4th decision tree like this it will be alpha n your decision tree and output and this will be your output finally when you are trying to inference from any new record now the reason why we say this as boosting because see understand we are going to add each and every decision tree output slowly to finally get our output with respect to the working of the decision tree this is how xgboost actually work don't credit further needs to be simplified yes see like this similarly we can split credit with the help of like we can make blue green one side normal at one side but whichever will be giving the information gain more that will be taken into consideration right and this is how your entire exi boost classifier works it is very very difficult to basically calculate all those things so that is the reason we say that xgboost is also a black box model so this is basically a back block model it is it prone to overfitting see at one stage we also need to perform hyper parameter tuning and this we specifically say pre-pruning we tend to do pre-pruning and since we are combining multiple decision trees no no this decision tree this decision tree is this one this independent decision tree which i have created now parallelly after this what i'll do i'll create one more decision tree so it will be looking like this see finally how it will look so this is my base model then my data then my data will go to this decision tree which i have actually done as a binary split on different different records then again we will make another decision tree which will again be a binary tree the splits will look like this then this is my base model where i'm getting the value as 0 this will be alpha 1 multiplied by decision tree 1 which is this then this is alpha 2 multiplied by decision tree 2 which is this and like this we will keep on continuously adding more decision trees unless and until this entire thing becomes a very strong learner so this is how we basically do the combination of all these things so i hope everybody is able to understand about the xg boost classifier now you may be thinking how does regressor work do you want a regressive problem statement also the decision tree will get constructed based on independent features and again lambda value is a hyper parameter we basically set up lambda value with the help of cross validation now uh let's go ahead and discuss about xgboost regressor the second algorithm that we will probably discuss about is something called as xgboost regressor and how does x-boost regressor actually work some fundamental is following random forest no in random forest it is completely different their bagging happens bagging happens so over here let's go ahead with the regressor so here i'm going to take some example let's say that i have this many experience this many gap and based on that we need to determine the salary my salary is my output fee let's say the experience is 2 2.5 3 4 4.5 okay now in this gap let's say it is yes yes no no yes and let's say that the salary is somewhere around 40k it is 41k 52k and uh let's see some more data set over here 60k and 62k now the first step in classifier we created a base model here also we will try to create a base model first of all this base model what output it will give it will give the average of all these values what is the average of all these values okay what is the average of all this value 40 81 52 60 62. if i just do the average it is nothing but 51k so by default i will create a base model which will take any input and just give the output as 51. this is the first step now based on this i will try to calculate my residual now how do i calculate my residual i will just subtract 40 by 51k so this will basically be minus 11k and uh this will be 10k minus k minus 10 and this will be 1 this will be 9 and this will be 11 i hope everybody is able to get this let's say that i i make this as 42k okay for just making my calculation little bit easy so i have nine over here so this is my residual then again the first step is that i construct my uh decision tree now let's say that i'm going to use the experience over here so this is my experience node and based on this experience node i have my features over here so here i will take up all my residuals minus 11 9 1 9 11 and then how do i do the split based on experience this is a continuous feature so i have to basically do split with respect to continuous feature which i have already shown you in decision tree how do we do so here is my residual here it is 40 minus this is minus 11k minus 9k uh this is 1k this is 9k and 11k minus 9k so now i will just create take up my first node here i'm going to use my experience feature i know my values what all things are going to come 11k in the root node minus 9 1 9 and 11. now what we are going to do over here is that so i am going to do again a binary split over here now the binary split will happen based on the continuous feature that is experience so two types of records i may get one is less than or equal to two and one is greater than two less than or equal to two and one is greater than 2 now less than or equal to 2 when i do the split let's see how many values we are getting less than or equal to 2 i will get only one value that is minus 11 and here i'm actually going to get all the other values minus 9 1 9 11. now what we are going to do after this is that calculate the similarity weight now here the similarity weight will little bit the formula will change with respect to regression so similarity weight is nothing but summation of residual squares divided by number of residuals plus lambda again here we are going to consider lambda is zero because this is a hyperparameter tuning more the value of lambda that basically means more more we are penalizing with respect to the residuals so this will be the formula that we are going to apply okay so let's see for the first number that we want to apply so how this will get applied again i'm going to write this formula here it will be better let's say here similarity weight is equal to summation of residual square and here you have number of residuals plus lambda see previously we are using probability and then all those things here so if you want to calculate the similarity weight of this this will become 121 divided by number of residual is 1 plus lambda is 0 so this is going to be 121 so here we are going to calculate the similarity weight which is nothing but 121. if if we probably take alpha let's let's do one thing if we probably take uh if if we probably take alpha is equal to one then what will happen if we take alpha is equal to one just think over here what will what may happen we may directly penalize the similarity weight right by just adding one okay so let's do that also suppose i say i'm going to take alpha is equal to 1. so what will happen this will not be the formula now now what will become 121 divided by number of residual is 1 plus 1 this is nothing but 65.5 so let's say that i now have 65.5 as my similarity weight now similarly i will go ahead and compute the similarity weight for the next one so here it will become minus 9 plus 9 plus 9 plus 11 whole square divided by 4 plus 1 so this and this will get subtracted 12 square is nothing but 144 144 divided by 5. so if i go ahead and calculate 144 divided by 5 it is nothing but 28.5 so here i get 28.5 so the similarity weight for this is 28.5 similarly i can go ahead and calculate the similarity weight for this for the top one so it will be nothing but what it will be 11 plus sorry minus 11 minus 11 minus 9 plus one plus nine plus eleven divided by one two three four five five plus one is six so this is getting subtracted this will be one by six anyhow this will be whole square right so anyhow it will be one by six only so 1 by 6 will be my similarity weight over here okay 28.8 it's okay now finally the information gain that we need to compute will be very much simple what will be the information gain 65.5 plus 28.8 minus 1 by 6 so try to get it whatever we are trying to get it over here just tell me what will be the output is it 98.34 so we are probably going to get 98.34 so 98.3 for information gain is with respect to this record when we split it we'll try to compare with the split of each and every one and whichever will be the better we'll try to use and split it so like this the entire splitting will happen and sequentially the decision tree will be added it's 60.5 oh yeah sorry 60.5 60.5 plus 28 88 then this will change just a second 89.13 understand you don't have to worry about calculation automatically that things will be doing it okay so you don't have to worry now see we have now further the decision tree can be splitted into any number of times probably the next split what we can do is that we can we can do next split something like this this will be my experience the two splits that may happen with respect to my less than or equal to 2.5 less than or equal to 2.5 or greater than 2.5 now if this probably gives the information gain better then the split will happen like this otherwise whichever gives the better information again the split will basically happen like this i hope like let's say that this is this is the split that is required minus 11 minus 11 is 9 is over here and then we have 1 comma 9 comma 11 okay because less than or equal to 2.5 this two records will definitely go away and this to this record will definitely go here now if i try to calculate the similarity weight for this it will be nothing but minus 11 minus 9 minus 11 minus 9 whole square divided by 2 plus 1 right now in this particular case it will be minus 20 whole square divided by 3 which is nothing but 400 to 20 into 220 is 400 which is nothing but 3 so if i go and probably use a calculator and show it to you 400 divided by 3 which is nothing but 133.33 so the similarity weight for this is 133.33 similarly i can go ahead and compute for this it will be 1 plus 9 plus 11 whole square divided by 3 plus 1 right so it will be 10 plus 11. 10 plus 11 is nothing but 21 whole square divided by 4 so what it is 21 whole square if i open my calculator 21 whole square 21 multiplied by 21 which is nothing but 441 divided by 4 divided by 4. so this will probably 110 110 dot 2.25 and similarly i can go ahead and compute for this so if i want to compute for this what it will be the same thing that we have got over here that is 1 by 6 so this will basically be one by six so finally if i compute the information again it will be what it will be 133 133.3 plus one one zero point two five minus one by six obviously this value will be greater than the previous one what we have got that is 89.13 so definitely we are going to use this split which is better than the previous split right let's say that this split has been considered finally how do we see the output okay i hope everybody is able to understand right let's say that this split has worked well so i'm going to rub all these things 110.25 is there now suppose i want to do the inferencing how the inferencing will be done 110.25 here 110.2 now suppose any record comes from here first of all any record that will go it will go to the base model so the base model whenever it goes the value is 51 51 plus alpha one this is my learning rate one suppose if it goes in this root then what we have we have minus 11 minus nine whenever we go in this root which has minus 11 and minus nine the average of both these numbers will be considered what is average of both these numbers minus 11 minus 1 9 divided by 2 this is nothing but -10 right so minus 10 will get multiplied here suppose if it goes in this root then here what will happen here will 1 plus 9 plus 11 divided by 3 average will be taken so 21 divided by 3 7 will be there so this will get replaced by 7. so similarly anything that you are doing this is with respect to decision tree one like this we will again construct decision tree separately and again it will become alpha two by decision tree two alpha three by decision three three and like this you will be doing till alpha and decision 3 n and once you calculate this this will be your specific output in a regression tree so in this particular case what will happen you are just trying to play with parameters and you are trying to use in a different way to compute all this thing everybody clear but again it is a black box model you cannot visualize all this thing now let's go to the third algorithm which is called as svm see svm is almost like decision logistic regression okay so the major aim of svm is that major aim of svm is that suppose if i have uh do data points like this okay we obviously use uh logistic regression to split these data points right like this we try to create a best fit line which looks like this and probably based on this best fit line we try to divide the point now in svm what we do is that we not only create a best fit line but instead we also create a point which is called as marginal planes so like this we create some marginal plane so this is your hyperplane and this is your marginal plane and whichever plane has is this maximum distance will be able to divide the points more efficiently but usually in a in a normal scenario you know whenever we talk about hyperplane or whenever we talk about marginal plane there will be lot of overlapping of points right suppose if i have some specific points i have one point which looks like this i may also have another point which may overlap so it is very difficult to get an exact straight marginal plane and split the point based on this now this specific marginal plane should be maximum because we can create any type best fit line and probably uh use this marginal plane now if we have this overlapping right if for what do we call for this kind of plane this kind of plane is basically called as hard marginal plane so this is basically called as harsh marginal plane okay and similarly if any points are overlapping suppose this yellow points can also get overlapped over here and there may be some kind of errors so for this particular case we basically say as soft marginal plane because here we will be able to see that errors will be there now in svm what we focus on doing is that we focus on creating this marginal plane with maximum distance even though there are some errors we consider it in solving it by providing some kind of hyperparameter now how do we go ahead and basically create this all marginal planes and how do we go ahead with this it's very much simple uh just imagine in this specific way that initially let's consider that i have this data point suppose this is my best fit line how do we give the c best fit line is equation we basically say y is equal to m x plus c right we we basically say this equation as y is equal to mx plus c now hard marginal it is impossible in a normal data set obviously you will not be able to get it but definitely we go ahead with creating a soft marginal plan now y is equal to mx plus c what does this m indicate m is nothing but slope and c indicates nothing but intercept can i say that this both equations are same a x plus b y plus c is equal to 0 can i also say that this is the equation of a straight line can i say that this is also the equation of straight line i will say that both of them are equal can i say both of them are equal see if i try to prove this to you if i take this equation and try to find out y it will be nothing but minus c minus c minus a sorry minus ax and this will be divided by b this will be divided by b this will be divided by b so here you can see that it is almost the same in this particular case my m value will be minus a by b and my c will basically be minus c by b so both the equation are almost same so let's consider that this is my equation and i am actually and whenever i say y is equal to m x plus c can i also write something like this y is equal to w one x one plus w two x two plus like this plus c or plus b same thing you know so here also we can write y w transpose x plus b same equation right we are basically using same equation yes we can also write it in a different way but at the end of the day we are also treating something like this let's say that this slope is in this direction if this slope is in this direction then i can basically say this let's consider that the slope is minus one let's say that the slope is minus one see it is in the negative direction let's say that this slope is minus 1 i'm just trying to prove that the slope is negative value let's consider this now suppose this is one of my point minus 4 comma 0 and obviously this particular equation is given by this particular line is given by this equation now if i really want to find out the y value let's say that this is my x1 this is my x1 and this is my x2 let's say that i want to find out i want to find out this w transpose x plus b the y value based on this line if i want to compute the y value based on this line how will i compute w transpose x basically means what w value what all things will be there one value is b right b is intercept right now intercept is passing from origin can i say my b will be zero obviously i can assume that b will be zero now in this particular case if i talk about w w in this case is minus 1 which i have initialized over here so if i want to do this matrix multiplication it will be w transpose can be written as like this and this x value can be written as minus 4 comma minus 4 and 0 minus 4 and 0 right so i can basically write like this now if i do this multiplication what will my value i get i will basically get 4 right so this is a positive value this is a positive value now understand since this is a positive value any points that are below this line any points that i consider below this line and if i try to calculate the y can i say that it will always be positive yes or no similarly if i could probably consider one points over here as four comma zero four comma four now tell me in this four comma four if i calculate the y value what will you get whether you'll get a positive value or a negative bank if i try to calculate the y value in this case because here only positive values will be getting right so if i calculate the y value will the y value be negative or positive just try to calculate how do you calculate again i will use y equation this time again my slope is minus 1 my intercept is 0 and here i will have 4 comma 4 now here minus 4 and then this is plus 0 this will be -4 right so this will be a negative value negative value guys negative see minus 4 plus 0 negative so any points that i will probably have in top of this any points above this plane right and if i try to calculate the y value it will always be negative so what two things you are able to get positive and negative so you can consider this entirely one category this another category at least these two things you can basically consider guys i hope everybody is able to understand this so this will be my one category and this will be my another category obviously so that basically means i can definitely use a plane and split this point i hope everybody is able to understand now let's go ahead and let's see how this marginal plane will get created and what is the cost function to basically do this or what is the cost function in making sure that the marginal plane will definitely work right it becomes difficult right so suppose let's consider an example suppose i say that this is my lines let's say uh i want to basically create a kind of i have two variety of points one is this point let's say i have all these points like this and the other points i have somewhere here let's consider i am just using directly good number of points so that i can split it okay because i will try to talk about it what i'm actually trying to prove so obviously this is my best fit line that splits and apart from that what i will do is that i'll also create a marginal points so in order to create the marginal point i may use some different color let's see which color this will be my one marginal point remember it will be to the nearest point over here and basically we will construct like like this and similarly here we will be constructing like that i've already told you guys this equation can be mentioned at w transpose x plus b equal to 0 right i can definitely say this because ax plus b y plus c is equal to 0 so this i can also write it as w transpose x is equal to 0 sorry plus b plus b equal to 0 so both are same okay this i don't have to prove it i hope everybody is clear with this now what i'm going to do let's represent this line also with some equation so this line if i want to represent this will be w transpose x plus b what value will come over here positive or negative c from this line anything above this plane right any any any distance that we try to find out it will always be negative so let's say that i'm using it as minus 1 to just treat that it is a negative value and this line that i am going to mention it it will be w transpose x plus b is equal to plus 1 minus 1 above plus 1 because i have we have already discussed from this point if you are trying to calculate the y value it is always going to be plus 1 this will going to be minus 1 here i should definitely say this as k okay but i am not mentioning k in many articles you will see it as minus 1 many research paper also they use it as minus 1 but i would like to specify minus n plus k but here let's go and write minus 1 and plus now my aim is to increase this distance okay this distance i really want to increase this distance now in order to increase this if i increase this distance that basically means my model is performing well so let's say i want to find this distance first of all so if i write w transpose x plus b is equal to 1 and here i will write w transpose x plus b is equal to -1 so what i'm going to do i'm going to do the computation and subtract it like this so here obviously this will be my x1 this will be my x2 okay because these are my another points x2 and x1 so i can write w transpose x1 minus x2 b and b will get cancel and here i will be writing 2 right so from here we can definitely write two different things let's see what all things we can write so here this is nothing but the difference between my this plane and this plane which is given by like this okay now i always understand whenever we consider any any vectors right any vectors right it also has something called as magnitude so if i want to remove this magnitude i can divide this by w this magnitude of w then only my vector will remain which is indicated like this so i am going to basically divide by this particular operation both both the side i am dividing by this magnitude of w and i don't care about the directions over here right now we just care about the vectors now when i write like this what is our aim our aim is to can i say our aim is to our aim is to maximize 2 by w can i say this guys yes or no what is our aim our aim is to basically maximize this right by updating w comma b value i need to maximize this yes everybody is clear with this can i say that yes i want to maximize this yes or no everybody i want to maximize this if i maximize this that basically means my marginal plane will become bigger my marginal plane will be bigger okay now can i write along with this that such that y of i my output will be dependent on two different things one is i can say that my y y of i is plus of is plus 1 when w transpose x plus b is greater than or equal to 1 everybody seen this equation what i'm actually trying to specify such that y of i is plus 1 when w transpose x plus b is greater than 1 and when it is minus 1 that basically means w transpose of x is b is less than or equal to minus now what does this basically mean see all my values but whenever i compute w transpose x plus b is greater than or equal to 1 i'm obviously going to get this plus 1. when w transpose x plus b is less than or equal to 1 i'm always going to get the output as minus 1. i hope that is the reason why i have actually written like this so this two we have already discussed why we are specifically writing we want to increase the marginal plane which is this this is my marginal plane and i'm writing one condition that my y i value will be plus 1 when w transpose x plus b is greater than or equal to 1 otherwise it when it is less than or equal to minus 1 it is going to be very much clear with this transpose condition we have already done it everybody clear with this now on top of it we can add one more very important point instead of writing such that and all you can also say that our major aim our major aim is that if i multiply y i multiplied by w transpose x of i plus b if i multiply this 2 this will always be able greater than or equal to 1 for correct points right for correct points because understand if it is minus 1 if i am multiplying with this and if it is a correct point minus into minus will obviously be greater than or equal to 1 only right similarly for this it will be greater than 1. so i can also definitely say that my major m if i multiply y of i with this it will be always greater than or equal to plus 1 which is definitely saying that it will be a positive value so this is just a representation guys but understand what is the minimize cost function this is my minimized cost function maximize cost function now i'm going to again write it down maximize w comma b maximize w comma b 2 by magnitude of w i can also write something like this minimize w comma b and i can just inverse this which looks like this are this both are same or not because always understand in machine learning algorithm why do we write minimize things because we are trying to minimize something okay both are equivalent this both are equivalent and why we specifically write minimization because in the back propagation when we are continuously updating the weights of w and b so we can definitely write like this so here my main target is to minimize this particular value by changing w and b and i will start adding some more parameters over here this is fine till here i think everybody has got it this is our aim and we are going to do this but i'm going to add two more parameters in this optimizer one is c of i and one is summation of i is equal to 1 to n and here i will use something called as eta eta of i first of all i'll tell what is c of i see if i have this specific data point let's say if some of my points are over here then is it a right prediction or wrong prediction if some of my points are over here is it a right prediction or wrong prediction obviously it is a wrong prediction if my points are somewhere here is it a rank prediction wrong rock bread incorrect prediction right so this c value basically says that how many errors we can have how many errors we can have if it says that fine we can have six errors or seven errors how many errors we can have even though we are using the marginal plane how many errors we can have so here i i'm specifically writing how many errors we can have this is what is specified by c of i eta of i basically says that what is the summation of i'm going to write it down since we are doing the summation this entire term basically mentions that summation of the distance of the values distance of the wrong points and how do we calculate the distance from here to here suppose this is the wrong point i will try to calculate the distance from here to here i will do the summation of this i'll do the summation of this i will do the summation of this similarly for the green point another submission will happen from here to here like this here to here and we are going to do that specific summation so we are telling that fine if you are not able to fit properly try to apply these two hyper parameters and try to make sure that this many errors are also there it is well and good no problem we will go ahead with that try to do the summation of the data points and based on that try to construct the best fit line along with the marginal plane like this even though there are some errors over here or errors over here we are good to go with respect one more thing is there which is called as alpha svr svr only one thing is getting changed in svr only this value will get changed so i want you all to explore and just let me know this will be one assignment for you only this value will be changing remaining everything are same so just try to if you change this particular value that becomes an svr just try to explore and just try to find out and just try to let me know so overall uh did you like the entire session everyone okay in this one more thing is there which is called as kernel matrix svm kernel we say it as svm kernel now in svm kernel what happens suppose if i have a specific data point switch looks like this which looks like this so we obviously cannot use a straight line and try to divide it so what we do we convert this two dimension into three dimensions and then probably we push our point like this one point will go like this and the white point will go down and then we can basically use a plane to split it so i have uploaded a video around around that and you can definitely have a look on to that and i have also shown you practically how to do it that is the reason i have created that specific video so great uh this was it from my side i hope you like this session so thank you everyone have a great day keep on rocking keep on learning and never give up

so today&#39;s session what all things we are basically going to discuss so first of all we are going to discuss about different types of machine learning algorithm like how many different types of machine learning algorithm are there understand the purpose of taking this session is to clear the interviews okay clear the interviews once you go for a data science interviews and all the main purpose is to clear the interviews i&#39;ve seen people who knew machine learning algorithms in a proper way okay they were definitely able to clear it because they just explained the algorithms in a better way to the recruiter so that they got hired first of all is the introduction to machine learning here i&#39;m just specifically going to talk about ai versus ml versus dl versus data side then the second thing that we are going to talk about over here is the difference between supervised ml and unsupervised ml the third thing that we are probably going to discuss about is something called as linear regression so we are going to clearly understand the maths and geometric intuition the next thing that we are probably going to discuss about is r square and adjusted r square the fifth topic that we are going to discuss about is ridge and lasso regression the first topic that we are going to discuss about is ai versus ml versus dl versus data science so this is the first topic that we are probably going to discuss if you really want to understand the difference between ai versus ml versus dl versus data science we will go in the specific format so just imagine the entire universe so this entire universe i will probably call it as an ai now specifically when i say ai this basically means ai artificial intelligence whatever role you are in you are as a machine learning developer you are working as a deep learning developer vision developer or a data scientist or an ai engineer at the end of the day you are actually creating a i application so if i really want to define what is this artificial intelligence you can just say that it is a process wherein we create some kind of applications in which it will be able to do its task without any human intervention so that basically means a person need not monitor this ai application automatically it will be able to take decisions it will be able to perform its task and it will be able to do many things so this is what an ai application is some of the examples that i would definitely like to consider so the first example that i would like to consider ai application ai model netflix has an ai module suppose if you see a kind of action movie for some time then the kind of ai work or ai work that is basically implemented over here is something called as recommendation so here through this application what happens is that when you&#39;re continuously seeing the action movies then automatically the ai module that is present inside netflix will make sure that it gives us recommendation on action movies second if i take an example of comedy movie if i continuously see comedy movie then also it will give us the recommendation of the comedy movie so this through this what happens is that it understands your behavior and it is being able to do its task without asking you anything the second example that i would like to take up in is amazon.com now amazon.n again if you buy an iphone then it may recommend you a headphones so this kind of recommendation is also a part of ai model that is integrated with the amazon dot in website the ads that you see probably when you&#39;re opening my channel through which i get paid a little bit from my from uh from the hard work that i do in youtube right so through that ads how that is recommended to you uh that is also an ai engine that is included in the youtube channel itself which really plays it is a business driven goal understand it is a business driven things that we basically do with the help of ai one more example that i would like to give you is if i consider it self-driving cars so here you&#39;ll be able to see self-driving cars if you take an example of tesla so self-driving cars what happens based on the road it is able to drive it automatically who&#39;s doing that there is an ai application integrated with the car itself right so if i consider all these things this is all our ai application at the end of the day whatever role you do you are going to create an ai application this is the common mistake what people do you know like uh our ceo sudan sukumar he is written in his profile that he is an ai engineer that basically means his goal is to create an ai application so probably in a product based companies will be seeing this kind of roles called as ai engineer now let&#39;s go to the next role which is called as machine learning so where does machine learning comes into existence so if i try to create this machine learning is a subset of ai and what is the role of machine learning it provides stats tools to analyze the data visualize the data and apart from that to do predictions i&#39;m forecasting so you will be seeing a lot of machine learning algorithms so internally those machine learning algorithm the equation that we are basically using it is basically using it is having a kind of stats to stat techniques because whenever we work with data statistics is definitely very much important so this exactly is called as machine learning so it is a subset of ai this is very much important to understand ml is a subset of ai so here you can see that it is a part of this now let&#39;s go to the next one which is called as deep learning deep learning is again a subset of ml now let&#39;s consider why deep learning came into existence because in 1950s 60s scientists thought that can we make machine learn like how we human being learn so for that particular purpose deep learning came into existence here the plan is to basically mimic human brain so when i say mimicking human brain that basically means we are trying to mimic the human brain to implement something to learn something so for this you use something called as multi-layered neural networks so this is what deep learning is it is a subset of machine learning its main aim is to mimic human brain so they actually create multi-layered neural network and this multi-layered neural network will basically help you to train the machines or applications whatever we are trying to create and deep learning has really really done an amazing work with the help of deep learning we are able to solve such a complex complex use cases that we will be probably discussing as we go ahead now if i come to data science see this is the thing guys if you want to say yourself as a data scientist tomorrow you&#39;re given a business use case and situation comes that you probably have to solve that use case with the help of machine learning algorithms or deep learning algorithms again the final goal is to create an ai application right you cannot say that i am a data scientist and i&#39;ll just work in machine learning or i&#39;ll work in deep learning or i may i don&#39;t know how to analyze the data no you cannot do that when i was working in panasonic i got various different kind of tasks sometime i was told to use tableau power we had to visualize analyze the data sometime i was given a machine learning project sometime i was given a deep learning project so as a data scientist if i consider where does data scientists fall into this it will be a part of everything so if i talk about machine learning and deep learning with respect to any kind of problem statement that we solve the majority of the business use cases will be falling in two sections one is supervised machine learning one is unsupervised machine learning so most of the problems that you are basically solving this is with respect to this two problem statement two different types of machine learning algorithms that is supervised machine learning and deep learning if i talk about supervised machine learning two major problem statements that you are basically solving here also one is regression problem and the other one is something called as classification problem and in the case of unsupervised machine learning problem statement you are basically solving two different types of problem one is clustering and one is dimensionality reduction and there is also one more type which is called as reinforcement learning reinforcement learning i can i i will definitely talk about this not right now right now we are just focusing on all these things now understand what happens in supervised machine learning let&#39;s consider a data set so here i have a data set which says this is my age and this is my weight suppose i have these two specific features let&#39;s say that i have values like 24 62 25 63 21 72 25 7 of 62 and many more data over here let&#39;s say that my task is to basically take this particular data and create a model wherein so suppose my task is that i need to create a model whenever it takes the new age first of all we train this model with this data and whenever we take age a new age it should be able to give us the output of weight this particular model is also called as hypothesis okay i&#39;ll discuss about this today whenever discussing about linear regression now what are the important components whenever we have this kind of problem statement first of all you need to understand there are two important things one is independent features and the other one is something called as dependent features now let&#39;s go ahead and discuss what is independent feature independent feature basically means in this particular case since the input that i am basically training in all those features becomes an independent feature now in this particular case my age is independent feature and whatever i&#39;m actually predicting so when i say predicting i know this is my output okay this is the what i have to basically make my model uh give this as an output so in this particular case my dependent feature becomes weight why we specifically say dependent feature because this is completely dependent on this value whenever this is increasing or decreasing this value is basically getting changed so that is the reason why we basically say this is independent and dependent feature whenever we are solving a problem right in the case of supervised machine learning remember there will be one dependent feature and there can be any number of independent features now let&#39;s go ahead and let&#39;s discuss about regression and classification what is the difference between them now let&#39;s go ahead and let&#39;s discuss about two things one is let&#39;s say i want a regression problem statement suppose i take the same example as age and weight so i have values like as discussed 24 72 23 71 24 or 25 71.5 okay so this kind of data i have see this is my output variable which is my dependent feature now in this particular dependent feature now whenever i&#39;m trying to find out the output and in this particular output you have a continuous variable when you have a continuous variable then this becomes a regression problem statement now one example i would like to give suppose this is my data set right this is my age this is my weight suppose i am populating this particular data set with the help of scatter plot then in order to basically solve this problem what we&#39;ll do suppose if i take an example of linear regression i will try to draw a straight line and this particular line is my equation which is called as y is equal to mx plus c and with the help of this particular equation i will try to find out the predicted points so this will be my predicted point this will be my predicted point this this any new points that i see over here will basically be my predicted point with respect to y so in this way we basically solve a regression problem statement so this is very much important to understand let&#39;s go to the always understand in a regression problem statement your output will be a continuous variable the second one is basically a classification problem now in classification problem suppose i have a data set let&#39;s say that number of hours study number of study hours number of play hours so this is my independent feature let&#39;s say a number of sleeping hours and finally i have my output which will be pass or fail so in this i have all this as my independent features and this is my dependent feature so i will be having some values like this and here either you will be pass or fail or pass or fail now whenever you have in your output fixed number of categories then that becomes a classification problem suppose it just has two outputs then it becomes a binary classification if you have more than two different categories at that time it becomes a multi-class classification so this is the difference between regression problem statement and the classification problem statement now let&#39;s go ahead and let&#39;s discuss about something called as unsupervised machine learning now in unsupervised machine learning which is my second main topic over here i&#39;m just going to write unsupervised machine learning now what exactly is unsupervised machine learning here whenever i talk about there are two main problem statement that we solve one is clustering one is dimensionality reduction let&#39;s take one example of a specific data set over here let&#39;s say that my data set is something called as salary and age now in this scenario we don&#39;t have any output variable no output variable no dependent variable then what kind of assumptions were that we can take out from this particular data set suppose i have salary and age as my values so in this particular case i would like to do something called as clustering now why clustering is used just understand let&#39;s say i am going to do something called as customer segmentation now what does this customer segmentation do clustering basically means that based on this data i will try to find out similar groups groups of people suppose this is my one group this is my another group this is my third group let&#39;s say that i was able to create this many groups this many groups of clusters i&#39;ll say cluster one two three each and every cluster will be specifying some information this cluster may specify that this person he was very young but he was able to get some amazing salary this person it may some specified that these people are basically having more age and they are getting good salary these people are like middle class background where with respect to the age the salary is not that much increasing so here what we are doing we are doing clustering we are grouping them together main thing is grouping this word is very much important now why do we use this suppose my company launches a product and i want to just target this particular product to rich people let&#39;s say product one is for rich people product two is for middle class people so if i make this kind of clusters i will be able to target my ads only to this kind of people let&#39;s say that this is the rich people this is the middle class people i will be able to target this particular ads or this particular product or send this particular things to those specific group of people by that that is basically called ad marketing and this uses something called as customer segmentation a very important example and based on this customer segmentation we can later apply any regression classification kind of problem statement now coming to the second one after clustering which is called a dimensionality reduction now in dimensionality reduction what we are focusing on suppose if we have 1000 features can we reduce this features to lower dimensions let&#39;s say that i want to convert this 1000 feature to 100 features lower dimension so can we do that yes it is possible with the help of dimensionality reduction algorithm there are some algorithms like pca so i&#39;ll also try to cover this as we go ahead understand clustering is not a classification problem clustering is a grouping algorithm there is no output feature no dependent variable in clustering sorry in unsupervised ml so yes i will also try to cover up lda will cover up pca and all as we go ahead so with respect to supervised and unsupervised so first thing that we are going to cover is something called as linear regression the second algorithm that we will try to cover after linear regression is something called as ridge and lasso third that we are going to cover is something called as logistic regression the fourth that we are basically going to cover is something called as decision tree decision tree includes both classification and regression for fifth that we are going to cover is something called adaboost sixth that we are going to cover is something called as random forest seventh that we are going to cover is something called as gradient boosting eighth that we are going to cover is something called as xgboost ninth that we are going to cover is something called an a bias then when we go to the unsupervised machine learning algorithm the first algorithm that we are going to do is something called as k means k means algorithm then we also have db scan then we are also going to do higher cult clustering there is also something called as k nearest neighbor clustering fifth we&#39;ll try to see about pca then lda so different different things we will try to cover up yes svm i have missed here i&#39;m going to include svm knn will also get covered so i have that in my list probably i may miss one or two but we are going to cover everything so let&#39;s start our first algorithm linear regression so let&#39;s go ahead and discuss about linear regression linear regression problem statement is very simple guys so suppose i have let&#39;s say i have two features one is my x feature and one is my y feature let&#39;s say that x is nothing but age and y is nothing but weight so based on these two features i have some data points that has been present over here so in linear regression what we try to do is that we try to create a model with the help of this training data set so this will be my training data set what i am actually going to do is that i am going to basically train a model and this model is nothing but a kind of hypothesis testing or it is just kind of hypothesis which takes the new age and gives the output of the weights and then with the help of performance metrics we try to verify whether this model is performing well or not now in short what we are going to do in linear regression is that we will try to find out a best fit line which will actually help us to do the prediction that basically means if i get my new age over here then what should be my output with respect to y okay so with respect to this what should be my output over here in this particular case whenever we are drawing a diagram like this i can basically say that y is a linear function of x so this is what we are going to do now understand how we are going to create this best fit line this is very much important whenever we say linear regression it basically means that we are going to create a linear line over there you may be thinking sir why to create a linear line why not non-linear line that i&#39;ll discuss about it as we go and see other other algorithms so to begin with let&#39;s consider this line that you see over here right this line equation can be given by multiple equations someone some people write y is equal to mx plus c some people write h some people write y is equal to beta 0 plus beta 1 into x some people write h theta of x is equal to theta 0 plus theta 1 into x many many equations are there for this this straight line this straight line many many equations are there with respect to many many different kind of notations but the first algorithm that i have probably learned of linear regression is from andrew ng definitely i would like to give him the entire credits and based on his notation whatever he has explained i&#39;ll try to explain you over here so the credits for this algorithm specifically goes to andrew and g so let&#39;s consider this one over here in order to create this straight line i will basically use the equation which is called as h theta so this is the equation of a straight line if i know the equation of the straight line whatever i can write i can write many things y is equal to mx plus c y is equal to beta0 plus beta1 multiplied by x and then i can also write one more that is h theta of x is equal to theta 0 plus theta 1 into x of i here also you can basically say x of i here also you can say x of i now let&#39;s go ahead and let&#39;s take this equation for now let&#39;s take this equation of now so i&#39;m going to take out this equation and just write one equation through which i have also studied but i will definitely be adding some points which probably andrew ng could not mention in his video but i&#39;ll try my level best obviously he is the best i cannot even compare myself to him so theta 0 plus theta 1 into x now let&#39;s understand what is theta 0 theta 1 as i said that let&#39;s say i have a problem statement over here let&#39;s say i this is my x and this is my y this is my data points now what i&#39;m doing i&#39;m trying to create a best fit line like this now what is this best fit line what is well when i say this best straight line is basically given by this equation what does theta 0 basically indicate theta 0 over here is something called as intercept now what exactly is intercept intercept basically means that when your x is 0 then h theta of x is equal to theta 0. so in this particular case intercept basically indicates that at what point you are meeting the y axis so this particular point is basically your intercept when your x is equal to 0 at that point of time you will be seeing that this line is intersecting the y axis whatever value this will be that is your intercept now the second thing is about your theta 1 what is theta 1 this is nothing but slope or coefficient now what does this basically indicate this indicates let&#39;s say that this is the unit 1 unit the x axis and probably with respect to this i can find one point over here one point over here and if i try to draw this over here to here this is the unit movement in y so what does it basically say slope with the unit movement in what one unit movement towards the x axis what is the unit movement in y axis that is basically slope or coefficient theta 0 and theta 1 two things and x of i is definitely your data points now our main aim is to create a best fit line in such a way that i&#39;ll just try to show it to you what is our main aim let&#39;s let&#39;s understand what is the aim of a linear regression so if i take an example of linear regression i need to find out the best fit line in such a way that the distance between this data points that i have and the predicted points should be very very less suppose i am creating a best fit line okay i am creating a best fit line so with respect to this data points initially was this right but my predicted point is this point in this particular case my predicted point is this point so and if i do the summation of all these points those distance should be minimal then only i&#39;ll be able to say that this is the best fit line so i i cannot definitely say that this is exactly the best fit line or not how will i say when i try to calculate the difference between this point and the predicted point these are my predicted point right if i try to calculate the distance between them then i will basically have a aim to m it should be minimal if i do the summation of all the distance it should be minimal so for that what i can do is that c you may be also thinking krish why not just do one thing okay suppose if these are my data points why not just play and create multiple lines and try to compare what we can do is that we can compare multiple we can create multiple lines right like this and then whoever is giving the best minimal point i will go and select that but how many iteration you will do how you will come to know that okay this line is the best line so for that specific purpose we should start at one point and we should lead towards finding the best fit line start at one point and then we should go towards finding the best fit line so for this particular purpose what we do is that we create a something called as a cost function i have already shown you what is my hypothesis function my best fit line equation is basically given as h theta of x is equal to theta 0 plus theta 1 multiplied by x this is my hypothesis right now coming to the cost function which is super super important why this it is super important because cost function basically what what is cost function over here i told right right this distance when i do the summation this distance that i when i&#39;m doing the summation it should be minimal so if i really want to find out this particular distance i will be using one more equation how can i use a distance formula between the predicted and the real point i will just say that h theta of x minus y so when i say s theta of x minus y what does this basically mean this is my real point and this is my predicted point predicted point is basically given by h theta of x and what i&#39;m going to do i&#39;m going to basically do the squaring because i may get a negative value so because of that i really want to do the squaring part now understand one thing i need to also do the summation i is equal to 1 to complete m let&#39;s say that i am taking the number of data points over here as m because i need to calculate the distance between all the points right with respect to the predicted and the predictor with with respect to the real points so after this i also need to divide by 1 by 2 m the reason why i&#39;m dividing by first of all let me show you why we are dividing by 1 by m 1 by m will give us the average of all the values that we have the specific reason why we are dividing by 1 by 2 is for the derivation purpose it helps us to make our equation very much simpler so that later on when i am updating the weights when i say weights i am basically updating theta 0 and theta 1 theta 0 and theta 1 at that point of time you will be able to see that this particular value when we probably do the derivative it will help us to do it again i&#39;m going to repeat it i&#39;m going to write it down for you first of all now in order to find out the best fit line i need to keep on changing theta 0 and theta 1 unless and until i get the best fit line unless and until i don&#39;t get the best fit line i need to keep on updating theta 0 and theta 1. now if i need to keep on updating theta 0 and theta 1 i probably require a cost function okay what this cost function will do i&#39;ll just tell you so cost function over here i will specify as j of theta 0 comma theta 1 is equal to now what is cost function over here what this distance i told right this distance between the h theta of x and y if i do the summation of all these things it needs to be minimal it needs to be less because with respect to our x point this is my y point right similarly with respect to this exponent this is my y point so what i&#39;m actually going to do i&#39;m going to use a cost function now in this cost function my main aim is to basically write h theta of x minus y whole square this will be with respect to i i i why i am saying i because this will be moving from i is equal to 1 to all the points that is m m is basically all the points over here now apart from this what i&#39;m actually going to do i&#39;m going to divide by 1 by 2 m i&#39;ll tell you why i&#39;m specifically dividing by 1 by 2 m first of all by dividing by m i will be getting an average output average cost function because here i&#39;m iterating through m the reason why i&#39;m dividing by 2 because it will help us in derivation why let&#39;s say that i have x square if i try to find out derivative of x square with respect to x then what will i get i will basically get 2x right that is what is the formula what is the derivation of x of n it is nothing but n x of n minus 1 so that is the reason why i am actually making it 1 by 2 so that when 2 comes over here this 2 and 2 will get cancelled so i hope everybody is able to understand so this is my cost function now understand what is this called as this entire equation is basically called as squared error function yes mathematical simplicity basically means because when we are updating theta 0 and theta 1 we basically find out derivation in the cost function so that is the reason why we are specifically doing it squaring off is basically done because so that we don&#39;t get any negative values here squared error function now let&#39;s go towards the what we need to solve this is my cost function okay so i need to minimize minimize this particular value that is 1 by 2 m summation of i is equal to 1 to m and then this will basically be h theta of x of i minus y of i whole square we need to minimize this by adjusting parameter theta 0 and theta 1. this entirely is what this is nothing but j of theta 0 comma theta 1 and we really need to minimize this so this is our task okay this is our task now let&#39;s go ahead and let&#39;s try to compare with two different things one is the hypothesis testing and one is with respect to the cost function okay let&#39;s take an example so right now my equation of the hypothesis is nothing but h theta of x is equal to theta 0 plus theta 1 multiplied by x if theta 0 is 0 then what does this basically indicate can i say that it basically the line the line the best fit line passes through the origin and this is nothing but h theta of x is equal to theta 1 multiplied by x can i say like this obviously i can definitely say like this right so my equation will be like this so for right now let&#39;s consider that your theta 0 is equal to 0 so this is what it is we have done till here we have minimized we have written the equation everything yes so it is passing through the origin and this is what is the equation i am actually getting now let&#39;s take one example and let&#39;s try to solve this if i if i have h theta of x so this is my new hypothesis considering that my intercept is passing through the origin so with respect to this let&#39;s say that i will create one line over here let&#39;s say this is my this is my data points like x1 y1 i have 1 three i have one two three now let&#39;s consider that if i have heat i have data points like what i have data points like let&#39;s say i have three data points one comma one 2 comma 2 3 comma 3 so 1 comma 1 is nothing but this is my data point 2 comma 2 is nothing but this is my data point and 3 comma 3 is this is my data point so these are my data points from the data set that i have so 2 comma 2 is this point and 3 comma 3 is basically this point let&#39;s consider that these are my points that i have these are my data points now if i consider theta 1 as 1 where do you think the straight line will pass through where do you think the straight line will pass the straight line will definitely pass like this right my straight line will definitely pass through all the points this same point becomes a prediction point also right same point let&#39;s consider that this is also getting passed through this it passes through all the points when theta 1 is equal to 1 theta 1 is nothing but slope and slope is equal to 1 in this scenario it passes through all the points now go ahead and calculate your j of theta so what will the formula of j of theta 1 become because theta 0 is 0 okay we can basically write 1 by 2 n summation of i is equal to 1 to 3 how many points are there 3 right and here i have j of h of theta of x 1 sorry x of theta of x i minus y i whole square right now let&#39;s go ahead and compute now in this particular scenario what will happen 1 by 2 m then what is what is this point minus y of i see state of x is also 1 y of i is also 1 both the points are 1 so this will become 1 minus 1 whole square plus because we are doing summation the next point is also falling in 2 comma 2 so this will become 2 minus 2 whole square plus 3 minus 3 whole square so in total this will become 0. 0 so when your j of theta when theta 1 is 1 theta 1 is 1 so j of theta 1 is how much it is 0 right so what is this j of theta 1 it is the cost function so let me draw the cost function graph over here let&#39;s say that this is my theta and this is my so here i have 0.5 here i have 1 here i have 1.5 so this is my theta here i have 2 then i have 2.5 okay then similarly i have 0.5 then i have 1 1.5 to 2.5 this is my j of theta 1 so right now what is my theta 1 my theta 1 is 1 at this particular point what did i get j of theta 1 is nothing but 0 so this will be my first point this will be my first point guys i have discussed why why the value will be 1 by 2 m basically to make the calculation simpler we are dividing by 1 by 2 m is basically used to average average the summation that we are actually doing over here now let&#39;s go ahead and let&#39;s take the second scenario in the second scenario let&#39;s consider my theta 1 let&#39;s say that my theta 1 over here is now 0.5 if my theta 1 is 0.5 then tell me what are the points that i will get for x is equal to 1 0.5 multiplied by 1 so it will come as 0.5 over here right then similarly when x is equal to 2 0.5 multiplied by 2 is nothing but 1 over here and then similarly when uh for x is equal to 3 0.5 multiplied by 3 see we are multiplying here right 0.5 multiplied by 3 is 1.5 so the next point will come over here now when i create my best fit line what will happen so here is my next best fit line which i will probably create by green color okay so this is my second one which is green color here definitely slope is decreasing so if i go ahead and calculate my j of theta let&#39;s see what i&#39;ll get so j of theta 1 is nothing but 1 by 2 m again same equation summation of i is equal to 1 to 3 h theta of x of i minus y of i whole square so what we have for over here we have nothing but 1 by 2 m now let&#39;s do the summation what is this point this point is nothing but the predicted point and this point is the real point right so in this particular scenario the first point that i will get is nothing but 0.5 minus 1 whole square i am getting 0.5 minus 1 whole square this is 1 this is the real point 1 this is the predicted point 0.5 so here i am getting 0.5 minus 1 whole square the second point will be 1 minus 2 whole square right 2 so 1 minus 2 whole square and then i will finally get 1.5 minus 3 whole square so finally if i do this calculation how much i&#39;m actually getting 1 by 2 multiplied by 3 which is 6 here i&#39;m getting 0.25 0.5 square here i&#39;m getting 1 here i&#39;m getting 1.5 whole square so my final output will be which i have already calculated it is nothing but point it will be approximately equal to 0.58 so 0.58 now with theta as this is nothing but theta theta 1 as 0.5 right that is what theta 1 is 0.5 we are able to get 0.58 so theta 1 is 0.5 over here and 0.58 will be coming somewhere here right so this is my next point which will be again in green color now let&#39;s go ahead and calculate the third condition now in third condition what i&#39;m actually going to write i&#39;m going to basically say theta 1 as 0 at that point of time just go and assume what is 0 multiplied by x it will obviously be 0 so i will be getting 3 points and my next line will be in this line that is the x axis and this is basically all my points now if i go ahead and calculate this what is j of theta 1 now what is j of theta 1 now in this particular case when my theta 1 is equal to 0 1 by 2 m now this part you will be able to see this is 0 minus 1 0 minus 2 0 minus 3 okay so it will become 0 minus 1 whole square 0 minus 2 whole square and 0 minus 3 whole square okay so this will become 1 by 6 multiplied by 1 plus 4 plus 9 which will not be it&#39;ll be nothing but 2.3 which is approximately equal to 2.3 then what will happen with respect to theta 1 is 0 we are getting 2.3 so if i draw this it is nothing but with respect to 0 i am getting 2 point 2 point two point three this is my point so similarly when you start constructing with theta one is equal to two i may get some point over here so here when i join this points together you will be seeing that i will be getting this kind of curve okay and this curve is something called as gradient descent and this gradient descent will play a very very important role in making sure that in making sure that you get the right theta 1 value or light slope value now which is the most suitable point the most suitable point is to come over here because this is this this point is basically called as global minima because see out of all these three lines which is the best fit line this is the best fit line right this is the best fit line when i had this best fit line my point that came over here was here itself this was my point that came over here right and i want to basically come to this region because this is my global minima when i basically am over here the distance between the predicted and the real point is very very less right so this specific point is basically called as global minima but still i did not discuss chris you have assumed theta 1 is 1 theta 1 is 0.5 theta 1 is 0 here also you are assuming many things right and then you are probably calculating and you are creating this gradient descent but the thing should be that probably you come to one point over here and then you reach towards this so for that specific reason how do you do that how do i first of all come to a point and then move towards this global minima so for that specific case we will be using one convergence algorithm because if i come to one specific point after that i just need to keep on updating theta 1 instead of using different different theta 1 value so for this we use something called as convergence algorithm so here the convergence algorithm basically says repeat until convergence that basically means i&#39;m in a while loop let&#39;s say and here i&#39;m basically going to update my theta value which will be given by this notation which is continuous updation where i&#39;ll say theta j minus i&#39;ll talk about this alpha don&#39;t worry and then it will be derivative of theta j with respect to this j of theta 0 and theta 1 so this should happen that basically means after we reach to a specific point of theta after performing this particular operation we should be able to come to the global minima and this this specific thing that you are able to see is called as derivative this is called as derivative derivative basically means i am trying to find out the slope derivative which i can also say it as slope this equation will definitely work guys trust me this will definitely work why it will work i&#39;ll just draw it show it to you let&#39;s say that this is my cost function let&#39;s say that i have got this gradient descent and let&#39;s say that my first point is somewhere here but i have to reach somewhere here right now when i reach this this is my theta one and this is my j of theta one suppose i reach at this specific point and i will also have another gradient descent which looks like this let&#39;s say that in the initial time i reached the point over here how we will be coming to this minimal global minima by using this equation i&#39;ll talk about alpha also don&#39;t worry now this is also my theta 1 this is also my j of theta 1. now let&#39;s say suppose i came to this particular point right after coming to this particular point i will basically apply this derivative on this j of theta 1 okay now when i find out the derivative that basically means we are trying to find out the slope and in order to find the slope we just create a straight line like this which will look like this i&#39;ll just try to create so i&#39;ll try to create a slope like this this slope so if you try to find out with respect to this this is a positive slope how do we indicate it because understand the right hand side of the line of this is pointing on the top what&#39;s direction this is the best easy way to find out whether it is a positive slow or negative slope now in this particular case this is a positive slope now when i get a positive slope that basically means i will update my weights or theta 1 as theta 1 let&#39;s say i&#39;m writing it over here so i will just apply this convergence algorithm see theta 1 colon theta 1 minus this learning rate which is called as alpha this is my learning rate i&#39;ll talk about learning rate don&#39;t worry then this derivative value in this particular case since i am having a positive slope i will be getting a positive value let&#39;s say that for this theta value i got this slope initially now i need to come to this location so for that i have to reduce theta 1 so that i come to this main point now here you can see that i am i am subtracting theta 1 with something which is a positive number right this is a positive number so definitely i know that after some n number of iteration i will be able to come to the global minima similarly if i take the right hand side and if i try to draw the slope in this particular case my slope will be negative so similarly i can write the equation as theta 1 equal to theta 1 minus learning rate multiplied by a negative number so minus into minus will be positive right suppose initially my theta 1 was here my theta 1 was here now i&#39;ll keep on updating the weight to come to this global minima so minus into minus is positive so i will basically get theta 1 plus alpha by a positive number because minus into minus is plus so this will definitely work so that we will be able to come over here to the global minima whether it is a positive slope or a negative slope now what is this learning rate now learning rate based on this learning rate suppose i want to come from this point to the global minima by what speed i should be coming what speed if my learning rate value is bigger what speed i may be coming suppose if i say usually we select learning rate as 0.01 if i select a small number then it will start taking small small steps to move towards the optimal minima but if i take a alpha value a huge value if it is a huge value then what will happen this uh this updation of the theta one will keep on jumping here and there and the situation will be that it will never meet it will never reach the global minima so it is a very very good decision to take a alpha a small value it should also not be a very very small value if it becomes a very very small value then what will happen very tiny steps it will take forever to reach the global minima that basically means my model will keep on training itself so definitely this algorithm is going to work now let me talk about one scenario one scenario will be that what if my my cost function has a local minima what if i have a local minima because here if i come here if i come this is a local minima suppose one of my points come over here and finally i am reaching over here what will happen in this particular case because in this case you&#39;ll be seeing that what will be my equation my equation will be simply theta 1 theta 1 minus alpha in this point in this local minima slope will be 0 so in this particular case my theta 1 will be equal to theta 1 now you may be thinking what is if this is the scenario then we will be stuck in local minima this is called as local minima but usually with respect to the gradient descent and the equation that we are using here we do not get stuck in local minima because our gradient descent in this particular scenario will always look like this but yes in deep learning when we are learning about gradient descent and a n at that point of time we have lot of local minima and because of that we have different different gradient decent algorithm like rms prop we have adam optimizes which will solve that specific problem so this one point also i wanted to mention because tomorrow if someone asks you as an interview question that what if in your do you see any local minima in linear regression you could just say that the cost function that we use will definitely give not give us local minima but if in deep learning techniques with that we are trying to use like a n we have different different kind of optimizers which will solve that particular problem so that is the answer you basically have to give now let me go ahead and write with respect to the gradient decent algorithm so here again i&#39;m going to write the gradient decent algorithm so this will be my gradient descent algorithm and remember guys gradient descent is an amazing algorithm and you will definitely be using it so please make sure that you know this perfectly now some questions are there when will convergence stop convergence will stop when we come to near this area where my uh j of theta will be very very less now in gradient decent algorithm i will again repeat it so what did i say i said repeat until convergence i told you right here we have written this algorithm and now let&#39;s take it for theta 0 and theta 1 so here i will write theta 0 j equal to theta j minus learning rate of derivative of theta j j of theta 0 and theta 1 so this is my repeat until convergence now we really need to find out what we&#39;ll try to equate we&#39;ll try to first of all find out what is this now if i really want to find out derivative of derivative of derivative of theta j with respect to j of theta 0 and theta 1. so how do i write this i can definitely write this in an easy way okay so this will be derivative of theta j and remember j will be 0 and 1 right because we need to find out for 0 theta 0 and theta 1 so this will be 1 by 2 m what is what is j of theta 0 comma theta 1 obviously my cost function so i will write summation of i is equal to 1 to m and here i will basically write j of theta of x of i minus y of i whole square so if my j is equal to zero so what will happen for this so here i can specifically say that derivative of derivative of theta 0 j of theta 0 comma 1 now simple here what i will be doing is that i will be simply applying derivative function see guys what is this derivative let&#39;s consider this is something like this 1 by 2 m x square so if i try to find out the derivative this will be 2 by 2 m x so 2 and 2 will get cancelled so similarly i will have 1 by m and here i will specifically be writing summation of i is equal to 1 to m h theta of x of i which will be my x minus y of i whole square so this will be my derivative with respect to theta 0 this is what i got now the second thing will be that when j is equal to 1 derivative of derivative of theta 1 j of theta 0 comma theta 1 in this particular case i will be having 1 by m summation of i is equal to 1 to m then again see in this particular case theta of 1 is there right theta of 1 basically means what if i try to replace this let&#39;s say that i am trying to replace this h theta of x with something else what is the state of x i know that right it is theta 0 plus theta 1 multiplied by x so theta 0 plus theta 1 multiplied by x so after this if i&#39;m trying to find out the derivative with respect to theta 0 this will obviously become i will be able to get this much right now with respect to the second derivative what i will be writing i will again be writing h theta of x of i minus y of i whole square multiplied by x of i so this square also went off understand this h theta of x is what see therefore h of x is nothing but theta 0 plus theta 1 multiplied by x so if i&#39;m trying to find out derivative with respect to theta 0 nothing will be going to come okay theta 1 of x will become a constant in this particular case in this case because theta 1 of x is there so if i try to find out derivative of theta 1 into x only i&#39;ll be getting x y square will not be there it&#39;s easy right x square means 2 x this is the derivative of x square right so that square went and 1 by 2 1 by 2 by 2 got cancelled so this will be now my convergence algorithm so here we have discussed about linear regression oh sorry i have to remove square here also so let me write it again okay repeat until convergence let me write it down again repeat until convergence finally your two updates will be happening one is theta 0 so here it will be theta 0 minus alpha that is my learning rate 1 by m summation of i is equal to 1 to m and this will basically be s theta of x of i minus y of i okay and similarly if i want to update theta 1 it will be minus alpha 1 by m submission of i is equal to 1 to m h theta of x of i oh my god y of i uh multiplied by x of i alpha is a learning rate guys alpha is nothing but it is learning rate here we have to initialize some value like 0.001 see what is h theta of x theta 0 plus theta 1 into x right if i do derivative of theta 1 into x what is derivative of theta 1 with theta 1 x it is nothing but x so this x will come over here now let&#39;s discuss about two important thing one is r square and adjusted r square now similarly what will happen you will have lot of convex functions now see if i talk about uh like if you have multiple features like x1 x2 x3 x4 at that point of time you will be having a 3d curve curve which looks like this gradient descent up which will be something like this gradient it&#39;s just like coming down a mountain now let&#39;s discuss about two performance metrics which is important in this particular case one is r square and adjusted r square we usually use this performance metrics to verify how a model is how good our model is with respect to linear regression so r square is basically given r square is a performance matrix to check how good this specific model is so here we basically have a formula which is like 1 minus sum of residual divided by sum of total now this is the formula of r square now what is this sum of residual i can basically write like this summation of y i minus y i hat whole square this y i hat is nothing but h theta of x just consider in this way divided by summation of y of i minus y mean y mean y whole square so a formula this is the formula i&#39;ll try to explain you what this formula definitely says okay so first thing first let&#39;s consider that this is my this is my problem statement that i&#39;m trying to solve suppose these are my data points and if i try to create the best fit line this y i hat why i had basically means this specific point we are trying to find out the difference between these things difference between these things let&#39;s say that these are my points i&#39;m trying to find out the difference between this predicted this is my predicted the point in green color are my predicted points which i have denoted as y i hat and always understand this is what some sum of residual is sum of residual is nothing but difference between this point to this point this point to this point this point to respond this point to this point and i am doing the all the submission of those now the next point which is very much important here is my x and y what is this y i minus y y bar y bar is nothing but mean mean of y if i calculate the mean of y then i will probably get a line which looks like this i&#39;ll get a line something like this and then i will probably try to calculate the distance between each and every point and this specific point with respect to the distance between this point and this point the denominator will definitely be high right this value obviously this value will be higher than this value right the reason why it will be higher because the mean of this particular value distance will obviously be higher so this 1 minus high this will be a low value and this will be a high value when i try to divide low by high low by high then obviously this entire number will become a small number when this is a small number 1 minus small number will be a big number so this basically shows that our r square has fitted properly right it has basically got a very good r square now tell me can i get this entire r square a negative number let&#39;s say that in this particular case i got 90 percent can i get this r square as negative number there will be situation guys what if i create a best fit line which looks like this if i create this best fit line which looks like this then this value will be quite high it is only possible when this value will be higher than higher than this value okay but in a usual scenario it will not happen because obviously we&#39;ll try to fit a line which will be at least good it&#39;s not just like pulling one line somewhere we don&#39;t want to create a best fit line which is worse than this right worse than this so in this particular scenario you will be saying that in r square now here you will be able to see one one amazing feature about r square is that let&#39;s say let&#39;s say one scenario suppose i have features like let&#39;s say that my feature is something like uh let&#39;s say i have a price of a house okay so suppose this is my bedrooms how many bedrooms i have and this is basically the price of the house now if i if i probably solve this problem i will definitely get an r square value let&#39;s say the r square value is 85 percent let&#39;s say that my r square is 85 percent now what if if i add one more feature the one more feature basically says that okay if i add location location of the house will be definitely correlated with price so there is a definite chance that the r square value will increase let&#39;s say that r square will become 90 if i probably have this two specific feature and obviously it is basically increasing the r square because this is also correlated to price and let me change the example see first case i got by r square as 85 percent let&#39;s say now as soon as i added location i got 90 percent now let&#39;s say that i added one more feature with gender is going to stay gender like male or female is going to stay you know that gender is nowhere correlated to price but even though i had one feature there is a scenario that my r square will still increase and it may become 91 percent even though my feature is not that important even gender is not that important the r square formula works in such a way that if i keep on adding features and that are not to nowhere correlated this is obviously nowhere correlated this is not correlated with price then also what it does is that it is basically increasing my r square so this specific thing should not happen whether a male will stay or female will stay that does not matter at all still when you do the calculation the r square will still increase so in order to not impact the model because see now right now with this particular model where i have got 90 percent now as soon as i see r square as 91 percent because it is considering this particular gender so this model will be picked right because it is performing well and is giving you a better r square value but this should not happen because that is not at all correlated this model should have been picked so in order to prevent the situation what we do we basically use something called as adjusted r square now what is this adjusted r square and how it will work i&#39;ll also show it to you very very nice concept of adjusted r square so adjusted r square r square adjusted is given by the formula is given by the formula 1 minus 1 minus r square multiplied by n minus 1 where n is the total number of samples n minus p minus 1 this p p is nothing but number of features or predictors we&#39;ll also say or predictors suppose initially my number of predictors were in this particular scenario in this scenario where i saw this my number of predictors was 2 and in this particular case my number of predictor was 3. now if my predictor is 2 i got the r square as 90 so in this particular scenario what all the calculation will happen okay all the calculation will happen and let&#39;s say that my r square adjusted it will be little bit less it will be little bit less let&#39;s say 86 percent let&#39;s say that my r squared adjusted is 86 percent based on this predictor 2. now when i use my predictor 3 predictor basically means number of features that i&#39;m going to use and now in this one one feature is nowhere related like gender but what we are getting we are basically getting r square increase to 91 percent now for the r square adjusted this will not increase this will in turn decrease right now it will become 82 percent how it will become i&#39;ll show you i&#39;ve just considered some value 86 82 here you can see that there is an increase here an increase is there here decrease is there now how this is basically happening see this p value that i will be putting okay if i put a p is equal to 3 obviously with n minus p minus 1 this will become a little bit smaller number or sorry little bit smaller number right so now in this particular case if it is not correlated obviously this will be high when i&#39;m increasing this so this will also be high let me write the equation something like this just a second so this will basically be okay now why probably this value may have decreased let me talk about this one what is r square i hope everybody understood n is the number of data points p is the number of predictors if p is increasing then what will happen as p keeps on increasing this value will keep on decreasing this value will keep on decreasing if this values keep on decreasing this will be a bigger number this will obviously be a big number a big number divided by a small number what it will be obviously this will be a little bit bigger number 1 minus bigger number we will basically get some values which will be decreasing if my p value is 2 in this particular case it will be less smaller than this right at least it will be greater than this particular value right when p is equal to 3 so with the help of p obviously r square is there to support you okay whether it is correlated or not always remember when the features are highly correlated your r square value will increase tremendously if it is less correlated then it will be there will be a small increase but there will not be a very huge increase now if i consider p is equal to 2 obviously when i am trying to find out this calculation n minus p minus 1 it will obviously be greater than p is equal to 3 when p is equal to 3 then this value will be still more smaller and when we are dividing a bigger number by a smaller number obviously we are subtracting with 1 so that basically means even though my r square is 86 over here there may be a scenario since this is nowhere correlated i am basically getting an 82 percent because of this entire equation so i hope you are understanding this this is very much important to understand a very very important property simple way to define is that as my p value keeps on increasing the number of predictors keeps on increasing my r square gets adjusted whatever r square i am getting with respect to this it will always be less than this particular r square there was one interview question that was asked to one of my student between r square and adjusted r square which will always be bigger definitely the student said r square then he told him to explain about existed r square why does that specific happen agenda one is about rich lasso regression second is assumptions of linear regression the third point that we are probably going to discuss about is logistic regression then the fourth thing that we are going to discuss about is something called as confusion matrix the fifth thing that we are going to consider about is practicals for linear rich lasso and logistic so first topic uh that we are probably going to discuss is something called as ridge and lasso regression so let&#39;s understand about ridge and lasso regression if you remember in our previous session what all things we discussed linear regression and then we had discussed about the cost function we have discussed about r square existed adjusted r square sorry r square and adjusted r square we have discussed about it gradient descent we had discussed about it it was nothing but 1 by 2 m summation of i is equal to 1 to m h theta of x i minus y minus y i whole square so this is the cost function that we had discussed right yesterday and this cost function was able to give us a gradient descent with respect to the j of theta uh j of theta 0 or theta naught so i can also write this as j of theta comma theta 0 comma theta 1 now let me give you a scenario let&#39;s say that i have a scenario over here and i have this specific scenario let&#39;s say that i just have two points which looks like this okay now if i have these two specific points what will happen i will probably try to create a best fit line the best fit line will definitely pass through all the points like this if i try to calculate the cost function what will be the value of j of theta 0 comma theta 1 let&#39;s say that in this particular case since it is passing through the origin my theta 0 will be 0 okay so what will be the value of g theta 0 comma theta 1 so here obviously you can see that there is no difference so it will obviously become zero now understand this data that you see right this data is basically called as training data so this data that i have actually plotted with two points these are specifically called as training data now what is the problem in this data right now see right now exactly whatever line is basically getting created over here which is through the hypothesis over here you can see that it is passing through every point so that is the reason your cost is zero and our main aim is to basically minimize the cost function that is absolutely fine now in this particular case in which my model this if this model is getting trained initially this data is basically called as training data now just imagine that tomorrow new data points comes so if my new data points are here let&#39;s consider that i i want to basically come up with this new data point now in this particular scenario if i want to predict with respect to this particular point let&#39;s say my predicted point is here is this the difference between the predicted and the real point quite huge yes or no so this is basically creating a condition which is called as over fitting that basically means even though my model has given or trained well with the training data or let me write it down properly over here so this condition since since you can see that over here my each and every point is basically passing through the best fit line so because of that what happens it causes something called as over fitting so you really need to understand what is overfitting now what does overfitting mean overfitting basically means my model performs well with training data but it fails to perform well with test data now what is the test data over here the test data is basically this points the real test data answer was this points but because my line is like this i&#39;m actually getting the predicted point over here so this distance if i try to calculate it is quite huge so in this scenario whenever i say my model performs well with training data and it fails to perform well with test data then this scenario we say it as overfitting so this scenario when the model performs well with training data i have a condition which is called as low bias and when it fails to perform with the test data then it is basically called as high variance very important okay i will make each and everyone understand one by one if it is performing well with the training data that is basically low bias and whenever it performs well with the test sorry fails to perform well with uh fails to perform well with the test data then it is basically high variance now similarly i may have another scenario which is called as under fitting so let&#39;s say that i have something called as under fitting now in this under fitting what is the scenario the model fails to perform it gives bad accuracy i say that model always remember whenever i talk about bias then you can understand that it is something related to the training data whenever i talk about test data at that point of time you talk about variance and that specifically whenever you talk about variance that basically means we are talking about the test data so for an overfitting you will basically have low bias and high variance low bias with respect to the training data and high variance with respect to the test data now if the model accuracy is bad with training data and the model accuracy is also bad with test data in this scenario we basically say it as under fitting so these are the two conditions that are with respect to under fitting that basically means that both for the training data also the model is giving bad accuracy and again for the test data also it is basically having a bad accuracy so in this particular scenario we can definitely say two things out of under fitting one is high bias and high variance so this is the condition with respect to under fitting very super important let me just explain you once again suppose let&#39;s consider i have one model i have model two this is model one this is model one this is model two and this is model three okay guys so suppose let&#39;s say that i have my model my training accuracy is let&#39;s say 1090 and my let&#39;s say that my test accuracy is 80 now in this particular case let&#39;s say that my training accuracy is 92 percent and my test accuracy is 91 and let&#39;s say my model 3 is basically having training accuracy as 70 and my test accuracy is 65 percent so if i take this particular case it is basically overfitting if i take this particular thing this basically becomes my generalized model and when i talk about this this is my i&#39;ll just say that okay i&#39;ll also put nice color so that you will be able to understand this this becomes our generalized model and this finally becomes our under fitting right under fitting so here is my red color i will just say it as under fitting what are the main properties of this overfitting as i said in this scenario since it is performing well with the training data so it will be low bias high variance in this particular case it will be low bias low variance and this particular case it will be high bias and high variance understanding this terminology in this particular way you will be able to understand so why do we require always a generalized model because whenever our new data will definitely come generalized model will be able to give us very good output let&#39;s go back to this particular example here you will be able to see this straight line the red line that i have actually created is basically overfitting so that whenever i probably get the new points which is having this real value and the predicted points here you&#39;ll be able to see the difference is quite huge so because of this it will definitely be a scenario of overfitting where it has low bias and high variance so again let me go ahead and take this example so this was my line which i have actually drawn i had two points and when i draw this line which was the best fit line to which is passing through both the points this scenario is basically causing a overfitting problem and i&#39;ve also shown you my j of theta 1 will be 0 in this scenario since it is passing exactly and the predicted point is also over there now understand one thing is that what can we take out from this what assumptions we can take out from this definitely if i talk about our cost function our cost function here is nothing but 1 by 2 m summation of i is equal to 1 to m h theta of x of i minus y of i whole square now let&#39;s consider that i am going to use this h theta x and i&#39;m going to basically write it as y hat okay let&#39;s focus on this specific point so when i take this i&#39;m just going to focus on this particular point so here i will definitely write it as y hat minus y of i whole square so this is my y y hat of i minus y hat y i whole square so this is nothing but the difference between the predicted value and the real value okay this is what i&#39;m actually trying to get now in this scenario if i am adding these values obviously i am going to get the value as 0. now i have to make sure that this value does not come to 0 because this is still over fitting so that is where your ridge regression will come into picture ridge and lasso will come into picture now when i use ridge and lasso suppose if i use ridge nine ridge what we&#39;d say this this is also called as l2 regularization now l2 regularization what it does is that it basically adds a unique parameter or add a one more sample value which is like lambda multiplied by slope square now what is the slope whatever slope of this particular line it is we are just going to square it off now suppose if i take my equation which looks like this h theta of x is equal to eta 0 plus theta 1 x now in this particular case my theta 0 was 0 so my h theta of x is nothing but theta 1 what is theta 1 this is specifically called as slope and i am basically taking this theta 1 i&#39;m actually making it as a square so always understand i don&#39;t want to make this as 0 because if it becomes 0 it may lead to overfitting condition now what will happen if i add this particular equation if i add this particular equation this will obviously come as 0 let&#39;s consider my lambda value over here my lambda value is 1 i&#39;ll talk about how do you set up lambda value okay let&#39;s consider that i&#39;m initializing it to 1. let&#39;s say my lambda value is 1. now what i will do is that this lambda value is 1 let&#39;s consider our slope value initially is 2 and because of this 2 i got this best fit line i&#39;m just going to consider it so if i do the total sum over here if i&#39;m just considering this this value is 3 now the cost function will not stop over here because still it has to minimize it has to reduce this 3 value so what it will do it will again change the theta 1 value and let&#39;s say that my theta band value has changed now it got another best fit line which looks something like this this is my next best fit line i&#39;ll talk about lambda lambda is a hyper parameter guys what exactly is lambda i&#39;ll just talk about it now when i basically change this line now see why i&#39;m getting this line let&#39;s consider i have changed my theta 1 value since we need to minimize now when we need to minimize what it will do we&#39;ll again calculate the slope of this particular line and then we will try to create a new line when we sorry it is 2 2 not 3 just a second guys 0 plus 1 multiplied by 2 square which is nothing but 4 so now i my cost function will not stop over here so we are going to still reduce this now in order to reduce this again theta 1 value will get changed and then we will get a next best fit line for this point now what will happen in this scenario once we have this best fit line we will definitely get a kind of small difference so now if i go ahead and consider the new equation my y hat i minus y i whole square plus lambda of slope square this value will be a small value now because i have some difference and then plus again 1 multiplied by now understand whether the slope will increase in this particular case or whether it will decrease in this particular case there will be some slope value let&#39;s say that i have got some slope of this particular line in this particular scenario again your slope will definitely decrease so let&#39;s say in the case of 2 initially it was now it is basically 1.36 whole square now this small value plus 1 plus 1.3 square or let me consider that my slope is now so one simple value that is 5 so if i get this it is 2.25 2.25 plus small value it will be less than 3 only right it will obviously be less than 3 or equal to 3 but understand what is happening the value is getting reduced from 4 to 3 so this is the importance of rich now what will happen is that you will try to get a generalized model which has low bias and low variance instead of this overfitting condition you know why specifically we are adding ridge l2 regularization it is basically to prevent overfitting because here you are not stopping here you are trying to reduce it unless and until you get a line you get a line which will be able to handle the which will be able to handle as a generalized model now here you can see now if i have my new points like how i drew over here now the distance will be less so now you will be able to see that it will be able to create a generalized model guys this will be a small value only see initially when we have this line obviously we have zero if we try to slightly move here and there so here you will be able to see that it will just a slight movement but what this movement is basically specifying it is specifying that the slope should not be steep if we probably have a steep slope it obviously leads to most of the time overfitting condition it should not be steep it should be very very it should be less steeper but it should actually help you to create a generalized model so you will be seeing that after playing for some amount of time this value will not reduce after some point of time it will get almost it will be a minimal value it will be a smaller value and for this also you have to specify iterations how many times you probably have to train them now this iterations is also a hyper parameter based on number of iterations you will probably see your r square or adjusted r square over here so this iterations based on the number of iterations it will never become 0 guys understand because 0 it is not possible if it becomes 0 trust me it is an overfitting model you cannot get that is something zero over here now what is lambda coming to this lambda this lambda is a hyper parameter this is basically to check how fast you want to lessen the steepness or how fast you want to make a steepness grow higher right and this lambda will also be selected by using hyper parameter and this also i&#39;ll show you today in practical what do you mean by iterations iteration basically how many time i want to change the theta 1 value how many times you want to change the theta value that is the convergence algorithm right convergence algorithm over here l2 regularization or rich is basically used in such a way that you should never over fit why we assume theta 0 is equal to 0 because i am considering that it passes through a origin right origin over here lambda is a hyper parameter steep basically means how steep the line is if i have this line this line is quite steep if i have this line this is less steep now if i go to the next regularization which is called as lasso raso ridge lasso regression this is also called as l1 regularization now here the formula will be changing little bit here you will be having y hat of minus of y whole square here you will be adding a parameter lambda but understand here you will not be adding slope square no here you will be adding mode of slope here you will be adding mode of slope and this mode of slope will work is that it will actually help you to do feature selection now you may be thinking how feature selection krish let&#39;s consider the equation over here let&#39;s say that i have many many features i have many many many features okay so my h theta of x which i&#39;m indicating here as y hat let&#39;s say that i&#39;m writing this equation apart from preventing for over fitting it will also help you to do feature selection here let me just show you over here with an example this h theta of x which i&#39;m probably writing as y hat will basically be indicated by something over here you&#39;ll be able to see that it is nothing but let&#39;s say that i have multiple features like this now in this particular features obviously they are so many coefficients over here so many slopes over here now mod of slope will be what it will be nothing but mod of x1 plus x2 plus x3 plus x4 plus x5 like this up to xn now in this particular case how it is basically helping you to sorry not x1 sorry just a second this mod of i have taken the data point this is not data points this should be your mod of theta 0 plus theta 1 plus theta 2 plus theta 3 plus theta 4 plus theta 5 like this up to theta n so here you&#39;ll be able to see that this is how i will basically i&#39;ll basically be calculating the slope now as we go ahead guys whichever features are probably not playing an amazing role the theta value the coefficient value the slope value will be very very small it is just like that entire feature is neglected then the entire feature is neglected now in this particular case we were doing squaring because of this squaring that value was also increasing but here because of the mode that value will not increase instead it will be a condition wherein we are basically neglecting those features that are not at all important in this specific problem statement so with the help of l1 regularization that is lasso you are able to do two important things one is preventing overfitting and the second case is that if you have many features and many of the features are not that important okay in basically finding out your slope or your line or the best fit line in that particular case it will also help you to perform feature selection so this is the importance of the entire what is the importance of this this is the importance of the uh ridge and the lasso regression that we are doing here i&#39;m just going to write l1 regularization and obviously we have discussed about l2 regularization also now you have probably understood lambda is one hyper parameter okay which we will specifically using okay and based on this lambda this will be found out through cross validation cross validation is a technique wherein we will try to probably train our model and try to find out the specific things okay what should be the exact value and there also we play with multiple values in short what we are doing we are just trying to reduce the cost function in such a way that it will definitely never become zero but it will basically reduce based on the lambda and the slope value in most of the scenario if you ask me we should definitely try both the regularization and see that wherever the performance matrix is good we should use that what is cross validation basically means i will try to use different different lambda value and basically use it so in a short let me write it down again for ridge regression which is an l2 norm here i&#39;m simply writing my cost function in this particular case will be little bit different here i can definitely write my cost function as h theta x of i minus y of i whole square plus lambda multiplied by slope square what is the purpose of this the purpose is very simple here we are preventing overfitting this was with respect to the ridge regression that is l2 knot now if i go ahead and discuss about the next one which is called as lasso regression which is also called as l1 regularization in the case of lasso regression your cost function will be h theta of x of i minus y of i whole square plus lambda multiplied by mode of slope so here you have this specific thing and what is the purpose the purpose are two one is prevent overfitting and the second one is something called as feature selection so these two are the outcomes of the entire thing see with respect to this lasso right you have slopes slopes here you&#39;ll be having theta 0 plus theta 1 plus theta 2 plus theta 3 like this up to theta n now when you have this many number of thetas when you have many number of features and when you have many number of features that basically means you&#39;ll have multiple slopes right those features that are not performing well or that has no contribution in finding out your output that coefficient value will be almost nil right it will be very much near to zero in short you are neglecting that value by using modulus you&#39;re not squaring them up you&#39;re not increasing those values now uh i will continue and uh probably i will also discuss about the assumptions of linear regressions so what are the assumptions of linear regression in this particular scenario so assumption is that number one point linear regression if our features are in normal or gaussian distribution if our features follows this particular distribution it is obviously good our model will get trained well so there is one concept which is called as feature transformation now in future transformation always understand what will happen if a model does not follow a gaussian distribution then we apply some kind of mathematical equation onto the data and try to convert them into normal or gaussian distribution the second assumption that i would definitely like to make is that standard scalar or standardization standardization is nothing but it is a kind of scaling your data by using z score i hope everybody remembers z score this is what we basically apply there your mean is equal to 0 and standard deviation equal to 1 see guys wherever you have gradient descent in void it is good to basically do standardization because if our initial point is a small point somewhere here then to reach the global minima or training will happen quickly otherwise what will happen if your values are quite huge then your graph may be very big and the point can come over any over there and the third point is that this linear regression works with respect to linearity it works if your data is linearly separable and not say linearly separable but this linearity will come into picture if your data is too much linear it will obviously be able to give a very good answer like logistic regression also which we are going to discuss today this also has the same property now you may be asking is it compulsory to do standardization guys if you want to increase the training time of your model or if you want to optimize your model i would suggest go ahead and do standardization now coming to the fourth point here you really need to check about multi-collinearity this is also one kind of check we basically do what is multicollinearities let&#39;s say i have x1 i have x2 and this is my output feature i have let&#39;s say x3 also now let&#39;s say that if i try to see the collinearity of this two feature how how correlated these two features are let&#39;s say that these two features are 95 percent correlated is it is it a wise decision to use both the features and let&#39;s say that let&#39;s let&#39;s say that these two features are 95 percent correlated but it is highly correlated with why is it necessary that we should use both the feature in this particular scenario the answer should be no we can&#39;t drop this particular feature okay we can drop this particular feature any one of the feature we can definitely drop it and based on that i can just use one single feature and basically we do the prediction there is also a concept which is called as variation inflation factor i will try to make a dedicated video about this multi carnality is also solved with the help of variation inflation factor one more term is there homozygotic so that kind of terminology is also we use one more condition in this but if you are almost satisfied with this assumptions you will definitely be able to outperform in linear regression so you have got an idea of the assumptions you have also got an idea of multiple things okay now let&#39;s go towards something called as logistic regression now logistic regression what logistic regression is the first type of algorithm that we are going to learn in classification let&#39;s say that in classification i have one example you know suppose i have say number of hours study hours and number of play hours based on this i want to predict whether a child is passing or failing suppose these two are my features i want to predict whether it is pass or fail so here you will be able to see that i have some fixed number of categories specifically in this particular scenario i have two categories binary logistic regression works very well with binary classification now the our question comes that can we solve multi-class classification using logistic the answer is simply yes you can definitely do it so let&#39;s go ahead and let&#39;s try to discuss about logistic regression now what is the main purpose of the logistic regression first of all let&#39;s let&#39;s understand one scenario okay suppose i have a feature which basically says number of study hours and this is like 1 two three four five six seven and let&#39;s say that i have pass this point is basically pass and this point is basically fail so i have these two conditions these are my outcomes now what i&#39;ll do i will just try to make some data points let&#39;s say that if i study less than three hours i will probably be fail if i study more than three hours then probably i will pass this i&#39;ll make it as fail and this i will make it as pass so i will be having points over here after this one two three let&#39;s say that this is my training data set now the first question says that okay chris fine you have some data over here whenever it is less than three you are basically the person is failing if it is greater than pi uh greater than three it is basically showing data points with respect to pass now can&#39;t we solve this problem first with linear regression now with the help of linear regression here the first point will be that yes i can definitely draw a best fit line my best fit line in this particular scenario may be something like this it may it may look something like this so here fail is nothing but 0 pass is 1 the middle point is basically 0.5 so obviously with the help of linear regression i am able to create this best fit line and i will put a scenario that whenever the value is less than 0.5 whenever the value is less than 0.5 whenever the output is less than file let&#39;s say that new data point is this and based on this i&#39;ll try to do the prediction i&#39;m actually able to get the output over here now when i&#39;m getting the output over here this basically is 0.25 now in this particular scenario obviously i&#39;m able to say that yes the person i&#39;ll write a condition over here saying that if my h theta of x value is less than 0.5 then my output should be 0 let&#39;s say less than 0.5 i&#39;ll say not less than or equal to less than 0.5 then my output will be 0 right so in this particular case 0 basically means power fail similarly i&#39;ll have a scenario when i&#39;ll say that when if my h2 of theta of x is greater than or equal to 5 then this will basically be 1 which is nothing but pass so this two condition i can definitely write over here this is my center point so that any point that will probably come over here let&#39;s say that this point is coming over here right let&#39;s say new data point is somewhere coming over here with this red point now what i&#39;ll do i will basically draw a straight line it will come over here i will just extend this line long i will extend this line all over here and i will extend this line over here and here you can see that based on this i&#39;m actually getting this particular prediction which is greater than 0.5 so i will say that okay the person has passed obviously this is fine this is obviously working better this is obviously working better so what what is the problem why we are not using linear regression okay in order to solve this particular problem why you are specifically having logistic regression the answer is very much simple guys the answer is that whenever let&#39;s say that if i have an outlier which looks something like this suppose i have an outlier which comes like this over here what is this value let&#39;s say that this value is nothing but 7 8 9 10. let&#39;s say that the number of study hours when i&#39;m studying for nine it is obviously path now in this particular scenario when i have an outlier this entire line will change now i will probably get my line which looks something like this okay my line will basically move something like this it will now get moved something like this now when it gets moves completely like this now for even five or even at any point that i am actually predicting let&#39;s say that at this particular point if i try to find out you it will be showing less than 0.5 so here this particular value or answer will be wrong right because if we are studying more than five hours obviously based on the previous line the person had to pass but in this scenario it is failing it is coming less than 0.5 but the real value for this is basically past so i hope you are understanding because of the outlier the entire line is getting changed so how do we fix this particular problem now in this two scenarios are there first of all obviously because of just an outlier your entire line is getting shifted here and there the second point is that over here sometimes you&#39;re also getting greater than one you&#39;re also getting less than one suppose if i try to calculate for this particular point if i project it in behind i will be getting some negative value so we have to squash this function if i squash this function then it will become a plane line right how do we squash it and for this we use something called a sigmoid activation function or sigmoid function if somebody asks you why don&#39;t you use linear regression in order to solve this classification problem then your answer should be very much simple you should say this to specific points so we will try to go ahead and solve some linear regression now with the help of cost function everything as such and we&#39;ll try to understand how the cost function will look for logistic regression second reason i told you right it is greater than zero over here the line is going greater than zero right greater than zero i have only zero and one and it is becoming greater than zero but i have already told that our maximum and minimum value of one and zero so i hope you have understood why linear regression cannot be used okay i showed you all the scenarios why linear regression should not be used now we&#39;ll continue and probably discuss about the other things over here and we will now try to understand fine what exactly logistic regression is all about and how the decision boundary is basically created now we will go ahead and discuss about that specific thing so let&#39;s go ahead our values should be always between 0 to 1 over here in this particular case because it is a binary classification problem only this should be the answer so let&#39;s go ahead and let&#39;s define our decision boundary so my decision boundary decision boundary in the case of logistic regulation first of all as usual in logistic regression we defined our hypothesis okay guys first of all let&#39;s see if i&#39;m writing my my h of theta my h theta of x as theta 0 plus theta 1 into x plus theta 2 into x like this x 1 x 2 plus theta n into x n now in this scenario can i write this entire equation as theta transpose x obviously i can definitely write this way right and this is what is the notation that you will probably seeing in many places so with respect to the decision boundary of logistic regression our theta like this we can write i&#39;m saying okay but since we have to consider two things one is squashing the line okay how that squashing will basically happen see if i have this if i have this line we saw in the above right if i have this line suppose i have some data points over here and i have some data points over here if i want to create the best fit line how will i create i will basically create like this but i have to also do two things one is squash over here and squash over here right squash over here and squash over here now in order to squash i&#39;m saying squash squash now in order to do this i use a function which is called a sigmoid activation function that basically means what happens obviously you know this line is basically denoted by h theta of x is equal to how do you denote this straight line let me write it down nicely for you how do you denote this straight line this straight line is obviously denoted by theta 0 plus theta 1 multiplied by x 1 let&#39;s say now on top of this on top of this i have to apply something on top of this value i have to apply something so that i can make this line straight instead of just expanding in this way so my hypothesis will basically be now g of g is basically a function on theta 0 and theta 1 multiplied by x 1. so here i&#39;m trying to basically what i&#39;m trying to do i will apply a mathematical formula on top of this linear regression to squash this line now let&#39;s go ahead and let&#39;s try to find out what is this g okay what is this g i will say let z is equal to theta 0 plus theta 1 multiplied by x i&#39;m just initializing this now my h theta of x is nothing but g of z now we need to understand what is this z g of z and how do we basically specify what is the g function so my g function is nothing but h theta of x is equal to 1 by 1 plus e to the power of minus z which in short if i try to initialize z now it is 1 to the power of e to the power of minus theta 0 plus theta 1 multiplied by x so this is what is my h theta of x which is my hypothesis and this obviously works well because it is being able to squash the function so this is basically my hypothesis which i am definitely trying to use it and this function that you are actually able to see is called as sigmoid or logistic function now you need to understand what does this sigmoid function look like in graph in graph it looks something like this so this is my z value and this is my g of z this is my 0.5 your sigmoid function will have this curve so this is your 1 this is 0 your value when now from this we can make a lot of assumptions what are the assumptions that we can basically make your g of z your g of z is greater than or equal to 5.5 is obviously greater than or equal to 0.5 when your z value is greater than or equal to 0 this is the major assumptions that we can basically make that is whenever your g of z is greater than your g of z is greater than or equal to 0.5 whenever your z is greater than or equal to 0 so obviously whenever your z value is greater than 0 it is greater than 0.5 if your z value is less than 0 what it will become it will basically be less than 0.5 so you can write that specific condition also you want so this is the most important condition over here why it is called as logistic regression see guys with the help of regression you are creating this straight line and with the help of the concept of sigma you are able to squash it so they have probably combined that name and basically have written in this way will squashing of the best fit line help to overcome the outlier issues yes obviously it will be able to help you so let&#39;s go ahead and let&#39;s try to solve the problem statement now usually let&#39;s consider my training set let&#39;s consider my training set suppose i have some training points like this x of 1 comma y of 1 let&#39;s say x of 2 comma y of 2 okay x of 3 comma y of 3 like this i have lot of training points and finally x of n comma y of n let&#39;s say that this is my training data so here uh my y y will belong to what 0 or 1 because i will only have 2 outputs since we are solving a binary classification problem here is my training set with two outputs and i hope everybody knows about j theta of z that is nothing but 1 plus e to the power of minus z here your z is nothing but theta 0 plus theta 1 multiplied by x 1 so this is your theta 0. now what we have to do we have to select this theta now in this particular case let&#39;s consider that my theta 0 is 0 because it is passing through the origin just for time past 6 suppose my z is theta 1 into x so now i need to change what is my parameter my parameter is theta 1 i have to change parameter theta 1 in such a way that i get the best fit line and long that i apply the sigmoid activation function now let&#39;s go ahead and let&#39;s first of all define our cost function because for this we definitely require a cost function now everything will be same obviously you know the cost function of linear regression because the first best fit line that you are probably creating is with the help of linear regression now in this particular case in the case of linear regression so here you can basically write j j of theta 1 is nothing but 1 by m summation of i is equal to 1 to m 1 by 2 and here you have h theta of x minus y of i i whole square so this is your entire thing of if you remember linear regression whatever things we have discussed yesterday okay so this is the cost function let&#39;s consider that for linear regression for this is for the linear regression now for the logistic regression what will happen for your logistic regression i will take the same cost function h theta of x now you know what is h theta of x it is nothing but 1 plus 1 plus e to the power of minus theta 0 plus theta sorry theta 1 multiplied by x right this is my with respect to logistic regression this is my entire equation now similarly i will try to only put this h theta of x let&#39;s consider that this is my cost function only my h theta of x is changing in this particular case so if i go ahead and write my cost function i can basically say 1 by 2 h theta of x of i minus y of i whole square and in this particular scenario what is h theta of x it is nothing but 1 plus 1 plus e to the power of minus theta 1 x so this is what this is getting replaced and this is my logistic regression cost function i&#39;m just considering this cost function part this part later on if you replace this to this see if i replace this to this and if i replace this to this it becomes a logistic regression cost function intercept i am considering it as 0 guys now when i am replacing this to this this to this then it becomes a logistic regression cost function but there is one problem we cannot we cannot use we cannot use this cost function there is a reason for this because this equation that you are seeing 1 divided by 1 plus e to the power of minus theta 1 multiplied by x this is a non-convex function now you may be considering what is a non-convex function so let me write it down so here this this term this terminology right it is a non-convex function now what is this non-convex function let me show you and let me differentiate it with convex function okay we&#39;ll try to understand what is the difference between non-convex function and convex function this is related to gradient descent very important this is related to gradient descent if you remember with the help of linear regression whatever gradient descent we are actually getting it is a convex function like this this is the convex function which looks like a parabola curve parabola curve because of this parabola curve whenever we use this linear regression cost function specifically because here my h theta of x is what it is nothing but theta 0 plus theta 1 into x because of this this equation will always give you a parabola curve this kind of cost function or convex function you can say but here your h theta of x is changing so in the case of if i use that cost function you will be getting some curves which looks like this now what is the problem with this curve here you have lot of local minima if local minima is there you will never reach this global minima so that is the reason we cannot use that cost function now mathematically you can also go and probably search in the google what is the what is the graph or what is the convex or non-convex function but always remember whenever we updates theta 1 with this within this particular equation by finding the slope then this way it will not be differentiable and here you have lot of local minima and because of this local minima you will never be able to reach the global minima this is your global minima right in case of in case of linear regression you will reach this global minimum but in this case you will never leach never never you&#39;ll be stuck over here or you may get stuck over here you may get stuck over here okay so this has a local minima problem so how do we solve this understand in local minima these are my points right i have to come over here this is my deepest point in this particular case i don&#39;t have any local minima now in local medium also you&#39;ll get slope is equal to zero so that is the reason your theta one will never get updated so in order to solve this problem you can see this diagram we have something called as logistic regression cost function so i can now write my logistic regression cost function in a different way so this researcher researcher thought of it and basically came up with this proposal that the logistic cost function should look something like this so the entire cost function of logistic regression that is specifically h theta of x of i comma y this should be written something like this and it should be written like this see here i&#39;m just going to write cost function of j of theta 1 let&#39;s say that i&#39;m writing j of theta 1 okay so j of theta 1 what are the different different output that i will be getting and we get i&#39;ll be getting y is equal to 1 or y is equal to zero so based on this two scenarios our cost function will look something like this minus log of h of theta of x and i know i hope you all know what is h t of x h theta of x is nothing but 1 plus 1 to the power of minus theta 1 x so this is what is my h theta of x and whenever y is 0 then you basically have minus log multiplied by 1 minus h theta of x of i of i okay so this is how you basically write your cost function in this particular scenario now with the help of this cost function it is always possible since it is getting log log is basically getting used in this scenario you will always get a global minima that is the reason why they have completely neglected this cost function and utilize this cost function now what does this cost function basically mean two scenarios if y is equal to 1 let&#39;s consider this is my cost function graph i have h theta of x and you know that h theta of x value will be ranging between 0 to 1 since it is a classification problem so it will be ranging between 0 to 1 and this is basically of j of theta 1 which is my cost function so if y is equal to 1 this specific equation will be used and whenever this equation is basically used you get a call you get a curve see minus log hd of x of i you get a curve which looks something like this okay which you&#39;ll get a curve which looks like this now what does this curve basically specify the curve come up with two assumptions the cost will be zero if y is equal to one and h theta of x is equal to one that basically means when your state of x is one and the y is output is one that basically means you are going to assign over here one right so in this particular case you will be seeing that your cost function will be zero cost is zero so here is my zero it is meeting over here if your state of x is equal to one and y is equal to one so this is this is again a convex function then the next point that you can probably discuss over here is with respect to y is equal to 0 if your y is 0 then what kind of curve you will be getting you will get a different kind of curve which will look like this h theta of x here your value will be zero to one and here you will be having a curve which looks like this so when you combine these two you&#39;ll be able to see that you are able to get a kind of gradient descent so this will definitely help us to create a cost function so i hope everybody is able to understand till here with respect to this and this will definitely work so finally i can also write my cost function in a different way the cost function that i will probably write over here so this will be my j of theta 1 so i can come up with a cost function which looks like this cost of h of theta of x of i comma y minus log of h theta of x if y is equal to 1 and then minus log 1 minus h theta of x if y is equal to 0 now i can combine this both and probably write something like this i can combine this both and i can basically write cost of s theta of x of i comma y is equal to minus y log h theta of x of i minus log 1 minus y okay 1 minus y log of 1 minus h theta of x so this will be my final cost function and here also you can see that if i replace if i replace y with 1 then what will remain only this particular value will remain right this value when y is equal to 1 this thing only will come you see over here replace y with 1 probably replace y with 1 and then you will be able to see so here i can now write if y is equal to 1 my cost function will look something like this which is nothing but see y is 1 then what will happen my log of h theta of x of i will come and this 1 minus 1 is 0 so 0 multiplied by anything will be 0 if y is equal to 0 then what will happen my cost function will be so when it is 0 this will minus y will become 0 0 multiplied by 0 anything is 0 so here you&#39;ll be able to see that i am i&#39;ll be having minus log 1 minus h theta of x sub i so this both the condition has been proved by this cost function so this is my cost function yes cos function and loss function with respect to the number of parameters will be almost same so finally if i try to write j of theta because i have 1 by 2 m also right so 1 by 2 m also i have so what i&#39;m actually going to do here you will be able to see that i can write j of theta 1 is equal to 1 by 2 m summation of i is equal to 1 to m and then write down the entire equation that you have probably written over here so here you have minus y or i&#39;ll just remove this minus and put it over here and this will become plus sorry y of i multiplied by log h theta of x of i 1 minus y of i y log 1 minus h theta of x of i so this becomes my entire first function and obviously you know what is h theta of x h theta of x of i is nothing but 1 plus 1 e to the power of minus theta 1 multiplied by x and finally my convergence algorithm i have to repeat this to update theta 1 repeat until this updation that is theta theta j is equal to theta j minus learning rate derivative with respect to theta j and this will be my j of theta 1 this is my repeat until conversion so this is my cost function this is my repetence algorithm and here i will be updating my entire theta 1 and this solves your problem with respect to logistic regression simple simple questions may come like how it is different from linear regression how it is not different from linear regression can we say log likelihood or topic from probabilistics yes this is uh this is a log likelihood now i will discuss about performance metrics and this is specific to classification problem and binary classification i&#39;m talking let&#39;s consider let&#39;s consider i have a data set which has x1 x2 and this is y and obviously in a logistic classification you have outputs like 0 1 0 1 1 0 1 and your y hat y hat is basically the output of the predicted model now in this particular scenario my y hat will probably be 1 1 0 uh 1 1 one zero so in this particular scenario this is my predicted output and this is my actual output so can we come to some kind of conclusions wherein probably we will be able to identify what may be the accuracy of this specific model with respect to this many data points because confusion matrix is all dealt with this is called as we will first of all have to create a confusion matrix now for a binary classification problem the confusion matrix will look like this so here you have 1 0 1 0 let&#39;s say that this is prediction let&#39;s say that these are my actual values and these are my prediction value okay this both are prediction value these are my output value when my actual value is 0 my predicted value is a 1 does this what does this mean wrong prediction right so when my actual value is 0 my predicted value is one so here my count will increase to one let&#39;s go to the second scenario when the actual value is one and my predicted value is one that basically means one and one so here i&#39;m going to increase my count similarly when my actual value is zero my predicted value is zero so that basically means when my actual value is zero my predicted value is zero i&#39;m going to increase the count by one if i go over here one one again it is so instead of writing one now this will become two i&#39;m going to increase the count similarly i&#39;ll go over here one more one is there so i&#39;m going to increase the count three then i have 0 1 0 1 basically means when my actual value is 0 i&#39;m actually getting it as 1 so i&#39;m also going to increase this particular value as 2 and then finally i have 1 and 0 where i&#39;m going to increase like this now what does this basically mean now what does this basically mean see with respect to this kind of predictions whenever we are discussing this basically basically says so this is my actual values and i have z 1 and 0 and this is my predicted values i also have 1 and 0. this value when 1 and 1 are there this is called as true positive this value when 0 and 0 are there this is called as false negative whenever your actual value is 0 and you have predicted 1 this becomes false positive and whenever your actual value is 1 you have predicted 0 this becomes false negative now coming to this i really need to find out the accuracy of this model now if i really want to find out and this is what is called as confusion matrix now in this confusion matrix if i really want to find out the accuracy the accuracy of this model it is very much simple this middle element that you are able to see will basically give us the right output so this and this if i add it up it will give us the right output so here i&#39;m going to get tp plus tn divided by tp plus fp plus fn plus tn so once i calculate this so i have 3 plus 1 divided by 3 plus 2 plus 1 plus 1 so this is nothing but 4 by 7 what is four by seven point five seven so am i getting fifty percent percentage accuracy so i&#39;m actually getting 57 percent accuracy over here with respect to the accuracy so this is how we basically calculate with respect to basic accuracy with the help of the confusion matrix okay so this is specifically called as confusion matrix now there are some more things that you really need to specify always remember our model aim should be that we should try to reduce false positive and pulse negative now let&#39;s say that i want to discuss about two topics what one is suppose in our data set i have zeros and one category let&#39;s say in my output if i say zeros are 900 and ones are 100 this becomes an imbalance data very clear right so this become an imbalanced data set it is a biased data suppose if i say zeros are probably 600 and ones are probably 400 in this particular scenario i will say that this is the balanced data because yes you have 100 less but it&#39;s okay the it may not impact many of the algorithm now see guys most of the algorithm that we will be probably discussing imbalance if we have an imbalanced data set it will obviously affect the algorithms let me talk about this let&#39;s say that i have number of zeros that are 900 and number of ones as hundred now let&#39;s say that my model i have created which will directly predict zero it&#39;ll i&#39;ll just say that all my inputs that it is probably getting with respect to this training data it will just output zero now in this particular scenario what will be my accuracy my accuracy will be 900 divided by 1000 right so this is nothing but 90 so is this a good accuracy obviously it is a good accuracy but this is a biased data if my model is basically just outputting 0 0 0 0 0 if it is outputting 0 0 0 obviously most of the answer will be 0&#39;s but this will be a scenario like you know where it is just outputting one thing then also it is able to get 90 accuracy so you should only not be dependent on accuracy so there are a lot of terminologies that we will basically use one terminology that we specifically use is something called as precision then we&#39;ll also use recall what is precision what is recall i&#39;ll write the formula over here in precision what do we need to focus and then finally we will discuss about f score so we have to use different kind of parametric so for different kind of formulas whenever you have an imbalanced data set you can also do over sampling but again understand in most of the scenarios in some of the scenarios over sampling may work but we have to focus on the type of performance metrics that we are focusing on right now i&#39;ll not say f1 square i&#39;ll say f score the reason why i&#39;m saying i&#39;ll just let you know so let&#39;s talk about recall recall formula is basically given by true positive divided by true positive plus false negative precision is given by true positive divided by true positive plus false positive and then i will probably discuss about f score also or we basically say f beta also now i&#39;ll just draw this confusion matrix again okay which is having true positive true negative so let me draw it over here so this is my ones and zeros these are my actual values and these are my predicted values i have true positive i have true negative false positive and false negative now in this particular scenario when i&#39;m actually discussing understand what is recall and what focus it is basically given on so here whenever i talk about recall recall basically says that tb tp divided by tp plus fn so i am actually focusing on this so what does this basically say through uh recall out of all the actual true positives how many have been predicted correctly that is basically mentioned by tp out of all the positive values how many of them have predicted as positive so this is what it is basically saying and this scenario is called as recall in this the false negative is basically given more priority and our focus should be that we should try to reduce false positive false negative sorry we should try to reduce this now let&#39;s go ahead and let&#39;s discuss about precision in precision what we are doing we are basically taking out of all the predicted values out of all the predicted positive values how many of them are actual true or positive okay this is what precision basically means now suppose if i consider spam classification suppose this is my task tell me in this particular case should we use precision or recall and one more use case i am saying that whether the person has cancer or not in which case we have to support recall and in which case we have to go ahead with precision as cancer or not in spam what is important you guys are true the recall is also called as true positive rate i can also say recall as sensitivity so if i go with spam classification it should definitely go with precision why it should go with precision if i probably get a spamming the main aim should be that whenever i get a spammy it should be identified as spam okay in that specific scenario my positive name false positive we should try to reduce and in this scenario my false pository talks about the spam classification a lot in a better way in the case of cancer i should definitely use recall let&#39;s let&#39;s focus on the recall formula tp divided by tp plus fn if a person has a cancer see one actually he has a cancer it should be predicted as one otherwise if we have fn it is basically predicting it does not have a cancer that is really a big situation in this case if a person does not have a cancer and if he is predicted if the model predicts okay fine he has a cancer he may go and further do the test and then he&#39;ll come to know whether he has a cancer or not but this scenario is very dangerous if a person has a cancer but he is being indicated that he does not have that cancer so here false negative is given more priority over here in the case of spam classification false positive is given more clarity so this is something important over here and you really need to understand with respect to different different problem statement let me give you one more example tomorrow the stock market is going to crash in this what we need to focus on should we focus on precision or should we focus on recall now here two things are there who is solving what kind of problem see many people will say recall or precision but here two things are there on whose point of view you are creating this model are you creating this model for the industry or are you creating this model for the people for the people he should definitely get identified that okay in this particular scenario you need to sell your stock because tomorrow stock market is going to crash but for companies this is very bad okay i hope everybody is able to understand for companies it is very very bad so in this particular case sometime we need to focus both on false positive and false negative and again i&#39;m telling you for which problem statement you are solving that indicates if you are solving for people then they should be able to get the notification saying that it is going to crash if you are probably doing it for companies at that time your precision recall may change but if i consider for both the scenarios at that point of time i will definitely use something called as f score f score or i&#39;ll also say it as f beta now how is f beta formula given as i will talk about it and here in the f score you have three different formulas the first formula i will say basically as when your beta value is one okay first of all i&#39;ll just give a generic definition of f score or f beta here you are basically going to consider 1 plus beta square precision multiplied by recall divided by beta square multiplied by precision plus recall whenever your both false positive and false negative are important we select beta as one so if i select beta as one it becomes one plus four precision multiplied by recall then you have precision plus recall so here sorry 1 plus 1 so this becomes 2 multiplied by precision into recall divided by precision plus recall so here you have this is basically called as harmonic mean harmonic mean probably you have seen this kind of equation where you have written 2 x y divided by x plus y same type you are able to see this this is called as harmonic mean here the focus is on both false positive and false negative let&#39;s say that your false positive is more important than false negative at that point of time you will try to decrease or you will try to decrease your beta value let&#39;s say that i am decreasing my beta value to 0.5 then what will happen 1 plus 0.5 whole square and then you have p multiplied by r precision recall and here also you have 0.25 p plus r now in this particular scenario i am decreasing my beta decreasing the beta basically means that you are providing more importance to false positive than pulse negative and finally you will be able to see that if i consider beta value as let me just say my notes if i consider beta value as 2 that basically means you are giving more importance to false negative than false positive so with this specific case you can come up to a conclusion what value you basically want to use now whenever i use beta is equal to 1 it becomes f 1 score if i use beta as 0.5 then this basically becomes f 0.5 score and this becomes your f 2 score so based on which is important okay which is important whether your precision or false positive or false negative is important you can consider those things f score will have different values if you are using beta is equal to 1 that basically means you are giving importance to both precision and recall if your false positive is more important than that at that point of time you reduce your beta value if false negative is greater than false bit uh false positive then your beta value is increasing beta is a deciding parameter to decide your f1 score or f2 score or f point score now first thing first what is the agenda of today&#39;s session first of all we will complete practicals for all the algorithms that we have discussed these all algorithms that we have discussed we will cover the practicals probably we will be doing hyper parameter tuning everything the second thing and again here we are going to take just simple examples so yes uh so today&#39;s session i said practicals with simple examples where i&#39;ll probably discuss about all the hyper parameter tuning then the second one the second algorithm that i am going to discuss about is something called as name bias this is a classification algorithm so we are going to understand the intuition and the third one that we are going to probably discuss is k n algorithm so k n algorithms is definitely there so this r today&#39;s plan i know i&#39;ve written very less but this much maths in involved in name bias right we&#39;ll understand the probability theorem again over there there&#39;s something called as biased theorem we&#39;ll try to understand and then we&#39;ll try to solve a problem on that so let&#39;s proceed and let&#39;s enjoy today&#39;s session how do we enjoy first of all we enjoy by creating a practical problem so i am actually opening a notebook file in front of you so here we will try to solve it with the help of linear regression rich lasso and try to solve some problems let&#39;s see how much we will be able to solve it but again the aim is that we learn in a better way okay uh so that everybody understands some basic basic things okay so first of all as usual uh everybody open your jupyter notebook file the first algorithm that i am going to discuss about is something called as sk learn linear regression so everybody i hope everybody knows about this sk learn let&#39;s see what all things are basically there in this we will be using fit intercept everything as such but here the main aim is to find out the coefficients which is basically indicated by theta 0 theta 1 and all the first thing we will start with linear regression and then we will go ahead and discuss with regen lasso i&#39;m just going to make this as markdown how many different libraries of for linear regression you can do with stats you can do with skype you can do with many things okay so first thing first let&#39;s first of all we require a data set so for the data set what we are going to do is that we are going to basically take up some smaller smaller data just let me do this so for this we are going to take the house pricing data set so we are going to solve house pricing data set problem a simple data set which is already present in sk learn only now in order to import the data set i will write a line of code which is like from sklearn dot data sets data sets import load underscore boston so we have some boston house pricing data set so i&#39;m just going to execute this i&#39;m also going to make a lot of cells so that i don&#39;t have to again go ahead and create all the cells again some basic libraries that i probably want is raw import numpy as np import panda spd okay import c bond as sns and then i will also import mat mat plot lib dot pi plot as plt and then percentile matplotlib matlob lib dot inline and i will try to execute this see this my typing speed has become a little bit faster by writing by executing this queries again and again and uh let&#39;s go ahead uh so i have imported all the necessary libraries that is required which will be more than sufficient for you all to start with now in order to load this particular data set i will just use this library called as load underscore boston and i&#39;m going to just initialize this so if you press shift tab you will be able to see that return load and return the boston house prices data set it is a regression problem it is same and then probably i&#39;m just going to execute it now once i execute it i will go and probably see the type of df so it is basically saying sqlearn.utils.bunch now if i go and probably execute df you will be able to see that this will be in the form of key value pairs okay like target is here data is here okay so data is here target is here and probably will be able to find out feature names is here so we definitely require feature names we require our target value and our data value so we really need to combine this specific thing in a proper way in the form of a data frame so that you will be able to see so what i&#39;m actually going to do over here i&#39;m just going to say pd dot data frame i&#39;ll convert this entirely into a data frame and i will say df dot data see this is a key value pair right so df.data is basically giving me all the features value so if i write df dot data and just execute it you will be able to see that i will be able to get my entire data set in this way my entire data set in this way this is my feature one feature two feature three feature for this feature 2l i have 12 features over here and based on that i have that specific value now the next thing that i am going to do probably i should also be able to add the target feature name over here so what i will do i will just convert this into df and then i will also say df dot columns and i&#39;ll set it to df.target and let me change this to data set so i&#39;m going to change this to data set and i&#39;m going to say data set dot columns is equal to df.target so if i execute this and now if i probably print my data set dot head you will be able to see this specific thing okay it is an error let&#39;s see expected access has 13 element new values has 506 so target okay i should not use target over here instead i had a column which was called as features feature names like if i go and probably see df df over here you&#39;ll be able to see there is one thing which is called as feature names so i&#39;m going to use df.feature names over here so here it is df.feature names i&#39;m just going to paste it over here and now if i go and write you here you can see print df dataset.head if i go and execute without print you will be able to see my entire dataset so these are my features with respect to different different things and this is basically a house pricing data set so initially i have these features crim zn indus chass nox rmh distance radius tax pt ratio bl stack so i have my entire data set over here the same data set i have basically put it over here now here also you will be able to see what all this feature basically means this is showing wasted weighted distance to 5 5 boston employment center rad basically means index of accessibility to radial highway tax basically means full value property tax rate this much pt rate basically means pupil teacher ratio i don&#39;t know what the hell it means but it&#39;s fine we have some kind of data over here properly in front of you so these are my independent features what are these these all are my independent features if you want the features detail here you can see it right everything what is crim this basically means per capita crime rate by town which is important zn it is proportional of a residential land zone for lots over 25 000 square feet so this is my df i did not do much i&#39;m just using data frame df.data column features name i&#39;m getting this value very much simple now let&#39;s go little bit slowly so that many people will be able to also understand now this is my dataset.head now the thing is that i obviously have taken all these particular values but this is my independent feature i still have my dependent feature so what i&#39;m actually going to do i will create a new feature which is like data set of price i&#39;ll create my feature name price price of the house and what i will assign this particular value this value will be assigned with this target this target value this target value is basically the sale the price of the houses right it is again in the form of array so i&#39;m going to take this and put it as a dependent feature so here you will be able to see that my price will be my dependent feature so here i&#39;ll basically write df.target so once i execute it and now if i probably go and see my dot head you will be able to see features over here and one more feature is getting added that is price now this price maybe the units may be in millions um somewhere target should be here or there it should be probably in millions or i cannot see it but it should be somewhere here it should have definitely said that it is probably in millions or okay but that is not a problem i think but mostly it will be in millions somewhere i think it should be here okay i cannot see it but probably if i put more time i&#39;ll be able to understand it okay so over here what is the thing main thing these all are my independent features and this is my dependent feature right so if i&#39;m trying to solve linear regression i have to divide my independent and dependent features properly now let&#39;s go to the next step that is dividing the data set dividing the oh my god dividing the data set into train into first of all i&#39;ll try to divide into independent and dependent features so i want my entire features data set divided into independent and dependent features x i will be using as my independent feature so i will write data set dot i will use an i lock which is present in data frames and understand from which feature to which feature i will be taking as my independent feature to this feature till l start so up the best way that basically means that i just need to skip the last feature in order to skip the last feature what i&#39;m actually going to do from all the columns i will just skip the last column so this is how you basically do an indexing with respect to just skipping the last feature and this will basically be my independent features and here i will basically say y is equal to data set dot i lock in here i just want the last feature so i will write colon all the records i want and see the first term that we are probably writing over here this basically specifies with respect to records here this specifies with respect to columns from all the columns i&#39;m taking the last column here i will just take the last column and this will basically be my dependent features independent features so here i have basically executed now if you can go and probably see x dot head here you will be able to find all my independent features in y dot head you will be able to find a dependent feature now let&#39;s go to the first algorithm that is called as linear regression always remember whenever i definitely start with linear regression i&#39;ll definitely not go directly with linear regression instead what i will do is that i&#39;ll try to go with ridge regression and uh lasso regression because there are a lot of options with respect to hyper parameter tuning but i will just show you how linear regression is done so basically you really really need to use a lot of libraries okay over here and based on these libraries this libraries will try to install okay and what are these libraries these are basically the linear regression library so here i&#39;m basically going to use two specific thing one is linear regression library so i will just use from sklearn dot linear underscore model import linear regression do you need to remember this the answer is no because i also do the google and i try to find out where in a scale and it is present okay so here is my linear regression so i will try to initialize linear reg is equal to initialize with linear regression and then here what i&#39;m actually going to do i&#39;m going to basically apply something called as cross validation the cross validation is very much important because in cross validation we divide our train and test data in such a way that every combination of the train and test data is basically taken by care is taken by the model and whosoever accuracy is better that all entire thing is basically combined so here what i&#39;m going to do i&#39;m going to say mean square error is going to here i will import one more library let&#39;s say from sklearn dot model selection am going to import cross val score so cause val scored cross validation score basically means it is going to do a lot of train and test split it&#39;s something like this one example i will show it to you here only so what does cross validation basically do okay so in cross validation what happens what you do suppose this is your entire data set suppose this is 100 records if you do five cross validation then in the first this will be your test data and remaining or will be your training data if in the second cross validation this will be your test data and remaining all will be attached uh training data like this five times you&#39;ll be doing cross validation by taking the different combination of train and test but i&#39;m not going to discuss much about it in the future if you want a separate session i will include that in one of the session itself so this was uh basically the plan with respect to cross validation across val score so here i&#39;m going to basically take cross val score and here the first parameter that i give is my model so linear regression in my model and here i will take x and y i&#39;m not doing a train test splits specifically over here i&#39;m giving the entire x and y and probably based on that i&#39;m going to do a cross validation over here you can also do train test split initially and then just give the x train and white rain over here to do the cross validation it is up to you but the best practices will be that first you do the train test split and then only give the train data over here to do the cross validation i&#39;m just going to use scoring is equal to you can use mean squared error negative mean squared error let&#39;s say that i&#39;m going to use negative mean squared error again where do you find all these things you will be able to see in the sk learn page of linear cross val score and then finally in the cross valve score you give cross validation value as 5 10 whatever you want so after this what i&#39;m actually going to do i&#39;m just going to basically from this how many scores i will get the mean squared error will be five since i&#39;m doing five cross validation if you don&#39;t believe me just see over here print msc so here you&#39;ll be able to see five different values one two three four five right five different mean values because we are doing cross five five cross validation so here what i&#39;m going to write i&#39;m just going to say np dot mean i want to take the average of all the five so here will basically be my mean underscore mse okay and then probably i&#39;ll print i will print my ms mean underscore msc so this will be my average score with respect to this the negative value is there because we have used negative mean squared error but if you just consider mean squared error then it is only 37.13 okay so this i have actually shown you how to do cross validation see with respect to linear regression you can&#39;t modify much with the parameter so that is the reason why specifically in order to overcome overfitting and do the feature selection we use origin lasso regression so here i will show you how to do ridge ridge regression now now in order to do the prediction all you have to do is that just go over here take the model okay what is the model linear underscore reg and just say dot predict so here you can see uh you will be getting a function called as dot predict and give the test value whatever you want to predict automatically the prediction will be done so i&#39;m just going to remove this and focus on ridge regression right now because i want to show how hyper parameter tuning is done in ridge regression so for ridge regression the simple thing is that i&#39;ll be using two different libraries from sklearn dot linear linear underscore model i&#39;m going to import ridge so for the ridge it is also present in linear underscore model for doing the hyper parameter tuning i will be using from sklearn dot model underscore selection and then i&#39;m going to import grid sir cv so these are the two libraries that i&#39;m actually going to reuse grid search cv will be able to help you out with the um okay i will be able to help you out with hyper parameter tuning and then probably you&#39;ll be able to do that uh difference between msc and negative mse not big thing guys if you use msc here mean squared error you&#39;ll be getting 37. i&#39;ve just used negation of msc it&#39;s okay anything is fine you can go with msc also mean squad error there is also another uh another scoring area which is like which focuses on square root square mean square uh sorry root mean square error okay so there are different different things which you can basically focus on okay now in order to give you this specific good value i&#39;m actually going to do hyper parameter tuning now let&#39;s go ahead with uh grid search cv so here what i&#39;m going to do again i&#39;m going to basically define my model which will be rich okay so this is what i have actually imported now uh let me open the ridge sql so sklearn ridge we nearly need to understand what all parameters are basically used do you remember this alpha value guys do you remember this alpha value why do we use alpha i i told you know alpha multiplied by slope square if you remember in ridge we specifically use this right ridge and lasso regression alpha so this is the alpha the this is probably the best parameter we can just perform hyperparameter tuning the next parameter that we can probably perform is basically um this max iteration okay max iteration basically means how many number of our iteration how many number of times we may probably change the theta 1 value to get the right value so we can do this so what i&#39;m actually going to do i&#39;m going to select some alpha values i&#39;m going to play with this apart from that if i want i can also play with the other parameters which are uh like kind of uh you know probably we can you can also play with the iteration parameter it is up to you try whichever parameter you want to change you can go ahead and change it now let me show you how do we write this and how do we make sure that the specific thing is done now uh before doing grid search cv uh let me do one thing i will define my parameters okay so here is my ridge now what i&#39;m going to do i&#39;m going to say parameters and in this parameter two important value that i&#39;m probably going to take is this one that is my c value and i will try to define this in the form of dictionaries so here the c value that i sorry not c just a second guys my mistake it is not c it is alpha let&#39;s see so how do i define my alpha value we&#39;ll try to see so here the parameters will be alpha c is basically for uh logistic regression i&#39;ll show you so the alpha value i will just mention some values like 1 e to the power of minus 15. that basically means point zero zero zero zero zero zero zero zero zero zero zero zero zero zero one similarly i i can write one e to the power of minus ten that again means zero zero zero zero zero zero zero zero ten times 0 1 i&#39;m just making fun okay so that you will also get entertained 1 e to the power of minus 8 okay similarly i can write 1 e to the power of minus 3 from this particular value now i&#39;m increasing this value see 1 e to the power of minus 2 and then probably i can have 1 5 10 um 20 something like this so i&#39;m going to play with all these particular parameters for right now because in grids or cv what they do is that they take all the combination of this alpha value and wherever your uh your your model performs well it is going to take that specific parameter and it is going to give you that okay this is the best fit parameter that is got selected so here i have got all these things now what i am going to do i am going to basically apply the grits or cv so here i have uh grid sorry ridge grid jam thing there is underscore aggressor so i&#39;m going to use grid search cv greaser cv and here i&#39;m basically going to take the parameters reg okay which is my first model and then i will take up all this params that i have actually defined c in grid search cv if i press shift tab i have to first of all execute this then only it will be able to press shift tab so here if i press shift tab here you will be able to see estimator and parameter grid is my second parameter then scoring and then all the other parameters so here the first thing that goes is your model then your parameters which you what you are actually playing then the third parameter is basically your scoring scoring and again here i&#39;m going to use negative mean squared error some people are saying that mean square error is not present so that is the reason why negative mean squared error is done why it may not be present because they try to always create a generic library probably this kind of scoring parameter may also get used in other algorithms so that is the reason they may not have created but if you want to deep dive into it google google karo patala then what is ridge regressor dot fit on x comma y again i&#39;m telling you you can first of all do train test split on x and y and then probably only do this on x train and y tray parameter is not oh sorry okay i get this okay parameter is not and why it is not n oh yeah it has become a list i&#39;m going to make this as dictionary right now i&#39;m fully focused on implementing things if i get an error i&#39;ll definitely make sure that it&#39;ll get fixed anyhow if i get that error i will not say oh chris why buy this error came you know why this error came i&#39;ll not get worried i&#39;ll get the error down only you cannot give this as this one okay so try to understand okay so this is your grid search cv i&#39;ve also done the fit and let&#39;s go and select the best parameter so what i can do i will write print reg underscore regressor dot parents sorry there will be a parameter called as best params i&#39;m going to print this and i&#39;m going to print rich underscore regressor dot best score so these are all the values that are got selected one is alpha is equal to 20 and the best score is minus 32 so initially i got minus 37 but because of ridge regression you can see that our negative mean squared error has definitely become better there is a minus sign don&#39;t worry but from 37 it has come to 32 cross validation guys over here inside grid search cv also when it is probably taking the entire combination over there the cv value cross validation also we can use so probably if i am probably considering all these things many people has a question chris is this minus value increased that basically means you cannot use ridge regression you are right in this particular case ridge regression is not helping you out so guys let me again write it down everybody don&#39;t worry previous i got minus 32 right now i&#39;m getting minus 37 right sorry previously i got what minus 37 minus 37 now i got -32 so here you can see this i got it from linear regression this i got it from what ridge which one should i select i should select this model only because it is performing well then this but again understand ridge also tries to reduce the overfitting so probably in this particular scenario we cannot use rich because the performance is becoming more bad so what i will do i will go and try with lasso regression now i&#39;ll copy and paste the same thing so linear model import lasso and then this will basically be my lasso let&#39;s see with lasso whether it&#39;ll increase or not let&#39;s see this is my parameter that got selected now let me write last so regressor dot best params so this is alpha is equal to 1 is got selected over here i&#39;m just going to print it okay and then i&#39;m going to print with lasso regression dot core will be the best so here i&#39;m actually getting minus 35 minus 35 here i&#39;m actually getting minus 32 so minus 35 still i will focus on linear regression now see what will happen if i add more parameters if i add more parameters see what will happen so and now i&#39;m going to take alpha different different values see this i&#39;m just going to remove this and probably add alpha value in this way see here i have added more values 1 5 10 20 30 35 40 45 100 okay let&#39;s see whether we our performance will increase or not so here uh first of all let me remove from here enrich just take it down guys i&#39;m i&#39;m adding more parameters like this just take it down yeah cv is equal to five nobody okay you&#39;re not able to see it um cv is equal to 5 now here it is what you can basically focus on so here you can see i have added some values like this you can also add and just try to execute and now if i go and probably see this is my first i have tried for rich i&#39;m getting -29 do you see after adding more parameters what happened in ridge after adding more parameters what happened in ridge you can see or minus 29 and the alpha value that is got selected is 100 if you want try with cross validation 10 and just try to execute now so these are some hyper parameters that we will definitely play with here you can see minus 29 so here you can see minus 29 you can also increase the cross validation value over here also and probably execute it but with lasso i don&#39;t know whether it is improving or not it is coming to minus 34. you just have to play with this parameters as now for a bigger problem statement the thing is not limited to here right we try to take multiples and many parameters multiples and many parameters and try to do these things it is up to you we play with multiple parameters whichever gives us the best result we are basically taking it it&#39;s okay error is increase i know that no error is increasing definitely error is increasing even though by trying with different different parameters what about most of the scenario see here i got minus 37 probably what i can actually do is that uh try to get better one with respect to this now the best way what i can also do is that i can basically take up train and test split also and probably do these things let&#39;s see let&#39;s see one example so how do we do train and test from scale on dot i think model selection import printer split okay it&#39;s okay guys you may get a different value okay let&#39;s do one thing okay let&#39;s make your problem statement little bit simpler now what i&#39;m going to do just tell me in train test plate what we need to do so i&#39;m going to take the same code i&#39;m going to paste it over here or let me do one thing let me insert a cell below and let me do it for train test split so in train test plate what we can do so i&#39;m just going to take the syntax paste it over here let&#39;s say that i&#39;m taking extra in y train and then i&#39;m using train test split with 33 percent now if i execute with respect to x train and white rail so here is my you can see this i have written this code from scalar dot model selection train test split random state can be anything whatever you write it is fine then you basically give x and y with test sizes 0.33 this is basically saying that the test will have 32 and the trained data will be 77 percent so this is what i&#39;m actually getting with respect to x train and y train here what i&#39;m going to do i&#39;m going to basically take x strain from a white ray and now if i go and probably see this here you can see minus 25 understand this value should go towards zero if it is going towards zero that basically means the performance is better now similarly i do it for reg in ridge what i&#39;m actually going to do here i&#39;m going to write x strain and y train and if i go and probably select the best score than this here you will be able to see i&#39;m getting how much i&#39;m getting minus 25.47 okay here i&#39;m getting 25.18 here 25 point most four seven that basically means now still the improvement is little bit bad because here we are not going towards zero so the next part again here also you can basically do it for x train and y train x train and y train so here you have this one and let&#39;s go and execute this so here you can see minus 25.47 now what you can also do is that you can use this lasso regressor dot predict and you can basically predict with respect to x test so this is your white test value suppose let&#39;s say that this is my white red y underscope red then what i can do from s k learn i will be using r square and adjusted r square if you remember s k learn r square r square so this is my r2 score so where it is present in scale dot matrix so i&#39;m going to write from sk learn import let&#39;s say i am saying from scalar dot matrix import r square r2 score now what i am going to do over here i am basically going to say my r2 score which is my variable i&#39;ll say this is nothing but r2 score here i&#39;m just going to give my y thread comma y underscore test so if i go and probably see the output here i will be able to see print r2 score this is all i have discussed guys see there is also adjusted rand score is there where is r2 r2 score 1 adjusted r square okay r2 score is there but adjusted r square should be here somewhere in some manner so this is how your output looks like with respect to by using this lasso regressor okay which is very good okay it should be i told it should be near 100 right now i&#39;m getting 67 percent if i want to try with the ridge you can also try that so you can say ridge regressor dot predict and here you can see 768 percent then you can also try linear regressor and predict what is the error saying the linear regression is not fitted yet why why it is not fitted why it is not fitted let&#39;s say that i have fitted here linear regression dot fit on x strain and white red x strain and comma white ray i&#39;m just going to fit it now if i go and probably try to do the calculation so if i go and see my r2 score it is also coming somewhere around 68 percent 67 percent now since this is just a linear regression you won&#39;t be able to get hundred percent because you are drawing a straight line right so for that you basically have to other use other algorithms like xgboost and all name bias so many algorithms are there it&#39;s okay see you give y test over here why bread over here both are same right they&#39;re comparing bye see at one limit you can you can increase the performance after that you cannot see again i&#39;m telling you in linear regression what we do these are my points right i will be only able to create one best fit line i cannot create a curve line right over here so obviously my accuracy will be only limited let&#39;s go and do it logistic practical quickly and here uh in logistic also we can do grid search cv now what i&#39;m actually going to do first of all let&#39;s go ahead with the data set so i will quickly implement logistic so from less scale dot linear underscore model i&#39;m going to import logistic regression so i&#39;m going to use logistic regression and apart from that we know that let&#39;s take a new data set because for logistic we need to solve using classification problem so this is basically my logistic regression i&#39;ll take one data set so from sklearn dot data sets import we&#39;ll take a data set which is like a breast cancer data set so that is also present in a scalar with respect to the breast cancer data set i&#39;m just going to use this load best answer data set i&#39;m loading it and all the independent features are in data and my columns are feature names the same thing like how we did previously okay so this will basically be my complete uh complete independent features so if i go and probably see this x dot head here you will be able to see that based on this input feature the independent feature we need to determine whether the person is having cancer or not these are some of the features over here and this is like many many features are actually present so next thing i this that was my independent feature now i&#39;ll take my dependent feature dependent feature will already present in df.target okay this particular data set that we have taken in df in df.target we will basically have all our dependent feature these are my independent features so what i&#39;m actually going to do i&#39;m going to create y and i&#39;m going to say pd dot data frame and here i&#39;m going to say df.target target and this column name should be target right so this will be my column name and now if i go and see my y y is basically having zeros and one in the target feature now the next thing that we are going to do is that uh apply basically apply the first of all we need to check whether this data set is uh this particular y column is balanced or imbalanced okay in order to do that i will just write f target if the data set is imbalanced definitely we need to work on that and try to perform up sampling so if i write y target dot value underscore counts if i execute this so here you will be able to see that value underscore counts will basically give that how many number of ones are and how many number of zeroes are so now total number of ones are 357 and total number of zeros are 212. so is this a imbalanced data set probably this is a balanced data set so here i&#39;m actually going to now do train test split train test split i will try to do again train test plate how do we do we can quickly do copy the same thing entirely i&#39;ll copy this entirely over here and then i will get my x and y so here is my x 2 and x test white rain my test so train test split obviously i&#39;ll be doing it now in logistic regression if i go and search for logistic regression scale on i will be able to see this what all parameters are there this is basically the l1 norm or l2 normal l1 regulation or l2 regularization with respect to whatever things we have discussed in logistics and then the c value this two parameter values are very much important if i probably show you over here the penalty what kind of penalty whether you want to add an l2 penalty l1 penalty you can use l2 or l1 the next thing is c this is nothing but inverse of regulation strength this basically says 1 by lambda something like that this parameter is also very much important guys class weight suppose if your data set is not balanced at that point of time you can apply weights to your classes if probably your data set is balanced you can directly use class weight is equal to balanced other than that you can use other other weights which you basically want so this is specifically some of this right no this is not ridge or lasso okay this is logistic in logistic also you have l1 norm and l2 norms understand probably i missed that particular part in the theory but here also you have an l2 penalty norm an l1 penalty now i probably did not teach you in theory because if you look see logistic regression can be learned by two different ways one is through probabilistic method and one is through geometric method if you go and probably see my video that is present with respect to logistic regression right now in my youtube channel there i have explained you about this l1 and l2 norms also over there so in this also it is basically present it is a kind of penalty again just for uh using for this kind of classification problem so what i&#39;m actually going to do let&#39;s go and play with the parameters that i am looking at so i will play with two parameters one is params c value here i am defining 1 5 10 20 anything that you can define one set of values you can define and there was one more parameter which is called as max iteration this is specifically for grid source cv okay that i&#39;m specifically going to apply so i will just try to execute this this will be my params now i&#39;m going to quickly define my model one which will be my logistic regression model so my logistic regression here by default one value i&#39;ll give for c and max either let&#39;s say i&#39;m giving this value later on what i will do for this model i&#39;ll apply it to grid search cv so i&#39;m just going to say grid search cb and i&#39;m going to apply it for model 1 param grid is equal to params this parameter that i&#39;m specifically trying to apply since this is a classification problem and i am not pretty sure that whether true positive is important or true negative is important i&#39;m going to use f1 scoring okay f1 scoring is basically again the parametric term which we discussed yesterday which is nothing but performance matrix and then i&#39;m going to use cv is equal to phi so this will be entirely my model with respect to grid search cv and i&#39;ll be executing this then i will do model.fit on my xtrain and y train data so once i execute it here you can see all the output along with warnings a lot of warnings will be coming i don&#39;t know because this many parameters are there and finally you can see that this has got selected now if you really want to find out what is your best param score model dot best params so here you can see max iteration as 150 and what you can actually do with respect to your best score model dot best score is 95 percentage but still we want to test it with test data so can we do it yes we can definitely do it i will say model dot core or i&#39;ll say model dot predict on my x test data and this will basically be my y print so this will be my y print all the y prediction that i&#39;m actually getting so if you go and see why thread so these are my ones and zeros with respect to the y prediction and finally after getting the prediction values i can apply confusion matrix i hope i have taught you about confusion matrix so from sklearn dot confusion matrix sorry say learn dot metrics i&#39;m going to import confusion metrics classification report and the next thing that i would like to do is this two i will try to import confusion metrics and classification report now if you want to see the confusion matrix with respect to your i can just write y underscore thread or y underscore test whatever you want to go ahead with it and this is basically my confusion matrix if i put this forward no difference will be there only this thing will be moving that also i showed you 63 118 3 and 4. now finally if i want to accuracy score i can also import accuracy score over here so here you can see accuracy score is imported i can also find out my accuracy score which is my the total accuracy with respect to this i we can give y test and y underscore trade which we have discussed yesterday this is giving 96 percent if you want detailed precision recall all the score then at that point of time i can use this classification report and here i can give y test in my thread here is what i&#39;m actually getting so here you can see with respect to f1 f1 score decision recall since this is a balanced data set obviously the performance will be best yes you can also use roc see i&#39;ll also show you how to use roc and probably you&#39;ll be able to see this you have to probably calculate false positive rate to positive rate but don&#39;t worry about roc i will first of all explain you the theoretical part now let&#39;s go ahead and discuss about name bias name bias is an important algorithm so here i&#39;m just going to go ahead so now let&#39;s go ahead and discuss about name bias and here we are going to discuss about the intuition so nave bias is an another amazing algorithm which is specifically used for classification and this specifically works on something called as bayes theorem now what exactly is bayes theorem first of all we need to understand about bayes theorem let&#39;s say that guys i have base theorem let&#39;s say that i have an experiment which is called as rolling a dice now in rolling a dice how many number of elements i have so if i say what is the probability of 1 then obviously you will be saying 1 by 6. if i say probability of 2 then also here you will say 1 by 6. if i say probability of 3 then i will definitely say it is 1 by 6. so here you know that this kind of events are basically called as independent events now rolling a dice why it is called as an independent event because getting one or two in every experiment one is not dependent on two two is not dependent on three so they are all independent that is the reason why we specifically say is an independent event but if i take an example of dependent events let&#39;s consider that i have a bag of marbles okay and this marble i basically have three red marbles and i have two green marbles now tell me what is the probability of suppose i have an event in the first event i take out a red marble so what is the probability of taking out a red orbal so here you can definitely say that it is three by five okay so this is my first event now in the second event let&#39;s say that in this you have taken out the red marble now what is that second second time again you are taking out the second red marble or forget about secondhand marble now you want to take out the green marble now what is the probability with respect to taking out a green marble so here you will be definitely saying that okay one red marble has been removed then the total number of marbles that are left are four so here you can definitely write that probability of getting a green marble is nothing but two by four which is nothing but one by two so here what is happening first first element you took out first marble that you took out first event from from the first event you took a red marble from the second event you took out green marble these two are in this two are dependent events because the number of marbles are getting reduced as you take out from them so if i tell you what is the probability of taking out a red marble and then a green marble so it&#39;s the simple the formula will be very much simple right which we have already discussed in stats it is nothing but probability of probability of red multiplied by probability of green given red so this specific thing is called as conditional probability here understand what is happening probability of green marble given the red marble event has occurred here both the events are independent now let me write it down very nicely so i can write probability of a and b is equal to probability of a multiplied by probability of b divided by probability of a let&#39;s go and derive something can i write probability of a and b is equal to probability of b and a so our answer is yes we can definitely say we can definitely say if you go and do the calculation you will be able to get the answer you should not say no now what is the formula for probability of a and b so here you can basically write probability of a multiplied by probability of b given a if i take out probability of green what is probability of green in this particular case 2 by 5 what is probability of red 3 by 4 for right now let&#39;s consider this now this part i can definitely write as this part i can definitely write a probability of b multiplied by probability of b probability of b this one probability of b and this will be probability of a given b so i can definitely write this much with respect to all this information now can i derive probability of a is equal to probability of b multiplied by probability of a divided by b means probability of a given b divided by probability of sorry i&#39;ll write this as probability of b given a divided by probability of a and this is specifically called as base theorem and this is the crux behind nail bias understand this is the crux behind the base theorem now let&#39;s go ahead and let&#39;s discuss about how we are using this to solve let&#39;s take some examples and probably make you understand let&#39;s say that i have some features like x1 x2 x3 x4 x5 like this still xn and i have my output y so these are my independent features these all are my independent features these all are my independent features so here i am going to write independent features and this is my output feature which is also my dependent feature now what is happening if i say probability of b or a what does this basically mean i need to really find what is the probability of y and you know that guys i will have some values over here and basically i will have some output value over here so based on this input values i need to predict what is the output initially on a training data set i will have your input and then your output initially my model will get trained on this now let&#39;s consider what this entire terminology is i will try to write in terms of this equation so i will say probability of y given x1 comma x2 comma x 3 up till x n then this equation will become probability of y see probability of y given x 1 x 2 x 3 x n this a is nothing but x 1 x 2 x 3 x n and i&#39;m trying to find out what is the probability of y and then i will write probability of b b is nothing but y but before that what i will write probability of a divided by b right a given b a probability of b probability of b is nothing but y multiplied by probability of a given b probability of a given b basically means probability of x 1 comma x 2 comma x n given b b is given right so i&#39;m able to find this entire value now just a second i made some mistakes i guess now it is correct sorry i i just missed one term that is this given y this is how it will become and this will be equal to probability of a that is x 1 comma x2 like this up to xf so probability of y multiplied by probability of a given y now if i try to expand this then this will basically become something like this c probability of y multiplied by probability of x 1 given yes a given y sorry given y multiplied by probability of x 2 given y probability of x 3 given y and like this it will be probability of x n given y so this will also be y 1 y 2 y 3 y n this i can expand it like this and then this will basically become probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 like this up to probability of x n so this is with respect to all the probability y will be different see here for this particular record y will be different for this y will be different for this y will be different but why output it may be yes or no right it may be yes or no okay i&#39;ll solve a problem it will make everything understand and this will probably be probability of y it can be in binary multi-class whatever things you want i&#39;ll solve a problem in front of you now let&#39;s say that i have my y as let&#39;s say that i have a lot of features x 1 x 2 x 3 x x 4 with respect to this let&#39;s say in my one of my data set i have this many x ones this many features and this is my y so these are my feature number and this is my y let&#39;s say that in y i have yes or no so how i will probably write we really need to understand this okay i will basically say what is the probability of y is equal to yes given this x of i&#39;s this is my first record first record of x of ice this is my second record of x of i so i may write like this what is the probability of y being yes if x of i is given to you x of 5 basically means x 1 x 2 x 3 x 4 so here you&#39;ll obviously write what kind of equation you&#39;ll basically say probability of yes multiplied by probability of yes multiplied by probability of x of 1 given yes multiplied by probability of x 2 given yes probability of x 3 given yes and probability of x 4 given s divided by probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 multiplied by probability of x for y is fixed it may be yes or it may be no but with respect to different different records this value may change similarly if i write probability of y is equal to no given x of i what it will be then it will be probability of no multiplied by probability of x 1 given no then probability of x 2 given no probability of x 3 given no and probability of x 4 given no so here because every input that i give any input x of i that i give i may either get yes or no so i need to find both the probability so probability of x 1 multiplied by probability of x 2 multiplied by probability of x 3 multiplied by probability of x 4. see with respect to any x of i the output can be yes or no and i really need to find out the probability so both the formula is written over here what is the probability of with respect to s and what is the probability with respect to no now in this case one common thing you see that this denominator is fixed this is definitely fixed it is fixed it is it is not going to change for both of them and i can consider that this is a constant so what i can do i can definitely ignore so here i can definitely ignore these things ignore this also ignore this all because see this is constant so i don&#39;t want to consider this in the next time i&#39;ll just use this specific formula to calculate the probability now let&#39;s say that if my first probability for a specific data set yes of x of i is let&#39;s say that i&#39;m getting as 0.13 and similarly probability of no with respect to x of i if i get 0.05 you know that in a binary classification any values if we grade greater than or equal to 5 we are going to consider it as 1 and if it is less than 0.5 i&#39;m going to consider it as 0 now i&#39;m getting values like this 0.13 and 0.1 0.05 obviously i&#39;m getting 0.13.0 so we do something called as normalization it says that if i really want to find out the probability of x with x of i if i do normalization it is nothing but 0.13 divided by 0.13 plus 0.05 0.72 this is nothing but 72 percentage and similarly if i do it for probability of no given x of i here obviously it will say 1 minus 0.72 which will be your remaining answer that is 0.28 which is nothing but 28 so your final answer will be this one this formulas you have to remember now we&#39;ll solve a problem let&#39;s solve a problem this will be a very very interesting problem let&#39;s say i have a data set which has like this feature day let me just copy this data set okay for you all now in this data set i want to take out some information let&#39;s take out outlook table now based on this output outlook feature see over here outlook my day outlook temperature humidity wind are the input features independent feature this is my output feature this one that you are probably seeing play tennis is my output feature which is specifically a binary classification so what i&#39;m actually going to do i&#39;m basically going to take my outlook feature and based on this outlook feature i will just try to create a smaller table which will give some information now based on outlook first of all try to find out how many categories are there in outlook one is sunny one is overcast and one is rain right three categories are there so i&#39;m going to write it down over here sunny overcast and rain so these three are my features with respect to sunny with outlook i have three categories one is sunny one is overcast and one is raised here i am going to basically say with respect to sunny how many yes are there and how many no are there and what is the probability of yes and probability of no i&#39;m going to again write it over here so this is my outlook feature and then i have categories first yes no sunny overcast rain yes no then probability of yes and probability of no now the next thing that we need to find out is that with respect to sunni how many of them are yes see yes we have so when we have sunny over here the answer is no so i will increase the count over here 1 then again i have sunny again the answer is no so i&#39;m going to increase the count to 2. with this sunny this is basically no okay so again i&#39;m going to increase the count to three now with sunny how many of them are yes one and two so i have this one and this one so i have two so i&#39;m going to say with respect to sunny i have two yes understand outlook is my x1 or x1 feature let&#39;s consider now the next thing is that let&#39;s say with respect to overcast with overcast how many of them are yes so this overcast is there yes one two three and four so total four yes are there with respect to overcast then with respect to overcast how many are on no you can go into and find out it is basically zero nose then with respect to rain how many of them are yes so here you can see with respect to one rain yes yes no no so this is nothing but three two just try to find out there are three is to or not one here also one yes is that right so three yes two nose so the total number of yes and nose if you count it there are nine yes and five no&#39;s this is my total count so if you totally count this 9 plus 5 is 14 you will be able to compare that there will be 9 yes and 5 knows what is the probability of yes when sunny is given so here you have 2 by 9. here you have 4 by 9 here you have three by nine now if i say what is the probability of no given sunny now see probability of yes given sunny probability of yes given forecast probability of yes given rain so it is basically that i will just try to write it in a simpler manner so that you will not get confused okay so this is my probability of yes and this is my probability of no but understand what does this basically mean this terminology basically means probability of yes given sunny probability of yes given overcast probability of yes given rain similarly what is probability of no probability of no obviously you know that 3 by 5 is my first probability then you have 0 by 5 and then you have 2 by 5. now with respect to the next feature let&#39;s consider that i&#39;m going to consider one more feature and in this feature i will say let&#39;s consider temperature okay let&#39;s consider temperature now in temperature how many features i have or how many categories i have i have hot you can see hot mild and cold now with respect to hot mild cold here also i will be having yes no probability of yes and probability of no now try to find out with respect to hot how many are yes so here no is there here also no is there two no&#39;s uh one yes uh two yes so two yes and two nose probably then similarly with respect to mild mild how many are there one yes one no two yes three yes four yes four yes and two nose okay so here you basically go and calculate four yes and two knows with respect to cold how many are there cool cool or cold one yes one no two yes three yes three yes and one no so here i have specifically have three s and one no again the total number is nine and five which will be equal to the same thing that what we have got now really go ahead with finding probability of yes given hot so it will be two by nine over here then here it will be how much 4 by 9 here it will be 3 by 9 again here what will be the probability of no given given hot so it&#39;ll be 2 by 5 2 by 5 1 by five so this two tables has already been created and finally with respect to play the total number of plays are yes is nine no is five and the answer is total fourteen if i say what is the probability of yes only yes then it is nothing but 9 by 14 what is the probability of no it is nothing but 5 by 14 okay so this two values also you require now let&#39;s say that you get a new data set you need to get a new data set let&#39;s say you get a new test data where it says that suppose if you are having sunny and hot tell me what is the output so this is my problem state so let me write it down so here i will write probability of yes given sunny comma hot then here i will write probability of yes multiplied by probability of so here i will write probability of sunny given yes multiplied by probability of hot given yes divided by what is it probability of sunny multiplied because probability of no also i will be getting the same value 9 by 14. so probability of yes i&#39;m going to replace it with 9 by 14 multiplied by 2 by 9 then probability of hot given yes so i am going to get 2 by 9 so yeah hamara ho jayega then this is nothing but 2 by 63 0.031 i read this statement little bit wrong it should be probability of sunny given yes now go ahead and calculate go ahead and calculate what is probability of no given sunny and hot so here you have probability of no multiplied by probability of sunny given no multiplied by probability of hot given no divided by probability of sunny multiplied by probability of hot this will get cancelled denominator is a constant guys this is a constant so what is probability of no so probability of no is nothing but 5 by 14 so i will write over here 5 by 14 multiplied by probability of sunny given no what is probability of sunny given no what is probability of sunny given no is nothing but probability of sunny given no is nothing but three by five so here i&#39;m going to get three by pi multiplied by probability of heart given no that is nothing but 2 by 5 so 2 by 5 is here 3 by 5 is there 5 and 5 will get cancelled 2 and the 2 7 0 and then i&#39;m getting 3 by 35 which is nothing but calculator uh if i&#39;m actually getting 3 divided by 35 that&#39;s nothing but 0.0857 i will write it down again probability of yes given sunny comma hot which is my independent feature is nothing but point zero three one point zero three one and this is probability of no given sunny comma hot point zero eight five now we&#39;ll try to normalize this point zero eight five plus point divided by point zero three one plus point zero eight five point seven three this is nothing but seventy three percent and here i can basically say one minus point seven three which is my point two seven which is nothing but twenty seven percent if the input comes as sunny and hot if the weather is sunny and hot what will the person do whether he will play or not the answer is no okay now my next question will be that if your new data is overcast and mild now tell me what will be the probability using name bias now you can add any number of features let&#39;s say that i will say that okay let&#39;s let&#39;s say that i will i will probably say we can consider humidity and mind wind also you basically create this kind of table to find it out but this will be an assignment just do it overcast and mild if it is with respect to neighborhoods try to solve it so the second algorithm that we are going to discuss about is something called as knn algorithm k n algorithm is a very simple problem statement okay which can be used to solve both classification and regression so k n basically means k nearest neighbor let&#39;s first of all discuss about classification problem number one classification problem let&#39;s say that i have a binary classification problem which looks like this i have two data points like this one and this is another one suppose a new data point suppose a new data point which comes over here then how do i say that whether this belongs to this category or whether it belongs to this category if i probably create a logistic regression i may divide a line but in this particular scenario how do we define or how do we come to a conclusion that whether this will belong to this category or this category so for here we basically use something called as k nearest neighbor let&#39;s say that i say that my k value is 5 so what it is going to do it is going to basically take the five nearest closest point let&#39;s say from this you have two nearest closest point and from here you have three nearest closest point so here we basically see from the distance the distance that which is my nearest point now in this particular case you see that maximum number of points are from red categories from red from red categories i&#39;m getting three points and from white categories i&#39;m getting two points now in this particular scenario maximum number of categories from where it is coming we basically categorize that into that particular class just with the help of distance which all distance we specifically use we use two distance one is euclidean distance and the other one is something called as manhattan distance so euclidean and and distance now what does euclidean distance basically say suppose if this is your two points which is denoted by x1 y1 x2 y2 euclidean distance in order to calculate we apply a formula which looks like this x2 minus x1 whole square plus y2 minus y1 whole square whereas in the case of manhattan distance suppose this are my two points then we calculate a distance in this way we calculate the distance from here then here right this is the distance we calculate we don&#39;t calculate the hypotenuse distance so this is the basic difference between euclidean and manner and distance now you may be thinking krish then fine that is for classification problem for regression what do we do for regression also it is very much simple suppose i have all the data points which looks like this now for a new data point like this if i want to calculate then we basically take up the nearest five points let&#39;s say my k is five k is a hyper parameter which we play now suppose let&#39;s say that k it finds the nearest point over here here here here and here so if we need to find out the point for this particular output with respect to the k is equal to 5 it will try to calculate the average of all the points once it calculates the average of all the points that becomes your output so regression and classification that is the only difference because this k is actually in hyper parameter we try with k is equal to 1 to 50 and then we probably try to check the error rate and if the error rate is less then only we select the model now two more things with respect to k nearest neighbor k nearest neighbor works very bad with respect to two things one is outliers and one is imbalanced data set now if i have an outlier let&#39;s say i have an outlier over here this is one of my categories like this and this is my another category let&#39;s consider that i have some outliers which looks like this now if i&#39;m trying to find out the point for this you can see that the nearest point is basically blue only and it belongs to the blue category but because this is outlier you know it will consider that the nearest neighbor is this so then this will be basically treated in this group only formula for manner and distance it uses modulus x2 minus x1 plus y2 minus y1 mode x2 minus x1 y2 minus y1 uh this was it from my site guys and yes i&#39;ve also made detailed videos about whatever topics we have discussed today you can directly go and search for that particular topic so this is the agenda of this session we will try to complete this all things again here we are going to understand the mathematical equations and so today&#39;s session we are basically going to discuss about uh decision tree okay and uh in this session we are going to basically understand what is the exact purpose of decision tree with the help of decision tree you are actually solving two different problems one is regression and the other one is classification so we&#39;ll try to understand both this particular part where we will take a specific data set and try to solve those problems now coming to the decision tree one thing you need to understand i&#39;ll say that if age is less than 8 let&#39;s say i am writing this condition if age is less than or equal to 18 i&#39;m going to say print go to college here i&#39;m printing print college and then i&#39;ll write else if age is greater than 18 and page is less than or equal to 35 and say print work then again i&#39;ll write else if age is let me let me put this condition little bit better then i&#39;ll write here l if age is greater than 18 and age is less than or equal to 35 i&#39;m going to say print work basically people needs to work in this age else i&#39;m just going to consider print retire so here is my if else condition over here now whenever we have this kind of nested effects condition what we can do is that we can also represent this in the form of decision trees we will also we can actually form this in the form of decision tree and the decision tree here first of all we will have a specific root node let&#39;s say this is my root node now in this root node the first condition is less than or equal to 18. so here obviously i will be having two conditions saying that if it is less than or equal to 18 and one condition will be yes one condition will be no so if this is yes and if this is no right if this condition is true that basically means we&#39;ll go in this side if it is true then here we will basically have something like college so this is your leaf node similarly when i have no okay no no in this particular case we will go to the next condition in this next condition i will again create a node and i&#39;ll say that okay this is less than 18 and greater than sorry less than or equal to 35 so if this is also there then again i&#39;ll have two conditions which is basically yes or no now when i create this yes or no over here you will be able to see that basically means here again two condition will be there if it is yes i will say print work so this will again be my leaf node and again for no again i will do the further splitting which is retire so here you can see that this entire algorithm this entire code that i have actually written you can see that it has got converted to this kind of trees where you are specifically able to take decisions yes or no so can we solve a classification problem sorry this is greater than 80 again if it is greater than 18 and less than or 35 so can we solve a regression and a classification problem regression and classification problem using this decision trees by creating this kind of nodes so in short whenever we talk about decision trees whenever we talk about decision trees you will be seeing that decision trees are nothing but decision trees are nothing but by using this nested if else condition we can definitely solve some specific problem statement but here in the visualized way we will specifically create this decision tree in the form of nodes now you need to understand that what type of maps we will probably use okay so let&#39;s do one thing let&#39;s take a specific data set which i will definitely do it over here in front of you okay and we will try to solve this particular data set and this will basically give you an idea like how we can probably solve these problems so let me just open my snippet tool so this is my data set that i have let&#39;s consider that i have this specific data set now this data set are pretty much important because this probably in research papers also probably people who have come up with this algorithm they usually take this they take this thing but right now this particular problem statement if i talk about this is a classification problem statement okay but don&#39;t worry i will also help you to explain i&#39;ll also explain you about regression also how decision tree regression will definitely work so let&#39;s go ahead and let&#39;s try to understand suppose if i have this specific problem statement how do we solve this this is my output feature play tennis yes or no okay whether the person is going to pay tennis or not yesterday or day after yesterday or whenever you want so if i have this input features like outlook temperature humidity and wind is the person going to play tennis or not this is what my model should predict with the help of decision tree so how decision tree will work in this particular case first of all let&#39;s consider any any any specific feature let&#39;s say that outlook is my feature so this will be my first feature which is specifically outlook now just tell me how many are basically having nose and how many are basically having yes in the case of outlook there you will be able to find out there are nine years see one two three four five six seven eight nine and how many knows are there one two three four five i think one two three four five so nine yes and five knows what we are going to do in this specific thing now we have nine yes and five nose and the first node that i have actually taken is basically outlook so outlook feature now just try to find out we are focusing on this specific feature now in this feature how many categories i have i have one sunny category you can see over here i have sunny one category then i have another category called as overcast then i have another category as rain so i have three unique categories so based on these three categories i will try to create three nodes so here is my one node here is my second node here is my third node so these are my three categories so this category is basically called as sunny this category is basically called as overcast and this category is basically called as rain based on these three categories so i am splitting it now just go ahead and see in sunny how many yes and how many know are there how many yes with respect to sunny are they see in sunny i have two nos see one and two no ah one more noise there three nose so here you can see this is my one no then this is my two no this is my three no and yes r2 so this one and this one so how many total number of yes so here you can see that there are one two two yes and three no let&#39;s say that i have randomly selected one feature which is outlook why can&#39;t i when like see it is up to it it is up to the decision tree to select any of the feature here i have specifically taken outlook later on i&#39;ll explain why it&#39;s it can basically select how it selects the feature okay i&#39;ll talk about it don&#39;t worry so in the outlook we have two yes sorry in the case of sunny we have two yes and three nos now the next thing is that let&#39;s go and see for overcast in overcast i have one yes or two yes um three years and four years i don&#39;t have any no in overcast so over here my thing will be that four yes and zero knows and then finally when we go to the rain part see in rain how many features are there in rain if you go and probably see it how many number of yes and nose are there go and see in one one years in row rain two yes then one no then again you have one yes and one no right so here you can basically say that in rain in the case of rain if i take an as an example how many number of yes and nodes are there it will be three yes and two knows understand understanding algorithm then everything will you&#39;ll be able to understand now let&#39;s go ahead and try to cease for sunny sunny definitely has two years and three nose this has four yes and zero nose here you have three y&#39;s and two nose now if i probably take overcast here you need to understand about two things one is pure split and one is impure split now what does pure split basically mean pure speed basically means that now see in this particular scenario in overcast in overcast i have either yes or no so here you can see that i have four years and zeros knows so that basically means this is a pure split anybody tomorrow in my data set if i just take this outlook feature suppose in one day in date 15 the outlook is outlook is basically overcast then i know directly it is the person is going to play so this part is already created and this node is called as pure node understand this why it is called as pr node because either you have all yes or zeros nose or zero yes or all knows like that in this particular case i have all yes so if i take this specific path i know that with respect to overcast my final decision which is yes it is always going to become yes so this is what it basically says so i don&#39;t have to split further so from here i will probably not split i will definitely not split more because i don&#39;t require it because i have it is a pure leaf node okay you can also say that this is a pure leaf node so i&#39;m just going to mention it again this one i&#39;m specifically talking about now let&#39;s talk about sunny in the case of sunny you have two yes and three nodes so this is obviously impure so what we do we take next feature and again how do we calculate that which feature we should take next i&#39;ll discuss about it let&#39;s say that after this i take up temperature i take up temperature and i start splitting again since this is impure okay and this split will happen again until we get finally a pure split similarly with respect to rain we will go ahead and take another feature and we&#39;ll keep on splitting unless and until we get a leaf node which is completely pure i hope you understood how this exactly work now two questions two questions is that krish the first thing is that how do we calculate this purity and how do we come to know that this is a pure split just by seeing definitely i can say i can definitely say by just seeing that how many number of yes or no&#39;s are there based on that i can definitely say it is a pure split or not so for this we use two different things one is entropy and the other one is something called as guinea coefficient so we will try to understand how does entropy work and how does gini coefficient work in decision tree which will help us to determine whether the split is pure split or not or whether this node is leaf node or not then coming to the second thing okay coming to the second thing one is with respect to purity second thing your first most important question which you had asked why did i probably select outlook how the features are selected and here you have a topic which is called as information gain and if you know this both your problem is solved so now let&#39;s go ahead and let&#39;s understand about entropy or guinea coefficient or information gain entropy or guinea coefficient or sorry guinea coefficient i am saying guinea impurity also you can say over here i&#39;ll write it as guinea impurity not coefficient also i&#39;ll just say it as guinea impurity but i hope everybody is understood till here let&#39;s go ahead and let&#39;s discuss about the first thing that is entropy how does entropy work and how we are going to use the formula so entropy here i will just write guinea so we are going to discuss about this both the things let&#39;s say that the entropy formula which is given by i will write h of s is equal to so h of s is equal to minus pl plus i&#39;ll talk about what is minus what is p plus log base 2 p plus minus p minus log base 2 p minus so this is the formula and in guinea impurity the formula is 1 minus summation of i is equal to 1 to n p square i&#39;ll even talk about when you should use guinea impurity when you should not use guinea impurity when you should use entropy you know by default the decision tree regressor or classification sorry decision tree classification uses guinea impurity now let&#39;s take one specific example so my example is that i have a feature one my root node i have a feature one which is my root node and let&#39;s say that in this root node i have six years and three nodes very simple let&#39;s say that this has two categories and based on these two categories of split has happened this is c1 let&#39;s say in this i have three yes three nodes and here i have three yes zero knows and this is my second category always understand if i do the summation 3s and 3s is 6 years see this this summation if i do 3 plus 3 is obviously 6 3 plus 0 is obviously 3 so this you need to understand based on the number of root nodes only almost it will be same now let&#39;s go ahead and let&#39;s understand how do we calculate let&#39;s take this example how do we calculate the entropy of this so i have already shown you the entropy formula over here now let&#39;s understand the components i will write h of s is equal to minus sign is there what is p plus p plus basically means that what is the probability of yes what is the probability of yes this is a simple thing for you all out of this what is the probability of yes out of this so obviously how you&#39;ll write if you want to find out the probability of yes out of this see when i say plus that basically means yes when i say minus that basically means no so what is the probability of yes so it is nothing but yes plus and minus are specifically for binary class this can be positive negative so the probability with respect to yes can i write 3 by 3 only for this what is the probability out of this total number of this is the 3 by 3 similarly if i go and see the next term log to the base 2 p plus so again if i go ahead and write over here log to the base 2 p plus p plus is again 3 by 3 so then again we have minus and this is now p minus what is p minus 0 by 3 log base 2 0 by 3 this obviously will become 0 this will obviously become 0 because 0 divided by anything is 0 what will this be 1 log to the base 1 what is this this is nothing but 0 log to the base 1 is nothing but 0 tell me whether this is a pure split or impure split so this is a pure split whenever we have a pure split the answer of the entropy is going to come to zero so here i am going to define one graph this is h of s and let&#39;s say this is p plus or p minus if my probability of plus c when i say probability of plus is 0.5 what will be probability of minus it will also be 0.5 right because it&#39;s just like p is equal to 1 minus q right if p is 0.5 then q will be 1 minus p same thing right so when it is 0.5 obviously my h of s will be 1 let&#39;s say so this is the graph that will basically get formed let&#39;s go ahead and try to calculate the entropy of this guys what will be the entropy of this node so here i&#39;m going to just make a graph h of s minus what is p plus p plus is nothing but 3 by 6 log base 2 3 by 6 minus 3 knows are there 3 by 6 log base 2 3 by 6. so if you compute this log base 2 to the power of 1 if you do the calculation here i am actually going to get 1. so when i&#39;m getting 1 when i&#39;m actually getting 1 when you have 3 s and 3 knows what is the probability it is 50 50 right so when your p plus is 0.5 that basically means your h of s is coming as 1 so from this graph you can see that i am getting 1 if this is 0 this is 1 this is 0 and this is 1 i hope everybody is able to understand guys 0 and 1 if your p plus is 0 or if your p plus is 1 that basically means it becomes a pure split so in h of s you are going to get zero so always understand your entropy will be between zero to one if i have an impure this is a completely impure split because here you have 50 probability of getting yes 50 probability of getting no h of s is entropy this is entropy for the sample h of s notation that i am using is h of s so if whenever the split is happening the first thing is done the purity test the priorities test is done with the help of entropy right now i&#39;ll also show guinea guinea impurity don&#39;t worry so with the entropy you will be able to find if i am getting 1 that basically means it is a impure split and if i am getting 0 it is pure split so this is the graph okay this is the graph and this graph is basically the entropy graph again understand if your probability of getting yes or no is 0.5 that basically means 50 50 is there 3 s and 3 knows then your entropy is going to be one h of s if your probability is completely one that basically means either you are getting completely yes or completely no so your your entropy will be zero that basically means it is pure split so in the case of probability 0.5 you&#39;re getting plus one then it will keep on reducing now let&#39;s go ahead and let&#39;s try to understand so here you have understood about priority test definitely you&#39;ll use entropy try to find out whether it is pure or mpr if it is impure you go ahead with the further shift further division of the categories again you take another feature divide it because here from this to which split you will do further you will do this split as further if you are getting 0.6 0.6 is this specific value then you probably go and draw over here this is your entropy if your probability is here which is 0.3 then you will go here and create this this may be 0.4 or 0.3 something like this it will be between zero to one let&#39;s go ahead and discuss about the second issue i hope everybody is discussed about we have discussed about checking the pure split or not and we have understood this much but the next thing is that okay fine krish this is very good you have explained well i know many people will say that but there are some people i can&#39;t help let&#39;s say that i have some features okay now coming to the second problem how do we consider which node to cap which which feature to take and split because here i may have one play one split so again let&#39;s see that what is the second problem which feature to take to split right this is the second problem that we are trying to solve let&#39;s say that i have one feature one over here and i have two categories let&#39;s say this is there c one and c two here let&#39;s say that i have nine yes five nose and then i have six yes two nose here i have basically three yes and three notes let&#39;s see and in my data set i have features like f1 f2 f3 now let&#39;s say that another split i can actually start with feature two also and in feature two i may have probably three categories like c1 c2 c3 so with respect to the root node and all the other features because after this also i may have to split right i may have to take another feature and keep on splitting right based on the pure or impure split how do i decide should i take f1 first or f2 first or f3 first or any other feature first how should i decide that which feature should i take and probably do the split that is the major question so for this we specifically use something called as information gain so here i am just going to say here we basically use information gain now what is this information gain i&#39;ll talk about it so information gain first of all i will write the formula we basically write gain with sample first with feature one i will compute so first with feature one i will compute suppose this is my first split of my data and probably i&#39;m computing over here this can be written as h of s i&#39;ll discuss about each and every parameter don&#39;t worry summation of v belong to values s of v don&#39;t worry guys if you have not understood the formula i will explain it then the sample size h of sv i&#39;ll discuss about each and every parameter let&#39;s say that i&#39;m taking this feature one split i have you have already seen what is feature one so this is my feature one i have two categories c1 c2 this has nine yes five nose this has six yes and two nose and this has three yes and three notes now i will try to calculate the information gain of this specific split now i will go ahead and probably take this up now see over here we&#39;ll try to understand what is this now if i want to compute the gain of s of f one fuch is first first thing that i need to find out is h of s now this h of s is specifically of the root node so i need to first of all calculate what is h of s h of s is nothing but entropy entropy of the root node so if i want to compute the entropy of the node node tell me how should i compute h of s is equal to minus p plus log base 2 p plus calculate guys along with me minus p minus log base 2 p minus so i hope everybody knows this so here i&#39;m going to compute by what is probability of plus over here in this specific root node it is nothing but 9 by 14 then i have log base 2 again 9 by 14. then i have p minus what is p minus 5 by 14 log base 2 5 by 14. so this calculation i will probably get it as 0.94 approximately equal to 0.94 just check it whether you are getting this or not again you can use calculator if you want now i have definitely found out this this is specifically for the root node now let&#39;s see the next thing the next important thing which is this part what is s of v and what is s and what is h of sb now very important just have a look everybody see this graph okay see this graph i will talk about h of sv first of all i&#39;ll talk about h of sv okay this one this is the entropy of category one you need to find an entropy of category two you need to find so if i write h of sv of category one so what is category one for this i&#39;ll write sc1 let&#39;s say i&#39;m going to write like this quickly calculate the h of sv of this and this separately you need to calculate so h of sv of c1 okay so here again you will write minus 6 by 8 log base 2 6 by 8 minus 2 by 8 log base 2 2 by 8 i hope everybody knows this how we got it so h of s v basically means i am going to compute the entropy of this category and this category so for that i will basically write h off so here i will write minus 6 by 8 log base 2 6 by 8 minus 2 by 8 log base 2 2 by 8 so if i get it i&#39;m actually going to get 0.81 and similarly if i calculate h of c 2 quickly calculate how much you are going to get guys 6 by 8 6 by eight with respect to this we need to find out so now we have all these values we&#39;ll start equating them to this equation so here we have finally gain of s comma f one so let&#39;s say that here i&#39;m going to basically add 0.94 minus see minus summation of okay summation of what is s of v understand s of v basically means that how many samples i have over here let&#39;s say for category one how many samples i have for category one over here simple if you really want to just calculate it is nothing but eight and total number of sample is how much if i go and see over here there are nine years five nos okay nine years and five nos that basically means 14 total sample here you have eight sample okay so this will become eight by fourteen then you multiply by what see 3 from this equation you multiply by h of sv so h of sv is nothing but the entropy of category 1 so entropy of category 1 is nothing but 0.81 plus then you go again back to the graph and try to see that for c2 how much how many total number of samples are there 3 plus 3 is 6. so 6 by 14 it will become multiplied by 1 right so this is your entire thing so here after all the calculation you are going to get 0.041 so this is my gain with s comma f1 so here i have got this value amazing i did this with feature 1 only what about feature 2 let&#39;s say that this was my split for feature 2 and suppose i get the gain for s comma feature 2 as 0.00051 if i get this now tell me in using which feature should i start splitting first whether it should be f1 or whether it should be f2 based on this value you know that over here the gain the information gain of s comma f2 is greater than gain of s comma f1 so your answer is very much simple we will definitely use feature 2 to start the split the thing over here you are trying to understand that if i really want to select which feature to select to start my splitting then i have to basically calculate the information gain and go throughout the all the paths and whichever path has the highest information gain then we will select that specific thing now the question rises krish obviously this is good but you had written about guinea impurity what is the purpose of that please explain us and why guinea impurity is basically used so let me go ahead with guinea impurity i told that yes you can obviously use you can obviously use entropy but why guinea impurity so guinea impurity formula which i have specifically written as 1 minus summation of i is equal to 1 to n p square now what is this p square suppose let&#39;s say that in my n n is the number of outputs right now how many outputs i have i have two outputs yes or no so i will expand this 1 minus since this is summation i is equal to 1 to n i am basically going to basically say that okay fine i will write probability of plus whole square plus probability of minus whole square so this is the formula for guinea impurity now you may be thinking okay fine the calculation will be obviously very much equal easy right suppose if i have a node sorry if i have a node which which has two yes two nodes now in this particular case how do i calculate my this probability if i have two yes or two nodes suppose let&#39;s say that i have a node over here which is my split and this is having 2 yes and 2 no so how do i calculate i will write 1 minus what is probability of square 1 by 2 square sorry not 1 by 2 yeah 1 by 2 square plus 1 by 2 square right then i will say 1 by 1 by 4 plus 1 by 4 is nothing but 2 by 4 which is nothing but 1 by 2 so i will be getting 0.5 now here you understand this is a complete impure split right if you have an impure split in entropy the output you are getting it as one whereas in the case of guinea impurity as 0 sorry 0.5 so if i go ahead with the graph that i probably had created here so my guinea impurity line will look something like this so it will be looking something like this for 0 obviously i&#39;ll be getting 0 but whenever my probability of plus is 0.5 i&#39;m going to get 0.5 over here and that is the difference between guinea impurity and entropy but the re but you may be seeing krish when to use what now let&#39;s understand that when to use guinea and when to use entropy tell me guys if i consider this formula of guinea purity and if i probably consider if i consider entropy this formula where do you think more time will take for execution for this particular formula whether for entropy it will take or for guinea impurity it will take more time where it will probably take for the execution purpose see understand decision tree is having a worst time complexity because if you have hundred features probably you will keep on comparing by dividing many many feature then probably compute information gain like this if you have just 100 features so which is faster entropy or guinea impurity understand in entropy you have log function here you have log function here you have simple maths the more amount of time out of entropy and guinea impurity the more amount of time basically is taken by entropy so if you have huge number of features like 100 200 features and you are planning to apply decision tree i would suggest try to use guinea impurity then entropy if you have small set of features then you can go ahead with entropy so over here definitely with respect to fast guinea is greater than entropy now let&#39;s go ahead and understand with respect to you may be thinking krish okay fine you have basically explained us about categorical variables over here see over here you have explained about categorical variables what if i have numerical feature let&#39;s say i have f1 over here which is a numerical feature i have an f1 feature which is numerical feature and i may have values let&#39;s say that i have sorted all the values over here okay let&#39;s say that i have f1 and output okay so this f1 let&#39;s say that i have values like assorted order values i&#39;m sorting these features i&#39;m basically doing this let&#39;s say that initially i have this features like this and let&#39;s say i have values like 2.3 1.3 4 5 7 3 let&#39;s say i have these features now this is a continuous feature this is a continuous feature so for a continuous feature how probably the decision tree entropy will be calculated and the information gain will get calculated so here you will be able to see that i will first of all sort these values so in f1 the decision tree will basically first of all sort this value so i have 1.3 then you have 2.3 then you have 4 sorry then you have 3 3 then you have 4 then you have 5 and then you have 6. now whenever you have a continuous feature so how the continuous feature will basically work in this case first of all your decision tree node will say that we&#39;ll take this one only one first record and say that if it is less than or equal to 1.3 okay if it is less than or equal to 1.3 so you here you will be getting two branches yes or no so yes and no definitely your output over here will be put over here right and then for the no here you&#39;ll be having another node over here how many number of records you&#39;ll be having in this particular case you&#39;ll be having one record in this particular case you will be having around five to six records and here also you&#39;ll be able to see right how many yes and no&#39;s are there definitely this will be a leaf node so in the first instance they will go ahead and calculate the information gain of this then probably once the information gain is got then what they&#39;ll do they will take the first two records and again create a new decision tree let&#39;s say that this is be my suggestion where they&#39;ll say it is less than or equal to 2.3 so i will get one and one over here so in this now you will be having two records which will basically say how many yes and no are there and remaining all records will come over here then again information gain will be computed here then again what will happen they will go to the next record then again they&#39;ll create another feature where they&#39;ll say less than or equal to 3 and they will create this many nodes again they&#39;ll try to understand how many yes or no are there and then they&#39;ll again compute the information again like this they&#39;ll do it for each and every record and finally whichever information gain is higher they will select that specific value in that feature and they&#39;ll split that node so in a continuous feature whenever you have a continuous feature this is how it will basically have and then it will try to compute who is having the highest information gain the best information gain will get selected and from there the splitting will happen now let&#39;s go ahead and understand about the next topic is that how this entirely things work in decision tree regressor because in decision tree regressor my output is and continuous variable so suppose if i have one feature one feature two and this output is a continuous feature it will be continuous any value can be there so in this particular case how do i split it so let&#39;s say that f1c feature is getting selected now in this f1c feature what value will come when it is getting selected first of all the entire mean will get calculated of the output mean will get calculated so here i will have the mean and here the cost function that is used is not guinea coefficient or guinea impurity or entropy here you use mean squared error or you can also use mean absolute error now what is mean squared error if you remember from our logistic linear regression how do we calculate 1 by 2 m summation of i is equal to 1 to n y hat minus y whole square y hat of i y minus y whole square this is what is mean squared error so what it will do first based on f1 feature it will try to assign a mean value and then it will compute the msc value and then it will go ahead and do the splitting now when it is doing splitting based on categories of continuous variable i will be having different different categories now in this categories what will happen after split some records will go over here then i will be having a mean value of this over here that will be my output and then again the msc will get calculated over here as the msc gets reduced that basically means we are reaching near the leaf node and the same thing will happen over here so finally when you follow this path whatever mean value is present over here that will be your output this is the difference between the decision tree regressor and the classifier here instead of using entropy and all you use mean squared error or mean absolute error and this is the formula of mean squared error now let&#39;s go to the one more topic which is called as the hyper parameters tell me decision tree if i keep on growing this to any depth what kind of problem it will face regressor part you want to me to explain okay let&#39;s see okay let&#39;s let&#39;s do the regression decision tree regressor let&#39;s say i have feature f1 and this is my output let&#39;s say i have values like 20 24 26 28 30 and this is my feature one with category one category one let&#39;s see some categories are there let&#39;s say i have done the division by f1 that is this feature initially tell me what is the mean of this that mean value will get assigned over here then using msc that is mean squared error here you will try to calculate suppose i get an msc of some 37 47 something like this and then i will try to split this then i will be getting two more nodes or three more nodes it depends then that specific nodes will be the part of this again the mean will change again the mean will change over here suppose this 2 is there this two records goes here right then again msc will get calculated i&#39;m just taking as an example over here just try to assume this thing now if i talk about hyper parameter see this is what is the formula that gets supplied over msc now let&#39;s see in this hyper parameter always understand decision tree leads to overfitting because we are just going to divide the nodes to whatever level we want so this obviously will lead to overfitting now in order to prevent overfitting we perform two important steps one is post pruning and one is pre-pruning so this two post pruning and pre-pruning is a condition let&#39;s say that i have done some splits i have done some splits let&#39;s say over here i have seven yes and two no and again probably i do the further split like this now in this particular scenario you know that if seven yes and two nose are there there is a maximum there is more than eighty percent chances that this node is saying that the output is yes so should we further do more pruning the answer is no we can close it and we can cut the branch from here this technique is basically called as post pruning that basically means first of all you create your decision tree then probably see the decision tree and see that whether there is an extra branch or not and just try to cut it there is one more thing which is called as pre-pruning now pre-pruning is decided by hyper parameters what kind of hyper parameters you can basically say that how many number of decision tree needs to be used not number of decision trees sorry over here you may say that what is the max depth what is the max depth how many max leaf you can have so this all parameters you can set it with grid search cv and you can try it and you can basically come up with a pre-pruning technique so this is the idea about decision tree regressor yes it is possible your guinea value will be one no this graph is there no guinea value are you talking about this guinea entropy it will not be one it will always be between 0 to 0.5 so the first thing first as usual what we should do we should import the libraries so here i will go ahead and import the library so i&#39;ll say import pandas as np pd import matplotlib dot pi plot as plt uh import so this basic things i have with me so i will go and take any data set that i want from sklearn dot data sets import let&#39;s say that i&#39;m going to take load iris data set and then i&#39;m going to upload the iris data set so i&#39;m going to write load iris in my iris data set then the next step uh once you get your iris data set so this is my irish.data okay these are all my features the four features will be there these four features are petal lens petal width sepal length and sepal width this is my independent features then if i really want to apply for classifier or decision tree classifier so i can first of all import from sklearn dot tree import decision let&#39;s see where is decision tree present in sql decision tree classifier the name is absolutely fine but i was not getting it over here so so this is got no module scala now okay sk [Music] so here you have classifier right now i&#39;m just going to overfit the data then i&#39;ll probably show you how you can go ahead with uh pruning so by default what are the parameters over here if you probably go and see in in the classifier over here you have criterion see this the first priority parameter is criterion by default it is guinea then you have splitter splitter basically means how you are going to split and there also you have two types best and random you can randomly select the features and do it okay you should always go with best max depth is a hyper parameter minimum sample lift is a hyper parameter maximum features how many number of features we are going to take in order to fix that that is also an hyper parameter so all these things are hyper parameter okay so i will just by default executed whatever is giving me in decision tree and the next thing that i am actually going to do is create a decision tree so for this i will be using plot dot fig size plot dot figure inside figure i have this fixed size okay and i will probably show in some better figure size so that everybody will be able to see it so here let me say that i&#39;m going to take an area of 15 10 and then probably i&#39;m going to say tree dot plot and here i&#39;m going to say classifier and it should be filled the coloring should be filled with this so tree sorry three three three three two two two two two it should be classified tree dot plot okay i have to also import uh three so i have to basically import three so from sklearn import three again i&#39;m getting error has no attribute plot why let me just see the documentation so this plot function is like plot underscore tree dot tab plot underscore tree now what is the error we are getting okay not fitted yet sorry so i&#39;m going to say classifier dot fit on data what data iris dot data and then i&#39;m going to fit with iris dot target so once this is done i think now it will get executed so this is how your graph will look like guys so here you can see this is how your graph looks like now if i show you the graph osc you can see some amazing things over here three outputs are actually there in this when you see in this left hand side this become a leaf node so this first one is probably versicolor uh versicolor flower okay if you go on the right hand side here you can see 50 50 is there so based on one feature based on one feature here you will be able to see that you are getting a leaf node based on another branch here you are getting 0 50 50. so again you have two more features getting splitted over here so here you have 49 5 here you have 47 one do we require this split anybody tell me from here do you require any any more split just try to think this is after post pruning i want to find out whether more splits are required or not now in this particular case you see this after this do you require any split you did not require right here you are basically getting 47 in one i guess after this also you require no split so understand this so this is basically post pruning so you can then decide your level and probably do it any value is more than 0.5 okay this side up this is coming as 0.5 greater than 0.5 it should not had here it is 0.5 now maximum 0.5 can come 0 to 0.51 you should come i don&#39;t know why this is coming at 0.667 i&#39;ll have a look on to this guys but anywhere you see other than that you&#39;re everywhere you&#39;re getting less than 0.5 the plotting graph is very much easy you use skill on import tree then you basically do this get classified and filled is equal to true and you can just do this so the agenda let me define the agenda what all things are there first we&#39;ll understand about ensembl techniques in this assemble techniques we are basically going to discuss about what is the difference between bagging and boosting second what we are basically going to discuss about is so uh the agenda of this session is ensemble techniques packing and boosting then we are probably going to cover random forest and then probably we will try to cover ada boost and if i have more energy i will also try to cover xgboost so all these algorithms will discuss about it so let&#39;s go ahead and let&#39;s start the topics we start the first topic that we are going to discuss is about ensembl techniques now what exactly is assembled techniques and we are going to discuss about it okay so ensembl techniques what exactly is ensembl technique till now we have solved two different kind of problem statement one is classification and regression and you have learnt about different different algorithms like uh linear regression logistic regression we have discussed about k n n we have discussed about uh yesterday what disc what did we discuss about a bias different different algorithms we have already finished now with respect to classification regression problem whatever algorithm we are discussing there was only one algorithm at a time we were discussing one algorithm at the time we are discussing and we are trying to either solve a classification or a regression problem now the next thing is over here is that can we use multiple algorithms multiple algorithms to solve a problem multiple algorithms basically means can we i&#39;ll just talk about it okay now in the if i ask this specific question can we use multiple algorithms to solve a problem at that point of time i will definitely say yes we can because we are going to use something called ensembl techniques there now what this ensemble techniques is okay so in symbol techniques in ensembl techniques we specifically use two different ways one is one way is that we specifically use another one i&#39;ll just go to write it over here so one that we basically use is something called as bagging technique and the other one we specifically use is something called as boosting technique so in bagging technique we what exactly we can do and in boosting technique what we can actually do and how we are combining multiple models to solve a problem so let&#39;s first of all discuss about bagging now how does bagging work let&#39;s say that i have a specific data set so this is my data set with uh with features rows columns everything like this i have this specific data set just imagine i have many many features over here like this f1 f2 f3 and probably i have my output so this is my data set d let&#39;s consider it now what we do in bagging is that we create models and this model can be anything it can be logistic it may be linear for a classification problem let&#39;s say that this is logistic model so this is my model m1 let&#39;s say i have another model m2 then i may have another model m3 let&#39;s say that this is logistic and this is probably the other model which is like decision tree and then probably we use this model as a k n classification and this model can again be decision tree it&#39;s fine let&#39;s use another decision tree so now here you can see that we have used so many models okay so many models are there now with respect to this particular model what i will do is that the first step that i will do from this particular data set i will just take up some rows so i&#39;ll basically do row sampling and i&#39;ll take a row sampling of d dash d basically means this d dash is always less than d some of the rows i&#39;ll push it to m1 okay i can also use neighbor as f5 so what i will do is that some of the rows i&#39;ll push it to model one this model one will be training let&#39;s say that for out of this ten thousand record thousand rows i&#39;m actually doing a row sampling of thousand rows and giving it to m1 to train it then what i&#39;m actually going to do over here i&#39;m basically going to give this specific model m2 and again i&#39;m going to do row sampling and i&#39;m again going to sample some of the rows and give it to model 2 and again remember some of the rows may get repeated from this d dash to next d double dash similarly i will do row sampling and give it to this and again i may have d triple dash and d4 dash so different different different different rows data points when i say row sampling basically i&#39;m talking about data points different different data points i will give it to separate separate model and this model will specifically train when i say d dash that basically means suppose i say 10 000 are my total number of data points when i say d dash this d dash may be thousand points then the double dash may be another thousand points and some of the rows may get repeated over here d triple dash here also i can basically use so here specifically row sampling will be used now when i have this many specific stuffs each and every model will be trained with different kind of data now how the inferencing will happen for the test data so first thing first let&#39;s say that i am going to get a new touch data over here now new test data will be passed to m1 and this m1 suppose it gives 0 as my output suppose let&#39;s say that i am doing a binary classification it gives a 0 as an output so this is my output of 0 next m2 for the new test data gives 1 m3 gives 1 and m4 also gives 1 as the output now in this particular case in this particular case what will happen now you can see over here it&#39;s simple what what do you think the output may be in this particular case now m1 has predicted for this particular test data as 0 the model m2 has predicted 1 m3 has predicted 1 and m4 has predicted 1. so finally all these outputs are going to get aggregated are going to get aggregated and a simple thing that gets applied is majority voting majority voting so tell me what will be the output with respect to this the output will obviously be one because the majority voting that you can see three people are basically saying it as one so my output over here will be one okay this is the concept of bagging wherein you are providing different different rows with probably all the features in this case and giving it to different different model again which is a classification model and then finally you are combining them based on majority voting and you are getting the answer as one so this step is called as bootstrap aggregator that basically means you&#39;re aggregating all the output that is basically coming from all the specific models all these specific models now many people will say krish what about thai guys like this kind of situation you know we will be having more than 100 to 200 models so it is very very difficult that it will be a tie who are repeating questions they will be put up in timeout so what if you are saying that if the 50 percentage of model says yes 50 percent of our model says no always understand guys we will be having more than hundred to two hundred plus models so in this particular case there will be high probability that always there will be a majority voting available it will always not be in that specific scenario so this was the concept about bagging now some people will be saying that krish why are you using different different models guys i&#39;m not discussing about random forest over here random forest uses only one type of model that is decision tree but if we think as an concept of bagging you can have different different models over here and you can basically combine them so this is a technique of ensemble techniques and this is basically called as bagging okay now tell me one point i missed out fine this is with respect to the classification problem with respect to the regression problem what will happen in case of a regression problem let&#39;s say that i got here 120 here 140 here 122 here 148 as my output so in regression what will happen is that the entire mean will be taken mean will be taken the output mean will be basically taken and that will be your output of the model average or mean very simple right so average or mean will be basically taken up and here based on the average you will be able to solve the regression problem great now let&#39;s go ahead and try to understand with respect to bagging and boosting how many different types of algorithm other but before that i need to make you understand what exactly is boosting now here in bagging you have seen that you have parallel models right one one one independent you have parallel models you are giving some row samples in different different models and basically are able to find out the output now in case of boosting boosting is a sequential combination of models like this you have a lot of sequential models like this and one after the model like first i&#39;ll give my training data to this particular model then it will go to this data then this model then this model so this will be my m1 m2 m3 m4 and finally i will be getting my output so here you can basically say that boosting is all about and this m1 m2 mt we basically mention it as weak learners so this will be weak learner weak learner weak learner weak learner and finally when we go till here it will if i combine all these weak knowledge weak learner weak learner okay once i combine all this weak learner it becomes a it becomes a strong learner finally if i try to combine this this will basically become a strong learner so here you have all the models sequentially one after the other and then you will probably try to provide your input from one model to the next model to the next model and these all models will be a very simple weak learner model which will not be able to predict properly but when you combine all these particular models together sequentially it becomes a strong learner how this specifically works i&#39;ll take an example of adaboost xd boost i will show you that okay weak learner basically means the prediction is very bad but as you go sequentially you combine them they become a strong learner okay one example i want to give you let&#39;s say that you are a data scientist right let&#39;s say that this model one may be a teacher with respect to physics then this model too may be a teacher with respect to chemistry let&#39;s say model 3 is basically a teacher of maths and model 4 is a teacher of geography now suppose if you are trying to solve one problem obviously if the physics teacher is not able to solve that particular problem then probably chemistry can help or maths can help or geography can help or someone can help so when we combine this many expertise together they will be able to give you the output in an efficient way sumit i&#39;ll talk about it where whether all the features are basically passed to all the models or not i&#39;ll just talk about it just give me some time okay but i just want to give you an idea about in short if someone asks you in an interview what exactly is boosting okay boosting is you can just say that it is a sequential set of all the models combined together and these all models that i initialize are usually weak learners and when they are combined together they become a strong learner and based on the strong learner they gives an amazing output and right now if i say in most of the kaggle competition they use different types of boosting or bagging technique so we have basically as i said bagging and boosting in bagging what kind of algorithm we specifically use we use something called as random forest classifier and the second model that we specifically use is something called as random forest regression so we specifically use these two kind of models which i&#39;m actually going to discuss right now after this and then in boosting we basically use techniques like adaboost gradient boost number three is extreme gradient boost which we also say it as a g boost extreme gradient boost so let&#39;s go ahead and let&#39;s discuss about the first algorithm which is called as a random forest classifier and regressor now first thing first let&#39;s understand some things from the yesterday&#39;s class i hope what is the main problem with respect to decision tree whenever we create a decision tree without any hyper parameter it does it not lead to overfitting does it not lead to overfitting whenever you probably have a decision tree right it leads to something like overfitting why overfitting because it completely splits all the feature till it&#39;s complete depth overfitting basically means for training data the accuracy is high for test data the accuracy is low so training data when the accuracy is high i may basically say it as high bias and then i may basically say it as sorry not high bias low bias and high variance so low bias and high variance yes obviously we can do pruning and all guys but again understand pruning is an extensive task probably if your if you have 100 features if you have data points which is like 1 million to do pruning also it is very much difficult yes pre-pruning can be done but again we cannot confirm that it may work well or not so right now with respect to decision tree you have this specific problem that is low bias and high variance now in low balance and high variance you know that my model is basically the generalized model that i should get it should have low bias and low variance so if somebody asks you why do you use random forest you can basically explain about decision trees like this now my main aim is to convert this high variance to low variance now i will be able to convert this high variance to low variance using random forest classifier or random forest regressor now what does random forest do random forest is a bagging technique similarly i have a data set over here let&#39;s say that i have this data set and then here i will be having multiple models like m1 m2 m3 m4 let&#39;s say i have these four models like this will have many many models now with respect to this models this models all the models are actually decision tree in random forest all are decision trees you don&#39;t have a different model over there so over here you can see that all the models are decision trees that is going to get used in random forest so decision trees always gets used in random forests the first thing that you should know now whenever we are using decision entries you know that decision tree if i by default if we try to create it it may lead to overfitting because of that every decision tree will basically create low virus low bias and high variance but if we combine in the form of bootstrap aggregator this high variance will be getting converted to low variance because why because majority of voting we will be taking from this particular decision trees there will be many many decision tree so the lot of outputs will be coming and with the help of majority voting classifier this high variance will get converted to low variance now in random forage how it works in the first case if i talk about random forest over here two things basically happen with respect to the d data set let&#39;s say in first model we do some kind of row sampling plus feature feature sampling that basically means we have to select some set of rows and some set of features and give it to m1 similarly you do row sampling and feature sampling and give it to m2 then you do row sampling and feature sampling you give it to m3 and then you do row sampling and feature sampling you give it to m4 now when you do this so what will happen independently you&#39;re giving some features along with some rows now there may be a situation that your features may also get repeated it may also get repeated your records or data points may also get repeated so when you are probably training your model with this specific data sets and specific features this model become expert in predicting something right as i said one example over here i&#39;m giving a physics model some data i&#39;m giving chemistry data chemistry model with some data similarly here i&#39;m giving some information to some model so the model will be an expert with respect to that specific data so based on all this particular data whenever i get a new test data so what will happen suppose let&#39;s say that this is a classification problem the m1 model will be predicting 0 this will be predicting 1 this will be predicting 0 and this will be predicting 0. now in this particular case again the majority voting classifier or majority voting will happen in the case of classification problem and then here you will be specifically able to get the output as zero so i hope everybody is able to understand all the models over here are decision trees and based on that you will be doing see when i&#39;m in an interview should be very very things the things that i&#39;m telling you over here is all all the points are very much important and similarly if you tell the interviewer definitely your interview is cracked in this kind of algorithm i&#39;ve seen some of my students saying that okay chris when the interviewer asked me that which is my favorite algorithm i said random forest i told why did you say like that because he said that because that person let me let him ask any questions in random forest i&#39;m very much confident about it and i&#39;m also going to prove him you know why they are very very good so with this specific case here you can basically see that because of the overfitting condition of the decision tree you are combining multiple decision tree so that you get a generalized model which has low bias and low variance so i hope everybody is able to understand boost feature sampling basically means suppose if i have one two three four feature for the first model i may give two features for the second model i may get three features for the fourth model i may give four features or any one feature also i can give to a specific model so internally that random forest should take cares of over here these things are there and this is how random forest works only the difference between random forest classify and regression is that in regression again whatever output you are basically getting you basically do the mean that&#39;s it average you just do the average you&#39;ll be able to get the output based on all the models output that you are actually getting now let&#39;s talk about some of the important points in random forest the first thing first question is that is normalization required in random forest then the next question is that in k n is normalization when i say normalization or standardization i&#39;ll just talk about standardization is standardization is required so this will be my another question so is normalization required in random forest or decision tree you here you can also say it as distributory is it required so for this the answer will be no because understand decision tree will basically do the splits if you minimize the data also that split won&#39;t be that much important but if i talk about knn whether standardization normalization required over here the answer is yes because here we use two things one is euclidean distance and manhattan distance because of this you definitely have to you apply standardization so that the computation or distance becomes easy so this is one of the most common interview questions that is basically asked in random forest coming to the third question is random forest impacted by outliers over here the answer will be no just check it out outside basically means google and check it out check it out in google okay perfect so i hope i have covered most of the things in random forest is random forest impacted by outliers this is the third question is k n impacted by outliers is this k n algorithm impacted by outliers is k n impacted by outliers the answer is yes big yes perfect so these all are the interview questions that needs to be covered now let&#39;s go ahead and discuss about adaboost now in bagging most of the time we specifically use random forest or you can also create custom bagging techniques custom bagging techniques means whatever algorithm you want use the combination of them and try to give the output this also you can do it manually with the help of hands okay guys so second thing we are going to discuss about is boosting technique and this the first thing that first algorithm that we are going to discuss about is at a boost so ada boost uh we are going to discuss about how does our ada boost uh work now let&#39;s solve uh the first boosting technique which is called as adaboost okay and this is a boosting technique in the boosting technique you have heard that we have to basically solve in a sequential way this at least you know i know there is a lot of confusion within you all but we&#39;ll try to solve a problem let&#39;s say so suppose i have a data set which looks like this f1 f2 f3 f4 so these are my features and probably these are my output okay so let&#39;s say that i am having this features like this and this is my output like yes or no like this so let&#39;s say that how many records i have over here three four five six and one more is there seven so this seven records are there now in ada boost the first thing is that specifically with ada boost you really need to understand that what all things we can basically do how do we solve this classification problem that we are going to understand the first thing first is that we define a weight and the weight is very much simple initially to all the records to all these input records we provide an equal weight now how do we provide an equal weight we just go and count how many number of records are there now in this particular case the total number of records are 1 2 3 4 5 six seven now every record i have to provide an equal weight that is between zero to one so that overall sum should be one so in this particular case what i can do if i make one by seven one by seven one by seven to every one this will definitely become a equal weights to all right and if i do the total sum it will obviously be one let&#39;s go to the next one now after this what do we do okay after this in adaboos the first thing that we do is that we take any of this feature how do we decide which feature to take whether we should go with f1 or whether we should go with f2 or whether we should go with f3 this we can do it with the help of information gain and information gain and entropy or guinea right based on this we can definitely understand whether we should start making decision here also you specifically make decision trees so here what you do is that you probably have to determine by using which feature i have to start my decision tree so suppose out of all this feature one feature to feature three you have selected that okay the information gain and entropy of feature one is higher so i&#39;m going to use feature one and probably divide this into decision trees now when i divide this into decision tree let&#39;s say that i&#39;m dividing like this into decision tree this decision tree depth will be only one one depth and this depth since it has only one depth we basically call it as stumps so what we do over here specifically we will create a decision tree by taking only one feature and we will only divide it to one level okay one level or one depth that&#39;s it and this is specifically called as stump what we are going to do next is that from this particular stump okay the stump is basically getting created only one so that is ada boost right we say it as weak learners because this is weak learner weak learner why there is a reason we say this as weak learner so only weak learner so that is the first thing with respect to uh this particular ada boost so the first step is that this is us weak learner so for the week learner we basically create a stump stump basically means one level decision tree that&#39;s it based on the information gain and entropy i have selected the feature and then i just made a decision tree with only one level why it is called as it is called as weak learner okay so that is the reason we use only stub that is just a one level decision tree now the next step happens is that we provide all these specific records to this f1 and we train this specific model only with one level decision tree we train them now after we train them let&#39;s say that we are going to pass all these particular records to find out how many are correct and how many are wrong this decision this decision tree is basically giving so let&#39;s say that out of this entire records one record one record was just given as wrong let&#39;s say that this is the this is the record which was given as wrong okay so let&#39;s say that this record output was predicted wrong from this particular model only one wrong was there after training the model now what we need to do in this specific case understand a very important thing so let&#39;s say that we have done this and probably after this what we are actually going to do we are going to calculate the total error so how many error this particular model made let&#39;s say that in this particular case only one was wrong so this was only wrong right one was wrong so if i want to calculate the total error how will i calculate how many how many of them are raw how many of them are wrong only one is wrong what is the weight of this so i will go and write one by seven so this is specifically my total error out of this specific model which is my stub over here okay which is my f1 stub now this is my first step the second step is that i need to see the performance of stump which stopped this specific stump and the performance is basically checked by a formula which is 1 by log e 1 minus total error divided by total error why we are doing this everything will make sense okay in just time every every in just a small time everything will make sense the first step that we do in ad boost is that we try to find out the total error the second step we try to find out the performance of stump now in this particular case it will be 1 by log e 1 minus 1 by 7 divided by 1 by 7. so once i calculate it it will be coming as 0.895 f2 and f3 see again understand out of all these features i found out from information gain and entropy that this is the best feature let&#39;s say that i have calculated this as 0.895 so this is my second step the first step is find out the total error the second step is performance of stump what is t t basically means total error t basically means total error now see see the steps okay see the steps whenever i am discussing about boosting i am going to combine weak learners together to get a strong learner now what is the next step out of this now what what will be my third step understand over here my third step will be to update all these weights and that is the reason why i&#39;m calculating this total error and performance of step so my third step will basically be new sample weight from the decision tree one which is my stump so i&#39;ll say new sample weight is equal to i need to update all these weights why me i need to update all these weights again understand i&#39;ll talk about it just a second so if i want to update the sample weights first update i will do it for correct records see for correct records whichever are correct like this all records are correct these all records are correct now when i update the weights of this way update the weights of this particular record it should reduce and when the the wrong records that i have this update should increase why because because if i increase this weights then the wrong records that are there that record should go to the next week learner that is the reason why i am doing it now how to update this particular weights for correct records for correct records the formula looks something like this weight multiplied by weight multiplied by e to the power of minus this specific performance okay this specific performance so e to the power of ps i&#39;ll write performance of stump and then i will basically be able to write 1 by 7 multiplied by e to the power of minus 0.895 if i do the calculation everybody try to do it the answer will be 0.05 now this is for correct records what about incorrect records for the incorrect records the the weights that is going to the formula that we are going to apply is multiplied by e to the power of plus ps not minus ps plus ps so here i&#39;ll write 1 by 7 multiplied by e to the power of 0.895 so if i go and probably calculate this i&#39;m going to get it as 0.349 so these two are the weights that i have got that basically means all these records now which are correct one by seven the new updated weights will be point zero five point zero five point zero five point zero five sorry not for the wrong records then this will be point zero five then point zero five and point zero five so let me just see what is one by seven so here you can see initially it was point one four two now it has got reduced to point zero five because all these records are correct but the wrong record value is 0.349 so my weights will now become over here as 0.349 now i will just go and go ahead and write over here my new weight my new weight is nothing but 0.05 0.05 zero five point zero five point zero five point zero five one two how many one two three okay fourth record is here fourth record is there one two three four point zero five point zero five okay how many records are there one two three four five six seven so my fourth record will basically become the new value that i&#39;m having is something called as 0.349 now tell me guys if i do the submission of all these weights is this is it 1 so probably no i don&#39;t think so it is 1 because if i try to add it up it is not 1 but if i go and see over here these all are one if i combine all these things one two three four five six seven these all are one so here i am need to find out my normalized weight now in order to find out the normalized weight all i have to do is that what i have to do because the entire summation should be 1 so we have to normalize now in order to normalize all you have to do is that go and find out what is the sum of all these things the summation of all these things will be point zero 0.649 all you have to do is that divide all the numbers by 0.649 divided by 0.649 0.649 like this divide all the numbers by 0.649 and tell me what will be the answer that you will be getting so here your normalized weight will now look like 0.07 0.07 0.07 and this value will be somewhere around 0.537 i guess in this case then this will be point zero seven point zero seven point zero seven here we are going to divide or by all this six four point six four nine now this is my normalized weight now after you get a normalized weight we will try to create something called as buckets because see one decision tree we have already created which is a stump and you know from this particular stump what you are going to get okay as an output then in the sequential model we will go and combine another model over here now it&#39;s the time that i have to create this specific model now in order to create this specific model i need to provide some specific rows only to this model to train because this model is giving one wrong now what i have to do is that whatever is wrong along with other data points i need to provide this specific model with those records so that this model will be able to train on this and probably be able to get the output now let&#39;s create buckets now based on buckets how the buckets will be created over here i will take 0.07 until sorry whatever is the value over here normal weight value okay so i will start creating my buckets buckets basically from 0 to 0.07 what did i say now for this decision tree or stump i need to provide some records so the maximum number of record that should be going should be the wrong records that should go over here now how do we decide that okay there should be a way that we should be able to say that that specific wrong number of records should go to that decision tree so for that purpose what we do is that this decision tree will randomly create some numbers between 0 to 1 randomly create those numbers between 0 to 1 and whichever bucket it will come in like 0 7 2 0 1 4 0 1 4 2 0 7 basically means 0 2 1 then 0 2 1 2 see how the bucket is getting created this value is getting added to this so that becomes this bucket 0 to 1 plus 0.3 537 how much it is it is nothing but four seven zero point seven four seven then point seven four seven to point seven five one like this you create all the buckets okay you can create all the buckets now tell me which record is basically having the biggest bucket size obviously this record so if i randomly create a number between 0 to 1 what is the highest probability that the values will be going in so in this particular case most of the wrong records will be passed along with the other records obviously other records there are chances that other records will go to the next decision tree but understand maximum number will go with the wrong records because the bucket is high over here so the bucket is high over here so most of the time this specific record will get selected and then it will be gone to the second tree now suppose i have this all records so this is my first stump this is my second stump this is my third stump similarly the third stump from the second stump whichever wrong records will be going maximum number of records will go over here then again it will be trained like this will be having lot of stumps minimum 100 decision trees can be added you know that every decision tree will give one output for a new test data new test data this week learner will give one output this week learner will give one output this week learner and this will be cleaner will be giving one output obviously the time complexity will be more now from this particular output suppose it is a binary classification i will be getting 0 1 1 1 so again over here majority voting will happen and the output will be 1. in case of regression problem i will be having a continuous value over here and for this the average average will be computed and that will give me an output over here so for regression the average will be done for classification what will happen majority will be happening so everywhere that same part will be going on buckets is very much simple guys buckets basically means based on this weights normalized weight we are going to create bucket so that whichever records has the highest bucket based on this randomly creating code you know it will select those specific buckets and put it into random forest we understand why this bucket size is big the other wrong records which are present right suppose they are more than four to five wrong records their bucket size will also be bigger and because based on this randomly creating number between zero to one most of the wrong records will be selected and given to the second stem similarly this particular decision tree will be doing some mistakes then that wrong records will get updated all the weights will get updated and it will be passed to the next decision tree guys when i say wrong record the output will be same only no zero and one so interesting everyone i hope you understood so much of maths in ada boost and how ada boost actually work three main things one is total error one is performance of stump and one is the new sample weight these things are getting calculated extensively max normalized weight was basically used because the sum of all these weights are approximately equal to one when boosting why not take the last output no no no we have to give the importance of every decision tree output every decision entry output are important okay let me talk about one model which is called as black box model versus white box what is the difference between black box model and white box if i take an example of linear regression tell me what kind of model it is is it a white box model or black box if i take an example of random forest is this a white box or black box if i take an example of decision tree it is a white box or black box model if i take an example of a n is it a white box or black box model linear regression is basically called as a white box model because here you can basically visualize how the theta value is basically changing and how it is coming to a global minima and all those things in random forest i will say this as black box model because it is impossible to see all the decision tree how it is working so that is the reason the math is so complex inside is if i talk about decision tree this is basically a white box model because in decision entry we know how the split are basically happening with the help of paper and pen you will be able to do it in the case of a n this is a black box model because here you don&#39;t know like how many neurons are there how they are performing and how the weights are getting updated so this is the basic difference between the black box and white box model this entire thing is the agenda of today&#39;s session so let&#39;s start uh the first algorithm that we are probably going to discuss today is something called as k means clustering k means clustering and this is a kind of unsupervised machine learning now always remember unsupervised machine learning basically means that uh the one and the most important thing is that in unsupervised machine learning in unsupervised ml you don&#39;t have any specific output so you don&#39;t have any specific output so suppose you have feature one and feature two and suppose you have data&#39;s different different data you know and based on this data what we do we basically try to create clusters this cluster basically says what are the similar kind of data so this is what we basically do from uh clustering and there are various techniques like k-means uh it is hierarchical clustering and all so first of all we&#39;ll try to understand about k-means and how does it specifically work it&#39;s simple uh suppose you have a data points like this okay let&#39;s say that this is your f1 feature f2 feature and based on this in two-dimensional probably i will be plotting this points and suppose this is my another points so our main purpose is basically to cluster together in different different groups okay so this will be my one group and probably the other group will be this group right so two groups because obviously you can see from this clusters here you have two similar kind of data which is basically grouped together right this is my cluster one and this is my cluster two let me talk about this and why specifically it will be very much useful and then we&#39;ll try to understand about math intuition also now always understand guys uh where does clustering gets used okay in most of the ensemble techniques i told you about custom and symbol technique right so custom and symbol techniques in custom and symbol techniques you know whenever we are probably creating a model first of all on our data set what we do is that we create clusters so suppose this is my data set during my model creation the first algorithm we will probably apply will be clustering algorithm and after that it is obviously good that we can apply regression or classification problem suppose in this clustering i have two or three groups let&#39;s say that i have two or three groups over here for each group we can apply a separate supervised machine learning algorithm if we knows the specific output that we really want to take ahead i&#39;ll talk about this and give you some of the examples as i go ahead now let&#39;s go on go ahead and focus more on understanding how does k-means clustering algorithm work so let&#39;s go over here the word k means has this k value this k are nothing but this k basically means centroids k basically means centroids so suppose if i have a data set which looks like this let&#39;s say that this is my data set now over here just by seeing the data set what are the possible groups you think definitely you will be saying k is equal to 2 so when you say k is equal to 2 that basically means you will be able to get two groups like this and each and every group will be having a centroid a centroid point here also there will be a centroid point so this centroid will determine basically this is a separate group over here this is a separate group over here so over here here you can definitely say that fine this is two groups but how do we come to a conclusion that there is only two groups okay we cannot just directly say that okay we will try to just by seeing the data because your data will be having a high dimension data right right now i&#39;m just showing you two dimension data but for a high dimension data definitely you will not be able to see the data points how it is plotted so how do you come to a conclusion that only two groups are there so for this there is some steps that we basically perform in k-means the first step is that we try with different k values we try with different k values and which is the suitable k value k is nothing but centroids okay it is nothing but centroids we try with different different centroids in this particular case let&#39;s say that i have this particular data point and i actually start with k is equal to 1 or 2 or 3 any one you want let&#39;s say that i&#39;m going to start with k is equal to 2 how to come up with this k is equal to 2 as a perfect value that i&#39;ll talk about it we need to know there is a concept which is called as within cluster sum of square so when we try different k values let&#39;s say that for k is equal to 2 what will happen the first step we select we try k values so let&#39;s say that we are considering k is equal to 2 the second step is that we initialize k number of centroids now in this particular case i know my k value is 2 so we will be initializing randomly let&#39;s say that k is equal to 2 so what we can actually do let&#39;s say that this is this is my one centroid i will i&#39;ll put it in another color so this will be my one centroid and let&#39;s say that this is my another centroid so i&#39;ve initialized two centroids randomly in this space now after this particular centroid what we have to do is that after initializing this centroid what we have to do is that we have to basically find out which points are near to the centroid and which points are near to the centroid now in order to find out it is a very easy step we can basically use euclidean distance to find out the distance between the points in an easy way if i really want to show you that you know like how many points i want to in an easy way what i can do i can basically draw a straight line over here let&#39;s say that i&#39;m drawing a straight line over here in another color i can draw a straight line and i can also draw one parallel line like this so this basically indicates that whichever points you see over here suppose if i draw a straight line in between all these points you will be able to see that let&#39;s say that i&#39;m drawing one more parallel line which is intersecting together so from this you can definitely find out let&#39;s say that these are all my points that are nearer to this green line green point so what i am actually going to do in this particular case all these points that you are seeing near the green it will become green color so that basically means this is basically nearer to this centroid and whichever points are nearer to this particular point that will become red point so that basically means this belongs to this group okay this belongs to this group so i hope everybody&#39;s clear till here then what will happen is that this summation of all the values then we initialize the k number of inter centroids that is done then we try to calculate the distance we try to find out which all points is nearer to the centroid let&#39;s say that this is my one centroid this is my another centroid and we have seen that okay these all points belong to this centroid it is near to this particular centroid so this is becoming red so it is based on the shortest distance and here it is becoming green now the next step let&#39;s see what is the next step after this so i am going to remove this thing now the next step will be that the entire points that is in red color all the average will be taken so here again the average will be taken now third step here i am going to write here we are going to compute the average the reason we compute the average is that because we need to update the centroid so compute the average to update centroid to update centroids so here you will be able to see that what i am actually doing as soon as we compute the average this centroid is going to move to some other location so what location it will move it will obviously become somewhere in center so here now i am going to rub this and now my new centroid will be this point where i am actually going to draw like this let&#39;s say this is my new center right now similarly this thing will happen with respect to the green color so with respect to the green color also it will happen and this green will also get updated so i&#39;m going to rub this and this will be my new green point which will get updated over here then again what will happen again the distance will be calculated and again a perpendicular line will be calculated here you can see that now all the points are towards there okay again the centroid based on this particular distance again it will be calculated and here you can see that all the points are in its own location so here now no update will actually happen let&#39;s say that there was one point which was red color over here then this would have become green color but since the updation has happened perfectly we are not going to update it and we are not going to update the centroid right so now you can understand that yes now we have actually got the perfect centroid and now this will be considered as one group and this will be basically considered as the another group it will not intersect but by default here intersection is happening so i hope everybody&#39;s understood the steps that you have actually followed in initializing the centroids in updating the centroids and in updating the points is it clear everybody with respect to k means now let&#39;s discuss about one point how do we decide this k value okay how do we decide this k value so for deciding the k value there is a concept which is called as elbow method so here i&#39;m going to basically define my elbow method now elbow method say something very much important because this will actually help us to find out what is the optimized k value whether the k value should be 2 whether the k value is going to be 3 whether the k value is going to become 4 and always understand suppose this is my data set suppose this is my data set initially let&#39;s say that i have my data points like this we cannot go ahead and directly say say that okay k is equal to 2 is going to work so obviously we are going to go with iteration for i is equal to probably 1 to 10 i&#39;m going to move towards iteration from 1 to 10 let&#39;s say so for every iteration we will construct a graph with respect to k value and with respect to something called as wcss now what is this wcss wcss basically means within cluster sum of square okay this is the meaning of wcss within cluster sum of square now let&#39;s say that initially we start with one centroid so one centroid let&#39;s say it is initialized here one centroid is basically initialized here if we go and compute the distance between each and every points to the centroid and if we try to find out the distance will the distance value be greater or it will be smaller will it be smaller or greater tell me if we try to calculate this distance from this centroid to every point this is what is within cluster sum of square it will always be very very much greater so let&#39;s say that my first point has come somewhere here it is going to be obviously greater let&#39;s say that my first point is coming over here fine so within k is equal to 1 initially we took and we found out the distance of wcss and it is a very huge value okay because we are going to compute the distance between each and every point to the centroid now the next thing that i&#39;m actually going to do is that now we&#39;ll go with next value that is k is equal to 2 now in k is equal to 2 i will initialize two points okay i will initialize two points and then probably i will do the entire process which i have written on the top now tell me whichever points is nearer to this green point if we compute the distance and whichever points is nearer to the red point if we compute the distance like this now this summation of the distance will be lesser than the previous wcss or not obviously it is going to be lesser than the previous wcss so what i am actually going to do probably with k is equal to 2 your value may come somewhere here then with k is equal to 3 your value may come somewhere here then k is equal to 4 will come here to 5 6 like this it will go so here if i probably join this line you&#39;ll be able to see that there will be an abrupt changes in the wcss value in the wcss value there&#39;ll be an abrupt changes and this this is basically called as elbow curve now why we say it as elbow curve because it is in the shape of elbow and here at one specific point there will be an abrupt change and then it will be straight so that is the reason why we basically say this as elbow okay so this is a very important thing see in finding the k value we use elbow method but for validating purpose how do we validate that this model is performing well we use silhouette score that i&#39;ll show you just in some time but understand that in k means clustering we need to update the centroids and based on that we calculate the distance and as the k value keep on increasing you will be able to see that the distance will become normal or the wcs&#39;s value will become normal and then we really need to find out which is the feasible k value where the abrupt change see over here suppose abrupt change is there and then it is normal then i will probably take this as my k value so obviously the model complexity will be high because we are going to check with respect to different different k values and wcss values and this basically means that the value that we&#39;ll probably get first of all we need to construct this elbow curve then see the changes where it is basically happening we&#39;ll need to find out the abrupt change and once we get the abrupt change we basically say that this may be the k value so k is equal to 4 as an example i&#39;m telling you so unless and until if you really want to find the cluster it is very much simple we take a k value we initialize k number of centroids we compute the average to update the centroids then again we try to find out the distance try to see that whether any points has changed and continue ahead process unless and until we get separate groups okay so this is the entire funda of clayman&#39;s clustering so finally you&#39;ll be able to see that with respect to the k value will be able to get that many number of groups if my k value is 4 that basically means i will be probably getting 4 different groups like this 1 2 right three like this and four i will be getting four groups like this with k is equal to four that basically means k is equal to four clusters and every group will be having its own centroids okay every group will be having okay centroids are very much important yes i&#39;ll try to show you in the coding also guys let&#39;s go towards the second algorithm the second algorithm that we will be probably discussing is called as hierarchical clustering now hierarchical clustering is very much simple guys all you have to do is that let&#39;s say this is your data points this is your data points and this is my p1 let&#39;s say p2 now hieracle clustering says that we will go step by step the first thing is that we will try to find out the most nearest value let&#39;s say this is my x and y let&#39;s say these are my points like this is my p1 point this is my p2 point this is my p3 point this is my p4 point p5 point six point p seven point okay so these are my points that i have actually named over here let&#39;s say that this is maybe the nearest point to each other so what it will do it will combine this together into one cluster this we have computed the distance so it will clear create one cluster now what will happen on the right hand side there will be another notation which you may be using in connecting all the points one so suppose this is my p1 this is my p2 this is my p3 p4 let&#39;s say that i have this many points and probably i will also try to make p7 so these are my points p7 now you know that the nearest point that we are having okay this will probably be distance one two three this is distance okay 4 5 6 like this we have a lot of distance so hierarchical clustering will first of all find out the nearest point and try to compute the distance between them and just try to combine them together into one what do we do we basically combine them into one group okay so p1 and p2 has been combined let&#39;s say then it will go and find out the other nearest point so let&#39;s say p6 and p7 are near so they are also going to combine into one group so once they combine into one group then we have p6 and p7 which will be obviously greater than the previous distance and we may get this kind of computation and another combination of cluster will form get formed over here then you have seen that okay p3 and p5 are nearer to each other so we are going to combine this so i am going to basically combine p3 and p5 okay and let&#39;s say that this distance is greater than the previous one because we are basically going to start with the shortest distance and then we are going to capture the longest distance now this is done now you can see that the next point that is near right to this particular group is p4 so we are going to combine this together into one group so once we combine this into one group this p4 will get connected like this let&#39;s say it is getting connected like this p4 has got connected then what is the nearest point whether it is p6 p7 group of p1 p2 obviously here you can see that p1 p2 is there so i am probably going to combine this group together that basically means p1 p2 let&#39;s say i&#39;m just going to combine this group together again circle is coming so i will make a dot let&#39;s say i&#39;m going to combine this group together because these are my nearest groups so what will happen p1 and p2 will get combined to p5 sorry p4 p5 this one so i will be getting another line like this and then finally you will be seeing that p6 p7 is the nearest group to this so this will totally get combined and it may look something like this so this will become a total group like this so all the groups are combined so finally you&#39;ll be able to see that there will be one more line which will get combined like this this is basically called as dendogram dendogram okay which is like bottom root to top now the question arises is that how do you find that how many groups should be here how do you find out that how many groups should be here the funda is very much clear guys in this is that you need to find the longest vertical line you need to find out the longest vertical line that has no horizontal line passed through it no horizontal line passed through it this is very much important that has no horizontal line pass through it now what this is basically meaning is that i will try to find out the longest line longest vertical line in such a way that none of the horizontal line passes through it what is horizontal line suppose if i consider this vertical line this vertical line over here if you see that if i extend this green line it is passing through this if i extend this line it is passing through this right if i am extending this line it is passing through this right so out of this the longest line that may be passing in such a way that no horizontal line probably is this line that i can actually see so what you do over here is that you basically just create a straight line over this and then you try to find out that how many clusters it will be there by understanding that how many lines it is passing through if it is passing through this one line two line three line four line that basically means your clusters will be four clusters this is how we basically do the calculation in hierarchical clustering again here it may not be the perfect line i&#39;ve just drawn with some assumptions but if you are trying to do this probably you have to do in this specific way okay i&#39;ve already uploaded a lot of practical videos with respect to hierarchical clustering and all now tell me maximum effort or maximum time is taken by is taken by k means or oracle clustering this is a question for you yes guys number of clusters may be three but here i&#39;m just showing you that how many lines it may be passed by how do you basically determine whether maximum time will be taken by k-means or higher clustering this is the interview question the maximum time that will be taken is by high radical clustering why because let&#39;s say that i have many many many data points at that point of time higher clustering will keep on constructing this kind of dendrograms and it will be taking many many many time lot time right so higher clustering will take more time maximum time that it is going to basically take so it is very much important that you understand which is making it basically taking more time so if your data set is small you may go ahead with hierarchical clustering if your data set is large go with k means clustering go with k means requesting in short both will take more time but k main will perform better than oracle clustering see guys you will be forming this kind of dendograms right and just imagine if you have 10 features and many data points how you are going to do it it will be a cuber some process you will not be even able to see this dendrogram properly and manually obviously you cannot do it so this was with respect to k-means clustering and hierarchical mean clustering i hope everybody&#39;s understood now the next topic that we&#39;ll focus on in is that how do we validate see how do we validate a classification problem we use performance metrics like confusion matrix accuracy um different different true positive rate precision recall but how do we validate clustering models we are going to use something called as so we are going to basically use something called a silhouette score i&#39;ll show you what silhouette score is i&#39;m going to just open the wikipedia so this is how a sealyard score looks like a very very amazing topic okay how do we validate whether my model basically has perfect three or four model perfect three suppose if i find out my k value is three how do we find out now see one more one more issue with k means one issue with k means which i forgot to tell you let&#39;s say that i have a data point which looks like this and suppose i have some data points like this i have some data points which looks like this let&#39;s say i have like this now in this one issue will be that suppose i try to make a cluster over here obviously you&#39;ll be saying my k value will be 2 okay in this particular case suppose this is one cluster this is my another cluster right because of my wrong initialization of the points okay understand because suppose if i initialize just randomly some centroids like this then what may happen is that there is a possibility that we may also have three clusters like like like this kind of clusters one cluster will be here one cluster will be here one cluster will be here so this initialization of the centroids one condition is that it should be very very far if we initialize our centroids very very far at that point of time we will be able to find the centroid exactly in the center because it will keep on updating it will keep on going ahead right but if we don&#39;t initialize that very far then there will be a situation that probably if i wanted to get only the real thing was to get only two centroids i was probably getting three centroids right so this is a problem so for this there is an algorithm which is called as k means plus plus and what this k means plus plus will do which i will probably show in practical this will make sure that all the centroids that are initialized it is very very far okay all the encendroids that is basically there it is initialized very very far we&#39;ll see in that in practical application where specifically those centroids are basically used now let me go ahead and let me show you with respect to silhouette clustering now what is the silhouette color string i&#39;m going to explain you in an amazing way this is important if someone says you how do we validate how do we validate cluster model then at that point of time we basically use this silhouette it will be used in it will be used with respect to it will be used with respect to k means it can be used in hierarchical mean right if you want to validate how do we validate okay that is what we are basically going to see over here now in k-means clustering what are the most important things the first and the most important thing is that we will try to find out we will try to find out a of i we will try to find out a of i1 now what is this a of i see this a of i that you basically see a of i is nothing but c three major steps happens in order to validate cluster model with the help of silhouette first thing is that i will probably take one cluster okay there will be one point which will be my centroid let&#39;s say and then what i&#39;m going to do i&#39;m just going to whatever points are there inside this cluster i&#39;m going to compute the distance between them so i&#39;m going to do the submission and i&#39;m also going to do the average of all this distance so here you can see that when i say distance of i comma j i basically means this point j basically means all these points i is nothing but it is the centroid so here is nothing but this is the centroid let&#39;s say that i am having the centroid so i am going to compute all the distance over here which is mentioned by this and this value that you see that i am actually dividing by c of i minus 1 in short i am actually trying to calculate the average distance so this is the first point where i am actually computing the a of i now similarly what i will do is that what i will do is that the next point will be that suppose i have computed a of i the next the next thing that we need to compute is b of i now what is b of i b of i is nothing but there will be multiple clusters in a k means problem statement we will try to find out the nearest cluster okay suppose let&#39;s say that this is the nearest cluster and in this i have all the variety of points then b of i basically says that i will try to compute the distance between each point and the other point in this centroid sorry in this cluster so this is my cluster one this is my cluster two so what i&#39;m actually going to do is that here i&#39;m going to compute the distance between this point to this point then this point to this point then this point to this point this point to this point this point to this point this point to this point every point i&#39;m actually going to compute the distance once this point is done we will go ahead with the next point and we&#39;ll try to compute the distance and once we get all this particular distance what we are going to do we are going to do the average of them average now tell me if i try to find out the relationship between a of i and b of i if my cluster model is good will a of i will be greater than b of i or will b of i will be greater than a of i if i have a good clustering model if i have a code clustering model will a of i is greater than b of i will be greater than b of i or whether b of i will be greater than a of i out of this if we have a really good model obviously the distance between b of i will be greater than a of i in a good model that basically means if i talk about silhouette clustering the values will be between minus 1 to plus 1. the more the value is towards plus one that basically means the good the model is the good the clustering model is the more the values towards negative one that basically means this condition is getting applied now what does this condition basically say that basically means that this distance is far than the cluster distance this is what this information is getting portrayed and this is the importance of silhouette clustering finally when we apply the formula of celluloid clustering you will be able to see that silhouette clustering is nothing but let me rub this everything guys for you let me just show you what is silhouette clustering silhouette clustering formula will be something like this b of i so here you have silhouette clustering this is the formula b of i minus a of i max of a of i comma b of i if c of i is greater than 1 right so by this you will be getting the value between minus one to plus one and more the value is towards plus one the more good your model is more the values towards minus one more bad your model is because if it is towards minus one that basically means your a of i is obviously greater than b of i so this is the outcome with respect to silhouette clustering if s is equal to zero that basically means still your model needs to be uh basically the clustering needs to be improved what is i over here i is nothing but one data point you can just read this guys data point in i in the cluster c of i so i hope everybody&#39;s understood this now let&#39;s go ahead and let&#39;s discuss about the next topic we have obviously finished up solhet clustering over here let&#39;s discuss about something called as db scan so for db scan clustering this is an amazing clustering algorithm we&#39;ll try to understand how to actually do db clustering and probably you&#39;ll be able to understand a lot of things from this now in db scan clustering what are the important things so let&#39;s start with respect to db and clustering and let&#39;s understand some of the important points over here the first point that you really need to remember is something called as score points i&#39;ll also talk about when do you say core points or when do you say other points as such so the first point that i will probably discuss about is something called as min points the second point that i will probably just discuss about is something called as score points the third thing that i will probably discuss about is something called as border points and the fourth point that i will definitely talk about is something called as noise point okay guys now tell me in k-means clustering if i have this kind of groups don&#39;t you think with the help of two different clusters i may combine this two like this with the help of two different clusters i may combine something like this right but understand over here what what problem is basically happening with the second clustering this is actually an outliers let&#39;s say that let&#39;s say one thing very nicely i will put okay let&#39;s say i have one point over here i have one point over here so if i do clustering probably i will get one cluster here and i may get another cluster which is somewhere here now understand one thing this point is definitely an outlier even though this is an outlier with the help of k means what i&#39;m actually doing i&#39;m actually grouping this into another group so can we have a scenario wherein a kind of clustering algorithm is there where we can leave the outlier separately and this outlier in this particular algorithm and this is basically uh we will be using dbscan to relieve the outlier and this point will be called as a noisy point noisy point or i can also say it as an outlier so this will be a noise point for this kind of algorithm where you want to skip the outliers we can definitely use db scan that is density based spatial constraint of application with noise a very amazing algorithm and definitely i have tried using this a lot nowadays i don&#39;t use k-means or hierarchical means instead i use this kind of algorithm now see this what are the important things over here first of all you need to go ahead with min points min point so first thing is that you need to have main points this min points is a kind of hyper parameter this basically says what does hyper parameter says and there is also a value which is called as epsilon which i forgot i will write it down over here this is called as epsilon now what does epsilon mean epsilon basically means if i have a point like this and if i take epsilon this is nothing but the radius of that specific circle radius of that specific circle okay so epsilon is nothing but radius over here in this specific thing what does minimum points is equal to 4 mean let&#39;s say that i have i have taken a point over here let&#39;s say that this is my point and i have drawn a circle which looks like this and let&#39;s say that this is my epsilon value okay this is my epsilon value if i say my min point is equal to 4 which is again a hyper parameter that basically means i can if i have four at least four points over here near to this particular circle based on this epsilon value then what will happen is that this point this red point will actually become a core point a core point which is basically given over here if it has at least that many number of min points inside or near to this particular within this epsilon okay within this particular cluster suppose this is my cluster with the help of epsilon i have actually created it is there a particular unit of epsilon or we simply take the unit of distance no epsilon value will also get selected through some way i&#39;ll show you i&#39;ll show you in the practical application don&#39;t worry now the next thing is that let&#39;s say let&#39;s say i have another another point over here let&#39;s say that i have another point over here and this is my circle with respect to epsilon i have created it let&#39;s say that here i have only one point i have only one point inside this particular at that point this point becomes something called as border point border point border point also we have discussed over here right so border point is also there so here i am saying that at least one at least one if it is only one it is present then it will become a border point if it has force definitely this will become a core point core point like how we have this red color so and there will be one more scenario suppose i have this one cluster let&#39;s say this is my epsilon and suppose if i don&#39;t have any points near this then this will definitely become my noise point and this noise point will nothing be but this will be a cluster so here i have actually discussed about the noise point also so i hope everybody is able to understand the key terms now what is basically happening is that whenever we have a noise point like in this particular scenario we have a noise point and we don&#39;t find any points inside this any core point or border point if we don&#39;t find inside this then it is going to just get neglected that basically means this is basically treated as an outlier i hope everybody is able to understand here this point will be treated as an outlier or it can also be treated as a noise point and this will never be taken inside a group okay it will never never be taken inside a group suppose i have this set of points which you see basically over here red core and all and there is also a border point by making multiple circles over here here you can definitely say that how we are defining core points and the border points and this can be combined into a single group okay this can be combined into a single group because how the connection is now see this this yellow line is basically created by one sorry this yellow point is basically created by one epsilon and we have one core point over here remember over here it should be at least one core point okay not one point but one core point at least if it is having one core point then it will become a border point this will become a border point that basically means yes this can be the part of this specific group so what we are doing whenever there is a noise we are going to neglect it wherever there is a broader and core points we are going to combine it so i will show you one more diagram which is an amazing diagram which will help you understand more in this a k means clustering and hierarchical mean clustering now see this everybody now the right hand side of diagram that you see is based on db scan clustering and the left hand side is basically your traditional clustering method let&#39;s say that this is k means which one do you think is better over here you see this this all outliers are not combined inside a group but whichever are nearer as a core point and the broader point separate separate groups are actually created right so this is how amazing a db scan clustering is a db scan clustering is pretty much amazing that is basically the outcome of this here in k-means clustering you can see this all these points has also been taken as blue color as one group because i will be considering this as one group but here we are able to determine this in amazing groups so any i&#39;m saying you guys directly use db scan without worrying about anything so now let&#39;s focus on the practical part uh i&#39;m just going to give you a github link everybody download the code guys i&#39;ve given you the github link quickly download and keep your file ready i&#39;m going to open my anaconda prompt probably open my jupyter notebook we&#39;ll do one practical problem i&#39;ve given you the link guys please open it so this is what we are going to do today this will be amazing here you will be able to see amazing things how do you un come to know that overfitting or under fitting is happening you don&#39;t know the real value right so in in clustering there will not be any under fitting or overfitting so uh what all things we will be importing first is that we&#39;ll try k means clustering we&#39;ll do slots scoring and then probably will see the output and um and we&#39;ll do db scan also let&#39;s say db scan is also there so uh what are the things we have basically imported one is the k-means clustering one is the sell-out samples and cellulite scores these all are present in the sk learn and it is present in the metrics that basically means we use this specific parameter to validate clustering models okay now we will try to execute this and apart from that matplotlib we are just trying to import numpy we are trying to import and all here we are executing it perfectly the next thing is that here the next step is that generating the sample data from make underscore blobs first of all we are just trying to generate some samples with some two features and we are saying that okay it should have four centroids or c centroids itself with some features i&#39;m trying to generate some x and y data randomly and this particular data set will basically be used in performing clustering algorithms okay forget about range underscore and underscore clusters because we need to try with different different clusters and try to find out the silhouette score so right now i&#39;ve just initialized with two three four five six values it is very simple so if i go and probably see my x data so my x data will look something like this so this is my x data with two features and this is my y data with one feature which is my output which belongs to a specific class okay so that you can actually do with the help of make underscore blobs let&#39;s say how to apply k means clustering algorithm so as i said that i will be using wcss wcss basically means within cluster sum of square so i&#39;m going to import k-means over here for i in range 1 comma 11 that basically means i&#39;m going to use different different k values of centroid values and try to see which is having the minimal wcss value and i&#39;ll try to draw that graph which i had actually shown you with respect to elbow method so here i will basically be also using k means number of clusters will be i and initialization technique i will be using k means plus plus so that the points the centroids that are initialized those those points are very very far and then you have random state is equal to 0 then we do fit and finally we do wcss dot append k means dot inertia okay this dot inertia will give you the distance between the centroids and all the other points and this is what i&#39;m going to append in this wcss value and finally i&#39;ll just plot it now here you can see that i&#39;m just plotting it obviously by seeing this graph this graph looks like an elbow okay this graph looks like an elbow so the point that i&#39;m actually going to consider over here c which is the last abrupt change so if i talk about the last abrupt change here i have the specific value with respect to this okay i have one specific value with respect to this this is my abrupt change from here the changes are normal so i&#39;m going to basically select k is equal to 4 now what i&#39;m actually going to do with the help of selhot with the help of silhouette plus score we are going to compare whether k is equal to 4 is valid or not so that is what we are going to do valid or not so here we are going to do this now let&#39;s go ahead and let&#39;s try to see it how we are going to do it so here you can see in cluster is equal to 4 then i am actually able to find out the prediction and this is specifically my output okay this is done now see this code okay this code is a huge code i have actually taken this code directly from the sklearn page of silhouette if you go and see this this code is directly given over there but i&#39;m just going to talk about like what are the important things we need to see over here with respect to different different clusters see see this clusters two three four five six i&#39;m going to basically compare whether the k value should be four or not with the help of silhouette scoring so let&#39;s go over here and here you can see that i&#39;m applying this one first i will go with respect to for loop for n underscore clusters in range underscore clusters different different cluster values are there first we&#39;ll start with two so here you can see initialize the cluster with n cluster value and a random generator seed of 10 for reproducibility so n underscore clusters first i took it as 2 and then i did fit predict on x after i did fit predictor on x i am using the score on x comma cluster label now what this is going to do understand in silhouette what did we discuss it will throw it will try to find out all the clusters the clusters over here like this and it will try to calculate the distance between them which is the a of i then it will try to compute the b of i then finally it will try to compute the score and if the value is between minus one to plus one the more the values towards plus one the more better it is right so this all things we have already discussed and that is what this specific function will do and this will give my silhouette average value over here silhouette value will be over here okay this we have done and then we can continuously do it for another another things you can actually find it over here and this value that you see this code that you see is nothing so complex okay this is just to display the data properly in the form of graphs okay in the form of graphs so again i&#39;m telling you i did not write this code i&#39;ve directly taken it from the sk learn page of siloid okay so just try to see this particular uh plotting diagrams and all that you can definitely figure out but let&#39;s see i will try to execute it and try to find out the output now see for n underscore cluster is equal to 2 the average silhouette score is 0.70 i told you the value will be between minus 1 to plus 1 and i&#39;m actually getting 0.704 which is very very good and then for n underscore cluster is equal to point five eight eight then end underscore cluster is equal to four i&#39;m getting point six five which is pretty much amazing and then for an underscore cluster is equal to five the average score is point five six three and an underscore cluster is equal to six you are saying point four five here directly you can actually say that fine for underscore cluster is equal to two i&#39;m getting an amazing score of point seven zero four obviously you&#39;re you&#39;re getting the highest value over this so should we select n underscore cluster is equal to two okay we should not directly conclude from it because here we need to also see that any feature value or any cluster value is also coming as negative value that also we need to check so here we will go down over here you will see the first one over here with respect to the first one you see that i&#39;m getting getting the value from 0 to 1 it is not going to minus 0.1 so definitely two clusters was able to solve the problem so i&#39;ll keep it like this with me i definitely have a chance that this may this may perform well i may have a chance that this k k is equal to 2 may perform well okay so i may have a chance let&#39;s see through the next one to the next one over here you can see that for one of the cluster the value is negative if the value is negative that basically means the a i is obviously greater than b of i so i&#39;m not going to refer this because it is having some negative values even though my cluster looks better but again understand what is the problem with respect to this cluster is that if i take this cluster and probably compute the distance between this point to this point and if i probably compute from this point to this point or this point to this point this point is obviously nearer to this right it is obviously nearer to this so that is the reason why i&#39;m getting a negative value over here okay negative value over here this is my uh output my score this point that you see dotted points this is my score 0.58 what whatever it is this is basically my score so obviously this basically indicates that this point is near the other cluster point is nearer to this so i&#39;m actually getting a negative value right so this you really need to understand okay now similarly if i go with respect to n underscore cluster is equal to 4 this looks good because here i don&#39;t have any negative value and here you can see how cool it has basically divided the points amazingly with the help of k is equal to 4 right and similarly if i go with 5 obviously you can see some negative values are here some dotted line negative value are there with respect to 6 you also have some negative values so definitely i&#39;ll not go with 6 i may either go with 4 or i may either go with 2 now whenever you have this options always take a bigger number instead of 2 take 4 because 4 is greater than 2 because it will be able to create a generalized model so from this i am actually going to take n is equal to 4 k is equal to 4. now should we compare with this with the elbow method here also i got 4 right so both are actually matching so this indicates that with the help of this clustering this silhouette score we can definitely come to a conclusion and validate our clustering model in an amazing way so i hope everybody is able to understand and this way you basically validate a model and definitely you can try it out you can understand this code definitely i but till here you have understood that here i am going to get the average value then for i in underscore clusters whatever cluster this is matching it is just mapping over there and it is basically giving so this was the session and yes in today&#39;s session we efficiently covered many topics we covered k-means hierarchical clustering science silhouette score db canon clustering in tomorrow&#39;s session the topics that are probably pending is first i&#39;ll start with svm and svr second i will go ahead with xgboost and third i will cover up pca let&#39;s see whether i&#39;ll be able to complete this session uh one one amazing thing that i want to teach you guys because many people ask me the definition of bias and variance so guys many people get confused when we talk about bias and variance you know because let&#39;s say that i have a model for the training data set it gives us somewhere around 90 percent accuracy let&#39;s say i&#39;m getting a 90 accuracy for the test data i may probably getting somewhere around 70 accuracy now tell me which scenario is basically this most of the people will be saying that okay fine it is over fitting now when i say overfitting i basically mention overfitting by low bias and high variance right so many people get confused chris tell me just the exact definition of bias and variance low bias obviously you are saying that because the training is performed like the model is performing well with the help of training data set but with respect to the test data set the model is not performing well with respect to training data set why do we always say bias and with respect to test data set why do we always say variance so for this you need to understand the definition of bias so let me write down the definition of bias over here so here i we can definitely write that bias it is a phenomena that skews the result of an algorithm in favor in favor or against an idea against an idea i&#39;ll make you understand the definition uh but understand the understand understand what i have actually written over here it is a phenomena that skews the result of an algorithm in favor or against an idea whenever i say this specific idea this idea i will just talk about the training data set initially now when we train a specific model suppose if i have this specific model over here and i&#39;m training with this specific training data set so this is my training data set now based on the definition what does it basically say it is a phenomenon that skews the result of an algorithm in favor or against an idea or a this specific training data set so even though i&#39;m training this particular model with this training data set with this data set it may it may be in favor of that or it may be against or that that basically means it may perform well it may not perform well if it is not performing well that basically means the accuracy is down if the accuracy is better at that point of time what we&#39;ll say see if the accuracy is better that time what we&#39;ll say will come up with two terms from here obviously you understand okay there are two scenarios of bias now yeah if it is in favor that basically means it is performing well with respect to the training data set i will basically say that it has high bias if it is not able to perform well with the training data set then here i will say it as low bias i hope everybody is able to understand in this specific thing because many many many people has this kind of confusion now similarly if i talk about variance let&#39;s say about variance because you need to understand the definition a definition is very much important okay if i if i just talk about the definition of variance i&#39;m just going to refer like this the variance refers to the changes in the model when using when using different portion of the training or test data now let&#39;s understand this particular definition variance refers to the changes in the model when using different proportion of the test training data or test data we obviously know that whenever initially if i have a model understand from the definition everything will make sense i am basically training initially with the training data okay because we divide our data set see our data set whenever we are working with we divide this into two parts one is our train data and test data okay because this is a trade test data is a part of that particular data set right and suppose in this particular training data it gets trained and performs well here i&#39;m actually talking about bias but when we come with respect to the prediction of this specific model at that point of time i can use other training data that basically means that training data may not be similar or i can also use test data now in this test data what we do we do some kind of predictions these are my predictions and in this prediction again i may get two scenario i may get two scenario which is basically mentioned by variance it refers to the changes in the model when using when using different portion of the training or touch data refers to the changes basically means whether it is able to give a good prediction or wrong predictions that&#39;s it so in this particular scenario if it gives a good prediction i may definitely say it as low variance that basically means the accuracy with the accuracy with respect to the test data is also very good if i probably get a bad if i probably get a bad accuracy at that time i basically say it as high variance so if i talk about three scenarios over here let&#39;s say this is my model 1 and this is my model 2 and this is my model 3. now in this scenario let&#39;s consider that my model 1 has the training accuracy of 90 and test accuracy of 75 similarly i have here as my train accuracy of 60 and my test accuracy of 55 now similarly if i have my train accuracy of 90 percent and my test accuracy of 92 percent now tell me what what things you will be getting here obviously you can directly say that find your training accuracy is better now you&#39;re talking about bias so this basically indicates that this has low bias and since your test accuracy is bad because it is when compared to the train accuracy it is less so here you are basically going to say high variance understand with respect to the definition similarly over here what you will say high bias high variance because obviously it is not performing well this is another scenario last the last scenario is that this is the scenario that we want because it is low bias and low variance okay many many people have basically asked me the definition with respect to bias and variance and here i&#39;ve actually discussed and this indicates this gives me a generalized model and this is what is our aim when we are working as a data scientist so i hope you have understood the basic difference between vary bias and variance and i was able to give you lot of examples lot of understanding with respect to this so i hope you have actually got this particular uh understanding of this two terms which we specifically talk about high bias low bias high variance low variance right so this was it from my side guys and i hope you have understood this okay so let&#39;s take let&#39;s consider a data set credit and let&#39;s say this is a approval so we are going to take this sample data set and understand how does xjboost work suppose salary is less than or equal to 50 and the credit is bad so approval the loan approval will be zero that basically means he&#39;ll he or she will not get if it is less than or equal to 50 if the credit score is good then probably approval will be one if it is less than or equal to 50 if it is good again then it is going to get one if it is greater than 50 and if it is bad then obviously approval will be zero if it is greater than 50 if it is good we are going to get it as one if it is greater than 50 k and probably if it is normal then also we are going to get it so this is my data set so how does xgboost classify work understand the full form of exe boost is extreme gradient boosting extreme gradient boosting so we will basically understand about extreme gradient boosting now extreme gradient boosting uh will be actually used to solve both classification and the regression problem statement so first of all let&#39;s understand how it is basically executes basically how it actually if you if you just talk about exe boost you understand that it is a boosting technique and internally it tries to use decision tree so how does this decision tree is basically getting constructed in the case of xeboost and how it is basically solved we are going to discuss about it so whenever we start exubus classifier understand that first of all we create a specific base model suppose if i say this is my base model and this base model will be a weak learner okay and this base model will always give an output of probability of 0.5 in the case of classification problem so suppose if i say this is probability is 0.5 then i will try to create a field over here this field is called as residual field so first base model what i am going to do any data set that you give from here to train it will always give you the output as 0.5 so this is just a dummy base model now tell me if my probability output is 0.5 if i want to calculate the residual that basically means i need to subtract approval minus this particular value so what will be the value over here zero minus point five will be minus point five one minus point five will be point five one minus point five will be point five and zero minus point five will be minus point five and this one minus 0.5 will be 0.5 and this will also be 0.5 let&#39;s consider that i have one more record and this specific record can be anything because i want to keep some more records over here so let&#39;s consider that i have one more record which is less than or equal to 50k and if the credit score is normal you are going to get zero so here also if i try to find out the residual it will be minus 0.5 now the first step i hope everybody&#39;s understood we have to create a base model okay this base model is very much important because we have to create all the decision tree in a sequential manner so the first sequential base tree which is again this is also a decision tree kind of thing you can consider but this is a base model which takes any inputs and gives by default the probability is 0.5 now let&#39;s go ahead and understand what are the steps in constructing decision tree after creating the base model the first step is that create a binary decision tree so i&#39;m going to write it down all the steps please make sure that you note it down so create a binary tree binary decision tree using the features second step we basically define we we say it as okay second step what we do we actually calculate the similarity weight we calculate the similarity weight i&#39;ll talk about this similarity weight what exactly it is if i want to use this a formula it is summation of residual square divided by summation of probability 1 minus probability plus lambda i&#39;ll talk about this what is exactly lambda it is a kind of hyperparameter again so that it does not over fit the third thing is that we calculate the information gain okay information gain so these are the steps we basically use in constructing or in solving uh in creating an xd boost classifier the first step is that we create a binary decision tree using the feature then we go ahead with calculating the similarity weight and finally we go ahead and calculate the information gain so how does it go ahead let&#39;s understand over here and let&#39;s try to find out okay now let&#39;s go ahead and let&#39;s try to construct the decision tree as i said that let&#39;s consider that i am considering salary feature so based on using salary feature what i&#39;m actually going to do i am going to take this as my node and i am going to split this up and remember whenever we are creating decision tree in this particular case it will be a binary decision tree let&#39;s say that in salary 1 is less than or equal to 1 is greater than 50. so this 2 you obviously have in the case of binary in case of credit where there are three categories i&#39;ll also show you how that further split will happen and how that will get converted into a binary t so here you have less than or equal to 50 k and greater than 50k now let&#39;s go ahead and understand how many values are there in this salary so if i see before the split you can definitely see that i&#39;m going to use this residual and probably train this entire model now if i really wanted to find out the residual initially these are my residuals over here so one residual is minus 0.5 then i have 0.5 over here then i have 0.5 then again i have minus 0.5 then again i have 0.5 and then again i have 0.5 and finally i have minus 0.5 so these are my total residuals that are there suppose if i make this split less than or equal to 50 first less than or equal to 50 the residuals what are things are there so here i&#39;m going to have minus 0.5 then less than or equal to 50 again i&#39;m going to have 0.5 then again less than or equal to 50 i&#39;m going to have 0.5 and less than or equal to again one more point five is there i&#39;m just going to remove this the last 0.5 which is nothing but minus 0.5 so i hope you understood this split so half of the things came over here the remaining half will be greater than or equal to greater than 50. so you have one value here one value here one value here so it will be minus 0.5 then you have 0.5 and then finally you have 0.5 residuals how do we get it guys see from the base model which is by default giving 0.5 first my data goes over here by default probability i&#39;m going to get 0.5 so residual is basically calculated from this probability and approval so this probability minus approval so if you subtract 0 minus 0.5 sorry i&#39;m just going to rub this so it if you subtract 0 minus 0.5 you&#39;re going to get minus 0.5 1 minus 0.5 you&#39;re going to get 0.5 1 minus 0.5 you&#39;re going to get 0.5 so everybody i hope is very much clear with respect to this so this is the first step we constructed a binary tree now in the second step it says calculate the similarity weight now how to calculate the similarity weight similarity weight formula is sum of residual square now what is residual square let&#39;s say that i&#39;m going to calculate the the the uh i&#39;m going to calculate for this okay similarity weight now in this particular case if i go and calculate my similarity weight it will be summation of residual square this is my residual values this is my residual value so i&#39;m going to do the summation of this square okay this value square you can see over here sum of residual square everybody you can see sum of residual squares so what do you think sum of residual squares will be in this particular case how i have to do it i will just take up this all values like minus 0.5 plus 0.5 plus 0.5 and minus 0.5 whole square right i&#39;m just going to do the squaring of this divided by understand what it is divided by it is divided by probability of 1 minus probability now where do we get this probability value where do we get this probability value we get this probability value from our base model right so here i&#39;m basically going to say that we are going to do the summation of probability of 1 minus probability 1 minus probability that basically means for each and every point for each and every point what is the probability see probability is basically coming from the base model so for each prob each point i am going to come compute two things one is the probability and then 1 minus probability and this i am going to do the submission like this i will do it 4 times 1 minus 0.5 then 0.5 multiplied by 1 minus 0.5 and finally you will be able to see one more will be there which is plus 0.5 1 minus 0.5 so this will be your total things with respect to this so i hope you have understood till here where you are able to understand that what we have done this is submission of residual square and this is the remaining probability multiplied by 1 minus probability now tell me what are you able to find out from this if you cancel this and this this and this this value is going to become 0 so this entire value is going to become 0 because 0 divided by anything is 0. so here i hope everybody has understood what is the similarity weight of this specific node if i want to write it is nothing but 0. now you may be considering where is lambda value okay we will initially initialize lambda by 1 i&#39;ll talk about this hyper parameter let&#39;s consider it as 1 so here plus 1 or plus 0 let&#39;s let&#39;s consider lambda value 0 let&#39;s say for right now okay i&#39;m just going to make it lambda is equal to zero i&#39;m just going to talk about it because it is a kind of hyper parameter by zero yeah minus point five plus minus point five plus point five plus one y if i do the submission if i do the summation here you will be able to see that i&#39;m going to get zero so this calculation we have done and we have got uh the summation of weight is equal to zero and let&#39;s go ahead and calculate the summation of the weight of the next node no no no it&#39;s not first square it is whole square so here also if i do so it is 0.5 plus 0.5 now let&#39;s do it for this if i want to find out the similarity weight again see i&#39;m going to repeat it 0.5 plus 0.5 whole square and since there are three points so i&#39;m going to basically use probability one minus probability for one point then plus probability one minus probability second point and then probability and one minus probability for the third point and lambda is zero so i&#39;m not going to write anything now go let&#39;s go and do the calculation for this node so minus 5 minus 5 0 it becomes 0 then 0.5 whole square right so here i&#39;m going to get 0.25 here if you do the calculation here you are going to get 0.75 so this value is going to be 1 by 3 and which is nothing but 0.33 so the similarity weight for this node for this node is 0.33 so here you can see probability of multiplied by 1 minus probability okay now the next step that we do is that calculate the information gain now you know how to calculate the information gain but before that let&#39;s do the computation for this also for this root node also go ahead and calculate the similarity weight of this okay why the base model probability is 0.5 because it is just understand that it is a dummy dummy model i have just put a if condition there saying that it is going to give 0.5 now do it for this one guys root node what it will be see i can calculate from here only minus 1 gone this is also gone this is also gone this will be 0.25 divided by something now tell me guys what should be for the root node what is the similarity weight what is the similarity weight for this do this calculation everyone up 1 i know it will be 0.25 divided by this will be 1.75 are you getting this similarity weight which will be nothing but 1 by 7 and if i divide 1 by 7 if i say what is 1 by 7 it is 0.142 so it is nothing but 0.14 if i want to calculate the root node similarity weight over here is 0.14 so i know 0.14 here 0 here 0.33 now see over here we calculate the information gain next step the third step what we do is that we calculate the information gain now information gain is nothing but in this particular case the root node similarity weight will try to add up so i will be getting 0.33 minus this particular top root node whatever split has happened that similarity weight i&#39;ll take 0 plus 0.33 minus 0.14 so point minus 0.14 and if i do it it is nothing but just open your calculator again and 0.33 minus 0.14 so it is nothing but 0.19 i am getting 0.19 as my information gain the information gain of this specific tree i got it as 0.19 obviously you know how the features will get selected based on the information gain but let&#39;s say that the highest information gain that is given by salary okay now we will go ahead and do the further split let&#39;s go ahead and do the further split so i i know my information gain now it is 0.19 and information gain is basically used to select that specific node through which the split will happen now i&#39;ll further go and do the split let&#39;s say that i&#39;m going to do the further split with the next feature that is which one credit so i&#39;m going to take credit over here i&#39;m going to take credit over here and again i have to do a binary split again but you may be considering krish here are only three categories how we are going to basically do this particular split right because we don&#39;t know how to do split because we have three categories over here so in this case what i will do is that we what we can definitely do is that in this particular case the split that we are probably going to do is that let&#39;s consider two categories like good and normal at one side bad at one side so here it becomes a binary split again now let&#39;s go ahead and let&#39;s try to see that how many data points will fall here and how many data points will fall here so for writing down the data points let&#39;s say if it is less than or see go to the path if it is less than or equal to 50 it will go this part and if it is b then we are probably going to get how much is the residual we are going to get one residual over here first of all so this is my one residual that is minus 0.5 then similarly if i see less than or equal to 50 good is there right good or normal is there so here again 0.5 will come i hope everybody is able to understand see the second record less than or equal to 50 we go in this path but it is good we come over here again less than or equal to 50 good again we are going to get one more point five then go with respect to greater than or equal to 50 which is coming over here will not worry about it right now again less than or equal to 50 normal again it is minus 0.5 right so this many records are definitely coming over here only one record is basically coming over here then again we will start the same process again we will start the same process now for the same process what we are going to do again try to calculate the similarity weight now in order to calculate the similarity weight what i will do i will basically say this is my similarity weight this will become 0.25 divided by 0.25 y because this whole square right this whole square residual square right summation of residual square but here i have only one residual so this square it will become and then what i&#39;m actually going to do i&#39;m going to basically write 0.5 minus 1 minus 0.5 this is nothing but only for one data point so this is nothing but 0.5 multiplied by 0.5 which is nothing but 0.25 right now in this particular case i will get similarity weight as i hope everybody i&#39;m getting it as 1. now what about this similarity weight if you want to compute it is again very very simple this and this will get cancelled then again it will be 0.25 divided by uh if i say 1 like this 0.25 then again it will be 0.75 then this will also be 1 by 3 that is nothing but 0.33 so similarity weight will be 0.33 then again i have to calculate the information gain of this node what i will do i will add this up see 1 plus 0.33 i&#39;ll add like 1 plus point three three minus zero y zero because the information gain the similarity weight of this uh the up one is basically zero right for this particular credit node similarity weight is zero so one plus 0.33 minus 0 this will be 1.33 so like this further split will again happen over here with different different node and we will only be getting a binary split but we will be comparing based on information gain which one is coming good now let&#39;s say that i have created this path i have i have designed i have developed my entire binary decision tree which is a speciality in xgboost now what i&#39;m going to do over here is that see everybody what i&#39;m going to do let&#39;s consider the inferencing part let&#39;s say this record is going to go how we are going to calculate the output so this first of all went to this base model now let&#39;s go ahead and see how the inferencing will happen suppose this record is going right so first of all this record will go to this base model the base model is giving the probability as 0.5 so the first base model is basically giving 0.5 now based on this 0.5 how do we calculate the real probability how do we calculate the real probability in this okay so we apply something called as log so we basically say log of p divided by one minus p so this is the formula we basically apply in only the case of base model so if we try to see this it is nothing but log of 0.5 divided by 0.5 which is nothing but 0 log of 1 is nothing but 0 so in the first case whenever any record goes i will be getting the zero value over here okay zero value over here then plus by plus i am doing because it will now go to the binary decision tree now this record will go to my binary decision tree whatever value i am getting from this i&#39;m actually adding that up and now it will go over here now when it goes over here first of all let&#39;s see which branch it is following it is falling less than or equal to 50 branch first branch over here then this is bad it will go and follow here so here i can see that the similarity weight is one now the similarity weight is basically one in this case so what we do in the case of this we pass it to a learning rate parameter so this specifically is my learning rate multiplied by 1 1 because y similarity weight is 1 over here so this will basically be my first references and alpha over here is my learning rate it can be a very small value based on the learning parameter that we use like how we have defined learning parameters elsewhere on top of this we apply an activation function which is called a sigmoid since this is a classification problem we apply an activation function which is called a sigmoid and i hope you know what is the use of sigmoid based on this based on the alpha value based on this the output will be between 0 to 1. now i hope you are getting it guys this is how the entire inferencing will probably happen now similarly what i will do i will try to construct this kind of decision tree parallely so we can also write our entire function will look something like this alpha 0 plus alpha 1 and this will be your decision tree 1 output then alpha 2 your decision tree output alpha 3 your decision 3 output like this alpha 4 your decision 3 output 4th decision tree like this it will be alpha n your decision tree and output and this will be your output finally when you are trying to inference from any new record now the reason why we say this as boosting because see understand we are going to add each and every decision tree output slowly to finally get our output with respect to the working of the decision tree this is how xgboost actually work don&#39;t credit further needs to be simplified yes see like this similarly we can split credit with the help of like we can make blue green one side normal at one side but whichever will be giving the information gain more that will be taken into consideration right and this is how your entire exi boost classifier works it is very very difficult to basically calculate all those things so that is the reason we say that xgboost is also a black box model so this is basically a back block model it is it prone to overfitting see at one stage we also need to perform hyper parameter tuning and this we specifically say pre-pruning we tend to do pre-pruning and since we are combining multiple decision trees no no this decision tree this decision tree is this one this independent decision tree which i have created now parallelly after this what i&#39;ll do i&#39;ll create one more decision tree so it will be looking like this see finally how it will look so this is my base model then my data then my data will go to this decision tree which i have actually done as a binary split on different different records then again we will make another decision tree which will again be a binary tree the splits will look like this then this is my base model where i&#39;m getting the value as 0 this will be alpha 1 multiplied by decision tree 1 which is this then this is alpha 2 multiplied by decision tree 2 which is this and like this we will keep on continuously adding more decision trees unless and until this entire thing becomes a very strong learner so this is how we basically do the combination of all these things so i hope everybody is able to understand about the xg boost classifier now you may be thinking how does regressor work do you want a regressive problem statement also the decision tree will get constructed based on independent features and again lambda value is a hyper parameter we basically set up lambda value with the help of cross validation now uh let&#39;s go ahead and discuss about xgboost regressor the second algorithm that we will probably discuss about is something called as xgboost regressor and how does x-boost regressor actually work some fundamental is following random forest no in random forest it is completely different their bagging happens bagging happens so over here let&#39;s go ahead with the regressor so here i&#39;m going to take some example let&#39;s say that i have this many experience this many gap and based on that we need to determine the salary my salary is my output fee let&#39;s say the experience is 2 2.5 3 4 4.5 okay now in this gap let&#39;s say it is yes yes no no yes and let&#39;s say that the salary is somewhere around 40k it is 41k 52k and uh let&#39;s see some more data set over here 60k and 62k now the first step in classifier we created a base model here also we will try to create a base model first of all this base model what output it will give it will give the average of all these values what is the average of all these values okay what is the average of all this value 40 81 52 60 62. if i just do the average it is nothing but 51k so by default i will create a base model which will take any input and just give the output as 51. this is the first step now based on this i will try to calculate my residual now how do i calculate my residual i will just subtract 40 by 51k so this will basically be minus 11k and uh this will be 10k minus k minus 10 and this will be 1 this will be 9 and this will be 11 i hope everybody is able to get this let&#39;s say that i i make this as 42k okay for just making my calculation little bit easy so i have nine over here so this is my residual then again the first step is that i construct my uh decision tree now let&#39;s say that i&#39;m going to use the experience over here so this is my experience node and based on this experience node i have my features over here so here i will take up all my residuals minus 11 9 1 9 11 and then how do i do the split based on experience this is a continuous feature so i have to basically do split with respect to continuous feature which i have already shown you in decision tree how do we do so here is my residual here it is 40 minus this is minus 11k minus 9k uh this is 1k this is 9k and 11k minus 9k so now i will just create take up my first node here i&#39;m going to use my experience feature i know my values what all things are going to come 11k in the root node minus 9 1 9 and 11. now what we are going to do over here is that so i am going to do again a binary split over here now the binary split will happen based on the continuous feature that is experience so two types of records i may get one is less than or equal to two and one is greater than two less than or equal to two and one is greater than 2 now less than or equal to 2 when i do the split let&#39;s see how many values we are getting less than or equal to 2 i will get only one value that is minus 11 and here i&#39;m actually going to get all the other values minus 9 1 9 11. now what we are going to do after this is that calculate the similarity weight now here the similarity weight will little bit the formula will change with respect to regression so similarity weight is nothing but summation of residual squares divided by number of residuals plus lambda again here we are going to consider lambda is zero because this is a hyperparameter tuning more the value of lambda that basically means more more we are penalizing with respect to the residuals so this will be the formula that we are going to apply okay so let&#39;s see for the first number that we want to apply so how this will get applied again i&#39;m going to write this formula here it will be better let&#39;s say here similarity weight is equal to summation of residual square and here you have number of residuals plus lambda see previously we are using probability and then all those things here so if you want to calculate the similarity weight of this this will become 121 divided by number of residual is 1 plus lambda is 0 so this is going to be 121 so here we are going to calculate the similarity weight which is nothing but 121. if if we probably take alpha let&#39;s let&#39;s do one thing if we probably take uh if if we probably take alpha is equal to one then what will happen if we take alpha is equal to one just think over here what will what may happen we may directly penalize the similarity weight right by just adding one okay so let&#39;s do that also suppose i say i&#39;m going to take alpha is equal to 1. so what will happen this will not be the formula now now what will become 121 divided by number of residual is 1 plus 1 this is nothing but 65.5 so let&#39;s say that i now have 65.5 as my similarity weight now similarly i will go ahead and compute the similarity weight for the next one so here it will become minus 9 plus 9 plus 9 plus 11 whole square divided by 4 plus 1 so this and this will get subtracted 12 square is nothing but 144 144 divided by 5. so if i go ahead and calculate 144 divided by 5 it is nothing but 28.5 so here i get 28.5 so the similarity weight for this is 28.5 similarly i can go ahead and calculate the similarity weight for this for the top one so it will be nothing but what it will be 11 plus sorry minus 11 minus 11 minus 9 plus one plus nine plus eleven divided by one two three four five five plus one is six so this is getting subtracted this will be one by six anyhow this will be whole square right so anyhow it will be one by six only so 1 by 6 will be my similarity weight over here okay 28.8 it&#39;s okay now finally the information gain that we need to compute will be very much simple what will be the information gain 65.5 plus 28.8 minus 1 by 6 so try to get it whatever we are trying to get it over here just tell me what will be the output is it 98.34 so we are probably going to get 98.34 so 98.3 for information gain is with respect to this record when we split it we&#39;ll try to compare with the split of each and every one and whichever will be the better we&#39;ll try to use and split it so like this the entire splitting will happen and sequentially the decision tree will be added it&#39;s 60.5 oh yeah sorry 60.5 60.5 plus 28 88 then this will change just a second 89.13 understand you don&#39;t have to worry about calculation automatically that things will be doing it okay so you don&#39;t have to worry now see we have now further the decision tree can be splitted into any number of times probably the next split what we can do is that we can we can do next split something like this this will be my experience the two splits that may happen with respect to my less than or equal to 2.5 less than or equal to 2.5 or greater than 2.5 now if this probably gives the information gain better then the split will happen like this otherwise whichever gives the better information again the split will basically happen like this i hope like let&#39;s say that this is this is the split that is required minus 11 minus 11 is 9 is over here and then we have 1 comma 9 comma 11 okay because less than or equal to 2.5 this two records will definitely go away and this to this record will definitely go here now if i try to calculate the similarity weight for this it will be nothing but minus 11 minus 9 minus 11 minus 9 whole square divided by 2 plus 1 right now in this particular case it will be minus 20 whole square divided by 3 which is nothing but 400 to 20 into 220 is 400 which is nothing but 3 so if i go and probably use a calculator and show it to you 400 divided by 3 which is nothing but 133.33 so the similarity weight for this is 133.33 similarly i can go ahead and compute for this it will be 1 plus 9 plus 11 whole square divided by 3 plus 1 right so it will be 10 plus 11. 10 plus 11 is nothing but 21 whole square divided by 4 so what it is 21 whole square if i open my calculator 21 whole square 21 multiplied by 21 which is nothing but 441 divided by 4 divided by 4. so this will probably 110 110 dot 2.25 and similarly i can go ahead and compute for this so if i want to compute for this what it will be the same thing that we have got over here that is 1 by 6 so this will basically be one by six so finally if i compute the information again it will be what it will be 133 133.3 plus one one zero point two five minus one by six obviously this value will be greater than the previous one what we have got that is 89.13 so definitely we are going to use this split which is better than the previous split right let&#39;s say that this split has been considered finally how do we see the output okay i hope everybody is able to understand right let&#39;s say that this split has worked well so i&#39;m going to rub all these things 110.25 is there now suppose i want to do the inferencing how the inferencing will be done 110.25 here 110.2 now suppose any record comes from here first of all any record that will go it will go to the base model so the base model whenever it goes the value is 51 51 plus alpha one this is my learning rate one suppose if it goes in this root then what we have we have minus 11 minus nine whenever we go in this root which has minus 11 and minus nine the average of both these numbers will be considered what is average of both these numbers minus 11 minus 1 9 divided by 2 this is nothing but -10 right so minus 10 will get multiplied here suppose if it goes in this root then here what will happen here will 1 plus 9 plus 11 divided by 3 average will be taken so 21 divided by 3 7 will be there so this will get replaced by 7. so similarly anything that you are doing this is with respect to decision tree one like this we will again construct decision tree separately and again it will become alpha two by decision tree two alpha three by decision three three and like this you will be doing till alpha and decision 3 n and once you calculate this this will be your specific output in a regression tree so in this particular case what will happen you are just trying to play with parameters and you are trying to use in a different way to compute all this thing everybody clear but again it is a black box model you cannot visualize all this thing now let&#39;s go to the third algorithm which is called as svm see svm is almost like decision logistic regression okay so the major aim of svm is that major aim of svm is that suppose if i have uh do data points like this okay we obviously use uh logistic regression to split these data points right like this we try to create a best fit line which looks like this and probably based on this best fit line we try to divide the point now in svm what we do is that we not only create a best fit line but instead we also create a point which is called as marginal planes so like this we create some marginal plane so this is your hyperplane and this is your marginal plane and whichever plane has is this maximum distance will be able to divide the points more efficiently but usually in a in a normal scenario you know whenever we talk about hyperplane or whenever we talk about marginal plane there will be lot of overlapping of points right suppose if i have some specific points i have one point which looks like this i may also have another point which may overlap so it is very difficult to get an exact straight marginal plane and split the point based on this now this specific marginal plane should be maximum because we can create any type best fit line and probably uh use this marginal plane now if we have this overlapping right if for what do we call for this kind of plane this kind of plane is basically called as hard marginal plane so this is basically called as harsh marginal plane okay and similarly if any points are overlapping suppose this yellow points can also get overlapped over here and there may be some kind of errors so for this particular case we basically say as soft marginal plane because here we will be able to see that errors will be there now in svm what we focus on doing is that we focus on creating this marginal plane with maximum distance even though there are some errors we consider it in solving it by providing some kind of hyperparameter now how do we go ahead and basically create this all marginal planes and how do we go ahead with this it&#39;s very much simple uh just imagine in this specific way that initially let&#39;s consider that i have this data point suppose this is my best fit line how do we give the c best fit line is equation we basically say y is equal to m x plus c right we we basically say this equation as y is equal to mx plus c now hard marginal it is impossible in a normal data set obviously you will not be able to get it but definitely we go ahead with creating a soft marginal plan now y is equal to mx plus c what does this m indicate m is nothing but slope and c indicates nothing but intercept can i say that this both equations are same a x plus b y plus c is equal to 0 can i also say that this is the equation of a straight line can i say that this is also the equation of straight line i will say that both of them are equal can i say both of them are equal see if i try to prove this to you if i take this equation and try to find out y it will be nothing but minus c minus c minus a sorry minus ax and this will be divided by b this will be divided by b this will be divided by b so here you can see that it is almost the same in this particular case my m value will be minus a by b and my c will basically be minus c by b so both the equation are almost same so let&#39;s consider that this is my equation and i am actually and whenever i say y is equal to m x plus c can i also write something like this y is equal to w one x one plus w two x two plus like this plus c or plus b same thing you know so here also we can write y w transpose x plus b same equation right we are basically using same equation yes we can also write it in a different way but at the end of the day we are also treating something like this let&#39;s say that this slope is in this direction if this slope is in this direction then i can basically say this let&#39;s consider that the slope is minus one let&#39;s say that the slope is minus one see it is in the negative direction let&#39;s say that this slope is minus 1 i&#39;m just trying to prove that the slope is negative value let&#39;s consider this now suppose this is one of my point minus 4 comma 0 and obviously this particular equation is given by this particular line is given by this equation now if i really want to find out the y value let&#39;s say that this is my x1 this is my x1 and this is my x2 let&#39;s say that i want to find out i want to find out this w transpose x plus b the y value based on this line if i want to compute the y value based on this line how will i compute w transpose x basically means what w value what all things will be there one value is b right b is intercept right now intercept is passing from origin can i say my b will be zero obviously i can assume that b will be zero now in this particular case if i talk about w w in this case is minus 1 which i have initialized over here so if i want to do this matrix multiplication it will be w transpose can be written as like this and this x value can be written as minus 4 comma minus 4 and 0 minus 4 and 0 right so i can basically write like this now if i do this multiplication what will my value i get i will basically get 4 right so this is a positive value this is a positive value now understand since this is a positive value any points that are below this line any points that i consider below this line and if i try to calculate the y can i say that it will always be positive yes or no similarly if i could probably consider one points over here as four comma zero four comma four now tell me in this four comma four if i calculate the y value what will you get whether you&#39;ll get a positive value or a negative bank if i try to calculate the y value in this case because here only positive values will be getting right so if i calculate the y value will the y value be negative or positive just try to calculate how do you calculate again i will use y equation this time again my slope is minus 1 my intercept is 0 and here i will have 4 comma 4 now here minus 4 and then this is plus 0 this will be -4 right so this will be a negative value negative value guys negative see minus 4 plus 0 negative so any points that i will probably have in top of this any points above this plane right and if i try to calculate the y value it will always be negative so what two things you are able to get positive and negative so you can consider this entirely one category this another category at least these two things you can basically consider guys i hope everybody is able to understand this so this will be my one category and this will be my another category obviously so that basically means i can definitely use a plane and split this point i hope everybody is able to understand now let&#39;s go ahead and let&#39;s see how this marginal plane will get created and what is the cost function to basically do this or what is the cost function in making sure that the marginal plane will definitely work right it becomes difficult right so suppose let&#39;s consider an example suppose i say that this is my lines let&#39;s say uh i want to basically create a kind of i have two variety of points one is this point let&#39;s say i have all these points like this and the other points i have somewhere here let&#39;s consider i am just using directly good number of points so that i can split it okay because i will try to talk about it what i&#39;m actually trying to prove so obviously this is my best fit line that splits and apart from that what i will do is that i&#39;ll also create a marginal points so in order to create the marginal point i may use some different color let&#39;s see which color this will be my one marginal point remember it will be to the nearest point over here and basically we will construct like like this and similarly here we will be constructing like that i&#39;ve already told you guys this equation can be mentioned at w transpose x plus b equal to 0 right i can definitely say this because ax plus b y plus c is equal to 0 so this i can also write it as w transpose x is equal to 0 sorry plus b plus b equal to 0 so both are same okay this i don&#39;t have to prove it i hope everybody is clear with this now what i&#39;m going to do let&#39;s represent this line also with some equation so this line if i want to represent this will be w transpose x plus b what value will come over here positive or negative c from this line anything above this plane right any any any distance that we try to find out it will always be negative so let&#39;s say that i&#39;m using it as minus 1 to just treat that it is a negative value and this line that i am going to mention it it will be w transpose x plus b is equal to plus 1 minus 1 above plus 1 because i have we have already discussed from this point if you are trying to calculate the y value it is always going to be plus 1 this will going to be minus 1 here i should definitely say this as k okay but i am not mentioning k in many articles you will see it as minus 1 many research paper also they use it as minus 1 but i would like to specify minus n plus k but here let&#39;s go and write minus 1 and plus now my aim is to increase this distance okay this distance i really want to increase this distance now in order to increase this if i increase this distance that basically means my model is performing well so let&#39;s say i want to find this distance first of all so if i write w transpose x plus b is equal to 1 and here i will write w transpose x plus b is equal to -1 so what i&#39;m going to do i&#39;m going to do the computation and subtract it like this so here obviously this will be my x1 this will be my x2 okay because these are my another points x2 and x1 so i can write w transpose x1 minus x2 b and b will get cancel and here i will be writing 2 right so from here we can definitely write two different things let&#39;s see what all things we can write so here this is nothing but the difference between my this plane and this plane which is given by like this okay now i always understand whenever we consider any any vectors right any vectors right it also has something called as magnitude so if i want to remove this magnitude i can divide this by w this magnitude of w then only my vector will remain which is indicated like this so i am going to basically divide by this particular operation both both the side i am dividing by this magnitude of w and i don&#39;t care about the directions over here right now we just care about the vectors now when i write like this what is our aim our aim is to can i say our aim is to our aim is to maximize 2 by w can i say this guys yes or no what is our aim our aim is to basically maximize this right by updating w comma b value i need to maximize this yes everybody is clear with this can i say that yes i want to maximize this yes or no everybody i want to maximize this if i maximize this that basically means my marginal plane will become bigger my marginal plane will be bigger okay now can i write along with this that such that y of i my output will be dependent on two different things one is i can say that my y y of i is plus of is plus 1 when w transpose x plus b is greater than or equal to 1 everybody seen this equation what i&#39;m actually trying to specify such that y of i is plus 1 when w transpose x plus b is greater than 1 and when it is minus 1 that basically means w transpose of x is b is less than or equal to minus now what does this basically mean see all my values but whenever i compute w transpose x plus b is greater than or equal to 1 i&#39;m obviously going to get this plus 1. when w transpose x plus b is less than or equal to 1 i&#39;m always going to get the output as minus 1. i hope that is the reason why i have actually written like this so this two we have already discussed why we are specifically writing we want to increase the marginal plane which is this this is my marginal plane and i&#39;m writing one condition that my y i value will be plus 1 when w transpose x plus b is greater than or equal to 1 otherwise it when it is less than or equal to minus 1 it is going to be very much clear with this transpose condition we have already done it everybody clear with this now on top of it we can add one more very important point instead of writing such that and all you can also say that our major aim our major aim is that if i multiply y i multiplied by w transpose x of i plus b if i multiply this 2 this will always be able greater than or equal to 1 for correct points right for correct points because understand if it is minus 1 if i am multiplying with this and if it is a correct point minus into minus will obviously be greater than or equal to 1 only right similarly for this it will be greater than 1. so i can also definitely say that my major m if i multiply y of i with this it will be always greater than or equal to plus 1 which is definitely saying that it will be a positive value so this is just a representation guys but understand what is the minimize cost function this is my minimized cost function maximize cost function now i&#39;m going to again write it down maximize w comma b maximize w comma b 2 by magnitude of w i can also write something like this minimize w comma b and i can just inverse this which looks like this are this both are same or not because always understand in machine learning algorithm why do we write minimize things because we are trying to minimize something okay both are equivalent this both are equivalent and why we specifically write minimization because in the back propagation when we are continuously updating the weights of w and b so we can definitely write like this so here my main target is to minimize this particular value by changing w and b and i will start adding some more parameters over here this is fine till here i think everybody has got it this is our aim and we are going to do this but i&#39;m going to add two more parameters in this optimizer one is c of i and one is summation of i is equal to 1 to n and here i will use something called as eta eta of i first of all i&#39;ll tell what is c of i see if i have this specific data point let&#39;s say if some of my points are over here then is it a right prediction or wrong prediction if some of my points are over here is it a right prediction or wrong prediction obviously it is a wrong prediction if my points are somewhere here is it a rank prediction wrong rock bread incorrect prediction right so this c value basically says that how many errors we can have how many errors we can have if it says that fine we can have six errors or seven errors how many errors we can have even though we are using the marginal plane how many errors we can have so here i i&#39;m specifically writing how many errors we can have this is what is specified by c of i eta of i basically says that what is the summation of i&#39;m going to write it down since we are doing the summation this entire term basically mentions that summation of the distance of the values distance of the wrong points and how do we calculate the distance from here to here suppose this is the wrong point i will try to calculate the distance from here to here i will do the summation of this i&#39;ll do the summation of this i will do the summation of this similarly for the green point another submission will happen from here to here like this here to here and we are going to do that specific summation so we are telling that fine if you are not able to fit properly try to apply these two hyper parameters and try to make sure that this many errors are also there it is well and good no problem we will go ahead with that try to do the summation of the data points and based on that try to construct the best fit line along with the marginal plane like this even though there are some errors over here or errors over here we are good to go with respect one more thing is there which is called as alpha svr svr only one thing is getting changed in svr only this value will get changed so i want you all to explore and just let me know this will be one assignment for you only this value will be changing remaining everything are same so just try to if you change this particular value that becomes an svr just try to explore and just try to find out and just try to let me know so overall uh did you like the entire session everyone okay in this one more thing is there which is called as kernel matrix svm kernel we say it as svm kernel now in svm kernel what happens suppose if i have a specific data point switch looks like this which looks like this so we obviously cannot use a straight line and try to divide it so what we do we convert this two dimension into three dimensions and then probably we push our point like this one point will go like this and the white point will go down and then we can basically use a plane to split it so i have uploaded a video around around that and you can definitely have a look on to that and i have also shown you practically how to do it that is the reason i have created that specific video so great uh this was it from my side i hope you like this session so thank you everyone have a great day keep on rocking keep on learning and never give up

Transcript for:Overview of Machine Learning Algorithms

Transcript for:
Overview of Machine Learning Algorithms