Understanding Linear Regression and Gradient Descent

what is going on everybody this is jp and welcome to code boosters in this video we are going to talk about the linear regression gradient descent now we know that the linear regression makes the prediction by plotting a straight line that fits best to the data set and the way to plot this straight line is by using something called cost function which is the measurement of how well the predictions are doing or in other words cost function or the cost value represents the error between the actual predictions and the actual value and in the previous video we saw that the cost function formula is given by this where m is the total number of data points y is the actual value or the labels and why the hat is are the predictions so in order for the straight line to fit the plot the cost function or the error representation should be minimum and to minimize this we need to use the gradient descent algorithm so the gradient descent algorithm is the algorithm we use to minimize the cost function value so that our straight line fits best to the data set so what is gradient descent algorithm let us start let us say we are plotting cost versus theta graph here i am taking only one of the many theta parameters if the number of theta parameters increases the dimension of the graph increases so for simplicity i'm only taking one of the theta parameter and i'm plotting it with respect to the cost the graph for the cost and the theta is going to look like this a parabola now let's say at the beginning we had a cost function value which was here so our goal is to minimize this cost function the minimum of the cost lies at its minimum so we need to reach from this point to its minima how do we do that let us say we take one step and we reach here at this point then we take another step and then we reach even closer at this point and then we we take another step we reach here but we are exceeded so we'll backtrack reach here then we'll reach here then we will oscillate around this region now one thing to know with the gradient descent algorithm we are taking one step we are taking steps like this to reach to its minimum as we are taking steps every time we are never going to be able to converge exactly at its minimum but we are going to oscillate the region very close to that minima but that serves our purpose now what we do in gradient descent is that we repeat a step which is theta equals to theta minus alpha times derivative of cost with respect to derivative of theta okay so we repeat this equation it is alpha multiplied by derivative of cost versus with respect to derivative of theta and alpha here is a positive constant now why does this step or this equation represents one step of this algorithm let us take a closer look let us examine this derivative of cos with respect to derivative of theta when we are at this point or this half of the parabola the derivative of cost with respect to derivative of theta which is nothing but the slope of the graph will be negative okay so slope will be negative right so this quantity is going to be negative multiplied by negative sign which is going to be positive so theta was here sorry so the theta was here but as theta is adding with some positive quantity negative negative multiplied by negative positive theta is adding with some positive quantity it is going to increase okay did you see okay and let us say if we are on this half right here then the slope is going to be positive so negative minus positive quantity so the theta is going to actually decrease so every time we take this equation every time we repeat this equation if it if the initially theta was on this side the theta is going to go closer and closer and closer towards the minima and if the theta becomes let's say exceeded to here like in this step which exceeded it is going to backtrack or it is it is going to backtrack and it is going to again go closer and closer to the minimum so when we repeat this step for quite a number of times we eventually end up to the local minima that was the gradient descent algorithm so now let us derive the derivative of the cost function with respect to theta before that let us see how our data set looks like let's say our data set is x and it has m data points and n dash features okay so the data set will be in the form of matrix of m comma n dash now i'm going to do something different with this data set and that is i'm going to add another column vector of ones to the data set right so the total dimension or the total number of columns or the total number of features are going to be n dash plus 1. let's say that is n now why did i do that it will be more clear in a while similarly the the labels y will have m data points one for each the spelling of data points is wrong but bear with it one for each label right so the dimension of y will be m comma one now we know that the predictions y hat has the formula theta n dash x n dash plus theta n dash minus 1 x n dash minus 1 plus dot theta 1 x 1 plus theta naught so if we were to get a vector for the theta it will be of the dimension n dash plus 1 comma 1 n plus 1 is nothing but n so the dimension or the size of the theta array or vector will be n comma 1 okay so let us let me summarize x is our data set where on the column side there are features and let us save the data set is of house prices the features can be size of the house number of bedrooms square foot area of the house or the number of other rooms bathroom quality living room quality etc or garage number of cars it can store etc on the roadside it is going to be the features for each house let's say we have data of m house for every house they're going there is going to be one prediction which is the price of the house which is stored in the y label and linear regression the predictions are given by this formula so theta and these thetas are parameters and x are the features of the house so if we if i write the vector of theta in this form it will be of size n comma one i hope this is very clear okay so why we are discussing this you see the equation of prediction is given by this and this represents only one prediction for one house where we have the values or the features of the house given by x's and thetas as the parameter so if we want to represent the predictions in the form of metrics the predictions care for m houses can be represented by nothing but x matrix multiplication theta how the size of the x is n comma n size of theta is n comma 1 which gives us this equation for m times so this overall size is going to be one so y hat will be again a matrix of these equations for different different houses for m houses okay now we know the prediction can be represented in the matrix form as x matrix multiplication theta now the cost function has the equation as this we can compute the value of y or we can compute y hat here as summation of the matrix of y this is capital y minus of this y hat capital y hat and square okay summation of this i hope you can understand that so what i did here is that i represented this summation in the matrix form so y is the matrix of the actual labels and y hat are the matrix of the predictions and we squared it and we are just summing the resultant matrix this is the cost okay now now the formula for the cost function is given by this so the del cost by del theta will be if you know the calculus then you will know the del cos by del theta will come e as 1 by m matrix multiplication of x transpose n y minus y hat now how did this came if you know calculus you will see that the derivative of this term will remove the square and it will cancel this two and then we will have a partial derivative of this term which will bring only x and as it is in the matrix multiplications the dimensions must match so the dimension of x transpose is n comma m and the dimension of the this is m comma 1 so the oral dimension will be n comma 1 which is the dimensions of the theta vector okay so del cos by del theta is given by this equation now we are going to completely implement the gradient descent algorithm at the beginning we have our data set x the labels y and we initialize theta vector as all zeros and this is as the size of n comma one we are initializing theta as all zeros and we have the data set x and their labels y so as we have initialized theta zeros it will be somewhere near a very large cost value let's say i'm taking it here okay then we are going to loop this is a pseudo code for the gradient descent algorithm okay so we are going to loop let's say thousand times it can be any number of times as long as we reach to the minimum okay let's say we lose let's say we loop through a thousand times we are going to first make find out the predictions which is given by matrix multiplication of x and sorry this is theta we're going to do the matrix multiplication of x and theta then we're going to find out the cost function value which is given by this sorry there's a square here then we are going to find the derivative of del derivative of cost with respect to theta i'm representing as del theta so del theta is the derivative of cost with respect to theta and we just saw it is it is equal to 1 by m matrix multiplication of x transpose and y minus of y hat and then we will update theta as theta minus of alpha which is a positive constant by a multiplied by del theta so del theta is going to be negative as the slope is negative so theta is going to increase so theta will go from here to somewhere here and we will move here similarly as the loop goes and goes we will eventually move closer and closer towards the minima and when it reaches here the slope will be positive so the theta will decrease and it will oscillate at a very close region near its minima and that's how we minimize the cost function so this was the complete implementation of the gradient descent algorithm now one thing to note is about alpha here here alpha is a positive constant and we call it learning rate it is called learning rate and there's a reason why it is called learning rate and the reason is the value of alpha determines how much bigger our step is going to be so if the value of alpha is large we're going to take larger steps but if the value of alpha is small we're going to take smaller smaller steps now if the value of alpha is extremely large then it is going to overshoot and what do i mean by that i mean from here if we were to converge and the alpha is very high instead of going this side it will go directly here up and from here it will go up so basically if the value of alpha is extremely large it is instead of getting converging it is going to diverge and it will go up to infinity so that was something to be noted about alpha okay that is so good you just saw the gradient descent algorithm and you also saw the implementation or the pseudo code of the gradient descent algorithm now if you want to sustain it you must implement the complete linear regression model by yourself otherwise it's just going to up the mind so i have another video for you which shows the complete implementation of the linear regression with python and the way we are going to implement this by making a model on house price predictions you're going to love that now if you're interested in learning machine learning then hit the red subscribe button because i upload a machine learning video every week and if you are kind of a student like me or you like a fun learning then you might like my channel so hit the red subscribe button also if you like the video give this a thumbs up hit the bell icon as well and i see you in the next one

Transcript for:Understanding Linear Regression and Gradient Descent

Transcript for:
Understanding Linear Regression and Gradient Descent