Transcript for:
Understanding Gradient Descent in Regression

I believe that um one thing to really understand for gradient desent to actually make sense regression models um these are one of the simplest models in machine learning normally used for predicting continuous numerical values um these can vary from things like stock prices to temperatures and so on but the real question is how do they do this um well let's assume we are given a data set for the experience and salaries of employees from a certain company and our task is to predict the salaries that will be paid to a set of new employees with various years of experience first thing we'll have to put in mind is that this is a regression problem because it involves predicting unknown continuous value which are salaries in this case the first thing we'll have to do is to understand the relationship which exists between the years of experience and the salaries in our training data set and um this is one of the things done in the explanatory data analysis phase so um I believe the most intuitive way to do this is through plotting the years of experience against the salaries notice that um as the years of experience increase the salaries also increase so in statistical terms um we say that the years of experience and the salaries have a positive relationship one thing to note about this plot is that um each data point represents an example or a row in our training data now here is the most important thing to understand training our model on this data is as simple as fitting a straight line through these data points well um you might be wondering how you are supposed to fit this line through the data points should should I just drop it here should I just place it here like this or or like that or or hold on um I think I should just place it here well well not quite we leave this part for the computer to figure out um this is what um training a regression model is composed of it's just a computer trying to figure out where to place the line to understand how this is done we'll have to take a closer look at um the regression model itself as you may have already noticed our regression model is just a straight line and um it has a mathematical equation of of the form y = WX + B where Y is the predicted value W is the model weight X is the input value and B is the bias term in machine learning and deep learning W and B are called Model parameters and it's these parameters that the model has to learn because we know all else about our model apart from these parameters and B so now let's try to understand um how these two parameters affect the orientation of the line so we will give them random values you will realize that um the weight changes the orientation of the line and um the bias St shifts it up and down so now um how do we know that best position and orientation of the line to be in or what's the best combination of the parameters well um this is where error functions or those functions come in the major thing with model training is to find the best combination of the parameters W and B which are the weights and biases that minimize the L function but what is a loose function anyway well first we need to understand stand how errors are measured in regression and in machine learning in general an error is simply the difference between the model's predicted value and the actual value for example suppose a model is fitted to the training data with the following parameters the key thing to observe is what the model predicts whether it predictions are close to the correct y values for the given X values to the error the model has made we subtract the actual y value from the predicted value note that sometimes the error will be negative but we really don't care about the sign of the error instead we either take the absolute value of the error or Square R um this ensures that all errors are positive this process is repeated for all data points in the training data set [Music] to measure the model's overall performance we sum up all squared errors for all data [Music] points and the reason why the error of this point is zero is because the model correctly predicted its y value and we then divide by the total number of data points giving us the mean error this is what we call the mean squ error loss function another method is to sum the absolute errors of all data points dividing by the total number of data points giving us another loss function called the mean absolute error these are the commonly used loss functions for regression tasks so finally putting it all together model training means searching for a combination of model parameters that minimize a loss function and this cuts across from machine learning algorithms to deep planning since we saw earlier that the M loss has the following equation and that our regression model is of the form we can simply unify them together in two and single function with two learnable parameters W and B so this unified equation of the L function lays the foundation onto which Gant descent is built let's ignore the bias parameter for now and only work with the weight parameter upon plotting the MS is a continuous bow shaped cve which also means that we can easily investigate its gradient at different points of w to learn the best possible parameters that minimize the loss function the model uses an algorithm called gradient descent this algorithm iteratively adjusts the parameters of a model based on the gradient of the loss function at the current parameter values where W new is the new weight value W old is the old weight value Alpha is called the larning rate which controls how fast the model ears and this is the gradient of the loss function at the current parameter value so first things first in machine learning before training a model it's already is a good idea to First standardize the data set this ensures that all features are on the same scale and with a mean of zero and standard deviation of one this helps the model learn more efficiently so from now onwards we shall be working with a scaled version of the data set since we are working with a model with only one parameter for each parameter value between any range for example -3 and 3 we will calculate it corres oning MC loss for all examples in our data set creating a plot of w against L J this will help us examine how the loss reduces for each parameter during gradient descent let's set up our training environment first we've already chosen the model and the loss function to use gradient descent requires the gradient of the loss function with this to W which is simply the derivative of the loss function with respect to the weight we have our training data and our plot we then initialize our model with a random parameter value and set the learning rate to 0.06 or by the way learning rates are usually small values having everything we need let's start training the model so now let's try to break down its St first we will calculate this sum throughout all examples in our data set [Music] and then substituting for n which is the total number of examples followed by substituting the old parameter or the current parameter which is in this case 3.6 giving us a new parameter value of 3.28 so as you can see on the plot the new parameter reduced the loss function by moving a single step down towards the global minimum so this whole process is called an Epoch and the same process happens all over again until a minimum loss is reached [Music] [Music] [Music] as you can see um this tip says is keep on reducing as the mod approaches the global [Music] [Music] [Music] minimum let's try to see how this process affects the fitted regation line [Music] [Music] [Music] [Music] Also let's try and experiment and see what happens when you make the learning rate huge 0.9 as you can see this makes the model bounce around making the learning process unstable [Music] [Music] [Music] [Music] it's now time to bring back the parameter we dropped everything essentially Remains the Same the only difference is that there will be two gradient steps each for a single parameter we find the gradient with respect to the weight followed by the gradient of the loss with respect to the bias demo parameters the model has the more gradient decent steps everything Remains the Same apart from the EXT gradient descent tape which the model calculates for the bias parameter and this is done concurrently [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] this is how the L plot for this model looks like it's in three dimensions because of the extra buas parameter [Music] now let's experiment with different data sets [Music] [Music] [Music] gradient decent enables the modeln all kinds of relationships which exist between features in all data setes the concept of Gran desent lays foundations onto which more sophisticated optimization algorithms are built things like Adam adaptive momentum mini batch stochastic gradient descent and so on let me know if you found this video helpful and thanks for watching