Transcript for:
Understanding Gradient Descent in Linear Regression

Hello, it's me coming to you again from the future. You might recognize me from my failed videos as Calculus Partial Derivative. Anyway, I made some videos with some calculus stuff. They didn't turn out very well. You can find them if you want. They're kind of unlisted now. But I just tried again. And so this video, if you're watching it, this is a follow up to my linear regression with gradient descent video. That video stands alone. It's a programming video where all I do is walk through with a code for how to build An example that demonstrates linear regression with gradient descent. And this is a puzzle piece in my machine learning series that will hopefully act as a foundation and a building block to your understanding of hopefully some more creative or practical examples that will come later. This video that if you're watching is totally optional to watch as part of this series because I just applied the formula. But what I tried to do in this video is give some background. And I kind of worked it all out here. This is the end. This is what's on the whiteboard. I thought somehow if I use multiple colored markers it would somehow make a better video. I don't think I really succeeded at that. But, so I kind of walked through and tried to describe the math. I should say that, you know, this involves topics from calculus and there's a great video series by 3Blue1Brown on YouTube that gives you great background and more depth in calculus. So I'll put links to those videos in this video's description. Honestly, if you're really interested in kind of soaking up as much of this as you can, I would go and watch those videos first. and then come back here, it'll give you that background for understanding the pieces that I've done here. So I look forward to your feedback, positive and negative, constructive feedback into whether this was helpful and if it made sense. And if you then go on and keep watching, there'll be some future videos where I'll be getting back into the code. But there's no code in this video, just some math stuff, maths, maths, okay? Enjoy. So to recap, I have... A bunch of data points in 2D space. I have a line in that 2D space. The formula for that line is y equals mx plus b. When I try to make a prediction, I get a piece of input, a data input x. And from there, I try to make a guess. In addition to the guess, I have the known y. So this is the correct data that goes with x. My machine learning system makes a guess. The error is the difference between those two things. The error is y, the correct answer, minus the guess. So this relates to the idea of a cost function. A loss function. So if we want to evaluate how is our machine learning algorithm performing, we have this large data set. Maybe it has n elements. So what we want to do is from 1 to n for all n elements, we want to minimize that error. So the cost function, cost equals the sum. Of y sub i, every known answer minus the guess sub i squared. So this is the formula. This is a cost, known as a cost function. This is the total error for the particular model, being the current m and b values that describe this particular line. This is the error. So perhaps we can agree, we can agree that our goal is to minimize this cost, also known as maybe a loss. We want to minimize that loss. We want to have the lowest error. We want the m and b values for the lowest error. So we want to minimize this function. Now, what does it mean to minimize a function? So this function. Is something equals something squared? Which is not that different from me saying, just for a moment, y equals x squared. So if I were to take a Cartesian coordinate system and graph y equals x squared, it would look something like this. I'm drawing in purple now because I've stepped away from this notation and syntax for this particular scenario. And I'm just talking about. function in general y equals x squared again also write this like f of x equals x squared But I'm graphing y equals x squared. So what does it mean to find, to minimize this function, right? I said I want to minimize the loss. I want the smallest error. I want whatever line has the smallest error. Well, what it means to minimize a function is to actually find the x value that produces the lowest y. This is like the easiest thing in the world that we could ever possibly do right now. You don't need any calculus, fancy math, or anything to minimize this function. There it is, it's at the bottom. It's the lowest point. Zero. There it is! I can see it. It's quite obvious. So this is the thing. Eventually, we're going to, in the machine learning systems that I'm going to get further into, neural network based systems with many dimensions of data, you know, there might be some much more hard to describe crazy function that we're trying to approximate, that it's much harder, I mean, of course, we could eyeball this as well, but it's much harder to sort of mathematically just compute exactly where the minimum is, especially if you imagine this as instead of a single line, but a bowl, and then what happens when you get the three dimensions and four dimensions and five dimensions, things get kind of wonky. But there is, if we know the formula for this function, There is another way that you can find that minimum, that minima. Minima? Minimum? I don't know which it is. And that is what I keep talking about, gradient descent. So let's think about what gradient descent means. Let's say we're looking at this point here. And I'm going to... I'm going to... I'm a... I'm going to walk along this function, and I'm like, I'm right here. I'm like, hello, I'm looking for the minimum. Is it over there? Is it over there? Could you help me, please? Could you please provide me? Can I use my GPS Google Maps thing to find the minimum? How would I find the minimum? Well, if I'm right here, I've got two options. I could go this way, or I could go this way. And if I knew which direction I could go, I could also say, oh, I should take a big step, or I should take a little step. There are all sorts of options. So I need to know which way to go and how big of a step to take. And there's a way to figure out how to do that. And it's known as the derivative. So the derivative is a term that comes from calculus. And I would refer you to 3Blue1Brown's calculus series, where you can get a bit more background on what the meaning of derivative is and how it works and how you can sort of think about these. Concepts from calculus. But for us right now, what we can think of is it's just the slope of the graph at this particular point. And a way to describe that is like a tangent line to that graph. So if I'm able to compute this line, then I could say, ah, well, this direction, if I go this direction, it's going up, and I'm going away from the minimum. If I go this direction. I'm going down and I'm going towards the minimum. So I want to go down and you can see like over here, the slope is less extreme if I'm right here. So maybe I don't need to go very far anymore. But if I'm further up, that slope is going to point much more this way. Oh, I should take a bigger step down. So this idea of being able to compute this slope, this derivative of this function, tells me how to search and find the bottom. OK. So this is the landscape of the puzzle we're trying to solve and pieces of that puzzle. But what is the full, what's the actual part of the code that I'm trying to give you more background on? The actual part of the code that I'm trying to give you more background on is right over here. So this is the gradient descent algorithm that I programmed in the previous video, where what I did is we looked at every data point. We made a guess, we got the error, the difference between the known output and the guess, and then we adjusted the m and b values. m equals, so the idea here is that we want to say, every, as we're training the, I don't know which color I'm using right now, as we're training the system, I want to say m equals m plus delta m. Some change in m. b equals b plus delta b So I want to know what is a way that I can change the value of m in y equals mx plus b in order to make the error less. The next step that I want to do is find the minimum cost. I want to minimize this function for a particular, I want to find the m and b values for the, with the lowest error. So to do that, We've established that gradient descent says if I could find the derivative of a function I know which way to move to minimize it so somehow I need to find the derivative of this function to know which way to Move okay, so in order to do that though I'm going to have to rewrite this function in a different way so a couple things one is I think I made a mistake earlier Where this should actually be doesn't it sort of doesn't matter, but this should be guess minus y We were squaring it, so in a way, the positive negative doesn't matter. But I think this is important for later. So this should be guess minus y. That's technically the error is guess minus y, not y minus guess. OK, so I'm going to call the error function j. And j is a function of m and b. So I get some, no, I'm sorry, not the error function. Oh, because I'm about to call something else the error function. The loss function, the cost function j. Then I'm actually, what I'm going to do is I'm going to just Simplify this guess minus y and I'm going to call that error I'm also going to take out the summation the summation is kind of important it But this has to do with that stochastic versus batch graded descent that I talked about in the previous video Where I could I want to get the error over everything or I just want to look at each error one at a time So let's simplify things and say we're looking at each error one at a time. So I'm going to now say this equals error squared. So I have essentially rewritten this function and simplified it. The cost j is equal to this error, the guess minus y squared. So what I want to do is I want to find the derivative of j relative to m. I want to know how do I minimize j. How does j change when m changes? d of j relative to m. Okay, so in, you know, again, I recommend that you go and check out some of the three blue, one brown calculus videos, which will help give you more background here. But what I'm actually going to need to do here is use two rules from calculus. I'm looking for another pen color for no reason. I need to use the power rule. That is one rule, and I need to use the chain rule. Let me establish what the power rule is really quickly. If I have a function like f of x equals x to the n, the power rule says that the derivative is n times x to the n minus 1. So that's the power rule. So I'm going to now apply that here, and I'm going to say, I don't know why I'm in purple now, but I am. 2 times error to the first power. So the power rule says now 2 times error. OK, but I also need the chain rule. I'm not done. Why do I need the chain rule? Well, the chain rule is a rule. I'm going to erase this over here and use another marker, because somehow if I just use multiple colored markers, all of this will make sense. The chain rule states. Whew! OK. Let's say I have a function like y. Can you see this orange? y equals x squared. And I have a function like x equals z squared. So y depends on x. x depends on z. Well, what the chain rule says. Is if I want to get the derivative of y relative to z, what I can do is I can get the derivative of y relative to x to x, and then multiply that by the derivative of x relative to z, which is then times 2z. I can chain derivatives. I can get the derivative of 1 relative to something. times the derivative of that something relative to something else. And that's actually weirdly what's going on here. It may not be immediately apparent to you. J is a function of error, and error is a function of m and b, because I'm computing the error as the guess mx plus b minus a known y. So here, I could then say, get this derivative, 2 times error, and multiply that by the derivative of that error function itself relative to m. Because I'm trying to get delta n. Now, I could also do it relative to b when I want delta b. And this has to do with a partial derivative. Oh, see, there's so many concepts baked into this that are a lot. Again, I'm sitting here being like, this is all just a bad idea. OK, but what is this? This is actually quite simple to work out. And I'm going to do that for you right now. I'm going to get the black marker. And what I'm going to do is now. I want the derivative of error relative to m. OK, well, what is this actual, if I unpack this function, guess is mx plus b minus y. Error equals this. So when I say partial derivative, I mean like the derivative relative to m, what I mean is everything else is a constant. x is a constant, b is a constant, y is a constant. I mean, x and y are actually already constants, because those are the things that x is the input data, y is the known output result. So really, I should write this as like x times m plus b minus y. So this, the derivative of this, the power rule says 1 times x. times m to the 0 power, which means x. And the derivative of a constant is 0, because the constant doesn't change, right? Derivative describing how something changes, the derivative of this is 0. So guess what? It's just x, meaning this whole thing turns out to just be x equals 2 times the error times x. And guess what? This 2, we're going to. The whole point is, if you watched the previous video, is we're going to take this and multiply it by something called a learning rate. Because we want to like, we know the direction to go. This is giving us the direction to go to minimize that error, minimize that cost. But do I want to take a big step or a little step? Well, if I'm going to multiply it by a learning rate anyway, it's sort of like this 2 has no point, because I could have a learning rate that's twice as big or half as big. So ultimately, this is all it is, error times x. All of this math and craziness with power rule and chain rule and partial derivative this, it all boils down to just finally we get this, error times x. That's what should go here in delta m. Guess what? Let's go back over to our code. And we can see there it is, error times x. Error times x. There we go. That's it. That's why that says error times x. That was a That was a lot! But that's why it says it! I feel so happy that we kinda- Even though that was not the best explanation and there's lots of confusing bits and pieces, I feel very happy to have arrived there. This was useful for me. Just making this video makes me feel like something happened today. Okay, so, um... Two things I want to mention- Couple things I want to mention here. A way that I can make this make a little bit more sense here, just to clarify this chain rule thing a little bit better, thank you to Kay Wiegmann and the, uh... Slack channel is that I could just to see here I'm what I'm looking for is the derivative of the cost function You know relative to M. What happens when I change the M value What does that do to the cost and the chain rule says that if I look at the derivative of that function? So relative to the error I can multiply that by the derivative of the error relative to M Right? So this is actually the chain rule. So I can get this by doing the derivative of relative to error, the derivative of error relative to m. And that's what's going on here. 2 times error times this, and that's where I'm getting all this stuff. OK, so this is one way of looking at this. And you can see, like, oh, yeah, it's kind of like the numerator and denominator cancel each other out. So that makes sense. The other thing is, if I did this whole thing again, but did the derivative down here relative to b, right? B instead of M? What do I get here? Well, I get this is now a constant, so this becomes 0. This is a constant, this becomes 0. And what is this? I take the power rule, so I take 1 times b to the 0. I just get 1. So this becomes error times rather than times. Boy, look at this mess that I wrote here. Can we please end this video with this at least written? Use the very nice handwriting. So when it's relative to m, This was 2 times error times x. But when it's relative to b, that's when it's relative to m. But when it's relative to b, it's 2 times error times 1. And again, we could get rid of the 2. So it's really just error times x or error times 1. And then if I come back over here again, there you go. Error times x, m changes by error times x, b changes by just error. So that hopefully gives you some more background as to why these formulas exist this way and looking at the chat and as I go forward in session four what comes after this is now session four where I'm going to build a neural network model for learning You're going to see this formula over and over again. Change the weight. Instead of saying m and b, I'm going to say the weight. Well, the weight changes based on the input multiplied by the error. And then there's going to be a lot of some other pieces. But this formula is going to be everywhere. So I hope this was another attempt. Again, there's a lot of things I've glossed over here in terms of a lot of the background in terms of. what really is a derivative, why does calculus exist, why does the chain rule work the way it works, why does the power rule work the way it works, what's, why, what, what, that partial derivative, huh? Did you say something about partial derivative? And so again, take a look at this video's description and I'm going to point you towards resources and tutorials that kind of dive into each of those components a bit more deeply, but hopefully this gives you some semblance of that overall picture, okay? Thanks for watching and... I don't know, maybe you want to hit like or subscribe, but honestly, totally, totally understand. Totally, totally understand. Don't, you get the thumbs down, I get it, I get it. Okay, I'll see you in a future video maybe. Okay, goodbye.