Transcript for:
Linear Regression, Gradient Descent, and Normal Equations

Morning and welcome back. So what we'll see today in class is the first in-depth discussion of a learning algorithm, linear regression, and in particular, over the next, what, hour and a bit you'll see linear regression, batch and stochastic gradient descent is an algorithm for fitting linear regression models, and then the normal equations, um, uh, as a way of- as a very efficient way to let you fit linear models. Um, and we're going to define notation, and a few concepts today that will lay the foundation for a lot of the work that we'll see the rest of this quarter. Um, so to- to motivate linear regression, it's gonna be, uh, maybe the- maybe the simplest, one of the simplest learning algorithms. Um, you remember the ALVINN video, the autonomous driving video that I had shown in class on Monday, um, for the self-driving car video, that was a supervised learning problem. And the term supervised learning [NOISE] meant that you were given Xs which was a picture of what's in front of the car, and the algorithm [NOISE] had to map that to an output Y which was the steering direction. And that was a regression problem, [NOISE] because the output Y that you want is a continuous value, right? As opposed to a classification problem where Y is the speed. And we'll talk about classification, um, next Monday, but supervised learning regression. So I think the simplest, maybe the simplest possible learning algorithm, a supervised learning regression problem, is linear regression. And to motivate that, rather than using a self-driving car example which is quite complicated, we'll- we'll build up a supervised learning algorithm using a simpler example. Um, so let's say you want to predict or estimate the prices of houses. So [NOISE] the way you build a learning algorithm is start by collecting a data-set of houses, and their prices. Um, so this is a data-set that we collected off Craigslist a little bit back. This is data from Portland, Oregon. [NOISE] But so there's the size of a house in square feet, [NOISE] um, and there's the price of a house in thousands of dollars, [NOISE] right? And so there's a house that is 2,104 square feet whose asking price was $400,000. Um, [NOISE] house with, uh, that size, with that price, [NOISE] and so on. Okay? Um, and maybe more conventionally if you plot this data, there's a size, there's a price. So you have some dataset like that. And what we'll end up doing today is fit a straight line to this data, right? [NOISE] And go through how to do that. So in supervised learning, um, the [NOISE] process of supervised learning is that you have a training set such as the data-set that I drew on the left, and you feed this to a learning algorithm, [NOISE] right? And the job of the learning algorithm is to output a function, uh, to make predictions about housing prices. And by convention, um, I'm gonna call this a function that it outputs a hypothesis, [NOISE] right? And the job of the hypothesis is, [NOISE] you know, it will- it can input the size of a new house, or the size of a different house that you haven't seen yet, [NOISE] and will output the estimated [NOISE] price. Okay? Um, so the job of the learning algorithm is to input a training set, and output a hypothesis. The job of the hypothesis is to take as input, any size of a house, and try to tell you what it thinks should be the price of that house. Now, when designing a learning algorithm, um, and- and, you know, even though linear regression, right? You may have seen it in a linear algebra class before, or some other class before, um, the way you go about structuring a machine learning algorithm is important. And design choices of, you know, what is the workflow? What is the data-set? What is the hypothesis? How does this represent the hypothesis? These are the key decisions you have to make in pretty much every supervised learning, every machine learning algorithm's design. So, uh, as we go through linear regression, I will try to describe the concepts clearly as well because they'll lay the foundation for the rest of the algorithms. Sometimes it's much more complicated with the algorithms you'll see later this quarter. So when designing a learning algorithm the first thing we need to ask is, um, [NOISE] how- how do you represent the hypothesis, H, right? And in linear regression, for the purpose of this lecture, [NOISE] we're going to say that, um, the hypothesis is going to be [NOISE] that. Right? That the input, uh, size X, and output a number as a- as a linear function, um, of the size X, okay? And then, the mathematicians in the room, you'll say technically this is an affine function. It was a linear function, there's no theta 0, technically, you know, but- but the machine learning sometimes just calls this a linear function, but technically it's an affine function. Doesn't- doesn't matter. Um, so more generally in- in this example we have just one input feature X. More generally, if you have multiple input features, so if you have more data, more information about these houses, such as number of bedrooms [NOISE] Excuse me, my handwriting is not big. That's the word bedrooms, [NOISE] right? Then, I guess- [NOISE] All right. Yeah. Cool. My- my- my father-in-law lives a little bit outside Portland, uh, and he's actually really into real estate. So this is actually a real data-set from Portland. [LAUGHTER] Um, so more generally, uh, if you know the size, as well as the number of bedrooms in these houses, then you may have two input features [NOISE] where X1 is the size, and X2 is the number of bedrooms. [NOISE] Um, I'm using the pound sign bedrooms to denote number of bedrooms, and you might say that you estimate the size of a house as, um, h of x equals, theta 0 plus theta 1, [NOISE] X1, plus theta 2, X2, where X1 is the size of the house, and X2 is- [NOISE] is the number of bedrooms. Okay? Um, so in order to- [NOISE] so in order to simplify the notation, [NOISE] um, [NOISE] in order to make that notation a little bit more compact, um, I'm also gonna introduce this other notation where, um, we want to write a hypothesis, as sum from J equals 0-2 of theta JXJ, so this is the summation, where for conciseness we define X0 to be equal to 1, okay? See we define- if we define X0 to be a dummy feature that always takes on the value of 1, then you can write the hypothesis h of x this way, sum from J equals 0-2, or just theta JXJ, okay? It's the same with that equation that you saw to the upper right. And so here theta becomes a three-dimensional parameter, theta 0, theta 1, theta 2. This index starting from 0, and the features become a three dimensional feature vector X0, X1, X2, where X0 is always 1, X1 is the size of the house, and X2 is the number of bedrooms of the house, okay? So, um, to introduce a bit more terminology. Theta [NOISE] is called the parameters, um, of the learning algorithm, and the job [NOISE] of the learning algorithm is to choose parameters theta, that allows you to make good predictions about your prices of houses, right? Um, and just to lay out some more notation that we're gonna use throughout this quarter. We're gonna use a standard that, uh, M, we'll define as the number of training examples. So M is going to be the number of rows, [NOISE] right, in the table above, um, where, you know, each house you have in your training set. This one training example. Um, you've already seen [NOISE] me use X to denote the inputs, um, and often the inputs are called features. Um, you know, I think, I don't know, as- as- as a- as a emerging discipline grows up, right, notation kind of emerges depending on what different scientists use for the first time when you write a paper. So I think that, I don't know, I think that the fact that we call these things hypotheses, frankly, I don't think that's a great name. But- but I think someone many decades ago wrote a few papers calling it a hypothesis, and then others followed, and we're kind of stuck with some of this terminology. But X is called input features, or sometimes input attributes, um, and Y [NOISE] is the output, right? And sometimes we call this the target variable. [NOISE] Okay. Uh, so x, y is, uh, one training example. [NOISE] Um, and, uh, I'm going to use this notation, um, x_i, y_i in parentheses to denote the i_th training example. Okay. So the superscript parentheses i, that's not exponentiation. Uh, I think that as suppo- uh, this is- um, this notation x_i, y_ i, this is just a way of, uh, writing an index into the table of training examples above. Okay. So, so maybe, for example, if the first training example is, uh, the size- the house of size 2104, so x_1_1 would be equal to 2104, right, because this is the size of the first house in the training example. And I guess, uh, x, um, second example, feature one would be 1416 with our example above. So the superscript in parentheses is just some, uh, uh, yes, it's, it's just the, um, index into the different training examples where i- superscript i here would run from 1 through m, 1 through the number of training examples you have. Um, and then one last bit of notation, um, I'm going to use n to denote the number of features you have for the supervised learning problem. So in this example, uh, n is equal to 2, right? Because we have two features which is, um, the size of house and the number of bedrooms, so two features. Which is why you can take this and, and write this, um, as the sum from j equals 0 to n. Um, and so here, x and Theta are n plus 1 dimensional because we added the extra, um, x_0 and Theta_0. Okay. So- so we have two features then these are three-dimensional vectors. And more generally, if you have n features, uh, you, you end up with x and Theta being n plus 1 dimensional features. All right. And, you know, uh, you see this notation in multiple times, in multiple algorithms throughout this quarter. So if you, you know, don't manage to memorize all these symbols right now, don't worry about it. You'll see them over and over and they become familiar. All right. So, um, given the data set and given that this is the way you define the hypothesis, how do you choose the parameters, right? So you- the learning algorithm's job is to choose values for the parameters Theta so that it can output a hypothesis. So how do you choose parameters Theta? Well, what we'll do, um, is let's choose Theta such that h of x is close to y, uh, for the training examples. Okay. So, um, and I think the final bit of notation, um, I've been writing h of x as a function of the features of the house, as a function of the size and number of bedrooms of the house. [NOISE] Um, sometimes we emphasize that h depends both on the parameters Theta and on the input features x. Um, we're going to use h_Theta of x to emphasize that the hypothesis depends both on the parameters and on the, you know, input features x, right? But, uh, sometimes for notational convenience, I just write this as h of x, sometimes I include the Theta there, and they mean the same thing. It's just, uh, maybe a abbreviation in notation. Okay. But so in order to, um, learn a set of parameters, what we'll want to do is choose a parameters Theta so that at least for the houses whose prices you know, that, you know, the learning algorithm outputs prices that are close to what you know where the correct price is for that set of houses, what the correct asking price is for those houses. And so more formally, um, in the linear regression algorithm, also called ordinary least squares. With linear regression, um, we will want to minimize, I'm gonna build out this equation one piece at a time, okay? Minimize the square difference between what the hypothesis outputs, h_Theta of x minus y squared, right? So let's say we wanna minimize the squared difference between the prediction, which is h of x and y, which is the correct price. Um, and so what we want to do is choose values of Theta that minimizes that. Um, to fill this out, you know, you have m training examples. So I'm going to sum from i equals 1 through m of that square difference. So this is sum over i equals 1 through all, say, 50 examples you have, right? Um, of the square difference between what your algorithm predicts and what the true price of the house is. Um, and then finally, by convention, we put a one-half there- put a one-half constant there because, uh, when we take derivatives to minimize this later, putting a one-half there would make some of the math a little bit simpler. So, you know, changing one- adding a one-half. Minimizing that formula should give you the same as minimizing one-half of that but we often put a one-half there so to make the math a little bit simpler later, okay? And so in linear regression, we're gonna define the cost function J of Theta to be equal to that. And, uh, [NOISE] we'll find parameters Theta that minimizes the cost function J of Theta, okay? Um, and, the questions I've often gotten is, you know, why squared error? Why not absolute error, or this error to the power of 4? We'll talk more about that when we talk about, um, uh, when, when, when we talk about the generalization of, uh, linear regression. Um, when we talk about generalized linear models, which we'll do next week, you'll see that, um, uh, linear regression is a special case of a bigger family of algorithms called generalizing your models. And that, uh, using squared error corresponds to a Gaussian, but the- we, we, we'll justify maybe a little bit more why squared error rather than absolute error or errors to the power of 4, uh, next week. So, um, let me just check, see if any questions, [NOISE] at this point. No, okay. Cool. All right. So, um, so let's- next let's see how you can implement an algorithm to find the value of Theta that minimizes J of Theta. That- that minimizes the cost function J of Theta. Um, we're going to use an algorithm called gradient descent. And, um, you know, this is my first time teaching in this classroom, so trying to figure out logistics like this. All right. Let's get rid of the chair. Cool, um, all right. And so with, uh, gradient descent we are going to start with some value of Theta, um, and it could be, you know, Theta equals the vector of all zeros would be a reasonable default. We can initialize it randomly, the count doesn't really matter. But, uh, Theta is this three-dimensional vector. And I'm writing 0 with an arrow on top to denote the vector of all 0s. So 0 with an arrow on top that's a vector that says 0, 0, 0, everywhere, right. So, um, uh, so sought to some, you know, initial value of Theta and we're going to keep changing Theta, um, to reduce J of Theta, okay? So let me show you a, um- vi- vis- let me show you a visualization of gradient descent, and then- and then we'll write out the math. [NOISE] Um, so- all right. Let's say you want to minimize some function J of Theta and, uh, it's important to get the axes right in this diagram, right? So in this diagram the horizontal axes are Theta 0 and Theta 1. And what you want to do is find values for Theta 0 and Theta 1. In our- I- I- In our example it's actually Theta 0, Theta 1, Theta 2 because Theta's 3-dimensional but I can't plot that. So I'm just using Theta 0 and Theta 1. But what you want to do is find values for Theta 0 and Theta 1, right? That's the, um, uh, right, you wanna find values of Theta 0 and Theta 1 that minimizes the height of the surface j of Theta. So maybe this- this looks like a good- pretty good point or something, okay? Um, and so in gradient descent you, you know, start off at some point on this surface and you do that by initializing Theta 0 and Theta 1 either randomly or to the value of all zeros or something doesn't- doesn't matter too much. And, um, what you do is, uh, im- imagine that you are standing on this lower hill, right standing at the point of that little x or that little cross. Um, what you do in gradient descent is, uh, turn on- turn around all 360 degrees and look all around you and see if you were to take a tiny little step, you know, take a tiny little baby step, in what direction should you take a little step to go downhill as fast as possible because you're trying to go downhill which is- goes to the lowest possible elevation, goes to the lowest possible point of J of Theta, okay? So what gradient descent will do is, uh, stand at that point look around, look all- all around you and say, well, what direction should I take a little step in to go downhill as quickly as possible because you want to minimize, uh, J of Theta. You wanna minim- reduce the value of J of Theta, you know, go to the lowest possible elevation on this hill. Um, and so gradient descent will take that little baby step, right? And then- and then repeat. Uh, now you're a little bit lower on the surface. So you again take a look all around you and say oh it looks like that hill, that- that little direction is the steepest direction or the steepest gradient downhill. So you take another little step, take another step- another step and so on, until, um, uh, until you- until you get to a hopefully a local optimum. Now one property of gradient descent is that, um, uh, depend on where you initialize parameters, you can get to local diff- different points, right? So previously, you had started it at that lower point x. But imagine if, uh, you had started it, you know, just a few steps over to the right, right? At that- at that new x instead of the one on the left. If you had run gradient descent from that new point then, uh, that would have been the first step, that would be the second step and so on. And you would have gotten to a different local optimum- to a different local minima, okay? Um, it turns out that when you run gradient descents on linear regression, it turns out that, uh, uh, uh, there will not be local optimum but we'll talk about that in a little bit, okay? So let's formalize the [NOISE] gradient descent algorithm. In gradient descent, um, each step of gradient descent, uh, is implemented as follows. So- so remember, in- in this example, the training set is fixed, right? You- You know you've collected the data set of housing prices from Portland, Oregon so you just have that stored in your computer memory. And so the data set is fixed. The cost function J is a fixed function there's function of parameters Theta, and the only thing you're gonna do is tweak or modify the parameters Theta. One step of gradient descent, um, can be implemented as follows, which is Theta j gets updated as Theta j minus, I'll just write this out, okay? Um, so bit more notation, I'm gonna use colon equals, I'm gonna use this notation to denote assignment. So what this means is, we're gonna take the value on the right and assign it to Theta on the left, right? And so, um, so in other words, in the notation we'll use this quarter, you know, a colon equals a plus 1. This means increment the value of a by 1. Um, whereas, you know, a equals b, if I write a equals b I'm asserting a statement of fact, right? I'm asserting that the value of a is equal to the value of b. Um, and hopefully, I won't ever write a equals a plus 1, right because- cos that is rarely true, okay? Um, all right. So, uh, in each step of gradient descent, you're going to- for each value of j, so you're gonna do this for j equals 0, 1 ,2 or 0, 1, up to n, where n is the number of features. For each value of j takes either j and update it according to Theta j minus Alpha. Um, which is called the learning rate. Um, Alpha the learning rate times this formula. And this formula is the partial derivative of the cost function J of Theta with respect to the parameter, um, Theta j, okay? In- and this partial derivative notation. Uh, for those of you that, um, haven't seen calculus for a while or haven't seen, you know, some of their prerequisites for a while. We'll- we'll- we'll go over some more of this in a little bit greater detail in discussion section, but I'll- I'll- I'll do this, um, quickly now. But, um, I don't know. If, if you've taken a calculus class a while back, you may remember that the derivative of a function is, you know, defines the direction of steepest descent. So it defines the direction that allows you to go downhill as steeply as possible, uh, on the, on the hill like that. There's a question. How do you determine the learning rate? How do you determine the learning rate? Ah, let me get back to that. It's a good question. Uh, for now, um, uh, you know, there's a theory and there's a practice. Uh, in practice, you set to 0.01. [LAUGHTER]. [LAUGHTER] Let me say a bit more about that later. [NOISE]. Uh, if- if you actually- if- if you scale all the features between 0 and 1, you know, minus 1 and plus 1 or something like that, then, then, yeah. Then, then try- you can try a few values and see what lets you minimize the function best, but if the feature is scaled to plus minus 1, I usually start with 0.01 and then, and then try increasing and decreasing it. Say, say a little bit more about that. [NOISE] Um, uh, all right, cool. So, um, let's see. Let me just quickly [NOISE] show how the derivative calculation is done. Um, and you know, I'm, I'm gonna do a few more equations in this lecture, uh, and then, and then over time I think. Um, all, all of these, all of these definitions and derivations are written out in full detail in the lecture notes, uh, posted on the course website. So sometimes, I'll do more math in class when, um, we want you to see the steps of the derivation and sometimes to save time in class, we'll gloss over the mathematical details and leave you to read over, the full details in the lecture notes on the CS229 you know, course website. Um, so partial derivative with respect to J of Theta, that's the partial derivative with respect to that of one-half H of Theta of X minus Y squared. Uh, and so I'm going to do a slightly simpler version assuming we have just one training example, right? The, the actual deriva- definition of J of Theta has a sum over I from 1 to M over all the training examples. So I'm just forgetting that sum for now. So if you have only one training example. Um, and so from calculus, if you take the derivative of a square, you know, the 2 comes down and so that cancels out with the half. So 2 times 1.5 times, um, uh, the thing inside, right? Uh, and then by the, uh, chain rule of, uh, derivatives. Uh, that's times the partial derivative of Theta J of X Theta X minus Y, right? So if you take the derivative of a square, the two comes down and then you take the derivative of what's inside and multiply that, right? [NOISE] Um, and so the two and one-half cancel out. So this leaves you with H minus Y times partial derivative respect to Theta J of Theta 0X0 plus Theta 1X1 plus th- th- that plus Theta NXN minus Y, right? Where I just took the definition of H of X and expanded it out to that, um, sum, right? Because, uh, H of X is just equal to that. So if you look at the partial derivative of each of these terms with respect to Theta J, the partial derivative of every one of these terms with respect to Theta J is going to z- be 0 except for, uh, the term corresponding to J, right? Because, uh, if J was equal to 1, say, right? Then this term doesn't depend on Theta 1. Uh, this term, this term, all of them do not even depend on Theta 1. The only term that depends on Theta 1 is this term over there. Um, and the partial derivative of this term with respect to Theta 1 will be just X1, right? And so, um, when you take the partial derivative of this big sum with respect to say the J, uh, in- in- in- instead of just J equals 1 and with respect to Theta J in general, then the only term that even depends on Theta J is the term Theta JXJ. And so the partial derivative of all the other terms end up being 0 and partial derivative of this term with respect to Theta J is equal to XJ, okay? And so this ends up being H Theta X minus Y times XJ, okay? Um, and again, listen, if you haven't, if you haven't played with calculus for awhile, if you- you know, don't quite remember what a partial derivative is, or don't quite get what we just said. Don't worry too much about it. We'll go over a bit more in the section and we- and, and then also read through the lecture notes which kind of goes over this in, in, in, um, in more detail and more slowly than, than, uh, we might do in class, okay? [NOISE] So, um, so plugging this- let's see. So we've just calculated that this partial derivative, right, is equal to this, and so plugging it back into that formula, one step of gradient descent is, um, is the following, which is that we will- that Theta J be updated according to Theta J minus the learning rate times H of X minus Y times XJ, okay? Now, I'm, I'm gonna just add a few more things in this equation. Um, so I did this with one training example, but, uh, this was- I kind of used definition of the cost function J of Theta defined using just one single training example, but you actually have M training examples. And so, um, the, the correct formula for the derivative is actually if you take this thing and sum it over all M training examples, um, the derivative of- you know, the derivative of a sum is the sum of the derivatives, right? So, um, so you actually- If, if, if you redo this derivation, you know, summing with the correct definition of J of Theta which sums over all M training examples. If you just redo that little derivation, you end up with, uh, sum equals I through M of that, right? Where remember XI is the Ith training examples input features, YI is the target label, is the, uh, price in the Ith training example, okay? Um, and so this is the actual correct formula for the partial derivative with respect to that of the cost function J of Theta when it's defined using, um, uh, all of the, um, [NOISE] uh, on- when it's defined using all of the training examples, okay? And so the gradient descent algorithm is to- [NOISE] Repeat until convergence, carry out this update, and in each iteration of gradient descent, uh, you do this update for j equals, uh, 0, 1 up to n. Uh, where n is the number of features. So n was 2 in our example. Okay. Um, and if you do this then, uh, uh, you know, actually let me see. Then what will happen is, um, [NOISE] well, I'll show you the animation. As you fit- hopefully, you find a pretty good value of the parameters Theta. Okay. So, um, it turns out that when you plot the cost function j of Theta for a linear regression model, um, it turns out that, unlike the earlier diagram I had shown which has local optima, it turns out that if j of Theta is defined the way that, you know, we just defined it for linear regression, is the sum of squared terms, um, then j of Theta turns out to be a quadratic function, right? It's a sum of these squares of terms, and so, j of Theta will always look like, look like a big bowl like this. Okay. Um, another way to look at this, uh, uh, and so and so j of Theta does not have local optima, um, or the only local optima is also the global optimum. The other way to look at the function like this is to look at the contours of this plot, right? So you plot the contours by looking at the big bowl and taking horizontal slices and plotting where the, where the curves where, where the edges of the horizontal slice is. So the contours of a big bowl or I guess a formal is, uh, of a bigger, uh, of of this quadratic function will be ellipsis, um, like these or these ovals or these ellipses like this. And so if you run gradient descent on this algorithm, um, let's say I initialize, uh, my parameters at that little x, uh, shown over here, right. And usually you initialize Theta degree with a 0, but but, you know, but it doesn't matter too much. So let's reinitialize over there. Then, um, with one step of gradient descent, the algorithm will take that step downhill, uh, and then with a second step, it'll take that step downhill whereby we, fun fact, uh, if you- if you think about the contours of the function, it turns out that the direction of steepest descent is always at 90 degrees, is always orthogonal, uh, to the contour direction, right. So, I don't know, yeah. I seem to remember that from my high-school something, I think it's true. All right. And so as you, as you take steps downhill, uh, uh, because there's only one global minimum, um, this algorithm will eventually converge to the global minimum. Okay. And so the question just now about the choice of the learning rate Alpha. Um, if you set Alpha to be very very large, to be too large then they can overshoot, right. The steps you take can be too large and you can run past the minimum. Uh, if you set to be too small, then you need a lot of iterations and the algorithm will be slow. And so what happens in practice is, uh, usually you try a few values and and and see what value of the learning rate allows you to most efficiently, you know, drive down the value of j of Theta. Um, and if you see j of Theta increasing rather than decreasing, you see the cost function increasing rather than decreasing, then, there's a very strong sign that the learning rate is, uh, too large, and so, um. [NOISE] Actually what what I often do is actually try out multiple values of, um, the learning rate Alpha, and, uh, uh, and and usually try them on an exponential scale. So you try open O1, open O2, open O4, open O8 kinda like a doubling scale or some- uh, uh, or doubling or tripling scale and try a few values and see what value allows you to drive down to the learning rate fastest. Okay. Um, let me just. So I just want to visualize this in one other way, um, which is with the data. So, uh, this is this is the actual dataset. Uh, they're, um, there are actually 49 points in this dataset. So m the number of training examples is 49, and so if you initialize the parameters to 0, that means, initializing your hypothesis or initializing your straight line fit to the data to be that horizontal line, right? So, if you initialize Theta 0 equals 0, Theta 1 equals 0, then your hypothesis is, you know, for any input size of house or price, the estimated price is 0, right? And so your hypothesis starts off with a horizontal line, that is whatever the input x the output y is 0. And what you're doing, um, as you run gradient descent is you're changing the parameters Theta, right? So the parameters went from this value to this value to this value to this value and so on. And so, the other way of visualizing gradient descent is, if gradient descent starts off with this hypothesis, with each iteration of gradient descent, you are trying to find different values of the parameters Theta, uh, that allows this straight line to fit the data better. So after one iteration of gradient descent, this is the new hypothesis, you now have different values of Theta 0 and Theta 1 that fits the data a little bit better. Um, after two iterations, you end up with that hypothesis, uh, and with each iteration of gradient descent it's trying to minimize j of Theta. Is trying to minimize one half of the sum of squares errors of the hypothesis or predictions on the different examples, right? With three iterations of gradient descent, um, uh, four iterations and so on. And then and then a bunch more iterations, uh, and eventually it converges to that hypothesis, which is a pretty, pretty decent straight line fit to the data. Okay. Is there a question? Yeah, go for it. [inaudible] Uh, sure. Maybe, uh, just to repeat the question. Why is the- why are you subtracting Alpha times the gradient rather than adding Alpha times the gradient? Um, let me suggest, actually let me raise the screen. Um, [NOISE] so let me suggest you work through one example. Um, uh, it turns out that if you add multiple times in a gradient, you'll be going uphill rather than going downhill, and maybe one way to see that would be if, um, you know, take a quadratic function, um, excuse me. Right. If you are here, the gradient is a positive direction and you want to reduce, so this would be Theta and this will be j I guess. So you want Theta to decrease, so the gradient is positive. You wanna decrease Theta, so you want to subtract a multiple times the gradient. Um, I think maybe the best way to see that would be the work through an example yourself. Uh, uh, set j of Theta equals Theta squared and set Theta equals 1. So here at the quadratic function of the derivative is equal to 1. So you want to subtract the value from Theta rather than add. Okay? Cool. Um. All right. Great. So you've now seen your first learning algorithm, um, and, you know, gradient descent and linear regression is definitely still one of the most widely used learning algorithms in the world today, and if you implement this- if you, if you, if you implement this today, right, you could use this for, for some actually pretty, pretty decent purposes. Um, now, I wanna I give this algorithm one other name. Uh, so our gradient descent algorithm here, um, calculates this derivative by summing over your entire training set m. And so sometimes this version of gradient descent, has another name, which is batch gradient descent. Oops. All right and the term batch, um, you know- and again- I think in machine learning, uh, our whole committee, we just make up names and stuff and sometimes the names aren't great. But the- the term batch gradient descent refers to that, you look at the entire training set, all 49 examples in the example I just had on, uh, PowerPoint. You know, you- you think of all for 49 examples as one batch of data, I'm gonna process all the data as a batch, so hence the name batch gradient descent. The disadvantage of batch gradient descent is that if you have a giant dataset, if you have, um, and- and in the era of big data we're really, moving to larger and larger datasets, so I've used, you know, we train machine learning models of like hundreds of millions of examples. And- and if you are trying to- if you have, uh, if you download the US census database, if your data, the United States census, that's a very large dataset. And you wanna predict housing prices, from all across the United States, um, that- that- that may have a dataset with many- many millions of examples. And the disadvantage of batch gradient descent is that, in order to make one update, to your parameters, in order to even take a single step of gradient descent, you need to calculate, this sum. And if m is say a million or 10 million or 100 million, you need to scan through your entire database, scan through your entire dataset and calculate this for, you know, 100 million examples and sum it up. And so every single step of gradient descent becomes very slow because you're scanning over, you're reading over, right, like 100 million training examples, uh, uh, and uh, uh, before you can even, you know, make one tiny little step of gradient descent. Okay, um, yeah, and by the way, I think- I feel like in today's era of big data people start to lose intuitions about what's a big data-set. I think even by today's standards, like a hundred million examples is still very big, right, I- I rarely- only rarely use a hundred million examples. Um, I don't know, maybe in a few years we'll look back on a hundred million examples and say that was really small, but at least today. Uh, yeah. So the main disadvantage of batch gradient descent is, every single step of gradient descent requires that you read through, you know, your entire data-set, maybe terabytes of data-sets maybe- maybe- maybe, uh, tens or hundreds of terabytes of data, uh, before you can even update the parameters just once. And if gradient descent needs, you know, hundreds of iterations to converge, then you'll be scanning through your entire data-set hundreds of times. Right, or-or and then sometimes we train, our algorithms with thousands or tens of thousands of iterations. And so- so this- this gets expensive. So there's an alternative to batch gradient descent. Um, and let me just write out the algorithm here that we can talk about it, which is going to repeatedly do this. [NOISE] Oops, okay. Um, so this algorithm, which is called stochastic gradient descent. [NOISE] Um, instead of scanning through all million examples before you update the parameters theta even a little bit, in stochastic gradient descent, instead, in the inner loop of the algorithm, you loop through j equals 1 through m of taking a gradient descent step using, the derivative of just one single example of just that, uh, one example, ah, oh, excuse me it's through i, right. Yeah, so let i go from 1 to m, and update theta j for every j. So you update this for j equals 1 through n, update theta j, using this derivative that when now this derivative is taken just with respect to one training example- example I. Okay, um, I'll- I'll just- alright and I guess you update this for every j. Okay, and so, let me just draw a picture of what this algorithm is doing. If um, this is the contour, like the one you saw just now. So the axes are, uh, theta 0 and theta 1, and the height of the surface right, denote the contours as j of theta. With stochastic gradient descent, what you do is you initialize the parameters somewhere. And then you will look at your first training example. Hey, lets just look at one house, and see if you can predict that house as better, and you modify the parameters to increase the accuracy where you predict the price of that one house. And because you're fitting the data just for one house, um, you know, maybe you end up improving the parameters a little bit, but not quite going in the most direct direction downhill. And you go look at the second house and say, hey, let's try to fit that house better. And then you update the parameters. And you look at third house, fourth house. Right, and so as you run stochastic gradient descent, it takes a slightly noisy, slightly random path. Uh, but on average, it's headed toward the global minimum, okay. So as you run stochastic gradient descent- stochastic gradient descent will actually, never quite converge. In- with- with batch gradient descent, it kind of went to the global minimum and stopped right, uh, with stochastic gradient descent even as you won't run it, the parameters will oscillate and won't ever quite converge because you're always running around looking at different houses and trying to do better than just that one house- and that one house- and that one house. Uh, but when you have a very large data-set, stochastic gradient descent, allows your implementation- allows you algorithm to make much faster progress. Uh, and so, um, uh, uh- and so when you have very large data-sets, stochastic gradient descent is used much more in practice than batch gradient descent. [BACKGROUND] Uh, yeah, is it possible to start with stochastic gradient descent and then switch over to batch gradient descent? Yes, it is. So, uh, boy, something that wasn't talked about in this class, it's talked about in CS230 is Mini-batch gradient descent, where, um, you don't- where you use say a hundred examples at a time rather than one example at a time. And so- uh, so that's another algorithm that I should use more often in practice. I think people rarely- actually, so- so in practice, you know, when your dataset is large, we rarely, ever switch to batch gradient descent, because batch gradient descent is just so slow, right. So, I-I know I'm thinking through concrete examples of problems I've worked on. And I think that what- maybe actually maybe- I think that uh, for a lot of- for- for modern machine learning, where you have- if you have very- very large data sets, right so you know, whether- if you're building a speech recognition system, you might have like a terabyte of data, right, and so, um, it's so expensive to scan through a terabyte of data just reading it from disk, right it's so expensive that you would probably never even run one iteration of batch gradient descent. Uh, and it turns out the- the- there's one- one huge saving grace of stochastic gradient descent is, um, let's say you run stochastic gradient descent, right, and, you know, you end up with this parameter and that's the parameter you use, for your machine learning system, rather than the global optimum. It turns out that parameter is actually not that bad, right, you- you probably make perfectly fine predictions even if you don't quite get to the like the global- global minimum. So, uh, what you said I think it's a fine thing to do, no harm trying it. Although in practice uh, uh, in practice we don't bother, I think in practice we usually use stochastic gradient descent. The thing that actually is more common, is to slowly decrease the learning rate. So just keep using stochastic gradient descent, but reduce the learning rate over time. So it takes smaller and smaller steps. So if you do that, then what happens is the size of the oscillations would decrease. Uh, and so you end up oscillating or bouncing around the smaller region. So wherever you end up, may not be the global- global minimum, but at least it'll- be it'll be closer to the global minimum. Yeah, so decreasing the learning rate is used much more often. Cool. Question. Yeah. [BACKGROUND] Oh sure, when do you stop with stochastic gradient descent? Uh, uh, plot to j of theta, uh, over time. So j of theta is a cost function that you're trying to drag down. So monitor j of theta as, you know, is going down over time, and then if it looks like it stopped going down, then you can say, "Oh, it looks like it looks like it stopped going down," then you stop training. Although- and then- ah, uh, you know, one nice thing about linear regression is that it has no local optimum and so, um, uh, it- you run into these convergence debugging types of issues less often. Where you're training highly non-linear things like neural networks, we should talk about later in CS229 as well. Uh, these issues become more acute. Cool. Okay, great. So, um, uh, yeah. [BACKGROUND]. Oh, would your learning rate be 1 over n times linear regressions then? Not really, it's usually much bigger than that. Uh, uh, yeah, because if your learning rate was 1 over n times that of what you'd use with batch gradient descent then it would end up being as slow as batch gradient descent, so it's usually much bigger. Okay. So, um, so that's stochastic gradient descent and- and- so I'll tell you what I do. If- if you have a relatively small dataset, you know, if you have- if you have, I don't know, like hundreds of examples maybe thousands of examples where, uh, it's computationally efficient to do batch gradient descent. If batch gradient descent doesn't cost too much, I would almost always just use batch gradient descent because it's one less thing to fiddle with, right? It's just one less thing to have to worry about, uh, the parameters oscillating, but your dataset is too large that batch gradient descent becomes prohibit- prohibitively slow, then almost everyone will use, you know, stochastic gradient descent or whatever more like stochastic gradient descent, okay? All right, so, um, gradient descent, both batch gradient descent and stochastic gradient descent is an iterative algorithm meaning that you have to take multiple steps to get to, you know, get near hopefully the global optimum. It turns out there is another algorithm, uh, and- and, um, for many other algorithms we'll talk about in this course including generalized linear models and neural networks and a few other algorithms, uh, you will have to use gradient descent and so- and so we'll see gradient descent, you know, as we develop multiple different algorithms later this quarter. It turns out that for the special case of linear regression, uh, uh, and I mean linear regression but not the algorithm we'll talk about next Monday, not the algorithm we'll talk about next Wednesday, but if the algorithm you're using is linear regression and exactly linear regression. It turns out that there's a way to, uh, solve for the optimal value of the parameters theta to just jump in one step to the global optimum without needing to use an iterative algorithm, right, and this- this one I'm gonna present next is called the normal equation. It works only for linear regression, doesn't work for any of the other algorithms I talk about later this quarter. But [NOISE] um, uh, let me quickly show you the derivation of that. And, um, what I want to do is, uh, give you a flavor of how to derive the normal equation and where you end up with is you know, wha- what- what I hope to do is end up with a formula that lets you say theta equals some stuff where you just set theta equals to that and in one step with a few matrix multiplications you end up with the optimal value of theta that lands you right at the global optimum, right, just like that, just in one step. Okay. Um, and if you've taken, you know, advanced linear algebra courses before or something, you may have seen, um, this formula for linear regression. Wha- what a lot of linear algebra classes do is, what some linear algebra classes do is cover the board with, you know, pages and pages of matrix derivatives. Um, what I wanna do is describe to you a matrix derivative notation that allows you to derive the normal equation in roughly four lines of linear algebra, uh, rather than some pages and pages of linear algebra and in the work I've done in machine learning you know, sometimes notation really matters, right. If you have the right notation you can solve some problems much more easily and what I wanna do is, um, uh, define this uh, matrix linear algebra notation and then I don't wanna do all the steps of the derivation, I wanna give you- give you a sense of the flavor of what it looks like and then, um, I'll ask you to, uh, uh, get a lot of details yourself, um, in the- in the lecture notes where we work out everything in more detail than I want to do algebra in class. And, um, in problem set one you'll get to practice using this yourself to- to- to-, you know, derive some additional things. I've- I've found this notation really convenient, right, for deriving learning algorithms. Okay. So, um, I'm going to use the following notation. Um, so J, right. There's a function mapping from parameters to the real numbers. So I'm going to define this- this is the derivative of J of theta with respect to theta, where- remember theta is a three-dimensional vector says R3, or actually it's R n+1, right. If you have, uh, two features to the house if n=2, then theta was 3 dimensional, it's n+1 dimensional so it's a vector. And so I'm gonna define the derivative with respect to theta of J of theta as follows. Um, this is going to be itself a 3 by 1 vector [NOISE]. Okay, so I hope this notation is clear. So this is a three-dimensional vector with, uh, three components. Alright so that's what I guess I'm. So that's the first component is a vector, there's a second and there's a third. It's the partial derivative of J with respect to each of the three elements. Um, and more generally, in the notation we'll use, um, let me give you an example. Um, uh, let's say that a is a matrix. So let's say that a is a two-by-two matrix. Then, um, you can have a function, right, so let's say a is, you know, A1-1, A1-2, A2-1 and A2-2, right. So A is a two-by-two matrix. Then you might have some function um, of a matrix A right, then that's a real number. So maybe f maps from A 2-by-2 to, uh, excuse me, R 2-by-2, it's a real number. So, um, uh, and so for example, if f of A equals A11 plus A12 squared, then f of, you know, 5, 6, 7, 8 would be equal to I guess 5 plus 6 squared, right. So as we derive this, we'll be working a little bit with functions that map from matrices to real numbers and this is just one made up example of a function that inputs a matrix and maps the matrix, maps the values of matrix to a real number. And when you have a matrix function like this, I'm going to define the derivative with respect to A of f of A to be equal to itself a matrix where the derivative of f of A with respect to the matrix A. Uh, this itself will be a matrix with the same dimension of a and the elements of this are the derivative with respect to the individual elements. Actually, let me just write it like this. [NOISE] Okay. So if A was a 2-by-2 matrix then the derivative of F of A with respect to A is itself a 2-by-2 matrix and you compute this 2-by-2 matrix just by looking at F and taking, uh, derivatives with respect to the different elements and plugging them into the different, the different elements of this matrix. Okay. Um, and so in this particular example, I guess the derivative respect to A of F of A. This would be, um, [NOISE] right, it would be- it would be that. Ah and I got these four numbers by taking, um, the definition of F and taking the derivative with respect to A_1, 1 and plugging that here. Ah, taking the derivative with respect to A_1,2 and plugging that here and taking the derivative with respect to the remaining elements and plugging them here which- which was 0. Okay? So that's the definition of a matrix derivative. Yeah? [inaudible]. Oh, yes. We're just using the definition for a vector. Ah, N by 1 or N by 1 matrix. Yes. And in fact that definition and this definition for the derivative of J with respect to Theta these are consistent. So if you apply that definition to a column vector, treating a column vector as an N by 1 matrix or N, I guess it would be N plus 1 by 1 matrix then that- that specializes to what we described here. [NOISE] All right. So, um, let's see. Okay. So, um, I want to leave the details of the lecture notes because there's more lines of algebra which I won't do but it'll give you an overview [NOISE] of what the derivation of the normal equation looks like. Um, so onto this definition of a derivative of a- of a matrix, um, the broad outline of what we're going to do is we're going to take J of Theta. Right. That's the cost function. Um, take the derivative with respect to Theta. Right. Ah, since Theta is a vector so you want to take the derivative with respect to Theta and you know well, how do you minimize a function? You take derivatives with [NOISE] respect to Theta and set it equal to 0. And then you solve for the value of Theta so that the derivative is 0. Right. The- the minimum, you know, the maximum and minimum of a function is where the derivative is equal to 0. So- so how you derive the normal equation is take this vector. Ah, so J of Theta maps from a vector to a real number. So we'll, take the derivatives respect to Theta set it to 0,0 and solve for Theta and then we end up with a formula for Theta that lets you just, um, ah, you know, immediately go to the global minimum of the- of the cost function J of Theta. And, and a lot of the build up, a lot of this notation is, you know, is there- what does this mean and is there an easy way to compute the derivative of J of Theta? Okay? So, um, ah, so I hope you understand the lecture notes when hopefully you take a look at them, ah, just a couple other derivations. Um, if A [NOISE] is a square matrix. So let's say A is a [NOISE] N, N by N matrix. So number of rows equals number of columns. Um, I'm going to denote the trace of A [NOISE] to be equal to [NOISE] the sum of the diagonal entries. [NOISE] So sum of i of A_ii. And this is pronounced the trace of A, um, and, ah, and- and you can- you can also write this as trace operator like the trace function applied to A but by convention we often write trace of A without the parentheses. And so this is called the trace of A. [NOISE] So trace just means sum of diagonal entries and, um, some facts about the trace of a matrix. You know, trace of A is equal to the trace of A transpose because if you transpose a matrix, right, you're just flipping it along the- the 45 degree axis. And so the the diagonal entries actually stay the same when you transpose the matrix. So the trace of A is equal to trace of A transpose, um, and then, ah, there-there are some other useful properties of, um, the trace operator. Um, here's one that I don't want to prove but that you could go home and prove yourself with a-with a few with- with a little bit of work, maybe not, not too much which is, ah, if you define, um, F of A [NOISE] equals trace of A times B. So here if B is some fixed matrix, right, ah, and what F of A does is it multiplies A and B and then it takes the sum of diagonal entries. Then it turns out that the derivative with respect to A of F of A is equal to, um, B transpose [NOISE]. Um, and this is, ah, you could prove this yourself. For any matrix B, if F of A is defined this way, the de- the derivative is equal to B transpose. Um, the trace function or the trace operator has other interesting properties. The trace of AB is equal to the trace of BA. Ah, um, you could- you could prove this from past principles, it's a little bit of work to prove, ah, ah, that- that you, if you expand out the definition of A and B it should prove that [NOISE] and the trace of A times B times C is equal to [NOISE] the trace of C times A times B. Ah, this is a cyclic permutation property. If you have a multiply, you know, multiply several matrices together you can always take one from the end and move it to the front and the trace will remain the same. [NOISE] And, um, another one that is a little bit harder to prove is that the trace, excuse me, the derivative of A trans- of AA transpose C is [NOISE] Okay. Yeah. So I think just as- just as for you know, ordinary, um, calculus we know the derivative of X squared is 2_X. Right. And so we all figured out that very well. We just use it too much without- without having to re-derive it every time. Ah, this is a little bit like that. The trace of A squared C is, you know, two times CA. Right. It's a little bit like that but- but with- with matrix notation as there. So think of this as analogous to D, DA of A squared C equals 2AC. Right. But that is like the matrix version of that. [NOISE] All right. So finally, um, what I'd like to do is take J of Theta and express it in this, uh, you know, matrix vector notation. So we can take derivatives with respect to Theta, and set the derivatives equal to 0, and just solve for the value of Theta, right? And so, um, let me just write out the definition of J of Theta. So J of Theta was one-half sum from i equals 1 through m of h of x i minus y i squared. Um, and it turns out that, um, all right. It turns out that, um, if it is, if you define a matrix capital X as follows. Which is, I'm going to take the matrix capital X and take the training examples we have, you know, and stack them up in rows. So we have m training examples, right? So so the X's were column vectors. So I'm taking transpose to just stack up the training examples in, uh, in rows here. So let me call this the design matrix. So the capital X called the design matrix. And, uh, it turns out that if you define X this way, then X times Theta, there's this thing times Theta. And the way a matrix vector multiplication works is your Theta is now a column vector, right? So Theta is, you know, Theta_0, Theta_1, Theta_2. So the way that, um, matrix-vector multiplication works is you multiply this column vector with each of these in in turn. And so this ends up being X1 transpose Theta, X2 transpose Theta, down to X_m transpose Theta, which is of course just the vector of all of the predictions of the algorithm. And so if, um, now let me also define a vector y to be taking all of the, uh, labels from your training example, and stacking them up into a big column vector, right. Let me define y that way. Um, it turns out that, um, J of Theta can then be written as one-half of X Theta minus y transpose X Theta minus y. Okay. Um, and let me see. Yeah. Let me just, uh, uh, outline the proof, but I won't do this in great detail. So X Theta minus y is going to be, right, so this is X Theta, this is y. So, you know, X Theta minus y is going to be this vector of h of x1 minus y1 down to h of xm minus ym, right. So it's just all the errors your learning algorithm is making on the m examples. It's the difference between predictions and the actual labels. And if you you remember, so Z transpose Z is equal to sum over i Z squared, right. A vector transpose itself is a sum of squares of elements. And so this vector transpose itself is the sum of squares of the elements, right. So so which is why, uh, so so the cost function J of Theta is computed by taking the sum of squares of all of these elements, of all of these errors, and and the way you do that is to take this vector, your X Theta minus y transpose itself, is the sum of squares of these, which is exactly the error. So that's why you end up with a, this as the sum of squares of the, those error terms. Okay. And, um, if some of the steps don't quite make sense, really don't worry about it. All this is written out more slowly and carefully in the lecture notes. But I wanted you to have a sense of the, uh, broad arc of the of the big picture of their derivation before you go through them yourself in greater detail in the lecture notes elsewhere. So finally, what we want to do is take the derivative with respect to Theta of J of Theta, and set that to 0. And so this is going to be equal to the derivative of one-half X Theta minus y transpose X Theta minus y. Um, and so I'm gonna, I'm gonna do the steps really quickly, right. So the steps require some of the little properties of traces and matrix derivatives I wrote down briefly just now. But so I'm gonna do these very quickly without getting into the details, but, uh, so this is equal to one-half derivative of Theta of, um. So take transposes of these things. So this becomes Theta transpose X transpose minus y transpose. Right. Um, and then, uh, kind of like expanding out a quadratic function, right. This is, you know, A minus B times C minus D. So you can just AC minus AD this and so on. So I'll just write this out. All right. And so, uh, what I just did here this is similar to how, you know, ax minus b times ax minus b, is equal to a squared x squared minus axb minus bax plus b squared. Is it's kind of, it's just expanding out a quadratic function. Um, and then the final step is, yeah, go ahead. [BACKGROUND] Oh, is that right? Oh yes, thank you. Thank you. Um, and then the final step is, you know, for each of these four terms; first, second, third, and fourth terms, to take the derivative with respect to Theta. And if you use some of the formulas I was alluding to over there, you find that the derivative, um, which which I don't want to show the derivation of, but it turns out that the derivative is, um, X transpose X Theta plus X transpose X Theta minus, um, X transpose y minus X transpose y, um, and we're going to set this derivative. Actually not, let me just do this. And so this simplifies to X transpose X Theta minus X transpose y. And so as as described earlier, I'm gonna set this derivative to 0. And how to go from this step to that step is using the matrix derivatives, uh, explained in more detail in the lecture notes. And so the final step is, you know, having set this to 0, this implies that X transpose X Theta equals X transpose y. Uh, so this is called the normal equations. And the optimum value for Theta is Theta equals X transpose X inverse, X transpose y. Okay. Um, and if you implement this, um, then, you know, you can in basically one step, get the value of Theta that corresponds to the global minimum. Okay. Um, and and and again, common question I get is one is, well, what if X is non-invertible? Uh, what that usually means is you have have redundant features, uh, that your features are linearly dependent. Uh, but if you use something called the pseudo inverse, you kind of get the right answer if that's the case. Although I think the, uh, even more right answer is if you have linearly dependent features, probably means you have the same feature repeated twice, and I will usually go and figure out what features are actually repeated, leading to this problem. Okay. All right. Uh, any last questions before- so that so that's the normal equations. Hope you read through the detailed derivations in the lecture notes. Any last questions before we break? Okay. [BACKGROUND] Oh, yeah. How do you choose a learning rate? It's, it's, it's quite empirical, I think. So most people would try different values, and then just pick one. All right. I think let's let's break. If, if people have more questions, when the TAs come up, we're going to keep taking questions. Well, let's break for the day. Thanks everyone.