Linear Models and Learning Theory

ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about linear models. And linear models share what we would refer to as the signal, which is this formula. It's a linear sum involving the input variables and weights, that can be put in vector form. And all linear models, in one form or another, have that as their basic building block. And you can have a classification linear system, like the perceptron, that uses that signal and takes the sign of it to make a decision, +1 or -1. Or you can take something like regression, which is real-valued, that takes the signal as it is and has that as output. We looked at the linear regression algorithm, which was a particularly easy algorithm. All it does, it takes the inputs and puts them in a particular matrix form, and so the outputs. That's the inputs and outputs of the data set. And then, by computing this very simple formula, in one shot, it can get you the optimal value of the weight vector. If you look at linear models, you can think of them as an economy car. They get you where you want to go, and they don't consume a lot of gas. You may not be very proud of them, but they actually do the job. If you want a luxury car, wait until we get to support vector machines. And you have to pay the price for that. However, for linear models, it is remarkable how often they succeed on their own. And they are sufficient to get you the learning performance that you want. So I urge you to give linear models in general more attention than you would otherwise give. And try to use them when you face a learning problem first, and see if they will actually achieve what you want. To strengthen linear models even further, we introduced nonlinear transformation. And the idea behind it is that the signal is not only linear in x, which is what you would think of as the reason we call these linear systems. But actually, they're linear in w, the vector. And the reason this is important is because learning actually modifies w in the learning process until it gets to the optimal one, while x, which you usually think of as a variable, is actually a bunch of constants, which are the data set that are handed to you. So the linearity in w is the key point. And if you take x and transform it in any way you form to another vector z, in a very nonlinear way if you want, this will still preserve this linearity, the linearity in w. Obviously, it will not be linear in x. And that's all that matters for you to apply the machinery that we got here, like the simple linear regression algorithm. And we took an example where we had the two variables x_1 and x_2, and we transformed them nonlinearly to x_1 squared and x_2 squared. And we found that this case, the transformation helps us separate the data, where if we worked in the original space, we would not be able to. This time, I'm going to talk about error and noise. And these are practical considerations that we have to take, when we consider real-life problems. And we're going to modify the learning diagram that we have, by incorporating the notion of error and the notion of noise. And I will do that for the bulk of the lecture. However, my starting point will be to wrap up the nonlinear transformation that we started last time. So let's look at what we had last time. We had this space. Let me magnify it a little bit. This space is the original X space. And the origin is in the middle. And you have these points, which are your data set. And each point belongs to that space. And as you realize, there is really no way of separating the blue from the red using a line, which is what linear models do. And the idea for us was: let's do a nonlinear transformation. We called it phi. And if you look at what happened here-- let's look at both of them at the same time-- this is where you took x_1 squared. So this is x_1, and this is x_2. And that transformation here was x_1 squared. So this would be x_1 squared, and this is x2 squared, which we're going to label as z. So the transformation is you take every point in the sample space x_n, you put it through a transformation phi, and you get the corresponding point z_n. And now, you are working in the feature space, or the nonlinear space Z. When we did this, we realized that a data set like this can become linearly separable in the new space. And that allows us to apply the linear model algorithm here. And when you do that, you will get a separating boundary here. And that separating boundary is applied by applying your simple linear model, like linear regression-- in this case, linear classification, the perceptron-- to the data in the Z space. So that's what we get. But we are not working in the Z space. When I give you a test point, it will be x. And you have managed to separate things in the Z space. Well, the way you do it is you go back to the input space. And as you realize, I'm using the "inverse", between quotation, transformation. Because the transformation, in principle, may not have an inverse. There are some points in the Z space that may not be a mapping of any point in X, and some points in the Z space which may be the mapping of more than one point. And therefore, in spite of the fact that phi is a mapping, a function, phi -1, as we call it, is not. But when you apply this, figuratively, what you're going to get is a separating surface in the X space, that is not linear. And that was obtained by applying purely linear methods. And therefore, you can classify a new point by applying g of x, which would be the hypothesis that's defined here, the linear one, which happens to have that formula. So you look at the diagram all together. And this is basically the cycle you have, when you're doing the nonlinear transformation. You take the data set, transform it, classify it, and interpret it. In reality, when you get the new x, what you're going to do, you are going to take the new x, wherever it might be. You're going to transform it. And then look here where it lies, and classify it accordingly. It's a very simple procedure. And as you can see, although we are illustrating it here in a case where you're going from two-dimensional to two-dimensional, you could in principle go from two-dimensional to 100 dimensional, with highly nonlinear coordinates, and the same principle will apply. You will be classifying here, with a hyperplane in that case. And then, this surface would be very, very complicated-- could be completely jagged and whatnot. And that enables you to implement a lot of sophisticated surfaces. So let's look at the nonlinear transformation and ask ourselves, what transforms to what, to make sure that all the notions are clear. The first thing is the input point x. This is a single point that is represented by its coordinates x_1 up to x_d, together with the mandatory constant x_0 which equals 1 that takes care of the threshold term. This is a general representation of a point in the X space. What does this transform to? I'd like you to think before I give the answer. Well, it transforms to a z. That is a vector. Each of these coordinates, let's say, z_1, is a nonlinear function, potentially nonlinear function, of all of the x's, so of the entire vector x. For example, this could be x_1 x_2 e to the x_3. The next one would be 1 over x_2 times x_3 cubed. Whatever it is. And you can go on and on and on, and there is really no limit. So if we thought of linear methods here as an economy car, this could be a truck. This could be actually an 18-wheeler. And we must be proud of that, because with such a simple method, we are able to create such a strong machine. But be careful, because you may not be able to drive it. And if you do the wrong transformation, you will end up crashing-- in this case, generalization-wise crashing. That is, although you did everything right, and you did this transformation, and this is a powerful machine, you don't know how to drive the powerful machine, and you end up with very poor generalization. And we need the theory in order to get our driver's license. That will tell us what to do in order to be able to drive this machine. So that is x. Now, what do x_1 up to x_N go to? Remember, this is the data set, the inputs of the data set. Each of these guys, by itself, is a full vector x that has all of these coordinates. So this is the data set we have. What does it transform to? Not surprisingly, z_1 up to z_N. So you end up with the same number of points. That is obvious. And each of them is a vector. And each vector can be very long, according to the transformation you chose. Next one-- the labels. The data set comes with inputs and outputs, right? So the inputs I did the transformation-- what do y_1 up to y_N transform to? Well, they transform to y_1 up to y_N. These are untouched. These are the values. They are not touched. And these are the ones we learn. If it's classification, they are +1 or -1, exactly the same way they were there before. How about the weights? When we use linear models, we have a weight vector. So when we are in the X space here, what are the weights? The answer is that there are no weights in the X space when you do a nonlinear transformation. The weights are done in the Z space. And I labeled the weights here as w tilde. And I'm using tilde as nonlinear, so that you remember it's a nonlinear space. And everything here is tilde. So this is w tilde. And if you look here at the dimension-- you may have not seen it. Let me magnify it. The dimensionality here is also d tilde. So whenever we need to distinguish between z and x, we will add tilde to the z counterpart, so that you are not confused about which is which. So we have those weights. And finally, you ask yourself: I've done all of this machinery. Could you please tell me, what is the hypothesis that I'm delivering to my customer? We're still calling it g of x, the final hypothesis of your learning process. And it happens to be exactly the same way, except in Z space. You take the linear form here, and take the sign, and that would be your hypothesis. Except it's a little bit annoying, because this is g of x, and you're telling me this is w tilde transposed times z. Where is x? Don't worry. Here is x. What z is, is the transformation of x. When you want to evaluate this for any x, all you need to do is plug into this formula, and you are ready to go. That's the entire story of the nonlinear transformation. Now, with that out of the way, let's go to the main topic of the day, which are error measures and noisy targets. When we face a real learning problem, we will realize that there are components, practical components in real life, that we have not fully taken into consideration. And what I'm going to do, I'm going to take the learning diagram, which we introduced in the first lecture. And then, I'm going to adjust it according to these practical components, until we get the final general form of the supervised learning diagram. That will take us through both topics, which are the error measures and the noisy targets. Here is the learning diagram in case you forgot what it was. That's where we left it. Let's see what it looks like. Remember, it's a pretty simple diagram. And we built it from scratch. I need to rebuild it, in order for you to realize that I'm not just flashing a jungle on you. It has a sense. This is what we're trying to learn. It's an unknown target function, represented to us by training examples. We have a learning algorithm that will take these examples and produce the final hypothesis. All of this is nice. We said that the learning algorithm is picking the hypothesis from the hypothesis set. And we said that this is a convenient technicality that has no loss of generality. So we accepted that we will always have a hypothesis set. And then, we went into the feasibility of learning. And we realized that for that to happen, we need to introduce a probability distribution on X, any probability distribution, and generate the points x_1 up to x_N, which constitute the inputs to the training examples, using this probability distribution independently. Once you do that, you get the benefit of Hoeffding. And you can make a statement that you're going to do something out-of-sample that is reflected by the in-sample. That's where we stood. So this is the diagram we are going to be modifying piece by piece. Let's talk about error measures, the first notion. Error measures try to answer the following question: what does it mean for h to approximate f? You have two functions. And you say, this is a good approximation, this is a bad approximation. Is it a qualitative statement, or is it quantitative? It is quantitative. And because it's quantitative, we are going to define an error measure that measures how well, or how badly, h approximates f. So the error measure will be defined as E of two guys. And these will be h and f. It returns a number for any two functions you plug in. One of them will be the target function. One of them will be a hypothesis of interest. And you ask yourself, how well, or how badly in this case, does h approximate f? And you get an error. If the error is 0, then h perfectly reflects f, and you're home free. If there's an error, then maybe you need to look for another h that has smaller error. So that formalizes the question of search, of the learning algorithm, into minimizing an error function. We call it error function. And we call this error measure. It is neither a measure in the measure theoretic sense, or a function-- this is actually a functional. But we are not worrying about that. We just take these objects and return a number. And we refer to it as a function. And we talk about error measure in the sense of the English word "measure", not the mathematical "measure". So the error function, in principle, returns a number for a pair of functions. But it is almost always defined in terms of difference on a particular point. And then, you put these points together. That's the pointwise definition. In this case, you define a small e that goes with a capital E that also takes two arguments. And these two arguments are the value of h at a particular point, and the value of f at the same point. That makes sense. I'm trying to compare functions. I want them to be the same on the same point. Therefore, if I compare them for every point this way, then I will get something meaningful. And then, I will need to do something about the different e's, small e's, that will get me the big E. Although this is not strictly required by the definition-- you could have a crazy error function that does not reduce to corresponding points. But invariably, this is what you're going to do. Have we seen this before? Yes, we have. Remember the squared error? How do we formalize it in this sense? We can say that the error in this case is h of x minus f of x squared. That's what we did, and that is indeed an error function that measures the difference between the two. And indeed, if the error is 0, it means that h of x equals f of x, and we have exactly what we want. We also saw it before, although we didn't explicitly talk about it in those terms, when we talked about binary error. When we had perceptrons, every point could be either right or wrong. And that doesn't look like a quantitative error function. It's binary. However, you can also put it in those terms as follows. We agreed this notion-- let me magnify it for you. This notation, which is the funny bracket, means that you return 1 if the statement enclosed here is true. And you return 0 if it is false. That's a standard notation. So if you take this as your error function, what will happen? If h of x equals f of x, then this statement is false, and you return 0. So the error is 0, good. And if this statement is true, you return 1. And indeed, in that case, you're making an error. So it's a binary error because of this. And if you take it as a formal error, and do the rest of the development to get the other one, the big E for the global function, you will find that this is exactly what we did when we were talking about frequency of error and probability of error. That will become clear in a moment. Now, let's move from the pointwise-- you define it on one point in the space-- to define the error function, capital E, on the entire space. The way it is done is that the overall error, which has this notation, will always be the average of pointwise errors. So you take these pointwise errors and average them. And all we need to do is articulate what we mean by "average" in order to get that. So let's look at the in-sample error. When we have the in-sample error, this is the formula for it. And now, you think of in-sample error as the in-sample version of this. Because now, we are going to use the pointwise error that goes with that error function, or that error measure, in defining the in-sample error. If you take a single point from your training set, you would be having n going from 1 to N, so one of them generically is n. And I'm putting it in red, because that's what we are going to average with respect to. You compute this error measure whatever it may be, squared error, binary error, any other error you can think of. And now, you get the average. And average in that case will be the simple average, which is 1 over N over that, over the sum. This is indeed consistent with what we thought of as the training error. And if you go back to the binary error, which is the funny error, and you look at what this formula will return, it will return exactly the frequency of error in the sample, correct? Now let's go for out-of-sample error. Again, the out-of-sample error is the out-of-sample version of this error measure. Now, in this case, the point is a general point in the space. So we are labeling it as x in general, without any label. This could be any point in the space that is picked from X, which is the input space. And in order to get an average in that case, what you do is you get the expected value, in this case with respect to x. So that is the average for the out-of-sample case. And again, if you take the binary error, and you take the expected value of this, this would be identically the probability of error overall. And we are using the probability distribution over the input space X, in order to compute this quantity. That's how we get from a definition that you invoke on a single point, to the in-sample and out-of-sample versions. Now let's revise the learning diagram with this added component. Here is the learning diagram. There's nothing that changed here except that now, this is the standard color, because we already got used to it. The red stuff is the new stuff. So you have here, and you want to add the error measure. I'd like you to think for a moment what we are going to do. We're going to do it in two steps. The first one is to realize that we are defining the error measure on a point. So here's the addition. The addition is that in deciding whether g is close to f, which is the goal of learning, we test this with a point x. And the criterion for deciding whether g of x is approximately the same as f of x is our pointwise error measure. Furthermore, this x is created from the space using something very specific. And that is, it comes from the same probability distribution. It comes from the same probability distribution that generated those points. This was implicit when we talked about the bin. mu was the probability distribution in the bin. And nu was the sample distribution in the sample that you pick. When you test the system that you trained using a certain probability distribution, I ask you to test it with points drawn from the same distribution. That's the only requirement in order to invoke Hoeffding, or the counterparts of Hoeffding for more elaborate type of functions. Now, if you do that, then you have the guarantee. And it does make sense that with the benign assumption that P can be an unknown-- any probability distribution-- all you're asked to do in order to get the guarantees that we talked about is: use it to generate the examples, and use it to test the hypothesis. So that is what we have. Now we come to the question-- I understand where the role is. I'm going to define the error measure on point by point. I know how to move from a point to the in-sample, on to the out-of-sample. Now, we come to the crux of the question. How do I define the error measure? What is the number that I should return for h differing from f on a particular point? I will give you an example to make the point. And my example will be fingerprint verification. You declare yourself. And you want to authenticate yourself. So you put your finger, and the system will decide whether it's you and return +1, or it's an intruder and return -1. That's what the system does. And we would like to see how to define an error measure in this case. There are two types of error that you can make when you have a system like this. One of them is called false accept. I think it's self-explanatory. False accept meaning that someone who shouldn't be there was accepted, was falsely accepted. So the intruder got in. That's an error. The other type is false reject. You, the owner of the system, the one who paid for it, you put your finger, and you are rejected. And you're mad at them. That's a false reject. Now, in defining an error measure, I'd like to get this case, because there is a great intuition about what is going on. So if we can come up with a meaningful error measure here, that captures both the false accept and false reject, we will have a handle on what the error measures are all about. So how do we penalize each type? That's what you do. When you give an error, you penalize it such that the error is large. So you move away from that hypothesis, to get a better hypothesis that doesn't penalize it as much. Now, we can put it in a matrix form. This is the target. This is the perfect system. This returns +1 whenever it's you, returns -1 whenever it's an intruder. That's our dream system. We don't have that. We're going to use machine learning using examples. And we are going to come up with a hypothesis. This may not be the final hypothesis. We are talking about a general hypothesis here. When it becomes the final hypothesis, we are going to call it g. So h could be +1 or -1. And +1 means you are accepted, or the person is accepted, whoever he may be. And -1 means they are rejected. Now, let's try to put under the four possibilities here what the errors would be. First, the easy one-- the diagonal corresponds to no error. And I'm putting it in faint color, because in that case, it's clear that you're going to make zero error. It's you, and you were accepted. It's not you, and you were rejected. That's fine. What's interesting are these two. So we need to get a number for the false accept, which means it's an intruder but they were accepted, or the false reject, which means it's you but you were rejected. If I can come up with four numbers here, two of which I already know, then I have the answer. The key message I'm conveying with this discussion is that there is no inherent merit to choosing one error function over another. It's not an analytic question. It's an application-domain question. And I'm going to argue for that. Let's look at the error measure for this problem, for the important application of supermarkets. What happens with supermarkets? Well, supermarkets decide to become fancy. And when you go and you want the discount for your special program, you not only declare yourself, they need to verify that it's you. Because they have recently too many people just claim any number to get the discount. They want to control it a little bit. So on the checkout, you identify yourself. And then, you put your finger. And then, the system will verify you or decide that you're an intruder. Now, given this application, let's try to see false accepts and false rejects and how to penalize them. The false reject in this case actually is costly. Think of it this way. You are a customer, you go out, and you have this huge bag of stuff, $100 worth of stuff, and you think you're an important customer to the supermarket. You put your finger, you are rejected. You put your finger again, you're rejected, put your finger again, you're rejected. You're embarrassed in front of the entire queue, and in your mind, you say, the heck with this supermarket. I'm going to go to the competitor. So when they have a false reject, they run the risk of losing a customer-- customer gets annoyed. False accept is not that big of a deal. Someone comes in and claims they're you, and the system passes them. What is the downside? They got a discount. OK, one more discount-- for business, it's not that important. And furthermore, if you think about it, that must be a very brave person. Because they are an intruder, and they left their fingerprint on the system. That's part of the deal. All to get a discount-- I think really they are in trouble. So we really shouldn't penalize the false positives that much, if that will help us with the false negatives. If you look at a matrix that fits this situation, this one qualifies. I'm going to penalize false rejects. That's not a question. But I'm going to penalize them just by 1. When it comes to the other one, which are the false accepts-- let me try again. This is the false reject. It's you, and you were rejected, and you are penalized dearly. And this one is the false accept. It's not you, and you were accepted, and therefore, you give it less weight. So this looks like a reasonable matrix for that case. Now, let's look at the exact same system. You are given training examples, you are told that the target function is fingerprint verification, and you go about your machine learning algorithm. Now, one of them is for a supermarket. And the other one is for the CIA. Let's see the situation here. Now, what is the CIA going to use the system for? It uses it for security-- identity verification, that you are an authentic person authorized to do something, could be entering the building, could be looking at a document. So you put your fingerprint first. Now, let's look at the false accept and false reject. You have to agree with me that false accept in this case is an unmitigated disaster. Someone got authority to something that they're not authorized in, and national security is at stake. That's a no-no. False reject in this case can be tolerated. Why? You are not a customer. You are an employee. It's you, but the system rejected you. Just try again and again and again. Because you're paid to be here. Just take the inconvenience and do this, in order to save the false accepts. So in this case, it makes sense that we are going to put the weights in exactly the opposite way, even more extreme. And you will have a matrix that looks like this. If you are the wrong person, and you are accepted, that's a false accept. That's a huge penalty. The other one, you put a meager penalty. If you're really cruel with your employees, you put this 0.1 or 0.001, and then have them really go for this thing for 20 times until they're accepted. But in general, you can see that this has to be a much bigger number than this, whereas in the supermarket case, this was a bigger number than that. So the take-home lesson is that, when you're dealing with a practical learning problem, the error measure should be specified by the user. You are going to deliver a system to them. The system is not perfect. They want the target function, and you give them a hypothesis instead. You should ask them, how much does it cost you to use my imperfect system in place of the perfect system? That is their decision to make. And if they articulate that as a quantitative error function, this is the error function you should work with. However, this does not always happen. People may not have the formalization that will capture the error measure in reality. And sometimes, they capture it, but it's very difficult to optimize. There are other considerations. So now, what I'm going to talk about are the alternatives to this approach. And the alternatives are a compromise. They are very common and popular compromises, and people indulge on them. I don't mind indulging on them, because actually, there are some nice properties to them. But always remember, this is a second choice. If we knew what the error measure that needs to be used by the user is, we would use that. So here are the two alternatives-- you don't have the user-specified error measure. Then, you resort to plausible measures, measures that you can argue analytically that they have merit. Usually, the analytic argument starts with an assumption that is usually a loaded assumption. And from there, the going is very smooth. But nonetheless, that is in the absence of a meritorious error measure, we might as well resort to that. We have seen example of that, which is squared error. If you look at squared error, you can say that if the noise is Gaussian-- I didn't do that in the lecture, but it's not a difficult thing to imagine-- that the corresponding error measure in this case would be squared error. So that is the plausibility of it. And you can take other cases, for example, the binary error. You can go and get cross-entropy type of error corresponding to the binary error and whatnot. So these guys have an error measure that goes with them. The other approach is not to have a plausible measure, but to have a friendly measure. You are not justifying that this is a meritorious measure. You're just using it because it's easy to use. And we have also seen that. For example, we can get closed-form solution if we choose a particular error measure. Linear regression comes to mind. If you didn't use a squared error in that case, you would not have gotten the very easy formula that we started the lecture with. And also, even if you can't get a closed-form solution, you might be able to use optimization that is favorable. For example, the cross entropy that I referred to ends up, in a case of a linear model, the logistic regression, being convex which means that optimization is efficient. In that case, you get a global minimum and all of that. So now, you resort to either a conceptual appeal, the plausibility, or practical appeal, which is the friendly aspect, to choose the error measure. This is completely legitimate, because in many cases, you're not going to have the user-specified error measure. Now, let us modify the learning diagram once more to introduce the error measure as we defined it. So here is-- and I'd like to ask you to look at this for maybe 15 seconds, and tell me where you think the error measure will fit in this diagram-- the error measure itself. What does it affect? What does it take from? What exactly-- I can put the error measure, for example, here. That's an option! I can put it inside the unknown target function. What does the error have to do with the target function? So you can think, where would it be? It's not difficult to realize that that's where it belongs. It has two roles. One of them is to evaluate this statement. This statement is qualitative. The output, the final hypothesis, approximates the target function. This is what gives it a number. This gives it a grade. And you use the error measure to quantify this approximate thing. The other thing is that you have to feed the error measure to the learning algorithm. Because what does the learning algorithm do when you have an error measure? It minimizes the in-sample error, let's say, in this case. And the in-sample error depends on your error measure. If you're minimizing squared error, that's different from minimizing another type of error. So the error measure feeds into those two. Now we go for the next guy, which is the noisy targets-- new topic, another addition to the learning diagram. The noisy targets are actually very important. Because in reality, these are the only types you're going to encounter in the problems in life. Very seldom, you get a very clean target function. The first statement is, the target function is not always a function. Why are we calling it target function if it's not a function? Well, function is a mathematical notion. You have to return a unique value for every point in the domain. That's what qualifies it as a function. We used it here a little bit liberally. So far, we dealt with it as if it was really a function. But let's take the example we started with, the credit example. You consider credit card approval, and here is the historical record. This is an input. Isn't it possible that two identical customers have these fields, and one of them ended up being good credit, and one of them ended up being bad credit? Sure-- this doesn't capture all the information in the world. There's information that is not given that contributes noise, if you will. And there are circumstances that the people will go through that will make it probabilistic whether they'll be good credit or bad credit. So we come to realize that two identical customers, in the sense that their input representation is the same, can have two different behaviors. And having this-- this is one point mapping to two values, so it is not a function. What do we do about that? Well, we use a target distribution, as in probability distribution. So instead of having y equals f of x-- you tell me what x is, and I'm going to tell you what the value y is for sure. You use a target distribution. And the notation for that is probability of y given x. So again, it depends on x. But its dependence is probabilistic. Some y's are more likely than others in this case. Here, one y was possible, and the rest were impossible. Now, we make it a little bit more accommodating. Now, we have a target distribution instead of a target function. Let's follow it through. x used to be generated by the input probability distribution. It will still be generated by that distribution. This is an artifact that we introduced in order to get the benefit of the Hoeffding-type inequalities. Nothing has changed. But what will change now is that, instead of y being deterministic of x, once you generate x, y is also probabilistic-- generated by this fellow. So you can think now of x,y as a pair being generated by the joint distribution, which is P of x times P of y given x, assuming independence. So in this case-- there's no assumption of independence once you put it this way. But the assumption here is that the P of y you are given is actually conditional on x. Now you get noisy targets. What is the noisy target in this case? Well, a noisy target can be posed as a deterministic target, like the one we had before, plus noise. This applies to any numerical target function. So if y is a real number, or binary, or something numerical, you can always pose the question of a target distribution as if it was a deterministic target function proper, plus noise. This is just a convenience, to show you that this is not far from what we have already. And why is that? Because if you define now a target function to be the expected value-- the conditional expected value of y given x, that's a function. Although P of y given x gives you different values, you take the expected value-- that's a number, and you call this the value of the function, f of x. Then, whatever is left out, you call the noise. It's a nice trick. You've got the bulk of it. And then, you go here, and you call the rest the noise. And that is usually the form it is given. So you think that you are really trying to learn the target function still. But there is this annoying noise, and you're trying to make your algorithm pick this pattern. And there's nothing it can do about the remaining noise, which averages to 0. Now, by the same token, there is no loss of generality when we talk about probability distributions. If you actually have a proper function, which happens once in a blue moon, you can still pose this as a probability distribution. How do you do that? You get here P of y given x. And you define it to be identically 0 unless y equals f of x that you have in mind. So if we were talking about finite domains, you put all the probability 1 on this value. And you put the probability 0 for all other values. If it happens to be continuous, which is almost always the case, you put all the mass on the point, you put a delta function there, and you let the other ones be identically 0. In this case, the target function is a probability distribution. Therefore, if we decide that the target is always a distribution, there is no loss of generality. Now, let's do the final installment of the learning diagram. Once we are done with this, we will freeze it forever. This will be the general learning diagram for supervised learning. And here is what we have. We're going to include the noisy targets. This is what we had so far. It's getting crowded, isn't it? And we are going to make it more crowded. And in this case, we are including the noisy targets. So obviously, the modification will happen here. And what you do is you replace this with a target distribution. Let me magnify it. So the unknown target function became unknown target distribution. You define it as a conditional probability distribution of y given x. And you can think of this as if it was a target function, which is the expected value that I talked about, plus noise, which is the remaining part. But as a target, it is a distribution. And I'd like to look at this, and appreciate the time we spent to build the blocks here. In spite of the fact this looks like a complete jungle, you can go back and understand every single component here. Every component has a reason. This is to accommodate a practical consideration, which is the fact that we are learning something that is noisy. This is because a specification of the penalty, or the cost you pay for not being perfect, needs to be specified by the user. This is our artificial addition to the problem in order to make learning feasible, and so on and so forth. So that is the final diagram for supervised learning. Now, I'd like to make one final point about noisy targets, which is the distinction between the two probabilities that we have. We have probability of x, which we artificially introduced to accommodate Hoeffding. And then, this was introduced in a completely different context. That is to accommodate the fact that genuine functions that you encounter in practice are not functions-- are actually noisy distributions. So let's look at this. They look very much related. They both pour into the training examples. That's how the training examples are generated. This guy passes on the probability of y given x. This guy passes on the probability of x. When this guy gets it, it generates those guys according to the joint distribution-- multiplies these, if you will, and then uses it as a way to generate the pair x,y. So they look like-- they belong to the same category. Both of them are unknown. This one is unknown so that my machine learning statement can be as general as I can afford. You're learning something. You don't know what it is. So that's good to have. This one is unknown, because it is the most assumption we could afford in order to get Hoeffding. We needed a probability distribution, but we didn't need to make any assumptions about it. So we left it to be an arbitrary one, and unknown, and we don't want to know it. So these are the similarities. Now, let's look at the differences. Both have probabilistic aspects. We have seen that. Now, the target distribution is what you are trying to learn. You are not trying to learn the input distribution. As a matter of fact, when you are done, you will not know what the input distribution is. The input distribution is merely playing the role of quantifying the relative importance of the point x. Let me give you an example. Let's say you are approving credit again. The target distribution is the probability of creditworthiness given the input. Let's simplify the input and say it's the salary. So I give you the salary, you decide what is the risk of this person defaulting, and then you decide-- output is +1, approve credit with probability 0.9, and disapprove credit with probability 0.1. That is the target distribution. And that is what you're trying to learn. You're going to approximate it to a hard decision probably. Or you can actually learn the probability distribution, as we'll see later on. The input distribution just tells you the distribution of salaries in the general population. How many people make $100,000, how many people make $10,000, et cetera. So in spite of the fact that the probability distribution over the input matters in the sense that: let's say that you encounter a population where the salaries are very high, so P of x is tilted very much towards the high salaries, and let's conjecture that high salaries correspond to high creditworthiness. In this case, the same system that you trained that will take any salary, low or high, and then decide whether to approve credit or not, will be tested mostly in the very comfortable region of high salaries. So it will be returning: yes approve, yes approve, yes approve, with very small probability of error. And if you go and put the mass of probability around the borderline cases, the cases where the decision is difficult, the same system that you learned will probably perform worse, just because there are so many points that are borderline. So it does give the weight that will finally grade your hypothesis. But you're not trying to learn that distribution. And when you put them together analytically, which you're allowed to do, you can merge them as P of x and y. And that's what you will find in the literature. It's very nice and pleasant, and you generate the example using that joint distribution. However, you just need to remember that this merging mixes two concepts that are inherently different. Definitely, P of x and y is not a target distribution for supervised learning. The target distribution, the one you're actually trying to learn, is this fellow. And the other component is just a catalyst in the process. That covers the error and noise, and we have arrived at the final statement of the learning problem. So now, let me spend the rest of the lecture trying to prepare you for the coming two weeks, which will consider the theory of learning. It's a very important theory, and I encourage everyone to bite the bullet and go through it. I will do my best to make it user-friendly. However, it's important not just because of the mathematical derivation. The insight, and the secondary tools you are going to get, are extremely important. It's worth two weeks-- not a full two weeks, but four hours' worth of listening to a lecture and actually trying to study the material well. So that's my pitch. Let me give you the preamble to the theory. Let's see what we know so far in order to put the theory in perspective. We know that learning is feasible in a probabilistic sense. And the way we did this is by stating that it is likely that the out-of-sample performance is close to the in-sample performance. That, in our mind, corresponded to the feasibility of learning. And I'm going to test this premise now and ask: is this really learning? Is this condition the condition that captures what we mean by learning? Let's raise some doubt. Well, learning means that you got the hypothesis right. You give it to your customer, and it behaves well-- as close as possible to the target function, right? That's success. That means that the condition for learning is really that g approximates f. And now, we are sophisticated people. We know what "approximates" means. This condition is not really this condition, right? What is this condition in terms of the E_in and E_out and stuff? Very simple-- that's what it means. It means that the out-of-sample error for g is close to 0. Because the out-of-sample error measures what? It measures how far you are, that is g, from the target function over the entire space. And therefore, the statement that you are close means that the out-of-sample error is small. So this is what we want. And this is what we have. Now, what was that? If it's not learning, what was it? Well, this was actually good generalization. And it's an important building block. Because you never will have access to this. If I gave you this as the condition, then you say: A quantity that I will never know is close to 0. Thank you very much! But now, with the theory, I was able to tell you: you have a window on E_out by dealing with E_in, if you have the right checks in place that we developed vaguely as the number of hypotheses is not too big. And we will define it very, very accurately when we get to the theory part. So if you have that, all of a sudden, E_in is an important quantity. Because now, it acts as a proxy for E_out that you don't know. So this is not a total waste, but it's only half the story. The full story of learning has two questions. And if you look at this slide, and remember that this is always the case the learning problem is posed, you will dismiss a lot of misconceptions of-- learning is impossible, learning is possible. You'll find all kinds of results over the literature. So here's the deal. This quantity patently is: we learned well. That's what it means, right? So now, we are going to achieve this through two conditions. The first condition is the one we developed using Hoeffding. E_in is close to E_out. The second condition is: E_in is small. You put them together, and you have the learning. And you can see the difference between the two. This is a theoretical result. This is a practical result. E_in, I know E_in. I can try to use linear regression, or something else, to knock this down and get it as close to 0 as possible. And indeed, if you look at where we handled these questions, we didn't explicitly say the questions in this form. But we already covered them in two lectures. This was the subject of Lecture 2, right? Hoeffding was all about the fact that E_in is close to E_out. This was the subject of Lecture 3. We had data, and we wanted to get the in-sample error to be small. And we looked for techniques to do that. So now, because this is important, let's put it in a box. Learning reduces to two questions. First question: can we make sure that the out-of-sample performance is close enough to the in-sample performance? This is a theoretical question, and we are going to spend two weeks answering this question. Second one: can we make the in-sample error small enough? This is a practical question, and we're going to spend four weeks doing this one. And then, by the way, at the end, we'll have one week to reflect. And it's always sweet to reflect when you have a concrete ground to stand on. We can do all the philosophy in the world, and we will have very concrete mathematics, very concrete algorithms, and very concrete results on real data, to know that what we're talking about means something. That's the plan. Now, let me just make a small comment about this one. Small enough has been close to 0 so far. There is a very important class of applications where there is no way under the sun that you'll get an out-of-sample performance close to 0, anywhere near 0. And by proxy, simply, E_in will not also be 0. And I'll give you a very simple example. Let's say that you're doing financial forecasting, trying to detect whether the market's going up or down. Under idealized conditions, this is impossible. That data is purely noisy, and there's nothing to learn. The fact that the conditions are not ideal makes hedge funds make money because of that. They exploit a little bit of inefficiency. But they don't get it right 100% of the time. They don't get it right 70% of the time. They will be very, very happy if they get it right 53% of the time, consistently. Under those conditions, you will make a lot of money. So the out-of-sample error here, that we're trying to do, is very close to a half. It's 47% If you get an out-of-sample error, so the correct is 53%, the error is 47%. So you can get as small as possible, in some applications, to be not really near 0 at all, but actually closer to the half. As long as you know for a fact, or at least have the theoretical guarantee, that what you're seeing in-sample, when you add the Hoeffding allowance, if you will, that the out-of-sample will be comfortably-- error smaller than a half consistently, you are in business. If you don't have the Hoeffding guarantee, then you are so happy in-sample. You get the stock market right 75% of the time. You think you're going to make money. And then, when you look at the jump from E_in to E_out, you find that the error bar is that big, and you are in trouble. Let me talk about what the theory that we're going to cover in the next two weeks will achieve, and then I will stop. This is a typical figure that you are going to get. Theory deals with in-sample error and out-of-sample error. Let me actually magnify it. OK. We will see the behavior of in-sample error as you get the model to be more and more elaborate, which will be measured by a quantity which we are going to call the VC dimension. You will find that there are certain behaviors of the in-sample, and the out-of-sample, and the model complexity. And all of the things that appear in this figure will get a formal definition, and an ability to evaluate them when we get the theory. So it is worthwhile. But if you summarize what the theory does, there are two steps that are the most important. The first one, which is the most remarkable result in the theory of learning, is that we are going to characterize the feasibility of learning for the case where-- infinite M, remember what M was? M was the number of hypotheses. We worked with a finite hypothesis set, in order to be able to work with simple Hoeffding. And we realized that the bigger M is, the looser the bound. And if M goes too big, the bound is meaningless. So if M is infinity, we are toast. So now, we would like to be able to find a counterpart to able to take an infinite hypothesis set, like every hypothesis set we have seen so far-- perceptron, the linear regression model. All of these are infinite hypotheses. And we are going to try to find a way to deal with infinite hypotheses. This is the bulk of the development. And that is, we'll end up-- we are going to measure the model not by the number of hypotheses, but by a single parameter which tells us the sophistication of the model. And that sophistication will reflect the out-of-sample performance as it relates to the in-sample performance. Once we do this, lots of doors open. So we're going to characterize a tradeoff that we observed on and off as we went through the lectures. We realized that we would like our model, the hypothesis set, to be elaborate, in order to be able to fit the data. The more parameters you have, the more likely you are going to fit the data and get here. So the E_in goes down if you use more complex models. We also suspected that, if you make the model more complex-- the same direction, the discrepancy between E_out and E_in gets worse and worse. E_in tracks E_out much more loosely than it used to. Now, the good news from the theory is that this will be pinned down so concretely that we are going to derive techniques from this that will make a lot of difference in the practical learning. Regularization is a direct result of this. And without regularization, you basically cannot do machine learning, other than extremely naively. So this will set the foundation for a practical method that is used in almost every machine learning problem you will have. So it's worth really knowing. Now, I will stop here. And we'll take questions. And we'll start the theory next time. Please, get ready. Do your exercise, and get ready for two weeks' worth of very interesting derivation. Now, let's go to questions. MODERATOR: How does P of x impact the learning algorithm? So does it matter if P of x is different in the training and the real data set? PROFESSOR: There is the absolute impact of P of x. And then, there's the relative impact. You're asking about the relative impact. Let's say that I pick the training points according to one distribution, and then test the system using another. Let's answer that question first. There is a correction to the theory that takes into consideration the difference between the two probability distributions, assuming that they are not extreme. For example, if one probability distribution completely vanishes, then obviously there's a problem. Because the points in that part of the space will never happen, and you shouldn't be hoping to learn at all from that. But there are modifications to the theory, where you get a correction term based on the difference between the two probabilities. The absolute version-- I don't know whether this was asked, but let me address it anyway-- how does P of x affect the learning algorithm? Well, the emphasis that P of x gives on certain parts of the space over others will affect the choice of the learning examples. And if you have a limited resource in your hypothesis set-- which you always have to have, otherwise the model is too complicated-- then there's always a compromise about which part of the space should I try to get better than the other? You don't think of this explicitly, but that's what the algorithm does when it tries to satisfy a number of examples. Therefore, if you change the probability distribution-- even if it's the same for both of them-- then you will end up with a slightly different hypothesis that takes into consideration the emphasis of the new one. Nonetheless, you are not learning that input distribution. It's just affecting your choices. MODERATOR: In this discussion, you introduced a probability of y given x and probability of x. Will probability of x given y ever play a role? PROFESSOR: I can imagine cases where it plays a role, where you have P of x and y, and you ask yourself: if I get this output, what is the likely input? This is a role. I don't know whether it's a machine learning role or not. But in general, the merging of P of x and P of y given x in the same quantity, although it's mathematically convenient, as I mentioned, it's a little bit-- not misleading. Misleading is a strong word. What you're really looking at, you always think, I have P of y given x. That's the genuine thing that I'm trying to learn. And that is an integral part of the learning problem. On the other hand, P of x plays a technical role-- a technical role that is fairly negligible. It's essential for it to exist, but it's not nearly as important as P of y given x. MODERATOR: In the case of considering the target function as a probability distribution, then what is better to have-- N pairs of x and y or N y's per x? PROFESSOR: I don't have a theoretical proof for it. It seems to me obvious that if you get all the outputs corresponding to one input, you are dealing with a very specific part of the input space, and you're unlikely to infer anything about the rest of the space. So the answer to the question is that intuitively-- and I think it would probably be not that difficult to prove-- that you get them independently rather than get them for the same input. Also, by the argument we mentioned before, you should be choosing the inputs according to the probability distribution, the input probability distribution P of x, independently. So if you get all the examples according to the same x, this really means that P of x that you're using is a delta function on that x. So if you live up to the expectations, and use the same probability distribution to generate the output, then you are in good shape. But if you change the game on me, and generate all the examples according to this delta function, and then when you want to test it, you go out and give me points that I haven't seen before, then I'm in trouble. MODERATOR: Can you clarify what you mean by poor generalization? It's a common question. PROFESSOR: This will be part of the theory. There will be a very specific quantity we measure, which is the discrepancy between E_out and E_in. And we are going to call this the generalization error. And that will quantify poor generalization or good generalization. MODERATOR: Going back to slide 11 and 12-- PROFESSOR: Ah, the supermarket and the CIA. MODERATOR: Yes, so you chose the numbers 1 and 10, or 1000 and 1. Is there a principled way of choosing these numbers? PROFESSOR: The principled way is to estimate the cost of this occurrence, and then translate it into those. This was only an illustration. And I wasn't really interested in the 1 or 10. I was only interested in making the point crisp, that the error measure is different between two application domains for exactly the same system-- the same system as in machine learning system, same training data, same target function. But the error measure can be different depending on the application domain. So in this case, indeed, we can actually go and see, for example, the loss of revenue by giving an unwarranted discount for the supermarket, and the probability that the customer will be annoyed, and lost revenue because of the customer, and actually come up with the right balance between false accept and false reject for the supermarket. It may not be 10, but it will be-- definitely, the number that is 10 here would be bigger than the number that is 1 here. Similarly, for the CIA, you can go and ask yourself, what is the risk and how much does it cost, versus the lost time for the employees by trying the system again, and then come up with a more principled way of doing it. But that was not really the crux of what I'm doing here. I was only making the point that they are different, that's all. MODERATOR: Once the theory is explained, will it quantify the errors that result from not knowing parts of P of x, especially if P of x has maybe long tails and things like that? PROFESSOR: P of x has been assumed to be an unknown function. And I only used it as a utility to invoke a probabilistic setup. There are no assumptions about P of x. As long as you pick the points from the same distribution to train as to test, everything that I said, and I will say during the theory part, will be valid. If it's a long tail, it's a long tail for training and for testing. The probability of getting something from-- let's say if it's a heavy tail and I get something that is outlier, I will get a certain error. I will get an in-sample error and I'll get an out-of-sample error. I basically don't worry about the structure of P of x, because I'm assuming it's unknown. And I'm assuming that, in the course of supervised learning, I'm not going to learn it. MODERATOR: What happens in the case that both the false positives and negatives have higher values? PROFESSOR: If you scale both of them up, it makes no difference whatsoever. Then, the error measure is scaled up, and you're minimizing it. So it's just a constant multiplied by it. If they are scaled relative to each other, then obviously the emphasis on the system changes, trying to get more false positives and less false negatives, or vice versa. And that's what happens between these two examples. For the supermarket, here we are trying not to reject customers. And in the CIA case, we are trying not to accept people who are intruders. MODERATOR: There's also a question of reiterating the relation of P of x to Hoeffding Inequality. PROFESSOR: The Hoeffding Inequality was based on the bin. The bin had marbles, and we picked them according to some probability which we labeled as-- it's a Bernoulli trial, so it's a binary outcome. And the probability was mu. The bin became the input space. The input space, when we started talking about machine learning, did not have a probability distribution. It was just a set. So in order to be able to invoke the probabilistic aspect, we needed to put a probability distribution over the input space, such that when you you create green and red marbles according to agreement/disagreement, and the input space becomes a bin, there is a probability that goes with it for picking red versus green marbles. It doesn't matter, because any probability you put will correspond to some mu. And then, you have the rest of it. And we know that the Hoeffding Inequality is independent of the-- the bound on the right-hand side is independent of the value of mu. So any old probability will do-- will do what? Will do the legitimization of the learning problem as far as the probabilistic approach is concerned. Obviously, we can enter a discussion about the probability being concentrated or spread out, or parts of the space being 0. All of that is good and valid, except that it doesn't affect the basic question, which is to make sure that learning is feasible in a probabilistic sense. Any P of x will achieve that. MODERATOR: A clarification-- some people are asking to simplify the case of a squared error measure and the closed-form solution. PROFESSOR: This actually goes to the review. Let me go to the review one, because this is from last lecture. There is an algorithm that we derived for linear regression. And the algorithm was based on minimizing squared error. Remember that we took a gradient, and we took advantage of the form being squared error in order for the thing to be differentiable, and for the derivative to have a simple form. And that simple form is what ended up in getting the formula for w at the bottom, which is X transposed X, inverse of that, times X transposed y, as a simple closed-form solution for the final hypothesis of linear regression. So in that case, it is the squared error measure, that defines linear regression, that enabled us to find such a simple solution. If you take another measure, you may or may not get a simple solution. But for sure, in this case, we got it. And there are definitely error measures that you can put, where you cannot find a simple solution like this one. MODERATOR: Going back to the problem of the CIA and the supermarket, if the probability of y equals +1, and y equals -1, is not balanced, should you do something regarding P of x to have a correct estimate of your error? PROFESSOR: Probability of y, in the absolute, depends on two things-- probability of y given x and probability of x. So if you put them together, and you get an imbalance in the probability of y, this means that the building quantities, which is P of x and P of y given x, are what affected that. And those quantities will definitely affect the learning process. So the answer-- if you want to answer what happens when y is not balanced, go back and see what gave rise to it. And then, you will be able to find the answer more directly linked through the quantities that directly affect the learning process. MODERATOR: And again, on these costs, is there ever a case where you can use rewards instead of costs, as in assigning negative values to-- PROFESSOR: Yeah, they are equivalent, indeed. You are just maximizing the reward or minimizing the punishment. I think it's just two ways of looking at the same thing. MODERATOR: A question-- In the example of the bins, when you say there's a bin that becomes the input space, does the input space include just the training data points, or does input space include all possible points? PROFESSOR: The input space includes all possible points, but includes only the input parts of those possible points. If you look at the examples, the training data, there is x and y. And the input space deals only with the x part. When you talk about the input space in general, it covers all possible x's. When you talk about the training data, you are talking about the x's that were picked as a training set, N of them. MODERATOR: Regarding the transformation, what relation does phi have to something like principal component analysis? PROFESSOR: This is a different subject. So there is a subject of processing the input, in order to make it more compact, in order to get rid of irrelevant parts and whatnot. And that is a legitimate processing step, but it's not what I was alluding to here. What I was alluding to here, in the nonlinear transformation, is an ability to implement more sophisticated hypotheses using the same simple method, which is the linear method. And therefore, the transformation is with a view to that. Not with a view to getting rid of some of the artifacts of the input. However, feature extraction is feature extraction. You can think of the nonlinear transformation as feature extraction. You can also think of other methods for processing the input, and getting rid of some of the irrelevances, also as feature extraction. And if you think of the example of the handwritten digits that we talked about, we started with the full image, which I think was 257 bits' worth, counting the constant 1. And then, we ended up with only two features plus the 1. The two features where symmetry and intensity. And in some sense, these are informative features. And in that case, you lost some information about the input. But hopefully what you lost is not relevant. The principal component analysis, and other methods, are fairly systematic to detect that without attaching meaning. So you don't really study the subject. You just apply a standard method that will pick the most informative directions in the input space, in the input representation space. And that will be your coordinates. So it's a different subject. It's not related to nonlinear transformation, per se. MODERATOR: Regarding the error measures, so the squared error measure is used mainly for mathematical convenience. Do we lose too much by replacing it for something like just an absolute value? PROFESSOR: You lose the optimization. Squared error is this way. And that is nice and smooth, and has all kinds of properties. You take the absolute value, and you have this guy. And the edge is really bad news. All of a sudden, it becomes combinatorial optimization instead of a smooth function. So yes, you lose in terms of optimization. If you have a specific merit for using the absolute value-- that is, the guy tells you that this is my function, and I want to make sure that this is what you minimize-- then we have to bite the bullet and work through it. But if you're making an analytic choice just for the heck of it, you might as well pick something that is friendly either to the concept or to the optimizer. MODERATOR: So this question regarding the use of a linear model, when you have P of y given x, that represents f of x. And then, f of x would be the result of w transposed-- hello? OK, so if this is the case, then when you subtract y by f of x, does it mean you have a P of y of x shape? PROFESSOR: The target is f of x. The target is not w transposed x. w transposed x is the final hypothesis that is my attempt to approximate the target function. So I was talking about target function versus target distribution, even without any learning taking place. Someone has a target function. It's noisy. I'm telling them that they can model it this way-- take the expected value, assuming it's a numerical function-- the expected value under the probability distribution. You will get the average y given a particular x. That's a function. So your call this f of x. The remaining part, which is the value of y minus that, will be pure noise in the sense that it averages around 0. So this was just a way of looking at it. But definitely, it does not touch at all on linear models or any other models. It's a characterization of the target function versus target distribution. MODERATOR: There's a tradeoff between complexity and the performance. Is there a way to simultaneously improve the generalization as well as minimize error? PROFESSOR: If you sit through the next four lectures very, very attentively, you'll get the answer to that at the end of the four lectures. I'm half joking, but that's the reality of it. You'll have enough tools to be able to answer questions like that. MODERATOR: I think that's it. PROFESSOR: OK, that's it. Thank you, and we'll see you next week.

Transcript for:Linear Models and Learning Theory

Transcript for:
Linear Models and Learning Theory