The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
We're talking about generalized linear models. And in generalized linear models, we generalize linear models in two ways. The first one is to allow for different distributions for the response variables.
And the distributions that we wanted was the exponential family. And this is a family that can be generalized over random variables that are defined on, say, RQ in general with parameters in RK. But we're going to focus in a very specific case where this is when R is equal, Y is. real-valued response variable, which is the one you're used to when you're doing linear regression. And the parameter theta also lives in R.
And so we're going to talk about the canonical case. So that's the canonical. exponential family where you have a density of theta of x, which is of the form exponential minus, sorry, exponential plus. And then we have y, which interacts with theta only by taking a product. Then there's a term that depends only on theta, some dispersion parameter phi.
And then we have some normalization factor. Let's call it c of y. Phi. So it only depends. So it really should not matter too much.
So it's C of y phi. And that's really just a normalization factor. And here, we said that we're going to assume that phi is known. I have no idea what I write. I don't know if you guys can read.
I don't know what chalk has been used today. But I just can't see it. It's not my fault.
All right. So we're going to assume that phi is known. And so we saw that several distributions.
that we know well, including the Gaussian, for example, belong to this family. And there's other ones, such as Poisson and Bernoulli. So if the PMF has this form, if you have a discrete random variable, this is also valid.
And the reason why we introduce this family is because there are going to be some properties. So we know that this thing here, this function b of theta, is essentially what completely characterizes your distribution. So if phi is fixed, we know that the interaction is the form.
This form and this really just comes from the fact that we want the function to integrate to 1. So this b here in the canonical form encodes everything we want to know. If I tell you what b of theta is, and of course, I tell you what phi is, but let's say for a second that phi is equal to 1. If I tell you this B of theta, you know exactly what distribution I'm talking about. So it should encode everything that's specific to this distribution, such as mean, variance, all the moments that you would want. And we'll see how we can compute from this thing the mean and the variance, for example.
So today we're going to talk about likelihood. And we're going to start with the likelihood function, or the log likelihood for one observation. From this, we're going to do some computations.
And then we'll move on to the actual log likelihood based on n independent observations. And here, as we will see, the observations are not going to be identically distributed, because we're going to want each of them, I mean conditionally on x, to be a different function of x, where theta is just a different function of x for each of the observations. So remember, the log likelihood, and this is for one observation, is just the log of the density.
And we have this identity that I mentioned at the end of the class on Tuesday. And this identity is just that the expectation of the derivative of this guy with respect to theta is equal to 0. So let's see why. So if I take the derivative with respect to theta of log f theta of x, what I get is the derivative with respect to theta of f theta of x divided by f theta of x. All right?
Now, if I take the expectation of this guy with respect to this theta as well. What I get is that this thing, what is the expectation? Well, it's just the integral against f theta.
Or if I'm in a discrete case, I just have the sum against f theta, if it's a PMF. Just the definition of the expectation of x is either the integral, well, let's say of h of x, is the integral of h of x f theta of x. If this is discrete, or it's just the sum of h of x f theta of x.
If x is discrete, so if it's continuous, you put this sweet soft sum. And if it's this guy, it's the same thing, right? So I'm just going to illustrate the case when it's continuous. So this is what?
Well, this is the integral of partial derivative with respect to theta of f theta of x divided by f theta of x all time f theta of x. dx. And now this f theta is canceled.
So I'm actually left with the integral of the derivative, which I'm going to write as the derivative of the integral. But f theta being a sorry, f theta being density for any value of theta that I can take, this is the function. As a function of theta, well, actually, as a function of theta, this function is constantly equal to 1. For any theta that I take, it takes value 1. So this is constantly equal to 1. I put three bars to say that for any value of theta, this is 1, which actually tells me that the derivative is equal to 0. OK, yes.
That's the definition of the expectation. Oh, sorry. No, that's just the definition of the derivative of the log of a function.
OK. log of f prime is f prime over f, right? OK, OK, OK. That's a log, yeah.
Just by elimination. You add a squiggle that starts with an L. I'm sorry? When you add a squiggle that starts with an F.
Yeah. OK. And you do good, because that's probably how my mind processes. And so I'm like, yeah, L, here's enough information. OK?
Everybody's good with this? So that was convenient. So it just said that the expectation of the derivative of the log likelihood is equal to 0. That's going to be our first identity. Let's move on to the second identity using exactly the same trick, which is let's hope that at some point we have the integral of this function that's constantly equal to 1 as a function of theta and then use the fact that its derivative is equal to 0. So if I start taking the second derivative of the log of f of theta. So what is this?
Well, it's the derivative of this guy here. So I'm going to go straight to it. So it's second derivative of f theta of x times f theta of x. Minus the first derivative of f theta of x times first derivative of f theta of x. Here is some super important stuff.
No, I'm kidding. And so you can still see that guy over there. So it's just the square. And then I divide by f theta of x squared.
OK? So here I have the second derivative times f itself. And here I have the product of the first derivative itself, with itself. So that's the square. So now I'm going to integrate this guy.
So if I take the expectation of this thing here, What I get is the integral. So here, the only thing that's going to happen when I'm going to take my integral is that one of those squares is going to cancel against f theta, right? So I'm going to get the second derivative minus the second derivative squared. And then I'm divided by f theta.
And I know that this thing is equal to 0. OK? Now, one of this guy here, sorry, why do I have, yeah. So I have this guy here.
So this guy here is going to cancel. So this is what? This is equal to. the integral of the partial, so the second derivative of f theta of x.
Right? Because those two guys cancel minus the integral of the second derivative. OK?
And this is telling me what? It's telling me that. Yeah, I'm losing one because I have some weird consequences. Thank you. OK, so this is actually, yeah, OK, because now this is not, well, this is still positive.
So I know this thing, I want to say that this thing is actually equal to 0. But then it gives me some weird things, which are that, I have the integral of a positive function, which is equal to 0. Yeah, that's what I'm thinking of doing, but I'm going to get 0 for this entire integral, which means that I have the integral of a positive function, which is equal to 0, which means that this function is equal to 0, which sounds a little bad. Basically tells me that this function, f theta, is linear. So I went a little too far, I believe, because I only want to prove that the expectation of the second derivative, I want to show that the expectation of the second, I mean, is that true actually?
MEH. Yeah, so I want to pull this out. And so let's see. If I keep rolling with this, I'm going to get that.
Well, no, because the fact that it's divided by f theta means that indeed the second derivative is equal to 0. So I cannot do this here. OK, maybe I should. But the term on the right should be.
That's correct. But let's write it like this. So you're right.
So this is what? This is the expectation of the partial with respect to theta of f theta of x divided by f theta of x. squared, right?
And this is exactly the derivative of the log, right? So indeed, this thing is equal to the expectation with respect to theta of the partial of L. well, of log of f theta divided by partial theta.
So this is one of the guys that I want squared. So this is one of the guys that I want. And this is actually equal.
So this will be equal to the expectation. Oh, yeah, right, right, right. So this term should be equal to 0. Yeah, this was not 0. You're absolutely right.
So at some point, I got confused because I thought putting this equal to 0 would mean that this is 0. But this thing is not equal to 0. So this thing, you're right. I take the same trick as before, and this is actually equal to 0, which means that now I have what's on the left-hand side, which is equal to what's on the right-hand side. And if I recap, I get that e theta of the second derivative Of the log of f theta is equal to minus, because I had a minus sign here, to the expectation with respect to theta of log of f theta divided by theta squared. OK?
Thank you for being on watch when I'm falling apart. All right, so this is exactly what you have here, except that both terms have been put on the same side. All right, so those things are going to be useful to us. So maybe we should write them somewhere here. And then we have that the expectation of the second derivative of the log is equal to minus the expectation of the square of the first derivative.
OK. And this is indeed my Fisher information. This is just telling me what is the second derivative of my log likelihood at theta. So everything is with respect to theta when I take these expectations.
And so it tells me that the expectation of the the log of the second derivative, at least first of all, what it's telling me is that it's concave, right? Because the second derivative of this thing, which is the second derivative of the KL divergence, is actually minus something which must be non-negative. And so it's telling me that it's concave here at this maximum.
OK? And in particular, it's also telling me that it has to be strictly positive unless the derivative of f is equal to 0. So unless f is constant, then I don't have a, it's not going to change. All right.
Do you have a question? So now let's use this. So what does my log likelihood look like when I actually compute it for this canonical exponential family?
I mean, we have this exponential function, so taking the log should make my life much easier. And indeed, it does. So if I look at the canonical, what I have is that the log of f theta of x.
Well, it's equal simply to y theta minus b of theta divided by phi plus this function that does not depend on theta. So let's see what this tells me. Let's just plug in those equalities in there.
I mean, I can take the derivative of the right-hand side and just say that in expectation, it's equal to 0. So if I start looking at the derivative, This is equal to what? Well, here I'm going to pick up only y. So this is a function of y.
I'm going to pick only a. All right. I was talking about likelihood, so I actually need to put the random variable here. So I get y minus the derivative of b of theta. Since it's only a function of theta, I'm just going to write b prime.
Is that OK? Rather than having the partial with respect to theta. Then this is a constant.
This does not depend on theta, so it goes away. So if I start taking the expectation of this guy, I get the expectation of this guy, which is the expectation of y minus, well, this does not depend on y, so it's just itself, b prime of theta. And the whole thing is divided by phi.
But from my first equality over there, I know that this thing is actually equal to 0. Right? We just proved that. So in particular, it means that since phi is non-zero, it means that this guy must be equal to this guy. Or phi is not infinity, say. And so that implies that the expectation with respect to theta of y is equal to b prime of theta.
I'm sorry, you're not registered in this class. I'm going to have to ask you to leave. I'm not kidding.
You are? Yeah. Wow, I've never seen you here.
I mean, I saw you for the first lecture. OK. So e theta of y is equal to b prime of theta.
Everybody agrees with that? So this is actually nice. Because if I tell you, OK, I told you, if I give you an exponential family, the only thing I really need to tell you is what b theta is. And if I give you b of theta, then computing a derivative is actually much easier than having to integrate y against the density itself. I mean, you could really have fun and try to compute this, which you would be able to do, right?
And then there's the plus c of y phi blah, blah, blah dy. And that's the way you would actually compute this thing. But that would be really, sorry, this guy is here. That would be painful, right? I don't know what this normalization looks like.
So I would have to also explicit that so I can actually compute this thing. And just the same way, if you want to compute the expectation of a Gaussian, OK, the expectation of a Gaussian is not the most difficult one. But even if you compute the expectation of a Poisson, you start to have to work a little bit.
There's a few things that you have to work through. Here, I'm just telling you all you have to know is what b of theta is. And then you can just take the derivative. Let's see what the second equality is going to give us.
OK, so what is the second equality? It's telling me that if I look at the second derivative and then I take its expectation, I'm going to have something which is equal to negative this guy squared. Sorry, that was the log, right?
We've already computed this first derivative of the likelihood. It's just the expectation of the square of this thing here. So expectation of the derivative with respect to theta of log f theta of x divided by partial theta square. This is equal to the expectation of the square of.
y minus b theta divided by phi squared, b prime theta squared. And sorry, I'm actually going to move on with the. OK. And so if I start computing, what is this thing?
Well, this, we just agreed that this was what? The expectation of theta, right? So that's just the expectation of y.
We just computed here. Yeah, that's b prime. There's a derivative here.
OK? So now, this is what? This is simply?
Anyone? I'm sorry? Variance.
Variance of y, but there's scaling by phi squared, right? OK, so this is negative of the right-hand side of our inequality. And now I just have to take one more derivative to this guy. So now if I look at the left-hand side now, I have that the partial derivative, the second derivative of log of f theta of y divided by partial of theta squared.
So this thing is equal to, well, now I'm not left with much. The y part is going to go away, and I'm left only with the second derivative of theta minus the second derivative of theta divided by phi. So if I take expectation, well, it just doesn't change. This is deterministic. So now what I've established is that this guy is equal to negative this guy.
So those two things, the signs are going to go away. And so this implies that the variance of y is equal to b prime prime theta. And then I have a phi square in the denominator that cancels only one of the phi squares. So it's time phi. So now I have that my second derivative, since I know phi, is completely determining the variance.
So basically, that's why b is called the cumulant generating function. It's not generating moments, but cumulants. But cumulants In this case, correspond basically to the moments, at least for the first two.
If I start going farther, I'm going to have more combinations of the expectation of y3, y2, and y itself. So. But as we know, those are the ones that are usually the most useful, at least if we're interested in asymptotic performance. The central limit theorem tells us that all that matters are the first two moments.
And then the rest is just going to go and say, well, it doesn't matter. It's all going to a normal anyway. So let's go to a Poisson, for example.
So if I had a Poisson distribution, All right, so this is a discrete distribution. And what I know is that so f, let me call mu the parameter of y. This is a.
Well, OK, so it's mu to the y divided by y factorial exponential minus mu. So mu is usually called lambda, and y is usually called x. That's why it takes me a little bit of time, but usually it's lambda to the x over. over factorial x exponential minus lambda.
This thing clearly, since this is just the series expansion of the exponential, when I sum those things from 0 to infinity, this thing sums to 1. But then if I wanted to start understanding what the expectation of this thing is, so if I want to understand the expectation with respect to mu of y, then I would have to compute the sum from k equals 0, or let's say from k equals 0 to infinity. infinity of y times mu, sorry, of k times mu to the k over factorial of k exponential minus mu, which means that I would essentially have to take the derivative. OK. Then I would have to take the derivative of my series in the end. So I can do this.
This is a standard exercise. You've probably done it when you took probability. But let's see if we can actually just read it off from the first derivative. derivative of b.
So to do that, we need to write this in the form of an exponential where there's one parameter that captures mu that interacts with y, just doing this parameter times y, and then something that depends only on y, and then something that depends only on Sorry, something that depends only on mu. That's the important one. That's going to be our b.
And then there's going to be something that depends only on y. So let's write this and check that this f mu indeed belongs to this canonical exponential family. So I definitely have an exponential that comes from this guy.
So I have minus mu. And then this thing is going to give me what? It's going to give me plus y log mu. And then I'm going to have minus log of y factorial, all right?
OK. So clearly, I have a term that depends only on mu, a term that depends only on y, and I have a product of y and something that depends on mu. If I want to be canonical, I must have this to be exactly the product of y.
parameter theta itself. So I'm going to call this guy theta. So theta is log mu, which means that mu is equal to e to the theta.
And so wherever I see mu, I'm going to replace it by e to the theta because because my new parameter now is theta. So this is what? This is equal to exponential y times theta. And then I'm going to have minus e of theta. And then who cares?
Something that depends only on you. So this is my c of y. And phi is equal to 1 in this case. So that's all I care about.
So let's use it. OK, so this is my canonical exponential family. y interacts with theta exactly like this, and then I have this function. So this function here must be b of theta.
So from this function, exponential theta, I'm supposed to be able to read what the mean is. So because since in this course, I always know what the dispersion is, I can actually always absorb it into theta for one. But here, it's really of the form y times something divided by 1, right? I mean, I could have, if it was like log of mu divided by 5, it would be the question of whether I want to call phi my dispersion or if I want to just have it in there. But if I know, so usually, so OK, this makes no difference.
practice. But the real thing is it's never going to happen that this thing, this dispersion, is going to be an exact number. If it's an actual numerical number, this just means that this number should be absorbed in the definition of theta.
But if it's some something that is called sigma, say, and I say, and I will assume that sigma is known, then it's probably preferable to keep it in the dispersion so you can see that there's this parameter here that you can essentially play with. It doesn't make any difference when you know phi. So now what I have, so if I look at the expectation of some y, so now I'm going to have y which follows my poisson.
mu. I'm going to look at the expectation. And I know that the expectation is b prime of theta, right? Agreed?
That's what I just erased, I think. Agreed with this, the derivative? All right. So what is this?
Well, it's the derivative of e to the theta, which is e to the theta, which is mu. So my Poisson is parameterized by its mean. I can also compute the variance, which is equal to minus the second derivative of, is it, yeah, no, it's equal to the second derivative of b.
Dispersion is equal to 1. Again, if I took phi elsewhere, I would see it here as well. So if I just absorbed phi here, I would see it divided here. So it would not make any difference. And what is the second derivative of the exponential? It's still the exponential.
So it's still equal to mu. OK? So that certainly makes our life easier. Just one quick remark. Here's the b function.
I'm giving you a problem. Can the b function of some, this function b, can it ever be equal to? Log of theta.
Who says yes? Who says no? Why? Yeah. All right.
So what I've learned from this is sort of completely analytic, right? So we just took derivatives and this thing just happened, right? This thing actually allowed us to relate the second derivative b to the variance.
And And one thing that we know about a variance is that this is non-negative. And in particular, it's always positive. If I give you a canonical exponential family that has zero variance, trust me, you will see it. means that this thing is not going to look like something that's finite.
It means it's going to have a point mass. It's going to take value infinity at one point. So this will basically never happen. This thing is actually strictly positive, which means that this thing is always strictly concave.
It means that the second derivative of this function b has to be strictly positive, and so that the function is convex. So this is concave. So this is definitely not working.
I need to have something that looks like this when I talk about my b. So theta squared. squared.
We'll see a bunch of exponential theta. There's a bunch of them. But if you start writing something and you find b, try to think of the plot of b in your mind, and you find that b looks like it's going to be concave, you've made a sign mistake somewhere. All right, so we've done a pretty big parenthesis to try to characterize what the distribution of y was going to be.
We wanted to extend from, say, Gaussian to something else. But when we're doing regression, which means the generalized linear models, We are not interested in the distribution of y, but really the conditional distribution of y given x. So I need now to sort of couple those back together.
So what I know is that this, say, mu in this case, which is the expectation, what I want to say is that the conditional expectation of y given mu, sorry, the conditional expectation of y given x, this is some mu of x, right? And when we did linear models, we said, well, this thing was some x transpose beta for linear models. And the whole premise of this chapter is to say, well, this might make no sense because x transpose beta can take the entire range of real values, whereas this mu can take only a partial range.
So even if you actually focus on the poisson, for example, we know that the expectation of a poisson has to be a non-negative number. Actually, a positive number as soon as you have a little bit of variance. It's mu itself.
Mu is a positive number. And so it's not going to make any sense to assume that mu of x is equal to x transpose beta, because you might find some x's for which this value ends up being negative. And so we're going to need what we call the link function that relates, that transforms mu, maps it onto the real line, so that you can now express it of the form x transpose beta. So we're going to take Not this, but we're going to assume that g of mu of x is now equal to x transpose beta, and that's the generalized linear models, right? OK, so as I said, it's kind of weird to transform.
x transpose beta mu to make it take the real line. It feels like, at least to me, it feels a bit more natural to take x transpose beta and make it fit to the particular distribution that I want. And so I'm going to want to talk about g and g inverse at the same time.
So I'm going to actually take always g. So g is my link function. And I'm going to want g to be.
a continuously differentiable. Let's say that it has a derivative, and its derivative is continuous. And I'm going to want g to be strictly increasing. OK?
And that actually implies that g inverse exists. Actually, that's not true. What I'm also going to want is that g of mu spans.
OK, how do I do this? Well, OK. So I want the g as I range for all possible values of mu, whether they're all positive values, or whether they're values that are limited between the intervals 0, 1. I want those to span the entire real line, so that when I want to talk about g inverse is defined over the entire real line. I know where I started.
So this implies that g inverse exists. What else does it imply about g inverse? So for a function to be invertible, I only need for it to be strictly monotone, right?
I only need it to be strictly increasing. So in particular, the fact that I picked increasing implies that this guy is actually increasing. OK? That's the image.
OK. So this is my link function. And this slide is just telling me I want my function to be invertible so I can talk about G-universe. I'm going to switch between the two.
So what link functions am I going to get? So for linear models, we just said there's no link function. which is the same as saying that the link function is identity, which certainly satisfies all these conditions. It's invertible. It has all these nice properties.
But might as well not talk about it. For Poisson data, when we assume that the conditional distribution of y given x is Poisson, then mu, as I just said, is required to be positive. So I need a g that goes from the interval 0, infinity to the entire real line. I need a function that starts from one end and just takes not only the positive values are split between positive and negative values. And here, for example, I could take the log link.
So the log is defined on this entire interval. And as I range from 0 to plus infinity, the log is ranging from negative infinity to plus infinity. So you can probably think of other functions that do that.
like 2 times log. That's another one. But there's many other you can think of.
But let's say the log is one of them that you might want to think about. Then There's a, and we'll see that this is actually, I mean, it is a natural one in the sense that it's one of the first functions we can think of, but we'll see also that it has another canonical property that makes it a natural choice. The other one is the other example where we had an even stronger condition on what mu could be. Mu could only be a number between 0 and 1. That was the probability of success.
of a coin flip, right, the probability of success of a Bernoulli random variable. And now I need g to map 0, 1 to the entire real line. And so here are a bunch of things that you can come up with because now you start to have, maybe, I mean, I will soon claim that this one, log of mu divided by 1 minus mu, is the most natural one. But maybe if you had never thought of this, that might not be the first function you would come up with, right? You mentioned trigonometric functions, for example.
So maybe you can come up with something that comes from hyperbolic trigonometry or something. So what does this function do? Well, we'll see a picture. But this function does map the interval 0, 1 to the entire real line. We also discussed the fact that if we think reciprocally, well, a function.
What I want, if I want to think about g inverse, I want a function that maps the entire real line into the unit interval. And as we said, if I'm not a very creative statistician or probabilist, I can just pick my favorite continuous strictly increasing cumulative distribution function, which, as we know, will arise as soon as I have a density that has support on the entire real line. If I have support everywhere, then it means that my By meaning that the density is strictly positive everywhere, then it means that my community distribution function has to be strictly increasing. And of course, it has to go from 0 to 1, because that's just the nature of those things.
And so for example, I can take the Gaussian. That's one such function. But I could also take the double exponential that looks like an exponential on one end and an exponential on the other end.
Basically, if you take something which is called, which if you take capital Phi, which is the Gaussian, standard Gaussian cumulative distribution function, it does work for you. And you can take its inverse. And in this case, we don't talk about, so this guy is called logit or logit. And this guy is called probit. And you see it usually every time you have a package on generalized linear models you're trying to implement.
You have this choice. And for what's called logistic regression, So it's kind of funny that it's called logistic regression, but you can actually use the probit link, which in this case is get, I guess, is called probit regression. But those things are essentially equivalent, and it's really a matter of taste.
Maybe of community. Some communities might prefer one or the other. We'll see that, again, as I claimed before, the logistic one, the logit one, has a slightly more compelling argument for its reason to exist. I guess this one, the compelling argument, is that it involves the central.
the standard Gaussian, which, of course, is something that should show up everywhere. And then you can think about crazy stuff, something that even crazy gets a name, complementary log log, which is the log of minus log 1 minus. Why not? So I guess you can iterate that thing.
You can just put a log 1 minus in front of this thing and it's still going to go. All right. So That's not true. I have to put a minus and take another.
Well, I don't know. No, that's not true. OK. So you can think of whatever you want.
But now, you can actually, so I'll show, so I claim that the Legit link is the natural choice. So here's a picture. I should have actually plotted the other one so we can actually compare it.
To be fair, I don't even remember how it would actually fit in those two functions. So the blue one, which is this one for those of you who don't see the difference between blue and red, sorry about that. So the blue one is the logistic one. So this guy is the function that does e to the x over 1 plus e to the x. As you can see, this is a function that's supposed to map the entire real line into the interval 0, 1. So that's supposed to be the inverse of your function.
And I claim that this is the inverse of the logistic of the logit function. And the blue one, well, this is the Gaussian CDF. So you know it's clearly the inverse of the inverse of the Gaussian CDF.
And that's the red one. That's the one that goes here. So one of the things that I would guess that the complementary log log is something that's probably going above here and for which the slope is actually even a little flatter as you cross. 0. So of course, this is not our link functions. These are the inverse of our link functions.
So what do they look like when I actually basically flip my thing like this? So this is what I see. And so I can see that in blue, this is my logistic link. So I have it crosses 0 with a slightly faster rate.
Remember, our hope, if we could use the Identity, that would be very nice to us, right? We would just want to take the identity. The problem is that if I start having the identity that goes here, it's going to start being a problem.
And this is the Probit link, the phi inverse that you see here, it's a little flatter. I mean, you can check, right, what you can compute the derivative at 0 of those guys. And you will see that the, what is the derivative of the?
So log, so I'm taking the derivative of log of x over 1 minus x. So it's 1 over x minus 1 over x. No, minus 1 over 1 minus x. OK. So if I look at 0.5, actually, sorry, this is the interval 0, 1. So I'm interested in the slope at 0.5.
So no. Yes, plus, thank you. So at 0.5, what I get is 2 plus 2. Yeah, OK. So that's. OK, so that's the slope that we get.
And if you compute for the derivative, what is the derivative of phi inverse? Well, it's little phi of x divided by little phi of capital phi inverse of x. So little phi at 1 half, I don't know. Yeah, I guess I can probably compute the derivative of the capital Phi at 0, which is going to be just 1 over square root 2 pi.
And then just say, well, the slope has to be 1 over that. OK, so square root 2 pi. OK, so that's just a comparison. But again, so far, we do not have any reason to prefer one to the other. And so now I'm going to start giving you some reasons to prefer one to the other.
And one of those two, and actually, for each canonical family, there is something which is called the canonical link. And when you don't have any other reasons to choose anything else, why not choose the canonical one? And the canonical link is the one that says, OK, what I want is G. to map mu onto the real line. But mu is not the parameter of my canonical family.
I mean, just so here, for example, mu is e of theta, but the canonical parameter is theta. And the parameter of a canonical exponential family is something that lives in the entire real line, right? It was defined for all thetas. And so in particular, I can just assume, I can just take theta to be the one that's x transpose beta. And so in particular, I'm just going to try to find the link that just says, OK, when I take g of mu, I'm going to map.
So that's what's going to be, right? So I know that g of mu is going to be equal to x beta. And now what I'm going to say is say, OK, let's just take the g that makes this guy equal to theta so that this is theta to actually model like x transpose beta. Feels pretty canonical, right?
I mean, what else? What other central easy choice would you take? This was pretty easy.
There is a natural parameter for this canonical family. And it takes value on the entire real line. I have a function that maps mu onto the entire real line.
So let's just map it to the actual parameter. OK. So now what I claim, OK, why do I have this? Well, we've already figured that out. The canonical link function is strictly increasing.
OK, sorry. So I said that so now I want this guy, so I want mu, g of mu, to be equal to theta. which is equivalent to saying that I want mu to be equal to g inverse of theta.
But we know that mu is what? b prime of theta. So that means that b prime is the same function as g inverse.
And I claim that this is actually giving me, indeed, a function that has the properties that I want. Because before, I said just pick any function that has this. properties. And now I'm giving you a very hard rule to pick this.
So you need still to check that it satisfies those conditions, in particular that it's increasing and invertible. And so for this to be increasing and invertible, strictly increasing and invertible, really what I need is that the inverse is strictly increasing and invertible. an invertible, which is the case here because b prime, as we said, well, b prime is the derivative of a strictly convex function. A strictly convex function has a second derivative that's strictly positive.
We just figured it out using the fact that the variance was strictly positive. And if phi is strictly positive, then this thing has to be strictly positive. So if b prime prime is strictly positive, this is the derivative of a function called b prime. If your derivative is strictly positive, of u are strictly increasing. And so we know that b prime is indeed strictly increasing.
And what I need also to check, well, I guess this is already checked in its own, because b prime is actually mapping all of r into the possible values. I mean, when theta ranges on the entire real line, then b prime ranges in the entire interval of the mean values that it can take. And so now I have this thing that's completely defined.
B prime inverse is a valid link, and it's called the canonical link. OK. So again, if I give you an exponential family, which is another way of saying I give you a convex function b, which gives you some exponential family, then if you just take b prime inverse, this gives you the associated canonical. a link for this canonical exponential family.
So OK, clearly there's an advantage of doing this, which is I don't have to actually think about which one to pick if I don't want to think about it. But there's other advantages that come to it. And we'll see that in the representation.
There's basically going to be some cancellations that show up. So before we go there, let's just compute the canonical link for the Bernoulli distribution. So remember, the Bernoulli distribution has a PMF, which is part of the canonical exponential family. So the PMF of the Bernoulli is f theta of x.
is, OK, so let me just write it like this. So it's p to the x, so p to the y, let's say, 1 minus p to the 1 minus y, which I will write as exponential y log p plus 1 minus y. Log 1 minus p.
OK, we've done that last time. Now I'm going to group my terms in y to see how y interacts with this parameter p. And what I'm getting is y, which is times log p divided by 1 minus p. And then the only term that remains is log 1 minus p.
OK? Now, I want this to be a canonical exponential family, which means that I just need to call this guy. So it is part of the exponential family. I can read that.
If I want it to be canonical, this guy must be theta itself. OK, so I have that theta is equal to. log p 1 minus p. If I invert this thing, it tells me that p is e to the theta divided by 1 plus e to the theta.
That's just inverting this function. And so that means that this thing is actually so, in particular, it means that log 1 minus p is equal to log 1 minus this thing. So the exponential thetas go away.
So in the numerator, so this is what I get. Right? That's the log 1 minus this guy, which is equal to minus log 1 plus e to the theta.
OK? So I'm going a bit fast, but these are very elementary manipulations. Maybe it requires one more line to convince yourself, but just. Do it in the comfort of your room.
And then what you have is the exponential of y times theta. And then I have minus log 1 plus e theta. So this is the representation of the PMF of a Bernoulli distribution as part of a member of the canonical exponential family.
And it tells me that B of theta is equal. to log 1 plus e of theta. That's what I have there. From there, I can compute the expectation, which, you know, hopefully I'm going to get p as the mean and p times 1 minus p as the variance. Otherwise, that would be weird.
So let's just do this. b prime. of theta should give me the mean. And indeed, b prime of theta is e to the theta divided by 1 plus e to the theta, which is exactly this p that I had there. OK, just for fun, let's, well, I don't know.
Maybe that's not part of it. Yeah, let's not compute the second derivative. It's probably going to be on your homework at some point, if not on the final.
All right, so b prime now, we know, oh, I erased it, of course. g, the canonical link is b prime inverse. And I claim that this is going to give me the logit function, log of mu over 1 minus mu. So let's check that. So b prime is this thing.
So now I want to find the inverse. Well, I should really call my inverse a function of p. And I've done it before.
All I have to do is to solve this equation, which I've actually just done. That's where I'm actually coming from. So it's actually telling me that the solution of this thing is equal to log of p over 1 minus p.
We just solve this thing both ways. And this is indeed. log it of p by definition of log it.
So b prime inverse, this function that seemed to come out of nowhere, is really just the inverse of b prime, which we know is the canonical link. And canonical is some sort of ad hoc choices that we've made by saying, let's just take the link such that d of mu is giving me the actual canonical parameter theta. Yeah? Yeah, you're right.
Now, of course, I'm going through all this trouble. But you could see it immediately. I know this is going to be theta.
We also have prior knowledge, hopefully, that the expectation of a burn. only is p itself. So right at this step, when I say that I'm going to take theta to be this guy, I already knew that the canonical link was the logit link. Because I just said, oh, here's theta, and it's just this function of mu. Bam.
OK, so you can do that for a bunch of examples. And this is what they're going to give you. So in the Gaussian case, b of theta, we've actually computed it actually once. This is theta squared over 2. So the derivative of this thing is really just theta, which means that g or g inverse is actually equal to the identity.
And again, sort of sanity check. When I'm in the Gaussian case, there's nothing general about general linear models. They don't have a link. The Poisson case, you can actually check.
Did we do this, actually? Yes, we did it, right? So that's when we had this e of theta.
And so b is e of theta, which means that the natural link is the inverse, which is log, which is the inverse of exponential. And so that's the logarithm link, which, as I said, I use the word natural. Now, you can also use the word canonical if you want to describe this function as being the right function to map the positive real line to the entire real line.
The Bernoulli, we just did it. So b, the cumulative function is log of 1 plus e of theta, which is log of mu over 1 minus mu. And gamma function, where you have the thing you're going to see is minus log of minus linta, you see this. This is the reciprocal link is the link that actually shows up, so minus 1 over mu.
That maps. This is another link. Sorry, this is a OK.
Yeah, OK, that's the one. OK, so are there any questions about canonical links, canonical families? I use the word canonical a lot. But is everything fitting together right now? So we have this function.
We have a canonical exponential family by assumption. It has a function b, which contains every information we want. At the beginning of the lecture, we established that it has information about the mean in the first derivative, about the variance in the second derivative. And it's also giving us the canonical link.
So just cherish this b once you've found it, because it's everything you need. Yeah? What would be an example where you have I don't know, political preference. You know, I don't know, honestly. If I were a serious practitioner, I probably would have a better...
better answer for you. At this point, I just don't. I think it's a matter of practice and actual preferences.
You can also try both. We didn't mention that there's this idea of cross-validation. Well, we mentioned it without going too much into detail.
But you could try both and see what which one performs best on a yet unseen data set in terms of prediction. And just say I prefer this one of the two, because this actually comes as part of your modeling assumption, right? You modeled, not only did you decide to model the image of mu through the link function as a linear model, but really what you're saying, right? Your model is saying, well, so you have two pieces of the distribution of y, but you also have the fact that mu is modeled as g inverse of x transpose beta. And for different G's, this is just different modeling assumptions.
So why should this be linear? This thing be linear? I don't know.
My authority as a person who's not examined the Taldan data sets for both things would be that the changes are fairly minor. OK. So this was all for one observation. We just basically did probability. We described some densities, some properties of the densities, how to compute expectations.
That was really just probability. There was no data involved at any point. We did a bit of modeling, but it was all for one observation.
What we're going to try to do now is given due to the reverse engineering to probability that is statistics, given data, what can I infer by my model? Now remember, there's three parameters that are sort of floating around in this model, right? There's one that was theta. There's one that was mu, and there's one that is beta.
So those are the three parameters that are floating around. What we wanted to say, what we said is that the expectation of y given x is mu of x. So if I estimate mu, I know the conditional expectation of y given x, which definitely gives me Theta of x, right?
How do I go from mu of x to theta of x? Yeah. The inverse of what?
Of the arrow? Yeah. Sure, but how do I go from this guy to this guy?
So theta as a function of mu is? Yeah, right? So we just computed that mu was b prime of theta.
So it means that theta is just b prime inverse of mu. OK? So those two things are the same as far as we're concerned, because we know that b prime is strictly increasing.
It's invertible. So it's just a matter of reparameterization. And we just can switch from one to the other whenever we want. But why we go through mu, because so far for the entire semester, I told you there's one parameter that's theta.
It does not have to be the mean. And that's the parameter that we care about. That's the one on which we want to do inference. That's the one for which we're going to compute the Fisher information. This was the parameter that was our object of worship.
And now I'm saying, oh, I'm going to do mu that's coming at me. around. And why we have mu is because this is the mu that we use to go to beta. So I can go freely from theta to mu using b prime or b prime inverse. And now I can go from mu to beta because I have the g of mu of x is beta transpose x.
So in the end, now this is going to be my object of worship. This is going to be the parameter that matters, because once I set beta, I set everything else through this chain. So the question is, if I start stacking up this pile of parameters, so I start with my beta, which in turn gives me a mu, which in turn gives me a theta, can I just have a long, streamlined, what is the outcome when I actually start writing my likelihood, not as a function of theta, not as a function of mu, but as a function of beta, which is the one at the end of the chain? And hopefully, things are going to happen nicely, and they might not.
Yeah? So what about mu of x? PHILIPPE Rigollet-It-It's g.
It's g. That's my link, right? g of mu of x.
Now, mu is a function of x because it's conditional on x. So this is really theta of x. mu of x.
But b is not a function of x, because it's just something that tells me what the function of x is.. PHILIPPE Rigollet-It. mu is the conditional expectation of y given x.
It has actually a fancy name in the statistics literature. It's called, anybody knows the name of this function, mu of x, which is the conditional expectation of y given x? That's the regression function. That's the actual definition. If I tell you what is the definition of the regression function, that's just the conditional expectation of y given x.
And I could look at any property of the conditional distribution of y given x. of x. I could look at the conditional 95th percentile. I could look at the conditional median. I could look at the conditional interquartile range.
I could look at the conditional variance. But I decide to look at the conditional expectation, which is called the regression function. OK? Yes? Just to be sure, can you Oh, there's no transpose here.
Actually, only Victor Emmanuel uses prime for transpose, and I found it confusing with the derivatives. So prime here is only a derivative. That is a matrix of 3. Oh, yeah, sorry, beta transpose x.
Sorry, I thought you said. OK. So you said what? I said that g of mu of x is beta transpose x?
Yeah, mu of x is. Isn't that the same thing? X is a vector here, right?
So X transpose beta and beta transpose X are the same thing. So beta looks like this. X looks like this.
So it's just a simple number. Yeah, you're right. When I'm going to start to look at matrices, I'm going to have to be slightly more careful when I do this.
OK, so let's do the reverse engineering. I'm giving you data. From this data, hopefully you should be able to get what the conditional, if I give you an infinite.
in the amount of data, you would know exactly what the pairs x, y, you would know exactly what the conditional distribution of y given x is. And in particular, you would know what the conditional expectation of y given x is, which means that you would know mu. which means that you wouldn't know theta, which means that you wouldn't know beta.
Now, when I have a finite number of observations, I'm going to try to estimate mu of x. But really, I'm going to go the other way around, because the fact that I assume specifically that mu of x is of the form g of mu of x is x transpose beta, then that means that I only have to estimate beta, which is a much simpler object than the entire regression function. And so that's what I'm going to go for.
I'm going to try to represent the likelihood of The log likelihood of my data is a function not of theta, not of mu, but of beta. And then maximize that guy. All right, so now rather than thinking of just one observation, I'm going to have a bunch of observations.
And every time you see here, so this might actually look a little confusing, but let's just make sure that we understand each other before we go any further. So I'm going to have observations. X1, Y1, all the way to Xn, Yn, just like in a natural regression problem, except that here my y's might be 0, 1 valued.
They might be positive value. They might be exponential. They might be anything in the canonical exponential family. And OK, so I have this thing. And now what I have is that my observations are x1, y1, xn, yn.
And what I want is that I'm going to talk about, I'm going to assume that the conditional expectation of yi given x, sorry, the conditional distribution of yi given xi is something that has density. Did I put an i on y? Yeah. I'm not going to deal with the phi and the c now.
And why do I have theta i and not theta? It's because theta i is really a function of the, of, Of xi, right? So it's really theta i of xi. But what do I know about theta i of xi?
It's actually equal to b. I did this error twice. b prime inverse of mu of xi.
And also know that this thing is actually, I'm going to assume that this is of the form beta transpose xi. And this is why I have theta i, is because this theta i is a function of xi. And I'm going to assume a very simple form for this thing. Sorry, sorry, sorry, sorry.
That's, yeah, sorry, I should not write it like this. This is only when I have the canonical link. So this is actually equal to b prime inverse of g of xi transpose beta. And when I have the canonical link, sorry, g inverse, those two things are actually canceling each other. OK?
So as before, I'm going to stack everything into some, well, actually I'm not going to stack anything for the moment. I'm just going to give you a peek at what's happening next week rather than just manipulating the data. All right, so here is how we're going to proceed at this point.
Well, now I'm going to want to write my likelihood function not as a function of theta, but as a function of beta, because that's the parameter I'm actually trying to maximize, right? So if I have a link, so this thing that matters here, I'm going to call h. By definition, this is going to be h of xi transpose beta. Helena, you have a question? No.
OK. So this is just all the things that we know. Theta is just by definition of the fact that mu is b prime of theta.
The mean is b prime of theta. It means that theta is b prime inverse of mu. And then mu.
is modeled from the systematic component, g of mu is xi transpose beta. So this is g inverse of xi transpose beta. So I want to have b prime inverse of g inverse. This function is a bit annoying to say, so I'm just going to call it h.
H. And when I do the composition of two inverses, the inverse of the composition of those two things in the reverse order. So H is really the inverse of G composed with B prime, G of B prime inverse. And now if I have the canonical length, since I know that G is B prime inverse, this is really just the identity.
So as you can imagine, this entire thing, which is actually quite complicated, would just say, oh, this thing actually does not show up when I have the canonical link. I really just have that theta i can be replaced by xi of beta. So think about going back to this guy here.
Now theta becomes only xi transpose beta. That's going to be much more simple to optimize. Because remember, when I'm going to do log likelihood, this thing is going to go away. I'm going to sum those guys. And so what I'm going to have is something which is essentially linear in beta.
And then I'm going to have this minus b, which is just minus the sum of convex functions of beta. And so I'm going to have to bring in the tools of convex optimization. Now it's not just going to be take the gradient, set it to 0. It's going to be more complicated to do that.
I'm going to have to do that in an iterative fashion. And so that's what I'm telling you. When you look at your log likelihood for all those functions, you sum. The exponential goes away because you had the log. And then you have all these things here.
You kept the b. I kept the h. But if h is the identity, this is the linear function, the linear part, yi times xi transpose beta minus b of my theta, which is now only xi transpose beta. And that's the function I want to maximize in beta. That's a simple, I mean, it's a convex function.
When I know what b is, I have an explicit formula for this. And I want to just bring in some optimization. And that's what we're going to do. And we're going to see three different methods, which are really basic. basically the same method.
It's something which is based on just adaptation or specialization of the so-called Newton-Raphson method, which is essentially telling you do iterative local quadratic approximations through your function. So second order Taylor expansion, minimize this guy, and then do it again from where you were. And we'll see that this can be actually implemented using what's called iteratively reweighted least squares, which means that every step, since it's just a quadratic, there's going to be just squares in there.
Every step can actually be solved by using a weighted least squares version of the problem. So I'm going to stop here for today. So we'll continue and probably not finish this chapter, but finish next week.
And then I think there's only one lecture. Actually, for the last lecture, what do you guys want to do? Do you want to have donuts insider?
Do you want to just have some more outlooking lecture on what's happening post-1975 in statistics? Do you want to have a review for the final exam? Pragmatic people. All right. No, I think that you can have interesting advanced and interesting things.
You want to do interesting advanced topics for the last lecture? Yeah. Yeah, so that's sort of.
That we haven't thought of yet. Yeah, the thing is, I think, so there's two. That's basically what I'm asking, right?
Interesting advanced topics versus ask me any question you want. Those questions can be about interesting advanced topics, though. Like, what are interesting advanced topics?
I'm sorry? Interesting with donuts. Is that OK? Yeah, yeah, we can always do the donuts.
We can always do the donuts. As long as there are donuts. All right, we'll do that.
All right, so you guys have a good weekend.