Transcript for:
Estimators: Method of Moments vs MLE

[Music] hello and welcome this lecture is devoted to a bunch of examples where i will derive both the mme and the ml estimator okay so the method of moments estimator and the ml estimator for different problems and show you the similarities differences and sort of illustrate how these things look and how they can be different etcetera ok let us start with the exponential distribution and for the exponential distribution lambda hat m m e is already known if you remember n by x 1 plus x n ok so this we have seen before derived before so you know this how do we do ml distribution remember the pdf is lambda e power minus lambda x for x greater than 0 that's ok so the likelihood is going to be product i equals 1 to n lambda e power minus lambda x i ok and lets simplify multiply this together you get lambda power n e power minus lambda x one plus one till x n ok its turning out to be easy enough right so how do we do this you take log so your lambda star is going to be argument of maximum over lambda log of this log of this is going to be n times log lambda minus lambda x 1 plus x n okay so now how do you do this you differentiate and equate this to 0 you are going to get n by lambda minus x 1 plus x n equals 0. so from here you get lambda star is just n by x 1 plus x n so from here you get lambda hat maximum likelihood is also the same guy n by x 1 plus x n simple recipe isn't it it's really simple and we know uh it's it's it's got you know the distribution is very simply described as lambda e power minus lambda you plug it in you do the same calculation you get to the answer okay so here again we got both the method of moments estimator and the meld estimator to agree so that seems like a nice result to have so this is easy enough distribution okay so we're going to first start seeing some slightly non-trivial cases now okay so the first case is a discrete case where instead of having you know 0 1 just two values i am going to have three values what are the three values one two three so the distribution takes the samples are either one or two or three one with probability p one two with probability p two three with probability p three so the first thing i want you to note is p three is going to be 1 minus p 1 minus p 2 right because 3 of them have to add up to 1 they all they all will be between 0 and 1 ok so this will be true ok so this is x 1 to x n so ah so so let us do mme first i will do mme on the left side and ml on the right side will do mme first so if you look at mme i am going to have two equations the sample moment m1 is going to be equal to expected value of x expected value of x is p 1 plus 2 p 2 plus 3 p 3 this is one equation the other equation is m 2 is going to be the second moment which is p one plus four p two plus nine p three ok so those are the two equations now i have to solve for the unknown parameter so remember these are the three parameters that are unknown okay so because of this equation i will take p 1 and p 2 to be the unknown parameters hope that was clear i think i said i didn't say that early enough p one and p two are the unknown parameters i have to find so here i can substitute p three equals one minus p one minus p two so these two equations will become ah you know so if you put put p three equals 1 minus p 1 minus p 2 i am going to get m 1 equals p 1 plus 2 p 2 plus 3 minus 3 p 1 minus 3 p 2 right so if you simplify that i am going to get 3 minus 2 p1 minus p two ok so you can rewrite this you can write two p one plus p two i hope i got this right ok equals three minus m one ok i hope this ah seems easy enough so instead of p three i am going to put one minus p one minus p two so i got three minus three p 1 minus 3 p 2 so that's 3 minus 2 p 1 minus p 2 so 2 p 1 plus p 2 ok so the next equation is m 2 equals p 1 plus 4 p two plus nine minus nine p one minus nine p two so that's the same as nine minus eight p one minus five p two i think this is okay uh if you rewrite this you're going to get 8 p 1 plus 5 p 2 equals 9 minus m 2 okay so this is one equation the second equation you're going to solve these two equations ok so for doing that you can multiply this first first equation by 5 and then you can subtract that minus this so you're going to get 2 p1 so you can do 5 times 1 minus 2. this will give you 2 p 1 equals 5 times this is 15 minus 5 m 1 minus 9 so that 6 minus 5 m 1 plus m 2 oh no yeah plus m 2 that's correct okay so this is what it is so you solve for p 1 so p 1 is just 6 minus 5 m 1 by 2 plus m 2 by two ok something so what is p two p two s maybe i should write it somewhere here so that i can get more space p 2 s i can put it here 3 minus m 1 minus 2 p 1 so that's minus 6 plus 5 m 1 minus m 2 i hope i didn't make any mistakes the simplification can be very painful otherwise 4 m 1 minus m 2 minus 3 okay so let me just check this real quick if i do 2 p1 plus p2 what do i get 2 p1 plus p2 oh i know what i'm gonna mistake this has to be three okay so okay apologies i think i made i think i caught the error here so if you do 2p1 plus p2 i'm going to get 3 and then minus m1 and m2 will cancel so that's good that's good that agrees and then 8 p1 plus 5 p2 what will happen okay okay ok i think its ok so this is correct so from here i have got p1 and p2 in terms of m1 and m2 so when i when i go back and find my estimator p1 hat mme is going to be what it's going to be 3 minus 5 by 2 capital m 1 plus m 2 by 2 okay and then p hat 2 mme is going to be 4 capital m 1 minus capital m 2 minus 3 okay so this is my mme estimator so it required some work but hopefully you are convinced it is not too difficult work just setting up those equations and painfully solving for them ok so what is m 1 and m 2 you remember what m 1 and m 2 are right so m 1 is just sample mean right x bar and then m 2 is just the second order second moment right summation i equals 1 to n x i squared so that's sample mean is just summation 1 by n summation i equals 1 to n x n okay so so i have expressed my both the mme estimators some complex functions of not complex just linear functions of m one and m two ah that is good enough to get ok so how will m l look for this case ok so notice this case is becoming slightly more involved than we thought but ml actually happens to be very very easy okay so if you find the likelihood this will be p 1 power w 1 p 2 power w 2 and then p 3 which is actually 1 minus p 1 minus p 2 power n minus w 1 minus w 2 what is w 1 w 1 is number of ones w 2 is number of twos ok so likelihood is really easy so if you take log and differentiate you're going to get p 1 star p 2 star to be argument max over p 1 p 2 log of these k so it's going to be w 1 log p 1 plus w 2 log p 2 plus n minus w 1 minus w 2 log 1 minus p 1 minus p 2 okay so you can do this you can differentiate now if you differentiate you will see you will get these very very elegant and nice answers that p 1 hat m l is actually equal to ah no let me just write the p 1 star and p 2 star p 1 star is actually w 1 by n p two star is actually w two by n ok so notice how nice the p one hat m l is ok number of ones in the sample divided by n p two hat m l is number of twos in the sample divided by n such simple elegant expressions a very nice extension of bernoulli and bernoulli this is what happened right p was simply the number of ones divided by n w by n okay so we expect something like that when we do one two three p one p two p 3 p 1 should be the number of 1's by n p 2 should be the number of 2's by n right so that's the seems like a reasonable estimator but look at what the method of moments gave you it gave you something really convoluted and may be confusing but you know the ml code has given you very nice answers very intuitively pleasing answers so hopefully this gives you a comparison even though the mld code may seem a little bit more difficult in the beginning the mme maybe is a bit more easy the mld coder gives very very interesting answers in many cases okay so in this particular case it's very very interesting you can actually extend it to any number of digits instead of just 1 2 3 you can have 1 2 3 4 up to something you know that's something and each of these probabilities you don't know if you find a sample the ml estimate of each probability probability of a particular letter in the alphabet is number of times the letter occurred divided by n ok so this is a good principle to remember its easy proof in some sense ok so hopefully i mean i didn't show you the differentiation you can you can treat p 2 as a constant and differentiate with respect to p 1 you will get the you will get this answers ok it will need a little bit of work but you will get that okay next we move on to another very very interesting and simple looking case where you will get different answers with method of moments and ml in this case both are nice and simple sounding optimization problems but you will see the answer is very different later on will come back and compare these two as well okay so i id uniform 0 to theta is a continuous random variable and the density as you know is one by theta for x between zero and theta zero otherwise okay this zero otherwise is very important okay keep that in mind so let's start with the method of moments estimator this is usually the easiest ok ah m one you can equate to the expected value of x expected value of x is theta by two from here you get just theta equals two m one so theta hat m m e is just two times m one which is two times you know x one plus x n divided by n ok so the method of moments estimator is like trivial its two times x bar is another way to write this two times x bar two times the sample mean ok very very easy ah estimator with method of moments so again is it intuitive you know you are expecting uniform distribution between 0 and theta the mean is theta by 2 you compute the mean you multiply by 2 right seems reasonable okay let's see what ml says ml will tell you a completely different story let's see what ml says ml the likelihood so supposing you observe samples x one to x n ok the likelihood of these samples is what what is it its very very interesting there are two cases here okay it is going to be 1 by theta power n okay if what if all these guys lie between 0 and theta it will become 0 otherwise so let me give you a simple example of actual samples here so here is n equals 3 and maybe you got samples like half half eight ok taking an extreme example so if you look at theta hat mme the estimate is going to be half plus half plus eight by three into two so that is nine by three into two that's six ok so this is your theta hat mme and notice eight is bigger than six okay so this actually gave you a estimate for uh theta with which the sample itself cannot agree right see eight cannot be obtained if you say this ah the theta hat is six seems a little bit ridiculous ok so but that is what the method of moments estimator is giving you but on the other hand if you were to maximize the likelihood you have to absolutely avoid this ok to maximize likelihood to maximize l we have to avoid zero why should you avoid zero zero is really low it cannot be maximum a non zero value is always bigger than a zero value right so you have to pick a theta so that this condition is satisfied so theta has to be above the largest x n that you saw okay so so that's that's one condition we'll assume so so our theta star we will say is bigger than or equal to max of x1 to xm okay so this once you do this then your ah you know your theta will always be one by theta power n okay so so the zero won't occur okay so this kind of ridiculous the method of moments you getting this case where you know you are predicting something which disagrees with the samples in some sense right so it is not a very nice estimator of that sense so this maximum likelihood estimator is trying to avoid that okay so you're going to pick theta star to be greater than or equal to this okay and you are trying to maximize 1 by i am sorry theta to be greater than 1 by theta power n okay okay and theta you are going to make it greater than or equal to this so that this 0 does not occur i will always have 1 by theta power n now if i want to maximize over theta 1 by theta per something i have to pick the pick the least possible theta right because if i keep increasing theta my 1 by theta power n will keep on decreasing so i have to pick the least possible theta and theta is greater than or equal to max of x one power x n so clearly my theta hat has to be equal to maximum of x one through x n okay so this is the ml estimator for uniform zero theta okay so notice how the method of moments estimator disagrees so seriously with the ml estimator right 2 times sample mean is what the method of moments is telling you the ml estimator is saying you observe n samples from uniform 0 theta you don't know theta what is your maximum likelihood estimate for theta maximum of the observed samples that maximizes the likelihood okay you predict any theta inside the maximum your likelihood is zero why will you ever predict that okay it doesn't make sense so you have to predict theta hat ml to be max of this so in this case theta hat ml will be equal to eight you will predict eight you wont predict six like the mme estimator is doing okay so notice how the two estimators are so different they are different in philosophy but you know max of x 1 x n is a very reasonable way to understand the distribution of you know make a prediction for uniform zero theta right so you do not know theta you are observing a lot of things what will be the best prediction for theta maximum of those values very reasonable and ends up being the maximum likelihood estimate ok so interesting example so let us keep proceeding so i do not know if you are if you are the type who enjoys these kind of nice mathematical arguments and getting to the answer but even if you do not enjoy the argument look at the final expression right the final expression is a very intuitive expression for what the estimator can be for uniform 0 theta so hopefully that is a bit more interesting for everybody okay so next is a discrete distribution uniform discrete distribution i do not know n okay so here also you can do mme you can do ml okay so i am going to give you the final answer for ml you can guess its not so difficult max of x1 to xn okay i am not going to prove this with great detail you can see you know you observe a lot of samples it is going to be 1 by n raised to the power number of samples you want to keep n bigger than the maximum that you observed and you want to maximize the likelihood you pick n as the least possible value that you can assign to do that okay what will be mme what is the expected value it is uh you know it's uniform so it is 1 by n times 1 plus 1 by n times 2 plus 1 till 1 by n times n and that is the summation 1 times 1 plus 2 plus n and that is 1 by n times n times n plus 1 by 2 ok you might have already known this it is going to be n plus 1 by 2 this will cancel and you will get n plus 1 by 2 you can solve for them you will get n equals 2 m 1 minus 1. so n hat m e is two times x bar minus one so it's sort of related to two x bar two times x bar minus one and sort of similar ok so you see again its very different in the uniform one to n case the method of moments estimator and the ml estimator are very very different ok ok so so far we have seen cases where i was able to give a closed form expression for the ml distribution so now we will see a couple of cases where we will not get close form expressions we will get some equations and i will say you have to solve it numerically okay so the method of moments estimator we have seen before i am not going to reproduce it here you can go back to the previous one of the previous lectures and look at the slides the method of moments estimator for the gamma distribution is derived let us look at ml so the likelihood is product i equals 1 to n beta power alpha by gamma alpha alpha times x power alpha minus 1 e power minus beta x okay so if you want to simplify this it's going to be beta power n alpha by gamma of alpha power and you can't throw any of these out because they all depend on alpha beta and all these case you don't know right and then you have x i sorry i forgot about x i it's x i uh the product is going to come here so you're going to get x 1 till x n raise whole thing raise to the power alpha minus 1 then you will get e power minus beta x 1 plus x n okay and alpha star comma beta star is argument maximization over alpha beta log of this okay so if you take log of this you're going to get n alpha log beta minus n log gamma of alpha plus alpha minus 1 log of product okay minus beta times here maybe i can use the summation i equals 1 to n notation okay so i took log i can you can take log of each term and keep doing the subtraction or addition depending on how it works out so you get this complicated looking expression so now i am going to differentiate with respect to beta okay so if you do that see remember when you do that you can treat alpha and all as a constant so if you do that i'm going to treat alpha as a constant so it's n alpha times 1 by beta alpha is a constant so this term will go away alpha is a constant that term will go away here you will get minus summation i equals 1 to n x i equals 0 okay so notice here this this equation is nothing but alpha equals beta times the sample mean right so summation i equals one to n x i divided by n ok so very nice relationship we know alpha by beta has to be equal to the mean so alpha by beta ends up being the sample mean as well okay nice intuitive relationship that needs to be satisfied by alpha beta okay so the other one it turns out that you won't get anything so nice it'll be one very nasty looking relationship you'll get n log beta okay minus n by gamma of alpha i mean i don't know how many of you know this function gamma function it's it's a bit of a nasty function it's maybe not that nasty but you know it's a complicated functions it's not very easy to write down plus log the product you know minus 0 equals 0. so here you get this more complicated sort of equation and alpha is beta times the summation so so so these two equations together you have to solve okay and maybe you can rewrite this a little bit this is if you want i'll rewrite it for you ah you you got to get i will write down log of product as sum of logs okay so this term is the same i will get a divided by n here and then i will equate it to push this to that side i get gamma prime of alpha by gamma of alpha minus log beta okay so i just divided by n to get a form sort of similar to that so so anyway it doesn't matter what how you simplify this so these two you have to numerically solve ok so there is no way to write a very clean closed form expression here ok so there are computational procedures you you put this input this equation in some form uh into a python program suitably with load some packages it will solve and give you the answer okay so so this is the first case where we are not getting a very convenient neat and clean closed form expression for the ml distributor on the other hand the method of moments estimator was very easy right it gave you a closed form expression in terms of the sample moments so here we are not getting that just log of x i and all is involved it is not clear how it's going to work out at the end but the ml it's not very hard today i mean even the close form expression even if it is not there solving these kind of equations is not very very hard one can put it into a computer it will crank a little bit and give you the answer okay so it's possible to do this okay there are many more examples like this but the gamma is complicated enough for you to see so once again in this class we won't bother too much about complicated you know problem solving like this even if we do this we will expect you to do some very simple differentiations only nothing more than that okay so the next example is binomial np ok so you remember mme we have seen before so i am not going to redo this you can go back and check ok so the ml when you do not know both n and p it becomes a little bit more complicated so let us see how the likelihood is going to work the likelihood is going to be product i equals 1 to n and choose x i e power x i 1 minus p power n minus x i isn't it okay so this is a bit complicated it's not that easy so if you take know this you can you can sort of start simplifying this if you like ah you are going to get one term which will be like this n choose x 1 and choose x n and then you will get p power x one plus x n and then one minus p power n times n minus this summation right so i hope you can work this out so this is just this product multiplied n times you're going to get 1 minus p power n times n minus this x 1 plus 1 del x okay so this is the expression that you get it's a bit ugly agree you can take log now okay so just to take log sometimes it's useful to take log like this and see how it works out so you're going to get log and choose x one plus so on to log and choose x n right plus x one plus x n log p ah plus n times n minus x one plus x n log one minus p equals zero okay you're going to get this ok so you can try to differentiate with respect to p its its ugly enough ok so you are going to get 1 by p here 1 by 1 minus p here so you write it down you will get this equation that n p equals x one plus x n by n ok so i am doing a lot of simplification in my head i am differentiating and then equating it to 0 and then and then simplifying to get this so this is actually an expected result right the sample mean is n times p okay so this is one equation it should be good but if you start differentiating with respect to n you will get something that is really difficult to handle ok so this is this log n choose n one how are you going to start differentiating it you can think about simplifying this in some ways it does simplify you can get it you will get a you will get a slightly more complicated ah equation ok so this becomes very complicated i will just leave it like that ok so i welcome you to try it but you won't get anything too simple so there will be some simplification but it won't be all that nice okay so to get to the answer here is a little bit more complicated ok so i am not even sure if how the answer will look at the end of the day i have not really tried very hard to simplify it but my guess is it is going to be a little bit more difficult than uh before okay so maybe it will become will come out to something interesting but it's at least very very complicated okay so we saw quite a few examples simple examples like you know bernoulli poisson exponential normal where sample moments ended up naturally being the answers it was very very nice to see then we saw slightly more complicated examples uniform distributions both discrete and continuous and then also the simple discrete distribution you know one two three p one p two p three where m l and m e started giving different answers okay very different answers mme usually was not that great ml ended up being always very good okay and then these final type of examples where you have gamma and binomial where simplification with ml is really really hard mme ends up being much much simpler in that case right you have to directly get something here in this case it is a little bit more difficult okay so you will see all these kind of examples with other type of distributions as well okay but the recipe is always the same for ml likelihood throw away what's not necessary take log differentiate with respect to the parameters and try and solve you get your ml estimate okay thank you very much