EM Algorithm Overview

all right good morning and it's week nine so just just a little bit of time left I guess um all right uh okay so we will uh continue covering different um different topics here and so today we will look at um the EM algorithm at least one one specific case of the EM algorithm I should say or a specific application so the EM algorithm is the expectation maximization algorithm and we can use this in situations where we have both missing data and unknown parameters okay when you have only one of these things it's um it's not too hard to estimate what you need to estimate so you you can guess the missing data or you can generate your missing data you can estimate unknown parameters if you only have one of these things are unknown but if both of them are unknown then then it's a little bit trickier okay so here's uh the first scenario where a value is missing but the parameters are known okay you know the parameters so let's say we're told the values one 2 and x is unknown are drawn from a known distribution and that known distribution is a normal distribution with mean one and standard deviation one okay so what is our best guess for the missing value X well we'll look at the normal distribution and we say well it's uh its peak is at the mean of one and so our best guess uh for X is going to be the mean of the distribution we'll just say one okay okay um here's another scenario where the values are known but the parameters are unknown so let's say we're given the values 1 2 and zero and uh I tell you that they're drawn from a distribution uh except this distribution the mean is unknown so it's a normal distribution with an unknown mean standard deviation one and what is what is your best guess for the mean meu and so this time the mean of the distribution is unknown but you're going to use maximum likelihood estimation and you would say the mean of our sample okay is going to be the best guess for the mean of the distribution so we'll do 0 + 1 plus 2 and this gives us a mean of one and so we'll plug in one as our best guess from you okay so that that's great but what about this scenario let's say we have some data that have a missing value so we have one and two and unknown value X and I tell you they're drawn from a distribution and this distribution is a normal distribution but the mean of this normal distribution is unknown so what do we do what is your best guess for the missing value of x and the missing parameter mu and uh and so this is a little bit trickier and so we can take a kind of an iterative approach okay so first we just kind of start off with an arbitrary guess for mu so I'm going to just guess zero and if I you plug in zero for mu then um then I will say okay well if mu is zero then my best guess for x is going to be zero all right and if that is the case then my data is going to be 1 2 0 and in that case then my best guess then I iterate back and forth and I'm going to estimate what's my best guess from you okay in this case 1 12 0 my best guess from you is going to be one okay and now this is going to bother me okay well I don't know um our our best guess for uh mu is one and then so now uh if I have to plug in what is my best guess for x okay I'll use mu okay and then now my data is updated to 121 and if that's the case then my u maximum likelihood estimate for mu is going to be 1.333 the average of 121 okay this is not any better and then uh and we keep iterating back and forth back and forth for between updating my missing value x and updating my best guess for mu and eventually we will converge to um my best guess for x is 1.5 and my best guess for mu is 1.5 okay that's what we end up getting and that's kind of the em algorithm in its like most simple form is you have these two steps one is you uh um you have an E step an estimation step and then you have an MTEP a maximum maximizing step okay so in in our CA case the E step which is estimating your missing values of X okay estimating your missing values and then your MTEP is updating your unknown parameter to kind of be your maximum likelihood estimate okay so you um so that's kind of how it goes and you iterate back and forth between these two steps of updating your missing values and then updating your parameter guesses and so we'll uh apply this to a situation where we have Gaussian mixtures okay a mixture of normal distributions or mixture of Gaussian uh distributions and so um the data that we have are going to have been generated by several different Gaussian distributions uh except we don't know which distribution that a data point came from okay so our missing data is which distribution a point belongs to because uh we'll have basically the value of the point uh if we're dealing with multivariate we might have like an x and y coordinate of our point or x1 x2 coordinate and then we also have it came from distribution a or it came from distribution b or distribution c okay and so the parameters of these Gaussian distributions however are also unknown to us we don't know the mean and variance of the Gaussian distributions and so we're going to use this EM algorithm to iteratively figure out the unknown parameters of our Gaussian distribution as well as kind of the unknown cluster assignments so uh as an example I'm going to uh just do a simple example where I have three multivariate normal distributions in two dimensions and we're going to have cluster one cluster 2 and cluster 3 okay cluster one will represent 50% of our data cluster 2 will be 30% of our data and cluster 3 will be 20% of our data okay and so this uh vector alpha represents the mixture proportions okay so this is kind of like uh I don't know the recipe of of our data 50% come from uh cluster one 30 from cluster 2 and 20% from cluster three all right and obviously alpha can be basically any you know a vector of any length for how you know however many mixtures you have how many distributions you're mixing together and then the sum of alpha has to be one um so our first component will be a multivariant normal distribution centered at 0 0 and it's sigma matrix is going to be 9 0 09 so it's going to have kind of a lot of variance but the uh variance in the x1 direction is nine the variance in the x2 direction is nine but they are independent so it's going to be just kind of like a circle of points um component two is going to be centered at 4a 4 so in the upper right corner and its variance is 1 and one uh but it also has covariance of 0.9 so it's going to look kind of like an ellipse pointing from the lower left to the upper right okay component 3 is centered at -4 um it has variance matrix 212 so there's negative coariance and uh here let me just kind of show you the plot this is going to be kind of the the plot of the points we have you can kind of see uh there's a component centered at 0 0 and this this big cloud of points in the circle here you have a component centered at 44 and it's kind of this uh but it's high coariance uh and so that's kind of this angle point and then we also have a cluster centered at -4 and there's negative coariance and that's going to be kind of this concentrated um patch over here okay um the as far as the the I guess the PDF of our mixture model the PDF of our mixture model i don't know what to do well whatever okay um we have to uh it's going to be the sum of our component um of alpha and then basically this is the PDF of each multiorm multivariant normal distribution okay and so we're going to say the distribution okay of X given our um parameters okay and so theta this big bold theta is collects all of the parameters together okay all all of the different individual theta um values the individual theta oh dear the individual thetas are going to be the sigma and the the muse i don't I don't even know how to just turn it off is the problem yeah oh maybe that worked okay all right um theta's uh is going to collect all the parameters here okay so we have the the mixture proportions alpha okay and uh you know that's that that this vector right here they they add up to one and then this P here is just the individual PDF so this would be the multivariate PDF and then the Z here is basically which cluster does it belong to okay CL belongs to you know either cluster or component one component two or component three okay and then so to generate this in uh in R I have my mixture proportion vector alpha and then I have the means 044 and4 to correspond with that i have my sigma matrices there i gather them together into a list and then um to generate this I use random multivariate norm and I just say hey sample a value uh one two or three to decide are we drawing from component one two or three and then when you select that select the appropriate mean and sigma values and then generate our uh one random point there using RMV norm and this is our resulting thing now I can label or color the points based on uh the components that they uh were generated from okay and and that's because I generated the data so I know these data points came from you know cluster two and these came from cluster one so cluster one is represented with black cluster two is represented with red and cluster three is represented with green and I can say yeah I know um you know these points right here came from cluster three and stuff like that but in reality when we get our data it looks like this because the clusters that they came from or which component they came from is unknown to us okay so it comes in like this and what we're trying to do is we're trying to take take in this basically unlabeled data and figure out okay there's three clusters here all right so so that part we have to kind of decide um and we're going to look for three clusters and for these three clusters I've got to find the mean and the variance of each of these things okay and you know as we do this you know if you look at one of these points right there's this question of if I have a point right here which cluster should I assign it to okay should I assign it to basically the red cluster okay cluster two which is centered up here or cluster one in black centered over here right like if I have a point right there because it's kind of on the on the border on the boundary here and our solution here is unlike K means clustering where we just say hey assign it to the nearest centrid we are going to use a probabilistic assignment and we're going to say this point right here that's kind of on the border we might say it is like 51% cluster 2 and 49% cluster 1 or something like that or or maybe 60% this or 40% and we can kind of go through that for all of our points and we can say this point really probably is going to be cluster two and maybe we'll say like 99% cluster 2 1% cluster one 0% you know pretty close to zero for cluster three or something like that so so that's what we're going to do rather than strictly assigning a point to one cluster we will use a baze classifier to do get a probabilistic assignment and when we do that we call that a membership weight okay so each point is now going to have a vector of probabilities that sum to one of membership weights you know point number one is this probability cluster one this probability cluster two this probability cluster three okay and then once you have those membership weights we can use those weighted values okay to calculate the parameters of each cluster distribution so we'll use like a weighted mean and we'll also get a weighted variance coariance matrix using the membership weights of the points right and that will maximize the likelihood of the values that are quote assigned to each cluster and we're going to iterate back and forth between calculating the membership weights or estimating these missing values and then rec recalculating the parameters of the uh cluster dist cluster distributions okay the uh the mean and the variance and so this EM algorithm that we're going to look at is a little bit you can think of it almost as like a blend between K means clustering and the B classifier so in C means clustering you had this iterative back and forth reassign your points to clusters once you redo that you recalculate your clust cluster centroidids reassign points cluster centroidids reassign points recalculate your cluster centroidids here we're doing something very similar except rather than just when we reassign the points we're not strictly saying you're cluster one and you're cluster 2 and you're cluster three we're saying you are 80% cluster 1 and 20% cluster two and you are you know 70% cluster 2 and 25% cluster one and 5% cluster 3 and things like that so we're doing this kind of a baze classifier when we're quote assigning the points to each cluster and then once you do that we recalculate kind of our centroidids we recalculate the means by doing using a weighted mean and we'll recalculate the variances using a a weighted variance and so um so how do we do that how do we calculate our membership weights well that is done using the B classifier and so the B classifier is we're going to get do the likelihood times the prior divided by the marginal so the likelihood is going to be basically um what is the probability of getting this particular value if we know that it came from cluster one what's the probability of it getting this particular value if we know it came from cluster two and what's the probability of getting this value if it came from cluster three and so you you need to have kind of initial starting mean and variances for clusters 1 two and three okay uh the prior is going to be whatever alpha uh the uh your vector alpha k which is you know how many values are assigned to cluster one and assigned to cluster two and assigned to cluster three and then the marginal is basically the the sum of the numerator across all possible classes um each of these things are a little bit trickier so when we're talking about the alpha this is like the prior probability and this is going to be based on the previous iteration and it's the prior probability that a point belongs to cluster K and so we usually you know uh we'll say like if there's 100 points and 60 of them were assigned to cluster one our prior probability will be you know 60% cluster one but it's a little bit tricky because the number of points that are assigned isn't actually an integer okay um this number of points that are assigned to cluster K this n subk is not an actual count of points but it's going to be the sum of the weights okay so when you add up all of the weight vectors each weight vector adds up to one and if you sum them all up you'll have the number of points total n okay but if you take like uh what is the sum of like the first value in each weight vector that's going to be like the quote sum of points that belong to cluster one and and it's not necessarily an integer so we're going to basically add up all of the uh the weights for you know cluster one and that will be how many points belong to cluster one um all of these n subks will add up to n but it's uh but each of these n subks are not necessarily an integer so that's going to be our kind of our membership weights it's a a weighted like count okay and then as far as the likelihood here that's just going to be the um multivariate normal PDF and we're going to say hey what is the probability of getting X if you have this mean and this sigma M uh variance matrix okay and um and this is the PDF for the multivariant normal thankfully we can just use RS give me the density uh density of the multivariant norm we just say DMV norm and then we'll plug in our new value and our current estimates of the mean and standard deviation or a sigma matrix um okay so that's how we're going to estimate our our membership weights we need our likelihood and our prior and then we have to kind of sum them up to get our marginal but we'll uh but we can do that i'm going over this quickly because I'm assuming we're feel okay with our B classifier like we did it in the homework so it's it's pretty much that um we're using a base classifier to uh to assign these things and then um and then once you have your membership weights assigned now we have to update our uh our mean and our variance estimates of each um of each cluster okay oh before I forget uh your first view quiz answer for today is A a as an apple okay um we recalculate the Okay so yes um so we will basically just get a weighted mean okay and uh and I think we're okay with doing weighted means you you know you give you know some of these values have higher weights some of them have lower weights and you multiply that by X you sum them up and then you divide by our n subk which is quote the number of points assigned to cluster k but again not necessarily an integer okay um and so that's just going to um you know the the w's that are close to one will contribute more and then the w's that are close to zero those points will contribute less towards the the weighted mean okay same idea for um calculating our uh variances so this is just the definition of the variance x - mu x - mu transpose and this will give us basically a 2x2 matrix but then we weight this by our w and then we we add them up and divide by our n subk okay so normally you would do this times this and just add them up and divide by n or divide by n minus one to uh fix the uh the bias uh but here we are going to multiply by w and divide by n subk um however many points are in that cluster and uh and that's that okay and so um here this is how the data come in they are unlabeled and so when we do this we just have to kind of start off with some arbitrary values for the mean and the variance and things like that so I'm going to just arbitrarily start with three centers okay 0 0 9 and then my sigma matrices will be um just identity matrices okay and so these circles these concentric circles kind of represent the uh what do you call it um like contour lines of of density okay so this will be like I forget what values I use but like the 60th percentile and 70th 80th 90th 95th percentile right and so as um so right now um these are terrible estimates okay and we use the base classifier to assign these points to each of these clusters and I've colored them based on which cluster has the highest probability okay um but these graphs are a tiny bit deceiving because it looks like like this point right here is assigned to cluster one in black and this point right here is assigned to cluster uh three and green okay but in reality along this border the points that are are colored in black here are like 51% cluster 1 and 49% cluster three in green okay and then the points that are green over here are like 51% cluster 3 in green and 49% cluster one and black it's just that when you color a point you have to pick a color and you can only represent it either with black or green here but remember these are these are not strict assignments these are probabilistic so all of these points over here have you know some probability of being in cluster one and all of these points over here also have some probability of being in cluster three and technically some probability of being in cluster two even if that number is very very very close to zero okay so it's probably like 51% black 49% green and you know 10^ the minus 8 red or something like that okay um uh and so that's what I have uh written on this slide okay and then so after one iteration okay this is kind of our current assignments and I update and I say okay based on all of the kind of weighted values that are assigned to cluster one what is the mean of cluster one so this uh that's going to be this point right here in the center uh that's um the the weighted mean but then when we ask for its estimate of the variance you can see the estimate of the variance got much much bigger because it included all of all of these points okay meanwhile um if we look at here these were kind of the points that were quote assigned to the green cluster and so it took a weighted mean of all of the points that were assigned to the green cluster and the weighted mean is right here and then we say okay calculate the variance and uh and the variance matrix looks kind of like this we do the same thing with red and this is the the variance matrix and then and so now we update all of the points and all of their assignments and we've colored them so some are black and some are green and some are red we've updated their assignments this is what it looks like and then we'll do another iteration and then so at the next iteration we've updated this and so you know previous look at this um we can see okay our now our kind of our variance coariance matrix for the green cluster is a lot tighter uh still kind of centered here and um and same some something similar has happened for red and this is kind of our variance coariance matrix for all of the points assigned um to cluster one and back is what we have and and so we're kind of kind of um keep updating and as we do this we eventually so let me just kind of So we started off here and then just kind of quickly this is I think we end up something like this all right and so after some iterating back and forth of updating our assignments and updating the corresponding um parameter estimates this is what we end up getting okay and um and so after letting it run these are our final cluster assignments all right and and this is what we have so this is again these points are probabilistically assigned so the ones that are right on the border are like 51% red cluster 2 and 49% black cluster one and and things like that but overall probabilistically if we had to just kind of pick one assignment we'd say these ones get assigned to uh this cluster these get assigned to this cluster everything else gets assigned to the center cluster okay we can compare that to the original assignments and again the label switching issue of like which one is red and which one is green um that's that's in play here okay uh but but that's okay they all got clustered together and you can see these points right here are actually a mix of the cluster centered at4 and the cluster centered at 0 0 right so we can see there is kind of a mix but then when we did our final cluster assignments these these all got probabilistically assigned to uh this cluster over here okay but in reality it's it's going to be a mix and these pro points are probably like 95% one cluster 5% the other cluster okay um and so this is what we have we can compare the true groups because we generated our data oursel and so we know what it looks like um and that's that's what we have okay um we could try uh different starting locations and so if I started off with this like I kind of have like bad starting weights and starting locations um and we can just kind of see how this gets updated so um these are different clusters and after one iteration this is where we end up after another iteration we go there and we keep going and also it still manages to converge okay so so that is uh that's encouraging i want to show a couple examples from Wikipedia all right so this is uh I guess a similar diagram um so this is like a a real data set i think this is from uh what is it old faithful so it's like a geyser in Yellowstone National Park and you can they they measured how long the uh the geyser erupts and how long did they have to wait between one eruption and another eruption and it appears that it's kind of um that there's some patterns and there's like two two clusters here that uh again you start off with some initial assignments and then just iteratively they get some some kind of convergence here um there's a couple things that the reason why we would want to do this over C means clustering and let me see if I can do this in in uh and this is in the homework as well so let's say you have data that look kind of like this this is like a Mickey Mouse um distribution here okay so we have a big cluster here and a cluster over here and a cluster over here when you do kay means clustering it's uh it uses ukitian distance okay and it's only able to take into account distance and nothing about variance and so these points up here okay like when we look at this distribution we can say oh yeah you know it should belong to this center cluster right here but because uh the k means clustering algorithm is going to like identify a centrid right here it's going to say you know between this centrid right here this red plus here and this green circle over here it's going to put points right here it's going are closer to this centrid than they are to this centrid and therefore it gets assigned to this cluster over here similarly these points that are up here get assigned to this cluster so it's not able to take into account just kind of like the internal variance and it's using uklidian distance whereas the EM algorithm as far as the clusters go it is able to say you know what this cluster of points has a very small variance okay and therefore these points that are out here even though like as far as uklidian distance go even though they are closer to the centrid of this cluster up here because this cluster has such small variance it is less likely to be assigned to this cluster and more likely to be assigned to the cluster in the center the cluster in the center has a huge amount of variance so it's able to kind of you know get these points that are way out here okay whereas over here with Cayman's clustering you know this point that's way out here it just assigns it to the nearest centrid okay which probably doesn't quite make sense you know if you say like if you take in account the amount of variance you have uh you're going to get you know a point that's way out here okay so if you uh the EM algorithm is able to take into account the sheer amount of variance that uh that exists um and uh and include that with its uh with its assignments there okay so uh I have uh put this in your homework assignment where you're going to have to kind of do some um some of the uh basically create these plots um yourself and and the and the guiding um principles are are really on these slides right here of calculating your weighted um your assignment weights by using the base classifier and then um and then once you have that you calculate your weighted mean and weighted variance uh matrix matrices there okay um I guess that's all I have for uh today's lecture on the AM algorithm um and so we'll wrap things up a little bit early um answers last two view quiz answers are C and E c as in cat E as in elephant will be our uh last two view quiz answers okay um so we'll uh finish here and we'll see you guys later

Transcript for:EM Algorithm Overview

Transcript for:
EM Algorithm Overview