Understanding Denoising Diffusion Models

hi everyone in this video we'll look into diffusion models specifically we'll look into ddpm which is denoising diffusion probabilistic models which is a seminal paper on diffusion models in the image space we'll dive deep into the forward and the reverse diffusion process and understand the entire math that ultimately leads to this simple formulation of the training objective so let's get started the idea of diffusion models is to destroy all the information in the image progressively in a sequence of time step T where at each step we add a little bit of gosan noise and by the end of a large number of steps what we have is a completely random noise mimicking a sample from a normal distribution this is called the forward process and we apply this transition function to move from XT minus1 to XT we then learn a reverse of this process via a neural network model where effectively what the model is learning is to take the image and remove noise from it step by step and once this model is is learned we can then feed it a random noise sampled from a normal distribution and it will remove little bit of noise from it we then pass this slightly denoised image again and keep repeating this and at the end of a large number of denoising steps we'll get an image from a original distribution if all of this sounded like a very vague overview not giving any insight or intuition then yeah this is exactly what I felt at the start and then instead of rushing through all the details I decided to try and understand why we are doing this and what's the reason behind each of those steps if not from a mathematical standpoint then at least from an intuitive one and the first thought that one could have after seeing all of this is what the significance of the term diffusion is in here to be honest all I really know is that diffusion is movement of molecules from regions of higher concentration to lower concentration so what's that got to do with this and moreover this looks in some sense very similar to what we do in variational autoencoders doesn't it I mean mean in VA we have an image sample and we convert it to a mean and variance of a goian distribution using an encoder and then simultaneously train a decoder that takes a sample from this cian distribution and attempts to reconstruct the image so it seems like we are kind of doing something similar here except maybe not in one shot but through multiple levels I guess then why is this new term diffusion being brought here and why is this not some variation of vae and to answer that once you look more into what diffusion processes are these are some of the results that you will get so stochastic means some Randomness involved Markov we know simply means that the future state of the process will depend only on the present State and once we know the present State any knowledge about past will give no additional information about the future State and continuous would mean there aren't any sort of jumps and you can connect this to our equation of forward process where we have this Q function which is honoring this this Mark of definition XT is only dependent on XT minus one this is another one that will come up and this gives a different perspective so this states that diffusion is a process of converting samples from a complex distribution to a simple one and if you try to link it to what we are doing here then our input image distribution obviously is a complex one and if our goal is to convert samples from this distribution to samples from a simple prior that is normal distribution then this diffusion process can come handy and this mention of repeated application of transition kernel is again referring to this transition function that we have and the claim is that this will lead to a normal distribution we'll soon see why this is the case but for now let's just assume it to be true so defusion is a stochastic Mark of process that allows us to transition from any complex distribution to a simple distribution by repeatedly applying a transition kernel let's look at next one okay so my little brain is screaming to ignore this completely um so that's what we are going to do but wait let's bring the equation back again ignoring specific details this is saying that diffusion processes would have equations of this form which is that change in X is a combination of a deterministic term and a stochastic term and this is also something that we are doing in this transition function if we apply the reparameterization trick we get this and if we simply add and subtract XT - 1 we'll get this here also we see that the change which is XT minus XT minus1 is again a combination of a deterministic and a stochastic term so clearly while the end objective is very similar to what we have been doing in thees but we are modeling the process of reaching that objective very differently and in fact it is modeled behind ideas of physics specifically pertaining to that of diffusion processes hence the term diffusion and the hope is that modeling it this way will lead to better generation results than VI now it seems that the Center focus of this diffusion process specifically the forward where we move from image distribution to normal is this transition function and somehow applying this repeatedly converts an image which is a sample from a very complex distribution to a sample from a normal distribution so why is this transition function achieving that effect and first let's not even bother using some relationship between the coefficients let's try with Alpha and beta where they are independent of each other the way we'll go about it is we'll start with this distribution which is not really complex but that's okay we'll sample some numbers according to this distribution and chart out a histogram plot then we'll apply whatever transition step once this will lead to a change in the values and hence a change in the histogram plot let's say this is the values we use so we'll get this histogram then connecting the bar midpoints and having a sufficiently small bin size we can actually approximate that this is our change distribution and effectively what we are saying is that applying this transition function once changes the left distribution to the right one now obviously we can say that with Alpha equals 0 and beta equals 1 in one shot itself we are at a goian but that's not much useful is it what we want to do is gradually shift the distribution to a gosan making slow Progressive changes if we really want to model the transition as a diffusion process just like how D would slowly distribute in water let's first try with Alpha we can see that setting Alpha to values of anything greater than one will never allow us to reach our goal because the variance will end up blowing soon enough hence having some DEC in the original distribution values is necessary we can try with a value just smaller than one for Alpha but what about beta we need to slowly introduce this noise term and having higher values of beta will never allow us to do that and moreover it will end up end up making sudden changes in the distribution and we also don't want that we want slow Transitions and the more of original distribution we have destroyed the more it should look like normal distribution when we put in these values this is exactly what we get smooth transition between each steps and slowly and slowly the distribution changes to a gosan let's mathematically see why this is the case this was our transition function for the forward step that we started with where we are formulating XT in terms of XT minus 1 starting from the last time step T we can write XT minus1 in terms of XT minus 2 using the same formulation separating the XT minus 2 term and everything else we get this we can already see a pattern that's recurring but let's do once more the same steps this time replacing XT minus 2 in terms of XT minus 3 since each time we only get one term involving ing XT and the rest are all independent of XT repeating this for all time steps we'll end up with this only the first term has x0 and when T is a sufficiently large number this term will be very close to zero because all the values in the product are less than one the rest of the terms are all goian with zero mean but different variances however since all are independent we can formulate them as one goian with mean being zero and the variance being sum of these individual variances so this guy will have a variance of beta this one beta * 1 minus beta and this one will have beta * 1 - beta squ so this becomes a GP with the first term being beta and the factor that is Multiplied with is 1 - beta hence the sum for T terms will be this and this ends up being one hence this transition function indeed ultimately will lead to a mean zero and unit variance cian by the way if you would have noticed I conveniently went from putting uncorrelated coefficients Alpha and beta for both the terms to related coefficients square root of 1 minus beta and square root of beta this was because that is what the authors use but even if you go through with this proof with square root of Alpha and square root of beta terms under the same constraints that is Alpha just smaller than one and beta a very small value we'll end up with the first term being this which will again manage for large T and Alpha less than 1 the variances are this for the first second and third term and so on and so forth again a GP but now our GP sum will be this so mathematically the relationship between Alpha and beta comes out from our requirement of final sum of variances to be one intuitively also it makes sense as you're effectively correlating the amount of structure of original distribution that you destroy with the amount of noise that you add that's all that I could think of but do let me know if you think there's more to the reason of relationship between the scaling and the noise variance terms what we did here was discretize the diffusion process into a finite steps for a 1D distribution we can use the same principles to do it for a w cross H image and the output for the forward process will then be a w cross hmage where each pixel is identical to a sample from a goian distribution of mean zero and unit variance in practice we don't use a fixed variance of noise at all steps but we instead use a schedule the authors use a linear schedule where they increase the variance of noise over time which makes sense if you think from a reverse process standpoint as when you are at the start of the reverse process you want the model to learn to make larger jumps and when you are at the end of the reverse process you're very close to a clear image you want it to make very careful small jumps another benefit that I could see is that the schedule actually allows the variance of the distribution to progressively scale very smoothly from the input to the Target gosan this is the plot of variance of our not so complex distribution throughout the time steps of forward process you can see that with fixed noise our distribution variance reduction is larger at the start and after only about 500 steps it reaches very close to one and only minor changes post that whereas the reduction is much more Progressive and is more uniform for larger fraction of time steps when we use a schedule that the authors suggest there's another good property of this which I should cover before we move away from the forward process of diffusion because we want to have the process run over close to Thousand Steps then we are essentially requiring that for tal th000 we have to apply this equation thousand times to go through the entire Mark of chain which is not really efficient however it turns out that we don't really need to so we have this as the formulation of XT and now let's define 1us beta t as Alpha T allowing us to write XT as this now we can do the same recursion that we did few minutes back we can put the expression for XT minus 1 in terms of XT minus 2 using this transition formula separating XT and non XT terms we'll get this like before this is adding two independent goian noise with variances 1 minus Alpha t for this term and this for the other term and they can be Rewritten as one goian with variance equal to the sum of these variances Alpha TS will now cancel out each other and doing the same for XT minus 2 in terms of XT minus 3 and then again separating XT and non XT terms we'll get this we can see the pattern that's coming up and if you repeat this till x0 what we'll get are these two terms so we are going to Define this cumulative product of Alphas from I is equal to 1 to T as this term and then we can rewrite the last equation as this and finally we see that we can actually go from the initial image to the noisy version of the image at any time step te in the forward process in one shot all we need is this cumulative product term which can be pre-computed the original image and a noise sample from normal distribution now although we covered a lot of math but if you take a step back and see then all we really did was convert image to noise and you would agree there are simpler ways to do that so why go through all of this how does modeling this forward process benefit us if ultimately all we did was convert it to noise well it turns out that under certain conditions which are met here the reverse process is also a diffusion process with transitions having the same functional form which then means that our reverse process which is actually generating data from random noise will also be a mark of chain with goian transition probability but we can't straight away comput it because that would require computations involving the entire data distribution however we can approximate it by a distribution P which is formulated as a goian whose mean and sigma are the parameters that we are interested in learning this formulation can be done because we know that the reverse will have that form according to the theory now here I did not go into the details of why the reverse is also a diffusion with gosan transitions because frankly I don't know I'm guessing this would have Roots into some theory of stochastic processes but if somebody knows an intuitive answer as to why the reverse is a diffusion process do let me know in the comments I just told that we can't compute the reverse distribution but we'll somehow approximate it using p let's now look into how do we do that we have a similar situation in vae where we don't know the true distribution of x given Z but we learn to approximate it bya neeral Network let's just quickly see that and try to apply the same methods for learning our reverse distribution here we want to learn P of x given C such that we are able to generate images as close to our training data distribution as possible and for that we try to maximize the log likelihood of The observed data so we have log P of X and we can rewrite it as marginalizing over the latent variables we can then multiply and divide by a distribution Q of Z given X inside the integral to get this so this is nothing but log over expectation of this point under the distribution Q of Z given X and because log is a concave function we can see that log of expectation of something is always greater than the expectation of log and using this fact we can get to the lower bound of the log likelihood of our data and instead of maximizing the likelihood we can maximize this lower bound once you expand the terms within the lower bound you get these terms the first term here with the assumption that P is goian ends up being m or reconstruction loss between the generated output and the ground truth image and the second term is the K Divergence between our prior which is mean z and unit variance guian and the distribution that encoder predicts the same line of thought can be extended to our diffusion case where you can think that now instead of going to ZT in one go we move through a sequence of latent X1 to XT and our Q of Z given X is now Q of XT given XT minus one and it's fixed and not loarn so again we want to maximize the likelihood of The observed data which is log of P of x0 this is equivalent to integrating out everything from X1 to XT from The Joint distribution of the entire chain and applying our inequality for concave functions we get this which is our lower bound that we'll maximize which is kind of what we got in the VA case as well but up until now we haven't really use the fact that both forward and reverse are Mark of chains let's now do that this this was the term inside the expectation we can do that for p as well as Q because both are Mark of chains here Q is our forward process and P is the approximate for the reverse process notice that this conditioning on x0 is actually doing nothing and since the forward process is Markov we can remove it as Q of XT given XT minus one and x0 is same as Q of XT given XT minus one but we'll keep it and we'll soon see that this actually ends up helping us let's first look at at the denominator part on the terms inside the product we can apply base theorem to get this then after expanding you will see that all the XT given x0 terms cancel out like this and this this and this and even X1 given x0 all except the term involving the last time step T so only terms in the numerator remain and specifically what's left are these XT minus one given XT comma x0 terms and this one term XT given x0 so we can rewrite the denominator as this and put that in our lower bound equation now we can separate the x0 given X1 term from the numerator and separate this entire thing into some of three log terms remember all of this was under the expectation of Q now this is very similar to the VA lower bound the first one is like the K Divergence prior term but here because we are using diffusion Q is fixed and the final Q of XT according to Theory will actually be very close to normal distribution so this is parameter free and we won't bother optimizing this second is reconstruction of our input x0 given X1 and the last term is a sum of quantities which are nothing but K Divergence actually negative of K and since we want to maximize the lower bound we want to minimize all of these scale terms the good part about this formulation is that the last term involves two quantities of the same form and this is simply requiring the approximate Den noising transition distribution to be very close to the ground truth denoising transition distribution conditioned on x0 at first clance it might look like we are back where we started but that's not the case instead of Q of XT minus 1 given XT we have additional conditioning on x0 and intuitively this is something that's easier to compute because once you look at the original image you can have some sense as to how to go about denoising this image from one time step to previous time step in fact we'll compute this quantity next and we'll see that it has a very nice form we can use base theorem to rewrite this quantity now each of these are gossans and we have already computed them the first term is our forward process and although it's condition on x0 but that's going to have no impact because our forward process is a mark of CH the other two terms can be written using the recursion that we stablished earlier which allows us to go from x0 to a noisy image at any time step t because these are all gausian we can write them in this exponential form and now our ultimate goal is to compute distribution of XD minus 1 and our hope is that somehow we can convert this into a perfect square allowing us to get the equation for some goian distribution so with that motivation we'll separate XT minus1 Square terms and XT minus1 terms and everything that is independent of xt- one which is what we have done here XT minus one related ones are within the first two terms and everything not involving XT minus one is wrapped within this last parenthesis so these would be XT squ term coming from first x0 Square term coming from second and all the terms coming from this guy combining both terms for XT minus 1 squ we get this and combining both for XT minus 1 will give this now because our motivation is to get just xt- 1 square we'll factor out everything that is Multiplied with it from all terms here and to allow to do that we have have multiplied and divided the last term with this so up till now we wrote the normal distributions in the exponential form and sum algebra to get this now we'll work on the final term which was everything that is independent of XT minus one and like I said before these will be the XT Square term from this x0 Square term from this and everything in this and after we have simplify these terms we'll notice that it's exactly the square of this term that is Multiplied with 2xt one here the first two are the square terms and the third is their product this allows us to rewrite the whole equation for reverse distribution as a gausian with this being the mean and this being the variance all of this that we have done is because these were our likelihood terms that we had to optimize and for the last term we need to compute Q of XT minus 1 given XT comma x0 which we have done now if you take a closer look at the me then it's kind of a weighted average of XT and x0 and if you compute this weight for x0 then you will see that it is a very low value at higher time steps and as we move closer to the end of the reverse process this weight increases and from the graph it might seem that the weight is just zero for a long time but when you actually plot the log values you can see that there's indeed an increase over time coming back to our reverse diffusion process this is what we need to compute we obviously have to approximate it because for Generation we won't really have x0 but because we know that our reverse is a goian we can have our approximation also as a gossum and all we need to do is learn the mean and variance the first thing that the authors do is fix the variance to be exactly same as the ground Pro three noising step variance now remember our log likelihood terms were these and all of these under the expectation of Q making it K Divergence and when you use the K dience formula for goian because here both the distributions have the exact same variance it will end up being the square of difference between the means divide by twice the variance which is this quantity so since our goal is to maximize the likelihood we need to reduce this difference now because our ground proo denoising step has mean in this form we can rewrite the mean for our approximate denoising distribution also in this form however since we don't have x0 we'll approximate it by X Theta so at any time step t x Theta will be our approximation of the original image based on what XT looks like and once you formulate mean as this Al Divergence which was earlier square of the difference of mean will end up being this we scale down by twice the variance which was already present but we also need to scale by square of everything that x0 and X Theta are multip lied with so after all the maths we end up with the loss term being how well our model approximates the original image seeing a noisy image at time step T this is a valid formulation of the loss term but authors actually formulate loss as how well our model approximates the noise that was used which can actually be derived from what we have here so let's see that as well if you remember we had seen XT can be written in terms of x0 and this equation we can move x0 to one side which gives us a way to rewrite x0 in terms of XT and original noise added and now we can use this to reformulate the mean we can replace x0 with this guy and after some simplification we'll arrive at this quantity now we have expressed our ground truth mean in terms of XT and the noise sample added we can similarly rewrite our approximation of mean as XT and noise prediction and now our mean differences will become Square differences of the ground proo noise and the predicted noise we'll still have twice the variance in the denominator as that's coming from the kence formula but we'll also have to multiply with whatever Factor the noise term in the mean was multiplied with after doing that we'll finally have this loss term which is exactly what the authors also mention in the paper so to recap from our likelihood terms ignored the first one because it did not have any trainable parameters upon working on the individual terms in the last summation we identified that it can be written as scaled Square difference between the ground truth noise and the noise prediction generated by some model using XT as input in practice we also provide time step as well to the model but that is something we'll anyway cover in the implementation video the authors actually ignore the scaling altogether and through experiments find that just training a model on the Square difference of noise is good enough even the second term in likelihood is also wrapped under this loss by having the model approximate the noise that was added to x0 to get to X1 during training tal 1 is treated as same as any time step and the only minor difference happens in sampling which we'll see soon now that we have covered every bit of math involved and reached our simple objective to optimize let's see all together how we train this during training we'll sample an image from our data set and also sample a Time step T uniformly then we sample a random noise from a normal distribution we use the below formulation to get the noisy version of this image in the defusion process at time step T the original image and sampled noise are used and we use the time step and our noise schedule to get the cumulative product terms which we can also pre-compute the noisy version of this image is then fed to our neural network and we train the network using the LW to ensure that the predicted noise is as close to actual noise as possible training it for a large number of steps we'll cover all the time steps and effectively we are optimizing each individual member of this sum term for image generation we need to go through the reverse process and we can do that by iteratively sampling through our approximation of the denoising step distribution P that our neural network has learned for generating we start with a random sample from a normal distribution as our final step T image we pass it to our train model to get the predicted noise and just to reiterate our approximate Den noising distribution was this and we formulated the mean this way using noisy image at time step T and predicted noise if you simplify the terms you end up with the mean image of the denoising distribution as this once we have the predicted noise we can sample an image from this dist distribution using the mean image and the variance which was actually fixed to be same as the forward process and this becomes our XT minus one then we just keep on repeating this process all the way till we get our original image x0 the only thing we do differently is we return the mean image itself to get from X1 to x0 and that's it for training as well as the generation part for the ddpm diffusion model with this we come to the final question of this video which is why even create this video and specifically since I am not an expert in diffusion models in fact on a scale of 0 to 1 since all I've spent is about 500 hours of understanding diffusion models I would Peg myself to be somewhere right about here but the whole reason for this video is to make whatever little number of people are on my left a bit more knowledgeable of what's happening in the diffusion process and there's a selfish reason as well which is I wanted to dive deep into every little detail that I can think of and making this video ensure that I do that and also so that people on my right can pitch in and improve my understanding wherever my explanation was missing something this entire video would not have been possible had these amazing people not invested their time in creating resources which help me understand things a lot better so big thank you to them do take a look at these links are in the description and in the next video I'll get into the architecture and implementation of ddpm so see you then

Transcript for:Understanding Denoising Diffusion Models

Transcript for:
Understanding Denoising Diffusion Models