in this video I'll cover the implementation of diffusion models we'll create ddpm for now and in later videos move to stable diffusion with text proms in this one we'll be implementing the training and sampling part for ddpm for our model we'll actually implement the architecture that is used in latest diffusion models rather than the one originally used in ddpm we'll dive deep into the different blocks in it before finally putting everything in code and see results of training this diffusion model on grayscale and RGB images I'll cover the specific math of diffusion models that we need for implementation very quickly in the next few minutes but this should only act as a refresher so if you're not aware of it and are interested in knowing it I would suggest to first see my diffusion math video that's linked above the entire diffusion process involves a forward process where we take an image and create noisier versions of it step by step by adding gossan noise after a large number of steps it becomes equivalent to a sample of noise from a normal distribution we do this by applying this transition function at every time step T and beta is a scheduled noise which we add to the image at T minus one to get the image at T we saw that having Alpha as 1 minus beta and Computing cumulative products of these Alphas at time T allows us to jump from original image to noisy image at any time step T in the forward process we then have a model learn the reverse process distribution and because the reverse diffusion process has the same functional form as the forward process which here is a goian we essentially want the model to learn to predict its mean and variance after going through a lot of derivation from the initial goal of optimizing the log likelihood of The observed data we ended with the requirement to minimize the K Divergence between the ground Ruth Ren noising distribution conditioned on x0 which we computed as having this mean and this variance and the distribution predicted by our model we fix the variance to be exactly same as the target distribtion bution and rewrite the mean in the same form after this minimizing KL Divergence ends up being minimizing square of difference between the noise predicted and the original noise sample our Training Method then involves sampling an image time step T and A noise sample and feeding the model the noisy version of this image at sample time step T using this equation the cumulative product terms needs to be coming from the noise Schuler which decides the schedule of noise added as we move along time steps and loss becomes the MSC between the original noise and whatever the model predicts for generating images we just sample from a learned reverse distribution starting from a noise sample XT from a normal distribution and then Computing the mean using the same formulation just in terms of XT and noise prediction and variance is same as the ground truth denoising distribution conditioned on x0 then we get a sample from this reverse distribution using the reparameterization trick and repeating this gets us to x0 and for x0 we don't add any noise and simply return the mean this was a very quick overview and I had to skim through a lot for a detailed version of this I would encourage you to look at the previous diffusion video so for implementation we saw that we need to do some computation for the forward and the reverse process so we'll create a noise Schuler which will do these two things for us for the forward process given an image and a noise sample and time step t it will return the noisy version of this image using the forward equation and in order to do this efficiently it will store the alphas which is just 1 minus beta and the cumulative product terms of alpha for all T the authors use a linear noise Schuler where they linearly scale beta from 1 eus 4 to 0.02 with thousand time steps between them and we'll also do the same the second responsibility that this Schuler will do is given an XT and noise prediction from model it'll give us XT minus one by sampling from the reverse distribution for this it'll compute the mean and variance according to their respective equations and return a sample from this distribution using the reparameterization trick to do this we also store 1 minus Alpha T 1 minus the cumulative product terms and its square root obviously we can compute all of this at runtime as well but pre-computing them simplifies the code for the equation a lot so let's implement the noise schedu first as I mentioned we'll be creating a linear noise schedule after initializing all the parameters from the arguments of this class we'll create betas to linearly increase from start to end such that we have beta T from zero till the last time step we'll then initialize all the variables that we need for forward and reverse process equations the addore noise method is our forward process so it will take in an image original noise sample and time step T the images and noise will be of B Cross C cross H cross W and time step will be a 1D tensor of size B for the forward process we need the square root of cumulative product terms for the given time steps and 1 minus that and then we reshape them so that they are B cross 1 CR 1 CR 1 lastly we apply the forward process equation the second function will be the guide that takes the image XT and gives us a sample from our learned reverse distribution for that we'll have it receive XT and noise prediction from the model and time step t as the argument we'll be saving the original image prediction x0 for visualizations and get that using this equation this can be obtained using the same equation for forward process that takes from x0 to XT by just rearranging the terms and using noise prediction instead of the actual noise then for sampling we'll compute the mean which is simply this equation and as mentioned T equals 0 we simply return the mean and noise is only added for other time steps the variance of that is same as the variance of ground truth re noising distribution condition on X zero which was this and lastly we'll sample from a gosen distribution with this mean and variance using the reparameterization trick this completes the entire noise Schuler which handles the forward process of adding noise and the reverse process of sampling first let's now get into the model for diffusion models we are actually free to use whatever architecture we want as long as we meet two requirements the first being that the shape of the input and output must be same and the other is some mechanism to fuse in time step information let's talk about why for a bit the information of what time step we are at is always available to us whether we are at training or sampling and in fact knowing what time step we are at would Aid the model in predicting original noise because we are providing the information that how much of that input image actually is noise so instead of just giving the model an image we also give the time step that we are at for the model I'll use unit which is also what the authors use but for the exact specification of the blocks activations normalizations and everything else I'll mimic the stable diffusion unit used by hugging face in the diffusers pipeline that's because I plan to soon create a video on stable diffusion so that will allow me to reuse a lot of code that I'll create now actually even before going into the unit model let's first see how the time step information is represented let's call this the time embedding block which will take in a 1D tensor of time steps of size B which is batch size and give us a tore _ dim sized representation for each of those time steps in the batch the time embedding block would first convert the integer time steps into to some Vector representation using an embedding space that will then be fed to two linear layers separated by activation to give us our final time step representation for the embedding space the authors use the sinusoidal position embedding used in Transformers for activations everywhere I have used sigmoid linear units but you can choose a different one as well okay now let's get into the model as I mentioned I'll be using unit just like the authors which is essentially this encoder decoder architecture where encoder is a series of downsampling blocks where each block reduces the size of the input typically by half and increases the number of channels the output of final down sampling block is passed to layers of mid block which all work at the same spatial resolution and after that we have a series of upsampling blocks these one by one increase the spatial size and reduce the number of channels to ultimately match the input size of the model the upsampling blocks also fusing the output coming from the corresponding down sampling block at the same resolution via residual skip connections most of the diffusion models usually follow this unit architecture but differ based on specifications happening inside the blocks and as I mentioned for this video I've tried to mimic to some extent what's happening inside the stable diffusion unit from hugging face let's look closely into the down block and once we understand that the rest are pretty easy to follow down blocks of almost all the variations would be a resonet block followed by a self attention block and then a down sample layer for our resonet plus self attention block we'll have group Norm followed by activation followed by a convolutional layer the output of this will again be passed to a normalization activation and convolutional layer we add a residual connection from the input of first normalization layer to the output of second convolutional layer this entire thing is what will be called as a resonet block which you can think of as two convolutional blocks plus residual connection this is Then followed by A normalization and A Self attention layer and again residual connection we have multiple such resonet plus self attention layers but for Simplicity our current implementation will only have one layer the code on the repo however will be configurable to make as many layers as desired we also need to fuse the time information and the way it's done is that each resonant block has an activation followed by a linear layer and we pass the time ending representations through them first before adding to the output of the first convolutional layer so essentially this linear layer is projecting the tore emore dim time step representation to a tensor of same size as the channels in the convolutional layers output that way these two can be added by replicating this time step representation across the spatial Dimension now that we have seen the details inside the block to simplify let's replace everything within this part as a resonet block and within this as a self attention block the other two blocks are using the same components and just slightly different let's go back to our previous illustration of all three blocks we saw that down block is just multiple layers of reset followed by self attention and lastly we have a down sampling layer up block is exactly the same except that it first upsamples the input to twice the spatial size and then concatenates the down block output of the same spatial resolution across the channel Dimension Forst that it's the same layers of resonet and self attention blocks the layers of mid block always maintain the input to the same spatial resolution the hugging face version has first one resonet block and Then followed by layers of self attention and resonet so I also went ahead and made the same implementation and let's not forget the time step information for each of these reset blocks we have a Time step projection layer this was what we just saw an activation followed by a linear layer the existing time step representation goes through these blocks before being added to the output of first convolution layer of the resonet block let's see how all of this looks in code the first thing we'll do is implement the sinusoidal position embedding code this function receives B sized 1D tensor time steps where B is the bat size and is expected to return B cross tore _ dim tensor we first implement the factor part which is everything that the position which here is the time step integer value will be divided with inside the S and cosine functions this will get us all values from 0 to half of the time embedding Dimension size half because we'll concatenate s and cosine after replicating the time step values we get our desired shape tensor and divided by the factor that we computed this is now exactly the arguments for which we have to call the sign and cosine function again all this method does is convert the integer time step representation embeddings using a fixed embedding space now we'll be implementing the down block but before that let's quickly take a peek at what layers we need to implement so we need layers of reset plus self attention blocks reset will be two Norm activation convolutional layers with residual and self attention will be Norm followed by self attention we also need the time projection layers which will project the time embedding onto the same Dimension as the number of channels in the output of first convolution feature map I'll only implement the the block to have one layer for now and we'll only need single instances of these and after resonant and self attention we have a down sampling okay back to coding it for each down block we'll have these arguments incore channel is the number of channels expected in input out underscore channels is the channels we want in the output of this down block then we have the embedding Dimension I also add down sample argument just so that we have the flexibility to ignore the down sampling part in the code lastly num underscore heads is the number of heads that our attention block will have this is our first convolution block of resnet we make the channel conversion from input to Output channels via the first cor blayer itself so after this everything will have out uncore channels as the number of channels then these are the time projection layers for this resonet block remember each resonet block will have one of these and we had seen that this was just activation followed by linear layer the output of this linear layer should have out uncore channels so that we can do the addition this is the second G block which will be exactly same except everything operating on out underscore channels as the channel Dimension and then we add the attention part the normalization and multi-ad attention the feature dimension for multi-ad attention will be same as the number of channels this residual connection is 1 cross one con layer and this ensures that the input to the entire reset block can be added to the output of the last con blers and since the input was in underscore channels you have to first transform it to out underscore channels so this just does that and finally we have the down sample layer which can also be average pooling but I've used convolution with stri two and if the arguments convey to not down sample then this is just identity the forward method will be very simple we first pass the input to the first con block and then add the time information and then after going going through the second cor block we add the residual but only after passing through the one cross one corn player attention will happen between all the spatial H * W cells with out underscore channels being the feature dimensionality of each of those cells so the transpose just ensures that the channel features are the last Dimension and after the channel Dimension has been enriched with self attention representation we do the transpose back and again have the residual connection if we would be having multi layers then we would Loop over this entire thing but since we are only implementing one layer for now we'll just call the down sampling convolution after this next up is mid block and again let's revisit the illustration for this for Mid block we'll have a resonet block and then layers of self attention followed by resonet same as down block we'll only Implement one layer for now the code for midblock will have same kind of layers but we need two instances of every layer that belongs to the reset block so let's just put all of that in the forward method will have just one difference that is we call the first resonant block and then self attention and second resonant block had we implemented multiple layers the self attention and the following resonet block would have a loop now let's do up block which will be exactly same as down block except that instead of down sampling we'll have a upsampling layer we'll use con transpose to do the upsampling for us in the forward method let's first copy everything that we did for down block then we need to make three changes add the same spatial resolutions down block output as argument then before resonet plus self attention blocks we'll upsample the input and concat the corresponding down block output another way to implement this could be to First concat followed by reset and self attention and then upsample but I went with this one finally we'll build our unit class it will receive the channels and input image as argu doent we'll hard code the down channels and mid channels for now the way the code is implemented is that these four values of down channels will essentially be converted into three down blocks each taking input of Channel I dimensions and converting it to Output of Channel i+ 1 dimensions and same for the mid blocks this is just the down sample arguments that we are going to pass to the blocks remember our time embedding block had position embedding followed by linear layers with activation in between these are those two linear layers this is different from the time step layers which we had for each resonant block this will only be called once in an entire forward pass right at the start to get initial time step representation we'll also first have to convert the input to have the same channel Dimensions as the input of first down block and this convolution will just do that for us we then create the down blocks mid blocks and up blocks based on the number of channels provided for the last up block I simply hardcode the output Channel as 16 the output of last up block under goes a normalization and convolution to get us to the same number of channels as the input image we'll be training on mnist data set so the the number of channels in the input image would be one in the forward method we first call the con underscore in layer and then get the time step representation by calling the sinusoidal position embedding followed by our linear layers then we just call the down blocks and we keep saving the output of down blocks because we need it as input for the up block during up block calls we simply take down outputs from that list one by one and pass that together with the the current output and then we call our normalization activation and output convolution once we pass a 4 cross 1 cross 28 cross 28 input tensor to this we get the following output shapes so you can see because we had down sampled only twice our smallest size input to any convolution layer is 7 cross 7 the code on the repo is much more configurable and creates these blocks based on whatever configuration is passed and can create multiple layers as well we'll look at a sample config file later but first let's take a brief look at the data set training and sampling code the data set class is very simple it just takes in the path where the images are and then stores the file name of all those images in there right now we are building unconditional diffusion model so we don't really use the labels then we simply load the images and convert it to tensor and we also scale it from minus1 to 1 just like the authors so that our model consistently sees similarly scaled images as compared to the random noise moving to Trainor ddpm file where the train function loads up the config and gets the model data set diffusion and training configurations from it we then instantiate the noise Schuler data set and our model after setting up the optimizer and the loss functions we run our training Loop here we take our image batch sample random noise of shape B cross1 cross H crossw and Sample random time steps the scheduler adds noise to these batch images based on the sample time steps and we then back propagate based on the loss between noise prediction by a model and the actual noise that we added for sampling similar to training we load the config and necessary parameters our model and noise Schuler the sample method then creates a random noise sample based on number of images requested and then we go through the time steps in Reverse for each time step we get our models noise prediction and call the reverse process of scheduler that we had created with this XT and noise prediction and then it Returns the mean of XD minus one and estimate of the original image we can choose to either save one of these to see the progress of sampling now let's also take a look at our config file this just has the data set parameters which stores our image path model params which stores parameters necessary to create model like the number of channels down channels and so on like I had mentioned we can put in the number of layers required in each of our down mid and up blocks and finally we specify the training parameters the unit class in the repo has blocks which actually read this config and create model based on whatever configuration is provided it does everything similar to what we just implemented except that it Loops over the number of of layers as well and I've also added shapes of the output that we would get at each of those block calls so that it helps a bit in understanding everything for training as I mentioned I train on mnist but in order to see if everything works for RGB images I also train on this data set of texture images because I already have it downloaded since my video on implementing di there is a sample of images from this data set these are not generated these are images from the data set itself though the data set has 256 cross 256 images I resized the images to be 28 cross 28 primarily because I lack two important things for training on larger sized images patience and compute rather cheap compute for mnist I train it for about 20 box taking 40 minutes on v00 GPU and for this texture data set I train for about 60 box taking roughly about 3 hours and that gives me these results here I'm saving the original image prediction at each time step and you can see that because amnest images are all similar looking the model pretty quickly gets a decent original image prediction whereas for the texture data set it doesn't till about last 200 300 times steps but by the end of all the steps we get decent results for both the data sets you can obviously train it on a larger size data set though probably you would have to maybe increase the channels and maybe train for longer epochs to get nice results so that's all that I wanted to cover for implementing ddpm we went through scheduler implementation unit implementation and saw how everything comes together in the training and sampling code hopefully it give you a better understanding of diffusion models and thank you so much for watching this video and if you're liking the content and getting benefit from it do subscribe the channel see you in the next video