Understanding AI Generative Models

This is an artificial intelligence image generator. Given a text description of a picture, it will create, out of nothing, an image matching that description. As you can see, it is capable of generating high quality images of all kinds of different scenes. And it’s not just images, in recent years generative AI models have been developed that can generate text, audio, code, and soon, videos too. All of these models are based on the same underlying technology, namely deep neural networks. In a few of my previous videos, I’ve explained how and why deep neural networks work so well. But I only explained how neural nets can solve prediction tasks. In a prediction task, the neural net is trained with a bunch of examples of inputs and their labels, and tries to predict what the label will be for a new input which it hasn’t seen before. For example, if you trained a neural net on images labelled with the type of object appearing in each image, that neural net would learn to predict which object a human would say is in an image, even for new images which it hasn’t seen before. Under the hood, the way that prediction tasks are solved is by converting the training dataset into a set of points in a space, and then fitting a curve through those points, so prediction tasks are also known as curve-fitting tasks. And while prediction is certainly cool and very useful, it’s not generation. Right? This model is just fitting a curve to a set of points. It can’t produce new images. So where does the creativity of these generative models come from, if neural nets can only do curve fitting? Well, all of these generative models, are in fact just predictors. Yep, it turns out that the process of producing novel works of art can be reduced to a curve fitting exercise. And in this video, you’ll learn exactly how. Suppose that we have a training dataset consisting of a bunch of images. We want to train a neural net to create new images which are similar in style to these training images. The first thing you might try is to simply use the images as labels to train the predictor. Here we don’t care about the mapping from inputs to outputs, so we can just use anything we like for the inputs, for example a completely black image. Predictors learn to map inputs to outputs according to their training data. So, this predictor, once trained, should be able to map the dummy all-black image to new images, like those seen in the training set, right? Err, ok maybe not quite. That didn’t work so well, instead of producing a nice, beautiful picture, we just got this blurry mess. This demonstrates a very important fact about of predictors. If there are multiple possible labels for the same input, the predictor will learn to output the average of those labels. For traditional classification tasks, this isn’t really a problem, because the average of multiple class labels can still be a meaningful label. For example, this image could plausibly be given two different labels, both cat and dog would be valid labels. In this case a classifier would learn to output the average of those labels, which means you end up with a score of 0.5 cat and 0.5 dog. Which is still a useful label. In fact it’s arguably a better label than either of the original ones. On the other hand, when you average a bunch of images together you do not get a meaningful image out, you just get a blurry mess. Let’s try something a bit easier this time. How about, instead of generating a new image from scratch, we try to complete an image which has a part of it missing. In fact, let’s make this really easy and suppose there is only one missing pixel, say, the bottom right pixel. Can we train a neural net to predict the value of this one missing pixel? Well, as before, the neural net is going to output the average of plausible values that the missing pixel can take. But since it’s only one pixel that we’re predicting, the average value is still meaningful. The average of a bunch of colors is just another color, there’s no blurring effect. So, this model works perfectly fine! And we can use the value predicted by this neural net to complete images which are missing the bottom-right pixel. Great, so we can complete images with 1 missing pixel… What about 2? Well, we can do the same thing again, train another neural net on images with 2 missing pixels, using the value of the second missing pixel as the label. And then use this neural net to fill in the second missing pixel. Now we have an image with just 1 missing pixel, and so we can use the first neural net to fill in that. Great. And we can do this for every pixel in the image; train a neural net to predict the color of that pixel when it and all of the subsequent pixels are missing. Now we can “complete” an image starting from a fully black image, and filling in one pixel at a time. Crucially, each neural net only predicts one pixel, and so there’s no blurring effect. And there we have it, we have just generated a plausible image, out of nothing… There’s just one small problem. If we run this model again, it will generate exactly the same image… Not very creative, is it? But not to worry, we can fix this by introducing a bit of random sampling. You see, all predictors actually output a probability distribution over possible labels. Usually, we just take the label with the largest probability as the predicted value. But if we want diversity in our outputs, we can instead randomly sample a value from this probability distribution. This way, each time the model is run, it will sample different values at each step, which therefore changes the prediction for subsequent steps, and we get a completely different image each time. Now we have an interesting image generator. But still, at the end of the day, this model is made of predictors. They take as input a partially masked image, and predict the value of the next pixel. The only difference between this and a traditional image classifier is the label we used for training. The labels for our generator happen to be pixel colors which come from the original image itself, and not a human labeller. This is a very important point in practice: it means we don’t need humans to manually label images for this model, we can just scrape unlabelled images off the internet. But from the point of view of the neural net, it doesn’t know, nor does it care, that the label came from the original image. As far as it’s concerned this is just a curve fitting exercise, like any other. The generative model we’ve just created is called an auto-regressor. We have a removal process, which removes pixels one at a time, and we train neural nets to undo this process, generating and adding back in pixels one at a time. This is actually one of the oldest generative models, the very earliest use of auto-regression dates back to 1927, where it was used to model the timing of sunspots. But auto-regressors are still in use today. Most notably, Chat-GPT is an auto-regressor. Chat-GPT generates text by using a transformer classifier to output a probability distribution over possible next words, given a partial piece of text. However, auto-regressors are not used to generate images anymore. And the reason is that, while they can generate very realistic images, they take too long to run. In order to generate a sample with an auto-regressor, we need to evaluate a neural net once for every element. This is fine for generating a few thousand words to make a piece of text, but large images can have tens of millions of pixels. How can we get away with fewer neural net evaluations? For our auto-regressor, we removed one pixel at a time. But we don’t have to remove only one pixel, we could, for example, remove a 4 by 4 patch of pixels at a time. And train the neural net to predict all 16 missing pixels at once. This way, when we use our model to generate an image, it can produce 16 pixels per evaluation, and so generation is 16 times as fast. But there is a limit to this. We can’t generate too many pixels at the same time. In the extreme case, if we try to generate every pixel in the image at once, then we’re back to the original problem: there are many possible labels that get averaged together into a blurry mess. To be clear, the reason why the image quality degrades is that, when we predict a bunch of pixels at the same time, the model has to decide on the values for all of them at once. There are lots of plausible ways that this missing patch could be filled in, and so the model outputs the average of those. The model isn’t able to make sure that the generated values are consistent with each-other. In contrast, when we predict one pixel at a time, the model gets to see the previously generated pixels, and so the model can change its prediction for this pixel to make it consistent with what has already been generated. This is why there’s a trade-off, the more pixels we generate at once, the less computation we need to use, but the worse the quality of the generated images will be. Although, this problem only arises if the values we are predicting are related to each other. Suppose that the values were statistically independent of each-other, that is, knowing one of them does not help to predict any others. In this case, the model doesn’t need look at the previously generated values, since knowing what they were wouldn’t change its prediction for the next value anyway. In this case you can predict all of them at the same time without any loss in quality. So, that means, ideally, we want our model to generate a set of pixels that are unrelated to each other. For natural images, nearby pixels are the most strongly related, because they are usually part of the same object. Knowing the value of one pixel very often gives you a good idea of what color nearby pixels will be. This means that removing pixels in contiguous chunks is actually the worst way to do it. Instead, we should be removing pixels that are far away from each other, and hence more likely to be unrelated. So if in each step, we remove a random set of pixels, and predict values for those, then we can remove more pixels in each step for the same loss in image quality, compared to contiguous chunks. In order to minimize the number of steps needed for generation, we want the pixels we remove in each step to be as spread out as possible. Removing pixels in a random order is a pretty good way of maximizing the average spread, but there is an even better way. We can think of our generative model as two processes: a removal process that gradually removes information from the input, until nothing is left. And a generation process that uses neural nets to undo the removal process, generating and adding back in information. So far, we have been completely removing pixels. But rather than completely removing a pixel, we could instead remove only some of the information from a pixel, by, for example, adding a small amount of random noise to it. This means we don’t know exactly what the original pixel value was, but we do know it was somewhere close to the noisy value. Now, instead of removing a bunch of pixels in each step, we can add noise to the entire image. This way, we can remove information from every pixel in the image in a single step, which is the most spread-out way of removing information. And since its more spread out, you can remove more information in each step, for the same loss in generation quality. There is one small problem with this though. When we want to generate a new image, we need to start the neural net off with some initial blank image. When we were removing pixels, then every image eventually ends up as a completely black image, so of course that’s where we start the generation process from. But now that we’re adding noise, the values just keep getting larger and larger, never converging to anything. So where do we start the generation process from? We can avoid this problem by changing our noising step slightly, so that we first scale down the original value and then add the noise. This ensures that, when we repeat this noising step many times, information from the original image will disappear, and the result will be equivalent to a pure random sample from the noise distribution. So we can start our generation process from any such noise sample. And there we have it, this is known as a denoising diffusion model. The overall form is identical to an auto-regressor, the only difference is the way in which we remove information at each step. By adding noise, we can spread out the removal of information all across the image, which makes the predicted values as independent of each-other as possible, allowing you to use fewer neural net evaluations. Empirically, diffusion models can produce high-quality photo-realistic images in about a hundred steps, where auto-regressors would take millions. Now that we understand how these generative models work at a conceptual level, if you are ever going to implement these models in practice, there are a few important technical details that you should be aware of. First, in the procedure I described for auto-regression, I used a different neural net in each step of the process. This is certainly the best way to get the most accurate predictions, but it’s also very inefficient, since we need to train a whole bunch of different neural nets. In practice, you would just use the same model to do every step. This gives slightly worse predictions, but the savings in computation time more than make up for it. In order to train a single neural net to perform all of the generation steps, you would remove a random number of pixels from each input, and train the neural net to predict the corresponding next pixel of each input. Additionally, you can also give the number of pixels removed as an input to the neural net, so that it knows which pixel it’s supposed to be generating. Now this one neural net can be used for all generation steps. In the setup I just described, for each training image, the neural is trained on only one generation step for that image. But ideally, we would like to train it on every generation step of every image, we can get more use out of our training data that way. If you did this the naïve way you would have to evaluate the neural net once for every generation step. Which means a lot more computation. Fortunately, there exist special neural net architectures, known as causal architectures, that allow you to train on all of these generation steps while only evaluating the neural net once. There exist causal versions of all of the popular neural net architectures, such as causal convolutional neural nets, and causal transformers. Causal architectures actually give slightly worse predictions, but in practice, auto-regression is almost always done with causal architectures because the training is so much faster. The generation process for causal architectures is still exactly same though. For diffusion models, you can’t use causal architectures and so you do have to just train with each data point at a random generation step. I described the diffusion model as predicting the slightly less noisy image from the previous step. However, it’s actually better to predict the original, completely clean image at every step. The reason for this is it makes the job of the neural net easier. If you make it predict the noisy next step image, then the neural net needs to learn how to generate images at all different noise levels. This means the model will waste some of its capacity learning to produce noisy versions of images. If you instead just have the neural net always predict the clean image, then the model only needs to learn how to generate clean images, which is all we care about. You can then take the predicted clean image and reapply the noising process to it to get the next step of the generation process. Except that when you predict the clean image then, at the early steps of the generation process the model has only pure noise as input, so the original clean image could have been anything, and so you get a blurry mess again. To avoid this, we can train the neural net to predict the noise which was added to the image. Once we have a predicted value for the noise, we can plug it into this equation to get a prediction for the original clean image. So we are still predicting the original clean image, just in a round-about way. The advantage of doing it this way is that, now, the model output is uncertain at the later stages of the generation process, since any noise could have been added to the clean image. So the model outputs the average of a bunch of different noise samples, which is still valid noise. So far we’ve just been generating images from nothing, but most image generators actually allow you to provide a text prompt describing the image you want to make. The way that this works is exactly the same, you just give the neural net the text as an additional input at each step. These models are trained on pairs of images and their corresponding text descriptions, usually scraped from image alt text tags found on the internet. This ensures that the generated image is something for which the text prompt could plausibly be given as a description of that image. In principle, you can condition generative models on anything, not just text, so long as you can find appropriate training data. For example, here is a generative model that is conditioned on sketches. Finally, there’s a technique to make conditional diffusion models work better, called classifier free guidance. For this, during training the model will sometimes be given the text-prompts as additional input, and sometimes it won’t. This way, the same model learns to do predictions with or without the conditioning prompt as input. Then, at each step of the denoising process, the model is run twice, once with the prompt, and once without. The prediction without the prompt is subtracted from the prediction with the prompt, which removes details that are generated without the prompt, thus leaving only details that came from the prompt, leading to generations which more closely follow the prompt. In conclusion, generative AI, like all machine learning, is just curve fitting. And that’s all for this video. If you enjoyed it, please like and subscribe. And if you have any suggestions for topics you’d like to me to cover in a future video, leave a comment below.

Transcript for:Understanding AI Generative Models

Transcript for:
Understanding AI Generative Models