Transcript for:
Understanding AI Generative Models

This is an artificial intelligence image  generator. Given a text description of a picture,   it will create, out of nothing, an image  matching that description. As you can see,   it is capable of generating high quality  images of all kinds of different scenes.   And it’s not just images, in recent  years generative AI models have been   developed that can generate text, audio,  code, and soon, videos too. All of these   models are based on the same underlying  technology, namely deep neural networks. In a few of my previous videos, I’ve explained  how and why deep neural networks work so well.   But I only explained how neural nets can  solve prediction tasks. In a prediction task,   the neural net is trained with a bunch of  examples of inputs and their labels, and   tries to predict what the label will be for a new  input which it hasn’t seen before. For example,   if you trained a neural net on images labelled  with the type of object appearing in each image,   that neural net would learn to predict which  object a human would say is in an image,   even for new images which it hasn’t seen before.  Under the hood, the way that prediction tasks   are solved is by converting the training dataset  into a set of points in a space, and then fitting   a curve through those points, so prediction  tasks are also known as curve-fitting tasks. And while prediction is certainly cool and very  useful, it’s not generation. Right? This model   is just fitting a curve to a set of points.  It can’t produce new images. So where does   the creativity of these generative models come  from, if neural nets can only do curve fitting? Well, all of these generative models,  are in fact just predictors. Yep,   it turns out that the process of  producing novel works of art can be   reduced to a curve fitting exercise. And  in this video, you’ll learn exactly how. Suppose that we have a training dataset  consisting of a bunch of images. We want   to train a neural net to create new images which  are similar in style to these training images. The first thing you might try is to  simply use the images as labels to   train the predictor. Here we don’t care  about the mapping from inputs to outputs,   so we can just use anything we like for  the inputs, for example a completely   black image. Predictors learn to map inputs to  outputs according to their training data. So,   this predictor, once trained, should be able  to map the dummy all-black image to new images,   like those seen in the training set, right? Err,  ok maybe not quite. That didn’t work so well,   instead of producing a nice, beautiful  picture, we just got this blurry mess. This demonstrates a very important fact about  of predictors. If there are multiple possible   labels for the same input, the predictor  will learn to output the average of those   labels. For traditional classification  tasks, this isn’t really a problem,   because the average of multiple class labels  can still be a meaningful label. For example,   this image could plausibly be given two  different labels, both cat and dog would be   valid labels. In this case a classifier would  learn to output the average of those labels,   which means you end up with a score of 0.5 cat and  0.5 dog. Which is still a useful label. In fact   it’s arguably a better label than either of the  original ones. On the other hand, when you average   a bunch of images together you do not get a  meaningful image out, you just get a blurry mess. Let’s try something a bit easier this time. How  about, instead of generating a new image from   scratch, we try to complete an image which has  a part of it missing. In fact, let’s make this   really easy and suppose there is only one  missing pixel, say, the bottom right pixel.  Can we train a neural net to predict the  value of this one missing pixel? Well,   as before, the neural net is going to  output the average of plausible values   that the missing pixel can take. But since  it’s only one pixel that we’re predicting,   the average value is still meaningful. The average  of a bunch of colors is just another color,   there’s no blurring effect. So,  this model works perfectly fine! And we can use the value predicted by this neural   net to complete images which are  missing the bottom-right pixel. Great, so we can complete images  with 1 missing pixel… What about 2? Well, we can do the same thing again, train  another neural net on images with 2 missing   pixels, using the value of the second missing  pixel as the label. And then use this neural   net to fill in the second missing pixel. Now  we have an image with just 1 missing pixel,   and so we can use the first  neural net to fill in that. Great. And we can do this for every pixel in the image;  train a neural net to predict the color of that   pixel when it and all of the subsequent  pixels are missing. Now we can “complete”   an image starting from a fully black image,  and filling in one pixel at a time. Crucially,   each neural net only predicts one pixel,  and so there’s no blurring effect. And there we have it, we have just generated a  plausible image, out of nothing… There’s just   one small problem. If we run this model again,  it will generate exactly the same image… Not very   creative, is it? But not to worry, we can fix this  by introducing a bit of random sampling. You see,   all predictors actually output a probability  distribution over possible labels. Usually,   we just take the label with the largest  probability as the predicted value. But   if we want diversity in our outputs, we  can instead randomly sample a value from   this probability distribution. This way,  each time the model is run, it will sample   different values at each step, which therefore  changes the prediction for subsequent steps,   and we get a completely different image each  time. Now we have an interesting image generator. But still, at the end of the day, this model is  made of predictors. They take as input a partially   masked image, and predict the value of the next  pixel. The only difference between this and a   traditional image classifier is the label we used  for training. The labels for our generator happen   to be pixel colors which come from the original  image itself, and not a human labeller. This is   a very important point in practice: it means  we don’t need humans to manually label images   for this model, we can just scrape unlabelled  images off the internet. But from the point of   view of the neural net, it doesn’t know, nor  does it care, that the label came from the   original image. As far as it’s concerned this is  just a curve fitting exercise, like any other. The generative model we’ve just created is called  an auto-regressor. We have a removal process,   which removes pixels one at a time, and we train  neural nets to undo this process, generating and   adding back in pixels one at a time. This is  actually one of the oldest generative models,   the very earliest use of auto-regression dates  back to 1927, where it was used to model the   timing of sunspots. But auto-regressors  are still in use today. Most notably,   Chat-GPT is an auto-regressor. Chat-GPT generates  text by using a transformer classifier to output   a probability distribution over possible next  words, given a partial piece of text. However,   auto-regressors are not used to generate  images anymore. And the reason is that,   while they can generate very realistic  images, they take too long to run. In order to generate a sample  with an auto-regressor,   we need to evaluate a neural net once  for every element. This is fine for   generating a few thousand words to make  a piece of text, but large images can   have tens of millions of pixels. How can we  get away with fewer neural net evaluations? For our auto-regressor, we removed one pixel at a  time. But we don’t have to remove only one pixel,   we could, for example, remove a 4  by 4 patch of pixels at a time. And   train the neural net to predict  all 16 missing pixels at once. This way, when we use our  model to generate an image,   it can produce 16 pixels per evaluation,  and so generation is 16 times as fast. But there is a limit to this. We can’t generate  too many pixels at the same time. In the extreme   case, if we try to generate every pixel in  the image at once, then we’re back to the   original problem: there are many possible labels  that get averaged together into a blurry mess. To be clear, the reason why the  image quality degrades is that,   when we predict a bunch of pixels at the same  time, the model has to decide on the values for   all of them at once. There are lots of plausible  ways that this missing patch could be filled in,   and so the model outputs the average  of those. The model isn’t able to   make sure that the generated values are  consistent with each-other. In contrast,   when we predict one pixel at a time, the model  gets to see the previously generated pixels,   and so the model can change its prediction for  this pixel to make it consistent with what has   already been generated. This is why there’s a  trade-off, the more pixels we generate at once,   the less computation we need to use, but the  worse the quality of the generated images will be. Although, this problem only arises if the  values we are predicting are related to   each other. Suppose that the values were  statistically independent of each-other,   that is, knowing one of them does not  help to predict any others. In this case,   the model doesn’t need look at  the previously generated values,   since knowing what they were wouldn’t change  its prediction for the next value anyway. In   this case you can predict all of them at  the same time without any loss in quality. So, that means, ideally, we want our model to  generate a set of pixels that are unrelated to   each other. For natural images, nearby  pixels are the most strongly related,   because they are usually part of the same  object. Knowing the value of one pixel very   often gives you a good idea of what color nearby  pixels will be. This means that removing pixels   in contiguous chunks is actually the worst way to  do it. Instead, we should be removing pixels that   are far away from each other, and hence more  likely to be unrelated. So if in each step,   we remove a random set of pixels, and predict  values for those, then we can remove more pixels   in each step for the same loss in image  quality, compared to contiguous chunks. In order to minimize the number of steps  needed for generation, we want the pixels   we remove in each step to be as spread out as  possible. Removing pixels in a random order is   a pretty good way of maximizing the average  spread, but there is an even better way. We can think of our generative model as two  processes: a removal process that gradually   removes information from the input, until  nothing is left. And a generation process   that uses neural nets to undo the removal process,  generating and adding back in information. So far,   we have been completely removing pixels.  But rather than completely removing a pixel,   we could instead remove only some of the  information from a pixel, by, for example,   adding a small amount of random noise to it.  This means we don’t know exactly what the   original pixel value was, but we do know it  was somewhere close to the noisy value. Now,   instead of removing a bunch of pixels in each  step, we can add noise to the entire image.   This way, we can remove information from  every pixel in the image in a single step,   which is the most spread-out way of removing  information. And since its more spread out,   you can remove more information in each step,  for the same loss in generation quality. There is one small problem with this though. When  we want to generate a new image, we need to start   the neural net off with some initial blank image.  When we were removing pixels, then every image   eventually ends up as a completely black image,  so of course that’s where we start the generation   process from. But now that we’re adding noise,  the values just keep getting larger and larger,   never converging to anything. So where  do we start the generation process from? We can avoid this problem by changing  our noising step slightly, so that we   first scale down the original value and  then add the noise. This ensures that,   when we repeat this noising step many  times, information from the original   image will disappear, and the result will  be equivalent to a pure random sample from   the noise distribution. So we can start our  generation process from any such noise sample. And there we have it, this is known as a denoising  diffusion model. The overall form is identical   to an auto-regressor, the only difference is  the way in which we remove information at each   step. By adding noise, we can spread out the  removal of information all across the image,   which makes the predicted values as independent  of each-other as possible, allowing you to use   fewer neural net evaluations. Empirically,  diffusion models can produce high-quality   photo-realistic images in about a hundred steps,  where auto-regressors would take millions. Now that we understand how these generative models   work at a conceptual level, if you are ever  going to implement these models in practice,   there are a few important technical  details that you should be aware of. First, in the procedure I described for  auto-regression, I used a different neural   net in each step of the process. This is certainly  the best way to get the most accurate predictions,   but it’s also very inefficient, since we  need to train a whole bunch of different   neural nets. In practice, you would just use  the same model to do every step. This gives   slightly worse predictions, but the savings  in computation time more than make up for it. In order to train a single neural net  to perform all of the generation steps,   you would remove a random number of  pixels from each input, and train the   neural net to predict the corresponding  next pixel of each input. Additionally,   you can also give the number of pixels removed as  an input to the neural net, so that it knows which   pixel it’s supposed to be generating. Now this one  neural net can be used for all generation steps. In the setup I just described, for  each training image, the neural   is trained on only one generation  step for that image. But ideally,   we would like to train it on every generation  step of every image, we can get more use out   of our training data that way. If you did  this the naïve way you would have to evaluate   the neural net once for every generation  step. Which means a lot more computation. Fortunately, there exist special neural net  architectures, known as causal architectures,   that allow you to train on all of these  generation steps while only evaluating   the neural net once. There exist causal versions  of all of the popular neural net architectures,   such as causal convolutional neural  nets, and causal transformers. Causal   architectures actually give slightly  worse predictions, but in practice,   auto-regression is almost always done with causal  architectures because the training is so much   faster. The generation process for causal  architectures is still exactly same though. For diffusion models, you can’t use  causal architectures and so you do   have to just train with each data  point at a random generation step. I described the diffusion model as predicting  the slightly less noisy image from the previous   step. However, it’s actually better to predict  the original, completely clean image at every   step. The reason for this is it makes the  job of the neural net easier. If you make   it predict the noisy next step image, then the  neural net needs to learn how to generate images   at all different noise levels. This means  the model will waste some of its capacity   learning to produce noisy versions of images.  If you instead just have the neural net always   predict the clean image, then the model only  needs to learn how to generate clean images,   which is all we care about. You can then  take the predicted clean image and reapply   the noising process to it to get the  next step of the generation process. Except that when you predict the clean image  then, at the early steps of the generation   process the model has only pure noise as input, so  the original clean image could have been anything,   and so you get a blurry mess again. To avoid  this, we can train the neural net to predict   the noise which was added to the image. Once we  have a predicted value for the noise, we can plug   it into this equation to get a prediction for the  original clean image. So we are still predicting   the original clean image, just in a round-about  way. The advantage of doing it this way is that,   now, the model output is uncertain at the  later stages of the generation process,   since any noise could have been added to  the clean image. So the model outputs the   average of a bunch of different noise  samples, which is still valid noise. So far we’ve just been generating images from  nothing, but most image generators actually   allow you to provide a text prompt describing  the image you want to make. The way that this   works is exactly the same, you just give the  neural net the text as an additional input at   each step. These models are trained on pairs of  images and their corresponding text descriptions,   usually scraped from image alt text tags found  on the internet. This ensures that the generated   image is something for which the text prompt could  plausibly be given as a description of that image. In principle, you can condition  generative models on anything,   not just text, so long as you can find  appropriate training data. For example,   here is a generative model that  is conditioned on sketches. Finally, there’s a technique to make  conditional diffusion models work better,   called classifier free guidance. For this,  during training the model will sometimes   be given the text-prompts as additional  input, and sometimes it won’t. This way,   the same model learns to do predictions with or  without the conditioning prompt as input. Then,   at each step of the denoising process, the  model is run twice, once with the prompt,   and once without. The prediction without the  prompt is subtracted from the prediction with   the prompt, which removes details that are  generated without the prompt, thus leaving   only details that came from the prompt, leading to  generations which more closely follow the prompt. In conclusion, generative AI, like all  machine learning, is just curve fitting. And that’s all for this video. If you enjoyed  it, please like and subscribe. And if you have   any suggestions for topics you’d like to me to  cover in a future video, leave a comment below.