Diffusion Models Overview

Aug 18, 2025

Overview

This lecture explains how diffusion models generate realistic images, covering their mechanisms, architectures, and key techniques, with a focus on Stable Diffusion.

Diffusion Model Basics

  • Diffusion models train by adding noise to images until only noise remains.
  • Image generation reverses the process, starting with noise and gradually denoising to create a clean image.
  • The process is inspired by thermodynamic diffusion, increasing image entropy as noise is added.

Denoising Networks & U-Net Architecture

  • Denoising uses fully convolutional neural networks, often based on U-Net architecture.
  • U-Net has downsampling and upsampling layers with skip connections to preserve both details and high-level features.
  • Training typically involves predicting all noise in an image, not just a less noisy version.

Training & Inference Process

  • Models are trained to predict the noise added to images rather than directly denoised images.
  • During inference, predicted noise is subtracted (possibly scaled) step by step to clean the image.
  • Noise levels per training sample are set via a noise scheduler parameterized by a time step, which is embedded for model input.

Comparison to GANs & Variational Autoencoders

  • Diffusion models are more stable than GANs and less likely to fail catastrophically.
  • They are typically slower than GANs and Variational Autoencoders for image generation.

Latent Diffusion and Stable Diffusion

  • Latent Diffusion speeds up generation by working in a low-dimensional latent space using a variational autoencoder (VAE).
  • The VAE is trained to compress and reconstruct images; diffusion is performed on compressed representations.
  • After denoising in latent space, the decoder reconstructs the high-resolution image.
  • Latent Diffusion enables image-to-image tasks like inpainting and outpainting.

Text Conditioning & Guidance

  • Text prompts are tokenized and encoded (e.g., via CLIP) to provide embeddings to the diffusion model.
  • Stable Diffusion uses CLIP as a text encoder, conditioning the U-Net on text embeddings.
  • Guidance scale balances prompt influence: higher scales strengthen prompt effects, lower scales weaken them.
  • Classifier-free guidance runs the process twice (with/without prompt) and amplifies differences to steer results.
  • Negative prompts can be used to remove specific features from generated images.

Key Terms & Definitions

  • Diffusion Model — A model that generates images by gradually removing noise from random noise.
  • U-Net — A neural network architecture with symmetrical downsampling and upsampling paths connected by skip connections.
  • Latent Space — A low-dimensional representation of data (e.g., images), typically produced by an autoencoder.
  • Variational Autoencoder (VAE) — A neural network that learns to compress and reconstruct input data.
  • CLIP — A model that encodes images and text into a shared multimodal embedding space.
  • Guidance Scale — A parameter controlling the influence of a text prompt on image generation.
  • Classifier-Free Guidance — A technique that uses both conditioned and unconditioned predictions to enhance prompt effects.

Action Items / Next Steps

  • Review the structure and function of U-Net and VAEs.
  • Explore how Latent Diffusion and CLIP work in Stable Diffusion.
  • Experiment with prompt engineering and guidance scales in image generation tasks.