Diffusion Models Overview

Overview

This lecture explains how diffusion models generate realistic images, covering their mechanisms, architectures, and key techniques, with a focus on Stable Diffusion.

Diffusion Model Basics

Diffusion models train by adding noise to images until only noise remains.
Image generation reverses the process, starting with noise and gradually denoising to create a clean image.
The process is inspired by thermodynamic diffusion, increasing image entropy as noise is added.

Denoising Networks & U-Net Architecture

Denoising uses fully convolutional neural networks, often based on U-Net architecture.
U-Net has downsampling and upsampling layers with skip connections to preserve both details and high-level features.
Training typically involves predicting all noise in an image, not just a less noisy version.

Training & Inference Process

Models are trained to predict the noise added to images rather than directly denoised images.
During inference, predicted noise is subtracted (possibly scaled) step by step to clean the image.
Noise levels per training sample are set via a noise scheduler parameterized by a time step, which is embedded for model input.

Comparison to GANs & Variational Autoencoders

Diffusion models are more stable than GANs and less likely to fail catastrophically.
They are typically slower than GANs and Variational Autoencoders for image generation.

Latent Diffusion and Stable Diffusion

Latent Diffusion speeds up generation by working in a low-dimensional latent space using a variational autoencoder (VAE).
The VAE is trained to compress and reconstruct images; diffusion is performed on compressed representations.
After denoising in latent space, the decoder reconstructs the high-resolution image.
Latent Diffusion enables image-to-image tasks like inpainting and outpainting.

Text Conditioning & Guidance

Text prompts are tokenized and encoded (e.g., via CLIP) to provide embeddings to the diffusion model.
Stable Diffusion uses CLIP as a text encoder, conditioning the U-Net on text embeddings.
Guidance scale balances prompt influence: higher scales strengthen prompt effects, lower scales weaken them.
Classifier-free guidance runs the process twice (with/without prompt) and amplifies differences to steer results.
Negative prompts can be used to remove specific features from generated images.

Key Terms & Definitions

Diffusion Model — A model that generates images by gradually removing noise from random noise.
U-Net — A neural network architecture with symmetrical downsampling and upsampling paths connected by skip connections.
Latent Space — A low-dimensional representation of data (e.g., images), typically produced by an autoencoder.
Variational Autoencoder (VAE) — A neural network that learns to compress and reconstruct input data.
CLIP — A model that encodes images and text into a shared multimodal embedding space.
Guidance Scale — A parameter controlling the influence of a text prompt on image generation.
Classifier-Free Guidance — A technique that uses both conditioned and unconditioned predictions to enhance prompt effects.

Action Items / Next Steps

Review the structure and function of U-Net and VAEs.
Explore how Latent Diffusion and CLIP work in Stable Diffusion.
Experiment with prompt engineering and guidance scales in image generation tasks.