Overview
This lecture explains how diffusion models generate realistic images, covering their mechanisms, architectures, and key techniques, with a focus on Stable Diffusion.
Diffusion Model Basics
- Diffusion models train by adding noise to images until only noise remains.
- Image generation reverses the process, starting with noise and gradually denoising to create a clean image.
- The process is inspired by thermodynamic diffusion, increasing image entropy as noise is added.
Denoising Networks & U-Net Architecture
- Denoising uses fully convolutional neural networks, often based on U-Net architecture.
- U-Net has downsampling and upsampling layers with skip connections to preserve both details and high-level features.
- Training typically involves predicting all noise in an image, not just a less noisy version.
Training & Inference Process
- Models are trained to predict the noise added to images rather than directly denoised images.
- During inference, predicted noise is subtracted (possibly scaled) step by step to clean the image.
- Noise levels per training sample are set via a noise scheduler parameterized by a time step, which is embedded for model input.
Comparison to GANs & Variational Autoencoders
- Diffusion models are more stable than GANs and less likely to fail catastrophically.
- They are typically slower than GANs and Variational Autoencoders for image generation.
Latent Diffusion and Stable Diffusion
- Latent Diffusion speeds up generation by working in a low-dimensional latent space using a variational autoencoder (VAE).
- The VAE is trained to compress and reconstruct images; diffusion is performed on compressed representations.
- After denoising in latent space, the decoder reconstructs the high-resolution image.
- Latent Diffusion enables image-to-image tasks like inpainting and outpainting.
Text Conditioning & Guidance
- Text prompts are tokenized and encoded (e.g., via CLIP) to provide embeddings to the diffusion model.
- Stable Diffusion uses CLIP as a text encoder, conditioning the U-Net on text embeddings.
- Guidance scale balances prompt influence: higher scales strengthen prompt effects, lower scales weaken them.
- Classifier-free guidance runs the process twice (with/without prompt) and amplifies differences to steer results.
- Negative prompts can be used to remove specific features from generated images.
Key Terms & Definitions
- Diffusion Model — A model that generates images by gradually removing noise from random noise.
- U-Net — A neural network architecture with symmetrical downsampling and upsampling paths connected by skip connections.
- Latent Space — A low-dimensional representation of data (e.g., images), typically produced by an autoencoder.
- Variational Autoencoder (VAE) — A neural network that learns to compress and reconstruct input data.
- CLIP — A model that encodes images and text into a shared multimodal embedding space.
- Guidance Scale — A parameter controlling the influence of a text prompt on image generation.
- Classifier-Free Guidance — A technique that uses both conditioned and unconditioned predictions to enhance prompt effects.
Action Items / Next Steps
- Review the structure and function of U-Net and VAEs.
- Explore how Latent Diffusion and CLIP work in Stable Diffusion.
- Experiment with prompt engineering and guidance scales in image generation tasks.