🎨

Diffusion Models - Paper Explanation

Jun 22, 2024,

Diffusion Models - Paper Explanation

Introduction

Diffusion Models: Popular in image generation, competitive with state-of-the-art GANs.
Applications:
- Text-to-image generation
- Image variations (e.g., DALL-E 2)
- Inpainting/removing objects from images
- Generating animations

Fundamental Papers

2015: Introduction from statistical physics to ML.
2020: Significant quality improvements.
2021: OpenAI contributions for better performance and faster runtime.

Structure of Video

General idea of diffusion models
Detailed math explanation behind DDPMs
Improvements from relevant papers
Summary

General Idea

Forward Process: Systematically and slowly destroys the structure in data into noise.
Reverse Process: Learns to restore structure from noise to data.
- Adds noise iteratively to an image until it becomes pure noise.
- Reverse process by NN removing noise step-by-step, resulting in a clear image.
Key Insight: Predicting noise directly works better than directly predicting clean images.

Architectural Details

UNet-like Architecture:
- Bottleneck in the middle.
- Downsample -> Bottleneck -> Upsample.
- Attention blocks at certain resolutions.
- Skip connections between similar spatial resolutions.
- Sinusoidal embeddings from transformers.
Schedule: Regulates the amount of noise at different steps.
- Linear schedule from 2020 paper.
- Cosine schedule from OpenAI for better noise distribution.

Improvements by OpenAI

Network Depth: Increased depth, decreased width.
Attention Blocks: More blocks, increased number of attention heads.
Residual Blocks: From BigGAN used for upsample and downsample.
Adaptive Group Normalization: Incorporates timestep and class label.
Classifier Guidance: To specify class.

Mathematical Explanation

Notations:
- Image: Represented as x_t (t=subscript for timestep).
- Forward Function (q): Adds noise, transforms x_t-1 to x_t.
- Reverse Function (p): NN denoising, transforms x_t to x_t-1.
Forward Process: Iteratively applying noise, cumulatively represented.
- Linear schedule for noise.
Reverse Process: Predicts mu (mean of noise) at each step, eventually simplifies to predicting noise.

Variational Lower Bound and Loss Function

Objective: Minimize a variational lower bound since direct computation is difficult.
- Reformulated using the Bayesian approach to make it tractable.
Optimization: Simplified to the mean squared error between actual and predicted noise.

Final Objectives and Algorithm

Training:
- Sample data, timestep, and noise.
- Optimize over objective via gradient descent.
Sampling:
- Start with pure noise -> Iterative denoising using the trained model -> Final clean image.
Iteration: Key to achieving high-quality results.

Results and Comparisons

FID Scores on ImageNet (256x256):
- Improved DDPM: 12.3
- OpenAI: 4.59 (Ablated Diffusion Model), 3.94 (with upsampling)
Comparison with Other Models: Fusion models are catching up to GANs and may surpass them soon.

Conclusion

Diffusion Models: Powerful generative models making strides in image synthesis.
Transformation: From noise to high-fidelity images iteratively using a trained model.
Future Potential: Likely to surpass GANs with continued research and improvements.

Recap:

Dual processes of adding noise and learning to remove noise.
Neural network architecture for efficient denoising.
Mathematical grounding behind objectives and processes.
Continued improvements yielding better results over time.

Full transcript