Coconote
AI notes
AI voice & video notes
Try for free
🎨
Diffusion Models - Paper Explanation
Jun 22, 2024,
Diffusion Models - Paper Explanation
Introduction
Diffusion Models
: Popular in image generation, competitive with state-of-the-art GANs.
Applications
:
Text-to-image generation
Image variations (e.g., DALL-E 2)
Inpainting/removing objects from images
Generating animations
Fundamental Papers
2015
: Introduction from statistical physics to ML.
2020
: Significant quality improvements.
2021
: OpenAI contributions for better performance and faster runtime.
Structure of Video
General idea of diffusion models
Detailed math explanation behind DDPMs
Improvements from relevant papers
Summary
General Idea
Forward Process
: Systematically and slowly destroys the structure in data into noise.
Reverse Process
: Learns to restore structure from noise to data.
Adds noise iteratively to an image until it becomes pure noise.
Reverse process by NN removing noise step-by-step, resulting in a clear image.
Key Insight
: Predicting noise directly works better than directly predicting clean images.
Architectural Details
UNet-like Architecture
:
Bottleneck in the middle.
Downsample -> Bottleneck -> Upsample.
Attention blocks at certain resolutions.
Skip connections between similar spatial resolutions.
Sinusoidal embeddings from transformers.
Schedule
: Regulates the amount of noise at different steps.
Linear schedule from 2020 paper.
Cosine schedule from OpenAI for better noise distribution.
Improvements by OpenAI
Network Depth
: Increased depth, decreased width.
Attention Blocks
: More blocks, increased number of attention heads.
Residual Blocks
: From BigGAN used for upsample and downsample.
Adaptive Group Normalization
: Incorporates timestep and class label.
Classifier Guidance
: To specify class.
Mathematical Explanation
Notations
:
Image
: Represented as x_t (t=subscript for timestep).
Forward Function (q)
: Adds noise, transforms x_t-1 to x_t.
Reverse Function (p)
: NN denoising, transforms x_t to x_t-1.
Forward Process
: Iteratively applying noise, cumulatively represented.
Linear schedule for noise.
Reverse Process
: Predicts mu (mean of noise) at each step, eventually simplifies to predicting noise.
Variational Lower Bound and Loss Function
Objective
: Minimize a variational lower bound since direct computation is difficult.
Reformulated using the Bayesian approach to make it tractable.
Optimization
: Simplified to the mean squared error between actual and predicted noise.
Final Objectives and Algorithm
Training
:
Sample data, timestep, and noise.
Optimize over objective via gradient descent.
Sampling
:
Start with pure noise -> Iterative denoising using the trained model -> Final clean image.
Iteration
: Key to achieving high-quality results.
Results and Comparisons
FID Scores on ImageNet (256x256)
:
Improved DDPM: 12.3
OpenAI: 4.59 (Ablated Diffusion Model), 3.94 (with upsampling)
Comparison with Other Models
: Fusion models are catching up to GANs and may surpass them soon.
Conclusion
Diffusion Models
: Powerful generative models making strides in image synthesis.
Transformation
: From noise to high-fidelity images iteratively using a trained model.
Future Potential
: Likely to surpass GANs with continued research and improvements.
Recap
:
Dual processes of adding noise and learning to remove noise.
Neural network architecture for efficient denoising.
Mathematical grounding behind objectives and processes.
Continued improvements yielding better results over time.
📄
Full transcript