Implementing Denoising Diffusion Models

Sep 22, 2024

Implementation of Diffusion Models: DDPM

Overview

  • Focus on implementing Denoising Diffusion Probabilistic Model (DDPM)
  • Future videos will cover Stable Diffusion with text prompts
  • Training and sampling implementation for DDPM
  • Aim to implement architecture used in latest diffusion models

Diffusion Process

Forward Process

  • Create noisier versions of an image by adding Gaussian noise step-by-step
  • After many steps, results in a noise sample from a normal distribution
  • Transition function applied at every time step T
  • Beta is scheduled noise added at T-1 to get image at T
  • Alpha defined as:
    • ( \alpha = 1 - \beta )
    • Cumulative products of Alphas allow jumping from original to noisy image

Reverse Process

  • Model learns reverse process distribution
  • Same functional form as the forward process
  • Model predicts mean and variance
  • Goal: Minimize KL Divergence between ground truth and predicted noise distributions
    • Fix variance to match target distribution
    • Minimize square of difference between predicted and original noise

Training Method

  • Sample image at time step T and a noise sample
  • Feed noisy version of the image to the model
  • Loss becomes Mean Squared Error (MSE) between original noise and model prediction

Implementation Steps

  • Create noise scheduler to handle forward and reverse processes
  • Utilize a linear noise schedule from 1e-4 to 0.02 over 1000 time steps

Noise Scheduler Functions

  1. Forward Process: Returns noisy image given an image, noise, and time step T
  2. Reverse Process: Given XT and noise prediction, returns XT-1 sample

Model Architecture

  • Use U-Net architecture
  • Input and output shapes must match; include time step information
  • Time Embedding Block: Converts time steps into a tensor representation through embedding and linear layers

U-Net Structure

  • Encoder: Downsampling blocks reduce size, increase channels
  • Mid Block: Operates at the same spatial resolution
  • Decoder: Upsampling blocks increase size, reduce channels
  • Skip connections between corresponding encoding and decoding layers

Down Block Implementation

  • Residual connection, self-attention, downsampling
  • Each residual block consists of normalization, activation, and convolutional layers

Mid Block Implementation

  • Similar structure to down block but includes layers of self-attention

Up Block Implementation

  • Same as down block but includes an upsampling layer

Coding the U-Net

  • Initialize parameters and create down, mid, and up blocks based on the image channels
  • Time embedding processed at input to get necessary representation

Training and Sampling

  • Dataset class handles loading and converting images to tensors
  • Training loop samples random noise, applies noise scheduler, and backpropagates loss
  • Sampling method creates random noise sample and iteratively calls reverse process

Configuration File

  • Contains dataset parameters, model parameters, and training parameters
  • Allows flexibility in model block configurations

Results

  • Trained on MNIST and a texture dataset (28x28 resized images)
  • MNIST shows faster results due to similar image characteristics
  • Texture dataset takes longer to converge but produces decent images by the end

Conclusion

  • Steps covered: Scheduler, U-Net implementation, training, and sampling code.
  • Encouraged to check previous videos for more detailed information on diffusion models.

  • If you found this helpful, consider subscribing for more content!