Denoising Diffusion Implicit Models from Scratch
A step-by-step guide using PyTorch
In my previous article, we delved into understanding and coding Denoising Diffusion Probabilistic Models (DPPMs) using PyTorch. Now, we are going to understand the idea behind Denoising Diffusion Implicit Models (DDIMs). Again, I'm not going to dive into the math derivations but I will try to explain the idea behind the necessary formulations.
In an overview, DDIMs are a variation of DDPMs that allow for faster and more deterministic sampling. While DDPMs use a stochastic approach, which introduces randomness in each step of the generation process, DDIMs, on the other hand, use a deterministic approach, leading to consistent outputs for the same inputs.
To understand the mathematical transformation from DDPMs to DDIMs, it’s important to first grasp the foundational concepts of DDPMs and then see how DDIMs modify this framework. Let’s break it down into more digestible parts.
Understanding DDPMs
DDPMs are based on a stochastic process with two main phases: the forward (noising) process and the reverse (denoising) process.
Forward Process
DDPMs simulate a diffusion process, which is essentially adding noise to data (like images) over a series of steps. Mathematically, this is typically modeled as a Markov chain where an image x₀ is gradually corrupted by Gaussian noise over a series of steps T. The process can be described by the following mathematical formulation:
where ϵ is sampled from a standard normal distribution, and \bar{αₜ} are variance schedule parameters.
Reverse Process
The model learns to reverse this diffusion process. This involves predicting the noise that was added at each step and removing it, which is essentially a denoising step. In DDPMs, the reverse process involves probabilistic sampling from a Gaussian distribution at each step. This is where the probabilistic aspect comes into play. This can be represented as:
Here, even with the same xₜ and t, the output xₜ₋₁ can vary due to the sampling from the Gaussian distribution.
Mathematically, the reverse transition is modeled as:
where:
- xₜ₋₁ is the estimate of the data at the previous time step.
- xₜ is the noised data at the current time step.
- αₜ and \bar{αₜ} are the noise schedule parameters, with \bar{αₜ} being the cumulative product of αₜ up to time t.
- ϵθ(xₜ, t) is the predicted noise, estimated by the neural network.
- σₜ is a small noise term added to maintain stochasticity, and z is a noise sample (typically Gaussian).
Think of the forward process (adding noise) as a journey where you randomly walk away from your starting point (the original image) in a fog (noise). DDPMs try to find their way back (reverse the process) by guessing at each step.
For an in-depth understanding of DDPMs, please refer to the following article and its references.
Transition to DDIMs
DDIMs modify the reverse process to make it deterministic. Instead of sampling noise from a distribution at each step, DDIMs directly compute the denoised data. This is done using an implicit model that estimates the original data from the noisy data at each step.
The core transformation involves changing the stochastic differential equation (used in DDPMs) to a non-Markovian process (used in DDIMs). In simple terms, DDIMs use a fixed function to reverse each step of the diffusion process, unlike the random sampling in DDPMs. This is done by leveraging the dependencies between steps, which is mathematically formulated to ensure that each step accurately predicts the next state based on the current state and its history. I.e., the primary denoising function in DDIMs is deterministic, guiding the reverse process. However, to introduce variability, DDIMs add a random noise term, which is sampled from a Gaussian distribution:
Here, σₜϵₜ is a sample from a standard normal distribution. The primary computation (the deterministic function) is fixed for a given xₜ and t, but the noise term introduces variability.
The modified equation for the reverse process in DDIMs is:
where:
- the component predicted x₀ represents the model’s prediction of the original data point (or image) x₀ based on the current noisy version xₜ. The term ϵᵗθ t(xₜ) is the noise prediction by the neural network for the current step. The ratio under parenthesis aims to reverse the forward noise addition process. By multiplying this by \sqrt{αₜ₋₁}, the model scales the prediction to account for the noise added in all previous steps.
- the component direction pointing to xₜ captures the direction in the latent space that the model should move toward to get closer to the original data point. The term ϵᵗθ t(xₜ) guides the direction to move in the latent space. This direction is scaled to ensure the model doesn’t overshoot or undershoot the target.
- the component random noise represents a random noise term, where ϵₜ is typically sampled from a Gaussian distribution. This term introduces a small amount of stochasticity to the reverse process, allowing for some variation in the generated samples.
In comparison to DPPMs, this means that, in DDIMs, instead of guessing, you have a precise formula that tells you exactly how much blur was added at each step. So, you just follow this formula to remove the blur. This is quicker because you’re not guessing anymore; you’re following a set path back to the clear picture.
The paper shows that when defining
for all t, the forward process becomes Markovian, and the generative process becomes a DDPM. This means that if you’ve trained a model as a DDIM and then decide to use it as a DDPM, it's not necessary to re-train it if you use the provided formula for σₜ. The learned parameters and weights would be the same since the only change is in the sampling method. σₜ is often scaled by a factor η:
This includes an original DDPM generative process when η = 1 and DDIM when η = 0.
In summary, the transition from DDPMs to DDIMs involves moving from a stochastic, random-walk type of reverse process to a deterministic, guided path. This change is reflected in the mathematical formulation of the reverse process, enabling DDIMs to efficiently reconstruct the original data from its noisy version.
Coding this in PyTorch is as simple as creating a new DDIM class that inherits from DDPM to be used during training (code modified from minDiffusion):
import torch
import torch.nn as nn
from ddpm import DDPM
class DDIM(DDPM):
"""
DDPM Sampling.
Args:
eps_model: A neural network model that predicts the noise term given a tensor.
betas: A tuple containing two floats, which are parameters used in the DDPM schedule.
eta: Scaling factor for the random noise term.
n_timesteps: Mumber of timesteps in the diffusion process.
criterion: Loss function.
"""
def __init__(
self,
eps_model: nn.Module,
betas: tuple[float, float],
eta: float,
n_timesteps: int,
criterion: nn.Module = nn.MSELoss(),
) -> None:
super(DDIM, self).__init__(eps_model, betas, n_timesteps, criterion)
self.eta = eta
def sample(self, n_samples: int, size: torch.Tensor, device: str) -> torch.Tensor:
# Initialize x_i with random noise from a standard normal distribution
# x_i corresponds to x_T in the diffusion process, where T is the total number of timesteps
x_i = torch.randn(n_sample, *size).to(device) # x_T ~ N(0, 1)
# Iterate backwards through the timesteps from n_timesteps to 1
for i in range(self.n_timesteps, 1, -1):
# Sample additional random noise z, unless i is 1 (in which case z is 0, i.e., no additional noise)
z = torch.randn(n_samples, *size).to(device) if i > 1 else 0 # z ~ N(0, 1) for i > 1, else z = 0
# Predict the noise eps to be removed at the current timestep, using the eps_model
# The current timestep i is normalized by n_timesteps and replicated for each sample
eps = self.eps_model(x_i, torch.tensor(i / self.n_timesteps).to(device).repeat(n_samples, 1))
# Calculate the predicted x0 (original data) at timestep 'i'
x0_t = (x_i - eps * (1 - self.alphabar_t[i]).sqrt()) / self.alphabar_t[i].sqrt()
# Compute coefficients for the DDIM sampling process.
c1 = self.eta * ((1 - self.alphabar_t[i] / self.alphabar_t[i - 1]) * (1 - self.alphabar_t[i - 1]) / (
1 - self.alphabar_t[i])).sqrt()
c2 = ((1 - self.alphabar_t[i - 1]) - c1 ** 2).sqrt()
# Update x_i using the DDIM formula.
x_i = self.alphabar_t[i - 1].sqrt() * x0_t + c1 * z + c2 * eps
return x_i
Then, it's just a matter of training using DDIM instead of DDPM:
def train(data_loader: DataLoader, n_epoch: int = 1000, device: str = "cuda:0") -> None:
# Initializing the DDIM model with a specified U-Net architecture, beta values, and timesteps
ddim = DDIM(eps_model=NaiveUnet(3, 3, n_feat=256), betas=(1e-4, 0.02), n_timesteps=1000, eta=0.3)
ddim.to(device)
# Setting up the optimizer for training (Adam optimizer with a learning rate of 1e-5)
optimizer = torch.optim.Adam(ddim.parameters(), lr=1e-3)
# Create a list to store the average loss of each epoch
epoch_losses = []
# Setting up the plot
fig, ax = plt.subplots()
display_handle = display(fig, display_id=True)
# Main training loop
for epoch in tqdm(range(n_epoch), desc="Processing epoch", leave=False):
ddim.train() # Setting the model to training mode
pbar = tqdm(data_loader, desc="loss:", leave=False)
batch_losses = [] # List to store the loss of each batch
for x, _ in pbar:
optimizer.zero_grad() # Zeroing the gradients
x = x.to(device) # Moving the batch of images to the specified device
loss = ddim(x) # Forward pass to compute the loss
loss.backward() # Backward pass to compute gradients
nn.utils.clip_grad_norm_(ddim.parameters(), 1.0)
optimizer.step() # Update model parameters
# Store the loss of each batch
batch_losses.append(loss.item())
pbar.set_description(f"loss: {loss.item():.4f}")
# Calculate and store the average loss of the epoch
epoch_avg_loss = sum(batch_losses) / len(batch_losses)
epoch_losses.append(epoch_avg_loss)
# Clear the previous plot and plot the updated epoch_losses
ax.clear()
ax.plot(epoch_losses, label='Epoch Loss')
ax.legend()
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title(f'Training Loss up to Epoch {epoch + 1}')
ax.grid(True)
# Redraw the plot
display_handle.update(fig)
# Evaluation and sample generation
ddim.eval()
with torch.no_grad():
if epoch % 1 == 0:
samples = ddim.sample(8, (3, 32, 32), device)
sample_set = torch.cat([samples, x[:8]], dim=0)
grid = make_grid(sample_set, normalize=True, value_range=(-1, 1), nrow=4)
save_image(grid, f"./images/ddim_sample_{epoch}.png")
# Saving the model weights
torch.save(ddim.state_dict(), f"./ddim_weights.pth")
plt.close()
The modified code is available (with comments and plots) in my personal Github, under ddim.ipynb
.
Before you go!
If you found value in my story and you want to support me:
- Throw some Medium love 💕 (claps 👏, comments, and highlights), your support makes all the difference! 🌟
- Follow me on Medium and subscribe ✉️ to be notified of my latest articles.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay on the loop with the latest AI stories. Let’s shape the future of AI together!
