avatarBen

Summary

The context discusses the diffusion model, a class of generative models that convert noise into images through a backward denoising process, and its implementation using TensorFlow.

Abstract

The diffusion model is a generative model that transforms noise into images via a backward denoising process, as proposed by Sohl-Dickstein et al. in 2015. This model aims to address the limitations of other generative models, such as GANs, VAEs, and flow-based models, by providing a more scalable and stable training process. The diffusion model uses a slow and iterative process to convert noise into images, making it more scalable than GAN models. Additionally, the diffusion model's training process is more stable since it is based on supervised learning, unlike GANs, which rely on unsupervised learning. The context also discusses the implementation of the diffusion model using TensorFlow and provides an overview of the model architecture and training process.

Opinions

  • The diffusion model provides a more scalable and stable training process than other generative models.
  • The diffusion model uses a slow and iterative process to convert noise into images, making it more scalable than GAN models.
  • The diffusion model's training process is more stable since it is based on supervised learning, unlike GANs, which rely on unsupervised learning.
  • The context discusses the implementation of the diffusion model using TensorFlow.
  • The model architecture and training process are also discussed in the context.
  • The diffusion model aims to address the limitations of other generative models, such as GANs, VAEs, and flow-based models.
  • The diffusion model can generate high-quality images, such as those produced by the current state-of-the-art image generation model, StyleGAN-XL.

Understanding the Diffusion Model and the theory behind it

Tensorflow implementation with explanation

Image by Author

AI image generation is a technology that has been hotly discussed in the art and Deep Learning (DL) field. You must have heard of the AI Art Generator such as Dall-E 2 or NovelAI, a DL model that generates realistic-looking images from a given text sequence.

To explore this technology deeper, we need to introduce a new class in the generative model called ‘diffusion’, first proposed by Sohl-Dickstein et al. (2015), which aimed to generate images from noise using a backward denoising process.

So far, several generative models exist, including GAN, VAE and Flow-based models. Most of them could generate a high-quality image, such as StyleGAN-XL, the current State-of-the-Art image generation model. However, each has some limitations of its own.

GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transforms (Lilian Weng, 2021)

The diffusion model has provided a slow and iterative process when noise is converted into an image; this makes the diffusion model more scalable than the GAN model. Besides, since the target of the diffusion model is to predict the input noise, which is supervised learning, we could expect the training of the diffusion model will be much more stable than GAN (unsupervised learning).

The implementation in this article will be based on Denoising Diffusion Probabilistic Models (Ho et al., 2021) (DDPM) and Denoising Diffusion Implicit Models (Song et al., 2021) (DDIM).

Table of Content

· What are Diffusion Models? · Forward NoisingProperty 1: Reparameterization trickProperty 2: Xt at any timestep can be represented by X0 and βMathematical proof of Property 2 · Backward DenoisingDDPMMathematic behind the reverse process · Model Architecture and TrainingU-net blocksU-net modelTraining · Result · Reference

What are Diffusion Models?

the diffusion process of the particle in water (source)

The word diffusion was defined as the movement of any substance from a higher concentration region to a lower concentration.

Inspired by this concept, the diffusion model defined Markov chain to slowly add random noise to the image. The Markov chain could be seen as a diffusion, and the process of adding noise is the movement. Thus, our target is to find the noise (movement) added to the image and reverse this process.

The diffusion model is mainly composed of two processes Forward Noising and Backward Denoising; this could be regarded as continuously adding noise into the image than reversing this process. 😈 😈

Forward Noising

In the DDPM paper, the author defines the forward process as:

Eq. 1: Definition of the Markov chain in Forward noising

The above is a Markov chain in which every timestep t only depends on the previous step t-1. We use variance schedule β to control the mean and variance, Where β< β₁< … < βt.

We will start from x0 (sampling from the real data distribution q(x) ) and then resign the mean and variance of x0 to generate x1. Finally, to the final state xT, which is a Gaussian noise. This process could be seen as pushing the image out iteratively until it leaves real data distribution and becomes noise.

The figure describes the forward process (source)

Before we start coding, let me first introduce two important properties in the diffusion model.

Property 1: Reparameterization trick

In the diffusion model, we will have a lot of values that need to be sampled from a distribution, e.g. z ~ N(z; μ, σ2). However, we cannot perform the backpropagation across the network since we cannot take the derivative of a random variable.

Thus, the reparameterization trick provides another form of the sampling process. Instead of sampling Z from N(z; μ, σ2) we could rewrite it as:

Eq. 2: reparameterization of Normal distribution

Now we transfer randomness to the random variable sampled from the Gaussian distribution. This makes the process differentiable since we could get the value during training. 💛

Property 2: Xt at any timestep can be represented by X0 and β

To obtain a noisy image Xt, we need to go through the Markov chain until we reach timestep t. Apparently, this process is very inefficient, especially in the DL training, which will have a bunch of input simultaneously.

Thanks to the reparameterization trick, we could now get the noisy image Xt by inputting the initial image X0 and the corresponding timestep t, based on the following formula:

Eq. 3: Distribution of q after applying reparameterization trick

(for the detail deducing process, I put it at the end of this section)

Finally, let's code the forward-nosing process base on Property 2 😄

The above shows two different versions of the Diffusion schedule, discrete and continuous. This implementation will focus on the former one.

For more information about continuous Diffusion schedules, I recommend reading the Keras example of the DDIM model.

Image created by the author

Mathematical proof of Property 2

Eq. 4: the noising process

Our final target is to obtain the xT. By going through multiple Gaussian conditional probability q(xt|xt−1).

First, we redefine alpha and beta as below:

Eq. 5: refine alpha and beta

using the reparameterization trick we could rewrite q(xt|xt−1) as:

Eq. 6: Gaussian conditional probability q(xt|xt−1)

where z1 ~ N(0, I)

Expanding the xt we could get:

Eq. 6: calculation of Xt

Since the merge of two Gaussians is also a Gaussians which is

Eq. 7: merge of two Gaussians is also a Gaussians

The merged standard deviation is

Eq. 8: merged standard deviation

Finally, we could get the noisy image at any timestep using the result.

Eq. 9: Distribution of q after applying reparameterization trick

🥀 🥀 🥀 🥀 🥀🥀 🥀 🥀 🥀 🥀🥀 🥀 🥀 🥀 🥀🥀 🥀 🥀 🥀 🥀🥀 🥀 🥀🥀🥀🥀🥀🥀

Backward Denoising

If the forward process is the process of adding noise, then the backward process is to remove the noise.

The figure describes the backward process (source)

If we could find the reverse distribution q(xt−1|xt), we can recreate the real image from Gaussian noise xT ~ N(0, I). Since q(xt|xt−1) is a Gaussian, if βt is small enough, q(xt−1|xt) will also be a Gaussian.

However, we couldn’t estimate the q easily since it needs to estimate the entire data distribution; thus, we will learn a model P to approximate this conditional probability. The distribution of p(xt−1|xt) is written as:

Eq. 10: reverse distribution p(xt−1|xt)

Our target is to obtain the initial state X0

Eq. 11: the process to obtain initial state X0

DDPM

In the ddpm paper, the author defined the sampling process as:

Sampling process in DDPM (source)

The reverse distribution p(xt−1|xt) could be written as:

Eq. 12: reverse distribution p(xt−1|xt) in DDPM

We use the U-net model to predict Є_θ with the input (xt, t), besides DDPM use untrain sigma_θ and believe sigma_θ (sigma_t in the above image ) approximate to βt

let's code this !!! ~~~ 😙

The above is the denoising process of DDPM. However, I am more prefer the DDIM denoising process, which is based on:

Eq. 13: Sampling process in DDIM

By setting the σ to 0, we could remove the randomness during sampling, reducing the inference time.

Mathematic behind the reverse process

Ok, it's time for some math ~~ 😢 😢

let us review the sampling process again. Our target is to get the reverse conditional probability q(xt−1|xt). To make it tractable, we first throw an X0 into it like this:

Eq. 14: adding x0 into q(xt−1|xt)

after applying Bayes’ rule, we have:

Eq. 15: applying Bayes’ rule to q(xt−1|xt, x0)

Now, all the q becomes forward, which means we could get the mean and variance of q based on Property 2 mentioned in the previous section.

Expanding the standard Gaussian density function of q, we could get:

Eq. 16: Expanding the standard Gaussian density function of q

Keep expanding, we could get:

Eq. 17: obtain the mu and beta

Finally, we only left X0 needs to be removed. Thanks to the reparameterization trick (Property 1), we could rewrite it as:

Eq. 18: replace x0 in mu using the reparameterization trick

As shown above, the thing we need to have to process the denoising is Є_t which is equal to the input noise in the forward process, and the neural network can predict it. Yeah ~~ 😁 😁

Model Architecture and Training

In the diffusion model, we use a U-net structure to predict the noise Є_t by inputting image data X0 and timestep t.

image is taken from the official paper of u-net (source)

U-Net is a popular convolutional neural network (CNN) architecture, which was first developed for biomedical image segmentation. It is based on the convolutional layer to downsample and upsample the input image and adds skip connections between layers having the same resolution.

here is the link to the official paper of u-net

Let's write our diffusion model !! ~~~ 👐

Welcome to visit my GitHub. I have put the code on it ~~ 😸

U-net blocks

U-net model

For simplicity, I omit the attention layer, which could provide better global coherence. Besides, I use batch norm instead of group norm to reduce the amount of computation.

Training

We choose the mean squared error as the loss function for the model optimisation to calculate the loss between the noise (from the forward process) and the predicted noise (from the model).

But why can we use such a simple function as MSE to optimize the two distributions, p and q?

To answer this, I strongly recommend watching Ari Seff's youtube video and the Lil’Log. 🥀 🥀

Result

~~~~~~~~~~~~~~

Finally, I hope you enjoyed this article. I will write more articles related to AI, including explanations of their underlying principles and how to implement them

If that sounds interesting to you, feel free to follow me. 👏 😁

My Medium: link

My LinkedIn: link

Reference

  1. Denoising Diffusion Probabilistic Models
  2. Denoising Diffusion Implicit Models
  3. Keras example — Denoising Diffusion Implicit Models
  4. Lil’ Log — What are Diffusion Models?
  5. Vedant Jumle — Image generation with diffusion models using Keras and TensorFlow
Deep Learning
Machine Learning
Programming
Artificial Intelligence
Data Science
Recommended from ReadMedium