Summary

ControlNet is a neural network that enhances the performance of Stable Diffusion by allowing input of a conditioning image, which can then be used to manipulate the image generation process.

Abstract

ControlNet is a neural network that works in conjunction with a pretrained diffusion model, such as Stable Diffusion. It allows for the inclusion of conditional inputs, such as edge maps, segmentation maps, and key points, into large diffusion models. This incorporation of ControlNets provides greater control over the image generation process, resulting in the ability to generate more specific and desired images. The inner architecture of ControlNet involves creating copies of the original weights rather than training the original weights directly, which prevents overfitting when the dataset is small and maintains the high-quality performance of large models. The overall architecture of ControlNet with Stable Diffusion involves encoding the input conditions into feature maps and integrating them into the denoising step of the Stable Diffusion process. ControlNet can be trained using a loss function that includes the text condition and latent condition to improve output consistency with specified conditions. ControlNet is a flexible tool that allows for the use of different condition input types, such as Canny edge, line art, scribble, Hough line, semantic segmentation, depth, normal map, and open pose.

Bullet points

ControlNet is a neural network that enhances the performance of Stable Diffusion.
ControlNet allows for the inclusion of conditional inputs, such as edge maps, segmentation maps, and key points, into large diffusion models.
ControlNet's inner architecture involves creating copies of the original weights rather than training the original weights directly.
The overall architecture of ControlNet with Stable Diffusion involves encoding the input conditions into feature maps and integrating them into the denoising step of the Stable Diffusion process.
ControlNet can be trained using a loss function that includes the text condition and latent condition to improve output consistency with specified conditions.
ControlNet is a flexible tool that allows for the use of different condition input types.

Stable Diffusion — ControlNet Clearly Explained!

Generating images from line art, scribble, or pose key points using Stable Diffusion and ControlNet.

An image generated using Stable Diffusion with ControlNet

ControlNet is a neural network that controls a pretrained image Diffusion model (e.g. Stable Diffusion). Its function is to allow input of a conditioning image, which can then be used to manipulate the image generation.

├─ What Does ControlNet Do? ├─ Inner Architecture │ ├─ Feedforward │ ├─ Backpropagation ├─ Architecture with Stable Diffusion │ ├─ Encoder │ ├─ Overall Architecture ├─ Training ├─ Conditioning │ ├─ Canny Edge │ ├─ Line Art │ ├─ Scribble │ ├─ Hough Line │ ├─ Semantic Segmentation │ ├─ Depth │ ├─ Normal Map │ ├─ Open Pose ├─ Summary ├─ References

What Does ControlNet Do?

The combination of ControlNet and Stable Diffusion enables Stable Diffusion to take in a condition input that guides the image generation process, resulting in enhanced performance of Stable Diffusion.

It can accept scribbles, edge maps, pose key points, depth maps, segmentation maps, normal maps, etc as the condition input to guide the content of the generated image. Here are a few examples:

Source: https://github.com/lllyasviel/ControlNet

Inner Architecture

All the parameters in the Stable Diffusion UNet are locked and cloned into a trainable copy to the ControlNet side. This copy is then trained with an external condition vector.

The reason for creating copies of the original weights rather than training the original weights directly is to prevent overfitting when the dataset is small, and to maintain the high-quality performance of large models that have been trained on billions of images and are ready for deployment in production.

Feedforward

Notations:

x, y : Deep features in the neural network
c : An extra condition
“+” : Feature addition
Z( · ; · ) : Zero convolution operation (1 x 1 convolution layer with both weight and bias initialized with zeros)
F( · ; · ) : A neural network block operation (e.g. “resnet” block, “conv-bn-relu” block, etc.)
Θ_z1 : The parameters of the first zero convolution layer
Θ_z2 : The parameters of the second zero convolution layer
Θ_c : The parameters of the trainable copy

During the first training step, since the weight and bias of a zero convolution layer are initialized as zeros, the feed-forward process is identical to the process without ControlNet.

After backpropagation, zero convolution layers in ControlNet become non-zero and affect the output.

In other words, when a ControlNet is applied to some neural network blocks, before any optimization, it will not cause any influence to the deep neural features.

Backpropagation

The backpropagation updates the trainable copy and the zero convolution layers in the ControlNet, enabling the zero convolution weights to gradually transition to optimized values through the learning process.

Why gradient will not be zero?

We might assume that the gradient would be zero if the weights of the convolution layers are zero. However, it is not true.

Consider y = wx + b being the zero convolution layer, where w and b are the weight and bias respectively, and x is the input feature map. The above are the gradients for each term.

All become non-zero after 1 training step

In the beginning, when the weight value w = 0, the input feature x is typically non-zero. As a result, although the gradient on x becomes zero due to the zero convolution, the gradients of the weight and bias are not affected.

Nonetheless, after one gradient descent step, the weight value w will be updated to a non-zero value (since the partial derivative of y w.r.t. w is non-zero).

Architecture with Stable Diffusion

Encoder

Since the UNet of Stable Diffusion accepts a latent feature (64×64) instead of the original image, we have to also convert the image-based conditions to 64×64 feature space to match the convolution size.

We can use a network ε to encode the input conditions (c_i) into feature maps (c_f).

In the diagram, we use z_t and z_t-1 as the input and output for the locked network block to match the notation in the Stable Diffusion context.

Overall Architecture

The following diagram shows the inputs and outputs of the ControlNet and the UNet in the Stable Diffusion, within one denoising step.

Furthermore, the diagram below illustrates how ControlNet and Stable Diffusion work together in the reverse diffusion process (sampling), in the whole picture.

The flow of the entire reverse diffusion

The above diagram has been modified from my previous article on Stable Diffusion. If you have not yet read it, I suggest that you familiarize yourself with the Stable Diffusion architecture explained there first, as it is a comprehensive resource.

Stable Diffusion Clearly Explained!

How does Stable Diffusion paint an AI artwork? Understanding the tech behind the rise of AI-generated art.

medium.com

Training

The ControlNet loss function is similar to the one of Stable Diffusion, but includes the text condition (c_t) and latent condition (c_f) to improve output consistency with specified conditions.

As part of the training process, we randomly replace 50% of the text prompts (c_t) with empty strings. This helps ControlNet to understand better the meaning of input condition maps such as Canny edge maps or human scribbles.

By removing the prompts, the encoder is forced to rely more on the information in the control maps, which improves its ability to understand their semantic content.

Conditioning

ControlNet is a flexible tool that allows you to utilize Stable Diffusion using different condition input types. The following are some examples of the types of inputs that can be used in ControlNet.

Canny Edge

Line Art

Scribble

Hough Line

Semantic Segmentation

Depth

Normal Map

Open Pose

Summary

ControlNet is a type of neural network that can be used in conjunction with a pretrained Diffusion model, specifically one like Stable Diffusion.

ControlNets allow for the inclusion of conditional inputs, such as edge maps, segmentation maps, and key points, into large diffusion models like Stable Diffusion.

This incorporation of ControlNets provides greater control over the image generation process, resulting in the ability to generate more specific and desired images.

References

[1] L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv.org, https://arxiv.org/abs/2302.05543.

Stable Diffusion — ControlNet Clearly Explained!

Generating images from line art, scribble, or pose key points using Stable Diffusion and ControlNet.

Table of Contents

What Does ControlNet Do?

Inner Architecture

Feedforward

Backpropagation

Architecture with Stable Diffusion

Encoder

Overall Architecture

Stable Diffusion Clearly Explained!

How does Stable Diffusion paint an AI artwork? Understanding the tech behind the rise of AI-generated art.

Training

Conditioning

Canny Edge

Line Art

Scribble

Hough Line

Semantic Segmentation

Depth

Normal Map

Open Pose

Summary

References