avatarNg Wai Foong

Summary

The provided content is a comprehensive tutorial on utilizing ControlNet with HuggingFace's diffusers package for enhanced control in text-to-image generation using Stable Diffusion models.

Abstract

The article introduces ControlNet, an innovative neural network structure designed to augment Stable Diffusion models with additional conditioning inputs such as edge maps, segmentation maps, and pose keypoints. This allows for more precise control over the generated images, aligning them closer to the desired input specifications. The tutorial guides readers through the setup and installation of necessary packages, including diffusers, accelerate, opencv-python, and controlnet-aux. It also discusses the use of xformers for performance optimization. Practical implementation is demonstrated with examples using Canny edge detection and OpenPose bone images to condition the image generation process. The article concludes with a discussion on the potential of multi-ControlNet configurations, which is an emerging feature in active development, and provides references to the ControlNet GitHub repository and HuggingFace documentation for further exploration.

Opinions

  • The author emphasizes the significance of ControlNet in providing fine-grained control over text-to-image generation, marking a substantial improvement over traditional image-to-image generation methods.
  • ControlNet's ability to be trained on small datasets and augmented with pre-trained Stable Diffusion models is highlighted as a key advantage, making it accessible for various applications.
  • The tutorial suggests that the opencv-contrib-python package is recommended, but also notes that inference can be performed with any of the available OpenCV packages.
  • The author expresses that the use of xformers can significantly boost inference speed, indicating its importance for users looking to optimize performance.
  • The article conveys the opinion that the integration of multiple ControlNets will further enhance the capabilities of diffusion models, allowing for even more nuanced control during image synthesis.

Introduction to ControlNet for Stable Diffusion

Better control for text-to-image generation

Image by the author. Generated examples taken from HuggingFace.

This tutorial covers a step-by-step guide on text-to-image generation with ControlNet conditioning using the HuggingFace’s diffusers package.

ControlNet is a neural network structure to control diffusion models by adding extra conditions. It provides a way to augment Stable Diffusion with conditional inputs such as scribbles, edge maps, segmentation maps, pose key points, etc during text-to-image generation. As a result, the generated image will be a lot closer to the input image, which is a big improvement over traditional methods such as image-to-image generation.

In addition, a ControlNet model can be trained with small datasets on consumer GPU. Then, the model can be augmented with any pre-trained Stable Diffusion models for text-to-image generation.

The initial release of ControNet came with the following checkpoints:

Let’s proceed to the next section for the setup and installation.

Setup

It is highly recommended to create a new virtual environment before the package installation.

diffusers

Activate the virtual environment and run the following command to install the stable version of diffusers module:

pip install diffusers

ControlNet requires diffusers>=0.14.0

For the latest version of diffusers package, install it as follows:

pip install git+https://github.com/huggingface/diffusers

accelerate

You can install the accelerate module as follows:

pip install accelerate

This tutorial contains a few code snippets that relies on accelerate>=0.17.0, which has yet to be released on PyPi at the time of this writing. Install the latest version as follows:

pip install git+https://github.com/huggingface/accelerate

opencv-python

Note that the pre-conditioning processor and dependencies are different depending on the ControlNet used for image generation. For simplicity, this tutorial will cover the canny edge processor, which is dependent on the opencv-python package.

opencv-python comes with 4 different packages. The official documentation recommends the opencv-contrib-python package but inference can be done using any of the following packages:

  • opencv-python — main package
  • opencv-contrib-python — full package (comes with contrib/extra modules)
  • opencv-python-headless — main package without GUI
  • opencv-contrib-python-headless — full package without GUI

Install it via the following command (replace the package name based on your preferences):

pip install opencv-contrib-python

controlnet-aux

On the other hand, the OpenPose processor requires the controlnet-aux package. Run the following command to install it:

pip install controlnet-aux

xformers (optional)

The xformers package provides significant boost to inference speed. The latest version comes with pip wheels support for PyTorch 1.13.1.

Pip install (win/linux)

For those with torch==1.13.1, simply run the following command to install xformers:

pip install -U xformers

Conda (linux)

For conda users, the installation supports either torch==1.12.1 or torch==1.13.1

conda install xformers

Building from source

For the other use cases, consider building xformers directly from source:

# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

Implementation

Let’s explore how to utilize the canny edge ControlNet for image generation. It requires a canny edge image as input.

Canny

Create a new file called canny_inference.py and add the following import statements:

import cv2
import numpy as np
from PIL import Image

Then, continue by adding the following code snippets to create a canny edge image from an existing image

import cv2
import numpy as np
from PIL import Image

image = Image.open('input.png')
image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
canny_image.save('canny.png')

Save the file and run the following command to convert an image to canny edge image:

python canny_inference.py

Have a look at the following example:

Image by the author

The next step is to perform inference using the canny image as conditional input. Modify the import statement as follows:

import cv2
+import torch
import numpy as np
from PIL import Image
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, DPMSolverMultistepScheduler

Update the code by initializing the ControlNet and Stable Diffusion pipelines:

...

canny_image = Image.fromarray(image)
# canny_image.save('canny.png')

# for deterministic generation
generator = torch.Generator(device='cuda').manual_seed(12345)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
# change the scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# enable xformers (optional), requires xformers installation
pipe.enable_xformers_memory_efficient_attention()
# cpu offload for memory saving, requires accelerate>=0.17.0
pipe.enable_model_cpu_offload()

Run inference and save the generated image:

...

# cpu offload for memory saving, requires accelerate>=0.17.0
pipe.enable_model_cpu_offload()

image = pipe(
    "a beautiful lady, celebrity, red dress, dslr, colour photo, realistic, high quality",
    negative_prompt="cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, blurry, bad anatomy, bad proportions",
    num_inference_steps=20,
    generator=generator,
    image=canny_image,
    controlnet_conditioning_scale=0.5
).images[0]
image.save('output.png')

The StableDiffusionControlNetPipeline accepts the following parameter:

  • controlnet_conditioning_scale — The outputs of the controlnet are multiplied by controlnet_conditioning_scale before they are added to the residual in the original unet. Defaults to 1.0 and accepts any value between 0.0–1.0.

Run the script and you should get the following output:

Image by the author

Let’s rerun the script again with a different input image and settings:

...

image = pipe(
-   "a beautiful lady, celebrity, red dress, dslr, colour photo, realistic, high quality",
+   "a beautiful lady wearing blue yoga pants working out on beach, realistic, high quality",
    negative_prompt="cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, blurry, bad anatomy, bad proportions",
    num_inference_steps=20,
    generator=generator,
    image=canny_image,
-   controlnet_conditioning_scale=0.5
+   controlnet_conditioning_scale=1.0
).images[0]
image.save('tmp/output.png')

The output is as follows:

Image by the author

OpenPose

Let’s try using OpenPose bone image as conditional input instead. Have a look at the following image as reference on how it should look like:

Image by the author

The controlnet-aux module provides support to convert an image to OpenPose bone image. Create a new Python file called pose_inference.py and add the following import:

import torch
from PIL import Image
from controlnet_aux import OpenposeDetector
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, DPMSolverMultistepScheduler

Continue by adding the following code snippets:

...

image = Image.open('input.png')
openpose = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')
pose_image = openpose(image)
pose_image.save('pose.png')

Save the file and run the following command to convert an image to a OpenPose bone image:

python pose_inference.py

Have a look at the following example for reference:

Image by the author

Complete the script by appending the following lines of code:

...

# for deterministic generation
generator = torch.Generator(device='cuda').manual_seed(12345)
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
# change the scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# enable xformers (optional), requires xformers installation
pipe.enable_xformers_memory_efficient_attention()
# cpu offload for memory saving, requires accelerate>=0.17.0
pipe.enable_model_cpu_offload()

# cpu offload for memory saving, requires accelerate>=0.17.0
pipe.enable_model_cpu_offload()

image = pipe(
    "a beautiful hollywood actress wearing black dress attending award winning event, red carpet stairs at background",
    negative_prompt="cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, blurry, bad anatomy, bad proportions",
    num_inference_steps=20,
    generator=generator,
    image=pose_image,
    controlnet_conditioning_scale=1.0
).images[0]
image.save('output.png')

Run the script and the output is as follows:

Image by the author

ControlNet is an extremely powerful neural network structure to control diffusion models by adding extra conditions.

At the time of this writing, the support for Multi-ControlNet is still in active development by the open-source community.

The new feature provides a way to use multiple ControlNets and adding the outputs together for image generation, allowing better control on the whole image. Simply pass in

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet_canny = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", 
                                                   torch_dtype=torch.float16).to("cuda")
controlnet_pose = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", 
                                                   torch_dtype=torch.float16).to("cuda")

pipe = StableDiffusionControlNetPipeline.from_pretrained(
 "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16,
 controlnet=[
  controlnet_pose, 
  controlnet_canny
 ],
).to("cuda")

image = pipe(prompt='...',
             image=[pose_image, canny_image],
        ).images[0]
image.save("output.png")

When using multiple ControlNet(s), you can control the scale factor by passing in a list of float as input argument to controlnet_conditioning_scale as follows:

controlnet_conditioning_scale=[1.0, 0.5]

Conclusion

Let’s recap the learning points for today.

This article started off with a brief introduction on ControlNet and a list of supported models.

Then, it moved on to setup and installation step via pip install.

Subsequently, it covered on using opencv-python to get a canny edge image. The output is then used as conditional input for text-to-image generation.

Besides that, this tutorial also explained on using OpenPose bone image as conditional input instead. The controlnet-aux module comes in handy to convert an image to OpenPose image.

Thanks for reading this piece. Have a great day ahead!

References

  1. Github — ControlNet
  2. HuggingFace Documentation— ControlNet
Stable Diffusion
Machine Learning
Artificial Intelligence
Text To Image Generation
Controlnet
Recommended from ReadMedium