avatarNg Wai Foong

Summary

The web content provides a comprehensive guide on fine-tuning Stable Diffusion using Dreambooth for personalized image generation with custom styles or objects.

Abstract

The article "How to Fine-tune Stable Diffusion using Dreambooth" outlines a method for customizing AI-generated images by training the entire Stable Diffusion model. This approach, which contrasts with textual inversion, involves binding a unique identifier to a specific concept, such as an object or style, allowing for more personalized and precise image outputs. The tutorial, based on a forked version of Dreambooth by ShivamShrirao, details the setup process, including the installation of necessary Python packages and optional packages like bitsandbytes and xformers for reduced VRAM usage. It also covers the preparation of datasets, training commands, and inference techniques, ensuring users can leverage the model with varying GPU capacities, emphasizing the importance of high-quality datasets and providing examples for object and style training. The guide concludes with a discussion on the implications of such technology for the art industry and a call for responsible use.

Opinions

  • The author emphasizes the importance of using a unique identifier that is not part of the original Stable Diffusion’s datasets to avoid conflicts in token usage.
  • The article suggests that Dreambooth's method of fine-tuning can lead to overfitting if not carefully monitored, advising users to be cautious of training steps and learning rates.
  • The author provides a personal note on the limitations of memory resources, indicating that not all users may be able to replicate the full training process as described.
  • There is an acknowledgment of the potential impact of AI model fine-tuning on the art industry, with a recommendation for artists to integrate this technology into their workflows.
  • The author encourages the ethical use of AI technology for positive contributions to society.
  • The article promotes the use of xformers for memory-efficient attention during inference, despite its non-deterministic behavior, suggesting users experiment to suit their needs.

How to Fine-tune Stable Diffusion using Dreambooth

Personalized generated images with custom styles or objects

Image by the author

Previously, I have covered an article on fine-tuning Stable Diffusion using textual inversion. This tutorial focuses on how to fine-tune Stable Diffusion using another method called Dreambooth. Unlike textual inversion method which train just the embedding without modification to the base model, Dreambooth fine-tune the whole text-to-image model such that it learns to bind a unique identifier with a specific concept (object or style). As a result, the generated images is more personalized to the object or style compared to textual inversion.

This tutorial is based on a forked version of Dreambooth implementation by HuggingFace. The original implementation requires about 16GB to 24GB in order to fine-tune the model. The maintainer ShivamShrirao optimized the code to reduce VRAM usage to under 16GB. Depending on your needs and settings, you can fine-tune the model with 10GB to 16GB GPU. I have personally tested the training to be feasible on Tesla T4 GPU.

Please note that all the existing implementation is not by the original author of Dreambooth. As a result, there might be slight difference in terms of reproducibility.

Let’s proceed to the next section to setup all the necessary modules.

Setup

It is recommended to create a new virtual environment before you continue with the installation.

Python packages

In your working directory, create a new file called requirements.txt with the following code:

accelerate==0.12.0
torchvision
transformers>=4.21.0
ftfy
tensorboard
modelcards

Activate your virtual environment and run the following command one by one to install all the necessary modules:

pip install git+https://github.com/ShivamShrirao/diffusers.git
pip install -r requirements.txt

NOTE: You need to install diffusers using the url above instead of installing it directly from pypi.

bitsandbytes package

There is an optional package called bitsandbytes, which can reduce the VRAM usage further. However, it only supports CUDA versions 10.2–11.7 and your machine must fulfill the following requirements:

  • LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
  • 8-bit optimizers and quantization: NVIDIA Maxwell GPU or newer (>=GTX 9XX).

You can install it as follows:

pip install bitsandbytes

xformers package

For those with GPU that is less than 24GB, you need to install the xformers package in order to reduce the VRAM usage further. At the time of this writing, installing xformers is not that straightforward due to lack of wheel support from the developers.

You can install the package as follows:

pip install xformers

If you encounter an error with the command above, run the following command to build the package directly from the repository:

pip install git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

If you are having issue with CUDA version, make sure you install the latest version of CUDA that is compatible with your machine. Follow the instructions from the following link.

Accelerate setup

The next step is to initialize an Accelerate environment. Run the following command:

accelerate config

There will be multiple prompts in the terminal. Answer it based on your own use cases. Have a look at the following example as reference:

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: fp16

HuggingFace’s models

You can skip this section if you already have the diffusers model for Stable Diffusion (v1.4/v1.5). For those that

You have to use the diffusers model instead of ckpt file for fine-tuning. You can use the following script to convert your ckpt file to diffusers model.

Register a new account in HuggingFace and accept the model license before downloading or using the weights.

Once you are done with it, refer to this section of the documentation to enable the access token.

Run the following command and pass in your token for authentication:

huggingface-cli login

It will download the weights to the cache folder during the initial run.

Datasets

You need to collect high quality datasets to get consistent and good results. The training images should match the expected output and resized to 512 x 512 in resolution.

Please note that artifacts such as motion blur or low resolution will affect the generated images. This is applicable to any unwanted text, watermarks or icons in your training datasets. Make sure to pay attention to the datasets that you used for training.

Depending on your use cases, you can use the following guidelines:

Object

Use images of your object with a normal background. Transparent background may leave a fringe or border around the object. All training images should focus on just the object with variations on:

  • camera angle
  • pose
  • props (clothing, haircut, etc.)
  • background (taken at different locations)

The number of training images should be around 5 to 20. You may need to crop the images to focus on just the object.

Style

Use images of the style that you like. It can be from your own art collections or public movies/animations/tvshows that share a consistent style. All training images should focus on the style and not a particular object.

To generalize it well, you should make sure that the same object does not appear more than once in the training images (one of each character). If your goal is to generate different characters of the same style. Include only training images with characters. Otherwise, include landscape, objects and other relevant images in your training datasets.

Training images

In this tutorial, I will be using the following training images:

Image by the author

This tutorial uses the following terms for the training datasets.

  • Instance images — Custom images that represents the specific concept for dreambooth training. You should collect high quality images based on your use cases.
  • Class images — Regularization images for prior-preservation loss to prevent overfitting. You should generate these images directly from the base pre-trained model. You can choose to generate them on your own or generate them on the fly when running the training script.

Training

Head over to the following Github repository and download the train_dreambooth.py file to your working directory.

Training commands

The following is a list of the common parameters that should be modified based on your use cases:

  • pretrained_model_name_or_path — Path to pretrained model or model identifier from huggingface.co/models
  • pretrained_vae_name_or_path — Path to pretrained vae or vae identifier from huggingface.co/models. You can fine-tune the model with or without vae
  • instance_data_dir — A folder containing the training data of instance images
  • class_data_dir — A folder containing the training data of class images
  • instance_prompt — The prompt with identifier specifying the instance
  • class_prompt — The prompt to specify images in the same class as provided instance images
  • num_class_images — Minimal class images for prior preservation loss
  • output_dir — The output directory where the model predictions and checkpoints will be written
  • max_train_steps — Total number of training steps to perform. Recommend to set it to N * 100 where N represents the number of instance images.
  • learning_rate — Initial learning rate (after the potential warmup period) to use
  • lr_scheduler— The scheduler type to use. Choose between [linear, cosine, cosine_with_restarts, polynomial, constant, constant_with_warmup]
  • lr_warmup_steps — Number of steps for the warmup in the lr scheduler. Use max_train_steps / 10 when using polynomial or 0 when using constant.
  • save_interval — Save weights every N steps. Make sure you have sufficient storage space for it. Each weight is about 4GB.

You can set custom vae using the following flag:

--pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse"

Training with prior-preservation loss helps to prevent overfitting. Enable it as follows:

--with_prior_preservation --prior_loss_weight=1.0

Moreover, you can fine-tune the text_encoder along with unet. However, this will increase the VRAM usage drastically. Set it with the following flag:

--train_text_encoder

For object training, you can use the following example as reference for instance_prompt and class_prompt. Feel free to experiment different strings based on your use cases.

# Woman
--instance_prompt="photo of zwx woman" \
--class_prompt="photo of a woman" \

# Black man
--instance_prompt="photo of zwx black man" \
--class_prompt="photo of a black man" \

# Dog
--instance_prompt="photo of zwx dog" \
--class_prompt="photo of a dog" \

You can use your own custom string as the unique identifier. In the early implementation, most of the examples use sks as the unique identifier. However, sks is a known token for a semi-automatic rifle. It is highly recommended to use a different unique identifier that is not part of the token used in the original Stable Diffusion’s datasets.

Also, the unique identifier is not limited to a single string. Some users reported good performance with the following prompts for style training:

# Style 1
--instance_prompt="modern disney style" \
--class_prompt="artwork style" \

# Style 2
--instance_prompt="classic animation style" \
--class_prompt="illustration style" \

As for learning rate and scheduler, kindly refer to the following blog by HuggingFace.

Training examples

Have a look at the following example training command for the training images above (tested on Tesla T4 with 16GB memory):

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \
  --instance_data_dir="./instance-images/" \
  --class_data_dir="./class-images/" \
  --output_dir="./output-models/" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="photo of zwx bear toy" \
  --class_prompt="photo of bear toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --train_text_encoder \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --learning_rate=1e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=200 \
  --num_class_images=300 \
  --max_train_steps=2000 \
  --save_interval=500

Modify the data directory accordingly based on your working directory.

If you have insufficient memory, kindly toggle off the text_encoder training:

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --instance_data_dir="./instance-images/" \
  --class_data_dir="./class-images/" \
  --output_dir="./output-models/" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="photo of zwx bear toy" \
  --class_prompt="photo of bear toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --learning_rate=1e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=300 \
  --max_train_steps=2000 \
  --save_interval=500

Make sure not to over-train your model as the Dreambooth method tends to over-fit quickly. If your model does not generalize well to your prompts or has artifacts on it, most likely it means that you have over-trained your model. Kindly reduce the training steps or train with higher steps using lower learning rate.

Moreover, the latest training script accepts a new parameter called concept_list. It represent the path to a JSON file containing a list of dictionaries. It will overwrite parameters like instance_prompt, class_prompt, etc. You can use this to train multiple concepts at the same time to a single model. For example, given the following concept_list.json file:

[
    {
        "instance_prompt":      "photo of zwx dog",
        "class_prompt":         "photo of a dog",
        "instance_data_dir":    "./instance-images/",
        "class_data_dir":       "./class-images/"
    }
]

You can use the following parameter in the training command:

--concepts_list ./concepts_list.json

Simply append a new dictionary to the list for training another concept at the same time.

When you run it for the first time, it will generate the class images. You can re-use the same class images for the subsequent training as long as your are referring to the same concept. Simply set the class_data_dir to the same directory as your previous training. Hence, you can re-use most of the generated class images when training for a particular style.

Here are some class images generated for this tutorial:

Image by the author

Training output

The script will save a new weight on each interval based on the value of save_interval. You should have the following files and folders inside each of the newly generated weight folders:

|- feature_extractor
|  |- preprocessor_config.json
|- scheduler
|  |- scheduler_config.json
|- text_encoder
|  |- config.json
|  |- pytorch_model.bin
|- tokenizer
|  |- merges.txt
|  |- special_tokens_map.json
|  |- tokenizer_config.json
|  |- vocab.json
|- unet
|  |- config.json
|  |- diffusion_pytorch_model.bin
|- vae
|  |- config.json
|  |- diffusion_pytorch_model.bin
|- args.json
|- model_index.json

Inference

Now, create a new Python file called inference.py in your working directory. Append the following code inside it:

from diffusers import StableDiffusionPipeline, DDIMScheduler
import torch

device = "cuda"
# use DDIM scheduler, you can modify it to use other scheduler
scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=True)

# modify the model path
pipe = StableDiffusionPipeline.from_pretrained(
    f"./output-models/1500/",
    scheduler=scheduler,
    safety_checker=None,
    torch_dtype=torch.float16,
).to(device)

# enable xformers memory attention
pipe.enable_xformers_memory_efficient_attention()

prompt = "photo of zwx bear toy"
negative_prompt = ""
num_samples = 4
guidance_scale = 7.5
num_inference_steps = 50
height = 512
width = 512

images = pipe(
    prompt,
    height=height,
    width=width,
    negative_prompt=negative_prompt,
    num_images_per_prompt=num_samples,
    num_inference_steps=num_inference_steps,
    guidance_scale=guidance_scale
).images

count = 1
for image in images:
    # save image to local directory
    image.save(f"img-{count}.png")
    count += 1

Some users reported that using xformers for image generation will result in non-deterministic results. This means that you can’t reproduce the same image with the same settings (seed, image size, etc.). Kindly experiment it yourselves and modify the code accordingly based on your use cases.

Once you are done with it, run the following command to generate images using the newly fine-tuned model.

python inference.py

Here are some of the example outputs (prompts are not included as I made a mistake of not keeping track of them initially):

Image by the author

Consider checking out my other articles on training conditional/unconditional image generation models:

Conclusion

Dreambooth is a good technique to fine-tune the Stable Diffusion model with a particular concept (object or style).

With the advancement of research and development in AI, it is now possible for the average Joe to fine-tune their own custom models. However, it may be a double-edge sword with both opportunies and challenges for the art industries.

If you are part of the artist industries, it is recommended to embrace and use this technology to streamline your workflow. Also, it would be great if everyone can use this technology wisely for the betterment of mankind.

Unfortunately, I do not have sufficient memory at the moment to fine-tune the whole model using images and captions pairs.

Thanks for reading this piece. Have a great day ahead!

References

  1. HuggingFace Diffusers Github
  2. ShivamShrirao’s Dreambooth Github
Python
Stable Diffusion
Dreambooth
Machine Learning
Artificial Intelligence
Recommended from ReadMedium