Demystifying DreamBooth: A New Tool for Personalizing Text-To-Image Generation
Exploring the technology that turns boring images into creative masterpieces

Introduction
Imagine the joy of effortlessly generating a new image of your beloved puppy against the backdrop of the Acropolis in Athens. Not satisfied yet, you would like to see how Van Gogh would have painted your best friend or what he would look like if he had been conceived by a lion 😱! Thanks to DreamBooth, all of this is a reality, and today it is possible to make any animal, object or ourselves travel in fantasy from a handful of images.
While many of us have already seen on social media the mind-blowing results that can be achieved with this technology and there is no shortage of tutorials so that we can try it on our own photographs, rarely someone has tried to answer the question: yes, but how the hell does it work?
In this article, I will do my best to break down the scientific paper DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation by Ruiz et al. from which everything started. But don’t worry, I’ll simplify the complex parts and give explanations where they might require some prior knowledge. Now, fair warning, this is an advanced topic, so I assume you’ve got the basics of deep learning and related stuff covered. But hey, if you want to dig deeper into diffusion models or other cool topics, I’ll drop some references along the way. Let’s dive in!
Related Work

Before we get into the nitty-gritty of DreamBooth’s approach, let’s take a closer look at the related work and tasks associated with this technique.
Image Composition
Amidst the chaos of everyday life, it’s been far too long since your beloved backpack embarked on a globetrotting journey. It is time to infuse it with an exciting dose of adventure while you are planning your next vacation. Enter image composition, merge your subject seamlessly into new backgrounds, letting your backpack travel from the Grand Canyon to Boston in seconds.
If simply copy-pasting the subject doesn’t fulfill your desire for new perspectives, one possibility is to explore the application of 3D reconstruction techniques. However, it’s important to note that these techniques are primarily designed for rigid objects and often require a substantial number of starting views.
DreamBooth introduces a remarkable capability to generate fresh poses within new contexts while smoothly incorporating crucial elements such as lighting, shadows, and other scene-relative aspects. Achieving such consistency has proven challenging with prior methodologies. In the paper, this task is also denoted by the name recontextualization.
Text-to-Image Editing and Synthesis
Image editing based on textual input is a secret dream cherished by many avid users of photo editing software. Early methodologies, such as those employing GANs, demonstrated impressive results, but only in well-structured scenarios like editing human faces.
Even new approaches that take advantage of diffusion models have limitations and are usually restricted to global editing. Only recently have advances such as Text2LIVE emerged that allow localized editing. However, none of these techniques allow the generation of a given subject in new contexts.
Although text-image synthesis models like Imagen, DALL·E 2, and Stable Diffusion have made significant strides, the attainment of fine-grained control and the preservation of subject identity in synthesized images continue to pose substantial challenges.
Controllable Generative Models
To avoid subject modification, many approaches rely on a user-provided mask that limits the area to be modified. Inversion techniques, such as the one used by DALL·E 2, present an effective solution for preserving the subject while modifying the context.
Prompt-to-Prompt enables both local and global editing without the need for an input mask.
However, these methods do not adequately preserve the identity while generating novel samples of a subject.
While some GAN-based methods focus on generating instance variations, they often have limitations. For instance, they are primarily designed for the face domain, require many instances of the input subject, struggle with unique subjects, and fail to preserve important subject details.
Finally, recently Gal et. al. presented Textual Inversion, a methodology with features common to DreamBooth but which, as we will see, is limited by the expressiveness of the frozen diffusion model on which it is based.

Since this is the work with which the authors compare DreamBooth, it is worth providing a brief description of it.
Textual Inversion starts from a pre-trained diffusion model, such as Latent Diffusion, and defines a new placeholder string S*, to represent the new concept to be learned. At this point, keeping the diffusion model frozen, the new embedding is fine-tuned from just 3–5 images, similar to DreamBooth. If this brief description is not clear enough, wait until you read the more detailed description of DreamBooth, which has many points in common with this work.
Method

Before describing the components of DreamBooth in detail, let’s see schematically how this technology works:
- Choose 3–5 images of your favorite subject, it can be an animal, an object or even an abstract concept such as an art style.
- Associate this concept with a rare word to which corresponds a unique token that will represent it from now on, in the scientific paper the authors call this word [V].
- Fine-tune the model using images of the subject of interest with a simple prompt such as “A [V] [class noun]”, for example “A [V] dog” if the input images are photographs of your dog.
- Since we are fine-tuning all the parameters of the model, there is a risk that at this point all dogs (or whatever class our subject is) will become the same as our input images. To avoid this degradation of our model, we generate images from our frozen model with a prompt such as “A dog” (or “A [class noun]”) and add a loss that penalizes when the images generated by our model that we are fine-tuning for this prompt deviate from those generated by the frozen model.
Okay, now that we have a high-level idea of the procedure, let’s go into more detail about the various components.
Text to Image Diffusion Models
Do you really want to learn how diffusion models work and, in particular, latent diffusion models such as Stable Diffusion? Read my previous article below, I will be waiting for you here when you are done!
OK, maybe you don’t want a whole explanation, in which case I will give here the intuition behind diffusion models, which is very simple.

- Take an image x0 and add a certain amount of noise (e.g. Gaussian noise) proportional to a certain timestep t. If t is zero the added noise will be zero, if t > 0 the added noise will be as large as t is, until you arrive at an image that is just noise.
- Train a model, such as a U-Net, to predict the noise-free image (or the noise that has been added) by giving as input to the model the timestep t and the corrupted image.
- At this point, having trained a model that can remove noise from an image, we can sample an image composed only of noise and gradually remove it (doing it all at once works poorly) either by predicting the image without noise or by predicting the noise and subtracting it from the image.
- The first three points describe an unconditional diffusion model. In order to produce a conditional output based on textual prompt, the text is encoded using models like CLIP, or language models such as BERT, T5, and others. This encoding step allows for the integration of additional information, which is then fed as input to the model alongside the corrupted image and the timestep t.
The authors in the paper use two diffusion models: Google’s Imagen (also DreamBooth is from Google Research) and Stable Diffusion from Stability AI, the main open-source text-to-image model.
Imagen employs a multi-resolution strategy to enhance the quality of generated images. Initially, a diffusion model is trained using low-resolution 64x64 images. The output of the low-resolution model is then upscaled by two additional diffusion models that operate at higher resolutions, 256x256 and 1024x1024. The first model specializes in capturing macro-details, while the subsequent models refine the output by leveraging the conditioning effect of the lower resolution model’s generated image. This iterative refinement facilitates the generation of high-resolution images with improved quality and fidelity.
Stable Diffusion instead, as a latent diffusion model, introduces a three-step approach to enhance the efficiency of training and generating high-resolution images. Initially, a Variational Autoencoder (VAE) is trained to compress a high-resolution image. From this point onward, the process closely resembles that of standard diffusion models, with one key distinction: instead of employing the original image as input, the latent representation generated by the VAE encoder is utilized. Subsequently, the output of the inverse diffusion process is then restored to the original resolution using the VAE decoder. For a more comprehensive understanding of this entire procedure, I delve into greater detail in the aforementioned article.
Personalization of Text to Image Models
DreamBooth aims to place the subject instance (e.g. your dog) within the output domain of the model, enabling the model to generate fresh images of the subject upon query. An advantage of diffusion models, as opposed to GANs, is their ability to effectively incorporate new information into their domain while retaining knowledge of previous data and avoiding overfitting to a limited training image set.
Designing Prompts for Few-Shot Personalization
As mentioned above, the model undergoes training using simplistic prompts structured as “a [identifier] [class noun]”. Here, [identifier] represents a distinct identifier associated with the subject, and [class noun] serves as a general description of the subject’s category (such as cat, dog, watch, etc.). The authors incorporate the class noun into the prompt to establish a connection between the general class and our individual subject, observing that using an incorrect or missing class noun leads to longer training times and language drift, ultimately affecting performance. Essentially, the main aim is to capitalize on the relationship between the specific class and our subject, utilizing the existing knowledge acquired by the model about the class. This enables us to generate fresh poses and variations of the subject across various contexts.
Rare-Token Identifiers
The paper highlights that common English words are not ideal in this context since the model needs to disassociate them from their original meaning and reintegrate them to refer to our subject.
To address this, the authors propose using an identifier that has a weak prior in both the language and diffusion models. While selecting random characters like “xxy5syt00” may initially appear appealing, it poses potential risks. It is important to consider that the tokenizer could tokenize each letter individually. So, what is the solution? The most effective approach involves identifying uncommon tokens in the vocabulary and then inverting these tokens within the text space. This minimizes the likelihood of the identifier having a strong prior.
Funny enough, most tutorials use “sks” for this purpose but, as pointed out by one of the authors, this seemingly harmless word can have side effects…











