Summary

Stable Diffusion 2.0 by StabilityAI introduces improvements and new features while addressing legal concerns, though it faces criticism for performance in certain areas and a departure from the previous model's capabilities.

Abstract

The release of Stable Diffusion 2.0 by StabilityAI represents a significant update that brings state-of-the-art features, with a focus on enhancing image quality and resolution, and addresses long-standing legal issues such as child pornography and deep fakes. Trained on a larger, NSFW-filtered LAION-5B dataset, the new model aims to generate better images in areas like realistic photography and 3D renders, but has received mixed reactions due to its tendency to produce less sexually attractive nudes, perform poorly on human anatomy, and introduce biases towards monochrome images. The model incorporates a new text encoder, OpenCLIP-ViT/H, which is expected to provide better understanding of prompts and reduce domain biases, but requires adjustments to existing prompts, embeddings, and dreambooth models. Stable Diffusion 2.0 also introduces a depth-to-image model for enhanced image-to-image generation capabilities, though its initial integration faced usability challenges. StabilityAI encourages the community to further refine the model with custom datasets and distributed training experiments.

Opinions

Positive opinions highlight the new model's ability to generate better images in non-human photography, improved lighting in photo-realistic scenes, and higher resolution outputs.
Criticism includes the model's lack of sexual attractiveness in nudes, poor performance in generating human anatomy and celebrity faces, and a bias towards monochrome images.
There is concern over the model's tendency to generate images with watermarks due to the absence of the estimated watermark probability filter in the LAION-5B dataset.
The new depth-guided model is praised for its potential to enhance image-to-image generation, possibly enabling the creation of layered images and text to 3D models.
Users have noted the increased effort required to craft prompts for the new model and the incompatibility of existing embeddings and dreambooth models, necessitating the use of the corresponding text encoder.
Some community members are optimistic about the future potential for customization and improvement of the text encoder and overall model through distributed training efforts.

Stable Diffusion 2: The Good, The Bad and The Ugly

One step forward, one step back

On 24 Nov 2022, Stability.AI announced the public release of Stable Diffusion 2.0, a big update to the previous version with breaking changes. At the time of this writing, it has received mixed reactions from the community. The new architecture was praised for its state-of-the-art features, but at the same time, it was criticized for the direction of the StabilityAI.

StabilityAI released the following checkpoints for version 2:

512-base-ema.ckpt — Version 2 checkpoint to generate 512x512 images.
768-v-ema.ckpt — Checkpoint based on 512-base-ema.ckpt. It was further fine-tuned using a v-objective on the same datasets. Able to generate 768x768 images natively.
512-depth-ema.ckpt —Checkpoint based on 512-base-ema.ckpt with an extra input channel to process the (relative) depth prediction produced by MiDaS. It is a depth-guided diffusion model. Good for image to image generation.
512-inpainting-ema.ckpt — Version 2 checkpoint of the inpainting model to inpaint images in 512x512 resolution.
x4-upscaling-ema.ckpt — Super resolution upscaler diffusion model to generate images with resolution of 2048x2048 or higher.

This article covers some of the facts of Stable Diffusion 2.0 and my personal opinions.

Training Data

Generally, Stable Diffusion 1 is trained on LAION-2B (en), subsets of laion-high-resolution and laion-improved-aesthetics.

laion-improved-aesthetics is a subset of laion2B-en, filtered to images with an original size >= 512x512, estimated aesthetics score > 5.0, and an estimated watermark probability < 0.5.

On the other hand, Stable Diffusion 2 is based on a subset of LAION-5B:

laion2B-en 2.32 billion images with texts in the English language
laion2B-multi 2.26 billion images with texts from 100+ other languages
laion1B-nolang 1.27 billion images with unknown languages texts (could not be clearly detected)

The datasets is then filtered for explicit pornographic material, using the LAION-NSFW classifier with punsafe=0.1 and an aesthetic score >= 4.5.

In other words, Stable Diffusion 2 used a larger, NSFW-filtered datasets for training. Also, the second phase of training is based on images with resolution higher or equal to 512x512.

The Good

The new release ensures that StabilityAI resolves the long-standing legal problems related to child pornography and deep fakes. Also, the new model generates better images in certain areas. From the test conducted by the community, it seems to do well for realistic photography (non-human), lighting in photo-realistic scenes, 3D renders and designs.

Here is an example of birds generated using Stable Diffusion 2.0.

The Bad

Based on user feedback, you can still generate nudes but the generated images tend to lack any sexual attractiveness. Also, the new model performs poorly for images related to human anatomy and faces of celebrities.

Moreover, certain prompts tend to generate mostly black and white images (bias towards monochrome), and you have to add the prompt “colourful” to generate colored images.

The Ugly

Similar to version 1, hands and text-in-image generations are still a big problem. Also, it is unsure why Stable Diffusion 2 did not use the estimated watermark probability strategy to filter the LAION-5B datasets. Based on my own personal anecdote, the new model has a higher tendency to generate images with watermarks as compared to the old model.

Furthermore, LAION is a quantity over quality datasets. You can easily verify the images and their text labels pairs via the following clip retrieval page. You will realize that most of the images are poorly labelled, which will have big implications on the performance of the model.

Text encoder

Stable Diffusion 2 is based on OpenCLIP-ViT/H as the text-encoder, while the older architecture uses OpenAI’s ViT-L/14. ViT/H is trained on LAION-2B with an accuracy of 78.0. It is one of the best open-source weights provided by OpenCLIP.

Although the weight for ViT-L/14 is open-source, OpenAI did not release the training data. As a result, you will have little control on what is being learned by the model. StabilityAI solves this by leveraging on an open-source implementation OpnCLIP-ViT/H weight, which is trained on the same datasets they used for the latent diffusion model.

The Good

The new model will now understand your prompt better. Moving on forward, StabilityAI and the community can improve the text encoder by training their own OpenCLIP model. As a result, you can guide the model to generate the desired images easily.

In addition, there seems to be less biases on certain domains. For example, version 2 is now capable of generating rooms with non-white ceilings.

The Bad

All existing prompts for version 1 will not work the same for version 2. The input prompt needs to be a little more descriptive. Using the prompt “cat” is different from the prompt “a photo of a cat”.

Besides that, all existing embedding and dreambooths model will not work out-of-the-box. You need to use the corresponding text encoder in your pipeline depending on the based pre-trained model.

Not only that, the new text encoder will not recognize some of the “famous artists” in your prompt. It will only recognize artists that are presents in the LAION-5B datasets.

Note that no artists were deliberated removed from the training datasets.

The Ugly

Since Stable Diffusion is trained on subsets of LAION-5B, there is a high chance that OpenCLIP will train a new text encoder using LAION-5B in the future. Given that the text encoder is a crucial component in the entire stable diffusion architecture, most of the existing works related to prompts will be invalidated when the text encoder changed.

Also, some of the existing implementations will have to provide backward compatibility support for both the old and new versions.

Depth-to-Image Model

The new depth-guided model is one of the most promising features released by StabilityAI. It is based on MiDaS, which infers the depth of an image, and then using both the text and depth information to generate new images.

Here is an example of how it works:

Check out the following space in Hugging to learn more about generating depth information from an image.

The Good

The new depth-guided model offers better results for image to image generation. This opens up new possibilities on features such as generating images with layers or even text to 3D models.

The Bad and The Ugly

At the time of this writing, you can only use the depth-guided model with the stablediffusion repository, making it difficult to test the model as most users experienced difficulties in setting up and running the provided scripts. The diffusers version 0.9.0 only supports the other four models:

From version 0.10.0 onward, you can now run the model with diffusers:

stable-diffusion-2-depth

import torch
import requests
from PIL import Image
from diffusers import StableDiffusionDepth2ImgPipeline

pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
   "stabilityai/stable-diffusion-2-depth",
   torch_dtype=torch.float16,
).to("cuda")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)

prompt = "two tigers"
n_propmt = "bad, deformed, ugly, bad anotomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]

Conclusion

StabilityAI states that the new model only serves as a base model for further improvement by the community. In the future, they will do a regular releases so that anyone can further improve upon it by fine-tuning it with their own datasets. They will also offer methods to do public distributed training (experimental test that is not guaranteed to work).

All in all, the released of Stable Diffusion 2 marks a step forward in research and problem with legal issues. However, it is also a step back for some of the adopters in the community that prefer unrestricted creations or general improvement from the previous version.

Thanks for reading this piece. Have a great day ahead!

Stable Diffusion 2: The Good, The Bad and The Ugly

One step forward, one step back

Training Data

The Good

The Bad

The Ugly

Text encoder

The Good

The Bad

The Ugly

Depth-to-Image Model

The Good

The Bad and The Ugly

Conclusion

References