avatarJim Clyde Monge

Summary

Stability AI has released Stable Diffusion 3, an advanced text-to-image model with improved multi-subject prompt handling, image quality, and text rendering capabilities, positioning it as a strong competitor to OpenAI's Dall-E 3 and Google's Gemini.

Abstract

Stable Diffusion 3 (SD3) by Stability AI represents a significant leap in AI-generated imagery, showcasing enhanced performance in generating complex images with multiple subjects, superior image quality, and the ability to render text within images. This latest iteration incorporates a diffusion transformer architecture and flow matching techniques, setting new benchmarks in generative model performance, particularly on tasks involving class-conditional image generation. SD3's capabilities are demonstrated through its ability to accurately interpret and visualize detailed prompts, such as arranging objects and animals in a scene with specific color and spatial relationships. The model has been compared to OpenAI's Dall-E 3 and Google's Gemini, with some early assessments suggesting SD3 may outperform these competitors in certain tasks, such as rendering text within an image. Access to SD3 is currently limited, with interested parties invited to join a waitlist for early access. The announcement also emphasizes a strong focus on AI safety, which is seen as both a priority and a potential oversight in marketing, given the community's interest in the model's open-source nature and versatility for personal use.

Opinions

  • The author expresses excitement about SD3's text rendering support, which is comparable to that of OpenAI's Dall-E 3 and Google's Imagen 2 in Gemini.
  • There is a noted interest in the subtle details SD3 captures, such as the green tint on the white fur of animals in a generated image, speculating that the model may have learned such effects from real-world photography scenarios.
  • The author is surprised and critical of Dall-E 3's inability to render text with certain prompts, which contrasts with SD3's proficiency in this area.
  • The author is intrigued by the potential of SD3, having already signed up for early access to further explore its capabilities.
  • There is a concern that the emphasis on safety in the SD3 announcement may detract from the model's open-source and user-empowerment aspects, which are seen as key selling points.
  • The author reiterates that despite the focus on safety, the SD3 image model will remain open-source, with the preview aimed at improving quality and safety in line with previous stable diffusion releases.

Stable Diffusion 3 Is Here — It’s Packed With Huge Improvements

Image by Stability AI

The biggest week in the history of AI isn’t over yet. Just days after OpenAI announced Sora, which can generate jaw-dropping videos, and Google revealed Gemini 1.5, which supports up to 1.5 million tokens of context window, Stability AI today showed an early preview of Stable Diffusion 3.

What is Stable Diffusion 3?

Stable Diffusion 3 is the latest and most capable text-to-image model from Stability AI. It boasts significant improvements in handling multi-subject prompts, image quality, and even text rendering abilities.

The suite of models currently ranges from 800M to 8B parameters. It combines a diffusion transformer architecture (similar to Sora) and flow matching.

Diffusion Transformer Architecture

The Diffusion Transformer (DiT) architecture represents a novel class of diffusion models that incorporate transformer technology. Unlike traditional diffusion models that commonly use convolutional U-Net backbones, DiTs employ transformers to operate on latent patches of images.

The Diffusion Transformer (DiT) architecture

This architecture proves particularly effective for class-conditional image generation tasks on large datasets like ImageNet, where DiTs have set new benchmarks for image quality and generative model performance.

Flow Matching

Flow Matching (FM) is introduced as a new, simulation-free approach for training Continuous Normalizing Flows (CNFs) that enables training CNFs at unprecedented scales. FM works by regressing vector fields of fixed conditional probability paths compatible with a general family of Gaussian probability paths, including diffusion paths.

Sample paths from the same initial noise with models trained on ImageNet

This not only makes training diffusion models more robust but also paves the way for faster training, sampling, and better generalization with CNFs using non-diffusion probability paths, such as optimal transport (OT) paths.

What’s new in Stable Diffusion 3?

Here are the key improvements SD3 brings:

  • Text rendering support
  • Improved performance
  • Multi-subject prompts
  • Better image quality

Perhaps the most exciting feature of this new image model is its capability to render text similar to openAI’s Dall-E 3 and Google’s Imagen 2 in Gemini. Stability AI CEO Emad Mostaque has been sharing images generated with SD 3, and here are some of my favorites:

Prompt: “Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”

Stable Diffusion 3 sample image

One thing I find interesting about this image is the subtle green tint on the white fur of the animals. I wonder if the model learned this effect from behind-the-scenes photos of green-screen film sets.

Prompt: “cinematic photo of a red apple on a table in a classroom, on the blackboard are the words “go big or go home” written in chalk”

Stable Diffusion 3 sample image

Stable Diffusion 3 vs Dall-E 3 vs Gemini

I did a quick comparison of the images generated by SD3 and OpenAI’s Dall-E 3. In the example below, I used the prompts from SD3’s announcement blog post.

Prompt: “Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says “Stable Diffusion 3” made out of colorful energy”

Image by Jim Clyde Monge

Did Stable Diffusion 3 just beat Dall-E 3? Honestly, I’m surprised that Dall-E 3 repeatedly refused to render text with this prompt. Go try it yourself.

Out of curiosity, I also fed the prompt into Gemini Advance, and here’s the result:

Image by Jim Clyde Monge

How do I get access to SD 3?

Right now, Stable Diffusion 3.0 is not available to the general public. You can, however, sign up here to get invited to the Discord server.

Stable Diffusion 3 waitlist

Final Thoughts

Overall, I’m very excited to see more examples of Stable Diffusion 3. I’ve already signed up to get early access to the preview model.

One thing I am concerned about, though, is that half of the announcement post was talking about AI safety. The obsession with safety in this announcement feels like a missed marketing opportunity, considering the recent Gemini debacle.

Isn’t Stable Diffusion's primary use case the fact that you can install it on your own computer and make what you want to make?

Anyway, open-source models can be fine-tuned by the community if needed. Just to make it clear, the SD3 image model will still be open-source. The preview is to improve its quality & safety, just like the other stable diffusion releases.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!

Technology
Stability Ai
Stable Diffusion 3
Stable Diffusion
Artificial Intelligence
Recommended from ReadMedium