Summary

Google has introduced a new AI-powered virtual try-on feature for Google Shopping that uses a diffusion model to enable realistic clothing visualization with detailed textures and flexible body poses.

Abstract

Google's latest AI model, TryOnDiffusion, represents a significant advancement in virtual try-on technology. It allows users to realistically visualize clothing on themselves with high fidelity and adaptability to various poses. This technology improves upon traditional methods that struggled with preserving clothing details and accommodating different body shapes and poses. The AI model uses a diffusion-based approach, which involves adding and then gradually removing noise to an image, resulting in a seamless blend of the person's image with the clothing item. The process leverages Google's Shopping Graph, the largest dataset of product information, and includes an upscaling step to enhance image quality to 1024x1024 resolution using the Imagen Super-Resolution Diffusion Model.

Opinions

The new virtual try-on feature is seen as a breakthrough, overcoming the limitations of previous models that could not balance detail preservation with pose and shape flexibility.
The integration of two UNet models in TryOnDiffusion is considered a successful solution to the challenges faced by earlier virtual try-on technologies.
The use of a diffusion model is praised for its elegant approach to image generation, which simplifies neural network parameterization and makes training more efficient.
The reliance on Google's Shopping Graph underscores the importance of a comprehensive and real-time dataset for enabling advanced shopping features.
The comparison with Walmart's virtual try-on feature suggests that Google's technology is part of a broader trend towards more accessible and scalable virtual fitting solutions in e-commerce.

Google’s Latest AI Model Enables Virtual Try On clothes with Unchanged Details and Flexible Poses

credit: https://tryondiffusion.github.io/

Google unveiled a new virtual try-on feature for Google shopping.

One easiest example is to search for womens top in Google search and scroll down to look for a 4-pack layout unit.

Traditionally, virtual try-on technology struggled to accurately represent clothing details and accommodate various poses. However, Google’s innovative AI model has overcome these obstacles, enabling users to virtually try on clothes with greater fidelity and flexibility.

Paper link: https://tryondiffusion.github.io/

In the past, the critical challenge of such models lied in striking a balance between preserving clothing details, enabling garment deformations, and accommodating different body poses and shapes to ensure a seamless and natural appearance. Significantly, when it has a pattern or details like pockets or rare sleeves.

Groundbreaking AI Model

The previous work in this area decomposes the try-on task into two steps: the warping model and the blending model. They either preserved clothing details while failing to handle pose and shape variations or allowed for pose changes but compromised on clothing details.

However, with the integration of two UNet models, TryOnDiffusion has successfully addressed these limitations. The new model performs implicit warping and blending in a single pass. This unified approach allows the AI model to retain clothing details within a single network while incorporating significant pose and body changes.

Let’s see what it looks like on the phone:

credit: https://blog.google/products/shopping/ai-virtual-try-on-google-shopping/

AI-based Virtual Try-On

They have employed a new diffusion-based AI model called TryOnDiffusion, which gradually adds additional pixels (or “noise”) to the image until it becomes unrecognizable, and then progressively eliminates the noise to reconstruct the original image with impeccable quality.

Overview of the diffusion model

A probabilistic diffusion model, referred to as a “diffusion model” for convenience, which showcases promising capabilities in generating high-quality samples.

The key concept behind the diffusion model lies in its ability to reverse a diffusion process. This process involves gradually adding noise to the data in the opposite direction of sampling until the signal is no longer distinguishable. By learning the transitions of this chain, the diffusion model becomes adept at generating samples that closely resemble the original data.

Notably, when the diffusion process involves small amounts of Gaussian noise, the sampling chain transitions can be simplified to conditional Gaussians. This elegant approach allows for a streamlined neural network parameterization, making diffusion models straightforward to define and efficient to train.

Training steps

In the preprocessing step, there are 4 images:

Person image by removing the original clothing but retaining the person's identity.
Person poses map.
Garment pose map.
The target garment is segmented out of the garment image.

Next step in the first Parallel UNet diffusion, the person image and garment image are handled respectively. The person-UNet takes a person image and the garment-UNet takes a segmented garment image. The person and garment poses are used to guide the process.

In the second Parallel UNet diffusion, the inputs are the input from preprocessing step and the output from the first Parallel UNet diffusion.

The last step is a purely super-resolution step to upscale the try-on image from 256x256 to 1024x1024 for better image quality.

Upscale image quality

Imagen is the Super-Resolution Diffusion Model used in the last step for better image quality. See research here.

Imagen was introduced with powerful capabilities in visualizing text input by generating high-resolution images. Leveraging the diffusion model, Imagen combines a frozen text encoder, conditional diffusion models, and text-conditional super-resolution diffusion models to create stunning visual representations.

The process begins with the input text being encoded into text embeddings using a frozen text encoder. These embeddings serve as a bridge between the textual information and the subsequent image generation process.

Imagen incorporates text-conditional super-resolution diffusion models. These models apply a step-wise approach to upsample the generated image, first from 64x64 to 256x256 and then from 256x256 to an impressive 1024x1024 resolution.

Here, the upscale image quality process only uses part of the Imagen process, which is from 256x256 to 1024x1024 resolution.

Dataset: Shopping Graph

The virtual try-on feature is leveraging the power of Google Shopping Graph, which stands as the most extensive collection of product and seller data globally.

Shopping Graph — an ML-powered, real-time data set comprising the world’s products and sellers. This revolutionary tool serves as a comprehensive repository, storing billions of global product listings and detailed information about each item.

Final Thoughts — How to try on clothes virtually?

In the past, virtual try-on experiences often required extensive 3D scanning processes for each product, limiting their scalability and accessibility. The previous virtual try-on normally means a VR, AR-based technology for shopping. The use case is more like an AR fitting room, using photorealistic virtual clothing, accessories, makeup, and footwear that instantly displays styles on the phone.

The only similar virtual try-on feature I found is Walmart virtual try-on(launched in Sept 2022), which was called Zeekit and acquired by Walmart in 2021. With Walmart's virtual try-on feature, users have the option to either upload their own photo or use a model’s photo to try on different clothes virtually.

With the rise of generative AI technology, the scale-up process for virtual applications has become significantly faster and more efficient. A single-person image and one clothing image are all that are needed for the virtual try-on feature.