Consistency Models: The Next Step in Generative AI (Diffusion Models)
Open AI bring new option for Diffusion models
Open AI have achieved significant breakthroughs in image, audio, and video generation using diffusion models. However, these models rely on an iterative generation process that results in slow sampling speeds, limiting their potential for real-time applications. This performance however comes at a price, as diffusion models’ iterative sampling process that progressively removes noise to generate high-quality outputs typically requires 10–2000 times more compute than conventional single-step generative models such as generative adversarial networks (GANs). To address this limitation, we propose consistency models, a new family of generative models that achieve high sample quality without adversarial training.
https://arxiv.org/pdf/2303.01469v1.pdf
Consistency models are designed to support fast one-step generation and can still allow for few-step sampling to trade compute for sample quality. They also offer zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training for these tasks.
These models can be trained either as a way to distill pre-trained diffusion models or as standalone generative models. Extensive experiments demonstrate that consistency models outperform existing distillation techniques for diffusion models in one- and few-step generation. For instance, we achieve the new state-of-the-art FID of 3.55 on CIFAR10 and 6.20 on ImageNet 64 x 64 for one-step generation.

When trained as standalone generative models, consistency models outperform single-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64 x 64, and LSUN 256 x 256.
Introduction
Diffusion models are generative models that have achieved remarkable success in various fields, including image generation, audio synthesis, and video generation. These models do not rely on adversarial training and have fewer constraints on model architectures compared to other generative models. However, their iterative sampling process requires a high amount of compute, limiting their real-time applications.
To address this issue, researchers propose to create consistency models that facilitate efficient, single-step generation without sacrificing the benefits of iterative refinement. Consistency models are based on the probability flow (PF) ordinary differential equation (ODE) in continuous-time diffusion models, which smoothly transition the data distribution into a tractable noise distribution. The proposed model maps any point at any time step to the trajectory’s starting point, allowing for self-consistency.
Consistency models enable the generation of data samples with only one network evaluation, and by chaining their outputs at multiple time steps, the sample quality can be improved. The researchers offer two methods for training consistency models based on enforcing the self-consistency property. The first method distills a diffusion model into a consistency model by minimizing the difference between model outputs for pairs of adjacent points on a PF ODE trajectory. The second method eliminates the need for a pre-trained diffusion model and allows training consistency models in isolation.
Empirically, consistency models outperform progressive distillation across a variety of datasets and number of sampling steps. They achieve new state-of-the-art FIDs on CIFAR-10 and ImageNet 64ˆ64 with one and two network evaluations, respectively. Consistency models achieve comparable performance to progressive distillation for single-step generation, despite having no access to pre-trained diffusion models. Additionally, consistency models can be used to perform zero-shot data editing tasks such as image denoising, interpolation, inpainting, colorization, super-resolution, and stroke-guided image editing.
Diffusion models
Diffusion models use Gaussian perturbations to generate data by sequentially denoising noise. They start by diffusing the data distribution with a stochastic differential equation (SDE), which generates the Probability Flow (PF) ODE whose solution trajectories are distributed according to the data distribution. Diffusion models are also known as score-based generative models, and a score model is trained via score matching to obtain an empirical estimate of the PF ODE. Numerical ODE solvers are used to solve the empirical PF ODE backwards in time and obtain an approximate sample from the data distribution. Diffusion models have slow sampling speed due to the computational cost of evaluating the score model. Existing methods for faster sampling include faster numerical ODE solvers and distillation techniques, but they still require many evaluation steps. Progressive distillation is the only distillation approach that does not require a large dataset of samples from the diffusion model prior to distillation.
What is Conistency model?
The paper describes “consistency models,” a new type of generative model that supports single-step generation at the core of its design while still allowing iterative generation for zero-shot data editing and trade-offs between sample quality and compute.
Consistency models are a new type of generative model that support single-step generation at the core of their design, while still allowing iterative generation for zero-shot data editing and trade-offs between sample quality and compute. The main idea is to learn a consistency function that maps points on any trajectory of the Probability Flow Ordinary Differential Equation (PF ODE) to the trajectory’s origin. This function has the property of self-consistency, which means that its outputs are consistent for arbitrary pairs of points that belong to the same PF ODE trajectory.
To implement this idea, consistency models are trained to enforce the self-consistency property of the consistency function. This can be done in either the distillation mode or the isolation mode. In the former case, consistency models distill the knowledge of pre-trained diffusion models into a single-step sampler, significantly improving other distillation approaches in sample quality, while allowing zero-shot image editing applications. In the latter case, consistency models are trained in isolation, with no dependence on pre-trained diffusion models, making them an independent new class of generative models.
To generate samples using a well-trained consistency model, one can sample from the initial distribution and then evaluate the consistency model for the sample. This involves only one forward pass through the consistency model and generates samples in a single step. One can also evaluate the consistency model multiple times by alternating denoising and noise injection steps for improved sample quality. This multistep sampling procedure provides the flexibility to trade compute for sample quality and has important applications in zero-shot data editing.
Consistency models enable various data editing and manipulation applications in zero shot. For example, they can easily interpolate between samples by traversing the latent space. They can also perform denoising for various noise levels and solve certain inverse problems in zero shot by using an iterative replacement procedure similar to that of diffusion models. This enables many applications in the context of image editing, including inpainting, colorization, super-resolution, and stroke-guided image editing.
Training
In recent years, generative models have seen significant advancements in the form of deep learning architectures, which allow for modeling complex distributions of data. In particular, diffusion models, which use the Fokker-Planck equation to model the evolution of probability densities over time, have been shown to be effective in generating high-quality samples from various datasets. However, the training of these models can be computationally expensive and requires large amounts of data.
To address this issue, recent research has focused on developing consistency models, which are trained to predict the difference between two consecutive samples generated by a diffusion model. These models are easier and faster to train than diffusion models and can be used to improve the quality of diffusion-generated samples by enforcing consistency constraints.
One approach for training consistency models is based on distillation, where a pre-trained score model is used to approximate the score function of the diffusion model. The score function is a gradient of the log-likelihood of the data distribution, and its estimation is key to the training of both diffusion and consistency models. By using a score model, which can be pre-trained on a large dataset, the estimation of the score function becomes more efficient.

The consistency distillation method involves plugging the pre-trained score model into the Fokker-Planck equation to obtain an empirical equation that approximates the evolution of the probability density over time. This equation is discretized into a set of sub-intervals, and an ODE solver is used to compute the density at each time step. The resulting trajectory is used to generate pairs of adjacent samples, which are then used to train the consistency model via a distillation loss that minimizes the differences between the model predictions and the ground truth differences.
To improve the stability of the training, the distillation loss is computed with respect to a target network, which is an exponential moving average of the consistency model parameters. This approach is similar to the one used in deep reinforcement learning and momentum-based contrastive learning. Additionally, the step size of the ODE solver is progressively decreased during training, which improves the accuracy of the density estimation.
An important feature of consistency distillation is that it can be used to train consistency models without relying on any pre-trained diffusion models. This makes consistency models a new independent family of generative models that can be trained from scratch. To estimate the score function in this case, an unbiased Monte Carlo estimator is used, which involves generating pairs of adjacent samples and computing their difference.
The effectiveness of the consistency distillation method has been demonstrated on several datasets, including CelebA-HQ and LSUN Church. The resulting consistency models have been shown to improve the quality of diffusion-generated samples, as well as to outperform other consistency-based approaches.
In addition to consistency distillation, another approach for training consistency models is based on consistency training, which involves estimating the difference between two adjacent samples directly from the data distribution, rather than relying on a pre-trained score model. In this approach, an ODE solver is used to simulate the evolution of the density over time, and the resulting trajectory is used to generate pairs of adjacent samples, which are then used to train the consistency model via a consistency loss that minimizes the differences between the model predictions and the ground truth differences.
The main advantage of consistency training over consistency distillation is that it does not require a pre-trained score model, which can be computationally expensive to obtain. However, the estimation of the density evolution is more challenging in this approach, as it requires solving an ODE for each pair of adjacent samples. Additionally, the convergence properties of the resulting consistency models are not well-understood.
Overall, consistency models represent a promising approach for generative modeling that can improve the quality of diffusion-generated samples and enable new applications in image and video synthesis.
The paper proposes a novel approach to training generative models called consistency modeling, which seeks to enforce consistency between the samples generated from the model and the output of a diffusion model. The approach is based on the observation that diffusion models can learn to smooth out noise, whereas generative models can learn to generate sharp images. By combining the strengths of both models, the authors propose to learn generative models that are both sharp and realistic.
To train the generative models, the authors introduce two algorithms: consistency distillation and consistency training. Consistency distillation distills a generative model from a pre-trained diffusion model by minimizing the discrepancy between the two models’ outputs. Consistency training, on the other hand, trains a generative model from scratch by iteratively refining the model using an adaptive schedule of step sizes and noise levels.
The authors evaluate the proposed approach on several image datasets, including CIFAR-10, ImageNet 64x64, LSUN Bedroom 256x256, and LSUN Cat 256x256. They compare their approach with several state-of-the-art generative models and report that their approach achieves higher fidelity and sharper images. They also perform ablation studies to analyze the effect of various hyperparameters on the performance of their approach. Finally, they demonstrate the effectiveness of their approach on few-shot image generation tasks, where the proposed approach outperforms the existing methods.
Conclusion
In conclusion, consistency models are a promising approach to generative modeling that offer several advantages over existing techniques, particularly in the realm of one-step and few-step generation. Through our empirical evaluation, we have demonstrated the effectiveness of our consistency distillation method in outperforming other distillation techniques for diffusion models on multiple image benchmarks and various sampling iterations.
Moreover, consistency models prove to be a strong standalone generative model, surpassing other available models that permit single-step generation, with the exception of GANs. These models also offer zero-shot image editing applications, including inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation. These capabilities make consistency models particularly appealing for real-world applications, including computer vision, digital art, and image editing.
One of the key strengths of consistency models is their ability to leverage similarities with techniques used in other fields, such as deep Q-learning and momentum-based contrastive learning. This presents exciting prospects for cross-pollination of ideas and methods among diverse fields, as researchers continue to explore the possibilities and limitations of these models.
However, like any modeling technique, consistency models also have limitations and areas for improvement. For example, while they excel at one-step and few-step generation, they may struggle with generating more complex, high-resolution images or videos. Additionally, there are still open questions around the interpretability and explainability of these models, which is an important consideration for applications where transparency and accountability are essential.
Despite these limitations, the potential of consistency models is undeniable. As researchers continue to refine and improve these models, they have the potential to unlock new possibilities in fields ranging from computer vision to digital art. By building on the strengths of existing techniques and incorporating new ideas from diverse fields, consistency models offer a promising path forward for generative modeling.
consistency models represent an exciting development in the field of generative modeling, with their ability to support one-step and few-step generation, offer zero-shot image editing capabilities, and leverage insights from other fields. While there are still areas for improvement and questions to be answered, the potential of these models is vast and exciting, and we look forward to seeing how they will continue to evolve in the years to come
More content at PlainEnglish.io. Sign up for our free weekly newsletter. Join our Discord community and follow us on Twitter, LinkedIn and YouTube.
Learn how to build awareness and adoption for your startup with Circuit.





