avatarSynced

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

1771

Abstract

gcaption></figure><p id="3485">This work aims to build a simple and general language model for both text and image data that unifies autoregressive generation and bidirectional context-aware mask prediction. In the pretraining stage, the researchers employ a unified tokenizer (ICE Tokenizer, <a href="https://github.com/THUDM/icetk">Icetk</a>) for Image, Chinese, and English that captures bilingual text and image tokens. They use a transformer with Sandwich LayerNorm as the CogLM backbone, and scale the model up to six billion parameters. A smart and versatile masking strategy is designed and implemented to improve model performance and enable finetuning for different downstream tasks.</p><p id="d4b0">The team further proposes a CogView2 system to support interactive text-guided editing on images and fast super-resolution. They summarize their hierarchical generation process as:</p><ol><li>First, we generate a batch of low-resolution images (20x20 tokens in CogView2) using the pretrained CogLM, and then (optionally) filter out the bad samples based on the perplexity of CogLM image captioning, which is the post-selection method introduced in CogView.</li><li>The generated images are directly mapped into 60x60-token images by a direct super-resolution module finetuned from the pretrained CogLM. We use local attention implemented by our customized CUDA kernel to reduce the training expense. The high-resolution images from this step usually have inconsistent textures and a lack of details.</li><li>These high-resolution images are refined via another iterative super-resolution module finetuned from the pretrained CogLM. Most tokens are re-masked and re-generated in a local parallel autoregressive (LoPAR) way, which is much faster than the usual au

Options

toregressive generation.</li></ol><p id="f16b">The team conducted extensive experiments on 30 million text-image pairs and compared the proposed CogView2 with popular benchmarks such as DALL-E-2 and XMC-GAN.</p><figure id="16d4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*1DC4Ek5KcrZGVvbA.png"><figcaption></figcaption></figure><p id="9643">The results show that finetuning the CogLM on the MS-COCO dataset achieves greatly improved performance on the FID metric, and that the resulting CogView2 text-to-image system can generate images with better quality and similar resolutions 10x faster than CogView and obtain results competitive with current state-of-the-art DALL-E-2 baselines.</p><p id="4612">Overall, the study successfully leverages hierarchical transformers to enable autoregressive text-to-image models to overcome slow generation and high complexity issues and narrow the gap between text-to-image pretraining and vision transformers.</p><p id="d112">Codes and a demo website will be updated on the project’s <a href="https://github.com/THUDM/CogView2">GitHub</a>. The paper <i>CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers </i>is on <a href="https://arxiv.org/abs/2204.14217">arXiv</a>.</p><p id="1010"><b>Author</b>: Hecate He | <b>Editor</b>: Michael Sarazen</p><figure id="9808"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*J55wGkD2zeKkh9gA.png"><figcaption></figcaption></figure><p id="5b11">We know you don’t want to miss any news or research breakthroughs. <b>Subscribe to our popular newsletter <a href="https://mailchi.mp/2fb3aa308ad3/welcome-to-synced-global-ai-weekly-newsletter"><i>Synced Global AI Weekly</i></a> to get weekly AI updates.</b></p></article></body>

Tsinghua U & BAAI’s CogView2 Achieves SOTA Competitive Text-to-Image Generation With 10x Speedups

Text-to-image generation has become one of the most publicly engaging AI research fields, with OpenAI’s recently unveiled state-of-the-art DALL-E-2 model garnering global mainstream media attention with its stunningly hyperrealistic images. High-performance autoregressive models like DALL-E-2 and 2021’s CogView however remain limited by slow generation speeds and expensive high-resolution training costs. Moreover, these models’ uni-directional token generation process differs from the bidirectional masked prediction of vision transformers (ViTs), limiting their application on traditional visual tasks such as image classification and object detection.

A research team from Tsinghua University and the Beijing Academy of Artificial Intelligence addresses these issues in their new paper CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, introducing a pretrained Cross-Modal general Language Model (CogLM) for efficient text and image tokens prediction. When finetuned for fast super-resolution, the resulting CogView2 hierarchical text-to-image system generates images with comparable resolution and quality at speeds up to 10x faster than CogView.

This work aims to build a simple and general language model for both text and image data that unifies autoregressive generation and bidirectional context-aware mask prediction. In the pretraining stage, the researchers employ a unified tokenizer (ICE Tokenizer, Icetk) for Image, Chinese, and English that captures bilingual text and image tokens. They use a transformer with Sandwich LayerNorm as the CogLM backbone, and scale the model up to six billion parameters. A smart and versatile masking strategy is designed and implemented to improve model performance and enable finetuning for different downstream tasks.

The team further proposes a CogView2 system to support interactive text-guided editing on images and fast super-resolution. They summarize their hierarchical generation process as:

  1. First, we generate a batch of low-resolution images (20x20 tokens in CogView2) using the pretrained CogLM, and then (optionally) filter out the bad samples based on the perplexity of CogLM image captioning, which is the post-selection method introduced in CogView.
  2. The generated images are directly mapped into 60x60-token images by a direct super-resolution module finetuned from the pretrained CogLM. We use local attention implemented by our customized CUDA kernel to reduce the training expense. The high-resolution images from this step usually have inconsistent textures and a lack of details.
  3. These high-resolution images are refined via another iterative super-resolution module finetuned from the pretrained CogLM. Most tokens are re-masked and re-generated in a local parallel autoregressive (LoPAR) way, which is much faster than the usual autoregressive generation.

The team conducted extensive experiments on 30 million text-image pairs and compared the proposed CogView2 with popular benchmarks such as DALL-E-2 and XMC-GAN.

The results show that finetuning the CogLM on the MS-COCO dataset achieves greatly improved performance on the FID metric, and that the resulting CogView2 text-to-image system can generate images with better quality and similar resolutions 10x faster than CogView and obtain results competitive with current state-of-the-art DALL-E-2 baselines.

Overall, the study successfully leverages hierarchical transformers to enable autoregressive text-to-image models to overcome slow generation and high complexity issues and narrow the gap between text-to-image pretraining and vision transformers.

Codes and a demo website will be updated on the project’s GitHub. The paper CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers is on arXiv.

Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

Text To Image Generation
Computer Vision
Transformers
Machine Learning
Artificial Intelligence
Recommended from ReadMedium