Machine Learning Art
Controllable Text to Image Synthesis

Generative models for text-to-image synthesis have shown significant progress. However, generating high-fidelity images remains a challenge, especially with long text descriptions.
The last few years have seen a surge in the use of GANs for image generation and advances in transformer architectures for natural language understanding. In this paper, the authors propose a novel text-to-image method: Make-A-Scene Text-to-Image Generation with Human Priors.
The method proposes in the paper:
- enables a simple control mechanism complementary to text in the form of a scene,
- introduce elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects),
- adapting classifier-free guidance for the transformer use case.
Achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels, significantly improving visual quality.
You can use a similar method to generate VIDEO based on your text. You don’t need to do anything, just write a few words, and the rest will happen by itself. Just click the HERE, 🟠 and we’ll get started! The article includes step-by-step instructions with screenshots and videos. A complete, non-technical guide.
Project Page (scroll down) Through scene controllability, the authors introduce several new capabilities:
🔵 Scene editing,
🔵 text editing with anchor scenes,
🔵 overcoming out-of-distribution text prompts,
🔵 story illustration generation, as demonstrated in the below video.









