Machine Learning Art

How a neural network hallucinates houses from a single image

The synthesis of individual images has attracted much attention in computer vision and computer graphics. It brings a photograph to life by extrapolating beyond the input pixels and generating new pixels that follow the geometric structure of the scene. At the same time, the generated pixels must be semantically coherent with the existing pixels. Current methods for synthesizing views that learn a 3D geometric representation have shown promising results in generating high-quality new views. However, these approaches can only generate views within a limited range of camera motion. For example, it is a major challenge for current approaches to synthesize what is outside the door of the room.

Project Page (scroll down)

Novel view synthesis from a single image has recently attracted a lot of attention, and it has been primarily advanced by 3D deep learning and rendering techniques. However, most work is still limited by synthesizing new views within relatively small camera motions. In this paper, we propose a novel approach to synthesize a consistent long-term video given a single scene image and a trajectory of large camera motions. Our approach utilizes an autoregressive Transformer to perform sequential modeling of multiple frames, which reasons the relations between multiple frames and the corresponding cameras to predict the next frame. To facilitate learning and ensure consistency among generated frames, we introduce a locality constraint based on the input cameras to guide self-attention among a large number of patches across space and time. Our method outperforms state-of-the-art view synthesis approaches by a large margin, especially when synthesizing long-term future in indoor 3D scenes. During training, images and camera transformations are first encoded to modality-specific tokens. Tokens are then fed into an autoregressive Transformer that predicts images. During inference, given a single image and a camera trajectory, novel views can be generated autoregressively by using the Transformer. https://xuanchiren.com/pub/look-outside-door.pdf

Augmenting models with Super Power

DEMO + Code

medium.com

3D Scenes From Image Data

a novel approach to model via 4D-tensor

medium.com

Conclusion An autoregressive Transformer based model to solve novel view synthesis, especially when synthesizing long-term future in indoor 3D scenes. This method leverages a locality constraint based on the input cameras in self-attention to ensure consistency among generated frames. As a result, the process can improve performance in novel view synthesis compared to the state-of-the-art approaches. To conclude, the authors take a further step to explore the capabilities of geometry-free methods and manage to synthesize consistent high-fidelity 3D scenes.

@inproceedings{ren2022look,
  title={Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image},
  author={Ren, Xuanchi and Wang, Xiaolong},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

https://xuanchiren.com/pub/look-outside-door.pdf

Project Page:

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Machine Learning Art

How a neural network hallucinates houses from a single image

Augmenting models with Super Power

DEMO + Code

3D Scenes From Image Data

a novel approach to model via 4D-tensor

Project Page:

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Join Medium with my referral link - Dariusz Gross #DATAsculptor

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai