Summary

ByteDance's new AI tool, Boximator, enables the transformation of static images into animated videos by using defined boxes for object motion control, showcasing advanced capabilities in AI-driven video generation.

Abstract

ByteDance, the parent company of TikTok, has introduced Boximator, an innovative AI animator that breathes life into still images by creating videos with controlled object movement. This tool leverages a combination of hard and soft boxes to precisely outline and animate objects within a video, offering users a high degree of customization. Boximator operates by generating image descriptions, extracting noun chunks, and using these prompts with a grounding model and object tracker to produce bounding boxes across video frames. The technology is trained on the WebVid-10M dataset, comprising over 10 million video-caption pairs, and demonstrates superior dynamic results when compared to other AI video generators like Pika 1.0 and Runway Gen2. While the demo website for Boximator is not yet publicly available, interested individuals can contact the creators for a preview of its capabilities. The article expresses enthusiasm for the potential of such technology while also acknowledging the need for responsible consumption of online media to mitigate risks like deepfake misuse.

Opinions

The author is impressed by Boximator's ability to generate dynamic and controllable animations from static images, indicating a significant advancement in AI video generation.
The comparison with other AI video generators suggests that Boximator's additional control layer, using box constraints, leads to more realistic and precise animations.
There is an underlying excitement about the future possibilities of AI tools like Boximator and OpenAI's Sora, hinting at their potential widespread accessibility and impact.
The author emphasizes the importance of being critical of online media and aware of the potential misuse of such powerful technology, particularly in the context of deepfakes and misinformation.

TikTok (ByteDance)’s New AI Animator Is Mind-Blowing

AI video generators have recently dominated tech headlines, especially following OpenAI’s announcement of Sora, their first video model that can generate jaw-dropping AI videos with simple text prompts.

Today, ByteDance, the company that made TikTok, is getting in on the action too. They’ve created Boximator, which lets you turn static pictures into videos.

What is Boximator?

Boximator combines “box” and “animator” to describe its function: animating objects within videos using user-defined boxes. This tool aims to give users control over how objects move in a video, offering a mix of hard and soft boxes for motion control.

Hard boxes allow for precise object outlines, while soft boxes enable more fluid motion paths.

In the example above, all bounding boxes are projected to the cropped region (white dashed box).

How Boximator works

Here are the video generation steps:

For every clip in the dataset, the first frame is taken to generate an image description using a visual language model.
Then they extract noun chunks from these descriptions, say “young man” or “white shirt.”
These prompts are fed to a pre-trained grounding model and object tracker to generate bounding boxes and populate them across all frames of the video.

The full architecture model of Boximator is illustrated below.

In every spatial attention block of video diffusion models, there are two stacked attention layers: a spatial self-attention layer and a spatial cross-attention layer.

Full details of how this works are described in this whitepaper.

The training dataset

Contrary to images, there aren’t a lot of publicly available video datasets with object tracking annotations. The engineers curated their training set from the WebVid-10M dataset.

WebVid-10M is a large-scale dataset of short videos with textual descriptions sourced from stock footage sites. The videos are diverse and rich in their content.

10.7 million video-caption pairs.
52K total video hours.

Example videos

Here are some incredible examples:

Left: “The kitten is hiding herself into the cup”

Right: “A dog is chasing a red ball.”

Left: “A young woman is turning her head, revealing her face in profile.”

Right: “A man sitting on a table is drinking a cup of coffee.”

Left: “The kitten is hiding herself into the cup”

Right: “A dog is chasing a red ball.”

Comparison with other AI video generators

The examples below are comparisons against two of the most popular AI video generators, Pika 1.0 and Runway Gen2.

Note: Pika and Gen-2 use image and text conditions; Boximator uses additional box constraints derived from the text prompt.

Prompt: “Adding wine to a glass.”

Boximator (left), Pika 1.0 (middle), Gen2 (right)

Prompt: “A handsome man is taking out a rose from his pocket with his right hand and looking at the rose.”

Prompt: “Two raccoons in blue shirts are playing a ball, the left one is jumping up.”

What do you think of these videos?

Looking at these examples, it’s evident that adding an additional control layer enhances the results. The video generated by Boximator is more dynamic than the ones from Pika and Gen2.

How to try

The demo website is currently not available to the public. According to its creators, it should be available in the next couple of months.

Our demo website is under development and will be available in the next 2–3 months. We will attach the demo link on this website once the demo is ready.

If you really want to try Boximator, you can email the creators at [email protected], send them the input image and the text prompt, and then they will reply with the generated video.

Final Thoughts

As a tech enthusiast, I feel excited to see tech giants showcase pieces of software like Boximator and Sora that could be accessible to our fingertips in the near future.

However, it is important to be aware of the risks associated with this technology. As with any powerful tool, there is the potential for misuse. Deepfakes, for example, could be used to spread misinformation or propaganda.

It is important to consume online media responsibly and to be critical of the information you see.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!