Stability AI has introduced Stable Video Diffusion (SVD), an AI model capable of generating short videos from text or images, showcasing potential for various applications despite current limitations.
Abstract
Stability AI, known for their Stable Diffusion image generator, has announced the development of Stable Video Diffusion (SVD), an AI model that can create up to 4 seconds of video from either text prompts or images. The model outputs smooth, multi-frame video clips with fluid interpolation between frames and supports various features such as multi-view generation, frame interpolation, 3D scene understanding, and camera control via LoRA. SVD has been favorably compared to competitors like Runway's Gen2 and Pika Labs, with user evaluations ranking it higher in image-to-video quality. While the technology is still in research preview and has limitations such as a 4-second video cap, lack of perfect photorealism, and issues with rendering motion, text, and people, it represents a significant advancement in AI-generated video content. Access to SVD is currently limited, with interested parties able to join a waitlist. The potential applications of this technology could revolutionize creative workflows in design, animation, VR, and filmmaking.
Opinions
The author is impressed with the smooth interpolation between frames and the decent quality of the example videos produced by SVD.
SVD's multi-view synthesis results are seen as particularly impressive, with the potential to generate 3D objects from 2D images being highlighted as a game-changer.
The author acknowledges some limitations of the current versions of SVD and SVD-XT, including the inability to generate videos longer than 4 seconds and imperfect photorealism.
Despite these limitations, the author is optimistic about the future of the technology, expressing enthusiasm for the creative possibilities it opens up and its potential impact across various industries.
The author encourages continued innovation and responsible implementation of the technology, anticipating profound positive impacts with careful research and development.
Stable Video Diffusion Is Here — Stability AI’s Text/Image-To-Video AI Model
Screenshot of Stable Video Diffusion example videos
Today, Stability AI, the startup behind the popular open-source AI image generator Stable Diffusion, announced Stable Video Diffusion (SVD), an AI model that brings text or images to life by turning them into short videos.
What is Stable Video Diffusion?
Stable Video Diffusion takes an image or text prompt as input and outputs a smooth, multi-frame video clip that’s up to 4 seconds long. The interpolation between frames appears to be remarkably fluid.
There are two image-to-video models, SVD and SVD-XT capable of generating 14 and 25 frames, respectively, at customizable frame rates between 3 and 30 frames per second.
Both models support the following features:
Text-to-Video
Image-to-Video
Resolution of 576 x 1024
Multi-View Generation: Create videos from multiple perspectives
Frame Interpolation: Smoothly transition between frames for seamless motion
3D Scene Understanding
Camera Control via LoRA: Manipulate camera movements for cinematic effects
If you want to delve into the details of how SVD works, check out this research paper.
Example Videos
Stability AI showcased the capabilities of SVD with a series of captivating example videos on its announcement.
Stable Video Diffusion example videos
To me, these example videos look very cool. The interpolation between frames looks smooth, and the quality of the video is very decent.
How does it compare to competitors?
Currently, a handful of AI tools can generate videos from text or existing images. Two notable competitors are Runway’s Gen2 and Pika Labs. User evaluations have consistently ranked SVD’s image-to-video capabilities as superior in terms of overall quality.
SVD vs Runway Gen2 vs Pika Labs
For in-depth details on the user study, refer to the accompanying research paper.
Multi-view synthesis
I was particularly impressed by the multi-view synthesis results—the potential to generate 3D objects from 2D images could be game-changing. Some users on X even posted their experiments to generate 3D objects with SVD.
How cool is that? An image-to-3D AI tool would be a game changer. Videos of basic 3D shapes showcase this capability, but we’ve yet to see how it handles more complex real-world scenes.
Current limitations
Of course, some limitations exist in these early versions. Here are some notable limitations for both SVD and SVD-XT:
The generated videos are limited to 4 seconds only
The models do not achieve perfect photorealism
The model may generate videos without motion or very slow camera pans
The model cannot render legible text
Faces and people in general may not be generated properly
How To Get Access
SVD is still in research preview because it’s not yet intended for real-world applications.
While we eagerly update our models with the latest advancements and work to incorporate your feedback, we emphasize that this model is not intended for real-world or commercial applications at this stage.
Those eager to experiment can join the waitlist here, but a release date for wider access is still unknown.
Stable Video Diffusion waitlist form
Final Thoughts
I am happy to see the emergence of another text-to-video and image-to-video AI technology again after a period of slowed progress. The implications of this technology are immense—the ability to effortlessly animate our creative visions could revolutionize workflows across design, animation, VR, and more fields.
Imagine creating a full-blown film with no experience.
I hope innovators continue pushing boundaries even further in the coming weeks and months. Of course, this is an extremely difficult technical challenge, but the results we’ve seen with SVDS prove it’s possible.
I’m optimistic that with enough care, research, and responsible implementation, the positive impacts will be profound.
This story is published on Generative AI. Connect with us on LinkedIn to get the latest AI stories and insights right in your feed. Let’s shape the future of AI together!