tool

ByteDance Open-Sources Video-As-Prompt Model: Use Videos as Commands to Turn Static Images into Animations in Seconds!

October 24, 2025
Updated Oct 24
2 min read

A new breakthrough in the field of AI video generation! ByteDance has officially open-sourced its innovative Video-As-Prompt (VAP) model. This technology allows users to directly use a reference video as a “prompt” to animate any static image, perfectly replicating the semantics and dynamic style of the reference video. This article will provide an in-depth analysis of VAP’s core concepts, the differences between the two models, and why it can compete with top commercial models like Kling and Vidu in terms of performance.


A New Way to Play with AI Video Generation: It’s Not Just a Word Game Anymore

Have you ever wondered how cool it would be if you could make a static photo dance, run, or even make delicate facial expressions just like the protagonist in a video? In the past, we were used to using text (Text-to-Video) to command AI to generate videos, but text descriptions often struggle to accurately convey the complex dynamics and emotions in our minds.

Now, all of that is about to change.

ByteDance recently open-sourced a new technology called Video-As-Prompt (VAP), which completely subverts the traditional video generation model. Its core concept is very intuitive: directly use a video as a command to drive a static image.

It’s like pointing to a video of Michael Jackson dancing and then saying to a portrait of the Mona Lisa, “Hey, make her dance like this!” VAP can understand the “semantics” of the dance—not just the movement trajectory, but also the rhythm, style, and sense of power—and apply it to the Mona Lisa.

What is the Core Concept of Video-As-Prompt?

Simply put, VAP’s task is: given a reference video with specific semantics (Video Prompt), it can make a reference image (Reference Image) move with the exact same semantics as the reference video.

Behind this is a new paradigm called “in-context generation.” It no longer requires complex text descriptions or multiple conditional controls, but instead learns directly from the example video, understands the essence of its dynamics, and then imitates and transfers it. This makes video generation unprecedentedly intuitive and flexible.

Two Models, Two Choices: Wan2.1 vs. CogVideoX

To meet the needs of different users, ByteDance has thoughtfully provided two versions of VAP, which make different trade-offs between capability and stability.

1. Wan2.1-I2V-14B: More Powerful, Better at Understanding Humans

  • Advantages: This 14-billion-parameter large model, thanks to its powerful base model capabilities, performs exceptionally well in generating human actions and novel concepts. Whether it’s complex dance moves or specific cultural concepts like “Squid Game,” it can accurately capture and restore them.
  • Limitations: Because the model is too large, its training steps are relatively few under limited computing resources. This also leads to slightly lower stability under certain semantic conditions, and sometimes unexpected results may occur.

2. CogVideoX-I2V-5B: A More Stable and Reliable Choice

  • Advantages: As a 5-billion-parameter model, it is more lightweight, which allows the development team to train it for a longer time with the same resources. The result is that it exhibits extremely high stability under most semantic conditions. For regular animation generation tasks, it is almost a top student who never makes mistakes.
  • Limitations: Limited by the capabilities of its backbone network, it is slightly inferior in handling human-centric generation tasks. At the same time, its understanding and generation capabilities for concepts that are not common in the pre-training data (such as ladudu, Minecraft, etc.) are also weaker.

How to choose? The conclusion is simple: if you need to generate complex human actions or niche, trendy content, then Wan2.1 is your first choice; if you are pursuing high stability and reliability in various common scenarios, then CogVideoX will be a more stable choice.

Tech Deep Dive: How Does VAP Work?

VAP’s architecture is quite clever. It doesn’t build a new model from scratch, but cleverly stands on the shoulders of giants.

The core of the entire system is a “frozen” Video Diffusion Model Transformer (Video DiT). You can think of it as a general-purpose brain that is already very good at generating videos. “Freezing” means locking its parameters to ensure that it does not forget its original powerful capabilities when learning new tasks, which effectively avoids the “catastrophic forgetting” problem common in the AI field.

So, how do you make this general-purpose brain understand “video commands”? The answer is a plug-and-play Mixture-of-Transformers (MoT). This MoT expert is like a translator, specializing in interpreting the dynamic semantics in the reference video, and then transmitting these instructions to the core DiT model to guide it to generate the required animation.

In addition, VAP also uses a temporally biased position embedding technology, which can help the model more accurately capture the contextual relevance from the reference video without generating incorrect temporal correspondence.

Amazing Performance! Can VAP Challenge Commercial Giants?

After all this, how does VAP actually perform? The answer is: very amazing.

According to the officially released data, VAP, as a unified and generalizable semantic-controlled video generation model, has surpassed all existing open-source models in performance. More importantly, its user preference score is almost on par with top closed-source commercial models like Kling and Vidu!

ModelClip Score (⬆)Motion Fluency (⬆)Dynamism (⬆)Aesthetic Quality (⬆)Alignment Score (⬆)User Preference Rate (⬆)
VACE (Original)5.8897.6068.7553.9035.380.6%
VACE (Deep)22.6497.6575.0056.0343.350.7%
VACE (Optical Flow)22.6597.5679.1757.3446.711.8%
CogVideoX-I2V22.8298.4872.9256.7526.046.9%
CogVideoX-I2V (LoRA)23.5998.3470.8354.2368.6013.1%
Kling / Vidu24.0598.1279.1759.1674.0238.2%
Video-As-Prompt24.1398.5977.0857.7170.4438.7%

From the table, it can be clearly seen that Video-As-Prompt has achieved the highest scores in several key indicators, especially the 38.7% user preference rate, which means that in blind tests against other models, nearly 40% of users believe that the videos generated by VAP are the best. For an open-source model, this is a milestone achievement.

Future Outlook: The Next Step Towards Universal Video Generation

The emergence of VAP not only provides developers and creators with a powerful new tool, but more importantly, it demonstrates the huge potential of AI video generation technology. Its powerful zero-shot generalization ability means that it can handle many tasks that it has never seen in training, which marks a solid step towards the goal of universal and controllable video generation.

From the creation of dynamic memes on social media, to artists bringing static paintings to life, to the design of animation prototypes in the film and television industry, the application prospects of VAP are limitless. With the participation and iteration of the community, we have reason to believe that AI will give new life to static images in an unprecedented way.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.