Tencent has officially open-sourced its latest Hunyuan World Model - Voyager. This model not only won first place in the WorldScore benchmark test, but can also generate 3D point cloud videos with world consistency from a single image, allowing users to explore immersively. How magical is this technology? Let’s find out.
Imagine giving an AI a single photo and having it construct a complete 3D world for you, a world you can freely “walk” through and explore. This sounds like something out of a science fiction movie, but Tencent’s newly open-sourced “HunyuanWorld-Voyager” is making it a reality.
This model is no simple toy; it’s the industry’s first world model to support native 3D reconstruction and ranked first overall in the authoritative WorldScore benchmark test. Even more impressively, it can directly output point cloud videos, opening up entirely new possibilities for 3D applications, game development, and virtual reality.
If you want to experience it for yourself, an online demo is available, and tech enthusiasts can find all the open-source data on GitHub.
How is this magical technology achieved?
Many may wonder how Voyager transforms a static image into a dynamic 3D world. In fact, two key core components are at work behind the scenes.
1. World-Consistent Video Diffusion Technology
First, Voyager employs a unified architecture that can simultaneously generate precisely calibrated color videos (RGB) and depth video sequences. What does this mean? Simply put, it not only “paints” the scene you see but also simultaneously “understands” the distance of each object in the scene. This ensures that as you move through this virtual world, the position and scale of all objects remain correct, without strange distortions or warping, guaranteeing global scene consistency.
2. Long-Range World Exploration Capability
A single scene is not enough; to create a “world,” it needs to be constantly expanded. Voyager proposes an efficient “world backup mechanism.” This mechanism is like equipping the AI with a super-powerful memory, which fuses point cloud cleaning and autoregressive inference capabilities to remember all the details of the generated scenes.
This way, when you need to explore further, the AI can iteratively expand the scene outward based on this memory. Through global cognitive technology, it ensures a seamless connection between old and new scenes, resulting in a very smooth video.
The Success Behind the Scenes: A Massive Data Training Engine
Training such a powerful AI model requires a massive amount of data. To this end, the Tencent team built a scalable data construction engine.
This engine is very intelligent; it can automatically estimate the camera position, pose, and depth information for any input video, completely eliminating the need for manual annotation. This greatly improves efficiency, making large-scale, high-quality training data construction possible. Voyager is based on this engine, integrating real-world captured videos and resources rendered with the Unreal Engine to create a massive dataset containing over 100,000 video clips.
How to objectively evaluate the quality of a virtual world?
After all this, how do we know that the world generated by Voyager is truly “good” and not just something that looks okay? This requires some objective evaluation criteria. In the following tables, you will see some technical terms. Don’t worry, they are actually easy to understand.
Three Major Metrics for Measuring Video/Image Quality
When an AI generates a video, we need to compare it with a “real” video. The following three metrics are used for this purpose:
- Peak Signal-to-Noise Ratio (PSNR) ↑: You can think of this as a “pixel-level comparison.” It compares each pixel of the generated image and the real image one by one. The higher the score (the arrow ↑ means the higher the better), the smaller the pixel difference between the two images and the lower the distortion.
- Structural Similarity (SSIM) ↑: This metric goes a step further than PSNR. It doesn’t just look at pixels but is more concerned with the “structure” that the human eye sees, such as brightness, contrast, and object edges. The higher the SSIM score (↑), the more similar the image feels to the original image to the human eye.
- Learned Perceptual Image Patch Similarity (LPIPS) ↓: This is the “smartest” metric. It uses another neural network to mimic human visual perception and judge the similarity of two images. It is better at capturing subtle differences that are sensitive to the human eye but may be ignored by traditional metrics. Therefore, the lower this score, the better (the arrow ↓), meaning that in the AI’s eyes, the “feel” of the two images is closer.
Now, let’s look at Voyager’s performance with this knowledge.
The Proof is in the Pudding: Performance Comparison
Video Generation Quality Comparison
In a comparison with four other open-source models (Swerve, ViewCrafter, See3D, FlexWorld), Voyager performed best on all key metrics.
| Method | Peak Signal-to-Noise Ratio (PSNR) ↑ | Structural Similarity (SSIM) ↑ | Learned Perceptual Image Patch Similarity (LPIPS) ↓ |
|---|---|---|---|
| Swerve | 16.648 | 0.613 | 0.349 |
| ViewCrafter | 16.512 | 0.636 | 0.332 |
| See3D | 18.189 | 0.694 | 0.290 |
| FlexWorld | 18.278 | 0.693 | 0.281 |
| Voyager | 18.751 | 0.715 | 0.277 |
From the data, it is clear that Voyager has the highest PSNR and SSIM scores and the lowest LPIPS score. This means that the videos it generates are not only the closest to reality at the pixel level but are also the most realistic in the perception of the human eye and AI.
From the actual generated videos, when the camera moves significantly, other models struggle to produce reasonable predictions, often resulting in obvious “ghosting” or loss of detail. Voyager, on the other hand, can effectively retain the detailed features of the input image, such as the chandelier in the example, and generate a highly realistic video sequence.
3D Scene Reconstruction Quality Comparison
Another major advantage of Voyager is its ability to directly generate RGB-D (color + depth) videos, which gives it a significant advantage in 3D reconstruction tasks. Other models can only generate color videos and require additional tools like VGGT to estimate depth, which naturally compromises the results.
| Method | Post-processing | Peak Signal-to-Noise Ratio (PSNR) ↑ | Structural Similarity (SSIM) ↑ | Learned Perceptual Image Patch Similarity (LPIPS) ↓ |
|---|---|---|---|---|
| Swerve | VGGT | 15.581 | 0.602 | 0.452 |
| ViewCrafter | VGGT | 16.161 | 0.628 | 0.440 |
| See3D | VGGT | 16.764 | 0.633 | 0.440 |
| FlexWorld | VGGT | 17.623 | 0.659 | 0.425 |
| Voyager | VGGT | 17.742 | 0.712 | 0.404 |
| Voyager | - | 18.035 | 0.714 | 0.381 |
This table tells us that even if we let other models “cheat” by using post-processing tools to add depth information, Voyager’s reconstruction results are still superior in terms of geometric consistency. If we directly use Voyager’s own depth information (the last row of the table, with post-processing as “-”, meaning no processing is needed), the results are far ahead, once again proving the power of its native 3D generation capabilities.
Reaching the Top of WorldScore: The All-Around Champion
Finally, let’s look at the comprehensive WorldScore benchmark. It doesn’t just look at image quality but evaluates a model’s ability to generate a “world” from multiple dimensions.
- Camera Control: Can the model accurately move the viewpoint according to instructions?
- Object Control: Are the objects in the scene stable, without deforming or disappearing randomly?
- Content Alignment: Is the style and theme of the generated content consistent with the original image?
- 3D Consistency: When viewing the same object from different angles, is its 3D structure reasonable?
- Subjective Quality: Finally, a human scores it: does this world look real? Is it engaging?
| Method | World Average Score | Camera Control | Object Control | Content Alignment | 3D Consistency | … | Subjective Quality |
|---|---|---|---|---|---|---|---|
| WonderJourney | 63.75 | 84.6 | 37.1 | 35.54 | 80.6 | … | 66.56 |
| WonderWorld | 72.69 | 92.98 | 51.76 | 71.25 | 86.87 | … | 49.81 |
| Voyager | 77.62 | 85.95 | 66.92 | 68.92 | 81.56 | … | 71.09 |
The results are clear: HunyuanWorld-Voyager achieved the highest scores in both “World Average Score” and “Subjective Quality,” officially taking the top spot. This fully demonstrates that Voyager exhibits strong competitiveness in both technical hard metrics and human subjective perception, setting a new benchmark in the field of 3D content generation.
In conclusion, the emergence of Tencent Hunyuan Voyager is not just a technological breakthrough; it also heralds a fundamental change in the way we will interact with the digital world in the future. From gaming and filmmaking to virtual reality, the application potential of this technology is endless. A new era of AI-driven 3D content creation may have already quietly arrived.


