Alibaba's New Move! Qwen3-VL Lightweight Version Arrives, Performance Challenges Gemini and GPT-5?
Alibaba has open-sourced the 4B and 8B lightweight models of Qwen3-VL. Not only do they have ultra-low video memory usage, but they also beat Gemini 2.5 Flash Lite and GPT-5 Nano in several tests. Is this small model really that amazing? Let’s take a look at its stunning performance.
In the world of artificial intelligence, there seems to be a myth: the larger the model, the more powerful it is. But what if I told you that there is now a small and exquisite model that not only consumes low resources, but its performance can also directly challenge those famous rivals? Would you believe it?
This is not a fantasy. Alibaba’s Tongyi team recently dropped a bombshell—officially open-sourcing the 4B and 8B lightweight versions of Qwen3-VL. These two models not only fully retain the core multi-modal capabilities of Qwen3-VL, but also greatly reduce the hardware threshold, allowing more developers and researchers to get started easily.
Small Size, Big Energy? What Makes Qwen3-VL So Strong?
The biggest highlight of the lightweight version of Qwen3-VL launched by Alibaba this time is “light”. The parameter scales of 4B and 8B mean that its demand for graphics card memory (VRAM) is greatly reduced. Speaking of video memory, this is a pain point for all AI developers! In the past, it was almost an impossible task to run a powerful multi-modal model without a top-level graphics card.
But now, Qwen3-VL makes it all much more accessible.
More importantly, the size has become smaller, but the capabilities have not shrunk. Whether it is image understanding, video analysis, or document OCR, these core functions have been completely retained. Not only that, in order to pursue the ultimate deployment efficiency, Alibaba has also thoughtfully provided an FP8 version. Simply put, this is a technology that can make the model run faster and save more resources. For developers who need to deploy applications on edge devices or personal computers, this is simply great news.
The Data Speaks for Itself: Direct Confrontation with Gemini and GPT-5 Nano
Talk is cheap, let’s look directly at the official test data. This report card can be said to be quite amazing.
Qwen2-VL 4B | Qwen2-VL Instruct 4B | Qwen2-5.5VL (72B*) | Gemini1.5 Flash-lite without Search | GPT-4o Nano Mobile | ||
---|---|---|---|---|---|---|
STEM & Puzzle | MMMU_val | 67.4 | 69.6 | 72.2* | 72.7 | 57.6 |
MMMU_pro_full | 53.2 | 55.9 | 51.1* | 55.6 | 36.5 | |
MathVista_mini | 73.7 | 77.2 | 74.8* | 70.3 | 40.9 | |
MathVision | 51.6 | 53.9 | 38.1* | 52.9 | 33.2 | |
MATHVerse_mini | 46.8 | 62.1 | 57.6* | 33.2 | 27.0 | |
ZERObench_pub | 21.0 | 22.8 | 18.0* | 15.3 | 15.9 | |
MMBench(tidy_en_v1.1) | 85.1 | 85.0 | 86.4* | 82.4 | 51.5 | |
General VQA | RealWorldQA | 70.9 | 71.5 | 77.1* | 70.5 | 60.7 |
MME-star | 55.8 | 70.3 | 70.8* | 71.3 | 41.5 | |
SimpleVQA | 48.6 | 50.2 | 58.2 | 52.2 | 39.0 | |
HallusionBench | 57.6 | 61.1 | 58.1* | 53.6 | 39.3 | |
Subjective Experience and Instruction Following | MM-MT-Bench | 7.5 | 7.7 | 7.6* | 7.1 | 6.2 |
MIABench | 89.7 | 91.1 | 90.7 | 90.5 | 89.6 | |
MMLongBench-Doc | 43.5 | 47.9 | 42.1 | 38.3 | 22.1 | |
DocVQA-TEST | 95.3 | 96.1 | 96.4* | 92.0 | 78.3 | |
IdleVQA-TEST | 80.3 | 83.1 | 87.3* | 75.0 | 49.2 | |
Text Recognition and Chart/Document Understanding | AI2D-TEST | 83.7 | 85.0 | 88.7* | 84.8 | 65.7 |
OCRBench | 881 | 896 | 945* | 912 | 701 | |
OCRBench(cn/en/zh) | 63.2 / 57.6 | 65.4 / 61.2 | 61.5* / 63.7* | 48.1 / 24.2 | 37.9 / 27.3 | |
CC-OCR-Bench_overall | 76.2 | 79.9 | 79.8* | 72.1 | 52.9 | |
ChartXv2(QG) | 76.2 | 83.0 | 87.4* | 73.5 | 64.4 | |
ChartXv2(Q) | 39.7 | 46.4 | 49.7* | 44.6 | 31.7 | |
ODinW-13 | 48.2 | 44.7 | 43.1* | - | - | |
2D/3D Grounding | ARKitScenes | 56.6 | 56.8 | - | - | - |
Hypersim | 12.2 | 12.7 | - | - | - | |
SUNRGB-D | 34.7 | 36.2 | - | - | - | |
Multi-Image | BLINK | 60.8 | 60.1 | 64.4* | 62.0 | 42.3 |
MM-ARENA | 63.4 | 64.4 | 70.7* | 67.0 | 45.7 | |
M-VGA | 41.3 | 45.8 | - | 40.5 | 45.8 | |
VSI-Bench | 58.4 | 59.4 | - | 27.0 | 27.0 | |
Embodied and Spatial Understanding | EmbSpatialBench | 79.6 | 78.5 | - | 66.3 | 50.7 |
RefSpatialBench | 46.6 | 54.2 | - | 12.3 | 2.5 | |
RobsSpatialHome | 61.7 | 66.9 | - | 41.2 | 44.8 | |
Video | MVBench | 68.9 | 68.7 | - | - | - |
Video-MME(w/o subj) | 69.3 | 71.4 | 73.5* | 65.0 | 49.4 | |
MVBench-Q | 75.8 | 73.1 | 74.6* | 69.3 | 52.6 | |
Charades | 58.2 | 58.3 | 58.3* | 52.6 | - | |
Charades-STA | 55.6 | 56.0 | 50.9* | - | - | |
Video-MMMU | 56.2 | 65.3 | 60.2* | 63.0 | 40.2 | |
ScreenSpot | 94.0 | 94.4 | 87.1* | - | - | |
Agent | ScreenSpot Pro | 59.5 | 54.6 | 43.6* | - | - |
OS-World-G | 58.2 | 58.2 | - | - | - | |
AndroidWorld | 45.3 | 47.6 | 35.0* | - | - | |
OS-World | 26.2 | 33.9 | 8.8* | - | - | |
Fine-grained Perception | V* | 80.1 | 86.4 | 69.1 | 64.9 | 69.7 |
HRBench4K | 76.3 | 77.6 | 75.6 | 72.4 | 77.6 | |
HRBench8K | 72.9 | 74.0 | 68.0 | 67.2 | - |
Note: The default evaluation is performed through API calls and metric scores from closed-source models. The evaluation results use 2-shot prompts, parsed to 2048 frames.
From the evaluation results in the figure above, it can be seen that Qwen3-VL-8B has shown unexpected strength in several key areas.
- General VQA: In tests such as RealWorldQA and MMStar, the score of Qwen3-VL-8B is significantly higher than Google’s Gemini 2.5 Flash Lite and the rumored GPT-5 Nano.
- OCR & Document Understanding: In the OCRBench test, Qwen3-VL-8B scored a high of 896, leaving its rivals far behind. This means that it has extremely high accuracy when processing images or documents containing a large amount of text.
- Video: Processing dynamic video content is a greater challenge for the model. But in tests such as VideoMME and ScreenSpot, the performance of the lightweight version of Qwen3-VL is still solid, and even surpasses larger models in some items.
What is most surprising is that the performance of Qwen3-VL-8B on some tasks is actually comparable to its own flagship model Qwen2.5-VL-72B released half a year ago! Achieving performance close to that of a top-level model with a much smaller size, the technical gold content behind this is self-evident.
Not Just Benchmarks, How Big is the Actual Application Potential?
Powerful evaluation scores must ultimately return to actual applications. So, what can the lightweight version of Qwen3-VL bring us?
Its low resource requirements mean that it can be deployed in more scenarios. For example, realizing real-time image recognition and interaction on mobile phones, creating smarter AI assistants on personal computers, or giving machines the ability to “understand” the world in IoT devices.
In addition, its excellent performance in Agent tasks also indicates that it has the potential to become the core that drives complex automated processes. Imagine an AI assistant that can not only understand the screenshots you send, but also understand the content and automatically complete subsequent operations—this is the future that Qwen3-VL wants to achieve.
Get Your Hands on It Now! Resource Portal
After all this, are you also eager to experience the power of Qwen3-VL for yourself? Alibaba has very generously provided all the resources. Whether you want to call the API directly or download the model for local deployment, you can find the corresponding channels.
- Hugging Face: The favorite community for AI developers, where you can find models and related tools.
- ModelScope: Alibaba’s own model community with the most complete resources.
- API Quick Experience: If you don’t want to deploy it yourself, you can call it directly through the API.
- Cookbooks (Tutorials): Provides a wealth of code examples to help you get started quickly.
In summary, the release of the lightweight version of Qwen3-VL once again proves that bigger is not always better for models. While pursuing ultimate performance, how to balance efficiency and accessibility may be the key to promoting the popularization of AI technology. Does this also indicate that an era of high-performance, lightweight models is about to arrive?