Just when we thought AI agents driven by Large Language Models (LLMs) were omnipotent, the latest benchmark VitaBench, released by Meituan’s LongCat team, serves as a reality check for the entire industry. This “hardest mock exam” shows that even top AI models have a surprisingly low success rate when dealing with complex real-world tasks. What is going on?
When AI Agents Step Out of the Lab, Reality Hits Hard
In recent years, AI agents driven by Large Language Models (LLMs) have undoubtedly been the hottest topic in the tech world. We imagine a future where, with just a few words, AI assistants can handle everything from booking restaurants and planning trips to arranging deliveries. Sounds great, right?
But reality is often a bit harsh. Current AI agents may perform well in simple, closed environments, like driving in a training course—everything goes smoothly. However, can they still cope when placed at the crossroads of the real world—a complex environment full of unexpected situations, vague instructions, and multiple tasks?
The answer might be a little disappointing. Many past evaluation benchmarks have oversimplified problems and failed to truly reflect the complexity of real life. It’s like using a linear equation to assess a mathematician’s ability—it doesn’t measure their true skills at all.
VitaBench: The “Ultimate Proving Ground” for AI Agents
To solve this problem, Meituan’s LongCat team launched VitaBench—a new, high-difficulty benchmark designed specifically to evaluate the performance of LLM agents in real-world applications.
You can think of VitaBench as an extremely realistic “life simulator.” It’s no longer about theory; it throws AI directly into the three major life scenarios we are most familiar with:
- Food Delivery
- In-store Consumption
- Online Travel Services
How complex is this simulated environment? It integrates up to 66 different tools, covering almost all possible operations from querying store information, making reservations, and placing orders to payment.
Not Just Single Tasks, but a Continuous “Cross-Scenario” Challenge
The core challenge of VitaBench lies in its task design. It not only has 300 single-scenario tasks but also 100 extremely challenging “cross-scenario tasks.”
What does this mean? For example, a real user request might be: “Help me book a hotel with a river view, and for the night of check-in, find a well-rated, non-spicy restaurant near the hotel with a budget of $200.”
This task requires the AI agent to:
- Understand Complex Intent: Not just booking a hotel, but also a restaurant, and the two are related.
- Cross-Temporal and Spatial Reasoning: Needs to handle check-in dates, dinner times, and the geographical relationship between the hotel and the restaurant.
- Flexible Use of Tools: Must first use the “hotel booking tool” and then use the “restaurant search tool” based on the results.
- Proactive Clarification: If the user’s instructions are vague, the AI needs to ask follow-up questions, such as “What type of cuisine would you prefer for the restaurant?”
- Track Dynamic Intent: In a multi-turn conversation, the user might change their mind, and the AI needs to keep up.
Honestly, this is a bit complicated even for humans, let alone AI.
The Brutal Report Card: Top AIs “Fall” One After Another
So, how did the most powerful AI models of today perform in this ultimate test?
The results are quite shocking.
Thinking Models
| Rank | Models | Avg @4 | Cross-Scenarios (Pass) | Cross-Scenarios (Pass @4) | Single-Scenarios (Avg @4) |
|---|---|---|---|---|---|
| 1 | 03 (high) | 30.0 | 6.0 | 61.0 | 53.5 |
| 2 | Al Claude-4.1-Opus (w/ thinking) | 29.0 | 56.0 | 6.0 | 47.5 |
| 3 | MLongCat-Flash-Thinking | 24.3 | 54.0 | 3.0 | 42.3 |
| 4 | Gemini-2.5-Pro | 23.5 | 53.0 | 5.0 | 49.0 |
| 5 | A Claude-4-Sonnet (w/ thinking) | 23.0 | 51.0 | 6.0 | 46.0 |
| 6 | GPT-5 (high) | 22.8 | 51.0 | 3.0 | 54.0 |
| 7 | Z GLM-4.5 (w/ thinking) | 22.8 | 48.0 | 2.0 | 44.5 |
| 8 | 04-mini (high) | 19.5 | 49.0 | 1.0 | 44.5 |
| 9 | Qwen3-235B-A22B-Thinking-2507 | 18.8 | 45.0 | 2.0 | 44.0 |
| 10 | Doubao-Seed-1.6-Thinking | 17.0 | 42.0 | 1.0 | 30.3 |
| 11 | DeepSeek-R1-0528 | 14.5 | 39.0 | 0.0 | 40.3 |
| 12 | Gemini2.5-Flash (think on) | 5.3 | 24.0 | 0.0 | 32.0 |
| 13 | Qwen3-32B (w/ thinking) | 5.0 | 47.0 | 3.0 | 22.8 |
Non-thinking Mode
| Rank | Models | Avg @4 | Cross-Scenarios (Pass) | Cross-Scenarios (Pass @4) | Single-Scenarios (Avg @4) |
|---|---|---|---|---|---|
| 1 | Al Claude-4.1-Opus (w/o thinking) | 21.8 | 47.0 | 3.0 | 46.0 |
| 2 | Al Claude-4-Sonnet (w/o thinking) | 21.3 | 49.0 | 4.0 | 39.0 |
| 3 | LongCat-Flash-Chat | 20.3 | 45.0 | 2.0 | 39.5 |
| 4 | GLM-4.5 (w/o thinking) | 20.0 | 47.0 | 1.0 | 45.8 |
| 5 | Qwen3-Max | 18.5 | 3.0 | 47.0 | 37.2 |
| 6 | DeepSeek-V3.2-Exp (w/o thinking) | 17.7 | 2.0 | 41.0 | 36.2 |
| 7 | DeepSeek-V3.1 (w/o thinking) | 16.3 | 40.0 | 1.0 | 34.0 |
| 8 | K Kimi-K2-0905 | 15.5 | 39.0 | 2.0 | 35.3 |
| 9 | Qwen3-235B-A22B-Instruct-2507 | 14.3 | 0.0 | 38.0 | 34.3 |
| 10 | GPT-4.1 | 13.8 | 0.0 | 35.0 | 37.8 |
| 11 | Doubao-Seed-1.6 | 10.5 | 29.0 | 0.0 | 37.8 |
| 12 | Gemini-2.5-Flash (think off) | 5.8 | 17.0 | 1.0 | 31.0 |
| 13 | Qwen3-32B (w/o thinking) | 4.0 | 0.0 | 12.0 | 16.5 |
| 14 | GPT-5 (minimal) | 4.0 | 9.0 | 0.0 | 30.0 |
| 15 | DeepSeek-V3-0324 | 3.8 | 12.0 | 0.0 | 25.3 |
According to the Leaderboard published by VitaBench, the data reveals a huge performance gap:
- In the relatively simple 300 single-scenario tasks, even the best-performing models have a success rate of less than 50%.
- In the 100 complex cross-scenario tasks, the success rate of the strongest models plummets to just 30%!
This report card clearly tells us that current LLM agents have significant shortcomings in the following areas:
- Difficulty in Domain Switching: An AI that is good at handling travel bookings can easily “crash” when asked to handle dining issues at the same time.
- Tool Selection Obstacles: Faced with 66 tools, AI often doesn’t know when and which one is the most appropriate to use.
- Lack of Long-term Coordination: Handling long-term tasks that require multiple steps and span several rounds of conversation remains a huge challenge for AI.
What Does This Mean for Our Future?
The emergence of VitaBench is not meant to undermine our confidence in AI. On the contrary, it acts like a mirror, truthfully reflecting the current technological deficiencies and pointing the way forward for the entire industry.
This research tells us that to make AI agents truly reliable assistants in our lives, we must not only focus on improving the language capabilities of the models but also train their ability to reason, plan, and execute tasks in complex, dynamic environments.
VitaBench provides a valuable resource for developers to test and improve their AI agents in an environment closer to reality. Although the current 30% may seem low, this is precisely the stage of accumulating strength before a technological take-off.
Frequently Asked Questions about VitaBench
Q1: What exactly is VitaBench? A: VitaBench is a high-difficulty evaluation benchmark developed by the Meituan LongCat team, specifically designed to assess the ability of Large Language Model (LLM) agents to perform complex interactive tasks in simulated real-world scenarios (such as delivery, travel).
Q2: Why do we need evaluation tools like VitaBench? A: Because existing evaluation tools are mostly oversimplified and cannot reflect the complexity of real-world tasks. VitaBench provides a “testing ground” that is closer to reality, which can effectively test the true capabilities of AI agents in handling multiple goals, dynamic information, and complex toolsets, thereby promoting the practical application and development of the technology.
Q3: Which AI models are currently performing best on VitaBench? A: According to the published leaderboard, in the most challenging cross-scenario tasks, models such as o3 (high), Claude-4.1-Opus (w/ thinking), and LongCat-Flash-Thinking are in the lead, but even so, their highest average success rate is only around 30%.
Q4: How can I learn more about or use VitaBench? A: The VitaBench project is open source. You can visit its official website to view the detailed research paper, dataset, and leaderboard. Developers can also find the relevant code and resources on its GitHub page.


