LLM Agent Midterm Exam: VitaBench Reveals Harsh Truth, Top Models Only 30% Success Rate?

Just when we thought AI agents driven by Large Language Models (LLMs) were omnipotent, the latest benchmark VitaBench, released by Meituan’s LongCat team, serves as a reality check for the entire industry. This “hardest mock exam” shows that even top AI models have a surprisingly low success rate when dealing with complex real-world tasks. What is going on?

When AI Agents Step Out of the Lab, Reality Hits Hard

In recent years, AI agents driven by Large Language Models (LLMs) have undoubtedly been the hottest topic in the tech world. We imagine a future where, with just a few words, AI assistants can handle everything from booking restaurants and planning trips to arranging deliveries. Sounds great, right?

But reality is often a bit harsh. Current AI agents may perform well in simple, closed environments, like driving in a training course—everything goes smoothly. However, can they still cope when placed at the crossroads of the real world—a complex environment full of unexpected situations, vague instructions, and multiple tasks?

The answer might be a little disappointing. Many past evaluation benchmarks have oversimplified problems and failed to truly reflect the complexity of real life. It’s like using a linear equation to assess a mathematician’s ability—it doesn’t measure their true skills at all.

VitaBench: The “Ultimate Proving Ground” for AI Agents

To solve this problem, Meituan’s LongCat team launched VitaBench—a new, high-difficulty benchmark designed specifically to evaluate the performance of LLM agents in real-world applications.

You can think of VitaBench as an extremely realistic “life simulator.” It’s no longer about theory; it throws AI directly into the three major life scenarios we are most familiar with:

Food Delivery
In-store Consumption
Online Travel Services

How complex is this simulated environment? It integrates up to 66 different tools, covering almost all possible operations from querying store information, making reservations, and placing orders to payment.

Not Just Single Tasks, but a Continuous “Cross-Scenario” Challenge

The core challenge of VitaBench lies in its task design. It not only has 300 single-scenario tasks but also 100 extremely challenging “cross-scenario tasks.”

What does this mean? For example, a real user request might be: “Help me book a hotel with a river view, and for the night of check-in, find a well-rated, non-spicy restaurant near the hotel with a budget of $200.”

This task requires the AI agent to:

Understand Complex Intent: Not just booking a hotel, but also a restaurant, and the two are related.
Cross-Temporal and Spatial Reasoning: Needs to handle check-in dates, dinner times, and the geographical relationship between the hotel and the restaurant.
Flexible Use of Tools: Must first use the “hotel booking tool” and then use the “restaurant search tool” based on the results.
Proactive Clarification: If the user’s instructions are vague, the AI needs to ask follow-up questions, such as “What type of cuisine would you prefer for the restaurant?”
Track Dynamic Intent: In a multi-turn conversation, the user might change their mind, and the AI needs to keep up.

Honestly, this is a bit complicated even for humans, let alone AI.

The Brutal Report Card: Top AIs “Fall” One After Another

So, how did the most powerful AI models of today perform in this ultimate test?

The results are quite shocking.

Thinking Models

Rank	Models	Avg @4	Cross-Scenarios (Pass)	Cross-Scenarios (Pass @4)	Single-Scenarios (Avg @4)
1	03 (high)	30.0	6.0	61.0	53.5
2	Al Claude-4.1-Opus (w/ thinking)	29.0	56.0	6.0	47.5
3	MLongCat-Flash-Thinking	24.3	54.0	3.0	42.3
4	Gemini-2.5-Pro	23.5	53.0	5.0	49.0
5	A Claude-4-Sonnet (w/ thinking)	23.0	51.0	6.0	46.0
6	GPT-5 (high)	22.8	51.0	3.0	54.0
7	Z GLM-4.5 (w/ thinking)	22.8	48.0	2.0	44.5
8	04-mini (high)	19.5	49.0	1.0	44.5
9	Qwen3-235B-A22B-Thinking-2507	18.8	45.0	2.0	44.0
10	Doubao-Seed-1.6-Thinking	17.0	42.0	1.0	30.3
11	DeepSeek-R1-0528	14.5	39.0	0.0	40.3
12	Gemini2.5-Flash (think on)	5.3	24.0	0.0	32.0
13	Qwen3-32B (w/ thinking)	5.0	47.0	3.0	22.8

Non-thinking Mode

Rank	Models	Avg @4	Cross-Scenarios (Pass)	Cross-Scenarios (Pass @4)	Single-Scenarios (Avg @4)
1	Al Claude-4.1-Opus (w/o thinking)	21.8	47.0	3.0	46.0
2	Al Claude-4-Sonnet (w/o thinking)	21.3	49.0	4.0	39.0
3	LongCat-Flash-Chat	20.3	45.0	2.0	39.5
4	GLM-4.5 (w/o thinking)	20.0	47.0	1.0	45.8
5	Qwen3-Max	18.5	3.0	47.0	37.2
6	DeepSeek-V3.2-Exp (w/o thinking)	17.7	2.0	41.0	36.2
7	DeepSeek-V3.1 (w/o thinking)	16.3	40.0	1.0	34.0
8	K Kimi-K2-0905	15.5	39.0	2.0	35.3
9	Qwen3-235B-A22B-Instruct-2507	14.3	0.0	38.0	34.3
10	GPT-4.1	13.8	0.0	35.0	37.8
11	Doubao-Seed-1.6	10.5	29.0	0.0	37.8
12	Gemini-2.5-Flash (think off)	5.8	17.0	1.0	31.0
13	Qwen3-32B (w/o thinking)	4.0	0.0	12.0	16.5
14	GPT-5 (minimal)	4.0	9.0	0.0	30.0
15	DeepSeek-V3-0324	3.8	12.0	0.0	25.3

According to the Leaderboard published by VitaBench, the data reveals a huge performance gap:

In the relatively simple 300 single-scenario tasks, even the best-performing models have a success rate of less than 50%.
In the 100 complex cross-scenario tasks, the success rate of the strongest models plummets to just 30%!

This report card clearly tells us that current LLM agents have significant shortcomings in the following areas:

Difficulty in Domain Switching: An AI that is good at handling travel bookings can easily “crash” when asked to handle dining issues at the same time.
Tool Selection Obstacles: Faced with 66 tools, AI often doesn’t know when and which one is the most appropriate to use.
Lack of Long-term Coordination: Handling long-term tasks that require multiple steps and span several rounds of conversation remains a huge challenge for AI.

What Does This Mean for Our Future?

The emergence of VitaBench is not meant to undermine our confidence in AI. On the contrary, it acts like a mirror, truthfully reflecting the current technological deficiencies and pointing the way forward for the entire industry.

This research tells us that to make AI agents truly reliable assistants in our lives, we must not only focus on improving the language capabilities of the models but also train their ability to reason, plan, and execute tasks in complex, dynamic environments.

VitaBench provides a valuable resource for developers to test and improve their AI agents in an environment closer to reality. Although the current 30% may seem low, this is precisely the stage of accumulating strength before a technological take-off.

Frequently Asked Questions about VitaBench

Q1: What exactly is VitaBench? A: VitaBench is a high-difficulty evaluation benchmark developed by the Meituan LongCat team, specifically designed to assess the ability of Large Language Model (LLM) agents to perform complex interactive tasks in simulated real-world scenarios (such as delivery, travel).

Q2: Why do we need evaluation tools like VitaBench? A: Because existing evaluation tools are mostly oversimplified and cannot reflect the complexity of real-world tasks. VitaBench provides a “testing ground” that is closer to reality, which can effectively test the true capabilities of AI agents in handling multiple goals, dynamic information, and complex toolsets, thereby promoting the practical application and development of the technology.

Q3: Which AI models are currently performing best on VitaBench? A: According to the published leaderboard, in the most challenging cross-scenario tasks, models such as o3 (high), Claude-4.1-Opus (w/ thinking), and LongCat-Flash-Thinking are in the lead, but even so, their highest average success rate is only around 30%.

Q4: How can I learn more about or use VitaBench? A: The VitaBench project is open source. You can visit its official website to view the detailed research paper, dataset, and leaderboard. Developers can also find the relevant code and resources on its GitHub page.

LLM Agent Midterm Exam: VitaBench Reveals Harsh Truth, Top Models Only 30% Success Rate?

When AI Agents Step Out of the Lab, Reality Hits Hard

VitaBench: The “Ultimate Proving Ground” for AI Agents

Not Just Single Tasks, but a Continuous “Cross-Scenario” Challenge

The Brutal Report Card: Top AIs “Fall” One After Another

Thinking Models

Non-thinking Mode

What Does This Mean for Our Future?

Frequently Asked Questions about VitaBench

DMflow.chat

videoweaver.app

DMflow.chat

scribis.app

DMflow.chat

videoweaver.app

DMflow.chat

scribis.app

Leaving Website

LLM Agent Midterm Exam: VitaBench Reveals Harsh Truth, Top Models Only 30% Success Rate?

When AI Agents Step Out of the Lab, Reality Hits Hard

VitaBench: The “Ultimate Proving Ground” for AI Agents

Not Just Single Tasks, but a Continuous “Cross-Scenario” Challenge

The Brutal Report Card: Top AIs “Fall” One After Another

Thinking Models

Non-thinking Mode

What Does This Mean for Our Future?

Frequently Asked Questions about VitaBench

DMflow.chat

videoweaver.app

DMflow.chat

scribis.app

DMflow.chat

videoweaver.app

DMflow.chat

scribis.app

Recommended for You

PerceptionBench Unveils AI Visual Blind Spots: GPT and Kimi Image Recognition Accuracy Under 60%

Goodbye Subjective Guessing! Deep Dive into Qwen-Image-Bench and AI Image Judge Q-Judger

AI Model Drawing Capabilities Showdown: SVG Generation Benchmark of 9 Top LLMs

Leaving Website