tool

Meituan's Meeseeks Emerges: A Major Test of AI Models' 'Obedience' - Who Can Pass the Ultimate Challenge?

September 2, 2025
Updated Sep 2
7 min read

Is AI not ‘obedient’ enough? Meituan has released a new instruction-following evaluation benchmark, Meeseeks. Through a unique multi-turn error correction mechanism, it deeply evaluates whether AI models can truly understand and execute complex instructions. This article will take you deep into Meeseeks’ three-layer evaluation framework, its technical principles, and why it is crucial for the development of AI.


Have you ever had this experience? You meticulously give a series of instructions to an AI assistant, hoping it will generate a piece of copy that meets a specific format, tone, and even rhyme scheme, only to receive an answer that is completely off the mark. This kind of “talking past each other” dilemma is a common challenge faced by many powerful language models today - they are knowledgeable, but not necessarily “obedient.”

To solve this problem, Meituan’s research team has launched a new instruction-following ability evaluation benchmark called Meeseeks. It’s like an ultra-difficult driver’s license test designed for AI, not only testing the model’s basic abilities but also focusing on its adaptability and self-correction capabilities in continuous multi-turn conversations.

This is not just a simple benchmark test; it simulates the real-world scenarios of our interactions with AI: we make a request, the AI responds, and then we provide feedback based on the response, asking it to make corrections. So, how does Meeseeks work? And how will it drive the evolution of AI models?

So, what exactly is Meeseeks?

Simply put, Meeseeks is a benchmark test specifically designed to evaluate the “instruction-following” ability of AI models. Its biggest difference from other evaluations is its specially designed multi-turn scenario.

Imagine that a traditional evaluation is like an exam with only one chance to answer; if you get it wrong, it’s over. But Meeseeks is more like a patient teacher. If the model fails to fully satisfy all instructions in the first round of answers, the evaluation framework will automatically generate structured feedback, clearly pointing out what was done wrong, and then ask the model to “correct the answer based on the feedback.”

This process is not just an evaluation, but also a test of the model’s adaptability, instruction adherence, and iterative improvement potential. This is also its core feature - a built-in “self-correction loop.”

Three-layer evaluation framework: How Meeseeks “interrogates” AI

To evaluate models comprehensively and objectively, Meeseeks has designed a sophisticated “three-level capability” evaluation framework. This framework progresses from shallow to deep, layer by layer, ensuring that only the most “obedient” models can stand out.

Level 1 Capability: Do you understand my core meaning?

This is the most basic test, evaluating whether the model correctly understands the user’s core task intent.

  • Core task: Does the model know whether to “write a poem” or “write a review”?
  • Overall structure: If asked to generate a three-paragraph article, did the model actually produce three paragraphs?
  • Independent units: Does each sentence or paragraph in the article conform to the details of the instructions?

This layer ensures that the AI does not go off track from the very beginning.

Level 2 Capability: Details determine success or failure

If the model passes the first level, it will then face more specific constraints. These are mainly divided into two categories:

  • Content constraints: Such as theme (about summer), style (light and humorous), language (Traditional Chinese), word count (within 200 words), etc.
  • Format constraints: Does it follow the specified template? Is the number of paragraphs or points correct?

This layer tests the model’s precise execution ability, not just a general understanding.

Level 3 Capability: The ultimate challenge - subtle rules

This is the most difficult level, evaluating the model’s ability to follow highly fine-grained rules. These rules are often very “counter-intuitive” and require the model to have extremely strong control. For example:

  • Rhyme: The end of each sentence must rhyme with “an.”
  • Keyword avoidance: The word “but” is forbidden throughout the article.
  • No repetition: There can be no repeated sentences or words.
  • Symbol usage: Only periods and commas can be used.

Many models will “show their true colors” at this level, as it requires them to constantly monitor these subtle restrictions while generating content.

Not just a one-time exam: Meeseeks’ “correction loop”

Meeseeks’ most fascinating part is its multi-turn error correction mode. If the AI’s answer in the first round is flawed - for example, it forgets the word count limit or uses the wrong symbols - the system will not directly judge it as a failure.

Instead, it will provide specific feedback like this: “Your answer did not meet the ‘word count limit of 200 words’ instruction, please modify it.” Then, the model has the opportunity to make a second or even third attempt based on this feedback.

From the evaluation chart above, we can see that top models like Claude-3.7-Sonnet-thinking perform very well in multi-turn interactions, with their scores consistently remaining high. In contrast, some models, such as GPT-4o-mini, perform acceptably in the first round, but their subsequent correction capabilities seem limited, with their scores not rising but falling. This difference is exactly what Meeseeks wants to reveal - a good AI should not only be smart, but also good at learning and correcting.

Why is Meeseeks important?

In today’s rapidly developing AI technology, it is no longer enough to simply pursue models that are “bigger” and have “broader knowledge.” What we need are tools that can collaborate with humans with precision. The emergence of Meeseeks brings at least two major benefits:

  1. Objective and measurable standards: It abandons those vague instructions (such as “write it better”), and all evaluation items can be judged objectively, which makes the evaluation results more accurate and credible.
  2. Pointing the way for model development: Through difficult test cases, Meeseeks can effectively widen the gap between different models. Developers can clearly see where their own models are lacking and thus carry out targeted optimization.

A brief analysis of the technical principles

You may be curious, how does Meeseeks automatically determine whether an AI’s answer is compliant? This relies on a series of mature technologies:

  • In the level 1 capability evaluation, it uses Natural Language Processing (NLP) technology to parse the user’s instructions and identify their core intent and structural requirements.
  • In the level 2 capability evaluation, it uses text analysis algorithms to check whether the generated content conforms to constraints such as word count and style.
  • At the most complex level 3, it will use tools such as Regular Expressions to accurately check for forbidden words, compliance with specific writing techniques, etc.

Want to try Meeseeks for yourself?

Meeseeks is an open source project, which means that any developer or researcher can use it to evaluate their own models. If you are interested, you can find more information through the following links:

In summary, Meeseeks is not just a new evaluation tool; it represents a new direction for AI development: from the pursuit of “erudition” to the pursuit of “precision” and “obedience.” When AI models learn how to better understand, follow, and learn from their mistakes, they can truly become reliable partners in our work and life.


Frequently Asked Questions (FAQ)

Q1: What is the difference between Meeseeks and other benchmarks?

A1: The main difference lies in the multi-turn error correction mechanism. Traditional evaluations are mostly “one-time,” while Meeseeks can provide specific feedback after the model makes a mistake and ask it to correct it. This can more realistically evaluate the model’s learning and adaptation capabilities. In addition, its evaluation standards are very objective and the difficulty is designed to be higher, which can effectively distinguish the subtle differences between top models.

Q2: Why is “multi-turn error correction” so important for AI models?

A2: Because real-world human-computer interaction is a process of continuous communication and correction. It is rare for users to give perfect instructions in one go, and the same is true for AI. A model that knows how to adjust itself based on feedback is far more practical than a model that only does “one-shot deals.” This ability is the key for AI to evolve from a “query tool” to an “intelligent collaborator.”

Q3: Is this evaluation framework open source? Can anyone use it?

A3: Yes, the Meeseeks project is completely open source. Researchers and developers can freely access its code on GitHub and download its dataset on Hugging Face to test and validate their own language models.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.