Meituan's LongCat Releases New Inference Model! Flash-Thinking Demonstrates Strength in Multiple Benchmarks, Challenging the New Standard for Open-Source Models

Meituan’s LongCat team has launched a new high-efficiency inference model, LongCat-Flash-Thinking, which has reached the top level of open-source models in multiple fields such as logic, mathematics, and code. This article will provide an in-depth analysis of its performance, efficiency advantages, and significance to the AI developer community.


The pace of artificial intelligence development is so fast it’s hard to keep up, especially in the field of large language models (LLMs), where amazing new technologies emerge almost constantly. Recently, Meituan’s LongCat team brought some big news, officially announcing their new high-efficiency inference model—LongCat-Flash-Thinking.

This is not just a minor update. This model not only inherits the extreme speed of its predecessor, LongCat-Flash-Chat, but also achieves a huge leap in “thinking” ability. Comprehensive evaluations show that it has reached the state-of-the-art (SOTA) level among global open-source models in logic, mathematics, code generation, and even complex agent tasks.

So, what makes LongCat-Flash-Thinking so strong?

Simply put, it is a smarter and more professional thinker.

In the past, many models might perform well on a single task, but they seemed to struggle with complex problems that require deep thinking and multi-step reasoning. LongCat-Flash-Thinking attempts to break this deadlock. Its most prominent feature is that it is the first language model in China to integrate “deep thinking + tool calling” and “informal + formal” reasoning capabilities.

This sounds a bit technical, but we can understand it this way:

  • Deep Thinking + Tool Calling: It can not only perform complex logical reasoning like a human, but also autonomously and intelligently call external tools (such as calculators, code interpreters) to assist itself, just like an expert who knows how to use tools to solve problems.
  • Informal + Formal Reasoning: It can understand our daily natural language conversations (informal) and also handle rigorous mathematical theorem proofs (formal), making its application range wider.

To put it bluntly, when dealing with extremely brain-burning tasks, such as difficult math competition problems, complex code debugging, or agent tasks that require multi-step planning, the advantages of LongCat-Flash-Thinking are particularly obvious.

Not just talk, what does the data say?

Of course, empty words are not enough. The strength of a model ultimately depends on its performance in standardized tests. From the data chart released by the LongCat team, LongCat-Flash-Thinking has indeed delivered an impressive report card.

In a series of benchmark tests covering code, mathematics, and logical reasoning, it competed with the world’s top models, including closed-source giants like GPT-5-Thinking and Gemini-2.5 Pro, as well as other excellent open-source models.

Benchmark (Metric)LongCat-Flash-ThinkingDeepSeek-V2.1-ThinkingQwen1.5-32B-A22B-Thinking-S207GLM-4.5OpenAI o1 miniGemini-2.5 ProGPT-5-Thinking
LiveCodeBench (Mean@4)79.480.673.575.461.176.274.2
OJBench (Pass@1)40.733.632.119.038.441.634.1
AIME-24 (Mean@32)93.393.989.391.690.792.0-
HMMT-25 (Mean@32)83.780.476.371.979.383.8-
τ²-Bench (Average Mean@4)74.0-63.844.457.867.680.1
VitaBench (Pass@1)29.521.513.526.835.329.324.3
MiniF2F-Test (Pass@32)81.079.526.627.037.741.851.2
ARC-AGI (Pass@1)50.337.545.321.447.346.859.0

Let’s look at a few key test items:

  • In the OJBench test, which examines code generation capabilities, LongCat-Flash-Thinking took the top spot with a high score of 40.7, surpassing all competitors.
  • In the MiniF2F-Test for mathematical reasoning, it also led the pack with a score of 81.6.
  • In the highly challenging math competitions AIME-24 and HMMT-25, its performance was comparable to top models like GPT-5-Thinking and Gemini-2.5 Pro.

How does this compare to GPT-5 or Gemini?

One noteworthy detail is that although there is still a slight gap compared to top closed-source models like GPT-5 in some comprehensive scores (such as τ²-Bench), LongCat-Flash-Thinking has firmly established itself in the top tier of open-source models. For the entire AI community, this is a very important milestone, as it means that developers and researchers can access world-class AI reasoning capabilities at a lower threshold.

Powerful performance, but cost must be carefully considered

For developers, model performance is certainly important, but operating efficiency and cost are also key considerations. This is another major highlight of LongCat-Flash-Thinking.

It is not only smart, but also “frugal.”

According to official data, in the AIME-24 math competition test, LongCat-Flash-Thinking achieved top accuracy while reducing the number of tokens required by 64.5%! This means lower computing costs and faster response speeds.

The hero behind this is the team’s optimization of the infrastructure. The asynchronous reinforcement learning (Async RL) framework they adopted achieved a 3x training speed improvement compared to the traditional synchronous framework. This allows the model to iterate and optimize faster, while also bringing a more efficient inference experience to users.

Experience it firsthand, embrace the power of open source

After all is said and done, the best way is to try it yourself. The Meituan LongCat team has fully open-sourced the LongCat-Flash-Thinking model on multiple platforms, demonstrating their determination to promote the development of AI technology.

Whether you are an AI researcher, application developer, or simply curious about cutting-edge technology, you can access this powerful model through the following channels:

In summary, the release of LongCat-Flash-Thinking is not only a major technological breakthrough for Meituan in the AI field, but also a generous gift to the global open-source community. It proves that open-source models are also capable of challenging and even surpassing top-level performance on the most complex reasoning tasks, while also taking into account efficiency and cost. This will undoubtedly inspire the birth of more innovative applications and is worthy of our continued attention.

Share on:

© 2025 Communeify. All rights reserved.