Beyond Gold: Google DeepMind Launches IMO-Bench, Setting a New Benchmark for AI Math Reasoning

After its Gemini model achieved gold medal standards in the International Mathematical Olympiad (IMO), Google DeepMind officially released IMO-Bench. This is not just an evaluation tool, but a new benchmark that pushes AI from “problem-solving” to “deep reasoning,” aiming to lead the AI field into a new era of more robust and creative mathematical reasoning.

What should we focus on after AI wins gold in math competitions?

In July 2025, the field of artificial intelligence ushered in a historic moment: Google DeepMind’s advanced Gemini model, equipped with Deep Think technology, achieved gold medal standards in the International Mathematical Olympiad (IMO). This is undoubtedly a major milestone in AI development.

However, the significance of this victory goes far beyond achieving excellent results on IMO-level problems. The real goal is to build a system capable of deep, robust mathematical reasoning. After all, simply providing the correct answer is not enough; understanding and proving “why” is the key to true intelligence.

It is based on this philosophy that Google DeepMind grandly launched IMO-Bench—a set of advanced reasoning benchmarks—at the EMNLP 2025 conference. It not only played a core role in Gemini’s journey to gold but also aims to open a new door for the entire AI community in mathematical reasoning capabilities.

So, what exactly is IMO-Bench?

Simply put, IMO-Bench is a set of “test questions” specifically designed to evaluate the mathematical abilities of AI models. But this is no ordinary exam; all its questions have been rigorously reviewed by a panel of experts composed of 10 IMO gold medalists and 5 silver medalists.

The difficulty of IMO problems lies in the fact that they not only require rigorous multi-step reasoning but also creativity that goes beyond formulaic frameworks. This is precisely the core of IMO-Bench. It not only cares whether AI can calculate the answer but also whether AI can “think.”

IMO-Bench is mainly composed of three parts, each with its focus:

IMO-AnswerBench: A large-scale test containing 400 questions, focusing on evaluating the model’s ability to “provide correct answers.”
IMO-ProofBench: An advanced evaluation containing 60 questions, designed to test the model’s ability to “write rigorous proofs.”
IMO-GradingBench: Contains 1000 cases, used to promote technological advancements in “automatically evaluating long answers.”

The release of this benchmark hopes to guide the community’s focus from simply the “final answer” to the more critical “proof process” itself, thereby achieving a more rigorous evaluation of AI’s reasoning capabilities.

Beyond Standard Answers: The Challenge of IMO-ProofBench

In the past, we often evaluated AI’s mathematical ability only by the accuracy of its answers. But this is far from enough. A system that truly understands mathematics must be able to construct rigorous and valid mathematical arguments.

To this end, IMO-Bench launched IMO-ProofBench, which elevates evaluation to a new level. This benchmark contains 60 problems requiring proof, divided into two subsets:

Basic Set: Covers problems from pre-IMO to medium difficulty, used to evaluate the model’s reasoning ability in its early development.
Advanced Set: Contains new and extremely challenging problems, simulating the highest difficulty of a real IMO exam.

Test results show significant differences in performance among different models. On the basic set, Gemini Deep Think (IMO Gold) achieved a high score of 89.0%, but most models still scored below 60%.

On the more challenging advanced set, the gap is even more pronounced. All non-Gemini models scored below 25%, while Gemini Deep Think achieved the current state-of-the-art 65.7%. Although this achievement is a huge leap, it also shows that even the most powerful models still have a long way to go on the road to perfect mathematical reasoning.

Can AI grade AI’s papers? The Birth of ProofAutoGrader

Although human expert evaluation is the gold standard for verifying mathematical proofs, its high time and labor costs limit the feasibility of large-scale research.

To solve this problem, the DeepMind team created ProofAutoGrader, an automatic grading tool based on Gemini 2.5 Pro. Its working method is: providing problem descriptions, candidate solutions, reference answers, and specific grading guidelines, allowing AI to automatically score the proof process.

The results are encouraging. When testing 14 public models, ProofAutoGrader’s scores were highly correlated with human expert scores, with Pearson correlation coefficients reaching an astonishing 0.96 and 0.93 on the basic and advanced sets, respectively. This means that AI automatic grading is not only feasible but also quite reliable, paving the way for future large-scale, scalable AI reasoning research.

Seeing the Real Gap in AI Reasoning from the Leaderboard

Model	Advanced Proof	Bench Breakdown	Query date	Novel	IMO 2024†
Gemini Deep Think (IMO Gold)	65.7%	61.1%	2025-08-02	76.2%	69.0%
Gemini Deep Think (IMO lite)	37.6%	31.7%	2025-08-20	40.5%	52.4%
Gemini 2.5 Pro with (Huang & Yang, 2025)	24.8%	17.5%	2025-07-14	19.1%	52.4%
Grok 4 (heavy)	23.3%	11.1%	2025-07-12	7.1%	76.2%
o3	20.5%	15.1%	2025-08-04	4.8%	52.4%
GPT-5	20%	15.9%	2025-09-18	33.3%	19.0%
Grok 4	18.6%	17.5%	2025-08-20	16.7%	23.8%
Gemini 2.5 Pro	17.6%	15.9%	2025-08-04	7.1%	33.3%
o4-mini (high reasoning)	11.4%	8.7%	2025-08-04	7.1%	23.8%
Kimi-K2-Instruct	7.1%	4%	2025-08-21	2.4%	21.4%
Qwen3-235B	5.2%	7.1%	2025-08-21	0.0%	4.8%
Claude Sonnet 4	4.8%	6.4%	2025-09-17	2.4%	2.4%
DeepSeek V3	4.3%	6.3%	2025-09-16	2.4%	0.0%
DeepSeek R1	3.8%	6.4%	2025-09-16	0.0%	0.0%
Claude Opus 4	2.9%	0.0%	2025-08-04	2.4%	11.9%

The IMO-Bench leaderboard reveals an interesting phenomenon: some models may have an “overfitting” problem.

For example, the Grok 4 (heavy) model scored as high as 76.2% on USAMO 2025 problems, but only 11.1% on new, unseen problems. This indicates that its strong performance may be overly reliant on specific datasets.

In contrast, Gemini Deep Think (IMO Gold) scored 69.0% and 61.1% on USAMO problems and new problems, respectively, showing its more general reasoning ability without overfitting to specific data.

This also highlights the value of IMO-ProofBench: it can not only evaluate the highest level of the model but also reveal the generality and robustness of its capabilities, helping researchers to more comprehensively understand the model’s mathematical abilities.

Future Outlook: Jointly Promoting AI’s Mathematical Thinking

Google DeepMind chose to open IMO-Bench along with rich evaluation data to the entire community, hoping to inspire more innovation and cooperation.

By providing a more rigorous and comprehensive evaluation standard, researchers can more accurately measure the progress of models and focus on developing AI systems with true creativity and deep understanding. This is not just about mathematics but also about all fields that require complex reasoning abilities.

Want to know more details about these benchmarks and results? You can check their official paper, dataset, and leaderboard. The next chapter of AI mathematical reasoning is waiting for us to write together.

Beyond Gold: Google DeepMind Launches IMO-Bench, Setting a New Benchmark for AI Math Reasoning

What should we focus on after AI wins gold in math competitions?

So, what exactly is IMO-Bench?

Beyond Standard Answers: The Challenge of IMO-ProofBench

Can AI grade AI’s papers? The Birth of ProofAutoGrader

Seeing the Real Gap in AI Reasoning from the Leaderboard

Future Outlook: Jointly Promoting AI’s Mathematical Thinking

DMflow.chat

videoweaver.app

DMflow.chat

DMflow.chat

videoweaver.app

DMflow.chat

Beyond Gold: Google DeepMind Launches IMO-Bench, Setting a New Benchmark for AI Math Reasoning

What should we focus on after AI wins gold in math competitions?

So, what exactly is IMO-Bench?

Beyond Standard Answers: The Challenge of IMO-ProofBench

Can AI grade AI’s papers? The Birth of ProofAutoGrader

Seeing the Real Gap in AI Reasoning from the Leaderboard

Future Outlook: Jointly Promoting AI’s Mathematical Thinking

DMflow.chat

videoweaver.app

DMflow.chat

DMflow.chat

videoweaver.app

DMflow.chat

Recommended for You

AI Model Drawing Capabilities Showdown: SVG Generation Benchmark of 9 Top LLMs

LLM Agent Midterm Exam: VitaBench Reveals Harsh Truth, Top Models Only 30% Success Rate?

Latest AI Model Rankings Are Out: Why the Most Powerful Models Don't Always Win