news

AI IQ Battle Turned Upside Down! Latest Data Reveals the Smartest Model Isn't What You Think

August 13, 2025
Updated Aug 13
5 min read

The AI intelligence race has taken a surprising turn! According to the latest leaked real-world data, top models from OpenAI, Google, and Anthropic have varying wins and losses across different intelligence tests. This article presents the complete IQ ranking of 29 AI models and delves into the hidden truths behind this data.

The “Olympics” of the AI World: Rules Are More Complex Than You Think

We are all accustomed to looking for a single champion. In the race of artificial intelligence, we also want to know: who is the smartest AI? A website called Tracking AI attempts to answer this question through regular intelligence tests. However, according to the latest leaked real-world data, we find that the answer is far more complex than a simple ranking.

This competition doesn’t have just one event, but at least two different “exam papers”: one is the Offline Test, and the other is the Mensa Norway test. Different AIs can perform vastly differently on these tests. It’s like an athlete who might be a 100-meter dash champion but may not necessarily win the marathon.

Complete AI IQ Ranking: Understand the True Strength of 29 Models at a Glance

This complete ranking, based on the latest data, lists the scores of each model in both tests. For ease of comparison, we have primarily sorted them by the Offline Test score, but be sure to pay attention to the surprising contrast in the Mensa Norway test.

Rank (by Offline Test)AI ModelOffline Test IQMensa Norway IQ
1OpenAI GPT-5 Pro (Vision)123136
2Gemini 2.5 Pro118137
3Claude-4 Opus118117
4OpenAI GPT-5 Pro116148
5OpenAI o3116135
6OpenAI o3 Pro109133
7Claude-4 Sonnet107119
8Grok-4103121
9OpenAI o3 Pro (Vision)100104
10Gemini 2.5 Pro (Vision)9996
11OpenAI o3 (Vision)9794
12OpenAI GPT-593115
13OpenAI o4 mini90112
14Gemini 2.5 Flash Thinking9087
15Claude-4 Sonnet (Vision)8893
16OpenAI GPT-5 (Vision)8767
17OpenAI o4 mini high8799
18DeepSeek R186101
19OpenAI o4 mini (Vision)8479
20Claude-4 Opus (Vision)8282
21Llama 4 Maverick82100
22Llama 4 Maverick (Vision)8275
23DeepSeek V37992
24Mistral7485
25GPT-4o6985
26Grok-4 (Vision)6882
27Bing Copilot6786
28GPT-4o (Vision)6564
29OpenAI GPT-5 Thinking6479

Please refer to the website for the latest information


Insights Behind the Data: Do You Really Understand This Chart?

Just looking at the rankings is for amateurs; understanding the nuances is for experts. This seemingly simple table actually hides several very important insights:

1. The “Dual Standard” for the Champion's Throne: Who is the Real Number One?

If you only look at the Offline Test, OpenAI GPT-5 Pro (Vision) takes the top spot with a score of 123, seemingly the undisputed king of visual reasoning.

But shift your gaze to the Mensa Norway column. The score for OpenAI GPT-5 Pro (Language Model) is a staggering 148, not only far exceeding its own performance in the other test (116) but also the highest score overall! What does this mean? It means the title of “smartest” completely depends on which ruler you use to measure. A model might be king in a test requiring visual-spatial abilities, but another model might be the champion in a test of abstract logic or verbal reasoning.

2. Do AIs Also “Specialize”? The Two Tests Are Vastly Different

The huge score difference for the same model in the two tests reveals that they have a clear tendency to “specialize.” For example:

  • OpenAI GPT-5 Pro: 116 on Offline Test, 148 on Mensa Norway, a full 32-point difference!
  • Gemini 2.5 Pro: 118 on Offline Test, 137 on Mensa Norway, also a 19-point difference.

This strongly suggests that the Offline Test and the Mensa Norway test have completely different focuses. The former may emphasize concrete reasoning abilities like pattern recognition and spatial relationships, which is why vision models (Vision) generally perform well. The latter may lean more towards abstract logic, numerical patterns, or language comprehension found in traditional IQ tests, allowing top-tier language models (Verbal) to shine.

3. The Vision-Language Divide: Different Faces of the Same Model

This data also shows us the “modality gap” in AI capabilities. Take Gemini 2.5 Pro as an example. Its language model achieved top scores in both tests (118/137), but its vision model’s scores dropped to (99/96). This indicates that even if the underlying technology is the same, models optimized for different tasks (processing text vs. processing images) will show significant differences in performance.

4. Hidden Dark Horses and Underrated Contenders

If you only look at the top three, you’ll miss many interesting details.

  • Llama 4 Maverick’s Offline Test score is only 82, which seems unremarkable, but its Mensa Norway score reaches 100, surpassing many models ranked above it.
  • DeepSeek R1 is similar, with a very respectable Mensa Norway score (101).

This shows that some open-source or second-tier models may not be inferior in specific reasoning abilities; they just haven’t been extremely optimized for all areas. For users with specific needs, these “specialized” contenders might offer better value.

Conclusion: There Is No Single Champion, Only a More Suitable Tool

In conclusion, this latest, more realistic data tells us an important truth: in the world of AI, there is no single, all-powerful champion.

Simplifying an AI’s “intelligence” into a single score is an oversimplified misunderstanding. Different models are designed to solve different problems, and each has its specialty. GPT-5 Pro (Vision) might be your best partner for solving picture puzzles, while GPT-5 Pro (Language Model) might be a stronger assistant for in-depth academic discussions or logical analysis.

As users, what we should do is not blindly chase the top-ranked model, but rather understand which AI performs best in the “examination hall” of our specific needs. The greatest value of this ranking is precisely in revealing this diversity, helping us move away from the myth of “who is the smartest?” and instead think about “who is the most suitable for me?”.

Share on:
Featured Partners

© 2026 Communeify. All rights reserved.