The AI intelligence race has taken a surprising turn! According to the latest leaked real-world data, top models from OpenAI, Google, and Anthropic have varying wins and losses across different intelligence tests. This article presents the complete IQ ranking of 29 AI models and delves into the hidden truths behind this data.
The “Olympics” of the AI World: Rules Are More Complex Than You Think
We are all accustomed to looking for a single champion. In the race of artificial intelligence, we also want to know: who is the smartest AI? A website called Tracking AI attempts to answer this question through regular intelligence tests. However, according to the latest leaked real-world data, we find that the answer is far more complex than a simple ranking.
This competition doesn’t have just one event, but at least two different “exam papers”: one is the Offline Test, and the other is the Mensa Norway test. Different AIs can perform vastly differently on these tests. It’s like an athlete who might be a 100-meter dash champion but may not necessarily win the marathon.
Complete AI IQ Ranking: Understand the True Strength of 29 Models at a Glance
This complete ranking, based on the latest data, lists the scores of each model in both tests. For ease of comparison, we have primarily sorted them by the Offline Test score, but be sure to pay attention to the surprising contrast in the Mensa Norway test.
| Rank (by Offline Test) | AI Model | Offline Test IQ | Mensa Norway IQ |
|---|---|---|---|
| 1 | OpenAI GPT-5 Pro (Vision) | 123 | 136 |
| 2 | Gemini 2.5 Pro | 118 | 137 |
| 3 | Claude-4 Opus | 118 | 117 |
| 4 | OpenAI GPT-5 Pro | 116 | 148 |
| 5 | OpenAI o3 | 116 | 135 |
| 6 | OpenAI o3 Pro | 109 | 133 |
| 7 | Claude-4 Sonnet | 107 | 119 |
| 8 | Grok-4 | 103 | 121 |
| 9 | OpenAI o3 Pro (Vision) | 100 | 104 |
| 10 | Gemini 2.5 Pro (Vision) | 99 | 96 |
| 11 | OpenAI o3 (Vision) | 97 | 94 |
| 12 | OpenAI GPT-5 | 93 | 115 |
| 13 | OpenAI o4 mini | 90 | 112 |
| 14 | Gemini 2.5 Flash Thinking | 90 | 87 |
| 15 | Claude-4 Sonnet (Vision) | 88 | 93 |
| 16 | OpenAI GPT-5 (Vision) | 87 | 67 |
| 17 | OpenAI o4 mini high | 87 | 99 |
| 18 | DeepSeek R1 | 86 | 101 |
| 19 | OpenAI o4 mini (Vision) | 84 | 79 |
| 20 | Claude-4 Opus (Vision) | 82 | 82 |
| 21 | Llama 4 Maverick | 82 | 100 |
| 22 | Llama 4 Maverick (Vision) | 82 | 75 |
| 23 | DeepSeek V3 | 79 | 92 |
| 24 | Mistral | 74 | 85 |
| 25 | GPT-4o | 69 | 85 |
| 26 | Grok-4 (Vision) | 68 | 82 |
| 27 | Bing Copilot | 67 | 86 |
| 28 | GPT-4o (Vision) | 65 | 64 |
| 29 | OpenAI GPT-5 Thinking | 64 | 79 |
Please refer to the website for the latest information
Insights Behind the Data: Do You Really Understand This Chart?
Just looking at the rankings is for amateurs; understanding the nuances is for experts. This seemingly simple table actually hides several very important insights:
1. The “Dual Standard” for the Champion's Throne: Who is the Real Number One?
If you only look at the Offline Test, OpenAI GPT-5 Pro (Vision) takes the top spot with a score of 123, seemingly the undisputed king of visual reasoning.
But shift your gaze to the Mensa Norway column. The score for OpenAI GPT-5 Pro (Language Model) is a staggering 148, not only far exceeding its own performance in the other test (116) but also the highest score overall! What does this mean? It means the title of “smartest” completely depends on which ruler you use to measure. A model might be king in a test requiring visual-spatial abilities, but another model might be the champion in a test of abstract logic or verbal reasoning.
2. Do AIs Also “Specialize”? The Two Tests Are Vastly Different
The huge score difference for the same model in the two tests reveals that they have a clear tendency to “specialize.” For example:
- OpenAI GPT-5 Pro: 116 on Offline Test, 148 on Mensa Norway, a full 32-point difference!
- Gemini 2.5 Pro: 118 on Offline Test, 137 on Mensa Norway, also a 19-point difference.
This strongly suggests that the Offline Test and the Mensa Norway test have completely different focuses. The former may emphasize concrete reasoning abilities like pattern recognition and spatial relationships, which is why vision models (Vision) generally perform well. The latter may lean more towards abstract logic, numerical patterns, or language comprehension found in traditional IQ tests, allowing top-tier language models (Verbal) to shine.
3. The Vision-Language Divide: Different Faces of the Same Model
This data also shows us the “modality gap” in AI capabilities. Take Gemini 2.5 Pro as an example. Its language model achieved top scores in both tests (118/137), but its vision model’s scores dropped to (99/96). This indicates that even if the underlying technology is the same, models optimized for different tasks (processing text vs. processing images) will show significant differences in performance.
4. Hidden Dark Horses and Underrated Contenders
If you only look at the top three, you’ll miss many interesting details.
- Llama 4 Maverick’s Offline Test score is only 82, which seems unremarkable, but its Mensa Norway score reaches 100, surpassing many models ranked above it.
- DeepSeek R1 is similar, with a very respectable Mensa Norway score (101).
This shows that some open-source or second-tier models may not be inferior in specific reasoning abilities; they just haven’t been extremely optimized for all areas. For users with specific needs, these “specialized” contenders might offer better value.
Conclusion: There Is No Single Champion, Only a More Suitable Tool
In conclusion, this latest, more realistic data tells us an important truth: in the world of AI, there is no single, all-powerful champion.
Simplifying an AI’s “intelligence” into a single score is an oversimplified misunderstanding. Different models are designed to solve different problems, and each has its specialty. GPT-5 Pro (Vision) might be your best partner for solving picture puzzles, while GPT-5 Pro (Language Model) might be a stronger assistant for in-depth academic discussions or logical analysis.
As users, what we should do is not blindly chase the top-ranked model, but rather understand which AI performs best in the “examination hall” of our specific needs. The greatest value of this ranking is precisely in revealing this diversity, helping us move away from the myth of “who is the smartest?” and instead think about “who is the most suitable for me?”.


