AI IQ Battle Turned Upside Down! Latest Data Reveals the Smartest Model Isn't What You Think

The AI intelligence race has taken a surprising turn! According to the latest leaked real-world data, top models from OpenAI, Google, and Anthropic have varying wins and losses across different intelligence tests. This article presents the complete IQ ranking of 29 AI models and delves into the hidden truths behind this data.

The “Olympics” of the AI World: Rules Are More Complex Than You Think

We are all accustomed to looking for a single champion. In the race of artificial intelligence, we also want to know: who is the smartest AI? A website called Tracking AI attempts to answer this question through regular intelligence tests. However, according to the latest leaked real-world data, we find that the answer is far more complex than a simple ranking.

This competition doesn’t have just one event, but at least two different “exam papers”: one is the Offline Test, and the other is the Mensa Norway test. Different AIs can perform vastly differently on these tests. It’s like an athlete who might be a 100-meter dash champion but may not necessarily win the marathon.

Complete AI IQ Ranking: Understand the True Strength of 29 Models at a Glance

This complete ranking, based on the latest data, lists the scores of each model in both tests. For ease of comparison, we have primarily sorted them by the Offline Test score, but be sure to pay attention to the surprising contrast in the Mensa Norway test.

Rank (by Offline Test)	AI Model	Offline Test IQ	Mensa Norway IQ
1	OpenAI GPT-5 Pro (Vision)	123	136
2	Gemini 2.5 Pro	118	137
3	Claude-4 Opus	118	117
4	OpenAI GPT-5 Pro	116	148
5	OpenAI o3	116	135
6	OpenAI o3 Pro	109	133
7	Claude-4 Sonnet	107	119
8	Grok-4	103	121
9	OpenAI o3 Pro (Vision)	100	104
10	Gemini 2.5 Pro (Vision)	99	96
11	OpenAI o3 (Vision)	97	94
12	OpenAI GPT-5	93	115
13	OpenAI o4 mini	90	112
14	Gemini 2.5 Flash Thinking	90	87
15	Claude-4 Sonnet (Vision)	88	93
16	OpenAI GPT-5 (Vision)	87	67
17	OpenAI o4 mini high	87	99
18	DeepSeek R1	86	101
19	OpenAI o4 mini (Vision)	84	79
20	Claude-4 Opus (Vision)	82	82
21	Llama 4 Maverick	82	100
22	Llama 4 Maverick (Vision)	82	75
23	DeepSeek V3	79	92
24	Mistral	74	85
25	GPT-4o	69	85
26	Grok-4 (Vision)	68	82
27	Bing Copilot	67	86
28	GPT-4o (Vision)	65	64
29	OpenAI GPT-5 Thinking	64	79

Please refer to the website for the latest information

Insights Behind the Data: Do You Really Understand This Chart?

Just looking at the rankings is for amateurs; understanding the nuances is for experts. This seemingly simple table actually hides several very important insights:

1. The “Dual Standard” for the Champion's Throne: Who is the Real Number One?

If you only look at the Offline Test, OpenAI GPT-5 Pro (Vision) takes the top spot with a score of 123, seemingly the undisputed king of visual reasoning.

But shift your gaze to the Mensa Norway column. The score for OpenAI GPT-5 Pro (Language Model) is a staggering 148, not only far exceeding its own performance in the other test (116) but also the highest score overall! What does this mean? It means the title of “smartest” completely depends on which ruler you use to measure. A model might be king in a test requiring visual-spatial abilities, but another model might be the champion in a test of abstract logic or verbal reasoning.

2. Do AIs Also “Specialize”? The Two Tests Are Vastly Different

The huge score difference for the same model in the two tests reveals that they have a clear tendency to “specialize.” For example:

OpenAI GPT-5 Pro: 116 on Offline Test, 148 on Mensa Norway, a full 32-point difference!
Gemini 2.5 Pro: 118 on Offline Test, 137 on Mensa Norway, also a 19-point difference.

This strongly suggests that the Offline Test and the Mensa Norway test have completely different focuses. The former may emphasize concrete reasoning abilities like pattern recognition and spatial relationships, which is why vision models (Vision) generally perform well. The latter may lean more towards abstract logic, numerical patterns, or language comprehension found in traditional IQ tests, allowing top-tier language models (Verbal) to shine.

3. The Vision-Language Divide: Different Faces of the Same Model

This data also shows us the “modality gap” in AI capabilities. Take Gemini 2.5 Pro as an example. Its language model achieved top scores in both tests (118/137), but its vision model’s scores dropped to (99/96). This indicates that even if the underlying technology is the same, models optimized for different tasks (processing text vs. processing images) will show significant differences in performance.

4. Hidden Dark Horses and Underrated Contenders

If you only look at the top three, you’ll miss many interesting details.

Llama 4 Maverick’s Offline Test score is only 82, which seems unremarkable, but its Mensa Norway score reaches 100, surpassing many models ranked above it.
DeepSeek R1 is similar, with a very respectable Mensa Norway score (101).

This shows that some open-source or second-tier models may not be inferior in specific reasoning abilities; they just haven’t been extremely optimized for all areas. For users with specific needs, these “specialized” contenders might offer better value.

Conclusion: There Is No Single Champion, Only a More Suitable Tool

In conclusion, this latest, more realistic data tells us an important truth: in the world of AI, there is no single, all-powerful champion.

Simplifying an AI’s “intelligence” into a single score is an oversimplified misunderstanding. Different models are designed to solve different problems, and each has its specialty. GPT-5 Pro (Vision) might be your best partner for solving picture puzzles, while GPT-5 Pro (Language Model) might be a stronger assistant for in-depth academic discussions or logical analysis.

As users, what we should do is not blindly chase the top-ranked model, but rather understand which AI performs best in the “examination hall” of our specific needs. The greatest value of this ranking is precisely in revealing this diversity, helping us move away from the myth of “who is the smartest?” and instead think about “who is the most suitable for me?”.

AI IQ Battle Turned Upside Down! Latest Data Reveals the Smartest Model Isn't What You Think

The “Olympics” of the AI World: Rules Are More Complex Than You Think

Complete AI IQ Ranking: Understand the True Strength of 29 Models at a Glance

Insights Behind the Data: Do You Really Understand This Chart?

1. The “Dual Standard” for the Champion's Throne: Who is the Real Number One?

2. Do AIs Also “Specialize”? The Two Tests Are Vastly Different

3. The Vision-Language Divide: Different Faces of the Same Model

4. Hidden Dark Horses and Underrated Contenders

Conclusion: There Is No Single Champion, Only a More Suitable Tool

DMflow.chat

videoweaver.app

DMflow.chat

DMflow.chat

videoweaver.app

DMflow.chat

AI IQ Battle Turned Upside Down! Latest Data Reveals the Smartest Model Isn't What You Think

The “Olympics” of the AI World: Rules Are More Complex Than You Think

Complete AI IQ Ranking: Understand the True Strength of 29 Models at a Glance

Insights Behind the Data: Do You Really Understand This Chart?

1. The “Dual Standard” for the Champion's Throne: Who is the Real Number One?

2. Do AIs Also “Specialize”? The Two Tests Are Vastly Different

3. The Vision-Language Divide: Different Faces of the Same Model

4. Hidden Dark Horses and Underrated Contenders

Conclusion: There Is No Single Champion, Only a More Suitable Tool

DMflow.chat

videoweaver.app

DMflow.chat

DMflow.chat

videoweaver.app

DMflow.chat

Recommended for You

LLM Evaluation Guide: A Complete Analysis from Basics to 2025 Latest Benchmarks

AI Daily Report October 24, 2025: Major Updates from OpenAI, Google, Anthropic, and Microsoft

2025-10-23 AI Daily Report: Sora Unveils Future Roadmap, OpenRouter Improves Model Accuracy with Exacto