We always thought AI was omnipotent, but a simple analog clock has defeated top models like Google Gemini and OpenAI GPT-5. The latest ClockBench benchmark shows that human accuracy is as high as 89.1%, while the strongest AI is only 13.3%. This finding reveals a huge gap in AI’s visual reasoning ability and a key challenge for future development.
We are often amazed by the rapid progress of artificial intelligence. They can write poetry, write code, and generate photorealistic images, and seem to be on a path to surpassing human intelligence. But if I ask you a question now: can the most advanced AI today read a traditional analog clock?
The answer may surprise you.
Recently, a new AI benchmark platform called ClockBench gave these super brains a “slap in the face.” The results show that even top models like Google Gemini 2.5 Pro and the rumored GPT-5 performed “miserably” on the seemingly simple task of “reading a clock.”
This is not just about telling time, but the ultimate test of AI’s reasoning ability
You may be thinking, it’s just a clock, what’s so difficult about it?
This is the cleverness of ClockBench’s design. Reading an analog clock is not just about recognizing numbers, it requires a deeper ability - visual reasoning. The AI must understand the spatial relationship between the hour, minute, and second hands, recognize the markings, and synthesize this visual information into a precise concept of time.
The difficulty of this task, according to researchers, is comparable to the ARC-AGI-2 challenge proposed by DeepMind founder François Chollet, and may even be more difficult than the well-known “Humanity’s Last Exam.” It directly hits the core weakness of current AI technology.
Not just wrong, but ridiculously wrong
The test results of ClockBench are nothing short of “amazing.” The data shows:
- The average human accuracy is as high as 89.1%. (Note here that the clock test samples they chose only have an hour and a minute hand, no markings)
- The best performing AI model, Gemini 2.5 Pro, has an accuracy of only 13.3%.
What’s more surprising is not that they got it “wrong,” but “how wrong they were.”
Researchers found that when humans read the time incorrectly, the median error is usually only 3 minutes. This is reasonable, perhaps a slight misreading in a hurry. However, the median error of the best-performing AI model was as long as 1 hour! As for the poorer performing models, the error was as high as about 3 hours. On a 12-hour clock, a 3-hour error is almost the same as a random guess.
This shows that AI is not “almost getting it,” but fundamentally does not “understand” how a clock works. They are just looking for the closest pattern in a huge database, and once the pattern changes slightly, the entire system may collapse.
What tripped up these super brains with hundreds of billions of parameters?
Since AI is so prone to errors, what specific features give them a headache? The data from ClockBench provides the answer. The models performed worst when dealing with the following types of clocks:
- Roman numeral dials: This requires the AI to not only recognize shapes, but also understand another number system.
- Circularly arranged numbers: When the numbers are not in the standard upright orientation, but are arranged in a circle, the AI’s recognition ability will be greatly reduced.
- Complex or mirrored backgrounds: When the dial background has interfering patterns, or the entire clock is mirrored, it is difficult for the AI to extract effective information from the noise.
- Clocks with a second hand: An extra hand adds another layer of spatial relationships to understand, and also increases the chance of confusion.
These tasks, which are easy for humans, have become insurmountable obstacles for AI. This also proves once again that there is a fundamental difference in the underlying logic between AI’s “vision” and human vision.
A strange paradox: a bad reader, but an excellent mathematician
Here comes the most interesting part. Although these AIs can’t read a clock, if you tell them the correct time, they can perform perfect logical reasoning based on it.
The test shows that when asked questions like “set the time forward or backward by a few hours,” “what time is it after rotating the hour hand by a specific angle,” or “convert to another time zone,” the accuracy of many top models is very high, even reaching 100%.
This creates a strange paradox: AI is a bad “information reader,” but an excellent “logical calculator.”
This means that the core of the problem lies in the first step of visual perception and interpretation. They cannot accurately convert images into abstract concepts of time, but once this concept is provided (by humans), their subsequent reasoning ability is completely fine. It’s like a musician who can’t read music, but as long as you tell him which notes to play, he can play a magnificent piece.
So, what does this all mean?
The emergence of ClockBench is not to mock the incompetence of AI, but to sound a wake-up call for the entire field. It clearly shows that:
- AI’s “understanding” is different from human’s: Current AI is better at pattern matching than true, comprehensive contextual understanding.
- Visual reasoning is a huge challenge: Teaching AI to “see” the world like a human, not just “see,” is a key bottleneck on the road to more general artificial intelligence (AGI).
- The importance of basic research: Such basic benchmark tests are crucial for exposing the blind spots of current technology and guiding future research and development directions.
While we are cheering for the various achievements of AI, studies like ClockBench remind us that there is still a long way to go. After all, if an intelligent body can’t even read a clock, can we really trust it with more complex tasks?
Frequently Asked Questions (FAQ)
Q1: Why use an analog clock to test AI?
A: Because an analog clock is a perfect testing tool. It combines multiple complex visual reasoning tasks such as symbol recognition (numbers, markings), spatial relationship understanding (pointer position), and contextual reasoning (the relationship between the hour and minute hands), which can effectively evaluate the comprehensive visual understanding ability of AI.
Q2: Which AI model performed best in this test?
A: Among the 11 top large language models tested, Google’s Gemini 2.5 Pro performed the best, but its 13.3% accuracy is still a huge gap compared to the human level of 89.1%.
Q3: Does this mean that current AI is not as smart as we thought?
A: This shows that the “intelligence” of AI is different from that of humans. It far exceeds humans in specific areas such as data processing and logical operations, but it exposes obvious shortcomings in tasks that require comprehensive perception and contextual understanding. ClockBench highlights one of the important blind spots.
Q4: Where can I learn more about ClockBench?
A: You can visit the official ClockBench website at clockbench.ai for more detailed research data and information.


