news

Apple Paper Claims AI Reasoning Is an "Illusion"? GitHub Engineer Fires Back: Tower of Hanoi Test Is a Total Misunderstanding!

June 11, 2025
Updated Jun 11
6 min read

Apple’s latest research puts AI to the test with the classic Tower of Hanoi puzzle, arguing that large language models (LLMs) have fundamental limitations in reasoning—and may just be giving the illusion of thinking. But this sparked strong rebuttals, especially from a GitHub engineer. So, is AI truly flawed, or did Apple miss the point from the start? Let’s explore this fascinating debate and uncover the truth behind AI reasoning.


The tech world has been abuzz lately with a heated debate: Can AI actually think? And it all started with Apple.

Apple’s machine learning research team published a paper titled “The Illusion of Thinking”, delivering a bold claim: even the most advanced large language models (LLMs) have fundamental reasoning flaws.

This sparked immediate discussion across platforms like X (formerly Twitter) and tech forums. One of the loudest dissenting voices was from GitHub senior software engineer Sean Goedecke, who directly called out Apple’s research as flawed. According to him, using the Tower of Hanoi—a classic, rigid logic puzzle—as a test of AI reasoning was missing the point entirely and misunderstood what LLMs are built to do.

So what’s really going on here? Did Apple expose a critical flaw, or did the GitHub expert uncover a misjudgment? Let’s dive in.

Apple’s Bold Claim: The Illusion of Thinking?

Apple’s paper sounds highly academic, but the core question is actually straightforward: When we say an AI model is “reasoning,” is it truly thinking—or just mimicking the appearance of thought?

To explore this, the researchers turned to the Tower of Hanoi puzzle. You might remember this from childhood: the goal is to move a stack of disks from one rod to another, following simple rules (larger disks can’t be placed on smaller ones), with the fewest possible moves.

Despite its simplicity, the complexity grows exponentially with the number of disks—making it an ideal test environment for reasoning under controlled complexity.

Here’s what Apple found in their experiments, which broke down into three scenarios:

  1. Simple Tasks: With only a few disks, standard models performed surprisingly well—sometimes even outperforming more sophisticated reasoning models (LRMs). In other words, you don’t need a bazooka to kill a mosquito.
  2. Moderately Complex Tasks: As the puzzle became more challenging, models with explicit reasoning traces began to shine, performing significantly better than baseline models. This aligned with expectations.
  3. Highly Complex Tasks: Here’s the kicker—when the number of disks required 1,000+ steps to solve (e.g., 10 disks), all models failed spectacularly.

What shocked researchers wasn’t that the models made mistakes—but that they appeared to give up. Instead of trying to reason through the solution, they simply stopped attempting to generate logical steps, even when given ample token budget and time.

Apple’s conclusion? LLMs don’t actually reason. What we perceive as “thinking” is merely a surface-level illusion that breaks down once the task complexity crosses a threshold.

You can read the full Apple research paper here.

GitHub Engineer Fires Back: That’s Not Reasoning!

Upon reading Apple’s claims, Sean Goedecke pushed back hard. His main point? The Tower of Hanoi is the wrong kind of task to test reasoning.

His argument is straightforward: the puzzle is essentially an algorithmic problem—recursive, repetitive, and mechanically solvable through a fixed procedure. That’s not what LLMs are designed for.

He offered a sharp analogy: “This is like criticizing a language model’s linguistic ability just because it can’t write a complex poem. That’s simply unfair.”

And he has a point. We don’t use LLMs to perform thousands of recursive math operations. Instead, we use them for creative thinking, ambiguity resolution, and pattern recognition across vast and messy datasets. Expecting them to solve Tower of Hanoi like a procedural algorithm is like asking a strategic consultant to manually calculate π to 10,000 decimal places.

Goedecke stressed that LLMs are pattern predictors, not symbolic logic engines. What looks like reasoning is actually the model guessing what a reasoning human might say next—not truly building a logical tree internally.

So, Can AI Actually “Think”? Where’s the Real Issue?

Who should we believe?

The truth likely lies somewhere in between.

Apple’s research reveals a valid limitation of LLMs: they struggle with precision, long-chain logic, and exponential complexity. This is an important insight that tempers over-enthusiastic expectations.

But Goedecke’s criticism also hits home: the test itself might have been inappropriate. Real reasoning isn’t defined solely by solving rigid math or algorithmic puzzles.

This raises a deeper issue: to achieve human-like reasoning, simply predicting the next token isn’t enough. We need symbolic reasoning, hypothesis testing, simulation, reward mechanisms, real-time learning—and much more that LLMs don’t currently possess.

Which leads us to the real question: Are we hitting the ceiling of what today’s architecture can do?

Beyond the Current Framework: We Need a “New Brain,” Not a Faster CPU

This debate points to a larger conversation: the future of computation.

Many experts argue that we’re reaching the limits of the current paradigm—namely, the Von Neumann architecture based on sequential processing. The human brain, in contrast, is massively parallel, with countless neurons firing simultaneously.

Trying to simulate that on linear silicon chips is like trying to run an entire city’s traffic on a one-lane road. No matter how fast the cars go, it’ll eventually jam.

That’s why real AI breakthroughs may require not just more powerful GPUs—but an entirely new computing architecture, like neuromorphic chips. These chips are modeled after the human brain’s structure, enabling true parallel computation.

Some even argue that while industry is pouring billions into quantum computing, neuromorphic computing may hold more scalable potential—especially for AI reasoning—without the same noise and error correction challenges.

Ultimately, the Apple vs. GitHub debate is more than a technical disagreement. It’s a wake-up call—reminding us to rethink what intelligence means, and what steps we need to take to reach Artificial General Intelligence (AGI).


Frequently Asked Questions (FAQ)

Q1: Does Apple’s research completely discredit AI reasoning capabilities? Not entirely. Apple’s findings highlight LLM limitations in handling specific problem types—those requiring precise, long-chain, algorithmic reasoning under exponential complexity. But in moderately complex or creative reasoning tasks, LLMs still perform impressively.

Q2: Why is using the Tower of Hanoi so controversial in AI testing? Because of “task mismatch.” Critics argue that Tower of Hanoi is a structured, formulaic problem. Using it to evaluate LLMs—models designed for natural language, pattern recognition, and ambiguity—is like judging a weightlifter’s strength by how far they can run. It misses the point.

Q3: What might AI need to achieve true reasoning in the future? This debate suggests that scaling models alone isn’t enough. Future breakthroughs may require:

  • Hybrid Architectures: Combining LLMs with symbolic reasoning engines.
  • New Algorithms: Capable of internal simulation, trial and error, and long-term planning.
  • New Hardware: Especially neuromorphic chips that mimic the brain’s parallel processing and structure.
Share on:
Featured Partners

© 2026 Communeify. All rights reserved.