A new research paper by a group of people at Apple has said that artificial intelligence (AI) ‘reasoning’ is not all that it is cracked up to be. Through an analysis of some of the most popular large reasoning models in the market, the paper showed that their accuracy faces a “complete collapse” beyond a certain complexity threshold.
The researchers put to the test models like OpenAI o3-mini (medium and high configurations), DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7- Sonnet (thinking). Their findings showed that the AI industry may be grossly overstating these models’ capabilities. They also benchmarked these large reasoning models (LRMs) with large language models (LLMs) with no reasoning capabilities, and found that in some cases, the latter outperformed the former.
“In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives — an ‘overthinking’ phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths. Beyond a certain complexity threshold, models completely fail to find correct solutions,” the paper said, adding that this “indicates LRMs possess limited self-correction capabilities that, while valuable, reveal fundamental inefficiencies and clear scaling limitations”.
For semantics, LLMs are AI models trained on vast text data to generate human-like language, especially in tasks such as translation and content creation. LRMs prioritise logical reasoning and problem-solving, focusing on tasks requiring analysis, like math or coding. LLMs emphasise language fluency, while LRMs focus on structured reasoning.
To be sure, the paper’s findings are a dampener on the promise of large reasoning models, which many have touted as a frontier breakthrough to understand and assist humans in solving complex problems, in sectors such as health and science.
The puzzles
Apple researchers evaluated reasoning capabilities of LRMs through four controllable puzzle environments, which allowed them fine-grained control over complexity and rigorous evaluation of reasoning:
Tower of Hanoi: It involves moving n disks between three pegs following specific rules, with complexity determined by the number of disks.
Story continues below this ad
Checker Jumping: This requires swapping red and blue checkers on a one-dimensional board, with complexity scaled by the number of checkers.
River Crossing: This is a constraint satisfaction puzzle where and actors and n agents must cross a river, controlled by the number of actor/agent pairs and boat capacity.
Blocks World: Focuses on rearranging blocks into a target configuration, with complexity managed by the number of blocks.
“Most of our experiments are conducted on reasoning models and their non-thinking counterparts, such as Claude 3.7 Sonnet (thinking/non-thinking) and DeepSeek-R1/V3. We chose these models because they allow access to the thinking tokens, unlike models such as OpenAI’s o-series. For experiments focused solely on final accuracy, we also report results on the o-series models,” the researchers said.
Story continues below this ad
How complexity affected reasoning
The researchers found that as problem complexity increased, the accuracy of reasoning models progressively declined. Eventually, their performance reached a complete collapse (zero accuracy) beyond a specific, model-dependent complexity threshold.
Apple analysis of AI models (Source: Apple)
Initially, reasoning models increased their thinking tokens proportionally with problem complexity. This indicates that they exerted more reasoning effort for more difficult problems. However, upon approaching a critical threshold (which closely corresponded to their accuracy collapse point), these models counter-intuitively began to reduce their reasoning effort (measured by inference-time tokens), despite the increasing problem difficulty.
Their work also found that in cases where problem complexity is low, non-thinking models (LLMs) were capable to obtain performance comparable to, or even better than thinking models with more token-efficient inference. With medium complexity, the advantage of reasoning models capable of generating long chain-of-thought began to manifest, and the performance gap between LLMs and LRMs increased. But, where problem complexity is higher, the performance of both models collapsed to zero. “Results show that while thinking models delay this collapse, they also ultimately encounter the same fundamental limitations as their non-thinking counterparts,” the paper said.
It is worth noting though that the researchers have acknowledged their work could have limitations: “While our puzzle environments enable controlled experimentation with fine-grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.”