⚡️🤖 NEW - Apple doesn't see reasoning models as a major breakthrough over standard LLMs - new study
Here is why:
In its latest study, The Illusion of Thinking, Apple questions the widespread assumption that large language models are already capable of real logical thinking - i.e. real “reasoning”. Instead of a cognitive breakthrough, the Apple researchers see a kind of illusion: the models merely create the impression of thinking without actually having stable, comprehensible thought processes.
At the heart of the criticism is the observation that LLMs drastically lose performance in more complex tasks - even if they had enough computational time or tokens available to solve the task. This means that as soon as the complexity increases, the performance of the models drops rapidly.
Apple systematically tested this with specially developed tasks in a controlled environment. Even small changes to the task - such as changing the wording or inserting irrelevant information - resulted in the model no longer answering correctly. According to Apple, this indicates that LLMs do not develop consistent, generalizable thinking strategies, but rather fall back on statistical patterns that they have learned during training.
This means that actual “thinking” does not take place - rather, it is a sophisticated form of pattern recognition that works impressively under certain conditions, but becomes fragile under stress.
In addition, there is another problem that Apple particularly emphasizes: Many benchmarks on which today's models are tested - such as GSM8K - are already included in their training data. This results in a distorted perception of the actual capabilities. To get around this, Apple has developed a new benchmark called GSM-Symbolic, which reveals the true limits of reasoning capabilities.
The results show: When the models are confronted with slightly different tasks that they cannot rely on, they often fail.
For Apple, it is clear that the current hype surrounding the reasoning of large language models is based on superficial results and benchmark illusions. Truly robust, generalizable reasoning has not yet been achieved - and as long as models cannot consistently deal with new, unfamiliar problems, it is better not to speak of a real breakthrough.
Do you agree with Apple? #asknostr
