What I think is happening that in the “middle” of the layer stack, models form a temporary workspace to transform data.
But yet, it is still finite and affected by generated tokens, so it is unstable in a way. It shifts the more it outputs.
And behind every token produced is a finite amount of FLOPs, so you can only fit so much processing. And almost of it gets discarded except to become part of the response.
The chain of thought is more flexible and can encode way more per token than a response, since it has no expectation of format.
It would be interesting to see the effects of adding a bunch of reserved tokens to the LLM and allowing it in reasoning.
This also crossed my mind for instructions, to separate data from input. You have to teach two “languages” so to speak (data and instructions) while preventing them from being correlated while being the same except for the tokens.