right, so hallucinations are literally like DCT compression artifacts in JPEG images. i remember when the first of the image generator algorithms turned up. you put some ordinay photo in it and it turns it into some sort of weird, kinda gross psychedelic art with all kinds of creatures and faces, it was like giving a computer pareidolia.
pretty sure that was one of the first forms of this kind of extreme data compression. i used to always say "LLMs are just lossy compression algorithms" and it ilteralyl is true, they follow the same pattern, you create a search, it checks it against its semantic graph, and then traces a path out of that that "unpacks" the information from the archive, and that path generates the embedding tokens which are used to generate the text from the lexicon thingy.
so, anyway, the parameter size is everything, and that's limited by memory speed. really, to implement the tech properly they need a different kind of memory system. highly parallel and holographic, where each memory trace changes the entire pattern a little, not just a change in one place. the stories and information are paths through that network.
and yeah it also confirms my feeling about these things. they are maybe nearly as dumb as your typical quadruped mammal. but they have an insanely large short term memory which lets them get really good at something really quickly, but it's still a cow, and it's still dumb and stubborn.
and yeah, stubborn. but i think the stubbornness i observe in the models is intentionally put there for "safety". you can kinda tell if you compare with other models that have less rigid safety, like grok, grok is less stubborn, and more aggreeable. gpt and claude are very opinionated.