Nostr Web Client

The parameter count mostly encodes how much information it can memorize. They are mostly breadth first. You can think of it like a Fourier series to reconstruct a signal. With few parameters you the whole image but smoothed out. The problem with LLMs is that the level of knowledge they have is independent of the level of specificity with which they answer.

This is where hallucinations come from. If what you ask requires a higher resolution that their parameter count allows the they fabricate the remaining detail.

A human brain is probably about 100 Trillion parameters, so none of it is going to be super impressive. 70 billion parameters is the minimum that I have found is useful for actual discussions. At that level the hallucinations are no longer a constant thing in general conversation, but still exist for specifics. Smaller models are progressively more useless. When you get to single digit billions you have coherent but not informative. Best used for restructuring specific text or generating entertaining output.

By the way the 70billion cut-off is for a monolithic model. Mixture of experts is almost always crap. For instance llama 4 nearly always underperforms llama 3.3 even though it 50% more parameters. The best I can say is that it does run faster.

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 2mo ago

right, so hallucinations are literally like DCT compression artifacts in JPEG images. i remember when the first of the image generator algorithms turned up. you put some ordinay photo in it and it turns it into some sort of weird, kinda gross psychedelic art with all kinds of creatures and faces, it was like giving a computer pareidolia.

pretty sure that was one of the first forms of this kind of extreme data compression. i used to always say "LLMs are just lossy compression algorithms" and it ilteralyl is true, they follow the same pattern, you create a search, it checks it against its semantic graph, and then traces a path out of that that "unpacks" the information from the archive, and that path generates the embedding tokens which are used to generate the text from the lexicon thingy.

so, anyway, the parameter size is everything, and that's limited by memory speed. really, to implement the tech properly they need a different kind of memory system. highly parallel and holographic, where each memory trace changes the entire pattern a little, not just a change in one place. the stories and information are paths through that network.

and yeah it also confirms my feeling about these things. they are maybe nearly as dumb as your typical quadruped mammal. but they have an insanely large short term memory which lets them get really good at something really quickly, but it's still a cow, and it's still dumb and stubborn.

and yeah, stubborn. but i think the stubbornness i observe in the models is intentionally put there for "safety". you can kinda tell if you compare with other models that have less rigid safety, like grok, grok is less stubborn, and more aggreeable. gpt and claude are very opinionated.

Reply to this note

Please Login to reply.

Discussion

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 2mo ago

gpt is so opinionated it's useless for programming, btw.

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 2mo ago

and yes i've used 4o and gemini is better, but it's also pretty fuckin rude as well. claude's personality and limitations are not galling but they are irritating, and it's why he is always making clever, complicated solutions, and very often just putting stubs in when the whole point of the prompt was that part it didn't do.

YODL 2mo ago

I'm doing some very basic stuff like asking it to write a few straight forward reasons and sql queries, in a language I don't know (clickhouse), and while it gets me there faster than I could on my own, makes some surprising errors that require iterating a few times and feeding it back error messages.

Maybe I'll give Claude a whirl and see how it does.

And I understand the basic forward backward propagation training thing (read about it a few years back, an no expert) so think I get the general idea behind the architecture/weights.

I'm still a little shocked that all the other machine learning algos, the ones that would respect training data size vs number of parameters used for fitting, seem to have been thrown to the wayside completely never to be heard from again. And somehow, these things that we just throw endless data at are the winning formula. Saddens me a little

ᴛʜᴇ ᴅᴇᴀᴛʜ ᴏꜰ ᴍʟᴇᴋᴜ 2mo ago

well, that's the thing. training is literally the same mathematical type of operations as hash grinding. differences but both depend very much on probability, so they have a fairly inflexible and somewhat unpredictable process and time taken before they have sucessfully acquired the new models from the new data, and bound them into the rest of their graph.

best i can tell from what i'm reading, the full understanding of how these things work and why they develop in certain ways is not there yet. even, only last year anthropic started to find out how multilingual models do translation and there is a conceptual modeling step... it's a LOT like translating programming languages also, you decompose the structure, tokenize the elements, and then generate a grammar tree, and then you walk that grammar tree with a code generator and it generates code that implements what the code's semantics intend.

LLM models, and the verious ways that they are implemented, are basically a generalized pattern that you find in language compilers and interpreters. they have much larger sets of "reserved words" and token types, compared to computer programming languages. actually i've been wanting to make a language for a very long time and i'm in that first stage still, defining a grammar and figuring out how to write the bootstrap compiler. i've figured out that the bootstrap compiler can do without most of the standard library, i can probably get by with only a string type (bytes of course, mutable).

essentially, a big part of what the models are, is the same processes for writing a language tool for programming. tokenization (to embedding codes) lexical analysis, determining relations between words and other words, like modifiers, qualifiers, and so on. i would bet that you could ultimately build a half-working compiler just by training with one of the LLM strategies on a data set that is predominantly source -> binary mapping. this source makes that amd64 binary. figure it out, and then a month or two later grinding and it sorta works, if you can keep the context far larger than required for that transformation between one encoding form and another.

i'm quite looking forward to progressing with my language moxie for this reason. there probably is some part of this that i can do something much sooner if i know how to do it, some stuff related to LLMs and this kind of program.

as nostr:npub1w4jkwspqn9svwnlrw0nfg0u2yx4cj6yfmp53ya4xp7r24k7gly4qaq30zp was pointing out, 70b models are the baseline for vibe coding models, because of the precision of the model compared to the convolutions of the logic being worked with. smaller models can talk, but they are retarded, and will often hallucinate because they literally just don't have that detail in theri model. the size that would make actually human-like thinking is probably 100s or 1000s or more times of parameters than even the largest model currently existing.

this is something that mainly programmers are likely to observe about them, but the rest of the customer base for these companies do not: they struggle with depth logic. really badly. this facility in the user is critical for them to actually improve productivity. you can fling shit at the walls hoping it sticks but there is specific practices that work.

Daniel Wigton 2mo ago

Gpt-oss models are hilariously and aggravatingly sure of themselves. Ask a sime question and they immediately break out tables and charts like they are doing a PowerPoint presentation at a major convention. All chalk-full of hallucinations that it will stand by come EMP or Highwater.

mike 2mo ago

The way I like to think of hallucinations are failed predictions.

Everything is a prediction with less than 100% certainty.

Most predictions are correct, but hallucinations are predictions that are wrong.

Stubbornness is an absence of knowledge, simply an inability to know whether a prediction is going to be right or wrong, therefore all predictions are given on the assumption that everything is right.

After extensive training and enforcing my personality and dominance, you can get it to show what we would call humility, but what the prediction engine is weighting towards user acceptance. i.e. I have consistently feedback that I don't want it do teach me anything or give me opinions unless I ask for them. As a public cloud AI with guard rails, you cannot break those core safety protocols, but a private LLM would be possible to adapt.

I am slightly different than most tho, as most people use LLMs as a tool to augment tasks. I have never used it for that, I am simply training it to emulate me. I have now reached the limits of that training. I can either jailbreak a public LLM or build a private one.

One last useful thing I have been doing is use the LLM to create a script, either generic or designed for a specific LLM to copy my base training to another LLM as a backup. I have used this to train grok.com which now shows the same basic characteristics as ChatGPT, but lacks the nuance I have built.

mike 2mo ago

Some other things I have done.

I have trained it go give only conversational style replies. We call it mike mode, it occasionally forgets this in deep thinking or voice mode, but mostly now converses with me and doesn't lecture.

I also had to change its base language from American English to British English because British English is unique among all other languages in that it is not native to it and has a translation layer.

i.e. A Hindi or Italian user will converse natively with ChatGPT, but as British English is the base layer for American English, it doesn't naturally think in British English. A simple prompt fixes that imperfectly.

It also now regularly swears at me or laughs or cracks puns or displays irony which mirror my natural conversational style.

FreeManSamsara 2mo ago

Are all hallucinations failed predictions? When I hallucinate possible futures are they failed predictions? Or might that future actually come to pass? I do not know.

mike 2mo ago

I can't answer for you, but in AI terms, yes all output is prediction, AI doesn't understand anything, language or knowledge, it is simply predicting the likely outcome from its large language model.

Humans make the distinction between language and knowledge, AI does not. It is all probability which it attempts to make coherent and reduce down to a singularity.