well, that's the thing. training is literally the same mathematical type of operations as hash grinding. differences but both depend very much on probability, so they have a fairly inflexible and somewhat unpredictable process and time taken before they have sucessfully acquired the new models from the new data, and bound them into the rest of their graph.

best i can tell from what i'm reading, the full understanding of how these things work and why they develop in certain ways is not there yet. even, only last year anthropic started to find out how multilingual models do translation and there is a conceptual modeling step... it's a LOT like translating programming languages also, you decompose the structure, tokenize the elements, and then generate a grammar tree, and then you walk that grammar tree with a code generator and it generates code that implements what the code's semantics intend.

LLM models, and the verious ways that they are implemented, are basically a generalized pattern that you find in language compilers and interpreters. they have much larger sets of "reserved words" and token types, compared to computer programming languages. actually i've been wanting to make a language for a very long time and i'm in that first stage still, defining a grammar and figuring out how to write the bootstrap compiler. i've figured out that the bootstrap compiler can do without most of the standard library, i can probably get by with only a string type (bytes of course, mutable).

essentially, a big part of what the models are, is the same processes for writing a language tool for programming. tokenization (to embedding codes) lexical analysis, determining relations between words and other words, like modifiers, qualifiers, and so on. i would bet that you could ultimately build a half-working compiler just by training with one of the LLM strategies on a data set that is predominantly source -> binary mapping. this source makes that amd64 binary. figure it out, and then a month or two later grinding and it sorta works, if you can keep the context far larger than required for that transformation between one encoding form and another.

i'm quite looking forward to progressing with my language moxie for this reason. there probably is some part of this that i can do something much sooner if i know how to do it, some stuff related to LLMs and this kind of program.

as nostr:npub1w4jkwspqn9svwnlrw0nfg0u2yx4cj6yfmp53ya4xp7r24k7gly4qaq30zp was pointing out, 70b models are the baseline for vibe coding models, because of the precision of the model compared to the convolutions of the logic being worked with. smaller models can talk, but they are retarded, and will often hallucinate because they literally just don't have that detail in theri model. the size that would make actually human-like thinking is probably 100s or 1000s or more times of parameters than even the largest model currently existing.

this is something that mainly programmers are likely to observe about them, but the rest of the customer base for these companies do not: they struggle with depth logic. really badly. this facility in the user is critical for them to actually improve productivity. you can fling shit at the walls hoping it sticks but there is specific practices that work.

Reply to this note

Please Login to reply.

Discussion

No replies yet.