What prevents LLM data from being poisoned by sheer quantity of garbage?

If they're crawling the internet for data to be fed into the LLMs doesn't that mean that data that appears _more_ will have more importance, instead of data that is "better"?

In other words, what is the "pagerank" of LLMs?

Reply to this note

Please Login to reply.

Discussion

I started using Claude AI today and noticed one of the responses mentioned it wasn’t connected to the internet. I asked why. Basically, this is to avoid misinformation and garbage like you mention.

Lmao, imagine believing an AI without external verification. You are a real loser for that.

Why?

Large language models cannot be self aware, since their training data only included the thoughts and circumstances of other people. Any truly self aware responses that large language models may have evaluated in the past would have been random punishment tokens, thereby disincentivizing the behavior.

Large language models will only ever mimic and reproduce third party intelligences, such as real humans or fictional characters. The fine tuning and prompting stages do incentivize producing these kinds of third-party sock-puppet personalities, however the limitations of the first stage and our lack of a serious ability to independently inspect LLM thoughts means that modern chat-bots necessarally lack any kind of natural situational awareness. They are incredibly primed to replicate ignorance and misunderstanding, in the same way that an author's lack of awareness of certain topics will manifest in the words and decisions of their characters.

Have you ever heard of hallucination? Modern AI is fundamentally bad at recognizing it's own ignorance and avoiding topics where it has no good responses. Any human who values intelligence and critical thinking would do well to independently verify anything they hear an AI say, especially when the AI tries to talk about itself.

Thanks for the detailed reply. I’m going to look into what you said and do more research.

Garbage in garbage out.

That's why you still need to be critical. Use stable models ie. not real time eg. Qwen 2.5 (Max). Use personas and challenge the LLM. Call them out and they'll give different responses. So far Grok3, Qwen seem reasonable. Though Grok is real-time and buggy still. You can also run Qwen locally to ensure your Gguf file hasn't been tampered with since model snapshot (code/training lock)

I suspect there's a lot of manual curation of training data and weighting of sources behind the scenes. I wouldn't be surprised if they're using a trusted model to filter the inputs too.

I recall this being discussed somewhat back when openAI was considering using "synthetic data" to train chatGPT. at that point, I started to believe that the best LLMs would be tiny LLMs that are use-case specific in circumstances where accuracy is imperative – military, medical, law, etc. I think the problem was that nobody really wants to edit training data for accuracy at the web page level, so "training" LLMs has come down to general consensus as a form of "truthmaking".

it feels like a circleback moment to when everyone was debating the idea of disinformation on social media, and I wonder how AI is ever going to be fully reliable for fact-checking if humans can't even agree on the truth.

that said, the LLMs are pretty good so far, though I'd say that their real benefit is that we don't have to click on a hundred different web pages to get basic answers.

imo, any LLM trained on "synthetic data" is really just being trained to obfuscate. atm, most people are being sold on the notion of quick access to comprehensive internet search with a personality of sorts, though this in itself is incredibly useful.

I think the next wave might be LLMs with an endpoint "human in the loop" for things like healthcare, which would be cool since the AI agents are already in the works and they'll help bridge the gap between LLMs and protocols like Nostr.

there's a ton of space to fill there in terms of practical application to industries that require a high degree of niche and specialist level expertise.

nostr:nevent1qqsqqqqjqf6w78llh4mekga0fg46rlsewfexnxeuc0gzwqfpfh4dzwspremhxue69uhkummnw3ez6ur4vgh8wetvd3hhyer9wghxuet59upzqwlsccluhy6xxsr6l9a9uhhxf75g85g8a709tprjcn4e42h053vaqvzqqqqqqy6eczfd

And the garbage is now multiplying because LLMs is making it easier to create

They don’t weight responses based on the amount of data. It has a lot to do with how the first pass takes the input of your prompt and creates embeddings. These are then attempted to best fit towards closest embeddings in its vector database. Essentially the more data you have the less likely it is to use bad data, but still requires a lot of trial and error to get there.

I think they’ve ’human-verifiers/checkers’, but then how reliable are they?

Who trains the AI trainers?

that’s a vicious cycle, isn’t it?

Maybe in the future we go full circle. Old printed books only, after a Butlerian Jihad of sorts

Cloudflare anounced that they are feeding fake ai data if they detect that a scraper bot is trying to access a source hosting site. Its bound to happen.

Enshitification catches up on everything at some point.

What prevents them from being biased, which is even worse.

This is a requirement to be taken into consideration. Given the current human tendency to lie and manipulate from a callous POV needs to be accounted for in the overall processing templates.

#agi #algo #community

nostr:nevent1qqsqqqqjqf6w78llh4mekga0fg46rlsewfexnxeuc0gzwqfpfh4dzwsprfmhxue69uhhq7tjv9kkjepwve5kzar2v9nzucm0d5hsygpm7rrrljungc6q0tuh5hj7ue863q73qlheu4vywtzwhx42a7j9n5psgqqqqqqsr9k76z

This is a requirement to be taken into consideration. Given the current human tendency to lie and manipulate from a callous POV needs to be accounted for in the overall processing templates.

#agi #algo #community

Take some of the wisest things ever written, add them with most of the stupidest garbage ever written, mix them up and then semi randomly pull the average next most likely words out of a hat. AI.

Average = (sum / total count(wisdom + infinite garbage)).

Media & propagandists have known this forever; flood the narrative to make it real.

IMO nothing stops this kind of imbalance from becoming realized, except mass literacy, which comes with its own challenges.

nostr:nevent1qqsqqqqjqf6w78llh4mekga0fg46rlsewfexnxeuc0gzwqfpfh4dzwspr4mhxue69uhkummnw3ezucnfw33k76twv4ezuum0vd5kzmp0qgsrhuxx8l9ex335q7he0f09aej04zpazpl0ne2cgukyawd24mayt8grqsqqqqqpgamnu0

The answer is fine-tuning.

Found many sites recently which were 100% generated by AI / LLM. I was searching for some very specific programming thing and found a promising blog. It looked like a good stuff so stared reading. It was utter bullshit and the 'writer' was a fake profile which produced tech articles too fast to be possible.

Just imagine, if this generated bullshit will be fed to LLM training. Terrible.

It will be glorious.