Nostr Web Client

What prevents LLM data from being poisoned by sheer quantity of garbage?

If they're crawling the internet for data to be fed into the LLMs doesn't that mean that data that appears _more_ will have more importance, instead of data that is "better"?

In other words, what is the "pagerank" of LLMs?

Jochens 9mo ago

They don’t weight responses based on the amount of data. It has a lot to do with how the first pass takes the input of your prompt and creates embeddings. These are then attempted to best fit towards closest embeddings in its vector database. Essentially the more data you have the less likely it is to use bad data, but still requires a lot of trial and error to get there.

Reply to this note

Please Login to reply.

Discussion

No replies yet.