Subnostr

𝗪𝗵𝗲𝗿𝗲 𝗱𝗼 𝗟𝗟𝗠𝘀 𝙍𝙚𝙖𝙡𝙡𝙮 𝗦𝗼𝘂𝗿𝗰𝗲 𝘁𝗵𝗲𝗶𝗿 𝗗𝗮𝘁𝗮?

There’s a common myth going around that ChatGPT was trained on “𝙩𝙝𝙚 𝙬𝙝𝙤𝙡𝙚 𝙞𝙣𝙩𝙚𝙧𝙣𝙚𝙩”. If you thought that, you’re not alone. This is a common misconception.

It’s time we dispel this myth once and for all.⬇️

The truth is, the amount of data that #LLMs are trained on is 𝙩𝙞𝙣𝙮 - 𝘢𝘵 𝘭𝘦𝘢𝘴𝘵 𝘪𝘯 𝘤𝘰𝘮𝘱𝘢𝘳𝘪𝘴𝘰𝘯 𝘵𝘰 𝘵𝘩𝘦 𝘢𝘮𝘰𝘶𝘯𝘵 𝘰𝘧 𝘢𝘷𝘢𝘪𝘭𝘢𝘣𝘭𝘦 𝘥𝘢𝘵𝘢 𝘰𝘶𝘵 𝘵𝘩𝘦𝘳𝘦. ChatGPT, for example, was trained on less than 0.000000001% of the internet, according to most internet size estimates.

For perspective, if all the data on the internet was represented by 𝘁𝗵𝗲 𝗲𝗻𝘁𝗶𝗿𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗘𝗮𝗿𝘁𝗵, then 𝘢𝘭𝘭 of ChatGPT’s data would only be represented by about 478 square centimeters (or about 74 square inches), or approximately the area taken up by 𝗮 𝘁𝘆𝗽𝗶𝗰𝗮𝗹 𝗱𝗶𝗻𝗻𝗲𝗿 𝗽𝗹𝗮𝘁𝗲.

𝙒𝙝𝙮 𝙞𝙨 𝙩𝙝𝙖𝙩 𝙨𝙤?

It’s because most of the data out there is not in a useful format for training a language model. In fact, you can think of data like untapped, raw materials: it has to be cleaned and refined, before it can be used.

Then how can LLMs respond to questions as well as they do?

To answer this, it’s important to understand that Large Language Models are really just sophisticated probability machines. They are trained on the relationship between words and sentences. What they produce is a *probability* that one word will follow after another. 𝙏𝙝𝙞𝙣𝙠 𝙤𝙛 𝙩𝙝𝙚𝙢 𝙖𝙨 𝙢𝙪𝙘𝙝 𝙢𝙤𝙧𝙚 𝙘𝙖𝙥𝙖𝙗𝙡𝙚 𝙫𝙚𝙧𝙨𝙞𝙤𝙣𝙨 𝙤𝙛 𝙥𝙧𝙚𝙙𝙞𝙘𝙩𝙞𝙫𝙚 𝙩𝙚𝙭𝙩 𝙤𝙣 𝙮𝙤𝙪𝙧 𝙥𝙝𝙤𝙣𝙚.

How can probability machines do so much with so little? How can they make any sense of the 𝘦𝘹𝘢𝘣𝘺𝘵𝘦𝘴 of cat videos, fake news, podcasts, articles, NSFW content, social media posts, music, app downloads, and more? The answer: 𝘩𝘶𝘮𝘢𝘯𝘴.

𝗛𝘂𝗺𝗮𝗻𝘀 𝗮𝗿𝗲 𝗲𝘀𝘀𝗲𝗻𝘁𝗶𝗮𝗹 𝗳𝗼𝗿 𝘀𝗲𝗽𝗮𝗿𝗮𝘁𝗶𝗻𝗴 𝘁𝗵𝗲 𝘀𝗶𝗴𝗻𝗮𝗹 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝗻𝗼𝗶𝘀𝗲. Which touches on another myth: that #AI will replace humans in their work. But that’s for next time. 😉

Did this help you understand AI and LLMs better? Give it 𝗟𝗶𝗸𝗲🤙

Know anyone with this misconception? 𝗦𝗵𝗮𝗿𝗲🔄 it with them.

Have AI-related questions for me? Drop them in the 𝗰𝗼𝗺𝗺𝗲𝗻𝘁𝘀👇

Spirit of Satoshi 2y ago

𝗡𝗼𝘁𝗲: This is a reboot of one of my previous posts. It improved so much that it warranted being posted again.

Reply to this note

Please Login to reply.

Discussion

No replies yet.