๐—ช๐—ต๐—ฒ๐—ฟ๐—ฒ ๐—ฑ๐—ผ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐™๐™š๐™–๐™ก๐™ก๐™ฎ ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ?

Thereโ€™s a common myth going around that ChatGPT was trained on โ€œ๐™ฉ๐™๐™š ๐™ฌ๐™๐™ค๐™ก๐™š ๐™ž๐™ฃ๐™ฉ๐™š๐™ง๐™ฃ๐™š๐™ฉโ€. If you thought that, youโ€™re not alone. This is a common misconception.

Itโ€™s time we dispel this myth once and for all.โฌ‡๏ธ

The truth is, the amount of data that #LLMs are trained on is ๐™ฉ๐™ž๐™ฃ๐™ฎ - ๐˜ข๐˜ต ๐˜ญ๐˜ฆ๐˜ข๐˜ด๐˜ต ๐˜ช๐˜ฏ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ข๐˜ณ๐˜ช๐˜ด๐˜ฐ๐˜ฏ ๐˜ต๐˜ฐ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ข๐˜ฎ๐˜ฐ๐˜ถ๐˜ฏ๐˜ต ๐˜ฐ๐˜ง ๐˜ข๐˜ท๐˜ข๐˜ช๐˜ญ๐˜ข๐˜ฃ๐˜ญ๐˜ฆ ๐˜ฅ๐˜ข๐˜ต๐˜ข ๐˜ฐ๐˜ถ๐˜ต ๐˜ต๐˜ฉ๐˜ฆ๐˜ณ๐˜ฆ. ChatGPT, for example, was trained on less than 0.000000001% of the internet, according to most internet size estimates.

For perspective, if all the data on the internet was represented by ๐˜๐—ต๐—ฒ ๐—ฒ๐—ป๐˜๐—ถ๐—ฟ๐—ฒ ๐˜€๐˜‚๐—ฟ๐—ณ๐—ฎ๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—˜๐—ฎ๐—ฟ๐˜๐—ต, then ๐˜ข๐˜ญ๐˜ญ of ChatGPTโ€™s data would only be represented by about 478 square centimeters (or about 74 square inches), or approximately the area taken up by ๐—ฎ ๐˜๐˜†๐—ฝ๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฑ๐—ถ๐—ป๐—ป๐—ฒ๐—ฟ ๐—ฝ๐—น๐—ฎ๐˜๐—ฒ.

๐™’๐™๐™ฎ ๐™ž๐™จ ๐™ฉ๐™๐™–๐™ฉ ๐™จ๐™ค?

Itโ€™s because most of the data out there is not in a useful format for training a language model. In fact, you can think of data like untapped, raw materials: it has to be cleaned and refined, before it can be used.

Then how can LLMs respond to questions as well as they do?

To answer this, itโ€™s important to understand that Large Language Models are really just sophisticated probability machines. They are trained on the relationship between words and sentences. What they produce is a *probability* that one word will follow after another. ๐™๐™๐™ž๐™ฃ๐™  ๐™ค๐™› ๐™ฉ๐™๐™š๐™ข ๐™–๐™จ ๐™ข๐™ช๐™˜๐™ ๐™ข๐™ค๐™ง๐™š ๐™˜๐™–๐™ฅ๐™–๐™—๐™ก๐™š ๐™ซ๐™š๐™ง๐™จ๐™ž๐™ค๐™ฃ๐™จ ๐™ค๐™› ๐™ฅ๐™ง๐™š๐™™๐™ž๐™˜๐™ฉ๐™ž๐™ซ๐™š ๐™ฉ๐™š๐™ญ๐™ฉ ๐™ค๐™ฃ ๐™ฎ๐™ค๐™ช๐™ง ๐™ฅ๐™๐™ค๐™ฃ๐™š.

How can probability machines do so much with so little? How can they make any sense of the ๐˜ฆ๐˜น๐˜ข๐˜ฃ๐˜บ๐˜ต๐˜ฆ๐˜ด of cat videos, fake news, podcasts, articles, NSFW content, social media posts, music, app downloads, and more? The answer: ๐˜ฉ๐˜ถ๐˜ฎ๐˜ข๐˜ฏ๐˜ด.

๐—›๐˜‚๐—บ๐—ฎ๐—ป๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ฒ๐˜€๐˜€๐—ฒ๐—ป๐˜๐—ถ๐—ฎ๐—น ๐—ณ๐—ผ๐—ฟ ๐˜€๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐˜€๐—ถ๐—ด๐—ป๐—ฎ๐—น ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐—ป๐—ผ๐—ถ๐˜€๐—ฒ. Which touches on another myth: that #AI will replace humans in their work. But thatโ€™s for next time. ๐Ÿ˜‰

Did this help you understand AI and LLMs better? Give it ๐—Ÿ๐—ถ๐—ธ๐—ฒ๐Ÿค™

Know anyone with this misconception? ๐—ฆ๐—ต๐—ฎ๐—ฟ๐—ฒ๐Ÿ”„ it with them.

Have AI-related questions for me? Drop them in the ๐—ฐ๐—ผ๐—บ๐—บ๐—ฒ๐—ป๐˜๐˜€๐Ÿ‘‡

๐—ก๐—ผ๐˜๐—ฒ: This is a reboot of one of my previous posts. It improved so much that it warranted being posted again.

Reply to this note

Please Login to reply.

Discussion

No replies yet.