๐ช๐ต๐ฒ๐ฟ๐ฒ ๐ฑ๐ผ ๐๐๐ ๐ ๐๐๐๐ก๐ก๐ฎ ๐ฆ๐ผ๐๐ฟ๐ฐ๐ฒ ๐๐ต๐ฒ๐ถ๐ฟ ๐๐ฎ๐๐ฎ?
Thereโs a common myth going around that ChatGPT was trained on โ๐ฉ๐๐ ๐ฌ๐๐ค๐ก๐ ๐๐ฃ๐ฉ๐๐ง๐ฃ๐๐ฉโ. If you thought that, youโre not alone. This is a common misconception.
Itโs time we dispel this myth once and for all.โฌ๏ธ

The truth is, the amount of data that #LLMs are trained on is ๐ฉ๐๐ฃ๐ฎ - ๐ข๐ต ๐ญ๐ฆ๐ข๐ด๐ต ๐ช๐ฏ ๐ค๐ฐ๐ฎ๐ฑ๐ข๐ณ๐ช๐ด๐ฐ๐ฏ ๐ต๐ฐ ๐ต๐ฉ๐ฆ ๐ข๐ฎ๐ฐ๐ถ๐ฏ๐ต ๐ฐ๐ง ๐ข๐ท๐ข๐ช๐ญ๐ข๐ฃ๐ญ๐ฆ ๐ฅ๐ข๐ต๐ข ๐ฐ๐ถ๐ต ๐ต๐ฉ๐ฆ๐ณ๐ฆ. ChatGPT, for example, was trained on less than 0.000000001% of the internet, according to most internet size estimates.
For perspective, if all the data on the internet was represented by ๐๐ต๐ฒ ๐ฒ๐ป๐๐ถ๐ฟ๐ฒ ๐๐๐ฟ๐ณ๐ฎ๐ฐ๐ฒ ๐ผ๐ณ ๐๐ต๐ฒ ๐๐ฎ๐ฟ๐๐ต, then ๐ข๐ญ๐ญ of ChatGPTโs data would only be represented by about 478 square centimeters (or about 74 square inches), or approximately the area taken up by ๐ฎ ๐๐๐ฝ๐ถ๐ฐ๐ฎ๐น ๐ฑ๐ถ๐ป๐ป๐ฒ๐ฟ ๐ฝ๐น๐ฎ๐๐ฒ.
๐๐๐ฎ ๐๐จ ๐ฉ๐๐๐ฉ ๐จ๐ค?
Itโs because most of the data out there is not in a useful format for training a language model. In fact, you can think of data like untapped, raw materials: it has to be cleaned and refined, before it can be used.
Then how can LLMs respond to questions as well as they do?
To answer this, itโs important to understand that Large Language Models are really just sophisticated probability machines. They are trained on the relationship between words and sentences. What they produce is a *probability* that one word will follow after another. ๐๐๐๐ฃ๐ ๐ค๐ ๐ฉ๐๐๐ข ๐๐จ ๐ข๐ช๐๐ ๐ข๐ค๐ง๐ ๐๐๐ฅ๐๐๐ก๐ ๐ซ๐๐ง๐จ๐๐ค๐ฃ๐จ ๐ค๐ ๐ฅ๐ง๐๐๐๐๐ฉ๐๐ซ๐ ๐ฉ๐๐ญ๐ฉ ๐ค๐ฃ ๐ฎ๐ค๐ช๐ง ๐ฅ๐๐ค๐ฃ๐.
How can probability machines do so much with so little? How can they make any sense of the ๐ฆ๐น๐ข๐ฃ๐บ๐ต๐ฆ๐ด of cat videos, fake news, podcasts, articles, NSFW content, social media posts, music, app downloads, and more? The answer: ๐ฉ๐ถ๐ฎ๐ข๐ฏ๐ด.
๐๐๐บ๐ฎ๐ป๐ ๐ฎ๐ฟ๐ฒ ๐ฒ๐๐๐ฒ๐ป๐๐ถ๐ฎ๐น ๐ณ๐ผ๐ฟ ๐๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ถ๐ป๐ด ๐๐ต๐ฒ ๐๐ถ๐ด๐ป๐ฎ๐น ๐ณ๐ฟ๐ผ๐บ ๐๐ต๐ฒ ๐ป๐ผ๐ถ๐๐ฒ. Which touches on another myth: that #AI will replace humans in their work. But thatโs for next time. ๐

Did this help you understand AI and LLMs better? Give it ๐๐ถ๐ธ๐ฒ๐ค
Know anyone with this misconception? ๐ฆ๐ต๐ฎ๐ฟ๐ฒ๐ it with them.
Have AI-related questions for me? Drop them in the ๐ฐ๐ผ๐บ๐บ๐ฒ๐ป๐๐๐