To help explain the weirdness of LLM Tokenization I thought it could be amusing to translate every token to a unique emoji. This is a lot closer to truth - each token is basically its own little hieroglyph and the LLM has to learn (from scratch) what it all means based on training data statistics.

So have some empathy the next time you ask an LLM how many letters 'r' there are in the word 'strawberry', because your question looks like this:

‍‍‍

Play with it here :)

https://colab.research.google.com/drive/1SVS-ALf9ToN6I6WmJno5RQkZEHFhaykJ#scrollTo=75OlT3yhf9p5…

Source: x.com/karpathy/status/1816637781659254908

Reply to this note

Please Login to reply.

Discussion

No replies yet.