The Evolution of LLM (Part 6): Unveiling the Mystery of the Tokenizer

Tokenizer is an important yet not so fancy component in LLMs. In previous language model modeling, the implementation of the tokenizer was character-level, creating an embedding table for all possible 65 characters and then encoding the training set using an embedding layer. In practice, modern language models use more complex schemes, operating at the chunk level through algorithms like Byte Pair Encoding (BPE).

In the GPT-2 paper Language Models are Unsupervised Multitask Learners, researchers built a vocabulary of size 50,257 with a context length of 1024 tokens.

Screenshot 2024-06-15 at 23.31.43

In the attention layer of the Transformer network of the language model, each token focuses on the preceding tokens in the sequence, meaning it sees the previous 1024 tokens.

Tokens can be seen as the atomic units of language models, and tokenization is the process of converting string text into a sequence of tokens.

By the way: There are also studies that input bytes directly into the model without tokenization (like MegaByte), but they have not yet been fully validated.

First taste#

Tokenization is the reason for many strange phenomena in LLMs:

Screenshot 2024-06-16 at 23.02.12

Why can't LLMs spell words correctly? Tokenization.
Why can't LLMs perform very simple string manipulation tasks, like reversing a string? Tokenization.
Why do LLMs perform worse when handling non-English languages (like Japanese)? Tokenization.
Why do LLMs perform poorly on simple arithmetic? Tokenization.
Why does GPT-2 encounter unnecessary trouble in Python coding? Tokenization.
Why does my LLM suddenly stop when it sees the string "