As briefly mentioned in Part 1, tokenization can affect LLM performance on basic arithmetic. This post considers that idea in slightly more detail. The blog post Integer Tokenization is Insane, and the followup Integer Tokenization is now Much Less Insane by the same author, showed how some early models like GPT-2 tokenized integers with no particular plan, while later models took a much more deliberate approach. I wanted to look at that for myself.

What was GPT-2 doing?

The first thing is to recreate the original chart for GPT-2. This can be done pretty easily with python using the python-llama-cpp module. There are alternative ways to run local LLMs, but I'm using llama.cpp for most of my experiments at the moment.

Here's the code. This requires downloading a llama-compatible GGUF version of GPT-2. There are several available on HuggingFace.

import matplotlib.pyplot as plt
import numpy as np
from llama_cpp import Llama

llm = Llama(model_path='./models/gpt2.Q4_K_M.gguf')

token_counts = []
for i in range(10000):
    digit_str = f" {str(i)}"
    tokens = llm.tokenize(digit_str.encode("utf-8"), add_bos=False)
    token_counts.append(len(tokens))

data = np.array(token_counts).reshape(100, 100)

plt.style.use('dark_background')
plt.figure(figsize=(12, 10))
plt.imshow(data, cmap="viridis")
plt.tight_layout()
plt.savefig("gpt2_digits_heatmap.png")

The chart looks like the original blog post, so we should be good.

digits_gpt2.png

What we're looking at is the number of tokens assigned to each integer from 0-9999. The first 362 numbers (0-361) are each assigned one token. Then at 362 something different happens: the tokenizer splits "362" into two separate tokens: "36" and "2". Then 363 is assigned a single token again.

Word # Tokens Token Values
0 1 0
1 1 1
2 1 2
.. .. ..
361 1 361
362 2 [36, 2]
363 1 363

There's nothing special about "362". It's just the first number that happens to split. More splits occur before we reach this interesting run in the 400s. Here, in a sequence of 4 numbers, we see splitting on the 100s digit, splitting on the 10s digit, and also no splitting at all.

Word # Tokens Token Values
495 2 [4, 95]
496 2 [4, 96]
497 2 [49, 7]
498 2 [4, 98]
499 1 499
.. .. ..

After reaching the 400s, most numbers are encoded with 2 tokens.

But then another interesting sequence begins in the late 1800s. All of the recent "year" numbers between 1895 and 2022 are encoded with a single token.

Word # Tokens Token Values
1895 1 1895
.. .. ..
2021 1 2021
2022 1 2022
2023 2 [20, 23]
2024 1 2024
.. .. ..

After 2022, it mostly goes back to using 2 tokens. The reason this happens is that GPT-2 uses something called BPE or Byte-Pair Encoding. Roughly speaking, short common words or sub-words are assigned individual tokens, and longer less common words are represented as a combination of multiple tokens. Because recent years are so common in the training data, they get unique tokens, despite other values in the 1000s range using multiple tokens.

What happens nowadays?

This problem was noticed. Newer models behave differently with more deliberate number encodings. There appear to be two main schemes in use, at least in the handful of publicly available models that I checked.

10 unique number tokens

In this scheme, the digits 0-9 are each represented as 1 token. Then two-digit numbers are represented as 2 tokens, three-digit numbers are represented as three tokens, etc.

Number # Tokens (GPT-2) Token Values
0 1 [0]
1 1 [0]
.. .. ..
9 1 [9]
10 2 [1, 0]
11 2 [1, 1]
.. .. ..
100 3 [1, 0, 0]
101 3 [1, 0, 1]

This method is used by Mistral, Qwen, Gemma, and Llama 2.

I'm not sure whether these are customized at all, but it appears this type of splitting is available as a flag in SentencePiece, which is commonly used for tokenization.

1000 unique number tokens

In this scheme, the digits 0-999 are each represented with a unique token. Then numbers larger than 999 are represented with 2 tokens.

Number # Tokens (GPT-2) Token Values
0 1 [0]
.. .. ..
10 1 [10]
.. .. ..
100 1 [100]
.. .. ..
999 1 [999]
1000 2 [100, 0]
1001 2 [100, 1]
1002 2 [100, 2]

This method is used by Llama 3 and Phi

Why did they choose to encode 1000 as [100, 0] versus [1, 000]? Not sure!

Why does it matter?

LLMs have no inherent concept of "numbers", everything is just tokens. Initially, before training, the tokens for "7" and "9" are as unrelated as "tree" and "cup". Encoding the digits in a systematic way should make mathematical relationships easier to learn. By encoding "1004" as "1-0-0-4" or even "100-4", that number is automatically imbued with a sense of "1-ness", "0-ness" and "4-ness" (or "100-ness" and "4-ness"). It's like you get to boostrap the numeric relationships for free, without needing to learn every single number as if it were totally unique.

Visualizing Digit Tokenization

Here is what the newer models look like visually. They're less fun and more regular than GPT-2. The chart below shows Mistral and Phi, but all the models mentioned above match one of these two schemes.

Note the detail on the first few digits of Mistral. 0-9 are 1 token, 10-99 are two tokens, 100-999 are three tokens and 1000-9999 are 4 tokens. Phi does 1 token for 0-999 and 2 tokens for 1000-9999.

Heatmap comparing tokenization of digits 0-9999 for Mistral and Phi models

Visualizing Word Tokenization

The last thing to look at here is how "number words" are tokenized. By that, I mean the words "ninety-five" instead of the digits "95". If nothing else, the charts look cool with those banding patterns.

Heatmap comparing tokenization of words 0-9999 for Mistral and Phi models

Here is a table showing some differences in the word tokenizations. The visual banding occurs because the "10s" digits (10, 20, 30, 40..) get encoded with fewer tokens. Mistral (on the left) and Llama 2 (not pictured) have some extra banding in eighties and nineties as well. It's interesting how "eighteen" and "nineteen" are tokenized differently here.

Word Mistral Llama 2 Qwen / Phi / Gemma / Llama 3
zero [zero] [zero] [zero]
one [one] [one] [one]
.. .. .. ..
eight [eight] [eight] [eight]
nine [nine] [nine] [nine]
.. .. .. ..
eighteen [eighteen] [eigh, teen] [eighteen]
nineteen [ninete, en] [nin, ete, en] [nineteen]
twenty [twenty] [twenty] [twenty]
thirty [thirty] [thirty] [thirty]
eighty [eight, y] [eight, y] [eighty]
ninety [nin, ety] [nin, ety] [ninety]
ninety-one [nin, ety, -, one] [nin, ety, -, one] [ninety, -, one]