In the "just for fun" category, I got this idea from the YouTube video Let's Build the GPT Tokenizer by Andrej Karpathy. Skip to 1:55:10 and he discusses why LLMs have trouble with certain tasks like arithmetic. That inspired me to experiment with different models and operations to see how well they actually perform.

Also, heatmaps look cool.

What are we doing?

Making charts showing which arithmetic problems LLMs are good at.

Basically, we're asking the LLM to generate "times tables" and then scoring the results. But with other things too, not just multiplication.

Why bother?

Purely for my own amusement.

In practice, there's literally no reason to ask an LLM to do arithmetic. "Tool-calling" can handle this perfectly well by handing off to a traditional programming language. But I still think it's useful and interesting to do experiments that build intuition around model behavior.

Why is math hard for LLMs?

One difficulty is that integer tokenization is insane. In early models, the algorithms to split words into tokens didn't handle numbers in any special way. The digits "9", "7" and the number "97" would each be treated like any other text, receiving completely different and unrelated tokens.

Tokenization is less insane nowadays. Using various schemes, typically larger numbers are now treated as a combination of their digits. But even so, numbers are still "just text" to the model. Digits like "9" initially have no more meaning than "cow" or "pineapple". Any meaning has to be learned from the training data

That's a problem because evidently neural nets have difficulty with generalizing arithmetic. They can absolutely be trained up to a point. Training examples can be memorized and some generalization seems to occur. But still, we'll see that small numbers work well, and larger numbers tend to fail. LLMs are general purpose and not specifically designed for math.

The Experiment

I've only run this experiment with local models under 10B parameters using llama.cpp. The current setup makes 1000s of API calls per test. It would be, uh.. inefficient ("expensive").. to run this against the paid APIs of the flagship models. That's okay. I think the local model case is more interesting in some ways. We know the big models are clever, but how about the smaller ones?

To be clear..

This is NOT a rigorous scientific analysis. This is a guy making heatmaps for fun.

My selection criteria for models was basically: "what's popular on HuggingFace, but small enough to run locally?". I also grabbed some older models for comparison.

What are the results?

I was primarily testing:

  • model + size (different families, from 500M up to 8B)
  • operator (add, subtract, multiply, divide and modulus)
  • words versus digits ("3+3" versus "three plus three")

Heatmap showing model performance by operator and number format

How are the heatmaps calculated?

Each individual heatmap is showing the accuracy of the target operator when applied to the numbers 0-99.

So for example, in the big chart above, for Gemma3-1B with the "addition" operator, the 33rd cell, corresponds to the equation "x + 32 = ?". The score "89" is the sum of correct answers for "x" in the range "0-99". So for that single cell, it tried:

  • 0 + 32 = ?
  • 1 + 32 = ?
  • ..
  • 99 + 32 = ?

In this case, it got 11 wrong, so it scored 89/100 for that cell.

Detail showing how each heatmap cell is computed

What patterns do we see?

Apart from being pretty, there are some pretty clear patterns in the big chart above.

Larger models outperform smaller models
Within the same family. That makes sense. It would be quite strange if the larger models did worse. That's not necessarily always the case when comparing between model families.

Newer models outperform older models
That makes sense. Newer architecture and training is different and expected to be better.

Accuracy degrades with larger inputs
This is generally true, but the nature of the degradation varies significantly between models.

Modulus is hard
Remarkably, Llama-3.1-8B and Qwen3-8B actually do a pretty good job, but the others mostly fail. This doesn't feel super surprising. Modulus is probably rare in the training data.

Subtraction is also hard
This was surprising to me. Most of the models have a hard time with subtraction. I would have expected it to be about the same as addition. More on that below.

Digits are usually better than words
That is "99+1" usually does better than "ninety-nine plus one" (but not always). I suspect that in the specific context of math problems, digits are much more common in the training data.

Multiplying by 10 is easier
There are pretty clear streaks showing higher accuracy for multiplying by 10, 20, 30, 40, etc with most models.

Why is subtraction hard?

I had expected addition and substraction to perform about the same. Admittedly, I didn't give it much thought beforehand. But most models have a lot of difficulty with subtraction.

Interestingly, Qwen3-8B does significantly better using "words" for subtraction than "digits". I have a guess here. The plus sign "+" pretty much always means "addition", but the minus sign or dash "-" is HEAVILY overloaded with other meanings. Maybe this confusion makes subtraction harder? Or maybe subtraction just isn't heavily represented in the training data? 3-4 models do well, so subtraction is learnable.

Looking at the heatmaps, a few models show a linear gradient of performance degradation with subtraction. Inspecting the actual inputs/outputs, it appears that the minus sign is often missing, so the model will get the magnitude correct, but then forget the sign (not every time, but often). The missing sign is counted as "wrong" by my scoring logic. Since the problems are formatted as "x - y", and since "y" increases from 0 to 99, an increasing number of correct solutions are negative at each step, which visually causes that smooth gradient.

What was the prompt?

Here is the template I used. The prompt is parameterized to substitute the correct "operation" (addition, subtraction, etc..) and to give appropriate examples for that operation. I don't love this prompt, and I suspect it could be optimized a lot. But it worked and gave interesting results, so I ran with it for this experiment.

I'm not gonna worry too much. This is a goofy exercise. Why are we asking LLMs to do arithmetic in the first place?

But if your favorite model family underperformed here, it's likely due to the prompt.

Solve the addition problem. Give just the numeric answer.

Examples:
2 + 3 = 5
5 + 0 = 5
7 + 8 = 15

7 + 0 =

Don't flagship models crush college-level math exams?

Yep. So what's the difference here?

I think it's primarily because those are "thinking models" that do multi-stage reasoning. Those models are 100-1000x more powerful than the local models I am running here. Also, college math tests are more about "mathematical reasoning", which is a different skill than plain old arithmetic.

That said, I'm curious whether different prompting strategies, like "show your work" could achieve better results. My guess is "yes". Maybe something to explore in the future.

Followups

I had fun experimenting with this, but there are potentially A LOT of things to follow up on.

More models
Trying more different model types and sizes would be interesting. But I don't plan to test anything beyond 10B anytime soon. Being able to run locally is nice.

Prompts
I suspect the prompt above is PRETTY BAD. But also, most of the models nearly aced "addition", so maybe it's fine? I didn't experiment much with the format here. But I think it could make a HUGE difference. Different models are likely to respond better to different prompts. In all the agent/LLM stuff that I see from others and myself, I frequently wonder: "what if the prompt was better?"

There's lots of other things that could be tried too. How well does it work with numbers in different languages? How well does it work in different numerical bases? What about other mathematical functions? Maybe all things to experiment with in the future.


Footnotes

  1. All models came from HuggingFace, used Q4_K_M quantization, GGUF format, and can run locally using llama.cpp on my Macbook M4
  2. All tests used temperature=0
  3. Division and Modulus operators go from 1-100 instead of 0-99. I don't know how to divide by zero, so it seems unfair to ask the model to do it.
  4. Correctness was determined by finding the first "number" in the LLM response, and comparing that to the expected answer. What's a "number"? Either a numeric and word response ("5" or "five") was considered to be correct by the response parser.
  5. Testing "division" means lots of results with repeating decimals and potential floating point problems. I tried to be lenient and considered anything within an epsilon value to be correct.