Photo by vackground.com on Unsplash
Last month, I asked Claude to calculate the total cost of 47 items at $12.99 each. It confidently returned $593.53. The correct answer is $610.53. A $17 error. Then I asked it to do the same calculation again, and it gave me $612.04. Neither answer was correct, and neither matched the other.
This isn't a bug. It's a feature—albeit an infuriating one—of how modern language models actually work.
The Tokenization Problem Nobody Talks About
Here's what most people don't realize: when you feed numbers into an AI model, they don't get processed as mathematical entities. They get broken into "tokens"—chunks of text. The number 47 might be tokenized as "4" and "7" separately, depending on the model's vocabulary. This means the AI is essentially trying to reason about numbers the way you'd solve a puzzle if someone handed you individual letter tiles and asked you to multiply.
GPT-3 has roughly 100,000 tokens in its vocabulary. Of those, perhaps a few thousand represent numbers and number fragments. When a model encounters "2847" in your prompt, it might split it across multiple tokens, forcing the model to piece together the value somewhat blindly. It's like playing telephone, except you're the message and mathematics is the whisper game.
OpenAI acknowledged this limitation in their technical documentation, but it rarely makes headlines. Most articles focus on impressive capabilities—an AI that beats lawyers on contract review, another that writes code. Nobody writes a Medium post titled "AI Still Can't Divide." But these failures are frequent and consequential.
What the Data Actually Shows
A 2023 study from Stanford's Center for Research on Foundation Models tested various large language models on arithmetic tasks. The results were sobering. GPT-3.5 achieved only 47% accuracy on two-digit multiplication problems. GPT-4 improved to 84%—respectable, but still worse than a seventh grader. When researchers tested chain-of-thought prompting (asking the model to "show its work"), accuracy climbed to 98% on those same problems.
But here's the twist: the model wasn't actually better at math. It was better at explaining its reasoning in a way that forced it to step through calculations sequentially. Give it a different format or a slightly different problem structure, and the performance degraded again.
This reveals something crucial: these models don't understand mathematics. They pattern-match against training data. When you ask an AI to solve a problem it's seen thousands of times in its training set, it performs well. Ask it something slightly novel, and you're back to chance.
Why This Matters Beyond Calculators
You might think: "Fine, I'll use a calculator for math. Why does AI's mathematical illiteracy matter?" The answer lurks in applications you probably interact with weekly.
A financial institution deploying an AI system to assess loan risk needs accurate numerical reasoning. A healthcare platform using AI to recommend medication dosages cannot afford to be "right 84% of the time." An e-commerce recommendation engine that miscalculates inventory could create cascading logistics failures. In these domains, approximation isn't acceptable.
There's also a subtler problem: confidence inflation in AI outputs. When a model generates a plausible-looking but incorrect number, users often trust it because it was delivered with certainty. The model doesn't hedge or admit uncertainty—it just outputs a figure. A human researcher checking the model's work might miss the error if they're not specifically testing arithmetic.
I interviewed a data scientist at a mid-size tech company who discovered their AI-powered budget forecasting tool was making systematic calculation errors. They'd deployed it without catching this because they were focused on whether it could identify trends, not whether it could add correctly. When they discovered the bug, they'd already shared reports with executives based on the faulty numbers. That's a trust problem that's hard to recover from.
The Road Forward: Specialized Architectures
Researchers aren't ignoring this. Some promising approaches have emerged. Neural-symbolic AI combines traditional machine learning with explicit symbolic computation—letting the AI handle language understanding while delegating math to deterministic algorithms. It's less elegant than a pure language model, but it works.
OpenAI's Code Interpreter feature sidesteps the problem by having the model write Python code, then execute it to verify results. The model generates the logic, and deterministic code handles the computation. Anthropic has published research on similar approaches.
The issue is that these solutions require extra steps and computational overhead. They're also not available in every application. The dream of a single unified AI system that can reason about language, images, and numbers with equal fluency remains distant.
What You Should Do Right Now
If you're using AI in any capacity that involves numbers—budgeting, forecasting, financial analysis, scientific research—you need a human checkpoint. Assume any numerical output from an AI is a draft until verified independently. This isn't paranoia; it's basic quality control.
For companies: audit your AI deployments. If numerical accuracy is critical to your use case, either implement secondary verification or use specialized tools instead of general-purpose language models.
For everyone else: be skeptical of AI-generated numbers without context. Ask the model to show its work. Better yet, ask it to explain the reasoning step-by-step. And if something seems off? It probably is.
The reality is that modern AI is exceptional at pattern matching and language generation, but fundamentally limited at mathematical reasoning. Acknowledging this limitation—rather than pretending it doesn't exist—is how we build more reliable systems.

Comments (0)
No comments yet. Be the first to share your thoughts!
Sign in to join the conversation.