Photo by BoliviaInteligente on Unsplash
Last week, I asked ChatGPT to write a 100-word summary of a scientific paper. It confidently delivered what it promised. Then I counted the words. Seventy-three. Not even close.
This isn't an isolated glitch. It's a fundamental quirk of how large language models operate, and it exposes something crucial about the gap between what these systems appear to do and what they're actually doing under the hood.
The Counting Problem Nobody Talks About
Try this yourself. Ask any major AI model—ChatGPT, Claude, Gemini—to write exactly 50 words on a topic. Most of the time, you'll get somewhere between 45 and 65 words. Ask for a specific word count consistently, and you'll notice the same pattern: they're genuinely struggling with a task that any literate human completes automatically.
This happens because of how these models process language. They don't read words the way you do. They don't scan across a sentence and count each unit. Instead, they work with "tokens"—chunks of text that don't map neatly onto word counts.
The relationship between tokens and words varies depending on the language and the specific tokenizer being used. In English, roughly 1.3 tokens equals one word on average. But that's just an average. Some words become one token. Others break into two or three. Punctuation adds its own complications. When an AI model "counts" to 100, it's actually estimating based on token probability, which is fundamentally different from enumeration.
What This Reveals About AI Reasoning
Here's where it gets interesting. This counting problem isn't a minor engineering oversight. It's evidence that these models don't actually "think" the way we've been assuming they do.
We often hear that AI systems are "reasoning" or "problem-solving." These words create an impression of conscious deliberation. But the word-counting gap suggests something different. These models are performing statistical pattern completion. They predict the next token based on massive amounts of training data, not because they're performing arithmetic or logical operations.
When you ask an AI to write exactly 50 words, it's not setting an internal counter to zero and incrementing it. It's generating text in a way that, during training, statistically corresponded to lengths humans labeled as "approximately 50 words." The model learned the correlation, not the concept.
This becomes obvious when you request increasingly specific constraints. Ask for 50 words, and you might get 48. Ask for 50 words arranged in exactly three paragraphs with a semicolon in each paragraph, and suddenly the model becomes more strained. Add the requirement that each paragraph contains exactly one question, and the performance degrades further. You're stacking probability estimates on top of each other, and errors compound.
The Broader Implications for AI Reliability
This seemingly trivial limitation has serious consequences for how we should think about AI systems in professional contexts.
If a model can't reliably count words, what else is it masking beneath a confident exterior? As I wrote about previously, AI systems often give terrible advice while sounding perfectly reasonable. The counting problem is a window into why: these systems excel at pattern matching and can sound authoritative on almost anything, but they lack the underlying mechanisms for precise logical operations.
Consider someone using an AI to summarize legal documents with specific length requirements. Or imagine relying on an AI to write grant proposals where word limits are absolute constraints. The confident delivery masks the fundamental inability to guarantee the constraint is met.
Researchers have noticed this extends beyond counting. AI models struggle with other tasks requiring discrete logical operations: exact instruction following, maintaining consistent internal state across long outputs, and genuinely understanding numerical relationships. They're building castles on sand—impressive-looking structures that collapse under specific kinds of pressure.
The Silver Lining (Sort Of)
The good news is that researchers are actively working on this problem. Recent approaches focus on giving language models access to "tools"—calculators, external memory systems, code execution environments—that handle discrete operations separately.
Some newer models can write Python code to solve problems, letting them leverage traditional programming logic for tasks where pattern matching fails. Others use "chain-of-thought" prompting to encourage step-by-step reasoning before generating answers. These workarounds don't solve the underlying issue, but they acknowledge it and build scaffolding around it.
The real lesson here is humility. AI systems are powerful at specific tasks—language generation, creative writing, content summarization, coding assistance—because those tasks align with what statistical pattern completion does well. But asking these systems to operate outside their sweet spot reveals their nature: they're not thinking. They're performing sophisticated statistical tricks that sometimes look identical to thinking.
Next time an AI gives you an answer that sounds right, remember: it might just be playing a really, really good game of statistical prediction. And if you ask it to count to ten, maybe verify the result yourself.

Comments (0)
No comments yet. Be the first to share your thoughts!
Sign in to join the conversation.