Photo by Steve Johnson on Unsplash

Last month, I asked ChatGPT to write me a short story about a detective investigating a missing cat. Thirty minutes later, I asked it to continue the story. The detective had vanished. The cat's name changed. The entire plot reset itself like someone had hit the rewind button on the universe.

This wasn't a glitch. It was a feature—or rather, a fundamental limitation that's been haunting AI development since the beginning.

The Problem Nobody Talks About

Most people don't realize that large language models operate under strict amnesia conditions. They have no actual memory between conversations. None. Zero. Each time you start a new chat, the model genuinely doesn't know what you said before, even if it was five seconds ago.

This becomes obvious when you try anything requiring sustained attention. Ask an AI assistant to remember that you hate cilantro, then come back tomorrow with a recipe request. Watch what happens. The AI will confidently suggest cilantro salsa, then act surprised when you remind it of your preference.

But here's the weird part: within a single conversation, AI models can hold context remarkably well. They can track character names, plot points, and even contradictions. So what's the difference?

The answer lies in something called the "context window."

Tokens, Windows, and the Mathematics of Forgetting

Every word that flows into an AI model gets converted into numerical representations called tokens. A typical English sentence uses about 4 tokens per word. So a 100-word paragraph? That's roughly 400 tokens.

Here's where it gets constraining: most models have a maximum context window. GPT-3.5 worked with 4,000 tokens. GPT-4 expanded this to 8,000 tokens (or 32,000 in the paid version). Claude pushes it to 100,000 tokens. These sound like big numbers until you realize that 100,000 tokens is only about 75,000 words—roughly equivalent to a thick novel.

Once you exceed your context window, the model can't "see" the earlier information anymore. It's like trying to read a book where pages keep disappearing from the beginning as you turn to new ones.

This is why AI can't remember yesterday—yesterday's conversation is outside the context window, deleted from view entirely.

The implications are enormous. Long-form research projects become nightmares. Customer service bots lose track of previous complaints. Collaborative writing sessions devolve into chaos. Any task requiring accumulated knowledge hits a wall.

How Attention Mechanisms Changed the Game

Before 2017, neural networks processed information sequentially, like reading left to right. They were slow and forgetful. Then researchers at Google published a paper called "Attention Is All You Need," and everything shifted.

The attention mechanism works differently. Instead of processing information linearly, it lets the model simultaneously examine every part of the input and decide which pieces deserve focus. It's like having a student who can look at all pages of a textbook at once and instantly identify the relevant passages.

This breakthrough made modern language models possible. Suddenly, context windows expanded. Models could handle longer documents. But there's still a hard ceiling.

Why? Because attention has a computational cost. The model needs to compute relationships between every token and every other token. With 100,000 tokens, that's 10 billion comparison operations. With a million tokens, you're looking at a trillion comparisons. Even with modern GPUs, this becomes prohibitively expensive.

The Emerging Solutions

Engineers aren't accepting this limitation gracefully. Several approaches are gaining traction.

Retrieval-augmented generation (RAG) works by storing information externally. Instead of keeping everything in the context window, the model queries an external database. When you ask it about your cilantro preference, it retrieves that memory from storage rather than keeping it internally. It's a workaround, but it works.

Sparse attention mechanisms are another approach. Instead of computing relationships between every token pair, sparse attention only examines nearby tokens or strategically important ones. This reduces computational load while preserving context. OpenAI's Sparse Transformer demonstrated this principle, though it hasn't yet reached mainstream consumer models.

Then there's the brute-force approach: just make the models bigger and better at managing attention. Anthropic's Claude model uses a 100,000-token window, which is genuinely useful for analyzing entire research papers or long documents. It's not a perfect solution, but it's a meaningful improvement.

Some researchers are exploring completely different architectures. State-space models like Mamba process information more efficiently than traditional transformers, potentially offering paths toward true long-term memory without the computational overhead.

What This Means for You

If you're using AI tools for anything substantial, understanding context windows matters. Summarize important information so it fits within available memory. Use external storage for critical details. Break large projects into smaller chunks that fit comfortably within context limits.

For AI developers and companies, this is genuinely urgent. As AI moves from novelty to infrastructure, the memory problem becomes critical. You can't build reliable AI assistants, research tools, or autonomous systems on top of models that forget everything the moment they stop talking to you.

The good news? The field is moving fast. Context windows have expanded 25-fold in three years. New architectural innovations arrive monthly. The amnesia problem isn't permanent—it's just the current frontier.

Your detective will eventually remember the missing cat. We just need to figure out how to make AI remember longer.