Photo by Microsoft Copilot on Unsplash

The Day Claude Forgot Its Own Instructions

Last month, I fed a leading language model a 50,000-token document about quantum computing, asked it to summarize the key findings, and then posed a follow-up question about something mentioned in the opening paragraph. The model confidently told me that concept wasn't discussed anywhere in the document. It wasn't hallucinating—it simply couldn't see information that was technically within its context window but had been psychologically "forgotten" by the attention mechanism.

This isn't a rare glitch. It's a systematic problem that nobody talks about, and it's costing companies millions in productivity losses.

Understanding Context Windows and Attention Decay

Modern language models work with something called a "context window"—basically, the amount of text the model can theoretically see at once. GPT-4 can handle 128,000 tokens. Claude 3 pushes toward 200,000. Gemini advertises a 1 million token window. Numbers that sound impressive until you actually try to use them.

The problem isn't capacity. The problem is attention.

Transformer models—the architecture behind ChatGPT, Claude, and most modern AI systems—use something called the attention mechanism. Picture it like this: when processing a sentence, the model looks at each word and calculates which other words are most important to understanding it. This happens in "layers," with each layer refining the relationships and importance scores.

But here's where it breaks down: as context gets longer, the attention mechanism gets noisier. Research from University of California and other institutions has shown that language models exhibit what researchers call "lost in the middle" syndrome—information in the middle of long documents receives dramatically less attention than information at the beginning or end. A document with 10,000 tokens? The model might miss 30-40% of the critical information in the middle sections, even though it technically "sees" the whole thing.

One study showed that when documents exceed 4,000 tokens, retrieval accuracy for facts in the middle of the document drops to less than 60%. By 8,000 tokens, it's worse than random chance for some information types.

Why This Matters More Than You Think

Enterprise teams are discovering this the hard way. A legal firm loads an 80,000-word contract into their AI assistant, asks it to identify all liability clauses, and the model misses three critical sections. A research team asks Claude to analyze a 200-page paper and synthesize contradictory findings in the middle sections—it fails to notice the contradictions exist. A customer service team uses AI to handle complex support tickets that reference previous conversations, and the AI forgets what happened two messages ago despite having the full thread.

Companies are throwing context window size at the problem like it's a solution. It isn't. More tokens just mean the model has more opportunities to get distracted.

The real issue? There's an inverse relationship between context window size and processing reliability. Larger windows let you include more information, but attention quality degrades logarithmically. You get to choose: narrow windows with higher accuracy, or wide windows with Swiss-cheese reliability.

What's Actually Happening Inside the Model

Let's get specific about the mechanics. In a transformer's attention layer, every token calculates an "attention score" with every other token in the context. With a 128,000 token window, that's over 16 billion attention calculations. The model learns to compress and prioritize, essentially asking "which tokens matter most for predicting the next token?"

Early in a document, everything matters equally, so attention spreads relatively even. By the middle, the model has seen enough pattern-matching to start ignoring "unimportant" stuff. By the end, there's heavy concentration on recent tokens and information the model has already identified as critical.

This isn't a bug in the traditional sense. It's an optimization that works brilliantly for predicting the next word in a novel. It's catastrophic for actually retrieving specific information from a knowledge base you've provided.

The semantic meaning gets preserved—the model "knows" the information is there—but the ability to *access* it degrades. When models can't access information reliably, they start hallucinating with confidence, which is arguably worse than simply saying "I don't know."

What Teams Should Actually Do Right Now

The interim solution isn't sexier than you'd hope: retrieval-augmented generation (RAG). Instead of dumping your entire document into the context window, you keep documents in a searchable database and pull only the relevant chunks into the model for each query. It's slower, less elegant, and dramatically more effective.

Companies like Anthropic are also experimenting with different attention mechanisms and training approaches, but these are still research projects. They won't fix the problem next quarter.

If you're building AI systems that need to work with long documents: don't max out your context window and assume the problem is solved. Implement proper retrieval mechanisms. Break documents into logical sections. Ask your model to self-reference sections rather than relying on it to hold everything in mind simultaneously.

The context window is not your safety net. It's a window of opportunity, and like most windows, what you can see through it depends heavily on how you angle the light.