Photo by Igor Omilaev on Unsplash

Last year, OpenAI, Google, and Meta all faced the same uncomfortable realization simultaneously: they were running out of internet.

Not literally, of course. But effectively? Yes. The fuel that powered the AI revolution—billions upon billions of text samples scraped from websites, books, and forums—is becoming scarce. We've already used most of the high-quality text data available on Earth. The models that changed everything about how we work were trained on decades of human-generated content, but you can't squeeze blood from a stone, and you can't endlessly multiply training data from a finite planet.

This isn't a minor technical hiccup. This is the scaling wall, and it's forcing the entire AI industry to confront an uncomfortable truth: the age of "bigger model, better results" might be ending.

The Era of Cheap Scaling Is Over

For roughly a decade, the AI improvement story was refreshingly simple. Researchers discovered that if you train neural networks on more data and with more computational power, the results get better. Not just incrementally better. Dramatically, reliably, predictably better.

This relationship—known as scaling laws—held true across model after model. GPT-3 was better than GPT-2. GPT-4 was better than GPT-3. Each generation required exponentially more resources, but the returns were worth it. Companies invested billions into data centers and specialized chips specifically because scaling worked.

But here's what nobody wanted to acknowledge: scaling laws assume an infinite supply of training data. We don't have that.

Researchers at Epoch AI estimated that the world contains approximately 5 zettabytes of data. Sounds infinite, right? The problem: most of it is either private (medical records, financial data, proprietary code), low-quality (bot-generated spam, duplicate content, garbage), or both. The high-quality, usable text data is much, much smaller. Current estimates suggest we'll exhaust the publicly available, high-quality text internet by 2026—maybe 2027 if we're generous with what counts as "useful."

Chinchilla, a smaller model trained by DeepMind, actually outperformed much larger models with better data allocation, suggesting that bigger isn't always better. But bigger is what companies know how to do. Bigger is what investors fund. Bigger is the strategy everyone bet their company on.

The Hallucination Problem Gets Worse Before It Gets Better

As companies grow desperate to find more training data, quality inevitably drops. Some are recycling AI-generated content to train new models—a practice that creates compounding errors. Others are licensing proprietary datasets at astronomical costs. A few are hoarding data like tech dragons, hoping their private collection gives them a durable advantage.

The consequence? Models trained on increasingly marginal data sources start exhibiting more frequent errors. This connects directly to a problem we've been watching explode: why AI keeps hallucinating facts. When a model runs out of legitimate examples to learn from, it starts improvising. It fills gaps with plausible-sounding nonsense.

Companies are desperately trying workarounds. Some are training models on synthetic data generated by other AI systems—essentially teaching AI from other AI's output, which is like photocopying a photocopy. Others are restricting models to smaller, ultra-clean datasets and accepting lower capability as the tradeoff. Neither solution is particularly satisfying.

Welcome to the Creativity Phase

Here's where things get interesting. With scaling hitting its limits, the industry is pivoting away from "more data, more compute" toward different approaches entirely.

Some researchers are experimenting with synthetic data generation—creating new training examples through simulation or algorithmic generation rather than human-produced content. OpenAI's recent moves toward reasoning-based models (like o1) represent a different philosophy: instead of memorizing patterns, maybe models should learn to think step-by-step through problems. It's computationally expensive during inference, but it doesn't require infinite training data.

Others are exploring architectural innovations. Maybe we don't need trillion-parameter models. Maybe we need cleverer ways to organize parameters. Maybe we need models that can be continuously updated with new information without full retraining. Maybe we need smaller, specialized models that excel at specific domains rather than massive generalists.

The timeline matters here. If synthetic data approaches work well, we might find a new path forward. If they don't work, and if the scaling wall holds, we could see a genuine slowdown in AI capability improvements. That's not as catastrophic as doomers suggest, but it would be historically unprecedented—the first time the compute-better curve has actually flattened.

What This Means for You (And Your AI Assistant)

The practical impact depends on where you sit. If you're a startup competing on raw capability, you're in trouble—you can't outspend OpenAI or Google, and the cheap scaling advantage that allowed underdogs to win is gone.

If you're using AI tools for work, prepare for a period of incremental rather than revolutionary improvement. The jump from GPT-3 to GPT-4 felt transformative. The jump from GPT-4 to whatever comes next might feel more like a refinement. That's not bad—refinements matter—but it's different from what we've experienced.

The optimistic take: this forces the industry to get smarter rather than just bigger. Creativity might trump raw compute. The pessimistic take: we're approaching the inherent limits of language model scaling, and that's just the beginning of our problems.

Either way, the age of effortless scaling is behind us. The real work is just starting.