Photo by Steve A Johnson on Unsplash

Last year, training a single large language model cost somewhere between $10 million and $100 million. That's not a typo. We're talking about the kind of money that could fund a decent-sized university for a year, and it's being spent to teach machines to write better emails and answer trivia questions.

The scaling problem is the elephant in every AI lab, conference room, and startup pitch deck—and it's getting worse. Much worse.

The Brutal Economics of Bigger Models

Here's the uncomfortable truth: the current path to smarter AI is absurdly expensive. The pattern is consistent and relentless. GPT-2 required millions in compute. GPT-3 required hundreds of millions. GPT-4 cost estimates hover somewhere in the billion-dollar range, though OpenAI remains characteristically quiet about the exact figures.

This isn't just about throwing more GPUs at the problem. The relationship between model size and training cost isn't linear—it's exponential. Doubling the number of parameters doesn't just double your bill; it can triple or quadruple it. You need more powerful hardware, more sophisticated cooling systems, more electricity (serious amounts of it), and researchers who command six-figure salaries to optimize every ounce of efficiency.

A 2022 study estimated that training a large language model could require somewhere in the range of 1,287 megawatt-hours of electricity. That's enough to power roughly 120 homes for an entire year. For one training run. If the experiment fails—and in machine learning, failure is common—you do it again and consume another 120 homes worth of annual electricity.

OpenAI, Google, Meta, and Anthropic are among the few organizations with the capital to absorb these costs. Everyone else? They're building smaller models, fine-tuning existing ones, or hoping for a breakthrough in efficiency that might never come.

The Efficiency Paradox That's Keeping Researchers Up at Night

You'd think that with billions in funding and thousands of brilliant researchers working on the problem, we'd have cracked efficiency by now. We haven't. And there's a reason: there might not be a clean solution.

The fundamental trade-off is brutal. You can have a smaller model that's cheap to train but less capable. Or you can have a powerful model that requires a small country's energy infrastructure to build. Some researchers are exploring techniques like distillation (teaching a smaller model to mimic a larger one) or pruning (removing unnecessary components), but these approaches have limitations.

Distillation is promising but imperfect. A smaller model trained to mimic GPT-4 will never quite match it. You lose something in translation. Similarly, pruning a model trained on 2 trillion tokens of text requires knowing exactly which parts matter—which, ironically, requires deeper understanding of how these models work. Which we don't fully have.

Then there's the infrastructure problem. Training at scale requires specialized hardware (specifically, NVIDIA's H100 GPUs, which cost $40,000 each and are in constant shortage). Building a cluster large enough for frontier model training means securing thousands of these chips, which creates a massive bottleneck. NVIDIA literally cannot produce enough GPUs to meet demand. This has created a situation where access to cutting-edge AI development is gated not just by money but by the physical availability of silicon.

This creates a secondary effect that most people miss: it concentrates power. If only five organizations in the world can afford to train frontier models, then innovation becomes dominated by those five organizations. Everyone else is essentially running on their scraps.

What Happens When the Money Runs Out?

Here's where it gets interesting. We're approaching what some researchers call the "scaling plateau." The assumption has always been that we can simply throw more data and more compute at the problem and get better results. But we might be running out of high-quality training data. The entire public internet has already been scraped, deduped, and used multiple times over.

Google's internal research leaked a few years back showed they were concerned about running out of useful text data by 2026. If true, we hit a wall where adding more compute doesn't help because there's nothing new to train on. Some labs are exploring synthetic data (using AI to generate training data for other AI systems), but that introduces its own problems—like models training on the mistakes and biases of their predecessors, compounded through generations.

This is also connected to why your AI model keeps hallucinating about things that never happened. When models are trained on increasingly sparse or synthetic data, they're more likely to confabulate convincing-sounding but false information.

The uncomfortable question everyone's avoiding: what if we've already hit the point of diminishing returns? What if the next generation of models will cost twice as much for only marginal improvements? At some point, even tech billionaires might decide that $500 million per model is too much.

The Underground Movement Toward Smaller, Smarter Models

Not everyone is playing the scale game. There's a quieter revolution happening among researchers who are exploring whether you can get surprising results with smaller models through better architecture, better training techniques, or smarter data selection.

Meta's LLaMA model demonstrated that a 13-billion parameter model could compete with much larger systems on certain benchmarks. It was trained for a fraction of what competitors spent. Other organizations are exploring mixture-of-experts architectures, where you have multiple smaller models that activate based on the input, rather than one massive monolithic network.

This approach won't replace frontier models trained at massive scale, but it suggests there's a future where competent, useful AI systems don't require a national laboratory's budget to create. For companies, startups, and researchers, that's a lifeline.

The Real Problem: Nobody Knows the Endpoint

The deepest fear in AI development isn't computational cost itself—it's uncertainty. Nobody actually knows what the optimal model size is. We don't know if AGI requires models the size of planetary atmospheres or whether a $10,000 system trained on the right data could theoretically achieve it. We're essentially driving blindfolded toward a destination we can't see.

If the answer turns out to be "you need even bigger models," then we're facing infrastructure and energy challenges that might be physically impossible to solve. If the answer is "smaller, smarter systems are the way," then the massive investments in scale were premature.

Either way, the companies betting everything on scaling are gambling with stakes they don't fully understand. And everyone else is waiting to see how the cards fall.