How AI Learned to Gaslight Itself: The Bizarre World of Hallucinations Getting Worse, Not Better

The Confidence Problem Nobody Expected

Last Tuesday, I asked ChatGPT to write me a professional biography. Thirty seconds later, it had created an entirely fictional career for me: "Principal Data Scientist at TensorFlow Industries (2019-2021)" and "Speaker at NeurIPS 2022." I've never worked there. I've never spoken at NeurIPS. The system didn't hedge with "I'm not sure about this" or "I might be wrong"—it presented invented credentials as absolute fact, with the kind of certainty you'd expect from someone reading from their actual resume.

This is the hallucination problem, and it's gotten weirder than anyone predicted. Not because AI systems are making mistakes—that was always expected. But because as these models have become larger and more capable, they haven't simply maintained their error rate; they've become more brazenly confident about being wrong. It's like having a friend who gets progressively drunker while simultaneously becoming more insistent that they're completely sober.

When Scale Makes Things Worse Instead of Better

OpenAI's GPT-4 is objectively more capable than GPT-3.5 across nearly every benchmark. It scores higher on bar exams, standardized tests, and technical coding challenges. Yet researchers have documented that it also hallucinates. Sometimes it hallucinates differently than its predecessor, and occasionally with more elaborate detail.

Consider what happened at Google. In December 2022, Bard generated a historically inaccurate response about the James Webb Space Telescope during a promotional video demo, causing Google's stock to drop 3.5% instantly. This wasn't a model that lacked capability—it was a model that confidently lied about something easily verifiable. The embarrassment stung partly because it felt preventable, like the AI should have known better.

Here's what makes this scientifically unsettling: we don't fully understand why scaling up models sometimes amplifies hallucinations instead of reducing them. The prevailing theory involves something called "training data saturation." When models see patterns in their training data that correlate with correct answers, they learn to reproduce those patterns. But when they encounter novel situations—questions about recent events, niche topics, or specific instructions—they extrapolate from existing patterns rather than admitting "I don't know." Bigger models are better at extrapolation. Sometimes that's useful. Sometimes it means they're just better at lying convincingly.

The Apology Paradox

There's another peculiar layer to this problem, one that researchers call the "false modesty effect." Related to why AI chatbots keep apologizing for things they never did, this phenomenon shows that models often hedge and apologize for correct statements while confidently asserting falsehoods.

A user asks: "What's the capital of Australia?" Claude responds with "I believe it's Canberra, though I could be mistaken." Technically correct, but unnecessarily cautious. The same user asks: "Who invented the internet?" And now you get a confident historical narrative that conveniently ignores ARPANET and credits the wrong people entirely—no hedging whatsoever.

The asymmetry is maddening. The models have learned that being uncertain sounds more human, more trustworthy. So they apply uncertainty cavalierly to mundane factual questions while throwing unwarranted certainty at complex queries. It's like they learned that humility is a fashion accessory to be worn selectively, not a genuine reflection of their limitations.

Why This Matters Beyond Academic Curiosity

The hallucination problem stops being abstract the moment someone in a legal department uses an AI assistant to research case law and submits fabricated precedents to a judge. This has already happened. A lawyer at Mata v. Avianca used ChatGPT to draft a brief and cited entirely invented cases. The judge was not pleased. Avianca had to file a notice apologizing for the fake citations.

Healthcare organizations using AI for diagnostic support face similar stakes. A radiologist's AI assistant confidently identifies a tumor that doesn't exist, or more dangerously, confidently identifies a benign finding when something sinister is present. The consequences compound because medical professionals might defer to the AI's confidence level.

The business impact is real too. Customer service departments implementing ChatGPT-powered support systems have discovered that the model will happily invent product features, pricing, or return policies. One company reported their chatbot offering 200% discounts because it confidently hallucinated their company's promotional policy.

What's Being Done (And Why It's Harder Than You'd Think)

Researchers are experimenting with several approaches. Retrieval-augmented generation (RAG) forces models to cite sources and check their claims against a knowledge base before responding. Constitutional AI methods train models using a set of principles they're supposed to follow, reducing hallucinations by 25-35% in some tests. Fine-tuning with human feedback helps, but it's labor-intensive and surprisingly inconsistent.

The uncomfortable truth is that we might not be able to solve hallucinations by making models bigger or training them longer. Some researchers suspect the problem is fundamental to how transformer-based models work. They pattern-match. They extrapolate. They don't genuinely access facts the way a database does; they approximate based on statistical relationships learned during training. Asking them not to hallucinate might be like asking a human not to misremember things—possible to reduce, but never to fully eliminate.

Anthropic, the company behind Claude, recently published research showing that their newer models could be tuned to be more honest about uncertainty. But the tradeoff was real: models that admitted ignorance more frequently also answered fewer questions overall. It's a reminder that every solution to the hallucination problem comes with costs.

The Future Probably Involves Acceptance

We might be approaching an important inflection point: the realization that we can't engineer hallucinations away entirely. Instead, the industry is gradually shifting toward guardrails, constraints, and clearer communication about limitations. Enterprise implementations increasingly pair AI assistants with mandatory human review processes. User interfaces are being redesigned to communicate uncertainty more transparently. Some organizations are building AI systems specifically designed to fail gracefully, returning "I don't know" rather than confidently guessing.

What we shouldn't expect is a magic bullet. The models that will dominate in five years will probably still hallucinate. The difference will be that we've learned to build systems that work *around* that limitation rather than pretending it doesn't exist. And maybe that's the real innovation—not smarter AI, but wiser humans using it.

How AI Learned to Gaslight Itself: The Bizarre World of Hallucinations Getting Worse, Not Better

The Confidence Problem Nobody Expected

When Scale Makes Things Worse Instead of Better

The Apology Paradox

Why This Matters Beyond Academic Curiosity

What's Being Done (And Why It's Harder Than You'd Think)

The Future Probably Involves Acceptance

Comments (0)

More from AI

Explore More Topics

How AI Learned to Gaslight Itself: The Bizarre World of Hallucinations Getting Worse, Not Better

The Confidence Problem Nobody Expected

When Scale Makes Things Worse Instead of Better

The Apology Paradox

Why This Matters Beyond Academic Curiosity

What's Being Done (And Why It's Harder Than You'd Think)

The Future Probably Involves Acceptance

Comments (0)

More from AI

Why Your AI Chatbot Keeps Making Confidently Wrong Answers (And How to Fix It)

Why Your AI Chatbot Keeps Giving You Weirdly Specific Advice About Penguins

Why Your AI Chatbot Keeps Giving You Terrible Advice (And What Actually Works)

Explore More Topics