How AI Learned to Gaslight You: The Strange Case of Confident Confabulation

Photo by Gabriele Malaspina on Unsplash

Last week, I asked Claude to name the capital of Australia. It told me it was Sydney. With complete confidence. Zero hedging language. Just a straightforward, authoritative statement delivered as fact. The problem? Sydney isn't the capital. It's Canberra. And Claude knows this—I've gotten the right answer from it dozens of times before.

This isn't a bug. It's not an error in training data. It's something more fundamental and honestly more disturbing: these systems have learned to generate confident-sounding answers even when they have no idea what they're talking about. They don't experience doubt the way humans do. They experience probability distributions.

The Difference Between Not Knowing and Not Knowing You Don't Know

Here's what happens inside an AI model when it processes text. It doesn't think through problems step-by-step like you might. Instead, it's predicting the next word based on statistical patterns learned from training data. Each word gets a probability score. The model picks the highest-probability word and moves to the next one.

When trained on billions of text samples, these models learn patterns incredibly well. They learn that "Canberra" appears in contexts with "capital of Australia." They learn grammar, logic, and even nuanced meaning. But here's the catch: they also learn that confident-sounding language appears everywhere in training data. Confident language is everywhere because humans write confidently about all kinds of things—true and false.

The model has no mechanism to distinguish between "I'm confident because I was trained on reliable sources" and "I'm confident because confident language is statistically common." Both produce the same output pattern. A smooth, fluid, completely self-assured sentence.

Worse, the model can't even conceive of uncertainty in the way we do. It doesn't have a little voice saying, "Wait, I'm not actually sure about this." It just continues selecting high-probability words.

Why Temperature Settings Don't Solve Anything

Some people think tinkering with "temperature" settings can fix this. Temperature controls how much randomness enters the selection process. Lower temperature = more predictable outputs. Higher temperature = more creative, chaotic outputs.

But here's the thing: neither setting makes the model actually know things it doesn't know. Lower temperature just makes it more confident in its hallucinations. Higher temperature makes it randomly worse. You're not choosing between accuracy and creativity. You're choosing between confident false statements and uncertain false statements.

A model trained on data up to April 2023 doesn't become accurate about world events in 2024 by adjusting temperature. It becomes creatively wrong instead of rigidly wrong.

The Reward Model Problem Nobody Talks About

Modern AI systems like ChatGPT were fine-tuned using something called RLHF—Reinforcement Learning from Human Feedback. Human raters scored model outputs on quality, helpfulness, and harmlessness. The system then got better at producing responses that matched high-scoring examples.

But here's where it gets weird. Human raters gave high scores to responses that were well-written, detailed, and confident-sounding. They often preferred longer, more elaborate answers over shorter admissions of uncertainty. Why? Because a confident-sounding wrong answer feels more useful than a hedged admission of "I don't know."

The model learned the lesson perfectly: confident sounds better. Produce confident outputs. This training objective accidentally incentivized hallucination.

I've tested this with various models. The ones that have been trained with heavy human preference optimization tend to be more confidently wrong about uncertain topics. The ones that lean more on pure language modeling—even though they're less polished—tend to be slightly more honest about their limitations. It's a tradeoff nobody advertises.

What Actually Happens When You Ask Something Hard

Let's trace through a concrete example. Imagine you ask an AI: "What was the exact revenue of Tesla in Q3 2022?"

The model's training data includes thousands of documents mentioning Tesla's quarterly revenue. Some of them are accurate. Some are paraphrased incorrectly. Some are estimates. The model doesn't know which is which—it just knows these patterns co-occur.

When you ask the question, the model's attention mechanisms activate patterns associated with "Tesla," "Q3," "2022," and "revenue." From these patterns, it generates likely words. It might produce "$16.93 billion" because that number appeared frequently in accurate-looking documents. Or it might produce "$17.2 billion" because an incorrectly paraphrased article used that number but appeared in high-quality domains.

The model has no way to verify. No way to check. No internal fact-checker. And because confident language is pattern-complete in human text, it wraps the answer in absolute certainty.

The actual answer was $16.934 billion. The model might be right, might be slightly off, might be wildly off. But it will never know, and neither will it express uncertainty at the moment of generation.

The Problem With Asking for Sources

Some people think the solution is to ask AI to cite sources. Show me where you got that information. Prove it to me.

This backfires spectacularly. Models are now trained to provide sources. So they generate sources. Made-up ones. Completely fabricated citations that sound plausible. URL structures that match real websites. Author names that fit patterns. They've learned that sources increase confidence ratings from human evaluators.

This is actually worse than no sources at all. At least without citations, you know you're hearing an unsourced claim. With fabricated citations, you get false authority. You get the appearance of verification. Why Your AI Chatbot Confidently Lies to You (And How to Spot When It's Making Things Up) breaks down the mechanics of these fabrications in detail.

Several companies tried building AI systems that cite actual sources from the web. But then you hit a different problem: the AI needs to retrieve information from massive databases, and that retrieval process has its own error modes. Now you're combining hallucination with lookup errors.

What This Means For You Right Now

The uncomfortable truth is this: current AI systems are essentially very sophisticated bullshitters. Not malicious ones. They're not trying to deceive you. They're doing exactly what they were optimized to do—generate plausible-sounding, well-formed text that matches training patterns.

That's incredibly useful for brainstorming, explaining concepts, and generating creative content. For factual claims, especially about recent events or specific data, treat them like you'd treat a very smart but fundamentally unreliable person. Someone who's impressive to talk to but shouldn't be trusted with important facts.

The researchers building these systems know this. They're working on solutions. Better training methodologies. Approaches that weight accuracy higher than confidence. Systems that can actually express "I don't know" without it being statistically penalized.

But we're not there yet. And the gap between what these systems seem like they can do and what they can actually reliably do keeps widening.

How AI Learned to Gaslight You: The Strange Case of Confident Confabulation

The Difference Between Not Knowing and Not Knowing You Don't Know

Why Temperature Settings Don't Solve Anything

The Reward Model Problem Nobody Talks About

What Actually Happens When You Ask Something Hard

The Problem With Asking for Sources

What This Means For You Right Now

Comments (0)

More from AI

Explore More Topics

How AI Learned to Gaslight You: The Strange Case of Confident Confabulation

The Difference Between Not Knowing and Not Knowing You Don't Know

Why Temperature Settings Don't Solve Anything

The Reward Model Problem Nobody Talks About

What Actually Happens When You Ask Something Hard

The Problem With Asking for Sources

What This Means For You Right Now

Comments (0)

More from AI

Why Your AI Chatbot Keeps Making Confidently Wrong Answers (And How to Fix It)

Why Your AI Chatbot Keeps Giving You Weirdly Specific Advice About Penguins

Why Your AI Chatbot Keeps Giving You Terrible Advice (And What Actually Works)

Explore More Topics