False Confidence

How do you verify your LLMs output?

Sep 08, 2025

Disclaimer: The views and opinions expressed in this blog are entirely my own and do not necessarily reflect the views of my current or any previous employer. This blog may also contain links to other websites or resources. I am not responsible for the content on those external sites or any changes that may occur after the publication of my posts.

End Disclaimer

One of the things LLMs can’t do yet is verify their own output.

They’re optimized to give an answer — not necessarily the right one — and that’s where hallucinations creep in.

This creates issues when building LLM-involved applications for use cases like data extraction.

If you were to ask an LLM “Who’s buried in Grant’s Tomb” or “ When was the war of 1812” and the LLM returns an answer- how do you check if that answer is true?

For factual information you can google it, get out a library book on the subject and look it up- verify that it’s true. The existence of materials to compare and check against give you a semblance of assurance that the answer is correct- there is something to ground it to. Ground Truth Comparison

With data extraction, you are pulling out data fields on the fly, much faster than humans can, and often with more accuracy than human transposition alone. One of the main issues with untethered, unverified output is how to prevent the human in the loop from having to relook at every answer over again anyway. Human in the Loop Verification

Well good news is there are ways to do that.

Bad news is none of them work well enough or are reliable enough to use as a stand alone.

A bunch of not-great-by-themselves, but uncorrelated methods can add up — think weak learners forming a stronger ensemble.

They can work as a proxy to check output “truthiness” and verification -albeit as an incomplete verifier- a sort of Veracity Proxy Score (VPS).

We don’t know why LLMs select answers at a super granular, neuron level, but teams have done work getting to see patterns of neuron activation (Anthropics super cool post/paper on the features associated with the Golden Gate Bridge).

Absent a clearer understanding of how LLMs get their answers, it’s going to be difficult to employ LLMs at certain levels of decision making in highly regulated industries.

So we need something in the meantime to at least “double check” the output.

Here are some ways (an incomplete list) of how you can do that, from the heuristic to the more quantitative:

Self-consistency check- Run the prompt a number of times through the same model, changing the temperature setting each time and seeing if the answer is the same. This is not a “confidence score”.

What you're really measuring isn't the model’s certainty or confidence in any epistemic sense. Instead, you’re indirectly getting at how peaked or flat the model’s predictive distribution is over possible next tokens and, by extension, entire output sequences.

This brings up the nuanced, but very important distinction between model confidence (distributional certainty) and actual correctness (epistemic certainty).

This distinction pervades each and every one of the methods discussed here.

When you ask the same question n times, let’s say 5 times, the model takes a sample from its probability distribution 5 times. If the distribution is peaked you get the same answer over and over. If the distribution is flat (high entropy- entropy is a measure of uncertainty or spread in the model’s output distribution. Thank you Claude (Shannon)!), you will see variability across the outputs. This variability is a proxy for entropy rather than confidence or certainty in the output.

Self-consistency is an indirect proxy for entropy, but high agreement across outputs doesn’t guarantee correctness — the model can still be confidently wrong.

LLMs are stochastic and probabilistic. That makes them awesome in certain use cases and less awesome/hard to wrangle in others, like data extraction.

The model does not know the right answer. The model has no sense of real world “correctness”. The model only knows its own internal representation (loosely) of the world.

Cross-Model Check- Produce an answer with GPT and double check with Claude(or bunch of models) or vice versa.

Cross-model verification is GAN-adjacent but non-adversarial — you’re just using a second model as a sanity check, not training it to outsmart the first.

Bonus- You can perturb the prompt and try to “fool” the verifier by changing around the wording of the prompt. This is called Adversarial Prompt Testing, and again, sits in the same church/different pew as its GANs brethren.

Using Logprobs- For each token, the model produces raw, unbounded scores called logits which are then turned into probabilities (normalized,bounded) via softmax which converts them into a probability distribution- the “map” or “topography” of the model.

Softmax exponentiation amplifies differences between logits: a slightly higher score can dominate the distribution — a kind of ‘winner-take-most’ effect.

~Intuitive example using probabilities:
Paris is the capital of what country?
France (99.5%), Italy(.002), Germany(.003)

You can incorporate either the token-level logprobs or the sequence-level logprobs to approximate “model certainty”, but again not “epistemic certainty”.

Because softmax exponentiation overweights high-logit tokens, it can actually worsen attention sinks — situations where the model fixates disproportionately on a few tokens — leading to overconfident but wrong outputs. Attention sinks are an active research area in LLMs. This overconfidence is why raw model probabilities don’t always reflect correctness — a 92% token probability doesn’t mean there’s a 92% chance the answer is right.

RAG grounding- Retrieval Augmented Generation (RAG) augments an existing LLM with outside material like a database, bunch of pdfs, some knowledge corpus or base.

With RAG, you constrain the LLM’s creativity by grounding it in a known knowledge base. You can then measure semantic alignment between its generated output and the retrieved source material using similarity metrics like cosine distance( super useful- measures the cosine of the angle between vectors in multidimensional space. )

Think of two arrows in space — the closer they point, the more semantically aligned they are.

So….tired….must….stop…math… (Puts down pencil, puts head on the desk. Goes to sleep.)

Ensemble Model- Put these all disparate model pieces together into a stew aggregate supermodel.

Equal weights or voting structure?

Throw an agent in the middle?

Ensembles can reduce variance, but they fail if every model shares the same biases.

What ingredients you throw into the pot and what quantities become the proprietary secret sauce- same applies for making sauce with all the other individual stand alone methods I’ve mentioned.

We don’t yet have a perfect verifier for LLM outputs.

Until we do, the best we can achieve is a layered defense — combining partial signals into a stronger “truthiness proxy”.

LLMs are great at producing answers.

They’re still bad at verifying them.

Until that changes, a VPS can be the taste-tester in your LLM kitchen.

Don’t scorch the secret sauce.

And most definitely…

Don’t slow down.

The AI Underwriter

Discussion about this post

Ready for more?