Hallucination monitoring in production
The metric we watch weekly: per-claim refusal rate, citation-mismatch rate, and the human-graded sample. What we do when each one moves, and the threshold values that trigger an alert.
We monitor hallucination in production with three metrics. Each one catches a different failure mode. Each one has a threshold that triggers an alert. This post is what those metrics are, why we picked these specific ones, and what we do when one moves.
The framing is from the Stanford HAI legal RAG study: commercial RAG systems claim grounding and hallucinate 17–33% of the time. We assume our system is in that range until measurement says otherwise. The point of monitoring is to catch the regression, not to claim a number.
A note on the numbers. The threshold values, refusal rates, and mismatch rates below are from our internal monitoring on our own production traffic. They are directional and operational — they describe where we set alerts on our system, not a benchmark. The specific percentages move as the KB, the verifier, and the question mix change. The durable content is the shape of the alerting discipline, not the numeric values.
Metric 1 — Per-claim refusal rate
When the drafter generates an answer, every numeric claim and named-entity claim is run through a verifier that checks the claim against the cited source span. If the verifier cannot entail the claim from the source, the system refuses to draft that claim. The refusal rate is the share of generated claims that get refused.
Why this is a hallucination metric. Refusals are the system catching itself before shipping. A rising refusal rate could mean (a) the KB has gone stale and the source spans no longer contain the facts the drafter wants to assert, (b) the drafter is generating claims that don’t appear in the source, (c) the verifier has gotten more strict, or (d) the questions coming in are harder than the historical baseline.
The threshold. We alert if the per-tenant 7-day refusal rate moves more than 30% relative to the prior 30-day baseline. Absolute number does not matter; movement does.
What we do when it moves. Read the refusals. The refusal log carries the claim, the source span, and the verifier’s reason. About 60% of the time the refusal is correct (the source genuinely doesn’t support the claim) and the question is “why did the drafter try to assert something the source doesn’t say.” The fix is usually a prompt or retrieval change. About 30% of the time the refusal is over-strict (the source supports the claim but in a paraphrased way the verifier missed). The fix is verifier tuning. About 10% of the time the source is genuinely stale — covered in shipped freshness scores.
Metric 2 — Citation-mismatch rate
When the drafter produces a citation, the citation points to a source span. The mismatch rate is the share of citations where the cited span does not actually contain the claim it is supposed to support — sampled and checked by a stronger model on a 5% slice of all answers.
Why this is a hallucination metric. Citations are how grounded systems claim grounding. If the citation does not actually support the claim, the citation is decorative. The Stanford paper named this exact failure: “citations that point at irrelevant material, cited as if they support the claim.”
The threshold. Per-tenant weekly mismatch rate above 4% triggers an investigation. Above 8% triggers an alert. The Stanford baseline for legal RAG was 17–33% — we are aiming for an order of magnitude better, and we measure ourselves against that.
What we do when it moves. Look at the mismatched samples. Common causes: the retriever returned a topically related but factually distinct block; the drafter conflated two retrieved blocks and cited only one; the verifier passed something it shouldn’t have. Each cause has a different fix.
Metric 3 — Human-graded weekly sample
Every week, a human (one of us) grades 30 random production answers against three rubrics: factual correctness against source, citation accuracy, and answer relevance to the question. The grader is blind to which model produced the answer.
Why we still do this. Model-graded eval is fast and noisy in ways that correlate with the model’s training. We tried it and stopped — covered in the quarterly eval report. Human grading on a small sample is slower, more expensive, and harder to game. Thirty answers per week is not statistically powerful, but it is a check on the automated metrics. If the automated metrics say things are fine and the human grading says they are not, the automated metrics are wrong.
The threshold. Any week where the human grader flags more than 3 of 30 answers as factually incorrect prompts a review. We have hit this threshold twice in the year and a half we have been running it; both times traced to a specific retrieval bug that the automated metrics had not flagged.
What we do when it moves. Read every flagged answer. Trace the failure to a retrieval, drafting, or verification cause. File a bug. Ship a fix. Re-run the same 30 answers against the fix to confirm.
The metrics we considered and didn’t ship
A few candidate metrics that we evaluated and decided against.
Single “hallucination rate” number. Tempting as a marketing number. Useless as an operational metric. The hallucination budget per claim post goes deeper on why a single number obscures the slice-level failures that actually matter.
Self-reported model confidence. LLM-reported confidence does not correlate well with actual correctness. We tested it on a 200-answer sample. Confidence and correctness had a weak positive correlation that vanished when the questions got harder. We do not gate anything on it.
Edit-distance between draft and final. A drafter who writes nonsense that the reviewer rewrites top-to-bottom looks the same as a drafter who writes a usable draft that the reviewer polishes. The edit-distance proxy can’t tell them apart.
What this monitoring is for
The point of this is operational, not marketing. We want to catch a regression in production before a customer does. We have caught two so far — both inside 24 hours of the regression landing. Both came from the citation-mismatch monitor; the per-claim refusal rate was the slower-moving signal that confirmed.
AutogenAI’s research on proposal hallucination names the failure modes well: invented case studies, fabricated statistics, incorrect compliance claims. These are the failures the customer sees. The metrics above are how we try to see them first.
What we have not built yet: per-customer dashboards so customers can see their own monitoring numbers without asking us. That is on the Q4 list. The internal dashboard already exists; the work is exposing it safely without surfacing other tenants’ data.