Hallucination rate: a year-in measurement update

We publish a hallucination number on /platform/grounded-ai. Today it sits at 0.9% across grounded drafts, measured the way we will describe below. This post is about how we measure it, what moved it since the early baseline, and what the number does and does not tell you.

The short context: the Stanford HAI research on legal RAG landed in 2024 and documented 17–33% hallucination rates on commercial grounded systems. That paper shaped how we talked about the category from day one. We are not the first to measure and we are not the only team trying to get the number lower, but we are one of the few teams that publishes it and defines exactly what the measurement covers.

The measurement, defined

A “hallucinated claim” in our measurement framework is a sentence in a grounded draft that asserts a factual claim, carries a citation, and is not actually supported by the cited source. Three components, all load-bearing.

The claim has to be factual. Stylistic assertions, transition sentences, and framing sentences are excluded. We are measuring claims of the form “our team has 14 certifications” or “the SLA is 42 minutes,” not “our approach is comprehensive.”

The claim has to carry a citation. We do not include uncited sentences in the denominator, because the product is designed to flag uncited factual claims at draft time. A sentence that ships without a citation is a different failure mode, and we measure it separately (see the citation density post).

And the claim has to be unsupported by the specific cited source. Not “unsupported by any knowable evidence” — that would be a different and much harder measurement. Our test is whether a reviewer who opens the cited block and reads it can verify the claim. If yes, the claim is grounded. If no, it is a hallucination.

The measurement pipeline

Three stages, run nightly against a fixed evaluation set of 500 grounded drafts:

Stage one: claim extraction. A separate model pass extracts every factual claim from each draft, along with its attached citation. We use a second model (not the drafting model) to avoid the failure mode of the drafting model grading its own work. The separation matters — we saw inflated self-scores the one time we tried having the drafting model verify itself.

Stage two: evidence lookup. For each extracted claim, the pipeline pulls the cited source block verbatim and presents it to a verification model along with the claim. The verification model is prompted to return one of three labels: supported, contradicted, or not addressed.

Stage three: human audit on a sample. Ten percent of the verification model’s labels are sampled and reviewed by a human. The human label is ground truth. The agreement rate between the verification model and the human gives us a confidence interval on the published number. Current agreement rate is 94.5%, which gives us roughly ±0.2 percentage points on the published number.

The pipeline is described in more detail in the eval harness CLI post; the measurement is rerun every night, and the public-facing number updates weekly.

What moved the number since year one

The year-one baseline (April 2025) was 2.3%. Today it is 0.9%. The difference is three interventions, each shippable separately.

First intervention: forcing citation at draft time, not at review time. A year ago the drafting loop produced sentences and then retrieved supporting evidence. Half the hallucination surface was sentences where the generator wrote something plausible and the retriever dutifully attached a roughly-on-topic block that didn’t actually support the claim. The change was to reverse the order — retrieve first, then draft from the retrieved blocks, with the blocks pinned in context. The hallucination rate dropped roughly 1 percentage point from this change alone.

Second intervention: per-claim verification as a hard gate. After drafting, every claim in the draft is re-verified against its cited block before the draft is released to the reviewer. If the verification model labels a claim contradicted or not addressed, the claim is flagged and held back. The reviewer sees a “this claim needs your attention” badge instead of a shipped sentence. This cost us roughly 20% more LLM spend per draft and moved the rate another 0.3 points.

Third intervention: better retrieval on high-risk claim types. Numeric claims and regulatory claims are where hallucinations cluster — the AutogenAI writeup names this pattern directly, and our data agreed. We added specialized retrievers for numbers and for compliance language that give higher weight to blocks with matching numeric or regulatory structure. The remaining 0.2 points came from this.

The pattern across all three interventions is the same pattern the Hacker News thread on RAG hallucination argues for: the hallucination problem is not solved by better retrieval alone, and it is not solved by better generation alone. It is solved — partially — by structural discipline that forces retrieval and generation to agree at the claim level.

What the number doesn’t tell you

The 0.9% is a measurement on our evaluation set. It is not an upper bound on hallucination in production, because production drafts contain claims our evaluation set doesn’t have.

It is not a measurement of “is the claim true.” A claim can be supported by its cited block and still be false in the world — if the cited block itself is wrong, stale, or misrepresenting its own source. We call that a KB quality problem, not a hallucination, and we measure it separately with the content-freshness dashboards the KB block versioning post describes.

And it is not a claim that grounded drafting is “solved.” Even a 0.9% rate means roughly one in every 110 claims is a hallucination. A 50-page proposal with a 1,000 citable claims ships with about nine hallucinated sentences. That is why per-claim verification is a hard gate, not a soft warning, and why the product still insists on reviewer sign-off before anything goes to a buyer.

Where the number lives

The 0.9% number is updated weekly on /platform/grounded-ai and tagged with the measurement date, the evaluation set version, and the verification-model version. The evaluation set is described in the golden-set post; the prompt for the verification model is documented in the same eval harness repo that runs the nightly pipeline.

Customers can reproduce the measurement on their own drafts. Any customer running the production pipeline can pull their own claim-level verification logs and compute the same rate on their own data. A handful of customers do this quarterly and publish the number internally as part of their own AI governance reporting. We think that is the right pattern — the hallucination number is most useful when each customer owns it for their own corpus.

What year three looks like

Three things we are working on for the next year. A better verification model (the current one is off-the-shelf; a distilled version trained on our audit data should cut LLM cost per verification by half). A richer label than “supported / contradicted / not addressed” that distinguishes partial support from full support. And a per-customer rate, published to the customer’s own dashboard, so each customer can see their number against the aggregate.

Hallucination rate is the measurement the category most needs to take seriously. Publishing a number with a method attached is the least a grounded-AI vendor should do, and until this is table stakes across the category we will keep publishing ours.