Grounded Retrieval 101, Part 2: why citations don't guarantee groundedness

In part one we covered why retrieval-augmented generation still hallucinates. Stanford HAI’s legal-RAG study put numbers on it: 17 to 33% hallucination rates in commercial tools that retrieve, cite, and use domain-tuned prompting.

This post is about why the citations those tools produce don’t fix the problem. The mechanism is the entailment gap — the difference between attributing a claim to a source and the source actually supporting the claim. Once you can see that gap, you can see why a separate verification pass is the only reliable defense.

The two questions a citation should answer

When a grounded-AI system writes a sentence and attaches a citation, the citation should answer two questions for the reader:

Where did this come from? Which document, which paragraph, which page.
Does the source actually support this? If the reader were to read the cited paragraph, would they conclude the sentence above is a fair summary of what the paragraph says?

A typical RAG pipeline answers question 1. It almost never answers question 2.

The reason is structural. Retrieval finds passages that are relevant to the question. The model generates a sentence that answers the question. The citation links the sentence to the passage. None of those steps verify that the passage entails the sentence. The pipeline is logically incomplete — it is missing a step that nobody is required to add and that very few products add.

What entailment means, in retrieval terms

Two sentences are in an entailment relation if reading the first sentence commits you to the truth of the second. “We support AES-256 encryption with customer-managed keys” entails “we support encryption at rest.” The first commits you to the second; you cannot assert the first and deny the second without contradiction.

In a RAG context, the question is whether the retrieved source entails the generated claim. The retrieved passage is the premise. The generated sentence is the conclusion. If the conclusion follows from the premise, the citation is honest. If it doesn’t, the citation is a fig leaf.

Entailment is not symmetric. The premise can entail the conclusion without the conclusion entailing the premise. A passage that says “we maintain SOC 2 Type II compliance, certified annually by an independent auditor” entails a generated claim that says “we are SOC 2 Type II compliant.” The reverse — generating “we are certified annually by an independent auditor” from a passage that only says “we are SOC 2 Type II compliant” — is not entailment, because the source doesn’t carry the annual-certification claim. A pipeline that doesn’t verify entailment will happily produce both directions.

Where the entailment gap shows up in proposals

AutogenAI has been writing about this in proposal context — hallucinations in proposals show up as invented case studies, fabricated statistics, and incorrect compliance claims. Most of those failures are entailment gaps.

A proposal RAG system asked “describe your past work with state agencies” retrieves a passage about a Department of Health Services contract. The model generates “we have completed three state-agency engagements in the last 24 months.” The retrieved passage describes one engagement. The “three engagements in 24 months” claim is not supported by the passage. The citation goes to a real, on-topic passage. The claim is fabricated. A reviewer who sees the citation and assumes the citation validates the claim will ship the fabrication.

The same pattern shows up on numeric claims. A passage says “RTO is targeted at 4 hours and RPO at 15 minutes for tier-1 systems.” The model generates “we offer RTO of 4 hours and RPO of 15 minutes across all customer tiers.” The “all customer tiers” generalization is not in the source. The cited passage looks correct on a glance because the numbers are right. The scope is wrong.

These are the failures the entailment gap explains. They look like citation success. They are claim failure.

Why the model can’t fix this on its own

A common reaction: better prompting will fix it. “Tell the model to only assert what’s in the source.” We’ve tried this. Stanford’s paper effectively tried it too, against the production prompts of three commercial vendors. Better prompting reduces the rate. It does not close the gap.

The reason is that a generative model’s job is to be helpful, and helpful in this context means producing an answer to the question. When the retrieved passage is on-topic but doesn’t fully answer, the model fills the gap with plausible-sounding content drawn from training. The fill is fluent. It cites the on-topic passage. It is wrong. The model is following the instructions it was trained on — be useful, answer the question — and a prompt that asks for restraint loses to those instructions in a fraction of cases that compounds across many sentences in a long proposal.

You cannot solve this from inside the generation step. The fix has to live above it.

What closes the gap

A separate verification pass. After the model generates a sentence and a citation, an independent step asks: does the cited source entail the sentence? The verifier reads the source and the sentence. It returns entailed or not-entailed. Sentences that fail are refused — sent back as “ungrounded” rather than shipped as if they were verified.

The verifier is a different model from the drafter. It is tuned for entailment classification rather than for generation. It is cheap to run — entailment is a smaller problem than open-ended generation, and a smaller model handles it well. It runs on every sentence the drafting engine emits, not just on a sample.

A long Hacker News thread on Mayo Clinic’s “Reverse RAG” approach walks through a related design. Mayo’s framing is per-claim evidence verification: every claim emitted by the system gets re-checked against the source it came from, and claims that fail verification are dropped. The economics of this — running a verification step on every sentence, on every draft — were debated in that thread, and the consensus among practitioners was that for high-stakes domains the cost is justified. Proposal work is high-stakes in this sense. Shipping a fabricated SOC 2 statement is more expensive than the cost of running a verifier across the response.

What this looks like in our stack

We described the architecture in detail in our grounded-AI pledge enforcement post — the post that pairs with this one. The short version: three gates before any drafted sentence ships. Retrieval-floor gate (no draft below a confidence threshold). Constrained-rewrite drafting (the model rewrites the source, not generates from training). Entailment verification (a separate model checks every drafted sentence against its source).

A failure at any gate produces a refusal — surfaced to the user as “ungrounded, here are the candidate sources, here’s why the engine stopped” — rather than a drafted-but-fabricated sentence. Refusals are a feature, not a bug.

What we still don’t have right

Two areas where the verifier is noisier than we’d like.

Numeric mismatches. “99.9% uptime” vs. a source that says “99.94%” can pass or fail entailment depending on the verifier’s tuning. We err on the side of strictness, which means the human reviewer sees more entailment failures than they would on a looser system. The trade is correct — false positives in the verifier are a workflow nuisance; false negatives are a fabricated claim shipping to a customer.

Synonym-and-paraphrase entailment. “We support SOC 2 Type II” vs. “SOC 2 Type II compliance is maintained.” A modern entailment model handles common paraphrase well, but corner cases slip. We treat these as the failures we most want customers to flag, because the verifier improves on real examples faster than it does on synthetic ones.

What’s coming in this series

Part 3 (next week): the verification step in detail — what model we use, how it’s tuned, what it costs per sentence, where the false positives and false negatives land.

Part 4: how we measure all of this in production. The held-out evaluation set, the metrics, and the honest version of what the numbers say about the engine over time.

The short version

A citation tells you where a passage came from. It does not tell you whether the passage supports the claim it’s attached to. Closing that gap is a separate engineering problem, and it requires a verification step the typical RAG pipeline does not include. Until that step is in your pipeline, the citation is a UI element, not a guarantee.