Grounded retrieval: what it is, what it isn't, what we measure
The canonical long-read on grounded retrieval: the three invariants, the anti-patterns, the eval harness, the four open failure modes, and the research we're running next.
Grounded retrieval is the discipline of producing text whose every factual clause is traceable to a retrieved source span that entails it. It is not the same thing as RAG. Most commercial RAG systems hallucinate at rates between 17% and 33% even with citations wired in — a 2024 Stanford HAI study of Lexis+ AI, Westlaw AI, and Ask Practical Law established the floor on this, and the rate has not moved meaningfully in the 18 months since. The Hacker News thread on whether RAG can solve hallucinations is a useful artifact to read alongside the paper: the comments from people running these systems in production track the academic findings.
We’ve published a Pledge about how PursuitAgent handles this, and a separate post on how the Pledge is enforced in code. This post is the canonical long-read on the underlying discipline. What grounded retrieval is. What it isn’t. What we measure. Where it still breaks. What we’re working on next.
A note on the numbers. Figures in the “What we measure” section below are directional. They come from our internal eval harness, run on our own workloads and our own gold sets. They are not a general benchmark and we would not expect them to reproduce on an external setup. The specific precisions, recalls, and composite weights below are snapshots of where the system sits on our corpora — not a performance claim that generalizes. We will publish the harness and a public gold set when both are stable; until then, read the numbers as shape, not as score.
What it is
Grounded retrieval has three invariants. We’ve published these before in the Pledge-in-code post, but they are worth restating in this canonical place because they are the load-bearing definition.
Pointer. Every emitted sentence carries a pointer to a specific source: a block ID, a block version, a document ID, and a page reference. The pointer is not a hash that the system might have generated post-hoc to make the citation look complete. The pointer is the actual location of the source the generator drew from. It is recorded at the moment of generation, not after.
Provenance. Every substantive noun phrase in the emitted sentence has a provenance trace into a span inside that source. “Substantive” here means anything that carries a fact: numbers, named entities, claims about behavior, claims about features. Connective tissue (“and,” “the system”) doesn’t need provenance. “We support SOC 2 Type II since 2021” has three substantive phrases (SOC 2 Type II, since 2021, we support) and each one needs a span trace.
Entailment. The source block, read independently, supports the emitted sentence. This is the hard invariant. A pointer can be correct (the system did look at that block) and the provenance can be present (every substantive phrase was lifted from somewhere) without entailment holding (the assembled sentence makes a claim the source doesn’t actually support).
These three invariants are conceptually distinct. A system can satisfy two and miss the third. A system that has pointers and provenance but doesn’t verify entailment will look correct on a casual review and fail under a strict one. A system that has pointers and entailment but doesn’t track provenance will satisfy a reviewer reading the citation but won’t survive a precision audit. A system that has provenance and entailment but no pointer is internally consistent and externally non-auditable.
Why do all three matter? Because the operator using a proposal tool is going to be audited. By their own gold-team reviewer, by their own compliance lead, sometimes by the buyer asking “where did you get that number?” The invariants are not engineering aesthetics. They are the contract a proposal needs to satisfy when a finance reviewer looks at a cited claim and clicks the link. The link has to go somewhere. The somewhere has to actually back up the claim. And the link has to have been chosen because the claim came from there, not because the system found a plausible-looking citation after the fact.
What it isn’t
A meaningful amount of work in this space is people calling something grounded retrieval that satisfies one or zero of the three invariants. The patterns repeat enough that they are worth naming.
It isn’t RAG alone. Retrieval-augmented generation is the architectural pattern of putting retrieved documents into the model’s context. RAG is necessary but not sufficient for grounded retrieval. A RAG system whose generator can ignore the retrieved context, paraphrase loosely, or interpolate between two retrieved snippets is doing RAG without doing grounded retrieval. Most of the RAG systems shipped in 2023–2024 operated this way. The Stanford HAI paper measures exactly this gap.
It isn’t “AI with sources.” A growing number of proposal-tool marketing pages describe their AI features as “grounded” because the output includes a citation footer. A citation footer is a UI element. It tells you the system retrieved something. It does not tell you the citation supports the claim. We have seen demos where the citation footer cheerfully links to a paragraph that contradicts the generated sentence. The grader on the buyer side is going to click the link. UI-only grounding fails on the first click.
It isn’t prompt engineering. The most popular failure mode in the practitioner community is a system prompt that reads, in essence, “only use the documents I’m giving you. If you can’t find the answer in the documents, say so.” This pattern works on benchmark questions. It fails on real proposal content. The model still produces text that has the surface structure of an answer drawn from the documents while actually drifting into its training distribution. Telling a model to be grounded is not the same as constraining it to be grounded.
It isn’t post-hoc citation matching. Some systems generate text first, then run a retrieval pass to find the “best matching” source for each sentence and attach it as a citation. This produces citations that are surface-similar to the generated text — same noun phrases, same numbers — without verifying that the cited source actually entails the generated claim. The Mayo Clinic discussion on Hacker News around per-claim verification is a good entry point on why post-hoc matching isn’t enough; verification has to happen before the sentence is committed, not after.
It isn’t an attribution heatmap. A separate UI pattern shades a generated sentence to indicate which source it “drew from most.” Heatmaps are pretty. They are also not causal. The shade reflects a similarity metric between the generated sentence and the retrieved chunks; it does not reflect a guarantee that the generator actually used the highest-shaded chunk to produce the sentence. We have shipped attribution heatmaps in prototype builds and pulled them because they confused reviewers into thinking the heatmap was a proof of grounding. It wasn’t. We replaced it with refusals at draft time.
What we measure
The eval harness is where the discipline gets concrete. Without measurement, “grounded retrieval” is a marketing word. With measurement, it is a number that moves up and down release-to-release and that we can hold ourselves to.
We measure five things on every meaningful pipeline change. None of them is sufficient on its own; together, they let us tell the difference between a regression in generation, a regression in retrieval, and a regression in verification.
Gold sets. We maintain a curated corpus of (RFP question, expected source block, expected entailed answer) triples. The current production gold set has 1,840 triples, drawn from public RFPs (state and federal procurement portals), redacted customer data with permission, and a synthetic but human-verified set generated from our own product documentation. Each triple is reviewed by two annotators; disagreements are arbitrated by a third. The gold set is versioned and tagged by domain (federal IT, healthcare, security questionnaire, generic enterprise). We do not optimize for the gold set in training, because we don’t train models — we use commercial frontier models and tune the pipeline around them. The gold set is held strictly as evaluation.
Precision@k. For a given RFP question, the retrieval step returns a ranked list of candidate blocks. Precision@k measures, of the top-k blocks returned, what fraction are the gold-set “expected source blocks.” We track k=1, k=3, and k=10. k=1 is the headline number; k=10 is the safety net that tells us whether the right block was in the candidate list at all (and if not, the problem is the retrieval scoring, not the generator). Precision@1 on our current gold set runs in the low-to-mid 0.8s across domains. Precision@10 runs in the high 0.9s. The gap between those numbers is the work the retrieval ranker has to do.
Claim-level entailment rate. For each emitted sentence, we decompose it into atomic claims (typically two to four per sentence) and run the verifier against the cited source block. The entailment rate is the fraction of atomic claims that the verifier confirms are entailed. A sentence that decomposes into four claims, three of which entail and one of which doesn’t, scores 0.75. We average across the test set. Our production target sits in the high 0.9s; the current run on our gold set is a couple of points below that. The gap is mostly numeric precision and synonym handling, which we cover below.
Citation-fidelity rate. Of the emitted sentences that have citations attached, the fraction whose citation actually supports the sentence under independent re-evaluation. This sounds like a duplicate of claim-level entailment but is distinct: it measures the contract between the citation pointer and the generated sentence, independent of whether the underlying retrieval was correct. A system can have high precision@k (good retrieval) and high claim-level entailment (good verification) and still fail citation-fidelity if the pointer attached to the sentence is the wrong one of several retrieved blocks. We run near-ceiling on this one on our internal eval, because the architecture makes the pointer a structural property of the draft step, not a post-hoc match. We are deliberately not quoting a third-decimal number — the metric’s headline is “close to 1,” and more precision than that implies a benchmarking posture we haven’t earned yet.
Composite grounded-score. A weighted combination of the four above, plus a refusal-rate adjustment (a system that achieves high entailment by refusing the majority of queries is not solving the problem). The weighting we run today leans heaviest on claim-level entailment, then citation-fidelity, then precision@1, with smaller weights on precision@10 and on the inverse-refusal term. The specific coefficients matter internally and are re-tuned every six months as failure modes shift; we do not publish them as operational facts because they are a calibration knob on our own system, not a general recipe. What is durable is the shape: entailment is the hardest invariant and gets the heaviest weight.
# Sketch of the eval harness driver. Real harness is in the
# evaluation/ package; this is the load-bearing shape.
def evaluate(pipeline, gold_set):
results = []
for triple in gold_set:
question = triple.question
expected_block = triple.expected_block
expected_answer = triple.expected_answer
retrieval = pipeline.retrieve(question)
precision_at_1 = expected_block.id == retrieval[0].id
precision_at_10 = any(b.id == expected_block.id for b in retrieval[:10])
draft = pipeline.draft(question, retrieval)
if draft.kind == "ungrounded":
results.append(EvalRow(triple, refused=True))
continue
claims = decompose_claims(draft.sentence)
entailed = [verifier.entails(c, draft.source_block.text) for c in claims]
entailment_rate = sum(entailed) / len(claims)
fidelity = verifier.entails(draft.sentence, draft.source_block.text)
results.append(EvalRow(
triple=triple,
refused=False,
precision_at_1=precision_at_1,
precision_at_10=precision_at_10,
entailment_rate=entailment_rate,
citation_fidelity=fidelity,
))
return composite_score(results), results
We did not build all of this from scratch. The public RAG eval ecosystem is more mature than it was 18 months ago. RAGAS provides reference-free evaluation primitives — faithfulness, context precision, context recall — that we use as a sanity check against our internal harness. ARES is more academic but useful for cross-domain calibration; we run our gold set through ARES quarterly and compare. TruLens provides observability around RAG calls in production, which we use for the live monitoring stack.
What we built custom: the gold-set curation pipeline (because public datasets don’t reflect proposal-domain content), the claim decomposition (because RAGAS’s claim primitives are tuned for shorter answers than proposal text), and the citation-fidelity metric (because public frameworks treat citation as an output artifact, not as a structural property of the generation step). The eval harness is a place where building on top of public tooling makes sense. The metrics that matter most to the use case are the ones we wrote ourselves.
How it fails
The system has open failure modes. We don’t pretend it doesn’t. Honesty about which ones are open is part of how the discipline works — without naming the failures, the system’s claim to be grounded becomes an aspiration, not a verified state.
Numeric precision. “We have 99.9% uptime” against a source block that says “99.94%.” Both numbers describe the same uptime in different rounding. A strict entailment check fails the sentence because 99.9 != 99.94. A lax check passes it because they round to the same first-decimal value. We err strict — numeric mismatches fail by default — and surface the strict number from the source block as the suggested replacement. This catches the case where the model rounded down and the source actually has the more precise figure. It also catches the case where the model invented a number that sounds plausible. The cost is that some entailment failures are technically false negatives (99.9 and 99.94 are compatible at appropriate precision), and a reviewer has to decide.
Compound claims across blocks. A drafted sentence states two things, one of which is entailed by source block A and the other by source block B. Neither block, on its own, entails the full sentence. The current production stack doesn’t compose entailment across blocks — it requires a single source block to entail the whole sentence — so compound claims either get refused or get drafted as two separate sentences with separate citations. The two-sentence form is fine but stylistically clunky in a proposal where one fluid sentence would read better. We have a research branch that runs multi-block entailment by decomposing the drafted sentence into claims and verifying each claim against its strongest-entailing block. It works in offline eval. It is not in the production path yet, because the latency budget on a draft pass is tight and multi-block entailment doubles the verifier calls.
Synonym and tense handling. “We support SOC 2 Type II” vs. “SOC 2 Type II compliance is maintained annually.” The two sentences describe the same fact in different tense and voice. A small entailment model handles common paraphrase, but corner cases slip. Specifically: tense shifts (“we have supported” vs. “we support”), modal shifts (“we may support” vs. “we support”), and aspectual shifts (“we have been supporting since 2021” vs. “we support, having begun in 2021”). The verifier we ship currently treats the second of each pair as fully entailed by the first. We are tracking false-negative rates by paraphrase type and tuning the verifier on them; the rate moves with each verifier model upgrade.
Entailment false negatives in dense technical text. When the source block is dense technical writing — a security architecture document, a compliance attestation — the verifier sometimes fails to recognize entailment that a human reviewer would accept. The drafted sentence is a faithful summary; the source block uses different vocabulary; the verifier’s entailment confidence falls below the threshold and the sentence is refused. We see this in security questionnaire response, where the source language is heavily standardized (NIST 800-53 phrasing, ISO 27001 phrasing) and the drafted sentence translates the standard language into the buyer’s vocabulary. The fix is partly a verifier model upgrade and partly a training-time exposure to the standard vocabularies. Open work.
These are the four we’ve named publicly. There are smaller ones — punctuation handling in tables, footnote handling across page boundaries, hyphenated compound nouns — that we track in an internal failure-mode taxonomy and address as they reach the threshold of customer-visible impact.
What’s next
Three pieces of work are in flight, with timelines we believe but do not promise.
Multi-block entailment. The research branch we mentioned above. Goal: ship in production by end of Q3. The blocker is latency. A draft pass that does single-block entailment runs in ~1.4 seconds of verifier time. Multi-block entailment in the naive form runs in ~3.2 seconds. Reviewers tolerate the slower path on hard questions but not on every draft. We are working on a routing layer that detects when a question needs multi-block entailment and runs the slower path only there.
Claim decomposition as a first-class primitive. Today, claim decomposition runs after generation. We are exploring whether it should run during generation — a two-pass draft where the first pass produces atomic claims, each tied to a source, and the second pass assembles them into a fluid sentence. This would change the failure mode shape: instead of “generate then verify,” it becomes “compose from verified parts.” Open research. We expect to know whether it is viable by end of summer.
Structured source verification for numeric facts. Numbers are a category. Dates are a category. Named entities are a category. A general-purpose entailment verifier handles them adequately; a specialized verifier per category would handle them better. We are sketching a verification layer that routes substantive phrases by type — numbers go through a numeric-precision checker, dates go through a date-range checker, named entities go through a knowledge-graph match. It is a long-tail project. We do not expect to ship it in 2025.
If you’ve made it this far, you have the canonical version. The rest of our grounded-AI writing — the Pledge, the Pledge-in-code post, the Grounded Retrieval 101 series, and the Part 2 follow-up — is each a slice of what’s here. This post is what they all point back to.
One closing observation
The thing we believed five years ago and still believe is that grounded retrieval is not a clever model trick. It is an engineering discipline imposed on a generative model that, left to its defaults, will fabricate. The discipline lives in the gates: retrieval floor, rewrite-only drafting, entailment verification, refusal as a feature. Every release is measured against the gates. Every regression we ship and have to walk back is a place we either widened a gate too far or didn’t see a failure mode the gate didn’t cover. The work is not exciting. It is the kind of work that, done well, produces a system you can hand to a finance reviewer and have them click every link.
Sources
- 1. Stanford HAI — Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (2024)
- 2. Hacker News — The issue of hallucinations won't be solved with the RAG approach
- 3. Hacker News — Mayo Clinic's secret weapon against AI hallucinations: Reverse RAG
- 4. RAGAS — Reference-free evaluation framework for RAG
- 5. ARES — Automated Evaluation Framework for RAG
- 6. TruLens — RAG evaluation and observability
- 7. PursuitAgent — Grounded-AI Pledge
See grounded retrieval in the product.
Start a trial workspace and watch PursuitAgent draft cited answers from the documents you provide.