Grounded Retrieval 101, Part 4: what we're still wrong about

This is Part 4 — the closing post — of the Grounded Retrieval 101 series. Earlier posts in the series covered the retrieval-floor pattern (Part 1), the rewrite-only drafting prompt (Part 2), and the entailment verifier (Part 3). They described what we built and why it works.

This post is the honest one. It describes what we have not solved.

The series exists because we promise customers, contractually, that PursuitAgent will refuse to draft a sentence it cannot ground in their KB. The promise is the Grounded-AI Pledge. It is a real promise with a real penalty clause. If we get it wrong, the customer can terminate without paying for the remaining term.

Promising something this strongly forces a discipline we wouldn’t otherwise have. We have to know exactly where the system fails and tell customers about it before they discover it themselves. Three failure modes are open. This post is each of them, with the test cases that surface them.

Failure 1 — Numeric precision

The issue: the entailment verifier does not enforce numeric exactness consistently.

A drafted sentence reads “Acme has 99.9% uptime across all services.” The retrieved block reads “Acme’s uptime SLA is 99.94%, measured monthly across the platform.” The block contains a more precise number and a stronger qualifier. The drafted sentence rounds down and drops the qualifier.

A strict verifier should fail this. The drafted claim (“99.9%”) is a weakening of the source (“99.94%”) — it is supported by the source but it is not the source’s claim. In a proposal context, weakening is rarely what the customer wants. They want the source’s number, not a rounded restatement of it.

Our verifier is tuned to fail numeric mismatches in either direction (drafted higher or lower than source). It does fail this case. But it fails inconsistently. When the source contains “approximately 99.9%” and the drafted sentence reads “99.94%,” the verifier sometimes passes the draft because “approximately” reads as a range. The model’s range interpretation is not stable.

The test cases. We maintain a held-out set of 240 numeric-precision pairs — drafted sentence, source block, expected verdict. The set was assembled from real proposals and real hallucinations. Our current verifier hits 91% accuracy on this set. The 9% miss rate is concentrated in three patterns:

Range expressions (“approximately,” “around,” “roughly”) in the source.
Different units in source and draft (“4Gbps” vs. “4,096 Mbps”).
Compound figures (“99.9% in 2024 and 99.94% in 2025” — drafted as “99.9% historically”).

What we’re doing. We are training a small numeric-precision classifier specifically for this case, separate from the general entailment verifier. The classifier reads only the numeric claims in source and draft and grades them against a stricter rubric: exact equality required unless the source explicitly authorizes rounding. It is in evaluation; not yet shipped.

Until it ships, we tell customers in onboarding that numeric facts in proposals should be reviewed by a human. The system catches most numeric drift and surfaces the source for human verification, but the residual error rate is high enough that we don’t claim it as solved.

Failure 2 — Compound claims across blocks

The issue: a drafted sentence sometimes legitimately rests on two retrieved blocks. Our entailment verifier only entails against one block at a time.

Example. The drafted sentence reads “Acme is SOC 2 Type II certified, audited annually by [auditor].” The retrieved blocks contain:

Block A: “Acme’s SOC 2 Type II report is renewed annually.”
Block B: “Acme’s SOC 2 audits have been conducted by [auditor] since 2022.”

Either block alone does not entail the full drafted sentence. Block A doesn’t name the auditor. Block B doesn’t establish that the audit is annual. Together they do.

Our current verifier checks each block against the full sentence and fails if any block alone doesn’t entail. The compound case fails the verifier even though the underlying claim is well-grounded.

We mitigate by retrieving generously — when a question pulls multiple relevant blocks, we pass them all to the drafter. The drafter often resolves to a single source by paraphrasing the compound, but sometimes it produces the compound naturally and we have to refuse.

The test cases. Our compound-claim test set has 180 pairs. The verifier currently fails 31% of true compounds (rejects when it should accept) and accepts 4% of false compounds (accepts when it shouldn’t). The asymmetry is intentional — we’d rather over-refuse than under-refuse, given the Pledge — but the over-refusal rate is high enough to be visible to customers.

What we’re doing. We have a multi-block entailment branch in research. The implementation reads two or three blocks at once and asks whether the drafted sentence is entailed by their union. The technical work is ranking which combinations to check (combinatorics get ugly past four blocks) and avoiding the obvious failure mode where the model “entails” a sentence by stitching together unrelated phrases. Not yet in production.

Failure 3 — Synonym and tense drift

The issue: the verifier handles common paraphrase well but misses corner cases that read identically to a human.

Examples we’ve collected from real proposals:

Source: “Encryption keys are rotated every 90 days.” Draft: “Acme rotates encryption keys quarterly.” These are equivalent — quarterly is 90 days — but the verifier sometimes flags the draft as ungrounded because the literal numeric anchor isn’t in the draft.
Source: “All data is encrypted at rest using AES-256.” Draft: “Customer data is encrypted at rest with AES-256 encryption.” These are equivalent. The verifier passes this one. But change it to: “Acme encrypts all customer data at rest, with industry-standard encryption.” The verifier may or may not pass — “industry-standard” is not a synonym for AES-256, even though most readers would accept it as one.
Source: “SOC 2 Type II compliance is maintained annually.” Draft: “Acme is SOC 2 Type II compliant, with annual audits.” Tense and voice shift. Verifier handles this consistently.
Source: “Acme’s data centers are located in US-East and EU-West.” Draft: “Acme operates from data centers in North America and Europe.” The draft generalizes — US-East is in North America, but the generalization is a one-way entailment. The verifier passes this. We’re not sure it should.

The HN thread on RAG hallucinations linked from this post covers some of the same ground: span-level verification has accuracy gaps that are not closed by larger models or stricter prompts. They are closed, slowly, by tighter test sets and explicit handling for known edge cases.

The test cases. Our paraphrase-and-tense test set is the largest of the three — 612 pairs. The verifier hits 94% accuracy. The 6% miss is roughly even between false rejections and false acceptances, which makes it harder to bias-correct than the numeric set.

What we’re doing. Three things, in priority order. First, we add to the test set every time a customer flags a verifier disagreement (the disagreement form is in the review UI). Second, we publish the test set internally and run the verifier against it on every model upgrade — we caught a regression last quarter when a Claude version change moved the false-acceptance rate up by a percentage point. Third, we are evaluating a different verifier model that’s been tuned for the proposal domain specifically, not a general entailment model. Initial numbers are promising but it’s not in production.

Why we publish this

There is a temptation, when you make a strong promise, to keep the failure modes private. The category we work in is full of vendors who do that — every AI proposal product on the market has known failure modes, and few of them publish anything about what those failure modes are.

Our position is that publishing the failures is part of what makes the Pledge credible. If a customer reads the Pledge and asks “where does this break?”, we want them to be able to read this post — and the rest of the Grounded Retrieval 101 series — and form a real opinion. The Stanford HAI study on legal RAG (linked) showed that vendors who claim “hallucination-free” without disclosing failure modes hallucinate at rates between 17% and 33%. We do not want to be on that list.

The honest claim about PursuitAgent’s grounded retrieval is: the system refuses to draft when it cannot entail. The verifier that decides whether it can entail is good and getting better. It has three known failure surfaces — numeric precision, compound claims, and synonym drift — and we have public test sets for each.

A customer who needs a stricter guarantee than that should not buy the product. A customer who is willing to operate inside the failure surfaces and use human review where the system flags low confidence is buying something that, as far as we can tell from our own measurements and from the public literature on retrieval evaluation, is a meaningful step ahead of the category’s incumbents.

What’s next

The Grounded Retrieval 101 series ends here. The next series in the engineering pillar — provisionally titled “Retrieval Evaluation in Production” — picks up where this leaves off. It’s about how we run the test sets continuously, what we do when a model upgrade changes the numbers, and how we surface verifier disagreements back into the data we train on. Targeting July.