The hallucination budget, per claim

Hallucination is a cost. A drafted claim has a probability of being mis-attributed to a source that doesn’t actually support it. Stanford HAI’s 2024 study put that probability between 17% and 33% across commercial legal RAG tools, even with retrieval. That’s the working number for what an unbudgeted system ships.

You can spend engineering effort to lower the probability. Better retrieval, stricter rewrite prompts, claim-level verification, multi-pass entailment, human-in-the-loop reviews. Each of those costs latency, dollars, or both. The question is how much you’re willing to pay to lower the probability by how much, on which claims.

That tradeoff is what we mean by a hallucination budget. It is per-claim, not per-draft, and the per-claim part is load-bearing.

Why per-claim and not per-draft

A draft has many claims. Some are low-stakes — “PursuitAgent supports SAML 2.0” — and the source block either says it or doesn’t. Some are high-stakes — “PursuitAgent meets the FedRAMP Moderate baseline” — and the cost of mis-attribution is the deal, possibly the company.

A per-draft budget — “this entire draft has at most a 3% chance of containing a hallucination” — averages across all claims in the draft. That is the wrong abstraction. The buyer doesn’t grade your draft as a whole; they grade specific claims. A draft with a 1% hallucination rate that hallucinates the FedRAMP claim is worse, contractually, than a draft with a 5% rate that hallucinates a low-stakes one.

Per-claim budgeting acknowledges this. Each claim gets a budget that depends on its risk class, and the system allocates engineering effort accordingly.

Risk classes

We classify claims into four risk tiers. Tier assignment is automatic at extraction time, with a smaller manual override path for content owners.

Tier	Examples	Budget
1 — Existential	Compliance certifications, regulatory claims, named contractual commitments	Effectively zero — refuse below the floor
2 — Material	Specific product capabilities, named integrations, pricing claims	Single-digit basis points
3 — Standard	General product description, architectural shape, customer count ranges	Sub-1%
4 — Low-stakes	Tone, style, transitional sentences	Bounded but not zero-tolerance

The tier maps directly to which gates run on the claim. A Tier 1 claim runs through retrieval-floor + rewrite + claim-level verification + multi-block entailment + a manual review flag. A Tier 4 claim runs through retrieval-floor + rewrite + claim-level verification only.

The budget on each tier is tuned, not assumed. We benchmark against held-out RFP corpora with human-graded entailment labels. The tier floors get adjusted when the eval surfaces drift.

The tradeoffs the budget surfaces

Three live tradeoffs.

Latency vs. grounding strength

Multi-block entailment lowers the probability of false positives on compound claims. It also adds latency — running the verifier against multiple source blocks per claim. Every additional block increases recall on entailment but burns the latency budget.

For Tier 1 claims, we run multi-block. The latency cost is paid because the claim is high-stakes. For Tier 4 claims, we don’t — the verifier runs against the single source block and accepts the false-positive risk on compound claims.

Cost vs. coverage

Larger verifier models entail more accurately. They also cost more per call. A frontier-model verifier run on every claim in every draft would 5-10x our per-draft cost. We use a smaller specialized verifier on most claims and reserve the frontier model for Tier 1 claims that the smaller verifier flagged as borderline.

This is a tiered pipeline, not a single threshold. The smaller verifier returns confidence; below a floor it refuses; in a borderline band it escalates to the larger model; above the upper floor it accepts. The escalation rate is the cost lever.

Refusal rate vs. user friction

The strictest possible system refuses anything not perfectly entailed. Users hate it. They see refusal after refusal on sentences that obviously paraphrase the source, get tired, and start manually overriding the system, which defeats the budget the system was supposed to enforce.

A more permissive system passes more claims and ships more hallucinations. Users initially love it (more drafts shipped) and then learn to distrust it.

We tune the refusal rate against user-perceived correctness on real drafts, not against pure entailment metrics. Refusal that surfaces a clear next-step (add a KB block, escalate, accept partial) is OK. Refusal that surfaces nothing actionable is not.

What the budget is, mathematically

For a single claim in a draft session, the residual hallucination probability after the gates is approximately:

P(hallucinated_claim) ≈ P(retrieval_returned_wrong_block)
                      × P(rewrite_drifted_from_source | wrong_block)
                      × P(verifier_passed | drift)

Each gate is a multiplier. Improving any gate improves the product. The budget per tier is the product floor we’re willing to ship at.

For Tier 1 we target a residual probability that’s effectively zero — the gates compose to a refusal rate that’s high on borderline cases and a pass rate on confident cases that matches the underlying source’s accuracy.

For Tier 4, we accept higher residual probability because the cost of refusing common transitional phrasing exceeds the cost of an occasional drift on a low-stakes sentence.

Stanford HAI’s 17-33% baseline is the unbudgeted system. A budgeted system with the gates we’ve described drives the high-tier rate substantially below that. The exact aggregate number we publish is a function of which tier mix we evaluate against, which is why our public numbers are tier-specific rather than aggregate.

Where the budget breaks

Three places.

Tier misclassification. A claim that should be Tier 1 but lands as Tier 3 gets the lighter gate stack and ships with higher residual probability than it should. The classifier is itself a model; it has error rate. We monitor the classifier’s drift on held-out labeled examples and run a periodic human-in-the-loop calibration.

Compound claims that span tiers. A sentence with a Tier 1 claim and a Tier 4 claim — “PursuitAgent meets FedRAMP Moderate and provides an intuitive user interface” — defaults to the strictest tier on the sentence as a whole. This is occasionally over-strict and surfaces refusals on what would be a fine sentence with the soft claim trimmed. The drafting prompt is tuned to avoid stacking high-tier and low-tier claims in the same sentence; it doesn’t always work.

Adversarial source. A KB block that is itself wrong — outdated certification, retired feature, copied from a deprecated document — passes entailment cleanly because the verifier checks the draft against the block, not the block against the world. Hallucination budgeting does not solve KB freshness. KB freshness is its own problem; we wrote about it in the chunking pipeline post and it’s a recurring topic on the engineering side of the blog.

The takeaway

Treating hallucination as a budget — per claim, by tier, with tunable gates — moves the question from “is your AI grounded?” to “at what residual probability, on which kinds of claims, with what latency cost.” That second question has answers. The first one mostly has marketing.

The Pledge sets the contract. The budget sets the engineering targets. The verifier post from last week is the gate the budget pays for. Three layers, one promise: the ungrounded claim doesn’t ship.