Field notes

The claim-verification cost profile, stage by stage

Per-claim verification is the defense against citation hallucination. It also costs real money. A breakdown of token costs at each stage of the verification pipeline, with the numbers we actually see in production.

Grounded AI without per-claim verification is AI that cites sources it cannot defend. The Stanford HAI study on legal RAG tools is the honest benchmark: 17-33% hallucination rates with retrieval in place. The fix is a verification pass that asks, for each factual claim in the draft, whether the cited source actually supports it.

This post is the cost profile of that verification pass. Token counts, per-stage breakdown, and the numbers we see on a typical 40-page RFP response in production as of February 2026.

A note on the numbers. Costs below are directional. They come from our internal telemetry on our own production traffic at current provider pricing. Exact token counts, per-claim costs, and total per-response costs depend on workload mix, model pricing at the time of the call, and our specific prompt structure — they are not a benchmark. The citation-supported-by-source rate below comes from our internal eval harness and would not be expected to reproduce on an external setup. Read the shape (which stage dominates, how claim density moves the bill) as the durable content; treat the specific dollar figures as a snapshot.

The pipeline

Four stages, in order:

  1. Claim extraction. Parse the draft, segment it into atomic factual claims.
  2. Evidence retrieval. For each claim, pull the cited source span plus a small neighborhood of nearby text.
  3. Entailment check. Ask a model: does the evidence support this claim?
  4. Repair pass. For claims that fail entailment, regenerate with a tighter retrieval or flag for human review.

We wrote about the mechanism in claim-level verification pass. This post is the accounting.

Stage 1 — Claim extraction

Input: one drafted response section, typically 600-1200 words. Output: a JSON array of claims with span offsets.

We use a small fast model (Claude Haiku 4.5 in most cases, Gemini 3 Flash for very long sections). Prompt is roughly 300 tokens of instructions plus the section text. Output is 200-400 tokens per 1000 words of input.

Cost per section: approximately $0.004-$0.008 at current pricing (March 2026). For a 40-page response averaging 12000 words, we run extraction on 15-20 sections. Total: $0.06-$0.16 per response.

This is the cheapest stage. It stays cheap because Haiku-tier models are good enough at claim segmentation when the prompt is structured. We tried running this with larger models early on; the quality lift was marginal and the cost was 5x.

Stage 2 — Evidence retrieval

Input: the extracted claims. Output: for each claim, the cited source span plus 1-2 neighboring chunks for context.

Retrieval is embedding-based and runs against the same index the draft was generated from. Cost is in embedding calls for any claim text we need to re-embed for similarity checks, plus the vector DB query cost.

Per-claim cost: approximately $0.00002 in embedding tokens plus $0.0001 in DB query (pgvector, HNSW index, self-hosted). For a response with 150 claims: $0.015-$0.02 per response.

This stage is effectively free at our volume. The cost shows up only on very large responses where claim counts cross 500.

Stage 3 — Entailment check

The expensive stage. For each claim-evidence pair, we prompt a reasoning-capable model with the claim, the evidence, and a structured instruction: does the evidence support the claim? Return supported / partially supported / not supported, plus a one-sentence rationale.

Input per claim: 400-600 tokens (instructions plus claim plus evidence span of 200-400 tokens). Output per claim: 80-120 tokens.

We run this on Claude Sonnet 4.6 for most claims. For high-stakes claims — compliance assertions, security posture, past performance — we run a second pass on Claude Opus 4.5. The second pass catches about 8-12% more entailment failures; at 3x the cost per claim, the math works out because those claims carry the highest risk of reviewer-rejection.

Per-claim cost on Sonnet: approximately $0.004. Per-claim on Opus: approximately $0.015. For a response with 150 claims, 20 of which get the Opus second pass: $0.60 on Sonnet + $0.30 on Opus = $0.90 per response.

This is 75-85% of the verification-pipeline cost.

Stage 4 — Repair pass

For claims that fail entailment, we run one of two repairs. Cheap repair: retrieve again with tighter query, re-ask the model to draft the sentence using only the re-retrieved evidence. Expensive repair: flag for human reviewer with a structured prompt (claim text, cited source, entailment verdict, suggested rewrite).

About 5-10% of claims trigger repair in production. Of those, about 70% succeed on the cheap repair; the remaining 30% go to human reviewer.

Cheap repair cost per claim: approximately $0.006 (generation on Sonnet, one round). Expensive repair is free (human time is accounted elsewhere).

For a response with 150 claims, 12 repairs, 8 of which are cheap: $0.05 per response.

Total cost per response

Stage 1: $0.06-$0.16
Stage 2: $0.015-$0.02
Stage 3: $0.60-$0.90
Stage 4: $0.05

Total: $0.73-$1.13 per 40-page response.

For comparison, the draft-generation pass itself costs $4-$7 for a response of this size. Verification adds 15-25% to total inference cost.

What moves the number

Three variables dominate.

Claim density. A compliance-heavy response has more claims per word than a narrative one. A DDQ with 250 questions and 250 short answers has 400-500 claims; a 40-page sales narrative has 120-180. DDQs cost 2-3x more to verify, in our data.

Claim complexity. “We are SOC-2 Type II certified” is a one-sentence entailment check against a well-structured evidence chunk. “Our past performance on federal IT modernization has resulted in a 34% reduction in help-desk ticket volume across three agencies” is a multi-predicate claim that requires evidence for each predicate. Complex claims cost 3-5x per verification.

Model tier for the second pass. Running every claim on Opus would take verification cost from ~$1 to ~$4 per response. Running zero claims on Opus saves cost but loses the 8-12% lift on high-stakes claims. Our current split (20 of 150 on Opus) costs $0.30 incremental and catches the kinds of errors that would survive the Sonnet pass and ship to the buyer.

What we’re trading

Verification doubles inference cost relative to drafting alone. It adds 30-45 seconds to response latency on a typical draft. In exchange, our internal eval harness shows citation-supported-by-source rates near-ceiling on our corpora (above 0.95) — a meaningful gap from the 67-83% range the Stanford paper found in commercial legal RAG tools, though the comparison is directional: our eval, our corpora, their eval, their corpora.

We think that trade is correct for this category. Regulated buyers do not accept 20% fabrication rates; proposal teams get disqualified when reviewers catch invented claims. For a commodity-bid tool, the math might run differently. For ours, verification is not an optimization target — it is a product requirement.

Where we want to go

Two optimizations on the roadmap:

  • Selective verification. Not every claim needs Opus. We’re training a small classifier that predicts which claims are high-risk, so we can route only those to the expensive pass. Expected: 40-60% cost reduction on Stage 3.
  • Evidence caching. Many claims across sections cite the same source spans. We cache the evidence retrieval; we don’t yet cache the entailment result. Early experiments show ~15% reduction when caching is on.

We’ll publish the results when the experiments land. For now, the cost is what it is, and we think it’s worth it.

For more on cost accounting across the whole response pipeline, see cost per response breakdown and RAG cost control budgets.

Posts bylined to “The PursuitAgent engineering team” are written by the people building the product. Cost figures reflect pricing and configuration as of February 2026.

Sources

  1. 1. Claim-level verification pass (PursuitAgent)
  2. 2. Cost per response breakdown (PursuitAgent)
  3. 3. RAG cost control budgets (PursuitAgent)
  4. 4. Stanford HAI — Hallucination-Free? Assessing Reliability of Leading AI Legal Research Tools
  5. 5. Anthropic — Claude Haiku 4.5
  6. 6. Anthropic — Claude release notes
  7. 7. Google — Gemini 3