Field notes

Numeric claim extraction and verification

How we parse numbers from drafts — percentages, dollar figures, head counts, dates — and check each one against a KB source before the sentence ships. The pipeline, the regex floor, the LLM ceiling, and what we still get wrong.

The PursuitAgent engineering team 6 min read Grounded AI

Numbers are the easiest claims to fabricate and the hardest to forgive. A drafted sentence that says “Acme Corp serves 230 customers” is wrong in a way that a vague sentence about “many customers” cannot be. Evaluators check numbers. Compliance reviewers check numbers. The buyer’s procurement office, in a federal pursuit, will sometimes check numbers against your filed past performance data and disqualify the bid if they don’t reconcile.

The Stanford HAI paper on commercial legal RAG tools showed hallucination rates of 17% to 33% even with retrieval grounding. Numeric hallucinations are a meaningful subset of that surface area. AutogenAI’s breakdown lists “fabricated statistics” as one of the three failure modes that pass review under deadline pressure.

This post is the engineering note on how we extract numeric claims from drafts and verify each one before the sentence ships.

What counts as a numeric claim

Three categories, in order of how often they appear in proposals:

  • Percentages. Performance metrics, market shares, customer satisfaction scores, accuracy claims.
  • Dollar figures. Revenue, ACV, contract sizes, savings claims, price points.
  • Counts and dates. Customer counts, employee counts, engineer counts, founding dates, contract durations, anniversary years.

A fourth category — units — is harder because the surface form is heterogeneous. “200 hours saved” is a numeric claim. So is “a 4x improvement.” So is “99.95% uptime.” We treat any token that contains a digit as a candidate, then filter against context.

The pipeline

Each drafted sentence passes through four stages:

  1. Extract. A regex floor pulls every digit-bearing span. A second pass uses a small LLM (Gemini 2.5 Flash) to convert each span into a structured claim: subject, predicate, value, unit, time qualifier. “Acme Corp serves 230 customers” becomes {subject: "Acme Corp", predicate: "customer count", value: 230, unit: "customers", time: "current"}.
  2. Match. Each structured claim is matched against KB blocks the draft already cites. The matcher checks for the same predicate-unit-value tuple in the source span. A claim that matches a sourced span is verified.
  3. Lookup. A claim that doesn’t match its cited block is rerouted to a numeric KB lookup that searches across all approved blocks for a matching predicate-unit. If a different block has the matching value, the citation is updated to that block. If no block has the value, the claim is flagged.
  4. Flag or refuse. Flagged claims appear in the verify panel for the reviewer to handle. Numeric claims that the engine cannot ground at all are refused at draft time — the sentence does not ship until a human resolves it.

The regex floor

The floor is deliberately permissive. A regex catches all digit spans, including dates (“2024”), version numbers (“v2.1”), section references (“3.4”), and ordinals (“first,” “1st”). Most of those are not numeric claims; they are references. The structuring step filters them.

The regex pattern (TypeScript) looks roughly like this:

const NUMERIC_SPAN = /(?:\$||£)?\s*\d{1,3}(?:[,.\s]\d{3})*(?:\.\d+)?\s*(?:%|x|×|k|m|bn|million|billion|customers|employees|hours|years|days)?/gi;

The pattern catches false positives. The structuring step is what makes the false-positive rate tolerable — Gemini 2.5 Flash is cheap, fast, and good at distinguishing a numeric claim (“230 customers” in the Acme Corp example) from a section reference (“Section 3.4”) in context.

The structuring step

The structuring prompt asks the model to return either a structured claim or null for spans that are not claims. Returns are a tagged JSON object so the downstream matcher can dispatch on type.

{
  "type": "claim",
  "subject": "Acme Corp",
  "predicate": "customer count",
  "value": 230,
  "unit": "customers",
  "time": "current",
  "source_span": "Acme Corp serves 230 customers"
}

A version-number span returns {type: "reference"}. A section-reference span returns {type: "reference"}. Only type: "claim" proceeds to matching.

The LLM call is cached per (sentence, KB-version). Re-running verification on a sentence whose draft and grounding both have not changed costs nothing the second time.

The matcher

The matcher is structural, not semantic. It compares the structured claim to the structured claims pre-extracted from each cited source block. The pre-extraction happens at KB ingest time — every block has a numeric_claims field with the same shape. A draft claim that appears in the cited block’s numeric_claims is a verified claim.

This is fast and cheap. The LLM is not in the matching loop; the LLM did the structuring upfront. The matcher is a hash lookup.

The lookup step

When a draft sentence cites block A but the numeric claim does not appear in block A, the lookup searches all approved blocks for a matching predicate-unit. Three outcomes:

  • A different block has the matching value. The citation is updated to that block, and the original block’s relevance to the sentence is flagged for the writer to review.
  • The same predicate-unit appears in multiple blocks with different values. The sentence is flagged with the conflict — the writer chooses which value is current, and the stale block is queued for KB review.
  • No block has the value. The sentence is refused at draft time. A reviewer sees the refusal in the verify panel along with the candidate blocks the engine considered.

What we still get wrong

Three failure modes we have not eliminated.

Compound claims. “Acme Corp’s 230 customers across 18 countries generated 4.2 billion in pipeline last year” is three claims in one sentence. The structuring step extracts all three, but the matching step needs all three to verify against blocks before the sentence ships. If two claims verify and the third doesn’t, the sentence is flagged — but the writer’s natural fix is to delete the unverified clause, which can change the sentence’s meaning. We have not yet built a good UI for partial-verification edits.

Time-qualified claims. “Acme Corp’s customer count grew 40% year-over-year” is a derived claim. The KB has an absolute customer count, not a year-over-year rate. The matcher doesn’t know to compute the derivative. For now, derived claims fail to verify and the writer is asked to either rewrite the claim against an absolute number or to add a derived-metrics block to the KB. We have a project planned to compute simple derivatives at lookup time but haven’t shipped it.

Unit normalization. “$4.2B” and “four billion two hundred million dollars” and “4.2 billion in revenue” should all match. They don’t always. The structuring step does most of the normalization, but edge cases — abbreviations like “MM” for million, locale-specific separators, currency conversions — still produce false negatives. We accept the false-negative rate because the alternative is false positives, which let unverified numeric claims ship.

How this connects to the rest of the verifier

Numeric verification is one of three claim types in the broader verifier pipeline. The other two are entity claims (named customers, named contracts, named certifications) and capability claims (does the product do X). All three flow through the same verify panel, the same refusal path, and the same caching layer described in Shipped: the inline verify button in drafts and The Grounded-AI Pledge in code.

The reason numeric extraction gets its own post is that the failure modes are sharper. A vague entity claim is recoverable in review. A wrong dollar figure on page seven of a federal proposal is not.

Sources

  1. 1. Stanford HAI — Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
  2. 2. PursuitAgent Grounded-AI Pledge
  3. 3. AutogenAI — AI hallucination: how can proposal teams reduce risk
  4. 4. Google — Gemini API changelog