The claim-level verification pass, explained

The draft model writes a sentence. The verifier model reads it and asks: does the source block entail this? If yes, the sentence ships. If no, it gets refused.

That sentence describes the claim-level verification pass, which is the third gate in the Grounded-AI Pledge enforcement stack. We’ve written about the Pledge enforcement architecture at a high level. This post zooms into the verifier specifically — what model runs, how the entailment check is structured, what it costs, and where it still misses.

A note on the numbers. Latency and false-negative rates below come from our internal eval harness and production monitoring on our own system. They are directional, not a benchmark. The shape (why we use a smaller verifier, why exact-match on numerics matters, where the residual false-negatives live) is durable; the specific millisecond figures and the aggregate error rates move as the verifier model and the gold set evolve.

The problem the verifier solves

A draft model with retrieval still hallucinates. Stanford HAI’s 2024 study put commercial legal RAG hallucination rates between 17% and 33% even with citations attached. The failure mode is specific: the model attaches a citation pointer to a sentence whose substance the cited block doesn’t actually support. The pointer is real. The support is not.

Better prompting reduces this. It does not eliminate it. The only architectural fix is to verify each substantive claim against the source block as a separate step, with a model that didn’t generate the claim. That is what the verifier does.

What “claim-level” means

A drafted sentence can have one claim or several. Take an example.

“PursuitAgent encrypts customer data at rest using AES-256 and supports SOC 2 Type II compliance.”

That is two claims. AES-256 at rest is one. SOC 2 Type II is the other. A verifier that scores the whole sentence against a source block can pass the sentence even if only one of the two claims is entailed. That is the failure we are trying to avoid.

Claim-level decomposition splits the sentence first, then verifies each claim independently, then refuses the whole sentence if any claim fails.

type Claim = { text: string; phrase: string };

const claims = await decomposeClaims(draftedSentence);
for (const claim of claims) {
  const result = await verifier.entails(claim, sourceBlock.text);
  if (!result.entailed) {
    return {
      kind: "ungrounded",
      reason: "claim-level-entailment-failure",
      failedClaim: claim,
    };
  }
}

The decomposition step is itself a model call. We use a small instruction-tuned model with a constrained output schema (JSON list of claims). It is fast — two-digit milliseconds — and it has an honest failure mode: if it under-decomposes, you over-pass; if it over-decomposes, you over-refuse. We err toward over-decomposition because over-refusal is recoverable (the operator sees the refusal and supplies a better source) and over-passing is the failure we promised the customer we wouldn’t ship.

What the verifier model actually is

The verifier is not Claude or GPT-class. It is a smaller, distilled, entailment-tuned model that we run on a co-located inference path. The reasons:

Cost. A verifier call runs per claim, which means a five-claim sentence gets five verifier calls. If each one is a frontier-model call, the per-draft cost balloons fast. We need the verifier to be cheap enough that we can run it on every sentence the drafting engine emits, not just the suspicious ones.

Latency. Frontier models add hundreds of milliseconds per call. Five claims means five hundred milliseconds added to a budget that already lives at 400ms P95 for retrieval (separate post on the retrieval latency budget). We need the verifier under 30ms per claim P50.

Specialization. Entailment is a narrower task than open generation. A model trained on natural language inference (NLI) corpora — SNLI, MNLI, ANLI, plus domain-specific extensions — outperforms a general frontier model on the entailment-only task at a fraction of the parameter count. We benchmark on held-out RFP-shaped pairs.

The current verifier sits in the 1B-parameter range, quantized, running on CPU. It hits 22ms P50 per claim, 60ms P95.

What “entailment” means here, operationally

Two strict rules and one soft rule.

Strict rule 1. Numeric facts must match exactly or by an explicit tolerance. “99.9% uptime” in the draft against “99.94%” in the source fails entailment. We err strict on numbers because the failure mode of accepting near-numbers is the failure mode regulated buyers most often catch us on.

Strict rule 2. Named entities must match. “Encrypts data using AES-256” against a source that says “AES” without specifying key length fails. “AES-128” against a source that says “AES-256” fails. The verifier is configured to treat key sizes, version numbers, certifications, and named compliance frameworks as named entities.

Soft rule. Synonym handling and tense changes are allowed. “We support SOC 2 Type II” against “SOC 2 Type II compliance is maintained annually” passes. “We provide encryption at rest” against “All customer data is encrypted at rest using industry-standard algorithms” passes — barely, because “industry-standard” is vaguer than “AES-256.” The verifier returns a confidence score; below a floor, we still refuse.

Cost, in numbers

A draft session of 60 sentences, averaging three claims per sentence, is 180 verifier calls. At 22ms P50 each, that’s about four seconds of verifier wall time per draft session if run sequentially. We run them in parallel, so the verifier never blocks the user-perceived latency on a per-sentence basis. The compute cost is real but small — the verifier runs on the same instance pool as the drafting service, and at current utilization the marginal cost per draft is in the cents range.

We measure dollar cost on every sentence and surface it on internal dashboards. A regression in verifier utilization shows up there before it shows up in customer behavior.

False negatives — the part we’re honest about

The verifier refuses sentences it shouldn’t. This is the most user-visible failure mode of the system. A reviewer sees a refusal, opens the candidate block, reads it, and says “this clearly supports the sentence — why did you refuse?”

Three patterns recur.

Compound claims that span blocks. Sentence claims A and B; block 1 entails A; block 2 entails B; neither block entails both. The current verifier scores against one source at a time, so the sentence refuses. Multi-block entailment is in a research branch; not on the production path.

Aggressive paraphrase. The draft model rephrased the source so heavily that the claim’s surface form drifted from the entailment-trained pairs the verifier knows. We’ve reduced this by tuning the rewrite prompt toward conservative paraphrase, but it still happens.

Numeric tolerance corner cases. “More than 99% uptime” against “99.94%” should pass — 99.94 is more than 99 — but the verifier reads “99% uptime” as a specific number and refuses. We’re working on a numeric-comparison subroutine.

The aggregate false-negative rate, on our internal eval set, is low single-digit percent. The aggregate false-positive rate (sentences that shouldn’t have passed) is lower still — we’ve optimized the system around making sure the false-positive rate stays small even at the cost of false-negative noise. A reviewer who sees a false negative is annoyed. A reviewer who sees a false positive ships an ungrounded claim to a customer, and we promised we wouldn’t.

What’s next

The hallucination budget — what it costs, per claim, to ship a draft that’s grounded to a chosen confidence level — is the topic of next week’s post. Verification cost is one term in that budget. Retrieval recall is another. The tradeoffs between them are where the engineering work lives.

If the verifier refuses a sentence in your draft today and you can see why it shouldn’t have, send the example. The eval set learns from corner cases, and corner cases is where the tail of the false-negative rate lives.