Retrieval evaluation, part 2: dealing with numeric claims

Continuation of the retrieval-eval series. Part 1 was the pillar piece: gold sets, metrics, the harness, the regression-guard. This post zooms in on a single failure mode that the gold-set numbers hide if you don’t look at them carefully: numeric claims.

A note on the numbers. The precision@5, recall@20, and accuracy figures below come from our internal eval harness running against our own corpora. They are directional — they describe the shape of a specific intervention on specific corpora, not a generalizable benchmark. We would not expect these numbers to reproduce on an external setup, and we will publish the harness and a public gold set when both are stable.

The failure mode

A buyer asks: “What’s the retention period on audit logs in your platform?”

The right answer is in the KB, in a security-policy block: “Audit logs are retained for 13 months in primary storage and 7 years in cold archive.”

Vanilla dense retrieval ranks the wrong block in the top-3 about 40% of the time on numeric questions like this one. The block it ranks first is the adjacent policy — usually about user-session retention or email-archive retention — because the surrounding language is similar to the audit-log block and the numbers are different.

Embeddings encode semantics. The semantically similar blocks cluster. The number — the thing that actually distinguishes the right block from the wrong one — is a low-signal token in a high-dimension space. The retriever doesn’t weight it highly enough.

This shows up in our gold-set metrics as a precision@5 dip on the adversarial slice — about 8 points lower than on the non-numeric slice. The aggregate number hides it. Per-query inspection makes it visible.

Tactic 1 — hybrid search

The first fix is the obvious one: dense + sparse, combined.

Dense retrieval (embeddings → ANN) ranks by semantic similarity. Sparse retrieval (BM25) ranks by lexical overlap. Numbers are lexical features. A query containing “13 months” matches a block containing “13 months” via BM25 directly; the embedding model conflates “13 months” and “12 months” because they live next to each other in vector space.

Our hybrid path is: run both retrievers, normalize scores, sum with a tunable weight, take top-k from the combined ranking. The hybrid-search post covers the architecture; the relevant detail for this post is the weight.

We tune the dense:sparse weight per corpus. On corpora where the labelers flagged a high share of numeric questions (typically ~25% of the gold-set queries), sparse weight runs around 0.4 — meaningful but not dominant. On corpora dominated by free-form essay questions (where numerics are rarer), sparse weight runs around 0.15. A single global weight would underfit one of the two distributions.

The improvement on the adversarial slice: precision@5 on numeric queries moved from the mid-0.5s to around 0.7 across the five corpora. Recall@20 moved from the high 0.7s into the high 0.8s. Closing about half the gap to non-numeric performance.

Tactic 2 — numeric-claim isolation

Hybrid gets us most of the way. The second tactic targets the residual: queries where the right block contains the right number but the wrong block contains a similar number, and BM25 weights both equally.

The mechanism: at index time, we extract numeric claims from each block as structured fields. A block containing “13 months” gets a separate column noting (quantity: 13, unit: months, context: audit-log-retention). The context is extracted via a small classifier that reads the surrounding 50 tokens.

At query time, if the query parser detects a numeric question (regex on patterns like “how long,” “what is the X period,” or directly on number-shaped tokens), the retriever runs an additional ranking pass on numeric-extracted fields. Blocks whose extracted numeric context matches the query’s target (audit-log-retention, not session-retention) get a re-rank boost.

This is brittle in the predictable way: the extraction classifier has to label “audit-log-retention” consistently across blocks written in different styles. We labeled a 400-block training set in-house, hand-curated, and the classifier is 88% accurate on a held-out set. The 12% miss is concentrated on blocks that mention multiple retention windows in one paragraph — exactly the case where the human labelers also disagreed.

The improvement on the adversarial numeric slice, on top of hybrid: precision@5 moved from around 0.7 into the high 0.7s. Smaller delta than hybrid, but it closes most of the remaining gap on the queries where hybrid alone wasn’t enough.

What we tried and removed

Direct numeric-similarity scoring. First attempt. We added a feature to the reranker scoring “how close is the number in the block to the number in the query.” This worked when the query had a number (“retention period of 13 months — what’s the policy?”) and not when the query was the more common form (“what’s our audit-log retention period?”). In the more common form there’s no query-side number to compare against. Removed.

Number-aware embeddings. Second attempt. We fine-tuned the embedding model on a synthetic corpus of numeric-policy paragraphs to make it more sensitive to digit tokens. The fine-tune helped on the synthetic eval and underperformed the base model on the gold set. The numeric distinction got slightly better at the cost of general semantic quality. Removed.

The lesson is the same one BEIR’s authors keep emphasizing: bespoke retrieval improvements that win one benchmark slice often lose more on others. The hybrid + numeric-extraction path won on the numeric slice without regressing the general slice. The fine-tune did not.

What’s still broken

Two failure modes the current path doesn’t fix.

Comparative numeric questions. “Is your retention period longer than industry standard?” The right answer requires knowing both our number and an industry benchmark, then comparing. Retrieval surfaces the policy block; the engine has no path to the comparator. Out of scope for retrieval; the drafting engine refuses.

Range and conditional numerics. “Retention is 13 months, except for healthcare customers, where it’s 7 years to comply with HIPAA.” The block has two numbers. A buyer asking about healthcare might surface the wrong number. Our extraction labels both, but the retriever doesn’t know which condition the buyer is asking about. We are working on conditional-claim awareness; not shipped.

What this changes about the eval

Two changes to the eval methodology after this work.

The adversarial slice is broken out separately in every release report. Aggregate precision@5 hides numeric regressions until they’re large. Reporting the slice separately catches them faster.
The gold set’s numeric-question share is now tracked. When new corpora are onboarded, we explicitly check that numeric questions are at least 15% of the gold set; if not, we add adversarial numeric queries until they are. Otherwise the eval doesn’t measure this failure mode at all.

The takeaway: the failure modes you can’t see in the aggregate metrics are the ones where your eval methodology, not your retrieval, is the bottleneck. Slice your numbers.