Retrieval eval snapshot, December 2025

Every quarter we publish the numbers for our retrieval eval. This is the December 2025 snapshot. If you want the methodology in full, the eval pillar lays out the harness, the held-out sets, and why we chose the metrics we did. This post is the Q4 cut: what moved, what didn’t, and which regressions we’re shipping around instead of fixing.

A note on the numbers. The recall, MRR, and grounded-hit-rate figures below are directional. They are produced by our internal eval harness on our own corpora and gold sets. They are not a general benchmark and we would not expect them to reproduce on an external setup. We will publish the harness and a public gold set when both are stable. Read the deltas — which way things moved quarter-over-quarter — as the load-bearing content here; read the absolute values as shape, not as score.

The headline numbers

Our held-out eval set has three partitions: the RFP-question set (4,200 queries derived from real public-sector RFPs), the DDQ set (2,800 queries from CAIQ-aligned and SIG-aligned question banks), and the adversarial set (600 queries designed to surface keyword-shaped confusions). Metrics are recall@10, reranker-adjusted MRR@10, and grounded-hit-rate — the fraction of top-10 results that a reviewer marked as genuinely supporting the question.

Metric	Q3 (Sep)	Q4 (Dec)	Delta
Recall@10 — RFP set	0.81	0.84	+0.03
Recall@10 — DDQ set	0.87	0.89	+0.02
Recall@10 — adversarial set	0.62	0.64	+0.02
MRR@10 — RFP set	0.54	0.57	+0.03
MRR@10 — DDQ set	0.63	0.64	+0.01
Grounded-hit-rate — all sets	0.71	0.73	+0.02

Nothing regressed. The RFP set moved most, driven by the hybrid-merge tuning we shipped in October. The DDQ set moved less because it was already the strongest partition.

What moved, and why

RFP recall from 0.81 to 0.84. The hybrid merge now weights BM25 higher on queries that contain a clause identifier pattern (digits with a period between them, “shall” within three tokens of a verb, CFR-style citations). On queries the classifier routes as “clause-shaped,” BM25 gets a weight of 0.6 in the reciprocal-rank fusion instead of the default 0.4. Recall on clause-shaped queries moved from 0.78 to 0.86. Recall on non-clause queries held flat. The classifier itself is a 12-feature logistic regression, not a model call; the feature extraction is a regex pass over the query string. We measured it at sub-millisecond.

MRR from 0.54 to 0.57 on the RFP set. The reranker got a fresh training pass on 22,000 newly labeled query-block pairs from Q3 usage, filtered to pairs where a human reviewer had explicitly accepted or rejected the block as a draft source. The relabeling was the cheap part; the quality-control pass on the labels was the expensive part. About 8% of the initial labels were contradictory or ambiguous and had to be re-adjudicated.

Grounded-hit-rate from 0.71 to 0.73. This moves slowly because it depends on reviewer judgment, which is noisier than recall. Two percentage points is real but modest. The ceiling we’re pushing against is the adversarial set — questions engineered to surface the kind of “retrieved result has surface-level keyword overlap but doesn’t actually answer the question” failure that Stanford HAI documented in the commercial legal RAG space.

What didn’t move

Adversarial recall is stuck. We moved it from 0.62 to 0.64. We thought we’d move it further. The queries in that partition are deliberately constructed to have high lexical overlap with wrong-answer blocks — same vocabulary, different intent. The reranker helps, but the retriever upstream of the reranker is still surfacing the wrong candidates in the top 50. We’ve hit a ceiling that says the pre-rerank candidate pool is the bottleneck.

The research branch that addresses this is a query-rewriting pass that expands the query into a set of intent-specific paraphrases before retrieval. Early offline numbers are promising (+0.06 on the adversarial set) but the pass adds 40ms to the retrieval budget, and we haven’t figured out how to make it optional without adding a routing decision. Shipping target: Q1 2026.

Long-document retrieval on 200-page PDFs. On queries where the correct answer lives deep in a 200+ page extracted RFP and the surrounding chunks are near-duplicates of other sections (a common pattern in federal solicitations with repeated boilerplate), recall stays in the high 0.6s. We know the problem. The fix involves a hierarchical chunking pass that we haven’t finished the evaluation on. Not shipping this quarter.

A regression we caught and fixed mid-quarter

In late October, a reindex changed the embedding normalization path. HNSW distances drifted by roughly 1.5% on average. The online eval held flat on the headline metrics, because we measure top-10 and the top-10 set absorbs small distance drifts. The offline adversarial set caught the drift: recall dropped from 0.63 to 0.58 in a span of four hours.

We rolled the change back within a day, added a pre-deploy eval gate that blocks on the adversarial set specifically, and filed the incident post-mortem in our internal tracker. The lesson is not novel: the metric you watch in production has to include the set you’re most likely to regress on. Headline recall@10 on the RFP set is not that set.

A regression we know about and are shipping

The one we’re not fixing this quarter: reranker latency P99 on a specific class of customer workloads. When a customer runs a DDQ of 250 questions in parallel, the reranker queue depth spikes and P99 latency crosses 900ms. P50 stays at 48ms. The bound is a CPU-contention issue on the inference path, not a model quality issue.

The fix is a dedicated rerank worker pool. We’ve scoped it. The cost isn’t trivial for the volume we’re at, and the P99 pain is concentrated in a small number of customer accounts during their year-end surge. We’ve told those customers the constraint. We’re shipping around it this quarter and revisiting in January.

This is the honest trade: we know about it, it’s on the list, and it is not the most valuable thing we could ship in the next three weeks. Being direct about the trade is better than quietly pretending the P99 number doesn’t exist.

What to take from this

Three things, for anyone running an equivalent eval harness:

A classifier-routed hybrid merge beats a single hybrid weighting. The win on clause-shaped queries was the biggest single improvement this quarter and the cheapest to implement.
Your offline eval has to include the adversarial partition your production traffic lives on. We nearly shipped a silent 5-point regression because the headline metric was flat.
“Shipping around” a known regression is a legitimate engineering move, not a failure, as long as it is named. The unnamed version is the one that accumulates.

The next eval snapshot lands in March. The query-rewriting pass should be in it by then, or we will publish why it isn’t.