Testing retrieval: gold sets, precision@k, and why BLEU lies for proposals

A retrieval system that scores well on BLEU and ROUGE can still produce proposal answers that are wrong. We learned this twice — once on the way in, when we built our first eval harness on text-overlap metrics because they were the obvious thing, and once on the way out, when we replaced them with task-specific metrics and the production failure modes started showing up in the test scores instead of in customer reports.

This is the longer version of the eval section in the grounded retrieval pillar piece. That post covers the canonical structure. This one is the practitioner’s notes on why the obvious metrics don’t work for proposal text.

What BLEU measures

BLEU was designed for machine translation. It compares a candidate translation against one or more reference translations using n-gram overlap, with a brevity penalty. A high BLEU score means the candidate shares many short token sequences with the references; a low score means the candidate uses different wording.

The metric was a real advance for translation. Translation outputs have a relatively narrow correct-answer space — there are usually only a few good ways to render a sentence in the target language, and lexical overlap with a reference translation correlates with quality.

Proposal text does not work this way.

Why proposal text is BLEU-hostile

A correct answer to “describe your security posture for cloud-hosted PII” can take many forms. It can lead with the framework (NIST 800-53, ISO 27001, SOC 2), or lead with the controls (encryption at rest and in transit, key rotation, access logging), or lead with the audit cadence (annual third-party assessment, quarterly internal review). Each version is a correct answer. Each version uses different vocabulary. None of them have particularly high n-gram overlap with the others.

If our gold-set reference answer leads with the framework and the system’s drafted answer leads with the controls, BLEU will rate them as low-overlap. The system will look like it failed even though both answers are correct, well-grounded, and citation-faithful.

The inverse is also a problem. A drafted answer that uses many of the same words as the reference but assembles them into a claim that doesn’t match — “we have annual SOC 2 audits” instead of “we have annual SOC 2 Type II audits” — will score well on BLEU because the n-grams overlap heavily. The substantive error (Type II vs. Type I, which are different attestation levels) is exactly the kind of error that matters in a regulated proposal and exactly the kind that BLEU misses.

ROUGE has the same shape of failure. Both are surface-form metrics. Both are poor proxies for correctness in a domain where multiple wordings are correct and similar wordings can be wrong.

What we use instead

The eval harness we run today combines five metrics. We covered them at length in the pillar piece; here are the practitioner notes on each.

Precision@k. For a given question, the retrieval step returns a ranked list of candidate blocks. Precision@1 is whether the top-1 block is the gold-set “correct source.” Precision@10 is whether the correct source is anywhere in the top-10. The two together let us localize regressions: if precision@10 drops, the retrieval scoring is broken; if precision@10 holds and precision@1 drops, the ranker is broken; if both hold but downstream metrics regress, the problem is generation or verification.

Claim-level entailment rate. The drafted sentence gets decomposed into atomic claims. Each claim is checked for entailment against the cited source block. The entailment rate is the fraction of claims that pass. This is the metric BLEU should have been: it measures whether the substantive content of the answer is supported, regardless of surface form.

Citation-fidelity rate. Of sentences that emit a citation, how often does the citation actually support the sentence under independent re-evaluation? This metric is high in our system (~0.99) because the architecture ties the pointer to the draft step rather than attaching it post-hoc.

Refusal rate. A system can hit perfect entailment by refusing every question. We track refusal rate as a denominator. Production target: under 8% on the gold set. Above 8% and the system is over-cautious; under 4% and we worry that the refusal gates aren’t tight enough.

Composite grounded-score. A weighted combination of the four. The composite is what we plot release-over-release. Movements in the composite trigger investigation; movements in any component metric trigger localization.

Why we keep BLEU around as a sanity check

We dropped BLEU as a primary metric. We did not drop it entirely.

Two cases where it remains useful. First, it is cheap. A regression that drops BLEU 20 points on the gold set is almost always a real regression somewhere — even if BLEU isn’t measuring the right thing, a 20-point drop usually means something structural broke. Second, it is comparable across systems. If we want to compare ourselves to a public baseline (or to ourselves a year ago), BLEU is one of the few metrics with a stable, well-known computation.

So BLEU runs in our nightly eval as a tripwire, not as a target. A movement triggers a look. The look usually finds something. The fix is rarely “improve BLEU.”

How the gold set gets built

A gold set is only as good as its construction. Ours has three sources.

Public RFPs. State and federal procurement portals publish RFPs and, in some cases, the awarded responses. We harvest a rotating set of (question, expected answer, expected source) triples from public materials, with the answer drawn from publicly available product documentation rather than confidential vendor responses. This part of the gold set runs about 600 triples.

Customer-provided evaluation sets. Customers who participate in our eval program provide redacted (question, answer, source) triples drawn from their own historical RFPs. We pay for the redaction work. Triples land in the gold set after a two-annotator review with arbitration. About 900 triples.

Synthetic, human-verified. Generated from our own product documentation — questions an evaluator might plausibly ask, with the answer and source taken directly from the docs. Each triple is reviewed by a human annotator. About 340 triples.

The three sources stratify by domain. We track precision@k and entailment rate per stratum, because the retrieval system’s failures cluster differently in different domains. Federal IT questions look different from healthcare questions, which look different from generic enterprise questions.

What the gold set can’t do

Three honest limits.

The gold set is a snapshot. Customer KBs change. Public RFPs change. The gold set we shipped six months ago is degrading by attrition as the underlying corpora drift. We refresh the synthetic and customer portions monthly; the public portion runs on a quarterly refresh.

The gold set under-represents long-tail questions. Most gold-set triples are questions a typical proposal would face. The hard questions — the ones at the long tail of customer-specific terminology, the ones that need cross-document composition, the ones with numeric precision requirements — are under-represented because they are harder to construct and harder to annotate. Our production failures tend to live in the long tail. Eval coverage in the long tail is the open work.

The gold set rewards systems that score well on the gold set. This is a tautology with teeth. A system tuned to pass the gold set may not generalize to questions the gold set doesn’t cover. We mitigate by holding out 20% of the gold set as a never-touched eval slice and refusing to look at the held-out scores during tuning. We compare them only at release time. Drift between the seen and held-out slices triggers a review of the gold-set construction.

What we wish we had

A standardized public benchmark for proposal-domain RAG. Nothing of the kind exists at production-grade quality today. RAGAS and ARES are general-purpose; they do not stratify by proposal subdomain, and their reference answers are short. The Stanford HAI legal-RAG paper points at a benchmark shape that we wish had been published as a corpus, not just as a paper. Until something like that exists, every production proposal vendor’s eval is a private gold set, which is a problem for category-level transparency.

We would publish ours if we could. The customer-provided portion has confidentiality constraints; the public-RFP portion is shareable in principle and we are working on a release. No timeline yet. The synthetic portion is the easiest to share and the least informative.

What to take away

If you are running an eval on a RAG system and you are using BLEU or ROUGE as your headline number, you are measuring surface form. Surface form is a weak proxy for correctness in any domain where multiple wordings are correct. In proposal text, multiple wordings are routinely correct. Move to entailment-based metrics. Build a gold set, even if it’s small. Track precision@k and refusal rate alongside the entailment number. The eval will tell you more about what’s broken in the system, and the production failure modes will start showing up in the test scores before they show up in customer feedback.

That migration costs about a quarter of engineering time. We did it once. We would not unship it.