Hybrid search: dense embeddings plus BM25 for proposals

Pure dense retrieval — embeddings only — has a known failure pattern in proposal data. It misses on the things proposals are full of: numeric identifiers, product names, certification codes, version strings. “SOC 2 Type II” embeds close to “SOC 2 Type I”. “AES-256” embeds close to “AES-128”. The semantic-similarity score doesn’t carry the discrete information that distinguishes them.

Pure BM25 — keyword scoring only — has the inverse failure. It nails the literal token matches and falls over on paraphrase. “How is data encrypted at rest?” won’t pull a block that says “All customer data is stored using AES-256.” The query has zero token overlap with the answer.

The fix in both cases is well-known: hybrid search. We run both retrievers and blend the scores. This post is the version of that we run, the blend ratio we settled on, and the test set we use to catch regressions when we tune it.

What we run, exactly

Two retrievers in parallel, results merged.

Dense retrieval. Each KB block is embedded once at ingest using a domain-tuned variant of a current-generation embedding model. We re-embed when the model upgrades (rare; we lock to a model version per customer for citation stability — see below). Query is embedded at request time. Top-K candidates by cosine similarity are returned with their raw scores.

BM25 retrieval. A standard BM25 index over the same blocks. Tokenization preserves casing-sensitive terms (AES-256, SOC 2, FedRAMP) and uses a small custom analyzer that doesn’t fold version-string punctuation away. Top-K candidates by BM25 score returned.

Blend. Scores from each are normalized to [0, 1] using the highest score in each retriever’s result set as the denominator. Final score is a weighted sum:

final = w_dense * dense_norm + w_bm25 * bm25_norm

The merged candidate set is the top-K by final. Ties go to dense.

The blend ratio

We currently run w_dense = 0.65, w_bm25 = 0.35.

This is not a universal best — it’s tuned on our test set, and the test set is biased toward the kinds of queries proposals generate. The ratio reflects four observations.

Numeric identifiers cluster around 0.35. When BM25 carries 35% of the weight, the discrete-token claims (encryption algorithm names, SOC types, port numbers, version strings) reliably surface above paraphrase-only matches. Dropping BM25 below 0.30 produces visible regressions on the numeric-identifier slice of our test set.

Paraphrase coverage requires dense to dominate. Above 0.65 dense weight, paraphrase queries hit reliably. Below 0.55 dense weight, the retriever starts failing on synonym-only matches like the encryption example above.

The shape is more important than the exact numbers. The blend is not sharply tuned. 0.60/0.40 and 0.70/0.30 both produce acceptable test-set performance. We picked 0.65/0.35 because it sits at the floor where the numeric-identifier regressions start and gives us paraphrase headroom.

Per-customer tuning is in our backlog. Some customer corpora are more numeric (regulated industries with heavy compliance vocabulary); some are more narrative (consulting, legal). We have not shipped per-customer tuning; we run the global ratio. The trade is worth doing eventually.

Why pure-dense was tempting and wrong

When we were prototyping the retrieval engine, pure-dense looked attractive. The model is the modern thing. BM25 is from 1994. Embedding models have been catching up on identifier-heavy queries for a few years now and the conventional wisdom in some RAG circles is that they’re “good enough” to skip the BM25 stage.

The conventional wisdom doesn’t survive contact with proposal data. Three patterns broke pure-dense in the first month of evaluation:

SOC 2 Type II vs Type I. The embedding for the question “Are you SOC 2 Type II?” is closer to the embedding for “We are SOC 2 Type I” than the BM25 match would imply. A pure-dense retriever returned the wrong block for this query 8 times out of 100 in our held-out set. With BM25 weighted in, the failure rate dropped to under 1%.
Version strings. “Java 8” and “Java 11” embed close. So do “Postgres 13” and “Postgres 16”. Customers responding to RFPs about specific tech-stack support get burned by version drift. BM25 catches it.
Acronym-heavy queries. “FedRAMP Moderate” embeds close to “FedRAMP Low” and “FedRAMP High”. The discrete token differentiates. The embedding does not.

These are not exotic queries. They’re typical of the questions DDQs and security questionnaires ask.

The G2 review pattern of “search is terrible” against incumbents in this category — common across Responsive reviews — likely has the same root. Their search is keyword-only or near-keyword-only. Ours has the dense layer to handle paraphrase. The pure-keyword vendors have the inverse problem from the pure-dense ones, but the user-visible failure looks similar: the search returns the wrong thing.

Tuning

The ratio is tuned against a test set we maintain in-house: 1,400 query-block-relevance triples, drawn from real customer questions and real customer KBs (with permission, redacted), labeled by the proposal team for whether the block actually answers the query.

Three slices of the set carry most of the signal:

Numeric identifier slice (220 triples). Queries that turn on a numeric or version-coded fact. The slice that BM25 carries.
Paraphrase slice (480 triples). Queries with no token overlap with the relevant block. The slice that dense carries.
Mixed slice (700 triples). Queries that have both signal and paraphrase. The slice the blend has to handle.

We re-run the test set on every retriever change (model upgrade, BM25 analyzer change, weight adjustment) and gate the change on no slice regressing more than 1 percentage point. The gate has caught two regressions in the last four months — one when we upgraded the embedding model (numeric slice fell 4 points; we kept the old model for that customer cohort while we evaluated alternatives) and one when we changed the BM25 tokenizer (paraphrase slice fell unexpectedly because the new tokenizer was folding away some terms the old one preserved as multi-token features).

The test set is internal but the methodology is public, and we’ve discussed test-set construction in our grounded-AI Pledge enforcement post. A customer who wants to know whether our retriever performs on their corpus can run their own version of the same test on a sample of their KB.

Citation stability under model upgrades

One subtlety. Citations in PursuitAgent point to specific KB blocks by ID. When we upgrade the embedding model, the dense scores change — the same query-block pair may rank differently than it did yesterday. If a customer’s reviewer tomorrow re-runs a query they ran today, they could see a different top result.

We handle this by locking each customer’s KB to an embedding model version. The lock is broken only on a deliberate re-embed, which is a customer-visible operation. The audit log records the model change. A reviewer who re-runs a query after a re-embed sees a flag indicating the corpus was re-indexed since the last query.

This is part of why customers shouldn’t experience the retrieval engine as a moving target. The blend ratio is global; the model version is per-customer; citations resolve to a specific version.

What’s still open

Two known limitations.

Per-customer ratio tuning. Mentioned above. Some customer corpora would do better at 0.70/0.30 (heavily narrative content) and some at 0.55/0.45 (compliance-heavy). We don’t tune yet.

Cross-encoder reranking. A second-stage cross-encoder reranker over the top-K candidates is a known win in the RAG literature, particularly for the kind of paraphrase-heavy queries proposals generate. We have a research branch implementing it; the latency cost is the open question. A reranker that adds 200ms to retrieval is acceptable in the drafting flow; one that adds 800ms is not.

The takeaway

Hybrid is not exotic. It is the boring, correct answer for proposal retrieval. Pure dense fails on the discrete-token vocabulary that fills compliance and security content. Pure BM25 fails on paraphrase. The blend at 0.65/0.35 works on our test set; the methodology — labeled test set with three slices, gated on no slice regressing — is what keeps it working as the underlying models change.

If your current proposal software gives you results that look “loosely related,” you are probably running a pure-keyword retriever. If your current AI proposal product gives you results that get the version number wrong, you are probably running pure-dense. Hybrid catches both.