The reranker that paid for itself
Rerankers add latency and cost. They earn it back when retrieval is borderline and the wrong block in the top-K poisons the draft. Where we run a reranker, where we do not, and the honest tradeoffs.
A reranker is a small model that takes the top-K candidate blocks from retrieval and reorders them based on a more careful relevance computation. It costs more per query than the initial retrieval — usually two to ten times more, depending on the model — and it adds 100 to 400 milliseconds of latency. In exchange, the top-1 candidate after reranking is meaningfully more often the right candidate than it was before.
Whether the trade is worth it depends on what you are doing with the top-1.
For us, in the grounded-drafting loop, the answer is “yes, on a subset of questions, with a real cost-benefit math behind the subset boundary.” This post walks through the math.
A note on the numbers. The accuracy, latency, and cost figures below are directional. They come from our internal eval harness and our production cost tracking, against our own corpora and workloads. They are not a general benchmark and we would not expect the specific numbers to reproduce on an external setup. We will publish the harness and a public gold set when both are stable; until then, read the shape (when rerank pays, when it doesn’t, how to gate it) as the durable content.
What a reranker actually does
Initial retrieval is fast and approximate. We use hybrid retrieval (dense vector similarity plus BM25 sparse matching), which scores blocks against the query in milliseconds. The score is a blend of semantic and lexical signals. It works well on average. It struggles when the question and the source have different surface vocabulary for the same concept, or when multiple blocks score similarly and the ordering is noisy.
A reranker addresses both. The reranker takes the query and each candidate block separately, runs them through a dedicated cross-encoder model (or a small LLM scored on relevance), and produces a more careful relevance score per pair. The cross-encoder reads the query and the block jointly, which lets it capture relevance signals that the initial retrieval (which scored them independently) missed.
The output is a reordered top-K. The block that was rank 4 after initial retrieval may be rank 1 after rerank. The drafting step then consumes the reranked top, not the original.
Where it earns its keep
The reranker earns its cost on questions where the initial retrieval is borderline. Three patterns where this happens.
Compressed questions. “Describe your approach” with implicit context. The initial retrieval pulls a noisy top-K because the lexical signal is weak. Reranking, with the rewritten query and the top-K candidates, sharply pulls the actually-relevant block to the top. Most of the time this works because the query rewriting step has already added context; the reranker is the second-stage refinement.
Vocabulary mismatch. The question uses one term (“data migration”), the corpus uses another (“cutover” or “transition”). Initial retrieval scores them similarly low. The reranker, with a model that has stronger paraphrase awareness, can distinguish a block that genuinely covers the topic (under different vocabulary) from a block that lexically matches without semantic relevance.
Tie-breaking in the top-K. Multiple blocks score within 0.02 of each other on initial retrieval. The drafting step is going to pick the top-1 (or compose from the top-2). Which one ends up at rank 1 matters disproportionately. Reranking produces a more confident ordering even when initial retrieval was indecisive.
On these questions, reranking moves the top-1 from “right block” in the low-to-mid 0.7s to “right block” in the mid-0.8s — measured against our internal held-out evaluation set, with all the caveats that the held-out set is finite, the gains are not uniform across question types, and the exact numbers move when the set grows or the reranker is retuned. The improvement is real and reproducible run-to-run on our corpora.
Where it does not earn its keep
Three patterns where the reranker is paying cost for no gain.
Specific factual questions. “What is the maximum file size your API accepts on a POST to /v1/documents?” The initial retrieval is decisive — the top-1 score is well above the second-place score, and the top block is unambiguously the right one. Reranking does not change the order. We are spending the reranker’s cost for a no-op.
Already-decisive retrievals. Even on non-specific questions, the initial retrieval is sometimes confident. When the top-1 score is far from the top-2, reranking rarely changes the order. We can detect this from the score gap and skip the reranker.
Refusal-bound questions. Questions where the corpus does not contain the answer at all. Initial retrieval returns blocks with low scores; reranking reorders low-scoring blocks. The output of the loop is going to be a refusal regardless. Spending the reranker’s cost on a question that will refuse is wasted cost.
These cases are most of the volume. Reranking everything is expensive and pays back less than reranking selectively.
Our gating rule
We rerank when one of three conditions holds.
Condition 1. The top-1 retrieval score is below a “decisive” threshold. If the top-1 is below the threshold, the retrieval is borderline and the reranker has work to do.
Condition 2. The score gap between top-1 and top-3 is under 0.03 (on a 0-to-1 score scale). If the top three are bunched, reranking helps tie-break.
Condition 3. The query rewriting step produced a substantively different query than the original. The initial retrieval ran on the rewritten query, but if the rewriting changed the query meaningfully, the initial retrieval may have been confused enough that reranking helps.
Roughly 35 to 45 percent of questions in our typical mix hit one of these conditions and trigger rerank. The remaining 55 to 65 percent skip it. This is the cost-benefit boundary.
What it costs
At our current model choice (a Cohere rerank-class model, with a fallback to a Haiku-class LLM-as-reranker for some question types), the cost per rerank call is roughly $0.001 to $0.003. Latency adds 100 to 250 milliseconds.
For a 100-question response with 40 questions hitting rerank: 40 × ~$0.002 = ~$0.08 in additional cost. Latency adds maybe four to ten seconds in total wall-clock, distributed across questions. Both are inside our envelope.
The earlier cost-per-response post folds reranker cost into the broader retrieval line item; it is not yet broken out separately. It probably should be, in a future revision of that post.
What we measured before turning it on
When we evaluated whether to enable reranking by default, we ran four scenarios on the held-out evaluation corpus.
- No rerank. Baseline. Top-1 was the right block roughly 73% of the time across the full distribution.
- Rerank everything. Top-1 right roughly 84%. Cost roughly 3x baseline retrieval cost.
- Rerank only on borderline retrievals (condition 1). Top-1 right roughly 81%. Cost roughly 1.4x baseline.
- Rerank on all three conditions (current production). Top-1 right roughly 83%. Cost roughly 1.7x baseline.
The numbers above are on our internal eval; they will not generalize to other corpora and we are not claiming they do. The shape of the result was: “rerank everything” wins on accuracy by a small margin over “rerank selectively,” but the cost ratio is meaningfully worse and the marginal accuracy gain comes from questions that would have refused either way.
The gating rule produces most of the accuracy gain at less than half the cost. We turned it on with that configuration.
What still goes wrong
Reranker hallucinations on long blocks. When a candidate block is long (1,200+ tokens), the LLM-as-reranker sometimes scores it high based on partial relevance — the first half of the block is on-topic, the second half drifts. The downstream drafter then has to deal with a block whose top score did not come from the part of the block that gets used. Mitigation: we chunk more aggressively before retrieval, which makes the blocks shorter and the rerank more accurate.
Reranker disagreement with verifier. Occasionally the reranker promotes a block to rank 1 based on relevance scoring, but the entailment verifier later refuses the draft because the block did not actually support the specific claim. The reranker sees relevance; the verifier checks support; relevance does not equal support. We covered this distinction in the Pledge-in-code post. The fix is not to weaken the verifier — it is to accept that some reranker promotions correctly identify the most relevant block and the corpus genuinely does not contain a supporting source.
Latency for the user. Reranking adds time to the loop. On small, fast questions where the user is iterating in the proposal UI, the added latency is noticeable. We are evaluating a “fast mode” that skips reranking for live editing flows and only runs it on batch operations (like ingesting a 300-question questionnaire).
When to turn it off
If your retrieval is consistently decisive (the top-1 score is well separated from the rest), the reranker is paying cost for no gain. The data tells you this — look at the score-gap distribution in your retrieval logs.
If your KB is small (fewer than ~5,000 chunks) and the corpus is highly specific to a narrow topic, the reranker likely does not help. The initial retrieval has so few candidates that vocabulary mismatch is rare and ordering is mostly correct.
If the latency cost is unacceptable for your workflow (live editing, sub-second response targets), skip the reranker on that path and run it only on batch.
The short version
Rerankers are a real cost and a real benefit. They earn their keep on borderline retrievals; they are dead weight on decisive ones. A simple gating rule (rerank when initial retrieval is borderline) captures most of the accuracy gain at a fraction of the always-on cost.
We run them in production with the gate. The tradeoff math holds at our current question mix. If the mix shifts — more decisive questions, fewer borderline ones — we will revisit. The point of writing this is not to argue that reranking is always right. It is to show the math we used to decide when it is.
That math is reproducible from your own retrieval logs. If you are running a RAG pipeline and you have not yet checked whether reranking pays for itself in your specific corpus, the experiment is a half day of work and the answer might surprise you.