Field notes

Draft latency, a year on: 45s P95 to 28s

A year of draft-latency work. What moved P95 from 45 seconds to 28, which changes cost quality and which cost money, and the three tradeoffs we chose not to take.

A year ago the P95 latency for a grounded draft on a 500-word RFP question was 45 seconds. Today it is 28 seconds. This post is the detail on what moved it, which changes cost quality, which cost money, and the three tradeoffs we chose not to take.

The framing to keep in mind throughout: draft latency is not a pure engineering metric. It is a quality-and-cost metric wearing an engineering costume. Every change that makes drafting faster has a knob on either quality or cost. The interesting question is which knob you pulled.

The starting position

April 2025. P50 latency of 18 seconds, P95 of 45, P99 of 80. The pipeline at the time was a single generation pass that retrieved context, generated the draft, and attached citations. Straightforward. The tail was ugly because complex multi-part questions would trigger multi-round retrieval and the generation pass would block on the slowest retrieval.

The target we set for the year was P95 under 30 seconds, without regressing on hallucination rate or citation density. That is a real constraint — a latency number that comes from cutting verification is not a latency number, it is a quality cut that happened to make the clock run faster.

The changes that moved the number

Five, in order of impact.

One: parallel retrieval across sub-queries. The query understanding post describes how we split compound questions into sub-queries. A year ago those sub-queries ran sequentially. We made them parallel. Typical 3-part compound question went from 6 seconds of retrieval to 2. Simple win, no quality tradeoff.

Two: streaming the draft into the UI as it generates. Not a latency change for the full draft, but a perceived-latency change for the user. Time-to-first-token dropped from 6 seconds to 1.1. The P95 for “full draft done” is still 28 seconds, but the experience is very different when the user sees the draft start appearing in under two seconds.

Three: smaller verification model. Per-claim verification used to run on the same model as drafting. We moved it to a distilled model that is 4x faster and 10x cheaper. Agreement with the larger model’s verification labels is 94%, which we validated against human audit and accepted. The 6% disagreement cases get routed to a larger model as a second pass, so the worst-case is no worse than before.

Four: KV-cache reuse across sub-sections of the same draft. Drafting a long response section involves multiple generation passes. The prefix of each pass is substantially similar to the previous pass. Cache reuse across the passes cut generation time on long sections by about 25%. This was the hardest of the five to build correctly and the one we were most worried would break something subtle. It did not.

Five: early-exit when confidence is high. Our drafting loop used to always run a full verification pass. For claims where the verification model’s confidence is very high (>0.98) and the retrieved evidence is a verbatim match for the claim’s numerics, the loop now exits verification early. This covers about 30% of claims and saves the verification time on those.

What the changes cost

In quality, measurable: nothing. The hallucination rate over the same year went from 2.3% to 0.9% (see the hallucination-rate update post). Both moved together, which is the only kind of latency win worth taking.

In engineering time: roughly 4 engineer-months across the year. Not dominant; the team shipped a lot of other things.

In LLM spend: total LLM spend per draft is down 15% despite the per-claim verification gate being tightened. The distilled verification model did most of that work.

In complexity: higher. The pipeline now has more moving pieces — parallel retrieval, cache layer, two-tier verification, confidence-gated early exit. Complexity cost is real, and we track it by how often a production incident traces back to an interaction between components that weren’t designed together. In the past 12 months, that happened twice.

Three tradeoffs we chose not to take

Not taken: reduce the per-claim verification pass to improve latency. We could cut roughly 4 seconds off P95 by skipping verification on claims with the highest-confidence citations. We measured the quality hit and it was non-trivial — the top-confidence claims are mostly correct, but “mostly” includes a non-zero rate of false-positive citations that verification catches. The 4 seconds of latency was not worth the re-introduction of that failure mode.

Not taken: use a smaller drafting model. The drafting model dominates the latency budget. A smaller model would cut P95 meaningfully. We ran the experiment; the smaller model’s drafts were materially worse on the sections that attract the most reviewer edits (see the heatmap post). The saved latency was spent in review. Net-negative for the user.

Not taken: batch requests across customers. Batching would improve throughput and reduce cost, but would increase tail latency for the user whose request got stuck at the end of a batch window. In a product where users are watching a draft generate, tail latency is the latency the user experiences. Not a trade we want.

The Hacker News discussion on reverse RAG makes a related point about per-claim verification at scale: the economics are tight and you can always cut quality to save time. The discipline is choosing not to. Our draft-latency work was, in retrospect, a year of choosing not to cut quality and finding other ways to move the clock.

What comes next

The easy wins are gone. The next 5 seconds of P95 reduction will be harder than the 17 we just banked. Three candidates on the board:

Candidate one: speculative decoding on the drafting model. Model-level technique. Expected impact: 15-20% on drafting time, which is roughly 3-5 seconds on P95. Risk: quality regression on edge cases; has to be measured carefully.

Candidate two: prefetching retrieval based on partial question parsing. Start retrieval while the user is still typing the question. Expected impact: another 1-2 seconds on time-to-first-token. Risk: cost of wasted retrievals on partial questions that change.

Candidate three: a drafting-time cache for commonly-cited block combinations. Many drafts cite overlapping sets of blocks. A cache keyed on block-set fingerprints could skip some work. Expected impact: uncertain, probably 1-2 seconds on P95 in common cases. Risk: cache invalidation is hard, especially with versioned KB blocks.

We will ship one or two of these. P95 under 25 seconds by next April is a reasonable target. Under 20 is probably not, given the verification gate is non-negotiable and the drafting model dominates the remaining budget.

The honest summary of the year: the latency work that mattered was the work that found time without cutting quality. The work that would have cut quality was cheaper and we did not take it. In a grounded-AI product, that is not just a principle. It is the feature.

Sources

  1. 1. Hacker News — Mayo Clinic's reverse RAG