Draft autocomplete latency, end to end

The draft autocomplete feature is the one where a writer types a prompt in the proposal editor — “summarize our SOC 2 posture in three sentences for a fintech buyer” — and the editor streams back an answer grounded in the KB. It is not a chat interface. It is not a one-shot request/response. It is a typing loop where every keystroke after the trigger can change the suggestion the model is producing.

The user-facing bar is that the feedback from keystroke to first-visible-token feels immediate. That translates to a P95 latency budget of 240ms from keystroke to first token. Below 240ms, the interaction feels like the editor is keeping up with the writer. Above it, the writer notices the lag and stops trusting the suggestion.

This post walks through where those 240ms go and what we do when any one of the budgets slips.

The three budgets

240ms is not a single budget. It is three budgets stacked, with visible slack between them.

Budget 1 — the typing-debounce window (60ms). We do not fire an inference request on every keystroke. We debounce — 60ms after the last keystroke, if nothing new has arrived, the request fires. 60ms is short enough that a writer who has stopped typing does not wait for the suggestion; long enough that a writer typing at normal speed does not fire a new request per character.

Budget 2 — retrieval and prompt assembly (100ms target, 140ms ceiling). The editor sends the current proposal context to the retrieval layer. The retrieval layer pulls the top N blocks from the KB that match the current sentence and the section the writer is in. The prompt is assembled — system prompt, style profile, retrieved blocks, the current fragment. This step is 100ms on a warm path, up to 140ms if the retrieval index was cold for this tenant.

Budget 3 — inference queue to first token (80ms target, 140ms ceiling). The assembled prompt goes to the LLM API. We use streaming — see OpenAI’s streaming reference and Anthropic’s equivalent — so the time we care about is time-to-first-token, not time-to-completion. On the warm path, a small-model completion returns first token in 80ms. On a queued-request path it can take 140ms.

The budgets overlap deliberately: 60 + 100 + 80 = 240ms on the warm path. 60 + 140 + 140 = 340ms in the worst case. We hold the P95 at 240ms by making sure the warm path is actually warm for typical tenants during typical hours. The worst case happens; we watch it, and we treat sustained worst-case behavior as a bug.

Why 240ms

The Nielsen Norman Group’s response-time thresholds — the old reference that has not meaningfully changed in 30 years — name 100ms as the limit for “feels instantaneous” and 1 second as the limit for “uninterrupted flow of thought.” Autocomplete sits between the two. 240ms is above the instantaneous bar but well under the flow-of-thought bar. Writers report that suggestions at 240ms feel “quick” but not “immediate” — which is the right feel. We do not want immediate. Immediate would suggest that the writer’s typing drives the output more than it should. A small deliberate lag reads as “the system thought about this.”

We experimented with 180ms and 320ms earlier in the year. 180ms made the suggestions feel noisy — writers complained that suggestions were flickering as they typed. 320ms made the suggestions feel slow — writers would finish their own sentence before the suggestion appeared, and the suggestion then felt redundant. 240ms was the sweet spot.

What happens when a budget slips

Retrieval slips. Usually because the tenant’s KB has been recently written to and the HNSW index has a pending rebuild. We detect this on the retrieval side and fall back to a scan over the recently-written blocks plus the indexed set. The fallback is slower but correct. When the fallback is active, the autocomplete latency displayed to internal observability turns yellow. If the fallback stays active for more than 10 minutes, the incident channel gets a page.

Inference slips. Usually because the provider has a queueing event — a regional issue, a rate-limit shift, a model cold-start. We detect this from the time-to-first-token metric. When sustained P95 crosses 180ms on the inference budget alone, we fail over to a secondary provider for the affected tenants. The draft quality is equivalent across the two providers for the autocomplete model tier; we validated that before wiring the failover.

Debounce does not slip. It is local timer logic with no external dependency. If it ever does not work, it is a client bug, not an infrastructure bug.

The per-call cost side

Every autocomplete request costs money on the API side. We do not run autocomplete on every keystroke — the debounce plus a contextual check (“is the writer actually asking for a completion?”) means maybe one in six typing sessions actually triggers a completion. The retrieval step is cached at the prompt-prefix layer where possible — see the draft-step caching post for the mechanics. Inference uses a small, fast model, not the primary drafting model; quality for a 50-word completion does not need the full-draft model.

Autocomplete is meaningfully cheaper than a full-draft request, which matters, because a writer using autocomplete heavily through a workday can fire hundreds of requests. If we were using the full-draft stack for every one of them, the unit economics would not work.

What we watch

Four metrics, per tenant, on the autocomplete dashboard:

P95 end-to-end latency, target 240ms.
Time-to-first-token P95 on the inference step, target 80ms warm / 140ms ceiling.
Retrieval P95, target 100ms warm / 140ms ceiling.
Abandon rate — how often a user starts a completion and dismisses it before it finishes. Rising abandon rate with flat latency means the suggestions got worse, not slower. That is a retrieval-quality problem, not a latency problem, and it goes to the eval harness in the retrieval eval pillar, not here.

What is next

The two experiments in flight: a client-side draft cache for near-duplicate completions, and a speculative-retrieval pattern that pre-warms the retrieval index when the writer enters a new section. Both are performance plays that target the same 240ms bar. We will write them up when they ship or when we decide not to ship them.

The thing we try not to forget: the writer is the point. The budget exists because a writer who has lost flow is a writer who closes the editor. Every millisecond we hold onto is a millisecond the writer gets to keep thinking about the bid.