The SLA on draft generation: 45 seconds, 95th percentile
The operational target we hold draft generation to, why it's 45 seconds and not 30 or 90, and the specific things we do to hold the number under peak federal-FY-Q2 load.
We hold draft generation to 45 seconds at the 95th percentile. Not the median — the 95th. A P95 that drifts past 60 seconds is a paging event.
This post is how we set that number, and what it costs us to defend it during the federal FY Q2 surge when the queue depth triples.
Why 45 seconds and not 30 or 90
Draft generation has a hard floor set by the slowest sensible model call for a grounded answer. In our stack that’s the claim-verification pass, which we walked through in the per-claim verification post. Faster than ~20 seconds for a medium-length draft means the verifier didn’t actually verify; the system degraded silently.
45 seconds is roughly the point where a proposal writer keeps their attention on the draft tab. Past 60, they Slack a colleague and lose context. Past 90, they close the tab and come back later, which means the draft arrives into an empty session and gets re-triggered. We watched that happen for a year before we set the SLA.
So the number is product-shaped, not infra-shaped. The infra has to hold it.
What we measure
Three numbers, tracked per-draft:
- Time-to-first-token (TTFT). The SSE connection opens, the first retrieval chunk lands, and the model emits its first token. Our target: under 8 seconds P95. Covered in more depth in the SSE streaming post.
- Time-to-first-citation (TTFC). The first inline citation renders. This is the number writers actually care about — it’s when the draft stops being “generic” and starts being “from our KB.” Target: under 18 seconds P95.
- Time-to-complete (TTC). The full draft, verifier pass included. Target: under 45 seconds P95.
A green dashboard has all three inside target. An amber has one out. A red has TTC out for more than five consecutive minutes.
Where the seconds go
A representative 40-second generation, stage by stage:
| Stage | Typical ms | Notes |
|---|---|---|
| Auth + tenant routing | 120 | Edge; mostly cache hits |
| Retrieval (hybrid + rerank) | 2,400 | Dense + BM25 + cross-encoder |
| Prompt assembly | 600 | Block composition, schema validation |
| First model call (draft) | 14,000 | Streaming; TTFT inside this |
| Claim verification pass | 16,000 | Per-claim entailment, parallelized |
| Citation rendering + persist | 1,500 | Spans resolved back to blocks |
| Export-ready format pass | 2,000 | Markdown + inline anchor IDs |
Retrieval is 6% of the budget. The two model calls together are 75%. The interesting optimization work is almost always in the model stages, not the retrieval stages — a surprise to every engineer who joins the team expecting the opposite.
Holding the number under surge
Federal FY Q2 roughly triples our draft queue depth in the last week of March. Three things keep the number from blowing up:
1. Parallelized verification. The claim verifier fans out one call per claim, capped at eight concurrent per draft. A 20-claim draft used to be serial (slow) and is now 3 batches of 8, 8, 4 (fast). The cap is there because an uncapped fan-out starved other drafts during the December surge last year.
2. A separate queue for long-tail drafts. Drafts with more than 40 expected claims route to a dedicated worker pool with a 90-second SLA. Those drafts don’t poison the main pool’s P95. They also don’t show ”45s” in the UI — they show “this one is long, planning to be ~80s” from the start. Honest latency beats a broken SLA.
3. Retrieval budgeting. Under surge we drop the cross-encoder reranker’s top-k from 30 to 12 when queue depth crosses a threshold. We lose ~2% on retrieval quality metrics (measured in the quarterly eval report) and save ~900ms per draft. During Q2 surge we take the trade. Outside surge, we don’t.
What we’re not doing
We are not caching drafts. Every draft is grounded against the current KB state at generation time. A cached draft is a stale draft, and staleness is the failure mode the whole product is designed against.
We are not using a faster, less-grounded model as a fallback. A model that doesn’t verify its claims produces output that reviewers have to audit by hand — so the “faster” path costs the team more downstream time than it saved upstream. We tried it in 2025 and the reviewer complaints were immediate and correct.
We are not hiding latency behind a fake progress bar. The UI shows the actual stage the draft is in — retrieval, drafting, verifying, rendering — because writers who wait 45 seconds for a blank spinner close the tab, and writers who wait 45 seconds for a visible pipeline don’t.
The complication
The 45-second target is the easy part. The hard part is that the target has to hold across the whole P95 distribution, not just the median. Our P50 is 22 seconds. Our P95 is 41. Our P99 is 63.
The gap between P95 and P99 is almost entirely about one class of draft: long responses where the verifier encounters a compound claim that has to cross four blocks to entail. Those drafts run long, they’re correct to run long, and they’re the ones routed to the dedicated pool. We’re still tuning where that routing threshold should sit. A lower threshold protects the P95 but pushes more drafts into the “this one is long” UX. A higher threshold loosens the P95 but keeps more drafts on the happy path.
The takeaway
A draft-generation SLA is a product decision wearing infrastructure clothing. The number lives where the writer’s attention lives. Infrastructure’s job is to hold it, and to tell the writer the truth when it can’t.