Caching the draft step

Drafting is the most expensive step in our pipeline by a wide margin. It is where the model sees the retrieved context, the question, the style profile, and the compliance constraints, and produces a first-pass answer. Every token in the input prompt gets billed. Every token in the output gets billed. The input side is where the money actually goes, because the retrieved context is long.

Two months ago we shipped caching on the draft step. The measured reduction in draft-step cost per response is in the high teens to low twenties of percent across our three largest tenants, depending on content-reuse rate. This post is how it works and why we are confident it is not introducing a stale-answer failure mode.

A note on the numbers. The cache-hit and cost-reduction figures below come from our internal usageEvents tracking on our own production tenants. They are directional — they describe what the cache is doing on our specific prompt structure and our specific tenant mix, not a generalizable benchmark. The ceiling we describe (cacheable-prefix ratio) is shape; the exact percentages depend on the prompt design at the time of measurement.

Where the cache actually lives

The cache is on the prompt side, not the response side. We are not caching finished drafts and reusing them; that would be the exact stale-answer failure mode the platform is built to avoid. What we cache is the prompt prefix — the parts of the input to the model that are identical across many requests.

A draft-step prompt has four parts:

The system prompt — grounded-draft instructions, citation rules, refusal rules. Stable per tenant.
The style profile — per-customer tone guidance. Stable per tenant, changes on explicit user action.
The retrieval context — N blocks pulled from the KB for this question. Variable per request.
The question itself — the RFP question being answered. Variable per request.

The first two are long and stable. The last two are short and volatile. Provider-side prompt caching — as documented by OpenAI and by Anthropic — lets us mark the first two as cacheable and pay a reduced per-token rate on cache hits.

The key design

The cache key is the byte-level hash of the cacheable prefix. Any change — a style-profile update, a system-prompt revision, a tenant-specific policy edit — produces a new key, which means a cache miss, which means the stale prompt does not get reused. The cache key never includes retrieved context or the question, because those should never be shared across requests.

Expiration is handled by the provider. We do not build our own TTL. The cacheable content is re-fingerprinted on every request and the provider compares to its own cache state. If the prefix has not been seen recently, we pay the full rate. If it has, we pay the cache rate.

Measuring that it works

We split the draft-step cost into two lines on the usageEvents dashboard: cached-input tokens and uncached-input tokens. Before the change, all input tokens were uncached. After the change, a healthy majority of input tokens served across the three tenants we measured now hit cache. The weighted average cost drop — factoring in that cached tokens are billed at a lower rate — lands in the high teens to low twenties of percent depending on tenant.

This is not a 50% reduction. It will never be. The retrieval context and the question are inherently per-request and cannot be cached. The ceiling on draft-step savings is set by the ratio of cacheable-prefix tokens to total-input tokens, and that ratio tops out around 70–75% for our current prompt structure.

Why the stale-answer risk is low

Three defenses:

The retrieval context is not cached. The actual facts the model draws from — the KB blocks retrieved for this specific question — are in the per-request part of the prompt, not the cacheable prefix. Every draft sees fresh retrieval. No factual staleness is possible from the cache.

Style-profile edits invalidate immediately. When a customer edits their style profile, the prefix changes, the hash changes, the next request is a cache miss. There is no window during which an updated style profile is ignored because an old prefix is still cached.

The cache is observable. The usageEvents row for every draft call logs whether the call hit cache and on what key. If we saw a stale prefix being served, we would see it in the log. We have not.

What it did not change

We did not change the draft output. Same model (Claude Sonnet 4.6), same temperature, same grounding pipeline. Responses pre-cache and post-cache are not measurably different on our retrieval-eval harness. The change was an infrastructure change, not a product change.

What is next

The retrieval step itself is the next target for caching. Retrieval over a KB that changes slowly — which is most KBs most of the time — could cache embedding comparisons at the query level for repeated or near-duplicate questions. We have a design for that and it is in the build log queue. The retrieval eval pipeline is the harness we will use to verify it does not degrade retrieval quality before we ship.

A cost reduction that ships without a quality regression is a cost reduction worth keeping. This one has held for eight weeks. If anything moves, we will write the correction.