The async drafting worker pool, explained
How 40 concurrent draft sections get written without saturating the LLM budget or crashing the rate limiter. The worker pool, the budget enforcer, and the retry ladder.
When a user clicks “draft all sections” on a 40-question DDQ, the system has a choice. It can draft the sections serially (40 sequential LLM calls, somewhere between 8 and 20 minutes wall-clock), or it can draft them concurrently. Concurrency is obviously faster. Concurrency without a governor is also how you get rate-limited, blow past a token budget, or crash the drafting run mid-way with half the sections unfilled.
This post is the shape of our worker pool for async drafting. Three parts: the pool, the budget enforcer, and the retry ladder.
The pool
The drafting pool is a Hatchet workflow. When a draft-all request comes in, the orchestrator fans out one task per question. Each task is independent: it owns the retrieval for its question, the prompt assembly, the LLM call, and the write of the draft and its citations back to the proposal record.
The pool size is a per-tenant configurable with a platform default. The default is 8 concurrent workers. Why 8 and not 40? Two reasons.
First, the LLM provider’s rate limits are measured in tokens-per-minute and requests-per-minute. Anthropic’s published limits and OpenAI’s tier by account and model; a new tenant in a lower tier can get throttled at 5 concurrent requests against a long-context model. 8 is a conservative starting point that rarely triggers rate limits for standard tiers.
Second, concurrent drafting doesn’t scale linearly with pool size past a point. Retrieval contention on the same KB, embedding-cache thrashing, and the reranker’s per-query compute all make 40 concurrent drafts not actually 5x faster than 8 concurrent drafts. We benchmarked at 4, 8, 16, and 32; 8 is the sweet spot on our current retrieval stack.
The budget enforcer
Every drafting job carries a token budget. The budget is set per-bid at the tenant’s configured tier (a Starter tenant might get 2 million tokens per bid; an Enterprise tenant might have no hard cap). Before each LLM call, the worker checks the remaining budget against the call’s projected cost.
The projected-cost check isn’t magical. It adds up the input tokens (prompt, retrieved chunks, instructions) and a generous upper estimate for output tokens (the model’s max_tokens setting, not a guess about what the model will actually produce). If the projected cost exceeds the remaining budget by more than a tolerance, the worker marks the section as “budget-deferred” and the section doesn’t draft. The proposal manager sees a yellow flag in the UI; the section can be drafted later by raising the budget or narrowing the retrieval.
The enforcer refuses to overdraft. A bid that budgeted for 1.5 million tokens uses exactly 1.5 million tokens, not 1.7 million because three workers happened to return verbose output. This is strict in a way that annoys users who expected “a little over is fine.” It’s also the feature that keeps Finance from getting called into a proposal-software pricing conversation.
The retry ladder
LLM calls fail. Transient network errors, provider 5xx responses, rate-limit 429s, validation errors from the provider’s schema checks. Each failure mode wants a different retry strategy.
Our ladder has four rungs, in order:
- Immediate retry with identical request. For transient network errors. One attempt. Backoff: 0 seconds.
- Retry with exponential backoff. For 5xx provider errors. Three attempts, 2s / 8s / 32s.
- Retry after rate-limit headers. For 429s. The provider tells us when to retry; we wait exactly that long. One attempt after the header-specified wait.
- Model fallback. For persistent failures on the primary model. The worker falls back to a secondary model (a smaller or cheaper one in the same family) with a note in the draft’s metadata. The section still gets drafted; the proposal manager sees the fallback flag at review time.
After the fourth rung, the section is marked “drafting-failed” and the proposal manager sees a red flag. Nothing is silently skipped. A 40-section bid will complete with 40 sections either drafted or explicitly marked as failed — never with 37 sections drafted and 3 missing because of errors the system swallowed.
What concurrency does to determinism
Concurrent drafts don’t happen in the order the user sees them. Section 1 might finish after section 12. This doesn’t usually matter — each section is independent — except when the proposal-level assembly step needs ordered input. For that step, we collect all drafts, sort by the original section order, and assemble in order. The drafting stage is concurrent; the assembly stage is serial.
One thing that bit us early: a proposal-level caching optimization that reused a prompt template across sections interacted badly with concurrent drafts. Two workers would try to seed the cache simultaneously, one would win, and the loser’s request would run without the cache. We fixed this by moving the template-prep step into a synchronous pre-flight and letting all concurrent workers read from a shared in-memory handle.
What we haven’t shipped yet
Per-section priority. Right now, the pool picks work FIFO. A section the user marked “high priority” doesn’t jump the queue. The UX for setting priority exists; the pool doesn’t honor it yet. Priority-aware scheduling is a Q1 item.
Cross-bid scheduling. Two concurrent bids from the same tenant share the same per-tenant pool. They compete. A long-running bid’s 40 sections can starve a faster bid’s 5 sections. We have a bid-level fairness hack that limits any single bid to 75% of the pool; that’s a cap, not a scheduler. A proper weighted-fair scheduler is on the roadmap.
Why Hatchet, not a queue library
We chose Hatchet over building on a plain job queue for three reasons: the workflow DAG semantics match how drafts are composed of sub-tasks (retrieve → rerank → draft → validate → store), the failure-handling story is built around exactly the retry ladder described above, and the dashboard gives a SRE-shaped view of workers, queue depth, and failure rates. Building those three on top of a raw queue would have been two months of work we’d rather spend on retrieval quality.
Related
- Background job queue choices for proposal workloads — the earlier write-up on why we run async jobs at all.
- Cost per response breakdown — what the budget enforcer is actually enforcing.