The background job queue for proposal processing

A proposal that arrives at our system goes through four distinct processing stages: ingest, classify, draft, and verify. Each stage takes seconds to minutes. Each one has different failure modes. Each one needs different retry behavior. Doing all four synchronously inside the API request that uploaded the RFP would be wrong; doing all four async with no orchestration would also be wrong.

We use Hatchet for the orchestration. This post is what the workflow looks like, why we chose async-with-orchestration over the alternatives, and the one place we deliberately chose synchronous.

The four stages

┌───────────┐     ┌──────────┐     ┌────────┐     ┌────────┐
│  Ingest   │ ──► │ Classify │ ──► │ Draft  │ ──► │ Verify │
└───────────┘     └──────────┘     └────────┘     └────────┘
   PDF parse,     Question         RAG draft       Per-claim
   chunking,      bucketing,       per question    verification,
   embedding      KB retrieval                     citation check

Ingest. The RFP arrives — PDF, DOCX, ZIP, or portal export. Ingest extracts text, identifies sections, chunks the content, embeds chunks, and writes them to the per-tenant vector index. Failure modes: parse errors, OCR errors on scanned pages, embedding API timeouts. Retries: idempotent, up to 5 with exponential backoff. Dead-letter on persistent failure goes to a queue we review weekly.

Classify. Once the document is ingested, the classifier identifies questions and assigns each to a bucket (security, finance, legal, ops, etc.) — covered in the DDQ classification changelog. Failure modes: model timeouts, low-confidence classifications that route to human review. Retries: 3, with the human-review path as the deterministic fallback.

Draft. For each classified question, the system retrieves from the KB and produces a grounded draft. This is the heaviest stage — minutes per response on a typical RFP. Failure modes: retrieval misses, drafting timeouts, refusals from the verifier (which surface as “needs SME” in the UI). Retries: 2, with refusals being a successful outcome rather than an error.

Verify. Every claim in the draft runs through the verifier. The output is the draft annotated with citations or refusals. Failure modes: verifier disagreements with the drafter that look like over-refusal. Retries: 1, then escalates to the per-claim refusal log.

Why async

The naive alternative would be synchronous processing inside the upload request. For a 60-page RFP with 200 questions, that means a 5-to-15-minute HTTP request. That is not viable for any kind of UX, and it pins a worker thread for the duration. We tried it briefly during early prototyping; the test for “the user uploads, then closes the laptop, then comes back” was the killer.

The other alternative is uncoordinated async — fire each stage as its own queue item, let them race, manage state in the database. We tried that too. The state-management code grew faster than the workflow logic. Hatchet exists specifically to be the orchestrator so the workflow logic doesn’t have to manage its own state.

const proposalWorkflow = hatchet.workflow({
  name: "proposal-processing",
  on: { event: "proposal.uploaded" },
});

const ingest = proposalWorkflow.task({
  name: "ingest",
  retries: 5,
  retryBackoff: { strategy: "exponential", initial: 5_000 },
  fn: async (input, ctx) => ingestRFP(input.proposalId),
});

const classify = proposalWorkflow.task({
  name: "classify",
  parents: [ingest],
  retries: 3,
  fn: async (input, ctx) => classifyQuestions(input.proposalId),
});

const draft = proposalWorkflow.task({
  name: "draft",
  parents: [classify],
  retries: 2,
  fn: async (input, ctx) => draftAnswers(input.proposalId),
});

const verify = proposalWorkflow.task({
  name: "verify",
  parents: [draft],
  retries: 1,
  fn: async (input, ctx) => verifyClaims(input.proposalId),
});

The shape of the workflow is straightforward DAG. Each task carries its own retry policy. Hatchet handles the rest.

Retry policies, in detail

Retry policies are not the same across stages because the failure modes are not the same.

Ingest retries are aggressive (5 attempts, exponential backoff up to 5 minutes) because the failure modes are external: a PDF parse vendor returns 503, an embedding API rate-limits us, a tenant’s vector index is in the middle of a maintenance window. Retrying is almost always the right move.

Classify retries are moderate (3 attempts) because the failure modes split between transient (model timeouts) and structural (the document genuinely has no extractable questions, in which case retrying is pointless). After 3, we route to human review.

Draft retries are conservative (2 attempts) because retrying a draft that produced bad output usually produces similarly bad output. The expensive thing here is the LLM call, not the network round-trip; pounding it twice is enough to catch transient errors.

Verify retries are 1 because verifier disagreement is a feature, not a bug. If the verifier refuses, that refusal is the answer.

Dead-letter handling

Persistent failures across all retries land in a dead-letter queue per stage. We have separate dead-letter queues for ingest, classify, draft, and verify. Each has its own owner and its own review cadence:

Ingest DLQ: reviewed daily by an on-call engineer. Most common cause: PDFs with broken structure, scans we can’t OCR cleanly, files in unsupported formats.
Classify DLQ: reviewed weekly by the classification owner. Most common cause: documents that don’t have a question structure (e.g., a buyer uploaded a contract instead of an RFP).
Draft DLQ: reviewed by the drafting team. Most common cause: questions where retrieval returns nothing and the generator has nothing to ground from. These often surface real KB gaps and feed into the freshness work.
Verify DLQ: reviewed by the grounded-AI team. Most common cause: the drafter and verifier disagreeing in a way that suggests the verifier is being over-strict.

A DLQ that grows unbounded is a sign that one stage is failing systematically, not transiently. The growth rate is the signal.

The one place we chose synchronous

Citation verification when a user clicks “verify this citation” in the UI is synchronous. The user is waiting on the answer. They expect a few hundred milliseconds, not “we will process this in the background and notify you.” The verification call is fast enough (sub-second for short claims) that synchronous is the right call. The orchestration overhead would be larger than the work.

The async-by-default rule is not a religion. The right question is “is the user waiting.” If they are, synchronous. If they are not, async with proper orchestration.

What we have not built

Cross-tenant prioritization. Right now, a paying customer’s draft job and a free-tier user’s ingest job sit in the same queue and FIFO each other. We have not had to fix this yet because volume is manageable. When volume crosses some threshold, we will add per-tenant priority lanes. Hatchet supports it; we have not configured it.

Also: no live progress streaming to the UI yet. The frontend polls for status. SSE or WebSocket-based progress is on the Q4 list. The polling works fine but is not the right long-term answer.