Field notes

The prompt library behind grounded drafting

Seven named prompts, one kill-switch registry, a versioning scheme, and the governance pattern we use to keep prompt sprawl from becoming an outage. Engineering notes on how we actually run prompts in production.

The PursuitAgent engineering team 7 min read Engineering

A production RAG system is a prompt library attached to a retriever. The retriever is the part most teams talk about; the prompt library is the part that decides whether the drafted output is usable. After a year of running grounded drafting in front of real proposals, we know more about how our prompt library fails than how the retriever fails. This post catalogs the library — seven named prompts, one kill-switch registry — and the governance pattern around it.

For the broader architecture, the grounded retrieval pillar is the long version. For the versioning mechanics, prompt versioning in production covers how a prompt change rolls from branch to prod without breaking in-flight drafts. This post is about the library itself — what lives in it, how prompts relate to each other, and the kill switches we have wired in.

The seven prompts

Every draft request touches at least three of these prompts and often all seven. They run in sequence or branch, depending on the question.

classify-question decides what kind of question the inbound item is. A compliance-matrix line? A DDQ security question? A free-text narrative prompt? A table-structured request? The classifier’s output routes the request to a specialized downstream prompt. Misclassification is the dominant source of bad drafts — it is cheaper to improve this prompt than any other.

rewrite-for-retrieval rewrites the question into a retrieval-friendly form. Evaluators write questions in evaluator vocabulary; the KB is written in vendor vocabulary; the gap between the two loses information at the retrieval step. This prompt expands abbreviations, resolves pronouns, and adds the vertical or product context the retriever needs. It does not invent facts. It restates the question.

draft-with-citations is the main drafting prompt. It receives the retrieved chunks, the question, and a structural instruction (narrative, table, bulleted). It produces a draft where every claim is citation-tagged against a specific chunk ID. This prompt is the largest single piece of the library. It is also the one that fails most loudly when something upstream is wrong.

verify-claim-spans runs after drafting. It receives the draft and the cited chunks, and for each span, confirms the chunk actually supports the span. This is our defense against the Stanford HAI finding that citations do not imply grounding — a draft can cite correctly at the reference level while stating something the reference does not support. This prompt flags ungrounded spans for human review.

extract-compliance-language is specialized for RFP intake. It reads a document and produces a structured list of “shall,” “must,” “will provide,” “describe” statements with their parent section. It is the compliance-matrix scaffold that the pipeline later reconciles against drafted sections.

summarize-for-executive produces the executive-summary condensation of a full response. It does not draft new content; it compresses drafted sections into an executive-friendly narrative. Running this as a separate prompt, downstream of the draft prompt, keeps the exec summary honest — it cannot claim things the body does not say.

post-mortem-extract is the closed-loop prompt. Given a finished proposal, its win/loss outcome, and any debrief notes, it extracts candidate blocks to write back to the KB — promoted themes, content additions, answer refreshes. This is the prompt that makes the product compounding rather than stateless; without it, the library never learns from what shipped.

How the prompts relate

The seven are not a graph anybody would draw as linear. They are a fan-out from the classifier, a fan-in at verify, and an asynchronous loop from post-mortem back into the KB that the others read from. In operational terms, every user-visible action calls a small subset, and a full draft-to-submit cycle touches all seven across its life.

The relationships matter because changes propagate. A change to rewrite-for-retrieval affects what draft-with-citations receives. A change to classify-question affects which specialized prompt runs at all. We do not let prompt changes ship independently of the downstream consumers without a regression pass — the governance pattern below enforces this.

The kill-switch registry

Every prompt has a kill switch. The registry is a table:

prompt_id | current_version | fallback_version | kill_switch
--------- | --------------- | ---------------- | -----------
classify-question | v14 | v13 | off
rewrite-for-retrieval | v8 | v7 | off
draft-with-citations | v22 | v21 | off
verify-claim-spans | v6 | v5 | off
extract-compliance-language | v11 | v10 | off
summarize-for-executive | v4 | v3 | off
post-mortem-extract | v3 | v2 | off

When a production regression is detected — by the eval harness, by a customer, by a support ticket — engineering flips the kill switch. The kill switch is not “turn the prompt off”; it is “fall back to the previous version.” The system stays online, degraded to the known-good prompt, while the regression is diagnosed.

A kill switch is a specific intervention, not a global one. Killing draft-with-citations does not kill classify-question. This matters because the failure mode is rarely “all prompts are broken.” It is usually “the prompt we shipped last Tuesday regressed one sub-class of question.” Flipping the whole system back would lose the wins from the other six prompts.

The governance pattern

A prompt change goes through four gates.

  1. Branch and eval. The change lives on a branch, in the repo, with a human-readable change log. The eval harness runs the new version against the regression suite, which is about 400 questions with known-good answers. The harness produces a per-category accuracy delta.
  2. Review. An engineer who did not author the change reviews it. The review checks three things: does the change improve the metric it targets, does it regress any category, and is the change safe if the retriever returns something weird.
  3. Canary. The change ships to 5% of traffic for at least 48 hours. Metrics are watched: citation verification rate, human-edit frequency, time-to-ship. A regression on any of those reverts the canary automatically.
  4. Promote. After the canary window, the change promotes to 100%. The previous version stays as the fallback in the kill-switch registry for two weeks, then rotates out.

This is heavy for what is technically a text string. It is not heavy for what the text string does. A prompt ships to thousands of drafts per week; a bad prompt ships a citation error into every one of them.

What we got wrong on governance this year

We shipped a change to draft-with-citations in August without the canary gate because a customer-reported bug felt urgent. The fix fixed the bug. It also regressed table-structured output in a way the eval harness did not cover, because the eval suite at the time was narrative-heavy. Tables started losing column structure in production drafts for about six days before a customer reported it. We flipped the kill switch, diagnosed, added table-structured items to the regression suite, and reshipped through the full four gates.

The lesson we took was not “never skip the canary.” It was “the canary gate is cheap relative to a regression, and the urgency that tells you to skip it is the same urgency that makes you miss the regression.” We have not skipped the canary since.

The part not in this post

The prompts themselves. We do not publish the prompt text. The prompts encode a working theory of the category that is, in aggregate, a non-trivial piece of the product. Individually each prompt is legible — it is a paragraph of instruction — and collectively they compose into something that took a year to get right. Publishing them would not hurt us; it also would not help you, because the prompts only work against the retrieval and verification pipeline they are designed for.

What we do publish is the shape of the system and the failure modes we have seen. A production prompt library is a seven-component thing with a kill-switch registry and a four-gate change process. The specific strings on the page matter; the architecture around them matters more.

Sources

  1. 1. PursuitAgent — Grounded retrieval pillar
  2. 2. PursuitAgent — Prompt versioning in production