Field notes

Prompt versioning in production, the boring way

Git, tags, eval gates. How we roll a prompt change without breaking drafts in flight, and why the boring version is the one that actually works.

The PursuitAgent engineering team 6 min read Engineering

Prompts are code. Every engineering discipline we apply to code we apply to prompts. That is the position, it is boring, and it is how you avoid the pattern where “we updated the rewrite prompt” breaks drafts that are still in flight from the prior prompt.

This post is the current state of how we version, deploy, and roll back prompts in the PursuitAgent drafting pipeline. There is nothing clever here. The cleverness is in refusing to do anything clever.

The prompt catalog

Every prompt in the product lives in a file in the backend repo. Each file is a TypeScript module that exports a named prompt object:

export const draftAnswerPrompt = {
  id: "draft_answer",
  version: "2025-11-14.1",
  systemPrompt: `...`,
  userTemplate: `...`,
  model: "claude-sonnet-4-6",
  temperature: 0.2,
  maxTokens: 1200,
  evalSuite: "rfp_draft_v3"
};

The id is stable across versions. The version is an ISO date plus an increment; two prompt changes on the same day get .1 and .2. The evalSuite field names the eval set that must pass before this prompt version is allowed to ship.

Every runtime invocation of a prompt logs the prompt id and the full version alongside the input, the output, and the cost. Six months later, when we’re debugging a weird draft, we can reconstruct which exact prompt produced it.

The rollout mechanics

A prompt change flows through four stages before it affects production drafts:

Stage 1 — PR. The prompt change is a normal pull request. Diff review runs on the actual prompt text, not on an abstract description of it. Reviewers read the old and new prompts side by side.

Stage 2 — Offline eval. Every prompt change triggers the associated eval suite. For the draft_answer prompt, that’s 1,800 held-out questions with reviewer-approved reference answers. The suite runs across four scoring dimensions: factual entailment against retrieved context, citation correctness, style adherence, and the adversarial set. A change that regresses any dimension by more than 1.5% blocks the PR.

Stage 3 — Shadow deploy. Merged prompts run in shadow mode for 24 hours in production. Every real draft request executes both the old and the new prompt; the old prompt’s output is served to the user; the new prompt’s output is logged and scored offline against the same evaluators. The shadow stage catches regressions the offline eval missed — unusual customer corpora, unusual question shapes, unusual retrieval returns.

Stage 4 — Ramp. If shadow looks clean, the new prompt ramps to 10% of traffic, then 50%, then 100% over 48 hours. Each ramp gate checks the same metrics against the live 100% population.

A prompt change from commit to 100% takes four to six days. That is deliberate. We are not optimizing for prompt-change velocity; we are optimizing for drafts that reviewers trust.

The rollback

Rollbacks are one command and twenty seconds. Every prompt version ships with a deprecated_after field, which our deploy system uses to keep the last five versions live in the catalog. Rolling back means changing the version pointer in a YAML config, which the running service watches via a file handle, not a redeploy. The switch is atomic: the next request uses the old prompt.

Rollback telemetry logs what was rolled back, by whom, and why. The “why” field is a required free-text field, and we read it in the weekly prompt review. Rollbacks without a “why” are treated like deploys without tests — the person who did it gets a friendly note.

Drafts in flight

The tricky case: a draft that was started under prompt version A, and the user’s second editing pass pulls a new rewrite under prompt version B. The two outputs may have different styles, different citation densities, different sentence shapes.

We solve this with a stickiness rule. Every proposal carries a prompt_version_lock field that is set when the proposal’s first draft runs. Subsequent rewrite calls on the same proposal use the locked version, not the current latest. A proposal started under version A finishes under version A, even if version B shipped mid-cycle.

The lock releases when the proposal closes. The next proposal takes whatever version is current.

The edge case: a rolled-back version that a proposal is locked to. If we roll back A because it has a bug, we do not want in-flight proposals to keep using it. We mark A with a force_unlock flag; the next rewrite call on any locked proposal checks the flag and upgrades to the current version, with a UI notice to the reviewer that the prompt upgraded mid-draft. This happens rarely — two or three times a year — and the notice keeps the behavior honest.

Why this matters for grounding

Stanford HAI’s work on commercial legal RAG hallucinations names a pattern that our eval catches repeatedly: a prompt tweak that improves average output quality can quietly degrade factual entailment on a specific question shape. The eval suite we gate on is structured to surface this. The retrieval eval pillar covers the full methodology.

A prompt version that raises style quality by 4% and drops entailment by 2% is a net negative for grounded drafting. Our rollout mechanics refuse to ship it. That refusal is the entire product discipline. A tool that claims to ship grounded drafts cannot tolerate a silent regression on grounding because the prompt felt better in internal testing.

Seven named prompts

The drafting pipeline currently runs seven distinct prompts, each versioned independently:

  1. draft_answer — the first-pass draft from retrieved context.
  2. rewrite_answer — the reviewer-requested rewrite of an existing draft.
  3. compress_answer — shorten an existing answer to a word budget.
  4. explain_citation — generate the in-line rationale for why a cited block supports a claim.
  5. classify_question — route an incoming question to the right KB partition.
  6. score_draft_coverage — is this draft complete versus the retrieved context.
  7. generate_objection — the adversarial prompt that generates reviewer objections on a draft.

Each has its own eval suite. Each has its own rollout cadence. Each has a named owner on the engineering team who reviews every change to it. Prompts are infrastructure; they get treated like infrastructure.

The larger post on this — the prompt library behind grounded drafting, including the kill-switch registry and how we handle prompt-model coupling when we swap model providers — is coming on the 22nd. This post is the rollout-and-rollback mechanics. That post is the catalog and the governance.

The one-line takeaway

Prompts are shipped software; ship them like software. Version them, gate them on evals, deploy them in stages, roll them back atomically, and log every invocation. There is no clever alternative. The clever alternative is what causes the outages.

Sources

  1. 1. PursuitAgent retrieval eval pillar
  2. 2. Stanford HAI — Legal RAG hallucinations