Grounded-AI regressions we caught in year one

Every grounded-AI system has regressions. The Stanford HAI paper’s 17–33% hallucination range is not a static number — it is a moving target that any specific system crosses above and below depending on what changed recently. Our year-one system had four distinguishable regressions. This post describes each one, what caught it, and what we changed after.

Regression 1 — the narrative smoothing (March)

What broke. The draft prompt, after a version bump, began smoothing source language where the source used qualified or conditional phrasing. “As required by the standard” became “at six-month intervals.” “Where applicable, customer-managed keys” became “with customer-managed keys.” The smoothing made drafts read more fluently and made individual claims more specific than the source supported.

How we caught it. Our verification prompt flagged an anomalously high rate of “partial support” classifications on a section of the regression suite. The rate went from roughly 5% to 11% across two prompt deploys. An engineer investigating the anomaly spot-read ten of the flagged drafts and found the smoothing pattern.

How long to roll back. Eight hours from first flag to kill-switch. The delay was not technical — the kill switch flipped in seconds — it was confirming that the regression was real and not a test suite artifact. We chose to over-verify before flipping because a false kill-switch affects production drafts too.

What we changed. The draft prompt now has a specific instruction set around preserving qualified phrasing. The regression suite gained a new category — “qualified-phrase preservation” — with 40 questions specifically targeting the failure mode. The category did not exist in March; it exists now.

Regression 2 — the classifier blind spot on table-structured questions (May)

What broke. A prompt change to the classify-question model improved classification accuracy on narrative questions by 3 percentage points and quietly regressed classification on table-structured requests by 8 percentage points. The net metric the eval harness reported was positive; the sub-category regression was buried in the aggregate.

How we caught it. A customer reported that a section they expected to render as a table came back as a narrative. A support ticket. The customer was polite about it; we were not polite about it internally.

How long to roll back. Thirty-six hours from ticket to kill-switch. The delay was figuring out that this was a classifier bug, not a drafting bug. The support engineer’s first instinct was to look at the drafter, which took time to rule out.

What we changed. The eval harness now reports accuracy per question-type, not just aggregate. A sub-category regression greater than 3 percentage points fails the gate, regardless of aggregate gain. The prompt library governance pattern we published this month locks this rule in.

Regression 3 — the embedding model swap that broke corpus-specific retrieval (August)

What broke. We swapped embedding models for a 12% retrieval-quality gain on the public benchmark suite. The swap regressed retrieval on a specific customer’s KB — a corpus with heavy domain-specific terminology — by roughly 9%. The customer noticed because the drafts suddenly started citing less-relevant chunks.

How we caught it. Quarterly eval against per-customer KB samples, which we had just started running in July. Without that eval, we would have learned about the regression via a customer escalation weeks later.

How long to roll back. 48 hours to confirm customer-specific impact, 12 hours to re-deploy the prior embedding model for that customer’s corpus, 3 weeks to build and ship a per-tenant embedding-model configuration so customers with domain-heavy corpora could stay on a model that better fit their vocabulary.

What we changed. Embedding-model deploys now run a per-tenant eval before global rollout. A global gain with a per-tenant regression over a threshold triggers a phased rollout, not an immediate global ship. We also added the infrastructure to run heterogeneous embedding models in production, which we had resisted for complexity reasons until this regression forced the issue.

Regression 4 — the verifier that stopped verifying (October)

What broke. An infrastructure change — unrelated to prompts — altered the timeout envelope the verification prompt ran under. The verifier began timing out on long chunks and silently falling back to a passthrough mode. Drafts shipped with the citation panel rendering normally but without the underlying verification actually running.

How we caught it. An engineering review of the ungrounded-span flag rate noticed it had dropped unexpectedly — from the typical 3-4% range to below 1%. An improvement that wasn’t an improvement, it turned out; the rate dropped because the verifier was not running at all, not because the drafts were getting cleaner.

How long to roll back. Immediate kill-switch on the infra change, about an hour to confirm that verification was back to real operation, about a day to backfill verification on drafts that had shipped during the silent-fail window.

What we changed. The verifier’s heartbeat is now an alerted metric, not a logged one. A verifier that is not producing classification outputs within its expected volume triggers a page, not a quiet log line. The general principle — positive-silence failures are the hardest to catch — is one we had internalized in other parts of the system and somehow missed here.

The one we did not catch in time

What broke. Across a four-day window in late June, the draft prompt began producing exec-summary sections that occasionally referenced claims not present in the body of the response. The exec summary was drafted from the body after the body was drafted, but the prompt temperature had drifted (we track temperature as a deployable config; it shifted unintentionally in a config migration), and the summary occasionally embellished.

How we caught it, eventually. A customer’s gold-team reviewer caught it in a final-pass read. The reviewer was sharp enough to notice the summary said something the body did not. Most reviewers, we suspect, would have missed it.

Why we did not catch it. Our verification prompt ran against the body, not against the summary-body consistency. We did not have a summary-body reconciliation check. The failure mode fell through a gap in the verification surface.

What we changed. A specific reconciliation check now runs on exec-summary output against the body it was drafted from. Any claim in the summary that does not have a body-level support is flagged. The reconciliation check is a separate prompt with its own kill switch.

The broader pattern

The four regressions, plus the one miss, share a structure. Each one was a change that improved something measurable while quietly regressing something unmeasured. The eval harness was too aggregate, the verifier’s surface was too narrow, the test categories were not yet granular enough. Each regression’s post-mortem produced a new eval category, a new alerted metric, or a new verification surface.

This is the work of running grounded AI in production. The grounded retrieval pillar earlier this year argued that verification is the point of the architecture; the regressions above are the operational proof. A system without a verifier is a system that cannot notice it has regressed. A system with a verifier that stops running is the same system with a false floor.

What year two looks like

Wider verification surfaces, more granular eval categories, per-tenant evals baked in from the start of any prompt change, and a commitment to treat positive-silence metrics (verification rate dropping, flag rate dropping) with the same seriousness as error metrics spiking. The first three regressions above taught us to measure better. The fourth and the one we missed taught us that measuring the things we think to measure is not the same as measuring the things that are actually moving.