A full-year retrospective on shipping grounded AI

A year ago I wrote our founding thesis into a first-week blog post. The claim was simple: large language models are good at drafting from a corpus they can cite, and bad at generating facts they have not been given. Build the product around that asymmetry and you get software proposal teams can actually trust. Ignore the asymmetry and you get the category we were trying to replace — tools that confidently produce plausible fiction and ship it under the customer’s letterhead.

Twelve months later, I want to write about how that thesis held up. Not the marketing version. The real version. Where the evidence broke the way we hoped, where it broke against us, and what I would tell myself at the start of Q1 last year if I could go back.

This is a retrospective on the thesis, not on the company and not on the industry. A separate post covered the team we hired this year and where we got the hiring right and wrong. A different one covered the year in proposal tooling more broadly. This post is about whether grounded AI, as a technical approach, actually worked. The short answer is yes, but less cleanly than I expected, and the less-clean parts are the interesting ones.

The Stanford 17-33% number, a year later

The foundational external citation for our thesis was the Stanford HAI paper on commercial legal RAG tools — Magesh et al., “Hallucination-free? Assessing the reliability of leading AI legal research tools”. The finding that got quoted into every grounded-AI sales conversation: commercial tools with retrieval and citation hallucinate 17 to 33% of the time. The finding that got quoted less: the authors defined hallucination rigorously, against a human-evaluated benchmark, and the hallucination types they catalogued are specific, not vague.

The question I cared about at the start of the year was whether a grounded-AI system built with the discipline we were trying to build would hit a meaningfully better number on proposal content specifically, and whether we could measure that.

We built the evaluation pipeline for this in Q1 — the piece that became the retrieval eval pillar post and its engineering companion. The methodology we landed on: for every drafted answer the system produces, decompose into atomic factual claims, check whether each claim is supported by the cited block, mark each claim as supported / contradicted / unsupported. Ratio of unsupported-to-total is the hallucination rate. The definition is stricter than Stanford’s — we count an uncited claim as unsupported even if it happens to be true — because the point of the product is citation, not just factual accuracy.

The directional result, measured across our eight largest tenants through Q4: our unsupported-claim rate on drafts the system actually shipped was a small fraction of Stanford’s reported 17-33% range. Some tenants were meaningfully better than others. The best-performing tenants had strong KB hygiene, narrow proposal domains, and engaged content owners. The worst-performing tenants had sprawling multi-year KBs with thin freshness discipline.

I am not publishing the specific numbers because they are tenant-level and some are under NDA. I am also not publishing them because I do not entirely trust the comparison — Stanford’s corpus was legal case citations, ours is proposal content; the domains are different, the authority signals are different, and a strict apples-to-apples number is not honest. The honest claim is directional: grounded AI with real discipline around retrieval, freshness, and per-claim verification produces a measurably lower unsupported-claim rate than the Stanford benchmark implies is default. The directional claim is the one I can defend. The precise number is the one I cannot, and I would rather publish the honest uncertainty than a false precision.

What this year’s data did change: my confidence that the number is achievable. At the start of the year, I was not sure whether the Stanford pattern was the floor of what was technically possible or just the average of what the market had built. The evidence now is that it is the average of what the market had built. Better numbers are available if the discipline is there. The HN thread on RAG hallucination was divided between “RAG is fundamentally broken” and “RAG is fine if you build it right,” and I now think the second camp is closer to correct. RAG built as a demo is unreliable. RAG built as a product with serious eval infrastructure, freshness discipline, and claim-level verification is reliable enough to ship.

Four failure modes, 12 months of data

Grounded retrieval has specific ways it fails. At the start of the year, I could name a few of them by intuition. By the end of the year, we had instrumented most of them and could say which ones we closed and which ones we did not.

Failure mode 1 — Numeric precision

The model generates a sentence with a number in it. The retrieved block contains the correct number. The generated sentence has a transposed digit, a rounded value, or a unit error. The citation points to the correct block. The claim is technically not supported by the retrieved text, even though a human reader might not notice.

This failure mode is mechanical and we measured it tightly. At the start of the year, it was our single most common hallucination category — numbers the model slightly reshaped while citing a correct source.

What we did. The claim-level verification pass post covers the mechanism. We added a specific numeric-token check: for every number that appears in the generated draft, we verify that the exact token (or a trivial formatting variant — commas, decimals, currency symbols) appears in the cited block. Mismatches are rewritten to match the source or flagged for human review.

Where we ended up. This failure mode is mostly closed on fresh content. It is not closed on legacy KB content that contains stale numbers; in those cases the numeric-check passes because the source is wrong. Stale-source numeric errors are a different failure mode (closer to freshness than to hallucination) and we handle them with freshness discipline rather than with the numeric check.

Honest gap. A number expressed as a range in the source (“between 15 and 20%”) and rephrased by the model as a specific value (“approximately 18%”) still passes most of our checks and should not. We know about this. It is in the build log.

Failure mode 2 — Compound claims across blocks

The model generates a sentence that combines facts from two or more retrieved blocks. Each block individually supports part of the sentence; no single block supports the whole claim. The system cites both blocks, but the compound claim is only supported if the reader mentally joins them — which is itself a load-bearing mental step that the system is not supervising.

This failure mode is the one I underestimated most at the start of the year. I assumed that citation-per-claim was sufficient. It is not, for compound claims.

What we did. We changed the drafting prompt to discourage compound claims — prefer two short sentences citing two blocks over one long sentence citing both. We added a verification check that decomposes compound claims into atomic assertions and checks each against the cited blocks.

Where we ended up. Partially closed. The prompt change reduced compound-claim frequency noticeably. The decomposition check catches the remaining ones when they are structurally simple. Compound claims with implicit entailment — “we operate globally and comply with DORA” — still slip through some of the time because the check is not semantically strong enough to notice when the combination says more than its parts.

Honest gap. Entailment-strength checking is genuinely hard. The eval harness runs it with a stronger model, which catches more, but we cannot afford to run that stronger model on every draft. This is an open tradeoff.

Failure mode 3 — Synonym and tense drift

The model paraphrases a block in a way that preserves meaning but changes vocabulary or tense. “We support DORA compliance” becomes “our platform is fully DORA-compliant.” The first is a statement about capability; the second is a stronger claim. The retrieval cited the first. The draft says the second.

What we did. Less than the other failure modes, honestly. We have a lightweight check that flags when the generated sentence uses a modifier (“fully,” “completely,” “always”) that does not appear in the source. That catches the most obvious cases. It does not catch subtle tense drift (“we are implementing X” → “we have implemented X”) reliably.

Where we ended up. Partially closed on the obvious cases. Not closed on the subtle ones. This failure mode requires stronger semantic checking than we currently ship, and every candidate solution we have explored has unacceptable latency or cost at production scale.

Honest gap. This is probably the failure mode I am most uncomfortable with heading into year two. Semantic drift across paraphrase is how a grounded-AI system can lie while being technically grounded. Our current defense is that our KB is mostly written in precise enough voice that paraphrase does not have much room to drift. That is a reliance on good corpus hygiene, not on the system’s own defenses. I do not love it.

Failure mode 4 — Entailment false-negatives

The model generates a sentence that is supported by the retrieved block. The verification check concludes that it is not. The correct draft gets flagged as a hallucination, rewritten, or sent for human review.

This is the false-positive side of the hallucination check — not invented content shipping, but correct content being treated as invented. A verification system with high false-negative rate on its grounding check wastes human review time, frustrates drafters, and over time erodes trust in the check itself.

What we did. We calibrated the check. The hallucination-budget-per-claim post gets into the economics. A check that is 95% accurate on a claim-dense draft will flag some correct claims; the cost of that is real, and we had to decide what the acceptable false-negative rate was.

Where we ended up. The acceptable false-negative rate is not zero — we accept some correct claims being flagged in exchange for catching more incorrect claims. Currently the check sits in a place where drafters see a “review recommended” badge on a small single-digit percentage of their drafted claims, and roughly 2/3 of the flagged claims are real issues the drafter acts on. The remaining 1/3 are false negatives the drafter dismisses. This is not a great ratio. It is a better ratio than the 0% verification check we started the year with.

Honest gap. This failure mode is the only one where I am confident the main lever is a better underlying verifier model, not a better system around the model. The verifier we use is fast and cheap; a stronger verifier would improve the ratio. We are watching the model-provider landscape for better fits.

What changed under the hood

Four architectural changes from Q1 to Q4 carried most of the improvement in the numbers above. I will not re-derive each from scratch — the linked posts do that — but I want to trace them together because the interactions between them are where the system ended up.

Claim-level verification pass (Day 49 post). We stopped treating the draft as an atomic unit for verification. Every factual claim in a draft gets decomposed and checked separately against its cited source. This was the single largest reliability improvement of the year. It is also the most expensive change on a per-token basis, and the hallucination budget post covers how we decided what budget to give it.

Reranker economics (Day 91 post). The retrieval quality went up when we added a reranking step over the top-N retrieval results. The reranker is a smaller, domain-tuned model that re-scores the retrieval shortlist against the specific question before drafting. This moved recall@5 by a meaningful amount across tenants. The cost was acceptable once we calibrated the rerank-only-top-N threshold.

Chunking ablations (Day 75 post). Chunk size is a tuning knob with counterintuitive effects. Smaller chunks increase retrieval precision but reduce the context each retrieved block provides to the drafter. Larger chunks do the opposite. We ran an ablation across four chunk sizes on representative tenants, and the winning configuration was smaller than our Q1 default. The change was free — it cost nothing in infrastructure — and it improved both retrieval quality and draft quality.

Retrieval eval pipeline (Day 63 post). None of the other changes would have been committable without this one. Every system change we shipped this year passed through the eval harness first. The harness caught three changes we were enthusiastic about shipping that turned out to regress retrieval quality — and we killed those changes before they hit customers. The eval infrastructure is the thing that turned the project from “build grounded AI” into “measure whether grounded AI is working, and change it when it is not.”

Those four, together, got us most of the way. A handful of smaller changes — better citation rendering, the per-block freshness scoring, the refusal-rather-than-guess behavior when grounding is thin — added meaningful reliability on top. The grounded-AI pledge in code post walks the specific refusal logic.

What none of those four is: a magic bullet. The Mayo Clinic “reverse RAG” pattern that HN discussed last year was an early candidate for one — span-level evidence verification as the single defense. We looked at it seriously. It is economically plausible at some scale, structurally difficult at ours. What we ended up with is closer to a defense-in-depth setup: retrieval tuning, then reranking, then claim-level verification, then refusal. Any single layer is not sufficient. The stack is.

What I would tell Q1 Bo

Three things. These are the honest retrospective answers, not the marketing ones.

First — invest in the eval harness on day zero, not day 60. We shipped the first version of the retrieval eval pipeline in Q1, which felt early at the time. In retrospect it was late. The changes we shipped in January and February without the harness — the ones we could not measure — caused rework later when we could measure them and discovered some of them had regressed quality. If I were starting over, I would not ship a single retrieval change before the harness existed to measure its effect. The cost of building it first looks like slowness on the roadmap; the cost of building it later looks like unmeasurable drift on the production system.

Second — the freshness problem is bigger than the model problem. The biggest quality differences between our best-performing and worst-performing tenants come from their KBs, not from our system. A tenant with a well-maintained KB produces excellent drafts with the same system that produces mediocre drafts for a tenant with a rotting KB. I knew this abstractly at the start of the year. I did not fully act on it. The product changes we shipped in H2 around freshness scoring, ownership, and content-health reviews were the highest-leverage work of the year, and every one of them would have been higher-leverage shipped earlier. The SME collaboration piece eventually made the argument in full. I wish I had let it drive prioritization harder in Q1.

Third — refusal is a feature, and the market underestimates how much. The decision to have the system refuse to draft when grounding is thin — rather than produce a plausible-but-ungrounded answer — is the decision that most clearly separates grounded-AI as a product stance from grounded-AI as a marketing word. Customers were uncomfortable with it at first. The writers on our customer teams were used to AI that always had something to say; a system that sometimes says “I do not have enough grounded content for this — please add a source block” felt slower at first. Six months in, those same writers describe refusal as the feature they trust most. It is the feature that convinces them the system is not making things up. AutogenAI wrote about the proposal-specific flavor of hallucination risk last year; the refusal pattern is the operational answer to that risk, and I wish we had leaned on it earlier in the product narrative.

Closing — what the thesis looks like, 12 months in

The thesis was: LLMs are good at drafting from a corpus they can cite, and bad at generating facts they have not been given. Build around that asymmetry.

Twelve months in, the thesis held. The asymmetry is real. Building around it produces measurably better proposal drafts than systems that ignore the asymmetry. The hallucination rate is not zero — it never will be — but it is meaningfully lower than the industry-default number and the gap is visible to the customers who care about it.

The thesis did not hold as simply as I wanted it to. “Grounded AI” is not a binary — a system is not either grounded or ungrounded. It is a gradient, and the gradient is controlled by many small pieces: retrieval quality, chunk size, reranker presence, claim-level verification, freshness discipline, refusal behavior, KB hygiene. Any one of them missing degrades the whole. All of them present raise the system to the point where a customer can actually trust the draft enough to ship without re-checking every paragraph.

That is the product. That is what we are building. The thesis going into year two is the same as the year-one thesis, held with more evidence. The grounded retrieval pillar is the technical reference, the grounded-AI pledge is the commitment, and the retrieval eval pillar is the measurement. The hallucination budget piece is the economic frame that ties them together.

In June, the engineering team will publish a technical companion to this post — one year of grounded retrieval — that gets into the model-level and retrieval-level specifics in ways this retrospective deliberately avoids. If you want the engineering version of the same story, look for that one.

What I would not do, a year in: declare the problem solved. The category is not done. The regressions happen. The failure modes evolve. The year-two retrospective, assuming we get to write one, will catalogue new failure modes we did not know about this year, and it will claim a different set of closed and open items. That is fine. That is what building a product honestly looks like. The thesis is durable. The work is continuous.