Citation UI: three designs we tried, two we kept

Grounded AI is only grounded if the user actually verifies the grounding. The citation, in other words, is not an architectural feature — it is a UI feature. If the citation does not show up where the user is reading, in a form the user can act on without leaving the document, it does not function as a check.

We have tried three citation UI patterns over the last year. Two are in the product today. One we deprecated. This post is what we learned.

Pattern 1 — footnote chips (kept)

The simplest design. Each generated sentence gets a small numbered chip at its end. Clicking the chip opens a panel with the source passage, the document name, the block version ID, and the date the block was last edited. The chip is colored — green for high-confidence retrieval matches, amber for borderline, no chip at all if the sentence was not generated from a retrieved passage.

This is the default rendering. Reviewers like it because it is unobtrusive when reading and discoverable when checking. The amber-versus-green distinction matters more than we expected — reviewers learn to skim past the green chips and pause on the amber ones, which is exactly the attention budgeting we wanted.

The implementation is straightforward. Each generated paragraph is annotated at generation time with sentence-level provenance:

type CitedSentence = {
  text: string;
  blockVersionId: string | null; // null = no retrieval source
  retrievalScore: number | null; // 0..1, null if not retrieved
  bandColor: "green" | "amber" | "none";
};

The renderer walks the array and emits the chip with the appropriate band. Click handlers open the source-passage panel. There is no magic — the magic is that we generate the provenance at generation time, not as a post-hoc lookup, so the chip cannot drift from the sentence it annotates.

What we learned: reviewers click chips. We were not sure they would. Telemetry from production deployments shows that on a typical proposal section, 60 to 75 percent of generated sentences get their chip clicked at least once during the review pass. Higher on first reviews; lower on later reviews of the same content. The citation is not theatre; reviewers actually check.

Pattern 2 — side-pane evidence cards (kept, refined)

The second pattern is for deeper review. When a section is selected, a side pane opens with stacked evidence cards — one card per cited block, with the source passage, the highlighted span that the generated text drew from, and metadata (block version, last-edit date, owner, approval status).

The original version of this pane showed every cited block in section order. The refined version, after three months of iteration, shows the cited blocks grouped by KB block — so if a block is cited four times in a section, you see one card with four highlighted spans, not four separate cards.

This grouping change was the single most-requested refinement. Reviewers pointed out that they wanted to see “is this one block being relied on heavily, or is this section pulling from many sources?” — a question that the per-citation rendering hid and the per-block rendering surfaces immediately.

The side pane also surfaces freshness. Each card carries a small banner if the block has not been edited in 18 months, or if the block’s owner has left the company, or if a related block was retired since the citation was made. These banners are the most operationally useful part of the pane. They convert citations from passive checks into active triggers.

What we learned: the side pane is heavily used by senior reviewers (proposal managers, gold-team reviewers) and lightly used by drafters. Drafters use the inline chips; reviewers use the side pane. We design for both.

Pattern 3 — inline hover popovers (deprecated)

The third pattern was hover-triggered. Hovering over a generated sentence would surface a small popover with the source passage. No click required.

We deprecated this design after about four months of A/B testing. The reasons:

Hover does not work on touch devices. A meaningful fraction of reviewers — particularly executive sponsors at gold team, who often review on tablets — could not access the citation at all. We could have built a tap-equivalent, but at that point we had reinvented the chip click.

Hover popovers leak attention. Reviewers reported that the popovers triggered while reading, distracting from the flow of the paragraph. Users who liked them reported using them constantly; users who did not like them reported them as a friction point. The split was roughly 30/70 in favor of disabled.

The hover state did not survive copy-paste. Reviewers who copied a paragraph out of the proposal (to email a SME, for example) lost the citation entirely. The chip-based approach renders citations into the copied text as bracketed numbers; the hover approach left the copied text uncited.

We pulled the hover popovers in version 0.7. The chip-and-side-pane combination has been stable since.

The thing we did not anticipate

We expected reviewers to use citations primarily to verify factual claims. They do — but they also use citations as a navigation tool. “Show me everything in this section that cites block X” became a recurring request, because reviewers wanted to see how a single piece of evidence was being used across the response. We added a filter for that on the side pane. It is now the third most-used filter, behind “amber-banded only” and “freshness-flagged.”

This re-shaped how we think about citations. They are not just trust signals on individual sentences — they are an index into the underlying KB usage pattern of the response as a whole. A reviewer who clicks “show all citations to block X” is asking a different question than “is this sentence true.” They are asking “is this proposal over-reliant on a single source for this set of claims,” which is a question about the structure of the argument rather than the veracity of individual sentences.

We have not yet built the right tools for that question. A graph view of block usage across a section is on the backlog. Reverse-RAG-style verification — where the system independently checks whether each cited claim is actually supported by the cited passage — is a separate workstream. (The Mayo Clinic reverse-RAG discussion on Hacker News is the framing we are starting from; the open question is whether per-claim verification is economically viable at proposal scale.)

What grounded UI is for

Stanford HAI’s legal RAG audit found that even systems that produced citations alongside answers hallucinated 17 to 33 percent of the time — the citations did not guarantee that the cited passage supported the claim. The generation looked grounded; the verification step the user performed was insufficient to catch the gap.

Citation UI does not, by itself, fix that. What it can do is make verification cheap enough that users actually do it. Footnote chips on every sentence with a one-click expansion to the source passage make the verification cost small. Side-pane evidence cards with freshness banners turn citations into operational signals about the KB’s health. Neither pattern fixes hallucination at the model layer — that is a separate set of architectural choices — but both patterns move the verification cost from “follow a footnote in a separate document” down to “click a chip and read a paragraph.”

The thing we are working on next is the third pattern — verification-at-write-time, where the system independently checks whether the generated text is supported by the retrieved passage and flags discrepancies before the user sees them. That changes the citation from a passive trust signal to an active validation step, and it changes the UI we render around it. We will write about that when it ships.