Streaming drafts over SSE, with citations inline
How we stream draft output to the browser while keeping citation integrity intact. The architecture, the failure modes, and the part we got wrong twice.
Streaming a draft to the browser is easy. Streaming a draft with citations that stay attached to their sentences is the part that took us two rewrites to get right.
This post is the architecture we landed on, the two we abandoned, and the failure modes we still watch for.
What we want the user to see
When a reviewer hits “draft answer” in the proposal builder, three things should happen on screen, in order:
- The first token of the draft appears within ~600ms.
- As each sentence completes, a citation badge attaches to the end of the sentence pointing at the KB block it came from.
- If retrieval can’t ground a sentence, the engine refuses; the badge becomes a “no source” warning the reviewer can click to inspect.
The constraint is that the citation has to attach to the correct sentence — not the previous one, not the next one. Off-by-one citations are worse than no citations because they create false confidence.
The architecture, current state
The path is:
client (browser)
│ POST /api/v1/drafts (question, sectionId)
│ Accept: text/event-stream
▼
hono server
│ retrieve top-k blocks (dense + reranker)
│ open SSE stream
│
├─ event: meta { retrievedBlockIds, retrievalScores }
│
└─ start LLM call (Claude Sonnet 4.6, streaming)
│
for each token from upstream:
buffer until sentence boundary
if boundary reached:
run lightweight entailment over the sentence
vs. retrievedBlocks
attach citation OR mark ungrounded
emit:
event: token { text }
event: sentence { text, citations[], grounded }
│
end
├─ event: done { stats }
└─ stream closed
The two important design choices:
- Sentences, not tokens, are the citation unit. Tokens stream for UX latency; citations attach at sentence boundaries because that’s the smallest meaningful unit of claim. Attaching at the token level produced citation badges that flickered as the model revised mid-sentence.
- The entailment check runs as the sentence completes, not at the end of the response. This means a refusal can happen mid-stream — the engine can stop, emit an “ungrounded” sentence event, and let the reviewer decide whether to continue. The first version we built ran entailment after the whole response, which meant a 30-second draft might be entirely fabricated and we wouldn’t tell the reviewer until the spinner stopped.
The two architectures we abandoned
Architecture A — citations as a second request. First version. Stream the draft as plain tokens, then on stream end fire a separate /api/v1/drafts/:id/cite request that took the finished text and labeled which blocks supported each sentence. This works, but the citation step took 4–8 seconds, during which the reviewer was looking at an uncited draft and reading it. If they accepted the draft before citations resolved, they accepted ungrounded text. We removed it.
Architecture B — interleaved citation tokens. Second version. Have the model emit a special [[cite:block-id]] token after each sentence; the client parses these out of the stream and renders badges. This worked, sometimes. The model hallucinated block IDs. It also occasionally emitted citations for blocks that weren’t in the retrieval set, which is exactly the failure mode citations are supposed to prevent.
The lesson from both: citations cannot come from the model. The model writes prose; the system attributes prose to retrieved blocks via an entailment check the model doesn’t control. Letting the model self-cite is the same category of error as letting the model self-grade on hallucination.
Sentence buffering, in practice
The token-to-sentence buffer is small — under 60 lines of TypeScript:
class SentenceBuffer {
private buffer = "";
push(token: string): { sentence: string; remainder: string } | null {
this.buffer += token;
const match = this.buffer.match(/^(.+?[.!?])(\s|$)/s);
if (!match) return null;
const sentence = match[1];
const remainder = this.buffer.slice(match[0].length);
this.buffer = remainder;
return { sentence, remainder };
}
flush(): string {
const out = this.buffer;
this.buffer = "";
return out;
}
}
Sentence-end detection is a regex on [.!?] followed by whitespace or end-of-input. It’s wrong about 2% of the time on edge cases — abbreviations, list numbers (“4.7.1”), embedded URLs. We accept that error rate for now and have a known-cases test that pins the failures we’ve seen. A perfect sentence segmenter is a research problem we don’t need to solve to ship; we need a 98% segmenter that’s debuggable.
Entailment per sentence
Per-sentence entailment is the load-bearing step. The check is:
Given this sentence and the top-k retrieved blocks (the same ones surfaced in the
metaevent at stream start), does at least one block contain a span that supports the sentence’s claims?
We run this with a small classifier (DeBERTa-v3-large fine-tuned on NLI data, hosted in-process). It returns a confidence in [0, 1]. Above 0.7 the sentence gets a citation pointing at the highest-supporting block. Between 0.4 and 0.7, the sentence ships with a “review” badge. Below 0.4, the sentence gets a “no source” warning and the engine logs it for the grounded-AI pledge audit log.
Latency: ~80ms per sentence on our hardware. That’s inside the per-sentence budget because the model is faster than the LLM. We were nervous about adding another model call to the streaming hot path, and the answer turned out to be: small classifiers are cheap.
The race condition we hit
A subtle bug we shipped and reverted, twice: the SSE stream’s meta event (with retrieved block IDs) was occasionally racing the first token. On slow connections, the browser received the first token event before meta, which meant the client was rendering text without knowing what blocks were eligible to cite from. The citation badges then attached to whichever block won a fallback heuristic — sometimes the wrong one.
Fix: emit meta synchronously before the first LLM call, with a server-side flush. The server now sends meta before opening the upstream LLM connection, not after. It costs us about 30ms of TTFB and removes the race.
What this looks like to the reviewer
- 600ms to first character on screen.
- Citations attach as each sentence completes (typically 1–3 seconds per sentence at typical generation speed).
- A refused sentence stops the stream and surfaces a clickable “no source — inspect retrieval” affordance. The reviewer can either accept the partial draft, force a re-retrieve, or reject and edit the prompt.
- The draft, once complete, is auditable: every sentence has either a block-id link or a logged refusal. The grounded-AI pledge audit log consumes the same event stream and persists it.
What we still watch for
- Streaming over flaky networks. SSE is robust to most failures, but a client that disconnects mid-sentence loses the entailment result for that sentence. We retry on reconnect by re-running entailment over the buffered text; the cost is one extra classifier call per disconnect.
- Sentence segmentation on lists. Bulleted and numbered lists confuse the regex. We currently treat list items as their own sentences, which is mostly right and occasionally wrong (a list item with two clauses). Open issue.
- Multi-block claims. A sentence with two claims supported by two different blocks gets one citation. The reviewer doesn’t see the second support. We are working on multi-block citation rendering; the citation rendering stack post has the design, not yet shipped.
The shorter version: stream tokens for UX, attach citations at sentence boundaries, never let the model self-cite, and emit retrieval metadata before the first token. Get those four right and the rest is plumbing.