Draft attribution in exports: PDF, DOCX, HTML
Inline citations have to survive the export. How the rendering preserves citation anchors across the three export formats, where each format makes it hard, and the specific decisions we made to keep the attribution auditable.
A drafted response with inline citations is auditable inside the product. The hard question is what happens when the team exports the document to PDF, DOCX, or HTML and sends it to the buyer, the compliance reviewer, or the signing officer. If the citations don’t survive, the audit trail doesn’t survive, and the whole grounded-AI proposition unravels at the file boundary.
This post is how we preserve inline attribution across the three export formats we support.
The in-product representation
Inside the editor, a citation is a structured annotation on a text span. The stored form looks roughly like:
{
"type": "citation",
"span": { "start": 1420, "end": 1487 },
"source": {
"block_id": "blk_9k2x...",
"version": "v14",
"page_ref": null,
"match_span": { "start": 124, "end": 189 }
},
"verified_at": "2026-03-28T14:02:11Z",
"verifier": "claude-sonnet-4.8@2026-03"
}
The citation carries the source block ID, the version, the span within the source that matched, the verification timestamp, and the model+version of the verifier. This is the data every export has to convey — not literally the JSON, but the equivalent information in a form the format supports.
PDF is the most demanding format because it’s the one most often sent to buyers, the one most often archived for compliance, and the one with the least native support for structured annotation.
What we render in PDF:
Visible citation markers. A superscript numeric marker next to each cited span — [17], [18], etc. Same pattern every buyer has seen in academic or policy documents. No explanation needed.
An endnote section. A structured endnotes list at the back of the document, numbered to match the inline markers. Each endnote shows: the source block’s human-readable title, the block version, the verification date, and a shortened URL to the block in the product (if the reader has access).
PDF bookmark tree. A PDF bookmark is rendered for each cited source, so a reviewer with Acrobat can jump from citation to source in the bookmarks panel.
Structured metadata in the PDF dictionary. For PDFs consumed by automation (GRC tools, compliance trackers), we embed a parallel structured representation in the PDF’s custom metadata. A consumer that knows to look finds the citation graph as a JSON-LD blob; a consumer that doesn’t look sees a clean-looking document.
Where PDF makes this hard: text reflow in PDF can separate a citation marker from its anchor text if the export doesn’t lock the kerning. We pin marker-to-text with a no-break space and a specific small-caps superscript style that renders identically in Acrobat, Preview, and Chromium’s PDF viewer. Reviewed monthly because each of those three renderers has shipped regressions at various points.
DOCX
DOCX is the format proposal writers edit, not just read. Our DOCX export has to survive not just opening but also being edited by someone in Word, saved back, and returned.
What we render in DOCX:
Word comments. Each citation becomes a Word comment on the cited span. The comment body includes the source block title, the version, and a hyperlink back to the product. Word comments survive edits — even if the reviewer changes the cited sentence, the comment stays anchored.
An endnote section. Same structure as the PDF.
Tracked-change-safe anchoring. We use Word’s bookmark feature for the span, not just character indexing. If the reviewer inserts text before a cited span, the bookmark moves with the text. If the reviewer deletes the cited span entirely, the bookmark is lost, which is the correct behavior — the citation no longer has a valid anchor.
Where DOCX makes this hard: a Word comment is rendered differently in Word, Google Docs, and LibreOffice. Our spec is that the comment must be legible in Word 365, Word Online, Google Docs, and LibreOffice; we test against all four before releasing changes. Google Docs in particular has had quirks with imported Word comments on tracked-changes-enabled documents.
HTML
HTML is the simplest format to render and the most useful for automated downstream consumers.
What we render in HTML:
Semantic <cite> elements with data-* attributes. Each citation is a <cite data-source-id="..." data-version="v14" data-verified-at="...">[17]</cite>. The superscript is styled with a class; the underlying data is machine-readable without scraping a PDF.
A linked endnote section. <ol class="endnotes"> with anchor links from inline markers to endnote entries.
JSON-LD embedded in the document head. The same JSON-LD blob that lives in the PDF dictionary also embeds in HTML as a <script type="application/ld+json"> in the document head.
HTML is where the downstream-automation use case actually pays off. Buyers who ingest our exports into their own compliance trackers consistently cite “the HTML with structured metadata” as the format they prefer for archival.
What we don’t do in exports
We don’t bake the full source text of each cited block into the export. Three reasons.
First, the block might be confidential. Our customers’ KBs contain content not meant for the buyer — internal rationales, pricing guidance, competitive notes — and the citation should link to the block without embedding the block.
Second, blocks change. A cited block version is the immutable reference; an embedded block is a snapshot that will diverge. We store version references and resolve them on click.
Third, readability. A response with every citation expanded inline is twice as long and half as readable. The citation is a pointer; expansion happens on demand.
The compliance angle
Two buyers specifically (both federal sub-primes) required us to demonstrate that an exported PDF’s citations remained resolvable 90 days after submission — they wanted to confirm that a debrief conversation referencing a specific claim could be traced back to the underlying source. We passed the demonstration because the block-version references are immutable; the specific version of each block cited in a response submitted in January 2026 is the version resolvable in April 2026, regardless of whether the block has been updated since.
This is a non-obvious feature that matters enormously in practice. Our block-versioning decision from the v2 schema post was driven partly by this requirement. Export attribution is the visible surface; immutable block versions are the invisible foundation that makes it true.
What’s still rough
Two outstanding issues.
PDF accessibility. A citation marker is a superscript, which most screen readers read awkwardly (“seventeen”) instead of as an annotation. Our PDF export is accessible for body text but imperfect for citation navigation. We’re working on a Tagged PDF implementation for the cited annotations that would make the citations announcable as semantic annotations, not just superscripts.
Google Docs comment styling. Word comments imported into Google Docs sometimes lose the hyperlink formatting while preserving the text. Google Docs is the environment we can’t fully control. We warn users in the UI when a DOCX export is being opened in Google Docs and recommend the HTML format for that case.
The takeaway
Inline attribution in exports is a format-by-format engineering problem, not a single rendering decision. Each format has a different mechanism — PDF endnotes, Word comments, HTML semantic markup — and each one has a different failure mode. The invariant is that block-version references are immutable; that’s the floor the formats build on top of.