Retrieval over diagrams, not just text

Most retrieval systems index text and stop. The KB has 50,000 sentences across 4,000 documents, and the vector store has every sentence embedded into a high-dimensional space. Ask “what is your data flow architecture,” and you get back the paragraph that says “our data flow uses Kafka.” The paragraph is correct. The paragraph is also an inadequate answer if the source document had a labeled architecture diagram showing five services, three queues, two databases, and the Kafka topic naming convention.

For proposal questions about architecture, the figure is often the load-bearing artifact. The text describes the figure; the figure shows the system. A retrieval system that can only retrieve text is missing half the source.

This post walks through how we index D2 code and diagram descriptions so that an architecture question can ground to a figure. The mechanism is a sibling to the text-retrieval pipeline, with three differences in how the source is represented and one difference in how the citation surface looks.

The pipeline

When we extract a source document, the diagram-extraction step we shipped earlier in the year produces two artifacts per detected figure:

A diagram description — natural-language text that names the components, their relationships, and the figure’s caption. Generated by a vision-language model (Gemini 2.5 Flash) running over the cropped figure and conditioned on the surrounding document context.
A D2 code block — a D2 declarative diagram language representation of the figure’s structure, also generated by the vision-language step. D2 was chosen because it is text-form (so it is searchable), it preserves topology (so layout reconstructions are deterministic), and it round-trips legibly into a renderable figure on the proposal output side.

Both artifacts get stored as KB blocks with the same provenance graph as text blocks. The block has a source-document edge, an author chain (the extraction job is the initial author), an approver edge once a human reviewer signs off, a last-used-in edge as proposals reference it, and a deprecated-by edge if a newer figure supersedes it.

The two artifacts are linked. The block of type diagram_description points to the diagram_d2 block as a structural sibling. Retrieval indexes both.

How retrieval ranks them

Three retrieval signals participate.

Description embedding. The natural-language diagram description is embedded the same way text content is embedded. A query like “what does your data flow look like” semantically matches the description “a data flow diagram showing five services connected via Kafka topics.” This is the primary retrieval signal. It works the way ordinary semantic retrieval works.

D2 keyword match. The D2 code is searched on a keyword index. Service names, topic names, database identifiers — these are usually proper nouns and acronyms that don’t embed well but do match exactly. A query that mentions “ServiceX” returns D2 blocks that name ServiceX in their topology, even if the description text didn’t include the name. This is the secondary signal.

Co-document boost. When a query already retrieves a text block from document D, and document D contains a diagram block, the diagram block gets a small ranking boost. The intuition is that an architecture answer that draws from text in document D should probably surface the figure from the same document. This signal is small (it doesn’t dominate the retrieval ranking), but it shifts the right figures to the top of candidate lists in mixed text-and-diagram queries.

The retrieval call returns a ranked list that mixes text blocks and diagram blocks. The drafting layer decides whether the answer should include a figure, a paragraph, or both, based on the query type and the candidate list.

Citation for a diagram source

The grounded-retrieval pillar names three invariants: pointer, provenance, entailment. Diagram retrieval has to satisfy all three, with one twist on each.

Pointer. The pointer to a diagram source is to the diagram-block ID, not just to the source document. When the proposal output cites a figure, the citation footer links to the original document at the page where the figure appears, plus the rendered D2 inline so the reviewer can compare the rendered version against the source.

Provenance. Every component named in the drafted text — “the orchestration layer,” “the Kafka topics,” “the analytics database” — has a span pointer into either the diagram description or the D2 source. If the drafted text says “ServiceX writes to TopicY,” the provenance trace goes to the D2 line ServiceX -> TopicY.

Entailment. The hardest invariant for diagrams. A drafted sentence says “ServiceX writes to TopicY for downstream analytics.” The D2 has ServiceX -> TopicY (entailment present for the writing relationship) but says nothing about for downstream analytics. The verifier flags the suffix as ungrounded. The drafting layer either trims the suffix or surfaces a refusal back to the reviewer with a “no source for: ‘for downstream analytics’” annotation.

Entailment over diagrams is not magic. The verifier we use is the same claim-decomposition step that runs over text, with one specialization: structural claims (X writes to Y, A is connected to B, M is upstream of N) are checked against the D2 topology, and descriptive claims (the orchestration layer handles failover, the analytics database is HIPAA-compliant) are checked against the diagram description text. We split the verification because mixing topology checks and descriptive entailment in a single verifier confused the verifier on dense diagrams.

Failure modes we still have

Three honest ones.

Vision-language extraction errors at the source. The vision-language step that produces the description and the D2 from a cropped figure is not perfect. On simple two-or-three-component figures, it is essentially correct. On dense five-tier architecture diagrams with crossover lines and color-coded zones, the extracted D2 sometimes mislabels relationships or drops components that the model couldn’t disambiguate visually. We catch these in the human-review step before promotion to approved. Some slip through. We have a quarterly audit pass that re-runs extraction with a newer vision model and flags drift.

Out-of-vocabulary component names. When a customer’s architecture uses internal product code-names, the description text from extraction often falls back to generic descriptors (“the orchestration service” instead of the actual code-name). The D2 preserves the original text from any visible labels, so a query that mentions the code-name lands on the D2 even when the description doesn’t. The mixed retrieval with both signals is what makes this work; a description-only retrieval system would miss the case.

Versioning across diagram redraws. When an architecture changes, the new diagram is a new block, deprecating the old one. Citations from in-flight proposals to the old diagram surface inheritance signals (the same way text inheritance works in the cross-project inheritance system). The reviewer sees that the cited diagram is being deprecated by a newer one. Resolution is reviewer-driven; we do not auto-replace diagram citations because the new diagram may support a slightly different claim than the old one.

What this enables

Two product surfaces.

Architecture questions in DDQs. Security questionnaires routinely ask “describe your data flow” or “describe your encryption-at-rest topology.” With diagram retrieval wired in, the drafting layer can compose an answer that includes a rendered figure from the prior approved diagram block, with citation footer pointing back to the source document and architecture-team approver. Without diagram retrieval, the answer is paragraph-only and asks the reviewer to manually attach the figure.

Compliance-claim grounding. Compliance questions about isolation, segmentation, or fault-tolerance are typically answered by reference to a topology diagram. The drafted answer can ground each compliance claim to a structural property of the cited diagram — “the analytics environment is isolated from the production environment because the D2 source has no edges between them.” This is a stronger grounding than text-only because the topology is a verifiable structural property, not a claim in prose.

The next post in the engineering build-log series covers how the gold set for retrieval evaluation is curated — including the diagram-aware triples we added to the held-out set. That’s Tuesday.