Shipped: diagram-aware extraction via Gemini 2.5 Flash

Diagram-aware extraction shipped to all customer KBs this week. System architecture diagrams, network topology diagrams, and data-flow diagrams in uploaded PDFs and Word documents are now extracted as first-class KB blocks alongside text and tables.

Two columns per diagram block. A diagram_description (text) and a diagram_d2 (D2 code). The first is what the retrieval engine searches against. The second is what we render when a reviewer expands the block in the UI. Both come from a single extraction pass.

What we ship

A diagram inside an uploaded document hits the parse stage as an image region with its surrounding caption. Previously, the surrounding caption became a text block and the image was stored as an opaque attachment. Now, the image and caption are passed together to Gemini 2.5 Flash with a structured-output prompt that produces:

A natural-language description of what the diagram shows (typically 100–300 tokens). Names the components, the connections, the data flows, and the intent.
A D2 representation of the diagram structure. D2 is a text-first diagram language; the code re-renders to a clean SVG without us having to store the original image.

Both outputs land in the kb_blocks table with block_type = 'diagram'. The text description is embedded for retrieval; the D2 is rendered on demand in the reviewer’s expanded-block view.

Why text + D2

We considered three options.

Image-only. Store the image as a binary, do nothing with it. This is what we shipped before. Diagrams were retrievable only via their captions, which was usually wrong because captions are short and the diagram itself is where the information lives.

Text description only. Have Gemini produce a description, throw away the visual structure. Cheap, but it loses the “show me the diagram” affordance in review. Reviewers want to see the picture, not just read about it.

Text + D2. What we shipped. The text gives us retrievable, embeddable content. The D2 gives us a faithful re-render that fits in the reviewer’s flow without us storing or rendering the original image. D2 also re-renders responsively, which the original PDF image does not.

The cost of the third path is one extra Gemini call per diagram. Gemini 2.5 Flash is cheap enough that the overhead is negligible — the median document with two or three diagrams adds about two cents of extraction cost. We absorbed that.

How retrieval changed

The diagram description is treated as a regular text block for retrieval purposes. A query about “system architecture for the case management API” can now hit a diagram description directly, where previously it could only hit the caption. This is the largest practical win.

For citation rendering, a diagram block displays differently from a text block. The reviewer sees the description (citable) and a rendered SVG below it (illustrative). The citation pointer is to the diagram block as a whole; the page reference is the page in the source document where the diagram appears.

What still doesn’t work

Two known limits.

Hand-drawn or photographed diagrams. Gemini’s structured-output extraction is reliable on diagrams that were authored digitally — Visio exports, Lucidchart exports, draw.io PNGs. Hand-drawn diagrams or photos of a whiteboard come back as text descriptions but D2 conversion is unreliable. We surface a confidence score on the D2 output and fall back to the original image when confidence is low.

Diagrams that span pages. A network topology spread across two facing pages of a PDF gets extracted as two diagram blocks, neither of which is complete. We have a research notebook open on cross-page diagram stitching. Not in the production path yet.

Where to find it

Diagram extraction runs automatically on every new document upload. For documents already in the KB, customers can re-extract from the document’s settings page (look for “Re-extract with diagram support”). Re-extraction preserves citation history on text and table blocks and adds new diagram blocks.

See the Knowledge Base platform page for the broader extraction pipeline and what’s covered.

Stack notes

Extraction model: Gemini 2.5 Flash, called via the GOOGLE_AI_API_KEY configured on the company workspace.
Render: D2 v0.6, server-rendered to inline SVG on view.
Storage: same kb_blocks table, two new columns; migration shipped without downtime.

The extraction pipeline post from earlier this week — our chunking pipeline, end to end — covers the broader context. Diagrams now plug into stage 1 (parse) and stage 5 (index) the same way text blocks do.