Inside the ingest pipeline: parse, extract, index

Ingest is the part of the product the user sees as “I uploaded a PDF and it became searchable.” Internally it’s three stages, three failure surfaces, and a meaningful share of the engineering team’s ongoing work.

This post walks through the pipeline as it runs in production. Parse, extract, index. We’re honest where each stage wins and where it falls over.

Stage 1 — Parse

Source documents are PDFs (most of them), Word documents, Excel workbooks, and the occasional Google Doc export. The parser’s job is to turn the binary into a structured representation: pages, blocks, tables, headings, and metadata about positional layout.

We use LlamaParse as the primary parser. It’s a hosted service from LlamaIndex that’s purpose-built for the document shapes RAG pipelines actually see — long PDFs with mixed prose, tables, and figures. It returns a structured JSON document with positional information per element, table structure preserved as structured data rather than as flattened text, and a confidence score per page.

Where LlamaParse wins

Tables. Most PDF parsers flatten tables into runs of whitespace-separated text. The result reads coherently to a human and breaks every downstream extraction step. LlamaParse preserves table structure with row and column information intact. For RFP content — pricing workbooks, scoring rubrics, compliance matrices — this is the difference between “we can index this” and “we have to manually clean every table.”

Multi-column layouts. Government RFPs love a two-column layout with sidebars. LlamaParse reorders multi-column flows correctly more often than the alternatives we’ve tested.

OCR for scanned PDFs. Built-in. The OCR is good enough for clean scans, less good for low-DPI photocopies. We surface the per-page confidence score and flag low-confidence pages for human review.

Where LlamaParse loses

Diagrams and figures. A page with a process diagram, an architecture flowchart, or a figure that the response references gets returned as a bounding box with extracted caption text — but the diagram itself is not represented semantically. We pipe diagram-flagged pages to a separate diagram extraction step using a vision model, which produces a textual description and (for diagrams that map cleanly) a D2 representation. That’s a separate post.

Highly-formatted Word documents. A Word document that uses styles, embedded objects, and revision history doesn’t always round-trip cleanly through the LlamaParse path. We use a Word-native parser (mammoth + a custom DOCX converter) for .docx files and only fall back to LlamaParse for the PDF render of Word documents.

Tables with merged cells. LlamaParse preserves structure better than most, but merged cells in complex tables still produce occasional structural artifacts. The downstream extraction step has tolerance built in, but it’s not zero-cost.

Cost. LlamaParse is a per-page service. At customer scale, parsing cost is meaningful. We cache aggressively keyed on file hash; a re-upload of the same document is free.

Stage 2 — Extract

The parser’s output is structured JSON. Extraction turns that into KB blocks: chunk-shaped, semantically meaningful units that the retrieval engine will query against.

Two kinds of extraction run.

Structural extraction

Structural extraction respects the document’s own headings, sections, and table boundaries. A heading-bounded section becomes a block. A table becomes a block (or multiple, if it’s large). A figure with an extracted caption becomes a block. The block carries metadata: page number, heading path, parent section, document version.

type KbBlock = {
  id: string;
  documentId: string;
  documentVersion: string;
  pageRef: { start: number; end: number };
  headingPath: string[]; // ["Section 4", "4.2 Technical Approach", "4.2.1 Architecture"]
  type: "prose" | "table" | "figure" | "list";
  text: string;
  structuredPayload?: TablePayload | FigurePayload;
  embedding: number[];
};

Structural extraction is high-precision and low-recall. It produces blocks the retrieval engine can cite cleanly. It misses content that doesn’t have a clean heading boundary — which, in real-world RFPs, is a non-trivial fraction.

Semantic extraction

Semantic extraction runs over the structural output and produces additional blocks that capture cross-section concepts. A discussion of pricing that spans three sections gets a single semantic block that summarizes the pricing position with pointers back to the source sections. A win theme that recurs across sections gets a semantic block tagged with each occurrence.

Semantic blocks are tagged as derived. The retrieval engine knows the difference: a citation that lands on a structural block points to a specific page; a citation that lands on a semantic block points to the source pages the semantic block was derived from. The Pledge-enforcement layer treats them differently — verifier entailment runs against the underlying structural blocks, not against the semantic summary.

Where extraction loses

Implicit structure. A document that’s one long flow of paragraphs, with no headings, with topical shifts that a human reader picks up but a parser doesn’t, produces blocks the structural extractor chunks by length rather than by meaning. The retrieval engine still works, but the precision of citation suffers.

Adversarial layouts. Some RFPs are formatted in ways that look intentional and aren’t. A heading that is a heading in the visual sense but uses a body-text style; a section that’s been pasted in as an image. These cases produce extraction artifacts we surface to the operator at upload.

Cross-document references. “See Attachment B” is a structure we resolve at the bundle level (the multi-doc ingest work shipped this month). But a reference whose target is in a different document we don’t have access to produces a flagged block — “this references content not present in the bundle” — which the operator has to triage.

Stage 3 — Index

KB blocks land in Postgres with pgvector. Each block has an embedding (1536-dim) computed at ingest time. The HNSW index is built per company_id partition.

The schema:

CREATE TABLE kb_blocks (
  id UUID PRIMARY KEY,
  company_id UUID NOT NULL,
  document_id UUID NOT NULL,
  document_version TEXT NOT NULL,
  page_start INT,
  page_end INT,
  heading_path JSONB,
  block_type TEXT,
  text TEXT,
  structured_payload JSONB,
  embedding vector(1536),
  active BOOLEAN DEFAULT TRUE,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX kb_blocks_embedding_hnsw
  ON kb_blocks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 200);

Build parameters: m=16, ef_construction=200. Query parameters: ef_search=80 default, tuned per call. Build time on a 50,000-block corpus is single-digit minutes. Query latency we covered in the retrieval latency budget post.

Where indexing loses

Bursty rebuilds. A customer who re-uploads their entire KB triggers a full re-embed and re-index. We background the work and don’t block the user; the user just sees their search quality climb as the background job completes.

Embedding model migration. When we move from one embedding model to another, every block in every customer’s corpus needs to be re-embedded. We’ve done this once. We did it on a rolling background path with both indexes live during the transition. It is the kind of work that’s invisible when it’s done well and very visible when it isn’t.

Cold-page-cache after deploy. A fresh container hitting a corpus that hasn’t been queried in an hour pays cold-cache cost on the first query. We pre-warm with the prior week’s top queries on boot.

The honest list of what’s still on the work

Diagram extraction is its own pipeline; we cover it separately.
Tables with merged cells in adversarial layouts still produce occasional artifacts; flagged at upload, manual triage available.
Cross-document references are resolved within a pursuit; references to documents outside the bundle become flagged blocks.
Embedding model is provider-pluggable but the migration tooling is not yet self-serve; we run migrations as a managed operation per customer.

Why all of this is worth writing about

A user uploads a PDF and expects search to work. The work between “uploaded a PDF” and “search works” is the substance of the product. We treat it as the substance — we measure it, we publish where it breaks, and we ship the next round of fixes against the same metrics.

If your KB ingest is slow, your KB ingest is wrong, or your KB ingest just produced a search experience that returns “loosely related” results — the Analyzer is free to try. Upload one document and see what comes out.