Turning a SOC 2 PDF into 140 KB blocks
The ingest, the extraction, the linking. A worked trace of how a SOC 2 Type II report becomes the set of KB blocks that DDQ answers cite — with the real pgvector row shape at the end.
A SOC 2 Type II report is a PDF of roughly 60 to 100 pages. The final form, as far as the DDQ workflow is concerned, is 140-ish KB blocks — one per control description, plus a handful for organizational context, plus evidence-attachment pointers. This post traces a single SOC 2 ingest end to end.
The example is a redacted version of a real customer’s SOC 2 from earlier this year. Numbers are representative of what we see across similar ingests.
Step 1 — Upload and hash
The PDF hits the ingest endpoint. Three things happen synchronously:
- The file is hashed with SHA-256. The hash becomes the artifact’s permanent identity in the evidence vault.
- The file is stored in object storage at a path derived from the hash.
- An
evidence_artifactrow is inserted withartifact_type = 'soc2', the uploader asowner_user_id, andclassification = 'nda_required'by default (the GRC lead can re-classify sections later).
The rest of the ingest runs asynchronously — a Hatchet workflow that polls the parser and processes the result.
Step 2 — Parse
The PDF goes to LlamaParse. SOC 2 reports parse cleanly because they are typically well-structured: a cover letter, an independent auditor’s report section, a management assertion, a description of the service organization’s system, and the actual control listing with test procedures and results.
LlamaParse returns a structured document — headings, paragraphs, tables, page numbers — in roughly 15 to 30 seconds for a 70-page report. We pay about $0.30 per report in parser cost.
For SOC 2s specifically, we also run a secondary extraction pass against the control-testing tables. Those tables have a predictable shape (control ID, control description, test procedure, test result, exceptions noted) and we want them structured rather than flattened into prose. The secondary pass uses a constrained extraction prompt that expects that specific shape.
Step 3 — Segment into candidate blocks
The parsed document is walked to produce candidate KB blocks. The segmentation rules are specific to SOC 2 structure:
- Each control listed in section 4 (the control description section) becomes one block. CC6.1 is a block; CC6.2 is a block; CC6.3 is a block.
- Each subsection of section 3 (the description of the service organization) becomes one block: company overview, scope of services, subservice organizations, relevant aspects of the control environment, etc.
- The management assertion becomes one block (used when questionnaires ask for it directly).
- The auditor’s opinion letter becomes one block.
- The exception list (if any) becomes one block per exception.
A typical SOC 2 Type II produces 140 candidate blocks. Larger reports — companies with wider scope — produce 200 or more. Smaller companies’ first-year SOC 2s sometimes produce 90.
Step 4 — Classify and tag
Every candidate block goes through a classifier that assigns:
- Control family. CC6.1 maps to “access control”; CC7.2 maps to “system operations”; A1.1 (availability criteria) maps to “availability and BCP.” The mapping is from the AICPA TSC framework.
- Question categories the block is likely to answer. A block describing CC6.1 is tagged with
["access-control", "rbac", "mfa"]because those are the question shapes the retrieval layer will fire at it. - Owner. Set to the default security-KB owner for the customer, then overridden by the GRC lead if needed.
- Effective date. Pulled from the audit period on the report’s cover page.
- Expiration date. Set to one year past the end of the audit period.
The classifier is a small LLM call — roughly 50 tokens in, 30 tokens out per block. At $0.001-$0.003 per block, the whole SOC 2 classification runs for under 50 cents.
Step 5 — Embed and insert
Each block is embedded (we use a commercial embedding model; details are in the embedding-model-selection post). The embedding goes into pgvector.
The row shape, simplified:
create table kb_block (
id uuid primary key,
customer_id uuid not null,
source_ref text not null, -- 'evidence_artifact:<uuid>'
control_family text,
title text not null, -- 'CC6.1 — Logical access restricted'
body text not null, -- control description + test + result
tags text[] not null,
owner_id uuid not null,
effective_date date not null,
expires_on date,
version int not null default 1,
embedding vector(1536),
freshness real not null default 1.0,
created_at timestamptz default now()
);
create index kb_block_embedding_idx
on kb_block using hnsw (embedding vector_cosine_ops)
with (m = 16, ef_construction = 128);
The HNSW parameters follow the tuning we cover in pgvector HNSW tuning for proposal workloads. The source_ref preserves the link back to the evidence_artifact — critical for the auto-attachment path.
Step 6 — Link to evidence
The evidence_citation join table is populated. For each of the 140 blocks, a row is written linking block_id → artifact_id with the relevant section_ref (“CC6.1”, “section 4”, “management assertion”). This is the mapping that makes a DDQ answer citing “SOC 2 report, section CC6.1” resolve to the exact PDF and page range at attachment time.
The linking is automatic from the segmentation step — each block already knows which section of the SOC 2 it came from, so the join is mechanical.
What ends up in the KB
At the end of the ingest, for a 72-page SOC 2 report, we typically have:
- 1
evidence_artifactrow pointing to the PDF. - 142
kb_blockrows (in this particular example). - 142
evidence_citationrows linking each block back to its section of the PDF. - 142 HNSW index entries.
Total ingest time: 4 to 7 minutes end to end. Total cost: roughly $0.80 in parser and model calls.
Where this breaks
Two failure modes.
SOC 2 reports with unusual structure. Some audit firms depart from the standard control-listing format, especially for smaller companies whose reports are one bespoke section rather than a structured control list. Our segmenter falls back to a generic paragraph-level chunking when it cannot find the control sections, which produces blocks but loses the control-family tagging. About 8% of ingested SOC 2s hit this path.
Reports with embedded image-only pages. A surprising number of SOC 2 reports include screenshots of dashboards or architecture diagrams as raster images. LlamaParse gets the surrounding text but can miss content inside the images. We run a Gemini 2.5 Flash pass over image pages that flags image-only content; those blocks are marked for manual review rather than auto-ingested.
The 142-block output becomes the retrieval target for every subsequent security-questionnaire answer the customer ships. Every DDQ that asks about access control, encryption, incident response, vendor management, or any of the other 17 control families retrieves one or more of these blocks, runs the verification pass, and either auto-answers or escalates. The ingest is the boring part. It is also the part that determines how much of the auto-answer path is possible at all.