Ingesting a 300-question security questionnaire

A 300-question security questionnaire arrives. Sometimes as an Excel file with three tabs (one for “company info,” one for “controls,” one for “evidence references”). Sometimes as a Word document with checkboxes and a freeform comment column. Sometimes as a portal where the questions are gated behind a login and the team has to copy them into a working document one by one.

The point is not the format. The point is the volume. Safe Security reports that enterprise security teams now process 500 or more such questionnaires per year, with 200 to 400 questions each. Treated as a writing problem, this is unwinnable. Treated as a throughput problem with retrieval, it is a pipeline you can run in an afternoon.

This post walks through that pipeline. Five stages, what each one does, and what the actual cost looks like when you run it on a typical 300-question instrument.

Stage 1 — Extract

The questionnaire arrives. The pipeline’s first job is to turn it into a structured list of question rows.

For Excel, this is roughly mechanical: parse the sheet, identify the question column (heuristic on column header keywords), the answer column (often pre-labeled “Response”), the evidence-reference column, and any “owner” or “section” columns. We use a parsing library and a small classifier to label each column.

For Word documents and PDFs, we extract via the document-parsing pipeline (LlamaParse with diagram fallback). The extraction has to identify question boundaries, which is harder than it sounds — questions can be enumerated (4.2.7), un-enumerated (a paragraph that ends in a question mark), or implicit (a checkbox row whose label is the question). The extractor produces a normalized row structure regardless of source.

For portals: the team copies. The pipeline does not screen-scrape login-gated content.

What this stage costs. Roughly free for Excel, modest for Word/PDF (a few cents in extraction calls). Latency is in seconds, not minutes.

Where it breaks. Tables nested inside questions. Questions that span multiple lines with sub-bullets. Multi-instrument questionnaires where two different vendors’ questions are concatenated in the same file (yes, this happens). The extractor flags rows it could not parse cleanly; the human reviewer disambiguates them at the start.

Stage 2 — Classify

The 300 rows are now structured. They are not yet useful, because they are heterogeneous. Some are factual (“Do you encrypt data at rest? Y/N”). Some are descriptive (“Describe your incident response process”). Some are evidentiary (“Provide your most recent SOC 2 report”). Some are policy-text dumps that ask the vendor to attest yes/no to three pages of legal language.

The classifier assigns each row a type. The type drives downstream handling.

type QuestionType =
  | "boolean"          // yes/no
  | "boolean-with-comment"
  | "short-answer"     // a sentence to a paragraph
  | "descriptive"      // multiple paragraphs, may need SME input
  | "evidentiary"      // attach a document
  | "policy-attest"    // attest to a long block of text
  | "out-of-scope";    // not a question, e.g. "Vendor name" cells

The classifier is a small model with a prompt and a few-shot list. It runs in batch on all 300 rows. It is roughly 95% accurate on our internal evaluation set. Wrong classifications are caught in the assemble step when the answer shape does not match the row schema.

What this stage costs. A few cents for the batch classification call.

Where it breaks. Boundary cases between “short-answer” and “descriptive.” A question like “Describe your access control approach” can be answered in two sentences (boolean-with-comment-style) or in two paragraphs (descriptive-style). The classifier picks one; sometimes the buyer wanted the other. We mitigate with a length-aware second pass at draft time, but the first-pass classification is sometimes overruled.

Stage 3 — Dedupe against the last one

This is the stage that most published “DDQ automation” pipelines do not run, and in our experience it is the highest-payoff stage in the whole pipeline.

The questionnaire you are answering this week is, with very high probability, not the first questionnaire your team has answered. It overlaps materially with the last questionnaire and the one before that. 1up.ai observed that “most questionnaires are quite similar, but just different enough that you can’t copy/paste every answer.” That gap — similar but not identical — is what the dedupe stage closes.

For each question in the new instrument, the pipeline retrieves the most similar question from the last N questionnaires the team has completed (we default to N=10). If the similarity score clears a “near-duplicate” threshold, the prior approved answer is offered as the starting point. If it clears a weaker threshold, the prior answer is offered as a candidate the writer can pull from.

The threshold is conservative. We err toward “candidate, review please” rather than “near-duplicate, ship it.” The fail mode of a too-aggressive dedupe is shipping last quarter’s answer to a question that is asking something subtly different, which is the exact failure 1up.ai called out. The dedupe is a retrieval suggestion, not an autopilot.

What this stage costs. One embedding lookup per question against the prior-questionnaire index. Cents per questionnaire.

Where it breaks. Buyer-specific phrasing changes that the dedupe does not catch. “Do you encrypt data at rest using AES-256?” vs “Do you encrypt data at rest?” — same topic, different answer schema. The first asks for a specific algorithm and would fail if your answer says “AES-128.” The dedupe will flag both as near-duplicates because the embedding sees the topic. The writer has to read carefully.

Stage 4 — Retrieve

For questions the dedupe stage did not answer, the standard grounded-drafting loop runs (see the loop walkthrough). Retrieve from the KB, draft under constraint, verify entailment, emit or refuse.

The questions that reach this stage are a smaller set than the original 300. On a typical 300-question questionnaire we see roughly 40 to 60% of questions resolved by the dedupe step alone. The remainder hit the full retrieval pipeline.

This is the headline efficiency claim. The questionnaire feels like 300 questions to the team. The system processes it as roughly 130 questions, because 170 of them have an approved answer from a prior questionnaire that survives a near-duplicate check. The 130 that go through retrieval are still expensive — a full draft, verify, emit cycle each — but the 170 that did not are the entire reason the pipeline pays for itself.

Stage 5 — Assemble

The pipeline now has 300 rows of (question, drafted-or-reused answer, source pointer, status). The assemble stage builds the deliverable.

If the buyer sent an Excel template, the answers go back into the original cells. If they sent a Word document, the answers fill the response column. If they sent a PDF and expect a separate response document, the assembler builds that document with question-and-answer pairs and the buyer’s required formatting.

The assembly is mechanical. The interesting part of this stage is the report the assembler produces alongside the deliverable. The report says: 170 answers reused (here are the prior questionnaires they came from, with dates), 95 answers freshly drafted (here are the source pointers in the KB), 35 questions refused (here is why each refused — retrieval floor, entailment failure, or out-of-scope). The 35 refusals go to the human as the actual workload. Everything else has been pre-resolved.

What this stage costs. Negligible compute. The cost here is human review of the assembled output before submission.

What the whole run costs

A 300-question questionnaire end to end costs us, at current pricing, on the order of 5 to 15 dollars in compute. The variance is mostly in the proportion of questions that reached the full retrieval pipeline vs. the dedupe-only path. We will publish the line-by-line breakdown in Wednesday’s cost-per-response post.

The question that matters more than the dollar cost is the human time. A questionnaire that took the security team 30 to 40 hours historically becomes a 3-to-5-hour review cycle when the pipeline does the throughput work and the human does the judgment work. We have seen that pattern hold across the teams running the pipeline now. We will publish a fuller benchmark when the sample is large enough to be useful.

Where this pipeline does not help

The first questionnaire. If the team has not run a prior questionnaire through the system, the dedupe stage produces zero hits and every question goes through full retrieval. Costs and time are higher. The KB has to build up.

Heavy policy-attest sections. Some financial-services DDQs include 20-page policy attestations that the vendor either accepts or proposes redlines on. The pipeline does not draft redlines. Those sections still go to legal.

Buyer-specific certifications we do not hold. The pipeline cannot manufacture a SOC 2 Type II report we have not produced. It can correctly refuse the question and surface the gap to the team. That is not nothing — knowing on day one that we cannot answer 12 of the 300 questions is structurally useful — but it is also not magic.

The pipeline is throughput. The judgment work, the gap-closing work, and the actual security posture remain human. The point is to spend the human time on the parts that need a human, not on the parts a retriever and a verifier can handle.