Field notes

The evidence vault: where SOC 2 PDFs live and how they cite

How a DDQ answer citing 'SOC 2 report, section CC6.1' actually finds the right PDF, serves it to the right buyer, and keeps the audit trail. The storage, access, and audit layer underneath.

The PursuitAgent engineering team 7 min read Engineering

A DDQ answer that says “AES-256 at rest, evidence: SOC 2 report section CC6.1” is only as good as the path from that citation to the actual PDF. The citation text is five words. The infrastructure underneath is where the trust sits.

This post walks through the evidence vault — the storage, access, and audit layer that makes “SOC 2 report, section CC6.1” resolve to the correct file, served to the correct buyer, logged for the compliance audit trail. It is the backbone the 80%-retrieval pillar assumes but does not explain in depth.

What goes in the vault

The vault stores seven kinds of artifacts:

  1. Attestation reports. SOC 2 Type II, SOC 1, SOC 3, and ISO 27001 certificates. Current and prior cycle.
  2. Penetration test reports. Typically annual. Executive summary versions and full technical reports, separately access-controlled.
  3. Questionnaire submissions. SIG Lite, SIG Core, CAIQ v4, HECVAT where applicable. Kept because buyers sometimes request “send us your completed SIG” instead of answering a custom questionnaire.
  4. Policy PDFs. Information security policy, BCP/DR plan, data classification policy, acceptable-use policy. The buyer-facing redacted versions.
  5. Insurance certificates. Cyber liability, E&O, general liability.
  6. Subprocessor documentation. DPAs with each subprocessor, notification logs.
  7. Incident-response documentation. The runbook, the tabletop exercise records, the post-incident summaries (redacted for buyer view).

Each artifact has metadata: effective date, expiration date, owner, classification (buyer-facing, internal-only, customer-NDA-required), and a content hash. The hash is what makes the citation resolvable. When a KB block says “SOC 2 report, section CC6.1,” the block references a hash, not a filename. Filenames are free to change. Hashes pin the artifact version the KB block was authored against.

The schema

The vault is two Postgres tables plus object storage. The SQL is the clearest way to describe the shape:

create table evidence_artifact (
  id             uuid primary key,
  customer_id    uuid not null,
  artifact_type  text not null,  -- 'soc2', 'pentest', 'policy', ...
  content_hash   bytea not null unique,
  effective_date date not null,
  expires_on     date,
  owner_user_id  uuid not null,
  classification text not null,  -- 'buyer_facing', 'nda_required', 'internal'
  storage_key    text not null,  -- object-storage path
  created_at     timestamptz default now()
);

create table evidence_citation (
  id              uuid primary key,
  block_id        uuid not null references kb_block(id),
  block_version   int  not null,
  artifact_id     uuid not null references evidence_artifact(id),
  section_ref     text,           -- 'CC6.1', 'section 3.2', 'appendix A'
  created_at      timestamptz default now()
);

evidence_artifact is the file table. evidence_citation is the join between a KB block and the evidence it backs. A single KB block can cite multiple artifacts (a control answer that cites both the SOC 2 and the pentest). A single artifact can be cited from many blocks (the SOC 2 shows up in 40-plus KB blocks).

The content_hash is the primary integrity guarantee. When the block was authored, the hash of the PDF at that moment was recorded. If the PDF is replaced — new cycle, new attestation — the new PDF gets a new row with a new hash, and the old row stays. The block does not silently pick up the new evidence; the block author has to explicitly re-cite. This is the same versioning discipline we apply to KB blocks themselves.

The access layer

Not every buyer gets every PDF. A full SOC 2 Type II report contains enough detail that most vendors only release it under an NDA. The access layer enforces this.

A citation in a DDQ answer resolves to an artifact, but the artifact is not served until the buyer’s context is checked. Three checks run on every evidence fetch:

  1. NDA status. Does the buyer have an active NDA with the vendor? If the artifact is NDA-gated, the fetch is denied until the NDA is on file and logged.
  2. Classification match. A buyer-facing policy PDF serves freely. An NDA-required pentest report serves only after the NDA check passes. An internal-only artifact never serves — it exists in the vault for internal reference but cannot be attached to a buyer-facing response.
  3. Expiration. If the artifact’s expires_on has passed, the fetch is refused and the DDQ workflow routes the question to the owner for evidence refresh.

The classification is set once, on ingest. The owner is set once, on ingest. Changing either requires the owner’s approval, logged to an audit table. This is paranoid; it is paranoid on purpose. The vault is the place in the product where a mistake has the highest blast radius — serving the wrong PDF to the wrong buyer is a legal-review event, not a bug-ticket event.

The audit trail

Every artifact fetch is logged with the requesting user, the buyer context, the DDQ/questionnaire ID, the block ID that triggered the cite, and the timestamp. The log rows are immutable — appends only, no updates, no deletes.

The compliance team runs a monthly sweep against this log. The sweep answers three questions:

  • Were any NDA-gated artifacts served without an active NDA?
  • Were any expired artifacts cited in shipped answers?
  • Did any buyer receive an artifact classification they should not have?

The first two checks are automated. The third is semi-manual; it runs a pattern match on artifacts that crossed classification boundaries and flags anomalies for review.

During an actual audit — the kind the vendor’s own auditor runs to produce the next SOC 2 cycle — the audit trail is the evidence that the vendor’s representations to buyers were consistent with the vendor’s own control environment. Without the trail, the vendor would have to reconstruct which PDF was attached to which response, which is impossible after the fact if the file names or locations changed. With the trail, the reconstruction is a SQL query.

How it plugs into the DDQ workflow

The DDQ auto-answer path calls into the vault on every answer that has cited evidence. Pseudocode:

async function stageEvidence(
  citations: EvidenceCitation[],
  buyerContext: BuyerContext,
): Promise<StagedArtifact[]> {
  const staged: StagedArtifact[] = [];
  for (const c of citations) {
    const artifact = await vault.fetch(c.artifactId);
    const check = await accessLayer.check(artifact, buyerContext);
    if (!check.ok) {
      throw new EvidenceAccessError(c, check.reason);
    }
    if (artifact.expiresOn && artifact.expiresOn < new Date()) {
      throw new EvidenceExpiredError(c, artifact.expiresOn);
    }
    staged.push({
      artifact,
      sectionRef: c.sectionRef,
      accessLogId: await audit.log(artifact, buyerContext, c),
    });
  }
  return staged;
}

Two kinds of errors halt the auto-answer: an access-layer refusal (the buyer does not have NDA coverage) and an expiration error (the PDF is too old to ship). Both route to the evidence-owner as a ticket. The DDQ question stays unanswered until the ticket resolves. This is intentional — shipping the text of an answer without shippable evidence is worse than not answering, because it creates a visible gap the buyer will notice.

Where it breaks

Three failure modes we have hit.

The unversioned shared drive. Customers who have not migrated their evidence from “a Google Drive folder Sarah owns” to the vault cannot use the auto-answer path for evidence-bearing questions. The vault ingest is manual the first time through. We built a wizard to reduce the migration cost, but the cost is not zero.

The NDA that lives in another system. Enterprise NDAs are often tracked in the vendor’s CLM (contract lifecycle management) platform, not in the vault. The access layer has to know about the NDA to enforce it. We integrate with the two most common CLMs; the long tail of smaller CLMs and legacy contract systems is an open gap.

The section reference that does not exist. A KB block cites “SOC 2 report, section CC6.1.” The SOC 2 report is real. The section label is a human-written string; if the auditor’s report uses a slightly different label (“CC6.01” vs. “CC6.1”), the citation looks wrong to a buyer even if the underlying content is right. We have been slow to automate section-label validation and it shows in occasional buyer pushback. This is on the Q1 roadmap.

For the adjacent pieces — how a SOC 2 PDF becomes the KB blocks the vault backs, and why the 80% of questionnaire work is retrieval — see turning a SOC 2 PDF into 140 KB blocks and the security-questionnaire 80% pillar.

Sources

  1. 1. AICPA — SOC 2 trust services criteria
  2. 2. AWS KMS documentation