Backup and restore for a KB that contains embeddings

A KB that carries embeddings is not a normal relational database. The blocks are small. The vectors are large. The HNSW index is expensive to rebuild. Backups and restores that are correct for one of these three properties are often wrong for the others. This post is how we actually back up and restore the PursuitAgent knowledge base, what we learned running a restore drill every month for the last year, and where the standard Postgres playbook broke for us.

The state of play

Each customer’s KB lives in a Postgres instance with pgvector installed. A single customer’s data comprises:

A kb_blocks table — the canonical content rows, including the block text, version tag, source pointers, and metadata. Typical row size 4 to 20 kilobytes.
A block_embeddings table — one row per block, carrying a vector(1536) column. Row size ~6 kilobytes.
A block_usage table — the telemetry we described in the block reuse tracking post. One row per retrieval event, compact.
Supporting tables for documents, extraction provenance, reviewer annotations, and chunk-to-block mappings.

Three indexes matter for recovery: the HNSW index on block_embeddings.embedding, a full-text GIN index on kb_blocks.text, and a handful of btree indexes on foreign keys.

The backup posture

We run three layers of backup. Each one catches a different failure mode.

Layer 1 — continuous WAL archiving. Every transaction’s WAL segment ships to object storage. The standard Postgres continuous-archiving pattern. This gives us point-in-time recovery to any moment in the last 30 days.

Layer 2 — daily logical dumps of the high-value tables. kb_blocks, block_embeddings, and the tables that reference them get a nightly pg_dump in custom format. These are restoreable independently of the rest of the database, which matters when a customer asks us to restore a specific block to a version from Tuesday without rolling back the entire database.

Layer 3 — weekly full physical backups. A full pg_basebackup once a week, with the WAL segments applied through the end of the backup window. This is the fastest restore path for a full-database recovery event.

The retention schedule: WAL segments, 30 days. Logical dumps, 90 days. Full physical backups, 12 months. The 12-month horizon on full backups is driven by customer contractual requirements, not by technical consideration; the further back you go, the less useful the backup is, but the customer’s auditor is not interested in our opinion on that.

Where the standard playbook broke

Three places.

HNSW index rebuild time. A full physical restore of a medium KB (500,000 blocks) completes in 15 to 25 minutes. The HNSW rebuild on the restored vector column takes 45 to 90 minutes. During that window, retrieval queries either return bad results (degraded HNSW) or fail (no index). A naive restore procedure gives the customer an apparently-restored database that doesn’t work.

The fix is that every restore path builds the HNSW index before the database is marked as serviceable. The customer-facing status shows “restoring: content” first, then “restoring: search index,” with separate ETAs. The pg_basebackup carries the physical index files, so a full physical restore doesn’t need to rebuild; the logical dump path does, and we benchmark the rebuild time for every customer corpus so the ETA is not a guess.

Vector-relational consistency. A mid-restore view where kb_blocks is restored but block_embeddings is half-restored produces bad results: the drafter retrieves a block by vector similarity that no longer exists in the content table. Our restore procedure stages the restore into a shadow schema, validates the row counts match, and only then renames the shadow schema into place. The rename is atomic inside a transaction. The cutover is a single DDL step.

Point-in-time restore granularity. Standard Postgres PITR restores to a WAL position. Customers usually want to restore to an event — “the version of this block as of 3pm on Tuesday.” Translating the event into a WAL position requires that we log every KB block mutation with a WAL-position marker at write time. We do. The restore tool takes a human timestamp, looks up the WAL position from the audit log, and restores to that position.

The monthly drill

Every month, on the first Monday, we restore a production-shaped KB to a staging environment from a backup selected at random from the previous week’s archive. The drill runs the full sequence: the restore completes, the HNSW index builds, retrieval queries run against a held-out evaluation set, and the eval scores are compared against the production scores from the same corpus.

The drill is deliberately boring. The whole point is that it is boring — that a restore works, every time, without heroics. A month where the drill produces a surprise is a month where we fix the surprise and update the runbook. We have had four surprises in the last year:

A WAL segment rotation bug that truncated one of the 30-day archives. Fixed; added a checksum pass at ship time.
A customer corpus that exceeded the memory budget on the staging environment’s HNSW build. Fixed; the drill environment now autoscales memory for large corpora.
A logical dump that silently failed because the backup script didn’t check the exit code of pg_dump when the process was killed by the OOM killer. Fixed; the script now checks and alerts.
A vector index that came up in production after restore but returned zero matches on every query. Root cause was a corrupted hnsw.ef_search session setting in the restored environment. Fixed; the restore script now explicitly resets session-level vector settings.

Three of the four would have been invisible without the drill. The drill is what converts the existence of a backup into the existence of a recovery.

What the customer sees

In the normal case, nothing. Backups are not a customer-visible artifact unless we need to invoke them.

When we do need to invoke them — the rare case where a customer asks us to restore a specific block, or the very rare case of a full-database restore — the customer sees a status page with:

The restore target (block-level, table-level, or full-database).
The target timestamp.
An ETA broken into content-restore and index-rebuild phases.
A confirmation step before the final cutover.

We have not had a full-database restore in production in the life of the product. The monthly drill is how we stay ready for the first one.

What we don’t back up

We do not back up cross-company search indexes or shared model weights; those are reproducible artifacts derived from primary data. We do not back up the reranker’s internal state because it is stateless. We do not back up customer-provided raw documents in full fidelity — we back up the extracted-block representation, which is what the product operates on. A customer who deletes their original PDF and wants the PDF back from our systems will not find it; we document this explicitly and ask customers to maintain their own primary-document storage.

The one-line takeaway

Backups are not a capability; restores are. A backup you have not tried to restore is a file. A backup you restore every month is an operational posture. The difference is the one that matters when a real recovery lands.

The wider year-end post on operational drills — the fire drills we run beyond backup and restore — lands in January.