Tuning pgvector HNSW for proposal workloads

PursuitAgent runs retrieval over a per-tenant knowledge base of content blocks. Every draft is grounded in that retrieval. Latency in the retriever shows up as latency in the draft; recall in the retriever shows up as citation gaps in the draft. The index underneath that retriever is a pgvector HNSW index.

This post is about the three knobs that matter on an HNSW index, what we chose, and the trade we made. We have not published a public benchmark of proposal corpora — the numbers below are from our internal testing against synthetic proposal workloads and a handful of real customer corpora under NDA. Treat them as priors, not universal truths.

What HNSW is, briefly

HNSW — Hierarchical Navigable Small World — is the current default approximate-nearest-neighbor index for high-dimensional vector search. It builds a multi-layer graph where each node connects to a small set of neighbors, and search walks the graph from a high-level entry point down to the fine-grained bottom layer. The original paper by Malkov and Yashunin is the canonical reference; pgvector’s implementation follows it closely.

The three knobs:

M — the number of bidirectional links built per node at index construction.
ef_construction — the size of the dynamic candidate list during construction. Higher means slower build, higher recall.
ef_search — the size of the dynamic candidate list at query time. Higher means slower query, higher recall.

Our workload

A typical PursuitAgent tenant sits in the 20,000 to 200,000 content-block range. Each block is embedded to 1,536 dimensions (OpenAI text-embedding-3-small at the time of writing; swappable per-tenant). Queries are drafting queries — one per proposal section, typically 50 to 300 per response. We care about recall at k=20 (the top 20 candidates go to the reranker) more than recall at k=1.

The p95 retrieval latency target is 120 ms per query. The p50 target is 40 ms. These are query-side only — they do not include the reranker.

M

pgvector’s default M is 16. The HNSW paper’s guidance is M between 8 and 48, with higher M helping on higher-dimensional data. Supabase’s benchmark on 1,536-dim embeddings landed at M=16 as a reasonable default.

We chose M=24. The reasoning: 1,536 dimensions is on the high end, and our workload cares about recall at k=20 more than k=1. M=24 adds construction cost and index size (roughly 50% more than M=16) but reliably lifts recall@20 by a meaningful margin — moving from around 0.92 into the mid-to-high 0.9s on our synthetic tests. The index-size cost is tolerable; a 200k-block tenant index fits in a few hundred megabytes either way.

ef_construction

pgvector’s default is 64. The paper’s typical range is 100 to 500 for production indexes.

We chose ef_construction=200. Construction only happens at index build, and we rebuild the index offline when a tenant’s block count crosses a threshold or a tenant re-embeds. A 10-minute build that delivers recall@20 of 0.97 is a better trade than a 2-minute build that delivers 0.94. Customers do not experience build latency directly — they experience query latency and recall.

ef_search

This is the knob that matters at query time. pgvector’s default is 40.

We run ef_search=100 in production, with a per-request override. The per-request override is the interesting part. For drafting queries — the ones that go to the reranker and then to the LLM — we use 100 and eat the extra millisecond. For UI typeahead queries (the search bar in the KB browser), we drop ef_search to 30; the user wants a snappy response and the recall target is looser. For offline re-scoring jobs, we push to 300.

The cost of ef_search=100 vs. 40 on our workload is on the order of 15 ms at p95. The recall gain at k=20 is a few percentage points. We spent the 15 ms.

What we tried and backed off from

Two things.

Per-tenant ef_search auto-tuning. We built a prototype that adjusted ef_search per tenant based on observed latency vs. recall against a held-out eval set. It worked — on tenants where we had an eval set. The number of tenants with an eval set is small. We shelved the auto-tuner and default every tenant to ef_search=100; tenants with enough eval data can override.

IVFFlat instead of HNSW. IVFFlat is cheaper to build and smaller on disk. We benchmarked it at the 200k-block scale and it lost to HNSW on recall by roughly 8 percentage points at our target latency. For smaller tenants (under 20k blocks), IVFFlat is competitive — but we don’t run two index types. The operational complexity isn’t worth the disk.

Rebuild cadence

Indexes are rebuilt under three conditions: the tenant’s block count grows by more than 30% since the last build, the tenant changes their embedding model, or an engineer bumps the tuning parameters across the fleet. Rebuild happens on a background worker; queries run against the old index until the new one swaps in atomically.

What we’re still wrong about

Two open questions.

Quantization. pgvector 0.7 added halfvec and binary quantization. We haven’t moved production to quantized vectors yet. The latency win is real; the recall cost on drafting queries is what we haven’t measured rigorously. The citation-density target post covers why recall gaps are expensive for us specifically.

Per-block reweighting. The index treats every block as equal. In practice, a freshly-reviewed block from a trusted SOC 2 report should rank higher than a boilerplate block from 2023. We do this reweighting in the reranker, not in the index. Whether that’s the right layer is a question we haven’t answered.

For the retrieval pipeline that sits above this index, see the grounded retrieval pillar.