A pgvector migration postmortem

This is a postmortem. No customer was materially harmed. Latency on search was degraded for a specific window — call it 90 minutes — across a subset of tenants. We rolled forward and the degraded state resolved. But the degradation was avoidable, and the specific thing that went wrong is common enough in pgvector operations that it is worth writing up.

The context: a planned HNSW index rebuild on one of our larger tenant databases. We were tuning index parameters — m and ef_construction — to pull latency down at the upper end of our tenant size distribution. The rebuild itself was routine. What went wrong was how the rebuild interacted with production traffic.

What happened

At 14:03 UTC we started CREATE INDEX ... USING hnsw on the new parameter set. We did not use CREATE INDEX CONCURRENTLY. The rationale at the time: the tenant’s indexed table held a large number of blocks, and non-concurrent builds are faster; we expected the build to take roughly 8 minutes based on our staging rehearsal. 8 minutes of write-locked reads on a tenant we knew was not writing heavily in that window felt acceptable.

At 14:03 UTC production search queries for that tenant began queueing.

At 14:05 UTC the alert fired. Search P95 latency for the affected tenant had moved from our normal sub-100ms range to multi-second territory because queries were waiting on the table-level lock. This was expected; the alert simply confirmed the build was in progress.

At 14:08 UTC the build was still running. Our staging rehearsal had finished in 7 minutes on equivalent hardware. We started to look for why it was taking longer.

At 14:21 UTC we discovered that a background ingestion job on the tenant had been writing blocks into the table through the morning — concurrent with our rehearsal assumption of a write-idle window. The table was larger than our rehearsal data had represented, and the build was scaling nonlinearly in the number of blocks.

At 14:34 UTC the build completed. The index came online. Queries cleared the backlog. Search latency recovered within 90 seconds.

Total window of degraded search on this tenant: roughly 31 minutes during the build, plus a several-minute queue-drain after. The full “degraded” window on our internal metric was 90 minutes from the first anomalous latency sample to return-to-baseline.

What we did wrong

We did not use CREATE INDEX CONCURRENTLY. This is the headline error. Postgres’s documentation is explicit that non-concurrent index creation takes a table-level lock that blocks writes and, in practice, also degrades reads when they compete for the same resources. Concurrent index creation is slower — often 2-3x slower — but does not block. For production tables on a customer-facing path, concurrent is the correct default. We knew this. We made an engineering tradeoff toward speed, based on an assumption about write traffic that was wrong.

Our rehearsal was not representative. The staging dataset we used for the rehearsal was derived from a snapshot taken weeks earlier, before the ingestion backlog we were unaware of had grown. A rehearsal on a smaller or older dataset told us the build would be fast; the production dataset was neither. The lesson is not “take a newer snapshot.” The lesson is “rehearse index builds on the current production size, not on a snapshot.” We now pin rehearsal datasets to within a few days of the planned change.

Our alert threshold was too loose to catch the right signal. The alert that fired at 14:05 told us search was slow. It did not distinguish between “index build in progress, expected degradation” and “something has gone wrong.” Because we were already expecting degradation, the alert did not change our posture. We did not tighten to “is the build actually progressing as expected,” which would have surfaced the mismatch with staging timing earlier.

The change did not have an abort criterion. We had not written down “if the build has not completed by time T, abort and roll back.” We did not have a rollback plan for mid-build abort. If the build had taken 90 minutes instead of 31, we would have been making the abort decision under pressure, without a pre-committed threshold. That is a bad position.

What we changed

Default to concurrent index creation on all production tables. The additional build time is a cost we will pay. It is a fixed, predictable cost. It does not concentrate into a window that affects customer experience. Concurrent index builds are now the runbook default for any table larger than a threshold we set per tenant size class. Non-concurrent is an explicit opt-out with a written rationale.

Rehearse on current data. Any index operation on a production table requires a rehearsal on a recent snapshot. “Recent” is defined as within a small number of days, not weeks. The rehearsal captures both the expected duration and the expected resource profile. If the rehearsal surprises us, we postpone.

Alert on progress, not just degradation. For long-running index operations we now emit a progress signal — rows indexed per minute, estimated time remaining based on table size. The alert fires if progress is meaningfully slower than the rehearsal predicted, not when customer-facing latency crosses a threshold. This gives us time to react before the customer notices.

Every operation has an abort criterion. Before kicking off, we write “we will abort if X has not happened by time Y” and paste it into the incident channel. During the operation, the person running it is watching for the abort criterion. If it hits, we stop. Rollback is a step in the runbook, not a decision made under pressure.

What we did right

A few things that kept the incident small:

The tenant-level isolation in our database model meant the incident was scoped to the tenant we were working on. Other tenants were unaffected. If we had run this on a shared index across tenants, the blast radius would have been much larger.

The pgvector query planner continued to function during the build — queries were slow because they were waiting for the lock, not because they were returning wrong results. No tenant saw incorrect search results during the window.

The rollback path, though not pre-committed, was available. DROP INDEX CONCURRENTLY on the in-progress index would have cleared the lock, and we could have fallen back to the original index which had not been dropped yet. We would not have taken a query-correctness hit even if we had aborted mid-build.

The broader lesson

The honest version is that everything on the “what we did wrong” list was an optimization we chose because it was fast. Non-concurrent is faster. Snapshots from a week ago are easier than snapshots from today. Loose alerts are easier to tune. Not writing abort criteria is faster than writing them. Every one of these is a tradeoff of “faster today, exposed if something surprises us.”

When we are running infrastructure operations on production, we should be taking the slower, more conservative path by default. The cost of the slow path is small and evenly distributed. The cost of the fast path, when it goes wrong, is concentrated into a specific customer-facing window. We understated that asymmetry on this one.

The tenant whose search degraded during the window got a timely note. The customer support person who handled the conversation was clear about what happened and what we changed. That is how we handle these at this size, and it is the right shape for a company that takes engineering honesty as a feature. We will write the next postmortem when we have the next one; there will be a next one, because that is what running software at scale produces.