Migrating to Gemini Embedding v3, the safe way

Two weeks ago we switched from our previous embedding model to Gemini Embedding v3 across all tenants. The cutover was boring on purpose. This post is how it was boring.

The problem with embedding migrations is that they are silently destructive if you do them wrong. The existing vectors in your pgvector index were produced by model A. New vectors produced by model B live in a different space. Querying the index with a model-B query vector against model-A document vectors returns nonsense. You do not see errors — you see degraded retrieval, which the user experiences as “the answers got worse” without a clear cause. The fix after the fact is a full re-embed and a full re-evaluation, which is the work you should have done before the cutover.

The dual-index setup

Before the migration, every KB block had one embedding column, indexed with a single HNSW index. For the migration, we added a second embedding column and a second HNSW index. Every new block written during the migration window was embedded twice — once with the legacy model, once with Gemini v3. The retrieval API had a feature flag per tenant: embedding_model = legacy | v3. The flag defaulted to legacy.

This is the standard shape for a safe embedding migration. The cost is obvious: you are paying for two embedding calls per write and storing two vectors per block for the duration of the migration. The benefit is that you can compare retrieval quality on real traffic before committing.

The backfill

Every pre-existing block needed a second embedding. At our current KB size across all tenants, that is in the tens of millions of blocks. We ran the backfill in a batched background job over four days. The job was rate-limited on the Gemini API side and rate-limited on our own side to keep the index build from contending with production traffic. Progress was visible in a per-tenant dashboard: % of blocks with v3 embedding.

The backfill was idempotent. A block already embedded got skipped. A block updated during the backfill got re-queued. Nothing in the pipeline treated the v3 column as “live” during this period — retrieval still used the legacy column.

The eval harness

Our retrieval eval harness — the one covered in the retrieval eval pipeline post — has a fixed set of question/expected-block pairs per tenant. For the migration, we ran the harness twice on every tenant: once against the legacy index, once against the v3 index. The metric we cared about was recall@5 — does the expected block appear in the top 5 retrieved results for the question.

The headline result: v3 improved recall@5 by a modest but consistent margin across every tenant we tested. No tenant regressed. The magnitude varied — tenants with more diverse KBs saw larger gains than tenants with tightly-scoped product-documentation-style KBs. We are not publishing the specific numbers because they are tenant-specific and some of the data is under NDA, but the shape was: every tenant got better, some got meaningfully better, none got worse.

The second metric we watched was latency. Gemini v3 embeddings are a different dimensionality than the legacy model’s, which affected the index build time and the query-time vector comparison cost. End-to-end retrieval latency increased a small amount — low single-digit milliseconds at the P95 — which was inside our budget.

The cutover

The cutover was per-tenant, staged over five days. A tenant was flipped only after:

Its v3 backfill was 100% complete.
Its eval harness showed no regression.
Its latency budget was unchanged.

The flip itself was a single feature-flag change. The retrieval API started reading from the v3 column. No deploys, no downtime, no write path changes. If anything looked wrong in the hour after the flip, the flag flipped back — we kept the legacy column live for another week as a rollback path.

Three tenants flipped on day one. Seven on day two. The rest on days three and four. Day five was buffer for anything we missed. Nothing got flipped back.

What we watched during the cutover

P95 retrieval latency per tenant, second-by-second. No change outside the expected small increase.

Drafting quality via a lightweight canary: a known set of questions per tenant, auto-answered every hour, scored on a cheap LLM-judged rubric. Drafts produced with v3 retrieval scored slightly better or flat. No tenant scored worse.

User-visible error rate. Unchanged.

Citation validity rate — the share of drafts where every cited claim actually appeared in the retrieved block. This is the metric we track for the grounded-AI pledge. Unchanged or slightly better.

What we stopped doing

A week after cutover, we deleted the legacy embedding column and dropped the legacy HNSW index. Storage dropped noticeably. The write path was simplified to a single embed call per block. The feature flag was removed from the retrieval API.

We kept the eval harness runs in our historical record. Any future embedding migration will compare against the v3 baseline with the same methodology.

What we would tell someone doing this for the first time

The time cost of a safe migration is not in the code. It is in the patience to wait for the backfill to finish, the eval harness to run, and the staged cutover to complete. Teams that skip the wait because the migration is “just an embedding model change” are the teams that discover on a Friday that retrieval is worse. The dual-index cost is a rounding error compared to the cost of a bad cutover in production.

The pgvector ecosystem makes dual-index straightforward — adding a column is cheap, building a second HNSW index is cheap, and dropping either column when you are done is cheap. There is no reason to do the cutover any other way.

We were not the first team to run this playbook. We will not be the last. The value of the writeup is that the next team that does it sees what a boring migration looks like and does not think it is supposed to be exciting.