Embedding model selection: why Gemini Embedding 2 for proposals

We picked an embedding model for proposal retrieval. Before we name the choice, we want to explain how the comparison was structured, because “we benchmarked four models and X won” is a sentence that means very little without knowing what was actually measured.

This post is a teardown of the evaluation methodology. We are not publishing specific precision-at-k numbers in this post for two reasons. First, our held-out evaluation corpus is built from real customer KBs and we cannot publish numbers tied to it without anonymizing in ways that would damage the read. Second, embedding-model rankings on benchmarks like MTEB shift as new model versions ship, and a number we publish today is stale within a quarter. What we can publish is the methodology, the qualitative differences we saw, and the reasoning that landed the decision.

The four candidates

We compared four embedding models that are credible options for a production proposal retrieval system as of early 2025:

Gemini Embedding 2 (Google) — multilingual, strong on long contexts, recent release with documented performance on retrieval benchmarks.
text-embedding-3-large (OpenAI) — widely adopted, strong general-purpose retrieval performance.
embed-v4 (Cohere) — designed with retrieval as the primary use case, supports multilingual and long contexts.
voyage-3 (Voyage AI) — specialized embedding provider, marketed for retrieval-first workloads.

All four meet the baseline our system requires: vector dimensionality compatible with our pgvector setup, an inference latency profile that doesn’t bottleneck our drafting flow, and an API that is stable enough to depend on.

What we evaluated against

The evaluation corpus is a held-out set of question-passage pairs drawn from real customer proposal libraries (with permission) and from public RFP responses. The questions are written by proposal practitioners — not generated synthetically — and each question has a labeled “correct” passage (sometimes more than one) drawn from the same corpus.

We split the questions into four classes, because a proposal retrieval system fails in different ways depending on what kind of question it is asked:

Factual numeric questions. “What is the SLA for incident response?” Answer is a specific number.
Compliance questions. “Describe your encryption practices.” Answer is a paragraph that may exist in multiple variants in the KB.
Past-performance questions. “Have you done similar work for a state agency?” Answer is a reference to a specific past engagement.
Win-theme questions. “How does the proposed approach reduce time to value?” Answer is a paragraph or section, often spanning multiple chunks.

The four classes are evaluated separately. A model that wins on factual numeric retrieval can lose on win-theme retrieval, and the relative weight a proposal team places on each class matters more than a single aggregate score.

What we measured, and what we deliberately didn’t

We measured top-k recall at k = 1, 3, and 10. We measured mean reciprocal rank. We measured the rate at which the top-1 result was within the same document section as the labeled passage even when it wasn’t the labeled passage itself (a “near miss” metric — sometimes near misses are functionally correct in proposal use).

We did not measure cosine-similarity scores in isolation. Raw cosine numbers are not comparable across embedding models — different models produce different score distributions, and “0.74 in model A” can correspond to “0.91 in model B” for the same semantic match. Comparing models on raw similarity without rank-based metrics is a category error.

We also did not measure end-to-end answer quality (the generation step, on top of retrieval) at this stage. The embedding evaluation is upstream of the drafting evaluation, and conflating them produces results that depend on the drafting prompt as much as the embedding model. We evaluated drafting separately, with a fixed embedding choice.

What the four models are good at, qualitatively

Across the four question classes, the rough qualitative pattern we saw:

Factual numeric questions. All four models did reasonably well. Numeric questions tend to retrieve correctly because the question and the passage share specific terms (the metric name, the unit). Differentiation between models was small at the top of the leaderboard. Cohere embed-v4 was particularly strong here.

Compliance questions. This is where models diverge. Compliance questions often have multiple plausible answers in the KB — a generic security overview block, a specific SOC 2 control block, a customer-specific addendum block — and the “best” answer depends on the question’s framing. Gemini Embedding 2 did well at distinguishing between near-synonymous compliance blocks, partly because of its handling of long-context paraphrase. OpenAI text-embedding-3-large was competitive but tended to over-rank generic-overview blocks above specific-control blocks.

Past-performance questions. Past-performance retrieval rewards a model that can match a question about a type of past engagement to a passage describing a specific past engagement. This is harder than it looks — the question might say “state agency in healthcare” and the passage might say “Department of Health Services contract.” Gemini Embedding 2 and Voyage performed strongly here. Cohere embed-v4 was competitive.

Win-theme questions. This is the hardest class. Win-theme retrieval often spans multiple chunks, requires the model to recognize a thematic argument distributed across paragraphs, and rewards a model that handles long-context coherence well. Gemini Embedding 2’s long-context handling was the strongest of the four on this class. The differential here was the largest of any class.

Why we picked Gemini Embedding 2

Three reasons, in priority order.

First, the long-context handling. Proposal content is long-form. A KB block can be 800 to 1500 tokens. A question can pull on a thematic thread that spans a multi-paragraph block. The model’s behavior on those longer inputs is the dominant factor in retrieval quality for our workload, more than its handling of short factual queries.

Second, the multilingual case. Several customers responding to multinational tenders operate KBs with content in multiple languages. A model that handles cross-lingual retrieval gracefully reduces the engineering cost of supporting those customers.

Third, the cost-per-call envelope at our query volume. All four models are within a tolerable cost envelope. Gemini Embedding 2’s pricing is competitive at our retrieval volume, and the trajectory of Google’s embedding pricing has been favorable.

The decision is not a permanent choice. Embedding-model selection is a swap-out — our retrieval pipeline is structured so the embedding model is a configuration, not a deep dependency. We expect to re-evaluate at least annually, and we will change the choice when a different model wins decisively on our corpus.

What this evaluation does not tell you

A few honest limits.

This evaluation is run against our held-out corpus. A different corpus — a different industry’s vocabulary, a different geography’s regulatory language, a different mix of question classes — could produce a different ranking. We are not claiming Gemini Embedding 2 is the best embedding model in absolute terms. We are claiming it is the best fit for our workload, evaluated on our corpus, with the methodology described above.

The Stanford HAI legal-RAG study and the long Hacker News thread on RAG limits both make a related point worth repeating: embedding-model selection is upstream of the harder problem, which is whether the generated answer is actually grounded in the retrieved passage. A better embedding model improves retrieval and modestly reduces downstream hallucination. It does not solve the entailment-gap problem. Our grounded-AI series is about the part of the stack that comes after retrieval.

The methodology you can copy

If you are running this evaluation for your own workload, the structure that worked for us:

Build a held-out evaluation set with at least 200 question-passage pairs, written by practitioners, drawn from your actual corpus.
Split the questions into 3 to 5 classes that correspond to the failure modes you actually care about.
Measure rank-based metrics, not raw cosine similarity. Top-k recall at k = 1, 3, and 10 is a defensible default.
Evaluate per-class, then weight the per-class results by the question-class mix you actually see in production.
Re-run the evaluation when you onboard a meaningfully different customer or when a candidate model ships a major version.

A retrieval evaluation is a piece of software you maintain. It is not a one-time benchmark report.