Field notes

New models, quarterly eval: Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro

An internal eval across three current-generation models for our specific workloads — drafting, claim verification, extraction. What moved, where we switched defaults, and why one workload still sits on a year-old model.

Three current-generation models landed across the providers we track between late 2025 and Q1 2026: Anthropic’s Claude Sonnet 4.6 (Feb 2026), OpenAI’s GPT-5.2, and Google’s Gemini 3.1 Pro (Feb 2026). This post is what the internal eval looked like, which workloads moved, and which one stayed on a year-old model on purpose.

Numbers below are directional. They are produced by our own harness on our own workloads; they are not a general benchmark, and we would not expect them to generalize outside a grounded-RAG proposal context. The purpose of the post is to show how we think about the swap, not to publish a leaderboard.

The harness and the workloads

The eval harness is the same one we’ve run since 2025 (CLI reference), against the same golden set described here. Four workloads:

  1. Draft generation — write a grounded answer from retrieved KB blocks.
  2. Claim verification — entail or reject a specific numeric/named-entity claim against a source block.
  3. Compliance extraction — extract requirement rows from RFP text.
  4. Question rewriting — rewrite a buyer’s question into 3–5 retrieval queries.

For each workload, we measure a task-specific quality score, end-to-end latency, and cost per request.

Results

Scores are directional. Each row is one configuration we ran; deltas within a column are more meaningful than absolute values.

Draft generation

ModelQuality (0–100, internal)P95 latencyCost per draft
Claude Sonnet 4.5 (prior default)89.414.2s$0.041
Claude Sonnet 4.691.113.1s$0.039
GPT-5.290.811.4s$0.036
Gemini 3.1 Pro87.29.8s$0.028

Decision: switched default to Claude Sonnet 4.6. Quality delta from 4.5 is small but consistent, and citation-anchor adherence (a sub-score that measures whether claims resolve to the cited block) improved materially. GPT-5.2 is a close second and we’re keeping it wired as the fallback; Gemini 3.1 Pro’s lower quality score came almost entirely from citation-anchor drift, which matters more for us than the latency win.

Claim verification

ModelPrecisionRecallF1Cost per claim
Claude Sonnet 4.5 (prior default)0.9620.9310.946$0.003
Claude Sonnet 4.60.9740.9280.950$0.003
GPT-5.20.9680.9420.955$0.003
Gemini 3.1 Pro0.9710.9180.944$0.002

Decision: switched default to GPT-5.2 for claim verification. This is the workload where recall matters more than raw precision — a missed unverifiable claim is worse than a verified claim flagged for re-check. GPT-5.2’s recall advantage is the biggest delta in the eval. Claude Sonnet 4.6 runs as the fallback.

Compliance extraction

ModelPrecisionRecallCost per RFP
Our fine-tuned BERT classifier (prior default)0.860.94$0.11
Claude Sonnet 4.6 (zero-shot)0.810.96$2.40
GPT-5.2 (zero-shot)0.790.97$2.10
Gemini 3.1 Pro (zero-shot)0.770.95$1.40

Decision: stayed on the fine-tuned BERT. The grammar-based extraction we described in our compliance-extraction revisit still wins on precision and is 15–20x cheaper per RFP. LLMs have slightly better recall on long-tail phrasing but they also hallucinate requirements that aren’t in the document at all. For this workload, a small task-specific classifier beats a generalist.

Question rewriting

ModelRetrieval MRR@10Latency P95Cost per rewrite
Claude Haiku 4.5 (prior default)0.71780ms$0.0004
Claude Sonnet 4.60.741,420ms$0.002
GPT-5.2 mini0.73890ms$0.0006
Gemini 3 Flash0.72640ms$0.0003

Decision: stayed on Claude Haiku 4.5. The 3-point MRR@10 delta to Sonnet 4.6 doesn’t justify a 5x cost and a 1.8x latency hit on a call that runs every retrieval. Haiku 4.5 is fine for this workload and there’s no reason to move it.

Year-over-year view

Compared to the May 2025 eval, two things changed structurally:

Quality converged. Twelve months ago the spread between the best and worst model on draft generation was about 8 points. Today it’s 4. Choosing a model is now much more about workload fit than about finding a “better” model — most current-gen models are roughly comparable on the easy workloads and differentiated on the hard ones.

Cost dropped materially. Composite cost across our four workloads is down meaningfully since last May. This is the tailwind that let us hold our Team-tier pricing flat (covered in the year-one pricing post) while the underlying model spend fell.

What we didn’t test

We did not test open-weight models for this cycle. Our production stack requires provider-managed routing, data residency commitments, and the citation-anchor features we built on top of specific provider APIs. An open-weight evaluation is a real piece of work and we plan to run one in Q3 2026 on the verification workload specifically — it’s the one where self-hosted inference could plausibly win on cost.

We did not test reasoning modes on any model. The four workloads are all latency-sensitive; none of them tolerate the extra 30–120 seconds a reasoning pass adds. For one-off analysis tasks (an RFP teardown, say) we do use reasoning modes ad hoc, but they aren’t in the production critical path.

The takeaway

Two default swaps (drafting and verification), one non-change (extraction stays on fine-tuned BERT), one non-upgrade (question rewriting stays on Haiku 4.5). The eval drives the decisions; our job is to keep the harness honest. Numbers are ours, on our workloads — reproducing them requires our corpus, which we do not publish.

Sources

  1. 1. Anthropic — Claude Sonnet 4.6 release
  2. 2. Introducing GPT-5.2 (OpenAI)
  3. 3. Gemini 3.1 Pro — release notes
  4. 4. RAG eval quarterly report
  5. 5. Golden set — how we curate
  6. 6. Eval harness CLI
  7. 7. Hallucination monitoring in production