Blog · Tag
evaluation.
14 posts in this archive.
New models, quarterly eval: Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro
An internal eval across three current-generation models for our specific workloads — drafting, claim verification, extraction. What moved, where we switched defaults, and why one workload still sits on a year-old model.
The prompt test suite, an update
300 tests across our drafting and verification prompts. What they cover, what they miss, which ones still flake, and how we keep the flaky ones from becoming the reason we stop running CI.
One year of grounded retrieval: what changed, what didn't
The engineering companion to the founder retrospective. A year of build-log posts, condensed: what the retrieval stack looks like now, how verification evolved, what the gold set became, and what's still unsolved.
Embedding evaluation, revisited
What we measure differently from 12 months ago. How the gold set grew, which metrics earned their spot in CI, and which ones we quietly retired.
Retrieval eval snapshot, December 2025
Quarter four retrieval evaluation numbers against our held-out RFP and DDQ corpus. What moved since September, what's still stuck, and which regressions we're not yet fixing.
Discriminator vs. feature: which one moves the score
Two real RFP sections, same underlying win theme, different framings. Which framing evaluators picked, and the rule I extracted from comparing them.
Confidence-threshold tuning for DDQ auto-answer
Where we set the confidence bar for auto-answering a DDQ question. The precision/recall trade-off, explained with our own data and the number we actually use for security questionnaires.
Hallucination monitoring in production
The metric we watch weekly: per-claim refusal rate, citation-mismatch rate, and the human-graded sample. What we do when each one moves, and the threshold values that trigger an alert.
In preview: the retrieval-eval dashboard, publicly visible
Our internal retrieval evaluation dashboard is going public in preview. Real gold-set numbers, real regressions, updated nightly. Here is what is on it and what we deliberately left out.
Our retrieval eval, quarterly report
A quarter of running our retrieval evaluation harness against a frozen gold set: the regressions we caught, the two changes that actually moved precision, and the metric we stopped reporting because it lied.
How we curate the retrieval gold set
120 questions, three annotators, a disagreement-resolution protocol. The recipe behind the held-out set we evaluate every retrieval pipeline change against — and the parts we plan to open-source.
Testing retrieval: gold sets, precision@k, and why BLEU lies for proposals
Surface-form metrics like BLEU and ROUGE rate proposal text by token overlap. Token overlap is a poor proxy for whether the answer is actually right. Here's the eval stack we use instead.
Grounded retrieval: what it is, what it isn't, what we measure
The canonical long-read on grounded retrieval: the three invariants, the anti-patterns, the eval harness, the four open failure modes, and the research we're running next.
Embedding model selection: why Gemini Embedding 2 for proposals
A teardown of how we evaluated four embedding models — Gemini Embedding 2, OpenAI text-embedding-3-large, Cohere embed-v4, and Voyage — for a proposal corpus, and the methodology that drove the choice.
See the proposal workflow
Take the 5-minute tour, then start a trial workspace when you're ready to run a real pursuit against your own source material.