Blog · Tag

evaluation.

14 posts in this archive.

Grounded AI

New models, quarterly eval: Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro

An internal eval across three current-generation models for our specific workloads — drafting, claim verification, extraction. What moved, where we switched defaults, and why one workload still sits on a year-old model.

The PursuitAgent engineering team
Engineering

The prompt test suite, an update

300 tests across our drafting and verification prompts. What they cover, what they miss, which ones still flake, and how we keep the flaky ones from becoming the reason we stop running CI.

The PursuitAgent engineering team
Engineering Long read

One year of grounded retrieval: what changed, what didn't

The engineering companion to the founder retrospective. A year of build-log posts, condensed: what the retrieval stack looks like now, how verification evolved, what the gold set became, and what's still unsolved.

The PursuitAgent engineering team
Engineering

Embedding evaluation, revisited

What we measure differently from 12 months ago. How the gold set grew, which metrics earned their spot in CI, and which ones we quietly retired.

The PursuitAgent engineering team
Engineering

Retrieval eval snapshot, December 2025

Quarter four retrieval evaluation numbers against our held-out RFP and DDQ corpus. What moved since September, what's still stuck, and which regressions we're not yet fixing.

The PursuitAgent engineering team
Craft

Discriminator vs. feature: which one moves the score

Two real RFP sections, same underlying win theme, different framings. Which framing evaluators picked, and the rule I extracted from comparing them.

Sarah Smith
Grounded AI

Confidence-threshold tuning for DDQ auto-answer

Where we set the confidence bar for auto-answering a DDQ question. The precision/recall trade-off, explained with our own data and the number we actually use for security questionnaires.

The PursuitAgent engineering team
Grounded AI

Hallucination monitoring in production

The metric we watch weekly: per-claim refusal rate, citation-mismatch rate, and the human-graded sample. What we do when each one moves, and the threshold values that trigger an alert.

The PursuitAgent engineering team
Engineering

In preview: the retrieval-eval dashboard, publicly visible

Our internal retrieval evaluation dashboard is going public in preview. Real gold-set numbers, real regressions, updated nightly. Here is what is on it and what we deliberately left out.

The PursuitAgent engineering team
Engineering

Our retrieval eval, quarterly report

A quarter of running our retrieval evaluation harness against a frozen gold set: the regressions we caught, the two changes that actually moved precision, and the metric we stopped reporting because it lied.

The PursuitAgent engineering team
Engineering

How we curate the retrieval gold set

120 questions, three annotators, a disagreement-resolution protocol. The recipe behind the held-out set we evaluate every retrieval pipeline change against — and the parts we plan to open-source.

The PursuitAgent engineering team
Engineering

Testing retrieval: gold sets, precision@k, and why BLEU lies for proposals

Surface-form metrics like BLEU and ROUGE rate proposal text by token overlap. Token overlap is a poor proxy for whether the answer is actually right. Here's the eval stack we use instead.

The PursuitAgent engineering team
Grounded AI Long read

Grounded retrieval: what it is, what it isn't, what we measure

The canonical long-read on grounded retrieval: the three invariants, the anti-patterns, the eval harness, the four open failure modes, and the research we're running next.

The PursuitAgent engineering team
Engineering

Embedding model selection: why Gemini Embedding 2 for proposals

A teardown of how we evaluated four embedding models — Gemini Embedding 2, OpenAI text-embedding-3-large, Cohere embed-v4, and Voyage — for a proposal corpus, and the methodology that drove the choice.

The PursuitAgent research team

See the proposal workflow

Take the 5-minute tour, then start a trial workspace when you're ready to run a real pursuit against your own source material.