In preview: the retrieval-eval dashboard, publicly visible
Our internal retrieval evaluation dashboard is going public in preview. Real gold-set numbers, real regressions, updated nightly. Here is what is on it and what we deliberately left out.
The retrieval-eval dashboard we built for ourselves is going public in preview. The dashboard sits behind a trust-page surface we are standing up alongside the Knowledge Base module and updates nightly from the harness described in our quarterly eval report. It is not yet linked from the marketed platform pages; the link from the trust surface lands when the trust page itself ships.
What is on it:
- Top-1, top-5, top-10 retrieval precision against the frozen gold set, by date.
- Mean reciprocal rank by date.
- Per-slice numbers: numeric questions, named-entity questions, CAIQ-tagged, SIG-tagged, paraphrased vs. verbatim.
- The current gold-set size (412 pairs, frozen since June 30) and the date of the last frozen snapshot.
- A link to the harness CLI in the eval harness post.
What is not on it:
- Customer KB numbers. We do not have permission to publish those, and they would be misleading even if we did. Customer KBs vary in size, freshness, and structure; one customer’s number is not predictive of another’s.
- Hallucination rate as a single number. We covered why in the hallucination budget per claim post — the metric is too easy to game and too easy to misread.
- Comparisons to competitors. We do not run their harnesses and they do not publish theirs.
Why we are publishing this. Two reasons. First, the Stanford HAI legal RAG paper showed that commercial vendors claiming “grounded AI” had hallucination rates between 17% and 33%. The category gets to make that claim because nobody publishes their numbers. We are publishing ours.
Second, regressions are easier to catch when they are visible to other people. If our top-1 precision drops 5 points overnight, somebody outside the team will notice. That is good for us.
The data behind the page
The dashboard pulls from the same harness CLI we have been using internally for over a year. Every nightly run produces a JSON artifact with the full numeric output; the dashboard renders the most recent N runs as time series. The artifacts are retained for 12 months. We can answer “what was top-1 precision on this slice on this date” precisely.
The data we are exposing per slice: top-1 precision, top-5 recall, mean reciprocal rank, citation-mismatch rate (sampled), and refusal rate. We are not exposing per-question results — that would put labeled gold-set content on a public page and make the gold set easier to game. The slice numbers tell the trend; the per-question numbers stay internal.
What we are committing to
A regression on the dashboard means we have the conversation publicly. We will not delete a bad day. We will not pause publishing during a release that we expect might cause a regression. The point of the public dashboard is exactly that — when the numbers move, they move where everyone can see.
The dashboard is not pretty. It is a single page with tables and small charts. We will make it prettier when we have time. For now, the numbers are the point.