What we learned analyzing 90 days of search logs

Once a quarter the team sits down with the search-query logs from the KB-search surface and reads them. Just reads them. Not as an analytics exercise — a literal reading, a sample of several hundred queries, annotated. This post is the writeup from the Q4 2025 review, which we did in the first week of January.

The methodology is deliberately low-tech. We pull a stratified sample of 400 search queries from the previous 90 days, split by customer segment (enterprise, mid-market, self-serve) and by session context (drafting, KB maintenance, post-review), and two engineers read through them with a lightweight annotation schema. The schema is short: what was the user asking, did the system answer well, what was the failure mode if not.

Three findings stood out.

Finding one — “synonym bleed” across acronyms

A surprising fraction of the queries — roughly 8% of the sample — were for acronyms where the same three letters map to meaningfully different things in different contexts. “SOC” is the clearest example. A security-focused customer types “SOC” expecting SOC 2 Type II content. A government-focused customer types “SOC” expecting Statement of Work templates. A healthcare customer types “SOC” expecting Standard of Care documentation.

The retrieval system handled these by similarity, which meant it picked whichever interpretation was most common in the KB. For a customer whose KB had both SOC 2 and SOC (Statement of Work) content, the wrong one would sometimes come back first.

The fix — which is now live — is an acronym-disambiguation step that runs when a query is mostly-acronym and the top-K results span materially different concepts. Instead of picking, the UI surfaces the detected senses and asks the user to pick. When the user picks, the choice is remembered for the session. The fix is closest in spirit to the citation-disagreement logic we wrote about last week — the resolver does not pick for the user when the signal is genuinely ambiguous.

Finding two — query reformulation is silent

A second pattern: roughly 14% of sessions contained a query that the user reformulated within 30 seconds of the first query. The reformulations were mostly the user adding specificity (“encryption” becomes “data-at-rest encryption AES-256”) or restricting scope (“incident response” becomes “incident response for PHI exposures”).

This was a surprise because the UI was not surfacing the reformulation as a concept. The user was doing the work of search-refinement entirely in their head, typing a new query from scratch each time. That is fine in moderation but painful at volume.

We added a reformulation assist that, when the user clears the search box after a query that returned results, offers to refine the prior query rather than starting fresh. The assist is opt-in by keypress (a tooltip with a hotkey appears). Early data shows about 30% of users accept the refinement path when it is offered, which is a larger fraction than we expected.

Finding three — the “nothing” query

A small but recurring pattern — 2% of queries — was a single-word query that was so generic (e.g., “proposal,” “vendor,” “compliance”) that it matched a very large number of KB chunks and returned a useless result list. The user would scroll briefly and abandon the search.

These queries are not the search system’s fault. They are the search system’s fault to handle gracefully. The old behavior was to return the top-K by similarity, which for a very generic query is a somewhat-arbitrary slice. The new behavior — which shipped last week — is to detect when the query is generic (by entropy of the result distribution) and prompt the user with a disambiguation: “this returned a lot of matches; what are you trying to do?” The prompt suggests three or four structured follow-ups (draft a new response, find a specific document, look up a policy) that narrow the search before re-running.

The metric we are watching is the abandonment rate on low-specificity queries. If that rate drops from the baseline — it was around 60% abandonment on the generic-query subset — we will call the prompt a success.

What we are not doing

Two things we considered and did not ship.

We considered a query-correction step that would autocorrect typos and map common misspellings. We decided not to ship it by default because the domain vocabulary is full of terms that look like typos of other terms (e.g., “DDQ” vs. “DQ,” “SOC” vs. “SoC”), and silent correction would hide the wrong cases more often than it would help the right ones. If we ship a correction step later, it will be explicit rather than silent.

We considered a global query-ranking model that would reorder results across customers based on aggregated behavior. We decided not to ship it because the customers we checked with were uncomfortable with cross-customer signal influencing their search. Their KBs are proprietary; their search behavior is the revealed version of their proprietary work. We keep the ranking model customer-scoped.

What the next review will look for

The Q1 2026 review — which will happen at the end of April — will check two things. First, did the reformulation assist and the nothing-query prompt move the metrics we expect. Second, what patterns emerge in the queries from customers who joined in December and January. New customers tend to have distinct search patterns (they search more, they search broader, they adjust as they learn the KB scope), and reading their queries is the best early signal on whether the onboarding is landing.

The quarterly log review is one of the cheapest structured-investigation activities we run. An afternoon for two engineers, a sample of 400 queries, a short writeup. Recommended practice for anyone running a retrieval system that users type into.