Clustering win themes across 200 past bids

A team that’s submitted 200 proposals has roughly 600 win-theme assertions on the record. Some are paraphrases of each other. Some look like paraphrases but aren’t. The win-loss dashboard has to tell a user which themes have actually been repeated and how often each repeated theme has won.

This post is about the clustering. What worked, what we tried that didn’t, where the output is trustworthy, and where it lies.

The input

Every win theme captured under the schema we covered on Tuesday lives in proposal_theme with a free-text assertion. A typical assertion looks like:

“Our deployment plan minimizes risk by phasing the rollout across three regional waves with a 30-day stabilization between each.”

Three sentences. Forty-something words. Not boilerplate, not specific enough to be unique either. Across 200 bids, we’d expect this assertion (or a near-paraphrase) to repeat roughly twenty times — sometimes worded as “phased rollout,” sometimes as “regional deployment,” sometimes as “staged go-live with stabilization windows.”

The clustering question: are these the same theme?

The pipeline

raw assertions
   ↓ (text-embedding-3-large, 1536d)
embeddings
   ↓ (HDBSCAN, min_cluster_size=4, min_samples=2)
candidate clusters
   ↓ (LLM consolidation pass)
labeled clusters with member counts
   ↓ (join to proposal.outcome)
clusters with win rate

Five steps. The interesting parts are step three and step four.

Step 3 — HDBSCAN, not k-means

We tried k-means first. It’s the default. It was wrong almost everywhere we tested.

The reason is that win themes don’t cluster around equally-sized centroids. A team has three or four themes that show up in 30+ bids each, a long tail of themes that show up in 4-10 bids, and a fat tail of one-off assertions. K-means forces the long-tail themes into the dominant clusters and creates the illusion that every bid is a variation on the same five themes.

HDBSCAN handles the structure correctly. It finds dense regions and labels everything outside them as noise. With min_cluster_size=4, we get clusters that represent themes asserted at least four times. With min_samples=2, the clusters tolerate a small amount of paraphrase variance without splitting.

A typical run on a 600-assertion corpus produces ~25 clusters and ~200 noise points. The noise points are real — they’re one-off assertions specific to one bid, which is fine.

Step 4 — the LLM consolidation pass

Raw clusters from HDBSCAN are mathematically correct and not human-readable. The cluster centroid is a vector. Members are individual assertions. Without a label, the dashboard would show “Cluster 17 (47 members, 32% win rate)” and the user would have to read the assertions to know what Cluster 17 is.

We pass each cluster through a labeling step:

const labelPrompt = `
The following ${members.length} win-theme assertions cluster together by embedding distance.
Read them and produce:
- A 4-8 word label that describes the shared claim.
- A single sentence summarizing what the cluster asserts.
- A "split" flag, with reasoning, if the cluster contains two distinct claims that should be separated.

Assertions:
${members.map((m, i) => \`[\${i + 1}] \${m.text}\`).join("\\n")}
`;

The split flag is the load-bearing part. About 15% of HDBSCAN clusters in our test runs contained two distinct claims — for example, “phased rollout reduces risk” and “phased rollout accelerates time-to-value” had similar enough surface text to cluster together but reflect different bids to different buyers. The LLM catches the split and we run a second-pass clustering on the flagged ones.

The labeling step is not the search step. It’s a one-time pass that runs nightly, not on the request path. Cost is in the cents per company per night, which is fine.

Where the clustering is real signal

When a cluster has 20+ members and a win rate that diverges from the company-wide baseline by 10+ points in either direction, that’s a finding. “Phased regional rollout” with 32 members and a 41% win rate when the company average is 28% — that’s an actual repeated theme that’s working.

The dashboard surfaces these clusters at the top, with the win-rate delta, and the user can click through to see the bids and the buyers.

Where the clustering lies

Three failure modes we’ve watched.

Buyer-correlation masquerading as theme-strength. A cluster has a 50% win rate. Sounds great. Twelve of the fifteen members are bids to the same buyer who renews 80% of the time. The cluster’s win rate isn’t telling you about the theme — it’s telling you about the buyer. We mitigated this with a buyer-diversity score per cluster, displayed alongside the win rate, but the underlying problem is real and unfixable in the data.

Themes that became boilerplate. A cluster with 60 members and a win rate close to the average is probably a theme the team has stopped iterating on. It runs because it always runs. The dashboard flags it as “high volume, neutral signal” — translation, this isn’t winning bids for you, it’s just there.

Recency bias. Clusters that include older bids and newer bids weight them equally. A theme that worked in 2024 and stopped working in 2025 averages out to “neutral.” We added a recency-weighted view as a filter rather than the default, because the rolling 12-month signal is what most users want.

What we don’t claim

We don’t claim the clusters tell you what to write. They tell you which themes have repeated and how each has performed. The “what to write” question is a craft question, and Sarah’s win-themes field guide is the better starting point.

We also don’t claim the win rate is causal. A theme can be associated with wins because it’s the theme strong bids tend to assert, not because asserting it caused the win. We surface association; we don’t claim causation. The dashboard literally says “associated win rate” not “win rate.”

What’s next

The piece coming next Tuesday — evidence linking — is the engineering pair to this one. Clusters tell you which themes repeat. The evidence-linking pass tells you which KB blocks were used to source them and whether those blocks are still current. Together, they make the dashboard useful for editing the KB rather than just reading reports.