Semantic deduplication of KB blocks at ingest
How we merge near-duplicate KB blocks at ingest time using embedding similarity, the threshold we settled on after testing four values, and the trade-off we accept by tuning toward over-merging.
A KB grows by ingest. Customers upload past proposals, DDQ responses, security questionnaires, marketing collateral. Every ingest produces blocks. Many of those blocks are near-duplicates of blocks already in the KB. Without deduplication, the KB doubles in size every quarter and the retriever returns five copies of the same answer with slightly different wording.
This post is how our deduplication works. It runs at ingest, before the block is written. The mechanism is straightforward; the threshold tuning was the part that took work.
The pipeline
Every incoming block gets embedded. We use the same embedding model the retriever uses (covered in Embedding model selection). The embedding is queried against the existing KB’s vector index for the same tenant. If any existing block is within a cosine-similarity threshold, the system enters a merge decision.
async function ingestBlock(tenantId: string, candidate: Block) {
const embedding = await embed(candidate.text);
const neighbors = await pgvector.query({
tenantId,
embedding,
topK: 5,
minSimilarity: 0.82,
});
if (neighbors.length === 0) {
return await writeNewBlock(tenantId, candidate);
}
const decision = await mergeDecision(candidate, neighbors);
return decision.kind === "merge"
? await mergeInto(decision.target, candidate)
: await writeNewBlock(tenantId, candidate);
}
The merge decision is the interesting part. Cosine similarity above 0.82 is “close enough to investigate.” Above 0.94 is “auto-merge.” Between 0.82 and 0.94 is the gray zone — the system flags it for the KB owner to review before the next retrieval cycle.
The thresholds we tested
We tested four threshold pairs against a labeled set of 600 block pairs from real customer KBs (with permission, anonymized):
| Auto-merge | Investigate | False merges (per 100) | Missed merges (per 100) |
|---|---|---|---|
| 0.90 / 0.78 | 0.78 | 9 | 4 |
| 0.92 / 0.80 | 0.80 | 5 | 6 |
| 0.94 / 0.82 | 0.82 | 2 | 11 |
| 0.96 / 0.85 | 0.85 | 0 | 18 |
A “false merge” is two blocks the deduplicator merged that should have stayed separate. A “missed merge” is two blocks that should have merged but did not.
We settled on 0.94 / 0.82. The 2-per-100 false-merge rate was acceptable because every auto-merge produces a notification and the KB owner can split a bad merge inside one click. The 11-per-100 missed-merge rate was acceptable because the gray-zone (0.82-0.94) flag catches most of them at review time.
The trade-off we accept
The thresholds tune toward conservative merging. We would rather under-merge and let the KB owner see two near-duplicate blocks side by side than over-merge and silently lose content the owner wanted to keep separate.
The reason: a false merge can lose information that the team did not realize was distinct. Two CAIQ answers that look semantically similar might differ in a single numeric anchor — one says “30-day retention” and one says “90-day retention” because they are for different products in the same company. Auto-merging those drops one of the answers. The team finds out three months later when the wrong answer ships to a buyer.
Missed merges, by contrast, are visible. The KB owner sees a near-duplicate flag in the dashboard and can resolve it. Visible problems are easier to fix than silent data loss.
What survives a merge
When two blocks merge, the system has to decide what survives. The rules:
- The newer block’s text wins. Recent edits typically reflect the current truth.
- Both source attributions are preserved. The merged block carries metadata pointing to both original ingest documents.
- The earliest creation date is preserved. The block’s “first authored” date is the older of the two; this matters for freshness scoring (covered in shipped freshness scores).
- Approval status is the more conservative of the two. If one block is approved and the other is draft, the merged block is draft until re-approved.
- The owner is the more recently active of the two. A block whose owner left the company defers to one whose owner is still on the team.
These rules are written down because we got them wrong twice in early implementations. The first version preserved the older text. The second version preserved approval status from either side. Both produced incidents the KB owners had to clean up.
What this is not
This deduplicator does not handle semantic conflict. Two blocks that say opposite things at high similarity (one says “we support X,” one says “we do not support X”) get flagged for the gray zone, but the system does not pick a winner. The owner has to.
It also does not handle structural duplicates from the same document. If a customer uploads the same PDF twice, the file-level fingerprint catches it before block-level dedup runs. That is a different mechanism, covered in the multi-doc RFP ingest changelog.
Edge cases we hit in production
A few production patterns worth calling out, because they were not obvious from the test set.
Tables and lists. A block that is a table of compliance certifications and a block that is the same table with one row added are usually above 0.94 cosine similarity, and the auto-merger picks the older one. That is wrong; the newer one has the new row. We added a structural-difference check: if both blocks are tables, the merger checks row counts and column structure before auto-merging. A row-count mismatch demotes the merge to gray-zone.
Boilerplate that should not merge. Some sentences appear in multiple distinct blocks legitimately — a company-overview paragraph that introduces three different proposal sections. Embedding similarity is high. Auto-merging would lose context. The fix: blocks tagged with structural metadata (block type = “section intro,” block type = “answer body”) only merge with same-type counterparts. Cross-type matches go to gray-zone regardless of similarity.
Translated content. A small number of customers ingest both English and translated versions of the same document. The embeddings of high-quality translations are surprisingly close. We added a language-tag check: blocks with different language tags do not auto-merge, full stop.
The current numbers from production: across active tenants, the deduplicator merges or flags about 14% of incoming blocks. Without it, a tenant ingesting two past proposals a week would double their KB in eight months. With it, the KB grows roughly with the rate of genuinely new content, which is what a KB should do.