Confidence-threshold tuning for DDQ auto-answer

The auto-answer path fires when two things are true: the retrieval layer finds a KB block that matches the question, and the verifier’s confidence in the match is above a threshold we set. Below the threshold, the question escalates to an SME. The threshold is a single number. That number is where a lot of the system’s real behavior gets decided.

This post is about how we tuned it. The short version: we run two thresholds, one for security questionnaires and one for general RFP questions, and the gap between them reflects the different review discipline behind each.

A note on the numbers. The threshold/auto-answer/precision table below comes from our internal eval harness on our own gold sets. These are directional — they describe where the curve sits on our corpora, not a general benchmark. The shape (higher threshold → fewer auto-answers but higher precision) is durable; the specific percentages move as the gold set grows and as the verifier is retuned. We would not expect these numbers to reproduce on an external setup and will publish the harness and a public gold set when both are stable.

What the confidence number is

The verifier produces a score between 0 and 1 for every retrieval candidate. The score is not a single model output — it is a composite of four signals:

Embedding similarity. Cosine similarity between the query embedding and the block embedding, normalized.
Reranker score. Output of a cross-encoder reranker run over the top 20 embedding candidates.
Entailment probability. A separate model’s probability estimate that the block text entails the question’s likely answer.
Freshness adjustment. A multiplicative factor based on the block’s freshness score — a stale block gets its confidence discounted.

The four signals are combined with weights we retune quarterly. The current weighting: 0.2 embedding, 0.3 reranker, 0.4 entailment, 0.1 freshness. Entailment does most of the work. Embedding similarity is the cheapest signal but the weakest predictor of actual correctness.

The precision/recall trade-off

A higher threshold means fewer auto-answers and fewer wrong answers. A lower threshold means more auto-answers, more SME time saved, more wrong answers shipped. The shape is familiar.

We measure this against our gold sets. For the security-questionnaire sub-gold-set (1,240 questions across SOC 2, SIG, CAIQ, HIPAA, and custom formats), the curve looks like this:

Threshold	Auto-answer rate	Precision (of auto-answers)	SME tickets per 100 questions
0.72	91%	0.89	9
0.80	82%	0.94	18
0.84	76%	0.96	24
0.88	68%	0.98	32
0.92	54%	0.99	46
0.95	38%	0.995	62

At 0.72, we auto-answer 91% of security questions but accept roughly 10 wrong answers per 100 auto-answered questions. At 0.95, we auto-answer 38% and the wrong-answer rate drops to 0.5 per 100.

For security questionnaires, the cost of a wrong auto-answer is high. A wrong answer that ships to a buyer is a representation problem (Monday’s post covers why). The cost of an SME ticket is moderate — an SME spends five to fifteen minutes resolving it. The trade-off pushes the threshold up.

For general RFP questions, the cost of a wrong answer is lower (reviewable in the color-team cycle) and the cost of an SME ticket is higher (SMEs are rarer and more expensive per question). The trade-off pushes the threshold down.

The numbers we actually use

Security questionnaires: 0.88. We auto-answer 68% of security questions, get 32 SME escalations per 100 questions, and our precision on auto-answers is 0.98. The remaining 2% of errors — wrong answers that got through the threshold — are caught by the quality-pass step before the response ships. Not all of them are caught; some get shipped and become the recycled-answer audit material for January.

General RFP questions: 0.72. Lower threshold, higher auto-answer rate. The trade-off works because RFP responses go through color-team reviews where auto-drafted content gets a fresh read before ship. The review catches what the verifier missed.

The 16-point gap between 0.72 and 0.88 is the operational difference between a questionnaire and an RFP. A questionnaire ships on a tight cycle with limited review; an RFP ships with pink, red, and gold review passes. The threshold follows the review discipline.

What we learned tuning this

Four things that were not obvious when we started.

The curve is not smooth. Between 0.82 and 0.86 there is a soft plateau where precision gains slow down. Between 0.86 and 0.90 it accelerates again. The shape is produced by specific question categories that cluster at those confidence levels — below 0.86, a significant share of the auto-answers are buyer-specific policy questions that the retrieval layer matches poorly; above 0.90, most of what remains is crisply-templated SOC 2 questions.

Entailment is the dominant signal in the composite, but noisy. We run the entailment check on the retrieval candidate’s full block text, not just the nearest chunk. Full-block entailment is more reliable but slower. At one point we tried per-chunk entailment to save cost; precision dropped 3 points at the same threshold. Not worth it.

The freshness factor does real work. A block that is 11 months past its effective date with an annual attestation cycle gets its confidence discounted by roughly 30%. This means a stale block that would otherwise meet the 0.88 threshold will not. That is intentional — a stale block can be correct but cannot be shipped without review. The freshness discount forces the review.

Customers want to tune this per-account. The 0.88 default works for the median customer. Customers in defense and regulated finance push it higher (0.92). Customers in less-regulated B2B SaaS are comfortable with 0.84. We expose the setting to GRC leads with some guard rails against going below 0.80 for security-pillar blocks.

What the number is not

The confidence threshold is not a prediction of correctness. A question auto-answered at 0.95 confidence is not 95% likely to be correct. It is 99.5% likely to be correct on our test distribution, and that distribution does not perfectly reflect production traffic. The threshold is a control, not a guarantee.

The guarantee comes from the verification discipline that runs regardless of confidence — citation fidelity, claim-level entailment, evidence freshness, access-layer checks. Confidence decides whether to auto-answer; verification decides whether the answer is shippable. For the verification side, the grounded-retrieval pillar is the deeper read. For the SME escalation path that handles the refused answers, the team-structure post covers the operational side.