Compliance-matrix tooling, a quick bake-off

I ran a small bake-off this week. One 60-page state RFP, four tools, the same compliance-matrix extraction task. Each tool got the PDF and was asked to produce a structured matrix of every shall, must, and will-statement, with section pointers. I had a hand-built reference matrix from the proposal team that ran the bid, with 247 verified requirements.

The four tools: PursuitAgent (us), one of the major incumbent platforms with a recent AI-extraction feature, a general-purpose RAG tool (LlamaIndex with a custom prompt), and ChatGPT with a careful prompt and the PDF as an attachment. I am not naming the incumbent; the point is the pattern, not the comparison.

The numbers

Time-to-matrix:

PursuitAgent: 3 minutes 41 seconds.
Incumbent: 6 minutes 12 seconds.
LlamaIndex custom: 8 minutes (most of which was prompt iteration).
ChatGPT: 4 minutes 30 seconds (one shot, no iteration).

Accuracy against the 247-row reference (precision and recall):

PursuitAgent: 94% precision, 91% recall.
Incumbent: 87% precision, 84% recall.
LlamaIndex custom: 81% precision, 79% recall (after a second iteration; first pass was 70/70).
ChatGPT: 76% precision, 88% recall.

These are single-RFP numbers. They are directional, not benchmarks. A different RFP would produce different numbers across all four tools.

What broke

The failure modes were more interesting than the aggregate numbers.

The incumbent missed Section H (Special Contract Requirements) entirely. It treated H as an administrative section and skipped extraction. Section H had 18 requirements in this RFP. Score: zero of 18 in the incumbent’s output. This is the kind of error that causes a real disqualification.

**LlamaIndex returned strong matches on shall-statements and weak matches on the buyer’s softer language (“the offeror is encouraged to,” “should provide where applicable”). This is a known retrieval failure for general-purpose tools — the prompt has to be RFP-specific. With a more careful prompt I could have moved the numbers, but the time cost would have erased the speed advantage.

**ChatGPT had high recall and lower precision because it occasionally synthesized a “requirement” from instructional text that was not actually a requirement. Eight rows in its output were RFP-content paraphrases that didn’t correspond to a shall-statement. False positives in a compliance matrix are dangerous; they pull team effort to obligations that don’t exist.

Our system missed three shall-statements that lived in a footnote. Footnote handling in our PDF parse is the work in progress described in Ingest pipeline with LlamaParse. We have a fix in the queue.

What I take from this

A few things.

First, the incumbent’s AI extraction is not bad — 87% precision is usable. But the section-skipping pattern (missing Section H) is a category error, not a tuning problem. Tools that don’t understand RFP structure produce systematic blind spots that are dangerous in different ways than random errors.

Second, ChatGPT-with-a-prompt is closer than I expected for a one-shot baseline. If a small team has no budget for tooling and is willing to manually verify the false positives, it is not a useless workflow. It is also not a workflow that scales; the verification cost dominates at any volume.

Third, the speed gap among purpose-built tools is small. The accuracy gap is real but narrow. The category is not differentiating itself on extraction accuracy alone; differentiation has to come from what happens after extraction — versioning, review workflows, traceability to the response. We covered this in the compliance matrix in 30 minutes post.

I would run this again with a federal IDIQ — those have the most complex extraction surface. The state RFP I used was structurally clean. A messier document would spread the tools out more.

For now: any tool whose extraction misses an entire section silently is one I would not put on a bid where compliance scoring is binary. The hand-built reference still matters more than the tool.