Our infra spend for proposal workloads, year one

We have been running proposal workloads in production for 12 months. This post is an internal-transparency teardown: what we spent, where it went, and how the per-unit economics moved as volume grew.

Numbers are rounded for clarity and cover May 2025 through February 2026. Nothing here is market-sensitive enough to withhold, and several readers have asked us to publish the shape — so here it is. No customer data, no tenant-specific numbers.

Total spend, categorized

Over the first year, approximately:

Category	Share of infra spend	Trend
Inference (LLM API calls)	~62%	Per-unit down 45%, total up 3.1x with volume
Vector DB + retrieval	~11%	Stable per-unit, total up 2.8x
Relational DB + storage	~8%	Per-unit flat, total up 2.4x
Object storage (RFPs, artifacts)	~6%	Per-unit down slightly, total up 3.5x
Extraction (LlamaParse/Adobe)	~7%	Per-unit flat, total up 2.9x
Observability + ops	~4%	Per-unit down, total up 1.8x
Egress + misc	~2%	—

Total infra spend grew 2.7x over the period. Proposal volume (unique responses drafted end-to-end) grew 4.2x. Per-response infra cost dropped from baseline to ~64% of baseline over the year.

Where inference cost went

Inference is 62% of infra and the only category worth walking through in detail.

Q1 (May-July 2025). Heavy experimentation. We were trying three different embedding models in parallel and running every draft through two independent generation passes for eval purposes. Per-response cost was roughly 2.2x what it is today. Not sustainable but necessary for calibration.

Q2 (August-October 2025). Settled on primary models. Switched default generation from a mixed Opus/Sonnet routing to Sonnet-first with Opus only on flagged high-stakes sections. Per-response cost dropped 35% in this period.

Q3 (November 2025-January 2026). Added the per-claim verification pass. This cost us an incremental 20% per response — see the claim-verification cost profile for the breakdown. It was a deliberate quality-over-cost trade.

Q4 (February-April 2026). Provider price drops plus our own caching and batching optimizations. Per-response cost dropped another 25% in this window. Anthropic and Google both cut input-token pricing materially; OpenAI cut reasoning-token pricing on GPT-5. We rebalanced routing quarterly based on updated pricing.

Net: inference per-response is roughly 45% lower than the early-months baseline for a more thorough pipeline.

What surprised us

Three cost surprises, all on the upside (higher than we budgeted).

Vector DB storage grew faster than we expected. We projected vector DB spend to scale linearly with KB content volume. It grew supra-linearly because we added multiple embedding versions per block to support A/B evaluation of embedding models. Each new embedding model meant re-embedding the full corpus. Three switches across the year meant vector DB was holding 4x the data we originally projected.

Fix in progress: we no longer keep legacy embeddings in production. When a new embedding model becomes the default, the old vectors get moved to cold storage after a 30-day overlap. This brought projected vector DB cost back onto the original linear trajectory.

Object storage for RFP attachments is larger than expected. RFPs come with attachments. Technical appendices, compliance templates, drawings, scoring worksheets. Across the year, the attachment volume grew to ~8x the volume of the RFP documents themselves. Federal RFPs are particularly heavy — a typical DoD solicitation is 40 pages of narrative plus 200-800 pages of attachments.

We looked at tiered storage; for the access patterns we see (attachments read during intake and occasionally during draft generation, rarely after submission), standard storage is cheaper than the more complex tiering we considered. We kept it simple.

Extraction cost stayed flat per page. We expected LlamaParse and Adobe pricing to drift down over the year; neither provider meaningfully cut pricing for the document sizes we hit. We tested self-hosted alternatives (Docling, Unstructured) in 2025 and the quality gap — particularly on complex tables and multi-column layouts — was large enough that we did not switch. Extraction is one category where cost has not moved with the industry.

What did not surprise us

Inference cost per response would drop materially. Obvious bet, and it played out. Provider competition is real; we benefited from it.

Observability cost would drop per-unit. We right-sized retention windows, moved hot metrics to cheaper storage, and reduced alert-query frequency on quiet paths.

Relational DB cost would be stable. Postgres is Postgres. Nothing exotic happens; the cost is a function of storage and compute, both of which have been stable.

What we cut

We removed three line items over the year.

A commercial observability product at scale 2x a self-hosted alternative. We use the self-hosted alternative now. Saved ~2% of infra; the feature gap is real and we track what we lost.

A second LLM provider for generation redundancy. We had a hot failover to a second provider for a 6-month window; we removed it when the primary provider’s reliability proved adequate. Saved ~4% of inference cost. Would add it back in a heartbeat if the primary’s reliability regressed.

Per-request tracing of the full RAG context. We used to log the complete retrieved context for every request. Useful for debugging; expensive at volume. We now sample 1% and keep the rest at aggregate-metric level. Saved ~1% of infra.

Per-response trajectory

If you are planning capacity for a proposal-tool workload at our mix:

Period	Approx per-response infra cost	Major drivers
Q1	Highest (call it 2.2x)	Early experimentation
Q2	1.5x	Model routing settled
Q3	1.8x	Verification pipeline added
Q4	1.0x (current)	Price drops + caching

Figure $4-$8 per 40-page response today, all-in, including verification. Larger responses scale approximately linearly; smaller responses (single DDQ, RFI) run $0.50-$2.

What we expect next year

Our base case:

Inference down another 20-30% per-unit as providers continue competing. We are not counting on provider-side reasoning-cost innovations; we are counting on continued baseline price drops.
Vector DB flat per-unit as we stop multiplying embeddings. Total grows with volume; per-unit stable.
Extraction stable. We do not expect the extraction providers to cut pricing materially.
Observability down modestly as we continue right-sizing.

We don’t project a dramatic shift. Next year’s infra looks like this year with gentler slopes.

Closing

We publish this because we think proposal-tool infra costs are opaque in the market and because we want to give operators running similar workloads a data point. Our numbers reflect our architecture choices (per-claim verification, structured facet retrieval, daily eval harnesses). Teams making different quality/cost trades will see different curves.

For more on the cost side of specific pipeline stages, see cost per response breakdown and RAG cost control budgets.

Posts bylined to “The PursuitAgent research team” are synthesis notes from our data pulls and internal analysis. Numbers in this post are rounded and reflect internal infrastructure accounting for the period May 2025 — February 2026.