Observability for drafting: traces, logs, and replays

A bad draft that ships on Friday does not get investigated until the following Wednesday, when the proposal lead reads the submitted response and notices that one paragraph cites a KB source that does not say what the paragraph says. By Wednesday the session is over, the request has aged out of the hot cache, and whoever debugs it has only the persisted artifacts to work from. The question is whether the persisted artifacts are enough to explain what happened.

This post is how we answer that question. Three layers: request traces, retrieval logs, and deterministic replays.

Layer one — request traces

Every drafting request opens a trace on our tracing backend. The trace captures the request envelope (user, session, KB scope, model parameters), the steps the agent took (retrieval, grounding check, drafting, citation binding, post-hoc validation), and the timing on each step. Traces are retained for 90 days.

The trace is the first thing we look at when a bad draft is reported. The question we are answering is “did the request do what it was supposed to do, or did it take a wrong turn.” Wrong turns show up as missing steps (the grounding check was skipped because a flag was set somewhere), as unexpected branch choices (the retrieval fell back to a wider scope than the user expected), or as unusual timings (the drafting step took four times as long as normal because the agent hit a retry loop).

Most reported issues resolve at the trace layer. The trace tells you that the request went down a path the user did not expect, and the fix is usually a configuration change. A smaller fraction of issues require going deeper.

Layer two — retrieval logs

When the trace is clean — the request went down the expected path and still produced a bad draft — the next question is about the retrieval. Every retrieval step logs the query, the scope, the top-K chunks that came back, the similarity scores, and which chunks the grounding check accepted.

The retrieval log is the diagnostic surface for two common failure modes. The first is a chunk that scored well on similarity but was not what the drafter needed — a KB paragraph that mentioned the right keyword but described the wrong thing. The second is a KB gap: the right chunk was not in the KB, and the grounding check ended up accepting a less-good chunk because nothing better was available.

Both failure modes look the same from the outside (the draft cites a source that does not support the claim) and are fixed differently. The first is a retrieval-quality issue, fixed by improving the embedding model or the re-ranker. The second is a KB-content issue, fixed by adding the missing content. Having the retrieval log tells you which one you are looking at.

Retrieval logs are retained for 30 days because they are larger than request traces and cheaper to discard. The 30-day window is long enough for most debugging and short enough that the storage cost stays bounded.

Layer three — deterministic replay

The hard cases are the ones where the trace is clean, the retrieval log looks fine, and the draft is still wrong. For those we run a deterministic replay.

A replay re-executes the original request against the preserved inputs: the same query, the same KB state at the time of the original request (we snapshot the KB content store on every change, so we can reconstruct what the KB looked like at any timestamp within the retention window), the same model version, and — this is the part that took the most engineering — the same seeds and temperature on the sampling steps. A well-constructed replay reproduces the original draft byte-for-byte.

Once you have a deterministic replay, you can perturb it. What happens if we swap in a newer model version. What happens if we use a different re-ranker. What happens if we exclude a specific chunk from the retrieval. What happens if we run the grounding check with a stricter threshold. Each perturbation tells you something about the failure.

Deterministic replays are the most expensive layer — they take a request minute to set up and a few seconds to run — so we do not use them by default. We use them when the first two layers have not explained the issue and when the stakes warrant the investigation (a bad draft that shipped to a regulated customer, for example, or a pattern of bad drafts that suggests a systemic issue).

What we measure on top of the stack

Two metrics on the dashboard that lean on this observability stack:

Time-to-diagnosis. The median time between a drafting issue being reported and the engineering team identifying the root cause. This was 11 days at the end of Q1 2025 and is now about 2 days. The step change happened when we shipped deterministic replays — before that, the hard cases would sit open until someone had the time to do a manual investigation.

Replay divergence rate. The fraction of replays that do not reproduce the original output byte-for-byte. This should be near zero. When it rises, it tells us that an upstream dependency has changed in a way we were not tracking — a model version rolled, a retrieval parameter was tuned, a chunker was replaced. A nonzero divergence rate is not a bug in the replay system; it is a signal that the observability surface does not cover a new axis of variability.

What we do not have

We do not have a good answer for observability across a sequence of related drafts — the case where a user generates six drafts in a row, each reacting to the last, and the problem is in the sequence rather than in any single draft. The three-layer stack gives us per-request observability. Sequence-level observability is open.

We have a rough sketch of what that would look like — a session trace that ties together the individual request traces and captures the feedback the user gave between them — and we expect to ship a first version in Q2. Until then, sequence-level debugging is mostly done by eye, which is slow and not great.