Closes#54.
The JSONL trace previously stored only counts on the `complete` event
(gap_count, citation_count, discovery_count). Replay could re-render the
step log but could not recover which gaps fired or which sources were
cited, blocking M3.2/M3.3 stress-testing and calibration work.
Two complementary fixes:
1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to
`<trace_id>.result.json` next to the JSONL trace. The agent calls it
right before emitting the `complete` step. `cli replay` now loads the
sibling result file when present and renders the structured tables
under the trace step log.
2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or
`discovery_recorded` trace event per item from the final result. This
gives the JSONL stream a queryable timeline of what was kept, with
categories and topics in-band, without needing to load the result
sibling.
Tests: 4 added (127 total passing). Smoke-tested live with a real ask;
both files written and replay rendering verified.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TraceLogger now tracks monotonic start times for starter actions
(web_search, fetch_url, synthesis_start, start) and attaches a
duration_ms field to the matching completer (web_search_complete,
fetch_url_complete, synthesis_complete, synthesis_error). The
terminal 'complete' step gets total_duration_sec instead.
Pairings are tightly sequential in the agent code (each
_execute_tool call runs start→end before returning), so a simple
dict keyed by starter name suffices — no queueing needed. An
unpaired completer leaves duration unset and does not crash.
Durations flow into both the JSONL trace and the structlog
operational log, so OpenSearch queries can filter / aggregate
by step latency without cross-row joins.
Verified end-to-end on a real shallow query:
web_search 5,233 ms
web_search 3,006 ms
synthesis_complete 27,658 ms
complete 47.547 s total
Synthesis is by far the slowest step — visible at a glance
for the first time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The trace JSONL captures every step of a research call (search,
fetch, iteration boundaries, synthesis), but the structured
operational log only fired at research_started / research_completed,
giving administrators no real-time visibility into agent progress.
Have TraceLogger.log_step also emit a structlog event using the
same action name, fields, and step counter. trace_id and researcher
are already bound in contextvars by WebResearcher.research, so
every line carries them automatically — no plumbing needed.
Volume control: a curated set of milestone actions logs at INFO
(start, iteration_start, synthesis_start/complete/error, budget_-
exhausted, complete). Chatty per-tool actions (web_search,
fetch_url and their *_complete pairs) log at DEBUG. Default
MARCHWARDEN_LOG_LEVEL=INFO shows ~9 lines per call;
MARCHWARDEN_LOG_LEVEL=DEBUG shows everything.
This keeps dev stderr readable while making full step visibility
one env var away — and OpenSearch can ingest at DEBUG always.
Verified end-to-end: Utah peak query at INFO produces 9 milestone
log lines, at DEBUG produces 13.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>