marchwarden

archeious/marchwarden

Fork 0

Commit graph

Author	SHA1	Message	Date
Jeff Smith	1203b07248	fix(observability): persist full ResearchResult and per-item trace events Closes #54. The JSONL trace previously stored only counts on the `complete` event (gap_count, citation_count, discovery_count). Replay could re-render the step log but could not recover which gaps fired or which sources were cited, blocking M3.2/M3.3 stress-testing and calibration work. Two complementary fixes: 1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to `<trace_id>.result.json` next to the JSONL trace. The agent calls it right before emitting the `complete` step. `cli replay` now loads the sibling result file when present and renders the structured tables under the trace step log. 2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or `discovery_recorded` trace event per item from the final result. This gives the JSONL stream a queryable timeline of what was kept, with categories and topics in-band, without needing to load the result sibling. Tests: 4 added (127 total passing). Smoke-tested live with a real ask; both files written and replay rendering verified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:27:33 -06:00
Jeff Smith	ddaf7e85c3	Record per-step durations in trace and operational logs (#35 ) TraceLogger now tracks monotonic start times for starter actions (web_search, fetch_url, synthesis_start, start) and attaches a duration_ms field to the matching completer (web_search_complete, fetch_url_complete, synthesis_complete, synthesis_error). The terminal 'complete' step gets total_duration_sec instead. Pairings are tightly sequential in the agent code (each _execute_tool call runs start→end before returning), so a simple dict keyed by starter name suffices — no queueing needed. An unpaired completer leaves duration unset and does not crash. Durations flow into both the JSONL trace and the structlog operational log, so OpenSearch queries can filter / aggregate by step latency without cross-row joins. Verified end-to-end on a real shallow query: web_search 5,233 ms web_search 3,006 ms synthesis_complete 27,658 ms complete 47.547 s total Synthesis is by far the slowest step — visible at a glance for the first time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:49:52 -06:00
Jeff Smith	cef08c8984	M1.2: Trace logger with tests TraceLogger produces JSONL audit logs per research() call: - One file per trace_id at ~/.marchwarden/traces/{trace_id}.jsonl - Each line is a self-contained JSON object (step, action, timestamp, decision) - Supports arbitrary kwargs (url, content_hash, query, etc.) - Lazy file handle, flush after each write, context manager support - read_entries() for replay and testing 15 tests: file creation, step counting, JSONL validity, kwargs, timestamps, flush behavior, multiple independent traces. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:21:10 -06:00

Author

SHA1

Message

Date

Jeff Smith

1203b07248

fix(observability): persist full ResearchResult and per-item trace events

Closes #54.

The JSONL trace previously stored only counts on the `complete` event
(gap_count, citation_count, discovery_count). Replay could re-render the
step log but could not recover which gaps fired or which sources were
cited, blocking M3.2/M3.3 stress-testing and calibration work.

Two complementary fixes:

1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to
   `<trace_id>.result.json` next to the JSONL trace. The agent calls it
   right before emitting the `complete` step. `cli replay` now loads the
   sibling result file when present and renders the structured tables
   under the trace step log.

2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or
   `discovery_recorded` trace event per item from the final result. This
   gives the JSONL stream a queryable timeline of what was kept, with
   categories and topics in-band, without needing to load the result
   sibling.

Tests: 4 added (127 total passing). Smoke-tested live with a real ask;
both files written and replay rendering verified.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 19:27:33 -06:00

Jeff Smith

ddaf7e85c3

Record per-step durations in trace and operational logs (#35 )

TraceLogger now tracks monotonic start times for starter actions
(web_search, fetch_url, synthesis_start, start) and attaches a
duration_ms field to the matching completer (web_search_complete,
fetch_url_complete, synthesis_complete, synthesis_error). The
terminal 'complete' step gets total_duration_sec instead.

Pairings are tightly sequential in the agent code (each
_execute_tool call runs start→end before returning), so a simple
dict keyed by starter name suffices — no queueing needed. An
unpaired completer leaves duration unset and does not crash.

Durations flow into both the JSONL trace and the structlog
operational log, so OpenSearch queries can filter / aggregate
by step latency without cross-row joins.

Verified end-to-end on a real shallow query:
  web_search       5,233 ms
  web_search       3,006 ms
  synthesis_complete 27,658 ms
  complete         47.547 s total

Synthesis is by far the slowest step — visible at a glance
for the first time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 16:49:52 -06:00

Jeff Smith

cef08c8984

M1.2: Trace logger with tests

TraceLogger produces JSONL audit logs per research() call:
- One file per trace_id at ~/.marchwarden/traces/{trace_id}.jsonl
- Each line is a self-contained JSON object (step, action, timestamp, decision)
- Supports arbitrary kwargs (url, content_hash, query, etc.)
- Lazy file handle, flush after each write, context manager support
- read_entries() for replay and testing

15 tests: file creation, step counting, JSONL validity, kwargs,
timestamps, flush behavior, multiple independent traces.

Refs: archeious/marchwarden#1

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-04-08 14:21:10 -06:00

3 commits