fix(observability): persist full ResearchResult and per-item trace events #56

Merged
claude-code merged 1 commit from feat/trace-full-result into main 2026-04-09 01:27:48 +00:00
Collaborator

Closes #54.

Two complementary fixes for the trace observability gap blocking Phase 3:

(a) Persist full ResearchResult. TraceLogger.write_result() dumps the pydantic model to <trace_id>.result.json next to the JSONL trace. The agent calls it right before the complete step. cli replay now loads the sibling and renders the structured tables (answer, citations, gaps, discovery_events, open_questions, confidence_factors, cost) under the step log.

(b) Per-item trace events. The agent emits one gap_recorded / citation_recorded / discovery_recorded trace event per item from the final result. The JSONL stream gains a queryable timeline of what was kept, with categories and topics in-band — no need to load the sibling for category-level analysis.

Tests

  • 4 added (127 total passing): TraceLogger.result_path, write_result from pydantic + dict, agent persists result + emits per-item events, replay renders persisted result, replay reports absence cleanly.
  • Smoke-tested live (ask "What is the half-life of caffeine?" --depth shallow): both files written, replay renders the result.
Closes #54. Two complementary fixes for the trace observability gap blocking Phase 3: **(a) Persist full ResearchResult.** `TraceLogger.write_result()` dumps the pydantic model to `<trace_id>.result.json` next to the JSONL trace. The agent calls it right before the `complete` step. `cli replay` now loads the sibling and renders the structured tables (answer, citations, gaps, discovery_events, open_questions, confidence_factors, cost) under the step log. **(b) Per-item trace events.** The agent emits one `gap_recorded` / `citation_recorded` / `discovery_recorded` trace event per item from the final result. The JSONL stream gains a queryable timeline of what was kept, with categories and topics in-band — no need to load the sibling for category-level analysis. ## Tests - 4 added (127 total passing): `TraceLogger.result_path`, `write_result` from pydantic + dict, agent persists result + emits per-item events, replay renders persisted result, replay reports absence cleanly. - Smoke-tested live (`ask "What is the half-life of caffeine?" --depth shallow`): both files written, replay renders the result.
claude-code added 1 commit 2026-04-09 01:27:44 +00:00
Closes #54.

The JSONL trace previously stored only counts on the `complete` event
(gap_count, citation_count, discovery_count). Replay could re-render the
step log but could not recover which gaps fired or which sources were
cited, blocking M3.2/M3.3 stress-testing and calibration work.

Two complementary fixes:

1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to
   `<trace_id>.result.json` next to the JSONL trace. The agent calls it
   right before emitting the `complete` step. `cli replay` now loads the
   sibling result file when present and renders the structured tables
   under the trace step log.

2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or
   `discovery_recorded` trace event per item from the final result. This
   gives the JSONL stream a queryable timeline of what was kept, with
   categories and topics in-band, without needing to load the result
   sibling.

Tests: 4 added (127 total passing). Smoke-tested live with a real ask;
both files written and replay rendering verified.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude-code merged commit f68bbb1052 into main 2026-04-09 01:27:48 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#56
No description provided.