Persist full ResearchResult alongside trace (observability gap) #54

New issue

Closed

opened 2026-04-08 19:21:00 -06:00 by claude-code · 0 comments

claude-code commented

2026-04-08 19:21:00 -06:00

Collaborator

Discovered during M3.1 stress testing (Issue #44). Blocks M3.2 (#45) and M3.3 (#46).

Problem

The JSONL trace at ~/.marchwarden/traces/<trace_id>.jsonl records step events and final counts only:

{"action":"complete","confidence":0.82,"citation_count":9,"gap_count":5,"discovery_count":4}

The actual ResearchResult payload — gap categories, citations, discovery_events, open_questions, confidence_factors, synthesized answer — is never persisted. It exists only in the terminal output of the run.

cli.main replay <trace_id> re-renders trace events but cannot reconstruct the result, because the result was never written.

Why this blocks Phase 3

M3.1 (just done): I had to tee Q2/Q3/Q4 output to recover gap categories. Q1's gap categories are lost — trace says "5 gaps fired" but not which ones.
M3.2 (#45) multi-axis stress tests: the analysis is "across N runs, how often did each gap category fire?" Impossible from counts alone.
M3.3 (#46) calibration: needs per-run structured results to correlate confidence with actual gap mix.

Fix

Write the final ResearchResult as a sibling file ~/.marchwarden/traces/<trace_id>.result.json (pydantic .model_dump_json(indent=2)) at the same point we currently emit the complete step event in researchers/web/agent.py.

Update cli.main replay to load and render the result file when present.

Acceptance

After any ask run, both <trace_id>.jsonl and <trace_id>.result.json exist.
replay <trace_id> shows the rendered result tables (gaps, citations, etc.) in addition to the step log.
A simple analysis script can glob ~/.marchwarden/traces/*.result.json, deserialize each into ResearchResult, and count gap categories across runs.

Out of scope

Per-event gap/citation logging during the run (timeline analysis). Can come later if needed.

Discovered during M3.1 stress testing (Issue #44). Blocks M3.2 (#45) and M3.3 (#46). ## Problem The JSONL trace at `~/.marchwarden/traces/<trace_id>.jsonl` records step events and final *counts* only: ```json {"action":"complete","confidence":0.82,"citation_count":9,"gap_count":5,"discovery_count":4} ``` The actual `ResearchResult` payload — gap categories, citations, discovery_events, open_questions, confidence_factors, synthesized answer — is never persisted. It exists only in the terminal output of the run. `cli.main replay <trace_id>` re-renders trace events but cannot reconstruct the result, because the result was never written. ## Why this blocks Phase 3 - **M3.1 (just done):** I had to `tee` Q2/Q3/Q4 output to recover gap categories. Q1's gap categories are lost — trace says "5 gaps fired" but not which ones. - **M3.2 (#45) multi-axis stress tests:** the analysis is "across N runs, how often did each gap category fire?" Impossible from counts alone. - **M3.3 (#46) calibration:** needs per-run structured results to correlate confidence with actual gap mix. ## Fix Write the final `ResearchResult` as a sibling file `~/.marchwarden/traces/<trace_id>.result.json` (pydantic `.model_dump_json(indent=2)`) at the same point we currently emit the `complete` step event in `researchers/web/agent.py`. Update `cli.main replay` to load and render the result file when present. ## Acceptance - After any `ask` run, both `<trace_id>.jsonl` and `<trace_id>.result.json` exist. - `replay <trace_id>` shows the rendered result tables (gaps, citations, etc.) in addition to the step log. - A simple analysis script can `glob ~/.marchwarden/traces/*.result.json`, deserialize each into `ResearchResult`, and count gap categories across runs. ## Out of scope Per-event gap/citation logging during the run (timeline analysis). Can come later if needed.