marchwarden/docs/stress-tests/M3.3-runs/02-factual.log
Jeff Smith 13215d7ddb docs(stress-tests): M3.3 Phase A — calibration data collection
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:21:47 -06:00

145 lines
19 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Researching: When did the James Webb Space Telescope launch?
{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:06.289350Z"}
{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:07.051309Z"}
{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:07.061145Z"}
{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:07.098980Z"}
{"step": 1, "decision": "Beginning research: depth=balanced", "question": "When did the James Webb Space Telescope launch?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099569Z"}
{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099732Z"}
{"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1050, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:15.512242Z"}
{"step": 8, "decision": "Starting iteration 3/5", "tokens_so_far": 5418, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:18.749199Z"}
{"step": 10, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 6, "iterations_run": 3, "tokens_used": 11453, "event": "synthesis_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:28.069780Z"}
{"step": 11, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24998, "event": "synthesis_complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.942803Z"}
{"step": 20, "decision": "Research complete", "confidence": 0.99, "citation_count": 5, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 47.037, "event": "complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.943609Z"}
{"confidence": 0.99, "citations": 5, "gaps": 1, "discovery_events": 2, "tokens_used": 19708, "iterations_run": 3, "wall_time_sec": 44.843754529953, "budget_exhausted": false, "event": "research_completed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:51.943716Z"}
{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:51.944100Z"}
{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:51.947937Z"}
{"trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "confidence": 0.99, "citations": 5, "tokens_used": 19708, "wall_time_sec": 44.843754529953, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.133972Z"}
╭─────────────────────────────────── Answer ───────────────────────────────────╮
│ The James Webb Space Telescope (JWST) launched on December 25, 2021, at │
│ 12:20 UTC (7:20 AM ET) aboard an Arianespace Ariane 5 ECA+ rocket (Flight │
│ VA256) from the Guiana Space Centre (ELA-3) in Kourou, French Guiana. It │
│ entered service on July 12, 2022. │
╰──────────────────────────────────────────────────────────────────────────────╯
Citations
┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃
┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ 1 │ James Webb Space Telescope - │ Launch date: 25 December 2021 │ 0.99 │
│ │ Wikipedia │ (2021-12-25), 12:20 UTC | │ │
│ │ https://en.wikipedia.org/wiki │ Rocket: Ariane 5 ECA+ (S/N │ │
│ │ /James_Webb_Space_Telescope │ 5113, Flight VA256) | Launch │ │
│ │ │ site: Guiana, ELA-3 | │ │
│ │ │ Contractor: Arianespace | │ │
│ │ │ Entered service: 12 July 2022 │ │
├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
│ 2 │ The Launch of the James Webb │ On December 25, 2021, and 7:20 │ 0.98 │
│ │ Space Telescope - YouTube │ AM ET (12:20 UTC), the James │ │
│ │ https://www.youtube.com/watch │ Webb Space Telescope was │ │
│ │ ?v=9tXlqWldVVk │ launched by an ArianeSpace │ │
│ │ │ Ariane 5 rocket from │ │
├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
│ 3 │ James Webb Space Telescope │ The launch date was Saturday, │ 0.97 │
│ │ (JWST) Mission (Ariane 5) - │ December 25, 2021 at 12:20 PM │ │
│ │ RocketLaunch.Live │ (UTC). │ │
│ │ https://www.rocketlaunch.live │ │ │
│ │ /launch/jwst │ │ │
├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
│ 4 │ James Webb Space Telescope │ JWST's launch date was │ 0.95 │
│ │ College of Science │ December 25 from Europe's │ │
│ │ https://science.utah.edu/news │ Spaceport in Kourou, French │ │
│ │ /james-webb-space-telescope/ │ Guiana. Longtime fans of the │ │
│ │ │ telescope are celebrating it │ │
│ │ │ as a Christmas miracle. │ │
├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
│ 5 │ NASA's James Webb Space │ Liftoff is at 7:20 a.m. EST │ 0.90 │
│ │ Telescope officially set to │ (1220 GMT). │ │
│ │ launch Dec. 24 | Space │ │ │
│ │ https://www.space.com/james-w │ │ │
│ │ ebb-space-telescope-launch-da │ │ │
│ │ te-confirmed │ │ │
└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
Gaps
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Category ┃ Topic ┃ Detail ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ contradictory_sources │ Space.com headline │ The Space.com article │
│ │ discrepancy │ headline references Dec. │
│ │ │ 24, which was the │
│ │ │ announced/planned launch │
│ │ │ date at time of │
│ │ │ publication, while the │
│ │ │ actual launch occurred on │
│ │ │ Dec. 25, 2021. This is a │
│ │ │ pre-launch announcement │
│ │ │ artifact, not a true │
│ │ │ contradiction, and all │
│ │ │ other sources confirm │
│ │ │ Dec. 25. │
└───────────────────────┴──────────────────────────┴───────────────────────────┘
Discovery Events
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ ┃ Suggested ┃ ┃ ┃
┃ Type ┃ Researcher ┃ Query ┃ Reason ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ related_research │ null │ James Webb Space │ JWST entered │
│ │ │ Telescope first │ service on July │
│ │ │ science results │ 12, 2022; │
│ │ │ July 2022 │ understanding its │
│ │ │ │ early science │
│ │ │ │ results provides │
│ │ │ │ context for its │
│ │ │ │ operational │
│ │ │ │ impact. │
├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
│ related_research │ null │ JWST launch │ The telescope was │
│ │ │ delays history │ originally │
│ │ │ original 2007 │ planned to launch │
│ │ │ launch plan │ in 2007 but faced │
│ │ │ │ decades of │
│ │ │ │ delays, making │
│ │ │ │ the history of │
│ │ │ │ its development │
│ │ │ │ noteworthy. │
└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
Open Questions
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Priority ┃ Question ┃ Context ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ medium │ What were the key milestones │ Wikipedia notes the telescope │
│ │ after JWST's launch during its │ entered service on July 12, │
│ │ commissioning phase before │ 2022, approximately six months │
│ │ entering service on July 12, │ after its December 25, 2021 │
│ │ 2022? │ launch, suggesting a lengthy │
│ │ │ commissioning process. │
├──────────┼─────────────────────────────────┼─────────────────────────────────┤
│ low │ What caused JWST's launch to │ Space.com's article was titled │
│ │ slip from December 24 to │ with a Dec. 24 launch date, but │
│ │ December 25, 2021? │ the actual launch occurred on │
│ │ │ Dec. 25, suggesting a │
│ │ │ last-minute slip. │
├──────────┼─────────────────────────────────┼─────────────────────────────────┤
│ medium │ How does JWST's actual mission │ Wikipedia lists a 10-year │
│ │ performance compare to its │ planned and 20-year expected │
│ │ planned 10-year operational │ life; precise launch trajectory │
│ │ lifespan given its fuel │ reportedly left more fuel than │
│ │ efficiency during launch? │ expected, potentially extending │
│ │ │ the mission. │
└──────────┴─────────────────────────────────┴─────────────────────────────────┘
╭───────────────────────────────── Confidence ─────────────────────────────────╮
│ Overall: 0.99 │
│ Corroborating sources: 5 │
│ Source authority: high │
│ Contradiction detected: False │
│ Query specificity match: 1.00 │
│ Budget status: under cap │
│ Recency: current │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────── Cost ────────────────────────────────────╮
│ Tokens: 19708 │
│ Iterations: 3 │
│ Wall time: 44.84s │
│ Model: claude-sonnet-4-6 │
╰──────────────────────────────────────────────────────────────────────────────╯
trace_id: 91e87d05-6d23-4377-af13-270a8cf701e2