Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
48 lines
8.5 KiB
Text
48 lines
8.5 KiB
Text
Researching: Does screen time harm child development?
|
|
|
|
{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:34.721867Z"}
|
|
{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:09:35.602647Z"}
|
|
{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:35.613025Z"}
|
|
{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:35.653113Z"}
|
|
{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does screen time harm child development?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653592Z"}
|
|
{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653723Z"}
|
|
{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1126, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:45.628661Z"}
|
|
{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 10139, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:51.476900Z"}
|
|
{"step": 21, "decision": "Token budget reached before iteration 4: 23391/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056368Z"}
|
|
{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23391, "event": "synthesis_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056571Z"}
|
|
{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74986, "event": "synthesis_complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.739493Z"}
|
|
{"step": 24, "decision": "Failed to build ResearchResult: 1 validation error for DiscoveryEvent\nquery\n Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n For further information visit https://errors.pydantic.dev/2.12/v/string_type", "event": "synthesis_build_error", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.753603Z"}
|
|
{"step": 26, "decision": "Research complete", "confidence": 0.1, "citation_count": 0, "gap_count": 1, "discovery_count": 0, "total_duration_sec": 98.512, "event": "complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.755661Z"}
|
|
{"confidence": 0.1, "citations": 0, "gaps": 1, "discovery_events": 0, "tokens_used": 44375, "iterations_run": 3, "wall_time_sec": 95.08588027954102, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:10.755895Z"}
|
|
{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:11:10.757071Z"}
|
|
{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:10.770530Z"}
|
|
{"trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "confidence": 0.1, "citations": 0, "tokens_used": 44375, "wall_time_sec": 95.08588027954102, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.105698Z"}
|
|
╭─────────────────────────────────── Answer ───────────────────────────────────╮
|
|
│ Research on 'Does screen time harm child development?' completed but │
|
|
│ synthesis failed. 22 sources were gathered. │
|
|
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
No citations.
|
|
Gaps
|
|
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
|
┃ Category ┃ Topic ┃ Detail ┃
|
|
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
|
|
│ budget_exhausted │ synthesis │ The synthesis step failed to produce │
|
|
│ │ │ structured output. │
|
|
└──────────────────┴───────────┴───────────────────────────────────────────────┘
|
|
╭───────────────────────────────── Confidence ─────────────────────────────────╮
|
|
│ Overall: 0.10 │
|
|
│ Corroborating sources: 0 │
|
|
│ Source authority: low │
|
|
│ Contradiction detected: False │
|
|
│ Query specificity match: 0.00 │
|
|
│ Budget status: spent │
|
|
│ Recency: unknown │
|
|
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
╭──────────────────────────────────── Cost ────────────────────────────────────╮
|
|
│ Tokens: 44375 │
|
|
│ Iterations: 3 │
|
|
│ Wall time: 95.09s │
|
|
│ Model: claude-sonnet-4-6 │
|
|
╰──────────────────────────────────────────────────────────────────────────────╯
|
|
|
|
trace_id: 9c18d570-73d3-4e8a-98bc-7cb1b66c61d2
|