marchwarden/docs/stress-tests/M3.3-runs
Jeff Smith 13215d7ddb docs(stress-tests): M3.3 Phase A — calibration data collection
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:21:47 -06:00
..
01-factual.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
02-factual.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
03-factual.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
04-factual.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
05-factual.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
06-comparative.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
07-comparative.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
08-comparative.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
09-comparative.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
10-comparative.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
11-contradiction.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
12-contradiction.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
13-contradiction.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
14-contradiction.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
15-contradiction.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
16-scope.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
17-scope.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
18-scope.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
19-scope.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00
20-scope.log docs(stress-tests): M3.3 Phase A — calibration data collection 2026-04-08 20:21:47 -06:00