Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).
Adds the data-collection half of the calibration milestone:
- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
across 4 categories (factual, comparative, contradiction-prone,
scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
under ~/.marchwarden/traces/*.result.json and emits a markdown rating
worksheet with one row per run. Recovers question text from each
trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
+ caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
runner, kept as provenance. Gitignore updated with an exception
carving stress-test logs out of the global *.log ignore.
Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.
Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mcp SDK's StdioServerParameters does not pass the parent
process's environment to the spawned server by default, so env
vars set on the CLI process (notably MARCHWARDEN_MODEL) were
silently dropped on the way to the researcher.
Pass env=os.environ.copy() to StdioServerParameters so the server
sees the same environment as the CLI. Also update scripts/docker-test.sh
to forward MARCHWARDEN_MODEL into the container and to detect a
non-TTY parent so non-interactive `ask` invocations don't fail with
"the input device is not a TTY".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reproducible Python 3.12-slim container that installs the project
editable with dev deps. Adds pytest-asyncio to dev deps so async tests
run cleanly inside the container (host had it installed out-of-band).
scripts/docker-test.sh provides build, test, ask, replay, and shell
subcommands. The ask/replay/shell commands mount ~/secrets read-only
and ~/.marchwarden read-write so end-to-end runs persist traces back
to the host.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>