Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
55 lines
591 B
Text
55 lines
591 B
Text
# Python
|
|
__pycache__/
|
|
*.py[cod]
|
|
*$py.class
|
|
*.so
|
|
.Python
|
|
build/
|
|
develop-eggs/
|
|
dist/
|
|
downloads/
|
|
eggs/
|
|
.eggs/
|
|
lib/
|
|
lib64/
|
|
parts/
|
|
sdist/
|
|
var/
|
|
wheels/
|
|
*.egg-info/
|
|
.installed.cfg
|
|
*.egg
|
|
|
|
# Virtual environments
|
|
venv/
|
|
env/
|
|
ENV/
|
|
|
|
# IDEs
|
|
.vscode/
|
|
.idea/
|
|
*.swp
|
|
*.swo
|
|
*~
|
|
|
|
# OS
|
|
.DS_Store
|
|
.DS_Store?
|
|
._*
|
|
.Spotlight-V100
|
|
.Trashes
|
|
ehthumbs.db
|
|
|
|
# Project-specific
|
|
~/.marchwarden/
|
|
.env
|
|
.env.local
|
|
*.log
|
|
# Exception: stress test run logs are committed as provenance — they map
|
|
# trace_id -> category for the calibration collector script.
|
|
!docs/stress-tests/**/*.log
|
|
|
|
# Tests
|
|
.pytest_cache/
|
|
.coverage
|
|
htmlcov/
|