docs(stress-tests): M3.3 Phase A — calibration data collection #59
Loading…
Reference in a new issue
No description provided.
Delete branch "feat/m3.3-collection"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase A only of #46 (data collection + tooling). Phase B (human rating) and Phase C (analysis + rubric + wiki update) ship later. Issue stays open.
What's in this PR
scripts/calibration_runner.sh— runs 20 fixed balanced-depth queries across 4 categories.scripts/calibration_collect.py— loads*.result.jsonfiles and generates the rating worksheet.docs/stress-tests/M3.3-rating-worksheet.md— 22 rateable runs (20 calibration + caffeine smoke + M3.2 multi-axis), emptyactual_ratingcolumn ready for human review.docs/stress-tests/M3.3-runs/*.log— runtime logs as provenance..gitignore— exception for stress-test logs.Notable preview observations (before rating)
Wide confidence spread (0.10 to 0.99) — exactly what calibration needs.
confidence=0.10,citations=0,gaps=budget_exhausted(1)— this is the synthesis fallback path firing. Worth investigating during Phase C analysis.Note on M3.1 traces
M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet. We have 22 rateable runs instead of 26, still within the milestone target of 20–30. Perfect demonstration of why #54 mattered.