docs(stress-tests): M3.3 Phase A — calibration data collection #59

Merged
claude-code merged 1 commit from feat/m3.3-collection into main 2026-04-09 02:22:08 +00:00
Collaborator

Phase A only of #46 (data collection + tooling). Phase B (human rating) and Phase C (analysis + rubric + wiki update) ship later. Issue stays open.

What's in this PR

  • scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories.
  • scripts/calibration_collect.py — loads *.result.json files and generates the rating worksheet.
  • docs/stress-tests/M3.3-rating-worksheet.md — 22 rateable runs (20 calibration + caffeine smoke + M3.2 multi-axis), empty actual_rating column ready for human review.
  • docs/stress-tests/M3.3-runs/*.log — runtime logs as provenance.
  • .gitignore — exception for stress-test logs.

Notable preview observations (before rating)

Wide confidence spread (0.10 to 0.99) — exactly what calibration needs.

  • All 5 factual queries scored 0.97–0.99 (as expected).
  • Scope-edge queries hovered around 0.42–0.82 — agent IS hedging but maybe not as hard as the topic warrants.
  • Q17 (screen time) returned confidence=0.10, citations=0, gaps=budget_exhausted(1) — this is the synthesis fallback path firing. Worth investigating during Phase C analysis.

Note on M3.1 traces

M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet. We have 22 rateable runs instead of 26, still within the milestone target of 20–30. Perfect demonstration of why #54 mattered.

**Phase A only** of #46 (data collection + tooling). Phase B (human rating) and Phase C (analysis + rubric + wiki update) ship later. **Issue stays open.** ## What's in this PR - `scripts/calibration_runner.sh` — runs 20 fixed balanced-depth queries across 4 categories. - `scripts/calibration_collect.py` — loads `*.result.json` files and generates the rating worksheet. - `docs/stress-tests/M3.3-rating-worksheet.md` — 22 rateable runs (20 calibration + caffeine smoke + M3.2 multi-axis), empty `actual_rating` column ready for human review. - `docs/stress-tests/M3.3-runs/*.log` — runtime logs as provenance. - `.gitignore` — exception for stress-test logs. ## Notable preview observations (before rating) Wide confidence spread (0.10 to 0.99) — exactly what calibration needs. - All 5 factual queries scored 0.97–0.99 (as expected). - Scope-edge queries hovered around 0.42–0.82 — agent IS hedging but maybe not as hard as the topic warrants. - **Q17 (screen time) returned `confidence=0.10`, `citations=0`, `gaps=budget_exhausted(1)`** — this is the synthesis fallback path firing. Worth investigating during Phase C analysis. ## Note on M3.1 traces M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet. We have 22 rateable runs instead of 26, still within the milestone target of 20–30. Perfect demonstration of why #54 mattered.
claude-code added 1 commit 2026-04-09 02:22:03 +00:00
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude-code merged commit 78f08c92cc into main 2026-04-09 02:22:08 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#59
No description provided.