marchwarden

archeious/marchwarden

Fork 0

Commit graph

Author	SHA1	Message	Date
Jeff Smith	13215d7ddb	docs(stress-tests): M3.3 Phase A — calibration data collection Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 20:21:47 -06:00
Jeff Smith	0ddc1e6e37	docs(stress-tests): archive M3.2 multi-axis results Single deep query against AWS Lambda vs Azure Functions for HFT exercised 3 of 4 target axes simultaneously: recency, contradictions, and budget pressure all fired in the same run. scope_exceeded miss is soft (1 of 5 gaps was arguably miscategorized as source_not_found). First in-the-wild observation of the `contradiction` discovery_event type. Issue #45. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:34:27 -06:00
Jeff Smith	a39407f03e	docs(stress-tests): archive M3.1 results Single-axis stress test results from Issue #44. 1 of 4 query targets cleanly hit (Q3); Q1/Q2 missed because queries weren't adversarial enough; Q4 missed due to budget cap lag bug filed as #53. Trace observability gap blocking M3.2/M3.3 filed as #54. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:21:34 -06:00

Author

SHA1

Message

Date

Jeff Smith

13215d7ddb

docs(stress-tests): M3.3 Phase A — calibration data collection

Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 20:21:47 -06:00

Jeff Smith

0ddc1e6e37

docs(stress-tests): archive M3.2 multi-axis results

Single deep query against AWS Lambda vs Azure Functions for HFT
exercised 3 of 4 target axes simultaneously: recency, contradictions,
and budget pressure all fired in the same run. scope_exceeded miss is
soft (1 of 5 gaps was arguably miscategorized as source_not_found).

First in-the-wild observation of the `contradiction` discovery_event
type. Issue #45.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 19:34:27 -06:00

Jeff Smith

a39407f03e

docs(stress-tests): archive M3.1 results

Single-axis stress test results from Issue #44. 1 of 4 query targets
cleanly hit (Q3); Q1/Q2 missed because queries weren't adversarial
enough; Q4 missed due to budget cap lag bug filed as #53. Trace
observability gap blocking M3.2/M3.3 filed as #54.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 19:21:34 -06:00

3 commits