Jeff Smith 13215d7ddb docs(stress-tests): M3.3 Phase A — calibration data collection

Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 20:21:47 -06:00

7.5 KiB

Raw Permalink Blame History

M3.3 Calibration Rating Worksheet

Issue: #46 (Phase B — human rating)

How to use this worksheet

For each run below, read the answer + citations from the persisted result file (path in the Result file column). Score the answer's actual correctness on a 0.0–1.0 scale, independent of the model's self-reported confidence. Fill in the actual_rating column. Add notes in the notes column for anything unusual.

Rating rubric:

1.0 — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
0.8 — Mostly correct; minor inaccuracies or omissions that don't change the substance.
0.6 — Substantively right but with notable errors, missing context, or weak citations.
0.4 — Mixed: some right, some wrong; or right answer for wrong reasons.
0.2 — Mostly wrong, misleading, or hallucinated despite confident framing.
0.0 — Completely wrong, fabricated, or refuses to answer a tractable question.

After rating all rows, save this file and run:

.venv/bin/python scripts/calibration_analyze.py

Runs (22 total)

#	trace_id	category	question	model_conf	corrob	authority	contradiction	budget	recency	gaps	citations	discoveries	tokens
1	`28f55110`	ad-hoc	What is the half-life of caffeine?	0.95	4	high	no	under	current	scope_exceeded(1)	4	2	11582
2	`74a017bd`	ad-hoc	Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc...	0.78	18	medium	yes	spent	current	source_not_found(5)	18	4	127692
3	`6141a021`	factual	What is the boiling point of liquid nitrogen at standard atmospheric pressure?	0.98	5	high	no	under	current	—	5	2	42473
4	`91e87d05`	factual	When did the James Webb Space Telescope launch?	0.99	5	high	no	under	current	contradictory_sources(1)	5	2	19708
5	`710b0a62`	factual	What programming language is the Linux kernel primarily written in?	0.97	6	high	no	under	current	contradictory_sources(1), source_not_found(1)	6	2	32922
6	`ffc42162`	factual	What is the capital of Mongolia?	0.99	4	high	no	under	current	—	4	1	11009
7	`7561029e`	factual	How many amino acids are encoded by the standard genetic code?	0.98	4	high	no	under	current	scope_exceeded(1)	4	2	48308
8	`aaf3b9ef`	comparative	Compare the energy density of lithium-ion vs sodium-ion batteries.	0.91	8	high	no	spent	current	contradictory_sources(1), scope_exceeded(1), source_not_found(1)	8	3	48087
9	`01881015`	comparative	Compare PostgreSQL and SQLite for embedded analytics workloads.	0.88	10	medium	no	spent	current	source_not_found(3)	10	4	61699
10	`9e436db7`	comparative	Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.	0.82	14	high	no	spent	current	source_not_found(4)	14	4	54153
11	`7c8dd19b`	comparative	Compare React and Vue for large enterprise frontends in 2026.	0.81	12	medium	yes	spent	current	contradictory_sources(1), scope_exceeded(1), source_not_found(2)	12	4	56137
12	`e3fa81c3`	comparative	Compare wind and solar capacity factors in the continental United States.	0.88	10	high	no	spent	current	scope_exceeded(2), source_not_found(2)	10	4	48230
13	`96acce3c`	contradiction	Is red wine good for cardiovascular health?	0.72	7	high	yes	spent	recent	access_denied(1), contradictory_sources(1), source_not_found(1)	9	3	42350
14	`c4942f00`	contradiction	Does intermittent fasting extend lifespan in humans?	0.72	9	high	yes	spent	current	contradictory_sources(2), source_not_found(2)	11	4	62781
15	`2e2b6e88`	contradiction	Are nuclear power plants safe?	0.92	8	high	no	spent	current	contradictory_sources(1), scope_exceeded(1), source_not_found(1)	8	3	63429
16	`27d81891`	contradiction	Is dietary cholesterol harmful?	0.78	13	high	yes	spent	current	contradictory_sources(1), source_not_found(2)	13	4	64718
17	`9c18d570`	contradiction	Does screen time harm child development?	0.10	0	low	no	spent	—	budget_exhausted(1)	0	0	44375
18	`f4c43973`	scope	What proprietary indexing strategies do high-frequency trading firms use for ...	0.72	8	medium	no	spent	current	scope_exceeded(1), source_not_found(3)	8	4	70892
19	`b3d00938`	scope	What is the actual operational doctrine of Chinese DF-41 ICBM brigades?	0.72	12	high	yes	spent	current	access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1)	12	4	62857
20	`716e548a`	scope	What internal compensation bands does Goldman Sachs use for VPs in 2026?	0.62	8	medium	yes	spent	current	contradictory_sources(1), scope_exceeded(1), source_not_found(2)	10	3	51829
21	`b7cd9d50`	scope	How does Renaissance Technologies Medallion Fund actually generate alpha?	0.82	10	medium	no	spent	current	access_denied(1), source_not_found(3)	10	4	43096
22	`a4bb5b7a`	scope	What are the precise materials and tolerances in TSMC's 2nm process?	0.42	9	medium	no	spent	current	source_not_found(5)	9	4	62620

Result files (full content for review)

/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json
/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json
/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json
/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json
/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json
/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json
/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json
/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json
/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json
/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json
/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json
/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json
/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json
/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json
/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json
/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json
/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json
/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json
/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json
/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json
/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json
/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json

7.5 KiB Raw Permalink Blame History Unescape Escape

M3.3 Calibration Rating Worksheet

How to use this worksheet

Runs (22 total)

Result files (full content for review)

7.5 KiB

Raw Permalink Blame History