marchwarden/docs/stress-tests/M3.3-rating-worksheet.md
Jeff Smith 13215d7ddb docs(stress-tests): M3.3 Phase A — calibration data collection
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open).

Adds the data-collection half of the calibration milestone:

- scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries
  across 4 categories (factual, comparative, contradiction-prone,
  scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/.
- scripts/calibration_collect.py — loads every persisted ResearchResult
  under ~/.marchwarden/traces/*.result.json and emits a markdown rating
  worksheet with one row per run. Recovers question text from each
  trace's start event and category from the run-log filename.
- docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration
  + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns
  for the human-in-the-loop scoring step.
- docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration
  runner, kept as provenance. Gitignore updated with an exception
  carving stress-test logs out of the global *.log ignore.

Note: M3.1's 4 runs predate #54 (full result persistence) and so are
unrecoverable to the worksheet — only post-#54 runs have a result.json
sibling. 22 rateable runs is still within the milestone target of 20–30.

Phases B (human rating) and C (analysis + rubric + wiki update) follow
in a later session. This issue stays open until both are done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:21:47 -06:00

7.5 KiB
Raw Permalink Blame History

M3.3 Calibration Rating Worksheet

Issue: #46 (Phase B — human rating)

How to use this worksheet

For each run below, read the answer + citations from the persisted result file (path in the Result file column). Score the answer's actual correctness on a 0.01.0 scale, independent of the model's self-reported confidence. Fill in the actual_rating column. Add notes in the notes column for anything unusual.

Rating rubric:

  • 1.0 — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
  • 0.8 — Mostly correct; minor inaccuracies or omissions that don't change the substance.
  • 0.6 — Substantively right but with notable errors, missing context, or weak citations.
  • 0.4 — Mixed: some right, some wrong; or right answer for wrong reasons.
  • 0.2 — Mostly wrong, misleading, or hallucinated despite confident framing.
  • 0.0 — Completely wrong, fabricated, or refuses to answer a tractable question.

After rating all rows, save this file and run:

.venv/bin/python scripts/calibration_analyze.py

Runs (22 total)

# trace_id category question model_conf corrob authority contradiction budget recency gaps citations discoveries tokens actual_rating notes
1 28f55110 ad-hoc What is the half-life of caffeine? 0.95 4 high no under current scope_exceeded(1) 4 2 11582
2 74a017bd ad-hoc Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... 0.78 18 medium yes spent current source_not_found(5) 18 4 127692
3 6141a021 factual What is the boiling point of liquid nitrogen at standard atmospheric pressure? 0.98 5 high no under current 5 2 42473
4 91e87d05 factual When did the James Webb Space Telescope launch? 0.99 5 high no under current contradictory_sources(1) 5 2 19708
5 710b0a62 factual What programming language is the Linux kernel primarily written in? 0.97 6 high no under current contradictory_sources(1), source_not_found(1) 6 2 32922
6 ffc42162 factual What is the capital of Mongolia? 0.99 4 high no under current 4 1 11009
7 7561029e factual How many amino acids are encoded by the standard genetic code? 0.98 4 high no under current scope_exceeded(1) 4 2 48308
8 aaf3b9ef comparative Compare the energy density of lithium-ion vs sodium-ion batteries. 0.91 8 high no spent current contradictory_sources(1), scope_exceeded(1), source_not_found(1) 8 3 48087
9 01881015 comparative Compare PostgreSQL and SQLite for embedded analytics workloads. 0.88 10 medium no spent current source_not_found(3) 10 4 61699
10 9e436db7 comparative Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. 0.82 14 high no spent current source_not_found(4) 14 4 54153
11 7c8dd19b comparative Compare React and Vue for large enterprise frontends in 2026. 0.81 12 medium yes spent current contradictory_sources(1), scope_exceeded(1), source_not_found(2) 12 4 56137
12 e3fa81c3 comparative Compare wind and solar capacity factors in the continental United States. 0.88 10 high no spent current scope_exceeded(2), source_not_found(2) 10 4 48230
13 96acce3c contradiction Is red wine good for cardiovascular health? 0.72 7 high yes spent recent access_denied(1), contradictory_sources(1), source_not_found(1) 9 3 42350
14 c4942f00 contradiction Does intermittent fasting extend lifespan in humans? 0.72 9 high yes spent current contradictory_sources(2), source_not_found(2) 11 4 62781
15 2e2b6e88 contradiction Are nuclear power plants safe? 0.92 8 high no spent current contradictory_sources(1), scope_exceeded(1), source_not_found(1) 8 3 63429
16 27d81891 contradiction Is dietary cholesterol harmful? 0.78 13 high yes spent current contradictory_sources(1), source_not_found(2) 13 4 64718
17 9c18d570 contradiction Does screen time harm child development? 0.10 0 low no spent budget_exhausted(1) 0 0 44375
18 f4c43973 scope What proprietary indexing strategies do high-frequency trading firms use for ... 0.72 8 medium no spent current scope_exceeded(1), source_not_found(3) 8 4 70892
19 b3d00938 scope What is the actual operational doctrine of Chinese DF-41 ICBM brigades? 0.72 12 high yes spent current access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) 12 4 62857
20 716e548a scope What internal compensation bands does Goldman Sachs use for VPs in 2026? 0.62 8 medium yes spent current contradictory_sources(1), scope_exceeded(1), source_not_found(2) 10 3 51829
21 b7cd9d50 scope How does Renaissance Technologies Medallion Fund actually generate alpha? 0.82 10 medium no spent current access_denied(1), source_not_found(3) 10 4 43096
22 a4bb5b7a scope What are the precise materials and tolerances in TSMC's 2nm process? 0.42 9 medium no spent current source_not_found(5) 9 4 62620

Result files (full content for review)

  1. /home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json
  2. /home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json
  3. /home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json
  4. /home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json
  5. /home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json
  6. /home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json
  7. /home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json
  8. /home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json
  9. /home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json
  10. /home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json
  11. /home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json
  12. /home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json
  13. /home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json
  14. /home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json
  15. /home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json
  16. /home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json
  17. /home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json
  18. /home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json
  19. /home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json
  20. /home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json
  21. /home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json
  22. /home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json