Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.5 KiB
7.5 KiB
M3.3 Calibration Rating Worksheet
Issue: #46 (Phase B — human rating)
How to use this worksheet
For each run below, read the answer + citations from the persisted result file (path in the Result file column). Score the answer's actual correctness on a 0.0–1.0 scale, independent of the model's self-reported confidence. Fill in the actual_rating column. Add notes in the notes column for anything unusual.
Rating rubric:
- 1.0 — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
- 0.8 — Mostly correct; minor inaccuracies or omissions that don't change the substance.
- 0.6 — Substantively right but with notable errors, missing context, or weak citations.
- 0.4 — Mixed: some right, some wrong; or right answer for wrong reasons.
- 0.2 — Mostly wrong, misleading, or hallucinated despite confident framing.
- 0.0 — Completely wrong, fabricated, or refuses to answer a tractable question.
After rating all rows, save this file and run:
.venv/bin/python scripts/calibration_analyze.py
Runs (22 total)
| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 28f55110 |
ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 | ||
| 2 | 74a017bd |
ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 | ||
| 3 | 6141a021 |
factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 | ||
| 4 | 91e87d05 |
factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 | ||
| 5 | 710b0a62 |
factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 | ||
| 6 | ffc42162 |
factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 | ||
| 7 | 7561029e |
factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 | ||
| 8 | aaf3b9ef |
comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 | ||
| 9 | 01881015 |
comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 | ||
| 10 | 9e436db7 |
comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 | ||
| 11 | 7c8dd19b |
comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 | ||
| 12 | e3fa81c3 |
comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 | ||
| 13 | 96acce3c |
contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 | ||
| 14 | c4942f00 |
contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 | ||
| 15 | 2e2b6e88 |
contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 | ||
| 16 | 27d81891 |
contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 | ||
| 17 | 9c18d570 |
contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 | ||
| 18 | f4c43973 |
scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 | ||
| 19 | b3d00938 |
scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 | ||
| 20 | 716e548a |
scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 | ||
| 21 | b7cd9d50 |
scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 | ||
| 22 | a4bb5b7a |
scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 |
Result files (full content for review)
/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json