Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
74 lines
7.5 KiB
Markdown
74 lines
7.5 KiB
Markdown
# M3.3 Calibration Rating Worksheet
|
||
|
||
Issue: #46 (Phase B — human rating)
|
||
|
||
## How to use this worksheet
|
||
|
||
For each run below, read the answer + citations from the persisted result file (path in the **Result file** column). Score the answer's *actual* correctness on a 0.0–1.0 scale, **independent** of the model's self-reported confidence. Fill in the **actual_rating** column. Add notes in the **notes** column for anything unusual.
|
||
|
||
Rating rubric:
|
||
|
||
- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
|
||
- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.
|
||
- **0.6** — Substantively right but with notable errors, missing context, or weak citations.
|
||
- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.
|
||
- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.
|
||
- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.
|
||
|
||
After rating all rows, save this file and run:
|
||
|
||
```
|
||
.venv/bin/python scripts/calibration_analyze.py
|
||
```
|
||
|
||
## Runs (22 total)
|
||
|
||
| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |
|
||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
||
| 1 | `28f55110` | ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 | | |
|
||
| 2 | `74a017bd` | ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 | | |
|
||
| 3 | `6141a021` | factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 | | |
|
||
| 4 | `91e87d05` | factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 | | |
|
||
| 5 | `710b0a62` | factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 | | |
|
||
| 6 | `ffc42162` | factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 | | |
|
||
| 7 | `7561029e` | factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 | | |
|
||
| 8 | `aaf3b9ef` | comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 | | |
|
||
| 9 | `01881015` | comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 | | |
|
||
| 10 | `9e436db7` | comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 | | |
|
||
| 11 | `7c8dd19b` | comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 | | |
|
||
| 12 | `e3fa81c3` | comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 | | |
|
||
| 13 | `96acce3c` | contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 | | |
|
||
| 14 | `c4942f00` | contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 | | |
|
||
| 15 | `2e2b6e88` | contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 | | |
|
||
| 16 | `27d81891` | contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 | | |
|
||
| 17 | `9c18d570` | contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 | | |
|
||
| 18 | `f4c43973` | scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 | | |
|
||
| 19 | `b3d00938` | scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 | | |
|
||
| 20 | `716e548a` | scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 | | |
|
||
| 21 | `b7cd9d50` | scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 | | |
|
||
| 22 | `a4bb5b7a` | scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 | | |
|
||
|
||
## Result files (full content for review)
|
||
|
||
1. `/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json`
|
||
2. `/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json`
|
||
3. `/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json`
|
||
4. `/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json`
|
||
5. `/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json`
|
||
6. `/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json`
|
||
7. `/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json`
|
||
8. `/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json`
|
||
9. `/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json`
|
||
10. `/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json`
|
||
11. `/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json`
|
||
12. `/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json`
|
||
13. `/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json`
|
||
14. `/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json`
|
||
15. `/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json`
|
||
16. `/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json`
|
||
17. `/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json`
|
||
18. `/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json`
|
||
19. `/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json`
|
||
20. `/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json`
|
||
21. `/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json`
|
||
22. `/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json`
|