marchwarden/docs/stress-tests/M3.3-rating-worksheet.md

75 lines
7.5 KiB
Markdown
Raw Normal View History

2026-04-09 02:21:47 +00:00
# M3.3 Calibration Rating Worksheet
Issue: #46 (Phase B — human rating)
## How to use this worksheet
For each run below, read the answer + citations from the persisted result file (path in the **Result file** column). Score the answer's *actual* correctness on a 0.01.0 scale, **independent** of the model's self-reported confidence. Fill in the **actual_rating** column. Add notes in the **notes** column for anything unusual.
Rating rubric:
- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.
- **0.6** — Substantively right but with notable errors, missing context, or weak citations.
- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.
- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.
- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.
After rating all rows, save this file and run:
```
.venv/bin/python scripts/calibration_analyze.py
```
## Runs (22 total)
| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | `28f55110` | ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 | | |
| 2 | `74a017bd` | ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 | | |
| 3 | `6141a021` | factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 | | |
| 4 | `91e87d05` | factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 | | |
| 5 | `710b0a62` | factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 | | |
| 6 | `ffc42162` | factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 | | |
| 7 | `7561029e` | factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 | | |
| 8 | `aaf3b9ef` | comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 | | |
| 9 | `01881015` | comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 | | |
| 10 | `9e436db7` | comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 | | |
| 11 | `7c8dd19b` | comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 | | |
| 12 | `e3fa81c3` | comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 | | |
| 13 | `96acce3c` | contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 | | |
| 14 | `c4942f00` | contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 | | |
| 15 | `2e2b6e88` | contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 | | |
| 16 | `27d81891` | contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 | | |
| 17 | `9c18d570` | contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 | | |
| 18 | `f4c43973` | scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 | | |
| 19 | `b3d00938` | scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 | | |
| 20 | `716e548a` | scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 | | |
| 21 | `b7cd9d50` | scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 | | |
| 22 | `a4bb5b7a` | scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 | | |
## Result files (full content for review)
1. `/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json`
2. `/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json`
3. `/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json`
4. `/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json`
5. `/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json`
6. `/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json`
7. `/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json`
8. `/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json`
9. `/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json`
10. `/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json`
11. `/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json`
12. `/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json`
13. `/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json`
14. `/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json`
15. `/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json`
16. `/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json`
17. `/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json`
18. `/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json`
19. `/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json`
20. `/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json`
21. `/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json`
22. `/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json`