M3.3 Confidence calibration (V1.1) #46

New issue

Open

opened 2026-04-08 17:24:11 -06:00 by claude-code · 1 comment

claude-code commented

2026-04-08 17:24:11 -06:00

Collaborator

Phase 3 — Stress Testing & Calibration, milestone 3.

Goal

Use the data from M3.1 + M3.2 + ad-hoc runs (target: 20–30 queries total) to build an empirical understanding of when the LLM's confidence scores match reality, and produce a calibration rubric.

Process

Collect 20–30 ResearchResults across diverse question types (factual, comparative, contradiction-prone, scope-edge)
Manually rate each on a 0–1 scale of "how correct was the answer actually" — independent of the model's self-reported confidence
Plot model confidence vs. actual correctness; identify systematic biases
Draft a calibrated rubric: "what does a 0.9 actually mean empirically?"
Update ResearchContract with the calibrated guidance

Deliverable

Calibration data table (ledger entries + manual ratings)
Updated confidence_factors documentation in the wiki
Recommended adjustments to the synthesis prompt if patterns emerge

Out of scope

Automated calibration / RLHF — manual review only for V1.1.

Phase 3 — Stress Testing & Calibration, milestone 3. ## Goal Use the data from M3.1 + M3.2 + ad-hoc runs (target: 20–30 queries total) to build an empirical understanding of when the LLM's confidence scores match reality, and produce a calibration rubric. ## Process 1. Collect 20–30 ResearchResults across diverse question types (factual, comparative, contradiction-prone, scope-edge) 2. Manually rate each on a 0–1 scale of "how correct was the answer actually" — independent of the model's self-reported confidence 3. Plot model confidence vs. actual correctness; identify systematic biases 4. Draft a calibrated rubric: "what does a 0.9 actually mean empirically?" 5. Update [ResearchContract](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract) with the calibrated guidance ## Deliverable - Calibration data table (ledger entries + manual ratings) - Updated `confidence_factors` documentation in the wiki - Recommended adjustments to the synthesis prompt if patterns emerge ## Out of scope Automated calibration / RLHF — manual review only for V1.1.

archeious added this to the Phase 3: Stress Testing & Calibration milestone 2026-04-08 17:25:12 -06:00

claude-code referenced this issue

2026-04-08 19:21:00 -06:00

Persist full ResearchResult alongside trace (observability gap) #54

archeious commented

2026-04-08 19:48:38 -06:00

Owner

Splitting this milestone since the rating step requires human-in-the-loop review.

Phase A — Data collection (this session)

Run 20 additional balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge) to get a representative spread of confidence values.
Combined with M3.1 (4 runs) + M3.2 (1 run) + the caffeine smoke run = 26 total ResearchResults.
Build a scripts/calibration_collect.py that loads all ~/.marchwarden/traces/*.result.json files and emits a markdown rating worksheet to docs/stress-tests/M3.3-rating-worksheet.md with one row per run and an empty actual_rating column.
Branch + PR + merge the script and worksheet. Issue stays open.

Phase B — Human rating (offline, your pace)

You open the worksheet, read each answer, fill in actual_rating (0.0–1.0) per row.
No code changes from me during this phase.

Phase C — Analysis & rubric (next session, after rating is done)

Load the rated worksheet, compute calibration error (mean abs error between model confidence and actual rating), look for systematic biases.
Draft empirical rubric: "a 0.85 means X."
Update the wiki ResearchContract page.
Recommend synthesis-prompt adjustments if patterns emerge.
Close this issue.

Rationale for the split: I can't credibly self-rate the agent's outputs (same biases, marking my own homework). The rating step is fundamentally yours. Splitting unblocks the mechanical work now and lets you batch the cognitive work whenever convenient.

Phase A starts now in branch feat/m3.3-collection.

Splitting this milestone since the rating step requires human-in-the-loop review. ## Phase A — Data collection (this session) - Run 20 additional balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge) to get a representative spread of confidence values. - Combined with M3.1 (4 runs) + M3.2 (1 run) + the caffeine smoke run = 26 total ResearchResults. - Build a `scripts/calibration_collect.py` that loads all `~/.marchwarden/traces/*.result.json` files and emits a markdown rating worksheet to `docs/stress-tests/M3.3-rating-worksheet.md` with one row per run and an empty `actual_rating` column. - Branch + PR + merge the script and worksheet. **Issue stays open.** ## Phase B — Human rating (offline, your pace) - You open the worksheet, read each answer, fill in `actual_rating` (0.0–1.0) per row. - No code changes from me during this phase. ## Phase C — Analysis & rubric (next session, after rating is done) - Load the rated worksheet, compute calibration error (mean abs error between model confidence and actual rating), look for systematic biases. - Draft empirical rubric: "a 0.85 means X." - Update the wiki ResearchContract page. - Recommend synthesis-prompt adjustments if patterns emerge. - Close this issue. Rationale for the split: I can't credibly self-rate the agent's outputs (same biases, marking my own homework). The rating step is fundamentally yours. Splitting unblocks the mechanical work now and lets you batch the cognitive work whenever convenient. **Phase A starts now in branch `feat/m3.3-collection`.**

archeious referenced this issue from a commit

2026-04-08 20:21:48 -06:00

docs(stress-tests): M3.3 Phase A — calibration data collection

claude-code referenced this issue

2026-04-08 20:22:02 -06:00

docs(stress-tests): M3.3 Phase A — calibration data collection #59