M3.3 Confidence calibration (V1.1) #46
Labels
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: archeious/marchwarden#46
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 3 — Stress Testing & Calibration, milestone 3.
Goal
Use the data from M3.1 + M3.2 + ad-hoc runs (target: 20–30 queries total) to build an empirical understanding of when the LLM's confidence scores match reality, and produce a calibration rubric.
Process
Deliverable
confidence_factorsdocumentation in the wikiOut of scope
Automated calibration / RLHF — manual review only for V1.1.
Splitting this milestone since the rating step requires human-in-the-loop review.
Phase A — Data collection (this session)
scripts/calibration_collect.pythat loads all~/.marchwarden/traces/*.result.jsonfiles and emits a markdown rating worksheet todocs/stress-tests/M3.3-rating-worksheet.mdwith one row per run and an emptyactual_ratingcolumn.Phase B — Human rating (offline, your pace)
actual_rating(0.0–1.0) per row.Phase C — Analysis & rubric (next session, after rating is done)
Rationale for the split: I can't credibly self-rate the agent's outputs (same biases, marking my own homework). The rating step is fundamentally yours. Splitting unblocks the mechanical work now and lets you batch the cognitive work whenever convenient.
Phase A starts now in branch
feat/m3.3-collection.