M3.3 Confidence calibration (V1.1)
Splitting this milestone since the rating step requires human-in-the-loop review.
Phase A — Data collection (this session)
- Run 20 additional balanced-depth queries across 4 categories…
M3.2 Multi-axis stress test
M3.2 Results
One deep query, one trace, three of four target axes hit. Full writeup archived at docs/stress-tests/M3.2-results.md (PR #57).
Trace: 74a017bd-697b-4439-96b8-fe12057cf2e8
M3.1 Single-axis stress tests
M3.1 Results
Four queries run on feat/m3.1-stress-tests against current main. Default depth=balanced unless noted.
Record per-step durations in trace and operational logs
depth flag now drives constraint defaults
chore: Makefile with venv-based dev workflow