From fb4574d8ce998e3e137aabc8c7228b199cf07e0d Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Wed, 8 Apr 2026 20:26:09 -0600 Subject: [PATCH] chore: update CLAUDE.md for session 3 --- CLAUDE.md | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index f4e585c..9425502 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -9,12 +9,12 @@ and agent composition. | | | |---|---| -| **Phase** | Phases 0–2.5 complete; V1 shipped. Next: Phase 3 (stress testing) or Phase 5 (arxiv-rag) | +| **Phase** | Phases 0–3 substantially complete (M3.3 awaiting human rating); Phase 5 started (M5.1.1 shipped) | | **Last worked on** | 2026-04-08 | -| **Last commit** | `af79358` — Merge PR #36: per-step durations in trace and operational logs | +| **Last commit** | `78f08c9` — Merge PR #59: M3.3 Phase A calibration data collection | | **Branch** | `main` (clean) | -| **Tests** | 123 passing | -| **Blocking issues** | None | +| **Tests** | 141 passing | +| **Blocking issues** | #53 (budget cap lag — recommended fix before #39); #46 (M3.3 Phase B awaiting human rating) | ## Key Files @@ -52,12 +52,20 @@ and agent composition. |---|---|---| | 1 | 2026-04-08 | Project creation, naming, contract design, Phase 0 + Phase 1 complete (81 tests) | | 2 | 2026-04-08 | Phase 2 (CLI shim) + Phase 2.5 (logging + cost tracking) shipped; V1 ships; depth presets; docker test env; per-step duration tracking; arxiv-rag scoped as M5.1; Phase 3/4/5/6 milestones populated (123 tests) | +| 3 | 2026-04-08 | Phase 3 stress testing: M3.1+M3.2 closed, M3.3 split into Phases A/B/C with A done. Trace observability fix (#54) — full ResearchResult persisted as sibling + per-item events. M5.1.1 arxiv-rag ingest pipeline shipped (researchers/arxiv/, [arxiv] optional extra, lazy CLI imports). Structured-data tool critiqued and deferred until M6 PI consumer exists. Filed #53 (budget cap lag — recommended next session). 141 tests | ## What's Next -**Recommended next session: Phase 3 (Stress Testing & Calibration)** before Phase 5, since stress tests will likely tighten the contract before a second researcher has to implement it. +**Recommended next session: fix #53 (budget cap lag) before continuing Phase 5.** The arxiv researcher's eventual agent loop (#40) will inherit budget semantics from the web researcher — fix the bug before duplicating it. -- **Phase 3:** Issue #44 (M3.1 single-axis stress tests) → #45 (M3.2 multi-axis) → #46 (M3.3 confidence calibration) -- **Phase 5 alternative:** Issue #38 (M5.1.1 arxiv-rag ingest pipeline). New deps: pymupdf, chromadb, sentence-transformers, arxiv. Design lives at [wiki/ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal). +Order of next-session candidates: -Open milestones in Forgejo: Phase 3 (3 issues), Phase 4 (3 issues), Phase 5 (8 issues including arxiv-rag tracker), Phase 6 (2 issues). +1. **#53** — budget cap lag bug. Single-file fix in `researchers/web/agent.py` plus a regression test. ~30 min. +2. **Live arxiv smoke** — `marchwarden arxiv add 1706.03762` end-to-end. Validates M5.1.1 against a real PDF. First run downloads ~500MB embedding model. +3. **#39** — M5.1.2 arxiv-rag retrieval primitive. Builds the query API on top of M5.1.1's chromadb collection. +4. **M3.3 Phase C** — once the user brings back `docs/stress-tests/M3.3-rating-worksheet.md` with `actual_rating` columns filled in. Analysis script + rubric + wiki update. +5. **M4.1** (#47) — error handling / hardening. Independent of everything above. + +**Open issues:** #53 (budget cap lag), #46 (M3.3 awaiting rating). + +**Open milestones in Forgejo:** Phase 3 (1 issue: #46), Phase 4 (3 issues), Phase 5 (7 issues remaining), Phase 6 (2 issues).