retro: Session 2 — Phase 2+2.5 shipped, V1 ships, arxiv-rag scoped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jeff Smith 2026-04-08 17:30:26 -06:00
parent 9dcab7627f
commit 5257bb26e1
2 changed files with 130 additions and 0 deletions

129
Session2.md Normal file

@ -0,0 +1,129 @@
# Session 2 Notes — 2026-04-08
## What We Set Out to Do
Ship Phase 2 (CLI shim) — `marchwarden ask`, `marchwarden replay`, and a first end-to-end smoke test against a real question. Roadmap goal: prove the V1 stack works end to end, document the run, and close out the V1 ship target (Issue #1).
## What Actually Happened
The session went much further than planned. Phase 2 shipped, then five real bugs were caught by the smoke test, then Phase 2.5 (logging + cost tracking) was scoped and shipped, then a depth-presets fix, and finally a complete roadmap reorganization with milestones for Phases 3/4/5/6 and a fully designed Phase 5 (arxiv-rag researcher).
In rough order:
1. **Phase 2 — CLI shim shipped** (M2.1, M2.2, M2.3 → PRs #11, #12, smoke run on `chore/smoke-test-1`)
- `marchwarden ask` with rich rendering, MCP stdio plumbing
- `marchwarden replay` with the same rendering style for trace JSONL
- First smoke test against "What are ideal crops for a garden in Utah?"
2. **Smoke test caught five real bugs.** All filed as separate issues, fixed in dependency order, each on its own PR before the smoke-test branch could land:
- Issue #15 / PR #21 — invalid default model id (`claude-sonnet-4-5-20250514` 404s; → `claude-sonnet-4-6`)
- Issue #16 / PR #20 — synthesis JSON parse failure (max_tokens=4096 cut JSON mid-string; bumped to 16384)
- Issue #17 / PR #22 — token_budget enforced post-hoc; moved check to top of iteration loop
- Issue #18 / PR #23 — MCP stdio client doesn't propagate parent env to subprocess; pass `env=os.environ.copy()`
- Issue #19 / PR #20 — trace logger truncated `raw_response` to 1000 chars, hiding the synthesis bug; fix bundled with #16
3. **Docker test environment** (Issue #13 / PR #14) — created mid-stream when host install drift made the smoke test impossible to repeat. Reproducible Python 3.12-slim image, mounts `~/secrets` ro and `~/.marchwarden` rw, used for every subsequent live verification this session.
4. **Phase 2.5 invented mid-session** — user asked for "well-structured logging that records activity and costs," distinct from the JSONL trace. Filed as new milestones #24/#25/#26 → PRs #27/#28/#29:
- **M2.5.1: Structured logger via structlog** — separate from trace JSONL, JSON+console renderers, contextvar binding for `trace_id`, stderr-only so MCP stdio stays clean
- **M2.5.2: Cost ledger** — append-only JSONL at `~/.marchwarden/costs.jsonl`, per-call entries with token split / Tavily count / estimated USD; price table at `~/.marchwarden/prices.toml` auto-seeded
- **M2.5.3: `marchwarden costs` CLI** — rich summary panels (per-day, per-model, highest-cost call), `--since`/`--model`/`--json` filters
5. **Per-step trace logging** — user asked the operational log to "increase resolution"; mirrored every `TraceLogger.log_step` call into structlog with curated INFO/DEBUG levels (PR #32). Then added `duration_ms` per step and `total_duration_sec` on the terminal `complete` step (Issue #35 / PR #36) — first time the synthesis-is-58%-of-wall-time fact was visible at a glance.
6. **Depth presets** (Issue #30 / PR #33) — `--depth shallow|balanced|deep` was previously cosmetic; now drives `max_iterations` / `token_budget` / `max_sources` defaults. `balanced` preserves the historical defaults for backward compatibility.
7. **Local dev quality of life:**
- User Guide written and published (`docs/wiki/UserGuide.md`)
- `Home.md` created as wiki landing page
- `Makefile` (PR #34) — `make install`/`test`/`ask`/`costs`/`clean` plus docker wrappers
- Display fix: "Budget exhausted: True" → "Budget status: spent / under cap" (PR #31), because the original wording read as a failure indicator on every successful run
8. **Phase 2.5 → Phase 3 transition: roadmap restructure** — user wanted RAG over their arXiv reading list. We:
- Discussed RAG vs grep-based retrieval, settled on the user-curated reading list path
- Drafted a full implementation proposal (`docs/wiki/ArxivRagProposal.md`) with locked-in design defaults: pymupdf, chromadb, nomic-embed local embeddings, section-level chunking
- Filed Issue #37 as the umbrella tracker
- Replaced original M5.1 (grep file researcher) with arxiv-rag in the roadmap
- Filed M5.1.1M5.1.6 sub-issues (#38#43)
- Created Forgejo milestones for Phases 3/4/5/6 and filed the remaining Phase 3/4/5/6 issues (#44#52)
## Key Decisions & Reasoning
- **Fix smoke-test bugs in dependency order, not just batch them.** Synthesis-truncation (#19) was hiding the synthesis-parse bug (#16); fixing #19 first made #16 diagnosable from the trace alone. Order mattered.
- **Synthesis tokens are uncapped by design.** When fixing the budget enforcement (Issue #17), capping synthesis would have re-created the synthesis-stub failure mode the project just fixed. Decision: `token_budget` is a soft cap on the *tool-use loop only*; synthesis always completes. Documented as a code comment so future-me doesn't try to "fix" it again.
- **structlog over stdlib logging for the operational logger.** Initial recommendation was stdlib + RichHandler. User asked: "since these logs will eventually be consumed by OpenSearch, would you still recommend stdlib?" — that flipped the decision instantly to structlog. Worth re-emphasizing: the *destination* of logs is a load-bearing design input, not a deployment detail.
- **Ledger separate from contract `cost_metadata`.** Cost ledger supplements (does not replace) the per-call `cost_metadata` field returned in `ResearchResult`. The contract field is for callers; the ledger is for operators. They serve different audiences.
- **Mirror the trace logger into the operational logger instead of adding duplicate log calls everywhere.** One source of truth for "what happened in this research call" — the trace step list. The structlog mirror gets every step for free, with `trace_id`/`researcher` already bound in contextvars. Curated INFO/DEBUG levels keep the noise manageable.
- **arxiv-rag replaces, not supplements, the original grep-based file researcher in M5.1.** The grep researcher was a placeholder ("any second researcher") with no concrete user demand. The arxiv-rag has both: a concrete user (the project owner has an arXiv reading list) and a more interesting contract stress test (RAG vs web search has very different evidence properties). The grep researcher is demoted to "future ideas" in the roadmap, can be revived if a private file corpus shows up.
- **Issue #1 (V1 ship) was held open until the smoke test ran green.** Originally I was going to merge M2.3 with the buggy Utah crops run documented as "first attempt"; instead we held the milestone open, fixed every bug surfaced, and re-ran. The green run (trace `1f37b795...`, confidence 0.91, 10 citations) was what closed Issue #1.
## Surprises & Discoveries
- **The default `max_tokens=4096` for synthesis was way too small.** A real ResearchResult JSON over 28 sources easily exceeds 8k tokens once you include citations, raw_excerpts, gaps, discovery_events, and open_questions. The model was correctly producing the JSON; Anthropic was cutting it off mid-string. Bumped to 16384. This is the kind of bug that's invisible without an end-to-end real-corpus smoke test — unit tests pass perfectly while production silently degrades.
- **The trace logger was truncating its own diagnostic data.** `raw_response=raw_text[:1000]` was cutting off exactly the part of the JSON where the parse error was. The bug was effectively invisible from traces — a logging system that hides the symptom of what it's logging is worse than no logging. Permanent reminder: never truncate diagnostic fields without thinking about who's going to read them.
- **MCP `StdioServerParameters` does NOT propagate parent env by default.** This is non-obvious, undocumented, and silent. `MARCHWARDEN_MODEL` set on the CLI process was being eaten before the server subprocess saw it. Fix is one line (`env=os.environ.copy()`) but discovering it required printing env from inside the container.
- **Synthesis is 58% of wall time.** The duration tracking work surfaced this immediately on the first real run. Worth keeping in mind for any future optimization — the tool-use loop is fast; the synthesis call dominates.
- **`Budget exhausted: True` reads as a failure indicator even when it's expected.** User was alarmed that "every run exhausts the budget" — and that's *correct*, the agent is supposed to spend its budget. The wording was the problem, not the behavior. Renamed to "Budget status: spent / under cap" in the display. Lesson: contract field names ≠ user-facing labels, and the difference matters.
- **Bash CWD persistence got me twice.** Once when bash carried `docs/wiki` as cwd into a project commit (committed on `feat/step-durations` instead of main, recoverable). Once when a stash-pop after branch switching got the uncommitted env-fix on the wrong branch. Need to be more deliberate about absolute paths or `cd` at the start of every chained command, not the middle.
- **`claude-code` Forgejo user couldn't create milestones.** Hit a 403 mid-task; fell back to REST with `FORGEJO_API_TOKEN`. User then promoted `claude-code` to admin so future sessions don't need the fallback. Saved as a `reference_forgejo_admin.md` memory.
## Concerns & Open Threads
- **The cost ledger doesn't include synthesis tokens in its budget enforcement.** The displayed `tokens_used` (e.g., 44k) regularly exceeds the `token_budget` (20k) because synthesis is intentionally uncapped. This is documented and intentional, but it means the *user-facing* concept of "budget" differs from the *mechanical* concept. If we ever want strict total-spend control, we'll need to either reserve a fraction of the budget for synthesis up-front or split into `loop_budget` + `synthesis_budget`. This will be louder once people start sharing cost reports.
- **Phase 3 stress tests are going to surface contract issues that might invalidate the calibration data we collect.** Stress testing assumes the contract is stable; if we discover a missing gap category mid-Phase-3, we'll either need to retrofit older runs or restart calibration. Worth anticipating.
- **arxiv-rag is a much bigger leap than the grep-based file researcher would have been.** Six sub-milestones, four new dependencies (pymupdf, chromadb, sentence-transformers, arxiv), section-detection heuristics, vector store lifecycle management. The "prove the contract works across researcher types" goal would have been satisfied by the simpler grep researcher first. We chose the harder path because the user has concrete demand for it; that's the right call but it means Phase 5 takes longer.
- **The ArxivRagProposal commits to local embeddings (`nomic-embed-text-v1.5`) for v1.** That's the cheapest path but also the one most likely to need a do-over if quality is poor. The chunk-id-includes-embedding-model design means a re-ingest is non-destructive, but the operator-facing pain (re-ingest 50 papers) is real.
- **Two unsolved bugs from the smoke test era live as comments, not issues:** the synthesis "uncapped by design" decision and the env-propagation broad-vs-whitelist tension are both noted in code comments but not tracked. Might bite us in code review later.
- **`docs/` is in the project repo's git status as untracked the entire time.** It IS the wiki clone — gitignore should probably exclude it explicitly so it stops appearing in `git status`.
- **The session burned through ~20 PRs, all merged.** That's a lot of merge churn for one session. The branches were each focused, but the rate is unusually high. Worth retrospecting on whether some of these (e.g., the display-clarity fix, the docker env fix) could have ridden along with adjacent work without losing the "one branch one concern" discipline.
## Raw Thinking
- The pattern of "smoke test catches bugs the unit tests missed" is going to repeat. Stress testing in Phase 3 is going to catch *more*. Worth psyching myself up for it.
- The fact that synthesis takes 27s out of 47s wall time for a *shallow* query is really striking. Most of that is the model generating a long JSON output. There's an obvious optimization: stream the synthesis JSON and parse incrementally, fail fast on schema violations. Not in scope for any current milestone but it's the biggest single performance lever.
- The "two-branch maximum" rule from CLAUDE.md got tested constantly this session. Almost every PR was one of two open. That worked but felt right at the edge — if the user had been slower to review, I would have been blocked a lot. Worth re-reading whether the rule is actually load-bearing or whether it's a soft cap I can stretch in cases of independent fixes.
- The arxiv-rag proposal is the longest design doc I've written this project. Felt good to commit decisions in writing before code. Should probably do this more often, especially for any researcher beyond #2.
- Memory system pulled its weight: the "always qualify Issue #N / PR #N" feedback memory got used multiple times this session without re-explanation. Worth saving more of these as they come up.
- I keep wanting to add tests that verify *production behavior* rather than mocked behavior. The docker test env makes this easier — if Phase 4 hardening adds "live integration smoke tests against a fixture corpus," that would catch a whole class of bugs the current mocked tests miss.
## What's Next
**Recommended pickup order for Session 3:**
1. **Decide whether to start Phase 3 stress testing or Phase 5 arxiv-rag first.** They're independent. Phase 3 is about validating the existing web researcher's contract under stress; Phase 5 is about adding a second researcher. If the goal is "ship V1.1," do Phase 3. If the goal is "make the tool actually useful for my arxiv reading," do Phase 5.
2. **If Phase 5 first** — start with Issue #38 (M5.1.1 ingest pipeline). It's the smallest visible win and unblocks everything else in the milestone. New deps: pymupdf, chromadb, sentence-transformers, arxiv. Branch: `feat/arxiv-rag-ingest`.
3. **If Phase 3 first** — start with Issue #44 (M3.1 single-axis stress tests). Run the four targeted queries, document trace_ids, file any contract gaps as new issues. Lightweight, no code changes (unless bugs surface).
4. **Either way** — Issue #1 is closed, V1 ships, the docker test env is reliable, and the cost ledger is in place. The instrumentation foundation for Phase 3 (logs + ledger + duration tracking) is solid. Phase 5 would benefit from Phase 3 first because stress tests will likely tighten the contract before a second researcher has to implement it. **Recommendation: Phase 3 first.**
5. **Housekeeping debt to fold into the next session:**
- Add `docs/` to `.gitignore` so it stops showing up in `git status`
- Re-test the venv install path on a clean shell now that the stale `~/.local/bin/marchwarden` has been removed
- Consider whether any of the Phase 7 "Advanced" bullets should be promoted to issues
6. **Pickup specifics for next session:** branch `main` is clean at `af79358`; all PRs from this session merged; no work in progress; tests at 123 passing; cost ledger has 4 entries totaling ~$0.30.

@ -5,3 +5,4 @@ Index of all session notes for Marchwarden development.
| Session | Date | Summary | Key Decisions |
|:---|:---|:---|:---|
| [Session 1](Session1) | 2026-04-08 | Project creation through Phase 1 complete | Name: marchwarden; Tavily; contract: raw_excerpt + categorized gaps + discovery_events + open_questions + confidence_factors + model_id; Phase 0-1 shipped (81 tests) |
| [Session 2](Session2) | 2026-04-08 | Phase 2 + Phase 2.5 shipped; V1 ships; arxiv-rag scoped | structlog over stdlib (OpenSearch-bound); synthesis uncapped by design; budget = soft cap on loop only; cost ledger supplements contract; arxiv-rag replaces grep file researcher in M5.1; Phase 3/4/5/6 milestones populated (123 tests) |