Table of Contents

Session 2 Notes — 2026-04-08

What We Set Out to Do
What Actually Happened
Key Decisions & Reasoning
Surprises & Discoveries
Concerns & Open Threads
Raw Thinking
What's Next

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Session 2 Notes — 2026-04-08

What We Set Out to Do

Ship Phase 2 (CLI shim) — marchwarden ask, marchwarden replay, and a first end-to-end smoke test against a real question. Roadmap goal: prove the V1 stack works end to end, document the run, and close out the V1 ship target (Issue #1).

What Actually Happened

The session went much further than planned. Phase 2 shipped, then five real bugs were caught by the smoke test, then Phase 2.5 (logging + cost tracking) was scoped and shipped, then a depth-presets fix, and finally a complete roadmap reorganization with milestones for Phases 3/4/5/6 and a fully designed Phase 5 (arxiv-rag researcher).

In rough order:

Phase 2 — CLI shim shipped (M2.1, M2.2, M2.3 → PRs #11, #12, smoke run on chore/smoke-test-1)
- marchwarden ask with rich rendering, MCP stdio plumbing
- marchwarden replay with the same rendering style for trace JSONL
- First smoke test against "What are ideal crops for a garden in Utah?"
Smoke test caught five real bugs. All filed as separate issues, fixed in dependency order, each on its own PR before the smoke-test branch could land:
- Issue #15 / PR #21 — invalid default model id (claude-sonnet-4-5-20250514 404s; → claude-sonnet-4-6)
- Issue #16 / PR #20 — synthesis JSON parse failure (max_tokens=4096 cut JSON mid-string; bumped to 16384)
- Issue #17 / PR #22 — token_budget enforced post-hoc; moved check to top of iteration loop
- Issue #18 / PR #23 — MCP stdio client doesn't propagate parent env to subprocess; pass env=os.environ.copy()
- Issue #19 / PR #20 — trace logger truncated raw_response to 1000 chars, hiding the synthesis bug; fix bundled with #16
Docker test environment (Issue #13 / PR #14) — created mid-stream when host install drift made the smoke test impossible to repeat. Reproducible Python 3.12-slim image, mounts ~/secrets ro and ~/.marchwarden rw, used for every subsequent live verification this session.
Phase 2.5 invented mid-session — user asked for "well-structured logging that records activity and costs," distinct from the JSONL trace. Filed as new milestones #24/#25/#26 → PRs #27/#28/#29:
- M2.5.1: Structured logger via structlog — separate from trace JSONL, JSON+console renderers, contextvar binding for trace_id, stderr-only so MCP stdio stays clean
- M2.5.2: Cost ledger — append-only JSONL at ~/.marchwarden/costs.jsonl, per-call entries with token split / Tavily count / estimated USD; price table at ~/.marchwarden/prices.toml auto-seeded
- M2.5.3: marchwarden costs CLI — rich summary panels (per-day, per-model, highest-cost call), --since/--model/--json filters
Per-step trace logging — user asked the operational log to "increase resolution"; mirrored every TraceLogger.log_step call into structlog with curated INFO/DEBUG levels (PR #32). Then added duration_ms per step and total_duration_sec on the terminal complete step (Issue #35 / PR #36) — first time the synthesis-is-58%-of-wall-time fact was visible at a glance.
Depth presets (Issue #30 / PR #33) — --depth shallow|balanced|deep was previously cosmetic; now drives max_iterations / token_budget / max_sources defaults. balanced preserves the historical defaults for backward compatibility.
Local dev quality of life:
- User Guide written and published (docs/wiki/UserGuide.md)
- Home.md created as wiki landing page
- Makefile (PR #34) — make install/test/ask/costs/clean plus docker wrappers
- Display fix: "Budget exhausted: True" → "Budget status: spent / under cap" (PR #31), because the original wording read as a failure indicator on every successful run
Phase 2.5 → Phase 3 transition: roadmap restructure — user wanted RAG over their arXiv reading list. We:
- Discussed RAG vs grep-based retrieval, settled on the user-curated reading list path
- Drafted a full implementation proposal (docs/wiki/ArxivRagProposal.md) with locked-in design defaults: pymupdf, chromadb, nomic-embed local embeddings, section-level chunking
- Filed Issue #37 as the umbrella tracker
- Replaced original M5.1 (grep file researcher) with arxiv-rag in the roadmap
- Filed M5.1.1–M5.1.6 sub-issues (#38–#43)
- Created Forgejo milestones for Phases 3/4/5/6 and filed the remaining Phase 3/4/5/6 issues (#44–#52)

Key Decisions & Reasoning

Fix smoke-test bugs in dependency order, not just batch them. Synthesis-truncation (#19) was hiding the synthesis-parse bug (#16); fixing #19 first made #16 diagnosable from the trace alone. Order mattered.
Synthesis tokens are uncapped by design. When fixing the budget enforcement (Issue #17), capping synthesis would have re-created the synthesis-stub failure mode the project just fixed. Decision: token_budget is a soft cap on the tool-use loop only; synthesis always completes. Documented as a code comment so future-me doesn't try to "fix" it again.
structlog over stdlib logging for the operational logger. Initial recommendation was stdlib + RichHandler. User asked: "since these logs will eventually be consumed by OpenSearch, would you still recommend stdlib?" — that flipped the decision instantly to structlog. Worth re-emphasizing: the destination of logs is a load-bearing design input, not a deployment detail.
Ledger separate from contract cost_metadata. Cost ledger supplements (does not replace) the per-call cost_metadata field returned in ResearchResult. The contract field is for callers; the ledger is for operators. They serve different audiences.
Mirror the trace logger into the operational logger instead of adding duplicate log calls everywhere. One source of truth for "what happened in this research call" — the trace step list. The structlog mirror gets every step for free, with trace_id/researcher already bound in contextvars. Curated INFO/DEBUG levels keep the noise manageable.
arxiv-rag replaces, not supplements, the original grep-based file researcher in M5.1. The grep researcher was a placeholder ("any second researcher") with no concrete user demand. The arxiv-rag has both: a concrete user (the project owner has an arXiv reading list) and a more interesting contract stress test (RAG vs web search has very different evidence properties). The grep researcher is demoted to "future ideas" in the roadmap, can be revived if a private file corpus shows up.
Issue #1 (V1 ship) was held open until the smoke test ran green. Originally I was going to merge M2.3 with the buggy Utah crops run documented as "first attempt"; instead we held the milestone open, fixed every bug surfaced, and re-ran. The green run (trace 1f37b795..., confidence 0.91, 10 citations) was what closed Issue #1.

Surprises & Discoveries

The default max_tokens=4096 for synthesis was way too small. A real ResearchResult JSON over 28 sources easily exceeds 8k tokens once you include citations, raw_excerpts, gaps, discovery_events, and open_questions. The model was correctly producing the JSON; Anthropic was cutting it off mid-string. Bumped to 16384. This is the kind of bug that's invisible without an end-to-end real-corpus smoke test — unit tests pass perfectly while production silently degrades.
The trace logger was truncating its own diagnostic data. raw_response=raw_text[:1000] was cutting off exactly the part of the JSON where the parse error was. The bug was effectively invisible from traces — a logging system that hides the symptom of what it's logging is worse than no logging. Permanent reminder: never truncate diagnostic fields without thinking about who's going to read them.
MCP StdioServerParameters does NOT propagate parent env by default. This is non-obvious, undocumented, and silent. MARCHWARDEN_MODEL set on the CLI process was being eaten before the server subprocess saw it. Fix is one line (env=os.environ.copy()) but discovering it required printing env from inside the container.
Synthesis is 58% of wall time. The duration tracking work surfaced this immediately on the first real run. Worth keeping in mind for any future optimization — the tool-use loop is fast; the synthesis call dominates.
Budget exhausted: True reads as a failure indicator even when it's expected. User was alarmed that "every run exhausts the budget" — and that's correct, the agent is supposed to spend its budget. The wording was the problem, not the behavior. Renamed to "Budget status: spent / under cap" in the display. Lesson: contract field names ≠ user-facing labels, and the difference matters.
Bash CWD persistence got me twice. Once when bash carried docs/wiki as cwd into a project commit (committed on feat/step-durations instead of main, recoverable). Once when a stash-pop after branch switching got the uncommitted env-fix on the wrong branch. Need to be more deliberate about absolute paths or cd at the start of every chained command, not the middle.
claude-code Forgejo user couldn't create milestones. Hit a 403 mid-task; fell back to REST with FORGEJO_API_TOKEN. User then promoted claude-code to admin so future sessions don't need the fallback. Saved as a reference_forgejo_admin.md memory.

Concerns & Open Threads

The cost ledger doesn't include synthesis tokens in its budget enforcement. The displayed tokens_used (e.g., 44k) regularly exceeds the token_budget (20k) because synthesis is intentionally uncapped. This is documented and intentional, but it means the user-facing concept of "budget" differs from the mechanical concept. If we ever want strict total-spend control, we'll need to either reserve a fraction of the budget for synthesis up-front or split into loop_budget + synthesis_budget. This will be louder once people start sharing cost reports.
Phase 3 stress tests are going to surface contract issues that might invalidate the calibration data we collect. Stress testing assumes the contract is stable; if we discover a missing gap category mid-Phase-3, we'll either need to retrofit older runs or restart calibration. Worth anticipating.
arxiv-rag is a much bigger leap than the grep-based file researcher would have been. Six sub-milestones, four new dependencies (pymupdf, chromadb, sentence-transformers, arxiv), section-detection heuristics, vector store lifecycle management. The "prove the contract works across researcher types" goal would have been satisfied by the simpler grep researcher first. We chose the harder path because the user has concrete demand for it; that's the right call but it means Phase 5 takes longer.
The ArxivRagProposal commits to local embeddings (nomic-embed-text-v1.5) for v1. That's the cheapest path but also the one most likely to need a do-over if quality is poor. The chunk-id-includes-embedding-model design means a re-ingest is non-destructive, but the operator-facing pain (re-ingest 50 papers) is real.
Two unsolved bugs from the smoke test era live as comments, not issues: the synthesis "uncapped by design" decision and the env-propagation broad-vs-whitelist tension are both noted in code comments but not tracked. Might bite us in code review later.
docs/ is in the project repo's git status as untracked the entire time. It IS the wiki clone — gitignore should probably exclude it explicitly so it stops appearing in git status.
The session burned through ~20 PRs, all merged. That's a lot of merge churn for one session. The branches were each focused, but the rate is unusually high. Worth retrospecting on whether some of these (e.g., the display-clarity fix, the docker env fix) could have ridden along with adjacent work without losing the "one branch one concern" discipline.

Raw Thinking

The pattern of "smoke test catches bugs the unit tests missed" is going to repeat. Stress testing in Phase 3 is going to catch more. Worth psyching myself up for it.
The fact that synthesis takes 27s out of 47s wall time for a shallow query is really striking. Most of that is the model generating a long JSON output. There's an obvious optimization: stream the synthesis JSON and parse incrementally, fail fast on schema violations. Not in scope for any current milestone but it's the biggest single performance lever.
The "two-branch maximum" rule from CLAUDE.md got tested constantly this session. Almost every PR was one of two open. That worked but felt right at the edge — if the user had been slower to review, I would have been blocked a lot. Worth re-reading whether the rule is actually load-bearing or whether it's a soft cap I can stretch in cases of independent fixes.
The arxiv-rag proposal is the longest design doc I've written this project. Felt good to commit decisions in writing before code. Should probably do this more often, especially for any researcher beyond #2.
Memory system pulled its weight: the "always qualify Issue #N / PR #N" feedback memory got used multiple times this session without re-explanation. Worth saving more of these as they come up.
I keep wanting to add tests that verify production behavior rather than mocked behavior. The docker test env makes this easier — if Phase 4 hardening adds "live integration smoke tests against a fixture corpus," that would catch a whole class of bugs the current mocked tests miss.

What's Next

Recommended pickup order for Session 3:

Decide whether to start Phase 3 stress testing or Phase 5 arxiv-rag first. They're independent. Phase 3 is about validating the existing web researcher's contract under stress; Phase 5 is about adding a second researcher. If the goal is "ship V1.1," do Phase 3. If the goal is "make the tool actually useful for my arxiv reading," do Phase 5.
If Phase 5 first — start with Issue #38 (M5.1.1 ingest pipeline). It's the smallest visible win and unblocks everything else in the milestone. New deps: pymupdf, chromadb, sentence-transformers, arxiv. Branch: feat/arxiv-rag-ingest.
If Phase 3 first — start with Issue #44 (M3.1 single-axis stress tests). Run the four targeted queries, document trace_ids, file any contract gaps as new issues. Lightweight, no code changes (unless bugs surface).
Either way — Issue #1 is closed, V1 ships, the docker test env is reliable, and the cost ledger is in place. The instrumentation foundation for Phase 3 (logs + ledger + duration tracking) is solid. Phase 5 would benefit from Phase 3 first because stress tests will likely tighten the contract before a second researcher has to implement it. Recommendation: Phase 3 first.
Housekeeping debt to fold into the next session:
- Add docs/ to .gitignore so it stops showing up in git status
- Re-test the venv install path on a clean shell now that the stale ~/.local/bin/marchwarden has been removed
- Consider whether any of the Phase 7 "Advanced" bullets should be promoted to issues
Pickup specifics for next session: branch main is clean at af79358; all PRs from this session merged; no work in progress; tests at 123 passing; cost ledger has 4 entries totaling ~$0.30.