retro: Session 1 updated — Phase 0 + Phase 1 complete

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 14:43:51 -06:00 · 2026-04-08 14:43:51 -06:00 · ded78ff1ce
commit ded78ff1ce
parent 1f81471ce8
2 changed files with 64 additions and 41 deletions
--- a/Session1.md
+++ b/Session1.md
@ -1,95 +1,118 @@
 # Session 1 Notes — 2026-04-08

 ## What We Set Out to Do
-Create the Marchwarden project from scratch: name it, set up the repo, document the architecture and research contract, plan the development roadmap, and start Phase 0 implementation.
+Create the Marchwarden project from scratch: name it, set up the repo, document the architecture and research contract, plan the development roadmap, and implement through Phase 1.

 ## What Actually Happened

 ### Naming (longer than expected, worth it)
 Spent significant time finding the right name. Explored Latin (vestigia, sodalis, auspex), Greek (heuresis, gnomon, scholia, epoche), Arabic (isnad, rihla, ijtihad, tahqiq), and English compounds (marchwarden, lanternwake). Deep-dived gnomon vs rihla before landing on **marchwarden** — a guardian at the frontier of knowledge.

-The naming process revealed the user's style: evocative single-word names (harbormind, luminos), latinate/compound, not literal-descriptive.
-
 ### Repo and scaffolding
 - Created repo at `archeious/marchwarden` (initially created under `claude-code` by mistake, migrated)
 - Set up directory structure: `researchers/web/`, `orchestrator/`, `cli/`, `docs/wiki/`
 - Wiki pages: Architecture, ResearchContract, DevelopmentGuide, Roadmap
 - Issue #1 tracks V1 scope

-### Contract evolution (the real work of this session)
-The contract went through three revisions driven by architectural critique:
-
+### Contract evolution (three revisions)
 1. **Initial:** Simple `answer + citations + gaps + confidence + cost_metadata + trace_id`
 2. **Post-critique:** Added `raw_excerpt` (synthesis paradox fix), `discovery_events` (lateral metadata), categorized `gaps` (GapCategory enum), `confidence_factors` (auditable scoring)
-3. **Final:** Added `content_hash` in traces (pseudo-CAS), `model_id` in CostMetadata (fidelity-to-cost analysis)
+3. **Final:** Added `content_hash` in traces (pseudo-CAS), `model_id` in CostMetadata, `open_questions` (forward-looking follow-ups from the research itself)

-The user brought in external critique (Gemini analysis) which pushed the contract to higher fidelity. Good pattern — external review caught real architectural weaknesses.
+The user brought in external critique (Gemini analysis) which pushed the contract to higher fidelity.

-### Phase 0 implementation
+### Phase 0 — Foundation (complete)
 - M0.1: Tavily key verified (free tier, 1000 searches/month)
 - M0.2: All dependencies install clean
- M0.3: 8 Pydantic models implementing the full contract, 32 tests passing
+- M0.3: 9 Pydantic models implementing the full contract, 35 tests

-### Commits
- `f1e27e3` — Initial commit (auto-init)
- `deb124e` — Project scaffolding (dirs, README, pyproject.toml, CONTRIBUTING)
- `79becb2` — Fix README links (clone URL, issue URL)
+### Phase 1 — Web Researcher Core (complete)
+- M1.1: `tavily_search()` + `fetch_url()` with content hashing, 18 tests
+- M1.2: `TraceLogger` — JSONL audit logs per trace_id, 15 tests
+- M1.3: `WebResearcher` — Claude-driven tool-use loop (plan→search→fetch→iterate→synthesize), budget enforcement, fallback on synthesis failure, 9 tests
+- M1.4: FastMCP server wrapping WebResearcher as single `research` tool, 4 tests
+
+### Contract addition mid-implementation
+User identified a gap in the contract: no field for **forward-looking follow-up questions** that emerged from the research. Added `open_questions` (distinct from gaps=backward, discovery_events=sideways). Updated models, agent, tests, and wiki.
+
+### Commits (main, after merges)
+- `f1e27e3` — Initial commit
+- `deb124e` — Project scaffolding
+- `79becb2` — Fix README links
 - `6a8445e` — Fix wiki links to absolute URLs
- `1b0f863` — Contract models + 32 tests (feat/contract-models branch)
+- `8930f44` — Merge PR #2: M0.3 Contract models
+- `851fed6` — Merge PR #3: M1.1 Search and fetch tools
+- `21c8191` — Merge PR #4: M1.2 Trace logger
+- `ece2455` — Merge PR #5: M1.3 Inner agent loop
+- `f593dd0` — Merge PR #6: OpenQuestion contract addition
+- `7088f45` — Merge PR #7: M1.4 MCP server

 ## Key Decisions & Reasoning

-1. **Name: marchwarden** — Names the role (watcher at the frontier) not the tech. Immediately intuitive. Tolkien association exists but the word predates him. User preferred it over gnomon (vocabulary doesn't compound), rihla (Arabic baggage concern), and heuresis/scholia (less visceral).
+1. **Name: marchwarden** — Names the role (watcher at the frontier) not the tech. Tolkien association exists but the word predates him.

-2. **Tavily over SearXNG** — Initially planned SearXNG (self-hosted, fits homelab), but switched back to Tavily to reduce Phase 0 friction. SearXNG requires deploying a container; Tavily is `pip install + API key`. Can swap later since search is behind an internal abstraction.
+2. **Tavily over SearXNG** — Initially planned SearXNG (self-hosted, fits homelab), switched back to Tavily to reduce Phase 0 friction. Can swap later.

-3. **raw_excerpt on citations** — Prevents "Synthesis Paradox" where the PI synthesizes already-synthesized data, losing nuance. Every citation carries verbatim source text so the PI can verify claims against raw evidence.
+3. **raw_excerpt on citations** — Prevents "Synthesis Paradox" where the PI synthesizes already-summarized data, losing nuance.

-4. **Categorized gaps (GapCategory enum)** — Five categories (SOURCE_NOT_FOUND, ACCESS_DENIED, BUDGET_EXHAUSTED, CONTRADICTORY_SOURCES, SCOPE_EXCEEDED) drive different PI responses. Without categories, the PI can't distinguish "info doesn't exist" from "researcher ran out of budget."
+4. **Categorized gaps (GapCategory enum)** — Five categories drive different PI responses. Without categories, the PI can't distinguish "info doesn't exist" from "researcher ran out of budget."

-5. **discovery_events as lateral metadata** — MCP is request-response (no mid-flight notifications), so lateral findings are logged in the response for V2 orchestrator to process. Builds the nervous system for V2 without the complexity of streaming.
+5. **open_questions (new field)** — Gaps look backward (what failed), discovery events look sideways (what's lateral), open questions look forward (what needs deeper investigation). Added mid-session when user identified the gap.

-6. **Confidence: deferred calibration** — confidence_factors expose scoring inputs now; formal rubric after 20-30 real queries. Premature formalization would be false precision.
+6. **Confidence: deferred calibration** — `confidence_factors` expose scoring inputs; formal rubric after 20-30 real queries.

-7. **model_id in CostMetadata** — Enables comparing research quality across model tiers (Haiku vs Sonnet vs Opus). One string field, high value for calibration.
+7. **model_id in CostMetadata** — Enables comparing research quality across model tiers.

-8. **Secrets in ~/secrets, not .env** — User's established pattern across all projects. Noted in memory for future sessions.
+8. **Two-step agent architecture** — Tool-use loop for search/fetch (Claude decides what to do), then a separate synthesis call that produces structured JSON. Separating them makes the synthesis parseable and the tool loop flexible.
+
+9. **Secrets in ~/secrets, not .env** — User's established pattern. MCP server reads keys from there.

 ## Surprises & Discoveries

- **Repo created under wrong owner.** `mcp__gitea__create_repo` creates under the authenticated user (claude-code), not archeious. Had to create separately via REST API with admin token, then delete the claude-code copy.
+- **Repo created under wrong owner.** `mcp__gitea__create_repo` via MCP creates under the authenticated user (claude-code), not archeious. Need REST API with admin token for archeious repos.

- **Wiki links in README.** Relative `/wiki/Architecture` resolves against the Forgejo root, not the repo. Needed full absolute URLs. Wiki-to-wiki cross-links have the same issue.
+- **MCP merge permissions.** `claude-code` user can create PRs but can't merge. All merges go through REST API with `FORGEJO_API_TOKEN`.

- **The MCP request-response constraint is load-bearing.** It shapes the entire V2 architecture — no streaming progress, no mid-flight dispatch, no cancellation. Discovery events are the workaround. This will be the first thing to revisit if MCP adds streaming support.
+- **Wiki links in README.** Relative `/wiki/X` resolves against Forgejo root, not the repo. Need full absolute URLs.

- **External critique (Gemini) was genuinely useful.** Caught the synthesis paradox, the honesty assumption weakness, and the replay-vs-audit distinction. Cross-model review is a good pattern for architectural work.
+- **Working directory was /tmp.** Moved to `~/marchwarden` mid-session. Could have lost work on reboot.
+
+- **pytest-asyncio needed for async tests.** Not in the original dev deps. Installed separately.
+
+- **ResearchConstraints token_budget minimum is 1000.** Caught by a test that tried budget=800. The Pydantic model enforces `ge=1000`.

 ## Concerns & Open Threads

-1. **Branch not merged yet.** `feat/contract-models` is sitting on Forgejo. Need to create PR and merge before starting Phase 1.
+1. **No real end-to-end test yet.** All tests are mocked. Phase 2 (CLI shim + smoke test) will be the first live run hitting Tavily and Claude.

-2. **pyproject.toml still references `tavily-python`** but we haven't tested all deps together in a real agent loop yet. May discover version conflicts when we add the `anthropic` SDK agent loop.
+2. **The synthesis prompt is fragile.** It asks Claude to produce exact JSON. If the model adds markdown fences or extra text, we strip them. If it produces structurally wrong JSON, we fall back. The fallback is valid but useless. Need to see how this performs with real queries.

-3. **The agent loop design (M1.3) is the hard part.** Models and tools are mechanical; the inner loop that decides when to search again, when to stop, how to populate confidence_factors — that's where the LLM prompt engineering lives. No amount of architecture prevents a bad prompt from producing bad research.
+3. **Token counting is approximate.** We track `input_tokens + output_tokens` from the Anthropic response, but Tavily API calls also cost money (not tokens, but API credits). `tokens_used` in CostMetadata only reflects Claude tokens, not Tavily credits.

-4. **Token counting.** `cost_metadata.tokens_used` needs actual token tracking across the Claude API and Tavily calls. The anthropic SDK provides usage info; Tavily may not. Might need to estimate.
+4. **HTML extraction is naive.** `_extract_text()` uses regex to strip tags. Works for simple pages, will produce garbage on JavaScript-heavy sites. Tavily's `raw_content` is usually better. The fallback `fetch_url` path is where this matters.

-5. **Trace directory permissions.** `~/.marchwarden/traces/` needs to be created on first run. Should handle gracefully.
+5. **No `pytest-asyncio` in pyproject.toml dev deps.** Should add it.

 ## Raw Thinking

- The two-agent nesting (researcher inside MCP, called by PI/CLI) is where the real learning happens. The contract and models are scaffolding; the agent loop is the education.
+- The agent loop (M1.3) is where all the real complexity lives. The MCP server is just plumbing. The CLI will be plumbing too. The quality of Marchwarden lives or dies on the system prompt and the synthesis prompt.

- The user brings in external AI critique naturally (Gemini analysis). This is a productive pattern — use it. Different models catch different things.
+- The `open_questions` addition was the user's idea, not mine. It's the most architecturally significant addition since the original contract. It gives the PI a forward-looking signal that gaps and discoveries don't provide. Good instinct.

- The "build real things for education" philosophy means V1 needs to actually work well enough to be useful, not just pass tests. The smoke test (Utah crops) will be the moment of truth.
+- The user's pattern of bringing in external AI critique (Gemini) and then integrating the feedback is productive. Different models catch different architectural weaknesses.

- Consider: should the trace logger be shared infrastructure (not just in `researchers/web/`) since every future researcher will need it? Might refactor to a top-level `marchwarden/trace.py` module. Not now — wait until the second researcher exists.
+- Working copy should probably be in a more permanent location from the start in future projects. `/tmp` was a mistake.

 ## What's Next

-1. Create PR for `feat/contract-models`, merge to main
-2. Start Phase 1: M1.1 (Tavily search + URL fetch tools)
-3. Then M1.2 (trace logger), M1.3 (agent loop), M1.4 (MCP server)
-4. The agent loop (M1.3) is the critical path — everything else is plumbing
+**Phase 2: CLI Shim** — the path to the first smoke test.
+
+1. **M2.1 — `ask` command** — `marchwarden ask "question"` with pretty-printed output
+2. **M2.2 — `replay` command** — `marchwarden replay <trace_id>`
+3. **M2.3 — First smoke test** — "What are ideal crops for a garden in Utah?" end-to-end
+
+**Recommended first action next session:**
+- Branch: create `feat/cli-shim` from main
+- File: `cli/main.py`
+- First task: implement the `ask` command with Click
+- Current state: main is clean at `7088f45`, 81 tests passing
--- a/SessionRetrospectives.md
+++ b/SessionRetrospectives.md
@ -4,4 +4,4 @@ Index of all session notes for Marchwarden development.

 | Session | Date | Summary | Key Decisions |
 |:---|:---|:---|:---|
-| [Session 1](Session1) | 2026-04-08 | Project creation, naming, contract design, Phase 0 complete | Name: marchwarden; Tavily over SearXNG; raw_excerpt + categorized gaps + discovery_events in contract; confidence calibration deferred |
+| [Session 1](Session1) | 2026-04-08 | Project creation through Phase 1 complete | Name: marchwarden; Tavily; contract: raw_excerpt + categorized gaps + discovery_events + open_questions + confidence_factors + model_id; Phase 0-1 shipped (81 tests) |