2 Session1
Jeff Smith edited this page 2026-04-08 14:43:51 -06:00

Session 1 Notes — 2026-04-08

What We Set Out to Do

Create the Marchwarden project from scratch: name it, set up the repo, document the architecture and research contract, plan the development roadmap, and implement through Phase 1.

What Actually Happened

Naming (longer than expected, worth it)

Spent significant time finding the right name. Explored Latin (vestigia, sodalis, auspex), Greek (heuresis, gnomon, scholia, epoche), Arabic (isnad, rihla, ijtihad, tahqiq), and English compounds (marchwarden, lanternwake). Deep-dived gnomon vs rihla before landing on marchwarden — a guardian at the frontier of knowledge.

Repo and scaffolding

  • Created repo at archeious/marchwarden (initially created under claude-code by mistake, migrated)
  • Set up directory structure: researchers/web/, orchestrator/, cli/, docs/wiki/
  • Wiki pages: Architecture, ResearchContract, DevelopmentGuide, Roadmap
  • Issue #1 tracks V1 scope

Contract evolution (three revisions)

  1. Initial: Simple answer + citations + gaps + confidence + cost_metadata + trace_id
  2. Post-critique: Added raw_excerpt (synthesis paradox fix), discovery_events (lateral metadata), categorized gaps (GapCategory enum), confidence_factors (auditable scoring)
  3. Final: Added content_hash in traces (pseudo-CAS), model_id in CostMetadata, open_questions (forward-looking follow-ups from the research itself)

The user brought in external critique (Gemini analysis) which pushed the contract to higher fidelity.

Phase 0 — Foundation (complete)

  • M0.1: Tavily key verified (free tier, 1000 searches/month)
  • M0.2: All dependencies install clean
  • M0.3: 9 Pydantic models implementing the full contract, 35 tests

Phase 1 — Web Researcher Core (complete)

  • M1.1: tavily_search() + fetch_url() with content hashing, 18 tests
  • M1.2: TraceLogger — JSONL audit logs per trace_id, 15 tests
  • M1.3: WebResearcher — Claude-driven tool-use loop (plan→search→fetch→iterate→synthesize), budget enforcement, fallback on synthesis failure, 9 tests
  • M1.4: FastMCP server wrapping WebResearcher as single research tool, 4 tests

Contract addition mid-implementation

User identified a gap in the contract: no field for forward-looking follow-up questions that emerged from the research. Added open_questions (distinct from gaps=backward, discovery_events=sideways). Updated models, agent, tests, and wiki.

Commits (main, after merges)

  • f1e27e3 — Initial commit
  • deb124e — Project scaffolding
  • 79becb2 — Fix README links
  • 6a8445e — Fix wiki links to absolute URLs
  • 8930f44 — Merge PR #2: M0.3 Contract models
  • 851fed6 — Merge PR #3: M1.1 Search and fetch tools
  • 21c8191 — Merge PR #4: M1.2 Trace logger
  • ece2455 — Merge PR #5: M1.3 Inner agent loop
  • f593dd0 — Merge PR #6: OpenQuestion contract addition
  • 7088f45 — Merge PR #7: M1.4 MCP server

Key Decisions & Reasoning

  1. Name: marchwarden — Names the role (watcher at the frontier) not the tech. Tolkien association exists but the word predates him.

  2. Tavily over SearXNG — Initially planned SearXNG (self-hosted, fits homelab), switched back to Tavily to reduce Phase 0 friction. Can swap later.

  3. raw_excerpt on citations — Prevents "Synthesis Paradox" where the PI synthesizes already-summarized data, losing nuance.

  4. Categorized gaps (GapCategory enum) — Five categories drive different PI responses. Without categories, the PI can't distinguish "info doesn't exist" from "researcher ran out of budget."

  5. open_questions (new field) — Gaps look backward (what failed), discovery events look sideways (what's lateral), open questions look forward (what needs deeper investigation). Added mid-session when user identified the gap.

  6. Confidence: deferred calibrationconfidence_factors expose scoring inputs; formal rubric after 20-30 real queries.

  7. model_id in CostMetadata — Enables comparing research quality across model tiers.

  8. Two-step agent architecture — Tool-use loop for search/fetch (Claude decides what to do), then a separate synthesis call that produces structured JSON. Separating them makes the synthesis parseable and the tool loop flexible.

  9. Secrets in ~/secrets, not .env — User's established pattern. MCP server reads keys from there.

Surprises & Discoveries

  • Repo created under wrong owner. mcp__gitea__create_repo via MCP creates under the authenticated user (claude-code), not archeious. Need REST API with admin token for archeious repos.

  • MCP merge permissions. claude-code user can create PRs but can't merge. All merges go through REST API with FORGEJO_API_TOKEN.

  • Wiki links in README. Relative /wiki/X resolves against Forgejo root, not the repo. Need full absolute URLs.

  • Working directory was /tmp. Moved to ~/marchwarden mid-session. Could have lost work on reboot.

  • pytest-asyncio needed for async tests. Not in the original dev deps. Installed separately.

  • ResearchConstraints token_budget minimum is 1000. Caught by a test that tried budget=800. The Pydantic model enforces ge=1000.

Concerns & Open Threads

  1. No real end-to-end test yet. All tests are mocked. Phase 2 (CLI shim + smoke test) will be the first live run hitting Tavily and Claude.

  2. The synthesis prompt is fragile. It asks Claude to produce exact JSON. If the model adds markdown fences or extra text, we strip them. If it produces structurally wrong JSON, we fall back. The fallback is valid but useless. Need to see how this performs with real queries.

  3. Token counting is approximate. We track input_tokens + output_tokens from the Anthropic response, but Tavily API calls also cost money (not tokens, but API credits). tokens_used in CostMetadata only reflects Claude tokens, not Tavily credits.

  4. HTML extraction is naive. _extract_text() uses regex to strip tags. Works for simple pages, will produce garbage on JavaScript-heavy sites. Tavily's raw_content is usually better. The fallback fetch_url path is where this matters.

  5. No pytest-asyncio in pyproject.toml dev deps. Should add it.

Raw Thinking

  • The agent loop (M1.3) is where all the real complexity lives. The MCP server is just plumbing. The CLI will be plumbing too. The quality of Marchwarden lives or dies on the system prompt and the synthesis prompt.

  • The open_questions addition was the user's idea, not mine. It's the most architecturally significant addition since the original contract. It gives the PI a forward-looking signal that gaps and discoveries don't provide. Good instinct.

  • The user's pattern of bringing in external AI critique (Gemini) and then integrating the feedback is productive. Different models catch different architectural weaknesses.

  • Working copy should probably be in a more permanent location from the start in future projects. /tmp was a mistake.

What's Next

Phase 2: CLI Shim — the path to the first smoke test.

  1. M2.1 — ask commandmarchwarden ask "question" with pretty-printed output
  2. M2.2 — replay commandmarchwarden replay <trace_id>
  3. M2.3 — First smoke test — "What are ideal crops for a garden in Utah?" end-to-end

Recommended first action next session:

  • Branch: create feat/cli-shim from main
  • File: cli/main.py
  • First task: implement the ask command with Click
  • Current state: main is clean at 7088f45, 81 tests passing