Session 1 Notes — 2026-04-08

What We Set Out to Do

Create the Marchwarden project from scratch: name it, set up the repo, document the architecture and research contract, plan the development roadmap, and implement through Phase 1.

What Actually Happened

Naming (longer than expected, worth it)

Spent significant time finding the right name. Explored Latin (vestigia, sodalis, auspex), Greek (heuresis, gnomon, scholia, epoche), Arabic (isnad, rihla, ijtihad, tahqiq), and English compounds (marchwarden, lanternwake). Deep-dived gnomon vs rihla before landing on marchwarden — a guardian at the frontier of knowledge.

Repo and scaffolding

Created repo at archeious/marchwarden (initially created under claude-code by mistake, migrated)
Set up directory structure: researchers/web/, orchestrator/, cli/, docs/wiki/
Wiki pages: Architecture, ResearchContract, DevelopmentGuide, Roadmap
Issue #1 tracks V1 scope

Contract evolution (three revisions)

Initial: Simple answer + citations + gaps + confidence + cost_metadata + trace_id
Post-critique: Added raw_excerpt (synthesis paradox fix), discovery_events (lateral metadata), categorized gaps (GapCategory enum), confidence_factors (auditable scoring)
Final: Added content_hash in traces (pseudo-CAS), model_id in CostMetadata, open_questions (forward-looking follow-ups from the research itself)

The user brought in external critique (Gemini analysis) which pushed the contract to higher fidelity.

Phase 0 — Foundation (complete)

M0.1: Tavily key verified (free tier, 1000 searches/month)
M0.2: All dependencies install clean
M0.3: 9 Pydantic models implementing the full contract, 35 tests

Phase 1 — Web Researcher Core (complete)

M1.1: tavily_search() + fetch_url() with content hashing, 18 tests
M1.2: TraceLogger — JSONL audit logs per trace_id, 15 tests
M1.3: WebResearcher — Claude-driven tool-use loop (plan→search→fetch→iterate→synthesize), budget enforcement, fallback on synthesis failure, 9 tests
M1.4: FastMCP server wrapping WebResearcher as single research tool, 4 tests

Contract addition mid-implementation

User identified a gap in the contract: no field for forward-looking follow-up questions that emerged from the research. Added open_questions (distinct from gaps=backward, discovery_events=sideways). Updated models, agent, tests, and wiki.

Commits (main, after merges)

f1e27e3 — Initial commit
deb124e — Project scaffolding
79becb2 — Fix README links
6a8445e — Fix wiki links to absolute URLs
8930f44 — Merge PR #2: M0.3 Contract models
851fed6 — Merge PR #3: M1.1 Search and fetch tools
21c8191 — Merge PR #4: M1.2 Trace logger
ece2455 — Merge PR #5: M1.3 Inner agent loop
f593dd0 — Merge PR #6: OpenQuestion contract addition
7088f45 — Merge PR #7: M1.4 MCP server

Key Decisions & Reasoning

Name: marchwarden — Names the role (watcher at the frontier) not the tech. Tolkien association exists but the word predates him.
Tavily over SearXNG — Initially planned SearXNG (self-hosted, fits homelab), switched back to Tavily to reduce Phase 0 friction. Can swap later.
raw_excerpt on citations — Prevents "Synthesis Paradox" where the PI synthesizes already-summarized data, losing nuance.
Categorized gaps (GapCategory enum) — Five categories drive different PI responses. Without categories, the PI can't distinguish "info doesn't exist" from "researcher ran out of budget."
open_questions (new field) — Gaps look backward (what failed), discovery events look sideways (what's lateral), open questions look forward (what needs deeper investigation). Added mid-session when user identified the gap.
Confidence: deferred calibration — confidence_factors expose scoring inputs; formal rubric after 20-30 real queries.
model_id in CostMetadata — Enables comparing research quality across model tiers.
Two-step agent architecture — Tool-use loop for search/fetch (Claude decides what to do), then a separate synthesis call that produces structured JSON. Separating them makes the synthesis parseable and the tool loop flexible.
Secrets in ~/secrets, not .env — User's established pattern. MCP server reads keys from there.

Surprises & Discoveries

Repo created under wrong owner. mcp__gitea__create_repo via MCP creates under the authenticated user (claude-code), not archeious. Need REST API with admin token for archeious repos.
MCP merge permissions. claude-code user can create PRs but can't merge. All merges go through REST API with FORGEJO_API_TOKEN.
Wiki links in README. Relative /wiki/X resolves against Forgejo root, not the repo. Need full absolute URLs.
Working directory was /tmp. Moved to ~/marchwarden mid-session. Could have lost work on reboot.
pytest-asyncio needed for async tests. Not in the original dev deps. Installed separately.
ResearchConstraints token_budget minimum is 1000. Caught by a test that tried budget=800. The Pydantic model enforces ge=1000.

Concerns & Open Threads

No real end-to-end test yet. All tests are mocked. Phase 2 (CLI shim + smoke test) will be the first live run hitting Tavily and Claude.
The synthesis prompt is fragile. It asks Claude to produce exact JSON. If the model adds markdown fences or extra text, we strip them. If it produces structurally wrong JSON, we fall back. The fallback is valid but useless. Need to see how this performs with real queries.
Token counting is approximate. We track input_tokens + output_tokens from the Anthropic response, but Tavily API calls also cost money (not tokens, but API credits). tokens_used in CostMetadata only reflects Claude tokens, not Tavily credits.
HTML extraction is naive. _extract_text() uses regex to strip tags. Works for simple pages, will produce garbage on JavaScript-heavy sites. Tavily's raw_content is usually better. The fallback fetch_url path is where this matters.
No pytest-asyncio in pyproject.toml dev deps. Should add it.

Raw Thinking

The agent loop (M1.3) is where all the real complexity lives. The MCP server is just plumbing. The CLI will be plumbing too. The quality of Marchwarden lives or dies on the system prompt and the synthesis prompt.
The open_questions addition was the user's idea, not mine. It's the most architecturally significant addition since the original contract. It gives the PI a forward-looking signal that gaps and discoveries don't provide. Good instinct.
The user's pattern of bringing in external AI critique (Gemini) and then integrating the feedback is productive. Different models catch different architectural weaknesses.
Working copy should probably be in a more permanent location from the start in future projects. /tmp was a mistake.

What's Next

Phase 2: CLI Shim — the path to the first smoke test.

M2.1 — ask command — marchwarden ask "question" with pretty-printed output
M2.2 — replay command — marchwarden replay <trace_id>
M2.3 — First smoke test — "What are ideal crops for a garden in Utah?" end-to-end

Recommended first action next session:

Branch: create feat/cli-shim from main
File: cli/main.py
First task: implement the ask command with Click
Current state: main is clean at 7088f45, 81 tests passing