Table of Contents
- Session 1 Notes — 2026-04-08
- What We Set Out to Do
- What Actually Happened
- Naming (longer than expected, worth it)
- Repo and scaffolding
- Contract evolution (three revisions)
- Phase 0 — Foundation (complete)
- Phase 1 — Web Researcher Core (complete)
- Contract addition mid-implementation
- Commits (main, after merges)
- Key Decisions & Reasoning
- Surprises & Discoveries
- Concerns & Open Threads
- Raw Thinking
- What's Next
Session 1 Notes — 2026-04-08
What We Set Out to Do
Create the Marchwarden project from scratch: name it, set up the repo, document the architecture and research contract, plan the development roadmap, and implement through Phase 1.
What Actually Happened
Naming (longer than expected, worth it)
Spent significant time finding the right name. Explored Latin (vestigia, sodalis, auspex), Greek (heuresis, gnomon, scholia, epoche), Arabic (isnad, rihla, ijtihad, tahqiq), and English compounds (marchwarden, lanternwake). Deep-dived gnomon vs rihla before landing on marchwarden — a guardian at the frontier of knowledge.
Repo and scaffolding
- Created repo at
archeious/marchwarden(initially created underclaude-codeby mistake, migrated) - Set up directory structure:
researchers/web/,orchestrator/,cli/,docs/wiki/ - Wiki pages: Architecture, ResearchContract, DevelopmentGuide, Roadmap
- Issue #1 tracks V1 scope
Contract evolution (three revisions)
- Initial: Simple
answer + citations + gaps + confidence + cost_metadata + trace_id - Post-critique: Added
raw_excerpt(synthesis paradox fix),discovery_events(lateral metadata), categorizedgaps(GapCategory enum),confidence_factors(auditable scoring) - Final: Added
content_hashin traces (pseudo-CAS),model_idin CostMetadata,open_questions(forward-looking follow-ups from the research itself)
The user brought in external critique (Gemini analysis) which pushed the contract to higher fidelity.
Phase 0 — Foundation (complete)
- M0.1: Tavily key verified (free tier, 1000 searches/month)
- M0.2: All dependencies install clean
- M0.3: 9 Pydantic models implementing the full contract, 35 tests
Phase 1 — Web Researcher Core (complete)
- M1.1:
tavily_search()+fetch_url()with content hashing, 18 tests - M1.2:
TraceLogger— JSONL audit logs per trace_id, 15 tests - M1.3:
WebResearcher— Claude-driven tool-use loop (plan→search→fetch→iterate→synthesize), budget enforcement, fallback on synthesis failure, 9 tests - M1.4: FastMCP server wrapping WebResearcher as single
researchtool, 4 tests
Contract addition mid-implementation
User identified a gap in the contract: no field for forward-looking follow-up questions that emerged from the research. Added open_questions (distinct from gaps=backward, discovery_events=sideways). Updated models, agent, tests, and wiki.
Commits (main, after merges)
f1e27e3— Initial commitdeb124e— Project scaffolding79becb2— Fix README links6a8445e— Fix wiki links to absolute URLs8930f44— Merge PR #2: M0.3 Contract models851fed6— Merge PR #3: M1.1 Search and fetch tools21c8191— Merge PR #4: M1.2 Trace loggerece2455— Merge PR #5: M1.3 Inner agent loopf593dd0— Merge PR #6: OpenQuestion contract addition7088f45— Merge PR #7: M1.4 MCP server
Key Decisions & Reasoning
-
Name: marchwarden — Names the role (watcher at the frontier) not the tech. Tolkien association exists but the word predates him.
-
Tavily over SearXNG — Initially planned SearXNG (self-hosted, fits homelab), switched back to Tavily to reduce Phase 0 friction. Can swap later.
-
raw_excerpt on citations — Prevents "Synthesis Paradox" where the PI synthesizes already-summarized data, losing nuance.
-
Categorized gaps (GapCategory enum) — Five categories drive different PI responses. Without categories, the PI can't distinguish "info doesn't exist" from "researcher ran out of budget."
-
open_questions (new field) — Gaps look backward (what failed), discovery events look sideways (what's lateral), open questions look forward (what needs deeper investigation). Added mid-session when user identified the gap.
-
Confidence: deferred calibration —
confidence_factorsexpose scoring inputs; formal rubric after 20-30 real queries. -
model_id in CostMetadata — Enables comparing research quality across model tiers.
-
Two-step agent architecture — Tool-use loop for search/fetch (Claude decides what to do), then a separate synthesis call that produces structured JSON. Separating them makes the synthesis parseable and the tool loop flexible.
-
Secrets in ~/secrets, not .env — User's established pattern. MCP server reads keys from there.
Surprises & Discoveries
-
Repo created under wrong owner.
mcp__gitea__create_repovia MCP creates under the authenticated user (claude-code), not archeious. Need REST API with admin token for archeious repos. -
MCP merge permissions.
claude-codeuser can create PRs but can't merge. All merges go through REST API withFORGEJO_API_TOKEN. -
Wiki links in README. Relative
/wiki/Xresolves against Forgejo root, not the repo. Need full absolute URLs. -
Working directory was /tmp. Moved to
~/marchwardenmid-session. Could have lost work on reboot. -
pytest-asyncio needed for async tests. Not in the original dev deps. Installed separately.
-
ResearchConstraints token_budget minimum is 1000. Caught by a test that tried budget=800. The Pydantic model enforces
ge=1000.
Concerns & Open Threads
-
No real end-to-end test yet. All tests are mocked. Phase 2 (CLI shim + smoke test) will be the first live run hitting Tavily and Claude.
-
The synthesis prompt is fragile. It asks Claude to produce exact JSON. If the model adds markdown fences or extra text, we strip them. If it produces structurally wrong JSON, we fall back. The fallback is valid but useless. Need to see how this performs with real queries.
-
Token counting is approximate. We track
input_tokens + output_tokensfrom the Anthropic response, but Tavily API calls also cost money (not tokens, but API credits).tokens_usedin CostMetadata only reflects Claude tokens, not Tavily credits. -
HTML extraction is naive.
_extract_text()uses regex to strip tags. Works for simple pages, will produce garbage on JavaScript-heavy sites. Tavily'sraw_contentis usually better. The fallbackfetch_urlpath is where this matters. -
No
pytest-asyncioin pyproject.toml dev deps. Should add it.
Raw Thinking
-
The agent loop (M1.3) is where all the real complexity lives. The MCP server is just plumbing. The CLI will be plumbing too. The quality of Marchwarden lives or dies on the system prompt and the synthesis prompt.
-
The
open_questionsaddition was the user's idea, not mine. It's the most architecturally significant addition since the original contract. It gives the PI a forward-looking signal that gaps and discoveries don't provide. Good instinct. -
The user's pattern of bringing in external AI critique (Gemini) and then integrating the feedback is productive. Different models catch different architectural weaknesses.
-
Working copy should probably be in a more permanent location from the start in future projects.
/tmpwas a mistake.
What's Next
Phase 2: CLI Shim — the path to the first smoke test.
- M2.1 —
askcommand —marchwarden ask "question"with pretty-printed output - M2.2 —
replaycommand —marchwarden replay <trace_id> - M2.3 — First smoke test — "What are ideal crops for a garden in Utah?" end-to-end
Recommended first action next session:
- Branch: create
feat/cli-shimfrom main - File:
cli/main.py - First task: implement the
askcommand with Click - Current state: main is clean at
7088f45, 81 tests passing