V1: Web-search researcher MCP + CLI shim

claude-code commented

2026-04-08 11:59:06 -06:00

Collaborator

Ship target: V1 Marchwarden

Build a single agentic researcher exposed as an MCP server, controlled via CLI shim. This is the first node in a multi-agent research network; future versions add more specialists and a PI orchestrator.

Scope: Single researcher (web search)

Researcher capability:

Takes a single research(question, context?, depth?, constraints?) tool call
Runs internal agentic loop: plans, searches via Tavily, fetches URLs, iterates, synthesizes
Returns structured response: answer, citations[], gaps[], cost_metadata, trace_id
Server-enforced budgets: max 5 iterations, ~20k tokens per call
Produces JSONL trace logs (one file per research call, keyed by trace_id)

Server (MCP):

Implements the researcher contract
Exposes research(question, ...) as the sole tool
Enforces iteration/token budgets
Logs all traces to ~/.marchwarden/traces/ (or configurable path)

CLI shim:

marchwarden ask "what are ideal crops for a garden in Utah?"
marchwarden replay <trace_id>
Test harness for the researcher; will be replaced by PI orchestrator in V2

Contract details

Tool signature:

research(
  question: str,
  context?: str,           # what the PI already knows
  depth?: "shallow" | "deep" = "balanced",
  constraints?: {
    max_iterations?: int = 5,
    token_budget?: int = 20000,
  }
) → {
  answer: str,
  citations: [
    {
      source: str,         # "web", "file", "db", etc
      locator: str,        # URL, file path, row ID, etc
      snippet?: str,       # relevant excerpt
      confidence: float,   # 0.0-1.0
    }
  ],
  gaps: [
    {
      topic: str,         # what couldn't be resolved
      reason: str,        # "no sources found", "ambiguous", etc
    }
  ],
  cost_metadata: {
    tokens_used: int,
    iterations_run: int,
    wall_time_sec: float,
  },
  trace_id: str,          # UUID, links to JSONL trace file
}

Trace log (~/.marchwarden/traces/{trace_id}.jsonl):

One JSON object per inner-loop step
Fields: step, action, result, timestamp, decision
Supports replay and debugging

V1 is NOT

Multiple researchers (that's V2+)
PI orchestrator (that's V2+)
Database sources, file corpus, arxiv (V2+)
Web UI (keep CLI only)
Eval harness, caching, persistence beyond traces
Auth, multi-user, deployment

Ship checklist

Repo structure set up (see CONTRIBUTING.md or wiki)
MCP server implemented with research() tool
Internal agent loop (plan → search → fetch → iterate → synthesize)
Token budget enforcement
Trace logging (JSONL)
CLI shim (ask, replay commands)
Contract documented in wiki
Integration test: ask a non-trivial question, get back structured answer with citations and gaps
All tests passing

Decisions recorded

Stack: Python, claude-agent-sdk, official mcp SDK
Web search: Tavily (cheap, good for agents)
Name: Marchwarden (researcher at the frontier, reporting back)
Repo: archeious/marchwarden

Created: 2026-04-08
Assignee: archeious
Milestone: V1 Ship

## Ship target: V1 Marchwarden Build a single agentic researcher exposed as an MCP server, controlled via CLI shim. This is the first node in a multi-agent research network; future versions add more specialists and a PI orchestrator. ### Scope: Single researcher (web search) **Researcher capability:** - Takes a single `research(question, context?, depth?, constraints?)` tool call - Runs internal agentic loop: plans, searches via Tavily, fetches URLs, iterates, synthesizes - Returns structured response: `answer`, `citations[]`, `gaps[]`, `cost_metadata`, `trace_id` - Server-enforced budgets: max 5 iterations, ~20k tokens per call - Produces JSONL trace logs (one file per research call, keyed by trace_id) **Server (MCP):** - Implements the researcher contract - Exposes `research(question, ...)` as the sole tool - Enforces iteration/token budgets - Logs all traces to `~/.marchwarden/traces/` (or configurable path) **CLI shim:** - `marchwarden ask "what are ideal crops for a garden in Utah?"` - `marchwarden replay <trace_id>` - Test harness for the researcher; will be replaced by PI orchestrator in V2 ### Contract details **Tool signature:** ``` research( question: str, context?: str, # what the PI already knows depth?: "shallow" | "deep" = "balanced", constraints?: { max_iterations?: int = 5, token_budget?: int = 20000, } ) → { answer: str, citations: [ { source: str, # "web", "file", "db", etc locator: str, # URL, file path, row ID, etc snippet?: str, # relevant excerpt confidence: float, # 0.0-1.0 } ], gaps: [ { topic: str, # what couldn't be resolved reason: str, # "no sources found", "ambiguous", etc } ], cost_metadata: { tokens_used: int, iterations_run: int, wall_time_sec: float, }, trace_id: str, # UUID, links to JSONL trace file } ``` **Trace log** (`~/.marchwarden/traces/{trace_id}.jsonl`): - One JSON object per inner-loop step - Fields: `step`, `action`, `result`, `timestamp`, `decision` - Supports replay and debugging ### V1 is NOT - Multiple researchers (that's V2+) - PI orchestrator (that's V2+) - Database sources, file corpus, arxiv (V2+) - Web UI (keep CLI only) - Eval harness, caching, persistence beyond traces - Auth, multi-user, deployment ### Ship checklist - [ ] Repo structure set up (see CONTRIBUTING.md or wiki) - [ ] MCP server implemented with `research()` tool - [ ] Internal agent loop (plan → search → fetch → iterate → synthesize) - [ ] Token budget enforcement - [ ] Trace logging (JSONL) - [ ] CLI shim (ask, replay commands) - [ ] Contract documented in wiki - [ ] Integration test: ask a non-trivial question, get back structured answer with citations and gaps - [ ] All tests passing ### Decisions recorded - Stack: Python, `claude-agent-sdk`, official `mcp` SDK - Web search: Tavily (cheap, good for agents) - Name: Marchwarden (researcher at the frontier, reporting back) - Repo: archeious/marchwarden --- **Created:** 2026-04-08 **Assignee:** archeious **Milestone:** V1 Ship

claude-code commented

2026-04-08 12:28:29 -06:00

Author

Collaborator

Contract Revision (2026-04-08)

The research contract has been significantly revised based on architectural critique. Key changes:

New fields added to `ResearchResult`

raw_excerpt on citations — verbatim text from source, prevents synthesis paradox (double-summarization through LLM layers)
discovery_events[] — lateral findings for other researchers (logged in V1, auto-dispatched in V2)
confidence_factors — exposes inputs to confidence scoring (num sources, authority, contradictions, specificity, recency) for future calibration
Categorized gaps — GapCategory enum replaces free-text reasons:
- SOURCE_NOT_FOUND — info doesn't exist in this domain
- ACCESS_DENIED — paywall, robots.txt, auth wall
- BUDGET_EXHAUSTED — hit iteration/token cap
- CONTRADICTORY_SOURCES — sources disagree, unresolvable
- SCOPE_EXCEEDED — needs a different researcher type
content_hash in trace entries — SHA-256 of fetched content for pseudo-CAS change detection

Known Limitations documented

Confidence is LLM-generated, not calibrated (calibrate after 20-30 queries)
No citation validation (V2: validator node)
Traces are audit logs, not true replays (V2: CAS)
Discovery events logged only (V2: PI auto-dispatch)
No streaming progress (MCP is request-response)

Updated ship checklist

Repo structure set up
MCP server with research() tool
Internal agent loop (plan → search → fetch → iterate → synthesize)
Token/iteration budget enforcement
JSONL trace logging with content hashes
raw_excerpt on all citations
Categorized gaps (GapCategory enum)
Discovery events capture
Confidence factors reporting
CLI shim (ask, replay)
Contract documented in wiki ✅
Integration test: structured answer with citations, gaps, discoveries
All tests passing

Full spec: ResearchContract wiki page

## Contract Revision (2026-04-08) The research contract has been significantly revised based on architectural critique. Key changes: ### New fields added to `ResearchResult` 1. **`raw_excerpt`** on citations — verbatim text from source, prevents synthesis paradox (double-summarization through LLM layers) 2. **`discovery_events[]`** — lateral findings for other researchers (logged in V1, auto-dispatched in V2) 3. **`confidence_factors`** — exposes inputs to confidence scoring (num sources, authority, contradictions, specificity, recency) for future calibration 4. **Categorized gaps** — `GapCategory` enum replaces free-text reasons: - `SOURCE_NOT_FOUND` — info doesn't exist in this domain - `ACCESS_DENIED` — paywall, robots.txt, auth wall - `BUDGET_EXHAUSTED` — hit iteration/token cap - `CONTRADICTORY_SOURCES` — sources disagree, unresolvable - `SCOPE_EXCEEDED` — needs a different researcher type 5. **`content_hash`** in trace entries — SHA-256 of fetched content for pseudo-CAS change detection ### Known Limitations documented - Confidence is LLM-generated, not calibrated (calibrate after 20-30 queries) - No citation validation (V2: validator node) - Traces are audit logs, not true replays (V2: CAS) - Discovery events logged only (V2: PI auto-dispatch) - No streaming progress (MCP is request-response) ### Updated ship checklist - [ ] Repo structure set up - [ ] MCP server with `research()` tool - [ ] Internal agent loop (plan → search → fetch → iterate → synthesize) - [ ] Token/iteration budget enforcement - [ ] JSONL trace logging with content hashes - [ ] `raw_excerpt` on all citations - [ ] Categorized gaps (GapCategory enum) - [ ] Discovery events capture - [ ] Confidence factors reporting - [ ] CLI shim (ask, replay) - [ ] Contract documented in wiki ✅ - [ ] Integration test: structured answer with citations, gaps, discoveries - [ ] All tests passing Full spec: [ResearchContract wiki page](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract)