Architecture
Overview
Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.
┌─────────────┐
│ PI Agent │ Orchestrates, synthesizes, decides what to research
└──────┬──────┘
│ dispatch research(question)
│
┌────┴──────────────────────────┐
│ │
┌─┴────────────────────┐ ┌───────┴─────────────────┐
│ Web Researcher (MCP) │ │ Future: DB, Arxiv, etc. │
│ - Search (Tavily) │ │ (V2+) │
│ - Fetch URLs │ │ │
│ - Internal loop │ │ │
│ - Return citations │ │ │
│ - Raw evidence │ │ │
│ - Discovery events │ │ │
└──────────────────────┘ └─────────────────────────┘
Components
Researchers (MCP servers)
Each researcher is a standalone MCP server that:
- Exposes a single tool:
research(question, context, depth, constraints) - Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
- Returns structured data:
answer,citations(with raw evidence),gaps(categorized),discovery_events,confidence+confidence_factors,cost_metadata,trace_id - Enforces budgets: iteration cap and token limit
- Logs all internal steps to JSONL trace files with content hashes
V1 researcher: Web search + fetch
- Uses Tavily for searching
- Fetches full text from URLs
- Iterates up to 5 times or until budget exhausted
Future researchers (V2+): Database, Arxiv, internal documents, etc.
MCP Protocol
Marchwarden uses the Model Context Protocol (MCP) as the boundary between researchers and their callers. This gives us:
- Language agnostic — researchers can be Python, Node, Go, etc.
- Process isolation — researcher crash doesn't crash the PI
- Clean contract — one tool signature, versioned independently
- Parallel dispatch — PI can await multiple researchers simultaneously
MCP constraint: The protocol is JSON-RPC (request-response). A researcher cannot emit streaming events or notifications mid-loop. All output — including discovery events — is returned in the final response. This is a known V1 limitation; see Known Limitations below.
CLI Shim
For V1, the CLI is the test harness that stands in for the PI:
marchwarden ask "what are ideal crops for Utah?"
marchwarden replay <trace_id>
In V2, the CLI is replaced by a full PI orchestrator agent.
Trace Logging
Every research call produces a JSONL trace log:
~/.marchwarden/traces/{trace_id}.jsonl
Each line is a JSON object:
{
"step": 1,
"action": "search",
"query": "Utah climate gardening",
"result": {...},
"timestamp": "2026-04-08T12:00:00Z",
"decision": "query was relevant, fetching top 3 URLs"
}
For fetch actions, traces include a content_hash (SHA-256):
{
"step": 2,
"action": "fetch_url",
"url": "https://extension.usu.edu/gardening/utah-crops",
"content_hash": "sha256:a3f2b8c91d...",
"content_length": 14523,
"timestamp": "2026-04-08T12:00:05Z",
"decision": "Relevant to question; extracting crop data"
}
Traces support:
- Auditing — see exactly what the researcher did and decided
- Change detection —
content_hashreveals if web sources changed between runs - Debugging — diagnose why a researcher produced a particular answer
- Future replay — with Content Addressable Storage (V2+), traces become reproducible
V1 note: Traces are audit logs, not deterministic replays. True replay requires storing the full fetched content (CAS), not just its hash. See Known Limitations.
Data Flow
One research call (simplified)
CLI: ask "What are ideal crops for Utah?"
↓
MCP: research(question="What are ideal crops for Utah?", ...)
↓
Researcher agent loop:
1. Plan: "I need climate data for Utah + crop requirements"
2. Search: Tavily query for "Utah climate zones crops"
3. Fetch: Read top 3 URLs (hash each for pseudo-CAS)
4. Parse: Extract relevant info, preserve raw excerpts
5. Synthesize: "Based on X sources, ideal crops are Y"
6. Check gaps: "Couldn't find pest info" → categorize as SOURCE_NOT_FOUND
7. Check discoveries: "Found reference to USU soil study" → emit DiscoveryEvent
8. Compute confidence + factors
9. Return if confident, else iterate
↓
Response:
{
"answer": "...",
"citations": [
{
"source": "web",
"locator": "https://...",
"snippet": "...",
"raw_excerpt": "verbatim text from source...",
"confidence": 0.95
}
],
"gaps": [
{
"topic": "pest resistance",
"category": "source_not_found",
"detail": "No pest data found in general gardening sources"
}
],
"discovery_events": [
{
"type": "related_research",
"suggested_researcher": "database",
"query": "Utah soil salinity crop impact",
"reason": "Multiple sources reference USU study data not available on web"
}
],
"confidence": 0.82,
"confidence_factors": {
"num_corroborating_sources": 3,
"source_authority": "high",
"contradiction_detected": false,
"query_specificity_match": 0.85,
"budget_exhausted": false,
"recency": "current"
},
"cost_metadata": {
"tokens_used": 8452,
"iterations_run": 3,
"wall_time_sec": 42.5,
"budget_exhausted": false
},
"trace_id": "uuid-1234"
}
↓
CLI: Print answer + citations + gaps + discoveries, save trace
Design Decisions
Raw Evidence (The Synthesis Paradox)
When the PI synthesizes answers from multiple researchers, it risks "recursive compression loss" — each researcher has already summarized the raw data, and the PI summarizes those summaries. Subtle nuances and contradictions can be smoothed away.
Solution: Every citation includes a raw_excerpt field — verbatim text from the source. The PI can verify claims against raw evidence, detect when researchers interpret the same source differently, and flag high-entropy points for human review.
Categorized Gaps
Gaps are not just "things we didn't find." Different gap categories demand different responses from the PI:
| Category | PI Response |
|---|---|
SOURCE_NOT_FOUND |
Accept the gap or try a different researcher |
ACCESS_DENIED |
Specialized fetcher or human intervention |
BUDGET_EXHAUSTED |
Re-dispatch with larger budget |
CONTRADICTORY_SOURCES |
Examine raw_excerpts, flag for human review |
SCOPE_EXCEEDED |
Dispatch the appropriate specialist |
Discovery Events (Lateral Metadata)
A researcher often encounters information relevant to other researchers' domains. Rather than ignoring these findings (hub-and-spoke limitation) or acting on them (scope creep), the researcher logs them as DiscoveryEvent objects.
In V1, discovery events are logged for analysis. In V2, the PI orchestrator processes them dynamically, enabling mid-investigation dispatch of additional researchers.
Content Hashing (Pseudo-CAS)
Every fetched URL produces a SHA-256 hash in the trace. This provides change detection (did the source change between runs?) without the storage overhead of full content archiving. It's the foundation for V2's Content Addressable Storage.
Contract Versioning
The research() tool signature is the stable contract. Changes to the contract require explicit versioning so that:
- Multiple researchers with different versions can coexist
- The PI knows what version it's calling
- Backwards compatibility (or breaking changes) is explicit
See ResearchContract for the full spec.
Future: The PI Agent
V2 will introduce the orchestrator:
class PIAgent:
async def research_topic(self, question: str) -> Answer:
# Dispatch multiple researchers in parallel
web_results, arxiv_results = await asyncio.gather(
self.web_researcher.research(question),
self.arxiv_researcher.research(question),
)
# Process discovery events from both
for event in web_results.discovery_events + arxiv_results.discovery_events:
if self.should_dispatch(event):
additional = await self.dispatch_researcher(event)
all_results.append(additional)
# Synthesize using raw_excerpts for ground-truth verification
return self.synthesize(all_results)
The PI:
- Decides which researchers to dispatch (initially in parallel)
- Processes discovery events and dispatches follow-ups
- Compares raw_excerpts across researchers to detect contradictions
- Uses gap categories to decide whether to re-dispatch or accept
- Synthesizes into a final answer with full provenance
Assumptions & Constraints
- Citation grounding is structural, not assumed —
raw_excerptprovides verifiable evidence. Citation validation (programmatic URL ping) is V2 work. V1 relies on the researcher having actually fetched the source. - Tavily API is available — for V1 web search. Degradation strategy: note in gaps with
ACCESS_DENIEDcategory. - Token budgets are enforced — the MCP server enforces at the process level, not just the agent level.
- Traces are audit logs — stored locally, hashed for integrity, but not full content archives (V2).
- No multi-user — single-user CLI for V1.
- Confidence is directional — LLM-generated with exposed factors; formal calibration after V1 data collection.
Known Limitations (V1)
| Limitation | Rationale | Future Resolution |
|---|---|---|
| No citation validation | Adds latency; document as known risk | V2: Validator node pings URLs/DOIs |
| Traces are audit logs, not replays | True replay requires CAS for fetched content | V2: Content Addressable Storage |
| Discovery events are logged only | MCP is request-response; no mid-flight dispatch | V2: PI processes events dynamically |
| No streaming of progress | MCP tool responses are one-shot | V2+: Streaming MCP or polling pattern |
| Hub-and-spoke only | V1 simplicity; PI is only coordinator | V2: Dynamic priority queue in PI |
| Confidence not calibrated | Need empirical data first | V1.1: Rubric after 20-30 queries |
Terminology
- Researcher: An agentic system specialized in a domain or source type
- Marchwarden: The researcher metaphor — stationed at the frontier, reporting back
- Discovery Event: A lateral finding relevant to another researcher's domain
- Trace: A JSONL audit log of all decisions made during one research call
- Gap: An unresolved aspect of the question, categorized by cause
- Raw Excerpt: Verbatim text from a source, bypassing researcher synthesis
- Content Hash: SHA-256 of fetched content, enabling change detection
See also: ResearchContract, DevelopmentGuide