archeious/marchwarden

Fork 0

Table of Contents

Architecture

Overview
Components

Researchers (MCP servers)
MCP Protocol
CLI Shim
Trace Logging

Data Flow

One research call (simplified)

Design Decisions

Raw Evidence (The Synthesis Paradox)
Categorized Gaps
Discovery Events (Lateral Metadata)
Content Hashing (Pseudo-CAS)

Contract Versioning
Future: The PI Agent
Assumptions & Constraints
Known Limitations (V1)
Terminology

Architecture

Overview

Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.

┌─────────────┐
│  PI Agent   │  Orchestrates, synthesizes, decides what to research
└──────┬──────┘
       │ dispatch research(question)
       │
  ┌────┴──────────────────────────┐
  │                               │
┌─┴────────────────────┐  ┌───────┴─────────────────┐
│ Web Researcher (MCP) │  │ Future: DB, Arxiv, etc. │
│  - Search (Tavily)   │  │ (V2+)                   │
│  - Fetch URLs        │  │                         │
│  - Internal loop     │  │                         │
│  - Return citations  │  │                         │
│  - Raw evidence      │  │                         │
│  - Discovery events  │  │                         │
└──────────────────────┘  └─────────────────────────┘

Components

Researchers (MCP servers)

Each researcher is a standalone MCP server that:

Exposes a single tool: research(question, context, depth, constraints)
Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
Returns structured data: answer, citations (with raw evidence), gaps (categorized), discovery_events, confidence + confidence_factors, cost_metadata, trace_id
Enforces budgets: iteration cap and token limit
Logs all internal steps to JSONL trace files with content hashes

V1 researcher: Web search + fetch

Uses Tavily for searching
Fetches full text from URLs
Iterates up to 5 times or until budget exhausted

Future researchers (V2+): Database, Arxiv, internal documents, etc.

MCP Protocol

Marchwarden uses the Model Context Protocol (MCP) as the boundary between researchers and their callers. This gives us:

Language agnostic — researchers can be Python, Node, Go, etc.
Process isolation — researcher crash doesn't crash the PI
Clean contract — one tool signature, versioned independently
Parallel dispatch — PI can await multiple researchers simultaneously

MCP constraint: The protocol is JSON-RPC (request-response). A researcher cannot emit streaming events or notifications mid-loop. All output — including discovery events — is returned in the final response. This is a known V1 limitation; see Known Limitations below.

CLI Shim

For V1, the CLI is the test harness that stands in for the PI:

marchwarden ask "what are ideal crops for Utah?"
marchwarden replay <trace_id>

In V2, the CLI is replaced by a full PI orchestrator agent.

Trace Logging

Every research call produces a JSONL trace log:

~/.marchwarden/traces/{trace_id}.jsonl

Each line is a JSON object:

{
  "step": 1,
  "action": "search",
  "query": "Utah climate gardening",
  "result": {...},
  "timestamp": "2026-04-08T12:00:00Z",
  "decision": "query was relevant, fetching top 3 URLs"
}

For fetch actions, traces include a content_hash (SHA-256):

{
  "step": 2,
  "action": "fetch_url",
  "url": "https://extension.usu.edu/gardening/utah-crops",
  "content_hash": "sha256:a3f2b8c91d...",
  "content_length": 14523,
  "timestamp": "2026-04-08T12:00:05Z",
  "decision": "Relevant to question; extracting crop data"
}

Traces support:

Auditing — see exactly what the researcher did and decided
Change detection — content_hash reveals if web sources changed between runs
Debugging — diagnose why a researcher produced a particular answer
Future replay — with Content Addressable Storage (V2+), traces become reproducible

V1 note: Traces are audit logs, not deterministic replays. True replay requires storing the full fetched content (CAS), not just its hash. See Known Limitations.

Data Flow

One research call (simplified)

CLI: ask "What are ideal crops for Utah?"
  ↓
MCP: research(question="What are ideal crops for Utah?", ...)
  ↓
Researcher agent loop:
  1. Plan: "I need climate data for Utah + crop requirements"
  2. Search: Tavily query for "Utah climate zones crops"
  3. Fetch: Read top 3 URLs (hash each for pseudo-CAS)
  4. Parse: Extract relevant info, preserve raw excerpts
  5. Synthesize: "Based on X sources, ideal crops are Y"
  6. Check gaps: "Couldn't find pest info" → categorize as SOURCE_NOT_FOUND
  7. Check discoveries: "Found reference to USU soil study" → emit DiscoveryEvent
  8. Compute confidence + factors
  9. Return if confident, else iterate
  ↓
Response:
  {
    "answer": "...",
    "citations": [
      {
        "source": "web",
        "locator": "https://...",
        "snippet": "...",
        "raw_excerpt": "verbatim text from source...",
        "confidence": 0.95
      }
    ],
    "gaps": [
      {
        "topic": "pest resistance",
        "category": "source_not_found",
        "detail": "No pest data found in general gardening sources"
      }
    ],
    "discovery_events": [
      {
        "type": "related_research",
        "suggested_researcher": "database",
        "query": "Utah soil salinity crop impact",
        "reason": "Multiple sources reference USU study data not available on web"
      }
    ],
    "confidence": 0.82,
    "confidence_factors": {
      "num_corroborating_sources": 3,
      "source_authority": "high",
      "contradiction_detected": false,
      "query_specificity_match": 0.85,
      "budget_exhausted": false,
      "recency": "current"
    },
    "cost_metadata": {
      "tokens_used": 8452,
      "iterations_run": 3,
      "wall_time_sec": 42.5,
      "budget_exhausted": false
    },
    "trace_id": "uuid-1234"
  }
  ↓
CLI: Print answer + citations + gaps + discoveries, save trace

Design Decisions

Raw Evidence (The Synthesis Paradox)

When the PI synthesizes answers from multiple researchers, it risks "recursive compression loss" — each researcher has already summarized the raw data, and the PI summarizes those summaries. Subtle nuances and contradictions can be smoothed away.

Solution: Every citation includes a raw_excerpt field — verbatim text from the source. The PI can verify claims against raw evidence, detect when researchers interpret the same source differently, and flag high-entropy points for human review.

Categorized Gaps

Gaps are not just "things we didn't find." Different gap categories demand different responses from the PI:

Category	PI Response
`SOURCE_NOT_FOUND`	Accept the gap or try a different researcher
`ACCESS_DENIED`	Specialized fetcher or human intervention
`BUDGET_EXHAUSTED`	Re-dispatch with larger budget
`CONTRADICTORY_SOURCES`	Examine raw_excerpts, flag for human review
`SCOPE_EXCEEDED`	Dispatch the appropriate specialist

Discovery Events (Lateral Metadata)

A researcher often encounters information relevant to other researchers' domains. Rather than ignoring these findings (hub-and-spoke limitation) or acting on them (scope creep), the researcher logs them as DiscoveryEvent objects.

In V1, discovery events are logged for analysis. In V2, the PI orchestrator processes them dynamically, enabling mid-investigation dispatch of additional researchers.

Content Hashing (Pseudo-CAS)

Every fetched URL produces a SHA-256 hash in the trace. This provides change detection (did the source change between runs?) without the storage overhead of full content archiving. It's the foundation for V2's Content Addressable Storage.

Contract Versioning

The research() tool signature is the stable contract. Changes to the contract require explicit versioning so that:

Multiple researchers with different versions can coexist
The PI knows what version it's calling
Backwards compatibility (or breaking changes) is explicit

See ResearchContract for the full spec.

Future: The PI Agent

V2 will introduce the orchestrator:

class PIAgent:
    async def research_topic(self, question: str) -> Answer:
        # Dispatch multiple researchers in parallel
        web_results, arxiv_results = await asyncio.gather(
            self.web_researcher.research(question),
            self.arxiv_researcher.research(question),
        )

        # Process discovery events from both
        for event in web_results.discovery_events + arxiv_results.discovery_events:
            if self.should_dispatch(event):
                additional = await self.dispatch_researcher(event)
                all_results.append(additional)

        # Synthesize using raw_excerpts for ground-truth verification
        return self.synthesize(all_results)

The PI:

Decides which researchers to dispatch (initially in parallel)
Processes discovery events and dispatches follow-ups
Compares raw_excerpts across researchers to detect contradictions
Uses gap categories to decide whether to re-dispatch or accept
Synthesizes into a final answer with full provenance

Assumptions & Constraints

Citation grounding is structural, not assumed — raw_excerpt provides verifiable evidence. Citation validation (programmatic URL ping) is V2 work. V1 relies on the researcher having actually fetched the source.
Tavily API is available — for V1 web search. Degradation strategy: note in gaps with ACCESS_DENIED category.
Token budgets are enforced — the MCP server enforces at the process level, not just the agent level.
Traces are audit logs — stored locally, hashed for integrity, but not full content archives (V2).
No multi-user — single-user CLI for V1.
Confidence is directional — LLM-generated with exposed factors; formal calibration after V1 data collection.

Known Limitations (V1)

Limitation	Rationale	Future Resolution
No citation validation	Adds latency; document as known risk	V2: Validator node pings URLs/DOIs
Traces are audit logs, not replays	True replay requires CAS for fetched content	V2: Content Addressable Storage
Discovery events are logged only	MCP is request-response; no mid-flight dispatch	V2: PI processes events dynamically
No streaming of progress	MCP tool responses are one-shot	V2+: Streaming MCP or polling pattern
Hub-and-spoke only	V1 simplicity; PI is only coordinator	V2: Dynamic priority queue in PI
Confidence not calibrated	Need empirical data first	V1.1: Rubric after 20-30 queries

Terminology

Researcher: An agentic system specialized in a domain or source type
Marchwarden: The researcher metaphor — stationed at the frontier, reporting back
Discovery Event: A lateral finding relevant to another researcher's domain
Trace: A JSONL audit log of all decisions made during one research call
Gap: An unresolved aspect of the question, categorized by cause
Raw Excerpt: Verbatim text from a source, bypassing researcher synthesis
Content Hash: SHA-256 of fetched content, enabling change detection