3 Architecture
Jeff Smith edited this page 2026-04-08 12:28:10 -06:00

Architecture

Overview

Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.

┌─────────────┐
│  PI Agent   │  Orchestrates, synthesizes, decides what to research
└──────┬──────┘
       │ dispatch research(question)
       │
  ┌────┴──────────────────────────┐
  │                               │
┌─┴────────────────────┐  ┌───────┴─────────────────┐
│ Web Researcher (MCP) │  │ Future: DB, Arxiv, etc. │
│  - Search (Tavily)   │  │ (V2+)                   │
│  - Fetch URLs        │  │                         │
│  - Internal loop     │  │                         │
│  - Return citations  │  │                         │
│  - Raw evidence      │  │                         │
│  - Discovery events  │  │                         │
└──────────────────────┘  └─────────────────────────┘

Components

Researchers (MCP servers)

Each researcher is a standalone MCP server that:

  • Exposes a single tool: research(question, context, depth, constraints)
  • Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
  • Returns structured data: answer, citations (with raw evidence), gaps (categorized), discovery_events, confidence + confidence_factors, cost_metadata, trace_id
  • Enforces budgets: iteration cap and token limit
  • Logs all internal steps to JSONL trace files with content hashes

V1 researcher: Web search + fetch

  • Uses Tavily for searching
  • Fetches full text from URLs
  • Iterates up to 5 times or until budget exhausted

Future researchers (V2+): Database, Arxiv, internal documents, etc.

MCP Protocol

Marchwarden uses the Model Context Protocol (MCP) as the boundary between researchers and their callers. This gives us:

  • Language agnostic — researchers can be Python, Node, Go, etc.
  • Process isolation — researcher crash doesn't crash the PI
  • Clean contract — one tool signature, versioned independently
  • Parallel dispatch — PI can await multiple researchers simultaneously

MCP constraint: The protocol is JSON-RPC (request-response). A researcher cannot emit streaming events or notifications mid-loop. All output — including discovery events — is returned in the final response. This is a known V1 limitation; see Known Limitations below.

CLI Shim

For V1, the CLI is the test harness that stands in for the PI:

marchwarden ask "what are ideal crops for Utah?"
marchwarden replay <trace_id>

In V2, the CLI is replaced by a full PI orchestrator agent.

Trace Logging

Every research call produces a JSONL trace log:

~/.marchwarden/traces/{trace_id}.jsonl

Each line is a JSON object:

{
  "step": 1,
  "action": "search",
  "query": "Utah climate gardening",
  "result": {...},
  "timestamp": "2026-04-08T12:00:00Z",
  "decision": "query was relevant, fetching top 3 URLs"
}

For fetch actions, traces include a content_hash (SHA-256):

{
  "step": 2,
  "action": "fetch_url",
  "url": "https://extension.usu.edu/gardening/utah-crops",
  "content_hash": "sha256:a3f2b8c91d...",
  "content_length": 14523,
  "timestamp": "2026-04-08T12:00:05Z",
  "decision": "Relevant to question; extracting crop data"
}

Traces support:

  • Auditing — see exactly what the researcher did and decided
  • Change detectioncontent_hash reveals if web sources changed between runs
  • Debugging — diagnose why a researcher produced a particular answer
  • Future replay — with Content Addressable Storage (V2+), traces become reproducible

V1 note: Traces are audit logs, not deterministic replays. True replay requires storing the full fetched content (CAS), not just its hash. See Known Limitations.

Data Flow

One research call (simplified)

CLI: ask "What are ideal crops for Utah?"
  ↓
MCP: research(question="What are ideal crops for Utah?", ...)
  ↓
Researcher agent loop:
  1. Plan: "I need climate data for Utah + crop requirements"
  2. Search: Tavily query for "Utah climate zones crops"
  3. Fetch: Read top 3 URLs (hash each for pseudo-CAS)
  4. Parse: Extract relevant info, preserve raw excerpts
  5. Synthesize: "Based on X sources, ideal crops are Y"
  6. Check gaps: "Couldn't find pest info" → categorize as SOURCE_NOT_FOUND
  7. Check discoveries: "Found reference to USU soil study" → emit DiscoveryEvent
  8. Compute confidence + factors
  9. Return if confident, else iterate
  ↓
Response:
  {
    "answer": "...",
    "citations": [
      {
        "source": "web",
        "locator": "https://...",
        "snippet": "...",
        "raw_excerpt": "verbatim text from source...",
        "confidence": 0.95
      }
    ],
    "gaps": [
      {
        "topic": "pest resistance",
        "category": "source_not_found",
        "detail": "No pest data found in general gardening sources"
      }
    ],
    "discovery_events": [
      {
        "type": "related_research",
        "suggested_researcher": "database",
        "query": "Utah soil salinity crop impact",
        "reason": "Multiple sources reference USU study data not available on web"
      }
    ],
    "confidence": 0.82,
    "confidence_factors": {
      "num_corroborating_sources": 3,
      "source_authority": "high",
      "contradiction_detected": false,
      "query_specificity_match": 0.85,
      "budget_exhausted": false,
      "recency": "current"
    },
    "cost_metadata": {
      "tokens_used": 8452,
      "iterations_run": 3,
      "wall_time_sec": 42.5,
      "budget_exhausted": false
    },
    "trace_id": "uuid-1234"
  }
  ↓
CLI: Print answer + citations + gaps + discoveries, save trace

Design Decisions

Raw Evidence (The Synthesis Paradox)

When the PI synthesizes answers from multiple researchers, it risks "recursive compression loss" — each researcher has already summarized the raw data, and the PI summarizes those summaries. Subtle nuances and contradictions can be smoothed away.

Solution: Every citation includes a raw_excerpt field — verbatim text from the source. The PI can verify claims against raw evidence, detect when researchers interpret the same source differently, and flag high-entropy points for human review.

Categorized Gaps

Gaps are not just "things we didn't find." Different gap categories demand different responses from the PI:

Category PI Response
SOURCE_NOT_FOUND Accept the gap or try a different researcher
ACCESS_DENIED Specialized fetcher or human intervention
BUDGET_EXHAUSTED Re-dispatch with larger budget
CONTRADICTORY_SOURCES Examine raw_excerpts, flag for human review
SCOPE_EXCEEDED Dispatch the appropriate specialist

Discovery Events (Lateral Metadata)

A researcher often encounters information relevant to other researchers' domains. Rather than ignoring these findings (hub-and-spoke limitation) or acting on them (scope creep), the researcher logs them as DiscoveryEvent objects.

In V1, discovery events are logged for analysis. In V2, the PI orchestrator processes them dynamically, enabling mid-investigation dispatch of additional researchers.

Content Hashing (Pseudo-CAS)

Every fetched URL produces a SHA-256 hash in the trace. This provides change detection (did the source change between runs?) without the storage overhead of full content archiving. It's the foundation for V2's Content Addressable Storage.

Contract Versioning

The research() tool signature is the stable contract. Changes to the contract require explicit versioning so that:

  • Multiple researchers with different versions can coexist
  • The PI knows what version it's calling
  • Backwards compatibility (or breaking changes) is explicit

See ResearchContract for the full spec.

Future: The PI Agent

V2 will introduce the orchestrator:

class PIAgent:
    async def research_topic(self, question: str) -> Answer:
        # Dispatch multiple researchers in parallel
        web_results, arxiv_results = await asyncio.gather(
            self.web_researcher.research(question),
            self.arxiv_researcher.research(question),
        )

        # Process discovery events from both
        for event in web_results.discovery_events + arxiv_results.discovery_events:
            if self.should_dispatch(event):
                additional = await self.dispatch_researcher(event)
                all_results.append(additional)

        # Synthesize using raw_excerpts for ground-truth verification
        return self.synthesize(all_results)

The PI:

  • Decides which researchers to dispatch (initially in parallel)
  • Processes discovery events and dispatches follow-ups
  • Compares raw_excerpts across researchers to detect contradictions
  • Uses gap categories to decide whether to re-dispatch or accept
  • Synthesizes into a final answer with full provenance

Assumptions & Constraints

  • Citation grounding is structural, not assumedraw_excerpt provides verifiable evidence. Citation validation (programmatic URL ping) is V2 work. V1 relies on the researcher having actually fetched the source.
  • Tavily API is available — for V1 web search. Degradation strategy: note in gaps with ACCESS_DENIED category.
  • Token budgets are enforced — the MCP server enforces at the process level, not just the agent level.
  • Traces are audit logs — stored locally, hashed for integrity, but not full content archives (V2).
  • No multi-user — single-user CLI for V1.
  • Confidence is directional — LLM-generated with exposed factors; formal calibration after V1 data collection.

Known Limitations (V1)

Limitation Rationale Future Resolution
No citation validation Adds latency; document as known risk V2: Validator node pings URLs/DOIs
Traces are audit logs, not replays True replay requires CAS for fetched content V2: Content Addressable Storage
Discovery events are logged only MCP is request-response; no mid-flight dispatch V2: PI processes events dynamically
No streaming of progress MCP tool responses are one-shot V2+: Streaming MCP or polling pattern
Hub-and-spoke only V1 simplicity; PI is only coordinator V2: Dynamic priority queue in PI
Confidence not calibrated Need empirical data first V1.1: Rubric after 20-30 queries

Terminology

  • Researcher: An agentic system specialized in a domain or source type
  • Marchwarden: The researcher metaphor — stationed at the frontier, reporting back
  • Discovery Event: A lateral finding relevant to another researcher's domain
  • Trace: A JSONL audit log of all decisions made during one research call
  • Gap: An unresolved aspect of the question, categorized by cause
  • Raw Excerpt: Verbatim text from a source, bypassing researcher synthesis
  • Content Hash: SHA-256 of fetched content, enabling change detection

See also: ResearchContract, DevelopmentGuide