Initial wiki: Architecture, ResearchContract, DevelopmentGuide

- Architecture: system overview, component design, data flow - ResearchContract: complete tool specification with examples - DevelopmentGuide: setup, testing, workflow, debugging Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 11:58:26 -06:00 · 2026-04-08 11:58:26 -06:00 · a349d6f970
commit a349d6f970
3 changed files with 786 additions and 0 deletions
--- a/Architecture.md
+++ b/Architecture.md
@ -0,0 +1,175 @@
 # Architecture
 ## Overview
 Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.
 ```
 ┌─────────────┐
 │  PI Agent   │  Orchestrates, synthesizes, decides what to research
 └──────┬──────┘
       │ dispatch research(question)
       │
  ┌────┴──────────────────────────┐
  │                               │
 ┌─┴────────────────────┐  ┌───────┴─────────────────┐
 │ Web Researcher (MCP) │  │ Future: DB, Arxiv, etc. │
 │  - Search (Tavily)   │  │ (V2+)                   │
 │  - Fetch URLs        │  │                         │
 │  - Internal loop     │  │                         │
 │  - Return citations  │  │                         │
 └──────────────────────┘  └─────────────────────────┘
 ```
 ## Components
 ### Researchers (MCP servers)
 Each researcher is a **standalone MCP server** that:
 - Exposes a single tool: `research(question, context, depth, constraints)`
 - Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
 - Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id`
 - Enforces budgets: iteration cap and token limit
 - Logs all internal steps to JSONL trace files
 **V1 researcher**: Web search + fetch
 - Uses Tavily for searching
 - Fetches full text from URLs
 - Iterates up to 5 times or until budget exhausted
 **Future researchers** (V2+): Database, Arxiv, internal documents, etc.
 ### MCP Protocol
 Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between researchers and their callers. This gives us:
 - **Language agnostic** — researchers can be Python, Node, Go, etc.
 - **Process isolation** — researcher crash doesn't crash the PI
 - **Clean contract** — one tool signature, versioned independently
 - **Parallel dispatch** — PI can await multiple researchers simultaneously
 ### CLI Shim
 For V1, the CLI is the test harness that stands in for the PI:
 ```bash
 marchwarden ask "what are ideal crops for Utah?"
 marchwarden replay <trace_id>
 ```
 In V2, the CLI is replaced by a full PI orchestrator agent.
 ### Trace Logging
 Every research call produces a **JSONL trace log**:
 ```
 ~/.marchwarden/traces/{trace_id}.jsonl
 ```
 Each line is a JSON object:
 ```json
 {
  "step": 1,
  "action": "search",
  "query": "Utah climate gardening",
  "result": {...},
  "timestamp": "2026-04-08T12:00:00Z",
  "decision": "query was relevant, fetching top 3 URLs"
 }
 ```
 Traces support:
 - **Debugging** — see exactly what the researcher did
 - **Replay** — re-run a past session, same results
 - **Eval** — audit decision-making
 ## Data Flow
 ### One research call (simplified)
 ```
 CLI: ask "What are ideal crops for Utah?"
  ↓
 MCP: research(question="What are ideal crops for Utah?", ...)
  ↓
 Researcher agent loop:
  1. Plan: "I need climate data for Utah + crop requirements"
  2. Search: Tavily query for "Utah climate zones crops"
  3. Fetch: Read top 3 URLs
  4. Parse: Extract relevant info
  5. Synthesize: "Based on X sources, ideal crops are Y"
  6. Check gaps: "Couldn't find pest info"
  7. Return if confident, else iterate
  ↓
 Response:
  {
    "answer": "...",
    "citations": [
      {"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95},
      ...
    ],
    "gaps": [
      {"topic": "pest resistance", "reason": "no sources found"},
    ],
    "cost_metadata": {
      "tokens_used": 8452,
      "iterations_run": 3,
      "wall_time_sec": 42.5
    },
    "trace_id": "uuid-1234"
  }
  ↓
 CLI: Print answer + citations, save trace
 ```
 ## Contract Versioning
 The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that:
 - Multiple researchers with different versions can coexist
 - The PI knows what version it's calling
 - Backwards compatibility (or breaking changes) is explicit
 See [ResearchContract.md](ResearchContract.md) for the full spec.
 ## Future: The PI Agent
 V2 will introduce the orchestrator:
 ```python
 class PIAgent:
    def research_topic(self, question: str) -> Answer:
        # Dispatch multiple researchers in parallel
        web_results = await self.web_researcher.research(question)
        arxiv_results = await self.arxiv_researcher.research(question)
        # Synthesize
        return self.synthesize([web_results, arxiv_results])
 ```
 The PI:
 - Decides which researchers to dispatch
 - Waits for all responses
 - Checks for conflicts, gaps, consensus
 - Synthesizes into a final answer
 - Can re-dispatch if gaps are critical
 ## Assumptions & Constraints
 - **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source.
 - **Tavily API is available** — for V1 web search. Degradation strategy TBD.
 - **Token budgets are enforced** — the researcher respects its budget; the MCP server enforces it at the process level.
 - **Traces are ephemeral** — stored locally for debugging, not synced to a database yet.
 - **No multi-user** — single-user CLI for V1.
 ## Terminology
 - **Researcher**: An agentic system specialized in a domain or source type
 - **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back
 - **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question
 - **Trace**: A JSONL log of all decisions made during one research call
 - **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer
 ---
 See also: [ResearchContract.md](ResearchContract.md), [DevelopmentGuide.md](DevelopmentGuide.md)
--- a/DevelopmentGuide.md
+++ b/DevelopmentGuide.md
@ -0,0 +1,259 @@
 # Development Guide
 ## Setup
 ### Prerequisites
 - Python 3.10+
 - pip (with venv)
 - Tavily API key (free tier available at https://tavily.com)
 ### Installation
 ```bash
 git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
 cd marchwarden
 # Create virtual environment
 python3 -m venv venv
 source venv/bin/activate  # On Windows: venv\Scripts\activate
 # Install in dev mode
 pip install -e ".[dev]"
 ```
 ### Environment Setup
 Create a `.env` file in the project root:
 ```env
 TAVILY_API_KEY=<your-tavily-api-key>
 ANTHROPIC_API_KEY=<your-claude-api-key>
 MARCHWARDEN_TRACE_DIR=~/.marchwarden/traces
 ```
 Test that everything works:
 ```bash
 python -c "from anthropic import Anthropic; print('OK')"
 python -c "from tavily import TavilyClient; print('OK')"
 ```
 ## Project Structure
 ```
 marchwarden/
 ├── researchers/
 │   ├── __init__.py
 │   └── web/                    # V1: Web search researcher
 │       ├── __init__.py
 │       ├── server.py           # MCP server entry point
 │       ├── agent.py            # Inner research agent
 │       ├── models.py           # Pydantic models (ResearchResult, Citation, etc)
 │       └── tools.py            # Tavily integration, URL fetch
 ├── orchestrator/               # (V2+) PI agent
 │   ├── __init__.py
 │   └── pi.py
 ├── cli/                        # CLI shim (ask, replay)
 │   ├── __init__.py
 │   ├── main.py                 # Entry point (@click decorators)
 │   └── formatter.py            # Pretty-print results
 ├── tests/
 │   ├── __init__.py
 │   ├── test_web_researcher.py
 │   └── fixtures/
 ├── docs/
 │   └── wiki/                   # You are here
 ├── README.md
 ├── CONTRIBUTING.md
 ├── pyproject.toml
 └── .gitignore
 ```
 ## Running Tests
 ```bash
 # Run all tests
 pytest tests/
 # Run with verbose output
 pytest tests/ -v
 # Run a specific test file
 pytest tests/test_web_researcher.py
 # Run with coverage
 pytest --cov=. tests/
 ```
 All tests are unit + integration. We do **not** mock the database or major external services (only Tavily if needed to avoid API costs).
 ## Running the CLI
 ```bash
 # Ask a question
 marchwarden ask "What are ideal crops for a garden in Utah?"
 # With options
 marchwarden ask "What is X?" --depth deep --budget 25000
 # Replay a trace
 marchwarden replay <trace_id>
 # Show help
 marchwarden --help
 ```
 The first run will take a few seconds (agent planning + searches + fetches).
 ## Development Workflow
 ### 1. Create a branch
 ```bash
 git checkout -b feat/your-feature-name
 ```
 Branch naming: `feat/`, `fix/`, `refactor/`, `chore/` + short description.
 ### 2. Make changes
 Edit code, add tests:
 ```bash
 # Run tests as you go
 pytest tests/test_your_feature.py -v
 # Check formatting
 black --check .
 ruff check .
 # Type checking (optional, informational)
 mypy . --ignore-missing-imports
 ```
 ### 3. Commit
 ```bash
 git add <files>
 git commit -m "Brief imperative description
 - What changed
 - Why it changed
 "
 ```
 Commits should be atomic (one logical change per commit).
 ### 4. Test before pushing
 ```bash
 pytest tests/
 black .
 ruff check . --fix
 ```
 ### 5. Push and create PR
 ```bash
 git push origin feat/your-feature-name
 ```
 Then on Forgejo: open a PR, request review, wait for CI/tests to pass.
 Once approved:
 - Merge via Forgejo UI (not locally)
 - Delete remote branch via Forgejo
 - Locally: `git checkout main && git pull --ff-only && git branch -d feat/your-feature-name`
 ## Debugging
 ### Viewing trace logs
 ```bash
 # Human-readable trace
 marchwarden replay <trace_id>
 # Raw JSON
 cat ~/.marchwarden/traces/<trace_id>.jsonl | jq .
 # Pretty-print all lines
 cat ~/.marchwarden/traces/<trace_id>.jsonl | jq . -s
 ```
 ### Debug logging
 Set `MARCHWARDEN_DEBUG=1` for verbose logs:
 ```bash
 MARCHWARDEN_DEBUG=1 marchwarden ask "What is X?"
 ```
 ### Interactive testing
 Use Python REPL:
 ```bash
 python
 >>> from researchers.web import WebResearcher
 >>> researcher = WebResearcher()
 >>> result = researcher.research("What is X?")
 >>> print(result.answer)
 ```
 ## Common Tasks
 ### Adding a new tool to the researcher
 1. Define the tool in `researchers/web/tools.py`
 2. Register it in the agent's tool list (`researchers/web/agent.py`)
 3. Add test coverage in `tests/test_web_researcher.py`
 4. Update docs if it changes the contract
 ### Changing the research contract
 If you need to modify the `research()` signature:
 1. Update `researchers/web/models.py` (ResearchResult, Citation, etc)
 2. Update `researchers/web/agent.py` to produce the new fields
 3. Update `docs/wiki/ResearchContract.md`
 4. Add a migration guide if breaking
 5. Tests must pass with new signature
 ### Running cost analysis
 See how much a research call costs:
 ```bash
 marchwarden ask "Q" --verbose
 # Shows: tokens_used, iterations_run, wall_time_sec
 ```
 For batch analysis:
 ```python
 import json
 import glob
 for trace_file in glob.glob("~/.marchwarden/traces/*.jsonl"):
    for line in open(trace_file):
        event = json.loads(line)
        # Analyze cost_metadata
 ```
 ## FAQ
 **Q: How do I add a new researcher?**  
 A: Create `researchers/new_source/` with the same structure as `researchers/web/`. Implement `research()`, expose it as an MCP server. Test with the CLI.
 **Q: Do I need to handle Tavily failures?**  
 A: Yes. Catch `TavilyError` and fall back to what you have. Document in `gaps`.
 **Q: What if Anthropic API goes down?**  
 A: The agent will fail. Retry logic TBD. For now, it's a blocker.
 **Q: How do I deploy this?**  
 A: V1 is CLI-only, local use only. V2 will have a PI orchestrator with real deployment needs.
 ---
 See also: [Architecture.md](Architecture.md), [ResearchContract.md](ResearchContract.md), [../CONTRIBUTING.md](../CONTRIBUTING.md)
--- a/ResearchContract.md
+++ b/ResearchContract.md
@ -0,0 +1,352 @@
 # Research Contract
 This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).
 ## Tool Signature
 ```python
 async def research(
    question: str,
    context: Optional[str] = None,
    depth: Literal["shallow", "balanced", "deep"] = "balanced",
    constraints: Optional[ResearchConstraints] = None,
 ) -> ResearchResult
 ```
 ### Input Parameters
 #### `question` (required, string)
 The question the researcher is asked to investigate. Examples:
 - "What are ideal crops for a garden in Utah?"
 - "Summarize recent developments in transformer architectures"
 - "What is the legal status of AI in France?"
 Constraints: 1–500 characters, UTF-8 encoded.
 #### `context` (optional, string)
 What the PI or caller already knows. The researcher uses this to avoid duplicating effort or to refocus. Examples:
 - "I already know Utah is in USDA zones 3-8. Focus on water requirements."
 - "I've read the 2024 papers on LoRA. What's new in 2025?"
 Constraints: 0–2000 characters.
 #### `depth` (optional, enum)
 How thoroughly to research:
 - `"shallow"` — quick scan, 1–2 iterations, ~5k tokens. For "does this exist?" questions.
 - `"balanced"` (default) — moderate depth, 2–4 iterations, ~15k tokens. For typical questions.
 - `"deep"` — thorough investigation, up to 5 iterations, ~25k tokens. For important decisions.
 The researcher uses this as a *hint*, not a strict constraint. The actual depth depends on how much content is available and how confident the researcher becomes.
 #### `constraints` (optional, object)
 Fine-grained control over researcher behavior:
 ```python
@dataclass
 class ResearchConstraints:
    max_iterations: int = 5              # Stop after N iterations, regardless
    token_budget: int = 20000            # Soft limit on tokens; researcher respects
    max_sources: int = 10                # Max number of sources to fetch
    source_filter: Optional[str] = None  # Only search specific domains (V2)
 ```
 If not provided, defaults are:
 - `max_iterations`: 5
 - `token_budget`: 20000 (Sonnet 3.5 equivalent)
 - `max_sources`: 10
 The MCP server **enforces** these constraints and will stop the researcher if they exceed them.
 ---
 ### Output: ResearchResult
 ```python
@dataclass
 class ResearchResult:
    answer: str                      # The synthesized answer
    citations: List[Citation]        # Sources used
    gaps: List[Gap]                  # What couldn't be resolved
    confidence: float                # 0.0–1.0 overall confidence
    cost_metadata: CostMetadata      # Resource usage
    trace_id: str                    # UUID linking to JSONL trace log
 ```
 #### `answer` (string)
 The synthesized answer. Should be:
 - **Grounded** — every claim traces back to a citation
 - **Humble** — includes caveats and confidence levels
 - **Actionable** — structured so the caller can use it
 Example:
 ```
 In Utah (USDA zones 3-8), ideal crops depend on elevation and season:
 High elevation (>7k ft): Short-season crops dominate. Cool-season vegetables 
 (peas, lettuce, potatoes) thrive. Fruit: apples, berries. Summer crops 
 (tomatoes, squash) work in south-facing microclimates.
 Lower elevation: Full range possible. Long growing season supports tomatoes, 
 peppers, squash. Perennials (fruit trees, asparagus) are popular.
 Water is critical: Utah averages 10-20" annual precipitation (dry for vegetable 
 gardening). Most gardeners supplement with irrigation.
 Pests: Japanese beetles (south), aphids (statewide). Deer pressure varies by 
 location.
 See sources below for varietal recommendations by specific county.
 ```
 #### `citations` (list of Citation objects)
 ```python
@dataclass
 class Citation:
    source: str              # "web", "file", "database", etc
    locator: str             # URL, file path, row ID, or unique identifier
    title: Optional[str]     # Human-readable title (for web)
    snippet: Optional[str]   # Relevant excerpt (50–200 chars)
    confidence: float        # 0.0–1.0: researcher's confidence in this source's accuracy
 ```
 Example:
 ```python
 Citation(
    source="web",
    locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone",
    title="Oregon State Extension: Ideal Crops for Utah Gardens",
    snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.",
    confidence=0.9
 )
 ```
 Citations must be:
 - **Verifiable** — a human can follow the locator and confirm the claim
 - **Not hallucinated** — the researcher actually read/fetched the source
 - **Attributed** — each claim in `answer` should link to at least one citation
 #### `gaps` (list of Gap objects)
 ```python
@dataclass
 class Gap:
    topic: str      # What aspect wasn't resolved
    reason: str     # Why: "no sources found", "contradictory sources", "outside researcher scope"
 ```
 Example:
 ```python
 [
    Gap(topic="pest management by county", reason="no county-specific sources found"),
    Gap(topic="commercial varietals", reason="limited to hobby gardening sources"),
 ]
 ```
 Gaps are **critical for the PI**. They tell the orchestrator:
 - Whether to dispatch a different researcher
 - Whether to accept partial answers
 - Which questions remain for human input
 A researcher that admits gaps is more trustworthy than one that fabricates answers.
 #### `confidence` (float, 0.0–1.0)
 Overall confidence in the answer:
 - `0.9–1.0`: High. All claims grounded in multiple strong sources.
 - `0.7–0.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved.
 - `0.5–0.7`: Low. Few direct sources; lots of synthesis; clear gaps.
 - `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review.
 The PI uses this to decide whether to act on the answer or seek more sources.
 #### `cost_metadata` (object)
 ```python
@dataclass
 class CostMetadata:
    tokens_used: int          # Total tokens (Claude + Tavily calls)
    iterations_run: int       # Number of inner-loop iterations
    wall_time_sec: float      # Actual elapsed time
    budget_exhausted: bool    # True if researcher hit iteration or token cap
 ```
 Example:
 ```python
 CostMetadata(
    tokens_used=8452,
    iterations_run=3,
    wall_time_sec=42.5,
    budget_exhausted=False
 )
 ```
 The PI uses this to:
 - Track costs (token budgets, actual spend)
 - Detect runaway loops (budget_exhausted = True)
 - Plan timeouts (wall_time_sec tells you if this is acceptable latency)
 #### `trace_id` (string, UUID)
 A unique identifier linking to the JSONL trace log:
 ```
 ~/.marchwarden/traces/{trace_id}.jsonl
 ```
 The trace contains every decision, search, fetch, parse step for debugging and replay.
 ---
 ## Contract Rules
 ### The Researcher Must
 1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it.
 2. **Admit gaps.** If you can't find something, say so. Don't guess.
 3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
 4. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
 5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have.
 ### The Caller (PI/CLI) Must
 1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly.
 2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
 3. **Check locators.** For important decisions, verify citations by following the locators.
 ---
 ## Examples
 ### Example 1: High-Confidence Answer
 Request:
 ```json
 {
  "question": "What is the capital of France?",
  "depth": "shallow"
 }
 ```
 Response:
 ```json
 {
  "answer": "Paris is the capital of France. It is the country's largest city and serves as the political, cultural, and economic center.",
  "citations": [
    {
      "source": "web",
      "locator": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "snippet": "Paris is the capital and largest city of France",
      "confidence": 0.99
    }
  ],
  "gaps": [],
  "confidence": 0.99,
  "cost_metadata": {
    "tokens_used": 450,
    "iterations_run": 1,
    "wall_time_sec": 3.2,
    "budget_exhausted": false
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440001"
 }
 ```
 ### Example 2: Partial Answer with Gaps
 Request:
 ```json
 {
  "question": "What emerging startups in biotech are working on CRISPR gene therapy?",
  "depth": "deep"
 }
 ```
 Response:
 ```json
 {
  "answer": "Several emerging startups are advancing CRISPR gene therapy... [detailed answer]",
  "citations": [
    {
      "source": "web",
      "locator": "https://www.crunchbase.com/...",
      "title": "Crunchbase: CRISPR Startups",
      "snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...",
      "confidence": 0.8
    }
  ],
  "gaps": [
    {
      "topic": "funding rounds in 2026",
      "reason": "Web sources only go through Q1 2026; may be stale"
    },
    {
      "topic": "clinical trial status",
      "reason": "Requires access to clinical trials database (outside web search scope)"
    }
  ],
  "confidence": 0.72,
  "cost_metadata": {
    "tokens_used": 19240,
    "iterations_run": 4,
    "wall_time_sec": 67.8,
    "budget_exhausted": false
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440002"
 }
 ```
 ### Example 3: Budget Exhausted
 Request:
 ```json
 {
  "question": "Comprehensive history of AI from 1950s to 2026",
  "depth": "deep",
  "constraints": {
    "max_iterations": 3,
    "token_budget": 5000
  }
 }
 ```
 Response:
 ```json
 {
  "answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
  "citations": [
    { ... 3-4 citations ... }
  ],
  "gaps": [
    {
      "topic": "detailed timeline 2020-2026",
      "reason": "budget exhausted before deep synthesis"
    },
    {
      "topic": "minor research directions",
      "reason": "out of scope due to token limit"
    }
  ],
  "confidence": 0.55,
  "cost_metadata": {
    "tokens_used": 4998,
    "iterations_run": 3,
    "wall_time_sec": 31.2,
    "budget_exhausted": true
  },
  "trace_id": "550e8400-e29b-41d4-a716-446655440003"
 }
 ```
 ---
 ## Versioning
 The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period.
 Current version: **v1**
 ---
 See also: [Architecture.md](Architecture.md), [DevelopmentGuide.md](DevelopmentGuide.md)