Initial wiki: Architecture, ResearchContract, DevelopmentGuide

- Architecture: system overview, component design, data flow - ResearchContract: complete tool specification with examples - DevelopmentGuide: setup, testing, workflow, debugging Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 11:58:26 -06:00 · 2026-04-08 11:58:26 -06:00 · a349d6f970
commit a349d6f970
3 changed files with 786 additions and 0 deletions
--- a/Architecture.md
+++ b/Architecture.md
@ -0,0 +1,175 @@
+# Architecture
+
+## Overview
+
+Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.
+
+```
+┌─────────────┐
+│  PI Agent   │  Orchestrates, synthesizes, decides what to research
+└──────┬──────┘
+       │ dispatch research(question)
+       │
+  ┌────┴──────────────────────────┐
+  │                               │
+┌─┴────────────────────┐  ┌───────┴─────────────────┐
+│ Web Researcher (MCP) │  │ Future: DB, Arxiv, etc. │
+│  - Search (Tavily)   │  │ (V2+)                   │
+│  - Fetch URLs        │  │                         │
+│  - Internal loop     │  │                         │
+│  - Return citations  │  │                         │
+└──────────────────────┘  └─────────────────────────┘
+```
+
+## Components
+
+### Researchers (MCP servers)
+
+Each researcher is a **standalone MCP server** that:
+- Exposes a single tool: `research(question, context, depth, constraints)`
+- Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
+- Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id`
+- Enforces budgets: iteration cap and token limit
+- Logs all internal steps to JSONL trace files
+
+**V1 researcher**: Web search + fetch
+- Uses Tavily for searching
+- Fetches full text from URLs
+- Iterates up to 5 times or until budget exhausted
+
+**Future researchers** (V2+): Database, Arxiv, internal documents, etc.
+
+### MCP Protocol
+
+Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between researchers and their callers. This gives us:
+
+- **Language agnostic** — researchers can be Python, Node, Go, etc.
+- **Process isolation** — researcher crash doesn't crash the PI
+- **Clean contract** — one tool signature, versioned independently
+- **Parallel dispatch** — PI can await multiple researchers simultaneously
+
+### CLI Shim
+
+For V1, the CLI is the test harness that stands in for the PI:
+
+```bash
+marchwarden ask "what are ideal crops for Utah?"
+marchwarden replay <trace_id>
+```
+
+In V2, the CLI is replaced by a full PI orchestrator agent.
+
+### Trace Logging
+
+Every research call produces a **JSONL trace log**:
+
+```
+~/.marchwarden/traces/{trace_id}.jsonl
+```
+
+Each line is a JSON object:
+```json
+{
+  "step": 1,
+  "action": "search",
+  "query": "Utah climate gardening",
+  "result": {...},
+  "timestamp": "2026-04-08T12:00:00Z",
+  "decision": "query was relevant, fetching top 3 URLs"
+}
+```
+
+Traces support:
+- **Debugging** — see exactly what the researcher did
+- **Replay** — re-run a past session, same results
+- **Eval** — audit decision-making
+
+## Data Flow
+
+### One research call (simplified)
+
+```
+CLI: ask "What are ideal crops for Utah?"
+  ↓
+MCP: research(question="What are ideal crops for Utah?", ...)
+  ↓
+Researcher agent loop:
+  1. Plan: "I need climate data for Utah + crop requirements"
+  2. Search: Tavily query for "Utah climate zones crops"
+  3. Fetch: Read top 3 URLs
+  4. Parse: Extract relevant info
+  5. Synthesize: "Based on X sources, ideal crops are Y"
+  6. Check gaps: "Couldn't find pest info"
+  7. Return if confident, else iterate
+  ↓
+Response:
+  {
+    "answer": "...",
+    "citations": [
+      {"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95},
+      ...
+    ],
+    "gaps": [
+      {"topic": "pest resistance", "reason": "no sources found"},
+    ],
+    "cost_metadata": {
+      "tokens_used": 8452,
+      "iterations_run": 3,
+      "wall_time_sec": 42.5
+    },
+    "trace_id": "uuid-1234"
+  }
+  ↓
+CLI: Print answer + citations, save trace
+```
+
+## Contract Versioning
+
+The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that:
+- Multiple researchers with different versions can coexist
+- The PI knows what version it's calling
+- Backwards compatibility (or breaking changes) is explicit
+
+See [ResearchContract.md](ResearchContract.md) for the full spec.
+
+## Future: The PI Agent
+
+V2 will introduce the orchestrator:
+
+```python
+class PIAgent:
+    def research_topic(self, question: str) -> Answer:
+        # Dispatch multiple researchers in parallel
+        web_results = await self.web_researcher.research(question)
+        arxiv_results = await self.arxiv_researcher.research(question)
+        
+        # Synthesize
+        return self.synthesize([web_results, arxiv_results])
+```
+
+The PI:
+- Decides which researchers to dispatch
+- Waits for all responses
+- Checks for conflicts, gaps, consensus
+- Synthesizes into a final answer
+- Can re-dispatch if gaps are critical
+
+## Assumptions & Constraints
+
+- **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source.
+- **Tavily API is available** — for V1 web search. Degradation strategy TBD.
+- **Token budgets are enforced** — the researcher respects its budget; the MCP server enforces it at the process level.
+- **Traces are ephemeral** — stored locally for debugging, not synced to a database yet.
+- **No multi-user** — single-user CLI for V1.
+
+## Terminology
+
+- **Researcher**: An agentic system specialized in a domain or source type
+- **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back
+- **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question
+- **Trace**: A JSONL log of all decisions made during one research call
+- **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer
+
+---
+
+See also: [ResearchContract.md](ResearchContract.md), [DevelopmentGuide.md](DevelopmentGuide.md)
--- a/DevelopmentGuide.md
+++ b/DevelopmentGuide.md
@ -0,0 +1,259 @@
+# Development Guide
+
+## Setup
+
+### Prerequisites
+
+- Python 3.10+
+- pip (with venv)
+- Tavily API key (free tier available at https://tavily.com)
+
+### Installation
+
+```bash
+git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
+cd marchwarden
+
+# Create virtual environment
+python3 -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+
+# Install in dev mode
+pip install -e ".[dev]"
+```
+
+### Environment Setup
+
+Create a `.env` file in the project root:
+
+```env
+TAVILY_API_KEY=<your-tavily-api-key>
+ANTHROPIC_API_KEY=<your-claude-api-key>
+MARCHWARDEN_TRACE_DIR=~/.marchwarden/traces
+```
+
+Test that everything works:
+
+```bash
+python -c "from anthropic import Anthropic; print('OK')"
+python -c "from tavily import TavilyClient; print('OK')"
+```
+
+## Project Structure
+
+```
+marchwarden/
+├── researchers/
+│   ├── __init__.py
+│   └── web/                    # V1: Web search researcher
+│       ├── __init__.py
+│       ├── server.py           # MCP server entry point
+│       ├── agent.py            # Inner research agent
+│       ├── models.py           # Pydantic models (ResearchResult, Citation, etc)
+│       └── tools.py            # Tavily integration, URL fetch
+├── orchestrator/               # (V2+) PI agent
+│   ├── __init__.py
+│   └── pi.py
+├── cli/                        # CLI shim (ask, replay)
+│   ├── __init__.py
+│   ├── main.py                 # Entry point (@click decorators)
+│   └── formatter.py            # Pretty-print results
+├── tests/
+│   ├── __init__.py
+│   ├── test_web_researcher.py
+│   └── fixtures/
+├── docs/
+│   └── wiki/                   # You are here
+├── README.md
+├── CONTRIBUTING.md
+├── pyproject.toml
+└── .gitignore
+```
+
+## Running Tests
+
+```bash
+# Run all tests
+pytest tests/
+
+# Run with verbose output
+pytest tests/ -v
+
+# Run a specific test file
+pytest tests/test_web_researcher.py
+
+# Run with coverage
+pytest --cov=. tests/
+```
+
+All tests are unit + integration. We do **not** mock the database or major external services (only Tavily if needed to avoid API costs).
+
+## Running the CLI
+
+```bash
+# Ask a question
+marchwarden ask "What are ideal crops for a garden in Utah?"
+
+# With options
+marchwarden ask "What is X?" --depth deep --budget 25000
+
+# Replay a trace
+marchwarden replay <trace_id>
+
+# Show help
+marchwarden --help
+```
+
+The first run will take a few seconds (agent planning + searches + fetches).
+
+## Development Workflow
+
+### 1. Create a branch
+
+```bash
+git checkout -b feat/your-feature-name
+```
+
+Branch naming: `feat/`, `fix/`, `refactor/`, `chore/` + short description.
+
+### 2. Make changes
+
+Edit code, add tests:
+
+```bash
+# Run tests as you go
+pytest tests/test_your_feature.py -v
+
+# Check formatting
+black --check .
+ruff check .
+
+# Type checking (optional, informational)
+mypy . --ignore-missing-imports
+```
+
+### 3. Commit
+
+```bash
+git add <files>
+git commit -m "Brief imperative description
+
+- What changed
+- Why it changed
+"
+```
+
+Commits should be atomic (one logical change per commit).
+
+### 4. Test before pushing
+
+```bash
+pytest tests/
+black .
+ruff check . --fix
+```
+
+### 5. Push and create PR
+
+```bash
+git push origin feat/your-feature-name
+```
+
+Then on Forgejo: open a PR, request review, wait for CI/tests to pass.
+
+Once approved:
+- Merge via Forgejo UI (not locally)
+- Delete remote branch via Forgejo
+- Locally: `git checkout main && git pull --ff-only && git branch -d feat/your-feature-name`
+
+## Debugging
+
+### Viewing trace logs
+
+```bash
+# Human-readable trace
+marchwarden replay <trace_id>
+
+# Raw JSON
+cat ~/.marchwarden/traces/<trace_id>.jsonl | jq .
+
+# Pretty-print all lines
+cat ~/.marchwarden/traces/<trace_id>.jsonl | jq . -s
+```
+
+### Debug logging
+
+Set `MARCHWARDEN_DEBUG=1` for verbose logs:
+
+```bash
+MARCHWARDEN_DEBUG=1 marchwarden ask "What is X?"
+```
+
+### Interactive testing
+
+Use Python REPL:
+
+```bash
+python
+>>> from researchers.web import WebResearcher
+>>> researcher = WebResearcher()
+>>> result = researcher.research("What is X?")
+>>> print(result.answer)
+```
+
+## Common Tasks
+
+### Adding a new tool to the researcher
+
+1. Define the tool in `researchers/web/tools.py`
+2. Register it in the agent's tool list (`researchers/web/agent.py`)
+3. Add test coverage in `tests/test_web_researcher.py`
+4. Update docs if it changes the contract
+
+### Changing the research contract
+
+If you need to modify the `research()` signature:
+
+1. Update `researchers/web/models.py` (ResearchResult, Citation, etc)
+2. Update `researchers/web/agent.py` to produce the new fields
+3. Update `docs/wiki/ResearchContract.md`
+4. Add a migration guide if breaking
+5. Tests must pass with new signature
+
+### Running cost analysis
+
+See how much a research call costs:
+
+```bash
+marchwarden ask "Q" --verbose
+# Shows: tokens_used, iterations_run, wall_time_sec
+```
+
+For batch analysis:
+
+```python
+import json
+import glob
+for trace_file in glob.glob("~/.marchwarden/traces/*.jsonl"):
+    for line in open(trace_file):
+        event = json.loads(line)
+        # Analyze cost_metadata
+```
+
+## FAQ
+
+**Q: How do I add a new researcher?**  
+A: Create `researchers/new_source/` with the same structure as `researchers/web/`. Implement `research()`, expose it as an MCP server. Test with the CLI.
+
+**Q: Do I need to handle Tavily failures?**  
+A: Yes. Catch `TavilyError` and fall back to what you have. Document in `gaps`.
+
+**Q: What if Anthropic API goes down?**  
+A: The agent will fail. Retry logic TBD. For now, it's a blocker.
+
+**Q: How do I deploy this?**  
+A: V1 is CLI-only, local use only. V2 will have a PI orchestrator with real deployment needs.
+
+---
+
+See also: [Architecture.md](Architecture.md), [ResearchContract.md](ResearchContract.md), [../CONTRIBUTING.md](../CONTRIBUTING.md)
--- a/ResearchContract.md
+++ b/ResearchContract.md
@ -0,0 +1,352 @@
+# Research Contract
+
+This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).
+
+## Tool Signature
+
+```python
+async def research(
+    question: str,
+    context: Optional[str] = None,
+    depth: Literal["shallow", "balanced", "deep"] = "balanced",
+    constraints: Optional[ResearchConstraints] = None,
+) -> ResearchResult
+```
+
+### Input Parameters
+
+#### `question` (required, string)
+The question the researcher is asked to investigate. Examples:
+- "What are ideal crops for a garden in Utah?"
+- "Summarize recent developments in transformer architectures"
+- "What is the legal status of AI in France?"
+
+Constraints: 1–500 characters, UTF-8 encoded.
+
+#### `context` (optional, string)
+What the PI or caller already knows. The researcher uses this to avoid duplicating effort or to refocus. Examples:
+- "I already know Utah is in USDA zones 3-8. Focus on water requirements."
+- "I've read the 2024 papers on LoRA. What's new in 2025?"
+
+Constraints: 0–2000 characters.
+
+#### `depth` (optional, enum)
+How thoroughly to research:
+- `"shallow"` — quick scan, 1–2 iterations, ~5k tokens. For "does this exist?" questions.
+- `"balanced"` (default) — moderate depth, 2–4 iterations, ~15k tokens. For typical questions.
+- `"deep"` — thorough investigation, up to 5 iterations, ~25k tokens. For important decisions.
+
+The researcher uses this as a *hint*, not a strict constraint. The actual depth depends on how much content is available and how confident the researcher becomes.
+
+#### `constraints` (optional, object)
+Fine-grained control over researcher behavior:
+
+```python
+@dataclass
+class ResearchConstraints:
+    max_iterations: int = 5              # Stop after N iterations, regardless
+    token_budget: int = 20000            # Soft limit on tokens; researcher respects
+    max_sources: int = 10                # Max number of sources to fetch
+    source_filter: Optional[str] = None  # Only search specific domains (V2)
+```
+
+If not provided, defaults are:
+- `max_iterations`: 5
+- `token_budget`: 20000 (Sonnet 3.5 equivalent)
+- `max_sources`: 10
+
+The MCP server **enforces** these constraints and will stop the researcher if they exceed them.
+
+---
+
+### Output: ResearchResult
+
+```python
+@dataclass
+class ResearchResult:
+    answer: str                      # The synthesized answer
+    citations: List[Citation]        # Sources used
+    gaps: List[Gap]                  # What couldn't be resolved
+    confidence: float                # 0.0–1.0 overall confidence
+    cost_metadata: CostMetadata      # Resource usage
+    trace_id: str                    # UUID linking to JSONL trace log
+```
+
+#### `answer` (string)
+The synthesized answer. Should be:
+- **Grounded** — every claim traces back to a citation
+- **Humble** — includes caveats and confidence levels
+- **Actionable** — structured so the caller can use it
+
+Example:
+```
+In Utah (USDA zones 3-8), ideal crops depend on elevation and season:
+
+High elevation (>7k ft): Short-season crops dominate. Cool-season vegetables 
+(peas, lettuce, potatoes) thrive. Fruit: apples, berries. Summer crops 
+(tomatoes, squash) work in south-facing microclimates.
+
+Lower elevation: Full range possible. Long growing season supports tomatoes, 
+peppers, squash. Perennials (fruit trees, asparagus) are popular.
+
+Water is critical: Utah averages 10-20" annual precipitation (dry for vegetable 
+gardening). Most gardeners supplement with irrigation.
+
+Pests: Japanese beetles (south), aphids (statewide). Deer pressure varies by 
+location.
+
+See sources below for varietal recommendations by specific county.
+```
+
+#### `citations` (list of Citation objects)
+
+```python
+@dataclass
+class Citation:
+    source: str              # "web", "file", "database", etc
+    locator: str             # URL, file path, row ID, or unique identifier
+    title: Optional[str]     # Human-readable title (for web)
+    snippet: Optional[str]   # Relevant excerpt (50–200 chars)
+    confidence: float        # 0.0–1.0: researcher's confidence in this source's accuracy
+```
+
+Example:
+```python
+Citation(
+    source="web",
+    locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone",
+    title="Oregon State Extension: Ideal Crops for Utah Gardens",
+    snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.",
+    confidence=0.9
+)
+```
+
+Citations must be:
+- **Verifiable** — a human can follow the locator and confirm the claim
+- **Not hallucinated** — the researcher actually read/fetched the source
+- **Attributed** — each claim in `answer` should link to at least one citation
+
+#### `gaps` (list of Gap objects)
+
+```python
+@dataclass
+class Gap:
+    topic: str      # What aspect wasn't resolved
+    reason: str     # Why: "no sources found", "contradictory sources", "outside researcher scope"
+```
+
+Example:
+```python
+[
+    Gap(topic="pest management by county", reason="no county-specific sources found"),
+    Gap(topic="commercial varietals", reason="limited to hobby gardening sources"),
+]
+```
+
+Gaps are **critical for the PI**. They tell the orchestrator:
+- Whether to dispatch a different researcher
+- Whether to accept partial answers
+- Which questions remain for human input
+
+A researcher that admits gaps is more trustworthy than one that fabricates answers.
+
+#### `confidence` (float, 0.0–1.0)
+
+Overall confidence in the answer:
+- `0.9–1.0`: High. All claims grounded in multiple strong sources.
+- `0.7–0.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved.
+- `0.5–0.7`: Low. Few direct sources; lots of synthesis; clear gaps.
+- `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review.
+
+The PI uses this to decide whether to act on the answer or seek more sources.
+
+#### `cost_metadata` (object)
+
+```python
+@dataclass
+class CostMetadata:
+    tokens_used: int          # Total tokens (Claude + Tavily calls)
+    iterations_run: int       # Number of inner-loop iterations
+    wall_time_sec: float      # Actual elapsed time
+    budget_exhausted: bool    # True if researcher hit iteration or token cap
+```
+
+Example:
+```python
+CostMetadata(
+    tokens_used=8452,
+    iterations_run=3,
+    wall_time_sec=42.5,
+    budget_exhausted=False
+)
+```
+
+The PI uses this to:
+- Track costs (token budgets, actual spend)
+- Detect runaway loops (budget_exhausted = True)
+- Plan timeouts (wall_time_sec tells you if this is acceptable latency)
+
+#### `trace_id` (string, UUID)
+
+A unique identifier linking to the JSONL trace log:
+
+```
+~/.marchwarden/traces/{trace_id}.jsonl
+```
+
+The trace contains every decision, search, fetch, parse step for debugging and replay.
+
+---
+
+## Contract Rules
+
+### The Researcher Must
+
+1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it.
+2. **Admit gaps.** If you can't find something, say so. Don't guess.
+3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
+4. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
+5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have.
+
+### The Caller (PI/CLI) Must
+
+1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly.
+2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
+3. **Check locators.** For important decisions, verify citations by following the locators.
+
+---
+
+## Examples
+
+### Example 1: High-Confidence Answer
+
+Request:
+```json
+{
+  "question": "What is the capital of France?",
+  "depth": "shallow"
+}
+```
+
+Response:
+```json
+{
+  "answer": "Paris is the capital of France. It is the country's largest city and serves as the political, cultural, and economic center.",
+  "citations": [
+    {
+      "source": "web",
+      "locator": "https://en.wikipedia.org/wiki/Paris",
+      "title": "Paris - Wikipedia",
+      "snippet": "Paris is the capital and largest city of France",
+      "confidence": 0.99
+    }
+  ],
+  "gaps": [],
+  "confidence": 0.99,
+  "cost_metadata": {
+    "tokens_used": 450,
+    "iterations_run": 1,
+    "wall_time_sec": 3.2,
+    "budget_exhausted": false
+  },
+  "trace_id": "550e8400-e29b-41d4-a716-446655440001"
+}
+```
+
+### Example 2: Partial Answer with Gaps
+
+Request:
+```json
+{
+  "question": "What emerging startups in biotech are working on CRISPR gene therapy?",
+  "depth": "deep"
+}
+```
+
+Response:
+```json
+{
+  "answer": "Several emerging startups are advancing CRISPR gene therapy... [detailed answer]",
+  "citations": [
+    {
+      "source": "web",
+      "locator": "https://www.crunchbase.com/...",
+      "title": "Crunchbase: CRISPR Startups",
+      "snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...",
+      "confidence": 0.8
+    }
+  ],
+  "gaps": [
+    {
+      "topic": "funding rounds in 2026",
+      "reason": "Web sources only go through Q1 2026; may be stale"
+    },
+    {
+      "topic": "clinical trial status",
+      "reason": "Requires access to clinical trials database (outside web search scope)"
+    }
+  ],
+  "confidence": 0.72,
+  "cost_metadata": {
+    "tokens_used": 19240,
+    "iterations_run": 4,
+    "wall_time_sec": 67.8,
+    "budget_exhausted": false
+  },
+  "trace_id": "550e8400-e29b-41d4-a716-446655440002"
+}
+```
+
+### Example 3: Budget Exhausted
+
+Request:
+```json
+{
+  "question": "Comprehensive history of AI from 1950s to 2026",
+  "depth": "deep",
+  "constraints": {
+    "max_iterations": 3,
+    "token_budget": 5000
+  }
+}
+```
+
+Response:
+```json
+{
+  "answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
+  "citations": [
+    { ... 3-4 citations ... }
+  ],
+  "gaps": [
+    {
+      "topic": "detailed timeline 2020-2026",
+      "reason": "budget exhausted before deep synthesis"
+    },
+    {
+      "topic": "minor research directions",
+      "reason": "out of scope due to token limit"
+    }
+  ],
+  "confidence": 0.55,
+  "cost_metadata": {
+    "tokens_used": 4998,
+    "iterations_run": 3,
+    "wall_time_sec": 31.2,
+    "budget_exhausted": true
+  },
+  "trace_id": "550e8400-e29b-41d4-a716-446655440003"
+}
+```
+
+---
+
+## Versioning
+
+The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period.
+
+Current version: **v1**
+
+---
+
+See also: [Architecture.md](Architecture.md), [DevelopmentGuide.md](DevelopmentGuide.md)