From a349d6f970640d4fcc7c51790a81ed9969bd52a2 Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Wed, 8 Apr 2026 11:58:26 -0600 Subject: [PATCH] Initial wiki: Architecture, ResearchContract, DevelopmentGuide - Architecture: system overview, component design, data flow - ResearchContract: complete tool specification with examples - DevelopmentGuide: setup, testing, workflow, debugging Co-Authored-By: Claude Haiku 4.5 --- Architecture.md | 175 ++++++++++++++++++++++ DevelopmentGuide.md | 259 ++++++++++++++++++++++++++++++++ ResearchContract.md | 352 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 786 insertions(+) create mode 100644 Architecture.md create mode 100644 DevelopmentGuide.md create mode 100644 ResearchContract.md diff --git a/Architecture.md b/Architecture.md new file mode 100644 index 0000000..0b7cf87 --- /dev/null +++ b/Architecture.md @@ -0,0 +1,175 @@ +# Architecture + +## Overview + +Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses. + +``` +┌─────────────┐ +│ PI Agent │ Orchestrates, synthesizes, decides what to research +└──────┬──────┘ + │ dispatch research(question) + │ + ┌────┴──────────────────────────┐ + │ │ +┌─┴────────────────────┐ ┌───────┴─────────────────┐ +│ Web Researcher (MCP) │ │ Future: DB, Arxiv, etc. │ +│ - Search (Tavily) │ │ (V2+) │ +│ - Fetch URLs │ │ │ +│ - Internal loop │ │ │ +│ - Return citations │ │ │ +└──────────────────────┘ └─────────────────────────┘ +``` + +## Components + +### Researchers (MCP servers) + +Each researcher is a **standalone MCP server** that: +- Exposes a single tool: `research(question, context, depth, constraints)` +- Runs an internal agentic loop (plan → search → fetch → iterate → synthesize) +- Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id` +- Enforces budgets: iteration cap and token limit +- Logs all internal steps to JSONL trace files + +**V1 researcher**: Web search + fetch +- Uses Tavily for searching +- Fetches full text from URLs +- Iterates up to 5 times or until budget exhausted + +**Future researchers** (V2+): Database, Arxiv, internal documents, etc. + +### MCP Protocol + +Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between researchers and their callers. This gives us: + +- **Language agnostic** — researchers can be Python, Node, Go, etc. +- **Process isolation** — researcher crash doesn't crash the PI +- **Clean contract** — one tool signature, versioned independently +- **Parallel dispatch** — PI can await multiple researchers simultaneously + +### CLI Shim + +For V1, the CLI is the test harness that stands in for the PI: + +```bash +marchwarden ask "what are ideal crops for Utah?" +marchwarden replay +``` + +In V2, the CLI is replaced by a full PI orchestrator agent. + +### Trace Logging + +Every research call produces a **JSONL trace log**: + +``` +~/.marchwarden/traces/{trace_id}.jsonl +``` + +Each line is a JSON object: +```json +{ + "step": 1, + "action": "search", + "query": "Utah climate gardening", + "result": {...}, + "timestamp": "2026-04-08T12:00:00Z", + "decision": "query was relevant, fetching top 3 URLs" +} +``` + +Traces support: +- **Debugging** — see exactly what the researcher did +- **Replay** — re-run a past session, same results +- **Eval** — audit decision-making + +## Data Flow + +### One research call (simplified) + +``` +CLI: ask "What are ideal crops for Utah?" + ↓ +MCP: research(question="What are ideal crops for Utah?", ...) + ↓ +Researcher agent loop: + 1. Plan: "I need climate data for Utah + crop requirements" + 2. Search: Tavily query for "Utah climate zones crops" + 3. Fetch: Read top 3 URLs + 4. Parse: Extract relevant info + 5. Synthesize: "Based on X sources, ideal crops are Y" + 6. Check gaps: "Couldn't find pest info" + 7. Return if confident, else iterate + ↓ +Response: + { + "answer": "...", + "citations": [ + {"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95}, + ... + ], + "gaps": [ + {"topic": "pest resistance", "reason": "no sources found"}, + ], + "cost_metadata": { + "tokens_used": 8452, + "iterations_run": 3, + "wall_time_sec": 42.5 + }, + "trace_id": "uuid-1234" + } + ↓ +CLI: Print answer + citations, save trace +``` + +## Contract Versioning + +The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that: +- Multiple researchers with different versions can coexist +- The PI knows what version it's calling +- Backwards compatibility (or breaking changes) is explicit + +See [ResearchContract.md](ResearchContract.md) for the full spec. + +## Future: The PI Agent + +V2 will introduce the orchestrator: + +```python +class PIAgent: + def research_topic(self, question: str) -> Answer: + # Dispatch multiple researchers in parallel + web_results = await self.web_researcher.research(question) + arxiv_results = await self.arxiv_researcher.research(question) + + # Synthesize + return self.synthesize([web_results, arxiv_results]) +``` + +The PI: +- Decides which researchers to dispatch +- Waits for all responses +- Checks for conflicts, gaps, consensus +- Synthesizes into a final answer +- Can re-dispatch if gaps are critical + +## Assumptions & Constraints + +- **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source. +- **Tavily API is available** — for V1 web search. Degradation strategy TBD. +- **Token budgets are enforced** — the researcher respects its budget; the MCP server enforces it at the process level. +- **Traces are ephemeral** — stored locally for debugging, not synced to a database yet. +- **No multi-user** — single-user CLI for V1. + +## Terminology + +- **Researcher**: An agentic system specialized in a domain or source type +- **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back +- **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question +- **Trace**: A JSONL log of all decisions made during one research call +- **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer + +--- + +See also: [ResearchContract.md](ResearchContract.md), [DevelopmentGuide.md](DevelopmentGuide.md) diff --git a/DevelopmentGuide.md b/DevelopmentGuide.md new file mode 100644 index 0000000..e94437a --- /dev/null +++ b/DevelopmentGuide.md @@ -0,0 +1,259 @@ +# Development Guide + +## Setup + +### Prerequisites + +- Python 3.10+ +- pip (with venv) +- Tavily API key (free tier available at https://tavily.com) + +### Installation + +```bash +git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git +cd marchwarden + +# Create virtual environment +python3 -m venv venv +source venv/bin/activate # On Windows: venv\Scripts\activate + +# Install in dev mode +pip install -e ".[dev]" +``` + +### Environment Setup + +Create a `.env` file in the project root: + +```env +TAVILY_API_KEY= +ANTHROPIC_API_KEY= +MARCHWARDEN_TRACE_DIR=~/.marchwarden/traces +``` + +Test that everything works: + +```bash +python -c "from anthropic import Anthropic; print('OK')" +python -c "from tavily import TavilyClient; print('OK')" +``` + +## Project Structure + +``` +marchwarden/ +├── researchers/ +│ ├── __init__.py +│ └── web/ # V1: Web search researcher +│ ├── __init__.py +│ ├── server.py # MCP server entry point +│ ├── agent.py # Inner research agent +│ ├── models.py # Pydantic models (ResearchResult, Citation, etc) +│ └── tools.py # Tavily integration, URL fetch +├── orchestrator/ # (V2+) PI agent +│ ├── __init__.py +│ └── pi.py +├── cli/ # CLI shim (ask, replay) +│ ├── __init__.py +│ ├── main.py # Entry point (@click decorators) +│ └── formatter.py # Pretty-print results +├── tests/ +│ ├── __init__.py +│ ├── test_web_researcher.py +│ └── fixtures/ +├── docs/ +│ └── wiki/ # You are here +├── README.md +├── CONTRIBUTING.md +├── pyproject.toml +└── .gitignore +``` + +## Running Tests + +```bash +# Run all tests +pytest tests/ + +# Run with verbose output +pytest tests/ -v + +# Run a specific test file +pytest tests/test_web_researcher.py + +# Run with coverage +pytest --cov=. tests/ +``` + +All tests are unit + integration. We do **not** mock the database or major external services (only Tavily if needed to avoid API costs). + +## Running the CLI + +```bash +# Ask a question +marchwarden ask "What are ideal crops for a garden in Utah?" + +# With options +marchwarden ask "What is X?" --depth deep --budget 25000 + +# Replay a trace +marchwarden replay + +# Show help +marchwarden --help +``` + +The first run will take a few seconds (agent planning + searches + fetches). + +## Development Workflow + +### 1. Create a branch + +```bash +git checkout -b feat/your-feature-name +``` + +Branch naming: `feat/`, `fix/`, `refactor/`, `chore/` + short description. + +### 2. Make changes + +Edit code, add tests: + +```bash +# Run tests as you go +pytest tests/test_your_feature.py -v + +# Check formatting +black --check . +ruff check . + +# Type checking (optional, informational) +mypy . --ignore-missing-imports +``` + +### 3. Commit + +```bash +git add +git commit -m "Brief imperative description + +- What changed +- Why it changed +" +``` + +Commits should be atomic (one logical change per commit). + +### 4. Test before pushing + +```bash +pytest tests/ +black . +ruff check . --fix +``` + +### 5. Push and create PR + +```bash +git push origin feat/your-feature-name +``` + +Then on Forgejo: open a PR, request review, wait for CI/tests to pass. + +Once approved: +- Merge via Forgejo UI (not locally) +- Delete remote branch via Forgejo +- Locally: `git checkout main && git pull --ff-only && git branch -d feat/your-feature-name` + +## Debugging + +### Viewing trace logs + +```bash +# Human-readable trace +marchwarden replay + +# Raw JSON +cat ~/.marchwarden/traces/.jsonl | jq . + +# Pretty-print all lines +cat ~/.marchwarden/traces/.jsonl | jq . -s +``` + +### Debug logging + +Set `MARCHWARDEN_DEBUG=1` for verbose logs: + +```bash +MARCHWARDEN_DEBUG=1 marchwarden ask "What is X?" +``` + +### Interactive testing + +Use Python REPL: + +```bash +python +>>> from researchers.web import WebResearcher +>>> researcher = WebResearcher() +>>> result = researcher.research("What is X?") +>>> print(result.answer) +``` + +## Common Tasks + +### Adding a new tool to the researcher + +1. Define the tool in `researchers/web/tools.py` +2. Register it in the agent's tool list (`researchers/web/agent.py`) +3. Add test coverage in `tests/test_web_researcher.py` +4. Update docs if it changes the contract + +### Changing the research contract + +If you need to modify the `research()` signature: + +1. Update `researchers/web/models.py` (ResearchResult, Citation, etc) +2. Update `researchers/web/agent.py` to produce the new fields +3. Update `docs/wiki/ResearchContract.md` +4. Add a migration guide if breaking +5. Tests must pass with new signature + +### Running cost analysis + +See how much a research call costs: + +```bash +marchwarden ask "Q" --verbose +# Shows: tokens_used, iterations_run, wall_time_sec +``` + +For batch analysis: + +```python +import json +import glob +for trace_file in glob.glob("~/.marchwarden/traces/*.jsonl"): + for line in open(trace_file): + event = json.loads(line) + # Analyze cost_metadata +``` + +## FAQ + +**Q: How do I add a new researcher?** +A: Create `researchers/new_source/` with the same structure as `researchers/web/`. Implement `research()`, expose it as an MCP server. Test with the CLI. + +**Q: Do I need to handle Tavily failures?** +A: Yes. Catch `TavilyError` and fall back to what you have. Document in `gaps`. + +**Q: What if Anthropic API goes down?** +A: The agent will fail. Retry logic TBD. For now, it's a blocker. + +**Q: How do I deploy this?** +A: V1 is CLI-only, local use only. V2 will have a PI orchestrator with real deployment needs. + +--- + +See also: [Architecture.md](Architecture.md), [ResearchContract.md](ResearchContract.md), [../CONTRIBUTING.md](../CONTRIBUTING.md) diff --git a/ResearchContract.md b/ResearchContract.md new file mode 100644 index 0000000..a4ad5a7 --- /dev/null +++ b/ResearchContract.md @@ -0,0 +1,352 @@ +# Research Contract + +This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI). + +## Tool Signature + +```python +async def research( + question: str, + context: Optional[str] = None, + depth: Literal["shallow", "balanced", "deep"] = "balanced", + constraints: Optional[ResearchConstraints] = None, +) -> ResearchResult +``` + +### Input Parameters + +#### `question` (required, string) +The question the researcher is asked to investigate. Examples: +- "What are ideal crops for a garden in Utah?" +- "Summarize recent developments in transformer architectures" +- "What is the legal status of AI in France?" + +Constraints: 1–500 characters, UTF-8 encoded. + +#### `context` (optional, string) +What the PI or caller already knows. The researcher uses this to avoid duplicating effort or to refocus. Examples: +- "I already know Utah is in USDA zones 3-8. Focus on water requirements." +- "I've read the 2024 papers on LoRA. What's new in 2025?" + +Constraints: 0–2000 characters. + +#### `depth` (optional, enum) +How thoroughly to research: +- `"shallow"` — quick scan, 1–2 iterations, ~5k tokens. For "does this exist?" questions. +- `"balanced"` (default) — moderate depth, 2–4 iterations, ~15k tokens. For typical questions. +- `"deep"` — thorough investigation, up to 5 iterations, ~25k tokens. For important decisions. + +The researcher uses this as a *hint*, not a strict constraint. The actual depth depends on how much content is available and how confident the researcher becomes. + +#### `constraints` (optional, object) +Fine-grained control over researcher behavior: + +```python +@dataclass +class ResearchConstraints: + max_iterations: int = 5 # Stop after N iterations, regardless + token_budget: int = 20000 # Soft limit on tokens; researcher respects + max_sources: int = 10 # Max number of sources to fetch + source_filter: Optional[str] = None # Only search specific domains (V2) +``` + +If not provided, defaults are: +- `max_iterations`: 5 +- `token_budget`: 20000 (Sonnet 3.5 equivalent) +- `max_sources`: 10 + +The MCP server **enforces** these constraints and will stop the researcher if they exceed them. + +--- + +### Output: ResearchResult + +```python +@dataclass +class ResearchResult: + answer: str # The synthesized answer + citations: List[Citation] # Sources used + gaps: List[Gap] # What couldn't be resolved + confidence: float # 0.0–1.0 overall confidence + cost_metadata: CostMetadata # Resource usage + trace_id: str # UUID linking to JSONL trace log +``` + +#### `answer` (string) +The synthesized answer. Should be: +- **Grounded** — every claim traces back to a citation +- **Humble** — includes caveats and confidence levels +- **Actionable** — structured so the caller can use it + +Example: +``` +In Utah (USDA zones 3-8), ideal crops depend on elevation and season: + +High elevation (>7k ft): Short-season crops dominate. Cool-season vegetables +(peas, lettuce, potatoes) thrive. Fruit: apples, berries. Summer crops +(tomatoes, squash) work in south-facing microclimates. + +Lower elevation: Full range possible. Long growing season supports tomatoes, +peppers, squash. Perennials (fruit trees, asparagus) are popular. + +Water is critical: Utah averages 10-20" annual precipitation (dry for vegetable +gardening). Most gardeners supplement with irrigation. + +Pests: Japanese beetles (south), aphids (statewide). Deer pressure varies by +location. + +See sources below for varietal recommendations by specific county. +``` + +#### `citations` (list of Citation objects) + +```python +@dataclass +class Citation: + source: str # "web", "file", "database", etc + locator: str # URL, file path, row ID, or unique identifier + title: Optional[str] # Human-readable title (for web) + snippet: Optional[str] # Relevant excerpt (50–200 chars) + confidence: float # 0.0–1.0: researcher's confidence in this source's accuracy +``` + +Example: +```python +Citation( + source="web", + locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone", + title="Oregon State Extension: Ideal Crops for Utah Gardens", + snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.", + confidence=0.9 +) +``` + +Citations must be: +- **Verifiable** — a human can follow the locator and confirm the claim +- **Not hallucinated** — the researcher actually read/fetched the source +- **Attributed** — each claim in `answer` should link to at least one citation + +#### `gaps` (list of Gap objects) + +```python +@dataclass +class Gap: + topic: str # What aspect wasn't resolved + reason: str # Why: "no sources found", "contradictory sources", "outside researcher scope" +``` + +Example: +```python +[ + Gap(topic="pest management by county", reason="no county-specific sources found"), + Gap(topic="commercial varietals", reason="limited to hobby gardening sources"), +] +``` + +Gaps are **critical for the PI**. They tell the orchestrator: +- Whether to dispatch a different researcher +- Whether to accept partial answers +- Which questions remain for human input + +A researcher that admits gaps is more trustworthy than one that fabricates answers. + +#### `confidence` (float, 0.0–1.0) + +Overall confidence in the answer: +- `0.9–1.0`: High. All claims grounded in multiple strong sources. +- `0.7–0.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved. +- `0.5–0.7`: Low. Few direct sources; lots of synthesis; clear gaps. +- `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review. + +The PI uses this to decide whether to act on the answer or seek more sources. + +#### `cost_metadata` (object) + +```python +@dataclass +class CostMetadata: + tokens_used: int # Total tokens (Claude + Tavily calls) + iterations_run: int # Number of inner-loop iterations + wall_time_sec: float # Actual elapsed time + budget_exhausted: bool # True if researcher hit iteration or token cap +``` + +Example: +```python +CostMetadata( + tokens_used=8452, + iterations_run=3, + wall_time_sec=42.5, + budget_exhausted=False +) +``` + +The PI uses this to: +- Track costs (token budgets, actual spend) +- Detect runaway loops (budget_exhausted = True) +- Plan timeouts (wall_time_sec tells you if this is acceptable latency) + +#### `trace_id` (string, UUID) + +A unique identifier linking to the JSONL trace log: + +``` +~/.marchwarden/traces/{trace_id}.jsonl +``` + +The trace contains every decision, search, fetch, parse step for debugging and replay. + +--- + +## Contract Rules + +### The Researcher Must + +1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it. +2. **Admit gaps.** If you can't find something, say so. Don't guess. +3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`. +4. **Ground claims.** Every factual claim in `answer` must link to at least one citation. +5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have. + +### The Caller (PI/CLI) Must + +1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly. +2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer. +3. **Check locators.** For important decisions, verify citations by following the locators. + +--- + +## Examples + +### Example 1: High-Confidence Answer + +Request: +```json +{ + "question": "What is the capital of France?", + "depth": "shallow" +} +``` + +Response: +```json +{ + "answer": "Paris is the capital of France. It is the country's largest city and serves as the political, cultural, and economic center.", + "citations": [ + { + "source": "web", + "locator": "https://en.wikipedia.org/wiki/Paris", + "title": "Paris - Wikipedia", + "snippet": "Paris is the capital and largest city of France", + "confidence": 0.99 + } + ], + "gaps": [], + "confidence": 0.99, + "cost_metadata": { + "tokens_used": 450, + "iterations_run": 1, + "wall_time_sec": 3.2, + "budget_exhausted": false + }, + "trace_id": "550e8400-e29b-41d4-a716-446655440001" +} +``` + +### Example 2: Partial Answer with Gaps + +Request: +```json +{ + "question": "What emerging startups in biotech are working on CRISPR gene therapy?", + "depth": "deep" +} +``` + +Response: +```json +{ + "answer": "Several emerging startups are advancing CRISPR gene therapy... [detailed answer]", + "citations": [ + { + "source": "web", + "locator": "https://www.crunchbase.com/...", + "title": "Crunchbase: CRISPR Startups", + "snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...", + "confidence": 0.8 + } + ], + "gaps": [ + { + "topic": "funding rounds in 2026", + "reason": "Web sources only go through Q1 2026; may be stale" + }, + { + "topic": "clinical trial status", + "reason": "Requires access to clinical trials database (outside web search scope)" + } + ], + "confidence": 0.72, + "cost_metadata": { + "tokens_used": 19240, + "iterations_run": 4, + "wall_time_sec": 67.8, + "budget_exhausted": false + }, + "trace_id": "550e8400-e29b-41d4-a716-446655440002" +} +``` + +### Example 3: Budget Exhausted + +Request: +```json +{ + "question": "Comprehensive history of AI from 1950s to 2026", + "depth": "deep", + "constraints": { + "max_iterations": 3, + "token_budget": 5000 + } +} +``` + +Response: +```json +{ + "answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]", + "citations": [ + { ... 3-4 citations ... } + ], + "gaps": [ + { + "topic": "detailed timeline 2020-2026", + "reason": "budget exhausted before deep synthesis" + }, + { + "topic": "minor research directions", + "reason": "out of scope due to token limit" + } + ], + "confidence": 0.55, + "cost_metadata": { + "tokens_used": 4998, + "iterations_run": 3, + "wall_time_sec": 31.2, + "budget_exhausted": true + }, + "trace_id": "550e8400-e29b-41d4-a716-446655440003" +} +``` + +--- + +## Versioning + +The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period. + +Current version: **v1** + +--- + +See also: [Architecture.md](Architecture.md), [DevelopmentGuide.md](DevelopmentGuide.md)