Initial wiki: Architecture, ResearchContract, DevelopmentGuide

- Architecture: system overview, component design, data flow
- ResearchContract: complete tool specification with examples
- DevelopmentGuide: setup, testing, workflow, debugging

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Jeff Smith 2026-04-08 11:58:26 -06:00
commit a349d6f970
3 changed files with 786 additions and 0 deletions

175
Architecture.md Normal file

@ -0,0 +1,175 @@
# Architecture
## Overview
Marchwarden is a network of agentic researchers coordinated by a principal investigator (PI). Each researcher is specialized, autonomous, and fault-tolerant. The PI dispatches researchers to answer questions, waits for results, and synthesizes across responses.
```
┌─────────────┐
│ PI Agent │ Orchestrates, synthesizes, decides what to research
└──────┬──────┘
│ dispatch research(question)
┌────┴──────────────────────────┐
│ │
┌─┴────────────────────┐ ┌───────┴─────────────────┐
│ Web Researcher (MCP) │ │ Future: DB, Arxiv, etc. │
│ - Search (Tavily) │ │ (V2+) │
│ - Fetch URLs │ │ │
│ - Internal loop │ │ │
│ - Return citations │ │ │
└──────────────────────┘ └─────────────────────────┘
```
## Components
### Researchers (MCP servers)
Each researcher is a **standalone MCP server** that:
- Exposes a single tool: `research(question, context, depth, constraints)`
- Runs an internal agentic loop (plan → search → fetch → iterate → synthesize)
- Returns structured data: `answer`, `citations`, `gaps`, `cost_metadata`, `trace_id`
- Enforces budgets: iteration cap and token limit
- Logs all internal steps to JSONL trace files
**V1 researcher**: Web search + fetch
- Uses Tavily for searching
- Fetches full text from URLs
- Iterates up to 5 times or until budget exhausted
**Future researchers** (V2+): Database, Arxiv, internal documents, etc.
### MCP Protocol
Marchwarden uses the **Model Context Protocol (MCP)** as the boundary between researchers and their callers. This gives us:
- **Language agnostic** — researchers can be Python, Node, Go, etc.
- **Process isolation** — researcher crash doesn't crash the PI
- **Clean contract** — one tool signature, versioned independently
- **Parallel dispatch** — PI can await multiple researchers simultaneously
### CLI Shim
For V1, the CLI is the test harness that stands in for the PI:
```bash
marchwarden ask "what are ideal crops for Utah?"
marchwarden replay <trace_id>
```
In V2, the CLI is replaced by a full PI orchestrator agent.
### Trace Logging
Every research call produces a **JSONL trace log**:
```
~/.marchwarden/traces/{trace_id}.jsonl
```
Each line is a JSON object:
```json
{
"step": 1,
"action": "search",
"query": "Utah climate gardening",
"result": {...},
"timestamp": "2026-04-08T12:00:00Z",
"decision": "query was relevant, fetching top 3 URLs"
}
```
Traces support:
- **Debugging** — see exactly what the researcher did
- **Replay** — re-run a past session, same results
- **Eval** — audit decision-making
## Data Flow
### One research call (simplified)
```
CLI: ask "What are ideal crops for Utah?"
MCP: research(question="What are ideal crops for Utah?", ...)
Researcher agent loop:
1. Plan: "I need climate data for Utah + crop requirements"
2. Search: Tavily query for "Utah climate zones crops"
3. Fetch: Read top 3 URLs
4. Parse: Extract relevant info
5. Synthesize: "Based on X sources, ideal crops are Y"
6. Check gaps: "Couldn't find pest info"
7. Return if confident, else iterate
Response:
{
"answer": "...",
"citations": [
{"source": "web", "locator": "https://...", "snippet": "...", "confidence": 0.95},
...
],
"gaps": [
{"topic": "pest resistance", "reason": "no sources found"},
],
"cost_metadata": {
"tokens_used": 8452,
"iterations_run": 3,
"wall_time_sec": 42.5
},
"trace_id": "uuid-1234"
}
CLI: Print answer + citations, save trace
```
## Contract Versioning
The `research()` tool signature is the stable contract. Changes to the contract require explicit versioning so that:
- Multiple researchers with different versions can coexist
- The PI knows what version it's calling
- Backwards compatibility (or breaking changes) is explicit
See [ResearchContract.md](ResearchContract.md) for the full spec.
## Future: The PI Agent
V2 will introduce the orchestrator:
```python
class PIAgent:
def research_topic(self, question: str) -> Answer:
# Dispatch multiple researchers in parallel
web_results = await self.web_researcher.research(question)
arxiv_results = await self.arxiv_researcher.research(question)
# Synthesize
return self.synthesize([web_results, arxiv_results])
```
The PI:
- Decides which researchers to dispatch
- Waits for all responses
- Checks for conflicts, gaps, consensus
- Synthesizes into a final answer
- Can re-dispatch if gaps are critical
## Assumptions & Constraints
- **Researchers are honest** — they don't hallucinate citations. If they cite something, it exists in the source.
- **Tavily API is available** — for V1 web search. Degradation strategy TBD.
- **Token budgets are enforced** — the researcher respects its budget; the MCP server enforces it at the process level.
- **Traces are ephemeral** — stored locally for debugging, not synced to a database yet.
- **No multi-user** — single-user CLI for V1.
## Terminology
- **Researcher**: An agentic system specialized in a domain or source type
- **Marchwarden**: The researcher metaphor — stationed at the frontier, reporting back
- **Rihla**: (V2+) A unit of research work dispatched by the PI; one researcher's journey to answer a question
- **Trace**: A JSONL log of all decisions made during one research call
- **Gap**: An unresolved aspect of the question; the researcher couldn't find an answer
---
See also: [ResearchContract.md](ResearchContract.md), [DevelopmentGuide.md](DevelopmentGuide.md)

259
DevelopmentGuide.md Normal file

@ -0,0 +1,259 @@
# Development Guide
## Setup
### Prerequisites
- Python 3.10+
- pip (with venv)
- Tavily API key (free tier available at https://tavily.com)
### Installation
```bash
git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
cd marchwarden
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in dev mode
pip install -e ".[dev]"
```
### Environment Setup
Create a `.env` file in the project root:
```env
TAVILY_API_KEY=<your-tavily-api-key>
ANTHROPIC_API_KEY=<your-claude-api-key>
MARCHWARDEN_TRACE_DIR=~/.marchwarden/traces
```
Test that everything works:
```bash
python -c "from anthropic import Anthropic; print('OK')"
python -c "from tavily import TavilyClient; print('OK')"
```
## Project Structure
```
marchwarden/
├── researchers/
│ ├── __init__.py
│ └── web/ # V1: Web search researcher
│ ├── __init__.py
│ ├── server.py # MCP server entry point
│ ├── agent.py # Inner research agent
│ ├── models.py # Pydantic models (ResearchResult, Citation, etc)
│ └── tools.py # Tavily integration, URL fetch
├── orchestrator/ # (V2+) PI agent
│ ├── __init__.py
│ └── pi.py
├── cli/ # CLI shim (ask, replay)
│ ├── __init__.py
│ ├── main.py # Entry point (@click decorators)
│ └── formatter.py # Pretty-print results
├── tests/
│ ├── __init__.py
│ ├── test_web_researcher.py
│ └── fixtures/
├── docs/
│ └── wiki/ # You are here
├── README.md
├── CONTRIBUTING.md
├── pyproject.toml
└── .gitignore
```
## Running Tests
```bash
# Run all tests
pytest tests/
# Run with verbose output
pytest tests/ -v
# Run a specific test file
pytest tests/test_web_researcher.py
# Run with coverage
pytest --cov=. tests/
```
All tests are unit + integration. We do **not** mock the database or major external services (only Tavily if needed to avoid API costs).
## Running the CLI
```bash
# Ask a question
marchwarden ask "What are ideal crops for a garden in Utah?"
# With options
marchwarden ask "What is X?" --depth deep --budget 25000
# Replay a trace
marchwarden replay <trace_id>
# Show help
marchwarden --help
```
The first run will take a few seconds (agent planning + searches + fetches).
## Development Workflow
### 1. Create a branch
```bash
git checkout -b feat/your-feature-name
```
Branch naming: `feat/`, `fix/`, `refactor/`, `chore/` + short description.
### 2. Make changes
Edit code, add tests:
```bash
# Run tests as you go
pytest tests/test_your_feature.py -v
# Check formatting
black --check .
ruff check .
# Type checking (optional, informational)
mypy . --ignore-missing-imports
```
### 3. Commit
```bash
git add <files>
git commit -m "Brief imperative description
- What changed
- Why it changed
"
```
Commits should be atomic (one logical change per commit).
### 4. Test before pushing
```bash
pytest tests/
black .
ruff check . --fix
```
### 5. Push and create PR
```bash
git push origin feat/your-feature-name
```
Then on Forgejo: open a PR, request review, wait for CI/tests to pass.
Once approved:
- Merge via Forgejo UI (not locally)
- Delete remote branch via Forgejo
- Locally: `git checkout main && git pull --ff-only && git branch -d feat/your-feature-name`
## Debugging
### Viewing trace logs
```bash
# Human-readable trace
marchwarden replay <trace_id>
# Raw JSON
cat ~/.marchwarden/traces/<trace_id>.jsonl | jq .
# Pretty-print all lines
cat ~/.marchwarden/traces/<trace_id>.jsonl | jq . -s
```
### Debug logging
Set `MARCHWARDEN_DEBUG=1` for verbose logs:
```bash
MARCHWARDEN_DEBUG=1 marchwarden ask "What is X?"
```
### Interactive testing
Use Python REPL:
```bash
python
>>> from researchers.web import WebResearcher
>>> researcher = WebResearcher()
>>> result = researcher.research("What is X?")
>>> print(result.answer)
```
## Common Tasks
### Adding a new tool to the researcher
1. Define the tool in `researchers/web/tools.py`
2. Register it in the agent's tool list (`researchers/web/agent.py`)
3. Add test coverage in `tests/test_web_researcher.py`
4. Update docs if it changes the contract
### Changing the research contract
If you need to modify the `research()` signature:
1. Update `researchers/web/models.py` (ResearchResult, Citation, etc)
2. Update `researchers/web/agent.py` to produce the new fields
3. Update `docs/wiki/ResearchContract.md`
4. Add a migration guide if breaking
5. Tests must pass with new signature
### Running cost analysis
See how much a research call costs:
```bash
marchwarden ask "Q" --verbose
# Shows: tokens_used, iterations_run, wall_time_sec
```
For batch analysis:
```python
import json
import glob
for trace_file in glob.glob("~/.marchwarden/traces/*.jsonl"):
for line in open(trace_file):
event = json.loads(line)
# Analyze cost_metadata
```
## FAQ
**Q: How do I add a new researcher?**
A: Create `researchers/new_source/` with the same structure as `researchers/web/`. Implement `research()`, expose it as an MCP server. Test with the CLI.
**Q: Do I need to handle Tavily failures?**
A: Yes. Catch `TavilyError` and fall back to what you have. Document in `gaps`.
**Q: What if Anthropic API goes down?**
A: The agent will fail. Retry logic TBD. For now, it's a blocker.
**Q: How do I deploy this?**
A: V1 is CLI-only, local use only. V2 will have a PI orchestrator with real deployment needs.
---
See also: [Architecture.md](Architecture.md), [ResearchContract.md](ResearchContract.md), [../CONTRIBUTING.md](../CONTRIBUTING.md)

352
ResearchContract.md Normal file

@ -0,0 +1,352 @@
# Research Contract
This document defines the `research()` tool that all Marchwarden researchers implement. It is the stable contract between a researcher MCP server and its caller (the PI or CLI).
## Tool Signature
```python
async def research(
question: str,
context: Optional[str] = None,
depth: Literal["shallow", "balanced", "deep"] = "balanced",
constraints: Optional[ResearchConstraints] = None,
) -> ResearchResult
```
### Input Parameters
#### `question` (required, string)
The question the researcher is asked to investigate. Examples:
- "What are ideal crops for a garden in Utah?"
- "Summarize recent developments in transformer architectures"
- "What is the legal status of AI in France?"
Constraints: 1500 characters, UTF-8 encoded.
#### `context` (optional, string)
What the PI or caller already knows. The researcher uses this to avoid duplicating effort or to refocus. Examples:
- "I already know Utah is in USDA zones 3-8. Focus on water requirements."
- "I've read the 2024 papers on LoRA. What's new in 2025?"
Constraints: 02000 characters.
#### `depth` (optional, enum)
How thoroughly to research:
- `"shallow"` — quick scan, 12 iterations, ~5k tokens. For "does this exist?" questions.
- `"balanced"` (default) — moderate depth, 24 iterations, ~15k tokens. For typical questions.
- `"deep"` — thorough investigation, up to 5 iterations, ~25k tokens. For important decisions.
The researcher uses this as a *hint*, not a strict constraint. The actual depth depends on how much content is available and how confident the researcher becomes.
#### `constraints` (optional, object)
Fine-grained control over researcher behavior:
```python
@dataclass
class ResearchConstraints:
max_iterations: int = 5 # Stop after N iterations, regardless
token_budget: int = 20000 # Soft limit on tokens; researcher respects
max_sources: int = 10 # Max number of sources to fetch
source_filter: Optional[str] = None # Only search specific domains (V2)
```
If not provided, defaults are:
- `max_iterations`: 5
- `token_budget`: 20000 (Sonnet 3.5 equivalent)
- `max_sources`: 10
The MCP server **enforces** these constraints and will stop the researcher if they exceed them.
---
### Output: ResearchResult
```python
@dataclass
class ResearchResult:
answer: str # The synthesized answer
citations: List[Citation] # Sources used
gaps: List[Gap] # What couldn't be resolved
confidence: float # 0.01.0 overall confidence
cost_metadata: CostMetadata # Resource usage
trace_id: str # UUID linking to JSONL trace log
```
#### `answer` (string)
The synthesized answer. Should be:
- **Grounded** — every claim traces back to a citation
- **Humble** — includes caveats and confidence levels
- **Actionable** — structured so the caller can use it
Example:
```
In Utah (USDA zones 3-8), ideal crops depend on elevation and season:
High elevation (>7k ft): Short-season crops dominate. Cool-season vegetables
(peas, lettuce, potatoes) thrive. Fruit: apples, berries. Summer crops
(tomatoes, squash) work in south-facing microclimates.
Lower elevation: Full range possible. Long growing season supports tomatoes,
peppers, squash. Perennials (fruit trees, asparagus) are popular.
Water is critical: Utah averages 10-20" annual precipitation (dry for vegetable
gardening). Most gardeners supplement with irrigation.
Pests: Japanese beetles (south), aphids (statewide). Deer pressure varies by
location.
See sources below for varietal recommendations by specific county.
```
#### `citations` (list of Citation objects)
```python
@dataclass
class Citation:
source: str # "web", "file", "database", etc
locator: str # URL, file path, row ID, or unique identifier
title: Optional[str] # Human-readable title (for web)
snippet: Optional[str] # Relevant excerpt (50200 chars)
confidence: float # 0.01.0: researcher's confidence in this source's accuracy
```
Example:
```python
Citation(
source="web",
locator="https://extension.oregonstate.edu/ask-expert/featured/what-are-ideal-garden-crops-utah-zone",
title="Oregon State Extension: Ideal Crops for Utah Gardens",
snippet="Cool-season crops (peas, lettuce, potatoes) thrive above 7,000 feet. Irrigation essential.",
confidence=0.9
)
```
Citations must be:
- **Verifiable** — a human can follow the locator and confirm the claim
- **Not hallucinated** — the researcher actually read/fetched the source
- **Attributed** — each claim in `answer` should link to at least one citation
#### `gaps` (list of Gap objects)
```python
@dataclass
class Gap:
topic: str # What aspect wasn't resolved
reason: str # Why: "no sources found", "contradictory sources", "outside researcher scope"
```
Example:
```python
[
Gap(topic="pest management by county", reason="no county-specific sources found"),
Gap(topic="commercial varietals", reason="limited to hobby gardening sources"),
]
```
Gaps are **critical for the PI**. They tell the orchestrator:
- Whether to dispatch a different researcher
- Whether to accept partial answers
- Which questions remain for human input
A researcher that admits gaps is more trustworthy than one that fabricates answers.
#### `confidence` (float, 0.01.0)
Overall confidence in the answer:
- `0.91.0`: High. All claims grounded in multiple strong sources.
- `0.70.9`: Moderate. Most claims grounded; some inference; minor contradictions resolved.
- `0.50.7`: Low. Few direct sources; lots of synthesis; clear gaps.
- `< 0.5`: Very low. Mainly inference; major gaps; likely needs human review.
The PI uses this to decide whether to act on the answer or seek more sources.
#### `cost_metadata` (object)
```python
@dataclass
class CostMetadata:
tokens_used: int # Total tokens (Claude + Tavily calls)
iterations_run: int # Number of inner-loop iterations
wall_time_sec: float # Actual elapsed time
budget_exhausted: bool # True if researcher hit iteration or token cap
```
Example:
```python
CostMetadata(
tokens_used=8452,
iterations_run=3,
wall_time_sec=42.5,
budget_exhausted=False
)
```
The PI uses this to:
- Track costs (token budgets, actual spend)
- Detect runaway loops (budget_exhausted = True)
- Plan timeouts (wall_time_sec tells you if this is acceptable latency)
#### `trace_id` (string, UUID)
A unique identifier linking to the JSONL trace log:
```
~/.marchwarden/traces/{trace_id}.jsonl
```
The trace contains every decision, search, fetch, parse step for debugging and replay.
---
## Contract Rules
### The Researcher Must
1. **Never hallucinate citations.** If a claim isn't in a source, don't cite it.
2. **Admit gaps.** If you can't find something, say so. Don't guess.
3. **Respect budgets.** Stop iterating if `max_iterations` or `token_budget` is reached. Reflect in `budget_exhausted`.
4. **Ground claims.** Every factual claim in `answer` must link to at least one citation.
5. **Handle failures gracefully.** If Tavily is down or a URL is broken, note it in `gaps` and continue with what you have.
### The Caller (PI/CLI) Must
1. **Accept partial answers.** A researcher that hits its budget but admits gaps is better than one that spins endlessly.
2. **Use confidence and gaps.** Don't treat a 0.6 confidence answer the same as a 0.95 confidence answer.
3. **Check locators.** For important decisions, verify citations by following the locators.
---
## Examples
### Example 1: High-Confidence Answer
Request:
```json
{
"question": "What is the capital of France?",
"depth": "shallow"
}
```
Response:
```json
{
"answer": "Paris is the capital of France. It is the country's largest city and serves as the political, cultural, and economic center.",
"citations": [
{
"source": "web",
"locator": "https://en.wikipedia.org/wiki/Paris",
"title": "Paris - Wikipedia",
"snippet": "Paris is the capital and largest city of France",
"confidence": 0.99
}
],
"gaps": [],
"confidence": 0.99,
"cost_metadata": {
"tokens_used": 450,
"iterations_run": 1,
"wall_time_sec": 3.2,
"budget_exhausted": false
},
"trace_id": "550e8400-e29b-41d4-a716-446655440001"
}
```
### Example 2: Partial Answer with Gaps
Request:
```json
{
"question": "What emerging startups in biotech are working on CRISPR gene therapy?",
"depth": "deep"
}
```
Response:
```json
{
"answer": "Several emerging startups are advancing CRISPR gene therapy... [detailed answer]",
"citations": [
{
"source": "web",
"locator": "https://www.crunchbase.com/...",
"title": "Crunchbase: CRISPR Startups",
"snippet": "Editas, Beam Therapeutics, and CRISPR Therapeutics...",
"confidence": 0.8
}
],
"gaps": [
{
"topic": "funding rounds in 2026",
"reason": "Web sources only go through Q1 2026; may be stale"
},
{
"topic": "clinical trial status",
"reason": "Requires access to clinical trials database (outside web search scope)"
}
],
"confidence": 0.72,
"cost_metadata": {
"tokens_used": 19240,
"iterations_run": 4,
"wall_time_sec": 67.8,
"budget_exhausted": false
},
"trace_id": "550e8400-e29b-41d4-a716-446655440002"
}
```
### Example 3: Budget Exhausted
Request:
```json
{
"question": "Comprehensive history of AI from 1950s to 2026",
"depth": "deep",
"constraints": {
"max_iterations": 3,
"token_budget": 5000
}
}
```
Response:
```json
{
"answer": "The history of AI spans multiple eras... [partial answer, cut off mid-synthesis]",
"citations": [
{ ... 3-4 citations ... }
],
"gaps": [
{
"topic": "detailed timeline 2020-2026",
"reason": "budget exhausted before deep synthesis"
},
{
"topic": "minor research directions",
"reason": "out of scope due to token limit"
}
],
"confidence": 0.55,
"cost_metadata": {
"tokens_used": 4998,
"iterations_run": 3,
"wall_time_sec": 31.2,
"budget_exhausted": true
},
"trace_id": "550e8400-e29b-41d4-a716-446655440003"
}
```
---
## Versioning
The contract is versioned as `v1`. If breaking changes are needed (e.g., new required fields), the next version becomes `v2` and both can coexist in the network for a transition period.
Current version: **v1**
---
See also: [Architecture.md](Architecture.md), [DevelopmentGuide.md](DevelopmentGuide.md)