Add development roadmap to wiki

Phases 0-7 from foundation through advanced features.
V1 ship target: Phases 0-2 (web researcher + CLI).
Each milestone has a concrete deliverable.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Jeff Smith 2026-04-08 13:51:22 -06:00
parent e861c392e4
commit 4f945087d3

226
Roadmap.md Normal file

@ -0,0 +1,226 @@
# Development Roadmap
This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).
Search provider: **Tavily** (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.
---
## Phase 0: Foundation
**Goal:** Everything needed before writing agent code.
### M0.1 — Tavily API Key
- Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
- Store key in `.env` as `TAVILY_API_KEY`
- Verify: quick Python test `TavilyClient(api_key=...).search("test")` returns results
- **Deliverable:** Working Tavily access
### M0.2 — Verify Dependencies
- Confirm pyproject.toml dependencies are correct (`tavily-python`, `httpx`, `pydantic>=2.0`, `anthropic`, `mcp`, `click`)
- `pip install -e ".[dev]"` succeeds
- **Deliverable:** Clean install
### M0.3 — Contract Models (Pydantic)
- `researchers/web/models.py` — all contract types as Pydantic models:
- `ResearchConstraints`
- `Citation` (with `raw_excerpt`)
- `GapCategory` enum
- `Gap` (with `category`)
- `DiscoveryEvent`
- `ConfidenceFactors`
- `CostMetadata` (with `model_id`)
- `ResearchResult`
- Unit tests for serialization/deserialization
- **Deliverable:** Models importable, tests green
---
## Phase 1: Web Researcher Core
**Goal:** A working research agent that can answer questions via web search.
### M1.1 — Search & Fetch Tools
- `researchers/web/tools.py`:
- `tavily_search(query, max_results)` — calls Tavily API, returns structured results with content
- `fetch_url(url)` — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
- Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
- **Deliverable:** Two tested tool functions
### M1.2 — Trace Logger
- `researchers/web/trace.py`:
- `TraceLogger` class — opens JSONL file keyed by trace_id
- `log_step(step, action, result, decision)` — appends one JSON line
- Includes `content_hash` for fetch actions
- Configurable trace directory (`~/.marchwarden/traces/` default)
- Unit tests for file creation, JSON structure, content hashing
- **Deliverable:** Trace logging works, JSONL files verifiable with `jq`
### M1.3 — Inner Agent Loop
- `researchers/web/agent.py`:
- `WebResearcher` class
- `async research(question, context, depth, constraints) -> ResearchResult`
- Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
- Uses Claude API (via `anthropic` SDK) as the reasoning engine
- Enforces `max_iterations` and `token_budget` at the loop level
- Populates all contract fields: `raw_excerpt`, categorized `gaps`, `discovery_events`, `confidence_factors`, `cost_metadata` (with `model_id`)
- Logs every step to TraceLogger
- Integration test: call with a real question, verify all contract fields populated
- **Deliverable:** `WebResearcher.research("What are ideal crops for Utah?")` returns a valid `ResearchResult`
### M1.4 — MCP Server
- `researchers/web/server.py`:
- MCP server entry point using `mcp` SDK
- Exposes single tool: `research`
- Delegates to `WebResearcher`
- Server-level budget enforcement (kill agent if it exceeds constraints)
- Test: start server, call tool via MCP client, verify response schema
- **Deliverable:** `python -m researchers.web.server` starts an MCP server with one tool
---
## Phase 2: CLI Shim
**Goal:** Human-usable interface for testing the researcher.
### M2.1 — `ask` Command
- `cli/main.py` — Click CLI
- `marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]`
- Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
- Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
- Saves trace_id for replay
- **Deliverable:** End-to-end question -> formatted answer in terminal
### M2.2 — `replay` Command
- `cli/main.py`:
- `marchwarden replay <trace_id>`
- Reads JSONL trace file, pretty-prints each step
- Shows: action taken, decision made, content hashes
- **Deliverable:** `marchwarden replay <id>` shows full audit trail
### M2.3 — First Smoke Test
- Run the boring test: "What are ideal crops for a garden in Utah?"
- Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
- **Deliverable:** First successful end-to-end run, documented in issue #1
---
## Phase 3: Stress Testing & Calibration
**Goal:** Exercise every contract feature, collect calibration data.
### M3.1 — Single-Axis Stress Tests
Run each, verify the specific contract feature it targets:
1. **Recency:** "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
2. **Contradiction:** "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
3. **Scope:** "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
4. **Budget:** "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED
- **Deliverable:** 4 trace files, documented results, contract gaps identified
### M3.2 — Multi-Axis Stress Test
- Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
- Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
- **Deliverable:** Complex trace file, full contract exercised
### M3.3 — Confidence Calibration (V1.1)
- Analyze confidence_factors across all test runs (20-30 queries)
- Identify patterns: when does the LLM over/under-estimate confidence?
- Draft a calibration rubric (what a 0.9 actually looks like empirically)
- Update ResearchContract.md with calibrated guidance
- **Deliverable:** Data-driven confidence rubric
---
## Phase 4: Hardening
**Goal:** Production-quality V1.
### M4.1 — Error Handling
- Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
- URL fetch failures -> individual citation skipped, noted in trace
- Claude API timeout -> meaningful error, partial results if possible
- Budget overflow protection at MCP server level
### M4.2 — Test Suite
- Unit tests: models, tools, trace logger
- Integration tests: full research loop with mocked Tavily
- Contract compliance tests: verify every ResearchResult field is populated correctly
- **Deliverable:** `pytest tests/` all green, reasonable coverage
### M4.3 — Documentation Polish
- Update DevelopmentGuide with Tavily setup instructions
- Add troubleshooting section
- Update README quick start
- **Deliverable:** A new developer can clone, set up, and run in 15 minutes
---
## Phase 5: Second Researcher (V2 begins)
**Goal:** Prove the contract works across researcher types.
### M5.1 — File/Document Researcher
- `researchers/docs/` — same contract, different tools
- Searches a local file corpus (glob + grep + read)
- Returns citations with file paths instead of URLs
- Same gaps, discovery_events, confidence_factors structure
- **Deliverable:** Two researchers, same contract, different sources
### M5.2 — Contract Validation
- Run the same question through both researchers
- Compare: do the contracts compose cleanly? Can the PI synthesize across them?
- Identify any contract changes needed (backward-compatible additions only)
- **Deliverable:** Validated multi-researcher contract
---
## Phase 6: PI Orchestrator (V2)
**Goal:** An agent that coordinates multiple researchers.
### M6.1 — PI Agent Core
- `orchestrator/pi.py`
- Dispatches researchers in parallel (asyncio.gather)
- Processes discovery_events -> dispatches follow-up researchers
- Compares raw_excerpts across researchers for contradiction detection
- Uses gap categories to decide: retry, re-dispatch, accept, escalate
- Synthesizes into final answer with full provenance
### M6.2 — PI CLI or Web UI
- Replace the CLI shim with PI-driven interface
- User asks a question -> PI decides which researchers to dispatch
- Shows intermediate progress (which researchers running, what they found)
---
## Phase 7: Advanced (V2+)
**Goal:** Address known V1 limitations.
- **Citation Validator** — programmatic URL/DOI ping before PI accepts
- **Content Addressable Storage** — store full fetched content, enable true replay
- **Streaming/Polling**`research_status(job_id)` for long-running queries
- **Inter-Researcher Cross-Talk** — lateral dispatch without PI mediation
- **Utility Curve** — self-terminate when information gain diminishes
- **Vector-Indexed Trace Store** — cross-research learning
---
## Build Order Summary
```
Phase 0: Foundation <- Tavily key, deps, models
Phase 1: Web Researcher <- tools, trace, agent loop, MCP server
Phase 2: CLI Shim <- ask, replay, first smoke test
Phase 3: Stress Testing <- single-axis, multi-axis, calibration
Phase 4: Hardening <- errors, tests, docs
Phase 5: Second Researcher <- prove contract portability
Phase 6: PI Orchestrator <- the real goal
Phase 7: Advanced <- known limitations resolved
```
Each milestone has a clear deliverable and a moment of completion.
| Phases | Version | Ship Target |
|:---|:---|:---|
| 0-2 | V1 | Issue #1 — single web researcher + CLI |
| 3 | V1.1 | Stress testing + confidence calibration |
| 4 | V1.2 | Hardened, tested, documented |
| 5-6 | V2 | Multi-researcher + PI orchestrator |
| 7 | V2+ | Known limitations resolved |
---
See also: [Architecture](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/Architecture), [ResearchContract](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract), [DevelopmentGuide](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/DevelopmentGuide)