Add development roadmap to wiki

Phases 0-7 from foundation through advanced features. V1 ship target: Phases 0-2 (web researcher + CLI). Each milestone has a concrete deliverable. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 13:51:22 -06:00 · 2026-04-08 13:51:22 -06:00 · 4f945087d3
commit 4f945087d3
parent e861c392e4
1 changed files with 226 additions and 0 deletions
--- a/Roadmap.md
+++ b/Roadmap.md
@ -0,0 +1,226 @@
+# Development Roadmap
+
+This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).
+
+Search provider: **Tavily** (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.
+
+---
+
+## Phase 0: Foundation
+**Goal:** Everything needed before writing agent code.
+
+### M0.1 — Tavily API Key
+- Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
+- Store key in `.env` as `TAVILY_API_KEY`
+- Verify: quick Python test `TavilyClient(api_key=...).search("test")` returns results
+- **Deliverable:** Working Tavily access
+
+### M0.2 — Verify Dependencies
+- Confirm pyproject.toml dependencies are correct (`tavily-python`, `httpx`, `pydantic>=2.0`, `anthropic`, `mcp`, `click`)
+- `pip install -e ".[dev]"` succeeds
+- **Deliverable:** Clean install
+
+### M0.3 — Contract Models (Pydantic)
+- `researchers/web/models.py` — all contract types as Pydantic models:
+  - `ResearchConstraints`
+  - `Citation` (with `raw_excerpt`)
+  - `GapCategory` enum
+  - `Gap` (with `category`)
+  - `DiscoveryEvent`
+  - `ConfidenceFactors`
+  - `CostMetadata` (with `model_id`)
+  - `ResearchResult`
+- Unit tests for serialization/deserialization
+- **Deliverable:** Models importable, tests green
+
+---
+
+## Phase 1: Web Researcher Core
+**Goal:** A working research agent that can answer questions via web search.
+
+### M1.1 — Search & Fetch Tools
+- `researchers/web/tools.py`:
+  - `tavily_search(query, max_results)` — calls Tavily API, returns structured results with content
+  - `fetch_url(url)` — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
+- Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
+- **Deliverable:** Two tested tool functions
+
+### M1.2 — Trace Logger
+- `researchers/web/trace.py`:
+  - `TraceLogger` class — opens JSONL file keyed by trace_id
+  - `log_step(step, action, result, decision)` — appends one JSON line
+  - Includes `content_hash` for fetch actions
+  - Configurable trace directory (`~/.marchwarden/traces/` default)
+- Unit tests for file creation, JSON structure, content hashing
+- **Deliverable:** Trace logging works, JSONL files verifiable with `jq`
+
+### M1.3 — Inner Agent Loop
+- `researchers/web/agent.py`:
+  - `WebResearcher` class
+  - `async research(question, context, depth, constraints) -> ResearchResult`
+  - Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
+  - Uses Claude API (via `anthropic` SDK) as the reasoning engine
+  - Enforces `max_iterations` and `token_budget` at the loop level
+  - Populates all contract fields: `raw_excerpt`, categorized `gaps`, `discovery_events`, `confidence_factors`, `cost_metadata` (with `model_id`)
+  - Logs every step to TraceLogger
+- Integration test: call with a real question, verify all contract fields populated
+- **Deliverable:** `WebResearcher.research("What are ideal crops for Utah?")` returns a valid `ResearchResult`
+
+### M1.4 — MCP Server
+- `researchers/web/server.py`:
+  - MCP server entry point using `mcp` SDK
+  - Exposes single tool: `research`
+  - Delegates to `WebResearcher`
+  - Server-level budget enforcement (kill agent if it exceeds constraints)
+- Test: start server, call tool via MCP client, verify response schema
+- **Deliverable:** `python -m researchers.web.server` starts an MCP server with one tool
+
+---
+
+## Phase 2: CLI Shim
+**Goal:** Human-usable interface for testing the researcher.
+
+### M2.1 — `ask` Command
+- `cli/main.py` — Click CLI
+  - `marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]`
+  - Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
+  - Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
+  - Saves trace_id for replay
+- **Deliverable:** End-to-end question -> formatted answer in terminal
+
+### M2.2 — `replay` Command
+- `cli/main.py`:
+  - `marchwarden replay <trace_id>`
+  - Reads JSONL trace file, pretty-prints each step
+  - Shows: action taken, decision made, content hashes
+- **Deliverable:** `marchwarden replay <id>` shows full audit trail
+
+### M2.3 — First Smoke Test
+- Run the boring test: "What are ideal crops for a garden in Utah?"
+- Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
+- **Deliverable:** First successful end-to-end run, documented in issue #1
+
+---
+
+## Phase 3: Stress Testing & Calibration
+**Goal:** Exercise every contract feature, collect calibration data.
+
+### M3.1 — Single-Axis Stress Tests
+Run each, verify the specific contract feature it targets:
+1. **Recency:** "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
+2. **Contradiction:** "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
+3. **Scope:** "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
+4. **Budget:** "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED
+- **Deliverable:** 4 trace files, documented results, contract gaps identified
+
+### M3.2 — Multi-Axis Stress Test
+- Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
+- Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
+- **Deliverable:** Complex trace file, full contract exercised
+
+### M3.3 — Confidence Calibration (V1.1)
+- Analyze confidence_factors across all test runs (20-30 queries)
+- Identify patterns: when does the LLM over/under-estimate confidence?
+- Draft a calibration rubric (what a 0.9 actually looks like empirically)
+- Update ResearchContract.md with calibrated guidance
+- **Deliverable:** Data-driven confidence rubric
+
+---
+
+## Phase 4: Hardening
+**Goal:** Production-quality V1.
+
+### M4.1 — Error Handling
+- Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
+- URL fetch failures -> individual citation skipped, noted in trace
+- Claude API timeout -> meaningful error, partial results if possible
+- Budget overflow protection at MCP server level
+
+### M4.2 — Test Suite
+- Unit tests: models, tools, trace logger
+- Integration tests: full research loop with mocked Tavily
+- Contract compliance tests: verify every ResearchResult field is populated correctly
+- **Deliverable:** `pytest tests/` all green, reasonable coverage
+
+### M4.3 — Documentation Polish
+- Update DevelopmentGuide with Tavily setup instructions
+- Add troubleshooting section
+- Update README quick start
+- **Deliverable:** A new developer can clone, set up, and run in 15 minutes
+
+---
+
+## Phase 5: Second Researcher (V2 begins)
+**Goal:** Prove the contract works across researcher types.
+
+### M5.1 — File/Document Researcher
+- `researchers/docs/` — same contract, different tools
+- Searches a local file corpus (glob + grep + read)
+- Returns citations with file paths instead of URLs
+- Same gaps, discovery_events, confidence_factors structure
+- **Deliverable:** Two researchers, same contract, different sources
+
+### M5.2 — Contract Validation
+- Run the same question through both researchers
+- Compare: do the contracts compose cleanly? Can the PI synthesize across them?
+- Identify any contract changes needed (backward-compatible additions only)
+- **Deliverable:** Validated multi-researcher contract
+
+---
+
+## Phase 6: PI Orchestrator (V2)
+**Goal:** An agent that coordinates multiple researchers.
+
+### M6.1 — PI Agent Core
+- `orchestrator/pi.py`
+- Dispatches researchers in parallel (asyncio.gather)
+- Processes discovery_events -> dispatches follow-up researchers
+- Compares raw_excerpts across researchers for contradiction detection
+- Uses gap categories to decide: retry, re-dispatch, accept, escalate
+- Synthesizes into final answer with full provenance
+
+### M6.2 — PI CLI or Web UI
+- Replace the CLI shim with PI-driven interface
+- User asks a question -> PI decides which researchers to dispatch
+- Shows intermediate progress (which researchers running, what they found)
+
+---
+
+## Phase 7: Advanced (V2+)
+**Goal:** Address known V1 limitations.
+
+- **Citation Validator** — programmatic URL/DOI ping before PI accepts
+- **Content Addressable Storage** — store full fetched content, enable true replay
+- **Streaming/Polling** — `research_status(job_id)` for long-running queries
+- **Inter-Researcher Cross-Talk** — lateral dispatch without PI mediation
+- **Utility Curve** — self-terminate when information gain diminishes
+- **Vector-Indexed Trace Store** — cross-research learning
+
+---
+
+## Build Order Summary
+
+```
+Phase 0: Foundation         <- Tavily key, deps, models
+Phase 1: Web Researcher     <- tools, trace, agent loop, MCP server
+Phase 2: CLI Shim           <- ask, replay, first smoke test
+Phase 3: Stress Testing     <- single-axis, multi-axis, calibration
+Phase 4: Hardening          <- errors, tests, docs
+Phase 5: Second Researcher  <- prove contract portability
+Phase 6: PI Orchestrator    <- the real goal
+Phase 7: Advanced           <- known limitations resolved
+```
+
+Each milestone has a clear deliverable and a moment of completion.
+
+| Phases | Version | Ship Target |
+|:---|:---|:---|
+| 0-2 | V1 | Issue #1 — single web researcher + CLI |
+| 3 | V1.1 | Stress testing + confidence calibration |
+| 4 | V1.2 | Hardened, tested, documented |
+| 5-6 | V2 | Multi-researcher + PI orchestrator |
+| 7 | V2+ | Known limitations resolved |
+
+---
+
+See also: [Architecture](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/Architecture), [ResearchContract](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract), [DevelopmentGuide](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/DevelopmentGuide)