From 4f945087d311ca501169e524dedf3fb24835614d Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Wed, 8 Apr 2026 13:51:22 -0600 Subject: [PATCH] Add development roadmap to wiki Phases 0-7 from foundation through advanced features. V1 ship target: Phases 0-2 (web researcher + CLI). Each milestone has a concrete deliverable. Co-Authored-By: Claude Haiku 4.5 --- Roadmap.md | 226 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 Roadmap.md diff --git a/Roadmap.md b/Roadmap.md new file mode 100644 index 0000000..288642b --- /dev/null +++ b/Roadmap.md @@ -0,0 +1,226 @@ +# Development Roadmap + +This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator). + +Search provider: **Tavily** (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting. + +--- + +## Phase 0: Foundation +**Goal:** Everything needed before writing agent code. + +### M0.1 — Tavily API Key +- Sign up at tavily.com, get API key (free tier: 1,000 searches/month) +- Store key in `.env` as `TAVILY_API_KEY` +- Verify: quick Python test `TavilyClient(api_key=...).search("test")` returns results +- **Deliverable:** Working Tavily access + +### M0.2 — Verify Dependencies +- Confirm pyproject.toml dependencies are correct (`tavily-python`, `httpx`, `pydantic>=2.0`, `anthropic`, `mcp`, `click`) +- `pip install -e ".[dev]"` succeeds +- **Deliverable:** Clean install + +### M0.3 — Contract Models (Pydantic) +- `researchers/web/models.py` — all contract types as Pydantic models: + - `ResearchConstraints` + - `Citation` (with `raw_excerpt`) + - `GapCategory` enum + - `Gap` (with `category`) + - `DiscoveryEvent` + - `ConfidenceFactors` + - `CostMetadata` (with `model_id`) + - `ResearchResult` +- Unit tests for serialization/deserialization +- **Deliverable:** Models importable, tests green + +--- + +## Phase 1: Web Researcher Core +**Goal:** A working research agent that can answer questions via web search. + +### M1.1 — Search & Fetch Tools +- `researchers/web/tools.py`: + - `tavily_search(query, max_results)` — calls Tavily API, returns structured results with content + - `fetch_url(url)` — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash +- Unit tests with mocked HTTP responses (don't hit real Tavily in CI) +- **Deliverable:** Two tested tool functions + +### M1.2 — Trace Logger +- `researchers/web/trace.py`: + - `TraceLogger` class — opens JSONL file keyed by trace_id + - `log_step(step, action, result, decision)` — appends one JSON line + - Includes `content_hash` for fetch actions + - Configurable trace directory (`~/.marchwarden/traces/` default) +- Unit tests for file creation, JSON structure, content hashing +- **Deliverable:** Trace logging works, JSONL files verifiable with `jq` + +### M1.3 — Inner Agent Loop +- `researchers/web/agent.py`: + - `WebResearcher` class + - `async research(question, context, depth, constraints) -> ResearchResult` + - Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop + - Uses Claude API (via `anthropic` SDK) as the reasoning engine + - Enforces `max_iterations` and `token_budget` at the loop level + - Populates all contract fields: `raw_excerpt`, categorized `gaps`, `discovery_events`, `confidence_factors`, `cost_metadata` (with `model_id`) + - Logs every step to TraceLogger +- Integration test: call with a real question, verify all contract fields populated +- **Deliverable:** `WebResearcher.research("What are ideal crops for Utah?")` returns a valid `ResearchResult` + +### M1.4 — MCP Server +- `researchers/web/server.py`: + - MCP server entry point using `mcp` SDK + - Exposes single tool: `research` + - Delegates to `WebResearcher` + - Server-level budget enforcement (kill agent if it exceeds constraints) +- Test: start server, call tool via MCP client, verify response schema +- **Deliverable:** `python -m researchers.web.server` starts an MCP server with one tool + +--- + +## Phase 2: CLI Shim +**Goal:** Human-usable interface for testing the researcher. + +### M2.1 — `ask` Command +- `cli/main.py` — Click CLI + - `marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]` + - Connects to web researcher MCP server (or calls WebResearcher directly for simplicity) + - Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata + - Saves trace_id for replay +- **Deliverable:** End-to-end question -> formatted answer in terminal + +### M2.2 — `replay` Command +- `cli/main.py`: + - `marchwarden replay ` + - Reads JSONL trace file, pretty-prints each step + - Shows: action taken, decision made, content hashes +- **Deliverable:** `marchwarden replay ` shows full audit trail + +### M2.3 — First Smoke Test +- Run the boring test: "What are ideal crops for a garden in Utah?" +- Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL +- **Deliverable:** First successful end-to-end run, documented in issue #1 + +--- + +## Phase 3: Stress Testing & Calibration +**Goal:** Exercise every contract feature, collect calibration data. + +### M3.1 — Single-Axis Stress Tests +Run each, verify the specific contract feature it targets: +1. **Recency:** "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency +2. **Contradiction:** "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor +3. **Scope:** "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events +4. **Budget:** "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED +- **Deliverable:** 4 trace files, documented results, contract gaps identified + +### M3.2 — Multi-Axis Stress Test +- Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages." +- Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously +- **Deliverable:** Complex trace file, full contract exercised + +### M3.3 — Confidence Calibration (V1.1) +- Analyze confidence_factors across all test runs (20-30 queries) +- Identify patterns: when does the LLM over/under-estimate confidence? +- Draft a calibration rubric (what a 0.9 actually looks like empirically) +- Update ResearchContract.md with calibrated guidance +- **Deliverable:** Data-driven confidence rubric + +--- + +## Phase 4: Hardening +**Goal:** Production-quality V1. + +### M4.1 — Error Handling +- Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation +- URL fetch failures -> individual citation skipped, noted in trace +- Claude API timeout -> meaningful error, partial results if possible +- Budget overflow protection at MCP server level + +### M4.2 — Test Suite +- Unit tests: models, tools, trace logger +- Integration tests: full research loop with mocked Tavily +- Contract compliance tests: verify every ResearchResult field is populated correctly +- **Deliverable:** `pytest tests/` all green, reasonable coverage + +### M4.3 — Documentation Polish +- Update DevelopmentGuide with Tavily setup instructions +- Add troubleshooting section +- Update README quick start +- **Deliverable:** A new developer can clone, set up, and run in 15 minutes + +--- + +## Phase 5: Second Researcher (V2 begins) +**Goal:** Prove the contract works across researcher types. + +### M5.1 — File/Document Researcher +- `researchers/docs/` — same contract, different tools +- Searches a local file corpus (glob + grep + read) +- Returns citations with file paths instead of URLs +- Same gaps, discovery_events, confidence_factors structure +- **Deliverable:** Two researchers, same contract, different sources + +### M5.2 — Contract Validation +- Run the same question through both researchers +- Compare: do the contracts compose cleanly? Can the PI synthesize across them? +- Identify any contract changes needed (backward-compatible additions only) +- **Deliverable:** Validated multi-researcher contract + +--- + +## Phase 6: PI Orchestrator (V2) +**Goal:** An agent that coordinates multiple researchers. + +### M6.1 — PI Agent Core +- `orchestrator/pi.py` +- Dispatches researchers in parallel (asyncio.gather) +- Processes discovery_events -> dispatches follow-up researchers +- Compares raw_excerpts across researchers for contradiction detection +- Uses gap categories to decide: retry, re-dispatch, accept, escalate +- Synthesizes into final answer with full provenance + +### M6.2 — PI CLI or Web UI +- Replace the CLI shim with PI-driven interface +- User asks a question -> PI decides which researchers to dispatch +- Shows intermediate progress (which researchers running, what they found) + +--- + +## Phase 7: Advanced (V2+) +**Goal:** Address known V1 limitations. + +- **Citation Validator** — programmatic URL/DOI ping before PI accepts +- **Content Addressable Storage** — store full fetched content, enable true replay +- **Streaming/Polling** — `research_status(job_id)` for long-running queries +- **Inter-Researcher Cross-Talk** — lateral dispatch without PI mediation +- **Utility Curve** — self-terminate when information gain diminishes +- **Vector-Indexed Trace Store** — cross-research learning + +--- + +## Build Order Summary + +``` +Phase 0: Foundation <- Tavily key, deps, models +Phase 1: Web Researcher <- tools, trace, agent loop, MCP server +Phase 2: CLI Shim <- ask, replay, first smoke test +Phase 3: Stress Testing <- single-axis, multi-axis, calibration +Phase 4: Hardening <- errors, tests, docs +Phase 5: Second Researcher <- prove contract portability +Phase 6: PI Orchestrator <- the real goal +Phase 7: Advanced <- known limitations resolved +``` + +Each milestone has a clear deliverable and a moment of completion. + +| Phases | Version | Ship Target | +|:---|:---|:---| +| 0-2 | V1 | Issue #1 — single web researcher + CLI | +| 3 | V1.1 | Stress testing + confidence calibration | +| 4 | V1.2 | Hardened, tested, documented | +| 5-6 | V2 | Multi-researcher + PI orchestrator | +| 7 | V2+ | Known limitations resolved | + +--- + +See also: [Architecture](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/Architecture), [ResearchContract](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract), [DevelopmentGuide](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/DevelopmentGuide)