2 Roadmap
Jeff Smith edited this page 2026-04-08 17:16:51 -06:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Development Roadmap

This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).

Search provider: Tavily (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.


Phase 0: Foundation

Goal: Everything needed before writing agent code.

M0.1 — Tavily API Key

  • Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
  • Store key in .env as TAVILY_API_KEY
  • Verify: quick Python test TavilyClient(api_key=...).search("test") returns results
  • Deliverable: Working Tavily access

M0.2 — Verify Dependencies

  • Confirm pyproject.toml dependencies are correct (tavily-python, httpx, pydantic>=2.0, anthropic, mcp, click)
  • pip install -e ".[dev]" succeeds
  • Deliverable: Clean install

M0.3 — Contract Models (Pydantic)

  • researchers/web/models.py — all contract types as Pydantic models:
    • ResearchConstraints
    • Citation (with raw_excerpt)
    • GapCategory enum
    • Gap (with category)
    • DiscoveryEvent
    • ConfidenceFactors
    • CostMetadata (with model_id)
    • ResearchResult
  • Unit tests for serialization/deserialization
  • Deliverable: Models importable, tests green

Phase 1: Web Researcher Core

Goal: A working research agent that can answer questions via web search.

M1.1 — Search & Fetch Tools

  • researchers/web/tools.py:
    • tavily_search(query, max_results) — calls Tavily API, returns structured results with content
    • fetch_url(url) — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
  • Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
  • Deliverable: Two tested tool functions

M1.2 — Trace Logger

  • researchers/web/trace.py:
    • TraceLogger class — opens JSONL file keyed by trace_id
    • log_step(step, action, result, decision) — appends one JSON line
    • Includes content_hash for fetch actions
    • Configurable trace directory (~/.marchwarden/traces/ default)
  • Unit tests for file creation, JSON structure, content hashing
  • Deliverable: Trace logging works, JSONL files verifiable with jq

M1.3 — Inner Agent Loop

  • researchers/web/agent.py:
    • WebResearcher class
    • async research(question, context, depth, constraints) -> ResearchResult
    • Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
    • Uses Claude API (via anthropic SDK) as the reasoning engine
    • Enforces max_iterations and token_budget at the loop level
    • Populates all contract fields: raw_excerpt, categorized gaps, discovery_events, confidence_factors, cost_metadata (with model_id)
    • Logs every step to TraceLogger
  • Integration test: call with a real question, verify all contract fields populated
  • Deliverable: WebResearcher.research("What are ideal crops for Utah?") returns a valid ResearchResult

M1.4 — MCP Server

  • researchers/web/server.py:
    • MCP server entry point using mcp SDK
    • Exposes single tool: research
    • Delegates to WebResearcher
    • Server-level budget enforcement (kill agent if it exceeds constraints)
  • Test: start server, call tool via MCP client, verify response schema
  • Deliverable: python -m researchers.web.server starts an MCP server with one tool

Phase 2: CLI Shim

Goal: Human-usable interface for testing the researcher.

M2.1 — ask Command

  • cli/main.py — Click CLI
    • marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]
    • Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
    • Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
    • Saves trace_id for replay
  • Deliverable: End-to-end question -> formatted answer in terminal

M2.2 — replay Command

  • cli/main.py:
    • marchwarden replay <trace_id>
    • Reads JSONL trace file, pretty-prints each step
    • Shows: action taken, decision made, content hashes
  • Deliverable: marchwarden replay <id> shows full audit trail

M2.3 — First Smoke Test

  • Run the boring test: "What are ideal crops for a garden in Utah?"
  • Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
  • Deliverable: First successful end-to-end run, documented in issue #1

Phase 3: Stress Testing & Calibration

Goal: Exercise every contract feature, collect calibration data.

M3.1 — Single-Axis Stress Tests

Run each, verify the specific contract feature it targets:

  1. Recency: "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
  2. Contradiction: "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
  3. Scope: "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
  4. Budget: "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED
  • Deliverable: 4 trace files, documented results, contract gaps identified

M3.2 — Multi-Axis Stress Test

  • Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
  • Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
  • Deliverable: Complex trace file, full contract exercised

M3.3 — Confidence Calibration (V1.1)

  • Analyze confidence_factors across all test runs (20-30 queries)
  • Identify patterns: when does the LLM over/under-estimate confidence?
  • Draft a calibration rubric (what a 0.9 actually looks like empirically)
  • Update ResearchContract.md with calibrated guidance
  • Deliverable: Data-driven confidence rubric

Phase 4: Hardening

Goal: Production-quality V1.

M4.1 — Error Handling

  • Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
  • URL fetch failures -> individual citation skipped, noted in trace
  • Claude API timeout -> meaningful error, partial results if possible
  • Budget overflow protection at MCP server level

M4.2 — Test Suite

  • Unit tests: models, tools, trace logger
  • Integration tests: full research loop with mocked Tavily
  • Contract compliance tests: verify every ResearchResult field is populated correctly
  • Deliverable: pytest tests/ all green, reasonable coverage

M4.3 — Documentation Polish

  • Update DevelopmentGuide with Tavily setup instructions
  • Add troubleshooting section
  • Update README quick start
  • Deliverable: A new developer can clone, set up, and run in 15 minutes

Phase 5: Second Researcher (V2 begins)

Goal: Prove the contract works across researcher types.

M5.1 — arxiv-rag Researcher

Tracking issue: #37 · Design: ArxivRagProposal

  • researchers/arxiv/ — RAG-based reader of a user-curated arXiv reading list
  • Same ResearchResult contract, different evidence path (chromadb vector store, not Tavily)
  • Citations point to arxiv abs URLs; raw_excerpt is the chunk text
  • Sub-milestones (A.1A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields
  • Deliverable: Two working researchers, same contract, different sources

M5.2 — Contract Validation

  • Run the same question through both researchers (web + arxiv-rag)
  • Compare: do the contracts compose cleanly? Can the PI synthesize across them?
  • Identify any contract changes needed (backward-compatible additions only)
  • Deliverable: Validated multi-researcher contract

Future ideas (post-V2)

  • File/document researcher — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up.
  • Live arXiv search + cache (option C in the proposal) — extend arxiv-rag from a curated reading list to a growing semantic cache

Phase 6: PI Orchestrator (V2)

Goal: An agent that coordinates multiple researchers.

M6.1 — PI Agent Core

  • orchestrator/pi.py
  • Dispatches researchers in parallel (asyncio.gather)
  • Processes discovery_events -> dispatches follow-up researchers
  • Compares raw_excerpts across researchers for contradiction detection
  • Uses gap categories to decide: retry, re-dispatch, accept, escalate
  • Synthesizes into final answer with full provenance

M6.2 — PI CLI or Web UI

  • Replace the CLI shim with PI-driven interface
  • User asks a question -> PI decides which researchers to dispatch
  • Shows intermediate progress (which researchers running, what they found)

Phase 7: Advanced (V2+)

Goal: Address known V1 limitations.

  • Citation Validator — programmatic URL/DOI ping before PI accepts
  • Content Addressable Storage — store full fetched content, enable true replay
  • Streaming/Pollingresearch_status(job_id) for long-running queries
  • Inter-Researcher Cross-Talk — lateral dispatch without PI mediation
  • Utility Curve — self-terminate when information gain diminishes
  • Vector-Indexed Trace Store — cross-research learning

Build Order Summary

Phase 0: Foundation         <- Tavily key, deps, models
Phase 1: Web Researcher     <- tools, trace, agent loop, MCP server
Phase 2: CLI Shim           <- ask, replay, first smoke test
Phase 3: Stress Testing     <- single-axis, multi-axis, calibration
Phase 4: Hardening          <- errors, tests, docs
Phase 5: Second Researcher  <- prove contract portability
Phase 6: PI Orchestrator    <- the real goal
Phase 7: Advanced           <- known limitations resolved

Each milestone has a clear deliverable and a moment of completion.

Phases Version Ship Target
0-2 V1 Issue #1 — single web researcher + CLI
3 V1.1 Stress testing + confidence calibration
4 V1.2 Hardened, tested, documented
5-6 V2 Multi-researcher + PI orchestrator
7 V2+ Known limitations resolved

See also: Architecture, ResearchContract, DevelopmentGuide