This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Development Roadmap
This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).
Search provider: Tavily (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.
Phase 0: Foundation
Goal: Everything needed before writing agent code.
M0.1 — Tavily API Key
- Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
- Store key in
.envasTAVILY_API_KEY - Verify: quick Python test
TavilyClient(api_key=...).search("test")returns results - Deliverable: Working Tavily access
M0.2 — Verify Dependencies
- Confirm pyproject.toml dependencies are correct (
tavily-python,httpx,pydantic>=2.0,anthropic,mcp,click) pip install -e ".[dev]"succeeds- Deliverable: Clean install
M0.3 — Contract Models (Pydantic)
researchers/web/models.py— all contract types as Pydantic models:ResearchConstraintsCitation(withraw_excerpt)GapCategoryenumGap(withcategory)DiscoveryEventConfidenceFactorsCostMetadata(withmodel_id)ResearchResult
- Unit tests for serialization/deserialization
- Deliverable: Models importable, tests green
Phase 1: Web Researcher Core
Goal: A working research agent that can answer questions via web search.
M1.1 — Search & Fetch Tools
researchers/web/tools.py:tavily_search(query, max_results)— calls Tavily API, returns structured results with contentfetch_url(url)— httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
- Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
- Deliverable: Two tested tool functions
M1.2 — Trace Logger
researchers/web/trace.py:TraceLoggerclass — opens JSONL file keyed by trace_idlog_step(step, action, result, decision)— appends one JSON line- Includes
content_hashfor fetch actions - Configurable trace directory (
~/.marchwarden/traces/default)
- Unit tests for file creation, JSON structure, content hashing
- Deliverable: Trace logging works, JSONL files verifiable with
jq
M1.3 — Inner Agent Loop
researchers/web/agent.py:WebResearcherclassasync research(question, context, depth, constraints) -> ResearchResult- Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
- Uses Claude API (via
anthropicSDK) as the reasoning engine - Enforces
max_iterationsandtoken_budgetat the loop level - Populates all contract fields:
raw_excerpt, categorizedgaps,discovery_events,confidence_factors,cost_metadata(withmodel_id) - Logs every step to TraceLogger
- Integration test: call with a real question, verify all contract fields populated
- Deliverable:
WebResearcher.research("What are ideal crops for Utah?")returns a validResearchResult
M1.4 — MCP Server
researchers/web/server.py:- MCP server entry point using
mcpSDK - Exposes single tool:
research - Delegates to
WebResearcher - Server-level budget enforcement (kill agent if it exceeds constraints)
- MCP server entry point using
- Test: start server, call tool via MCP client, verify response schema
- Deliverable:
python -m researchers.web.serverstarts an MCP server with one tool
Phase 2: CLI Shim
Goal: Human-usable interface for testing the researcher.
M2.1 — ask Command
cli/main.py— Click CLImarchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]- Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
- Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
- Saves trace_id for replay
- Deliverable: End-to-end question -> formatted answer in terminal
M2.2 — replay Command
cli/main.py:marchwarden replay <trace_id>- Reads JSONL trace file, pretty-prints each step
- Shows: action taken, decision made, content hashes
- Deliverable:
marchwarden replay <id>shows full audit trail
M2.3 — First Smoke Test
- Run the boring test: "What are ideal crops for a garden in Utah?"
- Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
- Deliverable: First successful end-to-end run, documented in issue #1
Phase 3: Stress Testing & Calibration
Goal: Exercise every contract feature, collect calibration data.
M3.1 — Single-Axis Stress Tests
Run each, verify the specific contract feature it targets:
- Recency: "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
- Contradiction: "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
- Scope: "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
- Budget: "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED
- Deliverable: 4 trace files, documented results, contract gaps identified
M3.2 — Multi-Axis Stress Test
- Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
- Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
- Deliverable: Complex trace file, full contract exercised
M3.3 — Confidence Calibration (V1.1)
- Analyze confidence_factors across all test runs (20-30 queries)
- Identify patterns: when does the LLM over/under-estimate confidence?
- Draft a calibration rubric (what a 0.9 actually looks like empirically)
- Update ResearchContract.md with calibrated guidance
- Deliverable: Data-driven confidence rubric
Phase 4: Hardening
Goal: Production-quality V1.
M4.1 — Error Handling
- Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
- URL fetch failures -> individual citation skipped, noted in trace
- Claude API timeout -> meaningful error, partial results if possible
- Budget overflow protection at MCP server level
M4.2 — Test Suite
- Unit tests: models, tools, trace logger
- Integration tests: full research loop with mocked Tavily
- Contract compliance tests: verify every ResearchResult field is populated correctly
- Deliverable:
pytest tests/all green, reasonable coverage
M4.3 — Documentation Polish
- Update DevelopmentGuide with Tavily setup instructions
- Add troubleshooting section
- Update README quick start
- Deliverable: A new developer can clone, set up, and run in 15 minutes
Phase 5: Second Researcher (V2 begins)
Goal: Prove the contract works across researcher types.
M5.1 — arxiv-rag Researcher
Tracking issue: #37 · Design: ArxivRagProposal
researchers/arxiv/— RAG-based reader of a user-curated arXiv reading list- Same
ResearchResultcontract, different evidence path (chromadb vector store, not Tavily) - Citations point to arxiv abs URLs; raw_excerpt is the chunk text
- Sub-milestones (A.1–A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields
- Deliverable: Two working researchers, same contract, different sources
M5.2 — Contract Validation
- Run the same question through both researchers (web + arxiv-rag)
- Compare: do the contracts compose cleanly? Can the PI synthesize across them?
- Identify any contract changes needed (backward-compatible additions only)
- Deliverable: Validated multi-researcher contract
Future ideas (post-V2)
- File/document researcher — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up.
- Live arXiv search + cache (option C in the proposal) — extend arxiv-rag from a curated reading list to a growing semantic cache
Phase 6: PI Orchestrator (V2)
Goal: An agent that coordinates multiple researchers.
M6.1 — PI Agent Core
orchestrator/pi.py- Dispatches researchers in parallel (asyncio.gather)
- Processes discovery_events -> dispatches follow-up researchers
- Compares raw_excerpts across researchers for contradiction detection
- Uses gap categories to decide: retry, re-dispatch, accept, escalate
- Synthesizes into final answer with full provenance
M6.2 — PI CLI or Web UI
- Replace the CLI shim with PI-driven interface
- User asks a question -> PI decides which researchers to dispatch
- Shows intermediate progress (which researchers running, what they found)
Phase 7: Advanced (V2+)
Goal: Address known V1 limitations.
- Citation Validator — programmatic URL/DOI ping before PI accepts
- Content Addressable Storage — store full fetched content, enable true replay
- Streaming/Polling —
research_status(job_id)for long-running queries - Inter-Researcher Cross-Talk — lateral dispatch without PI mediation
- Utility Curve — self-terminate when information gain diminishes
- Vector-Indexed Trace Store — cross-research learning
Build Order Summary
Phase 0: Foundation <- Tavily key, deps, models
Phase 1: Web Researcher <- tools, trace, agent loop, MCP server
Phase 2: CLI Shim <- ask, replay, first smoke test
Phase 3: Stress Testing <- single-axis, multi-axis, calibration
Phase 4: Hardening <- errors, tests, docs
Phase 5: Second Researcher <- prove contract portability
Phase 6: PI Orchestrator <- the real goal
Phase 7: Advanced <- known limitations resolved
Each milestone has a clear deliverable and a moment of completion.
| Phases | Version | Ship Target |
|---|---|---|
| 0-2 | V1 | Issue #1 — single web researcher + CLI |
| 3 | V1.1 | Stress testing + confidence calibration |
| 4 | V1.2 | Hardened, tested, documented |
| 5-6 | V2 | Multi-researcher + PI orchestrator |
| 7 | V2+ | Known limitations resolved |
See also: Architecture, ResearchContract, DevelopmentGuide