This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Development Roadmap

This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).

Search provider: Tavily (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.

Phase 0: Foundation

Goal: Everything needed before writing agent code.

M0.1 — Tavily API Key

Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
Store key in .env as TAVILY_API_KEY
Verify: quick Python test TavilyClient(api_key=...).search("test") returns results
Deliverable: Working Tavily access

M0.2 — Verify Dependencies

Confirm pyproject.toml dependencies are correct (tavily-python, httpx, pydantic>=2.0, anthropic, mcp, click)
pip install -e ".[dev]" succeeds
Deliverable: Clean install

M0.3 — Contract Models (Pydantic)

researchers/web/models.py — all contract types as Pydantic models:
- ResearchConstraints
- Citation (with raw_excerpt)
- GapCategory enum
- Gap (with category)
- DiscoveryEvent
- ConfidenceFactors
- CostMetadata (with model_id)
- ResearchResult
Unit tests for serialization/deserialization
Deliverable: Models importable, tests green

Phase 1: Web Researcher Core

Goal: A working research agent that can answer questions via web search.

M1.1 — Search & Fetch Tools

researchers/web/tools.py:
- tavily_search(query, max_results) — calls Tavily API, returns structured results with content
- fetch_url(url) — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
Deliverable: Two tested tool functions

M1.2 — Trace Logger

researchers/web/trace.py:
- TraceLogger class — opens JSONL file keyed by trace_id
- log_step(step, action, result, decision) — appends one JSON line
- Includes content_hash for fetch actions
- Configurable trace directory (~/.marchwarden/traces/ default)
Unit tests for file creation, JSON structure, content hashing
Deliverable: Trace logging works, JSONL files verifiable with jq

M1.3 — Inner Agent Loop

researchers/web/agent.py:
- WebResearcher class
- async research(question, context, depth, constraints) -> ResearchResult
- Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
- Uses Claude API (via anthropic SDK) as the reasoning engine
- Enforces max_iterations and token_budget at the loop level
- Populates all contract fields: raw_excerpt, categorized gaps, discovery_events, confidence_factors, cost_metadata (with model_id)
- Logs every step to TraceLogger
Integration test: call with a real question, verify all contract fields populated
Deliverable: WebResearcher.research("What are ideal crops for Utah?") returns a valid ResearchResult

M1.4 — MCP Server

researchers/web/server.py:
- MCP server entry point using mcp SDK
- Exposes single tool: research
- Delegates to WebResearcher
- Server-level budget enforcement (kill agent if it exceeds constraints)
Test: start server, call tool via MCP client, verify response schema
Deliverable: python -m researchers.web.server starts an MCP server with one tool

Phase 2: CLI Shim

Goal: Human-usable interface for testing the researcher.

M2.1 — `ask` Command

cli/main.py — Click CLI
- marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]
- Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
- Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
- Saves trace_id for replay
Deliverable: End-to-end question -> formatted answer in terminal

M2.2 — `replay` Command

cli/main.py:
- marchwarden replay <trace_id>
- Reads JSONL trace file, pretty-prints each step
- Shows: action taken, decision made, content hashes
Deliverable: marchwarden replay <id> shows full audit trail

M2.3 — First Smoke Test

Run the boring test: "What are ideal crops for a garden in Utah?"
Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
Deliverable: First successful end-to-end run, documented in issue #1

Phase 3: Stress Testing & Calibration

Goal: Exercise every contract feature, collect calibration data.

M3.1 — Single-Axis Stress Tests

Run each, verify the specific contract feature it targets:

Recency: "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
Contradiction: "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
Scope: "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
Budget: "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED

Deliverable: 4 trace files, documented results, contract gaps identified

M3.2 — Multi-Axis Stress Test

Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
Deliverable: Complex trace file, full contract exercised

M3.3 — Confidence Calibration (V1.1)

Analyze confidence_factors across all test runs (20-30 queries)
Identify patterns: when does the LLM over/under-estimate confidence?
Draft a calibration rubric (what a 0.9 actually looks like empirically)
Update ResearchContract.md with calibrated guidance
Deliverable: Data-driven confidence rubric

Phase 4: Hardening

Goal: Production-quality V1.

M4.1 — Error Handling

Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
URL fetch failures -> individual citation skipped, noted in trace
Claude API timeout -> meaningful error, partial results if possible
Budget overflow protection at MCP server level

M4.2 — Test Suite

Unit tests: models, tools, trace logger
Integration tests: full research loop with mocked Tavily
Contract compliance tests: verify every ResearchResult field is populated correctly
Deliverable: pytest tests/ all green, reasonable coverage

M4.3 — Documentation Polish

Update DevelopmentGuide with Tavily setup instructions
Add troubleshooting section
Update README quick start
Deliverable: A new developer can clone, set up, and run in 15 minutes

Phase 5: Second Researcher (V2 begins)

Goal: Prove the contract works across researcher types.

M5.1 — arxiv-rag Researcher

Tracking issue: #37 · Design: ArxivRagProposal

researchers/arxiv/ — RAG-based reader of a user-curated arXiv reading list
Same ResearchResult contract, different evidence path (chromadb vector store, not Tavily)
Citations point to arxiv abs URLs; raw_excerpt is the chunk text
Sub-milestones (A.1–A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields
Deliverable: Two working researchers, same contract, different sources

M5.2 — Contract Validation

Run the same question through both researchers (web + arxiv-rag)
Compare: do the contracts compose cleanly? Can the PI synthesize across them?
Identify any contract changes needed (backward-compatible additions only)
Deliverable: Validated multi-researcher contract

Future ideas (post-V2)

File/document researcher — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up.
Live arXiv search + cache (option C in the proposal) — extend arxiv-rag from a curated reading list to a growing semantic cache

Phase 6: PI Orchestrator (V2)

Goal: An agent that coordinates multiple researchers.

M6.1 — PI Agent Core

orchestrator/pi.py
Dispatches researchers in parallel (asyncio.gather)
Processes discovery_events -> dispatches follow-up researchers
Compares raw_excerpts across researchers for contradiction detection
Uses gap categories to decide: retry, re-dispatch, accept, escalate
Synthesizes into final answer with full provenance

M6.2 — PI CLI or Web UI

Replace the CLI shim with PI-driven interface
User asks a question -> PI decides which researchers to dispatch
Shows intermediate progress (which researchers running, what they found)

Phase 7: Advanced (V2+)

Goal: Address known V1 limitations.

Citation Validator — programmatic URL/DOI ping before PI accepts
Content Addressable Storage — store full fetched content, enable true replay
Streaming/Polling — research_status(job_id) for long-running queries
Inter-Researcher Cross-Talk — lateral dispatch without PI mediation
Utility Curve — self-terminate when information gain diminishes
Vector-Indexed Trace Store — cross-research learning

Build Order Summary

Phase 0: Foundation         <- Tavily key, deps, models
Phase 1: Web Researcher     <- tools, trace, agent loop, MCP server
Phase 2: CLI Shim           <- ask, replay, first smoke test
Phase 3: Stress Testing     <- single-axis, multi-axis, calibration
Phase 4: Hardening          <- errors, tests, docs
Phase 5: Second Researcher  <- prove contract portability
Phase 6: PI Orchestrator    <- the real goal
Phase 7: Advanced           <- known limitations resolved

Each milestone has a clear deliverable and a moment of completion.

Phases	Version	Ship Target
0-2	V1	Issue #1 — single web researcher + CLI
3	V1.1	Stress testing + confidence calibration
4	V1.2	Hardened, tested, documented
5-6	V2	Multi-researcher + PI orchestrator
7	V2+	Known limitations resolved