Add development roadmap to wiki
Phases 0-7 from foundation through advanced features. V1 ship target: Phases 0-2 (web researcher + CLI). Each milestone has a concrete deliverable. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
parent
e861c392e4
commit
4f945087d3
1 changed files with 226 additions and 0 deletions
226
Roadmap.md
Normal file
226
Roadmap.md
Normal file
|
|
@ -0,0 +1,226 @@
|
|||
# Development Roadmap
|
||||
|
||||
This roadmap covers Marchwarden from V1 (single web researcher + CLI) through V2+ (multi-researcher network with PI orchestrator).
|
||||
|
||||
Search provider: **Tavily** (agent-native search API, free tier for dev, content extraction built in). SearXNG remains a future option for self-hosting.
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Foundation
|
||||
**Goal:** Everything needed before writing agent code.
|
||||
|
||||
### M0.1 — Tavily API Key
|
||||
- Sign up at tavily.com, get API key (free tier: 1,000 searches/month)
|
||||
- Store key in `.env` as `TAVILY_API_KEY`
|
||||
- Verify: quick Python test `TavilyClient(api_key=...).search("test")` returns results
|
||||
- **Deliverable:** Working Tavily access
|
||||
|
||||
### M0.2 — Verify Dependencies
|
||||
- Confirm pyproject.toml dependencies are correct (`tavily-python`, `httpx`, `pydantic>=2.0`, `anthropic`, `mcp`, `click`)
|
||||
- `pip install -e ".[dev]"` succeeds
|
||||
- **Deliverable:** Clean install
|
||||
|
||||
### M0.3 — Contract Models (Pydantic)
|
||||
- `researchers/web/models.py` — all contract types as Pydantic models:
|
||||
- `ResearchConstraints`
|
||||
- `Citation` (with `raw_excerpt`)
|
||||
- `GapCategory` enum
|
||||
- `Gap` (with `category`)
|
||||
- `DiscoveryEvent`
|
||||
- `ConfidenceFactors`
|
||||
- `CostMetadata` (with `model_id`)
|
||||
- `ResearchResult`
|
||||
- Unit tests for serialization/deserialization
|
||||
- **Deliverable:** Models importable, tests green
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Web Researcher Core
|
||||
**Goal:** A working research agent that can answer questions via web search.
|
||||
|
||||
### M1.1 — Search & Fetch Tools
|
||||
- `researchers/web/tools.py`:
|
||||
- `tavily_search(query, max_results)` — calls Tavily API, returns structured results with content
|
||||
- `fetch_url(url)` — httpx GET for pages Tavily didn't fully extract, returns clean text + SHA-256 content hash
|
||||
- Unit tests with mocked HTTP responses (don't hit real Tavily in CI)
|
||||
- **Deliverable:** Two tested tool functions
|
||||
|
||||
### M1.2 — Trace Logger
|
||||
- `researchers/web/trace.py`:
|
||||
- `TraceLogger` class — opens JSONL file keyed by trace_id
|
||||
- `log_step(step, action, result, decision)` — appends one JSON line
|
||||
- Includes `content_hash` for fetch actions
|
||||
- Configurable trace directory (`~/.marchwarden/traces/` default)
|
||||
- Unit tests for file creation, JSON structure, content hashing
|
||||
- **Deliverable:** Trace logging works, JSONL files verifiable with `jq`
|
||||
|
||||
### M1.3 — Inner Agent Loop
|
||||
- `researchers/web/agent.py`:
|
||||
- `WebResearcher` class
|
||||
- `async research(question, context, depth, constraints) -> ResearchResult`
|
||||
- Internal loop: plan -> search -> fetch -> parse -> check confidence -> iterate or stop
|
||||
- Uses Claude API (via `anthropic` SDK) as the reasoning engine
|
||||
- Enforces `max_iterations` and `token_budget` at the loop level
|
||||
- Populates all contract fields: `raw_excerpt`, categorized `gaps`, `discovery_events`, `confidence_factors`, `cost_metadata` (with `model_id`)
|
||||
- Logs every step to TraceLogger
|
||||
- Integration test: call with a real question, verify all contract fields populated
|
||||
- **Deliverable:** `WebResearcher.research("What are ideal crops for Utah?")` returns a valid `ResearchResult`
|
||||
|
||||
### M1.4 — MCP Server
|
||||
- `researchers/web/server.py`:
|
||||
- MCP server entry point using `mcp` SDK
|
||||
- Exposes single tool: `research`
|
||||
- Delegates to `WebResearcher`
|
||||
- Server-level budget enforcement (kill agent if it exceeds constraints)
|
||||
- Test: start server, call tool via MCP client, verify response schema
|
||||
- **Deliverable:** `python -m researchers.web.server` starts an MCP server with one tool
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: CLI Shim
|
||||
**Goal:** Human-usable interface for testing the researcher.
|
||||
|
||||
### M2.1 — `ask` Command
|
||||
- `cli/main.py` — Click CLI
|
||||
- `marchwarden ask "question" [--depth shallow|balanced|deep] [--budget N] [--max-iterations N]`
|
||||
- Connects to web researcher MCP server (or calls WebResearcher directly for simplicity)
|
||||
- Pretty-prints: answer, citations (with raw_excerpts), gaps (with categories), discovery events, confidence + factors, cost metadata
|
||||
- Saves trace_id for replay
|
||||
- **Deliverable:** End-to-end question -> formatted answer in terminal
|
||||
|
||||
### M2.2 — `replay` Command
|
||||
- `cli/main.py`:
|
||||
- `marchwarden replay <trace_id>`
|
||||
- Reads JSONL trace file, pretty-prints each step
|
||||
- Shows: action taken, decision made, content hashes
|
||||
- **Deliverable:** `marchwarden replay <id>` shows full audit trail
|
||||
|
||||
### M2.3 — First Smoke Test
|
||||
- Run the boring test: "What are ideal crops for a garden in Utah?"
|
||||
- Manually verify: answer is reasonable, citations have real URLs and raw_excerpts, gaps are categorized, confidence_factors are populated, trace file exists and is valid JSONL
|
||||
- **Deliverable:** First successful end-to-end run, documented in issue #1
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Stress Testing & Calibration
|
||||
**Goal:** Exercise every contract feature, collect calibration data.
|
||||
|
||||
### M3.1 — Single-Axis Stress Tests
|
||||
Run each, verify the specific contract feature it targets:
|
||||
1. **Recency:** "What AI models were released in Q1 2026?" -> tests SOURCE_NOT_FOUND or dated recency
|
||||
2. **Contradiction:** "Is coffee good or bad for you?" -> tests CONTRADICTORY_SOURCES gap + contradiction_detected factor
|
||||
3. **Scope:** "Compare CRISPR delivery mechanisms in recent clinical trials" -> tests SCOPE_EXCEEDED gap + discovery_events
|
||||
4. **Budget:** "Comprehensive history of AI 1950-2026" with tight budget (max_iterations=2, token_budget=5000) -> tests BUDGET_EXHAUSTED
|
||||
- **Deliverable:** 4 trace files, documented results, contract gaps identified
|
||||
|
||||
### M3.2 — Multi-Axis Stress Test
|
||||
- Run the HFT query: "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
|
||||
- Verify: exercises recency, contradiction, scope exceeded, discovery events simultaneously
|
||||
- **Deliverable:** Complex trace file, full contract exercised
|
||||
|
||||
### M3.3 — Confidence Calibration (V1.1)
|
||||
- Analyze confidence_factors across all test runs (20-30 queries)
|
||||
- Identify patterns: when does the LLM over/under-estimate confidence?
|
||||
- Draft a calibration rubric (what a 0.9 actually looks like empirically)
|
||||
- Update ResearchContract.md with calibrated guidance
|
||||
- **Deliverable:** Data-driven confidence rubric
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Hardening
|
||||
**Goal:** Production-quality V1.
|
||||
|
||||
### M4.1 — Error Handling
|
||||
- Tavily down/rate-limited -> gap with ACCESS_DENIED, graceful degradation
|
||||
- URL fetch failures -> individual citation skipped, noted in trace
|
||||
- Claude API timeout -> meaningful error, partial results if possible
|
||||
- Budget overflow protection at MCP server level
|
||||
|
||||
### M4.2 — Test Suite
|
||||
- Unit tests: models, tools, trace logger
|
||||
- Integration tests: full research loop with mocked Tavily
|
||||
- Contract compliance tests: verify every ResearchResult field is populated correctly
|
||||
- **Deliverable:** `pytest tests/` all green, reasonable coverage
|
||||
|
||||
### M4.3 — Documentation Polish
|
||||
- Update DevelopmentGuide with Tavily setup instructions
|
||||
- Add troubleshooting section
|
||||
- Update README quick start
|
||||
- **Deliverable:** A new developer can clone, set up, and run in 15 minutes
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Second Researcher (V2 begins)
|
||||
**Goal:** Prove the contract works across researcher types.
|
||||
|
||||
### M5.1 — File/Document Researcher
|
||||
- `researchers/docs/` — same contract, different tools
|
||||
- Searches a local file corpus (glob + grep + read)
|
||||
- Returns citations with file paths instead of URLs
|
||||
- Same gaps, discovery_events, confidence_factors structure
|
||||
- **Deliverable:** Two researchers, same contract, different sources
|
||||
|
||||
### M5.2 — Contract Validation
|
||||
- Run the same question through both researchers
|
||||
- Compare: do the contracts compose cleanly? Can the PI synthesize across them?
|
||||
- Identify any contract changes needed (backward-compatible additions only)
|
||||
- **Deliverable:** Validated multi-researcher contract
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: PI Orchestrator (V2)
|
||||
**Goal:** An agent that coordinates multiple researchers.
|
||||
|
||||
### M6.1 — PI Agent Core
|
||||
- `orchestrator/pi.py`
|
||||
- Dispatches researchers in parallel (asyncio.gather)
|
||||
- Processes discovery_events -> dispatches follow-up researchers
|
||||
- Compares raw_excerpts across researchers for contradiction detection
|
||||
- Uses gap categories to decide: retry, re-dispatch, accept, escalate
|
||||
- Synthesizes into final answer with full provenance
|
||||
|
||||
### M6.2 — PI CLI or Web UI
|
||||
- Replace the CLI shim with PI-driven interface
|
||||
- User asks a question -> PI decides which researchers to dispatch
|
||||
- Shows intermediate progress (which researchers running, what they found)
|
||||
|
||||
---
|
||||
|
||||
## Phase 7: Advanced (V2+)
|
||||
**Goal:** Address known V1 limitations.
|
||||
|
||||
- **Citation Validator** — programmatic URL/DOI ping before PI accepts
|
||||
- **Content Addressable Storage** — store full fetched content, enable true replay
|
||||
- **Streaming/Polling** — `research_status(job_id)` for long-running queries
|
||||
- **Inter-Researcher Cross-Talk** — lateral dispatch without PI mediation
|
||||
- **Utility Curve** — self-terminate when information gain diminishes
|
||||
- **Vector-Indexed Trace Store** — cross-research learning
|
||||
|
||||
---
|
||||
|
||||
## Build Order Summary
|
||||
|
||||
```
|
||||
Phase 0: Foundation <- Tavily key, deps, models
|
||||
Phase 1: Web Researcher <- tools, trace, agent loop, MCP server
|
||||
Phase 2: CLI Shim <- ask, replay, first smoke test
|
||||
Phase 3: Stress Testing <- single-axis, multi-axis, calibration
|
||||
Phase 4: Hardening <- errors, tests, docs
|
||||
Phase 5: Second Researcher <- prove contract portability
|
||||
Phase 6: PI Orchestrator <- the real goal
|
||||
Phase 7: Advanced <- known limitations resolved
|
||||
```
|
||||
|
||||
Each milestone has a clear deliverable and a moment of completion.
|
||||
|
||||
| Phases | Version | Ship Target |
|
||||
|:---|:---|:---|
|
||||
| 0-2 | V1 | Issue #1 — single web researcher + CLI |
|
||||
| 3 | V1.1 | Stress testing + confidence calibration |
|
||||
| 4 | V1.2 | Hardened, tested, documented |
|
||||
| 5-6 | V2 | Multi-researcher + PI orchestrator |
|
||||
| 7 | V2+ | Known limitations resolved |
|
||||
|
||||
---
|
||||
|
||||
See also: [Architecture](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/Architecture), [ResearchContract](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ResearchContract), [DevelopmentGuide](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/DevelopmentGuide)
|
||||
Loading…
Reference in a new issue