Roadmap M5.1 → arxiv-rag; resolve proposal open questions

Per session decision 2026-04-08: - M5.1 was the placeholder file/document researcher; replaced with the arxiv-rag researcher (Issue #37). The grep-based file researcher is demoted to 'future ideas' since no concrete user corpus drove its design. - All five open questions in the proposal resolved in favor of the v1-simplest path: local embeddings (nomic), arxiv IDs only, pin paper versions, embed-model in chunk hash, embedding_calls ledger field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:16:51 -06:00 · 2026-04-08 17:16:51 -06:00 · 9dcab7627f
commit 9dcab7627f
parent ca4ae053cb
2 changed files with 20 additions and 19 deletions
--- a/ArxivRagProposal.md
+++ b/ArxivRagProposal.md
@ -1,6 +1,6 @@
 # Implementation Proposal: arxiv-rag Researcher
-**Status:** Draft — awaiting review
+**Status:** Approved 2026-04-08
 **Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37)
 **Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract
@ -196,22 +196,17 @@ marchwarden ask "..." --researchers web,arxiv      # fan out, merge in CLI
 ---
-## Open questions
+## Resolved decisions (was: Open questions)
-1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract.
+1. **Embeddings: local vs API.** ✅ **Resolved 2026-04-08:** start with `nomic-embed-text-v1.5` (free, local). `voyage-3` upgrade path via `MARCHWARDEN_ARXIV_EMBED_MODEL` env var, deferred until real-world quality review.
-2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add <id>` simple. BibTeX import is a one-off helper script that can come later.
+2. **BibTeX import.** ✅ **Resolved 2026-04-08:** skip for v1. `arxiv add <id>` only. BibTeX importer is a future helper.
-3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies:
+3. **Paper versions.** ✅ **Resolved 2026-04-08:** pin to whatever the user supplies. Never auto-update. `marchwarden arxiv update <id>` will exist as an explicit action later.
   - **Pin** — index whatever the user supplies, never auto-update
   - **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks
   - **Track both** — index every version separately, distinguish in citations
-   **Recommendation: pin for v1.** Simplest. `arxiv update <id>` as an explicit user action later.
+4. **Chunk-id stability.** ✅ **Resolved 2026-04-08:** make embedding model part of the chunk ID hash, store it in `papers.json`. Re-ingest with a different model creates a new collection rather than overwriting old citations.
-4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting.
+5. **Cost ledger fields.** ✅ **Resolved 2026-04-08:** add an `embedding_calls` field to ledger entries (parallel to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
 5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
 ---
--- a/Roadmap.md
+++ b/Roadmap.md
@ -153,19 +153,25 @@ Run each, verify the specific contract feature it targets:
 ## Phase 5: Second Researcher (V2 begins)
 **Goal:** Prove the contract works across researcher types.
-### M5.1 — File/Document Researcher
+### M5.1 — arxiv-rag Researcher
- `researchers/docs/` — same contract, different tools
+*Tracking issue: [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) · Design: [ArxivRagProposal](ArxivRagProposal)*
- Searches a local file corpus (glob + grep + read)
+
- Returns citations with file paths instead of URLs
+- `researchers/arxiv/` — RAG-based reader of a user-curated arXiv reading list
- Same gaps, discovery_events, confidence_factors structure
+- Same `ResearchResult` contract, different evidence path (chromadb vector store, not Tavily)
- **Deliverable:** Two researchers, same contract, different sources
+- Citations point to arxiv abs URLs; raw_excerpt is the chunk text
 - Sub-milestones (A.1–A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields
 - **Deliverable:** Two working researchers, same contract, different sources
 ### M5.2 — Contract Validation
- Run the same question through both researchers
+- Run the same question through both researchers (web + arxiv-rag)
 - Compare: do the contracts compose cleanly? Can the PI synthesize across them?
 - Identify any contract changes needed (backward-compatible additions only)
 - **Deliverable:** Validated multi-researcher contract
 ### Future ideas (post-V2)
 - **File/document researcher** — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up.
 - **Live arXiv search + cache (option C in the proposal)** — extend arxiv-rag from a curated reading list to a growing semantic cache
 ---
 ## Phase 6: PI Orchestrator (V2)