Roadmap M5.1 → arxiv-rag; resolve proposal open questions

Per session decision 2026-04-08:
- M5.1 was the placeholder file/document researcher; replaced with
  the arxiv-rag researcher (Issue #37). The grep-based file
  researcher is demoted to 'future ideas' since no concrete user
  corpus drove its design.
- All five open questions in the proposal resolved in favor of the
  v1-simplest path: local embeddings (nomic), arxiv IDs only,
  pin paper versions, embed-model in chunk hash, embedding_calls
  ledger field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jeff Smith 2026-04-08 17:16:51 -06:00
parent ca4ae053cb
commit 9dcab7627f
2 changed files with 20 additions and 19 deletions

@ -1,6 +1,6 @@
# Implementation Proposal: arxiv-rag Researcher # Implementation Proposal: arxiv-rag Researcher
**Status:** Draft — awaiting review **Status:** Approved 2026-04-08
**Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) **Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37)
**Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract **Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract
@ -196,22 +196,17 @@ marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI
--- ---
## Open questions ## Resolved decisions (was: Open questions)
1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract. 1. **Embeddings: local vs API.** **Resolved 2026-04-08:** start with `nomic-embed-text-v1.5` (free, local). `voyage-3` upgrade path via `MARCHWARDEN_ARXIV_EMBED_MODEL` env var, deferred until real-world quality review.
2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add <id>` simple. BibTeX import is a one-off helper script that can come later. 2. **BibTeX import.** **Resolved 2026-04-08:** skip for v1. `arxiv add <id>` only. BibTeX importer is a future helper.
3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies: 3. **Paper versions.****Resolved 2026-04-08:** pin to whatever the user supplies. Never auto-update. `marchwarden arxiv update <id>` will exist as an explicit action later.
- **Pin** — index whatever the user supplies, never auto-update
- **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks
- **Track both** — index every version separately, distinguish in citations
**Recommendation: pin for v1.** Simplest. `arxiv update <id>` as an explicit user action later. 4. **Chunk-id stability.****Resolved 2026-04-08:** make embedding model part of the chunk ID hash, store it in `papers.json`. Re-ingest with a different model creates a new collection rather than overwriting old citations.
4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting. 5. **Cost ledger fields.****Resolved 2026-04-08:** add an `embedding_calls` field to ledger entries (parallel to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
--- ---

@ -153,19 +153,25 @@ Run each, verify the specific contract feature it targets:
## Phase 5: Second Researcher (V2 begins) ## Phase 5: Second Researcher (V2 begins)
**Goal:** Prove the contract works across researcher types. **Goal:** Prove the contract works across researcher types.
### M5.1 — File/Document Researcher ### M5.1 — arxiv-rag Researcher
- `researchers/docs/` — same contract, different tools *Tracking issue: [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) · Design: [ArxivRagProposal](ArxivRagProposal)*
- Searches a local file corpus (glob + grep + read)
- Returns citations with file paths instead of URLs - `researchers/arxiv/` — RAG-based reader of a user-curated arXiv reading list
- Same gaps, discovery_events, confidence_factors structure - Same `ResearchResult` contract, different evidence path (chromadb vector store, not Tavily)
- **Deliverable:** Two researchers, same contract, different sources - Citations point to arxiv abs URLs; raw_excerpt is the chunk text
- Sub-milestones (A.1A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields
- **Deliverable:** Two working researchers, same contract, different sources
### M5.2 — Contract Validation ### M5.2 — Contract Validation
- Run the same question through both researchers - Run the same question through both researchers (web + arxiv-rag)
- Compare: do the contracts compose cleanly? Can the PI synthesize across them? - Compare: do the contracts compose cleanly? Can the PI synthesize across them?
- Identify any contract changes needed (backward-compatible additions only) - Identify any contract changes needed (backward-compatible additions only)
- **Deliverable:** Validated multi-researcher contract - **Deliverable:** Validated multi-researcher contract
### Future ideas (post-V2)
- **File/document researcher** — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up.
- **Live arXiv search + cache (option C in the proposal)** — extend arxiv-rag from a curated reading list to a growing semantic cache
--- ---
## Phase 6: PI Orchestrator (V2) ## Phase 6: PI Orchestrator (V2)