diff --git a/ArxivRagProposal.md b/ArxivRagProposal.md index 18d9ddb..a335ef6 100644 --- a/ArxivRagProposal.md +++ b/ArxivRagProposal.md @@ -1,6 +1,6 @@ # Implementation Proposal: arxiv-rag Researcher -**Status:** Draft — awaiting review +**Status:** Approved 2026-04-08 **Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) **Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract @@ -196,22 +196,17 @@ marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI --- -## Open questions +## Resolved decisions (was: Open questions) -1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract. +1. **Embeddings: local vs API.** ✅ **Resolved 2026-04-08:** start with `nomic-embed-text-v1.5` (free, local). `voyage-3` upgrade path via `MARCHWARDEN_ARXIV_EMBED_MODEL` env var, deferred until real-world quality review. -2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add ` simple. BibTeX import is a one-off helper script that can come later. +2. **BibTeX import.** ✅ **Resolved 2026-04-08:** skip for v1. `arxiv add ` only. BibTeX importer is a future helper. -3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies: - - **Pin** — index whatever the user supplies, never auto-update - - **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks - - **Track both** — index every version separately, distinguish in citations +3. **Paper versions.** ✅ **Resolved 2026-04-08:** pin to whatever the user supplies. Never auto-update. `marchwarden arxiv update ` will exist as an explicit action later. - **Recommendation: pin for v1.** Simplest. `arxiv update ` as an explicit user action later. +4. **Chunk-id stability.** ✅ **Resolved 2026-04-08:** make embedding model part of the chunk ID hash, store it in `papers.json`. Re-ingest with a different model creates a new collection rather than overwriting old citations. -4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting. - -5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table. +5. **Cost ledger fields.** ✅ **Resolved 2026-04-08:** add an `embedding_calls` field to ledger entries (parallel to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table. --- diff --git a/Roadmap.md b/Roadmap.md index 288642b..589006c 100644 --- a/Roadmap.md +++ b/Roadmap.md @@ -153,19 +153,25 @@ Run each, verify the specific contract feature it targets: ## Phase 5: Second Researcher (V2 begins) **Goal:** Prove the contract works across researcher types. -### M5.1 — File/Document Researcher -- `researchers/docs/` — same contract, different tools -- Searches a local file corpus (glob + grep + read) -- Returns citations with file paths instead of URLs -- Same gaps, discovery_events, confidence_factors structure -- **Deliverable:** Two researchers, same contract, different sources +### M5.1 — arxiv-rag Researcher +*Tracking issue: [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) · Design: [ArxivRagProposal](ArxivRagProposal)* + +- `researchers/arxiv/` — RAG-based reader of a user-curated arXiv reading list +- Same `ResearchResult` contract, different evidence path (chromadb vector store, not Tavily) +- Citations point to arxiv abs URLs; raw_excerpt is the chunk text +- Sub-milestones (A.1–A.6 in the tracking issue): ingest pipeline, retrieval primitive, agent loop, MCP server, CLI integration, cost-ledger fields +- **Deliverable:** Two working researchers, same contract, different sources ### M5.2 — Contract Validation -- Run the same question through both researchers +- Run the same question through both researchers (web + arxiv-rag) - Compare: do the contracts compose cleanly? Can the PI synthesize across them? - Identify any contract changes needed (backward-compatible additions only) - **Deliverable:** Validated multi-researcher contract +### Future ideas (post-V2) +- **File/document researcher** — grep+read over a local file corpus. Was the original M5.1 placeholder; demoted because no concrete user corpus drove its design. Re-prioritize when one shows up. +- **Live arXiv search + cache (option C in the proposal)** — extend arxiv-rag from a curated reading list to a growing semantic cache + --- ## Phase 6: PI Orchestrator (V2)