diff --git a/ArxivRagProposal.md b/ArxivRagProposal.md new file mode 100644 index 0000000..18d9ddb --- /dev/null +++ b/ArxivRagProposal.md @@ -0,0 +1,259 @@ +# Implementation Proposal: arxiv-rag Researcher + +**Status:** Draft — awaiting review +**Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37) +**Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract + +--- + +## Motivation + +Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because: + +- **The corpus is private to the user's interest.** Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset. +- **Semantic queries are essential.** Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning"). +- **The corpus is stable.** Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows. +- **Contract fit is natural.** Each retrieved chunk *is* a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs. + +This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5. + +--- + +## Goals and non-goals + +### Goals +1. A second working researcher implementing the v1 ResearchContract +2. CLI ergonomics for ingesting arXiv papers (`marchwarden arxiv add 2403.12345`) +3. Same `research(question)` MCP tool surface as the web researcher +4. End-to-end question → grounded answer with citations from the local paper store +5. Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend + +### Non-goals (V1 of this researcher) +- Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives) +- Cross-researcher orchestration — that's the PI agent (Phase 6) +- Citation graph traversal (references, cited-by) +- Production-grade math notation fidelity beyond what pymupdf provides +- Multi-user / shared corpus — single-user, local store only + +--- + +## High-level architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ marchwarden CLI │ +│ ask --researcher arxiv "question" | arxiv add │ +└────────────────────────┬────────────────────────────────────────┘ + │ MCP stdio + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ researchers/arxiv/server.py │ +│ (FastMCP, exposes research) │ +└────────────────────────┬────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ researchers/arxiv/agent.py │ +│ │ +│ plan → retrieve(query) → re-rank → iterate → synthesize │ +│ │ │ +│ ▼ │ +│ researchers/arxiv/store.py │ +│ (chromadb wrapper, returns top-K chunks with metadata) │ +└──────────────┬──────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ ~/.marchwarden/arxiv-rag/ │ +│ ├── papers.json (manifest) │ +│ ├── pdfs/.pdf (cached PDFs) │ +│ └── chroma/ (vector store + chunks) │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Design decisions + +### 1. Corpus scope: user-curated reading list (Option A) + +Three options were considered: + +| Option | Description | Ship complexity | Long-term ceiling | +|---|---|---|---| +| **A** | User curates a reading list via `arxiv add`; only those papers are indexed | Low | Medium | +| **B** | Live arXiv search + semantic rerank, no persistent store | Medium | Medium | +| **C** | Hybrid: live search + everything fetched is cached and re-usable | High | High | + +**Choice: A first, evolve to C later.** A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production. + +### 2. PDF extraction: `pymupdf` (fitz) + +| Tool | Pros | Cons | +|---|---|---| +| **pymupdf** | Fast, simple API, decent quality, no GPU needed | Math notation is approximate | +| `marker` | Excellent quality, outputs clean markdown, handles math better | Slower, larger dependency | +| `nougat` (Meta) | Best math fidelity | Heavy model, slow, GPU recommended | +| `science-parse` | Section-aware parser | Java, more complex deployment | + +**Choice: `pymupdf` for v1.** Trade math fidelity for ship speed. Swap to `marker` if real-world use shows math-heavy papers are unusable. + +### 3. Chunking strategy: section-level + +| Strategy | Pros | Cons | +|---|---|---| +| Whole-paper | One vector per paper | Recall too coarse for "what did the authors say in methods?" | +| **Section-level** | Preserves logical boundaries, maps to verifiable claims | Requires section detection (heuristic on heading patterns) | +| Sliding window 500 tok | Standard RAG default; high recall | Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt | + +**Choice: section-level.** The contract's `raw_excerpt` field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected. + +### 4. Vector store: `chromadb` + +| Store | Pros | Cons | +|---|---|---| +| **chromadb** | Embedded, file-backed, simple Python API, no server | Fewer filter features than alternatives | +| qdrant local | Better filter support (author/year/category) | Slightly heavier setup | +| pgvector | Solid if you already run Postgres | New infrastructure dependency | +| FAISS | Fastest, lowest-level | No metadata filtering, no manifest | + +**Choice: `chromadb`.** Fits the "writes to `~/.marchwarden/` only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface). + +### 5. Embedding model: `nomic-embed-text-v1.5` (local) with `voyage-3` upgrade path + +| Model | Cost | Quality on technical text | Setup | +|---|---|---|---| +| **nomic-embed-text-v1.5** | Free, local, CPU-OK | Good | `pip install` | +| voyage-3 | API ($0.06/Mtok) | Excellent | API key in `~/secrets` | +| text-embedding-3-small (OpenAI) | API ($0.02/Mtok) | Decent | API key in `~/secrets` | +| text-embedding-3-large (OpenAI) | API ($0.13/Mtok) | Very good | API key in `~/secrets` | + +**Choice: `nomic-embed-text-v1.5` for v1.** Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to `voyage-3` later via `MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3` if quality reviews show problems. Embedding model choice is config, not architecture. + +### 6. Research loop pattern + +The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences: + +| | Web researcher | arXiv researcher | +|---|---|---| +| Tools the agent calls | `web_search`, `fetch_url` | `retrieve_chunks`, `read_full_section` | +| Source of evidence | Tavily + httpx | Local chromadb | +| Citation locator | URL | `https://arxiv.org/abs/` | +| Citation raw_excerpt | Page text excerpt | Chunk text | +| Discovery events | "Try a different researcher" | "This paper cites X — consider adding to corpus" | +| Confidence factors | source_authority based on .gov/.edu/etc | All sources are peer-reviewed by design; authority based on venue / cite count if available | + +The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same `ResearchResult` model). + +--- + +## Storage layout + +``` +~/.marchwarden/arxiv-rag/ +├── papers.json # manifest: id → {title, authors, year, added_at, version} +├── pdfs/ +│ ├── 2403.12345v1.pdf +│ └── 2401.00001v2.pdf +└── chroma/ # chromadb persistent store + └── ... # vectors + chunk metadata +``` + +`papers.json` schema: +```json +{ + "2403.12345": { + "version": "v1", + "title": "Diffusion Models for Protein Folding", + "authors": ["Alice Smith", "Bob Jones"], + "year": 2024, + "added_at": "2026-04-08T22:00:00Z", + "category": "cs.LG", + "chunks_indexed": 12, + "embedding_model": "nomic-embed-text-v1.5" + } +} +``` + +--- + +## CLI surface + +```bash +# Ingest +marchwarden arxiv add 2403.12345 # download, parse, embed, index +marchwarden arxiv add 2403.12345 2401.00001 # batch +marchwarden arxiv list # show indexed papers +marchwarden arxiv remove 2403.12345 # drop from index (also delete vectors) +marchwarden arxiv info 2403.12345 # show metadata + chunk count + +# Research +marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv + +# Stretch (post-V1): +marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI +``` + +--- + +## Open questions + +1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract. + +2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add ` simple. BibTeX import is a one-off helper script that can come later. + +3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies: + - **Pin** — index whatever the user supplies, never auto-update + - **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks + - **Track both** — index every version separately, distinguish in citations + + **Recommendation: pin for v1.** Simplest. `arxiv update ` as an explicit user action later. + +4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting. + +5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table. + +--- + +## Success criteria for V1 of this researcher + +- [ ] `marchwarden arxiv add 2403.12345` works end-to-end (download → extract → chunk → embed → store) +- [ ] `marchwarden arxiv list` shows the indexed papers with metadata +- [ ] `marchwarden ask "..." --researcher arxiv` returns a `ResearchResult` with the same shape as the web researcher +- [ ] Citations point to the correct arXiv URL with verbatim chunk text in `raw_excerpt` +- [ ] Cost ledger records `embedding_calls` separately +- [ ] Trace JSONL captures every retrieval / re-rank / synthesis step +- [ ] At least one cross-researcher manual smoke test: ask the same question of `--researcher web` and `--researcher arxiv` and confirm the contracts compose visually +- [ ] All existing tests still pass + +--- + +## Alternatives considered (and rejected for v1) + +- **Live arXiv search instead of pre-indexed corpus.** Loses the "private curated subset" advantage and forces an embedding pass on every query. +- **Whole-paper embeddings.** Too coarse for "what did the authors say in methods?" queries. +- **Sliding-window chunking.** Standard RAG default but breaks the contract's verifiable raw_excerpt requirement. +- **`pgvector` vector store.** Adds Postgres as a runtime dependency for a single-user local tool — overkill. +- **`science-parse` PDF extractor.** Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1. +- **Skipping the agent loop and doing one-shot RAG.** Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library. + +--- + +## Phasing + +This proposal is sized for **a single new milestone phase** (parallel to the existing roadmap). Suggested sequencing: + +1. **Sign off this proposal** — confirm decisions, file sub-issues +2. **A.1: Ingest pipeline** (smallest visible win) +3. **A.2: Retrieval primitive** +4. **A.3: ArxivResearcher agent** +5. **A.4: MCP server** +6. **A.5: CLI integration** +7. **A.6: Cost ledger integration** +8. **Smoke test**: index ~5 papers from your real reading list, ask 3 questions, document the run + +After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap. + +--- + +See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Roadmap](Roadmap), [User Guide](UserGuide)