Add arxiv-rag implementation proposal

Detailed design proposal for the second researcher: a RAG over a user-curated arXiv reading list. Locks in chunking, vector store, embeddings, PDF extraction, and storage layout decisions; lists open questions and success criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 17:08:25 -06:00 · 2026-04-08 17:08:25 -06:00 · 7e5bed6033
commit 7e5bed6033
parent f28eb02fd7
1 changed files with 259 additions and 0 deletions
--- a/ArxivRagProposal.md
+++ b/ArxivRagProposal.md
@ -0,0 +1,259 @@
+# Implementation Proposal: arxiv-rag Researcher
+
+**Status:** Draft — awaiting review
+**Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37)
+**Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract
+
+---
+
+## Motivation
+
+Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:
+
+- **The corpus is private to the user's interest.** Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
+- **Semantic queries are essential.** Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
+- **The corpus is stable.** Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
+- **Contract fit is natural.** Each retrieved chunk *is* a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.
+
+This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.
+
+---
+
+## Goals and non-goals
+
+### Goals
+1. A second working researcher implementing the v1 ResearchContract
+2. CLI ergonomics for ingesting arXiv papers (`marchwarden arxiv add 2403.12345`)
+3. Same `research(question)` MCP tool surface as the web researcher
+4. End-to-end question → grounded answer with citations from the local paper store
+5. Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend
+
+### Non-goals (V1 of this researcher)
+- Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
+- Cross-researcher orchestration — that's the PI agent (Phase 6)
+- Citation graph traversal (references, cited-by)
+- Production-grade math notation fidelity beyond what pymupdf provides
+- Multi-user / shared corpus — single-user, local store only
+
+---
+
+## High-level architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         marchwarden CLI                          │
+│   ask --researcher arxiv "question"   |   arxiv add <id>         │
+└────────────────────────┬────────────────────────────────────────┘
+                         │ MCP stdio
+                         ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                 researchers/arxiv/server.py                      │
+│                  (FastMCP, exposes research)                     │
+└────────────────────────┬────────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                   researchers/arxiv/agent.py                     │
+│                                                                  │
+│   plan → retrieve(query) → re-rank → iterate → synthesize        │
+│              │                                                   │
+│              ▼                                                   │
+│       researchers/arxiv/store.py                                 │
+│       (chromadb wrapper, returns top-K chunks with metadata)     │
+└──────────────┬──────────────────────────────────────────────────┘
+               │
+               ▼
+┌─────────────────────────────────────────────────────────────────┐
+│            ~/.marchwarden/arxiv-rag/                             │
+│            ├── papers.json     (manifest)                        │
+│            ├── pdfs/<id>.pdf   (cached PDFs)                     │
+│            └── chroma/         (vector store + chunks)           │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Design decisions
+
+### 1. Corpus scope: user-curated reading list (Option A)
+
+Three options were considered:
+
+| Option | Description | Ship complexity | Long-term ceiling |
+|---|---|---|---|
+| **A** | User curates a reading list via `arxiv add`; only those papers are indexed | Low | Medium |
+| **B** | Live arXiv search + semantic rerank, no persistent store | Medium | Medium |
+| **C** | Hybrid: live search + everything fetched is cached and re-usable | High | High |
+
+**Choice: A first, evolve to C later.** A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.
+
+### 2. PDF extraction: `pymupdf` (fitz)
+
+| Tool | Pros | Cons |
+|---|---|---|
+| **pymupdf** | Fast, simple API, decent quality, no GPU needed | Math notation is approximate |
+| `marker` | Excellent quality, outputs clean markdown, handles math better | Slower, larger dependency |
+| `nougat` (Meta) | Best math fidelity | Heavy model, slow, GPU recommended |
+| `science-parse` | Section-aware parser | Java, more complex deployment |
+
+**Choice: `pymupdf` for v1.** Trade math fidelity for ship speed. Swap to `marker` if real-world use shows math-heavy papers are unusable.
+
+### 3. Chunking strategy: section-level
+
+| Strategy | Pros | Cons |
+|---|---|---|
+| Whole-paper | One vector per paper | Recall too coarse for "what did the authors say in methods?" |
+| **Section-level** | Preserves logical boundaries, maps to verifiable claims | Requires section detection (heuristic on heading patterns) |
+| Sliding window 500 tok | Standard RAG default; high recall | Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt |
+
+**Choice: section-level.** The contract's `raw_excerpt` field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.
+
+### 4. Vector store: `chromadb`
+
+| Store | Pros | Cons |
+|---|---|---|
+| **chromadb** | Embedded, file-backed, simple Python API, no server | Fewer filter features than alternatives |
+| qdrant local | Better filter support (author/year/category) | Slightly heavier setup |
+| pgvector | Solid if you already run Postgres | New infrastructure dependency |
+| FAISS | Fastest, lowest-level | No metadata filtering, no manifest |
+
+**Choice: `chromadb`.** Fits the "writes to `~/.marchwarden/` only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).
+
+### 5. Embedding model: `nomic-embed-text-v1.5` (local) with `voyage-3` upgrade path
+
+| Model | Cost | Quality on technical text | Setup |
+|---|---|---|---|
+| **nomic-embed-text-v1.5** | Free, local, CPU-OK | Good | `pip install` |
+| voyage-3 | API ($0.06/Mtok) | Excellent | API key in `~/secrets` |
+| text-embedding-3-small (OpenAI) | API ($0.02/Mtok) | Decent | API key in `~/secrets` |
+| text-embedding-3-large (OpenAI) | API ($0.13/Mtok) | Very good | API key in `~/secrets` |
+
+**Choice: `nomic-embed-text-v1.5` for v1.** Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to `voyage-3` later via `MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3` if quality reviews show problems. Embedding model choice is config, not architecture.
+
+### 6. Research loop pattern
+
+The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:
+
+| | Web researcher | arXiv researcher |
+|---|---|---|
+| Tools the agent calls | `web_search`, `fetch_url` | `retrieve_chunks`, `read_full_section` |
+| Source of evidence | Tavily + httpx | Local chromadb |
+| Citation locator | URL | `https://arxiv.org/abs/<id>` |
+| Citation raw_excerpt | Page text excerpt | Chunk text |
+| Discovery events | "Try a different researcher" | "This paper cites X — consider adding to corpus" |
+| Confidence factors | source_authority based on .gov/.edu/etc | All sources are peer-reviewed by design; authority based on venue / cite count if available |
+
+The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same `ResearchResult` model).
+
+---
+
+## Storage layout
+
+```
+~/.marchwarden/arxiv-rag/
+├── papers.json              # manifest: id → {title, authors, year, added_at, version}
+├── pdfs/
+│   ├── 2403.12345v1.pdf
+│   └── 2401.00001v2.pdf
+└── chroma/                  # chromadb persistent store
+    └── ...                  # vectors + chunk metadata
+```
+
+`papers.json` schema:
+```json
+{
+  "2403.12345": {
+    "version": "v1",
+    "title": "Diffusion Models for Protein Folding",
+    "authors": ["Alice Smith", "Bob Jones"],
+    "year": 2024,
+    "added_at": "2026-04-08T22:00:00Z",
+    "category": "cs.LG",
+    "chunks_indexed": 12,
+    "embedding_model": "nomic-embed-text-v1.5"
+  }
+}
+```
+
+---
+
+## CLI surface
+
+```bash
+# Ingest
+marchwarden arxiv add 2403.12345              # download, parse, embed, index
+marchwarden arxiv add 2403.12345 2401.00001   # batch
+marchwarden arxiv list                        # show indexed papers
+marchwarden arxiv remove 2403.12345           # drop from index (also delete vectors)
+marchwarden arxiv info 2403.12345             # show metadata + chunk count
+
+# Research
+marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv
+
+# Stretch (post-V1):
+marchwarden ask "..." --researchers web,arxiv      # fan out, merge in CLI
+```
+
+---
+
+## Open questions
+
+1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract.
+
+2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add <id>` simple. BibTeX import is a one-off helper script that can come later.
+
+3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies:
+   - **Pin** — index whatever the user supplies, never auto-update
+   - **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks
+   - **Track both** — index every version separately, distinguish in citations
+
+   **Recommendation: pin for v1.** Simplest. `arxiv update <id>` as an explicit user action later.
+
+4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting.
+
+5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
+
+---
+
+## Success criteria for V1 of this researcher
+
+- [ ] `marchwarden arxiv add 2403.12345` works end-to-end (download → extract → chunk → embed → store)
+- [ ] `marchwarden arxiv list` shows the indexed papers with metadata
+- [ ] `marchwarden ask "..." --researcher arxiv` returns a `ResearchResult` with the same shape as the web researcher
+- [ ] Citations point to the correct arXiv URL with verbatim chunk text in `raw_excerpt`
+- [ ] Cost ledger records `embedding_calls` separately
+- [ ] Trace JSONL captures every retrieval / re-rank / synthesis step
+- [ ] At least one cross-researcher manual smoke test: ask the same question of `--researcher web` and `--researcher arxiv` and confirm the contracts compose visually
+- [ ] All existing tests still pass
+
+---
+
+## Alternatives considered (and rejected for v1)
+
+- **Live arXiv search instead of pre-indexed corpus.** Loses the "private curated subset" advantage and forces an embedding pass on every query.
+- **Whole-paper embeddings.** Too coarse for "what did the authors say in methods?" queries.
+- **Sliding-window chunking.** Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
+- **`pgvector` vector store.** Adds Postgres as a runtime dependency for a single-user local tool — overkill.
+- **`science-parse` PDF extractor.** Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.
+- **Skipping the agent loop and doing one-shot RAG.** Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.
+
+---
+
+## Phasing
+
+This proposal is sized for **a single new milestone phase** (parallel to the existing roadmap). Suggested sequencing:
+
+1. **Sign off this proposal** — confirm decisions, file sub-issues
+2. **A.1: Ingest pipeline** (smallest visible win)
+3. **A.2: Retrieval primitive**
+4. **A.3: ArxivResearcher agent**
+5. **A.4: MCP server**
+6. **A.5: CLI integration**
+7. **A.6: Cost ledger integration**
+8. **Smoke test**: index ~5 papers from your real reading list, ask 3 questions, document the run
+
+After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.
+
+---
+
+See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Roadmap](Roadmap), [User Guide](UserGuide)