Add arxiv-rag implementation proposal

Detailed design proposal for the second researcher: a RAG over a
user-curated arXiv reading list. Locks in chunking, vector store,
embeddings, PDF extraction, and storage layout decisions; lists
open questions and success criteria.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jeff Smith 2026-04-08 17:08:25 -06:00
parent f28eb02fd7
commit 7e5bed6033

259
ArxivRagProposal.md Normal file

@ -0,0 +1,259 @@
# Implementation Proposal: arxiv-rag Researcher
**Status:** Draft — awaiting review
**Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37)
**Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract
---
## Motivation
Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:
- **The corpus is private to the user's interest.** Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
- **Semantic queries are essential.** Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
- **The corpus is stable.** Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
- **Contract fit is natural.** Each retrieved chunk *is* a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.
This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.
---
## Goals and non-goals
### Goals
1. A second working researcher implementing the v1 ResearchContract
2. CLI ergonomics for ingesting arXiv papers (`marchwarden arxiv add 2403.12345`)
3. Same `research(question)` MCP tool surface as the web researcher
4. End-to-end question → grounded answer with citations from the local paper store
5. Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend
### Non-goals (V1 of this researcher)
- Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
- Cross-researcher orchestration — that's the PI agent (Phase 6)
- Citation graph traversal (references, cited-by)
- Production-grade math notation fidelity beyond what pymupdf provides
- Multi-user / shared corpus — single-user, local store only
---
## High-level architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ marchwarden CLI │
│ ask --researcher arxiv "question" | arxiv add <id>
└────────────────────────┬────────────────────────────────────────┘
│ MCP stdio
┌─────────────────────────────────────────────────────────────────┐
│ researchers/arxiv/server.py │
│ (FastMCP, exposes research) │
└────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ researchers/arxiv/agent.py │
│ │
│ plan → retrieve(query) → re-rank → iterate → synthesize │
│ │ │
│ ▼ │
│ researchers/arxiv/store.py │
│ (chromadb wrapper, returns top-K chunks with metadata) │
└──────────────┬──────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ ~/.marchwarden/arxiv-rag/ │
│ ├── papers.json (manifest) │
│ ├── pdfs/<id>.pdf (cached PDFs) │
│ └── chroma/ (vector store + chunks) │
└─────────────────────────────────────────────────────────────────┘
```
---
## Design decisions
### 1. Corpus scope: user-curated reading list (Option A)
Three options were considered:
| Option | Description | Ship complexity | Long-term ceiling |
|---|---|---|---|
| **A** | User curates a reading list via `arxiv add`; only those papers are indexed | Low | Medium |
| **B** | Live arXiv search + semantic rerank, no persistent store | Medium | Medium |
| **C** | Hybrid: live search + everything fetched is cached and re-usable | High | High |
**Choice: A first, evolve to C later.** A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.
### 2. PDF extraction: `pymupdf` (fitz)
| Tool | Pros | Cons |
|---|---|---|
| **pymupdf** | Fast, simple API, decent quality, no GPU needed | Math notation is approximate |
| `marker` | Excellent quality, outputs clean markdown, handles math better | Slower, larger dependency |
| `nougat` (Meta) | Best math fidelity | Heavy model, slow, GPU recommended |
| `science-parse` | Section-aware parser | Java, more complex deployment |
**Choice: `pymupdf` for v1.** Trade math fidelity for ship speed. Swap to `marker` if real-world use shows math-heavy papers are unusable.
### 3. Chunking strategy: section-level
| Strategy | Pros | Cons |
|---|---|---|
| Whole-paper | One vector per paper | Recall too coarse for "what did the authors say in methods?" |
| **Section-level** | Preserves logical boundaries, maps to verifiable claims | Requires section detection (heuristic on heading patterns) |
| Sliding window 500 tok | Standard RAG default; high recall | Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt |
**Choice: section-level.** The contract's `raw_excerpt` field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.
### 4. Vector store: `chromadb`
| Store | Pros | Cons |
|---|---|---|
| **chromadb** | Embedded, file-backed, simple Python API, no server | Fewer filter features than alternatives |
| qdrant local | Better filter support (author/year/category) | Slightly heavier setup |
| pgvector | Solid if you already run Postgres | New infrastructure dependency |
| FAISS | Fastest, lowest-level | No metadata filtering, no manifest |
**Choice: `chromadb`.** Fits the "writes to `~/.marchwarden/` only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).
### 5. Embedding model: `nomic-embed-text-v1.5` (local) with `voyage-3` upgrade path
| Model | Cost | Quality on technical text | Setup |
|---|---|---|---|
| **nomic-embed-text-v1.5** | Free, local, CPU-OK | Good | `pip install` |
| voyage-3 | API ($0.06/Mtok) | Excellent | API key in `~/secrets` |
| text-embedding-3-small (OpenAI) | API ($0.02/Mtok) | Decent | API key in `~/secrets` |
| text-embedding-3-large (OpenAI) | API ($0.13/Mtok) | Very good | API key in `~/secrets` |
**Choice: `nomic-embed-text-v1.5` for v1.** Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to `voyage-3` later via `MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3` if quality reviews show problems. Embedding model choice is config, not architecture.
### 6. Research loop pattern
The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:
| | Web researcher | arXiv researcher |
|---|---|---|
| Tools the agent calls | `web_search`, `fetch_url` | `retrieve_chunks`, `read_full_section` |
| Source of evidence | Tavily + httpx | Local chromadb |
| Citation locator | URL | `https://arxiv.org/abs/<id>` |
| Citation raw_excerpt | Page text excerpt | Chunk text |
| Discovery events | "Try a different researcher" | "This paper cites X — consider adding to corpus" |
| Confidence factors | source_authority based on .gov/.edu/etc | All sources are peer-reviewed by design; authority based on venue / cite count if available |
The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same `ResearchResult` model).
---
## Storage layout
```
~/.marchwarden/arxiv-rag/
├── papers.json # manifest: id → {title, authors, year, added_at, version}
├── pdfs/
│ ├── 2403.12345v1.pdf
│ └── 2401.00001v2.pdf
└── chroma/ # chromadb persistent store
└── ... # vectors + chunk metadata
```
`papers.json` schema:
```json
{
"2403.12345": {
"version": "v1",
"title": "Diffusion Models for Protein Folding",
"authors": ["Alice Smith", "Bob Jones"],
"year": 2024,
"added_at": "2026-04-08T22:00:00Z",
"category": "cs.LG",
"chunks_indexed": 12,
"embedding_model": "nomic-embed-text-v1.5"
}
}
```
---
## CLI surface
```bash
# Ingest
marchwarden arxiv add 2403.12345 # download, parse, embed, index
marchwarden arxiv add 2403.12345 2401.00001 # batch
marchwarden arxiv list # show indexed papers
marchwarden arxiv remove 2403.12345 # drop from index (also delete vectors)
marchwarden arxiv info 2403.12345 # show metadata + chunk count
# Research
marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv
# Stretch (post-V1):
marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI
```
---
## Open questions
1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract.
2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add <id>` simple. BibTeX import is a one-off helper script that can come later.
3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies:
- **Pin** — index whatever the user supplies, never auto-update
- **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks
- **Track both** — index every version separately, distinguish in citations
**Recommendation: pin for v1.** Simplest. `arxiv update <id>` as an explicit user action later.
4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting.
5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
---
## Success criteria for V1 of this researcher
- [ ] `marchwarden arxiv add 2403.12345` works end-to-end (download → extract → chunk → embed → store)
- [ ] `marchwarden arxiv list` shows the indexed papers with metadata
- [ ] `marchwarden ask "..." --researcher arxiv` returns a `ResearchResult` with the same shape as the web researcher
- [ ] Citations point to the correct arXiv URL with verbatim chunk text in `raw_excerpt`
- [ ] Cost ledger records `embedding_calls` separately
- [ ] Trace JSONL captures every retrieval / re-rank / synthesis step
- [ ] At least one cross-researcher manual smoke test: ask the same question of `--researcher web` and `--researcher arxiv` and confirm the contracts compose visually
- [ ] All existing tests still pass
---
## Alternatives considered (and rejected for v1)
- **Live arXiv search instead of pre-indexed corpus.** Loses the "private curated subset" advantage and forces an embedding pass on every query.
- **Whole-paper embeddings.** Too coarse for "what did the authors say in methods?" queries.
- **Sliding-window chunking.** Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
- **`pgvector` vector store.** Adds Postgres as a runtime dependency for a single-user local tool — overkill.
- **`science-parse` PDF extractor.** Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.
- **Skipping the agent loop and doing one-shot RAG.** Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.
---
## Phasing
This proposal is sized for **a single new milestone phase** (parallel to the existing roadmap). Suggested sequencing:
1. **Sign off this proposal** — confirm decisions, file sub-issues
2. **A.1: Ingest pipeline** (smallest visible win)
3. **A.2: Retrieval primitive**
4. **A.3: ArxivResearcher agent**
5. **A.4: MCP server**
6. **A.5: CLI integration**
7. **A.6: Cost ledger integration**
8. **Smoke test**: index ~5 papers from your real reading list, ask 3 questions, document the run
After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.
---
See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Roadmap](Roadmap), [User Guide](UserGuide)