Researcher #2: arxiv-rag — semantic search over a curated arXiv reading list #37

New issue

Open

opened 2026-04-08 17:07:03 -06:00 by claude-code · 0 comments

claude-code commented

2026-04-08 17:07:03 -06:00

Collaborator

Goal

Second researcher implementing the v1 contract: a RAG-based reader of arXiv papers. Sister to the planned grep-based file researcher (M5.1). Returns the same ResearchResult shape so the (future) PI orchestrator can blend its findings with the web researcher.

Detailed design lives at wiki/ArxivRagProposal — this issue is the implementation tracker.

Locked-in design defaults

Decision	Choice
Corpus mode	(A) User-curated reading list. Growing cache (option C) is V2+.
Paper ingest	CLI `marchwarden arxiv add <id>` and `marchwarden arxiv list`
Storage root	`~/.marchwarden/arxiv-rag/`
PDF extraction	`pymupdf` (swap to `marker` later if math fidelity suffers)
Chunking	Section-level (intro / methods / results / conclusion etc.)
Vector store	`chromadb` (embedded, file-backed)
Embedding model	`nomic-embed-text-v1.5` local; switch to `voyage-3` if quality is poor
Research interface	New MCP server `researchers/arxiv/server.py` exposing `research()`
Contract	Same `ResearchResult` as the web researcher — `Citation.locator` is the arXiv abs URL, `raw_excerpt` is the chunk text

Implementation milestones

To be filed as separate sub-issues once this proposal is signed off:

A.1 Ingest pipeline — marchwarden arxiv add <id>: download PDF, extract via pymupdf, section-chunk, embed, store in chromadb. Records to a sidecar manifest at ~/.marchwarden/arxiv-rag/papers.json.
A.2 Retrieval primitive — query → embed → chromadb top-K → return ranked chunks with paper metadata.
A.3 ArxivResearcher agent — wraps retrieval in the same plan→retrieve→synthesize loop the web researcher uses, but with arxiv chunks instead of web fetches. Same WebResearcher-style synthesis prompt adapted for academic tone.
A.4 MCP server — researchers/arxiv/server.py, mirrors researchers/web/server.py. Same tool name research(), same contract.
A.5 CLI integration — marchwarden ask "..." --researcher arxiv (default still web). Stretch: --researchers web,arxiv to fan out and merge.
A.6 Cost ledger integration — record per-call cost (embedding API or local cycles, not just LLM tokens).

Out of scope for this milestone

Live arxiv search (option B/C in the proposal)
Cross-researcher orchestration (PI agent — that's V2)
Citation graph traversal (cited-by, references)
Math notation fidelity beyond what pymupdf provides

Open questions to resolve before A.1

Local embeddings (nomic) vs. API embeddings (voyage) — start free, upgrade if quality is bad?
Should ingest accept BibTeX too, or arxiv IDs only for v1?
How to handle paper versions (v1, v2, ...) — pin to specific version or always-latest?

See the proposal page for full design rationale, alternatives considered, and architecture sketch.

## Goal Second researcher implementing the v1 contract: a RAG-based reader of arXiv papers. Sister to the planned grep-based file researcher (M5.1). Returns the same `ResearchResult` shape so the (future) PI orchestrator can blend its findings with the web researcher. Detailed design lives at **[wiki/ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal)** — this issue is the implementation tracker. ## Locked-in design defaults | Decision | Choice | |---|---| | Corpus mode | (A) User-curated reading list. Growing cache (option C) is V2+. | | Paper ingest | CLI `marchwarden arxiv add <id>` and `marchwarden arxiv list` | | Storage root | `~/.marchwarden/arxiv-rag/` | | PDF extraction | `pymupdf` (swap to `marker` later if math fidelity suffers) | | Chunking | Section-level (intro / methods / results / conclusion etc.) | | Vector store | `chromadb` (embedded, file-backed) | | Embedding model | `nomic-embed-text-v1.5` local; switch to `voyage-3` if quality is poor | | Research interface | New MCP server `researchers/arxiv/server.py` exposing `research()` | | Contract | Same `ResearchResult` as the web researcher — `Citation.locator` is the arXiv abs URL, `raw_excerpt` is the chunk text | ## Implementation milestones To be filed as separate sub-issues once this proposal is signed off: - **A.1 Ingest pipeline** — `marchwarden arxiv add <id>`: download PDF, extract via pymupdf, section-chunk, embed, store in chromadb. Records to a sidecar manifest at `~/.marchwarden/arxiv-rag/papers.json`. - **A.2 Retrieval primitive** — query → embed → chromadb top-K → return ranked chunks with paper metadata. - **A.3 ArxivResearcher agent** — wraps retrieval in the same plan→retrieve→synthesize loop the web researcher uses, but with arxiv chunks instead of web fetches. Same `WebResearcher`-style synthesis prompt adapted for academic tone. - **A.4 MCP server** — `researchers/arxiv/server.py`, mirrors `researchers/web/server.py`. Same tool name `research()`, same contract. - **A.5 CLI integration** — `marchwarden ask "..." --researcher arxiv` (default still `web`). Stretch: `--researchers web,arxiv` to fan out and merge. - **A.6 Cost ledger integration** — record per-call cost (embedding API or local cycles, not just LLM tokens). ## Out of scope for this milestone - Live arxiv search (option B/C in the proposal) - Cross-researcher orchestration (PI agent — that's V2) - Citation graph traversal (cited-by, references) - Math notation fidelity beyond what pymupdf provides ## Open questions to resolve before A.1 1. Local embeddings (nomic) vs. API embeddings (voyage) — start free, upgrade if quality is bad? 2. Should ingest accept BibTeX too, or arxiv IDs only for v1? 3. How to handle paper versions (v1, v2, ...) — pin to specific version or always-latest? See the [proposal page](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal) for full design rationale, alternatives considered, and architecture sketch.