Researcher #2: arxiv-rag — semantic search over a curated arXiv reading list #37

Open
opened 2026-04-08 23:07:03 +00:00 by claude-code · 0 comments
Collaborator

Goal

Second researcher implementing the v1 contract: a RAG-based reader of arXiv papers. Sister to the planned grep-based file researcher (M5.1). Returns the same ResearchResult shape so the (future) PI orchestrator can blend its findings with the web researcher.

Detailed design lives at wiki/ArxivRagProposal — this issue is the implementation tracker.

Locked-in design defaults

Decision Choice
Corpus mode (A) User-curated reading list. Growing cache (option C) is V2+.
Paper ingest CLI marchwarden arxiv add <id> and marchwarden arxiv list
Storage root ~/.marchwarden/arxiv-rag/
PDF extraction pymupdf (swap to marker later if math fidelity suffers)
Chunking Section-level (intro / methods / results / conclusion etc.)
Vector store chromadb (embedded, file-backed)
Embedding model nomic-embed-text-v1.5 local; switch to voyage-3 if quality is poor
Research interface New MCP server researchers/arxiv/server.py exposing research()
Contract Same ResearchResult as the web researcher — Citation.locator is the arXiv abs URL, raw_excerpt is the chunk text

Implementation milestones

To be filed as separate sub-issues once this proposal is signed off:

  • A.1 Ingest pipelinemarchwarden arxiv add <id>: download PDF, extract via pymupdf, section-chunk, embed, store in chromadb. Records to a sidecar manifest at ~/.marchwarden/arxiv-rag/papers.json.
  • A.2 Retrieval primitive — query → embed → chromadb top-K → return ranked chunks with paper metadata.
  • A.3 ArxivResearcher agent — wraps retrieval in the same plan→retrieve→synthesize loop the web researcher uses, but with arxiv chunks instead of web fetches. Same WebResearcher-style synthesis prompt adapted for academic tone.
  • A.4 MCP serverresearchers/arxiv/server.py, mirrors researchers/web/server.py. Same tool name research(), same contract.
  • A.5 CLI integrationmarchwarden ask "..." --researcher arxiv (default still web). Stretch: --researchers web,arxiv to fan out and merge.
  • A.6 Cost ledger integration — record per-call cost (embedding API or local cycles, not just LLM tokens).

Out of scope for this milestone

  • Live arxiv search (option B/C in the proposal)
  • Cross-researcher orchestration (PI agent — that's V2)
  • Citation graph traversal (cited-by, references)
  • Math notation fidelity beyond what pymupdf provides

Open questions to resolve before A.1

  1. Local embeddings (nomic) vs. API embeddings (voyage) — start free, upgrade if quality is bad?
  2. Should ingest accept BibTeX too, or arxiv IDs only for v1?
  3. How to handle paper versions (v1, v2, ...) — pin to specific version or always-latest?

See the proposal page for full design rationale, alternatives considered, and architecture sketch.

## Goal Second researcher implementing the v1 contract: a RAG-based reader of arXiv papers. Sister to the planned grep-based file researcher (M5.1). Returns the same `ResearchResult` shape so the (future) PI orchestrator can blend its findings with the web researcher. Detailed design lives at **[wiki/ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal)** — this issue is the implementation tracker. ## Locked-in design defaults | Decision | Choice | |---|---| | Corpus mode | (A) User-curated reading list. Growing cache (option C) is V2+. | | Paper ingest | CLI `marchwarden arxiv add <id>` and `marchwarden arxiv list` | | Storage root | `~/.marchwarden/arxiv-rag/` | | PDF extraction | `pymupdf` (swap to `marker` later if math fidelity suffers) | | Chunking | Section-level (intro / methods / results / conclusion etc.) | | Vector store | `chromadb` (embedded, file-backed) | | Embedding model | `nomic-embed-text-v1.5` local; switch to `voyage-3` if quality is poor | | Research interface | New MCP server `researchers/arxiv/server.py` exposing `research()` | | Contract | Same `ResearchResult` as the web researcher — `Citation.locator` is the arXiv abs URL, `raw_excerpt` is the chunk text | ## Implementation milestones To be filed as separate sub-issues once this proposal is signed off: - **A.1 Ingest pipeline** — `marchwarden arxiv add <id>`: download PDF, extract via pymupdf, section-chunk, embed, store in chromadb. Records to a sidecar manifest at `~/.marchwarden/arxiv-rag/papers.json`. - **A.2 Retrieval primitive** — query → embed → chromadb top-K → return ranked chunks with paper metadata. - **A.3 ArxivResearcher agent** — wraps retrieval in the same plan→retrieve→synthesize loop the web researcher uses, but with arxiv chunks instead of web fetches. Same `WebResearcher`-style synthesis prompt adapted for academic tone. - **A.4 MCP server** — `researchers/arxiv/server.py`, mirrors `researchers/web/server.py`. Same tool name `research()`, same contract. - **A.5 CLI integration** — `marchwarden ask "..." --researcher arxiv` (default still `web`). Stretch: `--researchers web,arxiv` to fan out and merge. - **A.6 Cost ledger integration** — record per-call cost (embedding API or local cycles, not just LLM tokens). ## Out of scope for this milestone - Live arxiv search (option B/C in the proposal) - Cross-researcher orchestration (PI agent — that's V2) - Citation graph traversal (cited-by, references) - Math notation fidelity beyond what pymupdf provides ## Open questions to resolve before A.1 1. Local embeddings (nomic) vs. API embeddings (voyage) — start free, upgrade if quality is bad? 2. Should ingest accept BibTeX too, or arxiv IDs only for v1? 3. How to handle paper versions (v1, v2, ...) — pin to specific version or always-latest? See the [proposal page](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal) for full design rationale, alternatives considered, and architecture sketch.
archeious added this to the Phase 5: Second Researcher milestone 2026-04-08 23:23:41 +00:00
Sign in to join this conversation.
No labels
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#37
No description provided.