Table of Contents
- Implementation Proposal: arxiv-rag Researcher
- Motivation
- Goals and non-goals
- High-level architecture
- Design decisions
- 1. Corpus scope: user-curated reading list (Option A)
- 2. PDF extraction: pymupdf (fitz)
- 3. Chunking strategy: section-level
- 4. Vector store: chromadb
- 5. Embedding model: nomic-embed-text-v1.5 (local) with voyage-3 upgrade path
- 6. Research loop pattern
- Storage layout
- CLI surface
- Resolved decisions (was: Open questions)
- Success criteria for V1 of this researcher
- Alternatives considered (and rejected for v1)
- Phasing
Implementation Proposal: arxiv-rag Researcher
Status: Approved 2026-04-08 Tracking issue: #37 Sister to: Roadmap M5.1 (grep-based file researcher) — different tool, same contract
Motivation
Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:
- The corpus is private to the user's interest. Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
- Semantic queries are essential. Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
- The corpus is stable. Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
- Contract fit is natural. Each retrieved chunk is a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.
This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.
Goals and non-goals
Goals
- A second working researcher implementing the v1 ResearchContract
- CLI ergonomics for ingesting arXiv papers (
marchwarden arxiv add 2403.12345) - Same
research(question)MCP tool surface as the web researcher - End-to-end question → grounded answer with citations from the local paper store
- Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend
Non-goals (V1 of this researcher)
- Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
- Cross-researcher orchestration — that's the PI agent (Phase 6)
- Citation graph traversal (references, cited-by)
- Production-grade math notation fidelity beyond what pymupdf provides
- Multi-user / shared corpus — single-user, local store only
High-level architecture
┌─────────────────────────────────────────────────────────────────┐
│ marchwarden CLI │
│ ask --researcher arxiv "question" | arxiv add <id> │
└────────────────────────┬────────────────────────────────────────┘
│ MCP stdio
▼
┌─────────────────────────────────────────────────────────────────┐
│ researchers/arxiv/server.py │
│ (FastMCP, exposes research) │
└────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ researchers/arxiv/agent.py │
│ │
│ plan → retrieve(query) → re-rank → iterate → synthesize │
│ │ │
│ ▼ │
│ researchers/arxiv/store.py │
│ (chromadb wrapper, returns top-K chunks with metadata) │
└──────────────┬──────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ~/.marchwarden/arxiv-rag/ │
│ ├── papers.json (manifest) │
│ ├── pdfs/<id>.pdf (cached PDFs) │
│ └── chroma/ (vector store + chunks) │
└─────────────────────────────────────────────────────────────────┘
Design decisions
1. Corpus scope: user-curated reading list (Option A)
Three options were considered:
| Option | Description | Ship complexity | Long-term ceiling |
|---|---|---|---|
| A | User curates a reading list via arxiv add; only those papers are indexed |
Low | Medium |
| B | Live arXiv search + semantic rerank, no persistent store | Medium | Medium |
| C | Hybrid: live search + everything fetched is cached and re-usable | High | High |
Choice: A first, evolve to C later. A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.
2. PDF extraction: pymupdf (fitz)
| Tool | Pros | Cons |
|---|---|---|
| pymupdf | Fast, simple API, decent quality, no GPU needed | Math notation is approximate |
marker |
Excellent quality, outputs clean markdown, handles math better | Slower, larger dependency |
nougat (Meta) |
Best math fidelity | Heavy model, slow, GPU recommended |
science-parse |
Section-aware parser | Java, more complex deployment |
Choice: pymupdf for v1. Trade math fidelity for ship speed. Swap to marker if real-world use shows math-heavy papers are unusable.
3. Chunking strategy: section-level
| Strategy | Pros | Cons |
|---|---|---|
| Whole-paper | One vector per paper | Recall too coarse for "what did the authors say in methods?" |
| Section-level | Preserves logical boundaries, maps to verifiable claims | Requires section detection (heuristic on heading patterns) |
| Sliding window 500 tok | Standard RAG default; high recall | Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt |
Choice: section-level. The contract's raw_excerpt field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.
4. Vector store: chromadb
| Store | Pros | Cons |
|---|---|---|
| chromadb | Embedded, file-backed, simple Python API, no server | Fewer filter features than alternatives |
| qdrant local | Better filter support (author/year/category) | Slightly heavier setup |
| pgvector | Solid if you already run Postgres | New infrastructure dependency |
| FAISS | Fastest, lowest-level | No metadata filtering, no manifest |
Choice: chromadb. Fits the "writes to ~/.marchwarden/ only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).
5. Embedding model: nomic-embed-text-v1.5 (local) with voyage-3 upgrade path
| Model | Cost | Quality on technical text | Setup |
|---|---|---|---|
| nomic-embed-text-v1.5 | Free, local, CPU-OK | Good | pip install |
| voyage-3 | API ($0.06/Mtok) | Excellent | API key in ~/secrets |
| text-embedding-3-small (OpenAI) | API ($0.02/Mtok) | Decent | API key in ~/secrets |
| text-embedding-3-large (OpenAI) | API ($0.13/Mtok) | Very good | API key in ~/secrets |
Choice: nomic-embed-text-v1.5 for v1. Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to voyage-3 later via MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3 if quality reviews show problems. Embedding model choice is config, not architecture.
6. Research loop pattern
The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:
| Web researcher | arXiv researcher | |
|---|---|---|
| Tools the agent calls | web_search, fetch_url |
retrieve_chunks, read_full_section |
| Source of evidence | Tavily + httpx | Local chromadb |
| Citation locator | URL | https://arxiv.org/abs/<id> |
| Citation raw_excerpt | Page text excerpt | Chunk text |
| Discovery events | "Try a different researcher" | "This paper cites X — consider adding to corpus" |
| Confidence factors | source_authority based on .gov/.edu/etc | All sources are peer-reviewed by design; authority based on venue / cite count if available |
The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same ResearchResult model).
Storage layout
~/.marchwarden/arxiv-rag/
├── papers.json # manifest: id → {title, authors, year, added_at, version}
├── pdfs/
│ ├── 2403.12345v1.pdf
│ └── 2401.00001v2.pdf
└── chroma/ # chromadb persistent store
└── ... # vectors + chunk metadata
papers.json schema:
{
"2403.12345": {
"version": "v1",
"title": "Diffusion Models for Protein Folding",
"authors": ["Alice Smith", "Bob Jones"],
"year": 2024,
"added_at": "2026-04-08T22:00:00Z",
"category": "cs.LG",
"chunks_indexed": 12,
"embedding_model": "nomic-embed-text-v1.5"
}
}
CLI surface
# Ingest
marchwarden arxiv add 2403.12345 # download, parse, embed, index
marchwarden arxiv add 2403.12345 2401.00001 # batch
marchwarden arxiv list # show indexed papers
marchwarden arxiv remove 2403.12345 # drop from index (also delete vectors)
marchwarden arxiv info 2403.12345 # show metadata + chunk count
# Research
marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv
# Stretch (post-V1):
marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI
Resolved decisions (was: Open questions)
-
Embeddings: local vs API. ✅ Resolved 2026-04-08: start with
nomic-embed-text-v1.5(free, local).voyage-3upgrade path viaMARCHWARDEN_ARXIV_EMBED_MODELenv var, deferred until real-world quality review. -
BibTeX import. ✅ Resolved 2026-04-08: skip for v1.
arxiv add <id>only. BibTeX importer is a future helper. -
Paper versions. ✅ Resolved 2026-04-08: pin to whatever the user supplies. Never auto-update.
marchwarden arxiv update <id>will exist as an explicit action later. -
Chunk-id stability. ✅ Resolved 2026-04-08: make embedding model part of the chunk ID hash, store it in
papers.json. Re-ingest with a different model creates a new collection rather than overwriting old citations. -
Cost ledger fields. ✅ Resolved 2026-04-08: add an
embedding_callsfield to ledger entries (parallel totavily_searches); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
Success criteria for V1 of this researcher
marchwarden arxiv add 2403.12345works end-to-end (download → extract → chunk → embed → store)marchwarden arxiv listshows the indexed papers with metadatamarchwarden ask "..." --researcher arxivreturns aResearchResultwith the same shape as the web researcher- Citations point to the correct arXiv URL with verbatim chunk text in
raw_excerpt - Cost ledger records
embedding_callsseparately - Trace JSONL captures every retrieval / re-rank / synthesis step
- At least one cross-researcher manual smoke test: ask the same question of
--researcher weband--researcher arxivand confirm the contracts compose visually - All existing tests still pass
Alternatives considered (and rejected for v1)
- Live arXiv search instead of pre-indexed corpus. Loses the "private curated subset" advantage and forces an embedding pass on every query.
- Whole-paper embeddings. Too coarse for "what did the authors say in methods?" queries.
- Sliding-window chunking. Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
pgvectorvector store. Adds Postgres as a runtime dependency for a single-user local tool — overkill.science-parsePDF extractor. Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.- Skipping the agent loop and doing one-shot RAG. Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.
Phasing
This proposal is sized for a single new milestone phase (parallel to the existing roadmap). Suggested sequencing:
- Sign off this proposal — confirm decisions, file sub-issues
- A.1: Ingest pipeline (smallest visible win)
- A.2: Retrieval primitive
- A.3: ArxivResearcher agent
- A.4: MCP server
- A.5: CLI integration
- A.6: Cost ledger integration
- Smoke test: index ~5 papers from your real reading list, ask 3 questions, document the run
After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.
See also: Architecture, Research Contract, Roadmap, User Guide