2 ArxivRagProposal
Jeff Smith edited this page 2026-04-08 17:16:51 -06:00

Implementation Proposal: arxiv-rag Researcher

Status: Approved 2026-04-08 Tracking issue: #37 Sister to: Roadmap M5.1 (grep-based file researcher) — different tool, same contract


Motivation

Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:

  • The corpus is private to the user's interest. Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
  • Semantic queries are essential. Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
  • The corpus is stable. Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
  • Contract fit is natural. Each retrieved chunk is a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.

This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.


Goals and non-goals

Goals

  1. A second working researcher implementing the v1 ResearchContract
  2. CLI ergonomics for ingesting arXiv papers (marchwarden arxiv add 2403.12345)
  3. Same research(question) MCP tool surface as the web researcher
  4. End-to-end question → grounded answer with citations from the local paper store
  5. Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend

Non-goals (V1 of this researcher)

  • Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
  • Cross-researcher orchestration — that's the PI agent (Phase 6)
  • Citation graph traversal (references, cited-by)
  • Production-grade math notation fidelity beyond what pymupdf provides
  • Multi-user / shared corpus — single-user, local store only

High-level architecture

┌─────────────────────────────────────────────────────────────────┐
│                         marchwarden CLI                          │
│   ask --researcher arxiv "question"   |   arxiv add <id>         │
└────────────────────────┬────────────────────────────────────────┘
                         │ MCP stdio
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                 researchers/arxiv/server.py                      │
│                  (FastMCP, exposes research)                     │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                   researchers/arxiv/agent.py                     │
│                                                                  │
│   plan → retrieve(query) → re-rank → iterate → synthesize        │
│              │                                                   │
│              ▼                                                   │
│       researchers/arxiv/store.py                                 │
│       (chromadb wrapper, returns top-K chunks with metadata)     │
└──────────────┬──────────────────────────────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────────────────────────────┐
│            ~/.marchwarden/arxiv-rag/                             │
│            ├── papers.json     (manifest)                        │
│            ├── pdfs/<id>.pdf   (cached PDFs)                     │
│            └── chroma/         (vector store + chunks)           │
└─────────────────────────────────────────────────────────────────┘

Design decisions

1. Corpus scope: user-curated reading list (Option A)

Three options were considered:

Option Description Ship complexity Long-term ceiling
A User curates a reading list via arxiv add; only those papers are indexed Low Medium
B Live arXiv search + semantic rerank, no persistent store Medium Medium
C Hybrid: live search + everything fetched is cached and re-usable High High

Choice: A first, evolve to C later. A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.

2. PDF extraction: pymupdf (fitz)

Tool Pros Cons
pymupdf Fast, simple API, decent quality, no GPU needed Math notation is approximate
marker Excellent quality, outputs clean markdown, handles math better Slower, larger dependency
nougat (Meta) Best math fidelity Heavy model, slow, GPU recommended
science-parse Section-aware parser Java, more complex deployment

Choice: pymupdf for v1. Trade math fidelity for ship speed. Swap to marker if real-world use shows math-heavy papers are unusable.

3. Chunking strategy: section-level

Strategy Pros Cons
Whole-paper One vector per paper Recall too coarse for "what did the authors say in methods?"
Section-level Preserves logical boundaries, maps to verifiable claims Requires section detection (heuristic on heading patterns)
Sliding window 500 tok Standard RAG default; high recall Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt

Choice: section-level. The contract's raw_excerpt field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.

4. Vector store: chromadb

Store Pros Cons
chromadb Embedded, file-backed, simple Python API, no server Fewer filter features than alternatives
qdrant local Better filter support (author/year/category) Slightly heavier setup
pgvector Solid if you already run Postgres New infrastructure dependency
FAISS Fastest, lowest-level No metadata filtering, no manifest

Choice: chromadb. Fits the "writes to ~/.marchwarden/ only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).

5. Embedding model: nomic-embed-text-v1.5 (local) with voyage-3 upgrade path

Model Cost Quality on technical text Setup
nomic-embed-text-v1.5 Free, local, CPU-OK Good pip install
voyage-3 API ($0.06/Mtok) Excellent API key in ~/secrets
text-embedding-3-small (OpenAI) API ($0.02/Mtok) Decent API key in ~/secrets
text-embedding-3-large (OpenAI) API ($0.13/Mtok) Very good API key in ~/secrets

Choice: nomic-embed-text-v1.5 for v1. Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to voyage-3 later via MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3 if quality reviews show problems. Embedding model choice is config, not architecture.

6. Research loop pattern

The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:

Web researcher arXiv researcher
Tools the agent calls web_search, fetch_url retrieve_chunks, read_full_section
Source of evidence Tavily + httpx Local chromadb
Citation locator URL https://arxiv.org/abs/<id>
Citation raw_excerpt Page text excerpt Chunk text
Discovery events "Try a different researcher" "This paper cites X — consider adding to corpus"
Confidence factors source_authority based on .gov/.edu/etc All sources are peer-reviewed by design; authority based on venue / cite count if available

The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same ResearchResult model).


Storage layout

~/.marchwarden/arxiv-rag/
├── papers.json              # manifest: id → {title, authors, year, added_at, version}
├── pdfs/
│   ├── 2403.12345v1.pdf
│   └── 2401.00001v2.pdf
└── chroma/                  # chromadb persistent store
    └── ...                  # vectors + chunk metadata

papers.json schema:

{
  "2403.12345": {
    "version": "v1",
    "title": "Diffusion Models for Protein Folding",
    "authors": ["Alice Smith", "Bob Jones"],
    "year": 2024,
    "added_at": "2026-04-08T22:00:00Z",
    "category": "cs.LG",
    "chunks_indexed": 12,
    "embedding_model": "nomic-embed-text-v1.5"
  }
}

CLI surface

# Ingest
marchwarden arxiv add 2403.12345              # download, parse, embed, index
marchwarden arxiv add 2403.12345 2401.00001   # batch
marchwarden arxiv list                        # show indexed papers
marchwarden arxiv remove 2403.12345           # drop from index (also delete vectors)
marchwarden arxiv info 2403.12345             # show metadata + chunk count

# Research
marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv

# Stretch (post-V1):
marchwarden ask "..." --researchers web,arxiv      # fan out, merge in CLI

Resolved decisions (was: Open questions)

  1. Embeddings: local vs API. Resolved 2026-04-08: start with nomic-embed-text-v1.5 (free, local). voyage-3 upgrade path via MARCHWARDEN_ARXIV_EMBED_MODEL env var, deferred until real-world quality review.

  2. BibTeX import. Resolved 2026-04-08: skip for v1. arxiv add <id> only. BibTeX importer is a future helper.

  3. Paper versions. Resolved 2026-04-08: pin to whatever the user supplies. Never auto-update. marchwarden arxiv update <id> will exist as an explicit action later.

  4. Chunk-id stability. Resolved 2026-04-08: make embedding model part of the chunk ID hash, store it in papers.json. Re-ingest with a different model creates a new collection rather than overwriting old citations.

  5. Cost ledger fields. Resolved 2026-04-08: add an embedding_calls field to ledger entries (parallel to tavily_searches); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.


Success criteria for V1 of this researcher

  • marchwarden arxiv add 2403.12345 works end-to-end (download → extract → chunk → embed → store)
  • marchwarden arxiv list shows the indexed papers with metadata
  • marchwarden ask "..." --researcher arxiv returns a ResearchResult with the same shape as the web researcher
  • Citations point to the correct arXiv URL with verbatim chunk text in raw_excerpt
  • Cost ledger records embedding_calls separately
  • Trace JSONL captures every retrieval / re-rank / synthesis step
  • At least one cross-researcher manual smoke test: ask the same question of --researcher web and --researcher arxiv and confirm the contracts compose visually
  • All existing tests still pass

Alternatives considered (and rejected for v1)

  • Live arXiv search instead of pre-indexed corpus. Loses the "private curated subset" advantage and forces an embedding pass on every query.
  • Whole-paper embeddings. Too coarse for "what did the authors say in methods?" queries.
  • Sliding-window chunking. Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
  • pgvector vector store. Adds Postgres as a runtime dependency for a single-user local tool — overkill.
  • science-parse PDF extractor. Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.
  • Skipping the agent loop and doing one-shot RAG. Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.

Phasing

This proposal is sized for a single new milestone phase (parallel to the existing roadmap). Suggested sequencing:

  1. Sign off this proposal — confirm decisions, file sub-issues
  2. A.1: Ingest pipeline (smallest visible win)
  3. A.2: Retrieval primitive
  4. A.3: ArxivResearcher agent
  5. A.4: MCP server
  6. A.5: CLI integration
  7. A.6: Cost ledger integration
  8. Smoke test: index ~5 papers from your real reading list, ask 3 questions, document the run

After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.


See also: Architecture, Research Contract, Roadmap, User Guide