archeious/marchwarden

Fork 0

Table of Contents

Implementation Proposal: arxiv-rag Researcher

Motivation
Goals and non-goals

Goals
Non-goals (V1 of this researcher)

High-level architecture
Design decisions

1. Corpus scope: user-curated reading list (Option A)
2. PDF extraction: pymupdf (fitz)
3. Chunking strategy: section-level
4. Vector store: chromadb
5. Embedding model: nomic-embed-text-v1.5 (local) with voyage-3 upgrade path
6. Research loop pattern

Storage layout
CLI surface
Resolved decisions (was: Open questions)
Success criteria for V1 of this researcher
Alternatives considered (and rejected for v1)
Phasing

Implementation Proposal: arxiv-rag Researcher

Status: Approved 2026-04-08 Tracking issue: #37 Sister to: Roadmap M5.1 (grep-based file researcher) — different tool, same contract

Motivation

Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:

The corpus is private to the user's interest. Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
Semantic queries are essential. Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
The corpus is stable. Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
Contract fit is natural. Each retrieved chunk is a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.

This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.

Goals and non-goals

Goals

A second working researcher implementing the v1 ResearchContract
CLI ergonomics for ingesting arXiv papers (marchwarden arxiv add 2403.12345)
Same research(question) MCP tool surface as the web researcher
End-to-end question → grounded answer with citations from the local paper store
Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend

Non-goals (V1 of this researcher)

Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
Cross-researcher orchestration — that's the PI agent (Phase 6)
Citation graph traversal (references, cited-by)
Production-grade math notation fidelity beyond what pymupdf provides
Multi-user / shared corpus — single-user, local store only

High-level architecture

┌─────────────────────────────────────────────────────────────────┐
│                         marchwarden CLI                          │
│   ask --researcher arxiv "question"   |   arxiv add <id>         │
└────────────────────────┬────────────────────────────────────────┘
                         │ MCP stdio
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                 researchers/arxiv/server.py                      │
│                  (FastMCP, exposes research)                     │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                   researchers/arxiv/agent.py                     │
│                                                                  │
│   plan → retrieve(query) → re-rank → iterate → synthesize        │
│              │                                                   │
│              ▼                                                   │
│       researchers/arxiv/store.py                                 │
│       (chromadb wrapper, returns top-K chunks with metadata)     │
└──────────────┬──────────────────────────────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────────────────────────────┐
│            ~/.marchwarden/arxiv-rag/                             │
│            ├── papers.json     (manifest)                        │
│            ├── pdfs/<id>.pdf   (cached PDFs)                     │
│            └── chroma/         (vector store + chunks)           │
└─────────────────────────────────────────────────────────────────┘

Design decisions

1. Corpus scope: user-curated reading list (Option A)

Three options were considered:

Option	Description	Ship complexity	Long-term ceiling
A	User curates a reading list via `arxiv add`; only those papers are indexed	Low	Medium
B	Live arXiv search + semantic rerank, no persistent store	Medium	Medium
C	Hybrid: live search + everything fetched is cached and re-usable	High	High

Choice: A first, evolve to C later. A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.

2. PDF extraction: `pymupdf` (fitz)

Tool	Pros	Cons
pymupdf	Fast, simple API, decent quality, no GPU needed	Math notation is approximate
`marker`	Excellent quality, outputs clean markdown, handles math better	Slower, larger dependency
`nougat` (Meta)	Best math fidelity	Heavy model, slow, GPU recommended
`science-parse`	Section-aware parser	Java, more complex deployment

Choice: pymupdf for v1. Trade math fidelity for ship speed. Swap to marker if real-world use shows math-heavy papers are unusable.

3. Chunking strategy: section-level

Strategy	Pros	Cons
Whole-paper	One vector per paper	Recall too coarse for "what did the authors say in methods?"
Section-level	Preserves logical boundaries, maps to verifiable claims	Requires section detection (heuristic on heading patterns)
Sliding window 500 tok	Standard RAG default; high recall	Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt

Choice: section-level. The contract's raw_excerpt field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.

4. Vector store: `chromadb`

Store	Pros	Cons
chromadb	Embedded, file-backed, simple Python API, no server	Fewer filter features than alternatives
qdrant local	Better filter support (author/year/category)	Slightly heavier setup
pgvector	Solid if you already run Postgres	New infrastructure dependency
FAISS	Fastest, lowest-level	No metadata filtering, no manifest

Choice: chromadb. Fits the "writes to ~/.marchwarden/ only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).

5. Embedding model: `nomic-embed-text-v1.5` (local) with `voyage-3` upgrade path

Model	Cost	Quality on technical text	Setup
nomic-embed-text-v1.5	Free, local, CPU-OK	Good	`pip install`
voyage-3	API ($0.06/Mtok)	Excellent	API key in `~/secrets`
text-embedding-3-small (OpenAI)	API ($0.02/Mtok)	Decent	API key in `~/secrets`
text-embedding-3-large (OpenAI)	API ($0.13/Mtok)	Very good	API key in `~/secrets`

Choice: nomic-embed-text-v1.5 for v1. Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to voyage-3 later via MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3 if quality reviews show problems. Embedding model choice is config, not architecture.

6. Research loop pattern

The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:

	Web researcher	arXiv researcher
Tools the agent calls	`web_search`, `fetch_url`	`retrieve_chunks`, `read_full_section`
Source of evidence	Tavily + httpx	Local chromadb
Citation locator	URL	`https://arxiv.org/abs/<id>`
Citation raw_excerpt	Page text excerpt	Chunk text
Discovery events	"Try a different researcher"	"This paper cites X — consider adding to corpus"
Confidence factors	source_authority based on .gov/.edu/etc	All sources are peer-reviewed by design; authority based on venue / cite count if available

The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same ResearchResult model).

Storage layout

~/.marchwarden/arxiv-rag/
├── papers.json              # manifest: id → {title, authors, year, added_at, version}
├── pdfs/
│   ├── 2403.12345v1.pdf
│   └── 2401.00001v2.pdf
└── chroma/                  # chromadb persistent store
    └── ...                  # vectors + chunk metadata

papers.json schema:

{
  "2403.12345": {
    "version": "v1",
    "title": "Diffusion Models for Protein Folding",
    "authors": ["Alice Smith", "Bob Jones"],
    "year": 2024,
    "added_at": "2026-04-08T22:00:00Z",
    "category": "cs.LG",
    "chunks_indexed": 12,
    "embedding_model": "nomic-embed-text-v1.5"
  }
}

CLI surface

# Ingest
marchwarden arxiv add 2403.12345              # download, parse, embed, index
marchwarden arxiv add 2403.12345 2401.00001   # batch
marchwarden arxiv list                        # show indexed papers
marchwarden arxiv remove 2403.12345           # drop from index (also delete vectors)
marchwarden arxiv info 2403.12345             # show metadata + chunk count

# Research
marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv

# Stretch (post-V1):
marchwarden ask "..." --researchers web,arxiv      # fan out, merge in CLI

Resolved decisions (was: Open questions)

Embeddings: local vs API. ✅ Resolved 2026-04-08: start with nomic-embed-text-v1.5 (free, local). voyage-3 upgrade path via MARCHWARDEN_ARXIV_EMBED_MODEL env var, deferred until real-world quality review.
BibTeX import. ✅ Resolved 2026-04-08: skip for v1. arxiv add <id> only. BibTeX importer is a future helper.
Paper versions. ✅ Resolved 2026-04-08: pin to whatever the user supplies. Never auto-update. marchwarden arxiv update <id> will exist as an explicit action later.
Chunk-id stability. ✅ Resolved 2026-04-08: make embedding model part of the chunk ID hash, store it in papers.json. Re-ingest with a different model creates a new collection rather than overwriting old citations.
Cost ledger fields. ✅ Resolved 2026-04-08: add an embedding_calls field to ledger entries (parallel to tavily_searches); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.

Success criteria for V1 of this researcher

marchwarden arxiv add 2403.12345 works end-to-end (download → extract → chunk → embed → store)
marchwarden arxiv list shows the indexed papers with metadata
marchwarden ask "..." --researcher arxiv returns a ResearchResult with the same shape as the web researcher
Citations point to the correct arXiv URL with verbatim chunk text in raw_excerpt
Cost ledger records embedding_calls separately
Trace JSONL captures every retrieval / re-rank / synthesis step
At least one cross-researcher manual smoke test: ask the same question of --researcher web and --researcher arxiv and confirm the contracts compose visually
All existing tests still pass

Alternatives considered (and rejected for v1)

Live arXiv search instead of pre-indexed corpus. Loses the "private curated subset" advantage and forces an embedding pass on every query.
Whole-paper embeddings. Too coarse for "what did the authors say in methods?" queries.
Sliding-window chunking. Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
pgvector vector store. Adds Postgres as a runtime dependency for a single-user local tool — overkill.
science-parse PDF extractor. Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.
Skipping the agent loop and doing one-shot RAG. Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.

Phasing

This proposal is sized for a single new milestone phase (parallel to the existing roadmap). Suggested sequencing:

Sign off this proposal — confirm decisions, file sub-issues
A.1: Ingest pipeline (smallest visible win)
A.2: Retrieval primitive
A.3: ArxivResearcher agent
A.4: MCP server
A.5: CLI integration
A.6: Cost ledger integration
Smoke test: index ~5 papers from your real reading list, ask 3 questions, document the run

After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.