Add arxiv-rag implementation proposal
Detailed design proposal for the second researcher: a RAG over a user-curated arXiv reading list. Locks in chunking, vector store, embeddings, PDF extraction, and storage layout decisions; lists open questions and success criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
parent
f28eb02fd7
commit
7e5bed6033
1 changed files with 259 additions and 0 deletions
259
ArxivRagProposal.md
Normal file
259
ArxivRagProposal.md
Normal file
|
|
@ -0,0 +1,259 @@
|
|||
# Implementation Proposal: arxiv-rag Researcher
|
||||
|
||||
**Status:** Draft — awaiting review
|
||||
**Tracking issue:** [#37](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues/37)
|
||||
**Sister to:** Roadmap M5.1 (grep-based file researcher) — different tool, same contract
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
Marchwarden's V1 web researcher is excellent at "what does the public internet say about X?" but blind to the literature in academic papers. arXiv specifically is a great fit for a Retrieval-Augmented Generation (RAG) researcher because:
|
||||
|
||||
- **The corpus is private to the user's interest.** Even though arXiv is public, the curated subset of papers a researcher actually cares about is unique to them. The web researcher can't reason over that subset.
|
||||
- **Semantic queries are essential.** Keyword search misses synonyms ("diffusion model" / "denoising score matching") and conceptual queries ("methods that don't require labeled data" → "self-supervised learning").
|
||||
- **The corpus is stable.** Published papers don't change. Index staleness — usually the biggest pain point in production RAG — is a non-problem here. The store only ever grows.
|
||||
- **Contract fit is natural.** Each retrieved chunk *is* a verifiable raw_excerpt with a stable URL (the arXiv abs page). The Marchwarden Research Contract's evidence model maps cleanly onto RAG outputs.
|
||||
|
||||
This is also the first opportunity to prove the contract works across researcher types — exactly the bar the Roadmap sets for Phase 5.
|
||||
|
||||
---
|
||||
|
||||
## Goals and non-goals
|
||||
|
||||
### Goals
|
||||
1. A second working researcher implementing the v1 ResearchContract
|
||||
2. CLI ergonomics for ingesting arXiv papers (`marchwarden arxiv add 2403.12345`)
|
||||
3. Same `research(question)` MCP tool surface as the web researcher
|
||||
4. End-to-end question → grounded answer with citations from the local paper store
|
||||
5. Validate that the contract composes — two researchers, same shape, ready for a future PI orchestrator to blend
|
||||
|
||||
### Non-goals (V1 of this researcher)
|
||||
- Live arXiv search / fully automatic corpus growth (a future phase, see Alternatives)
|
||||
- Cross-researcher orchestration — that's the PI agent (Phase 6)
|
||||
- Citation graph traversal (references, cited-by)
|
||||
- Production-grade math notation fidelity beyond what pymupdf provides
|
||||
- Multi-user / shared corpus — single-user, local store only
|
||||
|
||||
---
|
||||
|
||||
## High-level architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ marchwarden CLI │
|
||||
│ ask --researcher arxiv "question" | arxiv add <id> │
|
||||
└────────────────────────┬────────────────────────────────────────┘
|
||||
│ MCP stdio
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ researchers/arxiv/server.py │
|
||||
│ (FastMCP, exposes research) │
|
||||
└────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ researchers/arxiv/agent.py │
|
||||
│ │
|
||||
│ plan → retrieve(query) → re-rank → iterate → synthesize │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ researchers/arxiv/store.py │
|
||||
│ (chromadb wrapper, returns top-K chunks with metadata) │
|
||||
└──────────────┬──────────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ ~/.marchwarden/arxiv-rag/ │
|
||||
│ ├── papers.json (manifest) │
|
||||
│ ├── pdfs/<id>.pdf (cached PDFs) │
|
||||
│ └── chroma/ (vector store + chunks) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Design decisions
|
||||
|
||||
### 1. Corpus scope: user-curated reading list (Option A)
|
||||
|
||||
Three options were considered:
|
||||
|
||||
| Option | Description | Ship complexity | Long-term ceiling |
|
||||
|---|---|---|---|
|
||||
| **A** | User curates a reading list via `arxiv add`; only those papers are indexed | Low | Medium |
|
||||
| **B** | Live arXiv search + semantic rerank, no persistent store | Medium | Medium |
|
||||
| **C** | Hybrid: live search + everything fetched is cached and re-usable | High | High |
|
||||
|
||||
**Choice: A first, evolve to C later.** A is the smallest thing that's actually useful (literature review on a defined topic) and ships fast. C is the long-term winner but adds two open design questions (cache eviction policy, cache freshness signaling) that are easier to answer once we have A in production.
|
||||
|
||||
### 2. PDF extraction: `pymupdf` (fitz)
|
||||
|
||||
| Tool | Pros | Cons |
|
||||
|---|---|---|
|
||||
| **pymupdf** | Fast, simple API, decent quality, no GPU needed | Math notation is approximate |
|
||||
| `marker` | Excellent quality, outputs clean markdown, handles math better | Slower, larger dependency |
|
||||
| `nougat` (Meta) | Best math fidelity | Heavy model, slow, GPU recommended |
|
||||
| `science-parse` | Section-aware parser | Java, more complex deployment |
|
||||
|
||||
**Choice: `pymupdf` for v1.** Trade math fidelity for ship speed. Swap to `marker` if real-world use shows math-heavy papers are unusable.
|
||||
|
||||
### 3. Chunking strategy: section-level
|
||||
|
||||
| Strategy | Pros | Cons |
|
||||
|---|---|---|
|
||||
| Whole-paper | One vector per paper | Recall too coarse for "what did the authors say in methods?" |
|
||||
| **Section-level** | Preserves logical boundaries, maps to verifiable claims | Requires section detection (heuristic on heading patterns) |
|
||||
| Sliding window 500 tok | Standard RAG default; high recall | Cuts sentences mid-claim, breaks contract's verifiable raw_excerpt |
|
||||
|
||||
**Choice: section-level.** The contract's `raw_excerpt` field requires verbatim verifiable text — sliding-window chunking risks splitting "the authors found X" from its qualifier. Sections preserve context. Detection: heuristic pass on PDF headings (intro, related work, methods, experiments, results, discussion, conclusion, references), with whole-paper fallback if structure isn't detected.
|
||||
|
||||
### 4. Vector store: `chromadb`
|
||||
|
||||
| Store | Pros | Cons |
|
||||
|---|---|---|
|
||||
| **chromadb** | Embedded, file-backed, simple Python API, no server | Fewer filter features than alternatives |
|
||||
| qdrant local | Better filter support (author/year/category) | Slightly heavier setup |
|
||||
| pgvector | Solid if you already run Postgres | New infrastructure dependency |
|
||||
| FAISS | Fastest, lowest-level | No metadata filtering, no manifest |
|
||||
|
||||
**Choice: `chromadb`.** Fits the "writes to `~/.marchwarden/` only" pattern. If filter expressivity becomes a problem, qdrant is a low-cost migration (similar API surface).
|
||||
|
||||
### 5. Embedding model: `nomic-embed-text-v1.5` (local) with `voyage-3` upgrade path
|
||||
|
||||
| Model | Cost | Quality on technical text | Setup |
|
||||
|---|---|---|---|
|
||||
| **nomic-embed-text-v1.5** | Free, local, CPU-OK | Good | `pip install` |
|
||||
| voyage-3 | API ($0.06/Mtok) | Excellent | API key in `~/secrets` |
|
||||
| text-embedding-3-small (OpenAI) | API ($0.02/Mtok) | Decent | API key in `~/secrets` |
|
||||
| text-embedding-3-large (OpenAI) | API ($0.13/Mtok) | Very good | API key in `~/secrets` |
|
||||
|
||||
**Choice: `nomic-embed-text-v1.5` for v1.** Zero cost, runs offline, "good enough" for arxiv-quality text. Easy swap to `voyage-3` later via `MARCHWARDEN_ARXIV_EMBED_MODEL=voyage-3` if quality reviews show problems. Embedding model choice is config, not architecture.
|
||||
|
||||
### 6. Research loop pattern
|
||||
|
||||
The arXiv researcher reuses the same plan → tool → iterate → synthesize loop the web researcher uses. The differences:
|
||||
|
||||
| | Web researcher | arXiv researcher |
|
||||
|---|---|---|
|
||||
| Tools the agent calls | `web_search`, `fetch_url` | `retrieve_chunks`, `read_full_section` |
|
||||
| Source of evidence | Tavily + httpx | Local chromadb |
|
||||
| Citation locator | URL | `https://arxiv.org/abs/<id>` |
|
||||
| Citation raw_excerpt | Page text excerpt | Chunk text |
|
||||
| Discovery events | "Try a different researcher" | "This paper cites X — consider adding to corpus" |
|
||||
| Confidence factors | source_authority based on .gov/.edu/etc | All sources are peer-reviewed by design; authority based on venue / cite count if available |
|
||||
|
||||
The synthesis prompt is adapted for academic tone but the JSON output schema is identical to the web researcher (same `ResearchResult` model).
|
||||
|
||||
---
|
||||
|
||||
## Storage layout
|
||||
|
||||
```
|
||||
~/.marchwarden/arxiv-rag/
|
||||
├── papers.json # manifest: id → {title, authors, year, added_at, version}
|
||||
├── pdfs/
|
||||
│ ├── 2403.12345v1.pdf
|
||||
│ └── 2401.00001v2.pdf
|
||||
└── chroma/ # chromadb persistent store
|
||||
└── ... # vectors + chunk metadata
|
||||
```
|
||||
|
||||
`papers.json` schema:
|
||||
```json
|
||||
{
|
||||
"2403.12345": {
|
||||
"version": "v1",
|
||||
"title": "Diffusion Models for Protein Folding",
|
||||
"authors": ["Alice Smith", "Bob Jones"],
|
||||
"year": 2024,
|
||||
"added_at": "2026-04-08T22:00:00Z",
|
||||
"category": "cs.LG",
|
||||
"chunks_indexed": 12,
|
||||
"embedding_model": "nomic-embed-text-v1.5"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CLI surface
|
||||
|
||||
```bash
|
||||
# Ingest
|
||||
marchwarden arxiv add 2403.12345 # download, parse, embed, index
|
||||
marchwarden arxiv add 2403.12345 2401.00001 # batch
|
||||
marchwarden arxiv list # show indexed papers
|
||||
marchwarden arxiv remove 2403.12345 # drop from index (also delete vectors)
|
||||
marchwarden arxiv info 2403.12345 # show metadata + chunk count
|
||||
|
||||
# Research
|
||||
marchwarden ask "What chunking strategies do RAG papers recommend?" --researcher arxiv
|
||||
|
||||
# Stretch (post-V1):
|
||||
marchwarden ask "..." --researchers web,arxiv # fan out, merge in CLI
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Embeddings: local vs API.** Start with `nomic-embed-text-v1.5` (free, local). Add `voyage-3` upgrade path via env var. Defer the decision until real queries are flowing — quality is hard to evaluate in the abstract.
|
||||
|
||||
2. **BibTeX import.** Many users keep arxiv references in BibTeX (`.bib`) files from Zotero / LaTeX. Should `arxiv add` accept a `.bib` file and ingest every arxiv ID it finds? **Recommendation: no for v1.** Keep `arxiv add <id>` simple. BibTeX import is a one-off helper script that can come later.
|
||||
|
||||
3. **Paper versions.** arXiv papers have versions (`2403.12345v1`, `v2`, …). Three policies:
|
||||
- **Pin** — index whatever the user supplies, never auto-update
|
||||
- **Always latest** — re-fetch on every `marchwarden arxiv refresh`, replace chunks
|
||||
- **Track both** — index every version separately, distinguish in citations
|
||||
|
||||
**Recommendation: pin for v1.** Simplest. `arxiv update <id>` as an explicit user action later.
|
||||
|
||||
4. **Chunk-id stability.** If we re-ingest with a new embedding model, chunk IDs change. Citations in past traces would become unresolvable. **Recommendation:** make embedding model part of the chunk ID hash, and store it in `papers.json`. A re-ingest creates a new collection rather than overwriting.
|
||||
|
||||
5. **Cost ledger fields.** What does "cost" mean for a researcher that uses local embeddings? **Recommendation:** add an `embedding_calls` field to ledger entries (similar to `tavily_searches`); $0 for local, real cost for API embeddings. The synthesis call still bills via the existing model price table.
|
||||
|
||||
---
|
||||
|
||||
## Success criteria for V1 of this researcher
|
||||
|
||||
- [ ] `marchwarden arxiv add 2403.12345` works end-to-end (download → extract → chunk → embed → store)
|
||||
- [ ] `marchwarden arxiv list` shows the indexed papers with metadata
|
||||
- [ ] `marchwarden ask "..." --researcher arxiv` returns a `ResearchResult` with the same shape as the web researcher
|
||||
- [ ] Citations point to the correct arXiv URL with verbatim chunk text in `raw_excerpt`
|
||||
- [ ] Cost ledger records `embedding_calls` separately
|
||||
- [ ] Trace JSONL captures every retrieval / re-rank / synthesis step
|
||||
- [ ] At least one cross-researcher manual smoke test: ask the same question of `--researcher web` and `--researcher arxiv` and confirm the contracts compose visually
|
||||
- [ ] All existing tests still pass
|
||||
|
||||
---
|
||||
|
||||
## Alternatives considered (and rejected for v1)
|
||||
|
||||
- **Live arXiv search instead of pre-indexed corpus.** Loses the "private curated subset" advantage and forces an embedding pass on every query.
|
||||
- **Whole-paper embeddings.** Too coarse for "what did the authors say in methods?" queries.
|
||||
- **Sliding-window chunking.** Standard RAG default but breaks the contract's verifiable raw_excerpt requirement.
|
||||
- **`pgvector` vector store.** Adds Postgres as a runtime dependency for a single-user local tool — overkill.
|
||||
- **`science-parse` PDF extractor.** Java runtime complicates deployment; quality gain over pymupdf isn't worth it for v1.
|
||||
- **Skipping the agent loop and doing one-shot RAG.** Loses Marchwarden's "agent that decides what to retrieve next" advantage. Reduces this to a generic RAG library.
|
||||
|
||||
---
|
||||
|
||||
## Phasing
|
||||
|
||||
This proposal is sized for **a single new milestone phase** (parallel to the existing roadmap). Suggested sequencing:
|
||||
|
||||
1. **Sign off this proposal** — confirm decisions, file sub-issues
|
||||
2. **A.1: Ingest pipeline** (smallest visible win)
|
||||
3. **A.2: Retrieval primitive**
|
||||
4. **A.3: ArxivResearcher agent**
|
||||
5. **A.4: MCP server**
|
||||
6. **A.5: CLI integration**
|
||||
7. **A.6: Cost ledger integration**
|
||||
8. **Smoke test**: index ~5 papers from your real reading list, ask 3 questions, document the run
|
||||
|
||||
After this lands, the contract is empirically validated across two researcher types — Phase 5 of the original roadmap is partially fulfilled and the PI orchestrator (Phase 6) becomes a smaller leap.
|
||||
|
||||
---
|
||||
|
||||
See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Roadmap](Roadmap), [User Guide](UserGuide)
|
||||
Loading…
Reference in a new issue