M5.1.1 arxiv-rag: ingest pipeline (marchwarden arxiv add) #38

Closed
opened 2026-04-08 23:17:12 +00:00 by claude-code · 0 comments
Collaborator

First sub-milestone of Issue #37 (M5.1 arxiv-rag researcher). Design: ArxivRagProposal.

Goal

Standalone CLI command that takes an arXiv paper ID, downloads the PDF, extracts text, section-chunks it, embeds the chunks locally, and stores everything in ~/.marchwarden/arxiv-rag/.

Scope

  • New package researchers/arxiv/ (mirror of researchers/web/ layout)
  • New module researchers/arxiv/store.py — chromadb wrapper rooted at ~/.marchwarden/arxiv-rag/chroma/
  • New module researchers/arxiv/ingest.py:
    • download_pdf(arxiv_id) -> Path — uses arxiv API + httpx, caches to ~/.marchwarden/arxiv-rag/pdfs/
    • extract_sections(pdf_path) -> list[Section] — pymupdf, heuristic heading detection (intro/methods/results/conclusion/etc.), whole-paper fallback if no structure detected
    • embed_and_store(arxiv_id, sections, model_name)nomic-embed-text-v1.5 via sentence-transformers, writes to chromadb
  • New CLI subgroup marchwarden arxiv with three commands for v1:
    • arxiv add <id> [<id> ...] — ingest
    • arxiv list — show indexed papers from papers.json
    • arxiv info <id> — show metadata + chunk count
  • Manifest at ~/.marchwarden/arxiv-rag/papers.json per the proposal schema
  • Chunk IDs include the embedding model name in the hash so re-ingest with a different model doesn't collide

Tests

  • Mock arxiv download, run ingest end-to-end on a small fixture PDF, assert manifest and chromadb contents
  • arxiv list against an empty store renders friendly message
  • arxiv add is idempotent: re-adding the same id with the same model is a no-op

Dependencies (new)

  • pymupdf>=1.24 (PDF extraction)
  • chromadb>=0.5 (vector store)
  • sentence-transformers>=3.0 (embedding model)
  • arxiv>=2.1 (arxiv API client)

Out of scope

  • Retrieval / search (M5.1.2)
  • The agent loop (M5.1.3)
  • The MCP server (M5.1.4)
  • The --researcher arxiv flag in ask (M5.1.5)
  • Cost ledger fields (M5.1.6)

Branch

feat/arxiv-rag-ingest

Blocks: M5.1.2, M5.1.3, M5.1.4, M5.1.5, M5.1.6

First sub-milestone of Issue #37 (M5.1 arxiv-rag researcher). Design: [ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal). ## Goal Standalone CLI command that takes an arXiv paper ID, downloads the PDF, extracts text, section-chunks it, embeds the chunks locally, and stores everything in `~/.marchwarden/arxiv-rag/`. ## Scope - New package `researchers/arxiv/` (mirror of `researchers/web/` layout) - New module `researchers/arxiv/store.py` — chromadb wrapper rooted at `~/.marchwarden/arxiv-rag/chroma/` - New module `researchers/arxiv/ingest.py`: - `download_pdf(arxiv_id) -> Path` — uses arxiv API + httpx, caches to `~/.marchwarden/arxiv-rag/pdfs/` - `extract_sections(pdf_path) -> list[Section]` — pymupdf, heuristic heading detection (intro/methods/results/conclusion/etc.), whole-paper fallback if no structure detected - `embed_and_store(arxiv_id, sections, model_name)` — `nomic-embed-text-v1.5` via `sentence-transformers`, writes to chromadb - New CLI subgroup `marchwarden arxiv` with three commands for v1: - `arxiv add <id> [<id> ...]` — ingest - `arxiv list` — show indexed papers from `papers.json` - `arxiv info <id>` — show metadata + chunk count - Manifest at `~/.marchwarden/arxiv-rag/papers.json` per the proposal schema - Chunk IDs include the embedding model name in the hash so re-ingest with a different model doesn't collide ## Tests - Mock arxiv download, run ingest end-to-end on a small fixture PDF, assert manifest and chromadb contents - `arxiv list` against an empty store renders friendly message - `arxiv add` is idempotent: re-adding the same id with the same model is a no-op ## Dependencies (new) - `pymupdf>=1.24` (PDF extraction) - `chromadb>=0.5` (vector store) - `sentence-transformers>=3.0` (embedding model) - `arxiv>=2.1` (arxiv API client) ## Out of scope - Retrieval / search (M5.1.2) - The agent loop (M5.1.3) - The MCP server (M5.1.4) - The `--researcher arxiv` flag in `ask` (M5.1.5) - Cost ledger fields (M5.1.6) ## Branch `feat/arxiv-rag-ingest` Blocks: M5.1.2, M5.1.3, M5.1.4, M5.1.5, M5.1.6
claude-code changed title from A.1 arxiv-rag: ingest pipeline (marchwarden arxiv add) to M5.1.1 arxiv-rag: ingest pipeline (marchwarden arxiv add) 2026-04-08 23:23:10 +00:00
archeious added this to the Phase 5: Second Researcher milestone 2026-04-08 23:23:41 +00:00
Sign in to join this conversation.
No labels
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#38
No description provided.