feat(arxiv): ingest pipeline (M5.1.1) #58

Merged
claude-code merged 1 commit from feat/arxiv-rag-ingest into main 2026-04-09 02:03:59 +00:00
Collaborator

Closes #38. First sub-milestone of M5.1 (arxiv-rag researcher).

What ships

  • New package researchers/arxiv/ with store.py (chromadb wrapper + papers.json manifest) and ingest.py (download → extract → embed → store).
  • CLI subgroup marchwarden arxiv add|list|info|remove, lazy-imports heavy deps.
  • [arxiv] optional extra in pyproject.toml (pymupdf, chromadb, sentence-transformers, arxiv) — base install stays slim.
  • 14 new tests, 141 total passing.

What's deferred (later sub-milestones)

  • Retrieval / search API (#39)
  • ArxivResearcher agent loop (#40)
  • MCP server (#41)
  • ask --researcher arxiv flag (#42)
  • Cost ledger embedding_calls field (#43)

Notes

  • Chunk ids are embedding-model-scoped per ArxivRagProposal decision 4 — re-ingest with a different model creates a fresh ID space rather than overwriting prior chunks.
  • Re-ingest with the same model is idempotent: chunks for the paper are dropped before re-adding.
  • Live smoke against a real arxiv id deferred so we don't block the M3.3 collection runner currently using the venv. Will validate after the runner finishes.
  • pip install pulled in the CUDA torch wheel (~2GB nvidia libs) — harmless on CPU-only WSL but worth pinning to the CPU index in a follow-up.
Closes #38. First sub-milestone of M5.1 (arxiv-rag researcher). ## What ships - New package `researchers/arxiv/` with `store.py` (chromadb wrapper + papers.json manifest) and `ingest.py` (download → extract → embed → store). - CLI subgroup `marchwarden arxiv add|list|info|remove`, lazy-imports heavy deps. - `[arxiv]` optional extra in `pyproject.toml` (pymupdf, chromadb, sentence-transformers, arxiv) — base install stays slim. - 14 new tests, 141 total passing. ## What's deferred (later sub-milestones) - Retrieval / search API (#39) - ArxivResearcher agent loop (#40) - MCP server (#41) - `ask --researcher arxiv` flag (#42) - Cost ledger `embedding_calls` field (#43) ## Notes - Chunk ids are embedding-model-scoped per ArxivRagProposal decision 4 — re-ingest with a different model creates a fresh ID space rather than overwriting prior chunks. - Re-ingest with the same model is idempotent: chunks for the paper are dropped before re-adding. - Live smoke against a real arxiv id deferred so we don't block the M3.3 collection runner currently using the venv. Will validate after the runner finishes. - `pip install` pulled in the CUDA torch wheel (~2GB nvidia libs) — harmless on CPU-only WSL but worth pinning to the CPU index in a follow-up.
claude-code added 1 commit 2026-04-09 02:03:54 +00:00
Closes #38. First sub-milestone of M5.1 (Researcher #2: arxiv-rag).

New package researchers/arxiv/ with three modules:

- store.py — ArxivStore wraps a persistent chromadb collection at
  ~/.marchwarden/arxiv-rag/chroma/ plus a papers.json manifest. Chunk
  ids are deterministic and embedding-model-scoped (per ArxivRagProposal
  decision 4) so re-ingesting with a different embedder doesn't collide
  with prior chunks.
- ingest.py — three-phase pipeline: download_pdf (arxiv API), extract_sections
  (pymupdf with heuristic heading detection + whole-paper fallback), and
  embed_and_store (sentence-transformers, configurable via
  MARCHWARDEN_ARXIV_EMBED_MODEL). Top-level ingest() chains them and
  upserts the manifest entry. Re-ingest is idempotent — chunks for the
  same paper are dropped before re-adding.
- CLI subgroup `marchwarden arxiv add|list|info|remove`. Lazy-imports
  the heavy chromadb / torch deps so non-arxiv commands stay fast.

The heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) are
gated behind an optional `[arxiv]` extra so the base install stays slim
for users who only want the web researcher.

Tests: 14 added (141 total passing). Real pymupdf against synthetic PDFs
generated at test time covers extract_sections; chromadb and the
embedder are stubbed via dependency injection so the tests stay fast,
deterministic, and network-free. End-to-end ingest() is exercised with
a mocked arxiv.Search that produces synthetic PDFs.

Out of scope for #38 (covered by later sub-milestones):
- Retrieval / search API (#39)
- ArxivResearcher agent loop (#40)
- MCP server (#41)
- ask --researcher arxiv flag (#42)
- Cost ledger embedding_calls field (#43)

Notes:
- pip install pulled in CUDA torch wheel (~2GB nvidia libs); harmless on
  CPU-only WSL but a future optimization would pin the CPU torch index.
- Live smoke against a real arxiv id deferred so we don't block the M3.3
  collection runner currently using the venv.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude-code merged commit 3b57b563ab into main 2026-04-09 02:03:59 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#58
No description provided.