M5.1.1 arxiv-rag: ingest pipeline (marchwarden arxiv add) #38
Labels
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: archeious/marchwarden#38
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
First sub-milestone of Issue #37 (M5.1 arxiv-rag researcher). Design: ArxivRagProposal.
Goal
Standalone CLI command that takes an arXiv paper ID, downloads the PDF, extracts text, section-chunks it, embeds the chunks locally, and stores everything in
~/.marchwarden/arxiv-rag/.Scope
researchers/arxiv/(mirror ofresearchers/web/layout)researchers/arxiv/store.py— chromadb wrapper rooted at~/.marchwarden/arxiv-rag/chroma/researchers/arxiv/ingest.py:download_pdf(arxiv_id) -> Path— uses arxiv API + httpx, caches to~/.marchwarden/arxiv-rag/pdfs/extract_sections(pdf_path) -> list[Section]— pymupdf, heuristic heading detection (intro/methods/results/conclusion/etc.), whole-paper fallback if no structure detectedembed_and_store(arxiv_id, sections, model_name)—nomic-embed-text-v1.5viasentence-transformers, writes to chromadbmarchwarden arxivwith three commands for v1:arxiv add <id> [<id> ...]— ingestarxiv list— show indexed papers frompapers.jsonarxiv info <id>— show metadata + chunk count~/.marchwarden/arxiv-rag/papers.jsonper the proposal schemaTests
arxiv listagainst an empty store renders friendly messagearxiv addis idempotent: re-adding the same id with the same model is a no-opDependencies (new)
pymupdf>=1.24(PDF extraction)chromadb>=0.5(vector store)sentence-transformers>=3.0(embedding model)arxiv>=2.1(arxiv API client)Out of scope
--researcher arxivflag inask(M5.1.5)Branch
feat/arxiv-rag-ingestBlocks: M5.1.2, M5.1.3, M5.1.4, M5.1.5, M5.1.6
A.1 arxiv-rag: ingest pipeline (to M5.1.1 arxiv-rag: ingest pipeline (marchwarden arxiv add)marchwarden arxiv add)