M5.1.2 arxiv-rag: retrieval primitive #39

New issue

Open

opened 2026-04-08 17:17:20 -06:00 by claude-code · 0 comments

claude-code commented

2026-04-08 17:17:20 -06:00

Collaborator

Second sub-milestone of Issue #37. Design: ArxivRagProposal.

Goal

A standalone retrieval function that takes a query string, embeds it with the same model used at ingest time, and returns the top-K matching chunks with paper metadata.

Scope

researchers/arxiv/retrieve.py:
- retrieve(query: str, k: int = 10, model_name: str = ...) -> list[RetrievedChunk]
- RetrievedChunk Pydantic model: arxiv_id, paper_title, section, chunk_text, score, chunk_id
Reads from the chromadb store created by M5.1.1
Honors MARCHWARDEN_ARXIV_EMBED_MODEL env var so the embedding model can be swapped without code changes
Optional metadata filters: --arxiv-id, --year, --category (chromadb's where clause)

Tests

Round-trip: ingest a fixture paper via M5.1.1, query for terms known to be in it, assert top result's chunk_text contains the term
Empty store returns empty list, doesn't crash
Filter by arxiv_id returns only chunks from that paper
Querying with a different embedding model than the one used at ingest raises a clear error

Out of scope

Re-ranking beyond chromadb's built-in similarity score (deferred)
Hybrid sparse+dense retrieval (deferred)

Branch

feat/arxiv-rag-retrieve

Blocked by: M5.1.1. Blocks: M5.1.3.

Second sub-milestone of Issue #37. Design: [ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal). ## Goal A standalone retrieval function that takes a query string, embeds it with the same model used at ingest time, and returns the top-K matching chunks with paper metadata. ## Scope - `researchers/arxiv/retrieve.py`: - `retrieve(query: str, k: int = 10, model_name: str = ...) -> list[RetrievedChunk]` - `RetrievedChunk` Pydantic model: `arxiv_id`, `paper_title`, `section`, `chunk_text`, `score`, `chunk_id` - Reads from the chromadb store created by M5.1.1 - Honors `MARCHWARDEN_ARXIV_EMBED_MODEL` env var so the embedding model can be swapped without code changes - Optional metadata filters: `--arxiv-id`, `--year`, `--category` (chromadb's `where` clause) ## Tests - Round-trip: ingest a fixture paper via M5.1.1, query for terms known to be in it, assert top result's `chunk_text` contains the term - Empty store returns empty list, doesn't crash - Filter by `arxiv_id` returns only chunks from that paper - Querying with a different embedding model than the one used at ingest raises a clear error ## Out of scope - Re-ranking beyond chromadb's built-in similarity score (deferred) - Hybrid sparse+dense retrieval (deferred) ## Branch `feat/arxiv-rag-retrieve` Blocked by: M5.1.1. Blocks: M5.1.3.