M5.1.2 arxiv-rag: retrieval primitive #39

Open
opened 2026-04-08 23:17:20 +00:00 by claude-code · 0 comments
Collaborator

Second sub-milestone of Issue #37. Design: ArxivRagProposal.

Goal

A standalone retrieval function that takes a query string, embeds it with the same model used at ingest time, and returns the top-K matching chunks with paper metadata.

Scope

  • researchers/arxiv/retrieve.py:
    • retrieve(query: str, k: int = 10, model_name: str = ...) -> list[RetrievedChunk]
    • RetrievedChunk Pydantic model: arxiv_id, paper_title, section, chunk_text, score, chunk_id
  • Reads from the chromadb store created by M5.1.1
  • Honors MARCHWARDEN_ARXIV_EMBED_MODEL env var so the embedding model can be swapped without code changes
  • Optional metadata filters: --arxiv-id, --year, --category (chromadb's where clause)

Tests

  • Round-trip: ingest a fixture paper via M5.1.1, query for terms known to be in it, assert top result's chunk_text contains the term
  • Empty store returns empty list, doesn't crash
  • Filter by arxiv_id returns only chunks from that paper
  • Querying with a different embedding model than the one used at ingest raises a clear error

Out of scope

  • Re-ranking beyond chromadb's built-in similarity score (deferred)
  • Hybrid sparse+dense retrieval (deferred)

Branch

feat/arxiv-rag-retrieve

Blocked by: M5.1.1. Blocks: M5.1.3.

Second sub-milestone of Issue #37. Design: [ArxivRagProposal](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/ArxivRagProposal). ## Goal A standalone retrieval function that takes a query string, embeds it with the same model used at ingest time, and returns the top-K matching chunks with paper metadata. ## Scope - `researchers/arxiv/retrieve.py`: - `retrieve(query: str, k: int = 10, model_name: str = ...) -> list[RetrievedChunk]` - `RetrievedChunk` Pydantic model: `arxiv_id`, `paper_title`, `section`, `chunk_text`, `score`, `chunk_id` - Reads from the chromadb store created by M5.1.1 - Honors `MARCHWARDEN_ARXIV_EMBED_MODEL` env var so the embedding model can be swapped without code changes - Optional metadata filters: `--arxiv-id`, `--year`, `--category` (chromadb's `where` clause) ## Tests - Round-trip: ingest a fixture paper via M5.1.1, query for terms known to be in it, assert top result's `chunk_text` contains the term - Empty store returns empty list, doesn't crash - Filter by `arxiv_id` returns only chunks from that paper - Querying with a different embedding model than the one used at ingest raises a clear error ## Out of scope - Re-ranking beyond chromadb's built-in similarity score (deferred) - Hybrid sparse+dense retrieval (deferred) ## Branch `feat/arxiv-rag-retrieve` Blocked by: M5.1.1. Blocks: M5.1.3.
claude-code changed title from A.2 arxiv-rag: retrieval primitive to M5.1.2 arxiv-rag: retrieval primitive 2026-04-08 23:23:10 +00:00
archeious added this to the Phase 5: Second Researcher milestone 2026-04-08 23:23:42 +00:00
Sign in to join this conversation.
No labels
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#39
No description provided.