A network of agentic research specialists coordinated by a principal investigator agent. V1: web-search researcher MCP server + CLI shim.
Find a file
Jeff Smith 14cfd53514 feat(arxiv): ingest pipeline (M5.1.1)
Closes #38. First sub-milestone of M5.1 (Researcher #2: arxiv-rag).

New package researchers/arxiv/ with three modules:

- store.py — ArxivStore wraps a persistent chromadb collection at
  ~/.marchwarden/arxiv-rag/chroma/ plus a papers.json manifest. Chunk
  ids are deterministic and embedding-model-scoped (per ArxivRagProposal
  decision 4) so re-ingesting with a different embedder doesn't collide
  with prior chunks.
- ingest.py — three-phase pipeline: download_pdf (arxiv API), extract_sections
  (pymupdf with heuristic heading detection + whole-paper fallback), and
  embed_and_store (sentence-transformers, configurable via
  MARCHWARDEN_ARXIV_EMBED_MODEL). Top-level ingest() chains them and
  upserts the manifest entry. Re-ingest is idempotent — chunks for the
  same paper are dropped before re-adding.
- CLI subgroup `marchwarden arxiv add|list|info|remove`. Lazy-imports
  the heavy chromadb / torch deps so non-arxiv commands stay fast.

The heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) are
gated behind an optional `[arxiv]` extra so the base install stays slim
for users who only want the web researcher.

Tests: 14 added (141 total passing). Real pymupdf against synthetic PDFs
generated at test time covers extract_sections; chromadb and the
embedder are stubbed via dependency injection so the tests stay fast,
deterministic, and network-free. End-to-end ingest() is exercised with
a mocked arxiv.Search that produces synthetic PDFs.

Out of scope for #38 (covered by later sub-milestones):
- Retrieval / search API (#39)
- ArxivResearcher agent loop (#40)
- MCP server (#41)
- ask --researcher arxiv flag (#42)
- Cost ledger embedding_calls field (#43)

Notes:
- pip install pulled in CUDA torch wheel (~2GB nvidia libs); harmless on
  CPU-only WSL but a future optimization would pin the CPU torch index.
- Live smoke against a real arxiv id deferred so we don't block the M3.3
  collection runner currently using the venv.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:03:42 -06:00
cli feat(arxiv): ingest pipeline (M5.1.1) 2026-04-08 20:03:42 -06:00
docs/stress-tests docs(stress-tests): archive M3.2 multi-axis results 2026-04-08 19:34:27 -06:00
obs M2.5.2: Cost ledger with price table (#25) 2026-04-08 15:52:25 -06:00
orchestrator Initial project structure and scaffolding 2026-04-08 11:57:15 -06:00
researchers feat(arxiv): ingest pipeline (M5.1.1) 2026-04-08 20:03:42 -06:00
scripts Propagate parent env to MCP server subprocess (#18) 2026-04-08 15:31:14 -06:00
tests feat(arxiv): ingest pipeline (M5.1.1) 2026-04-08 20:03:42 -06:00
.dockerignore chore: add docker-based test environment (#13) 2026-04-08 15:06:12 -06:00
.gitignore Initial project structure and scaffolding 2026-04-08 11:57:15 -06:00
CLAUDE.md chore: update CLAUDE.md for session 2 2026-04-08 17:30:59 -06:00
CONTRIBUTING.md Initial project structure and scaffolding 2026-04-08 11:57:15 -06:00
Dockerfile M2.5.3: marchwarden costs CLI command (#26) 2026-04-08 15:57:39 -06:00
Makefile chore: add Makefile with venv-based dev workflow 2026-04-08 16:31:00 -06:00
pyproject.toml feat(arxiv): ingest pipeline (M5.1.1) 2026-04-08 20:03:42 -06:00
README.md chore: add Makefile with venv-based dev workflow 2026-04-08 16:31:00 -06:00

Marchwarden

A network of agentic research specialists coordinated by a principal investigator agent.

Marchwarden researchers are stationed at the frontier of knowledge — they watch, search, synthesize, and report back what they find. Each specialist is self-contained, fault-tolerant, and exposed via MCP. The PI agent orchestrates them to answer complex, multi-domain questions.

V1: Single web-search researcher + CLI shim for development.
V2+: Multiple specialists (arxiv, database, internal docs, etc.) + PI orchestrator.

Quick start

# Clone
git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
cd marchwarden

# Install (Makefile shortcut — creates .venv and installs deps)
make install
# or manually:
python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[dev]"

# Ask a question
marchwarden ask "What are ideal crops for a garden in Utah?"

# Replay a research session
marchwarden replay <trace_id>

Docker test environment

A reproducible container is available for running the test suite and the CLI without depending on the host's Python install:

scripts/docker-test.sh build           # build the image
scripts/docker-test.sh test             # run pytest
scripts/docker-test.sh ask "question"   # run `marchwarden ask` end-to-end
                                        # (mounts ~/secrets ro and ~/.marchwarden rw)
scripts/docker-test.sh replay <id>      # replay a trace from ~/.marchwarden/traces
scripts/docker-test.sh shell            # interactive bash in the container

Documentation

Status

  • V1 scope: Issue #1
  • Branch: main (development)
  • Tests: pytest tests/

Stack

License

(TBD)