Closes#38. First sub-milestone of M5.1 (Researcher #2: arxiv-rag).
New package researchers/arxiv/ with three modules:
- store.py — ArxivStore wraps a persistent chromadb collection at
~/.marchwarden/arxiv-rag/chroma/ plus a papers.json manifest. Chunk
ids are deterministic and embedding-model-scoped (per ArxivRagProposal
decision 4) so re-ingesting with a different embedder doesn't collide
with prior chunks.
- ingest.py — three-phase pipeline: download_pdf (arxiv API), extract_sections
(pymupdf with heuristic heading detection + whole-paper fallback), and
embed_and_store (sentence-transformers, configurable via
MARCHWARDEN_ARXIV_EMBED_MODEL). Top-level ingest() chains them and
upserts the manifest entry. Re-ingest is idempotent — chunks for the
same paper are dropped before re-adding.
- CLI subgroup `marchwarden arxiv add|list|info|remove`. Lazy-imports
the heavy chromadb / torch deps so non-arxiv commands stay fast.
The heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) are
gated behind an optional `[arxiv]` extra so the base install stays slim
for users who only want the web researcher.
Tests: 14 added (141 total passing). Real pymupdf against synthetic PDFs
generated at test time covers extract_sections; chromadb and the
embedder are stubbed via dependency injection so the tests stay fast,
deterministic, and network-free. End-to-end ingest() is exercised with
a mocked arxiv.Search that produces synthetic PDFs.
Out of scope for #38 (covered by later sub-milestones):
- Retrieval / search API (#39)
- ArxivResearcher agent loop (#40)
- MCP server (#41)
- ask --researcher arxiv flag (#42)
- Cost ledger embedding_calls field (#43)
Notes:
- pip install pulled in CUDA torch wheel (~2GB nvidia libs); harmless on
CPU-only WSL but a future optimization would pin the CPU torch index.
- Live smoke against a real arxiv id deferred so we don't block the M3.3
collection runner currently using the venv.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reproducible Python 3.12-slim container that installs the project
editable with dev deps. Adds pytest-asyncio to dev deps so async tests
run cleanly inside the container (host had it installed out-of-band).
scripts/docker-test.sh provides build, test, ask, replay, and shell
subcommands. The ask/replay/shell commands mount ~/secrets read-only
and ~/.marchwarden read-write so end-to-end runs persist traces back
to the host.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Click app with `ask` subcommand that spawns the web researcher MCP
server over stdio, calls the research tool, and pretty-prints the
ResearchResult contract using rich (panels for answer/confidence/cost,
tables for citations, gaps, discovery events, and open questions).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Directory layout: researchers/web/, orchestrator/, cli/, docs/wiki/
- README with quick start and vision
- CONTRIBUTING with workflow and testing guidelines
- pyproject.toml with dependencies and build config
- .gitignore for Python projects
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>