Commit graph

9 commits

Author SHA1 Message Date
Jeff Smith
14cfd53514 feat(arxiv): ingest pipeline (M5.1.1)
Closes #38. First sub-milestone of M5.1 (Researcher #2: arxiv-rag).

New package researchers/arxiv/ with three modules:

- store.py — ArxivStore wraps a persistent chromadb collection at
  ~/.marchwarden/arxiv-rag/chroma/ plus a papers.json manifest. Chunk
  ids are deterministic and embedding-model-scoped (per ArxivRagProposal
  decision 4) so re-ingesting with a different embedder doesn't collide
  with prior chunks.
- ingest.py — three-phase pipeline: download_pdf (arxiv API), extract_sections
  (pymupdf with heuristic heading detection + whole-paper fallback), and
  embed_and_store (sentence-transformers, configurable via
  MARCHWARDEN_ARXIV_EMBED_MODEL). Top-level ingest() chains them and
  upserts the manifest entry. Re-ingest is idempotent — chunks for the
  same paper are dropped before re-adding.
- CLI subgroup `marchwarden arxiv add|list|info|remove`. Lazy-imports
  the heavy chromadb / torch deps so non-arxiv commands stay fast.

The heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) are
gated behind an optional `[arxiv]` extra so the base install stays slim
for users who only want the web researcher.

Tests: 14 added (141 total passing). Real pymupdf against synthetic PDFs
generated at test time covers extract_sections; chromadb and the
embedder are stubbed via dependency injection so the tests stay fast,
deterministic, and network-free. End-to-end ingest() is exercised with
a mocked arxiv.Search that produces synthetic PDFs.

Out of scope for #38 (covered by later sub-milestones):
- Retrieval / search API (#39)
- ArxivResearcher agent loop (#40)
- MCP server (#41)
- ask --researcher arxiv flag (#42)
- Cost ledger embedding_calls field (#43)

Notes:
- pip install pulled in CUDA torch wheel (~2GB nvidia libs); harmless on
  CPU-only WSL but a future optimization would pin the CPU torch index.
- Live smoke against a real arxiv id deferred so we don't block the M3.3
  collection runner currently using the venv.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 20:03:42 -06:00
Jeff Smith
1203b07248 fix(observability): persist full ResearchResult and per-item trace events
Closes #54.

The JSONL trace previously stored only counts on the `complete` event
(gap_count, citation_count, discovery_count). Replay could re-render the
step log but could not recover which gaps fired or which sources were
cited, blocking M3.2/M3.3 stress-testing and calibration work.

Two complementary fixes:

1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to
   `<trace_id>.result.json` next to the JSONL trace. The agent calls it
   right before emitting the `complete` step. `cli replay` now loads the
   sibling result file when present and renders the structured tables
   under the trace step log.

2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or
   `discovery_recorded` trace event per item from the final result. This
   gives the JSONL stream a queryable timeline of what was kept, with
   categories and topics in-band, without needing to load the result
   sibling.

Tests: 4 added (127 total passing). Smoke-tested live with a real ask;
both files written and replay rendering verified.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:27:33 -06:00
Jeff Smith
ae48acd421 depth flag now drives constraint defaults (#30)
Previously the depth parameter (shallow/balanced/deep) was passed
only as a text hint inside the agent's user message, with no
mechanical effect on iterations, token budget, or source count.
The flag was effectively cosmetic — the LLM was expected to
"interpret" it.

Add DEPTH_PRESETS table and constraints_for_depth() helper in
researchers.web.models:

  shallow:  2 iters,  5,000 tokens,  5 sources
  balanced: 5 iters, 20,000 tokens, 10 sources  (= historical defaults)
  deep:     8 iters, 60,000 tokens, 20 sources

Wired through the stack:

- WebResearcher.research(): when constraints is None, builds from
  the depth preset instead of bare ResearchConstraints()
- MCP server `research` tool: max_iterations and token_budget now
  default to None; constraints are built via constraints_for_depth
  with explicit values overriding the preset
- CLI `ask` command: --max-iterations and --budget default to None;
  the CLI only forwards them to the MCP tool when set, so unset
  flags fall through to the depth preset

balanced is unchanged from the historical defaults so existing
callers see no behavior difference. Explicit --max-iterations /
--budget always win over the preset.

Tests cover each preset's values, balanced backward-compat,
unknown depth fallback, full override, and partial override.
116/116 tests passing. Live-verified: --depth shallow on a simple
question now caps at 2 iterations and stays under budget.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:27:38 -06:00
Jeff Smith
c0d4f391b6 Display budget as spend status, not exhaustion alarm
Replace 'Budget exhausted: True/False' with 'Budget status: spent /
under cap' in the Confidence panel. The previous wording read as a
failure indicator when in practice 'exhausted' just means the agent
spent its tool-use cap before voluntarily stopping — the normal,
expected outcome on real questions with the default 20k budget.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:12:39 -06:00
Jeff Smith
6fdf0e338a M2.5.3: marchwarden costs CLI command (#26)
Adds operator-facing `marchwarden costs` subcommand that reads the
JSONL ledger from M2.5.2 and pretty-prints a rich summary:

- Cost Summary panel: total calls, total spend, total tokens (input/
  output split), Tavily search count, warning for any calls with
  unknown model prices
- Per-Day table sorted by date
- Per-Model table sorted by model id
- Highest-Cost Call panel with trace_id and question

Flags:
  --since   ISO date or relative shorthand (7d, 24h, 2w, 1m)
  --until   same
  --model   filter to a specific model_id
  --json    emit raw filtered ledger entries instead of the table
  --ledger  override default path (mostly for tests)

Also fixes a Dockerfile gap: the obs/ package added in M2.5.1 was
not being COPYed into the image, so the installed `marchwarden`
entry point couldn't import it. Tests had been passing because
they mounted /app over the install. Adding `COPY obs ./obs`
restores parity.

Tests cover summary rendering, model filter, since-date filter,
JSON output, and the empty-ledger friendly path. 110/110 passing.
End-to-end verified against the real cost ledger.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:57:39 -06:00
Jeff Smith
8a62f6b014 M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.

Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.

Key features:
- Named loggers per component: marchwarden.cli,
  marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
  ~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
  of each research() call so every downstream log line carries
  them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
  third-party loggers (httpx, anthropic) get the same formatting
  and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean

Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
  ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
  logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
  research_started/research_completed

Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:46:51 -06:00
Jeff Smith
d0a732735e Propagate parent env to MCP server subprocess (#18)
The mcp SDK's StdioServerParameters does not pass the parent
process's environment to the spawned server by default, so env
vars set on the CLI process (notably MARCHWARDEN_MODEL) were
silently dropped on the way to the researcher.

Pass env=os.environ.copy() to StdioServerParameters so the server
sees the same environment as the CLI. Also update scripts/docker-test.sh
to forward MARCHWARDEN_MODEL into the container and to detect a
non-TTY parent so non-interactive `ask` invocations don't fail with
"the input device is not a TTY".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:31:14 -06:00
Jeff Smith
273d144381 M2.2: marchwarden replay CLI command (#9)
Adds `marchwarden replay <trace_id>` to pretty-print a prior research
run from its JSONL trace file. Resolves the trace under
~/.marchwarden/traces/ by default; --trace-dir overrides for tests and
custom locations. Renders each step as a row with action, decision,
extra fields, and content_hash. Friendly errors for unknown trace_id
and malformed JSON lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 14:57:37 -06:00
Jeff Smith
87a34c60d1 M2.1: marchwarden ask CLI command (#8)
Click app with `ask` subcommand that spawns the web researcher MCP
server over stdio, calls the research tool, and pretty-prints the
ResearchResult contract using rich (panels for answer/confidence/cost,
tables for citations, gaps, discovery events, and open questions).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 14:51:40 -06:00