marchwarden

Author	SHA1	Message	Date
claude-code	78f08c92cc	Merge pull request 'docs(stress-tests): M3.3 Phase A — calibration data collection' (#59 ) from feat/m3.3-collection into main	2026-04-09 02:22:07 +00:00
Jeff Smith	13215d7ddb	docs(stress-tests): M3.3 Phase A — calibration data collection Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 20:21:47 -06:00
claude-code	3b57b563ab	Merge pull request 'feat(arxiv): ingest pipeline (M5.1.1)' (#58 ) from feat/arxiv-rag-ingest into main	2026-04-09 02:03:59 +00:00
Jeff Smith	14cfd53514	feat(arxiv): ingest pipeline (M5.1.1) Closes #38. First sub-milestone of M5.1 (Researcher #2: arxiv-rag). New package researchers/arxiv/ with three modules: - store.py — ArxivStore wraps a persistent chromadb collection at ~/.marchwarden/arxiv-rag/chroma/ plus a papers.json manifest. Chunk ids are deterministic and embedding-model-scoped (per ArxivRagProposal decision 4) so re-ingesting with a different embedder doesn't collide with prior chunks. - ingest.py — three-phase pipeline: download_pdf (arxiv API), extract_sections (pymupdf with heuristic heading detection + whole-paper fallback), and embed_and_store (sentence-transformers, configurable via MARCHWARDEN_ARXIV_EMBED_MODEL). Top-level ingest() chains them and upserts the manifest entry. Re-ingest is idempotent — chunks for the same paper are dropped before re-adding. - CLI subgroup `marchwarden arxiv add\|list\|info\|remove`. Lazy-imports the heavy chromadb / torch deps so non-arxiv commands stay fast. The heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) are gated behind an optional `[arxiv]` extra so the base install stays slim for users who only want the web researcher. Tests: 14 added (141 total passing). Real pymupdf against synthetic PDFs generated at test time covers extract_sections; chromadb and the embedder are stubbed via dependency injection so the tests stay fast, deterministic, and network-free. End-to-end ingest() is exercised with a mocked arxiv.Search that produces synthetic PDFs. Out of scope for #38 (covered by later sub-milestones): - Retrieval / search API (#39) - ArxivResearcher agent loop (#40) - MCP server (#41) - ask --researcher arxiv flag (#42) - Cost ledger embedding_calls field (#43) Notes: - pip install pulled in CUDA torch wheel (~2GB nvidia libs); harmless on CPU-only WSL but a future optimization would pin the CPU torch index. - Live smoke against a real arxiv id deferred so we don't block the M3.3 collection runner currently using the venv. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 20:03:42 -06:00
claude-code	f27ba3cdcf	Merge pull request 'docs(stress-tests): archive M3.2 multi-axis results' (#57 ) from feat/m3.2-multiaxis into main	2026-04-09 01:35:01 +00:00
Jeff Smith	0ddc1e6e37	docs(stress-tests): archive M3.2 multi-axis results Single deep query against AWS Lambda vs Azure Functions for HFT exercised 3 of 4 target axes simultaneously: recency, contradictions, and budget pressure all fired in the same run. scope_exceeded miss is soft (1 of 5 gaps was arguably miscategorized as source_not_found). First in-the-wild observation of the `contradiction` discovery_event type. Issue #45. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:34:27 -06:00
claude-code	f68bbb1052	Merge pull request 'fix(observability): persist full ResearchResult and per-item trace events' (#56 ) from feat/trace-full-result into main	2026-04-09 01:27:47 +00:00
Jeff Smith	1203b07248	fix(observability): persist full ResearchResult and per-item trace events Closes #54. The JSONL trace previously stored only counts on the `complete` event (gap_count, citation_count, discovery_count). Replay could re-render the step log but could not recover which gaps fired or which sources were cited, blocking M3.2/M3.3 stress-testing and calibration work. Two complementary fixes: 1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to `<trace_id>.result.json` next to the JSONL trace. The agent calls it right before emitting the `complete` step. `cli replay` now loads the sibling result file when present and renders the structured tables under the trace step log. 2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or `discovery_recorded` trace event per item from the final result. This gives the JSONL stream a queryable timeline of what was kept, with categories and topics in-band, without needing to load the result sibling. Tests: 4 added (127 total passing). Smoke-tested live with a real ask; both files written and replay rendering verified. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:27:33 -06:00
claude-code	5ea6879ba0	Merge pull request 'docs(stress-tests): archive M3.1 results' (#55 ) from feat/m3.1-stress-tests into main	2026-04-09 01:21:43 +00:00
Jeff Smith	a39407f03e	docs(stress-tests): archive M3.1 results Single-axis stress test results from Issue #44. 1 of 4 query targets cleanly hit (Q3); Q1/Q2 missed because queries weren't adversarial enough; Q4 missed due to budget cap lag bug filed as #53. Trace observability gap blocking M3.2/M3.3 filed as #54. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:21:34 -06:00
Jeff Smith	d279c4c20e	chore: update CLAUDE.md for session 2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 17:30:59 -06:00
archeious	af7935819f	Merge pull request 'Record per-step durations in trace and operational logs' (#36 ) from feat/step-durations into main Reviewed-on: #36	2026-04-08 22:58:12 +00:00
Jeff Smith	ddaf7e85c3	Record per-step durations in trace and operational logs (#35 ) TraceLogger now tracks monotonic start times for starter actions (web_search, fetch_url, synthesis_start, start) and attaches a duration_ms field to the matching completer (web_search_complete, fetch_url_complete, synthesis_complete, synthesis_error). The terminal 'complete' step gets total_duration_sec instead. Pairings are tightly sequential in the agent code (each _execute_tool call runs start→end before returning), so a simple dict keyed by starter name suffices — no queueing needed. An unpaired completer leaves duration unset and does not crash. Durations flow into both the JSONL trace and the structlog operational log, so OpenSearch queries can filter / aggregate by step latency without cross-row joins. Verified end-to-end on a real shallow query: web_search 5,233 ms web_search 3,006 ms synthesis_complete 27,658 ms complete 47.547 s total Synthesis is by far the slowest step — visible at a glance for the first time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:49:52 -06:00
archeious	226c1c8660	Merge pull request 'depth flag now drives constraint defaults' (#33 ) from feat/depth-presets into main Reviewed-on: #33	2026-04-08 22:33:48 +00:00
archeious	6e7d3bc98a	Merge pull request 'chore: Makefile with venv-based dev workflow' (#34 ) from chore/makefile into main Reviewed-on: #34 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 22:32:28 +00:00
Jeff Smith	9ecc1db38d	chore: add Makefile with venv-based dev workflow Targets: make install create .venv and pip install -e ".[dev]" make test pytest inside the venv make test-cov pytest with coverage make lint ruff + black --check make ask run a sample research call make costs show the cost ledger make clean remove venv and caches make docker-build / docker-test parity wrappers for the docker flow Lets contributors get from clone to running CLI in one command without depending on docker. README points at make install as the recommended path; manual venv steps documented as fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:31:00 -06:00
Jeff Smith	ae48acd421	depth flag now drives constraint defaults (#30 ) Previously the depth parameter (shallow/balanced/deep) was passed only as a text hint inside the agent's user message, with no mechanical effect on iterations, token budget, or source count. The flag was effectively cosmetic — the LLM was expected to "interpret" it. Add DEPTH_PRESETS table and constraints_for_depth() helper in researchers.web.models: shallow: 2 iters, 5,000 tokens, 5 sources balanced: 5 iters, 20,000 tokens, 10 sources (= historical defaults) deep: 8 iters, 60,000 tokens, 20 sources Wired through the stack: - WebResearcher.research(): when constraints is None, builds from the depth preset instead of bare ResearchConstraints() - MCP server `research` tool: max_iterations and token_budget now default to None; constraints are built via constraints_for_depth with explicit values overriding the preset - CLI `ask` command: --max-iterations and --budget default to None; the CLI only forwards them to the MCP tool when set, so unset flags fall through to the depth preset balanced is unchanged from the historical defaults so existing callers see no behavior difference. Explicit --max-iterations / --budget always win over the preset. Tests cover each preset's values, balanced backward-compat, unknown depth fallback, full override, and partial override. 116/116 tests passing. Live-verified: --depth shallow on a simple question now caps at 2 iterations and stays under budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:27:38 -06:00
archeious	d51f16d33e	Merge pull request 'Mirror trace steps to the operational logger' (#32 ) from feat/per-step-logging into main Reviewed-on: #32 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 22:23:01 +00:00
Jeff Smith	b510902af3	Mirror trace steps to operational logger The trace JSONL captures every step of a research call (search, fetch, iteration boundaries, synthesis), but the structured operational log only fired at research_started / research_completed, giving administrators no real-time visibility into agent progress. Have TraceLogger.log_step also emit a structlog event using the same action name, fields, and step counter. trace_id and researcher are already bound in contextvars by WebResearcher.research, so every line carries them automatically — no plumbing needed. Volume control: a curated set of milestone actions logs at INFO (start, iteration_start, synthesis_start/complete/error, budget_- exhausted, complete). Chatty per-tool actions (web_search, fetch_url and their *_complete pairs) log at DEBUG. Default MARCHWARDEN_LOG_LEVEL=INFO shows ~9 lines per call; MARCHWARDEN_LOG_LEVEL=DEBUG shows everything. This keeps dev stderr readable while making full step visibility one env var away — and OpenSearch can ingest at DEBUG always. Verified end-to-end: Utah peak query at INFO produces 9 milestone log lines, at DEBUG produces 13. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:22:13 -06:00
archeious	bbb08b7789	Merge pull request 'Display budget as spend status, not exhaustion alarm' (#31 ) from fix/budget-display-clarity into main Reviewed-on: #31	2026-04-08 22:20:01 +00:00
Jeff Smith	c0d4f391b6	Display budget as spend status, not exhaustion alarm Replace 'Budget exhausted: True/False' with 'Budget status: spent / under cap' in the Confidence panel. The previous wording read as a failure indicator when in practice 'exhausted' just means the agent spent its tool-use cap before voluntarily stopping — the normal, expected outcome on real questions with the default 20k budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:12:39 -06:00
archeious	4816b9386e	Merge pull request 'M2.5.3: marchwarden costs CLI command' (#29 ) from feat/costs-command into main Reviewed-on: #29 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:59:07 +00:00
Jeff Smith	6fdf0e338a	M2.5.3: marchwarden costs CLI command (#26 ) Adds operator-facing `marchwarden costs` subcommand that reads the JSONL ledger from M2.5.2 and pretty-prints a rich summary: - Cost Summary panel: total calls, total spend, total tokens (input/ output split), Tavily search count, warning for any calls with unknown model prices - Per-Day table sorted by date - Per-Model table sorted by model id - Highest-Cost Call panel with trace_id and question Flags: --since ISO date or relative shorthand (7d, 24h, 2w, 1m) --until same --model filter to a specific model_id --json emit raw filtered ledger entries instead of the table --ledger override default path (mostly for tests) Also fixes a Dockerfile gap: the obs/ package added in M2.5.1 was not being COPYed into the image, so the installed `marchwarden` entry point couldn't import it. Tests had been passing because they mounted /app over the install. Adding `COPY obs ./obs` restores parity. Tests cover summary rendering, model filter, since-date filter, JSON output, and the empty-ledger friendly path. 110/110 passing. End-to-end verified against the real cost ledger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:57:39 -06:00
archeious	5a0ca73e2a	Merge pull request 'M2.5.2: Cost ledger with price table' (#28 ) from feat/cost-ledger into main Reviewed-on: #28	2026-04-08 21:54:23 +00:00
Jeff Smith	0d957336f5	M2.5.2: Cost ledger with price table (#25 ) Adds an append-only JSONL ledger of every research() call at ~/.marchwarden/costs.jsonl, supplementing (not replacing) the per-call cost_metadata field returned to callers. The ledger is the operator-facing source of truth for spend tracking, queryable via the upcoming `marchwarden costs` command (M2.5.3). Fields per entry: timestamp, trace_id, question (truncated 200ch), model_id, tokens_used, tokens_input, tokens_output, iterations_run, wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted, confidence. Cost estimation reads ~/.marchwarden/prices.toml, which is auto-created with seed values for current Anthropic + Tavily rates on first run. Operators are expected to update prices.toml manually when upstream rates change — there is no automatic fetching. Existing files are never overwritten. Unknown models log a WARN and record estimated_cost_usd: null instead of crashing. Each ledger write also emits a structured `cost_recorded` log line via the M2.5.1 logger, so cost data ships to OpenSearch alongside the ledger file with no extra plumbing. Tracking changes in agent.py: - Track tokens_input / tokens_output split (not just total) - Count tavily_searches across iterations - _synthesize now returns (result, synth_in, synth_out) so the caller can attribute synthesis tokens to the running counters - Ledger.record() called after research_completed log; failures are caught and warn-logged so a ledger write can never poison a successful research call Tests cover: price table seeding, no-overwrite of existing files, cost estimation for known/unknown models, tavily-only cost, ledger appends, question truncation, env var override. End-to-end verified with a real Anthropic+Tavily call: 9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated. 104/104 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:52:25 -06:00
archeious	d25c8865ea	Merge pull request 'M2.5.1: Structured application logger' (#27 ) from feat/structured-logging into main Reviewed-on: #27 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:48:10 +00:00
Jeff Smith	8a62f6b014	M2.5.1: Structured application logger via structlog (#24 ) Adds an operational logging layer separate from the JSONL trace audit logs. Operational logs cover system events (startup, errors, MCP transport, research lifecycle); JSONL traces remain the researcher provenance audit trail. Backend: structlog with two renderers selectable via MARCHWARDEN_LOG_FORMAT (json\|console). Defaults to console when stderr is a TTY, json otherwise — so dev runs are human-readable and shipped runs (containers, automation) emit OpenSearch-ready JSON without configuration. Key features: - Named loggers per component: marchwarden.cli, marchwarden.mcp, marchwarden.researcher.web - MARCHWARDEN_LOG_LEVEL controls global level (default INFO) - MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at ~/.marchwarden/logs/marchwarden.log - structlog contextvars bind trace_id + researcher at the start of each research() call so every downstream log line carries them automatically; cleared on completion - stdlib logging is funneled through the same pipeline so noisy third-party loggers (httpx, anthropic) get the same formatting and quieted to WARN unless DEBUG is requested - Logs to stderr to keep MCP stdio stdout clean Wired into: - cli.main.cli — configures logging on startup, logs ask_started/ ask_completed/ask_failed - researchers.web.server.main — configures logging on startup, logs mcp_server_starting - researchers.web.agent.research — binds trace context, logs research_started/research_completed Tests verify JSON and console formats, contextvar propagation, level filtering, idempotency, and auto-configure-on-first-use. 94/94 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:46:51 -06:00
archeious	8293cbfb68	Merge pull request 'Propagate parent env to MCP server subprocess' (#23 ) from fix/mcp-env-propagation into main Reviewed-on: #23 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:32:10 +00:00
Jeff Smith	d0a732735e	Propagate parent env to MCP server subprocess (#18 ) The mcp SDK's StdioServerParameters does not pass the parent process's environment to the spawned server by default, so env vars set on the CLI process (notably MARCHWARDEN_MODEL) were silently dropped on the way to the researcher. Pass env=os.environ.copy() to StdioServerParameters so the server sees the same environment as the CLI. Also update scripts/docker-test.sh to forward MARCHWARDEN_MODEL into the container and to detect a non-TTY parent so non-interactive `ask` invocations don't fail with "the input device is not a TTY". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:31:14 -06:00
archeious	712638fe8c	Merge pull request 'Enforce token_budget before each iteration' (#22 ) from fix/budget-enforcement into main Reviewed-on: #22 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:30:26 +00:00
Jeff Smith	6ff1a6af3d	Enforce token_budget before each iteration (#17 ) The loop previously checked the token budget at the bottom of each iteration, after the LLM call and tool work had already happened. By the time the cap was caught the budget had been exceeded and the overshoot was unbounded by the iteration's cost. Move the check to the top of the loop so a new iteration is never started past the budget. Document the policy explicitly: token_budget is a soft cap on the tool-use loop only; the synthesis call is always allowed to complete so callers get a structured ResearchResult rather than a fallback stub. Capping synthesis is a separate, larger design question (would require splitting the budget between loop and synthesis up-front). Verified: token_budget=5000, max_iterations=10 now stops after 2 iterations with budget_exhausted=True and a complete answer with 10 citations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:29:22 -06:00
archeious	50d59abf52	Merge pull request 'Fix invalid default model id' (#21 ) from fix/model-default-id into main Reviewed-on: #21 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:26:05 +00:00
Jeff Smith	eb2e71835c	Fix invalid default model id (#15 ) Both the MCP server and WebResearcher defaulted to claude-sonnet-4-5-20250514, which 404s against the Anthropic API. Update both defaults to claude-sonnet-4-6, which is current as of 2026-04. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:25:19 -06:00
archeious	c19a161a62	Merge pull request 'Fix synthesis truncation and trace masking' (#20 ) from fix/synthesis-truncation into main Reviewed-on: #20 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:24:41 +00:00
Jeff Smith	7956bf4873	Fix synthesis truncation and trace masking (#16 , #19 ) The synthesis step was passing max_tokens=4096 to Claude, which was not enough for a full ResearchResult JSON over a real evidence set (28 sources). The model's output got cut mid-string, json.loads failed, and the agent fell back to a stub answer with zero citations. The trace logger then truncated the raw_response to 1000 chars before recording it, hiding the actual reason for the parse failure (the truncated JSON suffix) and making the bug invisible from traces. Fixes: - Bump synthesis max_tokens to 16384 - Capture and log Claude's stop_reason on synthesis_error so future truncation cases are diagnosable from the trace alone - Log the parser exception text alongside the raw_response - Stop slicing raw_response — record the full string Verified end-to-end against the Utah crops question: - Before: 0 citations, confidence 0.10, fallback stub - After: 9 citations, confidence 0.88, real synthesized answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:23:03 -06:00
archeious	16d88e951b	Merge pull request 'chore: docker-based test environment' (#14 ) from chore/docker-test-env into main Reviewed-on: #14 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 21:08:27 +00:00
Jeff Smith	40d0725497	chore: add docker-based test environment (#13 ) Reproducible Python 3.12-slim container that installs the project editable with dev deps. Adds pytest-asyncio to dev deps so async tests run cleanly inside the container (host had it installed out-of-band). scripts/docker-test.sh provides build, test, ask, replay, and shell subcommands. The ask/replay/shell commands mount ~/secrets read-only and ~/.marchwarden read-write so end-to-end runs persist traces back to the host. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:06:12 -06:00
archeious	bca7294ec8	Merge pull request 'M2.2: marchwarden replay CLI command' (#12 ) from feat/cli-replay into main Reviewed-on: #12 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 20:59:12 +00:00
Jeff Smith	273d144381	M2.2: marchwarden replay CLI command (#9 ) Adds `marchwarden replay <trace_id>` to pretty-print a prior research run from its JSONL trace file. Resolves the trace under ~/.marchwarden/traces/ by default; --trace-dir overrides for tests and custom locations. Renders each step as a row with action, decision, extra fields, and content_hash. Friendly errors for unknown trace_id and malformed JSON lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 14:57:37 -06:00
archeious	b2b7026eb2	Merge pull request 'M2.1: marchwarden ask CLI command' (#11 ) from feat/cli-ask into main Reviewed-on: #11 Reviewed-by: archeious <archeious@unbiasedgeek.com>	2026-04-08 20:54:59 +00:00
Jeff Smith	87a34c60d1	M2.1: marchwarden ask CLI command (#8 ) Click app with `ask` subcommand that spawns the web researcher MCP server over stdio, calls the research tool, and pretty-prints the ResearchResult contract using rich (panels for answer/confidence/cost, tables for citations, gaps, discovery events, and open questions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 14:51:40 -06:00
Jeff Smith	166d86e190	chore: add CLAUDE.md for session 1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:44:16 -06:00
archeious	7088f45f06	Merge pull request 'M1.4: MCP server' (#7 ) from feat/mcp-server into main	2026-04-08 20:41:28 +00:00
Jeff Smith	5d894d9e10	M1.4: MCP server wrapping web researcher FastMCP server exposing a single 'research' tool: - Delegates to WebResearcher with keys from ~/secrets - Accepts question, context, depth, max_iterations, token_budget - Returns full ResearchResult as JSON - Configurable model via MARCHWARDEN_MODEL env var - Runnable as: python -m researchers.web 4 tests: secret reading, JSON response validation, default parameters. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:41:13 -06:00
archeious	f593dd060b	Merge pull request 'Add OpenQuestion to research contract' (#6 ) from feat/open-questions into main	2026-04-08 20:37:54 +00:00
Jeff Smith	ae9c11a79b	Add OpenQuestion to research contract New field on ResearchResult: open_questions — follow-up questions that emerged from the research itself. Distinct from gaps (backward: what failed) and discovery_events (sideways: what's lateral). Open questions look forward: 'based on what I found, this needs deeper investigation.' - OpenQuestion model: question, context, priority (high/medium/low), source_locator - Updated agent synthesis prompt to produce open_questions - Updated agent result builder to parse open_questions from JSON - 3 new tests for OpenQuestion model - Updated existing tests for new field 77 tests passing. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:37:30 -06:00
archeious	ece2455415	Merge pull request 'M1.3: Inner agent loop' (#5 ) from feat/agent-loop into main	2026-04-08 20:29:41 +00:00
Jeff Smith	7cb3fde90e	M1.3: Inner agent loop with tests WebResearcher — the core agentic research loop: - Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx) - Budget enforcement: stops at max_iterations or token_budget - Synthesis step: separate LLM call produces structured ResearchResult JSON - Fallback: valid ResearchResult even when synthesis JSON is unparseable - Full trace logging at every step (start, search, fetch, synthesis, complete) - Populates all contract fields: raw_excerpt, categorized gaps, discovery_events, confidence_factors, cost_metadata with model_id 9 tests: complete research loop, budget exhaustion, synthesis failure fallback, trace file creation, fetch_url tool integration, search result formatting. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:29:27 -06:00
archeious	21c8191b81	Merge pull request 'M1.2: Trace logger' (#4 ) from feat/trace-logger into main	2026-04-08 20:25:58 +00:00
Jeff Smith	cef08c8984	M1.2: Trace logger with tests TraceLogger produces JSONL audit logs per research() call: - One file per trace_id at ~/.marchwarden/traces/{trace_id}.jsonl - Each line is a self-contained JSON object (step, action, timestamp, decision) - Supports arbitrary kwargs (url, content_hash, query, etc.) - Lazy file handle, flush after each write, context manager support - read_entries() for replay and testing 15 tests: file creation, step counting, JSONL validity, kwargs, timestamps, flush behavior, multiple independent traces. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:21:10 -06:00

1 2

58 commits