marchwarden

Author	SHA1	Message	Date
Jeff Smith	6ff1a6af3d	Enforce token_budget before each iteration (#17 ) The loop previously checked the token budget at the bottom of each iteration, after the LLM call and tool work had already happened. By the time the cap was caught the budget had been exceeded and the overshoot was unbounded by the iteration's cost. Move the check to the top of the loop so a new iteration is never started past the budget. Document the policy explicitly: token_budget is a soft cap on the tool-use loop only; the synthesis call is always allowed to complete so callers get a structured ResearchResult rather than a fallback stub. Capping synthesis is a separate, larger design question (would require splitting the budget between loop and synthesis up-front). Verified: token_budget=5000, max_iterations=10 now stops after 2 iterations with budget_exhausted=True and a complete answer with 10 citations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:29:22 -06:00
Jeff Smith	eb2e71835c	Fix invalid default model id (#15 ) Both the MCP server and WebResearcher defaulted to claude-sonnet-4-5-20250514, which 404s against the Anthropic API. Update both defaults to claude-sonnet-4-6, which is current as of 2026-04. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:25:19 -06:00
Jeff Smith	7956bf4873	Fix synthesis truncation and trace masking (#16 , #19 ) The synthesis step was passing max_tokens=4096 to Claude, which was not enough for a full ResearchResult JSON over a real evidence set (28 sources). The model's output got cut mid-string, json.loads failed, and the agent fell back to a stub answer with zero citations. The trace logger then truncated the raw_response to 1000 chars before recording it, hiding the actual reason for the parse failure (the truncated JSON suffix) and making the bug invisible from traces. Fixes: - Bump synthesis max_tokens to 16384 - Capture and log Claude's stop_reason on synthesis_error so future truncation cases are diagnosable from the trace alone - Log the parser exception text alongside the raw_response - Stop slicing raw_response — record the full string Verified end-to-end against the Utah crops question: - Before: 0 citations, confidence 0.10, fallback stub - After: 9 citations, confidence 0.88, real synthesized answer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 15:23:03 -06:00
Jeff Smith	5d894d9e10	M1.4: MCP server wrapping web researcher FastMCP server exposing a single 'research' tool: - Delegates to WebResearcher with keys from ~/secrets - Accepts question, context, depth, max_iterations, token_budget - Returns full ResearchResult as JSON - Configurable model via MARCHWARDEN_MODEL env var - Runnable as: python -m researchers.web 4 tests: secret reading, JSON response validation, default parameters. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:41:13 -06:00
Jeff Smith	ae9c11a79b	Add OpenQuestion to research contract New field on ResearchResult: open_questions — follow-up questions that emerged from the research itself. Distinct from gaps (backward: what failed) and discovery_events (sideways: what's lateral). Open questions look forward: 'based on what I found, this needs deeper investigation.' - OpenQuestion model: question, context, priority (high/medium/low), source_locator - Updated agent synthesis prompt to produce open_questions - Updated agent result builder to parse open_questions from JSON - 3 new tests for OpenQuestion model - Updated existing tests for new field 77 tests passing. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:37:30 -06:00
Jeff Smith	7cb3fde90e	M1.3: Inner agent loop with tests WebResearcher — the core agentic research loop: - Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx) - Budget enforcement: stops at max_iterations or token_budget - Synthesis step: separate LLM call produces structured ResearchResult JSON - Fallback: valid ResearchResult even when synthesis JSON is unparseable - Full trace logging at every step (start, search, fetch, synthesis, complete) - Populates all contract fields: raw_excerpt, categorized gaps, discovery_events, confidence_factors, cost_metadata with model_id 9 tests: complete research loop, budget exhaustion, synthesis failure fallback, trace file creation, fetch_url tool integration, search result formatting. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:29:27 -06:00
Jeff Smith	cef08c8984	M1.2: Trace logger with tests TraceLogger produces JSONL audit logs per research() call: - One file per trace_id at ~/.marchwarden/traces/{trace_id}.jsonl - Each line is a self-contained JSON object (step, action, timestamp, decision) - Supports arbitrary kwargs (url, content_hash, query, etc.) - Lazy file handle, flush after each write, context manager support - read_entries() for replay and testing 15 tests: file creation, step counting, JSONL validity, kwargs, timestamps, flush behavior, multiple independent traces. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:21:10 -06:00
Jeff Smith	a5bc93e275	M1.1: Search and fetch tools with tests - tavily_search(): Tavily API wrapper returning SearchResult dataclasses with content hashing (raw_content preferred, falls back to summary) - fetch_url(): async URL fetch with HTML text extraction, content hashing, and graceful error handling (timeout, HTTP errors, connection errors) - _extract_text(): simple HTML → clean text (strip scripts/styles/tags, decode entities, collapse whitespace) - _sha256(): SHA-256 content hashing with 'sha256:' prefix for traces 18 tests: hashing, HTML extraction, mocked Tavily search, mocked async fetch (success, timeout, HTTP error, hash consistency). Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:17:18 -06:00
Jeff Smith	1b0f86399a	M0.3: Implement contract v1 Pydantic models with tests All Research Contract types as Pydantic models: - ResearchConstraints (input) - Citation with raw_excerpt (output) - GapCategory enum (5 categories) - Gap with structured category (output) - DiscoveryEvent (lateral findings) - ConfidenceFactors (auditable scoring inputs) - CostMetadata with model_id (resource tracking) - ResearchResult (top-level contract) 32 tests: validation, bounds checking, serialization roundtrips, JSON structure verification against contract spec. Refs: archeious/marchwarden#1 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 14:00:45 -06:00
Jeff Smith	deb124ed29	Initial project structure and scaffolding - Directory layout: researchers/web/, orchestrator/, cli/, docs/wiki/ - README with quick start and vision - CONTRIBUTING with workflow and testing guidelines - pyproject.toml with dependencies and build config - .gitignore for Python projects Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-08 11:57:15 -06:00

10 commits