M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
"""Web researcher agent — the inner agentic loop.
|
|
|
|
|
|
|
|
|
|
Takes a question, runs a plan→search→fetch→iterate→synthesize loop
|
|
|
|
|
using Claude as the reasoning engine and Tavily/httpx as tools.
|
|
|
|
|
Returns a ResearchResult conforming to the v1 contract.
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
import asyncio
|
|
|
|
|
import json
|
|
|
|
|
import time
|
|
|
|
|
from typing import Optional
|
|
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
import structlog
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
from anthropic import Anthropic
|
|
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
from obs import get_logger
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
from obs.costs import CostLedger
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
from researchers.web.models import (
|
|
|
|
|
Citation,
|
|
|
|
|
ConfidenceFactors,
|
|
|
|
|
CostMetadata,
|
|
|
|
|
DiscoveryEvent,
|
|
|
|
|
Gap,
|
|
|
|
|
GapCategory,
|
2026-04-08 20:37:30 +00:00
|
|
|
OpenQuestion,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
ResearchConstraints,
|
|
|
|
|
ResearchResult,
|
depth flag now drives constraint defaults (#30)
Previously the depth parameter (shallow/balanced/deep) was passed
only as a text hint inside the agent's user message, with no
mechanical effect on iterations, token budget, or source count.
The flag was effectively cosmetic — the LLM was expected to
"interpret" it.
Add DEPTH_PRESETS table and constraints_for_depth() helper in
researchers.web.models:
shallow: 2 iters, 5,000 tokens, 5 sources
balanced: 5 iters, 20,000 tokens, 10 sources (= historical defaults)
deep: 8 iters, 60,000 tokens, 20 sources
Wired through the stack:
- WebResearcher.research(): when constraints is None, builds from
the depth preset instead of bare ResearchConstraints()
- MCP server `research` tool: max_iterations and token_budget now
default to None; constraints are built via constraints_for_depth
with explicit values overriding the preset
- CLI `ask` command: --max-iterations and --budget default to None;
the CLI only forwards them to the MCP tool when set, so unset
flags fall through to the depth preset
balanced is unchanged from the historical defaults so existing
callers see no behavior difference. Explicit --max-iterations /
--budget always win over the preset.
Tests cover each preset's values, balanced backward-compat,
unknown depth fallback, full override, and partial override.
116/116 tests passing. Live-verified: --depth shallow on a simple
question now caps at 2 iterations and stays under budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 22:27:38 +00:00
|
|
|
constraints_for_depth,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
)
|
|
|
|
|
from researchers.web.tools import SearchResult, fetch_url, tavily_search
|
|
|
|
|
from researchers.web.trace import TraceLogger
|
|
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
log = get_logger("marchwarden.researcher.web")
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
SYSTEM_PROMPT = """\
|
|
|
|
|
You are a Marchwarden — a research specialist stationed at the frontier of knowledge. \
|
|
|
|
|
Your job is to investigate a question thoroughly using web search and URL fetching, \
|
|
|
|
|
then produce a grounded, evidence-based answer.
|
|
|
|
|
|
|
|
|
|
## Your process
|
|
|
|
|
|
|
|
|
|
1. **Plan**: Decide what to search for. Break complex questions into sub-queries.
|
|
|
|
|
2. **Search**: Use the web_search tool to find relevant sources.
|
|
|
|
|
3. **Fetch**: Use the fetch_url tool to get full content from promising URLs.
|
|
|
|
|
4. **Iterate**: If you don't have enough evidence, search again with refined queries.
|
|
|
|
|
5. **Stop**: When you have sufficient evidence OR you've exhausted your budget.
|
|
|
|
|
|
|
|
|
|
## Rules
|
|
|
|
|
|
|
|
|
|
- Every claim must be traceable to a source you actually fetched.
|
|
|
|
|
- If you can't find information, say so — never fabricate.
|
|
|
|
|
- If sources contradict each other, note the contradiction.
|
|
|
|
|
- If the question requires expertise outside web search (academic papers, databases, \
|
|
|
|
|
legal documents), note it as a discovery for another researcher.
|
|
|
|
|
- Be efficient. Don't fetch URLs that are clearly irrelevant from their title/snippet.
|
|
|
|
|
- Prefer authoritative sources (.gov, .edu, established organizations) over blogs/forums.
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
SYNTHESIS_PROMPT = """\
|
|
|
|
|
Based on the evidence gathered, produce a structured research result as JSON.
|
|
|
|
|
|
|
|
|
|
## Evidence gathered
|
|
|
|
|
{evidence}
|
|
|
|
|
|
|
|
|
|
## Original question
|
|
|
|
|
{question}
|
|
|
|
|
|
|
|
|
|
## Context from caller
|
|
|
|
|
{context}
|
|
|
|
|
|
|
|
|
|
## Instructions
|
|
|
|
|
|
|
|
|
|
Produce a JSON object with these exact fields:
|
|
|
|
|
|
|
|
|
|
{{
|
|
|
|
|
"answer": "Your synthesized answer. Every claim must trace to a citation.",
|
|
|
|
|
"citations": [
|
|
|
|
|
{{
|
|
|
|
|
"source": "web",
|
|
|
|
|
"locator": "the exact URL",
|
|
|
|
|
"title": "page title",
|
|
|
|
|
"snippet": "your 50-200 char summary of why this source is relevant",
|
|
|
|
|
"raw_excerpt": "verbatim 100-500 char excerpt from the source that supports your claim",
|
|
|
|
|
"confidence": 0.0-1.0
|
|
|
|
|
}}
|
|
|
|
|
],
|
|
|
|
|
"gaps": [
|
|
|
|
|
{{
|
|
|
|
|
"topic": "what wasn't resolved",
|
|
|
|
|
"category": "source_not_found|access_denied|budget_exhausted|contradictory_sources|scope_exceeded",
|
|
|
|
|
"detail": "human-readable explanation"
|
|
|
|
|
}}
|
|
|
|
|
],
|
|
|
|
|
"discovery_events": [
|
|
|
|
|
{{
|
|
|
|
|
"type": "related_research|new_source|contradiction",
|
|
|
|
|
"suggested_researcher": "arxiv|database|legal|null",
|
|
|
|
|
"query": "suggested query for that researcher",
|
|
|
|
|
"reason": "why this matters",
|
|
|
|
|
"source_locator": "URL where you found this, or null"
|
|
|
|
|
}}
|
|
|
|
|
],
|
2026-04-08 20:37:30 +00:00
|
|
|
"open_questions": [
|
|
|
|
|
{{
|
|
|
|
|
"question": "A follow-up question that emerged from the research",
|
|
|
|
|
"context": "What evidence prompted this question",
|
|
|
|
|
"priority": "high|medium|low",
|
|
|
|
|
"source_locator": "URL where this question arose, or null"
|
|
|
|
|
}}
|
|
|
|
|
],
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
"confidence": 0.0-1.0,
|
|
|
|
|
"confidence_factors": {{
|
|
|
|
|
"num_corroborating_sources": 0,
|
|
|
|
|
"source_authority": "high|medium|low",
|
|
|
|
|
"contradiction_detected": false,
|
|
|
|
|
"query_specificity_match": 0.0-1.0,
|
|
|
|
|
"budget_exhausted": false,
|
|
|
|
|
"recency": "current|recent|dated|null"
|
|
|
|
|
}}
|
|
|
|
|
}}
|
|
|
|
|
|
|
|
|
|
Respond with ONLY the JSON object, no markdown fences, no explanation.
|
|
|
|
|
"""
|
|
|
|
|
|
|
|
|
|
# Tool definitions for Claude's tool_use API
|
|
|
|
|
TOOLS = [
|
|
|
|
|
{
|
|
|
|
|
"name": "web_search",
|
|
|
|
|
"description": (
|
|
|
|
|
"Search the web for information. Returns titles, URLs, snippets, "
|
|
|
|
|
"and sometimes full page content. Use this to find sources."
|
|
|
|
|
),
|
|
|
|
|
"input_schema": {
|
|
|
|
|
"type": "object",
|
|
|
|
|
"properties": {
|
|
|
|
|
"query": {
|
|
|
|
|
"type": "string",
|
|
|
|
|
"description": "The search query.",
|
|
|
|
|
},
|
|
|
|
|
"max_results": {
|
|
|
|
|
"type": "integer",
|
|
|
|
|
"description": "Number of results (1-10). Default 5.",
|
|
|
|
|
"default": 5,
|
|
|
|
|
},
|
|
|
|
|
},
|
|
|
|
|
"required": ["query"],
|
|
|
|
|
},
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"name": "fetch_url",
|
|
|
|
|
"description": (
|
|
|
|
|
"Fetch the full text content of a URL. Use this when a search result "
|
|
|
|
|
"looks promising but the snippet isn't enough. Returns extracted text."
|
|
|
|
|
),
|
|
|
|
|
"input_schema": {
|
|
|
|
|
"type": "object",
|
|
|
|
|
"properties": {
|
|
|
|
|
"url": {
|
|
|
|
|
"type": "string",
|
|
|
|
|
"description": "The URL to fetch.",
|
|
|
|
|
},
|
|
|
|
|
},
|
|
|
|
|
"required": ["url"],
|
|
|
|
|
},
|
|
|
|
|
},
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
class WebResearcher:
|
|
|
|
|
"""Agentic web researcher that searches, fetches, and synthesizes."""
|
|
|
|
|
|
|
|
|
|
def __init__(
|
|
|
|
|
self,
|
|
|
|
|
anthropic_api_key: str,
|
|
|
|
|
tavily_api_key: str,
|
2026-04-08 21:25:19 +00:00
|
|
|
model_id: str = "claude-sonnet-4-6",
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
trace_dir: Optional[str] = None,
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
cost_ledger: Optional[CostLedger] = None,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
):
|
|
|
|
|
self.client = Anthropic(api_key=anthropic_api_key)
|
|
|
|
|
self.tavily_api_key = tavily_api_key
|
|
|
|
|
self.model_id = model_id
|
|
|
|
|
self.trace_dir = trace_dir
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
# Lazy default — only constructed if no override is given. Tests
|
|
|
|
|
# inject a CostLedger pointed at a tmp path to avoid touching
|
|
|
|
|
# the real ledger file.
|
|
|
|
|
self.cost_ledger = cost_ledger
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
|
|
|
|
|
async def research(
|
|
|
|
|
self,
|
|
|
|
|
question: str,
|
|
|
|
|
context: Optional[str] = None,
|
|
|
|
|
depth: str = "balanced",
|
|
|
|
|
constraints: Optional[ResearchConstraints] = None,
|
|
|
|
|
) -> ResearchResult:
|
|
|
|
|
"""Run a full research loop on a question.
|
|
|
|
|
|
|
|
|
|
Args:
|
|
|
|
|
question: The question to investigate.
|
|
|
|
|
context: What the caller already knows (optional).
|
|
|
|
|
depth: "shallow", "balanced", or "deep".
|
|
|
|
|
constraints: Budget and iteration limits.
|
|
|
|
|
|
|
|
|
|
Returns:
|
|
|
|
|
A ResearchResult conforming to the v1 contract.
|
|
|
|
|
"""
|
depth flag now drives constraint defaults (#30)
Previously the depth parameter (shallow/balanced/deep) was passed
only as a text hint inside the agent's user message, with no
mechanical effect on iterations, token budget, or source count.
The flag was effectively cosmetic — the LLM was expected to
"interpret" it.
Add DEPTH_PRESETS table and constraints_for_depth() helper in
researchers.web.models:
shallow: 2 iters, 5,000 tokens, 5 sources
balanced: 5 iters, 20,000 tokens, 10 sources (= historical defaults)
deep: 8 iters, 60,000 tokens, 20 sources
Wired through the stack:
- WebResearcher.research(): when constraints is None, builds from
the depth preset instead of bare ResearchConstraints()
- MCP server `research` tool: max_iterations and token_budget now
default to None; constraints are built via constraints_for_depth
with explicit values overriding the preset
- CLI `ask` command: --max-iterations and --budget default to None;
the CLI only forwards them to the MCP tool when set, so unset
flags fall through to the depth preset
balanced is unchanged from the historical defaults so existing
callers see no behavior difference. Explicit --max-iterations /
--budget always win over the preset.
Tests cover each preset's values, balanced backward-compat,
unknown depth fallback, full override, and partial override.
116/116 tests passing. Live-verified: --depth shallow on a simple
question now caps at 2 iterations and stays under budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 22:27:38 +00:00
|
|
|
# If the caller didn't supply explicit constraints, build them
|
|
|
|
|
# from the depth preset (Issue #30). Callers that DO pass a
|
|
|
|
|
# ResearchConstraints are taken at their word — explicit wins.
|
|
|
|
|
constraints = constraints or constraints_for_depth(depth)
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
trace = TraceLogger(trace_dir=self.trace_dir)
|
|
|
|
|
start_time = time.time()
|
|
|
|
|
total_tokens = 0
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
tokens_input = 0
|
|
|
|
|
tokens_output = 0
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
iterations = 0
|
|
|
|
|
evidence: list[dict] = []
|
|
|
|
|
budget_exhausted = False
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
tavily_searches = 0
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
# Bind trace context so every downstream log call automatically
|
|
|
|
|
# carries trace_id and researcher. Cleared in the finally block.
|
|
|
|
|
structlog.contextvars.bind_contextvars(
|
|
|
|
|
trace_id=trace.trace_id,
|
|
|
|
|
researcher="web",
|
|
|
|
|
)
|
|
|
|
|
log.info(
|
|
|
|
|
"research_started",
|
|
|
|
|
question=question,
|
|
|
|
|
depth=depth,
|
|
|
|
|
max_iterations=constraints.max_iterations,
|
|
|
|
|
token_budget=constraints.token_budget,
|
|
|
|
|
model_id=self.model_id,
|
|
|
|
|
)
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
trace.log_step(
|
|
|
|
|
"start",
|
|
|
|
|
decision=f"Beginning research: depth={depth}",
|
|
|
|
|
question=question,
|
|
|
|
|
context=context or "",
|
|
|
|
|
max_iterations=constraints.max_iterations,
|
|
|
|
|
token_budget=constraints.token_budget,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Build initial message
|
|
|
|
|
user_message = f"Research this question: {question}"
|
|
|
|
|
if context:
|
|
|
|
|
user_message += f"\n\nContext from the caller: {context}"
|
|
|
|
|
user_message += f"\n\nResearch depth: {depth}"
|
|
|
|
|
|
|
|
|
|
messages = [{"role": "user", "content": user_message}]
|
|
|
|
|
|
|
|
|
|
# --- Tool-use loop ---
|
2026-04-08 21:29:22 +00:00
|
|
|
# Budget policy: the loop honors token_budget as a soft cap. Before
|
|
|
|
|
# starting a new iteration we check whether we've already hit the
|
|
|
|
|
# budget; if so we stop and let synthesis run on whatever evidence
|
|
|
|
|
# we already have. Synthesis tokens are tracked but not capped here
|
|
|
|
|
# — the synthesis call is always allowed to complete so the caller
|
|
|
|
|
# gets a structured result rather than a stub.
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
while iterations < constraints.max_iterations:
|
2026-04-08 21:29:22 +00:00
|
|
|
if total_tokens >= constraints.token_budget:
|
|
|
|
|
budget_exhausted = True
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"budget_exhausted",
|
|
|
|
|
decision=(
|
|
|
|
|
f"Token budget reached before iteration "
|
|
|
|
|
f"{iterations + 1}: {total_tokens}/{constraints.token_budget}"
|
|
|
|
|
),
|
|
|
|
|
)
|
|
|
|
|
break
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
iterations += 1
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"iteration_start",
|
|
|
|
|
decision=f"Starting iteration {iterations}/{constraints.max_iterations}",
|
|
|
|
|
tokens_so_far=total_tokens,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
response = self.client.messages.create(
|
|
|
|
|
model=self.model_id,
|
|
|
|
|
max_tokens=4096,
|
|
|
|
|
system=SYSTEM_PROMPT,
|
|
|
|
|
messages=messages,
|
|
|
|
|
tools=TOOLS,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Track tokens
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
tokens_input += response.usage.input_tokens
|
|
|
|
|
tokens_output += response.usage.output_tokens
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
total_tokens += response.usage.input_tokens + response.usage.output_tokens
|
|
|
|
|
|
|
|
|
|
# Check if the model wants to use tools
|
|
|
|
|
tool_calls = [b for b in response.content if b.type == "tool_use"]
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
tavily_searches += sum(
|
|
|
|
|
1 for tc in tool_calls if tc.name == "web_search"
|
|
|
|
|
)
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
|
|
|
|
|
if not tool_calls:
|
|
|
|
|
# Model is done researching — extract any final text
|
|
|
|
|
text_blocks = [b.text for b in response.content if b.type == "text"]
|
|
|
|
|
if text_blocks:
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"agent_message",
|
|
|
|
|
decision="Agent finished tool use",
|
|
|
|
|
message=text_blocks[0][:500],
|
|
|
|
|
)
|
|
|
|
|
break
|
|
|
|
|
|
|
|
|
|
# Process each tool call
|
|
|
|
|
tool_results = []
|
|
|
|
|
for tool_call in tool_calls:
|
|
|
|
|
result_content = await self._execute_tool(
|
|
|
|
|
tool_call.name,
|
|
|
|
|
tool_call.input,
|
|
|
|
|
evidence,
|
|
|
|
|
trace,
|
|
|
|
|
constraints,
|
|
|
|
|
)
|
|
|
|
|
tool_results.append(
|
|
|
|
|
{
|
|
|
|
|
"type": "tool_result",
|
|
|
|
|
"tool_use_id": tool_call.id,
|
|
|
|
|
"content": result_content,
|
|
|
|
|
}
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Append assistant response + tool results to conversation
|
|
|
|
|
messages.append({"role": "assistant", "content": response.content})
|
|
|
|
|
messages.append({"role": "user", "content": tool_results})
|
|
|
|
|
|
|
|
|
|
# --- Synthesis step ---
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"synthesis_start",
|
|
|
|
|
decision="Beginning synthesis of gathered evidence",
|
|
|
|
|
evidence_count=len(evidence),
|
|
|
|
|
iterations_run=iterations,
|
|
|
|
|
tokens_used=total_tokens,
|
|
|
|
|
)
|
|
|
|
|
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
result, synth_in, synth_out = await self._synthesize(
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
question=question,
|
|
|
|
|
context=context,
|
|
|
|
|
evidence=evidence,
|
|
|
|
|
trace=trace,
|
|
|
|
|
total_tokens=total_tokens,
|
|
|
|
|
iterations=iterations,
|
|
|
|
|
start_time=start_time,
|
|
|
|
|
budget_exhausted=budget_exhausted,
|
|
|
|
|
)
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
tokens_input += synth_in
|
|
|
|
|
tokens_output += synth_out
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
|
2026-04-09 01:27:33 +00:00
|
|
|
# Issue #54 (b): emit one trace event per gap/citation/discovery so
|
|
|
|
|
# the JSONL stream contains the actual categories alongside the
|
|
|
|
|
# existing summary counts. Cheap and gives us a queryable timeline.
|
|
|
|
|
for c in result.citations:
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"citation_recorded",
|
|
|
|
|
decision="Citation kept in final result",
|
|
|
|
|
source=c.source,
|
|
|
|
|
locator=c.locator,
|
|
|
|
|
title=c.title,
|
|
|
|
|
confidence=c.confidence,
|
|
|
|
|
)
|
|
|
|
|
for g in result.gaps:
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"gap_recorded",
|
|
|
|
|
decision="Gap surfaced in final result",
|
|
|
|
|
category=g.category.value,
|
|
|
|
|
topic=g.topic,
|
|
|
|
|
detail=g.detail,
|
|
|
|
|
)
|
|
|
|
|
for d in result.discovery_events:
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"discovery_recorded",
|
|
|
|
|
decision="Discovery event surfaced in final result",
|
|
|
|
|
type=d.type,
|
|
|
|
|
suggested_researcher=d.suggested_researcher,
|
|
|
|
|
query=d.query,
|
|
|
|
|
reason=d.reason,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Issue #54 (a): persist the full ResearchResult next to the trace
|
|
|
|
|
# so replay and downstream analysis can recover the structured
|
|
|
|
|
# contract, not just counts.
|
|
|
|
|
try:
|
|
|
|
|
trace.write_result(result)
|
|
|
|
|
except Exception as write_err:
|
|
|
|
|
log.warning("trace_result_write_failed", error=str(write_err))
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
trace.log_step(
|
|
|
|
|
"complete",
|
|
|
|
|
decision="Research complete",
|
|
|
|
|
confidence=result.confidence,
|
|
|
|
|
citation_count=len(result.citations),
|
|
|
|
|
gap_count=len(result.gaps),
|
|
|
|
|
discovery_count=len(result.discovery_events),
|
|
|
|
|
)
|
|
|
|
|
trace.close()
|
|
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
log.info(
|
|
|
|
|
"research_completed",
|
|
|
|
|
confidence=result.confidence,
|
|
|
|
|
citations=len(result.citations),
|
|
|
|
|
gaps=len(result.gaps),
|
|
|
|
|
discovery_events=len(result.discovery_events),
|
|
|
|
|
tokens_used=result.cost_metadata.tokens_used,
|
|
|
|
|
iterations_run=result.cost_metadata.iterations_run,
|
|
|
|
|
wall_time_sec=result.cost_metadata.wall_time_sec,
|
|
|
|
|
budget_exhausted=result.cost_metadata.budget_exhausted,
|
|
|
|
|
)
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
|
|
|
|
|
# Append to the operational cost ledger. Construct on first use
|
|
|
|
|
# so test injection (cost_ledger=...) and the env override
|
|
|
|
|
# (MARCHWARDEN_COST_LEDGER) both work without forcing every
|
|
|
|
|
# caller to build a CostLedger explicitly.
|
|
|
|
|
try:
|
|
|
|
|
ledger = self.cost_ledger or CostLedger()
|
|
|
|
|
ledger.record(
|
|
|
|
|
trace_id=result.trace_id,
|
|
|
|
|
question=question,
|
|
|
|
|
model_id=self.model_id,
|
|
|
|
|
tokens_used=result.cost_metadata.tokens_used,
|
|
|
|
|
tokens_input=tokens_input,
|
|
|
|
|
tokens_output=tokens_output,
|
|
|
|
|
iterations_run=result.cost_metadata.iterations_run,
|
|
|
|
|
wall_time_sec=result.cost_metadata.wall_time_sec,
|
|
|
|
|
tavily_searches=tavily_searches,
|
|
|
|
|
budget_exhausted=result.cost_metadata.budget_exhausted,
|
|
|
|
|
confidence=result.confidence,
|
|
|
|
|
)
|
|
|
|
|
except Exception as ledger_err:
|
|
|
|
|
# Never let a ledger failure poison a successful research call.
|
|
|
|
|
log.warning("cost_ledger_write_failed", error=str(ledger_err))
|
|
|
|
|
|
M2.5.1: Structured application logger via structlog (#24)
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:46:51 +00:00
|
|
|
structlog.contextvars.clear_contextvars()
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
return result
|
|
|
|
|
|
|
|
|
|
async def _execute_tool(
|
|
|
|
|
self,
|
|
|
|
|
tool_name: str,
|
|
|
|
|
tool_input: dict,
|
|
|
|
|
evidence: list[dict],
|
|
|
|
|
trace: TraceLogger,
|
|
|
|
|
constraints: ResearchConstraints,
|
|
|
|
|
) -> str:
|
|
|
|
|
"""Execute a tool call and return the result as a string."""
|
|
|
|
|
|
|
|
|
|
if tool_name == "web_search":
|
|
|
|
|
query = tool_input.get("query", "")
|
|
|
|
|
max_results = min(
|
|
|
|
|
tool_input.get("max_results", 5),
|
|
|
|
|
constraints.max_sources,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"web_search",
|
|
|
|
|
decision=f"Searching: {query}",
|
|
|
|
|
query=query,
|
|
|
|
|
max_results=max_results,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
results = tavily_search(
|
|
|
|
|
api_key=self.tavily_api_key,
|
|
|
|
|
query=query,
|
|
|
|
|
max_results=max_results,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Store evidence
|
|
|
|
|
for r in results:
|
|
|
|
|
ev = {
|
|
|
|
|
"type": "search_result",
|
|
|
|
|
"url": r.url,
|
|
|
|
|
"title": r.title,
|
|
|
|
|
"content": r.content,
|
|
|
|
|
"raw_content": r.raw_content,
|
|
|
|
|
"content_hash": r.content_hash,
|
|
|
|
|
"score": r.score,
|
|
|
|
|
}
|
|
|
|
|
evidence.append(ev)
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"web_search_complete",
|
|
|
|
|
decision=f"Got {len(results)} results",
|
|
|
|
|
result_count=len(results),
|
|
|
|
|
urls=[r.url for r in results],
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Return results as text for the LLM
|
|
|
|
|
return _format_search_results(results)
|
|
|
|
|
|
|
|
|
|
elif tool_name == "fetch_url":
|
|
|
|
|
url = tool_input.get("url", "")
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"fetch_url",
|
|
|
|
|
decision=f"Fetching: {url}",
|
|
|
|
|
url=url,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
result = await fetch_url(url)
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"fetch_url_complete",
|
|
|
|
|
decision="Fetch succeeded" if result.success else f"Fetch failed: {result.error}",
|
|
|
|
|
url=url,
|
|
|
|
|
content_hash=result.content_hash,
|
|
|
|
|
content_length=result.content_length,
|
|
|
|
|
success=result.success,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
if result.success:
|
|
|
|
|
# Store evidence
|
|
|
|
|
evidence.append(
|
|
|
|
|
{
|
|
|
|
|
"type": "fetched_page",
|
|
|
|
|
"url": url,
|
|
|
|
|
"content": result.text[:10000],
|
|
|
|
|
"content_hash": result.content_hash,
|
|
|
|
|
"content_length": result.content_length,
|
|
|
|
|
}
|
|
|
|
|
)
|
|
|
|
|
# Return truncated text for the LLM
|
|
|
|
|
return result.text[:8000]
|
|
|
|
|
else:
|
|
|
|
|
return f"Failed to fetch URL: {result.error}"
|
|
|
|
|
|
|
|
|
|
return f"Unknown tool: {tool_name}"
|
|
|
|
|
|
|
|
|
|
async def _synthesize(
|
|
|
|
|
self,
|
|
|
|
|
question: str,
|
|
|
|
|
context: Optional[str],
|
|
|
|
|
evidence: list[dict],
|
|
|
|
|
trace: TraceLogger,
|
|
|
|
|
total_tokens: int,
|
|
|
|
|
iterations: int,
|
|
|
|
|
start_time: float,
|
|
|
|
|
budget_exhausted: bool,
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
) -> tuple[ResearchResult, int, int]:
|
|
|
|
|
"""Ask the LLM to synthesize evidence into a ResearchResult.
|
|
|
|
|
|
|
|
|
|
Returns ``(result, synthesis_input_tokens, synthesis_output_tokens)``
|
|
|
|
|
so the caller can track per-call token splits for cost estimation.
|
|
|
|
|
"""
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
|
|
|
|
|
# Format evidence for the synthesis prompt
|
|
|
|
|
evidence_text = ""
|
|
|
|
|
for i, ev in enumerate(evidence, 1):
|
|
|
|
|
if ev["type"] == "search_result":
|
|
|
|
|
content = ev.get("raw_content") or ev.get("content", "")
|
|
|
|
|
evidence_text += (
|
|
|
|
|
f"\n--- Source {i} (search result) ---\n"
|
|
|
|
|
f"URL: {ev['url']}\n"
|
|
|
|
|
f"Title: {ev['title']}\n"
|
|
|
|
|
f"Content hash: {ev['content_hash']}\n"
|
|
|
|
|
f"Content: {content[:3000]}\n"
|
|
|
|
|
)
|
|
|
|
|
elif ev["type"] == "fetched_page":
|
|
|
|
|
evidence_text += (
|
|
|
|
|
f"\n--- Source {i} (fetched page) ---\n"
|
|
|
|
|
f"URL: {ev['url']}\n"
|
|
|
|
|
f"Content hash: {ev['content_hash']}\n"
|
|
|
|
|
f"Content: {ev['content'][:3000]}\n"
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
prompt = SYNTHESIS_PROMPT.format(
|
|
|
|
|
evidence=evidence_text or "(No evidence gathered)",
|
|
|
|
|
question=question,
|
|
|
|
|
context=context or "(No additional context)",
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
response = self.client.messages.create(
|
|
|
|
|
model=self.model_id,
|
2026-04-08 21:23:03 +00:00
|
|
|
max_tokens=16384,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
messages=[{"role": "user", "content": prompt}],
|
|
|
|
|
)
|
|
|
|
|
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
synth_in = response.usage.input_tokens
|
|
|
|
|
synth_out = response.usage.output_tokens
|
|
|
|
|
total_tokens += synth_in + synth_out
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
wall_time = time.time() - start_time
|
|
|
|
|
|
|
|
|
|
# Parse the JSON response
|
|
|
|
|
raw_text = response.content[0].text.strip()
|
2026-04-08 21:23:03 +00:00
|
|
|
stop_reason = response.stop_reason
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
# Strip markdown fences if the model added them despite instructions
|
|
|
|
|
if raw_text.startswith("```"):
|
|
|
|
|
raw_text = raw_text.split("\n", 1)[1] if "\n" in raw_text else raw_text[3:]
|
|
|
|
|
if raw_text.endswith("```"):
|
|
|
|
|
raw_text = raw_text[:-3].strip()
|
|
|
|
|
|
|
|
|
|
try:
|
|
|
|
|
data = json.loads(raw_text)
|
2026-04-08 21:23:03 +00:00
|
|
|
except json.JSONDecodeError as parse_err:
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
trace.log_step(
|
|
|
|
|
"synthesis_error",
|
2026-04-08 21:23:03 +00:00
|
|
|
decision=(
|
|
|
|
|
f"Failed to parse synthesis JSON ({parse_err}); "
|
|
|
|
|
f"stop_reason={stop_reason}"
|
|
|
|
|
),
|
|
|
|
|
stop_reason=stop_reason,
|
|
|
|
|
parse_error=str(parse_err),
|
|
|
|
|
raw_response=raw_text,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
)
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
return (
|
|
|
|
|
self._fallback_result(
|
|
|
|
|
question, evidence, trace, total_tokens, iterations,
|
|
|
|
|
wall_time, budget_exhausted,
|
|
|
|
|
),
|
|
|
|
|
synth_in,
|
|
|
|
|
synth_out,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"synthesis_complete",
|
|
|
|
|
decision="Parsed synthesis JSON successfully",
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# Build the ResearchResult from parsed JSON
|
|
|
|
|
try:
|
|
|
|
|
citations = [
|
|
|
|
|
Citation(
|
|
|
|
|
source=c.get("source", "web"),
|
|
|
|
|
locator=c.get("locator", ""),
|
|
|
|
|
title=c.get("title"),
|
|
|
|
|
snippet=c.get("snippet"),
|
|
|
|
|
raw_excerpt=c.get("raw_excerpt", ""),
|
|
|
|
|
confidence=c.get("confidence", 0.5),
|
|
|
|
|
)
|
|
|
|
|
for c in data.get("citations", [])
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
gaps = [
|
|
|
|
|
Gap(
|
|
|
|
|
topic=g.get("topic", ""),
|
|
|
|
|
category=GapCategory(g.get("category", "source_not_found")),
|
|
|
|
|
detail=g.get("detail", ""),
|
|
|
|
|
)
|
|
|
|
|
for g in data.get("gaps", [])
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
discovery_events = [
|
|
|
|
|
DiscoveryEvent(
|
|
|
|
|
type=d.get("type", "related_research"),
|
|
|
|
|
suggested_researcher=d.get("suggested_researcher"),
|
|
|
|
|
query=d.get("query", ""),
|
|
|
|
|
reason=d.get("reason", ""),
|
|
|
|
|
source_locator=d.get("source_locator"),
|
|
|
|
|
)
|
|
|
|
|
for d in data.get("discovery_events", [])
|
|
|
|
|
]
|
|
|
|
|
|
2026-04-08 20:37:30 +00:00
|
|
|
open_questions = [
|
|
|
|
|
OpenQuestion(
|
|
|
|
|
question=q.get("question", ""),
|
|
|
|
|
context=q.get("context", ""),
|
|
|
|
|
priority=q.get("priority", "medium"),
|
|
|
|
|
source_locator=q.get("source_locator"),
|
|
|
|
|
)
|
|
|
|
|
for q in data.get("open_questions", [])
|
|
|
|
|
]
|
|
|
|
|
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
cf = data.get("confidence_factors", {})
|
|
|
|
|
confidence_factors = ConfidenceFactors(
|
|
|
|
|
num_corroborating_sources=cf.get("num_corroborating_sources", 0),
|
|
|
|
|
source_authority=cf.get("source_authority", "low"),
|
|
|
|
|
contradiction_detected=cf.get("contradiction_detected", False),
|
|
|
|
|
query_specificity_match=cf.get("query_specificity_match", 0.5),
|
|
|
|
|
budget_exhausted=budget_exhausted or cf.get("budget_exhausted", False),
|
|
|
|
|
recency=cf.get("recency"),
|
|
|
|
|
)
|
|
|
|
|
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
return (
|
|
|
|
|
ResearchResult(
|
|
|
|
|
answer=data.get("answer", "No answer could be synthesized."),
|
|
|
|
|
citations=citations,
|
|
|
|
|
gaps=gaps,
|
|
|
|
|
discovery_events=discovery_events,
|
|
|
|
|
open_questions=open_questions,
|
|
|
|
|
confidence=data.get("confidence", 0.5),
|
|
|
|
|
confidence_factors=confidence_factors,
|
|
|
|
|
cost_metadata=CostMetadata(
|
|
|
|
|
tokens_used=total_tokens,
|
|
|
|
|
iterations_run=iterations,
|
|
|
|
|
wall_time_sec=wall_time,
|
|
|
|
|
budget_exhausted=budget_exhausted,
|
|
|
|
|
model_id=self.model_id,
|
|
|
|
|
),
|
|
|
|
|
trace_id=trace.trace_id,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
),
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
synth_in,
|
|
|
|
|
synth_out,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
)
|
|
|
|
|
except Exception as e:
|
|
|
|
|
trace.log_step(
|
|
|
|
|
"synthesis_build_error",
|
|
|
|
|
decision=f"Failed to build ResearchResult: {e}",
|
|
|
|
|
)
|
M2.5.2: Cost ledger with price table (#25)
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 21:52:25 +00:00
|
|
|
return (
|
|
|
|
|
self._fallback_result(
|
|
|
|
|
question, evidence, trace, total_tokens, iterations,
|
|
|
|
|
wall_time, budget_exhausted,
|
|
|
|
|
),
|
|
|
|
|
synth_in,
|
|
|
|
|
synth_out,
|
M1.3: Inner agent loop with tests
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-08 20:29:27 +00:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
def _fallback_result(
|
|
|
|
|
self,
|
|
|
|
|
question: str,
|
|
|
|
|
evidence: list[dict],
|
|
|
|
|
trace: TraceLogger,
|
|
|
|
|
total_tokens: int,
|
|
|
|
|
iterations: int,
|
|
|
|
|
wall_time: float,
|
|
|
|
|
budget_exhausted: bool,
|
|
|
|
|
) -> ResearchResult:
|
|
|
|
|
"""Produce a minimal valid ResearchResult when synthesis fails."""
|
|
|
|
|
return ResearchResult(
|
|
|
|
|
answer=f"Research on '{question}' completed but synthesis failed. {len(evidence)} sources were gathered.",
|
|
|
|
|
citations=[],
|
|
|
|
|
gaps=[
|
|
|
|
|
Gap(
|
|
|
|
|
topic="synthesis",
|
|
|
|
|
category=GapCategory.BUDGET_EXHAUSTED
|
|
|
|
|
if budget_exhausted
|
|
|
|
|
else GapCategory.SOURCE_NOT_FOUND,
|
|
|
|
|
detail="The synthesis step failed to produce structured output.",
|
|
|
|
|
)
|
|
|
|
|
],
|
|
|
|
|
discovery_events=[],
|
|
|
|
|
confidence=0.1,
|
|
|
|
|
confidence_factors=ConfidenceFactors(
|
|
|
|
|
num_corroborating_sources=0,
|
|
|
|
|
source_authority="low",
|
|
|
|
|
contradiction_detected=False,
|
|
|
|
|
query_specificity_match=0.0,
|
|
|
|
|
budget_exhausted=budget_exhausted,
|
|
|
|
|
recency=None,
|
|
|
|
|
),
|
|
|
|
|
cost_metadata=CostMetadata(
|
|
|
|
|
tokens_used=total_tokens,
|
|
|
|
|
iterations_run=iterations,
|
|
|
|
|
wall_time_sec=wall_time,
|
|
|
|
|
budget_exhausted=budget_exhausted,
|
|
|
|
|
model_id=self.model_id,
|
|
|
|
|
),
|
|
|
|
|
trace_id=trace.trace_id,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
def _format_search_results(results: list[SearchResult]) -> str:
|
|
|
|
|
"""Format search results as readable text for the LLM."""
|
|
|
|
|
parts = []
|
|
|
|
|
for i, r in enumerate(results, 1):
|
|
|
|
|
content = r.raw_content or r.content
|
|
|
|
|
parts.append(
|
|
|
|
|
f"Result {i}:\n"
|
|
|
|
|
f" Title: {r.title}\n"
|
|
|
|
|
f" URL: {r.url}\n"
|
|
|
|
|
f" Relevance: {r.score:.2f}\n"
|
|
|
|
|
f" Content: {content[:2000]}\n"
|
|
|
|
|
)
|
|
|
|
|
return "\n".join(parts) if parts else "No results found."
|