Closes#54.
The JSONL trace previously stored only counts on the `complete` event
(gap_count, citation_count, discovery_count). Replay could re-render the
step log but could not recover which gaps fired or which sources were
cited, blocking M3.2/M3.3 stress-testing and calibration work.
Two complementary fixes:
1. (a) TraceLogger.write_result() dumps the pydantic ResearchResult to
`<trace_id>.result.json` next to the JSONL trace. The agent calls it
right before emitting the `complete` step. `cli replay` now loads the
sibling result file when present and renders the structured tables
under the trace step log.
2. (b) The agent emits one `gap_recorded`, `citation_recorded`, or
`discovery_recorded` trace event per item from the final result. This
gives the JSONL stream a queryable timeline of what was kept, with
categories and topics in-band, without needing to load the result
sibling.
Tests: 4 added (127 total passing). Smoke-tested live with a real ask;
both files written and replay rendering verified.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TraceLogger now tracks monotonic start times for starter actions
(web_search, fetch_url, synthesis_start, start) and attaches a
duration_ms field to the matching completer (web_search_complete,
fetch_url_complete, synthesis_complete, synthesis_error). The
terminal 'complete' step gets total_duration_sec instead.
Pairings are tightly sequential in the agent code (each
_execute_tool call runs start→end before returning), so a simple
dict keyed by starter name suffices — no queueing needed. An
unpaired completer leaves duration unset and does not crash.
Durations flow into both the JSONL trace and the structlog
operational log, so OpenSearch queries can filter / aggregate
by step latency without cross-row joins.
Verified end-to-end on a real shallow query:
web_search 5,233 ms
web_search 3,006 ms
synthesis_complete 27,658 ms
complete 47.547 s total
Synthesis is by far the slowest step — visible at a glance
for the first time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously the depth parameter (shallow/balanced/deep) was passed
only as a text hint inside the agent's user message, with no
mechanical effect on iterations, token budget, or source count.
The flag was effectively cosmetic — the LLM was expected to
"interpret" it.
Add DEPTH_PRESETS table and constraints_for_depth() helper in
researchers.web.models:
shallow: 2 iters, 5,000 tokens, 5 sources
balanced: 5 iters, 20,000 tokens, 10 sources (= historical defaults)
deep: 8 iters, 60,000 tokens, 20 sources
Wired through the stack:
- WebResearcher.research(): when constraints is None, builds from
the depth preset instead of bare ResearchConstraints()
- MCP server `research` tool: max_iterations and token_budget now
default to None; constraints are built via constraints_for_depth
with explicit values overriding the preset
- CLI `ask` command: --max-iterations and --budget default to None;
the CLI only forwards them to the MCP tool when set, so unset
flags fall through to the depth preset
balanced is unchanged from the historical defaults so existing
callers see no behavior difference. Explicit --max-iterations /
--budget always win over the preset.
Tests cover each preset's values, balanced backward-compat,
unknown depth fallback, full override, and partial override.
116/116 tests passing. Live-verified: --depth shallow on a simple
question now caps at 2 iterations and stays under budget.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds operator-facing `marchwarden costs` subcommand that reads the
JSONL ledger from M2.5.2 and pretty-prints a rich summary:
- Cost Summary panel: total calls, total spend, total tokens (input/
output split), Tavily search count, warning for any calls with
unknown model prices
- Per-Day table sorted by date
- Per-Model table sorted by model id
- Highest-Cost Call panel with trace_id and question
Flags:
--since ISO date or relative shorthand (7d, 24h, 2w, 1m)
--until same
--model filter to a specific model_id
--json emit raw filtered ledger entries instead of the table
--ledger override default path (mostly for tests)
Also fixes a Dockerfile gap: the obs/ package added in M2.5.1 was
not being COPYed into the image, so the installed `marchwarden`
entry point couldn't import it. Tests had been passing because
they mounted /app over the install. Adding `COPY obs ./obs`
restores parity.
Tests cover summary rendering, model filter, since-date filter,
JSON output, and the empty-ledger friendly path. 110/110 passing.
End-to-end verified against the real cost ledger.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an append-only JSONL ledger of every research() call at
~/.marchwarden/costs.jsonl, supplementing (not replacing) the
per-call cost_metadata field returned to callers. The ledger is
the operator-facing source of truth for spend tracking, queryable
via the upcoming `marchwarden costs` command (M2.5.3).
Fields per entry: timestamp, trace_id, question (truncated 200ch),
model_id, tokens_used, tokens_input, tokens_output, iterations_run,
wall_time_sec, tavily_searches, estimated_cost_usd, budget_exhausted,
confidence.
Cost estimation reads ~/.marchwarden/prices.toml, which is
auto-created with seed values for current Anthropic + Tavily rates
on first run. Operators are expected to update prices.toml
manually when upstream rates change — there is no automatic
fetching. Existing files are never overwritten. Unknown models
log a WARN and record estimated_cost_usd: null instead of
crashing.
Each ledger write also emits a structured `cost_recorded` log line
via the M2.5.1 logger, so cost data ships to OpenSearch alongside
the ledger file with no extra plumbing.
Tracking changes in agent.py:
- Track tokens_input / tokens_output split (not just total)
- Count tavily_searches across iterations
- _synthesize now returns (result, synth_in, synth_out) so the
caller can attribute synthesis tokens to the running counters
- Ledger.record() called after research_completed log; failures
are caught and warn-logged so a ledger write can never poison
a successful research call
Tests cover: price table seeding, no-overwrite of existing files,
cost estimation for known/unknown models, tavily-only cost,
ledger appends, question truncation, env var override.
End-to-end verified with a real Anthropic+Tavily call:
9107 input + 1140 output tokens, 1 tavily search, $0.049 estimated.
104/104 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an operational logging layer separate from the JSONL trace
audit logs. Operational logs cover system events (startup, errors,
MCP transport, research lifecycle); JSONL traces remain the
researcher provenance audit trail.
Backend: structlog with two renderers selectable via
MARCHWARDEN_LOG_FORMAT (json|console). Defaults to console when
stderr is a TTY, json otherwise — so dev runs are human-readable
and shipped runs (containers, automation) emit OpenSearch-ready
JSON without configuration.
Key features:
- Named loggers per component: marchwarden.cli,
marchwarden.mcp, marchwarden.researcher.web
- MARCHWARDEN_LOG_LEVEL controls global level (default INFO)
- MARCHWARDEN_LOG_FILE=1 enables a 10MB-rotating file at
~/.marchwarden/logs/marchwarden.log
- structlog contextvars bind trace_id + researcher at the start
of each research() call so every downstream log line carries
them automatically; cleared on completion
- stdlib logging is funneled through the same pipeline so noisy
third-party loggers (httpx, anthropic) get the same formatting
and quieted to WARN unless DEBUG is requested
- Logs to stderr to keep MCP stdio stdout clean
Wired into:
- cli.main.cli — configures logging on startup, logs ask_started/
ask_completed/ask_failed
- researchers.web.server.main — configures logging on startup,
logs mcp_server_starting
- researchers.web.agent.research — binds trace context, logs
research_started/research_completed
Tests verify JSON and console formats, contextvar propagation,
level filtering, idempotency, and auto-configure-on-first-use.
94/94 tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `marchwarden replay <trace_id>` to pretty-print a prior research
run from its JSONL trace file. Resolves the trace under
~/.marchwarden/traces/ by default; --trace-dir overrides for tests and
custom locations. Renders each step as a row with action, decision,
extra fields, and content_hash. Friendly errors for unknown trace_id
and malformed JSON lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Click app with `ask` subcommand that spawns the web researcher MCP
server over stdio, calls the research tool, and pretty-prints the
ResearchResult contract using rich (panels for answer/confidence/cost,
tables for citations, gaps, discovery events, and open questions).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FastMCP server exposing a single 'research' tool:
- Delegates to WebResearcher with keys from ~/secrets
- Accepts question, context, depth, max_iterations, token_budget
- Returns full ResearchResult as JSON
- Configurable model via MARCHWARDEN_MODEL env var
- Runnable as: python -m researchers.web
4 tests: secret reading, JSON response validation, default parameters.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
New field on ResearchResult: open_questions — follow-up questions that
emerged from the research itself. Distinct from gaps (backward: what
failed) and discovery_events (sideways: what's lateral). Open questions
look forward: 'based on what I found, this needs deeper investigation.'
- OpenQuestion model: question, context, priority (high/medium/low),
source_locator
- Updated agent synthesis prompt to produce open_questions
- Updated agent result builder to parse open_questions from JSON
- 3 new tests for OpenQuestion model
- Updated existing tests for new field
77 tests passing.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
WebResearcher — the core agentic research loop:
- Tool-use loop: Claude decides when to search (Tavily) and fetch (httpx)
- Budget enforcement: stops at max_iterations or token_budget
- Synthesis step: separate LLM call produces structured ResearchResult JSON
- Fallback: valid ResearchResult even when synthesis JSON is unparseable
- Full trace logging at every step (start, search, fetch, synthesis, complete)
- Populates all contract fields: raw_excerpt, categorized gaps,
discovery_events, confidence_factors, cost_metadata with model_id
9 tests: complete research loop, budget exhaustion, synthesis failure
fallback, trace file creation, fetch_url tool integration, search
result formatting.
Refs: archeious/marchwarden#1
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>