From 4163e67c0bdb2cfd5c98f350ce9d77c8c5572853 Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Wed, 8 Apr 2026 16:42:11 -0600 Subject: [PATCH] Add User Guide for operators Detailed end-user documentation distinct from the Development Guide. Covers installation (make/venv/docker), configuration, every CLI subcommand (ask/replay/costs), depth presets, output interpretation, operational logging, file layout, troubleshooting, and FAQ. Co-Authored-By: Claude Opus 4.6 (1M context) --- UserGuide.md | 421 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 421 insertions(+) create mode 100644 UserGuide.md diff --git a/UserGuide.md b/UserGuide.md new file mode 100644 index 0000000..1f2a6dd --- /dev/null +++ b/UserGuide.md @@ -0,0 +1,421 @@ +# User Guide + +This guide is for **operators using Marchwarden** to ask research questions, replay traces, and track costs. If you're contributing code, see the [Development Guide](DevelopmentGuide) instead. + +--- + +## Table of contents + +1. [What Marchwarden is](#what-marchwarden-is) +2. [Installation](#installation) +3. [Configuration](#configuration) +4. [Asking a question — `marchwarden ask`](#asking-a-question) +5. [Reading the output](#reading-the-output) +6. [Replaying a trace — `marchwarden replay`](#replaying-a-trace) +7. [Tracking spend — `marchwarden costs`](#tracking-spend) +8. [Depth presets](#depth-presets) +9. [Operational logging](#operational-logging) +10. [File layout under `~/.marchwarden/`](#file-layout) +11. [Running in Docker](#running-in-docker) +12. [Troubleshooting](#troubleshooting) +13. [FAQ](#faq) + +--- + +## What Marchwarden is + +Marchwarden is an agentic web research assistant. You give it a question; it plans search queries, fetches the most promising sources, synthesizes a grounded answer with inline citations, and reports the gaps it could not resolve. Each call returns a structured **ResearchResult** containing: + +- **answer** — multi-paragraph synthesis with inline source references +- **citations** — list of sources with raw verbatim excerpts (no rewriting) +- **gaps** — what the agent could not resolve, categorized +- **discovery_events** — lateral findings worth investigating with other tools +- **open_questions** — follow-up questions the agent generated +- **confidence** + factors — auditable score, not just a number +- **cost_metadata** — tokens, iterations, wall-clock time, model id +- **trace_id** — UUID linking to a per-call audit log + +Every research call is recorded three ways: +- a JSONL **trace** (per-step audit log) at `~/.marchwarden/traces/.jsonl` +- a one-line **cost ledger** entry at `~/.marchwarden/costs.jsonl` +- structured **operational logs** to stderr (and optionally a rotating file) + +--- + +## Installation + +### Option 1 — Make + venv (recommended for local use) + +```bash +git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git +cd marchwarden +make install +source .venv/bin/activate +``` + +`make install` creates `.venv/`, installs the project editable with dev extras, and wires the `marchwarden` command. After activation, `which marchwarden` should resolve to `.venv/bin/marchwarden`. + +If `marchwarden` reports `ModuleNotFoundError: No module named 'cli'`, you have a stale install on your `$PATH`: +```bash +which -a marchwarden # find the stale copy +rm # remove it +hash -r # clear bash's command cache +``` + +### Option 2 — Manual venv + +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install -e ".[dev]" +``` + +### Option 3 — Docker + +```bash +make docker-build +./scripts/docker-test.sh ask "your question here" +``` + +The docker flow mounts `~/secrets` (read-only) and `~/.marchwarden/` (read-write) into the container, so traces, costs, and logs land in your real home directory the same as a venv install. + +--- + +## Configuration + +### API keys (required) + +Marchwarden reads two keys from `~/secrets` (a shell-style `KEY=value` file): + +``` +ANTHROPIC_API_KEY=sk-ant-... +TAVILY_API_KEY=tvly-... +``` + +Get them from: +- Anthropic: https://console.anthropic.com +- Tavily: https://tavily.com (free tier: 1,000 searches/month) + +### Environment variables (optional) + +| Variable | Purpose | Default | +|---|---|---| +| `MARCHWARDEN_MODEL` | Anthropic model id used by the researcher | `claude-sonnet-4-6` | +| `MARCHWARDEN_LOG_LEVEL` | `DEBUG` / `INFO` / `WARNING` / `ERROR` | `INFO` | +| `MARCHWARDEN_LOG_FORMAT` | `json` (OpenSearch-ready) or `console` (colored) | auto: `console` if stderr is a TTY, `json` otherwise | +| `MARCHWARDEN_LOG_FILE` | Set to `1` to also log to `~/.marchwarden/logs/marchwarden.log` (10MB rotation, 5 backups) | unset | +| `MARCHWARDEN_COST_LEDGER` | Override cost ledger path | `~/.marchwarden/costs.jsonl` | + +### Price table + +`~/.marchwarden/prices.toml` is auto-created on first run with current Anthropic + Tavily rates. Edit it manually when upstream prices change — Marchwarden does not auto-fetch. Unknown models record `estimated_cost_usd: null` rather than crash. + +--- + +## Asking a question + +```bash +marchwarden ask "What are ideal crops for a garden in Utah?" +``` + +### Flags + +| Flag | Purpose | +|---|---| +| `--depth shallow\|balanced\|deep` | Pick a research depth preset (default: `balanced`) | +| `--budget INT` | Override the depth's token budget | +| `--max-iterations INT` | Override the depth's iteration cap | + +`--budget` and `--max-iterations` always win over the depth preset. If both are unset, the depth preset chooses. + +### Examples + +```bash +# Quick lookup — shallow depth (2 iterations, 5k tokens, 5 sources) +marchwarden ask "What is the capital of Utah?" --depth shallow + +# Default — balanced depth (5 iterations, 20k tokens, 10 sources) +marchwarden ask "Compare cool-season and warm-season crops for Utah" + +# Thorough — deep depth (8 iterations, 60k tokens, 20 sources) +marchwarden ask "Compare AWS Lambda vs Azure Functions for HFT" --depth deep + +# Override the depth preset for one call +marchwarden ask "..." --depth balanced --budget 50000 +``` + +--- + +## Reading the output + +Output is rendered with [rich](https://github.com/Textualize/rich). Each section is a panel or table: + +### Answer panel +The synthesized answer in prose. Source numbers like `[Source 4]` map to entries in the Citations table. + +### Citations table +| Column | Meaning | +|---|---| +| `#` | Source index (matches `[Source N]` in the answer) | +| `Title / Locator` | Page title plus the URL | +| `Excerpt` | **Verbatim** text from the source (up to 500 chars). This bypasses the synthesizer to prevent quiet rewriting | +| `Conf` | Researcher's confidence in this source's accuracy (0.00–1.00) | + +If the answer contains a claim, you can read the matching `Excerpt` to verify the source actually says what the synthesizer claims it says. + +### Gaps table +Categorized reasons the agent couldn't fully resolve the question: +- `source_not_found` — no relevant pages indexed +- `access_denied` — sources existed but couldn't be fetched +- `budget_exhausted` — ran out of iterations / tokens +- `contradictory_sources` — sources disagreed and the disagreement wasn't resolvable +- `scope_exceeded` — the question reaches into a domain web search can't answer (academic papers, internal databases, legal docs) + +### Discovery Events table +Lateral findings: things the agent stumbled across that aren't in the answer but might matter for follow-up. Each suggests a `target researcher` and a query — these are how a future PI orchestrator (V2) will dispatch other specialists. + +### Open Questions table +Forward-looking questions the agent generated mid-research. Each has a priority (`high`/`medium`/`low`) and the source context that prompted it. These often reveal the *next* useful question to ask. + +### Confidence panel +| Field | Meaning | +|---|---| +| `Overall` | 0.00–1.00. Read this in the context of the factors below, not in isolation | +| `Corroborating sources` | How many sources agree on the core claims | +| `Source authority` | `high` (.gov/.edu/peer-reviewed), `medium` (established orgs), `low` (blogs/forums) | +| `Contradiction detected` | Did sources disagree? | +| `Query specificity match` | How well the results addressed the actual question (0.00–1.00) | +| `Budget status` | `spent` (the loop hit its cap before voluntarily stopping) or `under cap` | +| `Recency` | `current` (<1y) / `recent` (1–3y) / `dated` (>3y) / `unknown` | + +**`Budget status: spent` is normal, not an error.** It means the agent used the cap you gave it before deciding it was done. Pair this with `Overall: 0.88+` for a confident answer that fully spent its budget. + +### Cost panel +`Tokens`, `Iterations`, `Wall time`, `Model`. The token total includes the synthesis call, which is uncapped by design (see [Depth presets](#depth-presets) below). + +### Trace footer +The `trace_id` is a UUID. Save it if you'll want to replay this run later. + +--- + +## Replaying a trace + +Every research call writes a JSONL audit log at `~/.marchwarden/traces/.jsonl`. Replay it with: + +```bash +marchwarden replay +``` + +The replay table shows every step the agent took: planning calls, search queries, URL fetches with content hashes, synthesis attempts, and the final outcome. Use it to: + +- **Diagnose unexpected results** — see exactly what queries the agent ran and what it found +- **Audit citations** — every fetch records a SHA-256 content hash so you can verify the same page hasn't changed since +- **Debug synthesis failures** — `synthesis_error` steps record the LLM's full raw response and parse error + +### Flags + +| Flag | Purpose | +|---|---| +| `--trace-dir PATH` | Override default trace directory (`~/.marchwarden/traces`) | + +--- + +## Tracking spend + +Every research call appends one line to `~/.marchwarden/costs.jsonl` with model, tokens (input/output split), Tavily search count, and an estimated cost in USD. Inspect it with: + +```bash +marchwarden costs +``` + +### Output sections + +- **Cost Summary** — total calls, total spend, total tokens (with input/output split), Tavily searches, and a warning if any calls used a model not in your price table +- **Per Day** — calls / tokens / spend grouped by day +- **Per Model** — calls / tokens / spend grouped by `model_id` +- **Highest-Cost Call** — the most expensive single run, with `trace_id` for follow-up + +### Flags + +| Flag | Purpose | +|---|---| +| `--since DATE` | ISO date (`2026-04-01`) or relative (`7d`, `24h`, `2w`, `1m`) | +| `--until DATE` | Same | +| `--model MODEL_ID` | Filter to a single model | +| `--json` | Emit raw filtered ledger entries (one JSON per line) instead of the table | +| `--ledger PATH` | Override default ledger location | + +### Examples + +```bash +marchwarden costs # all-time summary +marchwarden costs --since 7d # last 7 days +marchwarden costs --model claude-opus-4-6 +marchwarden costs --since 2026-04-01 --until 2026-04-08 --json +``` + +The `--json` mode is suitable for piping into `jq` or shipping to a billing/analytics tool. + +--- + +## Depth presets + +The `--depth` flag picks sensible defaults for the agent loop. Explicit `--budget` and `--max-iterations` always override. + +| Depth | max_iterations | token_budget | max_sources | Use for | +|---|---:|---:|---:|---| +| `shallow` | 2 | 5,000 | 5 | quick lookups, factual Q&A | +| `balanced` | 5 | 20,000 | 10 | default, most questions | +| `deep` | 8 | 60,000 | 20 | comparison studies, complex investigations | + +### How the budget is enforced + +The token budget is a **soft cap on the tool-use loop only**: +- Before each new iteration, the agent checks `tokens_used >= token_budget`. If yes, the loop stops and synthesis runs on whatever evidence is gathered. +- The synthesis call itself is **uncapped** — it always completes, so you get a real ResearchResult instead of a parse-failure stub. +- This means total tokens reported in the Cost panel and ledger will normally exceed `token_budget` by the synthesis cost (~10–25k tokens depending on evidence size). + +Practical implications: +- A `balanced` run with `token_budget=20000` typically reports `tokens_used: 30000–50000` total. That's normal. +- If you need *strict* total spend control, use `shallow` and hand-tune `--budget` low. +- If you need *thorough* answers, use `deep` and accept that the call may consume 100k+ tokens. + +--- + +## Operational logging + +Marchwarden logs every research step via `structlog`. Logs go to **stderr** so they don't interfere with the research output on stdout. + +### Log levels + +- **`INFO`** (default) — milestones only (~9 lines per call): research start, each iteration boundary, synthesis start/complete, completion, cost recording +- **`DEBUG`** — every step (~13+ lines per call): adds individual `web_search`, `fetch_url`, and tool-result events + +### Formats + +- **`console`** — colored, human-readable; auto-selected when stderr is a TTY +- **`json`** — newline-delimited JSON, OpenSearch-ready; auto-selected when stderr is not a TTY (e.g., in CI, containers, or piped output) + +Set explicitly with `MARCHWARDEN_LOG_FORMAT=json` or `=console`. + +### Persistent file logging + +```bash +MARCHWARDEN_LOG_FILE=1 marchwarden ask "..." +``` + +Logs are appended to `~/.marchwarden/logs/marchwarden.log` (10MB per file, 5 rotated backups). The format respects `MARCHWARDEN_LOG_FORMAT`. + +### Context binding + +Every log line emitted during a research call automatically carries: +- `trace_id` — the same UUID you see in the Trace footer +- `researcher` — currently always `web` (the researcher type) + +This means in OpenSearch (or any structured log viewer) you can filter to a single research call with one query: `trace_id:"abc-123-..."`. + +--- + +## File layout + +Marchwarden writes to `~/.marchwarden/` exclusively. Nothing else on disk is touched. + +``` +~/.marchwarden/ +├── prices.toml # auto-seeded price table; edit when rates change +├── costs.jsonl # cost ledger, one line per research call +├── traces/ +│ └── .jsonl # per-call audit log, one file per call +└── logs/ + └── marchwarden.log # only if MARCHWARDEN_LOG_FILE=1 + └── marchwarden.log.{1..5} # rotated backups +``` + +All files are append-only or rewritten safely; you can `tail -f`, `jq`, or back them up freely. + +--- + +## Running in Docker + +The same workflows work inside the docker test image — useful for sandboxed runs or to avoid touching the host's Python: + +```bash +make docker-build # one-time +./scripts/docker-test.sh ask "your question" --depth deep +./scripts/docker-test.sh replay +``` + +The `ask` and `replay` subcommands of `docker-test.sh` mount: +- `~/secrets:/root/secrets:ro` — your API keys +- `~/.marchwarden:/root/.marchwarden` — traces, costs, logs persist back to the host + +The script also forwards `MARCHWARDEN_MODEL` from the host environment if set. + +--- + +## Troubleshooting + +### `marchwarden: command not found` after `make install` + +Either: +1. The venv isn't activated. Run `source .venv/bin/activate`, or use `make ask` which calls `.venv/bin/marchwarden` directly. +2. A stale install exists at `~/.local/bin/marchwarden`. Run `which -a marchwarden`, delete the stale copy, then `hash -r`. + +### `ModuleNotFoundError: No module named 'cli'` + +The `marchwarden` script being run is from a stale install (e.g., a previous `pip install --user` or pipx install) that doesn't know about the current source layout. Same fix as above. + +### `Error: HTTP 404 Not Found` on the Anthropic API + +Your `MARCHWARDEN_MODEL` is set to a model id that doesn't exist. Check `claude-sonnet-4-6` or `claude-opus-4-6`. The default is `claude-sonnet-4-6`. + +### `Calls with unknown model price: N` warning in `marchwarden costs` + +You ran a research call with a `model_id` not present in `~/.marchwarden/prices.toml`. Add a section for it: +```toml +[models."your-model-id"] +input_per_mtok_usd = 3.00 +output_per_mtok_usd = 15.00 +``` +Then re-run `marchwarden costs`. Existing ledger entries with `null` cost won't be retroactively fixed; future calls will pick up the new prices. + +### `Budget status: spent` on every run + +This is *expected*, not an error. See [Reading the output → Confidence panel](#reading-the-output) and [Depth presets → How the budget is enforced](#depth-presets) for details. + +### Synthesis fallback ("Research completed but synthesis failed") + +This used to happen when the synthesis JSON exceeded its `max_tokens` cap, but was fixed in PR #20. If you still see it, file an issue with the `trace_id` — the JSONL trace will contain the exact `synthesis_error` step including the model's raw response and parse error. + +### The `marchwarden ask` output is paginated / cut off + +`rich` defaults to your terminal width. If lines are wrapping ugly, widen your terminal or pipe to `less -R` to see colors: +```bash +marchwarden ask "..." 2>&1 | less -R +``` + +--- + +## FAQ + +**How long does a research call take?** +Typical wall-clock times: shallow ~15s, balanced ~30–60s, deep ~60–120s. Mostly LLM latency, not network. + +**How much does a call cost?** +At current Sonnet 4.6 rates: shallow ~$0.02, balanced ~$0.05–$0.15, deep ~$0.20–$0.60. Run `marchwarden costs` after a few calls to see your actual numbers. + +**Can I use a different model?** +Yes. `MARCHWARDEN_MODEL=claude-opus-4-6 marchwarden ask "..."` will use Opus instead of Sonnet. Make sure the model id is in your `prices.toml` so the cost ledger can estimate spend. + +**Can the agent access local files / databases?** +Not yet. V1 is web-search only. V2+ (per the [Roadmap](Roadmap)) will add file/document and database researchers — same contract, different tools. + +**Does the agent learn between calls?** +No. Each `research()` call is stateless. The trace logs and cost ledger accumulate over time, but the agent itself starts fresh every time. Cross-call learning is on the V2+ roadmap. + +**Where do I report bugs?** +Open an issue at the [Forgejo repo](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues). Include the `trace_id` from the Trace footer — it lets us reconstruct exactly what happened. + +--- + +See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Development Guide](DevelopmentGuide), [Roadmap](Roadmap)