Add User Guide for operators

Detailed end-user documentation distinct from the Development Guide.
Covers installation (make/venv/docker), configuration, every CLI
subcommand (ask/replay/costs), depth presets, output interpretation,
operational logging, file layout, troubleshooting, and FAQ.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jeff Smith 2026-04-08 16:42:11 -06:00
parent ded78ff1ce
commit 4163e67c0b

421
UserGuide.md Normal file

@ -0,0 +1,421 @@
# User Guide
This guide is for **operators using Marchwarden** to ask research questions, replay traces, and track costs. If you're contributing code, see the [Development Guide](DevelopmentGuide) instead.
---
## Table of contents
1. [What Marchwarden is](#what-marchwarden-is)
2. [Installation](#installation)
3. [Configuration](#configuration)
4. [Asking a question — `marchwarden ask`](#asking-a-question)
5. [Reading the output](#reading-the-output)
6. [Replaying a trace — `marchwarden replay`](#replaying-a-trace)
7. [Tracking spend — `marchwarden costs`](#tracking-spend)
8. [Depth presets](#depth-presets)
9. [Operational logging](#operational-logging)
10. [File layout under `~/.marchwarden/`](#file-layout)
11. [Running in Docker](#running-in-docker)
12. [Troubleshooting](#troubleshooting)
13. [FAQ](#faq)
---
## What Marchwarden is
Marchwarden is an agentic web research assistant. You give it a question; it plans search queries, fetches the most promising sources, synthesizes a grounded answer with inline citations, and reports the gaps it could not resolve. Each call returns a structured **ResearchResult** containing:
- **answer** — multi-paragraph synthesis with inline source references
- **citations** — list of sources with raw verbatim excerpts (no rewriting)
- **gaps** — what the agent could not resolve, categorized
- **discovery_events** — lateral findings worth investigating with other tools
- **open_questions** — follow-up questions the agent generated
- **confidence** + factors — auditable score, not just a number
- **cost_metadata** — tokens, iterations, wall-clock time, model id
- **trace_id** — UUID linking to a per-call audit log
Every research call is recorded three ways:
- a JSONL **trace** (per-step audit log) at `~/.marchwarden/traces/<trace_id>.jsonl`
- a one-line **cost ledger** entry at `~/.marchwarden/costs.jsonl`
- structured **operational logs** to stderr (and optionally a rotating file)
---
## Installation
### Option 1 — Make + venv (recommended for local use)
```bash
git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
cd marchwarden
make install
source .venv/bin/activate
```
`make install` creates `.venv/`, installs the project editable with dev extras, and wires the `marchwarden` command. After activation, `which marchwarden` should resolve to `.venv/bin/marchwarden`.
If `marchwarden` reports `ModuleNotFoundError: No module named 'cli'`, you have a stale install on your `$PATH`:
```bash
which -a marchwarden # find the stale copy
rm <path/to/stale> # remove it
hash -r # clear bash's command cache
```
### Option 2 — Manual venv
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```
### Option 3 — Docker
```bash
make docker-build
./scripts/docker-test.sh ask "your question here"
```
The docker flow mounts `~/secrets` (read-only) and `~/.marchwarden/` (read-write) into the container, so traces, costs, and logs land in your real home directory the same as a venv install.
---
## Configuration
### API keys (required)
Marchwarden reads two keys from `~/secrets` (a shell-style `KEY=value` file):
```
ANTHROPIC_API_KEY=sk-ant-...
TAVILY_API_KEY=tvly-...
```
Get them from:
- Anthropic: https://console.anthropic.com
- Tavily: https://tavily.com (free tier: 1,000 searches/month)
### Environment variables (optional)
| Variable | Purpose | Default |
|---|---|---|
| `MARCHWARDEN_MODEL` | Anthropic model id used by the researcher | `claude-sonnet-4-6` |
| `MARCHWARDEN_LOG_LEVEL` | `DEBUG` / `INFO` / `WARNING` / `ERROR` | `INFO` |
| `MARCHWARDEN_LOG_FORMAT` | `json` (OpenSearch-ready) or `console` (colored) | auto: `console` if stderr is a TTY, `json` otherwise |
| `MARCHWARDEN_LOG_FILE` | Set to `1` to also log to `~/.marchwarden/logs/marchwarden.log` (10MB rotation, 5 backups) | unset |
| `MARCHWARDEN_COST_LEDGER` | Override cost ledger path | `~/.marchwarden/costs.jsonl` |
### Price table
`~/.marchwarden/prices.toml` is auto-created on first run with current Anthropic + Tavily rates. Edit it manually when upstream prices change — Marchwarden does not auto-fetch. Unknown models record `estimated_cost_usd: null` rather than crash.
---
## Asking a question
```bash
marchwarden ask "What are ideal crops for a garden in Utah?"
```
### Flags
| Flag | Purpose |
|---|---|
| `--depth shallow\|balanced\|deep` | Pick a research depth preset (default: `balanced`) |
| `--budget INT` | Override the depth's token budget |
| `--max-iterations INT` | Override the depth's iteration cap |
`--budget` and `--max-iterations` always win over the depth preset. If both are unset, the depth preset chooses.
### Examples
```bash
# Quick lookup — shallow depth (2 iterations, 5k tokens, 5 sources)
marchwarden ask "What is the capital of Utah?" --depth shallow
# Default — balanced depth (5 iterations, 20k tokens, 10 sources)
marchwarden ask "Compare cool-season and warm-season crops for Utah"
# Thorough — deep depth (8 iterations, 60k tokens, 20 sources)
marchwarden ask "Compare AWS Lambda vs Azure Functions for HFT" --depth deep
# Override the depth preset for one call
marchwarden ask "..." --depth balanced --budget 50000
```
---
## Reading the output
Output is rendered with [rich](https://github.com/Textualize/rich). Each section is a panel or table:
### Answer panel
The synthesized answer in prose. Source numbers like `[Source 4]` map to entries in the Citations table.
### Citations table
| Column | Meaning |
|---|---|
| `#` | Source index (matches `[Source N]` in the answer) |
| `Title / Locator` | Page title plus the URL |
| `Excerpt` | **Verbatim** text from the source (up to 500 chars). This bypasses the synthesizer to prevent quiet rewriting |
| `Conf` | Researcher's confidence in this source's accuracy (0.001.00) |
If the answer contains a claim, you can read the matching `Excerpt` to verify the source actually says what the synthesizer claims it says.
### Gaps table
Categorized reasons the agent couldn't fully resolve the question:
- `source_not_found` — no relevant pages indexed
- `access_denied` — sources existed but couldn't be fetched
- `budget_exhausted` — ran out of iterations / tokens
- `contradictory_sources` — sources disagreed and the disagreement wasn't resolvable
- `scope_exceeded` — the question reaches into a domain web search can't answer (academic papers, internal databases, legal docs)
### Discovery Events table
Lateral findings: things the agent stumbled across that aren't in the answer but might matter for follow-up. Each suggests a `target researcher` and a query — these are how a future PI orchestrator (V2) will dispatch other specialists.
### Open Questions table
Forward-looking questions the agent generated mid-research. Each has a priority (`high`/`medium`/`low`) and the source context that prompted it. These often reveal the *next* useful question to ask.
### Confidence panel
| Field | Meaning |
|---|---|
| `Overall` | 0.001.00. Read this in the context of the factors below, not in isolation |
| `Corroborating sources` | How many sources agree on the core claims |
| `Source authority` | `high` (.gov/.edu/peer-reviewed), `medium` (established orgs), `low` (blogs/forums) |
| `Contradiction detected` | Did sources disagree? |
| `Query specificity match` | How well the results addressed the actual question (0.001.00) |
| `Budget status` | `spent` (the loop hit its cap before voluntarily stopping) or `under cap` |
| `Recency` | `current` (<1y) / `recent` (13y) / `dated` (>3y) / `unknown` |
**`Budget status: spent` is normal, not an error.** It means the agent used the cap you gave it before deciding it was done. Pair this with `Overall: 0.88+` for a confident answer that fully spent its budget.
### Cost panel
`Tokens`, `Iterations`, `Wall time`, `Model`. The token total includes the synthesis call, which is uncapped by design (see [Depth presets](#depth-presets) below).
### Trace footer
The `trace_id` is a UUID. Save it if you'll want to replay this run later.
---
## Replaying a trace
Every research call writes a JSONL audit log at `~/.marchwarden/traces/<trace_id>.jsonl`. Replay it with:
```bash
marchwarden replay <trace_id>
```
The replay table shows every step the agent took: planning calls, search queries, URL fetches with content hashes, synthesis attempts, and the final outcome. Use it to:
- **Diagnose unexpected results** — see exactly what queries the agent ran and what it found
- **Audit citations** — every fetch records a SHA-256 content hash so you can verify the same page hasn't changed since
- **Debug synthesis failures**`synthesis_error` steps record the LLM's full raw response and parse error
### Flags
| Flag | Purpose |
|---|---|
| `--trace-dir PATH` | Override default trace directory (`~/.marchwarden/traces`) |
---
## Tracking spend
Every research call appends one line to `~/.marchwarden/costs.jsonl` with model, tokens (input/output split), Tavily search count, and an estimated cost in USD. Inspect it with:
```bash
marchwarden costs
```
### Output sections
- **Cost Summary** — total calls, total spend, total tokens (with input/output split), Tavily searches, and a warning if any calls used a model not in your price table
- **Per Day** — calls / tokens / spend grouped by day
- **Per Model** — calls / tokens / spend grouped by `model_id`
- **Highest-Cost Call** — the most expensive single run, with `trace_id` for follow-up
### Flags
| Flag | Purpose |
|---|---|
| `--since DATE` | ISO date (`2026-04-01`) or relative (`7d`, `24h`, `2w`, `1m`) |
| `--until DATE` | Same |
| `--model MODEL_ID` | Filter to a single model |
| `--json` | Emit raw filtered ledger entries (one JSON per line) instead of the table |
| `--ledger PATH` | Override default ledger location |
### Examples
```bash
marchwarden costs # all-time summary
marchwarden costs --since 7d # last 7 days
marchwarden costs --model claude-opus-4-6
marchwarden costs --since 2026-04-01 --until 2026-04-08 --json
```
The `--json` mode is suitable for piping into `jq` or shipping to a billing/analytics tool.
---
## Depth presets
The `--depth` flag picks sensible defaults for the agent loop. Explicit `--budget` and `--max-iterations` always override.
| Depth | max_iterations | token_budget | max_sources | Use for |
|---|---:|---:|---:|---|
| `shallow` | 2 | 5,000 | 5 | quick lookups, factual Q&A |
| `balanced` | 5 | 20,000 | 10 | default, most questions |
| `deep` | 8 | 60,000 | 20 | comparison studies, complex investigations |
### How the budget is enforced
The token budget is a **soft cap on the tool-use loop only**:
- Before each new iteration, the agent checks `tokens_used >= token_budget`. If yes, the loop stops and synthesis runs on whatever evidence is gathered.
- The synthesis call itself is **uncapped** — it always completes, so you get a real ResearchResult instead of a parse-failure stub.
- This means total tokens reported in the Cost panel and ledger will normally exceed `token_budget` by the synthesis cost (~1025k tokens depending on evidence size).
Practical implications:
- A `balanced` run with `token_budget=20000` typically reports `tokens_used: 3000050000` total. That's normal.
- If you need *strict* total spend control, use `shallow` and hand-tune `--budget` low.
- If you need *thorough* answers, use `deep` and accept that the call may consume 100k+ tokens.
---
## Operational logging
Marchwarden logs every research step via `structlog`. Logs go to **stderr** so they don't interfere with the research output on stdout.
### Log levels
- **`INFO`** (default) — milestones only (~9 lines per call): research start, each iteration boundary, synthesis start/complete, completion, cost recording
- **`DEBUG`** — every step (~13+ lines per call): adds individual `web_search`, `fetch_url`, and tool-result events
### Formats
- **`console`** — colored, human-readable; auto-selected when stderr is a TTY
- **`json`** — newline-delimited JSON, OpenSearch-ready; auto-selected when stderr is not a TTY (e.g., in CI, containers, or piped output)
Set explicitly with `MARCHWARDEN_LOG_FORMAT=json` or `=console`.
### Persistent file logging
```bash
MARCHWARDEN_LOG_FILE=1 marchwarden ask "..."
```
Logs are appended to `~/.marchwarden/logs/marchwarden.log` (10MB per file, 5 rotated backups). The format respects `MARCHWARDEN_LOG_FORMAT`.
### Context binding
Every log line emitted during a research call automatically carries:
- `trace_id` — the same UUID you see in the Trace footer
- `researcher` — currently always `web` (the researcher type)
This means in OpenSearch (or any structured log viewer) you can filter to a single research call with one query: `trace_id:"abc-123-..."`.
---
## File layout
Marchwarden writes to `~/.marchwarden/` exclusively. Nothing else on disk is touched.
```
~/.marchwarden/
├── prices.toml # auto-seeded price table; edit when rates change
├── costs.jsonl # cost ledger, one line per research call
├── traces/
│ └── <trace_id>.jsonl # per-call audit log, one file per call
└── logs/
└── marchwarden.log # only if MARCHWARDEN_LOG_FILE=1
└── marchwarden.log.{1..5} # rotated backups
```
All files are append-only or rewritten safely; you can `tail -f`, `jq`, or back them up freely.
---
## Running in Docker
The same workflows work inside the docker test image — useful for sandboxed runs or to avoid touching the host's Python:
```bash
make docker-build # one-time
./scripts/docker-test.sh ask "your question" --depth deep
./scripts/docker-test.sh replay <trace_id>
```
The `ask` and `replay` subcommands of `docker-test.sh` mount:
- `~/secrets:/root/secrets:ro` — your API keys
- `~/.marchwarden:/root/.marchwarden` — traces, costs, logs persist back to the host
The script also forwards `MARCHWARDEN_MODEL` from the host environment if set.
---
## Troubleshooting
### `marchwarden: command not found` after `make install`
Either:
1. The venv isn't activated. Run `source .venv/bin/activate`, or use `make ask` which calls `.venv/bin/marchwarden` directly.
2. A stale install exists at `~/.local/bin/marchwarden`. Run `which -a marchwarden`, delete the stale copy, then `hash -r`.
### `ModuleNotFoundError: No module named 'cli'`
The `marchwarden` script being run is from a stale install (e.g., a previous `pip install --user` or pipx install) that doesn't know about the current source layout. Same fix as above.
### `Error: HTTP 404 Not Found` on the Anthropic API
Your `MARCHWARDEN_MODEL` is set to a model id that doesn't exist. Check `claude-sonnet-4-6` or `claude-opus-4-6`. The default is `claude-sonnet-4-6`.
### `Calls with unknown model price: N` warning in `marchwarden costs`
You ran a research call with a `model_id` not present in `~/.marchwarden/prices.toml`. Add a section for it:
```toml
[models."your-model-id"]
input_per_mtok_usd = 3.00
output_per_mtok_usd = 15.00
```
Then re-run `marchwarden costs`. Existing ledger entries with `null` cost won't be retroactively fixed; future calls will pick up the new prices.
### `Budget status: spent` on every run
This is *expected*, not an error. See [Reading the output → Confidence panel](#reading-the-output) and [Depth presets → How the budget is enforced](#depth-presets) for details.
### Synthesis fallback ("Research completed but synthesis failed")
This used to happen when the synthesis JSON exceeded its `max_tokens` cap, but was fixed in PR #20. If you still see it, file an issue with the `trace_id` — the JSONL trace will contain the exact `synthesis_error` step including the model's raw response and parse error.
### The `marchwarden ask` output is paginated / cut off
`rich` defaults to your terminal width. If lines are wrapping ugly, widen your terminal or pipe to `less -R` to see colors:
```bash
marchwarden ask "..." 2>&1 | less -R
```
---
## FAQ
**How long does a research call take?**
Typical wall-clock times: shallow ~15s, balanced ~3060s, deep ~60120s. Mostly LLM latency, not network.
**How much does a call cost?**
At current Sonnet 4.6 rates: shallow ~$0.02, balanced ~$0.05$0.15, deep ~$0.20$0.60. Run `marchwarden costs` after a few calls to see your actual numbers.
**Can I use a different model?**
Yes. `MARCHWARDEN_MODEL=claude-opus-4-6 marchwarden ask "..."` will use Opus instead of Sonnet. Make sure the model id is in your `prices.toml` so the cost ledger can estimate spend.
**Can the agent access local files / databases?**
Not yet. V1 is web-search only. V2+ (per the [Roadmap](Roadmap)) will add file/document and database researchers — same contract, different tools.
**Does the agent learn between calls?**
No. Each `research()` call is stateless. The trace logs and cost ledger accumulate over time, but the agent itself starts fresh every time. Cross-call learning is on the V2+ roadmap.
**Where do I report bugs?**
Open an issue at the [Forgejo repo](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/issues). Include the `trace_id` from the Trace footer — it lets us reconstruct exactly what happened.
---
See also: [Architecture](Architecture), [Research Contract](ResearchContract), [Development Guide](DevelopmentGuide), [Roadmap](Roadmap)