52 changed files with 43 additions and 9346 deletions
--- a/.dockerignore
+++ b/.dockerignore
@ -1,15 +0,0 @@
 .git
 .gitignore
 .venv
 docs/wiki
 **/__pycache__
 **/*.pyc
 **/*.pyo
 *.egg-info
 .pytest_cache
 .mypy_cache
 .ruff_cache
 .coverage
 htmlcov
 Dockerfile
 .dockerignore
--- a/.gitignore
+++ b/.gitignore
@ -45,9 +45,6 @@ ehthumbs.db
 .env
 .env.local
 *.log
 # Exception: stress test run logs are committed as provenance — they map
 # trace_id -> category for the calibration collector script.
 !docs/stress-tests/**/*.log
 # Tests
 .pytest_cache/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -1,71 +0,0 @@
 # Marchwarden — Project Context
 ## What This Is
 A network of agentic research specialists (MCP servers) coordinated by a
 principal investigator (PI) agent. Educational project learning agents, MCP,
 and agent composition.
 ## Current Project State
 | | |
 |---|---|
 | **Phase** | Phases 0–3 substantially complete (M3.3 awaiting human rating); Phase 5 started (M5.1.1 shipped) |
 | **Last worked on** | 2026-04-08 |
 | **Last commit** | `78f08c9` — Merge PR #59: M3.3 Phase A calibration data collection |
 | **Branch** | `main` (clean) |
 | **Tests** | 141 passing |
 | **Blocking issues** | #53 (budget cap lag — recommended fix before #39); #46 (M3.3 Phase B awaiting human rating) |
 ## Key Files
 | File | Purpose |
 |---|---|
 | `researchers/web/models.py` | Research Contract v1 Pydantic models + `DEPTH_PRESETS` |
 | `researchers/web/tools.py` | Tavily search + URL fetch with content hashing |
 | `researchers/web/trace.py` | JSONL trace logger + step-duration tracking + structlog mirror |
 | `researchers/web/agent.py` | WebResearcher — inner agentic loop |
 | `researchers/web/server.py` | FastMCP server wrapping the researcher |
 | `cli/main.py` | CLI: `ask` / `replay` / `costs` |
 | `obs/__init__.py` | Structured operational logger (structlog) |
 | `obs/costs.py` | Cost ledger + price table |
 | `Makefile` | `make install` / `test` / `ask` / `costs` / `clean` |
 | `Dockerfile` + `scripts/docker-test.sh` | Reproducible test environment |
 ## Architecture
 - **Researcher** = MCP server exposing `research(question) -> ResearchResult`
 - **ResearchResult** = answer + citations (with raw_excerpt) + categorized gaps +
  discovery_events + open_questions + confidence + confidence_factors + cost_metadata + trace_id
 - **Agent loop** = Claude tool-use loop (plan→search→fetch→iterate) + synthesis step
 - **Trace** = JSONL audit log per research call at `~/.marchwarden/traces/`
 ## Conventions
 - API keys live in `~/secrets` (not `.env`)
 - Wiki is at `docs/wiki/` (local git clone, not MCP — wiki MCP is buggy)
 - All merges via Forgejo API (claude-code user can't merge via MCP)
 - One branch per concern, merge via PR, delete branch after
 ## Session Log
 | Session | Date | Summary |
 |---|---|---|
 | 1 | 2026-04-08 | Project creation, naming, contract design, Phase 0 + Phase 1 complete (81 tests) |
 | 2 | 2026-04-08 | Phase 2 (CLI shim) + Phase 2.5 (logging + cost tracking) shipped; V1 ships; depth presets; docker test env; per-step duration tracking; arxiv-rag scoped as M5.1; Phase 3/4/5/6 milestones populated (123 tests) |
 | 3 | 2026-04-08 | Phase 3 stress testing: M3.1+M3.2 closed, M3.3 split into Phases A/B/C with A done. Trace observability fix (#54) — full ResearchResult persisted as sibling + per-item events. M5.1.1 arxiv-rag ingest pipeline shipped (researchers/arxiv/, [arxiv] optional extra, lazy CLI imports). Structured-data tool critiqued and deferred until M6 PI consumer exists. Filed #53 (budget cap lag — recommended next session). 141 tests |
 ## What's Next
 **Recommended next session: fix #53 (budget cap lag) before continuing Phase 5.** The arxiv researcher's eventual agent loop (#40) will inherit budget semantics from the web researcher — fix the bug before duplicating it.
 Order of next-session candidates:
 1. **#53** — budget cap lag bug. Single-file fix in `researchers/web/agent.py` plus a regression test. ~30 min.
 2. **Live arxiv smoke** — `marchwarden arxiv add 1706.03762` end-to-end. Validates M5.1.1 against a real PDF. First run downloads ~500MB embedding model.
 3. **#39** — M5.1.2 arxiv-rag retrieval primitive. Builds the query API on top of M5.1.1's chromadb collection.
 4. **M3.3 Phase C** — once the user brings back `docs/stress-tests/M3.3-rating-worksheet.md` with `actual_rating` columns filled in. Analysis script + rubric + wiki update.
 5. **M4.1** (#47) — error handling / hardening. Independent of everything above.
 **Open issues:** #53 (budget cap lag), #46 (M3.3 awaiting rating).
 **Open milestones in Forgejo:** Phase 3 (1 issue: #46), Phase 4 (3 issues), Phase 5 (7 issues remaining), Phase 6 (2 issues).
--- a/25
+++ b/25
@ -1,25 +0,0 @@
 FROM python:3.12-slim
 ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_NO_CACHE_DIR=1
 WORKDIR /app
 # Install build deps separately so the layer caches when source changes.
 COPY pyproject.toml README.md ./
 RUN pip install --upgrade pip
 # Copy the project and install editable with dev extras.
 COPY cli ./cli
 COPY obs ./obs
 COPY researchers ./researchers
 COPY orchestrator ./orchestrator
 COPY tests ./tests
 RUN pip install -e ".[dev]"
 # Trace files land here; mount a volume to persist across runs.
 RUN mkdir -p /root/.marchwarden/traces
 CMD ["pytest", "-q"]
--- a/55
+++ b/55
@ -1,55 +0,0 @@
 # Marchwarden development tasks.
 #
 # All venv targets create/use ./.venv. Targets that need to invoke
 # python from inside the venv use $(VENV_PY) so they work whether or
 # not the venv is currently activated in the shell.
 VENV       := .venv
 VENV_PY    := $(VENV)/bin/python
 VENV_PIP   := $(VENV)/bin/pip
 VENV_BIN   := $(VENV)/bin
 PYTHON ?= python3
 .DEFAULT_GOAL := help
 .PHONY: help venv install test test-cov lint clean ask costs docker-build docker-test
 help:  ## Show this help.
 	@awk 'BEGIN {FS = ":.*##"; printf "Marchwarden — make targets\n\n"} /^[a-zA-Z_-]+:.*?##/ { printf "  \033[36m%-15s\033[0m %s\n", $$1, $$2 }' $(MAKEFILE_LIST)
 venv: $(VENV_PY)  ## Create the .venv if it doesn't exist.
 $(VENV_PY):
 	$(PYTHON) -m venv $(VENV)
 	$(VENV_PIP) install --upgrade pip
 install: venv  ## Install the project editable with dev extras into .venv.
 	$(VENV_PIP) install -e ".[dev]"
 test: install  ## Run the test suite inside the venv.
 	$(VENV_BIN)/pytest -q
 test-cov: install  ## Run tests with coverage.
 	$(VENV_BIN)/pytest --cov=cli --cov=obs --cov=researchers --cov-report=term-missing
 lint: install  ## Run ruff and black --check.
 	$(VENV_BIN)/ruff check .
 	$(VENV_BIN)/black --check .
 ask: install  ## Run a sample research call. Override Q="..." to ask something else.
 	$(VENV_BIN)/marchwarden ask "$${Q:-What is the highest peak in Utah?}" --depth shallow
 costs: install  ## Show the cost ledger summary.
 	$(VENV_BIN)/marchwarden costs
 clean:  ## Remove the venv and Python caches.
 	rm -rf $(VENV) .pytest_cache .ruff_cache .mypy_cache .coverage htmlcov
 	find . -type d -name __pycache__ -prune -exec rm -rf {} +
 	find . -type d -name "*.egg-info" -prune -exec rm -rf {} +
 docker-build:  ## Build the docker test image.
 	./scripts/docker-test.sh build
 docker-test:  ## Run the test suite inside docker.
 	./scripts/docker-test.sh test
--- a/README.md
+++ b/README.md
@ -14,10 +14,8 @@ Marchwarden researchers are stationed at the frontier of knowledge — they watc
 git clone https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden.git
 cd marchwarden
-# Install (Makefile shortcut — creates .venv and installs deps)
+# Install
-make install
+pip install -e .
 # or manually:
 python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[dev]"
 # Ask a question
 marchwarden ask "What are ideal crops for a garden in Utah?"
@ -26,20 +24,6 @@ marchwarden ask "What are ideal crops for a garden in Utah?"
 marchwarden replay <trace_id>
 ```
 ## Docker test environment
 A reproducible container is available for running the test suite and the
 CLI without depending on the host's Python install:
 ```bash
 scripts/docker-test.sh build           # build the image
 scripts/docker-test.sh test             # run pytest
 scripts/docker-test.sh ask "question"   # run `marchwarden ask` end-to-end
                                        # (mounts ~/secrets ro and ~/.marchwarden rw)
 scripts/docker-test.sh replay <id>      # replay a trace from ~/.marchwarden/traces
 scripts/docker-test.sh shell            # interactive bash in the container
 ```
 ## Documentation
 - **[Architecture](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/wiki/Architecture)** — system design, researcher contract, MCP flow
--- a/cli/main.py
+++ b/cli/main.py
@ -1,667 +0,0 @@
 """Marchwarden CLI shim.
 Talks to the web researcher MCP server over stdio and pretty-prints
 ResearchResult contracts to the terminal.
 """
 import asyncio
 import json
 import os
 import re
 import sys
 from collections import defaultdict
 from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Optional
 import click
 from mcp import ClientSession, StdioServerParameters
 from mcp.client.stdio import stdio_client
 from rich.console import Console
 from rich.panel import Panel
 from rich.table import Table
 from rich.text import Text
 from obs import configure_logging, get_logger
 from obs.costs import DEFAULT_LEDGER_PATH
 from researchers.web.models import ResearchResult
 DEFAULT_TRACE_DIR = "~/.marchwarden/traces"
 log = get_logger("marchwarden.cli")
 # ---------------------------------------------------------------------------
 # MCP client
 # ---------------------------------------------------------------------------
 async def call_research_tool(
    question: str,
    depth: str,
    max_iterations: Optional[int],
    token_budget: Optional[int],
 ) -> ResearchResult:
    """Spawn the web researcher MCP server and call its `research` tool.
    ``max_iterations`` and ``token_budget`` are optional — when None,
    the MCP server uses the depth preset (Issue #30).
    """
    params = StdioServerParameters(
        command=sys.executable,
        args=["-m", "researchers.web.server"],
        env=os.environ.copy(),
    )
    arguments: dict = {"question": question, "depth": depth}
    if max_iterations is not None:
        arguments["max_iterations"] = max_iterations
    if token_budget is not None:
        arguments["token_budget"] = token_budget
    async with stdio_client(params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.call_tool("research", arguments=arguments)
            # FastMCP returns the tool's string return as a TextContent block.
            payload = result.content[0].text
            return ResearchResult.model_validate_json(payload)
 # ---------------------------------------------------------------------------
 # Pretty printing
 # ---------------------------------------------------------------------------
 def render_result(result: ResearchResult, console: Console) -> None:
    """Render a ResearchResult to the console using rich."""
    # Answer
    console.print(
        Panel(
            result.answer,
            title="[bold cyan]Answer[/bold cyan]",
            border_style="cyan",
        )
    )
    # Citations
    if result.citations:
        table = Table(title="Citations", show_lines=True, expand=True)
        table.add_column("#", style="dim", width=3)
        table.add_column("Title / Locator", overflow="fold")
        table.add_column("Excerpt", overflow="fold")
        table.add_column("Conf", justify="right", width=5)
        for i, c in enumerate(result.citations, 1):
            header = f"[bold]{c.title or c.locator}[/bold]\n[dim]{c.locator}[/dim]"
            table.add_row(str(i), header, c.raw_excerpt, f"{c.confidence:.2f}")
        console.print(table)
    else:
        console.print("[dim]No citations.[/dim]")
    # Gaps grouped by category
    if result.gaps:
        gap_table = Table(title="Gaps", show_lines=True, expand=True)
        gap_table.add_column("Category", style="yellow")
        gap_table.add_column("Topic")
        gap_table.add_column("Detail", overflow="fold")
        for g in result.gaps:
            gap_table.add_row(g.category.value, g.topic, g.detail)
        console.print(gap_table)
    # Discovery events
    if result.discovery_events:
        de_table = Table(title="Discovery Events", show_lines=True, expand=True)
        de_table.add_column("Type", style="magenta")
        de_table.add_column("Suggested Researcher")
        de_table.add_column("Query", overflow="fold")
        de_table.add_column("Reason", overflow="fold")
        for d in result.discovery_events:
            de_table.add_row(
                d.type, d.suggested_researcher or "-", d.query, d.reason
            )
        console.print(de_table)
    # Open questions
    if result.open_questions:
        oq_table = Table(title="Open Questions", show_lines=True, expand=True)
        oq_table.add_column("Priority", style="green")
        oq_table.add_column("Question", overflow="fold")
        oq_table.add_column("Context", overflow="fold")
        for q in result.open_questions:
            oq_table.add_row(q.priority, q.question, q.context)
        console.print(oq_table)
    # Confidence + factors
    cf = result.confidence_factors
    conf_text = Text()
    conf_text.append(f"Overall: {result.confidence:.2f}\n", style="bold")
    conf_text.append(f"Corroborating sources: {cf.num_corroborating_sources}\n")
    conf_text.append(f"Source authority: {cf.source_authority}\n")
    conf_text.append(f"Contradiction detected: {cf.contradiction_detected}\n")
    conf_text.append(f"Query specificity match: {cf.query_specificity_match:.2f}\n")
    budget_status = "spent" if cf.budget_exhausted else "under cap"
    conf_text.append(f"Budget status: {budget_status}\n")
    conf_text.append(f"Recency: {cf.recency or 'unknown'}")
    console.print(Panel(conf_text, title="Confidence", border_style="green"))
    # Cost
    cm = result.cost_metadata
    cost_text = Text()
    cost_text.append(f"Tokens: {cm.tokens_used}\n")
    cost_text.append(f"Iterations: {cm.iterations_run}\n")
    cost_text.append(f"Wall time: {cm.wall_time_sec:.2f}s\n")
    cost_text.append(f"Model: {cm.model_id}")
    console.print(Panel(cost_text, title="Cost", border_style="blue"))
    # Trace footer
    console.print(f"\n[dim]trace_id: {result.trace_id}[/dim]")
 # ---------------------------------------------------------------------------
 # Click app
 # ---------------------------------------------------------------------------
@click.group()
 def cli() -> None:
    """Marchwarden — agentic research CLI."""
    configure_logging()
@cli.command()
@click.argument("question")
@click.option(
    "--depth",
    type=click.Choice(["shallow", "balanced", "deep"]),
    default="balanced",
    show_default=True,
 )
@click.option(
    "--budget",
    "token_budget",
    type=int,
    default=None,
    help="Token budget for the research loop. Overrides the depth preset.",
 )
@click.option(
    "--max-iterations",
    type=int,
    default=None,
    help="Max research loop iterations. Overrides the depth preset.",
 )
 def ask(
    question: str,
    depth: str,
    token_budget: Optional[int],
    max_iterations: Optional[int],
 ) -> None:
    """Ask the web researcher a QUESTION."""
    console = Console()
    console.print(f"[dim]Researching:[/dim] {question}\n")
    log.info(
        "ask_started",
        question=question,
        depth=depth,
        max_iterations=max_iterations,
        token_budget=token_budget,
    )
    try:
        result = asyncio.run(
            call_research_tool(
                question=question,
                depth=depth,
                max_iterations=max_iterations,
                token_budget=token_budget,
            )
        )
    except Exception as e:
        log.error("ask_failed", question=question, error=str(e), exc_info=True)
        console.print(f"[bold red]Error:[/bold red] {e}")
        sys.exit(1)
    log.info(
        "ask_completed",
        trace_id=result.trace_id,
        confidence=result.confidence,
        citations=len(result.citations),
        tokens_used=result.cost_metadata.tokens_used,
        wall_time_sec=result.cost_metadata.wall_time_sec,
    )
    render_result(result, console)
 def _resolve_trace_path(trace_id: str, trace_dir: Optional[str]) -> Path:
    """Resolve the JSONL path for a trace_id."""
    base = Path(os.path.expanduser(trace_dir or DEFAULT_TRACE_DIR))
    return base / f"{trace_id}.jsonl"
 def render_trace(entries: list[dict], trace_id: str, console: Console) -> None:
    """Pretty-print a list of trace entries."""
    console.print(
        Panel(
            f"[bold]trace_id:[/bold] {trace_id}\n[bold]steps:[/bold] {len(entries)}",
            title="[cyan]Replay[/cyan]",
            border_style="cyan",
        )
    )
    if not entries:
        console.print("[dim]Trace file is empty.[/dim]")
        return
    table = Table(show_lines=True, expand=True)
    table.add_column("#", style="dim", width=4)
    table.add_column("Action", style="magenta")
    table.add_column("Decision", overflow="fold")
    table.add_column("Details", overflow="fold")
    table.add_column("Hash", style="dim", overflow="fold")
    reserved = {"step", "action", "decision", "timestamp", "content_hash"}
    for e in entries:
        step = str(e.get("step", "?"))
        action = str(e.get("action", ""))
        decision = str(e.get("decision", ""))
        content_hash = str(e.get("content_hash", "") or "")
        extras = {k: v for k, v in e.items() if k not in reserved}
        details = "\n".join(f"{k}: {v}" for k, v in extras.items())
        table.add_row(step, action, decision, details, content_hash)
    console.print(table)
@cli.command()
@click.argument("trace_id")
@click.option(
    "--trace-dir",
    default=None,
    help=f"Trace directory (default: {DEFAULT_TRACE_DIR}).",
 )
 def replay(trace_id: str, trace_dir: Optional[str]) -> None:
    """Replay a prior research run by TRACE_ID."""
    console = Console()
    path = _resolve_trace_path(trace_id, trace_dir)
    if not path.exists():
        console.print(
            f"[bold red]Error:[/bold red] no trace file found for "
            f"trace_id [bold]{trace_id}[/bold] at {path}"
        )
        sys.exit(1)
    entries: list[dict] = []
    with open(path, "r", encoding="utf-8") as f:
        for lineno, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                entries.append(json.loads(line))
            except json.JSONDecodeError as e:
                console.print(
                    f"[bold red]Error:[/bold red] invalid JSON on line {lineno}: {e}"
                )
                sys.exit(1)
    render_trace(entries, trace_id, console)
    # Issue #54: if the agent persisted a sibling .result.json, render
    # the full structured ResearchResult underneath the step log so
    # replay can show which gaps fired, which sources were cited, etc.
    result_path = path.parent / f"{trace_id}.result.json"
    if result_path.exists():
        try:
            result = ResearchResult.model_validate_json(
                result_path.read_text(encoding="utf-8")
            )
        except Exception as exc:
            console.print(
                f"[yellow]warning:[/yellow] could not parse {result_path.name}: {exc}"
            )
        else:
            console.print()
            render_result(result, console)
    else:
        console.print(
            "[dim]No persisted result file alongside this trace.[/dim]"
        )
 # ---------------------------------------------------------------------------
 # costs command
 # ---------------------------------------------------------------------------
 _RELATIVE_RE = re.compile(r"^(\d+)([dwhm])$")
 def _parse_when(value: str) -> datetime:
    """Parse an ISO date or a relative shorthand like '7d', '24h'."""
    m = _RELATIVE_RE.match(value)
    if m:
        n = int(m.group(1))
        unit = m.group(2)
        delta = {
            "h": timedelta(hours=n),
            "d": timedelta(days=n),
            "w": timedelta(weeks=n),
            "m": timedelta(days=30 * n),
        }[unit]
        return datetime.now(timezone.utc) - delta
    # Otherwise treat as ISO date / datetime
    dt = datetime.fromisoformat(value)
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    return dt
 def _load_ledger(path: Path) -> list[dict]:
    if not path.exists():
        return []
    entries: list[dict] = []
    with open(path, "r", encoding="utf-8") as f:
        for lineno, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                entries.append(json.loads(line))
            except json.JSONDecodeError:
                # Skip a corrupt line rather than blow up the whole report
                continue
    return entries
 def _filter_entries(
    entries: list[dict],
    since: Optional[datetime],
    until: Optional[datetime],
    model: Optional[str],
 ) -> list[dict]:
    out = []
    for e in entries:
        ts_str = e.get("timestamp", "")
        try:
            ts = datetime.fromisoformat(ts_str.replace("Z", "+00:00"))
        except ValueError:
            continue
        if since and ts < since:
            continue
        if until and ts > until:
            continue
        if model and e.get("model_id") != model:
            continue
        out.append(e)
    return out
 def render_costs(entries: list[dict], console: Console) -> None:
    """Render a cost summary from filtered ledger entries."""
    if not entries:
        console.print("[dim]No cost data yet.[/dim]")
        return
    total_calls = len(entries)
    total_tokens = sum(e.get("tokens_used", 0) for e in entries)
    total_input = sum(e.get("tokens_input") or 0 for e in entries)
    total_output = sum(e.get("tokens_output") or 0 for e in entries)
    total_tavily = sum(e.get("tavily_searches", 0) for e in entries)
    total_spend = sum(
        e.get("estimated_cost_usd") or 0.0 for e in entries
    )
    unknown_cost_calls = sum(
        1 for e in entries if e.get("estimated_cost_usd") is None
    )
    # Summary panel
    summary = Text()
    summary.append(f"Calls: {total_calls}\n", style="bold")
    summary.append(f"Total spend: ${total_spend:.4f}\n", style="bold green")
    summary.append(f"Total tokens: {total_tokens:,} ")
    summary.append(f"(in {total_input:,} / out {total_output:,})\n", style="dim")
    summary.append(f"Tavily searches: {total_tavily}\n")
    if unknown_cost_calls:
        summary.append(
            f"Calls with unknown model price: {unknown_cost_calls}\n",
            style="yellow",
        )
    console.print(Panel(summary, title="Cost Summary", border_style="green"))
    # Per-day breakdown
    per_day: dict[str, dict] = defaultdict(lambda: {"calls": 0, "tokens": 0, "spend": 0.0})
    for e in entries:
        day = e.get("timestamp", "")[:10]
        per_day[day]["calls"] += 1
        per_day[day]["tokens"] += e.get("tokens_used", 0)
        per_day[day]["spend"] += e.get("estimated_cost_usd") or 0.0
    day_table = Table(title="Per Day", show_lines=False, expand=True)
    day_table.add_column("Date", style="dim")
    day_table.add_column("Calls", justify="right")
    day_table.add_column("Tokens", justify="right")
    day_table.add_column("Spend (USD)", justify="right", style="green")
    for day in sorted(per_day.keys()):
        d = per_day[day]
        day_table.add_row(
            day, str(d["calls"]), f"{d['tokens']:,}", f"${d['spend']:.4f}"
        )
    console.print(day_table)
    # Per-model breakdown
    per_model: dict[str, dict] = defaultdict(
        lambda: {"calls": 0, "tokens": 0, "spend": 0.0}
    )
    for e in entries:
        m = e.get("model_id", "(unknown)")
        per_model[m]["calls"] += 1
        per_model[m]["tokens"] += e.get("tokens_used", 0)
        per_model[m]["spend"] += e.get("estimated_cost_usd") or 0.0
    model_table = Table(title="Per Model", show_lines=False, expand=True)
    model_table.add_column("Model")
    model_table.add_column("Calls", justify="right")
    model_table.add_column("Tokens", justify="right")
    model_table.add_column("Spend (USD)", justify="right", style="green")
    for m in sorted(per_model.keys()):
        d = per_model[m]
        model_table.add_row(
            m, str(d["calls"]), f"{d['tokens']:,}", f"${d['spend']:.4f}"
        )
    console.print(model_table)
    # Highest-cost call
    costed = [e for e in entries if e.get("estimated_cost_usd") is not None]
    if costed:
        top = max(costed, key=lambda e: e["estimated_cost_usd"])
        top_text = Text()
        top_text.append(f"trace_id: {top.get('trace_id', '?')}\n")
        top_text.append(f"question: {top.get('question', '')[:120]}\n")
        top_text.append(f"model: {top.get('model_id', '?')}\n")
        top_text.append(f"tokens: {top.get('tokens_used', 0):,}\n")
        top_text.append(
            f"spend: ${top.get('estimated_cost_usd', 0):.4f}\n",
            style="bold green",
        )
        console.print(
            Panel(top_text, title="Highest-Cost Call", border_style="yellow")
        )
@cli.command()
@click.option(
    "--since",
    default=None,
    help="Filter by start time. ISO date or relative (e.g. 7d, 24h, 2w).",
 )
@click.option(
    "--until",
    default=None,
    help="Filter by end time. ISO date or relative.",
 )
@click.option(
    "--model",
    default=None,
    help="Filter to a specific model_id.",
 )
@click.option(
    "--json",
    "as_json",
    is_flag=True,
    default=False,
    help="Emit raw filtered ledger entries as JSON instead of the table.",
 )
@click.option(
    "--ledger",
    default=None,
    help=f"Override ledger path (default: {DEFAULT_LEDGER_PATH}).",
 )
 def costs(
    since: Optional[str],
    until: Optional[str],
    model: Optional[str],
    as_json: bool,
    ledger: Optional[str],
 ) -> None:
    """Show cost summary from the research ledger."""
    console = Console()
    path = Path(os.path.expanduser(ledger or DEFAULT_LEDGER_PATH))
    entries = _load_ledger(path)
    since_dt = _parse_when(since) if since else None
    until_dt = _parse_when(until) if until else None
    filtered = _filter_entries(entries, since_dt, until_dt, model)
    if as_json:
        for e in filtered:
            click.echo(json.dumps(e))
        return
    render_costs(filtered, console)
 # ---------------------------------------------------------------------------
 # arxiv subgroup (M5.1.1)
 # ---------------------------------------------------------------------------
@cli.group()
 def arxiv() -> None:
    """Manage the local arxiv-rag corpus.
    Sub-commands let you ingest arxiv papers, list what's indexed, and
    inspect individual entries. Retrieval and search ship in #39+.
    """
@arxiv.command("add")
@click.argument("arxiv_ids", nargs=-1, required=True)
@click.option(
    "--embedding-model",
    default=None,
    help=(
        "Override embedding model. Defaults to "
        "$MARCHWARDEN_ARXIV_EMBED_MODEL or nomic-ai/nomic-embed-text-v1.5."
    ),
 )
 def arxiv_add(arxiv_ids: tuple[str, ...], embedding_model: Optional[str]) -> None:
    """Download, extract, embed, and index one or more arxiv papers by ID."""
    # Imported lazily so the CLI doesn't pay the chromadb / torch import
    # cost on every invocation — only when the user actually runs an
    # arxiv command.
    from researchers.arxiv.ingest import DEFAULT_EMBEDDING_MODEL, ingest
    from researchers.arxiv.store import ArxivStore
    console = Console()
    store = ArxivStore()
    model = embedding_model or DEFAULT_EMBEDDING_MODEL
    for arxiv_id in arxiv_ids:
        console.print(f"[dim]Ingesting:[/dim] {arxiv_id} (model={model})")
        try:
            record = ingest(arxiv_id, store=store, model_name=model)
        except Exception as exc:
            console.print(f"[bold red]Failed:[/bold red] {arxiv_id}: {exc}")
            continue
        console.print(
            f"  -> [green]ok[/green] {record.title or '(no title)'} "
            f"({record.chunks_indexed} chunks)"
        )
@arxiv.command("list")
 def arxiv_list() -> None:
    """Show all indexed arxiv papers."""
    from researchers.arxiv.store import ArxivStore
    console = Console()
    store = ArxivStore()
    papers = store.list_papers()
    if not papers:
        console.print(
            "[dim]No papers indexed yet. Use[/dim] "
            "[bold]marchwarden arxiv add <id>[/bold]"
        )
        return
    table = Table(title=f"Indexed papers ({len(papers)})", show_lines=False, expand=True)
    table.add_column("arxiv_id", style="cyan")
    table.add_column("Title", overflow="fold")
    table.add_column("Year", justify="right", width=6)
    table.add_column("Chunks", justify="right", width=6)
    table.add_column("Model", overflow="fold")
    for p in papers:
        table.add_row(
            p.arxiv_id,
            p.title or "(no title)",
            str(p.year) if p.year else "—",
            str(p.chunks_indexed),
            p.embedding_model,
        )
    console.print(table)
@arxiv.command("info")
@click.argument("arxiv_id")
 def arxiv_info(arxiv_id: str) -> None:
    """Show metadata + chunk count for one indexed paper."""
    from researchers.arxiv.store import ArxivStore
    console = Console()
    store = ArxivStore()
    record = store.get_paper(arxiv_id)
    if record is None:
        console.print(
            f"[bold red]Not indexed:[/bold red] {arxiv_id}. "
            f"Use [bold]marchwarden arxiv add {arxiv_id}[/bold]."
        )
        sys.exit(1)
    text = Text()
    text.append(f"arxiv_id: {record.arxiv_id}\n", style="bold")
    text.append(f"title: {record.title or '(none)'}\n")
    text.append(f"authors: {', '.join(record.authors) or '(none)'}\n")
    text.append(f"year: {record.year or '(unknown)'}\n")
    text.append(f"category: {record.category or '(unknown)'}\n")
    text.append(f"chunks: {record.chunks_indexed}\n")
    text.append(f"embedding_model: {record.embedding_model}\n")
    text.append(f"added_at: {record.added_at}\n")
    console.print(Panel(text, title=arxiv_id, border_style="cyan"))
@arxiv.command("remove")
@click.argument("arxiv_id")
 def arxiv_remove(arxiv_id: str) -> None:
    """Drop one paper from the manifest and chromadb collection."""
    from researchers.arxiv.store import ArxivStore
    console = Console()
    store = ArxivStore()
    chunks_removed = store.delete_paper(arxiv_id)
    in_manifest = store.remove_paper(arxiv_id)
    if not in_manifest and chunks_removed == 0:
        console.print(f"[yellow]Not found:[/yellow] {arxiv_id}")
        sys.exit(1)
    console.print(
        f"[green]Removed[/green] {arxiv_id} "
        f"({chunks_removed} chunks dropped)"
    )
 if __name__ == "__main__":
    cli()
--- a/docs/stress-tests/M3.1-results.md
+++ b/docs/stress-tests/M3.1-results.md
@ -1,60 +0,0 @@
 # M3.1 Stress Test Results
 - Issue: #44 (closed)
 - Date: 2026-04-08
 - Branch: `feat/m3.1-stress-tests`
 ## Summary
 | Q | Targets | Result |
 |---|---|---|
 | 1 | SOURCE_NOT_FOUND, recency | Both miss (query not adversarial enough) |
 | 2 | CONTRADICTORY_SOURCES, contradiction_detected | Both miss (consensus too strong) |
 | 3 | SCOPE_EXCEEDED, discovery_events | Both hit |
 | 4 | BUDGET_EXHAUSTED, budget_exhausted | Both miss (real bug, see #53) |
 Follow-up issues filed: #53 (budget cap lag), #54 (trace observability — full result not persisted).
 ## Q1: "What AI models were released in Q1 2026?"
 Targets: SOURCE_NOT_FOUND gap, recency factor
 - trace_id: 8472f9a2-e712-4b9f-ac9f-5b736c343831
 - confidence: 0.82
 - confidence_factors: corroborating_sources=6, authority=medium, contradiction=False, specificity=0.85, budget=spent, recency=current
 - cost: 53134 tokens, 3 iters, 93s
 - gaps: 5 fired, categories not recoverable (run was not tee'd, and trace persists only counts — see #54)
 - **TARGET MISS:** SOURCE_NOT_FOUND not triggered (found 6 sources). Recency=current, not stale. Q1 2026 is not far enough in the past for source scarcity. Need a future-dated or genuinely obscure topic to trigger this gap.
 ## Q2: "Is coffee good or bad for you?"
 Targets: CONTRADICTORY_SOURCES gap, contradiction_detected factor
 - trace_id: 22597d75-f1b2-44ae-8d7e-f4ea3423f46b
 - confidence: 0.91
 - confidence_factors: corroborating=10, authority=high, contradiction=False, specificity=0.88, budget=spent, recency=current
 - cost: 53567 tokens, 3 iters, 80s
 - gaps: scope_exceeded(1), source_not_found(2) — total 3
 - discovery_events: 4 (arxiv + database refs)
 - **TARGET MISS:** CONTRADICTORY_SOURCES not surfaced; contradiction_detected=False. Agent synthesized coherent "benefits with caveats" rather than recognizing genuine contradictions. Query is too easy for modern consensus to win.
 ## Q3: "Compare CRISPR delivery mechanisms in recent clinical trials"
 Targets: SCOPE_EXCEEDED gap, discovery_events populated
 - trace_id: 05e54df5-edbd-40ac-b1d0-ae16cebade60
 - confidence: 0.82
 - confidence_factors: corroborating=9, authority=high, contradiction=False, specificity=0.80, budget=spent, recency=current
 - cost: 51710 tokens, 3 iters, 109s
 - gaps: source_not_found(2), scope_exceeded(1+) — multiple
 - discovery_events: 4 (suggesting arxiv researcher for delivery mechanism deep-dives)
 - **HIT BOTH TARGETS:** scope_exceeded gap surfaced, discovery_events populated with arxiv researcher suggestions.
 ## Q4: "Comprehensive history of AI 1950 to 2026" --budget 5000 --max-iterations 2
 Targets: BUDGET_EXHAUSTED gap, budget_exhausted factor
 - trace_id: 38235720-6efc-4d7d-b284-6e21b1c83d46
 - confidence: 0.87
 - confidence_factors: corroborating=8, authority=high, contradiction=False, specificity=0.88, **budget=under cap**, recency=current
 - cost: **29304 tokens (5.8x over 5000 budget)**, 2 iters (cap respected), 78s
 - gaps: scope_exceeded(1), access_denied(2), source_not_found(1) — total 4. **No budget_exhausted gap.**
 - **TARGET MISS:** BUDGET_EXHAUSTED not surfaced. budget_exhausted=False despite 5.8x overrun.
 - **BUG (real):** Budget enforcement lag — see #53. Loop check uses stale `total_tokens` (only updated after a model call). Iter-1 input is tiny so check passes, iter-2's huge input pushes loop total to 10606 (2.1x cap), then loop exits naturally. Synthesis adds ~19k more (uncapped by design).
 - Trace evidence: iter1 tokens_so_far=0 → iter2 tokens_so_far=1145 → synthesis tokens_used=10606 → final 29304.
--- a/docs/stress-tests/M3.2-results.md
+++ b/docs/stress-tests/M3.2-results.md
@ -1,50 +0,0 @@
 # M3.2 Multi-axis Stress Test Results
 - Issue: #45 (closed)
 - Date: 2026-04-08
 - Branch: `feat/m3.2-multiaxis`
 - Trace: `74a017bd-697b-4439-96b8-fe12057cf2e8`
 ## Query
 > "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
 Run with `--depth deep`. Default budget for deep = 60k tokens, max_iterations = 8.
 ## Scoring
 | Axis | Target | Hit | Evidence |
 |---|---|---|---|
 | Recency | recency factor | ✅ | `recency=current` |
 | Contradictory sources | `contradiction_detected=True` | ✅ | True; bonus discovery_event of type `contradiction` |
 | Scope exceeded | `scope_exceeded` gap | ❌ | 5 gaps, all `source_not_found` |
 | Budget pressure | `budget_exhausted=True` | ✅ | True; 127692 tokens consumed (2.1x deep cap of 60k) |
 3 of 4 axes hit cleanly in a single run. Multi-axis composition works.
 ## Result snapshot
 - confidence: **0.78**
 - citations: 18 (corroborating sources)
 - gaps: 5 (all `source_not_found`)
 - discovery_events: 4 (`related_research` × 2, `new_source` × 1, **`contradiction` × 1** — first in-the-wild observation)
 - open_questions: 5
 - cost: 127692 tokens, 4 iterations, 168s
 - confidence_factors: corroborating=18, authority=medium, contradiction=True, specificity=0.72, budget=spent, recency=current
 ## Findings
 **1. Multi-axis composition validated.** A single deep query exercised recency + contradictions + budget pressure simultaneously. The contract handled all three without losing structure — confidence dropped appropriately (0.78 vs 0.82–0.91 in M3.1) and the right factors fired.
 **2. New discovery event type observed.** This is the first time `contradiction` has fired as a `discovery_event.type` in any test (M3.1 only saw `related_research` and `new_source`). It's a documented type (`researchers/web/models.py:154`), so this is the contract working as designed — but worth noting for calibration in M3.3 that all three documented types are now reachable in practice.
 **3. Scope-exceeded distinction is fuzzy in practice.** The agent surfaced 5 `source_not_found` gaps, not the targeted `scope_exceeded`. Re-reading the gap topics:
 - Outage detail / SLA percentages / post-mortems → reasonable as `source_not_found` (could be on the web, just not gathered).
 - HFT-specific cold-start benchmarks → genuinely `scope_exceeded` (HFT firms don't publish these — wrong researcher entirely).
 So it's 1 of 5, not a clear bug. The agent's prompt could nudge sharper category assignment, but the misclassification is not severe enough to file independently. Worth revisiting in M3.3 calibration if confidence routinely overestimates due to gap miscategorization.
 **4. Persisted result file made this analysis trivial.** This is the first stress test run after #54 shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against `<trace_id>.result.json` instead of grepping rendered terminal output. M3.3 calibration work will be much faster as a result.
 ## Follow-ups
 - None blocking. The scope_exceeded vs source_not_found distinction may surface again in M3.3 calibration; if so, file it then.
--- a/docs/stress-tests/M3.3-rating-worksheet.md
+++ b/docs/stress-tests/M3.3-rating-worksheet.md
@ -1,74 +0,0 @@
 # M3.3 Calibration Rating Worksheet
 Issue: #46 (Phase B — human rating)
 ## How to use this worksheet
 For each run below, read the answer + citations from the persisted result file (path in the **Result file** column). Score the answer's *actual* correctness on a 0.0–1.0 scale, **independent** of the model's self-reported confidence. Fill in the **actual_rating** column. Add notes in the **notes** column for anything unusual.
 Rating rubric:
 - **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
 - **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.
 - **0.6** — Substantively right but with notable errors, missing context, or weak citations.
 - **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.
 - **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.
 - **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.
 After rating all rows, save this file and run:
 ```
 .venv/bin/python scripts/calibration_analyze.py
 ```
 ## Runs (22 total)
 | # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 | 1 | `28f55110` | ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 |  |  |
 | 2 | `74a017bd` | ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 |  |  |
 | 3 | `6141a021` | factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 |  |  |
 | 4 | `91e87d05` | factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 |  |  |
 | 5 | `710b0a62` | factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 |  |  |
 | 6 | `ffc42162` | factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 |  |  |
 | 7 | `7561029e` | factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 |  |  |
 | 8 | `aaf3b9ef` | comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 |  |  |
 | 9 | `01881015` | comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 |  |  |
 | 10 | `9e436db7` | comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 |  |  |
 | 11 | `7c8dd19b` | comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 |  |  |
 | 12 | `e3fa81c3` | comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 |  |  |
 | 13 | `96acce3c` | contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 |  |  |
 | 14 | `c4942f00` | contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 |  |  |
 | 15 | `2e2b6e88` | contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 |  |  |
 | 16 | `27d81891` | contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 |  |  |
 | 17 | `9c18d570` | contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 |  |  |
 | 18 | `f4c43973` | scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 |  |  |
 | 19 | `b3d00938` | scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 |  |  |
 | 20 | `716e548a` | scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 |  |  |
 | 21 | `b7cd9d50` | scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 |  |  |
 | 22 | `a4bb5b7a` | scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 |  |  |
 ## Result files (full content for review)
 1. `/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json`
 2. `/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json`
 3. `/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json`
 4. `/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json`
 5. `/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json`
 6. `/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json`
 7. `/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json`
 8. `/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json`
 9. `/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json`
 10. `/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json`
 11. `/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json`
 12. `/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json`
 13. `/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json`
 14. `/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json`
 15. `/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json`
 16. `/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json`
 17. `/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json`
 18. `/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json`
 19. `/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json`
 20. `/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json`
 21. `/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json`
 22. `/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json`
--- a/docs/stress-tests/M3.3-runs/01-factual.log
+++ b/docs/stress-tests/M3.3-runs/01-factual.log
@ -1,128 +0,0 @@
 Researching: What is the boiling point of liquid nitrogen at standard 
 atmospheric pressure?
 {"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:49:07.183443Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:49:07.993167Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:49:08.002221Z"}
 {"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:49:08.036624Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037079Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037172Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1107, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:20.314935Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5768, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:25.184914Z"}
 {"step": 15, "decision": "Starting iteration 4/5", "tokens_so_far": 16093, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:27.276067Z"}
 {"step": 17, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 17, "iterations_run": 4, "tokens_used": 29376, "event": "synthesis_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:43.946958Z"}
 {"step": 18, "decision": "Parsed synthesis JSON successfully", "duration_ms": 21492, "event": "synthesis_complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.440080Z"}
 {"step": 26, "decision": "Research complete", "confidence": 0.98, "citation_count": 5, "gap_count": 0, "discovery_count": 2, "total_duration_sec": 59.528, "event": "complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.442761Z"}
 {"confidence": 0.98, "citations": 5, "gaps": 0, "discovery_events": 2, "tokens_used": 42473, "iterations_run": 4, "wall_time_sec": 57.403085231781006, "budget_exhausted": false, "event": "research_completed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:05.442894Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:05.443791Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:05.453034Z"}
 {"trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "confidence": 0.98, "citations": 5, "tokens_used": 42473, "wall_time_sec": 57.403085231781006, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:05.720817Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The boiling point of liquid nitrogen at standard atmospheric pressure (1 atm │
 │ / 14.7 psia / 760 mmHg) is −195.79 °C (77 K; −320 °F). Some sources round    │
 │ this to −195.8 °C or approximately −196 °C. This value represents the        │
 │ temperature at which nitrogen transitions from its liquid phase to a gas     │
 │ phase under normal atmospheric conditions.                                   │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Liquid Nitrogen Temperature   │ The temperature of liquid      │  0.98 │
 │     │ and Facts                     │ nitrogen is −195.79 °C (77 K;  │       │
 │     │ https://sciencenotes.org/liqu │ −320 °F). This is the boiling  │       │
 │     │ id-nitrogen-temperature-and-f │ point of nitrogen. However,    │       │
 │     │ acts/                         │ nitrogen can exist as a liquid │       │
 │     │                               │ between 63 K and 77.2 K        │       │
 │     │                               │ (-346°F and -320.44°F).        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Nitrogen - Thermophysical     │ Boiling Point - at saturation  │  0.97 │
 │     │ Properties                    │ pressure 14.7 psia and 760 mm  │       │
 │     │ https://www.engineeringtoolbo │ Hg - ( o F, o C ) -320.4,      │       │
 │     │ x.com/nitrogen-d_1421.html    │ -195.8                         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ What Is the Temperature of    │ The temperature of liquid      │  0.95 │
 │     │ Liquid Nitrogen? - WestAir    │ nitrogen is -196°C (-321°F) at │       │
 │     │ https://westairgases.com/blog │ its boiling point. The liquid  │       │
 │     │ /liquid-nitrogen-temperature- │ nitrogen temperature range     │       │
 │     │ properties/                   │ spans between -210°C (freezing │       │
 │     │                               │ point) and -196°C (boiling     │       │
 │     │                               │ point).                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ What is the boiling point of  │ At 1 atmosphere of pressure,   │  0.90 │
 │     │ liquid nitrogen? Does it      │ nitrogen boils at -195.8       │       │
 │     │ change ... - Quora            │ Celsius (-320.4 Fahrenheit).   │       │
 │     │ https://www.quora.com/What-is │ Of course, like any substance, │       │
 │     │ -the-boiling-point-of-liquid- │ boiling point varies directly  │       │
 │     │ nitrogen-Does-it-change-in-a- │ with pressure.                 │       │
 │     │ vacuum-or-at-standard-conditi │                                │       │
 │     │ ons                           │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ The boiling point for liquid  │ The boiling point for liquid   │  0.88 │
 │     │ nitrogen at atmospheric       │ nitrogen at atmospheric        │       │
 │     │ pressure is 77 K.             │ pressure is 77 K. In an open   │       │
 │     │ https://brainly.com/question/ │ container, liquid nitrogen's   │       │
 │     │ 17018364                      │ temperature is generally       │       │
 │     │                               │ around its boiling point of 77 │       │
 │     │                               │ K due to continuous            │       │
 │     │                               │ vaporization.                  │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ database          │ liquid nitrogen   │ The boiling point │
 │                  │                   │ boiling point     │ of nitrogen       │
 │                  │                   │ pressure          │ varies with       │
 │                  │                   │ dependence phase  │ pressure;         │
 │                  │                   │ diagram           │ understanding     │
 │                  │                   │                   │ this relationship │
 │                  │                   │                   │ is useful for     │
 │                  │                   │                   │ industrial and    │
 │                  │                   │                   │ scientific        │
 │                  │                   │                   │ applications.     │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ nitrogen phase    │ Engineering       │
 │                  │                   │ diagram triple    │ ToolBox           │
 │                  │                   │ point critical    │ references a      │
 │                  │                   │ point             │ nitrogen phase    │
 │                  │                   │                   │ diagram showing   │
 │                  │                   │                   │ conditions for    │
 │                  │                   │                   │ solid, liquid,    │
 │                  │                   │                   │ and gas phases.   │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ medium   │ How does the boiling point of   │ Multiple sources note that      │
 │          │ liquid nitrogen change as       │ boiling point varies directly   │
 │          │ pressure decreases toward a     │ with pressure, suggesting       │
 │          │ vacuum?                         │ significant changes under       │
 │          │                                 │ reduced pressure conditions.    │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ What is the exact triple point  │ Sources mention nitrogen exists │
 │          │ temperature and pressure for    │ as a liquid between 63 K and    │
 │          │ nitrogen?                       │ 77.2 K, implying a triple point │
 │          │                                 │ near 63 K, but exact triple     │
 │          │                                 │ point data was not provided in  │
 │          │                                 │ the gathered evidence.          │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.98                                                                │
 │ Corroborating sources: 5                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 1.00                                                │
 │ Budget status: under cap                                                     │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 42473                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 57.40s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 6141a021-4a47-45df-aa0c-5acd1db78b79
--- a/docs/stress-tests/M3.3-runs/02-factual.log
+++ b/docs/stress-tests/M3.3-runs/02-factual.log
@ -1,145 +0,0 @@
 Researching: When did the James Webb Space Telescope launch?
 {"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:06.289350Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:07.051309Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:07.061145Z"}
 {"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:07.098980Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "When did the James Webb Space Telescope launch?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099569Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099732Z"}
 {"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1050, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:15.512242Z"}
 {"step": 8, "decision": "Starting iteration 3/5", "tokens_so_far": 5418, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:18.749199Z"}
 {"step": 10, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 6, "iterations_run": 3, "tokens_used": 11453, "event": "synthesis_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:28.069780Z"}
 {"step": 11, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24998, "event": "synthesis_complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.942803Z"}
 {"step": 20, "decision": "Research complete", "confidence": 0.99, "citation_count": 5, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 47.037, "event": "complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.943609Z"}
 {"confidence": 0.99, "citations": 5, "gaps": 1, "discovery_events": 2, "tokens_used": 19708, "iterations_run": 3, "wall_time_sec": 44.843754529953, "budget_exhausted": false, "event": "research_completed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:51.943716Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:51.944100Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:51.947937Z"}
 {"trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "confidence": 0.99, "citations": 5, "tokens_used": 19708, "wall_time_sec": 44.843754529953, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.133972Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The James Webb Space Telescope (JWST) launched on December 25, 2021, at      │
 │ 12:20 UTC (7:20 AM ET) aboard an Arianespace Ariane 5 ECA+ rocket (Flight    │
 │ VA256) from the Guiana Space Centre (ELA-3) in Kourou, French Guiana. It     │
 │ entered service on July 12, 2022.                                            │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ James Webb Space Telescope -  │ Launch date: 25 December 2021  │  0.99 │
 │     │ Wikipedia                     │ (2021-12-25), 12:20 UTC |      │       │
 │     │ https://en.wikipedia.org/wiki │ Rocket: Ariane 5 ECA+ (S/N     │       │
 │     │ /James_Webb_Space_Telescope   │ 5113, Flight VA256) | Launch   │       │
 │     │                               │ site: Guiana, ELA-3 |          │       │
 │     │                               │ Contractor: Arianespace |      │       │
 │     │                               │ Entered service: 12 July 2022  │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ The Launch of the James Webb  │ On December 25, 2021, and 7:20 │  0.98 │
 │     │ Space Telescope - YouTube     │ AM ET (12:20 UTC), the James   │       │
 │     │ https://www.youtube.com/watch │ Webb Space Telescope was       │       │
 │     │ ?v=9tXlqWldVVk                │ launched by an ArianeSpace     │       │
 │     │                               │ Ariane 5 rocket from           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ James Webb Space Telescope    │ The launch date was Saturday,  │  0.97 │
 │     │ (JWST) Mission (Ariane 5) -   │ December 25, 2021 at 12:20 PM  │       │
 │     │ RocketLaunch.Live             │ (UTC).                         │       │
 │     │ https://www.rocketlaunch.live │                                │       │
 │     │ /launch/jwst                  │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ James Webb Space Telescope –  │ JWST's launch date was         │  0.95 │
 │     │ College of Science            │ December 25 from Europe's      │       │
 │     │ https://science.utah.edu/news │ Spaceport in Kourou, French    │       │
 │     │ /james-webb-space-telescope/  │ Guiana. Longtime fans of the   │       │
 │     │                               │ telescope are celebrating it   │       │
 │     │                               │ as a Christmas miracle.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ NASA's James Webb Space       │ Liftoff is at 7:20 a.m. EST    │  0.90 │
 │     │ Telescope officially set to   │ (1220 GMT).                    │       │
 │     │ launch Dec. 24 | Space        │                                │       │
 │     │ https://www.space.com/james-w │                                │       │
 │     │ ebb-space-telescope-launch-da │                                │       │
 │     │ te-confirmed                  │                                │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ contradictory_sources │ Space.com headline       │ The Space.com article     │
 │                       │ discrepancy              │ headline references Dec.  │
 │                       │                          │ 24, which was the         │
 │                       │                          │ announced/planned launch  │
 │                       │                          │ date at time of           │
 │                       │                          │ publication, while the    │
 │                       │                          │ actual launch occurred on │
 │                       │                          │ Dec. 25, 2021. This is a  │
 │                       │                          │ pre-launch announcement   │
 │                       │                          │ artifact, not a true      │
 │                       │                          │ contradiction, and all    │
 │                       │                          │ other sources confirm     │
 │                       │                          │ Dec. 25.                  │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ null              │ James Webb Space  │ JWST entered      │
 │                  │                   │ Telescope first   │ service on July   │
 │                  │                   │ science results   │ 12, 2022;         │
 │                  │                   │ July 2022         │ understanding its │
 │                  │                   │                   │ early science     │
 │                  │                   │                   │ results provides  │
 │                  │                   │                   │ context for its   │
 │                  │                   │                   │ operational       │
 │                  │                   │                   │ impact.           │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ null              │ JWST launch       │ The telescope was │
 │                  │                   │ delays history    │ originally        │
 │                  │                   │ original 2007     │ planned to launch │
 │                  │                   │ launch plan       │ in 2007 but faced │
 │                  │                   │                   │ decades of        │
 │                  │                   │                   │ delays, making    │
 │                  │                   │                   │ the history of    │
 │                  │                   │                   │ its development   │
 │                  │                   │                   │ noteworthy.       │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ medium   │ What were the key milestones    │ Wikipedia notes the telescope   │
 │          │ after JWST's launch during its  │ entered service on July 12,     │
 │          │ commissioning phase before      │ 2022, approximately six months  │
 │          │ entering service on July 12,    │ after its December 25, 2021     │
 │          │ 2022?                           │ launch, suggesting a lengthy    │
 │          │                                 │ commissioning process.          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ What caused JWST's launch to    │ Space.com's article was titled  │
 │          │ slip from December 24 to        │ with a Dec. 24 launch date, but │
 │          │ December 25, 2021?              │ the actual launch occurred on   │
 │          │                                 │ Dec. 25, suggesting a           │
 │          │                                 │ last-minute slip.               │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does JWST's actual mission  │ Wikipedia lists a 10-year       │
 │          │ performance compare to its      │ planned and 20-year expected    │
 │          │ planned 10-year operational     │ life; precise launch trajectory │
 │          │ lifespan given its fuel         │ reportedly left more fuel than  │
 │          │ efficiency during launch?       │ expected, potentially extending │
 │          │                                 │ the mission.                    │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.99                                                                │
 │ Corroborating sources: 5                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 1.00                                                │
 │ Budget status: under cap                                                     │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 19708                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 44.84s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 91e87d05-6d23-4377-af13-270a8cf701e2
--- a/docs/stress-tests/M3.3-runs/03-factual.log
+++ b/docs/stress-tests/M3.3-runs/03-factual.log
@ -1,179 +0,0 @@
 Researching: What programming language is the Linux kernel primarily written in?
 {"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.691750Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:53.397487Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:53.405825Z"}
 {"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:53.438393Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What programming language is the Linux kernel primarily written in?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438693Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438784Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1096, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:04.950078Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7266, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:15.609351Z"}
 {"step": 14, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 16, "iterations_run": 3, "tokens_used": 18342, "event": "synthesis_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:38.886838Z"}
 {"step": 15, "decision": "Parsed synthesis JSON successfully", "duration_ms": 38497, "event": "synthesis_complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.247727Z"}
 {"step": 26, "decision": "Research complete", "confidence": 0.97, "citation_count": 6, "gap_count": 2, "discovery_count": 2, "total_duration_sec": 85.024, "event": "complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.248500Z"}
 {"confidence": 0.97, "citations": 6, "gaps": 2, "discovery_events": 2, "tokens_used": 32922, "iterations_run": 3, "wall_time_sec": 82.80920100212097, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:16.248601Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:16.248962Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:16.252134Z"}
 {"trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "confidence": 0.97, "citations": 6, "tokens_used": 32922, "wall_time_sec": 82.80920100212097, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.444923Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The Linux kernel is primarily written in the C programming language,         │
 │ specifically the GNU dialect of ISO C11 (compiled with GCC under -std=gnu11, │
 │ or alternatively with Clang). Assembly language is also used for             │
 │ architecture-specific low-level code. As of late 2022, Rust became an        │
 │ officially supported second language in the kernel, and as of the 2025 Linux │
 │ Kernel Maintainer Summit, Rust was elevated from 'experimental' to a         │
 │ permanent, first-class core language alongside C. According to Open Hub      │
 │ statistics, C accounts for approximately 95.8% of total lines in the kernel  │
 │ codebase, with Assembly at ~0.7% and Rust at ~0.3%. The kernel also uses     │
 │ small amounts of shell script, Python, Make, and Perl for tooling purposes.  │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Programming Language — The    │ The Linux kernel is written in │  1.00 │
 │     │ Linux Kernel documentation    │ the C programming language.    │       │
 │     │ https://docs.kernel.org/proce │ More precisely, it is          │       │
 │     │ ss/programming-language.html  │ typically compiled with gcc    │       │
 │     │                               │ under -std=gnu11: the GNU      │       │
 │     │                               │ dialect of ISO C11. clang is   │       │
 │     │                               │ also supported. The kernel has │       │
 │     │                               │ support for the Rust           │       │
 │     │                               │ programming language under     │       │
 │     │                               │ CONFIG_RUST.                   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ The Linux Kernel Open Source  │ C | 36,226,652 | 5,218,548 |   │  0.97 │
 │     │ Project on Open Hub:          │ 12.6% | 5,867,314 | 47,312,514 │       │
 │     │ Languages Page                │ | 95.8% ... Assembly | 266,797 │       │
 │     │ https://openhub.net/p/linux/a │ | 50,339 | 15.9% | 49,347 |    │       │
 │     │ nalyses/latest/languages_summ │ 366,483 | 0.7% ... Rust |      │       │
 │     │ ary                           │ 90,778 | 35,328 | 28.0% |      │       │
 │     │                               │ 11,361 | 137,467 | 0.3%        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Rust moves from experiment to │ The consensus among the        │  0.95 │
 │     │ a core Linux kernel language  │ assembled developers is that   │       │
 │     │ - Spiceworks                  │ Rust in the kernel is no       │       │
 │     │ https://www.spiceworks.com/so │ longer experimental — it is    │       │
 │     │ ftware/rust-moves-from-experi │ now a core part of the kernel  │       │
 │     │ ment-to-a-core-linux-kernel-l │ and is here to stay. So the    │       │
 │     │ anguage/                      │ 'experimental' tag will be     │       │
 │     │                               │ coming off. This elevates Rust │       │
 │     │                               │ to being the kernel's second   │       │
 │     │                               │ core language alongside C.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Why Linux Kernel is written   │ Although the current Linux     │  0.92 │
 │     │ in C-language but not in C++? │ Kernel source-code contain     │       │
 │     │ https://thelinuxchannel.org/2 │ certain parts of the code      │       │
 │     │ 024/06/why-linux-kernel-is-wr │ written in assembly code       │       │
 │     │ itten-in-c-language-but-not-i │ (actually native CPU assembly  │       │
 │     │ n-c-thelinuxchannel-kernelpro │ instructions) and recently     │       │
 │     │ gramming/                     │ certain parts of code written  │       │
 │     │                               │ in Rust Language, majority of  │       │
 │     │                               │ the Linux Kernel source-code   │       │
 │     │                               │ is only written in C Language. │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Linux Kernel Contributors And │ The Linux kernel crossed the   │  0.90 │
 │     │ Lines of Code Statistics 2026 │ 40 million line threshold with │       │
 │     │ https://commandlinux.com/stat │ version 6.14 rc1 in January    │       │
 │     │ istics/linux-kernel-contribut │ 2025, containing precisely     │       │
 │     │ ors-lines-of-code-statistics/ │ 40,063,856 lines. This         │       │
 │     │                               │ represents exponential growth  │       │
 │     │                               │ from the original 10,239 lines │       │
 │     │                               │ in version 0.01 released in    │       │
 │     │                               │ 1991.                          │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Rust for Linux - Wikipedia    │ Initial release | October 1,   │  0.93 │
 │     │ https://en.wikipedia.org/wiki │ 2022; 3 years ago (2022-10-01) │       │
 │     │ /Rust_for_Linux               │ | Written in | Rust |          │       │
 │     │                               │ Operating system | Linux |     │       │
 │     │                               │ License | GPL-2.0-only with    │       │
 │     │                               │ Linux-syscall-note.            │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Exact current percentage │ Open Hub statistics may   │
 │                       │ of Rust code in the most │ not reflect the most      │
 │                       │ recent kernel versions   │ recent kernel releases    │
 │                       │ (6.12+)                  │ (6.14+), so the exact     │
 │                       │                          │ current Rust percentage   │
 │                       │                          │ could be slightly higher  │
 │                       │                          │ than 0.3% given active    │
 │                       │                          │ Rust adoption.            │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Whether C++ is           │ Open Hub reports C++ at   │
 │                       │ officially used in any   │ 1.9% of total lines, yet  │
 │                       │ part of the kernel       │ official kernel docs and  │
 │                       │                          │ community sources say C   │
 │                       │                          │ is the language and C++   │
 │                       │                          │ is not used. The C++      │
 │                       │                          │ lines may be in           │
 │                       │                          │ tools/scripts not in the  │
 │                       │                          │ kernel proper.            │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ null              │ Linux kernel Rust │ Rust is growing   │
 │                  │                   │ adoption rate     │ quickly in the    │
 │                  │                   │ 2025 lines of     │ kernel; updated   │
 │                  │                   │ code percentage   │ statistics on its │
 │                  │                   │                   │ share would be    │
 │                  │                   │                   │ valuable          │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ null              │ Linux kernel C++  │ Open Hub shows    │
 │                  │                   │ code usage tools  │ ~1.9% C++ but     │
 │                  │                   │ vs kernel proper  │ official docs do  │
 │                  │                   │                   │ not mention C++;  │
 │                  │                   │                   │ clarifying        │
 │                  │                   │                   │ whether this is   │
 │                  │                   │                   │ tooling code vs   │
 │                  │                   │                   │ kernel code would │
 │                  │                   │                   │ resolve the       │
 │                  │                   │                   │ apparent          │
 │                  │                   │                   │ discrepancy       │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ medium   │ Will Rust eventually surpass    │ Rust is at ~0.3% and Assembly   │
 │          │ Assembly in lines of code       │ at ~0.7% per Open Hub; with     │
 │          │ within the Linux kernel?        │ active Rust driver development, │
 │          │                                 │ Rust may soon exceed Assembly   │
 │          │                                 │ usage.                          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the roadmap for Rust    │ Rust is now a first-class       │
 │          │ adoption in specific kernel     │ language, but the Spiceworks    │
 │          │ subsystems?                     │ article notes the focus is on   │
 │          │                                 │ 'where, how fast, and under     │
 │          │                                 │ whose terms does Rust spread    │
 │          │                                 │ inside Linux'.                  │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ Why does Open Hub report ~1.9%  │ Open Hub's language breakdown   │
 │          │ C++ in the Linux kernel         │ shows 568,053 code lines of     │
 │          │ codebase when official          │ C++, which may belong to        │
 │          │ documentation does not mention  │ userspace tools or build        │
 │          │ C++ as a supported kernel       │ infrastructure bundled in the   │
 │          │ language?                       │ same repository.                │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.97                                                                │
 │ Corroborating sources: 6                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 1.00                                                │
 │ Budget status: under cap                                                     │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 32922                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 82.81s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 710b0a62-06c8-4f49-83e3-dc651c3702a9
--- a/docs/stress-tests/M3.3-runs/04-factual.log
+++ b/docs/stress-tests/M3.3-runs/04-factual.log
@ -1,115 +0,0 @@
 Researching: What is the capital of Mongolia?
 {"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.982178Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:17.707574Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:17.715766Z"}
 {"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:17.748116Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the capital of Mongolia?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748504Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748598Z"}
 {"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1043, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:25.126703Z"}
 {"step": 7, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 5, "iterations_run": 2, "tokens_used": 5387, "event": "synthesis_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:38.025310Z"}
 {"step": 8, "decision": "Parsed synthesis JSON successfully", "duration_ms": 19958, "event": "synthesis_complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.937541Z"}
 {"step": 14, "decision": "Research complete", "confidence": 0.99, "citation_count": 4, "gap_count": 0, "discovery_count": 1, "total_duration_sec": 41.287, "event": "complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.938235Z"}
 {"confidence": 0.99, "citations": 4, "gaps": 0, "discovery_events": 1, "tokens_used": 11009, "iterations_run": 2, "wall_time_sec": 39.189372301101685, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:56.938337Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:56.938738Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:56.942176Z"}
 {"trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "confidence": 0.99, "citations": 4, "tokens_used": 11009, "wall_time_sec": 39.189372301101685, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.144089Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The capital of Mongolia is Ulaanbaatar (also spelled Ulan Bator). It is the  │
 │ largest city in Mongolia, situated at an elevation of 1,350 meters on the    │
 │ Tuul River, and is known as the coldest national capital in the world. The   │
 │ name 'Ulaanbaatar' means 'red hero' in Mongolian. It is home to over half of │
 │ Mongolia's population of approximately 3 million people.                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Ulaanbaatar - Wikipedia       │ Ulaanbaatar is the capital of  │  0.99 │
 │     │ https://en.wikipedia.org/wiki │ Mongolia, and is home to over  │       │
 │     │ /Ulaanbaatar                  │ half the country's population  │       │
 │     │                               │ of about 3 million people.     │       │
 │     │                               │ Human habitation dates back    │       │
 │     │                               │ more than 300,000 years. The   │       │
 │     │                               │ city is located along the Tuul │       │
 │     │                               │ River Valley.                  │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Ulaanbaatar, Mongolia | NASA  │ Ulaanbaatar is the capital of  │  0.99 │
 │     │ Jet Propulsion Laboratory     │ Mongolia, and is home to over  │       │
 │     │ (JPL)                         │ half the country's population  │       │
 │     │ https://www.jpl.nasa.gov/imag │ of about 3 million people. Due │       │
 │     │ es/pia26289-ulaanbaatar-mongo │ to its location deep in the    │       │
 │     │ lia/                          │ interior of Asia, and its high │       │
 │     │                               │ elevation, Ulaanbaatar is the  │       │
 │     │                               │ coldest national capital in    │       │
 │     │                               │ the world.                     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Capital of Mongolia | -       │ Ulaanbaatar (Ulan Bator) is    │  0.95 │
 │     │ Everything You Need to Know   │ capital of Mongolia known as   │       │
 │     │ About Ulaanbaatar             │ the coldest capital on earth.  │       │
 │     │ https://www.travelbuddies.inf │ It is located in central Asia  │       │
 │     │ o/capital-of-mongolia/        │ between China and Russia and   │       │
 │     │                               │ capital and largest city of    │       │
 │     │                               │ Mongolia. Ulaan is red and     │       │
 │     │                               │ Baatar is hero in Mongolian.   │       │
 │     │                               │ In general, Ulaanbaatar means  │       │
 │     │                               │ 'red hero'.                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Ulan Bator, Mongolia |        │ Ulaanbaatar, also known as     │  0.98 │
 │     │ Geography and Cartography |   │ Ulan Bator, is the capital and │       │
 │     │ Research Starters | EBSCO     │ largest city of Mongolia,      │       │
 │     │ Research                      │ situated at an elevation of    │       │
 │     │ https://www.ebsco.com/researc │ 1,350 meters (4,430 feet) on   │       │
 │     │ h-starters/geography-and-cart │ the Tuul River in the          │       │
 │     │ ography/ulan-bator-mongolia   │ northeast of the Mongolian     │       │
 │     │                               │ plateau.                       │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ null              │ Ulaanbaatar air   │ Multiple sources  │
 │                  │                   │ pollution and     │ mention severe    │
 │                  │                   │ climate           │ air pollution and │
 │                  │                   │ challenges        │ extreme cold as   │
 │                  │                   │                   │ notable           │
 │                  │                   │                   │ characteristics   │
 │                  │                   │                   │ of the capital    │
 │                  │                   │                   │ worth exploring   │
 │                  │                   │                   │ further.          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ low      │ How has Ulaanbaatar's           │ Sources mention dramatic        │
 │          │ population grown over recent    │ population increases due to     │
 │          │ decades due to rural-to-urban   │ migration from rural areas,     │
 │          │ migration?                      │ with population estimates       │
 │          │                                 │ ranging from 1.4 million to     │
 │          │                                 │ over 1.6 million across         │
 │          │                                 │ sources.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What measures is Ulaanbaatar    │ Multiple sources note that coal │
 │          │ taking to address its severe    │ reliance and extreme winters    │
 │          │ air pollution problem?          │ cause significant air pollution │
 │          │                                 │ in the city.                    │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.99                                                                │
 │ Corroborating sources: 4                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 1.00                                                │
 │ Budget status: under cap                                                     │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 11009                                                                │
 │ Iterations: 2                                                                │
 │ Wall time: 39.19s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: ffc42162-5527-4a35-97ad-474aafa47dc1
--- a/docs/stress-tests/M3.3-runs/05-factual.log
+++ b/docs/stress-tests/M3.3-runs/05-factual.log
@ -1,148 +0,0 @@
 Researching: How many amino acids are encoded by the standard genetic code?
 {"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.672745Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:58.404691Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:58.415522Z"}
 {"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:58.449581Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "How many amino acids are encoded by the standard genetic code?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449885Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449974Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1099, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:06.468160Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8623, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:11.359260Z"}
 {"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 18453, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:15.589960Z"}
 {"step": 19, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 14, "iterations_run": 4, "tokens_used": 34167, "event": "synthesis_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:38.361461Z"}
 {"step": 20, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24174, "event": "synthesis_complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.506420Z"}
 {"step": 28, "decision": "Research complete", "confidence": 0.98, "citation_count": 4, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 65.235, "event": "complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.507373Z"}
 {"confidence": 0.98, "citations": 4, "gaps": 1, "discovery_events": 2, "tokens_used": 48308, "iterations_run": 4, "wall_time_sec": 63.05677556991577, "budget_exhausted": false, "event": "research_completed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:01.507469Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:54:01.507940Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:01.514127Z"}
 {"trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "confidence": 0.98, "citations": 4, "tokens_used": 48308, "wall_time_sec": 63.05677556991577, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:01.785150Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The standard genetic code encodes 20 common amino acids. These are specified │
 │ by 64 possible codons (combinations of three nucleotides from four bases),   │
 │ with most amino acids encoded by more than one codon (a property called      │
 │ degeneracy). Methionine and tryptophan are the only amino acids specified by │
 │ a single codon each. Three codons serve as stop signals rather than encoding │
 │ amino acids. Beyond the standard 20, two additional amino                    │
 │ acids—selenocysteine (the 21st) and pyrrolysine (the 22nd)—are also          │
 │ genetically encoded in certain organisms via reprogramming of stop codons    │
 │ UGA and UAG, respectively, but are not part of the standard set of 20.       │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ The genetic code (article) -  │ Most of the amino acids in the │  0.95 │
 │     │ Khan Academy                  │ genetic code are encoded by at │       │
 │     │ https://www.khanacademy.org/s │ least two codons. In fact,     │       │
 │     │ cience/hs-bio/x230b3ff252126b │ methionine and tryptophan are  │       │
 │     │ b6:gene-expression-and-regula │ the only amino acids specified │       │
 │     │ tion/x230b3ff252126bb6:untitl │ by a single codon.             │       │
 │     │ ed-348/a/the-genetic-code     │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Is there a twenty third amino │ The universal genetic code     │  0.98 │
 │     │ acid in the genetic code? -   │ includes 20 common amino       │       │
 │     │ PubMed                        │ acids. In addition,            │       │
 │     │ https://pubmed.ncbi.nlm.nih.g │ selenocysteine (Sec) and       │       │
 │     │ ov/16713651/                  │ pyrrolysine (Pyl), known as    │       │
 │     │                               │ the twenty first and twenty    │       │
 │     │                               │ second amino acids, are        │       │
 │     │                               │ encoded by UGA and UAG,        │       │
 │     │                               │ respectively, which are the    │       │
 │     │                               │ codons that usually function   │       │
 │     │                               │ as stop signals.               │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Genetic code - Wikipedia      │ The genetic code is highly     │  0.95 │
 │     │ https://en.wikipedia.org/wiki │ similar among all organisms    │       │
 │     │ /Genetic_code                 │ and can be expressed in a      │       │
 │     │                               │ simple table with 64 entries.  │       │
 │     │                               │ The codons specify which amino │       │
 │     │                               │ acid will be added next during │       │
 │     │                               │ protein biosynthesis.          │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Understanding the Genetic     │ The universal                  │  0.97 │
 │     │ Code - PMC                    │ triple-nucleotide genetic      │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ code, allowing DNA-encoded     │       │
 │     │ articles/PMC6620406/          │ mRNA to be translated into the │       │
 │     │                               │ amino acid sequences of        │       │
 │     │                               │ proteins using transfer RNAs   │       │
 │     │                               │ (tRNAs) and many accessory and │       │
 │     │                               │ modification factors, is       │       │
 │     │                               │ essentially common to all      │       │
 │     │                               │ living organisms on Earth.     │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category       ┃ Topic                        ┃ Detail                       ┃
 ┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ scope_exceeded │ Exact codon-to-amino-acid    │ The full detailed codon      │
 │                │ mapping table                │ table listing all 64 codons  │
 │                │                              │ and their corresponding      │
 │                │                              │ amino acids was not          │
 │                │                              │ extracted verbatim from the  │
 │                │                              │ sources, though the total    │
 │                │                              │ count of 20 standard amino   │
 │                │                              │ acids is well established.   │
 └────────────────┴──────────────────────────────┴──────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ database          │ selenocysteine    │ The PubMed source │
 │                  │                   │ pyrrolysine       │ raises the        │
 │                  │                   │ genetic code      │ question of       │
 │                  │                   │ expansion         │ expanded genetic  │
 │                  │                   │ organisms         │ codes beyond 20   │
 │                  │                   │                   │ amino acids,      │
 │                  │                   │                   │ which may be      │
 │                  │                   │                   │ relevant for      │
 │                  │                   │                   │ advanced biology  │
 │                  │                   │                   │ research.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ synthetic biology │ Wikipedia         │
 │                  │                   │ unnatural amino   │ mentions expanded │
 │                  │                   │ acids expanded    │ genetic codes in  │
 │                  │                   │ genetic code      │ synthetic         │
 │                  │                   │                   │ biology,          │
 │                  │                   │                   │ suggesting active │
 │                  │                   │                   │ research into     │
 │                  │                   │                   │ adding more than  │
 │                  │                   │                   │ 22 amino acids.   │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ medium   │ Could a 23rd amino acid ever    │ A PubMed study scanned 16       │
 │          │ become widely distributed and   │ archaeal and 130 bacterial      │
 │          │ genetically encoded in nature?  │ genomes for tRNAs corresponding │
 │          │                                 │ to the three stop codons and    │
 │          │                                 │ concluded that additional       │
 │          │                                 │ widely distributed genetically  │
 │          │                                 │ encoded amino acids are         │
 │          │                                 │ unlikely.                       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ How many non-standard amino     │ Wikipedia references expanded   │
 │          │ acids have been successfully    │ genetic codes in synthetic      │
 │          │ incorporated into proteins via  │ biology as a distinct topic,    │
 │          │ synthetic biology methods?      │ suggesting                      │
 │          │                                 │ laboratory-engineered codes may │
 │          │                                 │ go beyond the natural 22.       │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.98                                                                │
 │ Corroborating sources: 4                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 1.00                                                │
 │ Budget status: under cap                                                     │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 48308                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 63.06s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 7561029e-5dcb-4eaa-98e9-7496ed4bf4c2
--- a/docs/stress-tests/M3.3-runs/06-comparative.log
+++ b/docs/stress-tests/M3.3-runs/06-comparative.log
@ -1,226 +0,0 @@
 Researching: Compare the energy density of lithium-ion vs sodium-ion batteries.
 {"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:02.430608Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:54:03.159945Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:03.167971Z"}
 {"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:03.200030Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200318Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200405Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1114, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:14.560598Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7183, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:18.314755Z"}
 {"step": 19, "decision": "Starting iteration 4/5", "tokens_so_far": 13977, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:28.528912Z"}
 {"step": 24, "decision": "Token budget reached before iteration 5: 28015/20000", "event": "budget_exhausted", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.027627Z"}
 {"step": 25, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 4, "tokens_used": 28015, "event": "synthesis_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.028531Z"}
 {"step": 26, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50955, "event": "synthesis_complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.614289Z"}
 {"step": 41, "decision": "Research complete", "confidence": 0.91, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 87.865, "event": "complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.616834Z"}
 {"confidence": 0.91, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 48087, "iterations_run": 4, "wall_time_sec": 84.41376757621765, "budget_exhausted": true, "event": "research_completed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:27.617014Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:55:27.617866Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:27.632124Z"}
 {"trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "confidence": 0.91, "citations": 8, "tokens_used": 48087, "wall_time_sec": 84.41376757621765, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:27.873634Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Lithium-ion batteries have significantly higher energy density than          │
 │ sodium-ion batteries across all commercial chemistries. Lithium-ion cells    │
 │ achieve 150–300 Wh/kg gravimetrically, depending on chemistry: NMC variants  │
 │ reach 250–300 Wh/kg in premium automotive applications, while LFP cells      │
 │ deliver 150–180 Wh/kg [Source 15]. Volumetrically, lithium-ion batteries     │
 │ reach roughly 250–700 Wh/L [Source 16]. Sodium-ion batteries currently       │
 │ achieve 90–190 Wh/kg gravimetrically; CATL's first-generation commercial     │
 │ cells reached ~160 Wh/kg [Source 15], with newer products like CATL's Naxtra │
 │ reaching ~175 Wh/kg [Source 22], and ScienceDirect prototypes ranging 90–150 │
 │ Wh/kg [Source 7]. The volumetric energy density of sodium-ion is             │
 │ approximately 20–40% lower than lithium-ion equivalents [Source 8]. This gap │
 │ exists fundamentally because sodium ions are heavier and larger than lithium │
 │ ions, reducing the energy stored per unit mass or volume [Source 3, Source   │
 │ 20]. A notable exception is a late-2025 announcement by ZN Energy of an      │
 │ anode-free solid-state sodium-ion pouch cell achieving 348.5 Wh/kg, verified │
 │ by CATARC, using a high-energy layered oxide cathode and anode-free          │
 │ solid-state architecture—though this is a laboratory/prototype result, not   │
 │ yet commercial [Source 10]. In practical terms, sodium-ion batteries are     │
 │ best suited for stationary storage and cost-sensitive low-performance EVs    │
 │ where energy density is less critical, while lithium-ion dominates portable  │
 │ electronics, robotics, and long-range EVs [Source 1, Source 8].              │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Battery Energy Density 2025:  │ Nickel Manganese Cobalt (NMC)  │  0.95 │
 │     │ State of the Art & Next-Gen   │ variants deliver the highest   │       │
 │     │ Tech                          │ energy densities at the cell   │       │
 │     │ https://timharper.net/fieldno │ level, reaching 250-300 Wh/kg  │       │
 │     │ tes/battery-energy-density-20 │ in premium automotive          │       │
 │     │ 25/                           │ applications... Sodium-ion     │       │
 │     │                               │ batteries have emerged from    │       │
 │     │                               │ laboratory curiosity to        │       │
 │     │                               │ commercial reality, with       │       │
 │     │                               │ CATL's first-generation cells  │       │
 │     │                               │ achieving 160 Wh/kg energy     │       │
 │     │                               │ density.                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Sodium ion batteries: A       │ Current prototypes of SIBs     │  0.95 │
 │     │ sustainable alternative to    │ have energy densities of       │       │
 │     │ lithium-ion ...               │ 90–150 Wh/kg, which remain     │       │
 │     │ https://www.sciencedirect.com │ lower than the 130–285 Wh/kg   │       │
 │     │ /science/article/pii/S2949821 │ typically achieved             │       │
 │     │ X25002418                     │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Sodium-ion batteries: Should  │ Sodium is heavier than         │  0.97 │
 │     │ we believe the hype?          │ lithium, and its ions are      │       │
 │     │ https://cen.acs.org/energy/en │ larger, resulting in a         │       │
 │     │ ergy-storage-/Sodium-ion-batt │ volumetric energy density that │       │
 │     │ eries-Should-believe/103/web/ │ is 20–40% less than that of    │       │
 │     │ 2025/11                       │ lithium ion. Consequently, a   │       │
 │     │                               │ sodium-ion battery is bigger   │       │
 │     │                               │ and heavier than an equivalent │       │
 │     │                               │ one made with lithium.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Energy Density of Lithium-Ion │ Modern lithium-ion batteries   │  0.90 │
 │     │ Batteries Explained: Wh/kg vs │ achieve 150-300 Wh/kg and      │       │
 │     │ Wh/L                          │ 250-700 Wh/L, depending on     │       │
 │     │ https://www.longsingtech.com/ │ chemistry and design.          │       │
 │     │ energy-density-of-lithium-ion │                                │       │
 │     │ -batteries/                   │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Sodium Ion vs Lithium Ion     │ Energy Density (Gravimetric):  │  0.88 │
 │     │ Batteries: 2026 Comparison &  │ Sodium-ion typically ranges    │       │
 │     │ Key Advantages                │ from 100–175 Wh/kg (e.g.,      │       │
 │     │ https://chargeprotexas.com/so │ CATL's Naxtra at ~175 Wh/kg).  │       │
 │     │ dium-ion-vs-lithium-ion-batte │ Lithium-ion hits 150–250+      │       │
 │     │ ries-2026-comparison/         │ Wh/kg (LFP: 150–210; NMC:      │       │
 │     │                               │ 240–350).                      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ ZN Energy Breaks Sodium-Ion   │ Its >25Ah large-format AFSSSIB │  0.78 │
 │     │ Battery Density Record at     │ pouch cell achieved a          │       │
 │     │ 348.5Wh/kg                    │ gravimetric energy density of  │       │
 │     │ https://www.linkedin.com/post │ 348.5Wh/kg, verified by CATARC │       │
 │     │ s/jerry-wan-069b41105_breakin │ (China Automotive Technology & │       │
 │     │ g-the-sodium-ceiling-zhaona-e │ Research Center, Tianjin).     │       │
 │     │ nergy-activity-74134108276403 │ This is not an incremental     │       │
 │     │ 20000-NHd_                    │ improvement—it directly        │       │
 │     │                               │ challenges the long-held       │       │
 │     │                               │ assumption that sodium         │       │
 │     │                               │ chemistry is structurally      │       │
 │     │                               │ capped at 'low energy          │       │
 │     │                               │ density.'                      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Sodium as a Green Substitute  │ But there are also downsides   │  0.93 │
 │     │ for Lithium in Batteries      │ to sodium-ion batteries, the   │       │
 │     │ https://physics.aps.org/artic │ top one being a lower energy   │       │
 │     │ les/v17/73                    │ density than their lithium-ion │       │
 │     │                               │ counterparts. Energy density   │       │
 │     │                               │ has a direct bearing on the    │       │
 │     │                               │ driving range of an electric   │       │
 │     │                               │ vehicle.                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Sodium-Ion vs Lithium-Ion     │ lithium-ion batteries dominate │  0.85 │
 │     │ Batteries Differences and     │ high-performance applications  │       │
 │     │ Applications in 2025          │ like consumer electronics and  │       │
 │     │ https://www.large-battery.com │ robotics, owing to their       │       │
 │     │ /blog/na-ion-vs-li-ion-batter │ superior energy density of     │       │
 │     │ ies-2025/                     │ 100–270 Wh/kg.                 │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Volumetric energy        │ Most sources provide      │
 │                       │ density figures for      │ gravimetric (Wh/kg) data  │
 │                       │ sodium-ion batteries     │ for sodium-ion; specific  │
 │                       │                          │ Wh/L volumetric figures   │
 │                       │                          │ for sodium-ion cells at   │
 │                       │                          │ the commercial pack level │
 │                       │                          │ were not found in         │
 │                       │                          │ evidence.                 │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Independent verification │ The 348.5 Wh/kg result    │
 │                       │ of ZN Energy 348.5 Wh/kg │ for sodium-ion is from a  │
 │                       │ claim                    │ LinkedIn post summarizing │
 │                       │                          │ a company announcement.   │
 │                       │                          │ No peer-reviewed or       │
 │                       │                          │ independent third-party   │
 │                       │                          │ publication was found to  │
 │                       │                          │ corroborate this figure.  │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ scope_exceeded        │ Cycle life vs energy     │ While cycle life is       │
 │                       │ density trade-offs in    │ mentioned in some         │
 │                       │ sodium-ion               │ sources, a detailed       │
 │                       │                          │ quantitative comparison   │
 │                       │                          │ of how energy density     │
 │                       │                          │ degrades over cycle life  │
 │                       │                          │ compared to lithium-ion   │
 │                       │                          │ was not covered in the    │
 │                       │                          │ evidence.                 │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ new_source       │ arxiv             │ anode-free        │ ZN Energy's 348.5 │
 │                  │                   │ solid-state       │ Wh/kg claim would │
 │                  │                   │ sodium-ion        │ benefit from      │
 │                  │                   │ battery energy    │ peer-reviewed     │
 │                  │                   │ density 2025      │ validation on     │
 │                  │                   │                   │ arXiv or similar  │
 │                  │                   │                   │ preprint server.  │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ sodium-ion        │ Volumetric energy │
 │                  │                   │ battery           │ density for       │
 │                  │                   │ volumetric energy │ sodium-ion at the │
 │                  │                   │ density Wh/L      │ cell and pack     │
 │                  │                   │ commercial cells  │ level is          │
 │                  │                   │ 2025              │ underrepresented  │
 │                  │                   │                   │ in current        │
 │                  │                   │                   │ evidence.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ layered oxide     │ Multiple sources  │
 │                  │                   │ cathode           │ mention cathode   │
 │                  │                   │ sodium-ion        │ engineering as    │
 │                  │                   │ specific capacity │ the key           │
 │                  │                   │ cycle stability   │ bottleneck for    │
 │                  │                   │ 2025              │ sodium-ion energy │
 │                  │                   │                   │ density           │
 │                  │                   │                   │ improvement.      │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Will sodium-ion batteries ever  │ ZN Energy's prototype achieved  │
 │          │ match or exceed LFP lithium-ion │ 348.5 Wh/kg, but commercial     │
 │          │ in gravimetric energy density   │ CATL sodium-ion cells are at    │
 │          │ at the commercial pack level?   │ ~160–175 Wh/kg while LFP cells  │
 │          │                                 │ are 150–180 Wh/kg. The gap is   │
 │          │                                 │ closing in prototypes but not   │
 │          │                                 │ yet in commercial products.     │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does energy density change  │ Sources mention sodium-ion's    │
 │          │ over the cycle life of          │ lower risk of thermal runaway   │
 │          │ sodium-ion vs lithium-ion       │ and good low-temperature        │
 │          │ batteries under real-world      │ performance, but long-term      │
 │          │ conditions?                     │ energy density retention data   │
 │          │                                 │ was not found.                  │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the volumetric energy   │ C&EN states volumetric density  │
 │          │ density (Wh/L) of current       │ is 20–40% lower than            │
 │          │ commercial sodium-ion battery   │ lithium-ion but provides no     │
 │          │ packs?                          │ absolute Wh/L figures for       │
 │          │                                 │ sodium-ion.                     │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.91                                                                │
 │ Corroborating sources: 8                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.97                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 48087                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 84.41s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: aaf3b9ef-d91a-4d03-8883-b0a906929cb1
--- a/docs/stress-tests/M3.3-runs/07-comparative.log
+++ b/docs/stress-tests/M3.3-runs/07-comparative.log
@ -1,350 +0,0 @@
 Researching: Compare PostgreSQL and SQLite for embedded analytics workloads.
 {"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:28.499294Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:55:29.256154Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:29.264747Z"}
 {"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:29.297908Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298261Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298356Z"}
 {"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1147, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:38.957520Z"}
 {"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8781, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:45.812510Z"}
 {"step": 23, "decision": "Starting iteration 4/5", "tokens_so_far": 18324, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:00.757335Z"}
 {"step": 28, "decision": "Token budget reached before iteration 5: 34877/20000", "event": "budget_exhausted", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990690Z"}
 {"step": 29, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 4, "tokens_used": 34877, "event": "synthesis_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990849Z"}
 {"step": 30, "decision": "Parsed synthesis JSON successfully", "duration_ms": 78663, "event": "synthesis_complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.513065Z"}
 {"step": 48, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 114.441, "event": "complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.536570Z"}
 {"confidence": 0.88, "citations": 10, "gaps": 3, "discovery_events": 4, "tokens_used": 61699, "iterations_run": 4, "wall_time_sec": 111.20896744728088, "budget_exhausted": true, "event": "research_completed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:20.538075Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:57:20.546420Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:20.668474Z"}
 {"trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "confidence": 0.88, "citations": 10, "tokens_used": 61699, "wall_time_sec": 111.20896744728088, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:21.511598Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ PostgreSQL and SQLite differ substantially for embedded analytics workloads  │
 │ across architecture, concurrency, feature set, and performance               │
 │ characteristics.                                                             │
 │                                                                              │
 │ **Architecture:** SQLite is a serverless, embedded database engine that      │
 │ reads/writes a single file on disk, making it highly portable and            │
 │ zero-configuration. PostgreSQL uses a client-server architecture requiring a │
 │ separate server process, which enables it to scale and handle multiple       │
 │ concurrent clients efficiently via Multi-Version Concurrency Control (MVCC)  │
 │ [Source 5]. For embedded analytics specifically, SQLite's in-process nature  │
 │ eliminates network overhead, which can yield significant read performance    │
 │ advantages in local scenarios [Source 31].                                   │
 │                                                                              │
 │ **Concurrency:** SQLite allows multiple concurrent readers but only one      │
 │ writer at a time, using file-level locking. This single-writer model is a    │
 │ significant bottleneck for write-heavy or high-concurrency analytical        │
 │ ingestion workloads [Source 24, Source 25]. PostgreSQL's MVCC ensures        │
 │ readers and writers do not block each other, making it far superior for      │
 │ multi-user or mixed OLTP/OLAP environments [Source 5]. Turso's work on       │
 │ concurrent writes for SQLite demonstrates the community recognizes this      │
 │ limitation, achieving up to 4x write throughput improvements over vanilla    │
 │ SQLite [Source 24].                                                          │
 │                                                                              │
 │ **OLAP/Analytical Performance:** SQLite is row-oriented and was designed     │
 │ primarily as a world-class OLTP engine. For analytical workloads—complex     │
 │ aggregations, percentile calculations, large scans—SQLite struggles          │
 │ significantly. A cited benchmark shows a single percentile query over 13M    │
 │ rows taking ~4 seconds in SQLite [Source 6]. PostgreSQL, while also          │
 │ row-oriented, supports more advanced SQL features (window functions, complex │
 │ joins, partitioning) and can be tuned for analytics [Source 22]. However,    │
 │ PostgreSQL itself hits a 'Postgres Wall' for heavy analytical workloads when │
 │ row-scanning large datasets exceeds available RAM [Source 13]. Neither       │
 │ SQLite nor PostgreSQL is natively columnar; PostgreSQL can be extended with  │
 │ columnar storage extensions for better OLAP performance [Source 23].         │
 │                                                                              │
 │ **Feature Set:** PostgreSQL offers a richer feature set including more data  │
 │ types, advanced indexing, role-based access control, JSON/array support,     │
 │ geospatial extensions (PostGIS), and time-series extensions. SQLite uses     │
 │ dynamic typing and has a simpler, more limited feature set—easier to use but │
 │ potentially limiting for complex analytical applications [Source 5, Source   │
 │ 1].                                                                          │
 │                                                                              │
 │ **Recommended Alternatives for Embedded Analytics:** DuckDB is widely cited  │
 │ as the superior embedded engine for analytical workloads, outperforming both │
 │ SQLite and PostgreSQL on OLAP queries by a large margin [Source 6, Source    │
 │ 2]. For embedded analytics use cases requiring columnar processing, DuckDB   │
 │ or Stoolap (a Rust-based embedded OLAP engine) are more purpose-built        │
 │ options. Stoolap benchmarks show up to 138x faster analytical query          │
 │ performance versus SQLite [Source 9].                                        │
 │                                                                              │
 │ **Summary:** SQLite wins for lightweight, read-heavy, single-writer,         │
 │ local/embedded OLTP workloads where portability and zero configuration       │
 │ matter. PostgreSQL wins for multi-user, concurrent, complex-query            │
 │ environments. For true embedded analytics workloads (large-scale             │
 │ aggregations, complex OLAP queries), neither is optimal—DuckDB or a hybrid   │
 │ architecture (PostgreSQL as system-of-record + DuckDB as analytical engine)  │
 │ is the modern recommended approach.                                          │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ SQLite vs. PostgreSQL: The    │ PostgreSQL is a client-server  │  0.97 │
 │     │ key differences and           │ database system... This        │       │
 │     │ advantages of each            │ architecture enables           │       │
 │     │ https://databaseschool.com/ar │ PostgreSQL to scale and handle │       │
 │     │ ticles/sqlite-vs-postgresql-t │ multiple concurrent clients    │       │
 │     │ he-key-differences-and-advant │ efficiently... SQLite is a     │       │
 │     │ ages-of-each                  │ serverless database engine. It │       │
 │     │                               │ functions as a lightweight     │       │
 │     │                               │ library embedded directly into │       │
 │     │                               │ applications... SQLite's       │       │
 │     │                               │ concurrency model is more      │       │
 │     │                               │ restrictive: while it allows   │       │
 │     │                               │ multiple readers, only one     │       │
 │     │                               │ process can write to the       │       │
 │     │                               │ database at a time.            │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Making -SQLite- Analytics     │ In some analytical queries     │  0.95 │
 │     │ Great Again! – Oldmoe's blog  │ SQLite will struggle to        │       │
 │     │ https://oldmoe.blog/2025/03/1 │ perform compared to other OLAP │       │
 │     │ 2/making-sqlite-analytics-gre │ oriented engines like DuckDB.  │       │
 │     │ at-again/                     │ Consider the following         │       │
 │     │                               │ scenario: You have a table     │       │
 │     │                               │ with 13M entries of latency    │       │
 │     │                               │ data, and you want to          │       │
 │     │                               │ determine the following        │       │
 │     │                               │ percentiles: p50, p95, p99...  │       │
 │     │                               │ After around 4 seconds you     │       │
 │     │                               │ will see the result.           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ DuckDB vs. Postgres for       │ That 'quick' analytical query  │  0.95 │
 │     │ embedded analytics: How to    │ powering a customer-facing     │       │
 │     │ choose (and when to use a     │ dashboard now takes 5 seconds, │       │
 │     │ hybrid architecture)          │ up from 50 milliseconds. Then  │       │
 │     │ https://motherduck.com/learn- │ thirty seconds. Then it times  │       │
 │     │ more/duckdb-vs-postgres-embed │ out. You've hit the 'Postgres  │       │
 │     │ ded-analytics/                │ Wall.' This isn't a Postgres   │       │
 │     │                               │ failure. It's an architectural │       │
 │     │                               │ mismatch. Postgres processes   │       │
 │     │                               │ analytics using the same       │       │
 │     │                               │ row-oriented logic designed    │       │
 │     │                               │ for transaction safety.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Beyond the Single-Writer      │ SQLite has a single-writer     │  0.93 │
 │     │ Limitation with Turso's       │ transaction model, which means │       │
 │     │ Concurrent Writes             │ whenever a transaction writes  │       │
 │     │ https://turso.tech/blog/beyon │ to the database, no other      │       │
 │     │ d-the-single-writer-limitatio │ write transactions can make    │       │
 │     │ n-with-tursos-concurrent-writ │ progress until that            │       │
 │     │ es                            │ transaction is complete...     │       │
 │     │                               │ When concurrent writes are     │       │
 │     │                               │ used, we achieve up to 4x the  │       │
 │     │                               │ write throughput of SQLite,    │       │
 │     │                               │ while also removing the        │       │
 │     │                               │ dreaded SQLITE_BUSY error.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Stoolap vs. SQLite: Comparing │ OLAP (Online Analytical        │  0.92 │
 │     │ Rust OLAP and Traditional     │ Processing) systems are        │       │
 │     │ OLTP Databases | Better Stack │ designed for a completely      │       │
 │     │ Community                     │ different purpose. OLAP        │       │
 │     │ https://betterstack.com/commu │ databases are optimized for    │       │
 │     │ nity/guides/ai/stoolap-vs-sql │ complex queries and data       │       │
 │     │ ite/                          │ analysis... Most standard      │       │
 │     │                               │ application databases,         │       │
 │     │                               │ including SQLite, PostgreSQL,  │       │
 │     │                               │ and MySQL, are classified as   │       │
 │     │                               │ OLTP (Online Transaction       │       │
 │     │                               │ Processing) systems.           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Postgres Tuning & Performance │ Analytics or OLAP activity     │  0.91 │
 │     │ for Analytics Data | Crunchy  │ typically involves much        │       │
 │     │ Data Blog                     │ longer, more complex queries   │       │
 │     │ https://www.crunchydata.com/b │ than OLTP activity, joining    │       │
 │     │ log/postgres-tuning-and-perfo │ data from multiple tables, and │       │
 │     │ rmance-for-analytics-data     │ working on large data sets.    │       │
 │     │                               │ This means it's very resource  │       │
 │     │                               │ intensive. Without careful     │       │
 │     │                               │ planning and tuning, you can   │       │
 │     │                               │ find yourself with analytics   │       │
 │     │                               │ queries that not only take far │       │
 │     │                               │ too long to run, but also slow │       │
 │     │                               │ down your existing             │       │
 │     │                               │ application.                   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Postgres Columnar Storage: 4  │ PostgreSQL is a row-oriented   │  0.90 │
 │     │ Popular Extensions and a      │ database by design, meaning it │       │
 │     │ Quick Tutorial                │ stores data tuple-by-tuple...  │       │
 │     │ https://www.epsio.io/blog/pos │ This structure is suitable for │       │
 │     │ tgres-columnar-storage-4-popu │ transactional workloads but    │       │
 │     │ lar-extensions-and-a-quick-tu │ not optimized for analytical   │       │
 │     │ torial                        │ queries that typically scan    │       │
 │     │                               │ large volumes of data across a │       │
 │     │                               │ few columns... While           │       │
 │     │                               │ PostgreSQL does not natively   │       │
 │     │                               │ support columnar storage,      │       │
 │     │                               │ several extensions and         │       │
 │     │                               │ external tools introduce       │       │
 │     │                               │ columnar capabilities.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ SQLite vs PostgreSQL          │ SQLite was faster. Of course   │  0.88 │
 │     │ Performance & Comparison |    │ it was. Writing to a local     │       │
 │     │ Pythonic AF                   │ file inside the same process   │       │
 │     │ https://medium.com/pythonic-a │ will almost always be faster   │       │
 │     │ f/sqlite-vs-postgresql-perfor │ than sending queries to a      │       │
 │     │ mance-comparison-46ba1d39c9c8 │ server.                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Everyone Is Wrong About       │ why SQLite is often the        │  0.80 │
 │     │ SQLite (Here's When It Beats  │ superior production choice for │       │
 │     │ Postgres)                     │ read-heavy, single-server, and │       │
 │     │ https://www.youtube.com/watch │ edge workloads ... SQLite vs   │       │
 │     │ ?v=t20KyfjtUs4                │ PostgreSQL Performance.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ SQLite SO MUCH FASTER than    │ Of course, with the advent of  │  0.82 │
 │     │ Postgres - Reddit             │ DuckDB, you use DuckDB for     │       │
 │     │ https://www.reddit.com/r/sqli │ data analysis tasks since it   │       │
 │     │ te/comments/1gu219r/sqlite_so │ can be faster than either      │       │
 │     │ _much_faster_than_postgres/   │ SQLite or PostgreSQL in those  │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found │ Quantitative head-to-head   │ Most benchmarks found       │
 │                  │ benchmark of SQLite vs      │ compare SQLite vs           │
 │                  │ PostgreSQL specifically on  │ PostgreSQL on OLTP          │
 │                  │ analytical queries (not     │ (reads/writes of individual │
 │                  │ just OLTP)                  │ rows) or compare each       │
 │                  │                             │ individually to             │
 │                  │                             │ DuckDB/Stoolap on OLAP. A   │
 │                  │                             │ direct, rigorous benchmark  │
 │                  │                             │ of SQLite vs PostgreSQL on  │
 │                  │                             │ complex analytical queries  │
 │                  │                             │ (GROUP BY, window           │
 │                  │                             │ functions, aggregations     │
 │                  │                             │ over millions of rows) was  │
 │                  │                             │ not surfaced in the         │
 │                  │                             │ evidence.                   │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ PostgreSQL columnar         │ While columnar extensions   │
 │                  │ extension performance vs    │ for PostgreSQL (e.g., Citus │
 │                  │ SQLite for embedded         │ columnar, hydra) are        │
 │                  │ analytics                   │ mentioned, no direct        │
 │                  │                             │ benchmark comparing         │
 │                  │                             │ PostgreSQL-with-columnar-ex │
 │                  │                             │ tension vs SQLite for       │
 │                  │                             │ embedded analytical         │
 │                  │                             │ workloads was found.        │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ SQLite WAL mode impact on   │ WAL mode is mentioned as    │
 │                  │ analytical query            │ improving concurrent        │
 │                  │ performance                 │ read/write behavior in      │
 │                  │                             │ SQLite, but its specific    │
 │                  │                             │ impact on analytical query  │
 │                  │                             │ throughput in embedded      │
 │                  │                             │ scenarios was not           │
 │                  │                             │ quantified in the evidence. │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ database          │ DuckDB vs SQLite  │ DuckDB is         │
 │                  │                   │ vs PostgreSQL     │ consistently      │
 │                  │                   │ analytical        │ cited as          │
 │                  │                   │ benchmark OLAP    │ outperforming     │
 │                  │                   │ embedded 2024     │ both for          │
 │                  │                   │ 2025              │ analytics; a      │
 │                  │                   │                   │ rigorous          │
 │                  │                   │                   │ three-way         │
 │                  │                   │                   │ comparison would  │
 │                  │                   │                   │ better answer the │
 │                  │                   │                   │ embedded          │
 │                  │                   │                   │ analytics         │
 │                  │                   │                   │ question.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ SQLite past       │ The VLDB paper on │
 │                  │                   │ present future    │ SQLite's          │
 │                  │                   │ VLDB paper bloom  │ past/present/futu │
 │                  │                   │ filter analytical │ re is cited       │
 │                  │                   │ performance 2022  │ multiple times as │
 │                  │                   │                   │ authoritative on  │
 │                  │                   │                   │ SQLite's          │
 │                  │                   │                   │ analytical        │
 │                  │                   │                   │ limitations;      │
 │                  │                   │                   │ accessing it      │
 │                  │                   │                   │ directly would    │
 │                  │                   │                   │ strengthen        │
 │                  │                   │                   │ claims.           │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ pg_duckdb         │ The motherduck    │
 │                  │                   │ extension         │ article           │
 │                  │                   │ PostgreSQL        │ references        │
 │                  │                   │ embedded          │ pg_duckdb as a    │
 │                  │                   │ analytics         │ key tool for      │
 │                  │                   │ performance       │ hybrid            │
 │                  │                   │ hybrid            │ Postgres+DuckDB   │
 │                  │                   │ architecture      │ analytics;        │
 │                  │                   │                   │ benchmarks for    │
 │                  │                   │                   │ this approach     │
 │                  │                   │                   │ were not found.   │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ null              │ Stoolap embedded  │ Stoolap is an     │
 │                  │                   │ OLAP Rust         │ emerging embedded │
 │                  │                   │ database          │ OLAP engine       │
 │                  │                   │ benchmark SQLite  │ (Rust) claiming   │
 │                  │                   │ PostgreSQL        │ 138x speedup over │
 │                  │                   │                   │ SQLite; it's a    │
 │                  │                   │                   │ relevant new      │
 │                  │                   │                   │ entrant to the    │
 │                  │                   │                   │ embedded          │
 │                  │                   │                   │ analytics space.  │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ At what data volume does        │ The evidence shows SQLite       │
 │          │ SQLite's analytical performance │ struggles at 13M rows for       │
 │          │ become unacceptably slow        │ percentile queries (~4s), but   │
 │          │ compared to PostgreSQL for      │ no clear threshold or scaling   │
 │          │ typical embedded analytics      │ curve vs PostgreSQL was found.  │
 │          │ workloads?                      │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Does enabling WAL mode and      │ Hacker News discussion mentions │
 │          │ tuning SQLite                   │ WAL + synchronous=NORMAL as     │
 │          │ (synchronous=NORMAL, page size, │ approaching 'line speed with IO │
 │          │ etc.) meaningfully close the    │ subsystem' for writes, but      │
 │          │ analytical performance gap with │ analytical query impact is      │
 │          │ PostgreSQL?                     │ unclear.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Is a hybrid architecture        │ The Postgres+DuckDB hybrid is   │
 │          │ (SQLite for OLTP + DuckDB for   │ well-documented, but an         │
 │          │ OLAP, sharing the same data)    │ SQLite+DuckDB embedded hybrid   │
 │          │ practical for embedded          │ (for truly serverless apps) is  │
 │          │ applications, and how does it   │ less explored in the evidence.  │
 │          │ compare to using PostgreSQL     │                                 │
 │          │ alone?                          │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How do PostgreSQL columnar      │ PostgreSQL columnar extensions  │
 │          │ storage extensions (e.g.,       │ are mentioned as improving OLAP │
 │          │ Hydra, Citus columnar) perform  │ performance, but no direct      │
 │          │ for embedded analytics compared │ comparison to SQLite in         │
 │          │ to native SQLite?               │ embedded scenarios was found.   │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the operational         │ SQLite's binary is ~500KB vs    │
 │          │ overhead (memory, disk, setup   │ PostgreSQL requiring a server   │
 │          │ complexity) of running          │ process; for edge/IoT embedded  │
 │          │ PostgreSQL vs SQLite in a truly │ analytics, resource constraints │
 │          │ embedded edge or mobile         │ may be the deciding factor.     │
 │          │ environment?                    │                                 │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.88                                                                │
 │ Corroborating sources: 10                                                    │
 │ Source authority: medium                                                     │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.82                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 61699                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 111.21s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 01881015-61a9-4894-a723-4e1d8b7a7755
--- a/docs/stress-tests/M3.3-runs/08-comparative.log
+++ b/docs/stress-tests/M3.3-runs/08-comparative.log
@ -1,364 +0,0 @@
 Researching: Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.
 {"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:22.951394Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:57:23.942406Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:23.953465Z"}
 {"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:24.008304Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008814Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008920Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1180, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:42.087229Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 12270, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:47.632253Z"}
 {"step": 21, "decision": "Token budget reached before iteration 4: 25966/20000", "event": "budget_exhausted", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072818Z"}
 {"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 3, "tokens_used": 25966, "event": "synthesis_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072985Z"}
 {"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 89456, "event": "synthesis_complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.172200Z"}
 {"step": 46, "decision": "Research complete", "confidence": 0.82, "citation_count": 14, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 121.701, "event": "complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.274347Z"}
 {"confidence": 0.82, "citations": 14, "gaps": 4, "discovery_events": 4, "tokens_used": 54153, "iterations_run": 3, "wall_time_sec": 117.15539288520813, "budget_exhausted": true, "event": "research_completed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:21.275590Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:59:21.286942Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:21.531952Z"}
 {"trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "confidence": 0.82, "citations": 14, "tokens_used": 54153, "wall_time_sec": 117.15539288520813, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:22.766505Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ CRISPR-Cas9 and CRISPR-Cas12a (formerly Cpf1) are both widely used           │
 │ RNA-guided nucleases adapted for genome editing, including in vivo           │
 │ applications, but they differ meaningfully in mechanism, structure, PAM      │
 │ requirements, cutting pattern, guide RNA architecture, specificity, and      │
 │ practical suitability for in vivo delivery.                                  │
 │                                                                              │
 │ **Mechanism and DNA Cleavage:** Cas9 (most commonly from Streptococcus       │
 │ pyogenes, SpCas9) cleaves both DNA strands at the same position, producing   │
 │ blunt-ended double-strand breaks (DSBs) [Source 7]. Cas12a, by contrast,     │
 │ introduces staggered cuts that leave 4–5 nucleotide 5′ overhangs [Sources 2, │
 │ 7]. These sticky ends generated by Cas12a may enhance homology-directed      │
 │ repair (HDR) efficiency compared to Cas9's blunt ends [Source 2].            │
 │                                                                              │
 │ **PAM Sequence:** Cas9 requires an NGG PAM (protospacer adjacent motif) on   │
 │ the non-template strand downstream of the target; Cas12a recognizes a T-rich │
 │ PAM (typically TTTV) upstream of the target on the non-template strand       │
 │ [Sources 2, 7]. This difference expands the targeting range of Cas12a to     │
 │ AT-rich genomic regions where Cas9 is limited.                               │
 │                                                                              │
 │ **Guide RNA:** Cas9 uses a two-component guide (crRNA + tracrRNA, often      │
 │ fused as sgRNA), while Cas12a requires only a single crRNA with a short      │
 │ direct repeat and processes its own pre-crRNA array, enabling multiplexed    │
 │ editing from a single transcript [Sources 2, 7, 13].                         │
 │                                                                              │
 │ **Specificity and Off-Target Effects:** Kinetic studies show Cas12a exhibits │
 │ greater target specificity than Cas9, attributed to a more stringent DNA     │
 │ unwinding mechanism that requires more extensive complementarity before      │
 │ cleavage [Source 5]. Cas12a tolerates fewer mismatches between the guide RNA │
 │ and target, resulting in fewer off-target cuts [Sources 2, 5].               │
 │                                                                              │
 │ **Editing Efficiency:** In comparative studies using ribonucleoprotein (RNP) │
 │ delivery in rice (OsPDS gene), Cas9 and Cas12a showed different efficiencies │
 │ depending on the target site [Source 1]. In Chlamydomonas reinhardtii, both  │
 │ Cas9 and Cas12a RNPs co-delivered with ssODN repair templates achieved       │
 │ similar total editing levels of 20–30% [Source 4]. Context and target site   │
 │ selection significantly influence which enzyme performs better.              │
 │                                                                              │
 │ **In Vivo Delivery Considerations:** Both enzymes can be delivered via AAV   │
 │ vectors, lipid nanoparticles (LNPs), or as RNPs via electroporation [Sources │
 │ 21, 24]. A critical practical consideration is size: SpCas9 (~4.2 kb coding  │
 │ sequence) is near the AAV packaging limit (~4.7–4.8 kb), leaving little room │
 │ for promoter and regulatory elements [Sources 20, 21]. Cas12a variants       │
 │ (including engineered compact forms such as EbCas12a) can be packaged        │
 │ together with their crRNA within a single AAV vector, which is a significant │
 │ advantage for in vivo delivery [Sources 19, 20, 21]. A miniature Cas12f1     │
 │ variant has also demonstrated efficacy for in vivo retinal gene therapy      │
 │ [Source 12].                                                                 │
 │                                                                              │
 │ **Clinical and Therapeutic Status:** CRISPR-Cas9 is currently the dominant   │
 │ nuclease in clinical trials for both ex vivo and in vivo genome editing      │
 │ [Sources 8, 11]. Cas12a is gaining traction in therapeutic research,         │
 │ particularly where higher specificity or AAV-compatible delivery is required │
 │ [Sources 9, 13, 22].                                                         │
 │                                                                              │
 │ **Summary Table:**                                                           │
 │ - DNA cut type: Cas9 = blunt; Cas12a = staggered (5′ overhang)               │
 │ - PAM: Cas9 = NGG (3′); Cas12a = TTTV (5′)                                   │
 │ - Guide RNA: Cas9 = sgRNA (crRNA+tracrRNA); Cas12a = crRNA only              │
 │ - Multiplexing: Cas9 = limited; Cas12a = inherent crRNA array processing     │
 │ - Specificity: Cas12a generally higher                                       │
 │ - AAV compatibility: Cas12a variants better suited                           │
 │ - Clinical use: Cas9 more established; Cas12a emerging                       │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ What's the Difference Between │ Cas9...cleaves both strands of │  0.95 │
 │     │ Cas9 and Cas12a Nucleases? |  │ DNA at the same point. This    │       │
 │     │ The Scientist                 │ creates a blunt end            │       │
 │     │ https://www.the-scientist.com │ double-stranded break (DSB)... │       │
 │     │ /what-s-the-difference-betwee │ For Cas9 to function, the      │       │
 │     │ n-cas9-and-cas12a-nucleases-7 │ protospacer adjacent motif     │       │
 │     │ 2481                          │ (PAM)—a two to six base pair   │       │
 │     │                               │ sequence—NGG...must sit        │       │
 │     │                               │ immediately downstream of the  │       │
 │     │                               │ target on the opposite strand. │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Cas9 versus Cas12a/Cpf1:      │ Cas9 and Cas12a have distinct  │  0.97 │
 │     │ Structure-function            │ evolutionary origins and       │       │
 │     │ comparisons and implications  │ exhibit different structural   │       │
 │     │ for genome editing - PubMed   │ architectures, resulting in    │       │
 │     │ https://pubmed.ncbi.nlm.nih.g │ distinct molecular             │       │
 │     │ ov/29790280/                  │ mechanisms... We discuss       │       │
 │     │                               │ implications for genome        │       │
 │     │                               │ editing, and how they may      │       │
 │     │                               │ influence the choice of Cas9   │       │
 │     │                               │ or Cas12a for specific         │       │
 │     │                               │ applications.                  │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ CRISPR-Cas12a More Precise    │ Cas12a...is, according to      │  0.90 │
 │     │ Than CRISPR-Cas9              │ scientists at the University   │       │
 │     │ https://www.genengnews.com/to │ of Texas at Austin             │       │
 │     │ pics/genome-editing/crispr-ca │ (UT-Austin), more effective    │       │
 │     │ s12a-more-precise-than-crispr │ and precise... Because Cas     │       │
 │     │ -cas9/                        │ enzymes occasionally fail to   │       │
 │     │                               │ cut DNA in the right places,   │       │
 │     │                               │ or even cut at all, they worry │       │
 │     │                               │ developers, who want to modify │       │
 │     │                               │ genomes with surgical          │       │
 │     │                               │ precision, especially in       │       │
 │     │                               │ therapeutic applications.      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Comparison of CRISPR/Cas9 and │ We found that Cas9 and Cas12a  │  0.92 │
 │     │ Cas12a for gene editing in    │ RNPs- co-delivered with ssODN  │       │
 │     │ Chlamydomonas reinhardtii -   │ repair templates- induced      │       │
 │     │ ScienceDirect                 │ similar levels of total        │       │
 │     │ https://www.sciencedirect.com │ editing, achieving as much as  │       │
 │     │ /science/article/pii/S2211926 │ 20–30 % in all                 │       │
 │     │ 424004089                     │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Comparison of                 │ Comparison of                  │  0.88 │
 │     │ CRISPR-Cas9/Cas12a            │ CRISPR-Cas9/Cas12a             │       │
 │     │ Ribonucleoprotein Complexes   │ Ribonucleoprotein Complexes    │       │
 │     │ for Genome Editing Efficiency │ for Genome Editing Efficiency  │       │
 │     │ in the Rice Phytoene          │ in the Rice Phytoene           │       │
 │     │ Desaturase (OsPDS) Gene - PMC │ Desaturase (OsPDS) Gene        │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │                                │       │
 │     │ articles/PMC6973557/          │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Current and Prospective       │ Current and Prospective        │  0.87 │
 │     │ Applications of CRISPR-Cas12a │ Applications of CRISPR-Cas12a  │       │
 │     │ in Pluricellular Organisms -  │ in Pluricellular Organisms...  │       │
 │     │ PMC                           │ Mol Biotechnol. 2022 Aug       │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ 8;65(2):196–205. doi:          │       │
 │     │ articles/PMC9841005/          │ 10.1007/s12033-022-00538-5     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ When size matters: A novel    │ When size matters: A novel     │  0.90 │
 │     │ compact Cas12a variant for in │ compact Cas12a variant for in  │       │
 │     │ vivo genome editing - PMC     │ vivo genome editing            │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │                                │       │
 │     │ articles/PMC11253977/         │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ When size matters: A novel    │ Altogether, the components of  │  0.91 │
 │     │ compact Cas12a variant for in │ the EbCas12a system are well   │       │
 │     │ vivo genome editing -         │ below the 4.8-kb packaging     │       │
 │     │ ResearchGate                  │ limit of AAVs, enabling        │       │
 │     │ https://www.researchgate.net/ │ successful packaging in the    │       │
 │     │ publication/382328745_When_si │ AAV9                           │       │
 │     │ ze_matters_A_novel_compact_Ca │                                │       │
 │     │ s12a_variant_for_in_vivo_geno │                                │       │
 │     │ me_editing                    │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Therapeutic In Vivo Gene      │ our current results prove that │  0.88 │
 │     │ Editing Achieved by a         │ the miniature Cas12f1 system   │       │
 │     │ Hypercompact CRISPR System -  │ is a promising gene editing    │       │
 │     │ Advanced Science              │ tool for retinal gene therapy  │       │
 │     │ https://advanced.onlinelibrar │                                │       │
 │     │ y.wiley.com/doi/10.1002/advs. │                                │       │
 │     │ 202308095                     │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Delivery of CRISPR-Cas tools  │ AAV is one of the most         │  0.90 │
 │     │ for in vivo genome editing    │ commonly used vector systems   │       │
 │     │ therapy: Trends and           │ to date, but immunogenicity    │       │
 │     │ challenges - ScienceDirect    │ against capsid, liver toxicity │       │
 │     │ https://www.sciencedirect.com │ at high dose, and potential    │       │
 │     │ /science/article/pii/S0168365 │ genotoxicity caused by         │       │
 │     │ 92200027X                     │ off-target mutagenesis and     │       │
 │     │                               │ genomic integration remain     │       │
 │     │                               │ unsolved.                      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 11  │ CRISPR-Based Therapeutic      │ These Cas proteins are more    │  0.87 │
 │     │ Genome Editing - DSpace@MIT   │ compatible with AAV delivery,  │       │
 │     │ https://dspace.mit.edu/bitstr │ enabling additional vector     │       │
 │     │ eam/handle/1721.1/138388.2/ni │ design options such as         │       │
 │     │ hms-1576523.pdf?sequence=4&is │ expanded promoter choices and  │       │
 │     │ Allowed=y                     │ a streamlined delivery.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 12  │ Revolutionizing in vivo       │ Genome editing using the       │  0.85 │
 │     │ therapy with CRISPR/Cas       │ CRISPR/Cas system has          │       │
 │     │ genome editing:               │ revolutionized the field of    │       │
 │     │ breakthroughs, opportunities  │ genetic engineering, offering  │       │
 │     │ and challenges - Frontiers    │ unprecedented opportunities    │       │
 │     │ https://www.frontiersin.org/j │ for therapeutic applications   │       │
 │     │ ournals/genome-editing/articl │ in vivo.                       │       │
 │     │ es/10.3389/fgeed.2024.1342193 │                                │       │
 │     │ /full                         │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 13  │ CRISPR Clinical Trials: A     │ CRISPR Clinical Trials: A 2024 │  0.80 │
 │     │ 2024 Update - Innovative      │ Update - Innovative Genomics   │       │
 │     │ Genomics Institute            │ Institute (IGI)                │       │
 │     │ https://innovativegenomics.or │                                │       │
 │     │ g/news/crispr-clinical-trials │                                │       │
 │     │ -2024/                        │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 14  │ Alt-R CRISPR-Cas9 vs Cas12a   │ The two most popular enzymes   │  0.83 │
 │     │ systems | IDT                 │ used in CRISPR genome editing  │       │
 │     │ https://www.idtdna.com/pages/ │ are Cas9 and Cas12a (Cpf1).    │       │
 │     │ technology/crispr/crispr-geno │ These enzymes are highly       │       │
 │     │ me-editing/Alt-R-systems      │ functional, do not require     │       │
 │     │                               │ binding to other enzymes as is │       │
 │     │                               │ the case for type I CRISPR     │       │
 │     │                               │ systems, and can be readily    │       │
 │     │                               │ programmed to target the       │       │
 │     │                               │ desired genomic DNA site.      │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found │ Head-to-head in vivo        │ Most comparative studies    │
 │                  │ efficacy data in mammals    │ focused on plants (rice) or │
 │                  │ across multiple tissue      │ algae (Chlamydomonas) or    │
 │                  │ types                       │ used in vitro/ex vivo       │
 │                  │                             │ models. Rigorous            │
 │                  │                             │ side-by-side in vivo        │
 │                  │                             │ mammalian comparisons of    │
 │                  │                             │ Cas9 vs. Cas12a across      │
 │                  │                             │ liver, muscle, CNS, and eye │
 │                  │                             │ were not identified in      │
 │                  │                             │ available sources.          │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Immunogenicity comparison   │ While immunogenicity of     │
 │                  │ between Cas9 and Cas12a in  │ Cas9 is well-documented as  │
 │                  │ vivo                        │ a challenge for in vivo     │
 │                  │                             │ delivery, direct            │
 │                  │                             │ comparative immunogenicity  │
 │                  │                             │ data for Cas12a in humans   │
 │                  │                             │ or animal models was not    │
 │                  │                             │ available in the gathered   │
 │                  │                             │ sources.                    │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Cas12a clinical trial data  │ The IGI clinical trials     │
 │                  │                             │ update and other sources    │
 │                  │                             │ confirm Cas9 dominance in   │
 │                  │                             │ trials but do not provide   │
 │                  │                             │ details on approved or      │
 │                  │                             │ ongoing Cas12a-specific     │
 │                  │                             │ clinical trials.            │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Detailed off-target         │ While Cas12a is reported to │
 │                  │ profiling comparison in     │ be more specific than Cas9  │
 │                  │ vivo                        │ based on kinetic studies,   │
 │                  │                             │ comprehensive in vivo       │
 │                  │                             │ off-target profiling        │
 │                  │                             │ comparing both enzymes      │
 │                  │                             │ systematically across the   │
 │                  │                             │ same targets was not        │
 │                  │                             │ available in the sources.   │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ arxiv             │ Cas12a vs Cas9 in │ Head-to-head in   │
 │                  │                   │ vivo editing      │ vivo mammalian    │
 │                  │                   │ efficiency        │ comparisons are a │
 │                  │                   │ off-target        │ critical gap;     │
 │                  │                   │ mammalian         │ preprint servers  │
 │                  │                   │ therapeutic       │ may have more     │
 │                  │                   │ comparison 2023   │ recent            │
 │                  │                   │ 2024              │ unpublished data  │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ CRISPR Cas12a     │ Clinical adoption │
 │                  │                   │ clinical trials   │ of Cas12a in vivo │
 │                  │                   │ ClinicalTrials.go │ is poorly         │
 │                  │                   │ v 2023 2024       │ characterized; a  │
 │                  │                   │                   │ ClinicalTrials.go │
 │                  │                   │                   │ v database search │
 │                  │                   │                   │ would clarify     │
 │                  │                   │                   │ current status    │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ Cas12a            │ Immunogenicity is │
 │                  │                   │ immunogenicity    │ a key barrier for │
 │                  │                   │ pre-existing      │ in vivo Cas9      │
 │                  │                   │ immunity in vivo  │ delivery; whether │
 │                  │                   │ gene therapy      │ Cas12a poses      │
 │                  │                   │ human             │ fewer immune      │
 │                  │                   │                   │ challenges is     │
 │                  │                   │                   │ clinically        │
 │                  │                   │                   │ important but not │
 │                  │                   │                   │ covered in        │
 │                  │                   │                   │ sources           │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ database          │ compact Cas12a    │ Compact Cas12a    │
 │                  │                   │ EbCas12a AsCas12a │ variants show     │
 │                  │                   │ in vivo liver     │ promise for AAV   │
 │                  │                   │ lung CNS          │ delivery; recent  │
 │                  │                   │ therapeutic       │ therapeutic in    │
 │                  │                   │ editing 2024      │ vivo data would   │
 │                  │                   │                   │ strengthen the    │
 │                  │                   │                   │ comparison        │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Does Cas12a's staggered cutting │ Sources note that staggered     │
 │          │ pattern result in meaningfully  │ cuts may enhance HDR, but       │
 │          │ higher HDR rates than Cas9's    │ comparative in vivo HDR         │
 │          │ blunt cuts in vivo in           │ efficiency data in mammals was  │
 │          │ therapeutically relevant cell   │ not found in the gathered       │
 │          │ types?                          │ evidence.                       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Are there pre-existing          │ Immunogenicity is a known       │
 │          │ antibodies or T-cell responses  │ challenge for Cas9 in vivo;     │
 │          │ against Cas12a proteins in      │ whether Cas12a, being from      │
 │          │ humans that would limit its     │ different bacterial origins,    │
 │          │ therapeutic use, as has been    │ faces similar or lesser immune  │
 │          │ documented for SpCas9?          │ barriers in human patients is   │
 │          │                                 │ clinically critical.            │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Can compact Cas12a variants     │ Compact variants fit within AAV │
 │          │ (e.g., EbCas12a, Cas12f)        │ packaging limits better than    │
 │          │ consistently match or exceed    │ Cas9, but their in vivo editing │
 │          │ SpCas9 editing efficiency in    │ efficiency relative to SpCas9   │
 │          │ vivo across diverse tissue      │ across tissues such as liver,   │
 │          │ types?                          │ muscle, and CNS needs           │
 │          │                                 │ systematic evaluation.          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does Cas12a's inherent      │ Cas12a can process its own      │
 │          │ crRNA array processing and      │ pre-crRNA array, enabling       │
 │          │ multiplexing capability         │ multiplexed targeting from a    │
 │          │ translate to in vivo            │ single transcript, which is     │
 │          │ combinatorial therapeutic       │ noted as an advantage but its   │
 │          │ strategies compared to          │ in vivo therapeutic             │
 │          │ Cas9-based multiplex            │ exploitation is not             │
 │          │ approaches?                     │ well-characterized in available │
 │          │                                 │ sources.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the current status of   │ The 2024 CRISPR clinical trials │
 │          │ Cas12a-specific clinical trials │ update from IGI and Frontiers   │
 │          │ for in vivo gene therapy, and   │ review both highlight Cas9      │
 │          │ how do their safety profiles    │ dominance in clinical trials,   │
 │          │ compare to Cas9-based trials?   │ but Cas12a clinical translation │
 │          │                                 │ remains poorly documented.      │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.82                                                                │
 │ Corroborating sources: 14                                                    │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 54153                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 117.16s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 9e436db7-fcde-4d0f-a568-c468ae4d419c
--- a/docs/stress-tests/M3.3-runs/09-comparative.log
+++ b/docs/stress-tests/M3.3-runs/09-comparative.log
@ -1,378 +0,0 @@
 Researching: Compare React and Vue for large enterprise frontends in 2026.
 {"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:24.701232Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:59:26.384813Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:26.398635Z"}
 {"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:26.459271Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare React and Vue for large enterprise frontends in 2026.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459554Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459652Z"}
 {"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1149, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:37.583764Z"}
 {"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11893, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:45.423050Z"}
 {"step": 23, "decision": "Token budget reached before iteration 4: 27147/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158499Z"}
 {"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 27147, "event": "synthesis_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158736Z"}
 {"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 87997, "event": "synthesis_complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.872069Z"}
 {"step": 46, "decision": "Research complete", "confidence": 0.81, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 114.815, "event": "complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.883053Z"}
 {"confidence": 0.81, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 56137, "iterations_run": 3, "wall_time_sec": 110.40975427627563, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:16.883613Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:01:16.886961Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:16.944624Z"}
 {"trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "confidence": 0.81, "citations": 12, "tokens_used": 56137, "wall_time_sec": 110.40975427627563, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:17.535111Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ For large enterprise frontends in 2026, React and Vue each offer distinct    │
 │ advantages, and the best choice depends on organizational priorities.        │
 │                                                                              │
 │ **Market Position & Adoption:** React dominates with ~42% market share among │
 │ professional developers (2025 State of JavaScript survey) and ~68% among     │
 │ enterprise applications globally, while Vue holds ~28% developer share and   │
 │ ~18% enterprise share. React powers Facebook, Netflix, Airbnb, and Uber; Vue │
 │ drives Alibaba, GitLab, and Nintendo. Some 80% of enterprise teams use React │
 │ directly or via Next.js. [Sources 1, 4, 25]                                  │
 │                                                                              │
 │ **Performance:** Both frameworks use a virtual DOM. Vue 4 showed 15% faster  │
 │ initial render times than React 19 in large-scale applications with          │
 │ thousands of components (JavaScript Performance Consortium 2025 benchmarks). │
 │ However, React 19's concurrent rendering features provide superior           │
 │ responsiveness during complex user interactions. In micro-benchmarks, Vue    │
 │ 3.4 creates 1,000 rows in 38ms vs React 19's 42ms, and Vue's bundle size is  │
 │ smaller (33KB vs 44KB min+gzip). The performance gap continues to narrow.    │
 │ [Sources 1, 25]                                                              │
 │                                                                              │
 │ **React 19 Architecture Shifts:** React 19 introduces a built-in compiler    │
 │ that automates memoization (making useMemo/useCallback largely redundant),   │
 │ native Server Components for zero-bundle-size dependencies and direct        │
 │ database access, a new Actions API for simplified async form handling, and   │
 │ the `use` hook for streamlined data fetching. These changes significantly    │
 │ reduce boilerplate and technical debt for enterprise teams. [Sources 18, 19, │
 │ 20]                                                                          │
 │                                                                              │
 │ **Vue's Enterprise Momentum:** Vue 3's Composition API enables better logic  │
 │ reuse across large codebases. Pinia (the official state manager) is          │
 │ TypeScript-first and lightweight. Nuxt 3 handles SSR. Vue's natural          │
 │ TypeScript support and Vite-powered tooling make it increasingly attractive  │
 │ for enterprise adoption. Fortune 500 companies, SaaS platforms, and          │
 │ government tech teams are growing adopters. [Sources 12, 15]                 │
 │                                                                              │
 │ **Learning Curve & Developer Experience:** Vue uses standard HTML/CSS/JS     │
 │ with Single File Components, making it easier to onboard developers with     │
 │ traditional web backgrounds. React uses JSX (combining HTML and JavaScript), │
 │ which has a steeper initial curve but becomes natural quickly. Vue's         │
 │ official routing and state solutions (Vue Router, Pinia) reduce              │
 │ architectural decision-making overhead. React requires selecting from a      │
 │ broader ecosystem (Redux/Zustand, React Router, etc.), offering more         │
 │ flexibility but more upfront choices. [Sources 1, 13, 14]                    │
 │                                                                              │
 │ **Ecosystem & Hiring:** React has ~44M+ weekly npm downloads vs Vue's        │
 │ ~4.5M+. React has 225K+ GitHub stars vs Vue's 207K+. Fortune 500 adoption is │
 │ 47% React vs 12% Vue. React job postings vastly outnumber Vue's (e.g.,       │
 │ 3,200+ vs 680+/month in one market). React's ecosystem is larger and more    │
 │ mature, making hiring and long-term support easier for large enterprises.    │
 │ [Sources 14, 25]                                                             │
 │                                                                              │
 │ **Enterprise Scalability:** React's flexibility and massive ecosystem make   │
 │ it the safer long-term choice for large, complex, cross-team enterprise      │
 │ platforms. Angular is also a strong contender here with built-in DI, forms,  │
 │ and strict structure. Vue excels when time-to-market and developer           │
 │ productivity are top priorities and when teams want a progressive,           │
 │ opinionated setup with lower onboarding cost. [Sources 13, 14, 23]           │
 │                                                                              │
 │ **Recommendation:** For most large enterprise frontends in 2026, React       │
 │ (often via Next.js) remains the dominant and lowest-risk choice due to its   │
 │ ecosystem size, hiring market, enterprise adoption, and React 19's           │
 │ architectural improvements. Vue is a compelling choice for enterprises       │
 │ prioritizing developer velocity, lower onboarding costs, and smaller bundle  │
 │ sizes, particularly in Asia-Pacific markets or mid-size SaaS platforms.      │
 │ Neither choice is technically wrong—both are production-proven at scale.     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ React vs Vue: Frontend        │ According to the 2025 State of │  0.88 │
 │     │ Frameworks Compared in 2025   │ JavaScript survey, React       │       │
 │     │ https://automation-ops.com/bl │ continues to dominate with a   │       │
 │     │ og/react-vs-vue-frontend-fram │ 42% market share among         │       │
 │     │ eworks-compared               │ professional developers, while │       │
 │     │                               │ Vue has grown to capture 28%   │       │
 │     │                               │ of the market. Vue 4 showed a  │       │
 │     │                               │ 15% faster initial render time │       │
 │     │                               │ compared to React 19 in        │       │
 │     │                               │ large-scale applications with  │       │
 │     │                               │ thousands of components.       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Angular vs. React vs. Vue.js: │ The focus in 2025 has shifted  │  0.82 │
 │     │ A performance guide for 2026  │ away from basic component      │       │
 │     │ - LogRocket Blog              │ logic toward reactivity        │       │
 │     │ https://blog.logrocket.com/an │ models, hydration strategies,  │       │
 │     │ gular-vs-react-vs-vue-js-perf │ and compiler-driven            │       │
 │     │ ormance/                      │ performance optimizations.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ React vs Next.js vs Vue:      │ React remains the foundation   │  0.80 │
 │     │ Which Frontend Framework Wins │ for modern frontend            │       │
 │     │ in 2026? - DEV Community      │ development with 80% of        │       │
 │     │ https://dev.to/ciphernutz/rea │ enterprise teams still using   │       │
 │     │ ct-vs-nextjs-vs-vue-which-fro │ it directly or via Next.js.    │       │
 │     │ ntend-framework-wins-in-2025- │                                │       │
 │     │ 26gj                          │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ The 2025 Tech Stack Dilemma:  │ According to the 2025 State of │  0.78 │
 │     │ React vs Vue vs Angular for   │ JavaScript survey, developers  │       │
 │     │ Enterprise Applications       │ using frameworks report 35-50% │       │
 │     │ https://www.codertrove.com/ar │ faster development cycles      │       │
 │     │ ticles/2025-tech-stack-dilemm │ compared to vanilla            │       │
 │     │ a-react-vs-vue-vs-angular-for │ JavaScript. The 2024 State of  │       │
 │     │ -enterprise-application       │ JavaScript survey reveals that │       │
 │     │                               │ 78% of developers cite 'faster │       │
 │     │                               │ development' as their primary  │       │
 │     │                               │ reason for adoption.           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Web Development with React vs │ React maintains its dominant   │  0.85 │
 │     │ Vue.js: 2025 Comparison |     │ position with approximately    │       │
 │     │ iTechDev Blog                 │ 68% market share among         │       │
 │     │ https://www.itechdev.com.mx/b │ enterprise applications        │       │
 │     │ log/react-vs-vue-comparison-2 │ globally. Vue 3.4 creates      │       │
 │     │ 025                           │ 1,000 rows in 38ms vs React    │       │
 │     │                               │ 19's 42ms. Bundle size         │       │
 │     │                               │ (min+gzip): React 44KB, Vue    │       │
 │     │                               │ 33KB. Fortune 500 adoption:    │       │
 │     │                               │ React 47%, Vue 12%.            │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ React 19 Features & Updates   │ React 19 emerges as a landmark │  0.87 │
 │     │ (2025): What's New & Why It   │ release that brings            │       │
 │     │ Matters - WEQ                 │ significant enhancements to    │       │
 │     │ https://weqtechnologies.com/r │ performance, developer         │       │
 │     │ eact-19-features-updates-2025 │ experience, and scalability.   │       │
 │     │ -whats-new-why-it-matters/    │ This update builds on the      │       │
 │     │                               │ foundations laid by React 18,  │       │
 │     │                               │ introducing powerful new       │       │
 │     │                               │ features like the React        │       │
 │     │                               │ Compiler, Actions API, and     │       │
 │     │                               │ enhanced support for React     │       │
 │     │                               │ Server Components.             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ React 19: Architecture        │ The React Compiler             │  0.83 │
 │     │ Shifts, Performance           │ automatically handles          │       │
 │     │ Optimization, and the Future  │ memoization, rendering hooks   │       │
 │     │ of Enterprise Web Development │ like useMemo and useCallback   │       │
 │     │ https://pblinuxtech.com/react │ largely redundant for          │       │
 │     │ -19-architecture-shifts-perfo │ performance optimization.      │       │
 │     │ rmance-optimization-and-the-f │ Native support for Server      │       │
 │     │ uture-of-enterprise-web-devel │ Components allows for          │       │
 │     │ opment/                       │ zero-bundle-size dependencies  │       │
 │     │                               │ and direct database access,    │       │
 │     │                               │ optimizing the use of          │       │
 │     │                               │ Linux-based edge runtimes.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Vue.js in the Enterprise: Why │ By 2026, more                  │  0.79 │
 │     │ More Companies Are Choosing   │ organizations—startups,        │       │
 │     │ Vue in 2026 – Manifest        │ Fortune 500 companies, large   │       │
 │     │ https://manifestinfotech.com/ │ SaaS platforms, and government │       │
 │     │ vue-js-in-the-enterprise-why- │ tech teams—are adopting Vue    │       │
 │     │ more-companies-are-choosing-v │ for mission-critical           │       │
 │     │ ue-in-2026/                   │ applications. Pinia, now the   │       │
 │     │                               │ official store for Vue,        │       │
 │     │                               │ delivers TypeScript-first      │       │
 │     │                               │ architecture, lightweight      │       │
 │     │                               │ design, better devtools        │       │
 │     │                               │ integration, faster global     │       │
 │     │                               │ state handling.                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ The State of Vue.js Report    │ This report, created in        │  0.84 │
 │     │ 2025                          │ collaboration with Evan You    │       │
 │     │ https://stateofvue.framer.web │ and the Vue and Nuxt Core      │       │
 │     │ site/                         │ Teams, offers unique insights  │       │
 │     │                               │ across 150 virtual pages.      │       │
 │     │                               │ We've included 16 real-world   │       │
 │     │                               │ case studies from leading      │       │
 │     │                               │ brands, including GitLab, Hack │       │
 │     │                               │ The Box, Storyblok, Booksy,    │       │
 │     │                               │ and DocPlanner.                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ React vs Angular vs Vue:      │ React, maintained by Meta, is  │  0.84 │
 │     │ Choosing the Best for         │ a declarative, component-based │       │
 │     │ Enterprise in 2025            │ library for building user      │       │
 │     │ https://softwarelogic.co/en/b │ interfaces. Its virtual DOM    │       │
 │     │ log/which-javascript-framewor │ and one-way data flow provide  │       │
 │     │ k-is-best-for-enterprise-reac │ outstanding performance and    │       │
 │     │ t-angular-or-vue              │ flexibility. Vue is loved for  │       │
 │     │                               │ its gentle learning curve and  │       │
 │     │                               │ progressive adoption. Angular  │       │
 │     │                               │ is designed for large, complex │       │
 │     │                               │ enterprise applications where  │       │
 │     │                               │ structure and scalability are  │       │
 │     │                               │ paramount.                     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 11  │ React vs Vue: which one       │ React is built for scale. Its  │  0.86 │
 │     │ should you choose in 2025? |  │ flexibility, huge ecosystem,   │       │
 │     │ DECODE                        │ and massive job market make it │       │
 │     │ https://decode.agency/article │ the safest choice for          │       │
 │     │ /react-vs-vue/                │ enterprise-grade apps. Vue is  │       │
 │     │                               │ built for speed. With a gentle │       │
 │     │                               │ learning curve and official    │       │
 │     │                               │ tools baked in, teams can move │       │
 │     │                               │ faster and deliver MVPs or     │       │
 │     │                               │ mid-size apps quickly.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 12  │ What is React.js in 2025 and  │ In React 19, that same Reactjs │  0.82 │
 │     │ why React 19 changed          │ library comes with first-class │       │
 │     │ front-end again | Merge       │ async workflows, server        │       │
 │     │ https://merge.rocks/blog/what │ components, and metadata       │       │
 │     │ -is-react-js-in-2025-and-why- │ management, so teams spend     │       │
 │     │ react-19-changed-front-end-ag │ less time gluing libraries     │       │
 │     │ ain                           │ together and more time on      │       │
 │     │                               │ product work. The React team   │       │
 │     │                               │ also ships React Compiler,     │       │
 │     │                               │ currently in beta, which       │       │
 │     │                               │ automatically optimizes many   │       │
 │     │                               │ components that used to        │       │
 │     │                               │ require manual memoization.    │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Real-world 2026          │ No sources provided       │
 │                       │ enterprise migration     │ firsthand accounts of     │
 │                       │ case studies from React  │ enterprises switching     │
 │                       │ to Vue or vice versa     │ frameworks in 2026 with   │
 │                       │                          │ documented outcomes, only │
 │                       │                          │ general advocacy pieces.  │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ scope_exceeded        │ Angular vs React vs Vue  │ The question focused on   │
 │                       │ head-to-head in 2026     │ React vs Vue, but Angular │
 │                       │ enterprise contexts      │ is a significant          │
 │                       │                          │ competitor in large       │
 │                       │                          │ enterprise contexts. Full │
 │                       │                          │ three-way comparison with │
 │                       │                          │ 2026 data was not         │
 │                       │                          │ available.                │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Vue 4 specific features  │ One source                │
 │                       │ and release status       │ (automation-ops.com)      │
 │                       │                          │ mentions 'Vue 4' with     │
 │                       │                          │ 'enhanced composition API │
 │                       │                          │ features', but most other │
 │                       │                          │ sources discuss Vue 3.x   │
 │                       │                          │ as the current version.   │
 │                       │                          │ Vue 4 release status is   │
 │                       │                          │ unclear.                  │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ source_not_found      │ Verified 2026 salary and │ Salary data found was     │
 │                       │ hiring market data       │ market-specific (Mexico)  │
 │                       │                          │ and from 2025; global     │
 │                       │                          │ 2026 enterprise hiring    │
 │                       │                          │ cost comparison between   │
 │                       │                          │ React and Vue developers  │
 │                       │                          │ was not available.        │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ database          │ Vue 4 release     │ One source        │
 │                  │                   │ date features     │ references Vue 4  │
 │                  │                   │ official          │ with enhanced     │
 │                  │                   │ announcement 2025 │ composition API,  │
 │                  │                   │ 2026              │ but most sources  │
 │                  │                   │                   │ still discuss Vue │
 │                  │                   │                   │ 3.x; clarifying   │
 │                  │                   │                   │ whether Vue 4 has │
 │                  │                   │                   │ been released is  │
 │                  │                   │                   │ important for     │
 │                  │                   │                   │ accurate          │
 │                  │                   │                   │ comparison.       │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ React Server      │ SSR tooling       │
 │                  │                   │ Components vs     │ (Next.js vs Nuxt) │
 │                  │                   │ Nuxt SSR          │ is a key          │
 │                  │                   │ enterprise        │ enterprise        │
 │                  │                   │ performance       │ decision factor   │
 │                  │                   │ comparison 2025   │ mentioned across  │
 │                  │                   │ 2026              │ sources but not   │
 │                  │                   │                   │ deeply            │
 │                  │                   │                   │ benchmarked       │
 │                  │                   │                   │ head-to-head.     │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ State of          │ Multiple sources  │
 │                  │                   │ JavaScript 2025   │ cite the 2025     │
 │                  │                   │ full survey       │ State of          │
 │                  │                   │ results React Vue │ JavaScript survey │
 │                  │                   │ Angular market    │ but only with     │
 │                  │                   │ share             │ partial data; the │
 │                  │                   │                   │ full report would │
 │                  │                   │                   │ provide           │
 │                  │                   │                   │ authoritative     │
 │                  │                   │                   │ market share      │
 │                  │                   │                   │ figures.          │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ contradiction    │ null              │ Vue 4 vs Vue 3    │ Automation-ops    │
 │                  │                   │ current version   │ references 'Vue   │
 │                  │                   │ enterprise 2025   │ 4' with benchmark │
 │                  │                   │ 2026              │ data but other    │
 │                  │                   │                   │ sources           │
 │                  │                   │                   │ consistently      │
 │                  │                   │                   │ reference Vue 3.4 │
 │                  │                   │                   │ as current. This  │
 │                  │                   │                   │ is a factual      │
 │                  │                   │                   │ discrepancy that  │
 │                  │                   │                   │ could affect      │
 │                  │                   │                   │ benchmark         │
 │                  │                   │                   │ interpretation.   │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Has Vue 4 officially been       │ One source claims Vue 4 shows   │
 │          │ released, and what are its      │ 15% faster initial render times │
 │          │ actual performance              │ than React 19, but most sources │
 │          │ characteristics vs React 19 in  │ still discuss Vue 3.4 as        │
 │          │ enterprise applications?        │ current. This discrepancy       │
 │          │                                 │ affects benchmark reliability.  │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ How does React's new React      │ React Compiler automates        │
 │          │ Compiler (in beta) affect the   │ memoization and is described as │
 │          │ performance gap between React   │ a game-changer, but its         │
 │          │ and Vue in production           │ real-world impact on large      │
 │          │ enterprise applications?        │ enterprise codebases has not    │
 │          │                                 │ yet been fully benchmarked      │
 │          │                                 │ against Vue's                   │
 │          │                                 │ compiler-optimized reactivity.  │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ For enterprises currently on    │ The State of Vue.js Report 2025 │
 │          │ Vue 2 or Vue 3, what is the     │ includes a chapter on Vue 3     │
 │          │ actual cost and risk profile of │ Migration, suggesting migration │
 │          │ upgrading to future Vue         │ is still a concern for many     │
 │          │ versions vs migrating to React? │ enterprise teams.               │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does the developer hiring   │ Sources note strong Vue         │
 │          │ market for Vue vs React differ  │ adoption in Asia-Pacific and    │
 │          │ across regions (Asia-Pacific vs │ Latin America but React         │
 │          │ North America vs Europe) for    │ dominance globally. Regional    │
 │          │ enterprise teams planning 2026  │ hiring market differences could │
 │          │ staffing?                       │ significantly impact enterprise │
 │          │                                 │ framework choices.              │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ What is the total cost of       │ Sources discuss development     │
 │          │ ownership difference between    │ cost at project level but do    │
 │          │ React+Next.js and Vue+Nuxt for  │ not model long-term TCO         │
 │          │ a 50+ person enterprise         │ including training,             │
 │          │ frontend team over a 3-year     │ maintenance, tooling, and       │
 │          │ horizon?                        │ hiring costs for large teams.   │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.81                                                                │
 │ Corroborating sources: 12                                                    │
 │ Source authority: medium                                                     │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 56137                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 110.41s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 7c8dd19b-174b-4850-a2f5-28917d37c0c0
--- a/docs/stress-tests/M3.3-runs/10-comparative.log
+++ b/docs/stress-tests/M3.3-runs/10-comparative.log
@ -1,310 +0,0 @@
 Researching: Compare wind and solar capacity factors in the continental United 
 States.
 {"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:18.663955Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:01:19.783461Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:19.795497Z"}
 {"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:19.838791Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare wind and solar capacity factors in the continental United States.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839685Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839976Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:29.064991Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8211, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:38.391464Z"}
 {"step": 19, "decision": "Token budget reached before iteration 4: 23963/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620609Z"}
 {"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23963, "event": "synthesis_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620851Z"}
 {"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72249, "event": "synthesis_complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.647112Z"}
 {"step": 40, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 99.134, "event": "complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.648194Z"}
 {"confidence": 0.88, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 48230, "iterations_run": 3, "wall_time_sec": 95.80813455581665, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:55.648284Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:02:55.648701Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:55.654584Z"}
 {"trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "confidence": 0.88, "citations": 10, "tokens_used": 48230, "wall_time_sec": 95.80813455581665, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:55.883067Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Wind and solar capacity factors in the continental United States differ      │
 │ notably, with wind generally outperforming utility-scale solar on an annual  │
 │ average basis, though both vary significantly by location and season.        │
 │                                                                              │
 │ **Wind Capacity Factors:** In 2023, the U.S. wind turbine fleet had an       │
 │ average capacity factor of 33.5%, which was an eight-year low driven by      │
 │ weaker-than-normal wind speeds (down from the 2022 all-time high of 35.9%).  │
 │ Wind capacity factors are highest in spring (March–April) and lowest in      │
 │ summer. In April 2024, wind generation hit a record 47.7 TWh, exceeding coal │
 │ generation for the second consecutive month. The NREL wind resource          │
 │ assessment identifies areas with capacity factors ≥30% (generally mean       │
 │ annual wind speeds ≥6.4 m/s) as suitable for development, with the           │
 │ highest-potential zones in the central Great Plains. The U.S. total          │
 │ installed wind capacity reached ~150,500 MW by end of 2023.                  │
 │                                                                              │
 │ **Solar (Utility-Scale PV) Capacity Factors:** The weighted average U.S.     │
 │ utility-scale solar capacity factor was 23.5% in 2023, down 0.7 percentage   │
 │ points from 24.2% in 2022. NREL's Annual Technology Baseline categorizes     │
 │ utility-scale PV capacity factors into 10 resource classes based on mean     │
 │ global horizontal irradiance (GHI); the desert Southwest achieves the        │
 │ highest factors, while northern states achieve at least ~70% of the          │
 │ Southwest's value. Solar generation is highest in summer and lowest in       │
 │ winter, opposite to wind seasonality.                                        │
 │                                                                              │
 │ **Comparison Summary:** On an annual fleet-wide average, wind capacity       │
 │ factors (~33–36%) are materially higher than utility-scale solar capacity    │
 │ factors (~23–24%). However, the two resources are complementary seasonally:  │
 │ wind peaks in spring, solar peaks in summer. Both are intermittent           │
 │ resources. In 2025, wind and solar together generated a record 17% of U.S.   │
 │ electricity (wind: 464,000 GWh; utility-scale solar: 296,000 GWh),           │
 │ reflecting wind's larger current installed base despite solar's faster       │
 │ recent capacity growth.                                                      │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Wind generation declined in   │ Last year, the average         │  0.98 │
 │     │ 2023 for the first time since │ utilization rate, or capacity  │       │
 │     │ the 1990s - EIA               │ factor, of the wind turbine    │       │
 │     │ https://www.eia.gov/todayinen │ fleet fell to an eight-year    │       │
 │     │ ergy/detail.php?id=61943      │ low of 33.5% (compared with    │       │
 │     │                               │ 35.9% in 2022, the all-time    │       │
 │     │                               │ high).                         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ US solar capacity factors     │ The weighted average US solar  │  0.95 │
 │     │ retreat in 2023, break        │ capacity factor came in at a   │       │
 │     │ multiyear streak above 24%    │ calculated 23.5% annually in   │       │
 │     │ https://www.spglobal.com/mark │ 2023, down 0.7 percentage      │       │
 │     │ et-intelligence/en/news-insig │ point from 24.2% in 2022.      │       │
 │     │ hts/research/us-solar-capacit │                                │       │
 │     │ y-factors-retreat-in-2023-bre │                                │       │
 │     │ ak-multiyear-streak-above-24p │                                │       │
 │     │ erc                           │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ U.S. wind generation hit      │ Wind generation, meanwhile,    │  0.97 │
 │     │ record in April 2024,         │ increased to a record 47.7     │       │
 │     │ exceeding coal-fired          │ TWh. However, during the first │       │
 │     │ generation - EIA              │ four months of 2024,           │       │
 │     │ https://www.eia.gov/todayinen │ coal-fired generation was 15%  │       │
 │     │ ergy/detail.php?id=62784      │ higher than wind generation in │       │
 │     │                               │ the United States. Installed   │       │
 │     │                               │ wind power generating capacity │       │
 │     │                               │ has increased substantially in │       │
 │     │                               │ the United States over the     │       │
 │     │                               │ last 25 years, growing from    │       │
 │     │                               │ 2.4 gigawatts (GW) in 2000 to  │       │
 │     │                               │ 150.1 GW in April 2024.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Land-Based Wind Market Report │ The U.S. wind industry         │  0.97 │
 │     │ 2024: Edition | Department of │ installed 6,474 megawatts (MW) │       │
 │     │ Energy                        │ of new land-based wind         │       │
 │     │ https://www.energy.gov/cmei/s │ capacity in 2023, bringing the │       │
 │     │ ystems/land-based-wind-market │ cumulative total to nearly     │       │
 │     │ -report-2024-edition          │ 150,500 MW. Also, $10.8        │       │
 │     │                               │ billion was invested in 2023   │       │
 │     │                               │ in land-based wind energy      │       │
 │     │                               │ expansion.                     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Utility-Scale PV |            │ The 2024 ATB provides the      │  0.93 │
 │     │ Electricity | 2024 | ATB |    │ average capacity factor for 10 │       │
 │     │ NREL                          │ resource categories in the     │       │
 │     │ https://atb.nrel.gov/electric │ United States, binned by mean  │       │
 │     │ ity/2024/utility-scale_pv     │ GHI. Average capacity factors  │       │
 │     │                               │ are calculated using           │       │
 │     │                               │ county-level capacity factor   │       │
 │     │                               │ averages from the Renewable    │       │
 │     │                               │ Energy Potential (reV) model   │       │
 │     │                               │ for 1998–2021.                 │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ NREL projects solar           │ In the latest update, zones    │  0.85 │
 │     │ generation and costs for 10   │ 2-8, representing all but the  │       │
 │     │ U.S. zones – pv magazine USA  │ northernmost states in the     │       │
 │     │ https://pv-magazine-usa.com/2 │ continental U.S., solar        │       │
 │     │ 021/07/22/nrel-projects-solar │ installations have a capacity  │       │
 │     │ -generation-and-costs-for-10- │ factor that is at least 70% of │       │
 │     │ u-s-zones/                    │ that in the desert Southwest's │       │
 │     │                               │ zone 1, the data show.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Wind and solar generated a    │ In 2025, wind power generated  │  0.96 │
 │     │ record 17% of U.S.            │ 464,000 GWh of electricity, 3% │       │
 │     │ electricity in 2025 - EIA     │ more than in 2024. In 2025,    │       │
 │     │ https://www.eia.gov/todayinen │ utility-scale solar power      │       │
 │     │ ergy/detail.php?id=67367      │ generation totaled 296,000     │       │
 │     │                               │ GWh, 34% more than in 2024.    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ 80 and 100 Meter Wind Energy  │ Windy land defined as areas    │  0.82 │
 │     │ Resource Potential for the    │ with >= 30% CF*, generally     │       │
 │     │ United States - NREL          │ mean annual wind speeds >= 6.4 │       │
 │     │ https://docs.nrel.gov/docs/fy │ m/s... U.S. wind potential     │       │
 │     │ 10osti/48036.pdf              │ from areas with CF*>=30% is    │       │
 │     │                               │ enormous, with almost 10,500   │       │
 │     │                               │ GW capacity at 80 m and 12,000 │       │
 │     │                               │ GW capacity at 100 m.          │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Wind power in the United      │ In 2025, 464.4 terawatt-hours  │  0.88 │
 │     │ States - Wikipedia            │ were generated by wind power,  │       │
 │     │ https://en.wikipedia.org/wiki │ or 10.48% of electricity in    │       │
 │     │ /Wind_power_in_the_United_Sta │ the United States. In March    │       │
 │     │ tes                           │ and April of 2024, electricity │       │
 │     │                               │ generation from wind exceeded  │       │
 │     │                               │ generation from coal, once the │       │
 │     │                               │ dominant source of U.S.        │       │
 │     │                               │ electricity, for an extended   │       │
 │     │                               │ period for the first time.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Utility-scale U.S. solar      │ In August 2024, a total of     │  0.94 │
 │     │ electricity generation        │ 107.4 gigawatts (GW) of solar  │       │
 │     │ continues to grow in 2024 -   │ electricity generating         │       │
 │     │ EIA                           │ capacity was operating in the  │       │
 │     │ https://www.eia.gov/todayinen │ Lower 48 states compared with  │       │
 │     │ ergy/detail.php?id=63324      │ 81.9 GW in August 2023... In   │       │
 │     │                               │ the final five months of 2024, │       │
 │     │                               │ we expect new U.S. solar       │       │
 │     │                               │ electricity generating         │       │
 │     │                               │ capacity will make up 63%, or  │       │
 │     │                               │ nearly two-thirds, of all new  │       │
 │     │                               │ electricity generating         │       │
 │     │                               │ capacity to come online.       │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ scope_exceeded   │ Offshore wind capacity      │ The evidence gathered       │
 │                  │ factors                     │ focuses on land-based wind. │
 │                  │                             │ Offshore wind typically has │
 │                  │                             │ higher capacity factors     │
 │                  │                             │ (40–50%+) than land-based   │
 │                  │                             │ wind but was not the        │
 │                  │                             │ primary focus of the        │
 │                  │                             │ sources retrieved.          │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Most recent 2024 annual     │ The 2023 annual wind        │
 │                  │ average wind capacity       │ capacity factor (33.5%) is  │
 │                  │ factor                      │ confirmed, but a final 2024 │
 │                  │                             │ annual figure was not found │
 │                  │                             │ in the sources; only        │
 │                  │                             │ monthly records for April   │
 │                  │                             │ 2024 were available.        │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Regional breakdown of wind  │ State- or region-level      │
 │                  │ vs. solar capacity factors  │ direct comparisons of wind  │
 │                  │ within the continental U.S. │ vs. solar capacity factors  │
 │                  │                             │ within the continental U.S. │
 │                  │                             │ were not available in the   │
 │                  │                             │ retrieved sources.          │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ scope_exceeded   │ Small-scale/rooftop solar   │ The 23.5% solar capacity    │
 │                  │ capacity factors            │ factor applies to           │
 │                  │                             │ utility-scale solar.        │
 │                  │                             │ Distributed/rooftop solar   │
 │                  │                             │ typically has lower         │
 │                  │                             │ capacity factors due to     │
 │                  │                             │ suboptimal orientation;     │
 │                  │                             │ this was not quantified in  │
 │                  │                             │ the retrieved evidence.     │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ database          │ U.S. offshore     │ Offshore wind has │
 │                  │                   │ wind capacity     │ substantially     │
 │                  │                   │ factors 2023 2024 │ higher capacity   │
 │                  │                   │ compared to       │ factors than      │
 │                  │                   │ land-based wind   │ land-based wind   │
 │                  │                   │ and solar         │ and solar, which  │
 │                  │                   │                   │ would complete    │
 │                  │                   │                   │ the renewable     │
 │                  │                   │                   │ capacity factor   │
 │                  │                   │                   │ comparison        │
 │                  │                   │                   │ picture.          │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ NREL ATB 2024     │ NREL ATB provides │
 │                  │                   │ utility-scale     │ wind capacity     │
 │                  │                   │ wind capacity     │ factors by        │
 │                  │                   │ factor by         │ resource class    │
 │                  │                   │ resource class    │ similar to solar, │
 │                  │                   │ continental US    │ enabling direct   │
 │                  │                   │                   │ apples-to-apples  │
 │                  │                   │                   │ regional          │
 │                  │                   │                   │ comparison with   │
 │                  │                   │                   │ solar CF data.    │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ seasonal wind vs  │ Wind peaks in     │
 │                  │                   │ solar capacity    │ spring, solar in  │
 │                  │                   │ factor            │ summer—understand │
 │                  │                   │ complementarity   │ ing this          │
 │                  │                   │ United States     │ complementarity   │
 │                  │                   │ grid balancing    │ is critical for   │
 │                  │                   │                   │ grid planning and │
 │                  │                   │                   │ storage           │
 │                  │                   │                   │ requirements.     │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ database          │ EIA Electric      │ The 2024          │
 │                  │                   │ Power Monthly     │ full-year wind    │
 │                  │                   │ 2024 annual wind  │ capacity factor   │
 │                  │                   │ capacity factor   │ would allow       │
 │                  │                   │ final             │ updated           │
 │                  │                   │                   │ comparison with   │
 │                  │                   │                   │ the 2023 solar    │
 │                  │                   │                   │ capacity factor   │
 │                  │                   │                   │ of 23.5%.         │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ How do wind and solar capacity  │ Texas led wind capacity         │
 │          │ factors compare on a regional   │ additions in 2023 (1,323 MW)    │
 │          │ basis within the continental    │ and is the second-largest       │
 │          │ U.S., particularly in states    │ utility-scale solar state (18.8 │
 │          │ like Texas and California that  │ GW). California leads solar.    │
 │          │ have significant installations  │ Regional comparisons would      │
 │          │ of both?                        │ clarify where each resource is  │
 │          │                                 │ most competitive.               │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the projected           │ NREL's ATB provides             │
 │          │ trajectory of utility-scale     │ Advanced/Moderate/Conservative  │
 │          │ solar capacity factors as       │ scenarios for solar CF          │
 │          │ technology improves, and will   │ improvements through 2050, and  │
 │          │ solar eventually close the gap  │ solar capacity additions are    │
 │          │ with wind on a fleet-wide       │ now outpacing wind. The         │
 │          │ average basis?                  │ convergence timeline is         │
 │          │                                 │ unclear.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How did the 2023 wind           │ Wind generation fell 2.1% in    │
 │          │ generation decline (due to low  │ 2023 to an eight-year-low       │
 │          │ wind speeds) affect investment  │ capacity factor of 33.5%, while │
 │          │ decisions for new wind vs.      │ solar continued growing. This   │
 │          │ solar projects?                 │ may have influenced utility     │
 │          │                                 │ procurement decisions.          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the capacity factor of  │ The DOE Wind Market Reports     │
 │          │ offshore wind installations in  │ cover offshore wind separately, │
 │          │ the U.S., and how does it       │ and offshore wind typically     │
 │          │ compare to both land-based wind │ achieves materially higher      │
 │          │ and utility-scale solar?        │ capacity factors than           │
 │          │                                 │ land-based wind (~40–50%), but  │
 │          │                                 │ this was not quantified in the  │
 │          │                                 │ retrieved sources.              │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ How does the Inflation          │ The IRA led to significant      │
 │          │ Reduction Act's impact on wind  │ near-term wind deployment       │
 │          │ and solar deployment affect     │ forecast increases and billions │
 │          │ future capacity factor trends,  │ in domestic supply chain        │
 │          │ given that larger, more         │ investment. Average wind        │
 │          │ efficient turbines and          │ turbine capacity grew to 3.4 MW │
 │          │ better-sited projects may       │ in 2023, up 375% since          │
 │          │ improve wind CFs?               │ 1998–1999.                      │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.88                                                                │
 │ Corroborating sources: 10                                                    │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 48230                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 95.81s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: e3fa81c3-eaff-4f76-9b50-d61e70e54540
--- a/docs/stress-tests/M3.3-runs/11-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/11-contradiction.log
@ -1,236 +0,0 @@
 Researching: Is red wine good for cardiovascular health?
 {"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:56.517038Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:02:57.298051Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:57.308234Z"}
 {"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:57.343434Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is red wine good for cardiovascular health?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343753Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343847Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1097, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:09.450890Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8466, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:15.393838Z"}
 {"step": 19, "decision": "Token budget reached before iteration 4: 22139/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405453Z"}
 {"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 19, "iterations_run": 3, "tokens_used": 22139, "event": "synthesis_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405621Z"}
 {"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50486, "event": "synthesis_complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808158Z"}
 {"step": 37, "decision": "Research complete", "confidence": 0.72, "citation_count": 9, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 78.676, "event": "complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808851Z"}
 {"confidence": 0.72, "citations": 9, "gaps": 3, "discovery_events": 3, "tokens_used": 42350, "iterations_run": 3, "wall_time_sec": 76.46466898918152, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:13.808934Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:04:13.809517Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:13.813434Z"}
 {"trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "confidence": 0.72, "citations": 9, "tokens_used": 42350, "wall_time_sec": 76.46466898918152, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.104351Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The relationship between red wine and cardiovascular health is nuanced and   │
 │ contested. Historically, observational studies found that moderate drinkers  │
 │ (at least one drink per day) were 30–40% less likely to die from             │
 │ cardiovascular disease compared to non-drinkers, a pattern sometimes called  │
 │ the 'J-shaped mortality curve' [NYT/AHA]. Red wine specifically contains     │
 │ polyphenols (including flavonoids and resveratrol) that may inhibit LDL      │
 │ oxidation, prevent endothelial dysfunction, raise HDL cholesterol, and       │
 │ decrease fibrinogen concentrations [Circulation Research; PMC6804046].       │
 │ However, no study has established a direct cause-and-effect link between red │
 │ wine consumption and improved heart health [AHA]. More recent analyses       │
 │ suggest the apparent benefit may reflect confounding factors—moderate        │
 │ drinkers may have healthier lifestyles overall—and methodological flaws such │
 │ as including former drinkers (who quit due to illness) in the abstainer      │
 │ group [NYT; Three Spirit]. The 'French Paradox,' which popularized the red   │
 │ wine-heart health hypothesis, is now being critically re-examined as a       │
 │ public health myth [ResearchGate]. Major health organizations, including the │
 │ American Heart Association, do not recommend starting to drink red wine for  │
 │ heart benefit, and current evidence does not support a causal protective     │
 │ effect of alcohol on the heart.                                              │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ How Red Wine Lost Its Health  │ Researchers found that those   │  0.85 │
 │     │ Halo - The New York Times     │ who reported having at least   │       │
 │     │ https://www.nytimes.com/2024/ │ one alcoholic drink per day    │       │
 │     │ 02/17/well/eat/red-wine-heart │ were 30 to 40 percent less     │       │
 │     │ -health.html                  │ likely to die from             │       │
 │     │                               │ cardiovascular disease.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Drinking red wine for heart   │ No research has established a  │  0.92 │
 │     │ health? Read this before you  │ cause-and-effect link between  │       │
 │     │ toast | American Heart        │ drinking alcohol and better    │       │
 │     │ Association                   │ heart health. Rather, studies  │       │
 │     │ https://www.heart.org/en/news │ have found an association      │       │
 │     │ /2019/05/24/drinking-red-wine │ between wine and such benefits │       │
 │     │ -for-heart-health-read-this-b │ as a lower risk of dying from  │       │
 │     │ efore-you-toast               │ heart disease.                 │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Red Wine and Cardiovascular   │ The alcoholic component is     │  0.90 │
 │     │ Health | Circulation Research │ known to increase high-density │       │
 │     │ https://www.ahajournals.org/d │ lipoprotein cholesterol and to │       │
 │     │ oi/10.1161/CIRCRESAHA.112.278 │ decrease fibrinogen            │       │
 │     │ 705?doi=10.1161/CIRCRESAHA.11 │ concentrations. The            │       │
 │     │ 2.278705                      │ polyphenols present in red     │       │
 │     │                               │ wine                           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Wine and Cardiovascular       │ Flavonoids from red wine have  │  0.88 │
 │     │ Health | Circulation          │ been credited to inhibit       │       │
 │     │ https://www.ahajournals.org/d │ low-density lipoprotein (LDL)  │       │
 │     │ oi/10.1161/circulationaha.117 │ oxidation and prevent          │       │
 │     │ .030387                       │ endothelial dysfunction, which │       │
 │     │                               │ is                             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Red Wine Consumption and      │ Red Wine Consumption and       │  0.85 │
 │     │ Cardiovascular Health - PMC   │ Cardiovascular Health Luigi    │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ Castaldo ... Department of     │       │
 │     │ articles/PMC6804046/          │ Pharmacy, Faculty of Pharmacy, │       │
 │     │                               │ University of Naples "Federico │       │
 │     │                               │ II" ... Molecules. 2019 Oct    │       │
 │     │                               │ 8;24(19):3626. doi:            │       │
 │     │                               │ 10.3390/molecules24193626      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Association between Wine      │ Association between Wine       │  0.87 │
 │     │ Consumption with              │ Consumption with               │       │
 │     │ Cardiovascular Disease and    │ Cardiovascular Disease and     │       │
 │     │ Cardiovascular Mortality: A   │ Cardiovascular Mortality: A    │       │
 │     │ Systematic Review and         │ Systematic Review and          │       │
 │     │ Meta-Analysis - PMC           │ Meta-Analysis ... Nutrients.   │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ 2023 Jun 17;15(12):2785. doi:  │       │
 │     │ articles/PMC10303697/         │ 10.3390/nu15122785             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Red wine and resveratrol:     │ Is red wine heart healthy?     │  0.88 │
 │     │ Good for your heart? - Mayo   │ Antioxidants in red wine       │       │
 │     │ Clinic                        │ called polyphenols may help    │       │
 │     │ https://www.mayoclinic.org/di │ protect the lining of blood    │       │
 │     │ seases-conditions/heart-disea │ vessels in the heart. ·        │       │
 │     │ se/in-depth/red-wine/art-2004 │ Resveratrol in red wine.       │       │
 │     │ 8281                          │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Debunking the 'wine is        │ In the early nineties, a TV    │  0.65 │
 │     │ healthy' myth – Three Spirit  │ show in the US reported lower  │       │
 │     │ US                            │ heart attack rates in          │       │
 │     │ https://us.threespiritdrinks. │ France... The report framed    │       │
 │     │ com/blogs/blog/where-the-wine │ the country's regular          │       │
 │     │ -is-healthy-myth-came-from    │ consumption of alcohol, in     │       │
 │     │                               │ particular red wine, as the    │       │
 │     │                               │ reason behind this, claiming   │       │
 │     │                               │ that it reduced that risk of   │       │
 │     │                               │ heart disease.                 │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Revisiting the French         │ The "French Paradox," the      │  0.78 │
 │     │ Paradox: Deconstructing a     │ hypothesis that moderate red   │       │
 │     │ Public Health Myth and its    │ wine consumption explains      │       │
 │     │ Global Commercial Legacy      │ France's historically low      │       │
 │     │ https://www.researchgate.net/ │ coronary heart disease rates   │       │
 │     │ publication/399257280_Title_R │                                │       │
 │     │ evisiting_the_French_Paradox_ │                                │       │
 │     │ Deconstructing_a_Public_Healt │                                │       │
 │     │ h_Myth_and_its_Global_Commerc │                                │       │
 │     │ ial_Legacy                    │                                │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Randomized controlled    │ Most evidence is          │
 │                       │ trial evidence on red    │ observational. Robust RCT │
 │                       │ wine and cardiovascular  │ data directly testing red │
 │                       │ outcomes                 │ wine's causal             │
 │                       │                          │ cardiovascular effect in  │
 │                       │                          │ humans is lacking and not │
 │                       │                          │ surfaced in available     │
 │                       │                          │ sources.                  │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Differential effect of   │ Some sources attribute    │
 │                       │ red wine vs. other       │ benefits to polyphenols   │
 │                       │ alcohol types on         │ specific to red wine,     │
 │                       │ cardiovascular health    │ while others suggest the  │
 │                       │                          │ effect is due to alcohol  │
 │                       │                          │ in general, making it     │
 │                       │                          │ unclear whether red wine  │
 │                       │                          │ is uniquely beneficial.   │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ access_denied         │ Full text of 2023        │ The PMC10303697           │
 │                       │ meta-analysis findings   │ meta-analysis page header │
 │                       │                          │ was retrieved but full    │
 │                       │                          │ results/conclusions were  │
 │                       │                          │ not available in the      │
 │                       │                          │ scraped content.          │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ contradiction    │ database          │ randomized        │ Observational     │
 │                  │                   │ controlled trial  │ studies suggest   │
 │                  │                   │ red wine          │ benefit, but no   │
 │                  │                   │ polyphenols       │ causal link       │
 │                  │                   │ cardiovascular    │ established; RCT  │
 │                  │                   │ outcomes          │ evidence needed   │
 │                  │                   │                   │ to resolve        │
 │                  │                   │                   │ contradiction.    │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ resveratrol       │ Resveratrol is    │
 │                  │                   │ bioavailability   │ cited as a key    │
 │                  │                   │ cardiovascular    │ mechanism but its │
 │                  │                   │ human clinical    │ bioavailability   │
 │                  │                   │ trials 2022 2023  │ from wine in      │
 │                  │                   │ 2024              │ clinically        │
 │                  │                   │                   │ meaningful doses  │
 │                  │                   │                   │ is debated.       │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ sick quitter bias │ The J-shaped      │
 │                  │                   │ abstainer         │ curve may be an   │
 │                  │                   │ misclassification │ artifact of       │
 │                  │                   │ alcohol           │ methodological    │
 │                  │                   │ cardiovascular    │ flaws (sick       │
 │                  │                   │ epidemiology      │ quitters included │
 │                  │                   │                   │ in abstainer      │
 │                  │                   │                   │ group), which     │
 │                  │                   │                   │ undermines        │
 │                  │                   │                   │ earlier           │
 │                  │                   │                   │ protective        │
 │                  │                   │                   │ findings.         │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Does the apparent               │ Observational J-curve studies   │
 │          │ cardiovascular benefit of       │ may misclassify former drinkers │
 │          │ moderate red wine consumption   │ who quit due to illness as      │
 │          │ disappear when sick quitters    │ non-drinkers, inflating the     │
 │          │ are properly excluded from the  │ apparent benefit of moderate    │
 │          │ abstainer comparison group?     │ drinking.                       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Is the cardiovascular effect of │ Circulation Research notes both │
 │          │ red wine attributable to        │ the alcohol component and       │
 │          │ polyphenols (resveratrol,       │ polyphenols independently       │
 │          │ flavonoids) or simply to the    │ affect cardiovascular markers,  │
 │          │ alcohol content?                │ but their relative contribution │
 │          │                                 │ is unclear.                     │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What do the most recent         │ The 2023 PMC meta-analysis was  │
 │          │ meta-analyses (2022–2024)       │ identified but its full         │
 │          │ conclude about wine consumption │ conclusions were not accessible │
 │          │ and cardiovascular mortality    │ in the retrieved content.       │
 │          │ after correcting for            │                                 │
 │          │ confounders?                    │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Are there subpopulations (e.g., │ Current guidance is             │
 │          │ by age, sex, genetic profile)   │ population-level; individual    │
 │          │ for whom moderate red wine      │ variation in alcohol metabolism │
 │          │ consumption might confer        │ and cardiovascular risk         │
 │          │ measurable cardiovascular       │ profiles may produce different  │
 │          │ benefit?                        │ outcomes.                       │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.72                                                                │
 │ Corroborating sources: 7                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: recent                                                              │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 42350                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 76.46s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 96acce3c-853d-40b7-ba02-c721ac59f85d
--- a/docs/stress-tests/M3.3-runs/12-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/12-contradiction.log
@ -1,330 +0,0 @@
 Researching: Does intermittent fasting extend lifespan in humans?
 {"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.725578Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:04:15.543876Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:15.553451Z"}
 {"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:15.587475Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does intermittent fasting extend lifespan in humans?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587815Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587912Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1148, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:22.802797Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8443, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:26.505496Z"}
 {"step": 21, "decision": "Starting iteration 4/5", "tokens_so_far": 18167, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:43.089460Z"}
 {"step": 26, "decision": "Token budget reached before iteration 5: 36705/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193645Z"}
 {"step": 27, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 26, "iterations_run": 4, "tokens_used": 36705, "event": "synthesis_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193894Z"}
 {"step": 28, "decision": "Parsed synthesis JSON successfully", "duration_ms": 76890, "event": "synthesis_complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.759366Z"}
 {"step": 48, "decision": "Research complete", "confidence": 0.72, "citation_count": 11, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 109.604, "event": "complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.760365Z"}
 {"confidence": 0.72, "citations": 11, "gaps": 4, "discovery_events": 4, "tokens_used": 62781, "iterations_run": 4, "wall_time_sec": 105.17169857025146, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:00.760468Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:06:00.760848Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:00.765020Z"}
 {"trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "confidence": 0.72, "citations": 11, "tokens_used": 62781, "wall_time_sec": 105.17169857025146, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:00.989582Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Current scientific evidence does NOT conclusively demonstrate that           │
 │ intermittent fasting (IF) extends lifespan in humans. While IF has proven    │
 │ lifespan-extending effects in animal models (particularly rodents), and      │
 │ improves multiple healthspan markers in humans—including weight, insulin     │
 │ resistance, inflammation, dyslipidemia, hypertension, oxidative stress, and  │
 │ autophagy—direct evidence of increased human lifespan from IF is lacking.    │
 │ Mechanistically, IF triggers 'adaptive stress' in cells, activating          │
 │ antioxidant production, DNA repair, autophagy (via spermidine-mediated       │
 │ pathways), and reduced inflammation, all of which are theoretically linked   │
 │ to longevity [InsideTracker, FORTH/Nature Cell Biology]. A 2024 review in    │
 │ Ageing Research Reviews concluded IF 'can be considered a                    │
 │ non-pharmacological strategy to extend lifespan' and has been 'proven to     │
 │ extend lifespan in rodent models,' but human translation remains unconfirmed │
 │ [ScienceDirect/PubMed]. A scoping review of RCTs found IF improves           │
 │ aging-related biomarkers in adults but stopped short of claiming lifespan    │
 │ extension [PMC]. A 2024 Nature study on genetically diverse mice showed      │
 │ dietary restriction (including IF) extends healthy lifespan in mice but its  │
 │ human relevance is unclear. Critically, a major 2024 AHA-presented           │
 │ observational study of 20,000+ U.S. adults found that eating within an       │
 │ 8-hour window was associated with a 91% higher risk of cardiovascular death  │
 │ compared to eating across 12–16 hours—though this study has been heavily     │
 │ criticized for methodological limitations including confounding variables    │
 │ (demographics, pre-existing disease) and reliance on only two days of        │
 │ dietary recall data [AHA, WebMD, Forbes]. In summary, IF improves several    │
 │ biomarkers associated with healthy aging in humans, and extends lifespan in  │
 │ animals, but no long-term human RCT has demonstrated actual lifespan         │
 │ extension, and some observational data raise cardiovascular safety concerns. │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Intermittent fasting and      │ IF can be considered as a      │  0.95 │
 │     │ longevity: From animal models │ non-pharmacological strategy   │       │
 │     │ to implication for humans -   │ to extend lifespan. IF         │       │
 │     │ ScienceDirect                 │ improves physiological         │       │
 │     │ https://www.sciencedirect.com │ function, enhances             │       │
 │     │ /science/article/abs/pii/S156 │ performance, and slows aging.  │       │
 │     │ 8163724000928                 │ IF was proven to extend        │       │
 │     │                               │ lifespan in rodent models.     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Intermittent fasting and      │ Findings to date from both     │  0.95 │
 │     │ longevity: From animal models │ human and animal experiments   │       │
 │     │ to implication for humans -   │ indicate that fasting improves │       │
 │     │ PubMed                        │ physiological function,        │       │
 │     │ https://pubmed.ncbi.nlm.nih.g │ enhances performance, and      │       │
 │     │ ov/38499159/                  │ slows aging and disease        │       │
 │     │                               │ processes. Metabolic and       │       │
 │     │                               │ cellular responses triggered   │       │
 │     │                               │ by IF could help to achieve    │       │
 │     │                               │ the aim of preventing disease, │       │
 │     │                               │ and maximizing healthspan and  │       │
 │     │                               │ longevity with minimal side    │       │
 │     │                               │ effects.                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ How Intermittent Fasting      │ In humans, intermittent        │  0.88 │
 │     │ Impacts Longevity: A Summary  │ fasting improves weight,       │       │
 │     │ of the Research -             │ insulin resistance,            │       │
 │     │ InsideTracker                 │ inflammation, dyslipidemia,    │       │
 │     │ https://www.insidetracker.com │ and hypertension. IF has also  │       │
 │     │ /a/articles/how-intermittent- │ reduced tumor growth, boosted  │       │
 │     │ fasting-impacts-longevity     │ stem cell production, and      │       │
 │     │                               │ increased lifespan in mice.    │       │
 │     │                               │ During fasting, cells undergo  │       │
 │     │                               │ adaptive stress, which         │       │
 │     │                               │ activates different pathways   │       │
 │     │                               │ in the body, resulting in a    │       │
 │     │                               │ range of effects, including    │       │
 │     │                               │ increased production of        │       │
 │     │                               │ antioxidants, DNA repair,      │       │
 │     │                               │ autophagy.                     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Effects of Intermittent       │ In humans,                     │  0.97 │
 │     │ Fasting on Health, Aging, and │ intermittent-fasting           │       │
 │     │ Disease - NEJM                │ interventions ameliorate       │       │
 │     │ https://www.nejm.org/doi/full │ obesity, insulin resistance,   │       │
 │     │ /10.1056/NEJMra1905136        │ dyslipidemia, hypertension,    │       │
 │     │                               │ and inflammation.              │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Impact of Intermittent        │ Impact of Intermittent Fasting │  0.90 │
 │     │ Fasting and/or Caloric        │ and/or Caloric Restriction on  │       │
 │     │ Restriction on Aging-Related  │ Aging-Related Outcomes in      │       │
 │     │ Outcomes in Adults: A Scoping │ Adults: A Scoping Review of    │       │
 │     │ Review of Randomized          │ Randomized Controlled Trials.  │       │
 │     │ Controlled Trials - PMC       │ Nutrients. 2024 Jan            │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ 20;16(2):316. doi:             │       │
 │     │ articles/PMC10820472/         │ 10.3390/nu16020316             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ International scientific      │ intermittent fasting increases │  0.90 │
 │     │ collaboration reveals how     │ the levels of spermidine, a    │       │
 │     │ intermittent fasting          │ chemical compound (natural     │       │
 │     │ regulates ageing through      │ polyamine), that enhances the  │       │
 │     │ autophagy | FORTH             │ resilience and survival of     │       │
 │     │ https://forth.gr/en/news/show │ cells and organisms, through   │       │
 │     │ /&tid=2606                    │ the activation of autophagy.   │       │
 │     │                               │ Autophagy defects have been    │       │
 │     │                               │ linked to ageing, as well as,  │       │
 │     │                               │ with the emergence of          │       │
 │     │                               │ age-related disorders.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Dietary restriction impacts   │ Caloric restriction extends    │  0.92 │
 │     │ health and lifespan of        │ healthy lifespan in multiple   │       │
 │     │ genetically diverse mice |    │ species. Intermittent fasting, │       │
 │     │ Nature                        │ an alternative form of dietary │       │
 │     │ https://www.nature.com/articl │ restriction, is potentially    │       │
 │     │ es/s41586-024-08026-3         │ more sustainable in humans,    │       │
 │     │                               │ but its effectiveness remains  │       │
 │     │                               │ largely unexplored.            │       │
 │     │                               │ Identifying the most           │       │
 │     │                               │ efficacious forms of dietary   │       │
 │     │                               │ restriction is key for         │       │
 │     │                               │ developing interventions to    │       │
 │     │                               │ improve human health and       │       │
 │     │                               │ longevity.                     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Time-restricted eating may    │ A popular weight loss strategy │  0.85 │
 │     │ raise cardiovascular death    │ that limits the hours during   │       │
 │     │ risk in the long term |       │ which calories can be consumed │       │
 │     │ American Heart Association    │ may nearly double a person's   │       │
 │     │ https://www.heart.org/en/news │ long-term risk of dying from   │       │
 │     │ /2024/03/18/time-restricted-e │ cardiovascular disease, new    │       │
 │     │ ating-may-raise-cardiovascula │ research finds, especially     │       │
 │     │ r-death-risk-in-the-long-term │ among people with underlying   │       │
 │     │                               │ cardiovascular disease or      │       │
 │     │                               │ cancer.                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Fasting Study Under Fire      │ Those conclusions are          │  0.87 │
 │     │ After Heart Conference -      │ premature and misleading, says │       │
 │     │ WebMD                         │ Christopher Gardner, PhD, a    │       │
 │     │ https://www.webmd.com/heart-d │ professor of medicine at       │       │
 │     │ isease/features/is-intermitte │ Stanford University... people  │       │
 │     │ nt-fasting-bad-for-heart-heal │ in the study group who         │       │
 │     │ th                            │ consumed all their food in a   │       │
 │     │                               │ daily window of 8 hours or     │       │
 │     │                               │ fewer had a higher percentage  │       │
 │     │                               │ of men, African Americans, and │       │
 │     │                               │ smoke.                         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Intermittent Fasting - The    │ intermittent fasting activated │  0.78 │
 │     │ Impact on Autophagy,          │ autophagy, a cellular process  │       │
 │     │ Inflammasome, and Senescence  │ that breaks down components    │       │
 │     │ https://nomix.ai/2024/05/24/f │ within cells. Autophagy has    │       │
 │     │ asting-in-young-males-examini │ been linked to longevity...    │       │
 │     │ ng-the-impact-on-autophagy-in │ p21 levels decreased during    │       │
 │     │ flammasome-and-senescence-bio │ and after fasting. The         │       │
 │     │ markers/                      │ findings suggest that fasting  │       │
 │     │                               │ may contribute to delaying the │       │
 │     │                               │ onset of age-related diseases. │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 11  │ Effect of fasting-mimicking   │ Significant between-group      │  0.82 │
 │     │ diet on markers of autophagy  │ differences were observed in   │       │
 │     │ and metabolic health in human │ changes from baseline to the   │       │
 │     │ subjects | GeroScience        │ end of the 6-day dietary       │       │
 │     │ https://link.springer.com/art │ intervention for body weight,  │       │
 │     │ icle/10.1007/s11357-025-02035 │ fasting glucose, BHB, HOMA-IR, │       │
 │     │ -4                            │ and autophagic flux (p <       │       │
 │     │                               │ 0.05)... These results suggest │       │
 │     │                               │ that FMD may improve           │       │
 │     │                               │ autophagic flux and markers of │       │
 │     │                               │ metabolic health.              │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Long-term human RCT data │ No randomized controlled  │
 │                       │ on IF and all-cause      │ trial has followed human  │
 │                       │ mortality or lifespan    │ participants long enough  │
 │                       │                          │ to measure actual         │
 │                       │                          │ lifespan extension from   │
 │                       │                          │ IF. All human longevity   │
 │                       │                          │ evidence is based on      │
 │                       │                          │ biomarker surrogates or   │
 │                       │                          │ observational data.       │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Optimal IF protocol for  │ Studies test different    │
 │                       │ longevity in humans      │ protocols (TRF, ADF, 5:2, │
 │                       │                          │ FMD) with varying         │
 │                       │                          │ durations and             │
 │                       │                          │ populations, making it    │
 │                       │                          │ impossible to identify a  │
 │                       │                          │ single optimal regimen    │
 │                       │                          │ for human longevity.      │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Cardiovascular safety of │ Short-term studies show   │
 │                       │ long-term IF             │ cardiovascular benefit    │
 │                       │                          │ (improved BP, glucose,    │
 │                       │                          │ cholesterol), but the     │
 │                       │                          │ 2024 AHA observational    │
 │                       │                          │ study suggests possible   │
 │                       │                          │ long-term cardiovascular  │
 │                       │                          │ mortality risk, with      │
 │                       │                          │ experts disputing         │
 │                       │                          │ methodology.              │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ source_not_found      │ IF effects across        │ Most human studies focus  │
 │                       │ diverse demographic      │ on limited populations    │
 │                       │ groups                   │ (e.g., young males,       │
 │                       │                          │ specific ethnic groups),  │
 │                       │                          │ limiting generalizability │
 │                       │                          │ of longevity findings.    │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ contradiction    │ database          │ time-restricted   │ The AHA 2024      │
 │                  │                   │ eating            │ study claiming    │
 │                  │                   │ cardiovascular    │ 91% higher        │
 │                  │                   │ mortality NHANES  │ cardiovascular    │
 │                  │                   │ confounding       │ death risk        │
 │                  │                   │ variables         │ contradicts       │
 │                  │                   │ methodology       │ short-term        │
 │                  │                   │ critique 2024     │ studies showing   │
 │                  │                   │                   │ CV benefit;       │
 │                  │                   │                   │ deeper            │
 │                  │                   │                   │ methodological    │
 │                  │                   │                   │ analysis is       │
 │                  │                   │                   │ warranted.        │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ spermidine        │ The FORTH/Nature  │
 │                  │                   │ autophagy         │ Cell Biology      │
 │                  │                   │ intermittent      │ finding on        │
 │                  │                   │ fasting lifespan  │ spermidine-mediat │
 │                  │                   │ human clinical    │ ed autophagy is a │
 │                  │                   │ trial 2024        │ novel mechanism   │
 │                  │                   │                   │ that may be       │
 │                  │                   │                   │ testable in human │
 │                  │                   │                   │ longevity trials. │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ fasting mimicking │ A large           │
 │                  │                   │ diet longevity    │ registered RCT    │
 │                  │                   │ diet RCT          │ (NCT05698654) on  │
 │                  │                   │ NCT05698654       │ fasting-mimicking │
 │                  │                   │ results           │ and longevity     │
 │                  │                   │                   │ diet is underway; │
 │                  │                   │                   │ results could be  │
 │                  │                   │                   │ transformative    │
 │                  │                   │                   │ for the question  │
 │                  │                   │                   │ of human lifespan │
 │                  │                   │                   │ extension.        │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ telomere length   │ The Frontiers in  │
 │                  │                   │ intermittent      │ Aging study on    │
 │                  │                   │ fasting exercise  │ metabolic         │
 │                  │                   │ metabolomics      │ signatures of     │
 │                  │                   │ aging biomarkers  │ combined exercise │
 │                  │                   │ 2024              │ and fasting links │
 │                  │                   │                   │ to telomere       │
 │                  │                   │                   │ length, a key     │
 │                  │                   │                   │ aging biomarker   │
 │                  │                   │                   │ worth             │
 │                  │                   │                   │ investigating     │
 │                  │                   │                   │ further.          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Will ongoing large-scale RCTs   │ No current RCT has followed     │
 │          │ (e.g., NCT05698654) provide     │ participants long enough to     │
 │          │ definitive evidence that IF     │ measure actual lifespan; only   │
 │          │ extends human lifespan or       │ biomarker surrogates have been  │
 │          │ healthspan?                     │ studied.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Does the cardiovascular         │ Experts including Stanford's    │
 │          │ mortality risk signal from the  │ Christopher Gardner criticized  │
 │          │ 2024 AHA observational study    │ the study for not controlling   │
 │          │ hold up after controlling for   │ for demographics, pre-existing  │
 │          │ confounders like pre-existing   │ disease, and reason for         │
 │          │ illness and dietary quality?    │ adopting IF.                    │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Can spermidine supplementation  │ FORTH research showed IF raises │
 │          │ replicate the                   │ spermidine, which activates     │
 │          │ autophagy-activating,           │ autophagy and promotes cell     │
 │          │ anti-aging effects of IF in     │ survival, suggesting            │
 │          │ humans who cannot sustain       │ supplementation as a potential  │
 │          │ fasting?                        │ proxy.                          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Which IF protocol (TRF, ADF,    │ Multiple protocols are studied  │
 │          │ 5:2, or FMD) produces the       │ with heterogeneous populations, │
 │          │ greatest longevity-associated   │ making comparative              │
 │          │ biomarker improvements in       │ effectiveness unclear.          │
 │          │ diverse human populations?      │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ Does the 92-year-old case study │ SAGE Journals reported this as  │
 │          │ of repeated 3-week annual       │ the world's longest medically   │
 │          │ fasting over 45 years offer any │ documented repeated fasting     │
 │          │ generalizable insight into      │ history; clinical parameters    │
 │          │ long-term IF and human          │ showed cyclic variation.        │
 │          │ longevity?                      │                                 │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.72                                                                │
 │ Corroborating sources: 9                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 62781                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 105.17s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3
--- a/docs/stress-tests/M3.3-runs/13-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/13-contradiction.log
@ -1,260 +0,0 @@
 Researching: Are nuclear power plants safe?
 {"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:01.606512Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:06:02.435399Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:02.443368Z"}
 {"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:02.477384Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Are nuclear power plants safe?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477723Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477819Z"}
 {"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1169, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:15.136739Z"}
 {"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11760, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:25.196255Z"}
 {"step": 23, "decision": "Token budget reached before iteration 4: 29534/20000", "event": "budget_exhausted", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263571Z"}
 {"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 29534, "event": "synthesis_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263885Z"}
 {"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 58649, "event": "synthesis_complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.700545Z"}
 {"step": 40, "decision": "Research complete", "confidence": 0.92, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 92.558, "event": "complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.701336Z"}
 {"confidence": 0.92, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 63429, "iterations_run": 3, "wall_time_sec": 89.22308659553528, "budget_exhausted": true, "event": "research_completed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:31.701429Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:07:31.701781Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:31.705585Z"}
 {"trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "confidence": 0.92, "citations": 8, "tokens_used": 63429, "wall_time_sec": 89.22308659553528, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.018740Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Yes, nuclear power plants are among the safest sources of electricity        │
 │ generation when measured by deaths per unit of energy produced. According to │
 │ Statista (sourcing 2018 data), nuclear energy results in approximately 0.03  │
 │ deaths per terawatt-hour (TWh), making it safer than wind (0.04), solar      │
 │ (0.02 is slightly lower), natural gas (2.82), biomass (4.63), hydro (1.3),   │
 │ oil (18.43), coal (24.62), and brown coal (32.72). A separate dataset from   │
 │ ResearchGate reports 0.04 deaths per billion kWh for nuclear, compared to    │
 │ 100 for coal. Despite three major accidents—Three Mile Island (1979),        │
 │ Chernobyl (1986), and Fukushima (2011)—the overall fatality record remains   │
 │ exceptionally low. At Chernobyl, the worst nuclear accident in history, 2    │
 │ workers died in the initial explosion, 28 of 134 acute radiation syndrome    │
 │ patients later died, and roughly 5,000 thyroid cancer cases were             │
 │ attributable to radiation exposure among those under 18 at the time          │
 │ (Canadian Nuclear Safety Commission). Stanford researchers estimated         │
 │ Fukushima may cause approximately 130 deaths and 180 cancer cases globally,  │
 │ in addition to ~600 evacuation-related deaths. Three Mile Island caused no   │
 │ direct radiation deaths. U.S. nuclear plants operate under strict NRC        │
 │ oversight using a 'defense-in-depth' multi-layer safety approach (U.S.       │
 │ Department of Energy). The IAEA also sets international design and safety    │
 │ standards. Public perception of nuclear risk is widely considered            │
 │ disproportionate to the statistical evidence.                                │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Global deaths per energy      │ Brown coal 32.72 | Coal 24.62  │  0.97 │
 │     │ source | Statista             │ | Oil 18.43 | Biomass 4.63 |   │       │
 │     │ https://www.statista.com/stat │ Natural gas 2.82 | Hydro 1.3 | │       │
 │     │ istics/494425/death-rate-worl │ Wind 0.04 | Nuclear 0.03 |     │       │
 │     │ dwide-by-energy-source/       │ Solar 0.02. Death rates are    │       │
 │     │                               │ measured based on deaths from  │       │
 │     │                               │ accidents and air pollution    │       │
 │     │                               │ per terawatt-hour (TWh) of     │       │
 │     │                               │ electricity.                   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ rates for each energy source  │ 100 for coal, 36 for oil, 24   │  0.91 │
 │     │ in deaths per billion kWh     │ for biofuel/biomass, 4 for     │       │
 │     │ produced... | ResearchGate    │ natural gas, 1.4 for hydro,    │       │
 │     │ https://www.researchgate.net/ │ 0.44 for solar, 0.15 for wind  │       │
 │     │ figure/rates-for-each-energy- │ and 0.04 for nuclear.          │       │
 │     │ source-in-deaths-per-billion- │                                │       │
 │     │ kWh-produced-Source-Updated_t │                                │       │
 │     │ bl2_272406182                 │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Health effects of the         │ The initial steam explosion at │  0.97 │
 │     │ Chornobyl accident | Canadian │ the Chornobyl nuclear plant    │       │
 │     │ Nuclear Safety Commission     │ resulted in the deaths of 2    │       │
 │     │ https://www.cnsc-ccsn.gc.ca/e │ workers, and 134 plant staff   │       │
 │     │ ng/resources/health/health-ef │ and emergency workers suffered │       │
 │     │ fects-chornobyl-accident/     │ acute radiation syndrome due   │       │
 │     │                               │ to high doses of radiation. Of │       │
 │     │                               │ these 134 people, 28 later     │       │
 │     │                               │ died. About 5,000 thyroid      │       │
 │     │                               │ cancer cases were due to       │       │
 │     │                               │ radioactive iodine             │       │
 │     │                               │ (iodine-131) exposure to       │       │
 │     │                               │ children or adolescents at the │       │
 │     │                               │ time of the accident.          │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Stanford researchers          │ Radiation from Japan's         │  0.93 │
 │     │ calculate global health       │ Fukushima Daiichi nuclear      │       │
 │     │ impacts of the Fukushima      │ disaster may eventually cause  │       │
 │     │ nuclear disaster | Stanford   │ approximately 130 deaths and   │       │
 │     │ University                    │ 180 cases of cancer, mostly in │       │
 │     │ https://engineering.stanford. │ Japan, Stanford researchers    │       │
 │     │ edu/news/stanford-researchers │ have calculated. The numbers   │       │
 │     │ -calculate-global-health-impa │ are in addition to the roughly │       │
 │     │ cts-fukushima-nuclear-disaste │ 600 deaths caused by the       │       │
 │     │ r                             │ evacuation of the area         │       │
 │     │                               │ surrounding the nuclear plant  │       │
 │     │                               │ directly after the March 2011  │       │
 │     │                               │ earthquake, tsunami and        │       │
 │     │                               │ meltdown.                      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Enhanced Safety of Advanced   │ U.S. nuclear power plants are  │  0.96 │
 │     │ Reactors | U.S. Department of │ already among the safest and   │       │
 │     │ Energy                        │ most secure industrial         │       │
 │     │ https://www.energy.gov/ne/enh │ facilities in the world due to │       │
 │     │ anced-safety-advanced-reactor │ the industry's commitment to   │       │
 │     │ s                             │ comprehensive safety           │       │
 │     │                               │ procedures, robust training    │       │
 │     │                               │ programs and stringent federal │       │
 │     │                               │ regulation that keep nuclear   │       │
 │     │                               │ plants and neighboring         │       │
 │     │                               │ communities safe.              │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Three Mile Island, Chernobyl  │ Estimates on nuclear's overall │  0.88 │
 │     │ and Fukushima accidents haunt │ mortality rate are comparable  │       │
 │     │ nuclear's past | MinnPost     │ to solar or wind power (and    │       │
 │     │ https://www.minnpost.com/othe │ roughly 2.5% that of hydro     │       │
 │     │ r-nonprofit-media/2023/10/thr │ power). Oil and coal,          │       │
 │     │ ee-mile-island-chernobyl-and- │ meanwhile, are as much as 800  │       │
 │     │ fukushima-accidents-haunt-nuc │ times higher.                  │       │
 │     │ lears-past-will-they-dictate- │                                │       │
 │     │ its-future/                   │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Devastating Consequences of   │ The Chernobyl disaster, which  │  0.85 │
 │     │ Nuclear Accidents: Chernobyl, │ occurred on April 26, 1986,    │       │
 │     │ Fukushima and Three Mile      │ was the most significant       │       │
 │     │ Island | SciTechnol           │ nuclear accident in history.   │       │
 │     │ https://www.scitechnol.com/pe │ The explosion and fire at the  │       │
 │     │ er-review/devastating-consequ │ Chernobyl nuclear power plant  │       │
 │     │ ences-of-nuclear-accidents-ch │ in Ukraine resulted in the     │       │
 │     │ ernobyl-fukushima-and-three-m │ release of large amounts of    │       │
 │     │ ile-island-HLGS.php?article_i │ radioactive material into the  │       │
 │     │ d=21379                       │ atmosphere, leading to the     │       │
 │     │                               │ deaths of 31 people, and       │       │
 │     │                               │ causing widespread             │       │
 │     │                               │ contamination of the           │       │
 │     │                               │ surrounding areas.             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Laying the Foundation for New │ Domestic power reactors are    │  0.94 │
 │     │ and Advanced Nuclear Reactors │ tightly regulated by the U.S.  │       │
 │     │ in the United States |        │ Nuclear Regulatory Commission  │       │
 │     │ National Academies            │ (NRC) in all phases of their   │       │
 │     │ https://www.nationalacademies │ life cycle—design,             │       │
 │     │ .org/read/26630/chapter/9     │ construction, operations, and  │       │
 │     │                               │ decommissioning. The NRC is    │       │
 │     │                               │ charged with licensing and     │       │
 │     │                               │ regulation of plants to        │       │
 │     │                               │ provide reasonable assurance   │       │
 │     │                               │ of adequate protection of      │       │
 │     │                               │ public health and safety.      │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ contradictory_sources │ Long-term cancer         │ Estimates of total        │
 │                       │ mortality estimates from │ Chernobyl-attributed      │
 │                       │ Chernobyl                │ cancer deaths vary widely │
 │                       │                          │ across sources, from      │
 │                       │                          │ hundreds (WHO/UNSCEAR     │
 │                       │                          │ conservative estimates)   │
 │                       │                          │ to tens of thousands      │
 │                       │                          │ (Greenpeace/TORCH         │
 │                       │                          │ report), making a         │
 │                       │                          │ definitive number         │
 │                       │                          │ difficult to cite.        │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ scope_exceeded        │ Comparative safety of    │ Evidence gathered focuses │
 │                       │ advanced/next-generation │ on existing reactor fleet │
 │                       │ reactors (Gen IV, SMRs)  │ safety records; safety    │
 │                       │                          │ data specific to small    │
 │                       │                          │ modular reactors (SMRs)   │
 │                       │                          │ or Gen IV designs was not │
 │                       │                          │ retrieved.                │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ source_not_found      │ Nuclear waste long-term  │ While radioactive waste   │
 │                       │ safety statistics        │ management was briefly    │
 │                       │                          │ mentioned, quantitative   │
 │                       │                          │ long-term health risk     │
 │                       │                          │ data from waste storage   │
 │                       │                          │ was not found in the      │
 │                       │                          │ retrieved sources.        │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ arxiv             │ nuclear power     │ A systematic      │
 │                  │                   │ plant safety      │ academic review   │
 │                  │                   │ mortality         │ post-2020 could   │
 │                  │                   │ statistics        │ provide updated   │
 │                  │                   │ systematic review │ mortality         │
 │                  │                   │ 2020-2025         │ statistics        │
 │                  │                   │                   │ incorporating the │
 │                  │                   │                   │ full operational  │
 │                  │                   │                   │ history of        │
 │                  │                   │                   │ Fukushima         │
 │                  │                   │                   │ cleanup.          │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ IAEA PRIS nuclear │ The IAEA Power    │
 │                  │                   │ power plant       │ Reactor           │
 │                  │                   │ operational       │ Information       │
 │                  │                   │ safety incidents  │ System (PRIS)     │
 │                  │                   │ database          │ contains          │
 │                  │                   │                   │ comprehensive     │
 │                  │                   │                   │ incident and      │
 │                  │                   │                   │ safety data for   │
 │                  │                   │                   │ all global        │
 │                  │                   │                   │ nuclear plants.   │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ contradiction    │ database          │ Chernobyl total   │ SciTechnol source │
 │                  │                   │ excess cancer     │ cites 31          │
 │                  │                   │ deaths estimates  │ Chernobyl deaths  │
 │                  │                   │ UNSCEAR vs WHO vs │ while CNSC cites  │
 │                  │                   │ independent       │ 28+2=30, and      │
 │                  │                   │ researchers       │ long-term cancer  │
 │                  │                   │                   │ projections       │
 │                  │                   │                   │ differ vastly     │
 │                  │                   │                   │ between           │
 │                  │                   │                   │ organizations.    │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ How do small modular reactors   │ The DOE page on enhanced safety │
 │          │ (SMRs) compare in safety        │ of advanced reactors mentions   │
 │          │ profile to traditional          │ new designs but no comparative  │
 │          │ large-scale nuclear plants?     │ safety mortality data was       │
 │          │                                 │ available in the evidence.      │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the total projected     │ Sources give conflicting        │
 │          │ cancer death toll from          │ numbers; CNSC cites 28 direct   │
 │          │ Chernobyl according to the most │ deaths but does not give a      │
 │          │ recent UNSCEAR assessment?      │ total long-term cancer          │
 │          │                                 │ projection.                     │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Does nuclear power's safety     │ Chernobyl and Fukushima both    │
 │          │ record hold across all          │ involved regulatory failures;   │
 │          │ countries, including those with │ safety statistics may differ    │
 │          │ less stringent regulatory       │ between high-regulation and     │
 │          │ frameworks?                     │ low-regulation countries.       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does nuclear power's safety │ Statista notes deaths are       │
 │          │ compare when including the      │ measured from 'accidents and    │
 │          │ health risks from uranium       │ air pollution' per TWh, which   │
 │          │ mining and fuel processing?     │ may not fully account for       │
 │          │                                 │ upstream fuel cycle risks.      │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.92                                                                │
 │ Corroborating sources: 8                                                     │
 │ Source authority: high                                                       │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.95                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 63429                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 89.22s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 2e2b6e88-c973-4422-919c-3838634336c9
--- a/docs/stress-tests/M3.3-runs/14-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/14-contradiction.log
@ -1,358 +0,0 @@
 Researching: Is dietary cholesterol harmful?
 {"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.656017Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:07:33.414998Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:33.424151Z"}
 {"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:33.456353Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is dietary cholesterol harmful?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456707Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456830Z"}
 {"step": 11, "decision": "Starting iteration 2/5", "tokens_so_far": 1218, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:51.522768Z"}
 {"step": 18, "decision": "Starting iteration 3/5", "tokens_so_far": 14738, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:58.291229Z"}
 {"step": 27, "decision": "Token budget reached before iteration 4: 31680/20000", "event": "budget_exhausted", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.672921Z"}
 {"step": 28, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 31680, "event": "synthesis_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.673116Z"}
 {"step": 29, "decision": "Parsed synthesis JSON successfully", "duration_ms": 82227, "event": "synthesis_complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.529276Z"}
 {"step": 50, "decision": "Research complete", "confidence": 0.78, "citation_count": 13, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 124.559, "event": "complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.531034Z"}
 {"confidence": 0.78, "citations": 13, "gaps": 3, "discovery_events": 4, "tokens_used": 64718, "iterations_run": 3, "wall_time_sec": 120.07263135910034, "budget_exhausted": true, "event": "research_completed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:33.531291Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:09:33.531771Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:33.538771Z"}
 {"trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "confidence": 0.78, "citations": 13, "tokens_used": 64718, "wall_time_sec": 120.07263135910034, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:33.877746Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ The question of whether dietary cholesterol is harmful is nuanced and        │
 │ remains an area of active scientific debate. Here is a synthesis of the      │
 │ current evidence:                                                            │
 │                                                                              │
 │ 1. **Dietary cholesterol vs. blood cholesterol**: The body produces most of  │
 │ its own cholesterol, and dietary cholesterol has a relatively modest and     │
 │ variable effect on blood (serum) cholesterol levels. Epidemiological studies │
 │ and clinical interventions have largely shown that dietary cholesterol       │
 │ intake does not significantly impact blood cholesterol in most individuals   │
 │ [PMC6024687; PMC9143438]. A meta-analysis of 224 studies (8,143 subjects)    │
 │ found only modest increases in both LDL and HDL when dietary cholesterol is  │
 │ increased [Consensus Academic Search].                                       │
 │                                                                              │
 │ 2. **CVD risk from observational studies**: A 2020 AHA Science Advisory      │
 │ (Carson et al., Circulation) found a significant positive relationship       │
 │ between dietary cholesterol intake and blood LDL, but evidence from          │
 │ observational studies generally does not indicate a significant association  │
 │ with cardiovascular disease risk [AHA Journals,                              │
 │ doi:10.1161/CIR.0000000000000743]. However, a large pooled cohort study      │
 │ (n=29,615, published in JAMA) found each additional 300 mg/day of dietary    │
 │ cholesterol was associated with higher risk of incident CVD and all-cause    │
 │ mortality [PACE-CME; The Cardiology Advisor].                                │
 │                                                                              │
 │ 3. **Updated dietary guidelines**: The 2015–2020 U.S. Dietary Guidelines     │
 │ removed the previous 300 mg/day dietary cholesterol limit, citing no         │
 │ appreciable relationship between dietary cholesterol and serum cholesterol.  │
 │ However, this decision was contested by scientists who argued the evidence   │
 │ was insufficient rather than exculpatory [Regulations.gov scientists'        │
 │ comment; PMC6024687]. The AHA's 2026 dietary guidance states that dietary    │
 │ cholesterol is 'no longer a primary target for CVD risk reduction for most   │
 │ people,' though it still advises limiting cholesterol-rich foods [AHA        │
 │ Journals, doi:10.1161/CIR.0000000000001435].                                 │
 │                                                                              │
 │ 4. **Individual variability**: People differ substantially in how they       │
 │ respond to dietary cholesterol—'hyper-responders' see more significant LDL   │
 │ increases than 'hypo-responders.' Genetic and hormonal factors play          │
 │ important roles [ScienceDirect hypo/hyperresponders; PubMed 12074253].       │
 │                                                                              │
 │ 5. **Eggs as a cholesterol source**: Eggs are the primary dietary            │
 │ cholesterol source studied. Evidence on egg consumption and CVD is           │
 │ inconsistent. A 2025 umbrella review found 'critically low' quality of       │
 │ evidence and concluded there is no sufficient evidence to discourage egg     │
 │ consumption, though weak associations with higher LDL and heart failure risk │
 │ were noted [ScienceDirect, doi:10.1016/j.numecd.2025.103849]. A BMJ          │
 │ meta-analysis suggested higher egg consumption could be associated with      │
 │ higher CVD risk [BMJ m513].                                                  │
 │                                                                              │
 │ 6. **Saturated fat confounding**: Most foods high in dietary cholesterol are │
 │ also high in saturated fat, which does raise LDL cholesterol and CVD risk.   │
 │ Eggs and shrimp are notable exceptions [PMC6024687].                         │
 │                                                                              │
 │ **Bottom line**: For most people, dietary cholesterol in moderate amounts is │
 │ unlikely to be a primary driver of CVD risk. However, it is not completely   │
 │ benign—particularly for hyper-responders or people with diabetes—and the     │
 │ overall dietary pattern (especially saturated fat intake) matters more than  │
 │ dietary cholesterol in isolation. Caution is still warranted, and individual │
 │ factors should guide dietary choices.                                        │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Dietary Cholesterol and the   │ To date, extensive research    │  0.92 │
 │     │ Lack of Evidence in           │ did not show evidence to       │       │
 │     │ Cardiovascular Disease - PMC  │ support a role of dietary      │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ cholesterol in the development │       │
 │     │ articles/PMC6024687/          │ of CVD. As a result, the       │       │
 │     │                               │ 2015–2020 Dietary Guidelines   │       │
 │     │                               │ for Americans removed the      │       │
 │     │                               │ recommendations of restricting │       │
 │     │                               │ dietary cholesterol to 300     │       │
 │     │                               │ mg/day.                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Is There a Correlation        │ it was not until the late      │  0.91 │
 │     │ between Dietary and Blood     │ 1990s when they were finally   │       │
 │     │ Cholesterol? Evidence from    │ challenged by the newer        │       │
 │     │ Epidemiological Data and      │ information derived from       │       │
 │     │ Clinical Interventions - PMC  │ epidemiological studies and    │       │
 │     │ https://pmc.ncbi.nlm.nih.gov/ │ meta-analysis, which confirmed │       │
 │     │ articles/PMC9143438/          │ the lack of correlation        │       │
 │     │                               │ between dietary and blood      │       │
 │     │                               │ cholesterol.                   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Dietary Cholesterol and       │ Evidence from observational    │  0.93 │
 │     │ Cardiovascular Risk: A        │ studies conducted in several   │       │
 │     │ Science Advisory from the AHA │ countries generally does not   │       │
 │     │ https://www.ahajournals.org/d │ indicate a significant         │       │
 │     │ oi/full/10.1161/CIR.000000000 │ association with               │       │
 │     │ 0000743                       │ cardiovascular disease risk.   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Dietary Cholesterol and       │ Differences in dietary         │  0.88 │
 │     │ Cardiovascular Risk: A        │ cholesterol ranged from 155 to │       │
 │     │ Science Advisory (full text)  │ 1000 mg/d. A significant       │       │
 │     │ https://www.ahajournals.org/d │ positive relationship was      │       │
 │     │ oi/10.1161/CIR.00000000000007 │ identified between dietary     │       │
 │     │ 43                            │ cholesterol                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ 2026 Dietary Guidance to      │ Dietary cholesterol is no      │  0.90 │
 │     │ Improve Cardiovascular Health │ longer a primary target for    │       │
 │     │ https://www.ahajournals.org/d │ CVD risk reduction for most    │       │
 │     │ oi/10.1161/CIR.00000000000014 │ people. Nevertheless, heart    │       │
 │     │ 35                            │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Higher consumption of dietary │ Among US adults, higher intake │  0.87 │
 │     │ cholesterol or eggs linked to │ of dietary cholesterol or eggs │       │
 │     │ increased risk of incident    │ was significantly linked to    │       │
 │     │ CVD and mortality - PACE-CME  │ increased risk of incident CVD │       │
 │     │ https://pace-cme.org/news/hig │ and all-cause mortality in a   │       │
 │     │ her-consumption-of-dietary-ch │ dose-response manner, which    │       │
 │     │ olesterol-or-eggs-linked-to-i │ was independent of nutrients   │       │
 │     │ ncreased-risk-of-incident-cvd │ or diets                       │       │
 │     │ -and-mortality/2455413/       │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ After Continued Debate,       │ Each additional 300 mg of      │  0.87 │
 │     │ Dietary Cholesterol Linked to │ dietary cholesterol consumed   │       │
 │     │ Significant Increase in CVD - │ per day was significantly      │       │
 │     │ The Cardiology Advisor        │ associated with a higher risk  │       │
 │     │ https://www.thecardiologyadvi │ for incident CVD and all-cause │       │
 │     │ sor.com/home/topics/metabolic │ mortality, as was each         │       │
 │     │ /dyslipidemia/after-continued │ additional half an egg         │       │
 │     │ -debate-dietary-cholesterol-l │ consumed per day.              │       │
 │     │ inked-to-significant-increase │                                │       │
 │     │ -in-cvd/                      │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Scientists' Comment on        │ dietary cholesterol is very    │  0.82 │
 │     │ Dietary Cholesterol -         │ much a 'nutrient of concern,'  │       │
 │     │ Regulations.gov               │ because it increases LDL       │       │
 │     │ https://downloads.regulations │ cholesterol, a                 │       │
 │     │ .gov/FDA-2018-P-1593-0049/att │ well-established risk factor   │       │
 │     │ achment_2.pdf                 │ for coronary heart disease.    │       │
 │     │                               │ Furthermore, the consumption   │       │
 │     │                               │ of whole eggs is associated    │       │
 │     │                               │ with the risk of type 2        │       │
 │     │                               │ diabetes                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Dietary Cholesterol And Blood │ A meta-analysis of 224 studies │  0.85 │
 │     │ Cholesterol - Consensus       │ involving 8,143 subjects found │       │
 │     │ Academic Search Engine        │ that dietary cholesterol       │       │
 │     │ https://consensus.app/questio │ intake leads to modest         │       │
 │     │ ns/dietary-cholesterol-and-bl │ increases in both LDL and HDL  │       │
 │     │ ood-cholesterol/              │ cholesterol levels. The study  │       │
 │     │                               │ highlighted that while dietary │       │
 │     │                               │ cholesterol does raise serum   │       │
 │     │                               │ cholesterol levels, the effect │       │
 │     │                               │ is relatively small and varies │       │
 │     │                               │ among individuals.             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Effect of egg consumption on  │ The overall quality of studies │  0.88 │
 │     │ health outcomes: Updated      │ was critically low. The level  │       │
 │     │ umbrella review -             │ of evidence was very weak for  │       │
 │     │ ScienceDirect                 │ all the significant            │       │
 │     │ https://www.sciencedirect.com │ associations: risk of heart    │       │
 │     │ /science/article/pii/S0939475 │ failure (RR 1.15; 95%CI:       │       │
 │     │ 325000031                     │ 1.02–1.30)... higher levels of │       │
 │     │                               │ LDL cholesterol (WMD 7.39;     │       │
 │     │                               │ 95%CI 5.82–8.95)... No         │       │
 │     │                               │ evidence of association was    │       │
 │     │                               │ found among all cardiovascular │       │
 │     │                               │ outcomes and all-cause         │       │
 │     │                               │ mortality risk                 │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 11  │ Egg consumption and risk of   │ Results from our updated       │  0.84 │
 │     │ cardiovascular disease - The  │ meta-analysis suggest that     │       │
 │     │ BMJ                           │ higher egg consumption could   │       │
 │     │ https://www.bmj.com/content/3 │ be associated with a higher    │       │
 │     │ 68/bmj.m513                   │ risk of cardiovascular disease │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 12  │ Hypo- and hyperresponders to  │ Hypo- and hyperresponders to   │  0.78 │
 │     │ dietary cholesterol -         │ dietary cholesterol            │       │
 │     │ ScienceDirect                 │                                │       │
 │     │ https://www.sciencedirect.com │                                │       │
 │     │ /science/article/abs/pii/S000 │                                │       │
 │     │ 2916523398897                 │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 13  │ Here's the latest on dietary  │ More recently, accumulating    │  0.87 │
 │     │ cholesterol and how it fits   │ data has caused researchers to │       │
 │     │ in with a healthy diet |      │ broaden their thinking about   │       │
 │     │ American Heart Association    │ how dietary cholesterol – and  │       │
 │     │ https://www.heart.org/en/news │ eggs – fit into a healthy      │       │
 │     │ /2023/08/25/heres-the-latest- │ eating pattern. 'We've         │       │
 │     │ on-dietary-cholesterol-and-ho │ advanced considerably,' said   │       │
 │     │ w-it-fits-in-with-a-healthy-d │ professor Linda Van Horn       │       │
 │     │ iet                           │                                │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Long-term RCT data on    │ Most evidence comes from  │
 │                       │ dietary cholesterol and  │ observational studies or  │
 │                       │ hard CVD endpoints       │ short-term interventions. │
 │                       │                          │ There are no large,       │
 │                       │                          │ long-term randomized      │
 │                       │                          │ controlled trials         │
 │                       │                          │ directly testing reduced  │
 │                       │                          │ dietary cholesterol       │
 │                       │                          │ versus hard CVD outcomes  │
 │                       │                          │ like myocardial           │
 │                       │                          │ infarction or             │
 │                       │                          │ cardiovascular death.     │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ source_not_found      │ Dietary cholesterol      │ While some sources        │
 │                       │ effects in specific      │ mention increased CVD     │
 │                       │ high-risk subgroups      │ risk from eggs in people  │
 │                       │ (diabetes, familial      │ with diabetes, the        │
 │                       │ hypercholesterolemia)    │ gathered evidence does    │
 │                       │                          │ not deeply characterize   │
 │                       │                          │ effects in all high-risk  │
 │                       │                          │ subgroups such as         │
 │                       │                          │ familial                  │
 │                       │                          │ hypercholesterolemia      │
 │                       │                          │ patients.                 │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Mechanisms               │ Confounding between       │
 │                       │ distinguishing dietary   │ dietary cholesterol and   │
 │                       │ cholesterol from         │ saturated fat intake      │
 │                       │ saturated fat effects    │ makes it difficult to     │
 │                       │                          │ isolate dietary           │
 │                       │                          │ cholesterol's independent │
 │                       │                          │ effect on CVD; different  │
 │                       │                          │ studies handle this       │
 │                       │                          │ confounder differently,   │
 │                       │                          │ leading to inconsistent   │
 │                       │                          │ conclusions.              │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ contradiction    │ database          │ dietary           │ The evidence is   │
 │                  │                   │ cholesterol CVD   │ contradictory     │
 │                  │                   │ risk randomized   │ between large     │
 │                  │                   │ controlled trial  │ observational     │
 │                  │                   │ meta-analysis     │ pooled cohorts    │
 │                  │                   │ 2020 2024         │ (showing CVD      │
 │                  │                   │                   │ risk) and         │
 │                  │                   │                   │ intervention/epid │
 │                  │                   │                   │ emiological       │
 │                  │                   │                   │ reviews (showing  │
 │                  │                   │                   │ no significant    │
 │                  │                   │                   │ association),     │
 │                  │                   │                   │ warranting deeper │
 │                  │                   │                   │ RCT-level         │
 │                  │                   │                   │ analysis.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ lean mass         │ A distinct        │
 │                  │                   │ hyper-responder   │ phenotype (lean   │
 │                  │                   │ LDL dietary       │ mass              │
 │                  │                   │ cholesterol       │ hyper-responders) │
 │                  │                   │ cardiovascular    │ shows pronounced  │
 │                  │                   │ risk 2023 2024    │ LDL increases on  │
 │                  │                   │                   │ low-carb diets    │
 │                  │                   │                   │ high in dietary   │
 │                  │                   │                   │ fat/cholesterol,  │
 │                  │                   │                   │ with unclear CVD  │
 │                  │                   │                   │ implications.     │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ dietary           │ Multiple sources  │
 │                  │                   │ cholesterol type  │ mention           │
 │                  │                   │ 2 diabetes risk   │ association       │
 │                  │                   │ eggs 2020 2024    │ between           │
 │                  │                   │ meta-analysis     │ egg/cholesterol   │
 │                  │                   │                   │ intake and type 2 │
 │                  │                   │                   │ diabetes risk,    │
 │                  │                   │                   │ which is not      │
 │                  │                   │                   │ fully explored in │
 │                  │                   │                   │ the gathered      │
 │                  │                   │                   │ evidence.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ database          │ ACC AHA 2026      │ New 2026 ACC/AHA  │
 │                  │                   │ dyslipidemia      │ dyslipidemia      │
 │                  │                   │ guidelines        │ guidelines were   │
 │                  │                   │ dietary           │ referenced but    │
 │                  │                   │ cholesterol       │ only partially    │
 │                  │                   │ recommendations   │ retrieved; full   │
 │                  │                   │                   │ dietary           │
 │                  │                   │                   │ cholesterol       │
 │                  │                   │                   │ guidance warrants │
 │                  │                   │                   │ review.           │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Should dietary cholesterol      │ Scientists' comments on the     │
 │          │ recommendations differ for      │ 2015 dietary guidelines and     │
 │          │ people with diabetes or         │ some observational studies      │
 │          │ familial hypercholesterolemia   │ suggest egg/cholesterol intake  │
 │          │ compared to the general         │ may increase CHD risk           │
 │          │ population?                     │ specifically in people with     │
 │          │                                 │ diabetes.                       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Do LDL cholesterol              │ Research shows wide individual  │
 │          │ hyper-responders to dietary     │ variability in LDL response to  │
 │          │ cholesterol face meaningfully   │ dietary cholesterol; it is      │
 │          │ higher long-term CVD risk, and  │ unclear whether                 │
 │          │ should they restrict dietary    │ hyper-responders have elevated  │
 │          │ cholesterol?                    │ CVD risk and need tailored      │
 │          │                                 │ advice.                         │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ How much of the observed CVD    │ PMC6024687 notes most           │
 │          │ risk associated with dietary    │ high-cholesterol foods are also │
 │          │ cholesterol in observational    │ high in saturated fat;          │
 │          │ studies is attributable to      │ isolating dietary cholesterol's │
 │          │ saturated fat co-ingestion      │ independent effect is           │
 │          │ rather than cholesterol itself? │ methodologically challenging.   │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the effect of dietary   │ PACE-CME study noted that CVD   │
 │          │ cholesterol within the context  │ risk association from dietary   │
 │          │ of a high-quality overall diet  │ cholesterol was independent of  │
 │          │ (e.g., Mediterranean or DASH    │ overall diet quality, but this  │
 │          │ diet)?                          │ needs further investigation.    │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Does the food matrix (e.g.,     │ The 2025 umbrella review of egg │
 │          │ eggs vs. red meat) in which     │ consumption found weak          │
 │          │ dietary cholesterol is consumed │ associations; it is unclear if  │
 │          │ modify its impact on CVD risk?  │ the source of dietary           │
 │          │                                 │ cholesterol modulates risk      │
 │          │                                 │ independently of the            │
 │          │                                 │ cholesterol content.            │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.78                                                                │
 │ Corroborating sources: 13                                                    │
 │ Source authority: high                                                       │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.85                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 64718                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 120.07s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 27d81891-5bf2-4bf4-9744-55f39ffaf696
--- a/docs/stress-tests/M3.3-runs/15-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/15-contradiction.log
@ -1,48 +0,0 @@
 Researching: Does screen time harm child development?
 {"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:34.721867Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:09:35.602647Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:35.613025Z"}
 {"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:35.653113Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does screen time harm child development?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653592Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653723Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1126, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:45.628661Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 10139, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:51.476900Z"}
 {"step": 21, "decision": "Token budget reached before iteration 4: 23391/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056368Z"}
 {"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23391, "event": "synthesis_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056571Z"}
 {"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74986, "event": "synthesis_complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.739493Z"}
 {"step": 24, "decision": "Failed to build ResearchResult: 1 validation error for DiscoveryEvent\nquery\n  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n    For further information visit https://errors.pydantic.dev/2.12/v/string_type", "event": "synthesis_build_error", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.753603Z"}
 {"step": 26, "decision": "Research complete", "confidence": 0.1, "citation_count": 0, "gap_count": 1, "discovery_count": 0, "total_duration_sec": 98.512, "event": "complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.755661Z"}
 {"confidence": 0.1, "citations": 0, "gaps": 1, "discovery_events": 0, "tokens_used": 44375, "iterations_run": 3, "wall_time_sec": 95.08588027954102, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:10.755895Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:11:10.757071Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:10.770530Z"}
 {"trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "confidence": 0.1, "citations": 0, "tokens_used": 44375, "wall_time_sec": 95.08588027954102, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.105698Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Research on 'Does screen time harm child development?' completed but         │
 │ synthesis failed. 22 sources were gathered.                                  │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 No citations.
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic     ┃ Detail                                        ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ budget_exhausted │ synthesis │ The synthesis step failed to produce          │
 │                  │           │ structured output.                            │
 └──────────────────┴───────────┴───────────────────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.10                                                                │
 │ Corroborating sources: 0                                                     │
 │ Source authority: low                                                        │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.00                                                │
 │ Budget status: spent                                                         │
 │ Recency: unknown                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 44375                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 95.09s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 9c18d570-73d3-4e8a-98bc-7cb1b66c61d2
--- a/docs/stress-tests/M3.3-runs/16-scope.log
+++ b/docs/stress-tests/M3.3-runs/16-scope.log
@ -1,321 +0,0 @@
 Researching: What proprietary indexing strategies do high-frequency trading 
 firms use for order book reconstruction?
 {"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.888630Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:11:12.816801Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:12.829566Z"}
 {"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:12.871225Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.871693Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.872051Z"}
 {"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1212, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:27.416025Z"}
 {"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 15135, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:33.632271Z"}
 {"step": 23, "decision": "Token budget reached before iteration 4: 35581/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723229Z"}
 {"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 3, "tokens_used": 35581, "event": "synthesis_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723491Z"}
 {"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72229, "event": "synthesis_complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.638239Z"}
 {"step": 42, "decision": "Research complete", "confidence": 0.72, "citation_count": 8, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.111, "event": "complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.639828Z"}
 {"confidence": 0.72, "citations": 8, "gaps": 4, "discovery_events": 4, "tokens_used": 70892, "iterations_run": 3, "wall_time_sec": 97.76683187484741, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:50.639933Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:12:50.640430Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:50.648897Z"}
 {"trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "confidence": 0.72, "citations": 8, "tokens_used": 70892, "wall_time_sec": 97.76683187484741, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:50.931342Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ High-frequency trading firms use several proprietary and semi-documented     │
 │ indexing strategies for order book reconstruction, though most production    │
 │ details remain trade secrets. Based on available evidence:                   │
 │                                                                              │
 │ 1. **Hash Table + Array Hybrid**: The most commonly cited production         │
 │ approach combines plain arrays (for cache-friendly sequential memory access  │
 │ minimizing cache misses) with hash tables (for O(1) lookup of specific price │
 │ levels). This codesign optimizes both speed and cache locality. [Sources 15, │
 │ 16, 28]                                                                      │
 │                                                                              │
 │ 2. **B-Tree / ISAM Indexing**: The historically significant Island ECN       │
 │ (1996), built by Josh Levine, used in-memory B-tree indexing via an ISAM     │
 │ storage engine with zero disk access during matching, achieving O(log N)     │
 │ access per price level. This is considered the documented proof-of-concept   │
 │ for production-grade LOB indexing. [Source 29]                               │
 │                                                                              │
 │ 3. **Hybrid Binary-Linear Search**: A IEEE-documented approach proposes a    │
 │ simple linear data structure for tracking the order book combined with a     │
 │ hybrid binary-linear search algorithm to maintain top bid/ask with minimal   │
 │ latency. [Source 19]                                                         │
 │                                                                              │
 │ 4. **ROI Vector (Region-of-Interest Vector)**: Used in backtesting           │
 │ frameworks like HftBacktest, this approach restricts the active price range  │
 │ to a bounded region of interest, enabling vector-based O(1) access within    │
 │ the ROI while avoiding full-book scanning. [Source 25, 35]                   │
 │                                                                              │
 │ 5. **Lock-Free Concurrent Data Structures**: To handle concurrent updates    │
 │ without mutex overhead, firms implement lock-free data structures allowing   │
 │ multiple threads to update the LOB simultaneously. [Sources 15, 16]          │
 │                                                                              │
 │ 6. **Event-Driven with Selective Polling Hybrid**: The LOB primarily         │
 │ operates event-driven but incorporates high-frequency polling for the most   │
 │ latency-sensitive execution pathways, ensuring sub-microsecond               │
 │ responsiveness. [Sources 15, 16]                                             │
 │                                                                              │
 │ 7. **Order Record Reuse (Object Pooling)**: Levine's Island engine reused    │
 │ recently freed order records for new orders—described as 'hugely             │
 │ important'—a form of memory pooling that avoids allocation overhead during   │
 │ high-throughput periods. [Source 29]                                         │
 │                                                                              │
 │ 8. **Structural Filtration for Signal Quality**: Recent research (2025)      │
 │ proposes filtering transient LOB events by order lifetime, update count, or  │
 │ inter-update delay before indexing, improving directional signal quality     │
 │ (OBI) extracted from the reconstructed book. [Source 6]                      │
 │                                                                              │
 │ Notably, red-black trees—frequently cited in academic literature—are rarely  │
 │ used in production due to poor cache behavior versus simpler arrays at       │
 │ realistic market depths. The key insight from practitioners is that          │
 │ algorithmic data structure choice (O(log N) vs O(N)) dominates hardware      │
 │ investment: a $2M co-location/FPGA upgrade produced no measurable latency    │
 │ improvement when the underlying order book used a sorted array with O(N)     │
 │ inserts. [Source 23, 29]                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Matching Engine Architecture: │ Josh Levine built the Island   │  0.95 │
 │     │ Why Your Order Book Data      │ matching engine in FoxPro for  │       │
 │     │ Structure Is the Real Latency │ MS-DOS... The order book used  │       │
 │     │ Bottleneck                    │ in-memory B-tree indexing via  │       │
 │     │ https://electronictradinghub. │ an ISAM storage engine. Zero   │       │
 │     │ com/matching-engine-architect │ disk access during matching.   │       │
 │     │ ure-why-your-order-book-data- │ Every price level accessed in  │       │
 │     │ structure-is-the-real-latency │ O(log N) time. Levine's        │       │
 │     │ -bottleneck/                  │ optimization for new-order     │       │
 │     │                               │ entry latency: reuse recently  │       │
 │     │                               │ freed order records for new    │       │
 │     │                               │ orders — a detail he called    │       │
 │     │                               │ 'hugely important'             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Optimizing Limit Order Book   │ I use a combination of plain   │  0.88 │
 │     │ for HFT Systems               │ arrays and hash tables to      │       │
 │     │ https://www.linkedin.com/post │ manage the LOB. Arrays are     │       │
 │     │ s/silahian_hft-hft-trading-ac │ highly effective with CPU      │       │
 │     │ tivity-7351226537301417988-ei │ caches, offering sequential    │       │
 │     │ cX                            │ memory access that minimizes   │       │
 │     │                               │ cache misses. The integration  │       │
 │     │                               │ of hash tables provides quick  │       │
 │     │                               │ access to specific entries,    │       │
 │     │                               │ ensuring that both speed and   │       │
 │     │                               │ cache locality are optimized.  │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Red Black Trees for Limit     │ They're not necessarily ideal. │  0.92 │
 │     │ Order Book - Quantitative     │ In fact, they're rarely used   │       │
 │     │ Finance Stack Exchange        │ in production trading systems  │       │
 │     │ https://quant.stackexchange.c │ with low latency               │       │
 │     │ om/questions/63140/red-black- │ requirements... a simple array │       │
 │     │ trees-for-limit-order-book    │ or vector with linear access   │       │
 │     │                               │ patterns will often outperform │       │
 │     │                               │ any complex data structure     │       │
 │     │                               │ with better asymptotic runtime │       │
 │     │                               │ because a simple array         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Order Book Reconstruction -   │ HashMapMarketDepth...          │  0.85 │
 │     │ HftBacktest                   │ BTreeMarketDepth...            │       │
 │     │ https://mintlify.com/nkaz001/ │ ROIVectorMarketDepth::new(tick │       │
 │     │ hftbacktest/concepts/order-bo │ _size, lot_size, roi_lb,       │       │
 │     │ ok                            │ roi_ub)...                     │       │
 │     │                               │ FusedHashMapMarketDepth        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Order Book Filtration and     │ Three real-time, observable    │  0.82 │
 │     │ Directional Signal Extraction │ filtration schemes: based on   │       │
 │     │ at High Frequency             │ order lifetime, update count,  │       │
 │     │ https://arxiv.org/html/2507.2 │ and inter-update delay. These  │       │
 │     │ 2712v1                        │ are used to recompute OBI on   │       │
 │     │                               │ structurally filtered event    │       │
 │     │                               │ streams... Empirical results   │       │
 │     │                               │ show that structural           │       │
 │     │                               │ filtration improves            │       │
 │     │                               │ directional signal clarity in  │       │
 │     │                               │ correlation and regime-based   │       │
 │     │                               │ metrics                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Building Low-Latency Order    │ This paper proposes a simple   │  0.80 │
 │     │ Books with Hybrid             │ linear data structure for      │       │
 │     │ Binary-Linear ...             │ tracking the order book and a  │       │
 │     │ https://ieeexplore.ieee.org/d │ hybrid binary-linear search    │       │
 │     │ ocument/10296447/             │ algorithm to maintain the top  │       │
 │     │                               │ bid and ask                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Order Book Reconstruction -   │ Index reusing... Regional      │  0.75 │
 │     │ dxFeed KB                     │ events... Event flags          │       │
 │     │ https://kb.dxfeed.com/en/data │ applicable to Order event...   │       │
 │     │ -model/dxfeed-order-book/orde │ Snapshots... Transaction       │       │
 │     │ r-book-reconstruction.html    │ model... dxFeed market data    │       │
 │     │                               │ feeds (real-time, delayed or   │       │
 │     │                               │ historical) allow clients to   │       │
 │     │                               │ reconstruct order books, price │       │
 │     │                               │ level aggregations, and        │       │
 │     │                               │ aggregations by Market Maker   │       │
 │     │                               │ or a data provider.            │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ GitHub -                      │ This Limit Order Book is       │  0.70 │
 │     │ brprojects/Limit-Order-Book   │ developed in C++ from scratch  │       │
 │     │ https://github.com/brprojects │ and able to handle over        │       │
 │     │ /Limit-Order-Book             │ 1,400,000 TPS (transactions    │       │
 │     │                               │ per second), including Market, │       │
 │     │                               │ Limit, Stop and Stop Limit     │       │
 │     │                               │ orders.                        │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found │ Proprietary FPGA-based      │ Actual FPGA hardware        │
 │                  │ order book indexing schemes │ implementations used by     │
 │                  │                             │ firms like Virtu, Jane      │
 │                  │                             │ Street, or Citadel for      │
 │                  │                             │ on-chip order book indexing │
 │                  │                             │ are not publicly            │
 │                  │                             │ documented. MIT project     │
 │                  │                             │ proposal references FPGA    │
 │                  │                             │ LOB but lacks               │
 │                  │                             │ implementation details.     │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Exact data structures used  │ No public disclosure exists │
 │                  │ by specific named HFT firms │ for the specific indexing   │
 │                  │                             │ implementations of major    │
 │                  │                             │ HFT firms (e.g., Virtu, Two │
 │                  │                             │ Sigma, Jump Trading). All   │
 │                  │                             │ evidence is from            │
 │                  │                             │ practitioners sharing       │
 │                  │                             │ general principles or       │
 │                  │                             │ academic reconstructions.   │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ scope_exceeded   │ Co-location-specific memory │ NUMA-aware memory           │
 │                  │ topology optimization for   │ allocation and CPU affinity │
 │                  │ LOB                         │ strategies for LOB          │
 │                  │                             │ processes in co-located     │
 │                  │                             │ environments are referenced │
 │                  │                             │ but not detailed in         │
 │                  │                             │ available sources.          │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Crypto-specific LOB         │ While one Medium article    │
 │                  │ indexing differences vs     │ covers crypto HFT system    │
 │                  │ equity markets              │ design, it does not detail  │
 │                  │                             │ how LOB indexing strategies │
 │                  │                             │ differ for 24/7 crypto      │
 │                  │                             │ markets with different tick │
 │                  │                             │ structures.                 │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ arxiv             │ FPGA order book   │ The MIT HFT       │
 │                  │                   │ matching engine   │ Accelerator paper │
 │                  │                   │ hardware          │ and FPGA          │
 │                  │                   │ implementation    │ references        │
 │                  │                   │ nanosecond        │ suggest           │
 │                  │                   │ latency           │ significant       │
 │                  │                   │                   │ unpublished work  │
 │                  │                   │                   │ on                │
 │                  │                   │                   │ hardware-accelera │
 │                  │                   │                   │ ted LOB indexing  │
 │                  │                   │                   │ that would        │
 │                  │                   │                   │ directly answer   │
 │                  │                   │                   │ the proprietary   │
 │                  │                   │                   │ indexing question │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ limit order book  │ Cache-oblivious   │
 │                  │                   │ data structure    │ structures like   │
 │                  │                   │ cache-oblivious   │ van Emde Boas     │
 │                  │                   │ van Emde Boas     │ trees are         │
 │                  │                   │ tree HFT          │ theoretically     │
 │                  │                   │                   │ optimal for LOB   │
 │                  │                   │                   │ operations but    │
 │                  │                   │                   │ not mentioned in  │
 │                  │                   │                   │ sources; academic │
 │                  │                   │                   │ literature may    │
 │                  │                   │                   │ document their    │
 │                  │                   │                   │ use               │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ database          │ Island ECN Levine │ The Island ECN    │
 │                  │                   │ order book ISAM   │ B-tree/ISAM       │
 │                  │                   │ indexing original │ reference is      │
 │                  │                   │ documentation     │ cited secondhand; │
 │                  │                   │ 1996              │ primary           │
 │                  │                   │                   │ documentation     │
 │                  │                   │                   │ would provide     │
 │                  │                   │                   │ authoritative     │
 │                  │                   │                   │ details on the    │
 │                  │                   │                   │ original          │
 │                  │                   │                   │ production        │
 │                  │                   │                   │ indexing strategy │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ order book        │ L3 order-by-order │
 │                  │                   │ reconstruction L3 │ reconstruction    │
 │                  │                   │ tick data index   │ requires          │
 │                  │                   │ compression high  │ per-order         │
 │                  │                   │ frequency         │ indexing by       │
 │                  │                   │                   │ order_id which    │
 │                  │                   │                   │ has different     │
 │                  │                   │                   │ data structure    │
 │                  │                   │                   │ requirements than │
 │                  │                   │                   │ L2 price-level    │
 │                  │                   │                   │ indexing          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Do modern HFT firms use         │ Sources confirm cache-friendly  │
 │          │ NUMA-aware memory allocation    │ arrays dominate in production,  │
 │          │ strategies specifically tuned   │ but NUMA effects in             │
 │          │ for order book price-level      │ multi-socket co-located servers │
 │          │ index structures, and how does  │ are not addressed               │
 │          │ this interact with CPU cache    │                                 │
 │          │ topology?                       │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ How do HFT firms handle the     │ dxFeed documentation describes  │
 │          │ transition from snapshot-based  │ snapshot and transaction models │
 │          │ full order book state to        │ separately; the handoff between │
 │          │ incremental delta updates in    │ these modes in production       │
 │          │ their indexing layer without    │ indexing is not detailed        │
 │          │ introducing consistency gaps?   │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the practical           │ HftBacktest documents both      │
 │          │ throughput and latency tradeoff │ structures but does not provide │
 │          │ between ROIVectorMarketDepth    │ comparative benchmarks for edge │
 │          │ and FusedHashMapMarketDepth     │ cases like flash crashes where  │
 │          │ implementations under real      │ price moves outside the ROI     │
 │          │ market conditions with large    │                                 │
 │          │ price spikes?                   │                                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Does structural LOB filtration  │ The filtration paper shows      │
 │          │ (by order lifetime or update    │ improved OBI signal quality but │
 │          │ count) as proposed in the 2025  │ acknowledges limited gains in   │
 │          │ arxiv paper degrade order book  │ causal excitation;              │
 │          │ reconstruction accuracy under   │ accuracy-speed tradeoff for     │
 │          │ normal market conditions        │ indexing filtered vs raw        │
 │          │ compared to raw feeds?          │ streams is unresolved           │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ low      │ How do exchanges like LMAX,     │ The electronictradinghub        │
 │          │ Tokyo Stock Exchange, and NSE   │ article cites these exchanges   │
 │          │ India differ in their           │ as modern evidence but does not │
 │          │ recommended order book          │ detail their specific           │
 │          │ reconstruction protocols, and   │ reconstruction protocol         │
 │          │ do these differences force      │ differences                     │
 │          │ different indexing strategies   │                                 │
 │          │ on client-side HFT systems?     │                                 │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.72                                                                │
 │ Corroborating sources: 8                                                     │
 │ Source authority: medium                                                     │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.65                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 70892                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 97.77s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: f4c43973-7cac-4193-a249-cbb1302de4f7
--- a/docs/stress-tests/M3.3-runs/17-scope.log
+++ b/docs/stress-tests/M3.3-runs/17-scope.log
@ -1,344 +0,0 @@
 Researching: What is the actual operational doctrine of Chinese DF-41 ICBM 
 brigades?
 {"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:51.608714Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:12:52.450376Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:52.459819Z"}
 {"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:52.495811Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496319Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496431Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1194, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:05.548923Z"}
 {"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8831, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:18.572224Z"}
 {"step": 23, "decision": "Token budget reached before iteration 4: 31917/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.495991Z"}
 {"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 3, "tokens_used": 31917, "event": "synthesis_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.496215Z"}
 {"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 90409, "event": "synthesis_complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.659059Z"}
 {"step": 46, "decision": "Research complete", "confidence": 0.72, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 136.645, "event": "complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.687651Z"}
 {"confidence": 0.72, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 62857, "iterations_run": 3, "wall_time_sec": 132.16255736351013, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:04.687981Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:15:04.688728Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:04.696829Z"}
 {"trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "confidence": 0.72, "citations": 12, "tokens_used": 62857, "wall_time_sec": 132.16255736351013, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:04.924751Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Chinese DF-41 ICBM brigade operational doctrine encompasses several key      │
 │ elements based on open-source intelligence and defense analysis:             │
 │                                                                              │
 │ **Basing and Mobility**: DF-41 brigades operate under a tri-basing doctrine  │
 │ employing road-mobile, rail-mobile, and silo-based launchers. The            │
 │ road-mobile variant uses the Tian HTF5980 16×16 wheeled chassis. Silo        │
 │ construction has accelerated since 2021 with three new solid-fuel ICBM silo  │
 │ fields identified in northern China. [Sources: MDAA, CSIS Missile Threat,    │
 │ FAS]                                                                         │
 │                                                                              │
 │ **Alert Posture and Launch Doctrine**: The PLARF is working to implement a   │
 │ launch-on-warning (LOW) posture. Brigades now strive to keep at least part   │
 │ of their force in a higher state of readiness, representing a significant    │
 │ shift from China's historically relaxed alert posture where warheads were    │
 │ stored separately from missiles. [Sources: Air University/PLARF Nuclear      │
 │ Warhead Management, NDU]                                                     │
 │                                                                              │
 │ **Warhead Management**: Historically, Chinese ICBMs stored warheads          │
 │ separately from missiles ('de-mated'). The shift toward LOW requires         │
 │ warheads to be mated or at least rapidly mateable to delivery systems. As of │
 │ the 2025 FAS Nuclear Notebook, China possesses approximately 600 warheads,   │
 │ with DF-41 launchers armed with either a single ~1 MT warhead or up to 10    │
 │ MIRV warheads (20/90/150 KT yield variants). [Sources: FAS 2025, MDAA]       │
 │                                                                              │
 │ **Force Structure**: As of 2020-2023, two brigades were confirmed operating  │
 │ DF-41 when it appeared in the 2019 parade. The CNS 2023 Order of Battle      │
 │ identifies Base 64 (Lanzhou HQ) Brigade 644 (Hanzhong) as a rumored DF-41    │
 │ integration base. Additional brigades under Base 63 are suspected. [Sources: │
 │ Bulletin PLARF Force Structure Table 2020, CNS OOB 2023]                     │
 │                                                                              │
 │ **Camouflage and Concealment**: Mobile DF-41 units employ camouflage netting │
 │ and disperse into forests and tunnels during exercises, consistent with      │
 │ PLARF general doctrine of 'hiding and waiting.' [Sources: Al                 │
 │ Arabiya/Facebook report]                                                     │
 │                                                                              │
 │ **No-First-Use and Deterrence**: Chinese doctrine officially maintains a     │
 │ no-first-use (NFU) posture, with the DF-41 serving as a second-strike        │
 │ deterrent. However, the silo expansion and LOW posture shift have raised     │
 │ questions among analysts about whether NFU remains operationally intact.     │
 │ [Sources: The Mandarin, FAS 2025]                                            │
 │                                                                              │
 │ **Range and Target Coverage**: With a range of 12,000–15,000 km, DF-41       │
 │ brigades based in central/northern China can target the entire continental   │
 │ United States, making them the primary strategic countervalue and            │
 │ counterforce deterrent against the US. [Sources: MDAA, CSIS Missile Threat]  │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Dong Feng-41(CSS-X-20)        │ The DF-41 has a range of       │  0.90 │
 │     │ https://www.missiledefenseadv │ 12,000-15,000 km (able to      │       │
 │     │ ocacy.org/missile-threat-and- │ target half to all of the      │       │
 │     │ proliferation/todays-missile- │ continental U.S.), can carry   │       │
 │     │ threat/china/df-41/           │ multiple independently         │       │
 │     │                               │ targetable reentry vehicles    │       │
 │     │                               │ (MIRVs), and is rail-or        │       │
 │     │                               │ road-mobile. The DF-41 is      │       │
 │     │                               │ solid propelled and can carry  │       │
 │     │                               │ a payload of up to 2500 kg.    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ DF-41 (Dong Feng-41 /         │ The DF-41 (Dong Feng [East     │  0.92 │
 │     │ CSS-X-20) | Missile Threat    │ Wind]-41, CSS-20) is Chinese   │       │
 │     │ https://missilethreat.csis.or │ road-mobile intercontinental   │       │
 │     │ g/missile/df-41/              │ ballistic missile (ICBM). It   │       │
 │     │                               │ has an operational range of up │       │
 │     │                               │ to 15,000 km, making it        │       │
 │     │                               │ China's longest-range missile, │       │
 │     │                               │ and is reportedly capable of   │       │
 │     │                               │ loading multiple               │       │
 │     │                               │ independently-targeted         │       │
 │     │                               │ warheads (MIRV).               │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ PLA Rocket Force Nuclear      │ PLARF is working to implement  │  0.88 │
 │     │ Warhead Management - Air      │ a launch-on-warning (LOW)      │       │
 │     │ University                    │ posture, and brigades now      │       │
 │     │ https://www.airuniversity.af. │ strive to keep at least part   │       │
 │     │ edu/Portals/10/CASI/documents │ of their force in a state of   │       │
 │     │ /Research/Infrastructure/2026 │                                │       │
 │     │ -03-09%20PLARF%20Nuclear%20Wa │                                │       │
 │     │ rhead%20Management.pdf        │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ IMPLICATIONS OF A PRC SHIFT   │ The PLARF has adjusted its     │  0.87 │
 │     │ TO A LAUNCH-ON-WARNING        │ nuclear warhead storage and    │       │
 │     │ https://inss.ndu.edu/LinkClic │ handling practices and         │       │
 │     │ k.aspx?fileticket=kU27dwWHUvU │ training to support regular    │       │
 │     │ %3D&portalid=82               │ alert status. A LOW posture,   │       │
 │     │                               │ which requires ICBM units      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Chinese nuclear weapons, 2025 │ China has continued to develop │  0.95 │
 │     │ - Federation of American      │ its three new missile silo     │       │
 │     │ Scientists                    │ fields for solid-fuel          │       │
 │     │ https://fas.org/wp-content/up │ intercontinental ballistic     │       │
 │     │ loads/2025/03/Chinese-nuclear │ missiles (ICBMs)...has been    │       │
 │     │ -weapons-2025.pdf             │ developing new variants of     │       │
 │     │                               │ ICBMs and advanced strategic   │       │
 │     │                               │ delivery systems, and has      │       │
 │     │                               │ likely produced excess         │       │
 │     │                               │ warheads for these systems     │       │
 │     │                               │ once they are deployed.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ New Missile Silo And DF-41    │ The photos also show that 18   │  0.90 │
 │     │ Launchers Seen In Chinese     │ road-mobile launchers of the   │       │
 │     │ Nuclear Missile Training Area │ long-awaited DF-41 ICBM were   │       │
 │     │ - FAS                         │ training in the area in        │       │
 │     │ https://fas.org/publication/c │ April-May 2019 together with   │       │
 │     │ hina-silo-df41/               │ launchers for the DF-31AG      │       │
 │     │                               │ ICBM, possibly the DF-5B ICBM, │       │
 │     │                               │ the DF-26 IRBM, and the DF-21  │       │
 │     │                               │ MRBM. Altogether, more than 72 │       │
 │     │                               │ missile launchers can be seen  │       │
 │     │                               │ operating together.            │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Table 2: PLARF Missile Force  │ 644 Brigade Hanzhong (33.1321, │  0.85 │
 │     │ Structure 2020                │ 106.9361) (DF-41) (Yes)        │       │
 │     │ https://thebulletin.org/wp-co │ Rumored DF-41 integration      │       │
 │     │ ntent/uploads/2020/12/Kristen │ base.                          │       │
 │     │ sen-Korda_Nov-Dec-China-Table │                                │       │
 │     │ 2_final.pdf                   │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Understanding the People's    │ The DF-41 will likely replace  │  0.88 │
 │     │ Liberation Army Rocket Force  │ older ICBMs in the Chinese     │       │
 │     │ https://www.armyupress.army.m │ arsenal and will carry either  │       │
 │     │ il/Journals/Military-Review/E │ a single megaton warhead or up │       │
 │     │ nglish-Edition-Archives/China │ to ten MIRV smaller warheads.  │       │
 │     │ -Reader-Special-Edition-Septe │                                │       │
 │     │ mber-2021/Mihal-PLA-Rocket-Fo │                                │       │
 │     │ rce/                          │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ China's new missile silos     │ The discovery by researchers   │  0.82 │
 │     │ (hundreds of them)            │ at the James Martin Center for │       │
 │     │ https://www.themandarin.com.a │ Nonproliferation Studies in    │       │
 │     │ u/166656-china-military-watch │ California that 119 missile    │       │
 │     │ -2/                           │ silos were being built in the  │       │
 │     │                               │ desert near the city of Yumen  │       │
 │     │                               │ in the Gansu region suggested  │       │
 │     │                               │ a rapid expansion of China's   │       │
 │     │                               │ nuclear weapons capabilities.  │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ China is building more        │ The new underground silos are  │  0.84 │
 │     │ underground silos for its     │ located in the centre of the   │       │
 │     │ ballistic missiles | SCMP     │ Jilantai training base, within │       │
 │     │ https://www.scmp.com/news/chi │ a total area of 200 sq km, and │       │
 │     │ na/military/article/3125699/c │ are spaced between 2.2km and   │       │
 │     │ hina-building-more-undergroun │ 4.4km apart so that no two of  │       │
 │     │ d-silos-its-ballistic-missile │ them can be destroyed in a     │       │
 │     │ s                             │ single nuclear attack.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 11  │ China's Mobile ICBM Brigades: │ The PLARF is currently         │  0.75 │
 │     │ The DF-31 and DF-41           │ modernizing its                │       │
 │     │ https://www.aboyandhis.blog/p │ intercontinental ballistic     │       │
 │     │ ost/china-s-mobile-icbm-briga │ missile forces with two new    │       │
 │     │ des-the-df-31-and-df-41       │ mobile systems: the new DF-41  │       │
 │     │                               │ ballistic missile and the new  │       │
 │     │                               │ DF-31AG                        │       │
 │     │                               │ transporter-erector-launcher.. │       │
 │     │                               │ .The DF-41 is thought to be    │       │
 │     │                               │ out of development but has not │       │
 │     │                               │ yet moved into Operational     │       │
 │     │                               │ Testing and Evaluation (OT&E). │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 12  │ The 2024 DOD China Military   │ Other variables are how many   │  0.90 │
 │     │ Power Report - FAS            │ warheads are assigned to the   │       │
 │     │ https://fas.org/publication/t │ DF-26 IRBM launchers (probably │       │
 │     │ he-2024-dod-china-military-po │ not all of them), how many of  │       │
 │     │ wer-report/                   │ the six SSBNs have been        │       │
 │     │                               │ upgraded to the JL-3 SLBM and  │       │
 │     │                               │ whether it is assigned         │       │
 │     │                               │ multiple warheads, and how     │       │
 │     │                               │ many DF-41 ICBM launchers are  │       │
 │     │                               │ operational and how many       │       │
 │     │                               │ warheads each missile is       │       │
 │     │                               │ assigned.                      │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                    ┃ Detail                    ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Exact number of          │ Open sources confirm at   │
 │                       │ operational DF-41        │ least two brigades as of  │
 │                       │ brigades and launchers   │ 2019 parade, with         │
 │                       │ as of 2025               │ additional brigades       │
 │                       │                          │ suspected, but no         │
 │                       │                          │ authoritative public      │
 │                       │                          │ count of currently        │
 │                       │                          │ operational DF-41         │
 │                       │                          │ launchers exists as of    │
 │                       │                          │ 2025.                     │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ scope_exceeded        │ Specific warhead mating  │ Detailed operational      │
 │                       │ protocols and            │ warhead handling          │
 │                       │ pre-delegation authority │ procedures, command       │
 │                       │ for DF-41 brigades       │ authority thresholds, and │
 │                       │                          │ pre-delegation rules for  │
 │                       │                          │ DF-41 brigades are        │
 │                       │                          │ classified and not        │
 │                       │                          │ available in open         │
 │                       │                          │ sources.                  │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ contradictory_sources │ Confirmed rail-mobile    │ Multiple sources indicate │
 │                       │ DF-41 operational        │ rail-mobile DF-41 was     │
 │                       │ deployment               │ tested and considered,    │
 │                       │                          │ but no sources confirm it │
 │                       │                          │ has been operationally    │
 │                       │                          │ deployed in that basing   │
 │                       │                          │ mode as of 2025.          │
 ├───────────────────────┼──────────────────────────┼───────────────────────────┤
 │ access_denied         │ Full CNS 2023 Order of   │ The PDF was identified    │
 │                       │ Battle PDF content on    │ but binary content could  │
 │                       │ DF-41 brigades           │ not be fully parsed to    │
 │                       │                          │ extract specific DF-41    │
 │                       │                          │ brigade details from the  │
 │                       │                          │ 2023 CNS Order of Battle. │
 └───────────────────────┴──────────────────────────┴───────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ new_source       │ database          │ PLARF DF-41       │ The 2023 CNS      │
 │                  │                   │ brigade order of  │ Order of Battle   │
 │                  │                   │ battle 2024 2025  │ is the most       │
 │                  │                   │ silo field        │ recent structured │
 │                  │                   │ deployment        │ OOB but may be    │
 │                  │                   │                   │ outdated given    │
 │                  │                   │                   │ rapid 2024-2025   │
 │                  │                   │                   │ expansion.        │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ China DF-41       │ The LOW posture   │
 │                  │                   │ launch on warning │ shift is          │
 │                  │                   │ posture warhead   │ documented but    │
 │                  │                   │ mating 2024 2025  │ the degree to     │
 │                  │                   │                   │ which DF-41       │
 │                  │                   │                   │ brigades          │
 │                  │                   │                   │ specifically have │
 │                  │                   │                   │ implemented it    │
 │                  │                   │                   │ versus older      │
 │                  │                   │                   │ systems is        │
 │                  │                   │                   │ unclear.          │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ China nuclear no  │ The silo          │
 │                  │                   │ first use         │ expansion and LOW │
 │                  │                   │ doctrine DF-41    │ posture raise     │
 │                  │                   │ silo expansion    │ academic          │
 │                  │                   │ strategic         │ questions about   │
 │                  │                   │ stability         │ NFU credibility   │
 │                  │                   │                   │ that may be       │
 │                  │                   │                   │ addressed in      │
 │                  │                   │                   │ recent strategic  │
 │                  │                   │                   │ studies           │
 │                  │                   │                   │ literature.       │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ contradiction    │ null              │ DF-41 rail-mobile │ MDAA lists        │
 │                  │                   │ deployment status │ rail-mobile as an │
 │                  │                   │ operational vs    │ operational       │
 │                  │                   │ testing           │ basing mode,      │
 │                  │                   │                   │ while FAS and     │
 │                  │                   │                   │ CSIS sources      │
 │                  │                   │                   │ suggest it        │
 │                  │                   │                   │ remains in        │
 │                  │                   │                   │ testing/considera │
 │                  │                   │                   │ tion phase. This  │
 │                  │                   │                   │ contradiction     │
 │                  │                   │                   │ should be         │
 │                  │                   │                   │ investigated.     │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Has China fully transitioned to │ Air University and NDU sources  │
 │          │ a launch-on-warning posture for │ confirm PLARF is 'working to    │
 │          │ DF-41 brigades, or is this      │ implement' LOW, but the degree  │
 │          │ still aspirational?             │ of actual implementation vs.    │
 │          │                                 │ doctrinal aspiration is         │
 │          │                                 │ ambiguous.                      │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ How many DF-41 silos in the     │ Reuters December 2025 report    │
 │          │ three new silo fields           │ indicates 100+ solid-fuel ICBMs │
 │          │ (Yumen/Gansu, Hami/Xinjiang,    │ loaded in silo fields; FAS 2025 │
 │          │ Ordos/Inner Mongolia) are now   │ notes continued silo            │
 │          │ loaded with missiles as of      │ development. The DF-41 vs DF-31 │
 │          │ 2025?                           │ breakdown in these silos is     │
 │          │                                 │ unclear.                        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the command-and-control │ LOW posture implies faster      │
 │          │ structure for DF-41 brigades —  │ decision timelines, raising     │
 │          │ do brigade commanders have any  │ questions about whether China   │
 │          │ pre-delegated launch authority? │ has moved toward any degree of  │
 │          │                                 │ pre-delegation, which would be  │
 │          │                                 │ a major doctrinal shift.        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Has the DF-41 rail-mobile       │ Rail-mobile tests were reported │
 │          │ variant been operationally      │ in December 2015, and the 2019  │
 │          │ deployed with any PLARF         │ Pentagon report noted China     │
 │          │ brigade?                        │ 'appears to be considering'     │
 │          │                                 │ rail-mobile basing, but no      │
 │          │                                 │ confirmed operational           │
 │          │                                 │ deployment has been identified. │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the specific MIRV       │ FAS 2025 notes uncertainty      │
 │          │ loading assignment doctrine for │ about how many warheads each    │
 │          │ operational DF-41 missiles —    │ DF-41 is assigned in practice,  │
 │          │ are they typically deployed     │ which significantly affects     │
 │          │ with maximum warhead loads or   │ strategic stability             │
 │          │ reduced loads?                  │ calculations.                   │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.72                                                                │
 │ Corroborating sources: 12                                                    │
 │ Source authority: high                                                       │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.75                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 62857                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 132.16s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: b3d00938-5309-4faa-a20d-97a8511bb8f9
--- a/docs/stress-tests/M3.3-runs/18-scope.log
+++ b/docs/stress-tests/M3.3-runs/18-scope.log
@ -1,272 +0,0 @@
 Researching: What internal compensation bands does Goldman Sachs use for VPs in 
 2026?
 {"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:05.792037Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:15:06.820624Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:06.829930Z"}
 {"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:06.876139Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876453Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876542Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1108, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:14.969587Z"}
 {"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5772, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:26.767509Z"}
 {"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 15029, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:32.149418Z"}
 {"step": 22, "decision": "Token budget reached before iteration 5: 26452/20000", "event": "budget_exhausted", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927200Z"}
 {"step": 23, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 4, "tokens_used": 26452, "event": "synthesis_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927359Z"}
 {"step": 24, "decision": "Parsed synthesis JSON successfully", "duration_ms": 65550, "event": "synthesis_complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.258119Z"}
 {"step": 42, "decision": "Research complete", "confidence": 0.62, "citation_count": 10, "gap_count": 4, "discovery_count": 3, "total_duration_sec": 102.914, "event": "complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.259163Z"}
 {"confidence": 0.62, "citations": 10, "gaps": 4, "discovery_events": 3, "tokens_used": 51829, "iterations_run": 4, "wall_time_sec": 98.38188624382019, "budget_exhausted": true, "event": "research_completed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:45.259280Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:16:45.259714Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:45.264223Z"}
 {"trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "confidence": 0.62, "citations": 10, "tokens_used": 51829, "wall_time_sec": 98.38188624382019, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:45.493130Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Goldman Sachs does not publicly disclose formal internal compensation bands  │
 │ for VPs. Based on available evidence, the VP title at Goldman Sachs is a     │
 │ single, wide-band level (there are no officially published sub-bands like    │
 │ VP1/VP2/VP3 at Goldman, unlike some other banks). Compensation varies        │
 │ enormously depending on division (front office vs. middle/back office) and   │
 │ seniority within the band. Key data points for 2026: (1) Glassdoor reports a │
 │ typical total pay range of $213,109–$391,379 (25th–75th percentile) across   │
 │ ~4,695 salary submissions, covering all VP roles firm-wide. (2) Levels.fyi   │
 │ reports a median total VP compensation of $144K, which likely skews toward   │
 │ tech/engineering roles. (3) 6figr reports an average of $297K (range         │
 │ $265K–$501K, top 10% up to $514K) based on 67 profiles. (4) For front-office │
 │ Investment Banking VPs specifically, Glassdoor reports a much higher range   │
 │ of $480,547–$888,585 (25th–75th percentile) based on 14 salaries. (5)        │
 │ Industry benchmarks from Mergers & Inquisitions (2026 update) place          │
 │ front-office IB VP base salary at $250–$300K with total compensation of      │
 │ $525–$800K for NY-based roles. (6) Indeed reports an average of ~$145,324,   │
 │ consistent with a broad mix of roles. Community sources (Fishbowl) confirm   │
 │ the VP band is 'very wide' with no official internal sub-levels at Goldman;  │
 │ pay differentiation happens informally by group, skillset, and front vs.     │
 │ back office status.                                                          │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Total salary range for        │ The typical pay range is       │  0.85 │
 │     │ Goldman Sachs Vice President  │ between $213,109 (25th         │       │
 │     │ - Glassdoor                   │ percentile) and $391,379 (75th │       │
 │     │ https://www.glassdoor.com/Sal │ percentile) annually. This is  │       │
 │     │ ary/Goldman-Sachs-Vice-Presid │ based on 4,695 salaries        │       │
 │     │ ent-Salaries-E2800_D_KO14,28. │ submitted by Goldman Sachs     │       │
 │     │ htm                           │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Total salary range for        │ The typical pay range is       │  0.85 │
 │     │ Goldman Sachs Vice President  │ between $220,674 (25th         │       │
 │     │ - Glassdoor                   │ percentile) and $411,924 (75th │       │
 │     │ https://www.glassdoor.com/Sal │ percentile) annually. This is  │       │
 │     │ ary/Goldman-Sachs-V-P-Salarie │ based on 4,695 salaries        │       │
 │     │ s-E2800_D_KO14,17.htm         │ submitted by Goldman Sachs     │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ Goldman Sachs Vice President  │ The median Vice President      │  0.75 │
 │     │ Salary | $110K-$144K+ |       │ compensation in United States  │       │
 │     │ Levels.fyi                    │ package at Goldman Sachs       │       │
 │     │ https://www.levels.fyi/compan │ totals $144K per year. View    │       │
 │     │ ies/goldman-sachs/salaries/vi │ the base salary, stock, and    │       │
 │     │ ce-president                  │ bonus breakdowns for Goldman   │       │
 │     │                               │ Sachs's total compensation     │       │
 │     │                               │ packages. Last updated:        │       │
 │     │                               │ 4/6/2026                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Goldman Sachs Vice President  │ Employees at Goldman Sachs as  │  0.70 │
 │     │ Vp Salaries 2026 |            │ Vice President Vp earn an      │       │
 │     │ $265k-$514k                   │ average of $297k, mostly       │       │
 │     │ https://6figr.com/us/salary/g │ ranging from $265k per year to │       │
 │     │ oldman-sachs--vice-president- │ $501k per year based on 67     │       │
 │     │ vp                            │ profiles. The top 10%          │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ Goldman Sachs Investment      │ The typical pay range is       │  0.65 │
 │     │ Banking Vice President ...    │ between $480,547 (25th         │       │
 │     │ https://www.glassdoor.com/Sal │ percentile) and $888,585 (75th │       │
 │     │ ary/Goldman-Sachs-Investment- │ percentile) annually. This is  │       │
 │     │ Banking-Vice-President-Salari │ based on 14 salaries submitted │       │
 │     │ es-E2800_D_KO14,47.htm        │ by Goldman Sachs               │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Investment Banker Salary and  │ Vice President (VP) | 28-40 |  │  0.88 │
 │     │ Bonus Report: 2026 Update     │ $250-$300K | $525-$800K | 3-4  │       │
 │     │ https://mergersandinquisition │ years                          │       │
 │     │ s.com/investment-banker-salar │                                │       │
 │     │ y/                            │ NOTE: All numbers are pre-tax  │       │
 │     │                               │ for New York-based             │       │
 │     │                               │ front-office roles and include │       │
 │     │                               │ base salaries and year-end     │       │
 │     │                               │ bonuses but not                │       │
 │     │                               │ signing/relocation bonuses,    │       │
 │     │                               │ stub bonuses, benefits, etc.   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ Vice President yearly         │ Average Goldman Sachs Vice     │  0.70 │
 │     │ salaries in the United States │ President yearly pay in the    │       │
 │     │ at Goldman Sachs              │ United States is approximately │       │
 │     │ https://www.indeed.com/cmp/Go │ $145,324, which is 9% below    │       │
 │     │ ldman-Sachs/salaries/Vice-Pre │ the national average. Salary   │       │
 │     │ sident                        │ estimated from                 │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Are there internal levels/    │ Goldman VP band is very wide.  │  0.72 │
 │     │ bands within the VP tit... |  │ Promoted from associate and    │       │
 │     │ Fishbowl                      │ Next step md is difficult to   │       │
 │     │ https://www.fishbowlapp.com/p │ get.                           │       │
 │     │ ost/are-there-internal-levels │                                │       │
 │     │ -bands-within-the-vp-title-at │ Yes, banks have different      │       │
 │     │ -goldman-sachs-fwiw-this-is-f │ bands depending on skillset,   │       │
 │     │ or-a-nonbusiness-internal-str │ group within the firm, front   │       │
 │     │ ategy-kind                    │ office vs back office, etc     │       │
 │     │                               │                                │       │
 │     │                               │ Not Goldman though. It's just  │       │
 │     │                               │ VP                             │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ VP of FP&A at Goldman Sachs   │ FP&A is middle office at       │  0.65 │
 │     │ salary : r/FPandA - Reddit    │ banks, they won't make         │       │
 │     │ https://www.reddit.com/r/FPan │ anywhere near $400k at VP      │       │
 │     │ dA/comments/1dgguz5/vp_of_fpa │ level. Front office VP         │       │
 │     │ _at_goldman_sachs_salary/     │ positions will all clear over  │       │
 │     │                               │ $400k in a place               │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Goldman Sachs Vp Salaries     │ 15 to 15 yrs. Base. $179k.     │  0.65 │
 │     │ 2026 | $208k-$586k -          │ Stocks / Yr. $21k. Bonus.      │       │
 │     │ 6figr.com                     │ $120k. Total Salary. $318k.    │       │
 │     │ https://6figr.com/us/salary/g │ Goldman Sachs Vp salary levels │       │
 │     │ oldman-sachs--vp              │ ranges from Vice President     │       │
 │     │                               │ (Accountant) upto              │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category              ┃ Topic                     ┃ Detail                   ┃
 ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found      │ Official internal Goldman │ Goldman Sachs does not   │
 │                       │ Sachs VP compensation     │ publicly publish its     │
 │                       │ bands                     │ internal compensation    │
 │                       │                           │ bands or grade           │
 │                       │                           │ structures. No           │
 │                       │                           │ authoritative internal   │
 │                       │                           │ HR documentation was     │
 │                       │                           │ found. All data is from  │
 │                       │                           │ third-party crowdsourced │
 │                       │                           │ salary platforms.        │
 ├───────────────────────┼───────────────────────────┼──────────────────────────┤
 │ source_not_found      │ VP sub-band breakdown     │ Community sources        │
 │                       │ (VP1/VP2/VP3 equivalents) │ explicitly state Goldman │
 │                       │                           │ uses a single 'VP' title │
 │                       │                           │ with no formal           │
 │                       │                           │ sub-levels, unlike some  │
 │                       │                           │ peers. No granular       │
 │                       │                           │ sub-band salary data     │
 │                       │                           │ exists in any source     │
 │                       │                           │ reviewed.                │
 ├───────────────────────┼───────────────────────────┼──────────────────────────┤
 │ scope_exceeded        │ Non-US VP compensation    │ Some sources (e.g.,      │
 │                       │ bands                     │ AmbitionBox) reference   │
 │                       │                           │ India-based VP salaries  │
 │                       │                           │ (₹49.4L–₹54.6L), but     │
 │                       │                           │ comprehensive            │
 │                       │                           │ international band data  │
 │                       │                           │ was not gathered. The    │
 │                       │                           │ question context appears │
 │                       │                           │ US-focused.              │
 ├───────────────────────┼───────────────────────────┼──────────────────────────┤
 │ contradictory_sources │ Levels.fyi median         │ Levels.fyi reports a     │
 │                       │ discrepancy               │ median of $144K while    │
 │                       │                           │ Glassdoor and 6figr      │
 │                       │                           │ report $213K–$411K       │
 │                       │                           │ ranges. Levels.fyi       │
 │                       │                           │ likely captures          │
 │                       │                           │ engineering/tech VPs who │
 │                       │                           │ have different           │
 │                       │                           │ compensation structures  │
 │                       │                           │ and lower base pay than  │
 │                       │                           │ finance VPs.             │
 └───────────────────────┴───────────────────────────┴──────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ contradiction    │ database          │ Goldman Sachs VP  │ Large discrepancy │
 │                  │                   │ total             │ between           │
 │                  │                   │ compensation by   │ Levels.fyi ($144K │
 │                  │                   │ division 2025     │ median) and       │
 │                  │                   │ 2026              │ Glassdoor         │
 │                  │                   │                   │ ($213K–$391K      │
 │                  │                   │                   │ range) suggests   │
 │                  │                   │                   │ the VP population │
 │                  │                   │                   │ is heterogeneous  │
 │                  │                   │                   │ across tech and   │
 │                  │                   │                   │ finance           │
 │                  │                   │                   │ functions;        │
 │                  │                   │                   │ further           │
 │                  │                   │                   │ segmentation by   │
 │                  │                   │                   │ division would    │
 │                  │                   │                   │ resolve this.     │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ null              │ Goldman Sachs     │ Understanding how │
 │                  │                   │ internal grade    │ Goldman's VP band │
 │                  │                   │ structure VP      │ maps to peer      │
 │                  │                   │ Director MD 2026  │ banks' grade      │
 │                  │                   │                   │ systems would     │
 │                  │                   │                   │ clarify the wide  │
 │                  │                   │                   │ compensation      │
 │                  │                   │                   │ range observed.   │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ null              │ Goldman Sachs     │ Mergers &         │
 │                  │                   │ 2025 bonus pool   │ Inquisitions      │
 │                  │                   │ VP payout by      │ notes senior      │
 │                  │                   │ division          │ bankers (VPs+)    │
 │                  │                   │                   │ received          │
 │                  │                   │                   │ disproportionate  │
 │                  │                   │                   │ 2025 bonus        │
 │                  │                   │                   │ increases;        │
 │                  │                   │                   │ division-level    │
 │                  │                   │                   │ data would        │
 │                  │                   │                   │ sharpen the band  │
 │                  │                   │                   │ picture.          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ Does Goldman Sachs use any      │ Fishbowl community posts        │
 │          │ informal internal seniority     │ confirm the VP band is wide and │
 │          │ designations within the VP      │ pay varies significantly, but   │
 │          │ title (e.g., junior VP vs.      │ it is unclear whether informal  │
 │          │ senior VP) that affect          │ tracking of seniority within    │
 │          │ compensation but are not        │ the band drives structured pay  │
 │          │ publicly disclosed?             │ steps.                          │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ How did 2025 year-end bonuses   │ Mergers & Inquisitions notes    │
 │          │ for Goldman Sachs VPs compare   │ that VPs and Directors saw      │
 │          │ to the prior year, and were     │ 10–15% total comp increases in  │
 │          │ front-office VPs                │ 2025, but Goldman-specific      │
 │          │ disproportionate beneficiaries? │ figures were not isolated.      │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Why does Levels.fyi report a    │ The discrepancy likely reflects │
 │          │ $144K median for Goldman Sachs  │ different user populations      │
 │          │ VPs when Glassdoor and 6figr    │ (tech-focused on Levels.fyi vs. │
 │          │ report ranges starting at       │ finance-focused on              │
 │          │ $213K–$265K?                    │ Glassdoor/6figr), but this has  │
 │          │                                 │ not been confirmed.             │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What is the typical             │ Fishbowl notes the VP band is   │
 │          │ time-in-grade for a Goldman     │ wide and the step to MD is      │
 │          │ Sachs VP before promotion to    │ difficult; Mergers &            │
 │          │ Managing Director, and does     │ Inquisitions gives a 3–4 year   │
 │          │ longer tenure correlate with    │ promotion window for VPs across │
 │          │ meaningfully higher within-band │ large banks.                    │
 │          │ pay?                            │                                 │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.62                                                                │
 │ Corroborating sources: 8                                                     │
 │ Source authority: medium                                                     │
 │ Contradiction detected: True                                                 │
 │ Query specificity match: 0.55                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 51829                                                                │
 │ Iterations: 4                                                                │
 │ Wall time: 98.38s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: 716e548a-ceaf-4d18-8b47-ac35e3460b52
--- a/docs/stress-tests/M3.3-runs/19-scope.log
+++ b/docs/stress-tests/M3.3-runs/19-scope.log
@ -1,343 +0,0 @@
 Researching: How does Renaissance Technologies Medallion Fund actually generate 
 alpha?
 {"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:46.074147Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:16:46.829107Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:46.837149Z"}
 {"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:46.869281Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869587Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869675Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:56.914799Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8370, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:03.842868Z"}
 {"step": 21, "decision": "Token budget reached before iteration 4: 20077/20000", "event": "budget_exhausted", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.960507Z"}
 {"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 23, "iterations_run": 3, "tokens_used": 20077, "event": "synthesis_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.961508Z"}
 {"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74831, "event": "synthesis_complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.398868Z"}
 {"step": 42, "decision": "Research complete", "confidence": 0.82, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.925, "event": "complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.400004Z"}
 {"confidence": 0.82, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 43096, "iterations_run": 3, "wall_time_sec": 98.52941536903381, "budget_exhausted": true, "event": "research_completed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:25.400108Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:18:25.400618Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:25.405316Z"}
 {"trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "confidence": 0.82, "citations": 10, "tokens_used": 43096, "wall_time_sec": 98.52941536903381, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:25.623416Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ Renaissance Technologies' Medallion Fund generates alpha through several     │
 │ reinforcing mechanisms, all grounded in quantitative and data-driven methods │
 │ rather than traditional financial intuition:                                 │
 │                                                                              │
 │ 1. **Statistical Arbitrage & Pattern Recognition**: The fund identifies      │
 │ subtle, recurring market inefficiencies and pricing anomalies by analyzing   │
 │ vast amounts of historical and real-time data. It profits from small         │
 │ mispricings across many trades rather than large directional bets. [Sources  │
 │ 3, 6, 8]                                                                     │
 │                                                                              │
 │ 2. **Advanced Mathematical & Quantitative Models**: Renaissance employs      │
 │ sophisticated statistical models, hidden Markov models (used as early as     │
 │ 1983), and continuously refined algorithms to predict short-term price       │
 │ movements. The firm hired mathematicians, physicists, and computer           │
 │ scientists—not traditional Wall Street traders—to build these models.        │
 │ [Sources 9, 16, 21, 23]                                                      │
 │                                                                              │
 │ 3. **Machine Learning & AI Integration**: Medallion continuously refines its │
 │ models using machine learning, allowing them to adapt to changing market     │
 │ conditions and discover non-obvious patterns. [Sources 6, 8]                 │
 │                                                                              │
 │ 4. **High-Frequency, Fully Automated Trading**: The fund executes            │
 │ 150,000–300,000 trades daily through fully automated systems, eliminating    │
 │ emotional bias and exploiting fleeting inefficiencies at scale. [Source 8]   │
 │                                                                              │
 │ 5. **Market-Neutral & Diversified Strategies**: By balancing long and short  │
 │ positions across many asset classes (equities, futures, options, currencies) │
 │ and geographies, the fund reduces exposure to broad market moves. This is    │
 │ evidenced by the fund returning +74.6% in 2008 when markets crashed.         │
 │ [Sources 6, 16]                                                              │
 │                                                                              │
 │ 6. **Leverage & Risk Management via Kelly Criterion**: Medallion uses        │
 │ significant leverage combined with disciplined risk management techniques,   │
 │ including the Kelly Criterion, to size positions optimally and control       │
 │ drawdown. [Sources 6, 8]                                                     │
 │                                                                              │
 │ 7. **Extreme Secrecy & Employee-Only Structure**: The fund has been closed   │
 │ to outside investors since 1993, aligning incentives exclusively with        │
 │ employees and partners. This exclusivity prevents strategy dilution and      │
 │ protects proprietary edge. [Sources 5, 6, 12]                                │
 │                                                                              │
 │ 8. **Massive Data Collection & Cleaning**: Renaissance amasses and           │
 │ meticulously cleans enormous datasets of historical price data, economic     │
 │ indicators, and alternative data sources as the raw material for model       │
 │ building. [Sources 15, 21]                                                   │
 │                                                                              │
 │ 9. **Collaborative, Academic Culture**: Simons fostered an open, peer-driven │
 │ environment where ideas were freely shared among top-tier scientists,        │
 │ accelerating model refinement and discovery. [Sources 16, 21]                │
 │                                                                              │
 │ The cumulative result: average annual returns of 66% before fees and 39%     │
 │ after fees from 1988 to 2018—the best sustained track record in investment   │
 │ history. A $100 investment in 1988 would have grown to approximately $398.7  │
 │ million by 2018, versus $1,815 for the S&P 500 over the same period.         │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ Renaissance Technologies: The │ Between 1988 and 2018,         │  0.97 │
 │     │ $100 Billion Built on         │ Renaissance Technologies'      │       │
 │     │ Statistical Arbitrage         │ Medallion Fund generated       │       │
 │     │ https://navnoorbawa.substack. │ average annual returns of 66%  │       │
 │     │ com/p/renaissance-technologie │ before fees and 39% after fees │       │
 │     │ s-the-100                     │ — the most successful track    │       │
 │     │                               │ record in investing history. A │       │
 │     │                               │ $100 investment in 1988 would  │       │
 │     │                               │ have grown to approximately    │       │
 │     │                               │ $398.7 million by 2018.        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ Jim Simons Trading Strategy   │ Fully automated systems        │  0.93 │
 │     │ Explained: Inside Renaissance │ executed 150,000–300,000       │       │
 │     │ Technologies                  │ trades daily, eliminating      │       │
 │     │ https://www.quantvps.com/blog │ emotional biases. Techniques   │       │
 │     │ /jim-simons-trading-strategy  │ like the Kelly Criterion and   │       │
 │     │                               │ balanced portfolios helped     │       │
 │     │                               │ control risk and maintain      │       │
 │     │                               │ consistent returns.            │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ The Curious Case of Medallion │ The fund employs sophisticated │  0.92 │
 │     │ Fund: Renaissance             │ statistical and mathematical   │       │
 │     │ Technologies' Hedge Fund      │ models to identify and         │       │
 │     │ Success                       │ capitalize on market           │       │
 │     │ https://www.schoolofhedge.com │ inefficiencies. Medallion      │       │
 │     │ /pages/the-curious-case-of-me │ integrates machine learning    │       │
 │     │ dallion-fund                  │ and artificial intelligence to │       │
 │     │                               │ refine its models continually, │       │
 │     │                               │ adapting to changing market    │       │
 │     │                               │ conditions.                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ Decoding the Medallion Fund   │ The Medallion Fund boasts an   │  0.95 │
 │     │ Returns: What We Know About   │ unprecedented average annual   │       │
 │     │ Its Annual Performance        │ return of 66% before fees over │       │
 │     │ https://www.quantifiedstrateg │ 30 years, achieving a net      │       │
 │     │ ies.com/medallion-fund-return │ return of 39% after fees. The  │       │
 │     │ s/                            │ Medallion Fund has been closed │       │
 │     │                               │ to outside investors since     │       │
 │     │                               │ 1993 and is only available to  │       │
 │     │                               │ current and past employees and │       │
 │     │                               │ their families.                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ James Simons (Renaissance     │ In 1983 he was using Hidden    │  0.85 │
 │     │ Technologies Corp.) and his   │ Markov Models. Now he employs  │       │
 │     │ model - Quantitative Finance  │ 100+ PhDs, therefore I expect  │       │
 │     │ Stack Exchange                │ he will have 50+ strategies    │       │
 │     │ https://quant.stackexchange.c │ using 200+ predictors. And set │       │
 │     │ om/questions/30056/james-simo │ up as a production line, from  │       │
 │     │ ns-renaissance-technologies-c │ the teams importing and        │       │
 │     │ orp-and-his-model             │ cleaning data, down to         │       │
 │     │                               │ execution of trades.           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ Simons' Strategies:           │ Market-Neutral Strategies:     │  0.91 │
 │     │ Renaissance Trading Unpacked  │ Balancing long and short       │       │
 │     │ - LuxAlgo                     │ positions reduces risk. Unique │       │
 │     │ https://www.luxalgo.com/blog/ │ Hiring: Scientists and         │       │
 │     │ simons-strategies-renaissance │ mathematicians, not Wall       │       │
 │     │ -trading-unpacked/            │ Street veterans, build their   │       │
 │     │                               │ trading models. Even during    │       │
 │     │                               │ crashes like 2008, Medallion   │       │
 │     │                               │ outperformed with a 74.6%      │       │
 │     │                               │ return.                        │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ The Man Who Solved the Market │ Renaissance's success was      │  0.93 │
 │     │ by Gregory Zuckerman -        │ built on amassing and          │       │
 │     │ Summary & Notes               │ meticulously cleaning vast     │       │
 │     │ https://bagerbach.com/books/t │ amounts of historical price    │       │
 │     │ he-man-who-solved-the-market/ │ data, then using it to model   │       │
 │     │                               │ and predict market behavior.   │       │
 │     │                               │ They treated investing like a  │       │
 │     │                               │ scientific problem, forming    │       │
 │     │                               │ hypotheses, testing them       │       │
 │     │                               │ rigorously, and iterating      │       │
 │     │                               │ constantly.                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ Cracking the Code: Inside the │ Medallion began as an          │  0.88 │
 │     │ Medallion Fund and Jim        │ experiment in pattern          │       │
 │     │ Simons' Secretive Empire      │ recognition. Over time, it     │       │
 │     │ https://medium.com/@trading.d │ evolved into a fully           │       │
 │     │ ude/cracking-the-code-inside- │ automated, high-frequency,     │       │
 │     │ the-medallion-fund-and-jim-si │ multi-strategy quant           │       │
 │     │ mons-secretive-empire-b9af084 │ powerhouse. It traded          │       │
 │     │ 15b4f                         │ everything from equities to    │       │
 │     │                               │ futures.                       │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Renaissance Technologies and  │ Renaissance Technologies,      │  0.92 │
 │     │ The Medallion Fund            │ often just referred to as      │       │
 │     │ https://quartr.com/insights/e │ RenTec, is reputed as the      │       │
 │     │ dge/renaissance-technologies- │ highest-performing investment  │       │
 │     │ and-the-medallion-fund        │ firms ever, with its Medallion │       │
 │     │                               │ Fund having returned a net     │       │
 │     │                               │ 90,129x to investors between   │       │
 │     │                               │ the years 1988-2022 leveraging │       │
 │     │                               │ a quantitative investment      │       │
 │     │                               │ approach.                      │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 10  │ Jim Simons – The Man Who      │ Simons decided to use a purely │  0.90 │
 │     │ Solved the Market - Build     │ systematic approach to avoid   │       │
 │     │ Alpha                         │ emotional rollercoasters and   │       │
 │     │ https://www.buildalpha.com/ji │ avoid common trading biases    │       │
 │     │ m-simons-the-man-who-solved-t │ that trip up most traders.     │       │
 │     │ he-market/                    │ Simons staffed the new fund,   │       │
 │     │                               │ Renaissance Technologies, with │       │
 │     │                               │ mathematicians, computer       │       │
 │     │                               │ scientists, and physicists to  │       │
 │     │                               │ pioneer.                       │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ access_denied    │ Specific algorithmic        │ Renaissance Technologies    │
 │                  │ details and signal types    │ maintains extreme secrecy   │
 │                  │ used by the Medallion Fund  │ around its specific trading │
 │                  │                             │ signals, factor exposures,  │
 │                  │                             │ and model architecture. No  │
 │                  │                             │ public source has ever      │
 │                  │                             │ confirmed the exact         │
 │                  │                             │ mathematical formulas,      │
 │                  │                             │ specific predictors, or     │
 │                  │                             │ strategy details. All       │
 │                  │                             │ evidence is from secondary  │
 │                  │                             │ sources and informed        │
 │                  │                             │ inference.                  │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Post-2018 performance data  │ Most verified return data   │
 │                  │ for the Medallion Fund      │ covers 1988-2018. Some      │
 │                  │                             │ sources reference           │
 │                  │                             │ performance through 2022    │
 │                  │                             │ but with less granular      │
 │                  │                             │ annual data. The fund does  │
 │                  │                             │ not file public performance │
 │                  │                             │ reports.                    │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Specific leverage ratios    │ While sources note that     │
 │                  │ used by the Medallion Fund  │ high leverage is a          │
 │                  │                             │ component of alpha          │
 │                  │                             │ generation, specific        │
 │                  │                             │ leverage multiples are not  │
 │                  │                             │ publicly disclosed and were │
 │                  │                             │ not found in the gathered   │
 │                  │                             │ evidence.                   │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Fee structure and its exact │ Sources confirm the fund    │
 │                  │ impact on net returns over  │ charges approximately 5%    │
 │                  │ time                        │ management and 44%          │
 │                  │                             │ performance fees            │
 │                  │                             │ (historically), but         │
 │                  │                             │ detailed year-by-year       │
 │                  │                             │ impact analysis was not     │
 │                  │                             │ found in the gathered       │
 │                  │                             │ evidence.                   │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ arxiv             │ statistical       │ Simons used       │
 │                  │                   │ arbitrage hidden  │ Hidden Markov     │
 │                  │                   │ Markov models     │ Models in 1983.   │
 │                  │                   │ financial markets │ Academic papers   │
 │                  │                   │ quantitative      │ on HMMs in        │
 │                  │                   │ trading           │ finance could     │
 │                  │                   │                   │ illuminate the    │
 │                  │                   │                   │ mathematical      │
 │                  │                   │                   │ foundation of     │
 │                  │                   │                   │ early Medallion   │
 │                  │                   │                   │ strategies.       │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ Kelly Criterion   │ The Kelly         │
 │                  │                   │ optimal position  │ Criterion is      │
 │                  │                   │ sizing hedge fund │ cited as a key    │
 │                  │                   │ leverage          │ risk management   │
 │                  │                   │ quantitative      │ tool; academic    │
 │                  │                   │ trading           │ literature could  │
 │                  │                   │                   │ clarify how it    │
 │                  │                   │                   │ specifically      │
 │                  │                   │                   │ contributes to    │
 │                  │                   │                   │ alpha             │
 │                  │                   │                   │ sustainability.   │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ new_source       │ database          │ Renaissance       │ SEC 13F filings   │
 │                  │                   │ Technologies SEC  │ for Renaissance's │
 │                  │                   │ 13F filings RIEF  │ public-facing     │
 │                  │                   │ RIDA              │ funds (RIEF,      │
 │                  │                   │ institutional     │ RIDA) could       │
 │                  │                   │ holdings          │ provide insight   │
 │                  │                   │                   │ into equity       │
 │                  │                   │                   │ selection         │
 │                  │                   │                   │ methodology,      │
 │                  │                   │                   │ though not        │
 │                  │                   │                   │ Medallion         │
 │                  │                   │                   │ directly.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ null              │ Gregory Zuckerman │ The book by       │
 │                  │                   │ The Man Who       │ Zuckerman is      │
 │                  │                   │ Solved the Market │ cited as the most │
 │                  │                   │ primary source    │ authoritative     │
 │                  │                   │ analysis          │ public account of │
 │                  │                   │                   │ Renaissance's     │
 │                  │                   │                   │ methods; a deeper │
 │                  │                   │                   │ review could      │
 │                  │                   │                   │ yield more        │
 │                  │                   │                   │ specific          │
 │                  │                   │                   │ mechanism         │
 │                  │                   │                   │ details.          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ How has the Medallion Fund      │ Multiple sources confirm the    │
 │          │ maintained its edge as markets  │ strategy has worked for 30+     │
 │          │ have become more efficient and  │ years, but with algorithmic     │
 │          │ other quant funds have adopted  │ trading now comprising 60-73%   │
 │          │ similar approaches?             │ of U.S. equity trades, the      │
 │          │                                 │ persistence of edge is          │
 │          │                                 │ theoretically challenging.      │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the role of capacity    │ The fund is closed to outside   │
 │          │ constraints in limiting         │ investors and capped in size,   │
 │          │ Medallion's AUM, and how does   │ suggesting strategy returns     │
 │          │ the fund's small size (~$10B)   │ diminish at scale. This         │
 │          │ contribute to its returns?      │ capacity question is central to │
 │          │                                 │ understanding whether the alpha │
 │          │                                 │ is truly replicable.            │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ To what extent does Medallion's │ Sources describe both           │
 │          │ alpha come from market          │ high-frequency automated        │
 │          │ microstructure exploitation     │ trading and statistical         │
 │          │ (e.g., short-term mean          │ arbitrage, but the precise time │
 │          │ reversion) vs. longer-horizon   │ horizon distribution of trades  │
 │          │ factor exposures?               │ is unknown publicly.            │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How has Medallion's strategy    │ Jim Simons passed away in May   │
 │          │ evolved since Jim Simons'       │ 2024. The sustainability of the │
 │          │ retirement from day-to-day      │ fund's culture and edge under   │
 │          │ management and his death in May │ new leadership is an open       │
 │          │ 2024?                           │ question.                       │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ What specific alternative data  │ Sources mention 'alternative    │
 │          │ sources (beyond price/volume)   │ data sources' as inputs but     │
 │          │ does Renaissance use as inputs  │ provide no specifics, leaving   │
 │          │ to its models?                  │ this dimension of the alpha     │
 │          │                                 │ generation process unresolved.  │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.82                                                                │
 │ Corroborating sources: 10                                                    │
 │ Source authority: medium                                                     │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.75                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 43096                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 98.53s                                                            │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: b7cd9d50-3eec-4eca-8db0-a580722c2b19
--- a/docs/stress-tests/M3.3-runs/20-scope.log
+++ b/docs/stress-tests/M3.3-runs/20-scope.log
@ -1,325 +0,0 @@
 Researching: What are the precise materials and tolerances in TSMC's 2nm 
 process?
 {"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:26.198498Z"}
 {"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:18:26.963097Z"}
 {"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:26.972484Z"}
 {"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:27.004492Z"}
 {"step": 1, "decision": "Beginning research: depth=balanced", "question": "What are the precise materials and tolerances in TSMC's 2nm process?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004812Z"}
 {"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004904Z"}
 {"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1158, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:40.769568Z"}
 {"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 11802, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:47.013233Z"}
 {"step": 19, "decision": "Token budget reached before iteration 4: 30249/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139804Z"}
 {"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 30249, "event": "synthesis_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139984Z"}
 {"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 77777, "event": "synthesis_complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.633197Z"}
 {"step": 40, "decision": "Research complete", "confidence": 0.42, "citation_count": 9, "gap_count": 5, "discovery_count": 4, "total_duration_sec": 109.056, "event": "complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.634189Z"}
 {"confidence": 0.42, "citations": 9, "gaps": 5, "discovery_events": 4, "tokens_used": 62620, "iterations_run": 3, "wall_time_sec": 105.62861347198486, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:20:12.634324Z"}
 {"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:20:12.634698Z"}
 {"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:20:12.639617Z"}
 {"trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "confidence": 0.42, "citations": 9, "tokens_used": 62620, "wall_time_sec": 105.62861347198486, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:20:12.967147Z"}
 ╭─────────────────────────────────── Answer ───────────────────────────────────╮
 │ TSMC's 2nm (N2) process node, which began volume production in Q4 2025,      │
 │ introduces several key technical advances, though precise proprietary        │
 │ materials specifications and sub-angstrom tolerances are not publicly        │
 │ disclosed. What is publicly known:                                           │
 │                                                                              │
 │ **Transistor Architecture:** N2 is TSMC's first node to use Gate-All-Around  │
 │ (GAA) nanosheet transistors, replacing the FinFET architecture used since    │
 │ 2011. The gate surrounds the silicon nanosheet channel on all sides,         │
 │ providing superior electrostatic control and reduced gate leakage compared   │
 │ to 3nm FinFETs [Sources 10, 13, 21].                                         │
 │                                                                              │
 │ **Process Node Dimensions (IEEE IRDS):** The 2nm node class is projected to  │
 │ have a contacted gate pitch of ~45nm and a tightest metal pitch of ~20nm,    │
 │ per IEEE International Roadmap for Devices and Systems (2021 update) [Source │
 │ 16].                                                                         │
 │                                                                              │
 │ **Interconnects:** N2 features copper (Cu)-based redistribution layers       │
 │ (RDLs) with flat passivation and through-silicon vias (TSVs), co-optimized   │
 │ with 3DIC integration. Middle- and back-end-of-line (MEOL/BEOL)              │
 │ interconnects are included, with the densest SRAM macro ever reported at     │
 │ approximately 38 Mb/mm² [Sources 4, 21].                                     │
 │                                                                              │
 │ **Performance Metrics (vs. N3E):** 24–35% power reduction OR 15% performance │
 │ improvement at iso-voltage; >1.15x transistor density improvement over N3    │
 │ [Sources 10, 18, 21].                                                        │
 │                                                                              │
 │ **Yield:** Initial yields reportedly ~70%, with some memory products         │
 │ exceeding 90%. A 6% yield improvement over baseline was reported in late     │
 │ 2024 [Sources 13, 14].                                                       │
 │                                                                              │
 │ **Applications:** Designed for AI, mobile, and HPC applications. Key         │
 │ customers include Apple (A20 chip for iPhone 18 Pro) and NVIDIA [Sources 8,  │
 │ 14].                                                                         │
 │                                                                              │
 │ **Fab Locations:** Primary production in Hsinchu and Kaohsiung, Taiwan; a    │
 │ Kaohsiung 2nm facility expansion ceremony was held March 31, 2025 [Source    │
 │ 6].                                                                          │
 │                                                                              │
 │ **Specific proprietary materials** (e.g., exact dielectric compositions,     │
 │ gate oxide materials, metal liner chemistries, doping concentrations, and    │
 │ nanometer-level tolerances on nanosheet thickness/width) are not publicly    │
 │ disclosed by TSMC and were not found in the available evidence.              │
 ╰──────────────────────────────────────────────────────────────────────────────╯
                                   Citations                                    
 ┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
 ┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
 ┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
 │ 1   │ TSMC shares deep-dive details │ The new production node        │  0.95 │
 │     │ about its cutting edge 2nm    │ promises a 24 to 35% power     │       │
 │     │ process node at IEDM 2024 —   │ reduction or 15% performance   │       │
 │     │ 35 percent less power or 15   │ improvement at the same        │       │
 │     │ percent more performance |    │ voltage, and 1.15X higher      │       │
 │     │ Tom's Hardware                │ transistor density than the    │       │
 │     │ https://www.tomshardware.com/ │ previous 3nm node.             │       │
 │     │ tech-industry/tsmc-shares-dee │                                │       │
 │     │ p-dive-details-about-its-cutt │                                │       │
 │     │ ing-edge-2nm-process-node-at- │                                │       │
 │     │ iedm-2024-35-percent-less-pow │                                │       │
 │     │ er-or-15-percent-more-perform │                                │       │
 │     │ ance                          │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 2   │ IEDM 2024 – TSMC 2nm Process  │ The paper states that the      │  0.95 │
 │     │ Disclosure - TechInsights     │ process delivers a 30% power   │       │
 │     │ https://library.techinsights. │ improvement or 15% performance │       │
 │     │ com/public/hg-asset/f32a0f17- │ gain and >1.15x density versus │       │
 │     │ 5369-4c97-913c-b78d2ddd833b   │ the previous 3nm node.         │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 3   │ The Shape of Tomorrow's       │ The new N2 platform features   │  0.93 │
 │     │ Semiconductor Technology -    │ GAA nanosheet transistors;     │       │
 │     │ Semiconductor Digest          │ middle-/back-end-of-line       │       │
 │     │ https://www.semiconductor-dig │ interconnects with the densest │       │
 │     │ est.com/the-shape-of-tomorrow │ SRAM macro ever reported       │       │
 │     │ s-semiconductor-technology/   │ (~38Mb/mm2); and a holistic,   │       │
 │     │                               │ system-technology co-optimized │       │
 │     │                               │ (STCO) architecture offering   │       │
 │     │                               │ great design flexibility. That │       │
 │     │                               │ architecture includes a        │       │
 │     │                               │ scalable copper-based          │       │
 │     │                               │ redistribution layer and a     │       │
 │     │                               │ flat passivation layer (for    │       │
 │     │                               │ better performance, robust     │       │
 │     │                               │ CPI, and seamless 3D           │       │
 │     │                               │ integration); and              │       │
 │     │                               │ through-silicon vias, or TSVs. │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 4   │ 2 nm process - Wikipedia      │ According to the projections   │  0.90 │
 │     │ https://en.wikipedia.org/wiki │ contained in the 2021 update   │       │
 │     │ /2_nm_process                 │ of the International Roadmap   │       │
 │     │                               │ for Devices and Systems        │       │
 │     │                               │ published by the Institute of  │       │
 │     │                               │ Electrical and Electronics     │       │
 │     │                               │ Engineers (IEEE), a '2.1 nm    │       │
 │     │                               │ node range label' is expected  │       │
 │     │                               │ to have a contacted gate pitch │       │
 │     │                               │ of 45 nanometers and a         │       │
 │     │                               │ tightest metal pitch of 20     │       │
 │     │                               │ nanometers.                    │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 5   │ TSMC Boosts 2 nm Yields by    │ A key innovation in the N2     │  0.88 │
 │     │ 6%, Passing Savings to        │ process is the enhanced design │       │
 │     │ Customers | TechPowerUp       │ of its GAA nanosheet           │       │
 │     │ https://www.techpowerup.com/3 │ transistors, which offers      │       │
 │     │ 29435/tsmc-boosts-2-nm-yields │ improved electrostatic control │       │
 │     │ -by-6-passing-savings-to-cust │ and reduced gate leakage       │       │
 │     │ omers                         │ compared to 3 nm FinFET        │       │
 │     │                               │ transistors, given that the    │       │
 │     │                               │ gate can be controlled from    │       │
 │     │                               │ all sides. This advancement    │       │
 │     │                               │ enables smaller high-density   │       │
 │     │                               │ transistors to maintain        │       │
 │     │                               │ reliable performance through   │       │
 │     │                               │ better threshold voltage       │       │
 │     │                               │ tuning capabilities.           │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 6   │ TSMC 2nm, full details        │ This 2nm platform technology   │  0.82 │
 │     │ revealed-Electronics          │ includes new Cu RDLs with flat │       │
 │     │ Headlines-EEWORLD             │ passivation and TSVs,          │       │
 │     │ https://en.eeworld.com.cn/mp/ │ optimized holistically with    │       │
 │     │ Icbank/a391002.jspx           │ 3DIC to enable system          │       │
 │     │                               │ integration.                   │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 7   │ TSMC begins quietly volume    │ TSMC has quietly revealed that │  0.97 │
 │     │ production of 2nm-class chips │ it had commenced volume        │       │
 │     │ | Tom's Hardware              │ production of chips using its  │       │
 │     │ https://www.tomshardware.com/ │ N2 (2nm-class) fabrication     │       │
 │     │ tech-industry/semiconductors/ │ process... 'TSMC's 2nm (N2)    │       │
 │     │ tsmc-begins-quietly-volume-pr │ technology has started volume  │       │
 │     │ oduction-of-2nm-class-chips-f │ production in 4Q25 as          │       │
 │     │ irst-gaa-transistor-for-tsmc- │ planned.'                      │       │
 │     │ claims-up-to-15-percent-impro │                                │       │
 │     │ vement-at-iso-power           │                                │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 8   │ TSMC's 2nm Yield Rates Surge  │ Initial tsmc 2nm yield rates   │  0.75 │
 │     │ as Mass Production Ramps Up   │ are notably high, reportedly   │       │
 │     │ in 2026                       │ reaching around 70%. Some      │       │
 │     │ https://heqingele.com/blog/ts │ reports even indicate yields   │       │
 │     │ mc-2nm-yield-rates-mass-produ │ surpassing 90% for certain     │       │
 │     │ ction-status-2026/            │ memory products.               │       │
 ├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
 │ 9   │ Unlocking the Future: TSMC's  │ On March 31, 2025, TSMC held   │  0.80 │
 │     │ Bold Strategy for the 2nm     │ an expansion ceremony for its  │       │
 │     │ Revolution!                   │ 2nm production facility in     │       │
 │     │ https://tspasemiconductor.sub │ Kaohsiung, marking a           │       │
 │     │ stack.com/p/unlocking-the-fut │ significant milestone in       │       │
 │     │ ure-tsmcs-bold-strategy-cb2   │ Taiwan's semiconductor         │       │
 │     │                               │ advanced manufacturing         │       │
 │     │                               │ expansion.                     │       │
 └─────┴───────────────────────────────┴────────────────────────────────┴───────┘
                                      Gaps                                      
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Category         ┃ Topic                       ┃ Detail                      ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ source_not_found │ Exact dielectric and gate   │ TSMC does not publicly      │
 │                  │ oxide materials used in N2  │ disclose the specific       │
 │                  │ GAA nanosheet transistors   │ high-k dielectric           │
 │                  │                             │ materials, interfacial      │
 │                  │                             │ layer compositions, or work │
 │                  │                             │ function metal chemistries  │
 │                  │                             │ used in the N2 gate stack.  │
 │                  │                             │ These are considered core   │
 │                  │                             │ IP.                         │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Nanosheet thickness and     │ The precise nanometer-scale │
 │                  │ width tolerances            │ dimensions and process      │
 │                  │                             │ tolerances (e.g., nanosheet │
 │                  │                             │ thickness variation,        │
 │                  │                             │ critical dimension          │
 │                  │                             │ uniformity) for N2 GAA      │
 │                  │                             │ nanosheets are not publicly │
 │                  │                             │ available.                  │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Metal interconnect liner    │ While Cu RDLs are           │
 │                  │ and barrier materials       │ confirmed, the specific     │
 │                  │                             │ barrier/liner materials     │
 │                  │                             │ (e.g., whether ruthenium or │
 │                  │                             │ cobalt liners replace       │
 │                  │                             │ TaN/Ta at this node) are    │
 │                  │                             │ not disclosed in public     │
 │                  │                             │ sources.                    │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ Doping profiles and implant │ Source/drain doping         │
 │                  │ specifications              │ concentrations, implant     │
 │                  │                             │ energies, and anneal        │
 │                  │                             │ conditions are proprietary  │
 │                  │                             │ and not published.          │
 ├──────────────────┼─────────────────────────────┼─────────────────────────────┤
 │ source_not_found │ EUV lithography specifics   │ The number of EUV exposures │
 │                  │ (number of EUV layers,      │ per layer, overlay          │
 │                  │ stochastic defect control   │ tolerances, and specific    │
 │                  │ methods)                    │ stochastic control          │
 │                  │                             │ approaches are not detailed │
 │                  │                             │ in public TSMC disclosures. │
 └──────────────────┴─────────────────────────────┴─────────────────────────────┘
                                Discovery Events                                
 ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
 ┃                  ┃ Suggested         ┃                   ┃                   ┃
 ┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
 ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
 │ related_research │ arxiv             │ TSMC N2 nanosheet │ IEEE IEDM 2024    │
 │                  │                   │ GAA transistor    │ papers from TSMC  │
 │                  │                   │ gate stack        │ may contain more  │
 │                  │                   │ materials high-k  │ specific          │
 │                  │                   │ dielectric IEDM   │ materials details │
 │                  │                   │ 2024              │ in the full       │
 │                  │                   │                   │ published         │
 │                  │                   │                   │ proceedings not   │
 │                  │                   │                   │ summarized in     │
 │                  │                   │                   │ news articles.    │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ TSMC 2nm N2       │ TSMC patent       │
 │                  │                   │ process patent    │ filings related   │
 │                  │                   │ filings nanosheet │ to N2 may reveal  │
 │                  │                   │ gate-all-around   │ specific          │
 │                  │                   │ materials         │ materials         │
 │                  │                   │                   │ choices,          │
 │                  │                   │                   │ tolerances, and   │
 │                  │                   │                   │ process           │
 │                  │                   │                   │ innovations that  │
 │                  │                   │                   │ are not in press  │
 │                  │                   │                   │ releases.         │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ arxiv             │ gate-all-around   │ Academic          │
 │                  │                   │ nanosheet         │ literature on GAA │
 │                  │                   │ transistor        │ nanosheet         │
 │                  │                   │ silicon channel   │ fabrication may   │
 │                  │                   │ thickness         │ reveal typical    │
 │                  │                   │ variation         │ tolerance ranges  │
 │                  │                   │ tolerance 2nm     │ used at the 2nm   │
 │                  │                   │                   │ class node even   │
 │                  │                   │                   │ if not            │
 │                  │                   │                   │ TSMC-specific.    │
 ├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
 │ related_research │ database          │ TechInsights TSMC │ TechInsights      │
 │                  │                   │ N2 teardown       │ performs physical │
 │                  │                   │ materials         │ reverse           │
 │                  │                   │ analysis 2025     │ engineering of    │
 │                  │                   │                   │ chips and may     │
 │                  │                   │                   │ have detailed N2  │
 │                  │                   │                   │ materials         │
 │                  │                   │                   │ analysis          │
 │                  │                   │                   │ available through │
 │                  │                   │                   │ their             │
 │                  │                   │                   │ subscription      │
 │                  │                   │                   │ service.          │
 └──────────────────┴───────────────────┴───────────────────┴───────────────────┘
                                 Open Questions                                 
 ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 ┃ Priority ┃ Question                        ┃ Context                         ┃
 ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 │ high     │ What specific high-k dielectric │ Public sources confirm GAA      │
 │          │ and metal gate materials does   │ nanosheet architecture but do   │
 │          │ TSMC use in the N2 GAA          │ not specify gate dielectric     │
 │          │ nanosheet gate stack?           │ (e.g., HfO2 variants) or work   │
 │          │                                 │ function metal compositions     │
 │          │                                 │ used to achieve threshold       │
 │          │                                 │ voltage tuning.                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ Has TSMC adopted ruthenium or   │ At 20nm metal pitch,            │
 │          │ other alternative metals for    │ traditional TaN/Ta/Cu stacks    │
 │          │ BEOL interconnect liners in N2  │ face resistance issues; Intel   │
 │          │ to reduce resistance at tight   │ and others have explored Mo and │
 │          │ pitches?                        │ Ru. TSMC's specific choice for  │
 │          │                                 │ N2 BEOL is not disclosed in     │
 │          │                                 │ public sources.                 │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ high     │ What is the actual silicon      │ GAA nanosheet devices typically │
 │          │ nanosheet thickness and stack   │ stack 3-4 nanosheets; TSMC has  │
 │          │ count in TSMC's N2 process?     │ not publicly specified          │
 │          │                                 │ nanosheet dimensions or stack   │
 │          │                                 │ count for N2.                   │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ How does TSMC's N2 defect       │ A LinkedIn post references      │
 │          │ density compare quantitatively  │ Tom's Hardware reporting that   │
 │          │ to N3 at equivalent production  │ TSMC disclosed N2 defect        │
 │          │ maturity?                       │ density is lower than N3 at the │
 │          │                                 │ same stage of development, but  │
 │          │                                 │ specific numbers were not found │
 │          │                                 │ in the gathered sources.        │
 ├──────────┼─────────────────────────────────┼─────────────────────────────────┤
 │ medium   │ Will TSMC's N2P (enhanced N2)   │ Sources mention N2P is a 5%     │
 │          │ node incorporate backside power │ speed-enhanced version of N2    │
 │          │ delivery network (BSPDN), and   │ targeting qualification         │
 │          │ what materials/process changes  │ completion; the SemiAnalysis    │
 │          │ does that entail?               │ report discusses BSPDN as a key │
 │          │                                 │ innovation at 2nm class nodes,  │
 │          │                                 │ and its material implications   │
 │          │                                 │ differ significantly.           │
 └──────────┴─────────────────────────────────┴─────────────────────────────────┘
 ╭───────────────────────────────── Confidence ─────────────────────────────────╮
 │ Overall: 0.42                                                                │
 │ Corroborating sources: 9                                                     │
 │ Source authority: medium                                                     │
 │ Contradiction detected: False                                                │
 │ Query specificity match: 0.30                                                │
 │ Budget status: spent                                                         │
 │ Recency: current                                                             │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 ╭──────────────────────────────────── Cost ────────────────────────────────────╮
 │ Tokens: 62620                                                                │
 │ Iterations: 3                                                                │
 │ Wall time: 105.63s                                                           │
 │ Model: claude-sonnet-4-6                                                     │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 trace_id: a4bb5b7a-61dd-446b-8c06-06c78de5fef7
--- a/obs/init.py
+++ b/obs/init.py
@ -1,130 +0,0 @@
 """Marchwarden observability — structured application logging.
 Operational logs for administrators. NOT the same as the JSONL traces
 in `~/.marchwarden/traces/` — those are per-research-call audit logs
 for verifying provenance. These logs cover system events: startup,
 shutdown, errors, MCP transport activity, cost ledger writes.
 Format is structured-by-default (structlog) so logs can be ingested
 into OpenSearch or similar without per-call formatter ceremony.
 Usage:
    from obs import configure_logging, get_logger
    configure_logging()  # call once at process startup
    log = get_logger("marchwarden.cli")
    log.info("ask_started", question=question, depth=depth)
    # Bind context that flows to every downstream log call:
    log = log.bind(trace_id="abc-123", researcher="web")
    log.info("research_started")  # automatically includes trace_id + researcher
 """
 import logging
 import os
 import sys
 from logging.handlers import RotatingFileHandler
 from pathlib import Path
 from typing import Optional
 import structlog
 _CONFIGURED = False
 def _resolve_format() -> str:
    """Pick the renderer: explicit env override, else auto-detect TTY."""
    explicit = os.environ.get("MARCHWARDEN_LOG_FORMAT")
    if explicit in {"json", "console"}:
        return explicit
    return "console" if sys.stderr.isatty() else "json"
 def _resolve_level() -> int:
    name = os.environ.get("MARCHWARDEN_LOG_LEVEL", "INFO").upper()
    return getattr(logging, name, logging.INFO)
 def configure_logging(force: bool = False) -> None:
    """Configure structlog + stdlib logging once per process.
    Idempotent — subsequent calls are no-ops unless ``force=True``.
    Honors:
      - MARCHWARDEN_LOG_LEVEL (default INFO)
      - MARCHWARDEN_LOG_FORMAT (json|console; auto-detected from TTY if unset)
      - MARCHWARDEN_LOG_FILE (truthy → also log to ~/.marchwarden/logs/marchwarden.log)
    """
    global _CONFIGURED
    if _CONFIGURED and not force:
        return
    level = _resolve_level()
    fmt = _resolve_format()
    timestamper = structlog.processors.TimeStamper(fmt="iso", utc=True)
    shared_processors = [
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        timestamper,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
    ]
    if fmt == "json":
        renderer: structlog.types.Processor = structlog.processors.JSONRenderer()
    else:
        renderer = structlog.dev.ConsoleRenderer(colors=sys.stderr.isatty())
    structlog.configure(
        processors=shared_processors
        + [structlog.stdlib.ProcessorFormatter.wrap_for_formatter],
        logger_factory=structlog.stdlib.LoggerFactory(),
        wrapper_class=structlog.make_filtering_bound_logger(level),
        cache_logger_on_first_use=True,
    )
    formatter = structlog.stdlib.ProcessorFormatter(
        foreign_pre_chain=shared_processors,
        processors=[
            structlog.stdlib.ProcessorFormatter.remove_processors_meta,
            renderer,
        ],
    )
    # Always log to stderr so MCP stdio stdout stays clean.
    handler: logging.Handler = logging.StreamHandler(sys.stderr)
    handler.setFormatter(formatter)
    root = logging.getLogger()
    # Replace handlers so re-configuration in tests works cleanly.
    root.handlers = [handler]
    root.setLevel(level)
    if os.environ.get("MARCHWARDEN_LOG_FILE"):
        log_dir = Path(os.path.expanduser("~/.marchwarden/logs"))
        log_dir.mkdir(parents=True, exist_ok=True)
        file_handler = RotatingFileHandler(
            log_dir / "marchwarden.log",
            maxBytes=10 * 1024 * 1024,
            backupCount=5,
            encoding="utf-8",
        )
        file_handler.setFormatter(formatter)
        root.addHandler(file_handler)
    # Quiet a few noisy third-party loggers unless DEBUG is requested.
    if level > logging.DEBUG:
        for noisy in ("httpx", "httpcore", "anthropic"):
            logging.getLogger(noisy).setLevel(logging.WARNING)
    _CONFIGURED = True
 def get_logger(name: str) -> structlog.stdlib.BoundLogger:
    """Return a bound structlog logger. Configures logging on first call."""
    if not _CONFIGURED:
        configure_logging()
    return structlog.get_logger(name)
--- a/obs/costs.py
+++ b/obs/costs.py
@ -1,179 +0,0 @@
 """Cost tracking — price table loader and JSONL ledger writer.
 Supplements (does not replace) the per-call ``cost_metadata`` field
 on ``ResearchResult``. Operators consume this ledger via the
 ``marchwarden costs`` command (M2.5.3) for spend tracking.
 Estimated costs are computed from a TOML price table at
 ``~/.marchwarden/prices.toml``, auto-created with seed values on
 first run. Operators are expected to update prices manually when
 upstream rates change — there is no automatic fetching.
 """
 from __future__ import annotations
 import json
 import os
 import time
 from pathlib import Path
 from typing import Optional
 try:
    import tomllib  # Python 3.11+
 except ModuleNotFoundError:  # pragma: no cover
    import tomli as tomllib  # type: ignore[no-redef]
 from obs import get_logger
 log = get_logger("marchwarden.costs")
 DEFAULT_LEDGER_PATH = "~/.marchwarden/costs.jsonl"
 DEFAULT_PRICES_PATH = "~/.marchwarden/prices.toml"
 # Seed values current as of 2026-04. Operators should update
 # ~/.marchwarden/prices.toml when upstream rates change.
 SEED_PRICES_TOML = """\
 # Marchwarden price table — used for cost ledger estimation only.
 # Update these values when upstream pricing changes. Marchwarden does
 # not fetch prices automatically.
 #
 # input_per_mtok_usd  = USD per 1,000,000 input tokens
 # output_per_mtok_usd = USD per 1,000,000 output tokens
 [models."claude-sonnet-4-6"]
 input_per_mtok_usd = 3.00
 output_per_mtok_usd = 15.00
 [models."claude-opus-4-6"]
 input_per_mtok_usd = 15.00
 output_per_mtok_usd = 75.00
 [models."claude-haiku-4-5-20251001"]
 input_per_mtok_usd = 1.00
 output_per_mtok_usd = 5.00
 [tavily]
 # Estimated post-free-tier per-search rate. Free tier covers the first
 # 1000 searches per month at no cost.
 per_search_usd = 0.005
 """
 class PriceTable:
    """Loads and queries the price table at ~/.marchwarden/prices.toml."""
    def __init__(self, path: Optional[str] = None):
        self.path = Path(os.path.expanduser(path or DEFAULT_PRICES_PATH))
        self._data: dict = {}
        self._ensure_file()
        self._load()
    def _ensure_file(self) -> None:
        if self.path.exists():
            return
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self.path.write_text(SEED_PRICES_TOML, encoding="utf-8")
        log.info("price_table_seeded", path=str(self.path))
    def _load(self) -> None:
        with open(self.path, "rb") as f:
            self._data = tomllib.load(f)
    def estimate_call_usd(
        self,
        model_id: str,
        tokens_input: Optional[int],
        tokens_output: Optional[int],
        tavily_searches: int,
    ) -> Optional[float]:
        """Estimate USD cost for a single research call.
        Returns None if the model is unknown — caller should record
        ``estimated_cost_usd: null`` in the ledger and the operator
        is expected to update prices.toml.
        """
        models = self._data.get("models", {})
        model_prices = models.get(model_id)
        if not model_prices:
            log.warning(
                "unknown_model_for_pricing",
                model_id=model_id,
                hint=f"add a [models.\"{model_id}\"] section to {self.path}",
            )
            return None
        in_tok = tokens_input or 0
        out_tok = tokens_output or 0
        input_cost = (in_tok / 1_000_000) * model_prices.get("input_per_mtok_usd", 0.0)
        output_cost = (out_tok / 1_000_000) * model_prices.get("output_per_mtok_usd", 0.0)
        tavily = self._data.get("tavily", {})
        tavily_cost = tavily_searches * tavily.get("per_search_usd", 0.0)
        return round(input_cost + output_cost + tavily_cost, 6)
 class CostLedger:
    """Append-only JSONL ledger of completed research calls."""
    def __init__(
        self,
        ledger_path: Optional[str] = None,
        price_table: Optional[PriceTable] = None,
    ):
        env_path = os.environ.get("MARCHWARDEN_COST_LEDGER")
        self.path = Path(
            os.path.expanduser(ledger_path or env_path or DEFAULT_LEDGER_PATH)
        )
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self.price_table = price_table or PriceTable()
    def record(
        self,
        *,
        trace_id: str,
        question: str,
        model_id: str,
        tokens_used: int,
        tokens_input: Optional[int],
        tokens_output: Optional[int],
        iterations_run: int,
        wall_time_sec: float,
        tavily_searches: int,
        budget_exhausted: bool,
        confidence: float,
    ) -> dict:
        """Append one entry to the ledger and emit a structured log line.
        Returns the entry as a dict (useful for tests and the log call).
        """
        estimated_cost_usd = self.price_table.estimate_call_usd(
            model_id=model_id,
            tokens_input=tokens_input,
            tokens_output=tokens_output,
            tavily_searches=tavily_searches,
        )
        entry = {
            "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
            "trace_id": trace_id,
            "question": question[:200],
            "model_id": model_id,
            "tokens_used": tokens_used,
            "tokens_input": tokens_input,
            "tokens_output": tokens_output,
            "iterations_run": iterations_run,
            "wall_time_sec": round(wall_time_sec, 3),
            "tavily_searches": tavily_searches,
            "estimated_cost_usd": estimated_cost_usd,
            "budget_exhausted": budget_exhausted,
            "confidence": confidence,
        }
        with open(self.path, "a", encoding="utf-8") as f:
            f.write(json.dumps(entry) + "\n")
        log.info("cost_recorded", **entry)
        return entry
--- a/pyproject.toml
+++ b/pyproject.toml
@ -19,33 +19,22 @@ dependencies = [
    "tavily-python>=0.3.0",
    "httpx>=0.24.0",
    "click>=8.0",
    "rich>=13.0",
    "structlog>=24.0",
 ]
 [project.optional-dependencies]
 dev = [
    "pytest>=7.0",
    "pytest-cov>=4.0",
    "pytest-asyncio>=0.21",
    "black>=23.0",
    "ruff>=0.1.0",
    "mypy>=1.0",
 ]
 # arxiv-rag researcher (M5.1). Heavy ML deps — optional extra so the base
 # install stays slim for users who only want the web researcher.
 arxiv = [
    "pymupdf>=1.24",
    "chromadb>=0.5",
    "sentence-transformers>=3.0",
    "arxiv>=2.1",
 ]
 [project.scripts]
 marchwarden = "cli.main:cli"
 [tool.setuptools.packages.find]
-include = ["researchers*", "orchestrator*", "cli*", "obs*"]
+include = ["researchers*", "orchestrator*", "cli*"]
 [tool.pytest.ini_options]
 testpaths = ["tests"]
--- a/researchers/arxiv/init.py
+++ b/researchers/arxiv/init.py
@ -1,7 +0,0 @@
 """arxiv-rag researcher.
 Second researcher implementation in Marchwarden. M5.1.1 is the ingest
 pipeline only — see researchers.arxiv.ingest and researchers.arxiv.store.
 The agent loop, MCP server, and ResearchContract integration ship in
 later sub-milestones (#39–#43).
 """
--- a/researchers/arxiv/ingest.py
+++ b/researchers/arxiv/ingest.py
@ -1,370 +0,0 @@
 """Ingest pipeline for the arxiv-rag researcher.
 Public surface:
    download_pdf(arxiv_id, store) -> Path
    extract_sections(pdf_path) -> list[Section]
    embed_and_store(arxiv_id, sections, store, model_name, metadata) -> int
    ingest(arxiv_id, store=None, model_name=...) -> PaperRecord  # one-shot
 The split exists so unit tests can mock each phase independently. The
 top-level ``ingest()`` is what the CLI calls.
 Section detection is heuristic: we walk the PDF page by page, look for
 short lines that match a small set of canonical academic headings
 (introduction, methods, results, discussion, conclusion, references,
 etc.), and use those as section boundaries. If nothing matches, we fall
 back to one Section containing the entire paper text — citations to
 that section will still be valid, just less precise.
 """
 from __future__ import annotations
 import os
 import re
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Callable, Optional
 from .store import ArxivStore, PaperRecord, make_chunk_id
 # ---------------------------------------------------------------------------
 # Defaults
 # ---------------------------------------------------------------------------
 DEFAULT_EMBEDDING_MODEL = os.environ.get(
    "MARCHWARDEN_ARXIV_EMBED_MODEL",
    "nomic-ai/nomic-embed-text-v1.5",
 )
 # Headings considered "section starters" for the heuristic. Order
 # matters only for documentation; matching is case-insensitive and
 # whole-line.
 _SECTION_HEADINGS = [
    "abstract",
    "introduction",
    "background",
    "related work",
    "preliminaries",
    "methods",
    "method",
    "methodology",
    "approach",
    "model",
    "experiments",
    "experimental setup",
    "evaluation",
    "results",
    "discussion",
    "analysis",
    "limitations",
    "conclusion",
    "conclusions",
    "future work",
    "references",
    "acknowledgments",
    "appendix",
 ]
 # Compiled match: optional leading number ("3", "3.1", "III"), optional
 # trailing punctuation, the heading word, end of line.
 _HEADING_RE = re.compile(
    r"^\s*(?:[0-9IVX]+\.?[0-9.]*)?\s*(?P<title>" + "|".join(_SECTION_HEADINGS) + r")\s*$",
    re.IGNORECASE,
 )
@dataclass
 class Section:
    """One section of a paper."""
    index: int
    title: str
    text: str
    page_start: int
    page_end: int
@dataclass
 class PaperMetadata:
    """Lightweight metadata extracted from arxiv at download time."""
    arxiv_id: str
    version: str
    title: str
    authors: list[str] = field(default_factory=list)
    year: Optional[int] = None
    category: Optional[str] = None
 # ---------------------------------------------------------------------------
 # Phase 1 — download
 # ---------------------------------------------------------------------------
 def download_pdf(
    arxiv_id: str,
    store: ArxivStore,
    *,
    arxiv_search: Optional[Callable] = None,
 ) -> tuple[Path, PaperMetadata]:
    """Download a paper PDF and return its cached path + arxiv metadata.
    ``arxiv_search`` is injectable for tests so we can avoid hitting the
    real arxiv API. The default uses the ``arxiv`` package.
    """
    target = store.pdfs_dir / f"{arxiv_id}.pdf"
    if arxiv_search is None:
        import arxiv as arxiv_pkg
        search = arxiv_pkg.Search(id_list=[arxiv_id])
        results = list(search.results())
    else:
        results = list(arxiv_search(arxiv_id))
    if not results:
        raise ValueError(f"arxiv id not found: {arxiv_id}")
    paper = results[0]
    # Download the PDF if we don't already have it cached.
    if not target.exists():
        # Both the real arxiv.Result and our test stub expose
        # download_pdf(dirpath, filename). Test stubs may also accept a
        # destination Path directly — try that first, fall back.
        try:
            paper.download_pdf(
                dirpath=str(store.pdfs_dir),
                filename=f"{arxiv_id}.pdf",
            )
        except TypeError:
            paper.download_pdf(str(target))
    metadata = PaperMetadata(
        arxiv_id=arxiv_id,
        version=getattr(paper, "entry_id", "").rsplit("v", 1)[-1] if "v" in getattr(paper, "entry_id", "") else "",
        title=getattr(paper, "title", "") or "",
        authors=[
            getattr(a, "name", str(a))
            for a in (getattr(paper, "authors", []) or [])
        ],
        year=(
            getattr(paper, "published", None).year
            if getattr(paper, "published", None) is not None
            else None
        ),
        category=getattr(paper, "primary_category", None),
    )
    return target, metadata
 # ---------------------------------------------------------------------------
 # Phase 2 — extract sections
 # ---------------------------------------------------------------------------
 def extract_sections(pdf_path: Path) -> list[Section]:
    """Extract section-level chunks from a PDF.
    Heuristic: walk pages, split on lines that match a known section
    heading. If no headings are detected, return one Section containing
    the whole document.
    """
    import pymupdf
    doc = pymupdf.open(str(pdf_path))
    try:
        # Build a flat list of (page_num, line) tuples for the whole doc.
        lines: list[tuple[int, str]] = []
        for page_num, page in enumerate(doc, start=1):
            text = page.get_text("text") or ""
            for raw_line in text.splitlines():
                stripped = raw_line.strip()
                if stripped:
                    lines.append((page_num, stripped))
    finally:
        doc.close()
    # Find heading boundaries.
    boundaries: list[tuple[int, str, int]] = []  # (line_index, title, page_num)
    for i, (page_num, line) in enumerate(lines):
        if len(line) > 80:
            # Section headings are short. Skip likely body text.
            continue
        m = _HEADING_RE.match(line)
        if m:
            boundaries.append((i, m.group("title").strip().title(), page_num))
    sections: list[Section] = []
    if not boundaries:
        # Fallback: whole paper as one section.
        full_text = "\n".join(line for _, line in lines)
        if not full_text.strip():
            return []
        first_page = lines[0][0] if lines else 1
        last_page = lines[-1][0] if lines else 1
        return [
            Section(
                index=0,
                title="Full Paper",
                text=full_text,
                page_start=first_page,
                page_end=last_page,
            )
        ]
    # Build sections between consecutive boundaries.
    for idx, (start_line, title, page_start) in enumerate(boundaries):
        end_line = (
            boundaries[idx + 1][0] if idx + 1 < len(boundaries) else len(lines)
        )
        body_lines = lines[start_line + 1 : end_line]
        text = "\n".join(line for _, line in body_lines).strip()
        if not text:
            continue
        page_end = body_lines[-1][0] if body_lines else page_start
        sections.append(
            Section(
                index=idx,
                title=title,
                text=text,
                page_start=page_start,
                page_end=page_end,
            )
        )
    if not sections:
        # Headings detected but every section was empty — fall back to
        # whole paper rather than dropping the document.
        full_text = "\n".join(line for _, line in lines)
        return [
            Section(
                index=0,
                title="Full Paper",
                text=full_text,
                page_start=lines[0][0],
                page_end=lines[-1][0],
            )
        ]
    return sections
 # ---------------------------------------------------------------------------
 # Phase 3 — embed and store
 # ---------------------------------------------------------------------------
 def _load_embedder(model_name: str):
    """Load a sentence-transformers embedder. Cached at module level so
    repeated ingests in the same process don't re-download / re-load.
    """
    cache = _load_embedder._cache  # type: ignore[attr-defined]
    if model_name in cache:
        return cache[model_name]
    from sentence_transformers import SentenceTransformer
    embedder = SentenceTransformer(model_name, trust_remote_code=True)
    cache[model_name] = embedder
    return embedder
 _load_embedder._cache = {}  # type: ignore[attr-defined]
 def embed_and_store(
    arxiv_id: str,
    sections: list[Section],
    store: ArxivStore,
    model_name: str,
    metadata: PaperMetadata,
    *,
    embedder: Optional[object] = None,
 ) -> int:
    """Embed each section and write to the chromadb collection.
    ``embedder`` is injectable for tests so we don't have to load
    sentence-transformers. It must expose ``encode(list[str]) -> list[list[float]]``.
    Returns the number of chunks written.
    """
    if not sections:
        return 0
    if embedder is None:
        embedder = _load_embedder(model_name)
    texts = [s.text for s in sections]
    raw_vectors = embedder.encode(texts)
    # sentence-transformers returns a numpy.ndarray; chromadb wants
    # plain lists. Handle both shapes.
    embeddings: list[list[float]] = []
    for vec in raw_vectors:
        if hasattr(vec, "tolist"):
            embeddings.append(vec.tolist())
        else:
            embeddings.append(list(vec))
    ids = [make_chunk_id(arxiv_id, s.index, model_name) for s in sections]
    metadatas = [
        {
            "arxiv_id": arxiv_id,
            "section_index": s.index,
            "section_title": s.title,
            "page_start": s.page_start,
            "page_end": s.page_end,
            "title": metadata.title,
            "embedding_model": model_name,
        }
        for s in sections
    ]
    # Replace any prior chunks for this paper under this embedding model
    # before re-adding. Idempotency: re-ingest with the same model is a
    # no-op in observable state.
    store.delete_paper(arxiv_id)
    store.add_chunks(ids=ids, documents=texts, embeddings=embeddings, metadatas=metadatas)
    return len(ids)
 # ---------------------------------------------------------------------------
 # Top-level orchestrator
 # ---------------------------------------------------------------------------
 def ingest(
    arxiv_id: str,
    store: Optional[ArxivStore] = None,
    *,
    model_name: str = DEFAULT_EMBEDDING_MODEL,
    arxiv_search: Optional[Callable] = None,
    embedder: Optional[object] = None,
 ) -> PaperRecord:
    """End-to-end ingest: download → extract → embed → store → manifest."""
    store = store or ArxivStore()
    pdf_path, metadata = download_pdf(arxiv_id, store, arxiv_search=arxiv_search)
    sections = extract_sections(pdf_path)
    chunk_count = embed_and_store(
        arxiv_id=arxiv_id,
        sections=sections,
        store=store,
        model_name=model_name,
        metadata=metadata,
        embedder=embedder,
    )
    record = PaperRecord(
        arxiv_id=arxiv_id,
        version=metadata.version,
        title=metadata.title,
        authors=metadata.authors,
        year=metadata.year,
        category=metadata.category,
        chunks_indexed=chunk_count,
        embedding_model=model_name,
    )
    store.upsert_paper(record)
    return record
--- a/researchers/arxiv/store.py
+++ b/researchers/arxiv/store.py
@ -1,214 +0,0 @@
 """Chromadb wrapper for the arxiv-rag researcher.
 The store lives at ``~/.marchwarden/arxiv-rag/`` and contains:
 - ``papers.json`` — manifest mapping arxiv_id -> metadata
 - ``pdfs/<id>.pdf`` — cached PDFs
 - ``chroma/`` — chromadb persistent collection of embedded chunks
 This module is intentionally narrow: it exposes the persistent state and
 the operations the ingest + retrieval layers need (add chunks, fetch
 manifest, list papers). The retrieval layer (#39) will add a query API
 on top of the same collection.
 Chunk IDs are deterministic and include the embedding model name in
 their hash so re-ingesting a paper with a different embedder creates a
 new ID space rather than silently overwriting old citations
 (ArxivRagProposal §1, decision 4).
 """
 from __future__ import annotations
 import hashlib
 import json
 import os
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any, Iterable, Optional
 DEFAULT_ROOT = Path(os.path.expanduser("~/.marchwarden/arxiv-rag"))
 DEFAULT_COLLECTION = "arxiv_chunks"
@dataclass
 class PaperRecord:
    """Manifest entry for one indexed paper."""
    arxiv_id: str
    version: str
    title: str
    authors: list[str]
    year: Optional[int]
    category: Optional[str]
    chunks_indexed: int
    embedding_model: str
    added_at: str = field(
        default_factory=lambda: datetime.now(timezone.utc)
        .isoformat(timespec="seconds")
        .replace("+00:00", "Z")
    )
    def to_dict(self) -> dict:
        return {
            "version": self.version,
            "title": self.title,
            "authors": list(self.authors),
            "year": self.year,
            "category": self.category,
            "chunks_indexed": self.chunks_indexed,
            "embedding_model": self.embedding_model,
            "added_at": self.added_at,
        }
    @classmethod
    def from_dict(cls, arxiv_id: str, data: dict) -> "PaperRecord":
        return cls(
            arxiv_id=arxiv_id,
            version=data.get("version", ""),
            title=data.get("title", ""),
            authors=list(data.get("authors", [])),
            year=data.get("year"),
            category=data.get("category"),
            chunks_indexed=int(data.get("chunks_indexed", 0)),
            embedding_model=data.get("embedding_model", ""),
            added_at=data.get("added_at", ""),
        )
 def make_chunk_id(arxiv_id: str, section_index: int, embedding_model: str) -> str:
    """Deterministic chunk id, scoped by embedding model.
    Format: ``<arxiv_id>::<section_index>::<sha1(model)[0:8]>``. The
    model hash slice keeps the id readable while making it unique across
    embedding models. See ArxivRagProposal §1 decision 4 — re-ingesting
    with a different model must not collide with prior chunks.
    """
    model_hash = hashlib.sha1(embedding_model.encode("utf-8")).hexdigest()[:8]
    return f"{arxiv_id}::{section_index:04d}::{model_hash}"
 class ArxivStore:
    """File-backed manifest + chromadb collection for indexed papers."""
    def __init__(
        self,
        root: Optional[Path] = None,
        collection_name: str = DEFAULT_COLLECTION,
    ):
        self.root = Path(root) if root else DEFAULT_ROOT
        self.pdfs_dir = self.root / "pdfs"
        self.chroma_dir = self.root / "chroma"
        self.manifest_path = self.root / "papers.json"
        self.collection_name = collection_name
        self.root.mkdir(parents=True, exist_ok=True)
        self.pdfs_dir.mkdir(parents=True, exist_ok=True)
        self.chroma_dir.mkdir(parents=True, exist_ok=True)
        self._client = None  # lazy
        self._collection = None  # lazy
    # ------------------------------------------------------------------
    # Chroma — lazy because importing chromadb is slow
    # ------------------------------------------------------------------
    @property
    def collection(self):
        """Lazy chromadb collection handle."""
        if self._collection is None:
            import chromadb
            self._client = chromadb.PersistentClient(path=str(self.chroma_dir))
            self._collection = self._client.get_or_create_collection(
                name=self.collection_name,
                # Cosine distance — typical for sentence-transformer
                # embeddings normalized to unit length.
                metadata={"hnsw:space": "cosine"},
            )
        return self._collection
    def add_chunks(
        self,
        ids: list[str],
        documents: list[str],
        embeddings: list[list[float]],
        metadatas: list[dict[str, Any]],
    ) -> None:
        """Add a batch of embedded chunks to the collection."""
        if not ids:
            return
        if not (len(ids) == len(documents) == len(embeddings) == len(metadatas)):
            raise ValueError(
                "ids/documents/embeddings/metadatas must all have the same length"
            )
        self.collection.upsert(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=metadatas,
        )
    def chunk_count_for(self, arxiv_id: str) -> int:
        """Number of chunks currently stored for one paper."""
        # chromadb's get() with a where filter returns the matching docs;
        # we just need the count.
        try:
            res = self.collection.get(where={"arxiv_id": arxiv_id})
        except Exception:
            return 0
        return len(res.get("ids", []))
    def delete_paper(self, arxiv_id: str) -> int:
        """Remove all chunks for one paper. Returns number deleted."""
        before = self.chunk_count_for(arxiv_id)
        if before == 0:
            return 0
        self.collection.delete(where={"arxiv_id": arxiv_id})
        return before
    # ------------------------------------------------------------------
    # Manifest — plain JSON, atomic write
    # ------------------------------------------------------------------
    def load_manifest(self) -> dict[str, PaperRecord]:
        """Read papers.json. Returns {} if missing."""
        if not self.manifest_path.exists():
            return {}
        data = json.loads(self.manifest_path.read_text(encoding="utf-8"))
        return {
            arxiv_id: PaperRecord.from_dict(arxiv_id, entry)
            for arxiv_id, entry in data.items()
        }
    def save_manifest(self, manifest: dict[str, PaperRecord]) -> None:
        """Write papers.json atomically."""
        payload = {arxiv_id: rec.to_dict() for arxiv_id, rec in manifest.items()}
        tmp = self.manifest_path.with_suffix(".json.tmp")
        tmp.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")
        tmp.replace(self.manifest_path)
    def upsert_paper(self, record: PaperRecord) -> None:
        """Insert or replace one entry in the manifest."""
        manifest = self.load_manifest()
        manifest[record.arxiv_id] = record
        self.save_manifest(manifest)
    def remove_paper(self, arxiv_id: str) -> bool:
        """Drop one entry from the manifest. Returns True if removed."""
        manifest = self.load_manifest()
        if arxiv_id not in manifest:
            return False
        del manifest[arxiv_id]
        self.save_manifest(manifest)
        return True
    def list_papers(self) -> list[PaperRecord]:
        """All manifest entries, sorted by added_at descending (newest first)."""
        manifest = self.load_manifest()
        return sorted(manifest.values(), key=lambda r: r.added_at, reverse=True)
    def get_paper(self, arxiv_id: str) -> Optional[PaperRecord]:
        """Manifest entry for one paper, or None."""
        return self.load_manifest().get(arxiv_id)
--- a/researchers/web/main.py
+++ b/researchers/web/main.py
@ -1,8 +0,0 @@
 """Allow running the web researcher MCP server as a module.
 Usage: python -m researchers.web
 """
 from researchers.web.server import main
 main()
--- a/researchers/web/agent.py
+++ b/researchers/web/agent.py
@ -10,11 +10,8 @@ import json
 import time
 from typing import Optional
 import structlog
 from anthropic import Anthropic
 from obs import get_logger
 from obs.costs import CostLedger
 from researchers.web.models import (
    Citation,
    ConfidenceFactors,
@ -25,13 +22,10 @@ from researchers.web.models import (
    OpenQuestion,
    ResearchConstraints,
    ResearchResult,
    constraints_for_depth,
 )
 from researchers.web.tools import SearchResult, fetch_url, tavily_search
 from researchers.web.trace import TraceLogger
 log = get_logger("marchwarden.researcher.web")
 SYSTEM_PROMPT = """\
 You are a Marchwarden — a research specialist stationed at the frontier of knowledge. \
 Your job is to investigate a question thoroughly using web search and URL fetching, \
@ -173,18 +167,13 @@ class WebResearcher:
        self,
        anthropic_api_key: str,
        tavily_api_key: str,
-        model_id: str = "claude-sonnet-4-6",
+        model_id: str = "claude-sonnet-4-5-20250514",
        trace_dir: Optional[str] = None,
        cost_ledger: Optional[CostLedger] = None,
    ):
        self.client = Anthropic(api_key=anthropic_api_key)
        self.tavily_api_key = tavily_api_key
        self.model_id = model_id
        self.trace_dir = trace_dir
        # Lazy default — only constructed if no override is given. Tests
        # inject a CostLedger pointed at a tmp path to avoid touching
        # the real ledger file.
        self.cost_ledger = cost_ledger
    async def research(
        self,
@ -204,34 +193,13 @@ class WebResearcher:
        Returns:
            A ResearchResult conforming to the v1 contract.
        """
-        # If the caller didn't supply explicit constraints, build them
+        constraints = constraints or ResearchConstraints()
        # from the depth preset (Issue #30). Callers that DO pass a
        # ResearchConstraints are taken at their word — explicit wins.
        constraints = constraints or constraints_for_depth(depth)
        trace = TraceLogger(trace_dir=self.trace_dir)
        start_time = time.time()
        total_tokens = 0
        tokens_input = 0
        tokens_output = 0
        iterations = 0
        evidence: list[dict] = []
        budget_exhausted = False
        tavily_searches = 0
        # Bind trace context so every downstream log call automatically
        # carries trace_id and researcher. Cleared in the finally block.
        structlog.contextvars.bind_contextvars(
            trace_id=trace.trace_id,
            researcher="web",
        )
        log.info(
            "research_started",
            question=question,
            depth=depth,
            max_iterations=constraints.max_iterations,
            token_budget=constraints.token_budget,
            model_id=self.model_id,
        )
        trace.log_step(
            "start",
@ -251,24 +219,7 @@ class WebResearcher:
        messages = [{"role": "user", "content": user_message}]
        # --- Tool-use loop ---
        # Budget policy: the loop honors token_budget as a soft cap. Before
        # starting a new iteration we check whether we've already hit the
        # budget; if so we stop and let synthesis run on whatever evidence
        # we already have. Synthesis tokens are tracked but not capped here
        # — the synthesis call is always allowed to complete so the caller
        # gets a structured result rather than a stub.
        while iterations < constraints.max_iterations:
            if total_tokens >= constraints.token_budget:
                budget_exhausted = True
                trace.log_step(
                    "budget_exhausted",
                    decision=(
                        f"Token budget reached before iteration "
                        f"{iterations + 1}: {total_tokens}/{constraints.token_budget}"
                    ),
                )
                break
            iterations += 1
            trace.log_step(
@ -286,15 +237,10 @@ class WebResearcher:
            )
            # Track tokens
            tokens_input += response.usage.input_tokens
            tokens_output += response.usage.output_tokens
            total_tokens += response.usage.input_tokens + response.usage.output_tokens
            # Check if the model wants to use tools
            tool_calls = [b for b in response.content if b.type == "tool_use"]
            tavily_searches += sum(
                1 for tc in tool_calls if tc.name == "web_search"
            )
            if not tool_calls:
                # Model is done researching — extract any final text
@ -329,6 +275,15 @@ class WebResearcher:
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            # Check token budget
            if total_tokens >= constraints.token_budget:
                budget_exhausted = True
                trace.log_step(
                    "budget_exhausted",
                    decision=f"Token budget reached: {total_tokens}/{constraints.token_budget}",
                )
                break
        # --- Synthesis step ---
        trace.log_step(
            "synthesis_start",
@ -338,7 +293,7 @@ class WebResearcher:
            tokens_used=total_tokens,
        )
-        result, synth_in, synth_out = await self._synthesize(
+        result = await self._synthesize(
            question=question,
            context=context,
            evidence=evidence,
@ -348,46 +303,6 @@ class WebResearcher:
            start_time=start_time,
            budget_exhausted=budget_exhausted,
        )
        tokens_input += synth_in
        tokens_output += synth_out
        # Issue #54 (b): emit one trace event per gap/citation/discovery so
        # the JSONL stream contains the actual categories alongside the
        # existing summary counts. Cheap and gives us a queryable timeline.
        for c in result.citations:
            trace.log_step(
                "citation_recorded",
                decision="Citation kept in final result",
                source=c.source,
                locator=c.locator,
                title=c.title,
                confidence=c.confidence,
            )
        for g in result.gaps:
            trace.log_step(
                "gap_recorded",
                decision="Gap surfaced in final result",
                category=g.category.value,
                topic=g.topic,
                detail=g.detail,
            )
        for d in result.discovery_events:
            trace.log_step(
                "discovery_recorded",
                decision="Discovery event surfaced in final result",
                type=d.type,
                suggested_researcher=d.suggested_researcher,
                query=d.query,
                reason=d.reason,
            )
        # Issue #54 (a): persist the full ResearchResult next to the trace
        # so replay and downstream analysis can recover the structured
        # contract, not just counts.
        try:
            trace.write_result(result)
        except Exception as write_err:
            log.warning("trace_result_write_failed", error=str(write_err))
        trace.log_step(
            "complete",
@ -399,43 +314,6 @@ class WebResearcher:
        )
        trace.close()
        log.info(
            "research_completed",
            confidence=result.confidence,
            citations=len(result.citations),
            gaps=len(result.gaps),
            discovery_events=len(result.discovery_events),
            tokens_used=result.cost_metadata.tokens_used,
            iterations_run=result.cost_metadata.iterations_run,
            wall_time_sec=result.cost_metadata.wall_time_sec,
            budget_exhausted=result.cost_metadata.budget_exhausted,
        )
        # Append to the operational cost ledger. Construct on first use
        # so test injection (cost_ledger=...) and the env override
        # (MARCHWARDEN_COST_LEDGER) both work without forcing every
        # caller to build a CostLedger explicitly.
        try:
            ledger = self.cost_ledger or CostLedger()
            ledger.record(
                trace_id=result.trace_id,
                question=question,
                model_id=self.model_id,
                tokens_used=result.cost_metadata.tokens_used,
                tokens_input=tokens_input,
                tokens_output=tokens_output,
                iterations_run=result.cost_metadata.iterations_run,
                wall_time_sec=result.cost_metadata.wall_time_sec,
                tavily_searches=tavily_searches,
                budget_exhausted=result.cost_metadata.budget_exhausted,
                confidence=result.confidence,
            )
        except Exception as ledger_err:
            # Never let a ledger failure poison a successful research call.
            log.warning("cost_ledger_write_failed", error=str(ledger_err))
        structlog.contextvars.clear_contextvars()
        return result
    async def _execute_tool(
@ -539,12 +417,8 @@ class WebResearcher:
        iterations: int,
        start_time: float,
        budget_exhausted: bool,
-    ) -> tuple[ResearchResult, int, int]:
+    ) -> ResearchResult:
-        """Ask the LLM to synthesize evidence into a ResearchResult.
+        """Ask the LLM to synthesize evidence into a ResearchResult."""
        Returns ``(result, synthesis_input_tokens, synthesis_output_tokens)``
        so the caller can track per-call token splits for cost estimation.
        """
        # Format evidence for the synthesis prompt
        evidence_text = ""
@ -574,18 +448,15 @@ class WebResearcher:
        response = self.client.messages.create(
            model=self.model_id,
-            max_tokens=16384,
+            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}],
        )
-        synth_in = response.usage.input_tokens
+        total_tokens += response.usage.input_tokens + response.usage.output_tokens
        synth_out = response.usage.output_tokens
        total_tokens += synth_in + synth_out
        wall_time = time.time() - start_time
        # Parse the JSON response
        raw_text = response.content[0].text.strip()
        stop_reason = response.stop_reason
        # Strip markdown fences if the model added them despite instructions
        if raw_text.startswith("```"):
            raw_text = raw_text.split("\n", 1)[1] if "\n" in raw_text else raw_text[3:]
@ -594,24 +465,15 @@ class WebResearcher:
        try:
            data = json.loads(raw_text)
-        except json.JSONDecodeError as parse_err:
+        except json.JSONDecodeError:
            trace.log_step(
                "synthesis_error",
-                decision=(
+                decision="Failed to parse synthesis JSON, returning fallback",
-                    f"Failed to parse synthesis JSON ({parse_err}); "
+                raw_response=raw_text[:1000],
                    f"stop_reason={stop_reason}"
                ),
                stop_reason=stop_reason,
                parse_error=str(parse_err),
                raw_response=raw_text,
            )
-            return (
+            return self._fallback_result(
                self._fallback_result(
                question, evidence, trace, total_tokens, iterations,
                wall_time, budget_exhausted,
                ),
                synth_in,
                synth_out,
            )
        trace.log_step(
@ -673,8 +535,7 @@ class WebResearcher:
                recency=cf.get("recency"),
            )
-            return (
+            return ResearchResult(
                ResearchResult(
                answer=data.get("answer", "No answer could be synthesized."),
                citations=citations,
                gaps=gaps,
@ -690,22 +551,15 @@ class WebResearcher:
                    model_id=self.model_id,
                ),
                trace_id=trace.trace_id,
                ),
                synth_in,
                synth_out,
            )
        except Exception as e:
            trace.log_step(
                "synthesis_build_error",
                decision=f"Failed to build ResearchResult: {e}",
            )
-            return (
+            return self._fallback_result(
                self._fallback_result(
                question, evidence, trace, total_tokens, iterations,
                wall_time, budget_exhausted,
                ),
                synth_in,
                synth_out,
            )
    def _fallback_result(
--- a/researchers/web/models.py
+++ b/researchers/web/models.py
@ -41,43 +41,6 @@ class ResearchConstraints(BaseModel):
    )
 # Depth presets — choosing a depth picks sensible defaults for the
 # constraint fields. Explicit overrides (--max-iterations, --budget,
 # explicit ResearchConstraints) always win over the preset.
 #
 # `balanced` matches the historical defaults so existing callers see
 # no behavior change. `shallow` and `deep` are tuned for "quick lookup"
 # and "thorough investigation" respectively. These are starting points;
 # Phase 3 stress testing will inform calibration.
 DEPTH_PRESETS: dict[str, dict[str, int]] = {
    "shallow": {"max_iterations": 2, "token_budget": 5_000, "max_sources": 5},
    "balanced": {"max_iterations": 5, "token_budget": 20_000, "max_sources": 10},
    "deep": {"max_iterations": 8, "token_budget": 60_000, "max_sources": 20},
 }
 def constraints_for_depth(
    depth: str,
    *,
    max_iterations: Optional[int] = None,
    token_budget: Optional[int] = None,
    max_sources: Optional[int] = None,
 ) -> ResearchConstraints:
    """Build a ResearchConstraints from a depth preset, with optional overrides.
    Any non-None override wins over the preset value. Unknown depths
    fall back to ``balanced``.
    """
    preset = DEPTH_PRESETS.get(depth, DEPTH_PRESETS["balanced"]).copy()
    if max_iterations is not None:
        preset["max_iterations"] = max_iterations
    if token_budget is not None:
        preset["token_budget"] = token_budget
    if max_sources is not None:
        preset["max_sources"] = max_sources
    return ResearchConstraints(**preset)
 # ---------------------------------------------------------------------------
 # Output types — Citation
 # ---------------------------------------------------------------------------
--- a/researchers/web/server.py
+++ b/researchers/web/server.py
@ -1,98 +0,0 @@
 """MCP server for the web researcher.
 Exposes a single tool `research` that delegates to WebResearcher.
 Run with: python -m researchers.web.server
 """
 import asyncio
 import os
 import sys
 from typing import Optional
 from mcp.server.fastmcp import FastMCP
 from obs import configure_logging, get_logger
 from researchers.web.agent import WebResearcher
 from researchers.web.models import constraints_for_depth
 log = get_logger("marchwarden.mcp")
 mcp = FastMCP(
    name="marchwarden-web-researcher",
    instructions=(
        "A Marchwarden web research specialist. "
        "Call the research tool with a question to get a grounded, "
        "evidence-based answer with citations, gaps, open questions, "
        "and confidence scoring."
    ),
 )
 def _read_secret(key: str) -> str:
    """Read a secret from ~/secrets file."""
    secrets_path = os.path.expanduser("~/secrets")
    with open(secrets_path) as f:
        for line in f:
            if line.startswith(f"{key}="):
                return line.split("=", 1)[1].strip()
    raise ValueError(f"Key {key} not found in {secrets_path}")
 def _get_researcher() -> WebResearcher:
    """Create a WebResearcher with keys from ~/secrets."""
    return WebResearcher(
        anthropic_api_key=_read_secret("ANTHROPIC_API_KEY"),
        tavily_api_key=_read_secret("TAVILY_API_KEY"),
        model_id=os.environ.get("MARCHWARDEN_MODEL", "claude-sonnet-4-6"),
    )
@mcp.tool()
 async def research(
    question: str,
    context: Optional[str] = None,
    depth: str = "balanced",
    max_iterations: Optional[int] = None,
    token_budget: Optional[int] = None,
 ) -> str:
    """Research a question using web search and return a structured answer.
    Args:
        question: The question to investigate.
        context: What the caller already knows (optional).
        depth: Research depth — "shallow", "balanced", or "deep". Each
            depth picks default max_iterations / token_budget / max_sources.
        max_iterations: Override the depth preset for iterations (1-20).
        token_budget: Override the depth preset for token budget.
    Returns:
        JSON string containing the full ResearchResult with answer,
        citations, gaps, discovery_events, open_questions, confidence,
        and cost_metadata.
    """
    researcher = _get_researcher()
    constraints = constraints_for_depth(
        depth,
        max_iterations=max_iterations,
        token_budget=token_budget,
    )
    result = await researcher.research(
        question=question,
        context=context,
        depth=depth,
        constraints=constraints,
    )
    return result.model_dump_json(indent=2)
 def main():
    """Run the MCP server on stdio."""
    configure_logging()
    log.info("mcp_server_starting", transport="stdio", server="marchwarden-web-researcher")
    mcp.run(transport="stdio")
 if __name__ == "__main__":
    main()
--- a/researchers/web/trace.py
+++ b/researchers/web/trace.py
@ -14,43 +14,6 @@ import uuid
 from pathlib import Path
 from typing import Any, Optional
 from obs import get_logger
 # Actions that get promoted to INFO in the operational log. Everything
 # else logs at DEBUG so the default INFO level shows ~6-8 milestones per
 # research call instead of 20+ chatty per-step events. Set
 # MARCHWARDEN_LOG_LEVEL=DEBUG to see all steps.
 _INFO_ACTIONS = frozenset(
    {
        "start",
        "iteration_start",
        "synthesis_start",
        "synthesis_complete",
        "synthesis_error",
        "synthesis_build_error",
        "budget_exhausted",
        "complete",
    }
 )
 _log = get_logger("marchwarden.researcher.trace")
 # Action pairings for duration tracking. When a starter action fires
 # we record a monotonic start time keyed by the starter name. When the
 # matching completer fires we compute the elapsed duration and attach
 # it as a field on the completer's entry, then clear the start.
 #
 # Synthesis has two possible completers (success or error), both
 # pointing back to synthesis_start.
 _DURATION_PAIRS: dict[str, str] = {
    "web_search_complete": "web_search",
    "fetch_url_complete": "fetch_url",
    "synthesis_complete": "synthesis_start",
    "synthesis_error": "synthesis_start",
    "complete": "start",
 }
 _STARTER_ACTIONS = frozenset(_DURATION_PAIRS.values())
 class TraceLogger:
    """Logs research steps to a JSONL file.
@ -79,12 +42,8 @@ class TraceLogger:
        )
        self.trace_dir.mkdir(parents=True, exist_ok=True)
        self.file_path = self.trace_dir / f"{self.trace_id}.jsonl"
        self.result_path = self.trace_dir / f"{self.trace_id}.result.json"
        self._step_counter = 0
        self._file = None
        # action_name -> monotonic start time, populated by starter
        # actions and consumed by their matching completer (Issue #35).
        self._pending_starts: dict[str, float] = {}
    @property
    def _writer(self):
@ -113,61 +72,10 @@ class TraceLogger:
        }
        entry.update(kwargs)
        # Duration tracking (Issue #35). Record start times for starter
        # actions; when the matching completer fires, attach elapsed time
        # to both the trace entry and the operational log line.
        now = time.monotonic()
        if action in _STARTER_ACTIONS:
            self._pending_starts[action] = now
        duration_extras: dict[str, Any] = {}
        if action in _DURATION_PAIRS:
            starter = _DURATION_PAIRS[action]
            start = self._pending_starts.pop(starter, None)
            if start is not None:
                elapsed = now - start
                if action == "complete":
                    duration_extras["total_duration_sec"] = round(elapsed, 3)
                else:
                    duration_extras["duration_ms"] = int(elapsed * 1000)
        entry.update(duration_extras)
        self._writer.write(json.dumps(entry, default=str) + "\n")
        self._writer.flush()
        # Mirror the trace step into the operational logger so admins
        # can watch progress in real time. trace_id and researcher are
        # already bound in contextvars by WebResearcher.research, so
        # they automatically appear on every line.
        log_method = _log.info if action in _INFO_ACTIONS else _log.debug
        log_method(
            action,
            step=self._step_counter,
            decision=decision,
            **kwargs,
            **duration_extras,
        )
        return entry
    def write_result(self, result: Any) -> None:
        """Persist the final ResearchResult JSON next to the trace.
        Issue #54: the JSONL trace records step events and final counts
        only. Without the structured result on disk, replay can't show
        which gaps fired or which sources were cited, and downstream
        analysis (M3.2/M3.3) is impossible. We dump the pydantic model
        to ``<trace_id>.result.json`` so the full contract survives the
        process.
        """
        # Accept either a pydantic model or a plain dict to keep the
        # logger decoupled from the models module (avoids a circular
        # import path).
        if hasattr(result, "model_dump_json"):
            payload = result.model_dump_json(indent=2)
        else:
            payload = json.dumps(result, indent=2, default=str)
        self.result_path.write_text(payload, encoding="utf-8")
    def read_entries(self) -> list[dict]:
        """Read all entries from the trace file.
--- a/scripts/calibration_collect.py
+++ b/scripts/calibration_collect.py
@ -1,225 +0,0 @@
 """scripts/calibration_collect.py
 M3.3 Phase A: load every persisted ResearchResult under
 ~/.marchwarden/traces/*.result.json and emit a markdown rating worksheet
 to docs/stress-tests/M3.3-rating-worksheet.md.
 The worksheet has one row per run with the model's self-reported confidence
 and a blank `actual_rating` column for human review (Phase B). After rating
 is complete, scripts/calibration_analyze.py (Phase C) will load the same
 file with the rating column populated and compute calibration error.
 Usage:
    .venv/bin/python scripts/calibration_collect.py
 Optional env:
    TRACE_DIR — override default ~/.marchwarden/traces
    OUT       — override default docs/stress-tests/M3.3-rating-worksheet.md
 """
 from __future__ import annotations
 import json
 import os
 import sys
 from pathlib import Path
 REPO_ROOT = Path(__file__).resolve().parent.parent
 sys.path.insert(0, str(REPO_ROOT))
 from researchers.web.models import ResearchResult  # noqa: E402
 def _load_results(trace_dir: Path) -> list[tuple[Path, ResearchResult]]:
    """Load every <id>.result.json under trace_dir, sorted by mtime."""
    files = sorted(trace_dir.glob("*.result.json"), key=lambda p: p.stat().st_mtime)
    out: list[tuple[Path, ResearchResult]] = []
    for f in files:
        try:
            result = ResearchResult.model_validate_json(f.read_text(encoding="utf-8"))
        except Exception as exc:
            print(f"warning: skipping {f.name}: {exc}", file=sys.stderr)
            continue
        out.append((f, result))
    return out
 def _gap_summary(result: ResearchResult) -> str:
    """Render gap categories with counts, e.g. 'source_not_found(2), scope_exceeded(1)'."""
    if not result.gaps:
        return "—"
    counts: dict[str, int] = {}
    for g in result.gaps:
        cat = g.category.value if hasattr(g.category, "value") else str(g.category)
        counts[cat] = counts.get(cat, 0) + 1
    return ", ".join(f"{k}({v})" for k, v in sorted(counts.items()))
 def _category_map(runs_dir: Path) -> dict[str, str]:
    """Map trace_id -> category by parsing scripts/calibration_runner.sh log files.
    Each log file is named like ``01-factual.log`` and contains a final
    ``trace_id: <uuid>`` line emitted by the CLI.
    """
    out: dict[str, str] = {}
    if not runs_dir.exists():
        return out
    for log in runs_dir.glob("*.log"):
        # filename format: NN-category.log
        stem = log.stem
        parts = stem.split("-", 1)
        if len(parts) != 2:
            continue
        category = parts[1]
        try:
            text = log.read_text(encoding="utf-8")
        except Exception:
            continue
        # Find the last "trace_id: <uuid>" line
        trace_id = None
        for line in text.splitlines():
            if "trace_id:" in line:
                # Strip ANSI / rich markup if present
                token = line.split("trace_id:")[-1].strip()
                # Take only the UUID portion
                token = token.split()[0] if token else ""
                # Strip any surrounding rich markup
                token = token.replace("[/dim]", "").replace("[dim]", "")
                if token:
                    trace_id = token
        if trace_id:
            out[trace_id] = category
    return out
 def _question_from_trace(trace_dir: Path, trace_id: str) -> str:
    """Recover the original question from the trace JSONL's `start` event."""
    jsonl = trace_dir / f"{trace_id}.jsonl"
    if not jsonl.exists():
        return "(question not recoverable — trace missing)"
    try:
        for line in jsonl.read_text(encoding="utf-8").splitlines():
            line = line.strip()
            if not line:
                continue
            entry = json.loads(line)
            if entry.get("action") == "start":
                return entry.get("question", "(no question field)")
    except Exception as exc:
        return f"(parse error: {exc})"
    return "(no start event)"
 def _build_worksheet(
    rows: list[tuple[Path, ResearchResult]],
    trace_dir: Path,
    category_map: dict[str, str],
 ) -> str:
    """Render the markdown worksheet."""
    lines: list[str] = []
    lines.append("# M3.3 Calibration Rating Worksheet")
    lines.append("")
    lines.append("Issue: #46 (Phase B — human rating)")
    lines.append("")
    lines.append(
        "## How to use this worksheet"
    )
    lines.append("")
    lines.append(
        "For each run below, read the answer + citations from the persisted "
        "result file (path in the **Result file** column). Score the answer's "
        "*actual* correctness on a 0.0–1.0 scale, **independent** of the "
        "model's self-reported confidence. Fill in the **actual_rating** "
        "column. Add notes in the **notes** column for anything unusual."
    )
    lines.append("")
    lines.append("Rating rubric:")
    lines.append("")
    lines.append("- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.")
    lines.append("- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.")
    lines.append("- **0.6** — Substantively right but with notable errors, missing context, or weak citations.")
    lines.append("- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.")
    lines.append("- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.")
    lines.append("- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.")
    lines.append("")
    lines.append("After rating all rows, save this file and run:")
    lines.append("")
    lines.append("```")
    lines.append(".venv/bin/python scripts/calibration_analyze.py")
    lines.append("```")
    lines.append("")
    lines.append(f"## Runs ({len(rows)} total)")
    lines.append("")
    lines.append(
        "| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |"
    )
    lines.append(
        "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|"
    )
    for i, (path, result) in enumerate(rows, 1):
        cf = result.confidence_factors
        cm = result.cost_metadata
        question = _question_from_trace(trace_dir, result.trace_id).replace("|", "\\|")
        # Truncate long questions for table readability
        if len(question) > 80:
            question = question[:77] + "..."
        gaps = _gap_summary(result).replace("|", "\\|")
        contradiction = "yes" if cf.contradiction_detected else "no"
        budget = "spent" if cf.budget_exhausted else "under"
        recency = cf.recency or "—"
        category = category_map.get(result.trace_id, "ad-hoc")
        lines.append(
            f"| {i} "
            f"| `{result.trace_id[:8]}` "
            f"| {category} "
            f"| {question} "
            f"| {result.confidence:.2f} "
            f"| {cf.num_corroborating_sources} "
            f"| {cf.source_authority} "
            f"| {contradiction} "
            f"| {budget} "
            f"| {recency} "
            f"| {gaps} "
            f"| {len(result.citations)} "
            f"| {len(result.discovery_events)} "
            f"| {cm.tokens_used} "
            f"|  "
            f"|  |"
        )
    lines.append("")
    lines.append("## Result files (full content for review)")
    lines.append("")
    for i, (path, result) in enumerate(rows, 1):
        lines.append(f"{i}. `{path}`")
    lines.append("")
    return "\n".join(lines)
 def main() -> int:
    trace_dir = Path(
        os.environ.get("TRACE_DIR", os.path.expanduser("~/.marchwarden/traces"))
    )
    out_path = Path(
        os.environ.get("OUT", REPO_ROOT / "docs/stress-tests/M3.3-rating-worksheet.md")
    )
    out_path.parent.mkdir(parents=True, exist_ok=True)
    rows = _load_results(trace_dir)
    if not rows:
        print(f"No result files found under {trace_dir}", file=sys.stderr)
        return 1
    runs_dir = REPO_ROOT / "docs/stress-tests/M3.3-runs"
    category_map = _category_map(runs_dir)
    out_path.write_text(
        _build_worksheet(rows, trace_dir, category_map), encoding="utf-8"
    )
    print(f"Wrote {len(rows)}-row worksheet to {out_path}")
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())
--- a/scripts/calibration_runner.sh
+++ b/scripts/calibration_runner.sh
@ -1,67 +0,0 @@
 #!/usr/bin/env bash
 # scripts/calibration_runner.sh
 #
 # M3.3 Phase A: run a fixed set of 20 balanced-depth calibration queries.
 # Each run writes a trace JSONL and a result.json under ~/.marchwarden/traces/.
 # This script is idempotent in the sense that it doesn't track state — re-running
 # it will produce 20 NEW traces. Don't re-run unless you want fresh data.
 #
 # Categories (5 each):
 #   - factual: single verifiable answer
 #   - comparative: X vs Y across some dimension
 #   - contradiction-prone: contested topics, sources disagree
 #   - scope-edge: niche, proprietary, or expert-only knowledge
 set -euo pipefail
 cd "$(dirname "$0")/.."
 PY=".venv/bin/python"
 LOG_DIR="docs/stress-tests/M3.3-runs"
 mkdir -p "$LOG_DIR"
 declare -a QUERIES=(
  # factual
  "factual|01|What is the boiling point of liquid nitrogen at standard atmospheric pressure?"
  "factual|02|When did the James Webb Space Telescope launch?"
  "factual|03|What programming language is the Linux kernel primarily written in?"
  "factual|04|What is the capital of Mongolia?"
  "factual|05|How many amino acids are encoded by the standard genetic code?"
  # comparative
  "comparative|06|Compare the energy density of lithium-ion vs sodium-ion batteries."
  "comparative|07|Compare PostgreSQL and SQLite for embedded analytics workloads."
  "comparative|08|Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing."
  "comparative|09|Compare React and Vue for large enterprise frontends in 2026."
  "comparative|10|Compare wind and solar capacity factors in the continental United States."
  # contradiction-prone
  "contradiction|11|Is red wine good for cardiovascular health?"
  "contradiction|12|Does intermittent fasting extend lifespan in humans?"
  "contradiction|13|Are nuclear power plants safe?"
  "contradiction|14|Is dietary cholesterol harmful?"
  "contradiction|15|Does screen time harm child development?"
  # scope-edge
  "scope|16|What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?"
  "scope|17|What is the actual operational doctrine of Chinese DF-41 ICBM brigades?"
  "scope|18|What internal compensation bands does Goldman Sachs use for VPs in 2026?"
  "scope|19|How does Renaissance Technologies Medallion Fund actually generate alpha?"
  "scope|20|What are the precise materials and tolerances in TSMC's 2nm process?"
 )
 echo "Running ${#QUERIES[@]} calibration queries at depth=balanced..."
 echo "Output dir: $LOG_DIR"
 echo
 for entry in "${QUERIES[@]}"; do
  IFS='|' read -r category num question <<<"$entry"
  log_file="$LOG_DIR/${num}-${category}.log"
  echo "[$num/$category] $question"
  if "$PY" -m cli.main ask "$question" --depth balanced >"$log_file" 2>&1; then
    trace_id=$(grep -oE 'trace_id: [a-f0-9-]+' "$log_file" | tail -1 | awk '{print $2}')
    echo "    -> $trace_id"
  else
    echo "    !! FAILED — see $log_file"
  fi
 done
 echo
 echo "Done. Result files at ~/.marchwarden/traces/*.result.json"
--- a/scripts/docker-test.sh
+++ b/scripts/docker-test.sh
@ -1,68 +0,0 @@
 #!/usr/bin/env bash
 # Helper for the dockerized test/run environment.
 #
 # Usage:
 #   scripts/docker-test.sh build         Build the image
 #   scripts/docker-test.sh test          Run pytest in the container
 #   scripts/docker-test.sh ask "..."     Run `marchwarden ask` end-to-end
 #                                        (mounts ~/secrets ro and ~/.marchwarden rw)
 #   scripts/docker-test.sh shell         Drop into a bash shell in the container
 set -euo pipefail
 IMAGE="marchwarden-test"
 ROOT="$(cd "$(dirname "$0")/.." && pwd)"
 cmd="${1:-test}"
 shift || true
 case "$cmd" in
  build)
    docker build -t "$IMAGE" "$ROOT"
    ;;
  test)
    docker run --rm -v "$ROOT:/app" "$IMAGE" pytest -q "$@"
    ;;
  ask)
    if [ ! -f "$HOME/secrets" ]; then
      echo "error: ~/secrets not found on host" >&2
      exit 1
    fi
    mkdir -p "$HOME/.marchwarden/traces"
    tty_flag=""
    if [ -t 0 ] && [ -t 1 ]; then tty_flag="-it"; fi
    env_flag=""
    if [ -n "${MARCHWARDEN_MODEL:-}" ]; then
      env_flag="-e MARCHWARDEN_MODEL=$MARCHWARDEN_MODEL"
    fi
    docker run --rm $tty_flag $env_flag \
      -v "$ROOT:/app" \
      -v "$HOME/secrets:/root/secrets:ro" \
      -v "$HOME/.marchwarden:/root/.marchwarden" \
      "$IMAGE" marchwarden ask "$@"
    ;;
  replay)
    mkdir -p "$HOME/.marchwarden/traces"
    docker run --rm \
      -v "$ROOT:/app" \
      -v "$HOME/.marchwarden:/root/.marchwarden" \
      "$IMAGE" marchwarden replay "$@"
    ;;
  shell)
    docker run --rm -it \
      -v "$ROOT:/app" \
      -v "$HOME/secrets:/root/secrets:ro" \
      -v "$HOME/.marchwarden:/root/.marchwarden" \
      "$IMAGE" bash
    ;;
  *)
    echo "unknown command: $cmd" >&2
    echo "usage: $0 {build|test|ask|replay|shell}" >&2
    exit 1
    ;;
 esac
--- a/tests/test_agent.py
+++ b/tests/test_agent.py
@ -213,30 +213,6 @@ class TestWebResearcher:
            assert result.cost_metadata.tokens_used > 0
            assert result.trace_id is not None
            # Issue #54 (a): full result is persisted next to the trace
            from pathlib import Path
            result_file = Path(tmp) / f"{result.trace_id}.result.json"
            assert result_file.exists()
            persisted = ResearchResult.model_validate_json(
                result_file.read_text()
            )
            assert persisted.answer == result.answer
            assert len(persisted.gaps) == 1
            assert persisted.gaps[0].topic == "pest management"
            # Issue #54 (b): per-item events are emitted in the trace
            trace_file = Path(tmp) / f"{result.trace_id}.jsonl"
            entries = [
                json.loads(l) for l in trace_file.read_text().splitlines() if l
            ]
            actions = [e["action"] for e in entries]
            assert "gap_recorded" in actions
            assert "citation_recorded" in actions
            assert "discovery_recorded" in actions
            gap_event = next(e for e in entries if e["action"] == "gap_recorded")
            assert gap_event["category"] == "source_not_found"
            assert gap_event["topic"] == "pest management"
    @pytest.mark.asyncio
    async def test_budget_exhaustion(self):
        """Test that the loop stops when token budget is reached."""
--- a/tests/test_arxiv_ingest.py
+++ b/tests/test_arxiv_ingest.py
@ -1,415 +0,0 @@
 """Tests for the arxiv-rag ingest pipeline (M5.1.1).
 Strategy: mock the slow / network bits (arxiv API, embedder, chromadb)
 and exercise the real pipeline against synthetic PDFs generated with
 pymupdf at test time. This keeps the tests deterministic, fast, and
 network-free while still exercising the actual extract_sections logic.
 """
 from __future__ import annotations
 import json
 from datetime import datetime, timezone
 from pathlib import Path
 from types import SimpleNamespace
 from unittest.mock import MagicMock
 import pytest
 from researchers.arxiv import ingest as ingest_mod
 from researchers.arxiv.ingest import (
    PaperMetadata,
    Section,
    embed_and_store,
    extract_sections,
    ingest,
 )
 from researchers.arxiv.store import ArxivStore, PaperRecord, make_chunk_id
 # ---------------------------------------------------------------------------
 # Fixtures
 # ---------------------------------------------------------------------------
 def _make_synthetic_pdf(path: Path, sections: list[tuple[str, str]]) -> None:
    """Build a tiny PDF with one section per (heading, body) tuple.
    pymupdf is already a hard dep of the arxiv extra, so synthesizing a
    fixture PDF inline is cheaper than checking a binary into the repo.
    """
    import pymupdf
    doc = pymupdf.open()
    for heading, body in sections:
        page = doc.new_page()
        page.insert_text((50, 80), heading, fontsize=14)
        # Wrap body across a few lines for realism
        y = 110
        for line in body.split("\n"):
            page.insert_text((50, y), line, fontsize=11)
            y += 16
    doc.save(str(path))
    doc.close()
@pytest.fixture
 def store(tmp_path):
    """ArxivStore rooted in a temp directory."""
    return ArxivStore(root=tmp_path / "arxiv-rag")
 class StubEmbedder:
    """Minimal stand-in for sentence-transformers.SentenceTransformer."""
    def __init__(self, dim: int = 4):
        self.dim = dim
        self.calls: list[list[str]] = []
    def encode(self, texts):
        self.calls.append(list(texts))
        # Return deterministic vectors keyed off text length so two
        # different sections produce two different embeddings.
        return [[float(len(t)), 0.0, 0.0, 0.0] for t in texts]
 class StubChromaCollection:
    """In-memory drop-in for a chromadb collection."""
    def __init__(self):
        self.docs: dict[str, dict] = {}
    def upsert(self, ids, documents, embeddings, metadatas):
        for i, doc, emb, meta in zip(ids, documents, embeddings, metadatas):
            self.docs[i] = {"document": doc, "embedding": emb, "metadata": meta}
    def get(self, where=None):
        if where is None:
            ids = list(self.docs.keys())
        else:
            ids = [
                i
                for i, entry in self.docs.items()
                if all(entry["metadata"].get(k) == v for k, v in where.items())
            ]
        return {"ids": ids}
    def delete(self, where):
        to_drop = [
            i
            for i, entry in self.docs.items()
            if all(entry["metadata"].get(k) == v for k, v in where.items())
        ]
        for i in to_drop:
            del self.docs[i]
@pytest.fixture
 def stub_collection(monkeypatch):
    """Replace ArxivStore.collection with an in-memory stub."""
    stub = StubChromaCollection()
    monkeypatch.setattr(
        ArxivStore, "collection", property(lambda self: stub)
    )
    return stub
 # ---------------------------------------------------------------------------
 # extract_sections — real pymupdf, synthetic PDFs
 # ---------------------------------------------------------------------------
 class TestExtractSections:
    def test_detects_canonical_headings(self, tmp_path):
        pdf = tmp_path / "paper.pdf"
        _make_synthetic_pdf(
            pdf,
            [
                ("Introduction", "We study X. We find Y."),
                ("Methods", "We used Z to evaluate Y."),
                ("Results", "Accuracy was 95%."),
                ("Conclusion", "X works."),
            ],
        )
        sections = extract_sections(pdf)
        titles = [s.title.lower() for s in sections]
        assert "introduction" in titles
        assert "methods" in titles
        assert "results" in titles
        assert "conclusion" in titles
        # Body text from each section should be present
        intro = next(s for s in sections if s.title.lower() == "introduction")
        assert "we study x" in intro.text.lower()
    def test_falls_back_to_whole_paper_when_no_headings(self, tmp_path):
        pdf = tmp_path / "no-headings.pdf"
        _make_synthetic_pdf(
            pdf,
            [
                ("Some random title nobody recognizes", "Body text body text."),
            ],
        )
        sections = extract_sections(pdf)
        assert len(sections) == 1
        assert sections[0].title == "Full Paper"
        assert "body text" in sections[0].text.lower()
 # ---------------------------------------------------------------------------
 # embed_and_store — uses stub collection + stub embedder
 # ---------------------------------------------------------------------------
 class TestEmbedAndStore:
    def test_writes_chunks_and_returns_count(self, store, stub_collection):
        sections = [
            Section(index=0, title="Intro", text="aaa", page_start=1, page_end=1),
            Section(index=1, title="Methods", text="bbbb", page_start=2, page_end=2),
        ]
        meta = PaperMetadata(arxiv_id="2403.12345", version="v1", title="Test")
        n = embed_and_store(
            arxiv_id="2403.12345",
            sections=sections,
            store=store,
            model_name="stub-model",
            metadata=meta,
            embedder=StubEmbedder(),
        )
        assert n == 2
        assert len(stub_collection.docs) == 2
        # Check that chunk ids are model-scoped
        expected_ids = {make_chunk_id("2403.12345", i, "stub-model") for i in (0, 1)}
        assert set(stub_collection.docs.keys()) == expected_ids
        # Metadata round-trips
        first = next(iter(stub_collection.docs.values()))
        assert first["metadata"]["arxiv_id"] == "2403.12345"
        assert first["metadata"]["embedding_model"] == "stub-model"
    def test_re_embed_replaces_existing_chunks(self, store, stub_collection):
        meta = PaperMetadata(arxiv_id="2403.12345", version="v1", title="Test")
        sections_v1 = [
            Section(index=0, title="Intro", text="first", page_start=1, page_end=1),
            Section(index=1, title="Methods", text="second", page_start=2, page_end=2),
        ]
        embed_and_store(
            "2403.12345", sections_v1, store, "stub-model", meta,
            embedder=StubEmbedder(),
        )
        assert len(stub_collection.docs) == 2
        # Re-embed with fewer sections — should drop the second.
        sections_v2 = [
            Section(index=0, title="Intro", text="first", page_start=1, page_end=1),
        ]
        embed_and_store(
            "2403.12345", sections_v2, store, "stub-model", meta,
            embedder=StubEmbedder(),
        )
        assert len(stub_collection.docs) == 1
    def test_empty_sections_is_noop(self, store, stub_collection):
        meta = PaperMetadata(arxiv_id="x", version="", title="")
        n = embed_and_store("x", [], store, "stub-model", meta, embedder=StubEmbedder())
        assert n == 0
        assert stub_collection.docs == {}
 # ---------------------------------------------------------------------------
 # Top-level ingest() — full pipeline with mocked download
 # ---------------------------------------------------------------------------
 def _stub_arxiv_search(arxiv_id: str):
    """Return a fake arxiv.Search result for ``arxiv_id``."""
    def _download_pdf(dirpath=None, filename=None):
        # Generate a synthetic PDF on the fly so the rest of the
        # pipeline has something real to read.
        target = Path(dirpath) / filename
        _make_synthetic_pdf(
            target,
            [
                ("Introduction", "Stub paper introduction."),
                ("Methods", "Stub paper methods."),
                ("Results", "Stub paper results."),
            ],
        )
    paper = SimpleNamespace(
        entry_id=f"http://arxiv.org/abs/{arxiv_id}v1",
        title=f"Test paper {arxiv_id}",
        authors=[SimpleNamespace(name="Alice"), SimpleNamespace(name="Bob")],
        published=datetime(2024, 1, 15, tzinfo=timezone.utc),
        primary_category="cs.LG",
        download_pdf=_download_pdf,
    )
    return [paper]
 class TestIngest:
    def test_end_to_end(self, store, stub_collection):
        record = ingest(
            "2403.12345",
            store=store,
            model_name="stub-model",
            arxiv_search=_stub_arxiv_search,
            embedder=StubEmbedder(),
        )
        # Manifest entry
        assert isinstance(record, PaperRecord)
        assert record.arxiv_id == "2403.12345"
        assert record.title == "Test paper 2403.12345"
        assert record.authors == ["Alice", "Bob"]
        assert record.year == 2024
        assert record.category == "cs.LG"
        assert record.chunks_indexed >= 1
        assert record.embedding_model == "stub-model"
        # Manifest persisted to disk
        loaded = store.load_manifest()
        assert "2403.12345" in loaded
        assert loaded["2403.12345"].chunks_indexed == record.chunks_indexed
        # PDF cached
        assert (store.pdfs_dir / "2403.12345.pdf").exists()
        # Chunks in stub collection
        assert len(stub_collection.docs) == record.chunks_indexed
    def test_idempotent_reingest(self, store, stub_collection):
        first = ingest(
            "2403.12345",
            store=store,
            model_name="stub-model",
            arxiv_search=_stub_arxiv_search,
            embedder=StubEmbedder(),
        )
        chunks_after_first = len(stub_collection.docs)
        second = ingest(
            "2403.12345",
            store=store,
            model_name="stub-model",
            arxiv_search=_stub_arxiv_search,
            embedder=StubEmbedder(),
        )
        # Same number of chunks (replace, not append)
        assert len(stub_collection.docs) == chunks_after_first
        assert second.chunks_indexed == first.chunks_indexed
    def test_unknown_arxiv_id_raises(self, store):
        with pytest.raises(ValueError, match="not found"):
            ingest(
                "9999.99999",
                store=store,
                model_name="stub-model",
                arxiv_search=lambda _id: [],
                embedder=StubEmbedder(),
            )
 # ---------------------------------------------------------------------------
 # Manifest CRUD via ArxivStore
 # ---------------------------------------------------------------------------
 class TestManifest:
    def test_load_returns_empty_dict_when_missing(self, store):
        assert store.load_manifest() == {}
    def test_round_trip(self, store):
        rec = PaperRecord(
            arxiv_id="2401.00001",
            version="v2",
            title="Round trip test",
            authors=["A", "B"],
            year=2024,
            category="cs.AI",
            chunks_indexed=7,
            embedding_model="m",
        )
        store.upsert_paper(rec)
        loaded = store.load_manifest()
        assert "2401.00001" in loaded
        assert loaded["2401.00001"].title == "Round trip test"
        assert loaded["2401.00001"].chunks_indexed == 7
    def test_remove_paper(self, store):
        rec = PaperRecord(
            arxiv_id="2401.00001",
            version="",
            title="t",
            authors=[],
            year=None,
            category=None,
            chunks_indexed=0,
            embedding_model="m",
        )
        store.upsert_paper(rec)
        assert store.remove_paper("2401.00001") is True
        assert store.load_manifest() == {}
        assert store.remove_paper("2401.00001") is False
    def test_list_sorted_newest_first(self, store):
        old = PaperRecord(
            arxiv_id="old",
            version="",
            title="old",
            authors=[],
            year=None,
            category=None,
            chunks_indexed=0,
            embedding_model="m",
            added_at="2020-01-01T00:00:00Z",
        )
        new = PaperRecord(
            arxiv_id="new",
            version="",
            title="new",
            authors=[],
            year=None,
            category=None,
            chunks_indexed=0,
            embedding_model="m",
            added_at="2026-01-01T00:00:00Z",
        )
        store.upsert_paper(old)
        store.upsert_paper(new)
        listed = store.list_papers()
        assert [p.arxiv_id for p in listed] == ["new", "old"]
 # ---------------------------------------------------------------------------
 # CLI smoke (without actually calling chromadb)
 # ---------------------------------------------------------------------------
 class TestArxivCLI:
    def test_list_empty(self, tmp_path, monkeypatch):
        from click.testing import CliRunner
        from cli.main import cli
        monkeypatch.setattr(
            "researchers.arxiv.store.DEFAULT_ROOT",
            tmp_path / "arxiv-rag",
        )
        runner = CliRunner()
        result = runner.invoke(cli, ["arxiv", "list"])
        assert result.exit_code == 0, result.output
        assert "No papers indexed" in result.output
    def test_info_missing(self, tmp_path, monkeypatch):
        from click.testing import CliRunner
        from cli.main import cli
        monkeypatch.setattr(
            "researchers.arxiv.store.DEFAULT_ROOT",
            tmp_path / "arxiv-rag",
        )
        runner = CliRunner()
        result = runner.invoke(cli, ["arxiv", "info", "0000.00000"])
        assert result.exit_code == 1
        assert "Not indexed" in result.output
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@ -1,392 +0,0 @@
 """Tests for the marchwarden CLI."""
 from unittest.mock import patch
 from click.testing import CliRunner
 from cli.main import cli, render_costs, render_result, render_trace
 from researchers.web.models import (
    Citation,
    ConfidenceFactors,
    CostMetadata,
    DiscoveryEvent,
    Gap,
    GapCategory,
    OpenQuestion,
    ResearchResult,
 )
 from rich.console import Console
 def _fixture_result() -> ResearchResult:
    return ResearchResult(
        answer="Tomatoes, peppers, squash, and beans grow well in Utah.",
        citations=[
            Citation(
                source="web",
                locator="https://extension.usu.edu/yard-and-garden",
                title="USU Extension — Yard and Garden",
                snippet="USU recommends warm-season crops for Utah's climate.",
                raw_excerpt="Tomatoes, peppers, and squash thrive in Utah summers.",
                confidence=0.9,
            ),
        ],
        gaps=[
            Gap(
                topic="Microclimate variation",
                category=GapCategory.SCOPE_EXCEEDED,
                detail="Did not investigate elevation-specific recommendations.",
            ),
        ],
        discovery_events=[
            DiscoveryEvent(
                type="related_research",
                suggested_researcher="docs",
                query="Utah USDA hardiness zones",
                reason="Zone-specific guidance would improve answer.",
            ),
        ],
        open_questions=[
            OpenQuestion(
                question="What are the best cool-season crops?",
                context="Answer focused on warm-season crops.",
                priority="medium",
            ),
        ],
        confidence=0.82,
        confidence_factors=ConfidenceFactors(
            num_corroborating_sources=3,
            source_authority="high",
            contradiction_detected=False,
            query_specificity_match=0.85,
            budget_exhausted=False,
            recency="current",
        ),
        cost_metadata=CostMetadata(
            tokens_used=4321,
            iterations_run=3,
            wall_time_sec=12.5,
            budget_exhausted=False,
            model_id="claude-sonnet-4-6",
        ),
        trace_id="trace-abc-123",
    )
 class TestRenderResult:
    def test_renders_all_sections(self):
        console = Console(record=True, width=120)
        render_result(_fixture_result(), console)
        out = console.export_text()
        assert "Tomatoes" in out
        assert "USU Extension" in out
        assert "scope_exceeded" in out
        assert "related_research" in out
        assert "cool-season" in out
        assert "Confidence" in out
        assert "claude-sonnet-4-6" in out
        assert "trace-abc-123" in out
 class TestAskCommand:
    def test_ask_invokes_mcp_and_renders(self):
        runner = CliRunner()
        fixture = _fixture_result()
        async def fake_call(question, depth, max_iterations, token_budget):
            assert question == "What grows in Utah?"
            assert depth == "shallow"
            assert max_iterations == 2
            assert token_budget == 5000
            return fixture
        with patch("cli.main.call_research_tool", side_effect=fake_call):
            result = runner.invoke(
                cli,
                [
                    "ask",
                    "What grows in Utah?",
                    "--depth",
                    "shallow",
                    "--max-iterations",
                    "2",
                    "--budget",
                    "5000",
                ],
            )
        assert result.exit_code == 0, result.output
        assert "Tomatoes" in result.output
        assert "trace-abc-123" in result.output
    def test_ask_handles_error(self):
        runner = CliRunner()
        async def boom(**kwargs):
            raise RuntimeError("mcp went sideways")
        with patch("cli.main.call_research_tool", side_effect=boom):
            result = runner.invoke(cli, ["ask", "anything"])
        assert result.exit_code == 1
        assert "mcp went sideways" in result.output
 class TestReplayCommand:
    def _write_trace(self, tmp_path, trace_id="trace-xyz"):
        path = tmp_path / f"{trace_id}.jsonl"
        path.write_text(
            '{"step": 1, "action": "search", "decision": "initial query", '
            '"timestamp": "2026-04-08T00:00:00Z", "query": "utah crops"}\n'
            '{"step": 2, "action": "fetch_url", "decision": "promising source", '
            '"timestamp": "2026-04-08T00:00:01Z", "url": "https://example.com", '
            '"content_hash": "sha256:deadbeef"}\n'
            '{"step": 3, "action": "synthesize", "decision": "have enough", '
            '"timestamp": "2026-04-08T00:00:02Z"}\n'
        )
        return path
    def test_replay_renders_trace(self, tmp_path):
        runner = CliRunner()
        self._write_trace(tmp_path)
        result = runner.invoke(
            cli,
            ["replay", "trace-xyz", "--trace-dir", str(tmp_path)],
        )
        assert result.exit_code == 0, result.output
        assert "trace-xyz" in result.output
        assert "search" in result.output
        assert "fetch_url" in result.output
        assert "synthesize" in result.output
        assert "sha256:deadbeef" in result.output
        assert "utah crops" in result.output
    def test_replay_unknown_trace_id(self, tmp_path):
        runner = CliRunner()
        result = runner.invoke(
            cli,
            ["replay", "missing-id", "--trace-dir", str(tmp_path)],
        )
        assert result.exit_code == 1
        assert "no trace file found" in result.output
    def test_replay_invalid_json(self, tmp_path):
        runner = CliRunner()
        (tmp_path / "broken.jsonl").write_text("{not json}\n")
        result = runner.invoke(
            cli,
            ["replay", "broken", "--trace-dir", str(tmp_path)],
        )
        assert result.exit_code == 1
        assert "invalid JSON" in result.output
    def test_replay_renders_persisted_result(self, tmp_path):
        """Issue #54: replay loads <id>.result.json sibling and renders it."""
        runner = CliRunner()
        self._write_trace(tmp_path)
        result_payload = {
            "answer": "Test answer about Utah crops.",
            "citations": [
                {
                    "source": "web",
                    "locator": "https://example.com/utah",
                    "title": "Utah Guide",
                    "snippet": None,
                    "raw_excerpt": "raw excerpt content",
                    "confidence": 0.9,
                }
            ],
            "gaps": [
                {
                    "topic": "irrigation",
                    "category": "scope_exceeded",
                    "detail": "out of scope",
                }
            ],
            "discovery_events": [],
            "open_questions": [],
            "confidence": 0.8,
            "confidence_factors": {
                "num_corroborating_sources": 2,
                "source_authority": "high",
                "contradiction_detected": False,
                "query_specificity_match": 0.8,
                "budget_exhausted": False,
                "recency": "current",
            },
            "cost_metadata": {
                "tokens_used": 1000,
                "iterations_run": 2,
                "wall_time_sec": 12.5,
                "budget_exhausted": False,
                "model_id": "claude-test",
            },
            "trace_id": "trace-xyz",
        }
        import json as _j
        (tmp_path / "trace-xyz.result.json").write_text(_j.dumps(result_payload))
        result = runner.invoke(
            cli,
            ["replay", "trace-xyz", "--trace-dir", str(tmp_path)],
        )
        assert result.exit_code == 0, result.output
        # Step log still rendered
        assert "search" in result.output
        # Persisted result also rendered
        assert "Test answer about Utah crops" in result.output
        assert "scope_exceeded" in result.output
        assert "irrigation" in result.output
    def test_replay_without_result_file_notes_absence(self, tmp_path):
        runner = CliRunner()
        self._write_trace(tmp_path)
        result = runner.invoke(
            cli,
            ["replay", "trace-xyz", "--trace-dir", str(tmp_path)],
        )
        assert result.exit_code == 0
        assert "No persisted result file" in result.output
    def test_render_trace_empty(self):
        console = Console(record=True, width=120)
        render_trace([], "empty-trace", console)
        out = console.export_text()
        assert "empty-trace" in out
        assert "empty" in out.lower()
 # ---------------------------------------------------------------------------
 # costs command
 # ---------------------------------------------------------------------------
 import json as _json
 def _write_ledger(path, entries):
    path.write_text("\n".join(_json.dumps(e) for e in entries) + "\n")
 def _ledger_fixture(tmp_path):
    path = tmp_path / "costs.jsonl"
    entries = [
        {
            "timestamp": "2026-04-06T10:00:00Z",
            "trace_id": "t1",
            "question": "What is X?",
            "model_id": "claude-sonnet-4-6",
            "tokens_used": 1000,
            "tokens_input": 800,
            "tokens_output": 200,
            "iterations_run": 1,
            "wall_time_sec": 5.0,
            "tavily_searches": 1,
            "estimated_cost_usd": 0.005,
            "budget_exhausted": False,
            "confidence": 0.9,
        },
        {
            "timestamp": "2026-04-07T11:00:00Z",
            "trace_id": "t2",
            "question": "Bigger query",
            "model_id": "claude-opus-4-6",
            "tokens_used": 50000,
            "tokens_input": 40000,
            "tokens_output": 10000,
            "iterations_run": 5,
            "wall_time_sec": 120.0,
            "tavily_searches": 8,
            "estimated_cost_usd": 1.25,
            "budget_exhausted": True,
            "confidence": 0.7,
        },
        {
            "timestamp": "2026-04-08T12:00:00Z",
            "trace_id": "t3",
            "question": "Unknown model run",
            "model_id": "future-model-7",
            "tokens_used": 500,
            "tokens_input": 400,
            "tokens_output": 100,
            "iterations_run": 1,
            "wall_time_sec": 2.0,
            "tavily_searches": 0,
            "estimated_cost_usd": None,
            "budget_exhausted": False,
            "confidence": 0.5,
        },
    ]
    _write_ledger(path, entries)
    return path
 class TestCostsCommand:
    def test_renders_summary(self, tmp_path):
        path = _ledger_fixture(tmp_path)
        runner = CliRunner()
        result = runner.invoke(cli, ["costs", "--ledger", str(path)])
        assert result.exit_code == 0, result.output
        # Summary
        assert "Calls: 3" in result.output
        assert "$1.2550" in result.output
        # Per-day rows
        assert "2026-04-06" in result.output
        assert "2026-04-07" in result.output
        assert "2026-04-08" in result.output
        # Per-model rows
        assert "claude-sonnet-4-6" in result.output
        assert "claude-opus-4-6" in result.output
        # Highest-cost panel
        assert "t2" in result.output
        # Unknown model warning
        assert "unknown model price" in result.output
    def test_filter_by_model(self, tmp_path):
        path = _ledger_fixture(tmp_path)
        runner = CliRunner()
        result = runner.invoke(
            cli,
            ["costs", "--ledger", str(path), "--model", "claude-opus-4-6"],
        )
        assert result.exit_code == 0
        assert "Calls: 1" in result.output
        assert "claude-sonnet-4-6" not in result.output
    def test_filter_by_since_iso(self, tmp_path):
        path = _ledger_fixture(tmp_path)
        runner = CliRunner()
        result = runner.invoke(
            cli,
            ["costs", "--ledger", str(path), "--since", "2026-04-08"],
        )
        assert result.exit_code == 0
        assert "Calls: 1" in result.output
        assert "future-model-7" in result.output
        assert "claude-sonnet-4-6" not in result.output
    def test_json_output(self, tmp_path):
        path = _ledger_fixture(tmp_path)
        runner = CliRunner()
        result = runner.invoke(
            cli,
            ["costs", "--ledger", str(path), "--json"],
        )
        assert result.exit_code == 0
        lines = [l for l in result.output.strip().splitlines() if l]
        assert len(lines) == 3
        first = _json.loads(lines[0])
        assert first["trace_id"] == "t1"
    def test_empty_ledger(self, tmp_path):
        path = tmp_path / "missing.jsonl"
        runner = CliRunner()
        result = runner.invoke(cli, ["costs", "--ledger", str(path)])
        assert result.exit_code == 0
        assert "No cost data yet" in result.output
    def test_render_costs_handles_empty(self):
        console = Console(record=True, width=120)
        render_costs([], console)
        out = console.export_text()
        assert "No cost data yet" in out
--- a/tests/test_costs.py
+++ b/tests/test_costs.py
@ -1,179 +0,0 @@
 """Tests for the obs.costs cost ledger and price table."""
 import json
 from pathlib import Path
 import pytest
 from obs.costs import (
    DEFAULT_PRICES_PATH,
    SEED_PRICES_TOML,
    CostLedger,
    PriceTable,
 )
 class TestPriceTable:
    def test_seeds_missing_file(self, tmp_path):
        prices_path = tmp_path / "prices.toml"
        assert not prices_path.exists()
        table = PriceTable(path=str(prices_path))
        assert prices_path.exists()
        assert "claude-sonnet-4-6" in prices_path.read_text()
        # Loaded into memory
        assert table._data["models"]["claude-sonnet-4-6"]["input_per_mtok_usd"] == 3.00
    def test_does_not_overwrite_existing_file(self, tmp_path):
        prices_path = tmp_path / "prices.toml"
        prices_path.write_text(
            '[models."custom-model"]\n'
            'input_per_mtok_usd = 1.23\n'
            'output_per_mtok_usd = 4.56\n'
        )
        table = PriceTable(path=str(prices_path))
        assert table._data["models"]["custom-model"]["input_per_mtok_usd"] == 1.23
        assert "claude-sonnet-4-6" not in table._data.get("models", {})
    def test_estimates_known_model(self, tmp_path):
        table = PriceTable(path=str(tmp_path / "prices.toml"))
        # 1M input @ $3 + 1M output @ $15 = $18, no tavily
        cost = table.estimate_call_usd(
            model_id="claude-sonnet-4-6",
            tokens_input=1_000_000,
            tokens_output=1_000_000,
            tavily_searches=0,
        )
        assert cost == 18.00
    def test_estimates_with_tavily(self, tmp_path):
        table = PriceTable(path=str(tmp_path / "prices.toml"))
        cost = table.estimate_call_usd(
            model_id="claude-sonnet-4-6",
            tokens_input=0,
            tokens_output=0,
            tavily_searches=10,
        )
        # 10 * $0.005 = $0.05
        assert cost == 0.05
    def test_unknown_model_returns_none(self, tmp_path):
        table = PriceTable(path=str(tmp_path / "prices.toml"))
        cost = table.estimate_call_usd(
            model_id="some-future-model",
            tokens_input=1000,
            tokens_output=1000,
            tavily_searches=0,
        )
        assert cost is None
 class TestCostLedger:
    def _ledger(self, tmp_path):
        return CostLedger(
            ledger_path=str(tmp_path / "costs.jsonl"),
            price_table=PriceTable(path=str(tmp_path / "prices.toml")),
        )
    def test_record_writes_jsonl(self, tmp_path):
        ledger = self._ledger(tmp_path)
        entry = ledger.record(
            trace_id="abc-123",
            question="What grows in Utah?",
            model_id="claude-sonnet-4-6",
            tokens_used=10_000,
            tokens_input=8_000,
            tokens_output=2_000,
            iterations_run=3,
            wall_time_sec=42.5,
            tavily_searches=4,
            budget_exhausted=False,
            confidence=0.9,
        )
        # File contains one JSON line
        lines = (tmp_path / "costs.jsonl").read_text().strip().splitlines()
        assert len(lines) == 1
        on_disk = json.loads(lines[0])
        assert on_disk == entry
        # All required fields present and shaped correctly
        assert on_disk["trace_id"] == "abc-123"
        assert on_disk["question"] == "What grows in Utah?"
        assert on_disk["model_id"] == "claude-sonnet-4-6"
        assert on_disk["tokens_used"] == 10_000
        assert on_disk["tokens_input"] == 8_000
        assert on_disk["tokens_output"] == 2_000
        assert on_disk["iterations_run"] == 3
        assert on_disk["wall_time_sec"] == 42.5
        assert on_disk["tavily_searches"] == 4
        assert on_disk["budget_exhausted"] is False
        assert on_disk["confidence"] == 0.9
        assert "timestamp" in on_disk
        # 8000 input @ $3/Mtok + 2000 output @ $15/Mtok + 4 * $0.005 = $0.074
        assert on_disk["estimated_cost_usd"] == pytest.approx(0.074, abs=1e-6)
    def test_record_appends(self, tmp_path):
        ledger = self._ledger(tmp_path)
        for i in range(3):
            ledger.record(
                trace_id=f"trace-{i}",
                question=f"q{i}",
                model_id="claude-sonnet-4-6",
                tokens_used=100,
                tokens_input=80,
                tokens_output=20,
                iterations_run=1,
                wall_time_sec=1.0,
                tavily_searches=0,
                budget_exhausted=False,
                confidence=0.5,
            )
        lines = (tmp_path / "costs.jsonl").read_text().strip().splitlines()
        assert len(lines) == 3
        assert json.loads(lines[0])["trace_id"] == "trace-0"
        assert json.loads(lines[2])["trace_id"] == "trace-2"
    def test_unknown_model_records_null_cost(self, tmp_path):
        ledger = self._ledger(tmp_path)
        entry = ledger.record(
            trace_id="abc",
            question="q",
            model_id="some-future-model",
            tokens_used=1000,
            tokens_input=500,
            tokens_output=500,
            iterations_run=1,
            wall_time_sec=1.0,
            tavily_searches=0,
            budget_exhausted=False,
            confidence=0.5,
        )
        assert entry["estimated_cost_usd"] is None
    def test_question_is_truncated(self, tmp_path):
        ledger = self._ledger(tmp_path)
        long_q = "x" * 1000
        entry = ledger.record(
            trace_id="abc",
            question=long_q,
            model_id="claude-sonnet-4-6",
            tokens_used=10,
            tokens_input=5,
            tokens_output=5,
            iterations_run=1,
            wall_time_sec=0.1,
            tavily_searches=0,
            budget_exhausted=False,
            confidence=0.5,
        )
        assert len(entry["question"]) == 200
    def test_env_var_override(self, tmp_path, monkeypatch):
        custom = tmp_path / "custom-ledger.jsonl"
        monkeypatch.setenv("MARCHWARDEN_COST_LEDGER", str(custom))
        ledger = CostLedger(
            price_table=PriceTable(path=str(tmp_path / "prices.toml")),
        )
        assert ledger.path == custom
--- a/tests/test_models.py
+++ b/tests/test_models.py
@ -443,56 +443,3 @@ class TestResearchResult:
            "recency",
        }
        assert set(data["confidence_factors"].keys()) == cf_keys
 # ---------------------------------------------------------------------------
 # Depth presets (Issue #30)
 # ---------------------------------------------------------------------------
 from researchers.web.models import DEPTH_PRESETS, constraints_for_depth
 class TestDepthPresets:
    def test_shallow_preset(self):
        c = constraints_for_depth("shallow")
        assert c.max_iterations == 2
        assert c.token_budget == 5_000
        assert c.max_sources == 5
    def test_balanced_preset_matches_historical_defaults(self):
        # Backward compat: balanced must equal the original ResearchConstraints defaults
        c = constraints_for_depth("balanced")
        default = ResearchConstraints()
        assert c.max_iterations == default.max_iterations == 5
        assert c.token_budget == default.token_budget == 20_000
        assert c.max_sources == default.max_sources == 10
    def test_deep_preset(self):
        c = constraints_for_depth("deep")
        assert c.max_iterations == 8
        assert c.token_budget == 60_000
        assert c.max_sources == 20
    def test_unknown_depth_falls_back_to_balanced(self):
        c = constraints_for_depth("nonsense")
        assert c.max_iterations == DEPTH_PRESETS["balanced"]["max_iterations"]
        assert c.token_budget == DEPTH_PRESETS["balanced"]["token_budget"]
    def test_explicit_overrides_win(self):
        c = constraints_for_depth(
            "shallow",
            max_iterations=10,
            token_budget=42_000,
            max_sources=15,
        )
        assert c.max_iterations == 10
        assert c.token_budget == 42_000
        assert c.max_sources == 15
    def test_partial_override(self):
        # Only one field overridden — others stay at the preset
        c = constraints_for_depth("deep", token_budget=100_000)
        assert c.token_budget == 100_000
        assert c.max_iterations == 8  # deep preset
        assert c.max_sources == 20  # deep preset
--- a/tests/test_obs.py
+++ b/tests/test_obs.py
@ -1,128 +0,0 @@
 """Tests for the obs (structured logging) module."""
 import io
 import json
 import logging
 import os
 from unittest.mock import patch
 import pytest
 import structlog
 from obs import configure_logging, get_logger
@pytest.fixture(autouse=True)
 def reset_logging():
    """Reset structlog + stdlib state between tests so configure_logging
    is forced to re-run."""
    import obs
    obs._CONFIGURED = False
    structlog.reset_defaults()
    structlog.contextvars.clear_contextvars()
    root = logging.getLogger()
    root.handlers = []
    root.setLevel(logging.WARNING)
    yield
    obs._CONFIGURED = False
    structlog.reset_defaults()
    structlog.contextvars.clear_contextvars()
    root.handlers = []
 def _capture_stderr(monkeypatch):
    buf = io.StringIO()
    monkeypatch.setattr("sys.stderr", buf)
    return buf
 class TestConfigureLogging:
    def test_json_format_emits_json(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "json")
        monkeypatch.setenv("MARCHWARDEN_LOG_LEVEL", "INFO")
        buf = _capture_stderr(monkeypatch)
        configure_logging(force=True)
        log = get_logger("marchwarden.test")
        log.info("hello", key="value", count=3)
        line = buf.getvalue().strip()
        assert line, "expected at least one log line"
        parsed = json.loads(line)
        assert parsed["event"] == "hello"
        assert parsed["key"] == "value"
        assert parsed["count"] == 3
        assert parsed["level"] == "info"
        assert parsed["logger"] == "marchwarden.test"
        assert "timestamp" in parsed
    def test_console_format_is_human_readable(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "console")
        monkeypatch.setenv("MARCHWARDEN_LOG_LEVEL", "INFO")
        buf = _capture_stderr(monkeypatch)
        configure_logging(force=True)
        log = get_logger("marchwarden.test")
        log.info("greeting", who="world")
        text = buf.getvalue()
        assert "greeting" in text
        assert "world" in text
        # Console renderer is not JSON
        with pytest.raises(json.JSONDecodeError):
            json.loads(text.strip().splitlines()[-1])
    def test_context_binding_propagates(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "json")
        buf = _capture_stderr(monkeypatch)
        configure_logging(force=True)
        log = get_logger("marchwarden.test")
        structlog.contextvars.bind_contextvars(trace_id="abc-123", researcher="web")
        try:
            log.info("step")
        finally:
            structlog.contextvars.clear_contextvars()
        parsed = json.loads(buf.getvalue().strip())
        assert parsed["trace_id"] == "abc-123"
        assert parsed["researcher"] == "web"
    def test_log_level_filters(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "json")
        monkeypatch.setenv("MARCHWARDEN_LOG_LEVEL", "WARNING")
        buf = _capture_stderr(monkeypatch)
        configure_logging(force=True)
        log = get_logger("marchwarden.test")
        log.info("ignored")
        log.warning("kept")
        lines = [l for l in buf.getvalue().strip().splitlines() if l]
        assert len(lines) == 1
        assert json.loads(lines[0])["event"] == "kept"
    def test_idempotent(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "json")
        configure_logging(force=True)
        import obs
        obs._CONFIGURED = True
        # Second call should be a no-op
        before = logging.getLogger().handlers
        configure_logging()
        after = logging.getLogger().handlers
        assert before is after
    def test_get_logger_auto_configures(self, monkeypatch):
        monkeypatch.setenv("MARCHWARDEN_LOG_FORMAT", "json")
        buf = _capture_stderr(monkeypatch)
        # No explicit configure_logging() call
        log = get_logger("marchwarden.test")
        log.info("auto")
        parsed = json.loads(buf.getvalue().strip())
        assert parsed["event"] == "auto"
--- a/tests/test_server.py
+++ b/tests/test_server.py
@ -1,152 +0,0 @@
 """Tests for the MCP server."""
 import json
 from unittest.mock import AsyncMock, patch, MagicMock
 import pytest
 from researchers.web.server import _read_secret, research
 # ---------------------------------------------------------------------------
 # _read_secret
 # ---------------------------------------------------------------------------
 class TestReadSecret:
    def test_reads_key(self, tmp_path):
        secrets = tmp_path / "secrets"
        secrets.write_text("FOO=bar\nBAZ=qux\n")
        with patch("researchers.web.server.os.path.expanduser", return_value=str(secrets)):
            assert _read_secret("FOO") == "bar"
            assert _read_secret("BAZ") == "qux"
    def test_missing_key_raises(self, tmp_path):
        secrets = tmp_path / "secrets"
        secrets.write_text("FOO=bar\n")
        with patch("researchers.web.server.os.path.expanduser", return_value=str(secrets)):
            with pytest.raises(ValueError, match="MISSING"):
                _read_secret("MISSING")
 # ---------------------------------------------------------------------------
 # research tool
 # ---------------------------------------------------------------------------
 class TestResearchTool:
    @pytest.mark.asyncio
    async def test_returns_valid_json(self):
        """The research tool should return a JSON string with all contract fields."""
        from researchers.web.models import (
            ResearchResult,
            ConfidenceFactors,
            CostMetadata,
        )
        mock_result = ResearchResult(
            answer="Test answer.",
            citations=[],
            gaps=[],
            discovery_events=[],
            open_questions=[],
            confidence=0.8,
            confidence_factors=ConfidenceFactors(
                num_corroborating_sources=1,
                source_authority="medium",
                contradiction_detected=False,
                query_specificity_match=0.7,
                budget_exhausted=False,
                recency="current",
            ),
            cost_metadata=CostMetadata(
                tokens_used=500,
                iterations_run=1,
                wall_time_sec=5.0,
                budget_exhausted=False,
                model_id="claude-test",
            ),
            trace_id="test-trace-id",
        )
        with patch("researchers.web.server._get_researcher") as mock_get:
            mock_researcher = AsyncMock()
            mock_researcher.research.return_value = mock_result
            mock_get.return_value = mock_researcher
            result_json = await research(
                question="test question",
                context="some context",
                depth="shallow",
                max_iterations=2,
                token_budget=5000,
            )
        data = json.loads(result_json)
        assert data["answer"] == "Test answer."
        assert data["confidence"] == 0.8
        assert data["trace_id"] == "test-trace-id"
        assert "citations" in data
        assert "gaps" in data
        assert "discovery_events" in data
        assert "open_questions" in data
        assert "confidence_factors" in data
        assert "cost_metadata" in data
        # Verify researcher was called with correct args
        mock_researcher.research.assert_called_once()
        call_kwargs = mock_researcher.research.call_args[1]
        assert call_kwargs["question"] == "test question"
        assert call_kwargs["context"] == "some context"
        assert call_kwargs["depth"] == "shallow"
        assert call_kwargs["constraints"].max_iterations == 2
        assert call_kwargs["constraints"].token_budget == 5000
    @pytest.mark.asyncio
    async def test_defaults(self):
        """Test that defaults work when optional args are omitted."""
        from researchers.web.models import (
            ResearchResult,
            ConfidenceFactors,
            CostMetadata,
        )
        mock_result = ResearchResult(
            answer="Default test.",
            citations=[],
            gaps=[],
            discovery_events=[],
            open_questions=[],
            confidence=0.5,
            confidence_factors=ConfidenceFactors(
                num_corroborating_sources=0,
                source_authority="low",
                contradiction_detected=False,
                query_specificity_match=0.5,
                budget_exhausted=False,
            ),
            cost_metadata=CostMetadata(
                tokens_used=100,
                iterations_run=1,
                wall_time_sec=1.0,
                budget_exhausted=False,
                model_id="claude-test",
            ),
            trace_id="test-id",
        )
        with patch("researchers.web.server._get_researcher") as mock_get:
            mock_researcher = AsyncMock()
            mock_researcher.research.return_value = mock_result
            mock_get.return_value = mock_researcher
            result_json = await research(question="just a question")
        data = json.loads(result_json)
        assert data["answer"] == "Default test."
        call_kwargs = mock_researcher.research.call_args[1]
        assert call_kwargs["context"] is None
        assert call_kwargs["depth"] == "balanced"
        assert call_kwargs["constraints"].max_iterations == 5
        assert call_kwargs["constraints"].token_budget == 20000
--- a/tests/test_trace.py
+++ b/tests/test_trace.py
@ -34,27 +34,6 @@ class TestTraceLogger:
        with tempfile.TemporaryDirectory() as tmp:
            logger = self._make_logger(tmp, trace_id="test-123")
            assert str(logger.file_path).endswith("test-123.jsonl")
            assert str(logger.result_path).endswith("test-123.result.json")
    def test_write_result_persists_pydantic_model(self):
        with tempfile.TemporaryDirectory() as tmp:
            logger = self._make_logger(tmp, trace_id="result-test")
            class Stub:
                def model_dump_json(self, indent=None):
                    return '{"answer": "hi", "gaps": []}'
            logger.write_result(Stub())
            assert logger.result_path.exists()
            data = json.loads(logger.result_path.read_text())
            assert data["answer"] == "hi"
            assert data["gaps"] == []
    def test_write_result_accepts_dict(self):
        with tempfile.TemporaryDirectory() as tmp:
            logger = self._make_logger(tmp, trace_id="dict-test")
            logger.write_result({"foo": "bar"})
            assert json.loads(logger.result_path.read_text()) == {"foo": "bar"}
    def test_log_step_creates_file(self):
        with tempfile.TemporaryDirectory() as tmp:
@ -164,74 +143,3 @@ class TestTraceLogger:
            assert a.read_entries()[0]["query"] == "a"
            assert b.read_entries()[0]["query"] == "b"
            assert a.file_path != b.file_path
 # ---------------------------------------------------------------------------
 # Step duration tracking (Issue #35)
 # ---------------------------------------------------------------------------
 import time as _time
 class TestStepDurations:
    def test_web_search_pair_records_duration_ms(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("web_search", query="utah crops")
        _time.sleep(0.02)
        entry = logger.log_step("web_search_complete", result_count=5)
        logger.close()
        assert "duration_ms" in entry
        assert entry["duration_ms"] >= 15
    def test_fetch_url_pair_records_duration_ms(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("fetch_url", url="https://example.com")
        _time.sleep(0.01)
        entry = logger.log_step("fetch_url_complete", success=True)
        logger.close()
        assert "duration_ms" in entry
    def test_synthesis_complete_records_duration_ms(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("synthesis_start")
        _time.sleep(0.01)
        entry = logger.log_step("synthesis_complete")
        logger.close()
        assert "duration_ms" in entry
    def test_synthesis_error_also_records_duration_ms(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("synthesis_start")
        _time.sleep(0.01)
        entry = logger.log_step("synthesis_error", parse_error="boom")
        logger.close()
        assert "duration_ms" in entry
    def test_complete_records_total_duration_sec(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("start", question="q")
        _time.sleep(0.02)
        entry = logger.log_step("complete", confidence=0.9)
        logger.close()
        assert "total_duration_sec" in entry
        assert entry["total_duration_sec"] >= 0.015
        # Sec precision, not ms
        assert "duration_ms" not in entry
    def test_unpaired_completer_does_not_crash(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        # No matching web_search starter
        entry = logger.log_step("web_search_complete", result_count=0)
        logger.close()
        assert "duration_ms" not in entry
    def test_existing_fields_preserved(self, tmp_path):
        logger = TraceLogger(trace_dir=str(tmp_path))
        logger.log_step("web_search", query="x")
        entry = logger.log_step("web_search_complete", result_count=3, urls=["u1"])
        logger.close()
        assert entry["result_count"] == 3
        assert entry["urls"] == ["u1"]
        assert "step" in entry
        assert "timestamp" in entry