Merge pull request 'docs(stress-tests): M3.3 Phase A — calibration data collection' (#59 ) from feat/m3.3-collection into main

docs(stress-tests): M3.3 Phase A — calibration data collection
Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 02:22:07 +00:00 · 2026-04-08 20:21:47 -06:00
24 changed files with 5549 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -45,6 +45,9 @@ ehthumbs.db
 .env
 .env.local
 *.log
+# Exception: stress test run logs are committed as provenance — they map
+# trace_id -> category for the calibration collector script.
+!docs/stress-tests/**/*.log

 # Tests
 .pytest_cache/
--- a/docs/stress-tests/M3.3-rating-worksheet.md
+++ b/docs/stress-tests/M3.3-rating-worksheet.md
@ -0,0 +1,74 @@
+# M3.3 Calibration Rating Worksheet
+
+Issue: #46 (Phase B — human rating)
+
+## How to use this worksheet
+
+For each run below, read the answer + citations from the persisted result file (path in the **Result file** column). Score the answer's *actual* correctness on a 0.0–1.0 scale, **independent** of the model's self-reported confidence. Fill in the **actual_rating** column. Add notes in the **notes** column for anything unusual.
+
+Rating rubric:
+
+- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.
+- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.
+- **0.6** — Substantively right but with notable errors, missing context, or weak citations.
+- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.
+- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.
+- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.
+
+After rating all rows, save this file and run:
+
+```
+.venv/bin/python scripts/calibration_analyze.py
+```
+
+## Runs (22 total)
+
+| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| 1 | `28f55110` | ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 |  |  |
+| 2 | `74a017bd` | ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 |  |  |
+| 3 | `6141a021` | factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 |  |  |
+| 4 | `91e87d05` | factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 |  |  |
+| 5 | `710b0a62` | factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 |  |  |
+| 6 | `ffc42162` | factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 |  |  |
+| 7 | `7561029e` | factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 |  |  |
+| 8 | `aaf3b9ef` | comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 |  |  |
+| 9 | `01881015` | comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 |  |  |
+| 10 | `9e436db7` | comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 |  |  |
+| 11 | `7c8dd19b` | comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 |  |  |
+| 12 | `e3fa81c3` | comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 |  |  |
+| 13 | `96acce3c` | contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 |  |  |
+| 14 | `c4942f00` | contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 |  |  |
+| 15 | `2e2b6e88` | contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 |  |  |
+| 16 | `27d81891` | contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 |  |  |
+| 17 | `9c18d570` | contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 |  |  |
+| 18 | `f4c43973` | scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 |  |  |
+| 19 | `b3d00938` | scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 |  |  |
+| 20 | `716e548a` | scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 |  |  |
+| 21 | `b7cd9d50` | scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 |  |  |
+| 22 | `a4bb5b7a` | scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 |  |  |
+
+## Result files (full content for review)
+
+1. `/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json`
+2. `/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json`
+3. `/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json`
+4. `/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json`
+5. `/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json`
+6. `/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json`
+7. `/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json`
+8. `/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json`
+9. `/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json`
+10. `/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json`
+11. `/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json`
+12. `/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json`
+13. `/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json`
+14. `/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json`
+15. `/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json`
+16. `/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json`
+17. `/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json`
+18. `/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json`
+19. `/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json`
+20. `/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json`
+21. `/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json`
+22. `/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json`
--- a/docs/stress-tests/M3.3-runs/01-factual.log
+++ b/docs/stress-tests/M3.3-runs/01-factual.log
@ -0,0 +1,128 @@
+Researching: What is the boiling point of liquid nitrogen at standard 
+atmospheric pressure?
+
+{"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:49:07.183443Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:49:07.993167Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:49:08.002221Z"}
+{"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:49:08.036624Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037079Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037172Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1107, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:20.314935Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5768, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:25.184914Z"}
+{"step": 15, "decision": "Starting iteration 4/5", "tokens_so_far": 16093, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:27.276067Z"}
+{"step": 17, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 17, "iterations_run": 4, "tokens_used": 29376, "event": "synthesis_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:43.946958Z"}
+{"step": 18, "decision": "Parsed synthesis JSON successfully", "duration_ms": 21492, "event": "synthesis_complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.440080Z"}
+{"step": 26, "decision": "Research complete", "confidence": 0.98, "citation_count": 5, "gap_count": 0, "discovery_count": 2, "total_duration_sec": 59.528, "event": "complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.442761Z"}
+{"confidence": 0.98, "citations": 5, "gaps": 0, "discovery_events": 2, "tokens_used": 42473, "iterations_run": 4, "wall_time_sec": 57.403085231781006, "budget_exhausted": false, "event": "research_completed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:05.442894Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:05.443791Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:05.453034Z"}
+{"trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "confidence": 0.98, "citations": 5, "tokens_used": 42473, "wall_time_sec": 57.403085231781006, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:05.720817Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The boiling point of liquid nitrogen at standard atmospheric pressure (1 atm │
+│ / 14.7 psia / 760 mmHg) is −195.79 °C (77 K; −320 °F). Some sources round    │
+│ this to −195.8 °C or approximately −196 °C. This value represents the        │
+│ temperature at which nitrogen transitions from its liquid phase to a gas     │
+│ phase under normal atmospheric conditions.                                   │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Liquid Nitrogen Temperature   │ The temperature of liquid      │  0.98 │
+│     │ and Facts                     │ nitrogen is −195.79 °C (77 K;  │       │
+│     │ https://sciencenotes.org/liqu │ −320 °F). This is the boiling  │       │
+│     │ id-nitrogen-temperature-and-f │ point of nitrogen. However,    │       │
+│     │ acts/                         │ nitrogen can exist as a liquid │       │
+│     │                               │ between 63 K and 77.2 K        │       │
+│     │                               │ (-346°F and -320.44°F).        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Nitrogen - Thermophysical     │ Boiling Point - at saturation  │  0.97 │
+│     │ Properties                    │ pressure 14.7 psia and 760 mm  │       │
+│     │ https://www.engineeringtoolbo │ Hg - ( o F, o C ) -320.4,      │       │
+│     │ x.com/nitrogen-d_1421.html    │ -195.8                         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ What Is the Temperature of    │ The temperature of liquid      │  0.95 │
+│     │ Liquid Nitrogen? - WestAir    │ nitrogen is -196°C (-321°F) at │       │
+│     │ https://westairgases.com/blog │ its boiling point. The liquid  │       │
+│     │ /liquid-nitrogen-temperature- │ nitrogen temperature range     │       │
+│     │ properties/                   │ spans between -210°C (freezing │       │
+│     │                               │ point) and -196°C (boiling     │       │
+│     │                               │ point).                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ What is the boiling point of  │ At 1 atmosphere of pressure,   │  0.90 │
+│     │ liquid nitrogen? Does it      │ nitrogen boils at -195.8       │       │
+│     │ change ... - Quora            │ Celsius (-320.4 Fahrenheit).   │       │
+│     │ https://www.quora.com/What-is │ Of course, like any substance, │       │
+│     │ -the-boiling-point-of-liquid- │ boiling point varies directly  │       │
+│     │ nitrogen-Does-it-change-in-a- │ with pressure.                 │       │
+│     │ vacuum-or-at-standard-conditi │                                │       │
+│     │ ons                           │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ The boiling point for liquid  │ The boiling point for liquid   │  0.88 │
+│     │ nitrogen at atmospheric       │ nitrogen at atmospheric        │       │
+│     │ pressure is 77 K.             │ pressure is 77 K. In an open   │       │
+│     │ https://brainly.com/question/ │ container, liquid nitrogen's   │       │
+│     │ 17018364                      │ temperature is generally       │       │
+│     │                               │ around its boiling point of 77 │       │
+│     │                               │ K due to continuous            │       │
+│     │                               │ vaporization.                  │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ database          │ liquid nitrogen   │ The boiling point │
+│                  │                   │ boiling point     │ of nitrogen       │
+│                  │                   │ pressure          │ varies with       │
+│                  │                   │ dependence phase  │ pressure;         │
+│                  │                   │ diagram           │ understanding     │
+│                  │                   │                   │ this relationship │
+│                  │                   │                   │ is useful for     │
+│                  │                   │                   │ industrial and    │
+│                  │                   │                   │ scientific        │
+│                  │                   │                   │ applications.     │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ nitrogen phase    │ Engineering       │
+│                  │                   │ diagram triple    │ ToolBox           │
+│                  │                   │ point critical    │ references a      │
+│                  │                   │ point             │ nitrogen phase    │
+│                  │                   │                   │ diagram showing   │
+│                  │                   │                   │ conditions for    │
+│                  │                   │                   │ solid, liquid,    │
+│                  │                   │                   │ and gas phases.   │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ medium   │ How does the boiling point of   │ Multiple sources note that      │
+│          │ liquid nitrogen change as       │ boiling point varies directly   │
+│          │ pressure decreases toward a     │ with pressure, suggesting       │
+│          │ vacuum?                         │ significant changes under       │
+│          │                                 │ reduced pressure conditions.    │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ What is the exact triple point  │ Sources mention nitrogen exists │
+│          │ temperature and pressure for    │ as a liquid between 63 K and    │
+│          │ nitrogen?                       │ 77.2 K, implying a triple point │
+│          │                                 │ near 63 K, but exact triple     │
+│          │                                 │ point data was not provided in  │
+│          │                                 │ the gathered evidence.          │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.98                                                                │
+│ Corroborating sources: 5                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 1.00                                                │
+│ Budget status: under cap                                                     │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 42473                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 57.40s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 6141a021-4a47-45df-aa0c-5acd1db78b79
--- a/docs/stress-tests/M3.3-runs/02-factual.log
+++ b/docs/stress-tests/M3.3-runs/02-factual.log
@ -0,0 +1,145 @@
+Researching: When did the James Webb Space Telescope launch?
+
+{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:06.289350Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:07.051309Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:07.061145Z"}
+{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:07.098980Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "When did the James Webb Space Telescope launch?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099569Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099732Z"}
+{"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1050, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:15.512242Z"}
+{"step": 8, "decision": "Starting iteration 3/5", "tokens_so_far": 5418, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:18.749199Z"}
+{"step": 10, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 6, "iterations_run": 3, "tokens_used": 11453, "event": "synthesis_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:28.069780Z"}
+{"step": 11, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24998, "event": "synthesis_complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.942803Z"}
+{"step": 20, "decision": "Research complete", "confidence": 0.99, "citation_count": 5, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 47.037, "event": "complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.943609Z"}
+{"confidence": 0.99, "citations": 5, "gaps": 1, "discovery_events": 2, "tokens_used": 19708, "iterations_run": 3, "wall_time_sec": 44.843754529953, "budget_exhausted": false, "event": "research_completed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:51.943716Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:51.944100Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:51.947937Z"}
+{"trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "confidence": 0.99, "citations": 5, "tokens_used": 19708, "wall_time_sec": 44.843754529953, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.133972Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The James Webb Space Telescope (JWST) launched on December 25, 2021, at      │
+│ 12:20 UTC (7:20 AM ET) aboard an Arianespace Ariane 5 ECA+ rocket (Flight    │
+│ VA256) from the Guiana Space Centre (ELA-3) in Kourou, French Guiana. It     │
+│ entered service on July 12, 2022.                                            │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ James Webb Space Telescope -  │ Launch date: 25 December 2021  │  0.99 │
+│     │ Wikipedia                     │ (2021-12-25), 12:20 UTC |      │       │
+│     │ https://en.wikipedia.org/wiki │ Rocket: Ariane 5 ECA+ (S/N     │       │
+│     │ /James_Webb_Space_Telescope   │ 5113, Flight VA256) | Launch   │       │
+│     │                               │ site: Guiana, ELA-3 |          │       │
+│     │                               │ Contractor: Arianespace |      │       │
+│     │                               │ Entered service: 12 July 2022  │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ The Launch of the James Webb  │ On December 25, 2021, and 7:20 │  0.98 │
+│     │ Space Telescope - YouTube     │ AM ET (12:20 UTC), the James   │       │
+│     │ https://www.youtube.com/watch │ Webb Space Telescope was       │       │
+│     │ ?v=9tXlqWldVVk                │ launched by an ArianeSpace     │       │
+│     │                               │ Ariane 5 rocket from           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ James Webb Space Telescope    │ The launch date was Saturday,  │  0.97 │
+│     │ (JWST) Mission (Ariane 5) -   │ December 25, 2021 at 12:20 PM  │       │
+│     │ RocketLaunch.Live             │ (UTC).                         │       │
+│     │ https://www.rocketlaunch.live │                                │       │
+│     │ /launch/jwst                  │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ James Webb Space Telescope –  │ JWST's launch date was         │  0.95 │
+│     │ College of Science            │ December 25 from Europe's      │       │
+│     │ https://science.utah.edu/news │ Spaceport in Kourou, French    │       │
+│     │ /james-webb-space-telescope/  │ Guiana. Longtime fans of the   │       │
+│     │                               │ telescope are celebrating it   │       │
+│     │                               │ as a Christmas miracle.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ NASA's James Webb Space       │ Liftoff is at 7:20 a.m. EST    │  0.90 │
+│     │ Telescope officially set to   │ (1220 GMT).                    │       │
+│     │ launch Dec. 24 | Space        │                                │       │
+│     │ https://www.space.com/james-w │                                │       │
+│     │ ebb-space-telescope-launch-da │                                │       │
+│     │ te-confirmed                  │                                │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ contradictory_sources │ Space.com headline       │ The Space.com article     │
+│                       │ discrepancy              │ headline references Dec.  │
+│                       │                          │ 24, which was the         │
+│                       │                          │ announced/planned launch  │
+│                       │                          │ date at time of           │
+│                       │                          │ publication, while the    │
+│                       │                          │ actual launch occurred on │
+│                       │                          │ Dec. 25, 2021. This is a  │
+│                       │                          │ pre-launch announcement   │
+│                       │                          │ artifact, not a true      │
+│                       │                          │ contradiction, and all    │
+│                       │                          │ other sources confirm     │
+│                       │                          │ Dec. 25.                  │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ null              │ James Webb Space  │ JWST entered      │
+│                  │                   │ Telescope first   │ service on July   │
+│                  │                   │ science results   │ 12, 2022;         │
+│                  │                   │ July 2022         │ understanding its │
+│                  │                   │                   │ early science     │
+│                  │                   │                   │ results provides  │
+│                  │                   │                   │ context for its   │
+│                  │                   │                   │ operational       │
+│                  │                   │                   │ impact.           │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ null              │ JWST launch       │ The telescope was │
+│                  │                   │ delays history    │ originally        │
+│                  │                   │ original 2007     │ planned to launch │
+│                  │                   │ launch plan       │ in 2007 but faced │
+│                  │                   │                   │ decades of        │
+│                  │                   │                   │ delays, making    │
+│                  │                   │                   │ the history of    │
+│                  │                   │                   │ its development   │
+│                  │                   │                   │ noteworthy.       │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ medium   │ What were the key milestones    │ Wikipedia notes the telescope   │
+│          │ after JWST's launch during its  │ entered service on July 12,     │
+│          │ commissioning phase before      │ 2022, approximately six months  │
+│          │ entering service on July 12,    │ after its December 25, 2021     │
+│          │ 2022?                           │ launch, suggesting a lengthy    │
+│          │                                 │ commissioning process.          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ What caused JWST's launch to    │ Space.com's article was titled  │
+│          │ slip from December 24 to        │ with a Dec. 24 launch date, but │
+│          │ December 25, 2021?              │ the actual launch occurred on   │
+│          │                                 │ Dec. 25, suggesting a           │
+│          │                                 │ last-minute slip.               │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does JWST's actual mission  │ Wikipedia lists a 10-year       │
+│          │ performance compare to its      │ planned and 20-year expected    │
+│          │ planned 10-year operational     │ life; precise launch trajectory │
+│          │ lifespan given its fuel         │ reportedly left more fuel than  │
+│          │ efficiency during launch?       │ expected, potentially extending │
+│          │                                 │ the mission.                    │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.99                                                                │
+│ Corroborating sources: 5                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 1.00                                                │
+│ Budget status: under cap                                                     │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 19708                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 44.84s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 91e87d05-6d23-4377-af13-270a8cf701e2
--- a/docs/stress-tests/M3.3-runs/03-factual.log
+++ b/docs/stress-tests/M3.3-runs/03-factual.log
@ -0,0 +1,179 @@
+Researching: What programming language is the Linux kernel primarily written in?
+
+{"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.691750Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:53.397487Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:53.405825Z"}
+{"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:53.438393Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What programming language is the Linux kernel primarily written in?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438693Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438784Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1096, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:04.950078Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7266, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:15.609351Z"}
+{"step": 14, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 16, "iterations_run": 3, "tokens_used": 18342, "event": "synthesis_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:38.886838Z"}
+{"step": 15, "decision": "Parsed synthesis JSON successfully", "duration_ms": 38497, "event": "synthesis_complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.247727Z"}
+{"step": 26, "decision": "Research complete", "confidence": 0.97, "citation_count": 6, "gap_count": 2, "discovery_count": 2, "total_duration_sec": 85.024, "event": "complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.248500Z"}
+{"confidence": 0.97, "citations": 6, "gaps": 2, "discovery_events": 2, "tokens_used": 32922, "iterations_run": 3, "wall_time_sec": 82.80920100212097, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:16.248601Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:16.248962Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:16.252134Z"}
+{"trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "confidence": 0.97, "citations": 6, "tokens_used": 32922, "wall_time_sec": 82.80920100212097, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.444923Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The Linux kernel is primarily written in the C programming language,         │
+│ specifically the GNU dialect of ISO C11 (compiled with GCC under -std=gnu11, │
+│ or alternatively with Clang). Assembly language is also used for             │
+│ architecture-specific low-level code. As of late 2022, Rust became an        │
+│ officially supported second language in the kernel, and as of the 2025 Linux │
+│ Kernel Maintainer Summit, Rust was elevated from 'experimental' to a         │
+│ permanent, first-class core language alongside C. According to Open Hub      │
+│ statistics, C accounts for approximately 95.8% of total lines in the kernel  │
+│ codebase, with Assembly at ~0.7% and Rust at ~0.3%. The kernel also uses     │
+│ small amounts of shell script, Python, Make, and Perl for tooling purposes.  │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Programming Language — The    │ The Linux kernel is written in │  1.00 │
+│     │ Linux Kernel documentation    │ the C programming language.    │       │
+│     │ https://docs.kernel.org/proce │ More precisely, it is          │       │
+│     │ ss/programming-language.html  │ typically compiled with gcc    │       │
+│     │                               │ under -std=gnu11: the GNU      │       │
+│     │                               │ dialect of ISO C11. clang is   │       │
+│     │                               │ also supported. The kernel has │       │
+│     │                               │ support for the Rust           │       │
+│     │                               │ programming language under     │       │
+│     │                               │ CONFIG_RUST.                   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ The Linux Kernel Open Source  │ C | 36,226,652 | 5,218,548 |   │  0.97 │
+│     │ Project on Open Hub:          │ 12.6% | 5,867,314 | 47,312,514 │       │
+│     │ Languages Page                │ | 95.8% ... Assembly | 266,797 │       │
+│     │ https://openhub.net/p/linux/a │ | 50,339 | 15.9% | 49,347 |    │       │
+│     │ nalyses/latest/languages_summ │ 366,483 | 0.7% ... Rust |      │       │
+│     │ ary                           │ 90,778 | 35,328 | 28.0% |      │       │
+│     │                               │ 11,361 | 137,467 | 0.3%        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Rust moves from experiment to │ The consensus among the        │  0.95 │
+│     │ a core Linux kernel language  │ assembled developers is that   │       │
+│     │ - Spiceworks                  │ Rust in the kernel is no       │       │
+│     │ https://www.spiceworks.com/so │ longer experimental — it is    │       │
+│     │ ftware/rust-moves-from-experi │ now a core part of the kernel  │       │
+│     │ ment-to-a-core-linux-kernel-l │ and is here to stay. So the    │       │
+│     │ anguage/                      │ 'experimental' tag will be     │       │
+│     │                               │ coming off. This elevates Rust │       │
+│     │                               │ to being the kernel's second   │       │
+│     │                               │ core language alongside C.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Why Linux Kernel is written   │ Although the current Linux     │  0.92 │
+│     │ in C-language but not in C++? │ Kernel source-code contain     │       │
+│     │ https://thelinuxchannel.org/2 │ certain parts of the code      │       │
+│     │ 024/06/why-linux-kernel-is-wr │ written in assembly code       │       │
+│     │ itten-in-c-language-but-not-i │ (actually native CPU assembly  │       │
+│     │ n-c-thelinuxchannel-kernelpro │ instructions) and recently     │       │
+│     │ gramming/                     │ certain parts of code written  │       │
+│     │                               │ in Rust Language, majority of  │       │
+│     │                               │ the Linux Kernel source-code   │       │
+│     │                               │ is only written in C Language. │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Linux Kernel Contributors And │ The Linux kernel crossed the   │  0.90 │
+│     │ Lines of Code Statistics 2026 │ 40 million line threshold with │       │
+│     │ https://commandlinux.com/stat │ version 6.14 rc1 in January    │       │
+│     │ istics/linux-kernel-contribut │ 2025, containing precisely     │       │
+│     │ ors-lines-of-code-statistics/ │ 40,063,856 lines. This         │       │
+│     │                               │ represents exponential growth  │       │
+│     │                               │ from the original 10,239 lines │       │
+│     │                               │ in version 0.01 released in    │       │
+│     │                               │ 1991.                          │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Rust for Linux - Wikipedia    │ Initial release | October 1,   │  0.93 │
+│     │ https://en.wikipedia.org/wiki │ 2022; 3 years ago (2022-10-01) │       │
+│     │ /Rust_for_Linux               │ | Written in | Rust |          │       │
+│     │                               │ Operating system | Linux |     │       │
+│     │                               │ License | GPL-2.0-only with    │       │
+│     │                               │ Linux-syscall-note.            │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Exact current percentage │ Open Hub statistics may   │
+│                       │ of Rust code in the most │ not reflect the most      │
+│                       │ recent kernel versions   │ recent kernel releases    │
+│                       │ (6.12+)                  │ (6.14+), so the exact     │
+│                       │                          │ current Rust percentage   │
+│                       │                          │ could be slightly higher  │
+│                       │                          │ than 0.3% given active    │
+│                       │                          │ Rust adoption.            │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Whether C++ is           │ Open Hub reports C++ at   │
+│                       │ officially used in any   │ 1.9% of total lines, yet  │
+│                       │ part of the kernel       │ official kernel docs and  │
+│                       │                          │ community sources say C   │
+│                       │                          │ is the language and C++   │
+│                       │                          │ is not used. The C++      │
+│                       │                          │ lines may be in           │
+│                       │                          │ tools/scripts not in the  │
+│                       │                          │ kernel proper.            │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ null              │ Linux kernel Rust │ Rust is growing   │
+│                  │                   │ adoption rate     │ quickly in the    │
+│                  │                   │ 2025 lines of     │ kernel; updated   │
+│                  │                   │ code percentage   │ statistics on its │
+│                  │                   │                   │ share would be    │
+│                  │                   │                   │ valuable          │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ null              │ Linux kernel C++  │ Open Hub shows    │
+│                  │                   │ code usage tools  │ ~1.9% C++ but     │
+│                  │                   │ vs kernel proper  │ official docs do  │
+│                  │                   │                   │ not mention C++;  │
+│                  │                   │                   │ clarifying        │
+│                  │                   │                   │ whether this is   │
+│                  │                   │                   │ tooling code vs   │
+│                  │                   │                   │ kernel code would │
+│                  │                   │                   │ resolve the       │
+│                  │                   │                   │ apparent          │
+│                  │                   │                   │ discrepancy       │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ medium   │ Will Rust eventually surpass    │ Rust is at ~0.3% and Assembly   │
+│          │ Assembly in lines of code       │ at ~0.7% per Open Hub; with     │
+│          │ within the Linux kernel?        │ active Rust driver development, │
+│          │                                 │ Rust may soon exceed Assembly   │
+│          │                                 │ usage.                          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the roadmap for Rust    │ Rust is now a first-class       │
+│          │ adoption in specific kernel     │ language, but the Spiceworks    │
+│          │ subsystems?                     │ article notes the focus is on   │
+│          │                                 │ 'where, how fast, and under     │
+│          │                                 │ whose terms does Rust spread    │
+│          │                                 │ inside Linux'.                  │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ Why does Open Hub report ~1.9%  │ Open Hub's language breakdown   │
+│          │ C++ in the Linux kernel         │ shows 568,053 code lines of     │
+│          │ codebase when official          │ C++, which may belong to        │
+│          │ documentation does not mention  │ userspace tools or build        │
+│          │ C++ as a supported kernel       │ infrastructure bundled in the   │
+│          │ language?                       │ same repository.                │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.97                                                                │
+│ Corroborating sources: 6                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 1.00                                                │
+│ Budget status: under cap                                                     │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 32922                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 82.81s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 710b0a62-06c8-4f49-83e3-dc651c3702a9
--- a/docs/stress-tests/M3.3-runs/04-factual.log
+++ b/docs/stress-tests/M3.3-runs/04-factual.log
@ -0,0 +1,115 @@
+Researching: What is the capital of Mongolia?
+
+{"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.982178Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:17.707574Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:17.715766Z"}
+{"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:17.748116Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the capital of Mongolia?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748504Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748598Z"}
+{"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1043, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:25.126703Z"}
+{"step": 7, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 5, "iterations_run": 2, "tokens_used": 5387, "event": "synthesis_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:38.025310Z"}
+{"step": 8, "decision": "Parsed synthesis JSON successfully", "duration_ms": 19958, "event": "synthesis_complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.937541Z"}
+{"step": 14, "decision": "Research complete", "confidence": 0.99, "citation_count": 4, "gap_count": 0, "discovery_count": 1, "total_duration_sec": 41.287, "event": "complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.938235Z"}
+{"confidence": 0.99, "citations": 4, "gaps": 0, "discovery_events": 1, "tokens_used": 11009, "iterations_run": 2, "wall_time_sec": 39.189372301101685, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:56.938337Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:56.938738Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:56.942176Z"}
+{"trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "confidence": 0.99, "citations": 4, "tokens_used": 11009, "wall_time_sec": 39.189372301101685, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.144089Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The capital of Mongolia is Ulaanbaatar (also spelled Ulan Bator). It is the  │
+│ largest city in Mongolia, situated at an elevation of 1,350 meters on the    │
+│ Tuul River, and is known as the coldest national capital in the world. The   │
+│ name 'Ulaanbaatar' means 'red hero' in Mongolian. It is home to over half of │
+│ Mongolia's population of approximately 3 million people.                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Ulaanbaatar - Wikipedia       │ Ulaanbaatar is the capital of  │  0.99 │
+│     │ https://en.wikipedia.org/wiki │ Mongolia, and is home to over  │       │
+│     │ /Ulaanbaatar                  │ half the country's population  │       │
+│     │                               │ of about 3 million people.     │       │
+│     │                               │ Human habitation dates back    │       │
+│     │                               │ more than 300,000 years. The   │       │
+│     │                               │ city is located along the Tuul │       │
+│     │                               │ River Valley.                  │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Ulaanbaatar, Mongolia | NASA  │ Ulaanbaatar is the capital of  │  0.99 │
+│     │ Jet Propulsion Laboratory     │ Mongolia, and is home to over  │       │
+│     │ (JPL)                         │ half the country's population  │       │
+│     │ https://www.jpl.nasa.gov/imag │ of about 3 million people. Due │       │
+│     │ es/pia26289-ulaanbaatar-mongo │ to its location deep in the    │       │
+│     │ lia/                          │ interior of Asia, and its high │       │
+│     │                               │ elevation, Ulaanbaatar is the  │       │
+│     │                               │ coldest national capital in    │       │
+│     │                               │ the world.                     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Capital of Mongolia | -       │ Ulaanbaatar (Ulan Bator) is    │  0.95 │
+│     │ Everything You Need to Know   │ capital of Mongolia known as   │       │
+│     │ About Ulaanbaatar             │ the coldest capital on earth.  │       │
+│     │ https://www.travelbuddies.inf │ It is located in central Asia  │       │
+│     │ o/capital-of-mongolia/        │ between China and Russia and   │       │
+│     │                               │ capital and largest city of    │       │
+│     │                               │ Mongolia. Ulaan is red and     │       │
+│     │                               │ Baatar is hero in Mongolian.   │       │
+│     │                               │ In general, Ulaanbaatar means  │       │
+│     │                               │ 'red hero'.                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Ulan Bator, Mongolia |        │ Ulaanbaatar, also known as     │  0.98 │
+│     │ Geography and Cartography |   │ Ulan Bator, is the capital and │       │
+│     │ Research Starters | EBSCO     │ largest city of Mongolia,      │       │
+│     │ Research                      │ situated at an elevation of    │       │
+│     │ https://www.ebsco.com/researc │ 1,350 meters (4,430 feet) on   │       │
+│     │ h-starters/geography-and-cart │ the Tuul River in the          │       │
+│     │ ography/ulan-bator-mongolia   │ northeast of the Mongolian     │       │
+│     │                               │ plateau.                       │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ null              │ Ulaanbaatar air   │ Multiple sources  │
+│                  │                   │ pollution and     │ mention severe    │
+│                  │                   │ climate           │ air pollution and │
+│                  │                   │ challenges        │ extreme cold as   │
+│                  │                   │                   │ notable           │
+│                  │                   │                   │ characteristics   │
+│                  │                   │                   │ of the capital    │
+│                  │                   │                   │ worth exploring   │
+│                  │                   │                   │ further.          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ low      │ How has Ulaanbaatar's           │ Sources mention dramatic        │
+│          │ population grown over recent    │ population increases due to     │
+│          │ decades due to rural-to-urban   │ migration from rural areas,     │
+│          │ migration?                      │ with population estimates       │
+│          │                                 │ ranging from 1.4 million to     │
+│          │                                 │ over 1.6 million across         │
+│          │                                 │ sources.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What measures is Ulaanbaatar    │ Multiple sources note that coal │
+│          │ taking to address its severe    │ reliance and extreme winters    │
+│          │ air pollution problem?          │ cause significant air pollution │
+│          │                                 │ in the city.                    │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.99                                                                │
+│ Corroborating sources: 4                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 1.00                                                │
+│ Budget status: under cap                                                     │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 11009                                                                │
+│ Iterations: 2                                                                │
+│ Wall time: 39.19s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: ffc42162-5527-4a35-97ad-474aafa47dc1
--- a/docs/stress-tests/M3.3-runs/05-factual.log
+++ b/docs/stress-tests/M3.3-runs/05-factual.log
@ -0,0 +1,148 @@
+Researching: How many amino acids are encoded by the standard genetic code?
+
+{"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.672745Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:58.404691Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:58.415522Z"}
+{"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:58.449581Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "How many amino acids are encoded by the standard genetic code?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449885Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449974Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1099, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:06.468160Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8623, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:11.359260Z"}
+{"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 18453, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:15.589960Z"}
+{"step": 19, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 14, "iterations_run": 4, "tokens_used": 34167, "event": "synthesis_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:38.361461Z"}
+{"step": 20, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24174, "event": "synthesis_complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.506420Z"}
+{"step": 28, "decision": "Research complete", "confidence": 0.98, "citation_count": 4, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 65.235, "event": "complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.507373Z"}
+{"confidence": 0.98, "citations": 4, "gaps": 1, "discovery_events": 2, "tokens_used": 48308, "iterations_run": 4, "wall_time_sec": 63.05677556991577, "budget_exhausted": false, "event": "research_completed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:01.507469Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:54:01.507940Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:01.514127Z"}
+{"trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "confidence": 0.98, "citations": 4, "tokens_used": 48308, "wall_time_sec": 63.05677556991577, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:01.785150Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The standard genetic code encodes 20 common amino acids. These are specified │
+│ by 64 possible codons (combinations of three nucleotides from four bases),   │
+│ with most amino acids encoded by more than one codon (a property called      │
+│ degeneracy). Methionine and tryptophan are the only amino acids specified by │
+│ a single codon each. Three codons serve as stop signals rather than encoding │
+│ amino acids. Beyond the standard 20, two additional amino                    │
+│ acids—selenocysteine (the 21st) and pyrrolysine (the 22nd)—are also          │
+│ genetically encoded in certain organisms via reprogramming of stop codons    │
+│ UGA and UAG, respectively, but are not part of the standard set of 20.       │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ The genetic code (article) -  │ Most of the amino acids in the │  0.95 │
+│     │ Khan Academy                  │ genetic code are encoded by at │       │
+│     │ https://www.khanacademy.org/s │ least two codons. In fact,     │       │
+│     │ cience/hs-bio/x230b3ff252126b │ methionine and tryptophan are  │       │
+│     │ b6:gene-expression-and-regula │ the only amino acids specified │       │
+│     │ tion/x230b3ff252126bb6:untitl │ by a single codon.             │       │
+│     │ ed-348/a/the-genetic-code     │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Is there a twenty third amino │ The universal genetic code     │  0.98 │
+│     │ acid in the genetic code? -   │ includes 20 common amino       │       │
+│     │ PubMed                        │ acids. In addition,            │       │
+│     │ https://pubmed.ncbi.nlm.nih.g │ selenocysteine (Sec) and       │       │
+│     │ ov/16713651/                  │ pyrrolysine (Pyl), known as    │       │
+│     │                               │ the twenty first and twenty    │       │
+│     │                               │ second amino acids, are        │       │
+│     │                               │ encoded by UGA and UAG,        │       │
+│     │                               │ respectively, which are the    │       │
+│     │                               │ codons that usually function   │       │
+│     │                               │ as stop signals.               │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Genetic code - Wikipedia      │ The genetic code is highly     │  0.95 │
+│     │ https://en.wikipedia.org/wiki │ similar among all organisms    │       │
+│     │ /Genetic_code                 │ and can be expressed in a      │       │
+│     │                               │ simple table with 64 entries.  │       │
+│     │                               │ The codons specify which amino │       │
+│     │                               │ acid will be added next during │       │
+│     │                               │ protein biosynthesis.          │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Understanding the Genetic     │ The universal                  │  0.97 │
+│     │ Code - PMC                    │ triple-nucleotide genetic      │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ code, allowing DNA-encoded     │       │
+│     │ articles/PMC6620406/          │ mRNA to be translated into the │       │
+│     │                               │ amino acid sequences of        │       │
+│     │                               │ proteins using transfer RNAs   │       │
+│     │                               │ (tRNAs) and many accessory and │       │
+│     │                               │ modification factors, is       │       │
+│     │                               │ essentially common to all      │       │
+│     │                               │ living organisms on Earth.     │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category       ┃ Topic                        ┃ Detail                       ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ scope_exceeded │ Exact codon-to-amino-acid    │ The full detailed codon      │
+│                │ mapping table                │ table listing all 64 codons  │
+│                │                              │ and their corresponding      │
+│                │                              │ amino acids was not          │
+│                │                              │ extracted verbatim from the  │
+│                │                              │ sources, though the total    │
+│                │                              │ count of 20 standard amino   │
+│                │                              │ acids is well established.   │
+└────────────────┴──────────────────────────────┴──────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ database          │ selenocysteine    │ The PubMed source │
+│                  │                   │ pyrrolysine       │ raises the        │
+│                  │                   │ genetic code      │ question of       │
+│                  │                   │ expansion         │ expanded genetic  │
+│                  │                   │ organisms         │ codes beyond 20   │
+│                  │                   │                   │ amino acids,      │
+│                  │                   │                   │ which may be      │
+│                  │                   │                   │ relevant for      │
+│                  │                   │                   │ advanced biology  │
+│                  │                   │                   │ research.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ synthetic biology │ Wikipedia         │
+│                  │                   │ unnatural amino   │ mentions expanded │
+│                  │                   │ acids expanded    │ genetic codes in  │
+│                  │                   │ genetic code      │ synthetic         │
+│                  │                   │                   │ biology,          │
+│                  │                   │                   │ suggesting active │
+│                  │                   │                   │ research into     │
+│                  │                   │                   │ adding more than  │
+│                  │                   │                   │ 22 amino acids.   │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ medium   │ Could a 23rd amino acid ever    │ A PubMed study scanned 16       │
+│          │ become widely distributed and   │ archaeal and 130 bacterial      │
+│          │ genetically encoded in nature?  │ genomes for tRNAs corresponding │
+│          │                                 │ to the three stop codons and    │
+│          │                                 │ concluded that additional       │
+│          │                                 │ widely distributed genetically  │
+│          │                                 │ encoded amino acids are         │
+│          │                                 │ unlikely.                       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ How many non-standard amino     │ Wikipedia references expanded   │
+│          │ acids have been successfully    │ genetic codes in synthetic      │
+│          │ incorporated into proteins via  │ biology as a distinct topic,    │
+│          │ synthetic biology methods?      │ suggesting                      │
+│          │                                 │ laboratory-engineered codes may │
+│          │                                 │ go beyond the natural 22.       │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.98                                                                │
+│ Corroborating sources: 4                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 1.00                                                │
+│ Budget status: under cap                                                     │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 48308                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 63.06s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 7561029e-5dcb-4eaa-98e9-7496ed4bf4c2
--- a/docs/stress-tests/M3.3-runs/06-comparative.log
+++ b/docs/stress-tests/M3.3-runs/06-comparative.log
@ -0,0 +1,226 @@
+Researching: Compare the energy density of lithium-ion vs sodium-ion batteries.
+
+{"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:02.430608Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:54:03.159945Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:03.167971Z"}
+{"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:03.200030Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200318Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200405Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1114, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:14.560598Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7183, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:18.314755Z"}
+{"step": 19, "decision": "Starting iteration 4/5", "tokens_so_far": 13977, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:28.528912Z"}
+{"step": 24, "decision": "Token budget reached before iteration 5: 28015/20000", "event": "budget_exhausted", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.027627Z"}
+{"step": 25, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 4, "tokens_used": 28015, "event": "synthesis_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.028531Z"}
+{"step": 26, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50955, "event": "synthesis_complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.614289Z"}
+{"step": 41, "decision": "Research complete", "confidence": 0.91, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 87.865, "event": "complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.616834Z"}
+{"confidence": 0.91, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 48087, "iterations_run": 4, "wall_time_sec": 84.41376757621765, "budget_exhausted": true, "event": "research_completed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:27.617014Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:55:27.617866Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:27.632124Z"}
+{"trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "confidence": 0.91, "citations": 8, "tokens_used": 48087, "wall_time_sec": 84.41376757621765, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:27.873634Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Lithium-ion batteries have significantly higher energy density than          │
+│ sodium-ion batteries across all commercial chemistries. Lithium-ion cells    │
+│ achieve 150–300 Wh/kg gravimetrically, depending on chemistry: NMC variants  │
+│ reach 250–300 Wh/kg in premium automotive applications, while LFP cells      │
+│ deliver 150–180 Wh/kg [Source 15]. Volumetrically, lithium-ion batteries     │
+│ reach roughly 250–700 Wh/L [Source 16]. Sodium-ion batteries currently       │
+│ achieve 90–190 Wh/kg gravimetrically; CATL's first-generation commercial     │
+│ cells reached ~160 Wh/kg [Source 15], with newer products like CATL's Naxtra │
+│ reaching ~175 Wh/kg [Source 22], and ScienceDirect prototypes ranging 90–150 │
+│ Wh/kg [Source 7]. The volumetric energy density of sodium-ion is             │
+│ approximately 20–40% lower than lithium-ion equivalents [Source 8]. This gap │
+│ exists fundamentally because sodium ions are heavier and larger than lithium │
+│ ions, reducing the energy stored per unit mass or volume [Source 3, Source   │
+│ 20]. A notable exception is a late-2025 announcement by ZN Energy of an      │
+│ anode-free solid-state sodium-ion pouch cell achieving 348.5 Wh/kg, verified │
+│ by CATARC, using a high-energy layered oxide cathode and anode-free          │
+│ solid-state architecture—though this is a laboratory/prototype result, not   │
+│ yet commercial [Source 10]. In practical terms, sodium-ion batteries are     │
+│ best suited for stationary storage and cost-sensitive low-performance EVs    │
+│ where energy density is less critical, while lithium-ion dominates portable  │
+│ electronics, robotics, and long-range EVs [Source 1, Source 8].              │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Battery Energy Density 2025:  │ Nickel Manganese Cobalt (NMC)  │  0.95 │
+│     │ State of the Art & Next-Gen   │ variants deliver the highest   │       │
+│     │ Tech                          │ energy densities at the cell   │       │
+│     │ https://timharper.net/fieldno │ level, reaching 250-300 Wh/kg  │       │
+│     │ tes/battery-energy-density-20 │ in premium automotive          │       │
+│     │ 25/                           │ applications... Sodium-ion     │       │
+│     │                               │ batteries have emerged from    │       │
+│     │                               │ laboratory curiosity to        │       │
+│     │                               │ commercial reality, with       │       │
+│     │                               │ CATL's first-generation cells  │       │
+│     │                               │ achieving 160 Wh/kg energy     │       │
+│     │                               │ density.                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Sodium ion batteries: A       │ Current prototypes of SIBs     │  0.95 │
+│     │ sustainable alternative to    │ have energy densities of       │       │
+│     │ lithium-ion ...               │ 90–150 Wh/kg, which remain     │       │
+│     │ https://www.sciencedirect.com │ lower than the 130–285 Wh/kg   │       │
+│     │ /science/article/pii/S2949821 │ typically achieved             │       │
+│     │ X25002418                     │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Sodium-ion batteries: Should  │ Sodium is heavier than         │  0.97 │
+│     │ we believe the hype?          │ lithium, and its ions are      │       │
+│     │ https://cen.acs.org/energy/en │ larger, resulting in a         │       │
+│     │ ergy-storage-/Sodium-ion-batt │ volumetric energy density that │       │
+│     │ eries-Should-believe/103/web/ │ is 20–40% less than that of    │       │
+│     │ 2025/11                       │ lithium ion. Consequently, a   │       │
+│     │                               │ sodium-ion battery is bigger   │       │
+│     │                               │ and heavier than an equivalent │       │
+│     │                               │ one made with lithium.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Energy Density of Lithium-Ion │ Modern lithium-ion batteries   │  0.90 │
+│     │ Batteries Explained: Wh/kg vs │ achieve 150-300 Wh/kg and      │       │
+│     │ Wh/L                          │ 250-700 Wh/L, depending on     │       │
+│     │ https://www.longsingtech.com/ │ chemistry and design.          │       │
+│     │ energy-density-of-lithium-ion │                                │       │
+│     │ -batteries/                   │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Sodium Ion vs Lithium Ion     │ Energy Density (Gravimetric):  │  0.88 │
+│     │ Batteries: 2026 Comparison &  │ Sodium-ion typically ranges    │       │
+│     │ Key Advantages                │ from 100–175 Wh/kg (e.g.,      │       │
+│     │ https://chargeprotexas.com/so │ CATL's Naxtra at ~175 Wh/kg).  │       │
+│     │ dium-ion-vs-lithium-ion-batte │ Lithium-ion hits 150–250+      │       │
+│     │ ries-2026-comparison/         │ Wh/kg (LFP: 150–210; NMC:      │       │
+│     │                               │ 240–350).                      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ ZN Energy Breaks Sodium-Ion   │ Its >25Ah large-format AFSSSIB │  0.78 │
+│     │ Battery Density Record at     │ pouch cell achieved a          │       │
+│     │ 348.5Wh/kg                    │ gravimetric energy density of  │       │
+│     │ https://www.linkedin.com/post │ 348.5Wh/kg, verified by CATARC │       │
+│     │ s/jerry-wan-069b41105_breakin │ (China Automotive Technology & │       │
+│     │ g-the-sodium-ceiling-zhaona-e │ Research Center, Tianjin).     │       │
+│     │ nergy-activity-74134108276403 │ This is not an incremental     │       │
+│     │ 20000-NHd_                    │ improvement—it directly        │       │
+│     │                               │ challenges the long-held       │       │
+│     │                               │ assumption that sodium         │       │
+│     │                               │ chemistry is structurally      │       │
+│     │                               │ capped at 'low energy          │       │
+│     │                               │ density.'                      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Sodium as a Green Substitute  │ But there are also downsides   │  0.93 │
+│     │ for Lithium in Batteries      │ to sodium-ion batteries, the   │       │
+│     │ https://physics.aps.org/artic │ top one being a lower energy   │       │
+│     │ les/v17/73                    │ density than their lithium-ion │       │
+│     │                               │ counterparts. Energy density   │       │
+│     │                               │ has a direct bearing on the    │       │
+│     │                               │ driving range of an electric   │       │
+│     │                               │ vehicle.                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Sodium-Ion vs Lithium-Ion     │ lithium-ion batteries dominate │  0.85 │
+│     │ Batteries Differences and     │ high-performance applications  │       │
+│     │ Applications in 2025          │ like consumer electronics and  │       │
+│     │ https://www.large-battery.com │ robotics, owing to their       │       │
+│     │ /blog/na-ion-vs-li-ion-batter │ superior energy density of     │       │
+│     │ ies-2025/                     │ 100–270 Wh/kg.                 │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Volumetric energy        │ Most sources provide      │
+│                       │ density figures for      │ gravimetric (Wh/kg) data  │
+│                       │ sodium-ion batteries     │ for sodium-ion; specific  │
+│                       │                          │ Wh/L volumetric figures   │
+│                       │                          │ for sodium-ion cells at   │
+│                       │                          │ the commercial pack level │
+│                       │                          │ were not found in         │
+│                       │                          │ evidence.                 │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Independent verification │ The 348.5 Wh/kg result    │
+│                       │ of ZN Energy 348.5 Wh/kg │ for sodium-ion is from a  │
+│                       │ claim                    │ LinkedIn post summarizing │
+│                       │                          │ a company announcement.   │
+│                       │                          │ No peer-reviewed or       │
+│                       │                          │ independent third-party   │
+│                       │                          │ publication was found to  │
+│                       │                          │ corroborate this figure.  │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ scope_exceeded        │ Cycle life vs energy     │ While cycle life is       │
+│                       │ density trade-offs in    │ mentioned in some         │
+│                       │ sodium-ion               │ sources, a detailed       │
+│                       │                          │ quantitative comparison   │
+│                       │                          │ of how energy density     │
+│                       │                          │ degrades over cycle life  │
+│                       │                          │ compared to lithium-ion   │
+│                       │                          │ was not covered in the    │
+│                       │                          │ evidence.                 │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ new_source       │ arxiv             │ anode-free        │ ZN Energy's 348.5 │
+│                  │                   │ solid-state       │ Wh/kg claim would │
+│                  │                   │ sodium-ion        │ benefit from      │
+│                  │                   │ battery energy    │ peer-reviewed     │
+│                  │                   │ density 2025      │ validation on     │
+│                  │                   │                   │ arXiv or similar  │
+│                  │                   │                   │ preprint server.  │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ sodium-ion        │ Volumetric energy │
+│                  │                   │ battery           │ density for       │
+│                  │                   │ volumetric energy │ sodium-ion at the │
+│                  │                   │ density Wh/L      │ cell and pack     │
+│                  │                   │ commercial cells  │ level is          │
+│                  │                   │ 2025              │ underrepresented  │
+│                  │                   │                   │ in current        │
+│                  │                   │                   │ evidence.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ layered oxide     │ Multiple sources  │
+│                  │                   │ cathode           │ mention cathode   │
+│                  │                   │ sodium-ion        │ engineering as    │
+│                  │                   │ specific capacity │ the key           │
+│                  │                   │ cycle stability   │ bottleneck for    │
+│                  │                   │ 2025              │ sodium-ion energy │
+│                  │                   │                   │ density           │
+│                  │                   │                   │ improvement.      │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Will sodium-ion batteries ever  │ ZN Energy's prototype achieved  │
+│          │ match or exceed LFP lithium-ion │ 348.5 Wh/kg, but commercial     │
+│          │ in gravimetric energy density   │ CATL sodium-ion cells are at    │
+│          │ at the commercial pack level?   │ ~160–175 Wh/kg while LFP cells  │
+│          │                                 │ are 150–180 Wh/kg. The gap is   │
+│          │                                 │ closing in prototypes but not   │
+│          │                                 │ yet in commercial products.     │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does energy density change  │ Sources mention sodium-ion's    │
+│          │ over the cycle life of          │ lower risk of thermal runaway   │
+│          │ sodium-ion vs lithium-ion       │ and good low-temperature        │
+│          │ batteries under real-world      │ performance, but long-term      │
+│          │ conditions?                     │ energy density retention data   │
+│          │                                 │ was not found.                  │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the volumetric energy   │ C&EN states volumetric density  │
+│          │ density (Wh/L) of current       │ is 20–40% lower than            │
+│          │ commercial sodium-ion battery   │ lithium-ion but provides no     │
+│          │ packs?                          │ absolute Wh/L figures for       │
+│          │                                 │ sodium-ion.                     │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.91                                                                │
+│ Corroborating sources: 8                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.97                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 48087                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 84.41s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: aaf3b9ef-d91a-4d03-8883-b0a906929cb1
--- a/docs/stress-tests/M3.3-runs/07-comparative.log
+++ b/docs/stress-tests/M3.3-runs/07-comparative.log
@ -0,0 +1,350 @@
+Researching: Compare PostgreSQL and SQLite for embedded analytics workloads.
+
+{"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:28.499294Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:55:29.256154Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:29.264747Z"}
+{"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:29.297908Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298261Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298356Z"}
+{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1147, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:38.957520Z"}
+{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8781, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:45.812510Z"}
+{"step": 23, "decision": "Starting iteration 4/5", "tokens_so_far": 18324, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:00.757335Z"}
+{"step": 28, "decision": "Token budget reached before iteration 5: 34877/20000", "event": "budget_exhausted", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990690Z"}
+{"step": 29, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 4, "tokens_used": 34877, "event": "synthesis_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990849Z"}
+{"step": 30, "decision": "Parsed synthesis JSON successfully", "duration_ms": 78663, "event": "synthesis_complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.513065Z"}
+{"step": 48, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 114.441, "event": "complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.536570Z"}
+{"confidence": 0.88, "citations": 10, "gaps": 3, "discovery_events": 4, "tokens_used": 61699, "iterations_run": 4, "wall_time_sec": 111.20896744728088, "budget_exhausted": true, "event": "research_completed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:20.538075Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:57:20.546420Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:20.668474Z"}
+{"trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "confidence": 0.88, "citations": 10, "tokens_used": 61699, "wall_time_sec": 111.20896744728088, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:21.511598Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ PostgreSQL and SQLite differ substantially for embedded analytics workloads  │
+│ across architecture, concurrency, feature set, and performance               │
+│ characteristics.                                                             │
+│                                                                              │
+│ **Architecture:** SQLite is a serverless, embedded database engine that      │
+│ reads/writes a single file on disk, making it highly portable and            │
+│ zero-configuration. PostgreSQL uses a client-server architecture requiring a │
+│ separate server process, which enables it to scale and handle multiple       │
+│ concurrent clients efficiently via Multi-Version Concurrency Control (MVCC)  │
+│ [Source 5]. For embedded analytics specifically, SQLite's in-process nature  │
+│ eliminates network overhead, which can yield significant read performance    │
+│ advantages in local scenarios [Source 31].                                   │
+│                                                                              │
+│ **Concurrency:** SQLite allows multiple concurrent readers but only one      │
+│ writer at a time, using file-level locking. This single-writer model is a    │
+│ significant bottleneck for write-heavy or high-concurrency analytical        │
+│ ingestion workloads [Source 24, Source 25]. PostgreSQL's MVCC ensures        │
+│ readers and writers do not block each other, making it far superior for      │
+│ multi-user or mixed OLTP/OLAP environments [Source 5]. Turso's work on       │
+│ concurrent writes for SQLite demonstrates the community recognizes this      │
+│ limitation, achieving up to 4x write throughput improvements over vanilla    │
+│ SQLite [Source 24].                                                          │
+│                                                                              │
+│ **OLAP/Analytical Performance:** SQLite is row-oriented and was designed     │
+│ primarily as a world-class OLTP engine. For analytical workloads—complex     │
+│ aggregations, percentile calculations, large scans—SQLite struggles          │
+│ significantly. A cited benchmark shows a single percentile query over 13M    │
+│ rows taking ~4 seconds in SQLite [Source 6]. PostgreSQL, while also          │
+│ row-oriented, supports more advanced SQL features (window functions, complex │
+│ joins, partitioning) and can be tuned for analytics [Source 22]. However,    │
+│ PostgreSQL itself hits a 'Postgres Wall' for heavy analytical workloads when │
+│ row-scanning large datasets exceeds available RAM [Source 13]. Neither       │
+│ SQLite nor PostgreSQL is natively columnar; PostgreSQL can be extended with  │
+│ columnar storage extensions for better OLAP performance [Source 23].         │
+│                                                                              │
+│ **Feature Set:** PostgreSQL offers a richer feature set including more data  │
+│ types, advanced indexing, role-based access control, JSON/array support,     │
+│ geospatial extensions (PostGIS), and time-series extensions. SQLite uses     │
+│ dynamic typing and has a simpler, more limited feature set—easier to use but │
+│ potentially limiting for complex analytical applications [Source 5, Source   │
+│ 1].                                                                          │
+│                                                                              │
+│ **Recommended Alternatives for Embedded Analytics:** DuckDB is widely cited  │
+│ as the superior embedded engine for analytical workloads, outperforming both │
+│ SQLite and PostgreSQL on OLAP queries by a large margin [Source 6, Source    │
+│ 2]. For embedded analytics use cases requiring columnar processing, DuckDB   │
+│ or Stoolap (a Rust-based embedded OLAP engine) are more purpose-built        │
+│ options. Stoolap benchmarks show up to 138x faster analytical query          │
+│ performance versus SQLite [Source 9].                                        │
+│                                                                              │
+│ **Summary:** SQLite wins for lightweight, read-heavy, single-writer,         │
+│ local/embedded OLTP workloads where portability and zero configuration       │
+│ matter. PostgreSQL wins for multi-user, concurrent, complex-query            │
+│ environments. For true embedded analytics workloads (large-scale             │
+│ aggregations, complex OLAP queries), neither is optimal—DuckDB or a hybrid   │
+│ architecture (PostgreSQL as system-of-record + DuckDB as analytical engine)  │
+│ is the modern recommended approach.                                          │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ SQLite vs. PostgreSQL: The    │ PostgreSQL is a client-server  │  0.97 │
+│     │ key differences and           │ database system... This        │       │
+│     │ advantages of each            │ architecture enables           │       │
+│     │ https://databaseschool.com/ar │ PostgreSQL to scale and handle │       │
+│     │ ticles/sqlite-vs-postgresql-t │ multiple concurrent clients    │       │
+│     │ he-key-differences-and-advant │ efficiently... SQLite is a     │       │
+│     │ ages-of-each                  │ serverless database engine. It │       │
+│     │                               │ functions as a lightweight     │       │
+│     │                               │ library embedded directly into │       │
+│     │                               │ applications... SQLite's       │       │
+│     │                               │ concurrency model is more      │       │
+│     │                               │ restrictive: while it allows   │       │
+│     │                               │ multiple readers, only one     │       │
+│     │                               │ process can write to the       │       │
+│     │                               │ database at a time.            │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Making -SQLite- Analytics     │ In some analytical queries     │  0.95 │
+│     │ Great Again! – Oldmoe's blog  │ SQLite will struggle to        │       │
+│     │ https://oldmoe.blog/2025/03/1 │ perform compared to other OLAP │       │
+│     │ 2/making-sqlite-analytics-gre │ oriented engines like DuckDB.  │       │
+│     │ at-again/                     │ Consider the following         │       │
+│     │                               │ scenario: You have a table     │       │
+│     │                               │ with 13M entries of latency    │       │
+│     │                               │ data, and you want to          │       │
+│     │                               │ determine the following        │       │
+│     │                               │ percentiles: p50, p95, p99...  │       │
+│     │                               │ After around 4 seconds you     │       │
+│     │                               │ will see the result.           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ DuckDB vs. Postgres for       │ That 'quick' analytical query  │  0.95 │
+│     │ embedded analytics: How to    │ powering a customer-facing     │       │
+│     │ choose (and when to use a     │ dashboard now takes 5 seconds, │       │
+│     │ hybrid architecture)          │ up from 50 milliseconds. Then  │       │
+│     │ https://motherduck.com/learn- │ thirty seconds. Then it times  │       │
+│     │ more/duckdb-vs-postgres-embed │ out. You've hit the 'Postgres  │       │
+│     │ ded-analytics/                │ Wall.' This isn't a Postgres   │       │
+│     │                               │ failure. It's an architectural │       │
+│     │                               │ mismatch. Postgres processes   │       │
+│     │                               │ analytics using the same       │       │
+│     │                               │ row-oriented logic designed    │       │
+│     │                               │ for transaction safety.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Beyond the Single-Writer      │ SQLite has a single-writer     │  0.93 │
+│     │ Limitation with Turso's       │ transaction model, which means │       │
+│     │ Concurrent Writes             │ whenever a transaction writes  │       │
+│     │ https://turso.tech/blog/beyon │ to the database, no other      │       │
+│     │ d-the-single-writer-limitatio │ write transactions can make    │       │
+│     │ n-with-tursos-concurrent-writ │ progress until that            │       │
+│     │ es                            │ transaction is complete...     │       │
+│     │                               │ When concurrent writes are     │       │
+│     │                               │ used, we achieve up to 4x the  │       │
+│     │                               │ write throughput of SQLite,    │       │
+│     │                               │ while also removing the        │       │
+│     │                               │ dreaded SQLITE_BUSY error.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Stoolap vs. SQLite: Comparing │ OLAP (Online Analytical        │  0.92 │
+│     │ Rust OLAP and Traditional     │ Processing) systems are        │       │
+│     │ OLTP Databases | Better Stack │ designed for a completely      │       │
+│     │ Community                     │ different purpose. OLAP        │       │
+│     │ https://betterstack.com/commu │ databases are optimized for    │       │
+│     │ nity/guides/ai/stoolap-vs-sql │ complex queries and data       │       │
+│     │ ite/                          │ analysis... Most standard      │       │
+│     │                               │ application databases,         │       │
+│     │                               │ including SQLite, PostgreSQL,  │       │
+│     │                               │ and MySQL, are classified as   │       │
+│     │                               │ OLTP (Online Transaction       │       │
+│     │                               │ Processing) systems.           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Postgres Tuning & Performance │ Analytics or OLAP activity     │  0.91 │
+│     │ for Analytics Data | Crunchy  │ typically involves much        │       │
+│     │ Data Blog                     │ longer, more complex queries   │       │
+│     │ https://www.crunchydata.com/b │ than OLTP activity, joining    │       │
+│     │ log/postgres-tuning-and-perfo │ data from multiple tables, and │       │
+│     │ rmance-for-analytics-data     │ working on large data sets.    │       │
+│     │                               │ This means it's very resource  │       │
+│     │                               │ intensive. Without careful     │       │
+│     │                               │ planning and tuning, you can   │       │
+│     │                               │ find yourself with analytics   │       │
+│     │                               │ queries that not only take far │       │
+│     │                               │ too long to run, but also slow │       │
+│     │                               │ down your existing             │       │
+│     │                               │ application.                   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Postgres Columnar Storage: 4  │ PostgreSQL is a row-oriented   │  0.90 │
+│     │ Popular Extensions and a      │ database by design, meaning it │       │
+│     │ Quick Tutorial                │ stores data tuple-by-tuple...  │       │
+│     │ https://www.epsio.io/blog/pos │ This structure is suitable for │       │
+│     │ tgres-columnar-storage-4-popu │ transactional workloads but    │       │
+│     │ lar-extensions-and-a-quick-tu │ not optimized for analytical   │       │
+│     │ torial                        │ queries that typically scan    │       │
+│     │                               │ large volumes of data across a │       │
+│     │                               │ few columns... While           │       │
+│     │                               │ PostgreSQL does not natively   │       │
+│     │                               │ support columnar storage,      │       │
+│     │                               │ several extensions and         │       │
+│     │                               │ external tools introduce       │       │
+│     │                               │ columnar capabilities.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ SQLite vs PostgreSQL          │ SQLite was faster. Of course   │  0.88 │
+│     │ Performance & Comparison |    │ it was. Writing to a local     │       │
+│     │ Pythonic AF                   │ file inside the same process   │       │
+│     │ https://medium.com/pythonic-a │ will almost always be faster   │       │
+│     │ f/sqlite-vs-postgresql-perfor │ than sending queries to a      │       │
+│     │ mance-comparison-46ba1d39c9c8 │ server.                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Everyone Is Wrong About       │ why SQLite is often the        │  0.80 │
+│     │ SQLite (Here's When It Beats  │ superior production choice for │       │
+│     │ Postgres)                     │ read-heavy, single-server, and │       │
+│     │ https://www.youtube.com/watch │ edge workloads ... SQLite vs   │       │
+│     │ ?v=t20KyfjtUs4                │ PostgreSQL Performance.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ SQLite SO MUCH FASTER than    │ Of course, with the advent of  │  0.82 │
+│     │ Postgres - Reddit             │ DuckDB, you use DuckDB for     │       │
+│     │ https://www.reddit.com/r/sqli │ data analysis tasks since it   │       │
+│     │ te/comments/1gu219r/sqlite_so │ can be faster than either      │       │
+│     │ _much_faster_than_postgres/   │ SQLite or PostgreSQL in those  │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found │ Quantitative head-to-head   │ Most benchmarks found       │
+│                  │ benchmark of SQLite vs      │ compare SQLite vs           │
+│                  │ PostgreSQL specifically on  │ PostgreSQL on OLTP          │
+│                  │ analytical queries (not     │ (reads/writes of individual │
+│                  │ just OLTP)                  │ rows) or compare each       │
+│                  │                             │ individually to             │
+│                  │                             │ DuckDB/Stoolap on OLAP. A   │
+│                  │                             │ direct, rigorous benchmark  │
+│                  │                             │ of SQLite vs PostgreSQL on  │
+│                  │                             │ complex analytical queries  │
+│                  │                             │ (GROUP BY, window           │
+│                  │                             │ functions, aggregations     │
+│                  │                             │ over millions of rows) was  │
+│                  │                             │ not surfaced in the         │
+│                  │                             │ evidence.                   │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ PostgreSQL columnar         │ While columnar extensions   │
+│                  │ extension performance vs    │ for PostgreSQL (e.g., Citus │
+│                  │ SQLite for embedded         │ columnar, hydra) are        │
+│                  │ analytics                   │ mentioned, no direct        │
+│                  │                             │ benchmark comparing         │
+│                  │                             │ PostgreSQL-with-columnar-ex │
+│                  │                             │ tension vs SQLite for       │
+│                  │                             │ embedded analytical         │
+│                  │                             │ workloads was found.        │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ SQLite WAL mode impact on   │ WAL mode is mentioned as    │
+│                  │ analytical query            │ improving concurrent        │
+│                  │ performance                 │ read/write behavior in      │
+│                  │                             │ SQLite, but its specific    │
+│                  │                             │ impact on analytical query  │
+│                  │                             │ throughput in embedded      │
+│                  │                             │ scenarios was not           │
+│                  │                             │ quantified in the evidence. │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ database          │ DuckDB vs SQLite  │ DuckDB is         │
+│                  │                   │ vs PostgreSQL     │ consistently      │
+│                  │                   │ analytical        │ cited as          │
+│                  │                   │ benchmark OLAP    │ outperforming     │
+│                  │                   │ embedded 2024     │ both for          │
+│                  │                   │ 2025              │ analytics; a      │
+│                  │                   │                   │ rigorous          │
+│                  │                   │                   │ three-way         │
+│                  │                   │                   │ comparison would  │
+│                  │                   │                   │ better answer the │
+│                  │                   │                   │ embedded          │
+│                  │                   │                   │ analytics         │
+│                  │                   │                   │ question.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ SQLite past       │ The VLDB paper on │
+│                  │                   │ present future    │ SQLite's          │
+│                  │                   │ VLDB paper bloom  │ past/present/futu │
+│                  │                   │ filter analytical │ re is cited       │
+│                  │                   │ performance 2022  │ multiple times as │
+│                  │                   │                   │ authoritative on  │
+│                  │                   │                   │ SQLite's          │
+│                  │                   │                   │ analytical        │
+│                  │                   │                   │ limitations;      │
+│                  │                   │                   │ accessing it      │
+│                  │                   │                   │ directly would    │
+│                  │                   │                   │ strengthen        │
+│                  │                   │                   │ claims.           │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ pg_duckdb         │ The motherduck    │
+│                  │                   │ extension         │ article           │
+│                  │                   │ PostgreSQL        │ references        │
+│                  │                   │ embedded          │ pg_duckdb as a    │
+│                  │                   │ analytics         │ key tool for      │
+│                  │                   │ performance       │ hybrid            │
+│                  │                   │ hybrid            │ Postgres+DuckDB   │
+│                  │                   │ architecture      │ analytics;        │
+│                  │                   │                   │ benchmarks for    │
+│                  │                   │                   │ this approach     │
+│                  │                   │                   │ were not found.   │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ null              │ Stoolap embedded  │ Stoolap is an     │
+│                  │                   │ OLAP Rust         │ emerging embedded │
+│                  │                   │ database          │ OLAP engine       │
+│                  │                   │ benchmark SQLite  │ (Rust) claiming   │
+│                  │                   │ PostgreSQL        │ 138x speedup over │
+│                  │                   │                   │ SQLite; it's a    │
+│                  │                   │                   │ relevant new      │
+│                  │                   │                   │ entrant to the    │
+│                  │                   │                   │ embedded          │
+│                  │                   │                   │ analytics space.  │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ At what data volume does        │ The evidence shows SQLite       │
+│          │ SQLite's analytical performance │ struggles at 13M rows for       │
+│          │ become unacceptably slow        │ percentile queries (~4s), but   │
+│          │ compared to PostgreSQL for      │ no clear threshold or scaling   │
+│          │ typical embedded analytics      │ curve vs PostgreSQL was found.  │
+│          │ workloads?                      │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Does enabling WAL mode and      │ Hacker News discussion mentions │
+│          │ tuning SQLite                   │ WAL + synchronous=NORMAL as     │
+│          │ (synchronous=NORMAL, page size, │ approaching 'line speed with IO │
+│          │ etc.) meaningfully close the    │ subsystem' for writes, but      │
+│          │ analytical performance gap with │ analytical query impact is      │
+│          │ PostgreSQL?                     │ unclear.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Is a hybrid architecture        │ The Postgres+DuckDB hybrid is   │
+│          │ (SQLite for OLTP + DuckDB for   │ well-documented, but an         │
+│          │ OLAP, sharing the same data)    │ SQLite+DuckDB embedded hybrid   │
+│          │ practical for embedded          │ (for truly serverless apps) is  │
+│          │ applications, and how does it   │ less explored in the evidence.  │
+│          │ compare to using PostgreSQL     │                                 │
+│          │ alone?                          │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How do PostgreSQL columnar      │ PostgreSQL columnar extensions  │
+│          │ storage extensions (e.g.,       │ are mentioned as improving OLAP │
+│          │ Hydra, Citus columnar) perform  │ performance, but no direct      │
+│          │ for embedded analytics compared │ comparison to SQLite in         │
+│          │ to native SQLite?               │ embedded scenarios was found.   │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the operational         │ SQLite's binary is ~500KB vs    │
+│          │ overhead (memory, disk, setup   │ PostgreSQL requiring a server   │
+│          │ complexity) of running          │ process; for edge/IoT embedded  │
+│          │ PostgreSQL vs SQLite in a truly │ analytics, resource constraints │
+│          │ embedded edge or mobile         │ may be the deciding factor.     │
+│          │ environment?                    │                                 │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.88                                                                │
+│ Corroborating sources: 10                                                    │
+│ Source authority: medium                                                     │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.82                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 61699                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 111.21s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 01881015-61a9-4894-a723-4e1d8b7a7755
--- a/docs/stress-tests/M3.3-runs/08-comparative.log
+++ b/docs/stress-tests/M3.3-runs/08-comparative.log
@ -0,0 +1,364 @@
+Researching: Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.
+
+{"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:22.951394Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:57:23.942406Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:23.953465Z"}
+{"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:24.008304Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008814Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008920Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1180, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:42.087229Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 12270, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:47.632253Z"}
+{"step": 21, "decision": "Token budget reached before iteration 4: 25966/20000", "event": "budget_exhausted", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072818Z"}
+{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 3, "tokens_used": 25966, "event": "synthesis_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072985Z"}
+{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 89456, "event": "synthesis_complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.172200Z"}
+{"step": 46, "decision": "Research complete", "confidence": 0.82, "citation_count": 14, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 121.701, "event": "complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.274347Z"}
+{"confidence": 0.82, "citations": 14, "gaps": 4, "discovery_events": 4, "tokens_used": 54153, "iterations_run": 3, "wall_time_sec": 117.15539288520813, "budget_exhausted": true, "event": "research_completed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:21.275590Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:59:21.286942Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:21.531952Z"}
+{"trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "confidence": 0.82, "citations": 14, "tokens_used": 54153, "wall_time_sec": 117.15539288520813, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:22.766505Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ CRISPR-Cas9 and CRISPR-Cas12a (formerly Cpf1) are both widely used           │
+│ RNA-guided nucleases adapted for genome editing, including in vivo           │
+│ applications, but they differ meaningfully in mechanism, structure, PAM      │
+│ requirements, cutting pattern, guide RNA architecture, specificity, and      │
+│ practical suitability for in vivo delivery.                                  │
+│                                                                              │
+│ **Mechanism and DNA Cleavage:** Cas9 (most commonly from Streptococcus       │
+│ pyogenes, SpCas9) cleaves both DNA strands at the same position, producing   │
+│ blunt-ended double-strand breaks (DSBs) [Source 7]. Cas12a, by contrast,     │
+│ introduces staggered cuts that leave 4–5 nucleotide 5′ overhangs [Sources 2, │
+│ 7]. These sticky ends generated by Cas12a may enhance homology-directed      │
+│ repair (HDR) efficiency compared to Cas9's blunt ends [Source 2].            │
+│                                                                              │
+│ **PAM Sequence:** Cas9 requires an NGG PAM (protospacer adjacent motif) on   │
+│ the non-template strand downstream of the target; Cas12a recognizes a T-rich │
+│ PAM (typically TTTV) upstream of the target on the non-template strand       │
+│ [Sources 2, 7]. This difference expands the targeting range of Cas12a to     │
+│ AT-rich genomic regions where Cas9 is limited.                               │
+│                                                                              │
+│ **Guide RNA:** Cas9 uses a two-component guide (crRNA + tracrRNA, often      │
+│ fused as sgRNA), while Cas12a requires only a single crRNA with a short      │
+│ direct repeat and processes its own pre-crRNA array, enabling multiplexed    │
+│ editing from a single transcript [Sources 2, 7, 13].                         │
+│                                                                              │
+│ **Specificity and Off-Target Effects:** Kinetic studies show Cas12a exhibits │
+│ greater target specificity than Cas9, attributed to a more stringent DNA     │
+│ unwinding mechanism that requires more extensive complementarity before      │
+│ cleavage [Source 5]. Cas12a tolerates fewer mismatches between the guide RNA │
+│ and target, resulting in fewer off-target cuts [Sources 2, 5].               │
+│                                                                              │
+│ **Editing Efficiency:** In comparative studies using ribonucleoprotein (RNP) │
+│ delivery in rice (OsPDS gene), Cas9 and Cas12a showed different efficiencies │
+│ depending on the target site [Source 1]. In Chlamydomonas reinhardtii, both  │
+│ Cas9 and Cas12a RNPs co-delivered with ssODN repair templates achieved       │
+│ similar total editing levels of 20–30% [Source 4]. Context and target site   │
+│ selection significantly influence which enzyme performs better.              │
+│                                                                              │
+│ **In Vivo Delivery Considerations:** Both enzymes can be delivered via AAV   │
+│ vectors, lipid nanoparticles (LNPs), or as RNPs via electroporation [Sources │
+│ 21, 24]. A critical practical consideration is size: SpCas9 (~4.2 kb coding  │
+│ sequence) is near the AAV packaging limit (~4.7–4.8 kb), leaving little room │
+│ for promoter and regulatory elements [Sources 20, 21]. Cas12a variants       │
+│ (including engineered compact forms such as EbCas12a) can be packaged        │
+│ together with their crRNA within a single AAV vector, which is a significant │
+│ advantage for in vivo delivery [Sources 19, 20, 21]. A miniature Cas12f1     │
+│ variant has also demonstrated efficacy for in vivo retinal gene therapy      │
+│ [Source 12].                                                                 │
+│                                                                              │
+│ **Clinical and Therapeutic Status:** CRISPR-Cas9 is currently the dominant   │
+│ nuclease in clinical trials for both ex vivo and in vivo genome editing      │
+│ [Sources 8, 11]. Cas12a is gaining traction in therapeutic research,         │
+│ particularly where higher specificity or AAV-compatible delivery is required │
+│ [Sources 9, 13, 22].                                                         │
+│                                                                              │
+│ **Summary Table:**                                                           │
+│ - DNA cut type: Cas9 = blunt; Cas12a = staggered (5′ overhang)               │
+│ - PAM: Cas9 = NGG (3′); Cas12a = TTTV (5′)                                   │
+│ - Guide RNA: Cas9 = sgRNA (crRNA+tracrRNA); Cas12a = crRNA only              │
+│ - Multiplexing: Cas9 = limited; Cas12a = inherent crRNA array processing     │
+│ - Specificity: Cas12a generally higher                                       │
+│ - AAV compatibility: Cas12a variants better suited                           │
+│ - Clinical use: Cas9 more established; Cas12a emerging                       │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ What's the Difference Between │ Cas9...cleaves both strands of │  0.95 │
+│     │ Cas9 and Cas12a Nucleases? |  │ DNA at the same point. This    │       │
+│     │ The Scientist                 │ creates a blunt end            │       │
+│     │ https://www.the-scientist.com │ double-stranded break (DSB)... │       │
+│     │ /what-s-the-difference-betwee │ For Cas9 to function, the      │       │
+│     │ n-cas9-and-cas12a-nucleases-7 │ protospacer adjacent motif     │       │
+│     │ 2481                          │ (PAM)—a two to six base pair   │       │
+│     │                               │ sequence—NGG...must sit        │       │
+│     │                               │ immediately downstream of the  │       │
+│     │                               │ target on the opposite strand. │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Cas9 versus Cas12a/Cpf1:      │ Cas9 and Cas12a have distinct  │  0.97 │
+│     │ Structure-function            │ evolutionary origins and       │       │
+│     │ comparisons and implications  │ exhibit different structural   │       │
+│     │ for genome editing - PubMed   │ architectures, resulting in    │       │
+│     │ https://pubmed.ncbi.nlm.nih.g │ distinct molecular             │       │
+│     │ ov/29790280/                  │ mechanisms... We discuss       │       │
+│     │                               │ implications for genome        │       │
+│     │                               │ editing, and how they may      │       │
+│     │                               │ influence the choice of Cas9   │       │
+│     │                               │ or Cas12a for specific         │       │
+│     │                               │ applications.                  │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ CRISPR-Cas12a More Precise    │ Cas12a...is, according to      │  0.90 │
+│     │ Than CRISPR-Cas9              │ scientists at the University   │       │
+│     │ https://www.genengnews.com/to │ of Texas at Austin             │       │
+│     │ pics/genome-editing/crispr-ca │ (UT-Austin), more effective    │       │
+│     │ s12a-more-precise-than-crispr │ and precise... Because Cas     │       │
+│     │ -cas9/                        │ enzymes occasionally fail to   │       │
+│     │                               │ cut DNA in the right places,   │       │
+│     │                               │ or even cut at all, they worry │       │
+│     │                               │ developers, who want to modify │       │
+│     │                               │ genomes with surgical          │       │
+│     │                               │ precision, especially in       │       │
+│     │                               │ therapeutic applications.      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Comparison of CRISPR/Cas9 and │ We found that Cas9 and Cas12a  │  0.92 │
+│     │ Cas12a for gene editing in    │ RNPs- co-delivered with ssODN  │       │
+│     │ Chlamydomonas reinhardtii -   │ repair templates- induced      │       │
+│     │ ScienceDirect                 │ similar levels of total        │       │
+│     │ https://www.sciencedirect.com │ editing, achieving as much as  │       │
+│     │ /science/article/pii/S2211926 │ 20–30 % in all                 │       │
+│     │ 424004089                     │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Comparison of                 │ Comparison of                  │  0.88 │
+│     │ CRISPR-Cas9/Cas12a            │ CRISPR-Cas9/Cas12a             │       │
+│     │ Ribonucleoprotein Complexes   │ Ribonucleoprotein Complexes    │       │
+│     │ for Genome Editing Efficiency │ for Genome Editing Efficiency  │       │
+│     │ in the Rice Phytoene          │ in the Rice Phytoene           │       │
+│     │ Desaturase (OsPDS) Gene - PMC │ Desaturase (OsPDS) Gene        │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │                                │       │
+│     │ articles/PMC6973557/          │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Current and Prospective       │ Current and Prospective        │  0.87 │
+│     │ Applications of CRISPR-Cas12a │ Applications of CRISPR-Cas12a  │       │
+│     │ in Pluricellular Organisms -  │ in Pluricellular Organisms...  │       │
+│     │ PMC                           │ Mol Biotechnol. 2022 Aug       │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ 8;65(2):196–205. doi:          │       │
+│     │ articles/PMC9841005/          │ 10.1007/s12033-022-00538-5     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ When size matters: A novel    │ When size matters: A novel     │  0.90 │
+│     │ compact Cas12a variant for in │ compact Cas12a variant for in  │       │
+│     │ vivo genome editing - PMC     │ vivo genome editing            │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │                                │       │
+│     │ articles/PMC11253977/         │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ When size matters: A novel    │ Altogether, the components of  │  0.91 │
+│     │ compact Cas12a variant for in │ the EbCas12a system are well   │       │
+│     │ vivo genome editing -         │ below the 4.8-kb packaging     │       │
+│     │ ResearchGate                  │ limit of AAVs, enabling        │       │
+│     │ https://www.researchgate.net/ │ successful packaging in the    │       │
+│     │ publication/382328745_When_si │ AAV9                           │       │
+│     │ ze_matters_A_novel_compact_Ca │                                │       │
+│     │ s12a_variant_for_in_vivo_geno │                                │       │
+│     │ me_editing                    │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Therapeutic In Vivo Gene      │ our current results prove that │  0.88 │
+│     │ Editing Achieved by a         │ the miniature Cas12f1 system   │       │
+│     │ Hypercompact CRISPR System -  │ is a promising gene editing    │       │
+│     │ Advanced Science              │ tool for retinal gene therapy  │       │
+│     │ https://advanced.onlinelibrar │                                │       │
+│     │ y.wiley.com/doi/10.1002/advs. │                                │       │
+│     │ 202308095                     │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Delivery of CRISPR-Cas tools  │ AAV is one of the most         │  0.90 │
+│     │ for in vivo genome editing    │ commonly used vector systems   │       │
+│     │ therapy: Trends and           │ to date, but immunogenicity    │       │
+│     │ challenges - ScienceDirect    │ against capsid, liver toxicity │       │
+│     │ https://www.sciencedirect.com │ at high dose, and potential    │       │
+│     │ /science/article/pii/S0168365 │ genotoxicity caused by         │       │
+│     │ 92200027X                     │ off-target mutagenesis and     │       │
+│     │                               │ genomic integration remain     │       │
+│     │                               │ unsolved.                      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 11  │ CRISPR-Based Therapeutic      │ These Cas proteins are more    │  0.87 │
+│     │ Genome Editing - DSpace@MIT   │ compatible with AAV delivery,  │       │
+│     │ https://dspace.mit.edu/bitstr │ enabling additional vector     │       │
+│     │ eam/handle/1721.1/138388.2/ni │ design options such as         │       │
+│     │ hms-1576523.pdf?sequence=4&is │ expanded promoter choices and  │       │
+│     │ Allowed=y                     │ a streamlined delivery.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 12  │ Revolutionizing in vivo       │ Genome editing using the       │  0.85 │
+│     │ therapy with CRISPR/Cas       │ CRISPR/Cas system has          │       │
+│     │ genome editing:               │ revolutionized the field of    │       │
+│     │ breakthroughs, opportunities  │ genetic engineering, offering  │       │
+│     │ and challenges - Frontiers    │ unprecedented opportunities    │       │
+│     │ https://www.frontiersin.org/j │ for therapeutic applications   │       │
+│     │ ournals/genome-editing/articl │ in vivo.                       │       │
+│     │ es/10.3389/fgeed.2024.1342193 │                                │       │
+│     │ /full                         │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 13  │ CRISPR Clinical Trials: A     │ CRISPR Clinical Trials: A 2024 │  0.80 │
+│     │ 2024 Update - Innovative      │ Update - Innovative Genomics   │       │
+│     │ Genomics Institute            │ Institute (IGI)                │       │
+│     │ https://innovativegenomics.or │                                │       │
+│     │ g/news/crispr-clinical-trials │                                │       │
+│     │ -2024/                        │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 14  │ Alt-R CRISPR-Cas9 vs Cas12a   │ The two most popular enzymes   │  0.83 │
+│     │ systems | IDT                 │ used in CRISPR genome editing  │       │
+│     │ https://www.idtdna.com/pages/ │ are Cas9 and Cas12a (Cpf1).    │       │
+│     │ technology/crispr/crispr-geno │ These enzymes are highly       │       │
+│     │ me-editing/Alt-R-systems      │ functional, do not require     │       │
+│     │                               │ binding to other enzymes as is │       │
+│     │                               │ the case for type I CRISPR     │       │
+│     │                               │ systems, and can be readily    │       │
+│     │                               │ programmed to target the       │       │
+│     │                               │ desired genomic DNA site.      │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found │ Head-to-head in vivo        │ Most comparative studies    │
+│                  │ efficacy data in mammals    │ focused on plants (rice) or │
+│                  │ across multiple tissue      │ algae (Chlamydomonas) or    │
+│                  │ types                       │ used in vitro/ex vivo       │
+│                  │                             │ models. Rigorous            │
+│                  │                             │ side-by-side in vivo        │
+│                  │                             │ mammalian comparisons of    │
+│                  │                             │ Cas9 vs. Cas12a across      │
+│                  │                             │ liver, muscle, CNS, and eye │
+│                  │                             │ were not identified in      │
+│                  │                             │ available sources.          │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Immunogenicity comparison   │ While immunogenicity of     │
+│                  │ between Cas9 and Cas12a in  │ Cas9 is well-documented as  │
+│                  │ vivo                        │ a challenge for in vivo     │
+│                  │                             │ delivery, direct            │
+│                  │                             │ comparative immunogenicity  │
+│                  │                             │ data for Cas12a in humans   │
+│                  │                             │ or animal models was not    │
+│                  │                             │ available in the gathered   │
+│                  │                             │ sources.                    │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Cas12a clinical trial data  │ The IGI clinical trials     │
+│                  │                             │ update and other sources    │
+│                  │                             │ confirm Cas9 dominance in   │
+│                  │                             │ trials but do not provide   │
+│                  │                             │ details on approved or      │
+│                  │                             │ ongoing Cas12a-specific     │
+│                  │                             │ clinical trials.            │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Detailed off-target         │ While Cas12a is reported to │
+│                  │ profiling comparison in     │ be more specific than Cas9  │
+│                  │ vivo                        │ based on kinetic studies,   │
+│                  │                             │ comprehensive in vivo       │
+│                  │                             │ off-target profiling        │
+│                  │                             │ comparing both enzymes      │
+│                  │                             │ systematically across the   │
+│                  │                             │ same targets was not        │
+│                  │                             │ available in the sources.   │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ arxiv             │ Cas12a vs Cas9 in │ Head-to-head in   │
+│                  │                   │ vivo editing      │ vivo mammalian    │
+│                  │                   │ efficiency        │ comparisons are a │
+│                  │                   │ off-target        │ critical gap;     │
+│                  │                   │ mammalian         │ preprint servers  │
+│                  │                   │ therapeutic       │ may have more     │
+│                  │                   │ comparison 2023   │ recent            │
+│                  │                   │ 2024              │ unpublished data  │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ CRISPR Cas12a     │ Clinical adoption │
+│                  │                   │ clinical trials   │ of Cas12a in vivo │
+│                  │                   │ ClinicalTrials.go │ is poorly         │
+│                  │                   │ v 2023 2024       │ characterized; a  │
+│                  │                   │                   │ ClinicalTrials.go │
+│                  │                   │                   │ v database search │
+│                  │                   │                   │ would clarify     │
+│                  │                   │                   │ current status    │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ Cas12a            │ Immunogenicity is │
+│                  │                   │ immunogenicity    │ a key barrier for │
+│                  │                   │ pre-existing      │ in vivo Cas9      │
+│                  │                   │ immunity in vivo  │ delivery; whether │
+│                  │                   │ gene therapy      │ Cas12a poses      │
+│                  │                   │ human             │ fewer immune      │
+│                  │                   │                   │ challenges is     │
+│                  │                   │                   │ clinically        │
+│                  │                   │                   │ important but not │
+│                  │                   │                   │ covered in        │
+│                  │                   │                   │ sources           │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ database          │ compact Cas12a    │ Compact Cas12a    │
+│                  │                   │ EbCas12a AsCas12a │ variants show     │
+│                  │                   │ in vivo liver     │ promise for AAV   │
+│                  │                   │ lung CNS          │ delivery; recent  │
+│                  │                   │ therapeutic       │ therapeutic in    │
+│                  │                   │ editing 2024      │ vivo data would   │
+│                  │                   │                   │ strengthen the    │
+│                  │                   │                   │ comparison        │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Does Cas12a's staggered cutting │ Sources note that staggered     │
+│          │ pattern result in meaningfully  │ cuts may enhance HDR, but       │
+│          │ higher HDR rates than Cas9's    │ comparative in vivo HDR         │
+│          │ blunt cuts in vivo in           │ efficiency data in mammals was  │
+│          │ therapeutically relevant cell   │ not found in the gathered       │
+│          │ types?                          │ evidence.                       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Are there pre-existing          │ Immunogenicity is a known       │
+│          │ antibodies or T-cell responses  │ challenge for Cas9 in vivo;     │
+│          │ against Cas12a proteins in      │ whether Cas12a, being from      │
+│          │ humans that would limit its     │ different bacterial origins,    │
+│          │ therapeutic use, as has been    │ faces similar or lesser immune  │
+│          │ documented for SpCas9?          │ barriers in human patients is   │
+│          │                                 │ clinically critical.            │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Can compact Cas12a variants     │ Compact variants fit within AAV │
+│          │ (e.g., EbCas12a, Cas12f)        │ packaging limits better than    │
+│          │ consistently match or exceed    │ Cas9, but their in vivo editing │
+│          │ SpCas9 editing efficiency in    │ efficiency relative to SpCas9   │
+│          │ vivo across diverse tissue      │ across tissues such as liver,   │
+│          │ types?                          │ muscle, and CNS needs           │
+│          │                                 │ systematic evaluation.          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does Cas12a's inherent      │ Cas12a can process its own      │
+│          │ crRNA array processing and      │ pre-crRNA array, enabling       │
+│          │ multiplexing capability         │ multiplexed targeting from a    │
+│          │ translate to in vivo            │ single transcript, which is     │
+│          │ combinatorial therapeutic       │ noted as an advantage but its   │
+│          │ strategies compared to          │ in vivo therapeutic             │
+│          │ Cas9-based multiplex            │ exploitation is not             │
+│          │ approaches?                     │ well-characterized in available │
+│          │                                 │ sources.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the current status of   │ The 2024 CRISPR clinical trials │
+│          │ Cas12a-specific clinical trials │ update from IGI and Frontiers   │
+│          │ for in vivo gene therapy, and   │ review both highlight Cas9      │
+│          │ how do their safety profiles    │ dominance in clinical trials,   │
+│          │ compare to Cas9-based trials?   │ but Cas12a clinical translation │
+│          │                                 │ remains poorly documented.      │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.82                                                                │
+│ Corroborating sources: 14                                                    │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 54153                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 117.16s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 9e436db7-fcde-4d0f-a568-c468ae4d419c
--- a/docs/stress-tests/M3.3-runs/09-comparative.log
+++ b/docs/stress-tests/M3.3-runs/09-comparative.log
@ -0,0 +1,378 @@
+Researching: Compare React and Vue for large enterprise frontends in 2026.
+
+{"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:24.701232Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:59:26.384813Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:26.398635Z"}
+{"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:26.459271Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare React and Vue for large enterprise frontends in 2026.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459554Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459652Z"}
+{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1149, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:37.583764Z"}
+{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11893, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:45.423050Z"}
+{"step": 23, "decision": "Token budget reached before iteration 4: 27147/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158499Z"}
+{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 27147, "event": "synthesis_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158736Z"}
+{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 87997, "event": "synthesis_complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.872069Z"}
+{"step": 46, "decision": "Research complete", "confidence": 0.81, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 114.815, "event": "complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.883053Z"}
+{"confidence": 0.81, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 56137, "iterations_run": 3, "wall_time_sec": 110.40975427627563, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:16.883613Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:01:16.886961Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:16.944624Z"}
+{"trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "confidence": 0.81, "citations": 12, "tokens_used": 56137, "wall_time_sec": 110.40975427627563, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:17.535111Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ For large enterprise frontends in 2026, React and Vue each offer distinct    │
+│ advantages, and the best choice depends on organizational priorities.        │
+│                                                                              │
+│ **Market Position & Adoption:** React dominates with ~42% market share among │
+│ professional developers (2025 State of JavaScript survey) and ~68% among     │
+│ enterprise applications globally, while Vue holds ~28% developer share and   │
+│ ~18% enterprise share. React powers Facebook, Netflix, Airbnb, and Uber; Vue │
+│ drives Alibaba, GitLab, and Nintendo. Some 80% of enterprise teams use React │
+│ directly or via Next.js. [Sources 1, 4, 25]                                  │
+│                                                                              │
+│ **Performance:** Both frameworks use a virtual DOM. Vue 4 showed 15% faster  │
+│ initial render times than React 19 in large-scale applications with          │
+│ thousands of components (JavaScript Performance Consortium 2025 benchmarks). │
+│ However, React 19's concurrent rendering features provide superior           │
+│ responsiveness during complex user interactions. In micro-benchmarks, Vue    │
+│ 3.4 creates 1,000 rows in 38ms vs React 19's 42ms, and Vue's bundle size is  │
+│ smaller (33KB vs 44KB min+gzip). The performance gap continues to narrow.    │
+│ [Sources 1, 25]                                                              │
+│                                                                              │
+│ **React 19 Architecture Shifts:** React 19 introduces a built-in compiler    │
+│ that automates memoization (making useMemo/useCallback largely redundant),   │
+│ native Server Components for zero-bundle-size dependencies and direct        │
+│ database access, a new Actions API for simplified async form handling, and   │
+│ the `use` hook for streamlined data fetching. These changes significantly    │
+│ reduce boilerplate and technical debt for enterprise teams. [Sources 18, 19, │
+│ 20]                                                                          │
+│                                                                              │
+│ **Vue's Enterprise Momentum:** Vue 3's Composition API enables better logic  │
+│ reuse across large codebases. Pinia (the official state manager) is          │
+│ TypeScript-first and lightweight. Nuxt 3 handles SSR. Vue's natural          │
+│ TypeScript support and Vite-powered tooling make it increasingly attractive  │
+│ for enterprise adoption. Fortune 500 companies, SaaS platforms, and          │
+│ government tech teams are growing adopters. [Sources 12, 15]                 │
+│                                                                              │
+│ **Learning Curve & Developer Experience:** Vue uses standard HTML/CSS/JS     │
+│ with Single File Components, making it easier to onboard developers with     │
+│ traditional web backgrounds. React uses JSX (combining HTML and JavaScript), │
+│ which has a steeper initial curve but becomes natural quickly. Vue's         │
+│ official routing and state solutions (Vue Router, Pinia) reduce              │
+│ architectural decision-making overhead. React requires selecting from a      │
+│ broader ecosystem (Redux/Zustand, React Router, etc.), offering more         │
+│ flexibility but more upfront choices. [Sources 1, 13, 14]                    │
+│                                                                              │
+│ **Ecosystem & Hiring:** React has ~44M+ weekly npm downloads vs Vue's        │
+│ ~4.5M+. React has 225K+ GitHub stars vs Vue's 207K+. Fortune 500 adoption is │
+│ 47% React vs 12% Vue. React job postings vastly outnumber Vue's (e.g.,       │
+│ 3,200+ vs 680+/month in one market). React's ecosystem is larger and more    │
+│ mature, making hiring and long-term support easier for large enterprises.    │
+│ [Sources 14, 25]                                                             │
+│                                                                              │
+│ **Enterprise Scalability:** React's flexibility and massive ecosystem make   │
+│ it the safer long-term choice for large, complex, cross-team enterprise      │
+│ platforms. Angular is also a strong contender here with built-in DI, forms,  │
+│ and strict structure. Vue excels when time-to-market and developer           │
+│ productivity are top priorities and when teams want a progressive,           │
+│ opinionated setup with lower onboarding cost. [Sources 13, 14, 23]           │
+│                                                                              │
+│ **Recommendation:** For most large enterprise frontends in 2026, React       │
+│ (often via Next.js) remains the dominant and lowest-risk choice due to its   │
+│ ecosystem size, hiring market, enterprise adoption, and React 19's           │
+│ architectural improvements. Vue is a compelling choice for enterprises       │
+│ prioritizing developer velocity, lower onboarding costs, and smaller bundle  │
+│ sizes, particularly in Asia-Pacific markets or mid-size SaaS platforms.      │
+│ Neither choice is technically wrong—both are production-proven at scale.     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ React vs Vue: Frontend        │ According to the 2025 State of │  0.88 │
+│     │ Frameworks Compared in 2025   │ JavaScript survey, React       │       │
+│     │ https://automation-ops.com/bl │ continues to dominate with a   │       │
+│     │ og/react-vs-vue-frontend-fram │ 42% market share among         │       │
+│     │ eworks-compared               │ professional developers, while │       │
+│     │                               │ Vue has grown to capture 28%   │       │
+│     │                               │ of the market. Vue 4 showed a  │       │
+│     │                               │ 15% faster initial render time │       │
+│     │                               │ compared to React 19 in        │       │
+│     │                               │ large-scale applications with  │       │
+│     │                               │ thousands of components.       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Angular vs. React vs. Vue.js: │ The focus in 2025 has shifted  │  0.82 │
+│     │ A performance guide for 2026  │ away from basic component      │       │
+│     │ - LogRocket Blog              │ logic toward reactivity        │       │
+│     │ https://blog.logrocket.com/an │ models, hydration strategies,  │       │
+│     │ gular-vs-react-vs-vue-js-perf │ and compiler-driven            │       │
+│     │ ormance/                      │ performance optimizations.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ React vs Next.js vs Vue:      │ React remains the foundation   │  0.80 │
+│     │ Which Frontend Framework Wins │ for modern frontend            │       │
+│     │ in 2026? - DEV Community      │ development with 80% of        │       │
+│     │ https://dev.to/ciphernutz/rea │ enterprise teams still using   │       │
+│     │ ct-vs-nextjs-vs-vue-which-fro │ it directly or via Next.js.    │       │
+│     │ ntend-framework-wins-in-2025- │                                │       │
+│     │ 26gj                          │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ The 2025 Tech Stack Dilemma:  │ According to the 2025 State of │  0.78 │
+│     │ React vs Vue vs Angular for   │ JavaScript survey, developers  │       │
+│     │ Enterprise Applications       │ using frameworks report 35-50% │       │
+│     │ https://www.codertrove.com/ar │ faster development cycles      │       │
+│     │ ticles/2025-tech-stack-dilemm │ compared to vanilla            │       │
+│     │ a-react-vs-vue-vs-angular-for │ JavaScript. The 2024 State of  │       │
+│     │ -enterprise-application       │ JavaScript survey reveals that │       │
+│     │                               │ 78% of developers cite 'faster │       │
+│     │                               │ development' as their primary  │       │
+│     │                               │ reason for adoption.           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Web Development with React vs │ React maintains its dominant   │  0.85 │
+│     │ Vue.js: 2025 Comparison |     │ position with approximately    │       │
+│     │ iTechDev Blog                 │ 68% market share among         │       │
+│     │ https://www.itechdev.com.mx/b │ enterprise applications        │       │
+│     │ log/react-vs-vue-comparison-2 │ globally. Vue 3.4 creates      │       │
+│     │ 025                           │ 1,000 rows in 38ms vs React    │       │
+│     │                               │ 19's 42ms. Bundle size         │       │
+│     │                               │ (min+gzip): React 44KB, Vue    │       │
+│     │                               │ 33KB. Fortune 500 adoption:    │       │
+│     │                               │ React 47%, Vue 12%.            │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ React 19 Features & Updates   │ React 19 emerges as a landmark │  0.87 │
+│     │ (2025): What's New & Why It   │ release that brings            │       │
+│     │ Matters - WEQ                 │ significant enhancements to    │       │
+│     │ https://weqtechnologies.com/r │ performance, developer         │       │
+│     │ eact-19-features-updates-2025 │ experience, and scalability.   │       │
+│     │ -whats-new-why-it-matters/    │ This update builds on the      │       │
+│     │                               │ foundations laid by React 18,  │       │
+│     │                               │ introducing powerful new       │       │
+│     │                               │ features like the React        │       │
+│     │                               │ Compiler, Actions API, and     │       │
+│     │                               │ enhanced support for React     │       │
+│     │                               │ Server Components.             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ React 19: Architecture        │ The React Compiler             │  0.83 │
+│     │ Shifts, Performance           │ automatically handles          │       │
+│     │ Optimization, and the Future  │ memoization, rendering hooks   │       │
+│     │ of Enterprise Web Development │ like useMemo and useCallback   │       │
+│     │ https://pblinuxtech.com/react │ largely redundant for          │       │
+│     │ -19-architecture-shifts-perfo │ performance optimization.      │       │
+│     │ rmance-optimization-and-the-f │ Native support for Server      │       │
+│     │ uture-of-enterprise-web-devel │ Components allows for          │       │
+│     │ opment/                       │ zero-bundle-size dependencies  │       │
+│     │                               │ and direct database access,    │       │
+│     │                               │ optimizing the use of          │       │
+│     │                               │ Linux-based edge runtimes.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Vue.js in the Enterprise: Why │ By 2026, more                  │  0.79 │
+│     │ More Companies Are Choosing   │ organizations—startups,        │       │
+│     │ Vue in 2026 – Manifest        │ Fortune 500 companies, large   │       │
+│     │ https://manifestinfotech.com/ │ SaaS platforms, and government │       │
+│     │ vue-js-in-the-enterprise-why- │ tech teams—are adopting Vue    │       │
+│     │ more-companies-are-choosing-v │ for mission-critical           │       │
+│     │ ue-in-2026/                   │ applications. Pinia, now the   │       │
+│     │                               │ official store for Vue,        │       │
+│     │                               │ delivers TypeScript-first      │       │
+│     │                               │ architecture, lightweight      │       │
+│     │                               │ design, better devtools        │       │
+│     │                               │ integration, faster global     │       │
+│     │                               │ state handling.                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ The State of Vue.js Report    │ This report, created in        │  0.84 │
+│     │ 2025                          │ collaboration with Evan You    │       │
+│     │ https://stateofvue.framer.web │ and the Vue and Nuxt Core      │       │
+│     │ site/                         │ Teams, offers unique insights  │       │
+│     │                               │ across 150 virtual pages.      │       │
+│     │                               │ We've included 16 real-world   │       │
+│     │                               │ case studies from leading      │       │
+│     │                               │ brands, including GitLab, Hack │       │
+│     │                               │ The Box, Storyblok, Booksy,    │       │
+│     │                               │ and DocPlanner.                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ React vs Angular vs Vue:      │ React, maintained by Meta, is  │  0.84 │
+│     │ Choosing the Best for         │ a declarative, component-based │       │
+│     │ Enterprise in 2025            │ library for building user      │       │
+│     │ https://softwarelogic.co/en/b │ interfaces. Its virtual DOM    │       │
+│     │ log/which-javascript-framewor │ and one-way data flow provide  │       │
+│     │ k-is-best-for-enterprise-reac │ outstanding performance and    │       │
+│     │ t-angular-or-vue              │ flexibility. Vue is loved for  │       │
+│     │                               │ its gentle learning curve and  │       │
+│     │                               │ progressive adoption. Angular  │       │
+│     │                               │ is designed for large, complex │       │
+│     │                               │ enterprise applications where  │       │
+│     │                               │ structure and scalability are  │       │
+│     │                               │ paramount.                     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 11  │ React vs Vue: which one       │ React is built for scale. Its  │  0.86 │
+│     │ should you choose in 2025? |  │ flexibility, huge ecosystem,   │       │
+│     │ DECODE                        │ and massive job market make it │       │
+│     │ https://decode.agency/article │ the safest choice for          │       │
+│     │ /react-vs-vue/                │ enterprise-grade apps. Vue is  │       │
+│     │                               │ built for speed. With a gentle │       │
+│     │                               │ learning curve and official    │       │
+│     │                               │ tools baked in, teams can move │       │
+│     │                               │ faster and deliver MVPs or     │       │
+│     │                               │ mid-size apps quickly.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 12  │ What is React.js in 2025 and  │ In React 19, that same Reactjs │  0.82 │
+│     │ why React 19 changed          │ library comes with first-class │       │
+│     │ front-end again | Merge       │ async workflows, server        │       │
+│     │ https://merge.rocks/blog/what │ components, and metadata       │       │
+│     │ -is-react-js-in-2025-and-why- │ management, so teams spend     │       │
+│     │ react-19-changed-front-end-ag │ less time gluing libraries     │       │
+│     │ ain                           │ together and more time on      │       │
+│     │                               │ product work. The React team   │       │
+│     │                               │ also ships React Compiler,     │       │
+│     │                               │ currently in beta, which       │       │
+│     │                               │ automatically optimizes many   │       │
+│     │                               │ components that used to        │       │
+│     │                               │ require manual memoization.    │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Real-world 2026          │ No sources provided       │
+│                       │ enterprise migration     │ firsthand accounts of     │
+│                       │ case studies from React  │ enterprises switching     │
+│                       │ to Vue or vice versa     │ frameworks in 2026 with   │
+│                       │                          │ documented outcomes, only │
+│                       │                          │ general advocacy pieces.  │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ scope_exceeded        │ Angular vs React vs Vue  │ The question focused on   │
+│                       │ head-to-head in 2026     │ React vs Vue, but Angular │
+│                       │ enterprise contexts      │ is a significant          │
+│                       │                          │ competitor in large       │
+│                       │                          │ enterprise contexts. Full │
+│                       │                          │ three-way comparison with │
+│                       │                          │ 2026 data was not         │
+│                       │                          │ available.                │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Vue 4 specific features  │ One source                │
+│                       │ and release status       │ (automation-ops.com)      │
+│                       │                          │ mentions 'Vue 4' with     │
+│                       │                          │ 'enhanced composition API │
+│                       │                          │ features', but most other │
+│                       │                          │ sources discuss Vue 3.x   │
+│                       │                          │ as the current version.   │
+│                       │                          │ Vue 4 release status is   │
+│                       │                          │ unclear.                  │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ source_not_found      │ Verified 2026 salary and │ Salary data found was     │
+│                       │ hiring market data       │ market-specific (Mexico)  │
+│                       │                          │ and from 2025; global     │
+│                       │                          │ 2026 enterprise hiring    │
+│                       │                          │ cost comparison between   │
+│                       │                          │ React and Vue developers  │
+│                       │                          │ was not available.        │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ database          │ Vue 4 release     │ One source        │
+│                  │                   │ date features     │ references Vue 4  │
+│                  │                   │ official          │ with enhanced     │
+│                  │                   │ announcement 2025 │ composition API,  │
+│                  │                   │ 2026              │ but most sources  │
+│                  │                   │                   │ still discuss Vue │
+│                  │                   │                   │ 3.x; clarifying   │
+│                  │                   │                   │ whether Vue 4 has │
+│                  │                   │                   │ been released is  │
+│                  │                   │                   │ important for     │
+│                  │                   │                   │ accurate          │
+│                  │                   │                   │ comparison.       │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ React Server      │ SSR tooling       │
+│                  │                   │ Components vs     │ (Next.js vs Nuxt) │
+│                  │                   │ Nuxt SSR          │ is a key          │
+│                  │                   │ enterprise        │ enterprise        │
+│                  │                   │ performance       │ decision factor   │
+│                  │                   │ comparison 2025   │ mentioned across  │
+│                  │                   │ 2026              │ sources but not   │
+│                  │                   │                   │ deeply            │
+│                  │                   │                   │ benchmarked       │
+│                  │                   │                   │ head-to-head.     │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ State of          │ Multiple sources  │
+│                  │                   │ JavaScript 2025   │ cite the 2025     │
+│                  │                   │ full survey       │ State of          │
+│                  │                   │ results React Vue │ JavaScript survey │
+│                  │                   │ Angular market    │ but only with     │
+│                  │                   │ share             │ partial data; the │
+│                  │                   │                   │ full report would │
+│                  │                   │                   │ provide           │
+│                  │                   │                   │ authoritative     │
+│                  │                   │                   │ market share      │
+│                  │                   │                   │ figures.          │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ contradiction    │ null              │ Vue 4 vs Vue 3    │ Automation-ops    │
+│                  │                   │ current version   │ references 'Vue   │
+│                  │                   │ enterprise 2025   │ 4' with benchmark │
+│                  │                   │ 2026              │ data but other    │
+│                  │                   │                   │ sources           │
+│                  │                   │                   │ consistently      │
+│                  │                   │                   │ reference Vue 3.4 │
+│                  │                   │                   │ as current. This  │
+│                  │                   │                   │ is a factual      │
+│                  │                   │                   │ discrepancy that  │
+│                  │                   │                   │ could affect      │
+│                  │                   │                   │ benchmark         │
+│                  │                   │                   │ interpretation.   │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Has Vue 4 officially been       │ One source claims Vue 4 shows   │
+│          │ released, and what are its      │ 15% faster initial render times │
+│          │ actual performance              │ than React 19, but most sources │
+│          │ characteristics vs React 19 in  │ still discuss Vue 3.4 as        │
+│          │ enterprise applications?        │ current. This discrepancy       │
+│          │                                 │ affects benchmark reliability.  │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ How does React's new React      │ React Compiler automates        │
+│          │ Compiler (in beta) affect the   │ memoization and is described as │
+│          │ performance gap between React   │ a game-changer, but its         │
+│          │ and Vue in production           │ real-world impact on large      │
+│          │ enterprise applications?        │ enterprise codebases has not    │
+│          │                                 │ yet been fully benchmarked      │
+│          │                                 │ against Vue's                   │
+│          │                                 │ compiler-optimized reactivity.  │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ For enterprises currently on    │ The State of Vue.js Report 2025 │
+│          │ Vue 2 or Vue 3, what is the     │ includes a chapter on Vue 3     │
+│          │ actual cost and risk profile of │ Migration, suggesting migration │
+│          │ upgrading to future Vue         │ is still a concern for many     │
+│          │ versions vs migrating to React? │ enterprise teams.               │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does the developer hiring   │ Sources note strong Vue         │
+│          │ market for Vue vs React differ  │ adoption in Asia-Pacific and    │
+│          │ across regions (Asia-Pacific vs │ Latin America but React         │
+│          │ North America vs Europe) for    │ dominance globally. Regional    │
+│          │ enterprise teams planning 2026  │ hiring market differences could │
+│          │ staffing?                       │ significantly impact enterprise │
+│          │                                 │ framework choices.              │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ What is the total cost of       │ Sources discuss development     │
+│          │ ownership difference between    │ cost at project level but do    │
+│          │ React+Next.js and Vue+Nuxt for  │ not model long-term TCO         │
+│          │ a 50+ person enterprise         │ including training,             │
+│          │ frontend team over a 3-year     │ maintenance, tooling, and       │
+│          │ horizon?                        │ hiring costs for large teams.   │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.81                                                                │
+│ Corroborating sources: 12                                                    │
+│ Source authority: medium                                                     │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 56137                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 110.41s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 7c8dd19b-174b-4850-a2f5-28917d37c0c0
--- a/docs/stress-tests/M3.3-runs/10-comparative.log
+++ b/docs/stress-tests/M3.3-runs/10-comparative.log
@ -0,0 +1,310 @@
+Researching: Compare wind and solar capacity factors in the continental United 
+States.
+
+{"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:18.663955Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:01:19.783461Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:19.795497Z"}
+{"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:19.838791Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare wind and solar capacity factors in the continental United States.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839685Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839976Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:29.064991Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8211, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:38.391464Z"}
+{"step": 19, "decision": "Token budget reached before iteration 4: 23963/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620609Z"}
+{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23963, "event": "synthesis_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620851Z"}
+{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72249, "event": "synthesis_complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.647112Z"}
+{"step": 40, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 99.134, "event": "complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.648194Z"}
+{"confidence": 0.88, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 48230, "iterations_run": 3, "wall_time_sec": 95.80813455581665, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:55.648284Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:02:55.648701Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:55.654584Z"}
+{"trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "confidence": 0.88, "citations": 10, "tokens_used": 48230, "wall_time_sec": 95.80813455581665, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:55.883067Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Wind and solar capacity factors in the continental United States differ      │
+│ notably, with wind generally outperforming utility-scale solar on an annual  │
+│ average basis, though both vary significantly by location and season.        │
+│                                                                              │
+│ **Wind Capacity Factors:** In 2023, the U.S. wind turbine fleet had an       │
+│ average capacity factor of 33.5%, which was an eight-year low driven by      │
+│ weaker-than-normal wind speeds (down from the 2022 all-time high of 35.9%).  │
+│ Wind capacity factors are highest in spring (March–April) and lowest in      │
+│ summer. In April 2024, wind generation hit a record 47.7 TWh, exceeding coal │
+│ generation for the second consecutive month. The NREL wind resource          │
+│ assessment identifies areas with capacity factors ≥30% (generally mean       │
+│ annual wind speeds ≥6.4 m/s) as suitable for development, with the           │
+│ highest-potential zones in the central Great Plains. The U.S. total          │
+│ installed wind capacity reached ~150,500 MW by end of 2023.                  │
+│                                                                              │
+│ **Solar (Utility-Scale PV) Capacity Factors:** The weighted average U.S.     │
+│ utility-scale solar capacity factor was 23.5% in 2023, down 0.7 percentage   │
+│ points from 24.2% in 2022. NREL's Annual Technology Baseline categorizes     │
+│ utility-scale PV capacity factors into 10 resource classes based on mean     │
+│ global horizontal irradiance (GHI); the desert Southwest achieves the        │
+│ highest factors, while northern states achieve at least ~70% of the          │
+│ Southwest's value. Solar generation is highest in summer and lowest in       │
+│ winter, opposite to wind seasonality.                                        │
+│                                                                              │
+│ **Comparison Summary:** On an annual fleet-wide average, wind capacity       │
+│ factors (~33–36%) are materially higher than utility-scale solar capacity    │
+│ factors (~23–24%). However, the two resources are complementary seasonally:  │
+│ wind peaks in spring, solar peaks in summer. Both are intermittent           │
+│ resources. In 2025, wind and solar together generated a record 17% of U.S.   │
+│ electricity (wind: 464,000 GWh; utility-scale solar: 296,000 GWh),           │
+│ reflecting wind's larger current installed base despite solar's faster       │
+│ recent capacity growth.                                                      │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Wind generation declined in   │ Last year, the average         │  0.98 │
+│     │ 2023 for the first time since │ utilization rate, or capacity  │       │
+│     │ the 1990s - EIA               │ factor, of the wind turbine    │       │
+│     │ https://www.eia.gov/todayinen │ fleet fell to an eight-year    │       │
+│     │ ergy/detail.php?id=61943      │ low of 33.5% (compared with    │       │
+│     │                               │ 35.9% in 2022, the all-time    │       │
+│     │                               │ high).                         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ US solar capacity factors     │ The weighted average US solar  │  0.95 │
+│     │ retreat in 2023, break        │ capacity factor came in at a   │       │
+│     │ multiyear streak above 24%    │ calculated 23.5% annually in   │       │
+│     │ https://www.spglobal.com/mark │ 2023, down 0.7 percentage      │       │
+│     │ et-intelligence/en/news-insig │ point from 24.2% in 2022.      │       │
+│     │ hts/research/us-solar-capacit │                                │       │
+│     │ y-factors-retreat-in-2023-bre │                                │       │
+│     │ ak-multiyear-streak-above-24p │                                │       │
+│     │ erc                           │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ U.S. wind generation hit      │ Wind generation, meanwhile,    │  0.97 │
+│     │ record in April 2024,         │ increased to a record 47.7     │       │
+│     │ exceeding coal-fired          │ TWh. However, during the first │       │
+│     │ generation - EIA              │ four months of 2024,           │       │
+│     │ https://www.eia.gov/todayinen │ coal-fired generation was 15%  │       │
+│     │ ergy/detail.php?id=62784      │ higher than wind generation in │       │
+│     │                               │ the United States. Installed   │       │
+│     │                               │ wind power generating capacity │       │
+│     │                               │ has increased substantially in │       │
+│     │                               │ the United States over the     │       │
+│     │                               │ last 25 years, growing from    │       │
+│     │                               │ 2.4 gigawatts (GW) in 2000 to  │       │
+│     │                               │ 150.1 GW in April 2024.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Land-Based Wind Market Report │ The U.S. wind industry         │  0.97 │
+│     │ 2024: Edition | Department of │ installed 6,474 megawatts (MW) │       │
+│     │ Energy                        │ of new land-based wind         │       │
+│     │ https://www.energy.gov/cmei/s │ capacity in 2023, bringing the │       │
+│     │ ystems/land-based-wind-market │ cumulative total to nearly     │       │
+│     │ -report-2024-edition          │ 150,500 MW. Also, $10.8        │       │
+│     │                               │ billion was invested in 2023   │       │
+│     │                               │ in land-based wind energy      │       │
+│     │                               │ expansion.                     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Utility-Scale PV |            │ The 2024 ATB provides the      │  0.93 │
+│     │ Electricity | 2024 | ATB |    │ average capacity factor for 10 │       │
+│     │ NREL                          │ resource categories in the     │       │
+│     │ https://atb.nrel.gov/electric │ United States, binned by mean  │       │
+│     │ ity/2024/utility-scale_pv     │ GHI. Average capacity factors  │       │
+│     │                               │ are calculated using           │       │
+│     │                               │ county-level capacity factor   │       │
+│     │                               │ averages from the Renewable    │       │
+│     │                               │ Energy Potential (reV) model   │       │
+│     │                               │ for 1998–2021.                 │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ NREL projects solar           │ In the latest update, zones    │  0.85 │
+│     │ generation and costs for 10   │ 2-8, representing all but the  │       │
+│     │ U.S. zones – pv magazine USA  │ northernmost states in the     │       │
+│     │ https://pv-magazine-usa.com/2 │ continental U.S., solar        │       │
+│     │ 021/07/22/nrel-projects-solar │ installations have a capacity  │       │
+│     │ -generation-and-costs-for-10- │ factor that is at least 70% of │       │
+│     │ u-s-zones/                    │ that in the desert Southwest's │       │
+│     │                               │ zone 1, the data show.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Wind and solar generated a    │ In 2025, wind power generated  │  0.96 │
+│     │ record 17% of U.S.            │ 464,000 GWh of electricity, 3% │       │
+│     │ electricity in 2025 - EIA     │ more than in 2024. In 2025,    │       │
+│     │ https://www.eia.gov/todayinen │ utility-scale solar power      │       │
+│     │ ergy/detail.php?id=67367      │ generation totaled 296,000     │       │
+│     │                               │ GWh, 34% more than in 2024.    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ 80 and 100 Meter Wind Energy  │ Windy land defined as areas    │  0.82 │
+│     │ Resource Potential for the    │ with >= 30% CF*, generally     │       │
+│     │ United States - NREL          │ mean annual wind speeds >= 6.4 │       │
+│     │ https://docs.nrel.gov/docs/fy │ m/s... U.S. wind potential     │       │
+│     │ 10osti/48036.pdf              │ from areas with CF*>=30% is    │       │
+│     │                               │ enormous, with almost 10,500   │       │
+│     │                               │ GW capacity at 80 m and 12,000 │       │
+│     │                               │ GW capacity at 100 m.          │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Wind power in the United      │ In 2025, 464.4 terawatt-hours  │  0.88 │
+│     │ States - Wikipedia            │ were generated by wind power,  │       │
+│     │ https://en.wikipedia.org/wiki │ or 10.48% of electricity in    │       │
+│     │ /Wind_power_in_the_United_Sta │ the United States. In March    │       │
+│     │ tes                           │ and April of 2024, electricity │       │
+│     │                               │ generation from wind exceeded  │       │
+│     │                               │ generation from coal, once the │       │
+│     │                               │ dominant source of U.S.        │       │
+│     │                               │ electricity, for an extended   │       │
+│     │                               │ period for the first time.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Utility-scale U.S. solar      │ In August 2024, a total of     │  0.94 │
+│     │ electricity generation        │ 107.4 gigawatts (GW) of solar  │       │
+│     │ continues to grow in 2024 -   │ electricity generating         │       │
+│     │ EIA                           │ capacity was operating in the  │       │
+│     │ https://www.eia.gov/todayinen │ Lower 48 states compared with  │       │
+│     │ ergy/detail.php?id=63324      │ 81.9 GW in August 2023... In   │       │
+│     │                               │ the final five months of 2024, │       │
+│     │                               │ we expect new U.S. solar       │       │
+│     │                               │ electricity generating         │       │
+│     │                               │ capacity will make up 63%, or  │       │
+│     │                               │ nearly two-thirds, of all new  │       │
+│     │                               │ electricity generating         │       │
+│     │                               │ capacity to come online.       │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ scope_exceeded   │ Offshore wind capacity      │ The evidence gathered       │
+│                  │ factors                     │ focuses on land-based wind. │
+│                  │                             │ Offshore wind typically has │
+│                  │                             │ higher capacity factors     │
+│                  │                             │ (40–50%+) than land-based   │
+│                  │                             │ wind but was not the        │
+│                  │                             │ primary focus of the        │
+│                  │                             │ sources retrieved.          │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Most recent 2024 annual     │ The 2023 annual wind        │
+│                  │ average wind capacity       │ capacity factor (33.5%) is  │
+│                  │ factor                      │ confirmed, but a final 2024 │
+│                  │                             │ annual figure was not found │
+│                  │                             │ in the sources; only        │
+│                  │                             │ monthly records for April   │
+│                  │                             │ 2024 were available.        │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Regional breakdown of wind  │ State- or region-level      │
+│                  │ vs. solar capacity factors  │ direct comparisons of wind  │
+│                  │ within the continental U.S. │ vs. solar capacity factors  │
+│                  │                             │ within the continental U.S. │
+│                  │                             │ were not available in the   │
+│                  │                             │ retrieved sources.          │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ scope_exceeded   │ Small-scale/rooftop solar   │ The 23.5% solar capacity    │
+│                  │ capacity factors            │ factor applies to           │
+│                  │                             │ utility-scale solar.        │
+│                  │                             │ Distributed/rooftop solar   │
+│                  │                             │ typically has lower         │
+│                  │                             │ capacity factors due to     │
+│                  │                             │ suboptimal orientation;     │
+│                  │                             │ this was not quantified in  │
+│                  │                             │ the retrieved evidence.     │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ database          │ U.S. offshore     │ Offshore wind has │
+│                  │                   │ wind capacity     │ substantially     │
+│                  │                   │ factors 2023 2024 │ higher capacity   │
+│                  │                   │ compared to       │ factors than      │
+│                  │                   │ land-based wind   │ land-based wind   │
+│                  │                   │ and solar         │ and solar, which  │
+│                  │                   │                   │ would complete    │
+│                  │                   │                   │ the renewable     │
+│                  │                   │                   │ capacity factor   │
+│                  │                   │                   │ comparison        │
+│                  │                   │                   │ picture.          │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ NREL ATB 2024     │ NREL ATB provides │
+│                  │                   │ utility-scale     │ wind capacity     │
+│                  │                   │ wind capacity     │ factors by        │
+│                  │                   │ factor by         │ resource class    │
+│                  │                   │ resource class    │ similar to solar, │
+│                  │                   │ continental US    │ enabling direct   │
+│                  │                   │                   │ apples-to-apples  │
+│                  │                   │                   │ regional          │
+│                  │                   │                   │ comparison with   │
+│                  │                   │                   │ solar CF data.    │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ seasonal wind vs  │ Wind peaks in     │
+│                  │                   │ solar capacity    │ spring, solar in  │
+│                  │                   │ factor            │ summer—understand │
+│                  │                   │ complementarity   │ ing this          │
+│                  │                   │ United States     │ complementarity   │
+│                  │                   │ grid balancing    │ is critical for   │
+│                  │                   │                   │ grid planning and │
+│                  │                   │                   │ storage           │
+│                  │                   │                   │ requirements.     │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ database          │ EIA Electric      │ The 2024          │
+│                  │                   │ Power Monthly     │ full-year wind    │
+│                  │                   │ 2024 annual wind  │ capacity factor   │
+│                  │                   │ capacity factor   │ would allow       │
+│                  │                   │ final             │ updated           │
+│                  │                   │                   │ comparison with   │
+│                  │                   │                   │ the 2023 solar    │
+│                  │                   │                   │ capacity factor   │
+│                  │                   │                   │ of 23.5%.         │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ How do wind and solar capacity  │ Texas led wind capacity         │
+│          │ factors compare on a regional   │ additions in 2023 (1,323 MW)    │
+│          │ basis within the continental    │ and is the second-largest       │
+│          │ U.S., particularly in states    │ utility-scale solar state (18.8 │
+│          │ like Texas and California that  │ GW). California leads solar.    │
+│          │ have significant installations  │ Regional comparisons would      │
+│          │ of both?                        │ clarify where each resource is  │
+│          │                                 │ most competitive.               │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the projected           │ NREL's ATB provides             │
+│          │ trajectory of utility-scale     │ Advanced/Moderate/Conservative  │
+│          │ solar capacity factors as       │ scenarios for solar CF          │
+│          │ technology improves, and will   │ improvements through 2050, and  │
+│          │ solar eventually close the gap  │ solar capacity additions are    │
+│          │ with wind on a fleet-wide       │ now outpacing wind. The         │
+│          │ average basis?                  │ convergence timeline is         │
+│          │                                 │ unclear.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How did the 2023 wind           │ Wind generation fell 2.1% in    │
+│          │ generation decline (due to low  │ 2023 to an eight-year-low       │
+│          │ wind speeds) affect investment  │ capacity factor of 33.5%, while │
+│          │ decisions for new wind vs.      │ solar continued growing. This   │
+│          │ solar projects?                 │ may have influenced utility     │
+│          │                                 │ procurement decisions.          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the capacity factor of  │ The DOE Wind Market Reports     │
+│          │ offshore wind installations in  │ cover offshore wind separately, │
+│          │ the U.S., and how does it       │ and offshore wind typically     │
+│          │ compare to both land-based wind │ achieves materially higher      │
+│          │ and utility-scale solar?        │ capacity factors than           │
+│          │                                 │ land-based wind (~40–50%), but  │
+│          │                                 │ this was not quantified in the  │
+│          │                                 │ retrieved sources.              │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ How does the Inflation          │ The IRA led to significant      │
+│          │ Reduction Act's impact on wind  │ near-term wind deployment       │
+│          │ and solar deployment affect     │ forecast increases and billions │
+│          │ future capacity factor trends,  │ in domestic supply chain        │
+│          │ given that larger, more         │ investment. Average wind        │
+│          │ efficient turbines and          │ turbine capacity grew to 3.4 MW │
+│          │ better-sited projects may       │ in 2023, up 375% since          │
+│          │ improve wind CFs?               │ 1998–1999.                      │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.88                                                                │
+│ Corroborating sources: 10                                                    │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 48230                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 95.81s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: e3fa81c3-eaff-4f76-9b50-d61e70e54540
--- a/docs/stress-tests/M3.3-runs/11-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/11-contradiction.log
@ -0,0 +1,236 @@
+Researching: Is red wine good for cardiovascular health?
+
+{"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:56.517038Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:02:57.298051Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:57.308234Z"}
+{"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:57.343434Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is red wine good for cardiovascular health?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343753Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343847Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1097, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:09.450890Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8466, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:15.393838Z"}
+{"step": 19, "decision": "Token budget reached before iteration 4: 22139/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405453Z"}
+{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 19, "iterations_run": 3, "tokens_used": 22139, "event": "synthesis_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405621Z"}
+{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50486, "event": "synthesis_complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808158Z"}
+{"step": 37, "decision": "Research complete", "confidence": 0.72, "citation_count": 9, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 78.676, "event": "complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808851Z"}
+{"confidence": 0.72, "citations": 9, "gaps": 3, "discovery_events": 3, "tokens_used": 42350, "iterations_run": 3, "wall_time_sec": 76.46466898918152, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:13.808934Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:04:13.809517Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:13.813434Z"}
+{"trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "confidence": 0.72, "citations": 9, "tokens_used": 42350, "wall_time_sec": 76.46466898918152, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.104351Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The relationship between red wine and cardiovascular health is nuanced and   │
+│ contested. Historically, observational studies found that moderate drinkers  │
+│ (at least one drink per day) were 30–40% less likely to die from             │
+│ cardiovascular disease compared to non-drinkers, a pattern sometimes called  │
+│ the 'J-shaped mortality curve' [NYT/AHA]. Red wine specifically contains     │
+│ polyphenols (including flavonoids and resveratrol) that may inhibit LDL      │
+│ oxidation, prevent endothelial dysfunction, raise HDL cholesterol, and       │
+│ decrease fibrinogen concentrations [Circulation Research; PMC6804046].       │
+│ However, no study has established a direct cause-and-effect link between red │
+│ wine consumption and improved heart health [AHA]. More recent analyses       │
+│ suggest the apparent benefit may reflect confounding factors—moderate        │
+│ drinkers may have healthier lifestyles overall—and methodological flaws such │
+│ as including former drinkers (who quit due to illness) in the abstainer      │
+│ group [NYT; Three Spirit]. The 'French Paradox,' which popularized the red   │
+│ wine-heart health hypothesis, is now being critically re-examined as a       │
+│ public health myth [ResearchGate]. Major health organizations, including the │
+│ American Heart Association, do not recommend starting to drink red wine for  │
+│ heart benefit, and current evidence does not support a causal protective     │
+│ effect of alcohol on the heart.                                              │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ How Red Wine Lost Its Health  │ Researchers found that those   │  0.85 │
+│     │ Halo - The New York Times     │ who reported having at least   │       │
+│     │ https://www.nytimes.com/2024/ │ one alcoholic drink per day    │       │
+│     │ 02/17/well/eat/red-wine-heart │ were 30 to 40 percent less     │       │
+│     │ -health.html                  │ likely to die from             │       │
+│     │                               │ cardiovascular disease.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Drinking red wine for heart   │ No research has established a  │  0.92 │
+│     │ health? Read this before you  │ cause-and-effect link between  │       │
+│     │ toast | American Heart        │ drinking alcohol and better    │       │
+│     │ Association                   │ heart health. Rather, studies  │       │
+│     │ https://www.heart.org/en/news │ have found an association      │       │
+│     │ /2019/05/24/drinking-red-wine │ between wine and such benefits │       │
+│     │ -for-heart-health-read-this-b │ as a lower risk of dying from  │       │
+│     │ efore-you-toast               │ heart disease.                 │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Red Wine and Cardiovascular   │ The alcoholic component is     │  0.90 │
+│     │ Health | Circulation Research │ known to increase high-density │       │
+│     │ https://www.ahajournals.org/d │ lipoprotein cholesterol and to │       │
+│     │ oi/10.1161/CIRCRESAHA.112.278 │ decrease fibrinogen            │       │
+│     │ 705?doi=10.1161/CIRCRESAHA.11 │ concentrations. The            │       │
+│     │ 2.278705                      │ polyphenols present in red     │       │
+│     │                               │ wine                           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Wine and Cardiovascular       │ Flavonoids from red wine have  │  0.88 │
+│     │ Health | Circulation          │ been credited to inhibit       │       │
+│     │ https://www.ahajournals.org/d │ low-density lipoprotein (LDL)  │       │
+│     │ oi/10.1161/circulationaha.117 │ oxidation and prevent          │       │
+│     │ .030387                       │ endothelial dysfunction, which │       │
+│     │                               │ is                             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Red Wine Consumption and      │ Red Wine Consumption and       │  0.85 │
+│     │ Cardiovascular Health - PMC   │ Cardiovascular Health Luigi    │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ Castaldo ... Department of     │       │
+│     │ articles/PMC6804046/          │ Pharmacy, Faculty of Pharmacy, │       │
+│     │                               │ University of Naples "Federico │       │
+│     │                               │ II" ... Molecules. 2019 Oct    │       │
+│     │                               │ 8;24(19):3626. doi:            │       │
+│     │                               │ 10.3390/molecules24193626      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Association between Wine      │ Association between Wine       │  0.87 │
+│     │ Consumption with              │ Consumption with               │       │
+│     │ Cardiovascular Disease and    │ Cardiovascular Disease and     │       │
+│     │ Cardiovascular Mortality: A   │ Cardiovascular Mortality: A    │       │
+│     │ Systematic Review and         │ Systematic Review and          │       │
+│     │ Meta-Analysis - PMC           │ Meta-Analysis ... Nutrients.   │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ 2023 Jun 17;15(12):2785. doi:  │       │
+│     │ articles/PMC10303697/         │ 10.3390/nu15122785             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Red wine and resveratrol:     │ Is red wine heart healthy?     │  0.88 │
+│     │ Good for your heart? - Mayo   │ Antioxidants in red wine       │       │
+│     │ Clinic                        │ called polyphenols may help    │       │
+│     │ https://www.mayoclinic.org/di │ protect the lining of blood    │       │
+│     │ seases-conditions/heart-disea │ vessels in the heart. ·        │       │
+│     │ se/in-depth/red-wine/art-2004 │ Resveratrol in red wine.       │       │
+│     │ 8281                          │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Debunking the 'wine is        │ In the early nineties, a TV    │  0.65 │
+│     │ healthy' myth – Three Spirit  │ show in the US reported lower  │       │
+│     │ US                            │ heart attack rates in          │       │
+│     │ https://us.threespiritdrinks. │ France... The report framed    │       │
+│     │ com/blogs/blog/where-the-wine │ the country's regular          │       │
+│     │ -is-healthy-myth-came-from    │ consumption of alcohol, in     │       │
+│     │                               │ particular red wine, as the    │       │
+│     │                               │ reason behind this, claiming   │       │
+│     │                               │ that it reduced that risk of   │       │
+│     │                               │ heart disease.                 │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Revisiting the French         │ The "French Paradox," the      │  0.78 │
+│     │ Paradox: Deconstructing a     │ hypothesis that moderate red   │       │
+│     │ Public Health Myth and its    │ wine consumption explains      │       │
+│     │ Global Commercial Legacy      │ France's historically low      │       │
+│     │ https://www.researchgate.net/ │ coronary heart disease rates   │       │
+│     │ publication/399257280_Title_R │                                │       │
+│     │ evisiting_the_French_Paradox_ │                                │       │
+│     │ Deconstructing_a_Public_Healt │                                │       │
+│     │ h_Myth_and_its_Global_Commerc │                                │       │
+│     │ ial_Legacy                    │                                │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Randomized controlled    │ Most evidence is          │
+│                       │ trial evidence on red    │ observational. Robust RCT │
+│                       │ wine and cardiovascular  │ data directly testing red │
+│                       │ outcomes                 │ wine's causal             │
+│                       │                          │ cardiovascular effect in  │
+│                       │                          │ humans is lacking and not │
+│                       │                          │ surfaced in available     │
+│                       │                          │ sources.                  │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Differential effect of   │ Some sources attribute    │
+│                       │ red wine vs. other       │ benefits to polyphenols   │
+│                       │ alcohol types on         │ specific to red wine,     │
+│                       │ cardiovascular health    │ while others suggest the  │
+│                       │                          │ effect is due to alcohol  │
+│                       │                          │ in general, making it     │
+│                       │                          │ unclear whether red wine  │
+│                       │                          │ is uniquely beneficial.   │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ access_denied         │ Full text of 2023        │ The PMC10303697           │
+│                       │ meta-analysis findings   │ meta-analysis page header │
+│                       │                          │ was retrieved but full    │
+│                       │                          │ results/conclusions were  │
+│                       │                          │ not available in the      │
+│                       │                          │ scraped content.          │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ contradiction    │ database          │ randomized        │ Observational     │
+│                  │                   │ controlled trial  │ studies suggest   │
+│                  │                   │ red wine          │ benefit, but no   │
+│                  │                   │ polyphenols       │ causal link       │
+│                  │                   │ cardiovascular    │ established; RCT  │
+│                  │                   │ outcomes          │ evidence needed   │
+│                  │                   │                   │ to resolve        │
+│                  │                   │                   │ contradiction.    │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ resveratrol       │ Resveratrol is    │
+│                  │                   │ bioavailability   │ cited as a key    │
+│                  │                   │ cardiovascular    │ mechanism but its │
+│                  │                   │ human clinical    │ bioavailability   │
+│                  │                   │ trials 2022 2023  │ from wine in      │
+│                  │                   │ 2024              │ clinically        │
+│                  │                   │                   │ meaningful doses  │
+│                  │                   │                   │ is debated.       │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ sick quitter bias │ The J-shaped      │
+│                  │                   │ abstainer         │ curve may be an   │
+│                  │                   │ misclassification │ artifact of       │
+│                  │                   │ alcohol           │ methodological    │
+│                  │                   │ cardiovascular    │ flaws (sick       │
+│                  │                   │ epidemiology      │ quitters included │
+│                  │                   │                   │ in abstainer      │
+│                  │                   │                   │ group), which     │
+│                  │                   │                   │ undermines        │
+│                  │                   │                   │ earlier           │
+│                  │                   │                   │ protective        │
+│                  │                   │                   │ findings.         │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Does the apparent               │ Observational J-curve studies   │
+│          │ cardiovascular benefit of       │ may misclassify former drinkers │
+│          │ moderate red wine consumption   │ who quit due to illness as      │
+│          │ disappear when sick quitters    │ non-drinkers, inflating the     │
+│          │ are properly excluded from the  │ apparent benefit of moderate    │
+│          │ abstainer comparison group?     │ drinking.                       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Is the cardiovascular effect of │ Circulation Research notes both │
+│          │ red wine attributable to        │ the alcohol component and       │
+│          │ polyphenols (resveratrol,       │ polyphenols independently       │
+│          │ flavonoids) or simply to the    │ affect cardiovascular markers,  │
+│          │ alcohol content?                │ but their relative contribution │
+│          │                                 │ is unclear.                     │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What do the most recent         │ The 2023 PMC meta-analysis was  │
+│          │ meta-analyses (2022–2024)       │ identified but its full         │
+│          │ conclude about wine consumption │ conclusions were not accessible │
+│          │ and cardiovascular mortality    │ in the retrieved content.       │
+│          │ after correcting for            │                                 │
+│          │ confounders?                    │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Are there subpopulations (e.g., │ Current guidance is             │
+│          │ by age, sex, genetic profile)   │ population-level; individual    │
+│          │ for whom moderate red wine      │ variation in alcohol metabolism │
+│          │ consumption might confer        │ and cardiovascular risk         │
+│          │ measurable cardiovascular       │ profiles may produce different  │
+│          │ benefit?                        │ outcomes.                       │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.72                                                                │
+│ Corroborating sources: 7                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: recent                                                              │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 42350                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 76.46s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 96acce3c-853d-40b7-ba02-c721ac59f85d
--- a/docs/stress-tests/M3.3-runs/12-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/12-contradiction.log
@ -0,0 +1,330 @@
+Researching: Does intermittent fasting extend lifespan in humans?
+
+{"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.725578Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:04:15.543876Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:15.553451Z"}
+{"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:15.587475Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does intermittent fasting extend lifespan in humans?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587815Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587912Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1148, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:22.802797Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8443, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:26.505496Z"}
+{"step": 21, "decision": "Starting iteration 4/5", "tokens_so_far": 18167, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:43.089460Z"}
+{"step": 26, "decision": "Token budget reached before iteration 5: 36705/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193645Z"}
+{"step": 27, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 26, "iterations_run": 4, "tokens_used": 36705, "event": "synthesis_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193894Z"}
+{"step": 28, "decision": "Parsed synthesis JSON successfully", "duration_ms": 76890, "event": "synthesis_complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.759366Z"}
+{"step": 48, "decision": "Research complete", "confidence": 0.72, "citation_count": 11, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 109.604, "event": "complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.760365Z"}
+{"confidence": 0.72, "citations": 11, "gaps": 4, "discovery_events": 4, "tokens_used": 62781, "iterations_run": 4, "wall_time_sec": 105.17169857025146, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:00.760468Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:06:00.760848Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:00.765020Z"}
+{"trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "confidence": 0.72, "citations": 11, "tokens_used": 62781, "wall_time_sec": 105.17169857025146, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:00.989582Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Current scientific evidence does NOT conclusively demonstrate that           │
+│ intermittent fasting (IF) extends lifespan in humans. While IF has proven    │
+│ lifespan-extending effects in animal models (particularly rodents), and      │
+│ improves multiple healthspan markers in humans—including weight, insulin     │
+│ resistance, inflammation, dyslipidemia, hypertension, oxidative stress, and  │
+│ autophagy—direct evidence of increased human lifespan from IF is lacking.    │
+│ Mechanistically, IF triggers 'adaptive stress' in cells, activating          │
+│ antioxidant production, DNA repair, autophagy (via spermidine-mediated       │
+│ pathways), and reduced inflammation, all of which are theoretically linked   │
+│ to longevity [InsideTracker, FORTH/Nature Cell Biology]. A 2024 review in    │
+│ Ageing Research Reviews concluded IF 'can be considered a                    │
+│ non-pharmacological strategy to extend lifespan' and has been 'proven to     │
+│ extend lifespan in rodent models,' but human translation remains unconfirmed │
+│ [ScienceDirect/PubMed]. A scoping review of RCTs found IF improves           │
+│ aging-related biomarkers in adults but stopped short of claiming lifespan    │
+│ extension [PMC]. A 2024 Nature study on genetically diverse mice showed      │
+│ dietary restriction (including IF) extends healthy lifespan in mice but its  │
+│ human relevance is unclear. Critically, a major 2024 AHA-presented           │
+│ observational study of 20,000+ U.S. adults found that eating within an       │
+│ 8-hour window was associated with a 91% higher risk of cardiovascular death  │
+│ compared to eating across 12–16 hours—though this study has been heavily     │
+│ criticized for methodological limitations including confounding variables    │
+│ (demographics, pre-existing disease) and reliance on only two days of        │
+│ dietary recall data [AHA, WebMD, Forbes]. In summary, IF improves several    │
+│ biomarkers associated with healthy aging in humans, and extends lifespan in  │
+│ animals, but no long-term human RCT has demonstrated actual lifespan         │
+│ extension, and some observational data raise cardiovascular safety concerns. │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Intermittent fasting and      │ IF can be considered as a      │  0.95 │
+│     │ longevity: From animal models │ non-pharmacological strategy   │       │
+│     │ to implication for humans -   │ to extend lifespan. IF         │       │
+│     │ ScienceDirect                 │ improves physiological         │       │
+│     │ https://www.sciencedirect.com │ function, enhances             │       │
+│     │ /science/article/abs/pii/S156 │ performance, and slows aging.  │       │
+│     │ 8163724000928                 │ IF was proven to extend        │       │
+│     │                               │ lifespan in rodent models.     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Intermittent fasting and      │ Findings to date from both     │  0.95 │
+│     │ longevity: From animal models │ human and animal experiments   │       │
+│     │ to implication for humans -   │ indicate that fasting improves │       │
+│     │ PubMed                        │ physiological function,        │       │
+│     │ https://pubmed.ncbi.nlm.nih.g │ enhances performance, and      │       │
+│     │ ov/38499159/                  │ slows aging and disease        │       │
+│     │                               │ processes. Metabolic and       │       │
+│     │                               │ cellular responses triggered   │       │
+│     │                               │ by IF could help to achieve    │       │
+│     │                               │ the aim of preventing disease, │       │
+│     │                               │ and maximizing healthspan and  │       │
+│     │                               │ longevity with minimal side    │       │
+│     │                               │ effects.                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ How Intermittent Fasting      │ In humans, intermittent        │  0.88 │
+│     │ Impacts Longevity: A Summary  │ fasting improves weight,       │       │
+│     │ of the Research -             │ insulin resistance,            │       │
+│     │ InsideTracker                 │ inflammation, dyslipidemia,    │       │
+│     │ https://www.insidetracker.com │ and hypertension. IF has also  │       │
+│     │ /a/articles/how-intermittent- │ reduced tumor growth, boosted  │       │
+│     │ fasting-impacts-longevity     │ stem cell production, and      │       │
+│     │                               │ increased lifespan in mice.    │       │
+│     │                               │ During fasting, cells undergo  │       │
+│     │                               │ adaptive stress, which         │       │
+│     │                               │ activates different pathways   │       │
+│     │                               │ in the body, resulting in a    │       │
+│     │                               │ range of effects, including    │       │
+│     │                               │ increased production of        │       │
+│     │                               │ antioxidants, DNA repair,      │       │
+│     │                               │ autophagy.                     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Effects of Intermittent       │ In humans,                     │  0.97 │
+│     │ Fasting on Health, Aging, and │ intermittent-fasting           │       │
+│     │ Disease - NEJM                │ interventions ameliorate       │       │
+│     │ https://www.nejm.org/doi/full │ obesity, insulin resistance,   │       │
+│     │ /10.1056/NEJMra1905136        │ dyslipidemia, hypertension,    │       │
+│     │                               │ and inflammation.              │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Impact of Intermittent        │ Impact of Intermittent Fasting │  0.90 │
+│     │ Fasting and/or Caloric        │ and/or Caloric Restriction on  │       │
+│     │ Restriction on Aging-Related  │ Aging-Related Outcomes in      │       │
+│     │ Outcomes in Adults: A Scoping │ Adults: A Scoping Review of    │       │
+│     │ Review of Randomized          │ Randomized Controlled Trials.  │       │
+│     │ Controlled Trials - PMC       │ Nutrients. 2024 Jan            │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ 20;16(2):316. doi:             │       │
+│     │ articles/PMC10820472/         │ 10.3390/nu16020316             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ International scientific      │ intermittent fasting increases │  0.90 │
+│     │ collaboration reveals how     │ the levels of spermidine, a    │       │
+│     │ intermittent fasting          │ chemical compound (natural     │       │
+│     │ regulates ageing through      │ polyamine), that enhances the  │       │
+│     │ autophagy | FORTH             │ resilience and survival of     │       │
+│     │ https://forth.gr/en/news/show │ cells and organisms, through   │       │
+│     │ /&tid=2606                    │ the activation of autophagy.   │       │
+│     │                               │ Autophagy defects have been    │       │
+│     │                               │ linked to ageing, as well as,  │       │
+│     │                               │ with the emergence of          │       │
+│     │                               │ age-related disorders.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Dietary restriction impacts   │ Caloric restriction extends    │  0.92 │
+│     │ health and lifespan of        │ healthy lifespan in multiple   │       │
+│     │ genetically diverse mice |    │ species. Intermittent fasting, │       │
+│     │ Nature                        │ an alternative form of dietary │       │
+│     │ https://www.nature.com/articl │ restriction, is potentially    │       │
+│     │ es/s41586-024-08026-3         │ more sustainable in humans,    │       │
+│     │                               │ but its effectiveness remains  │       │
+│     │                               │ largely unexplored.            │       │
+│     │                               │ Identifying the most           │       │
+│     │                               │ efficacious forms of dietary   │       │
+│     │                               │ restriction is key for         │       │
+│     │                               │ developing interventions to    │       │
+│     │                               │ improve human health and       │       │
+│     │                               │ longevity.                     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Time-restricted eating may    │ A popular weight loss strategy │  0.85 │
+│     │ raise cardiovascular death    │ that limits the hours during   │       │
+│     │ risk in the long term |       │ which calories can be consumed │       │
+│     │ American Heart Association    │ may nearly double a person's   │       │
+│     │ https://www.heart.org/en/news │ long-term risk of dying from   │       │
+│     │ /2024/03/18/time-restricted-e │ cardiovascular disease, new    │       │
+│     │ ating-may-raise-cardiovascula │ research finds, especially     │       │
+│     │ r-death-risk-in-the-long-term │ among people with underlying   │       │
+│     │                               │ cardiovascular disease or      │       │
+│     │                               │ cancer.                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Fasting Study Under Fire      │ Those conclusions are          │  0.87 │
+│     │ After Heart Conference -      │ premature and misleading, says │       │
+│     │ WebMD                         │ Christopher Gardner, PhD, a    │       │
+│     │ https://www.webmd.com/heart-d │ professor of medicine at       │       │
+│     │ isease/features/is-intermitte │ Stanford University... people  │       │
+│     │ nt-fasting-bad-for-heart-heal │ in the study group who         │       │
+│     │ th                            │ consumed all their food in a   │       │
+│     │                               │ daily window of 8 hours or     │       │
+│     │                               │ fewer had a higher percentage  │       │
+│     │                               │ of men, African Americans, and │       │
+│     │                               │ smoke.                         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Intermittent Fasting - The    │ intermittent fasting activated │  0.78 │
+│     │ Impact on Autophagy,          │ autophagy, a cellular process  │       │
+│     │ Inflammasome, and Senescence  │ that breaks down components    │       │
+│     │ https://nomix.ai/2024/05/24/f │ within cells. Autophagy has    │       │
+│     │ asting-in-young-males-examini │ been linked to longevity...    │       │
+│     │ ng-the-impact-on-autophagy-in │ p21 levels decreased during    │       │
+│     │ flammasome-and-senescence-bio │ and after fasting. The         │       │
+│     │ markers/                      │ findings suggest that fasting  │       │
+│     │                               │ may contribute to delaying the │       │
+│     │                               │ onset of age-related diseases. │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 11  │ Effect of fasting-mimicking   │ Significant between-group      │  0.82 │
+│     │ diet on markers of autophagy  │ differences were observed in   │       │
+│     │ and metabolic health in human │ changes from baseline to the   │       │
+│     │ subjects | GeroScience        │ end of the 6-day dietary       │       │
+│     │ https://link.springer.com/art │ intervention for body weight,  │       │
+│     │ icle/10.1007/s11357-025-02035 │ fasting glucose, BHB, HOMA-IR, │       │
+│     │ -4                            │ and autophagic flux (p <       │       │
+│     │                               │ 0.05)... These results suggest │       │
+│     │                               │ that FMD may improve           │       │
+│     │                               │ autophagic flux and markers of │       │
+│     │                               │ metabolic health.              │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Long-term human RCT data │ No randomized controlled  │
+│                       │ on IF and all-cause      │ trial has followed human  │
+│                       │ mortality or lifespan    │ participants long enough  │
+│                       │                          │ to measure actual         │
+│                       │                          │ lifespan extension from   │
+│                       │                          │ IF. All human longevity   │
+│                       │                          │ evidence is based on      │
+│                       │                          │ biomarker surrogates or   │
+│                       │                          │ observational data.       │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Optimal IF protocol for  │ Studies test different    │
+│                       │ longevity in humans      │ protocols (TRF, ADF, 5:2, │
+│                       │                          │ FMD) with varying         │
+│                       │                          │ durations and             │
+│                       │                          │ populations, making it    │
+│                       │                          │ impossible to identify a  │
+│                       │                          │ single optimal regimen    │
+│                       │                          │ for human longevity.      │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Cardiovascular safety of │ Short-term studies show   │
+│                       │ long-term IF             │ cardiovascular benefit    │
+│                       │                          │ (improved BP, glucose,    │
+│                       │                          │ cholesterol), but the     │
+│                       │                          │ 2024 AHA observational    │
+│                       │                          │ study suggests possible   │
+│                       │                          │ long-term cardiovascular  │
+│                       │                          │ mortality risk, with      │
+│                       │                          │ experts disputing         │
+│                       │                          │ methodology.              │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ source_not_found      │ IF effects across        │ Most human studies focus  │
+│                       │ diverse demographic      │ on limited populations    │
+│                       │ groups                   │ (e.g., young males,       │
+│                       │                          │ specific ethnic groups),  │
+│                       │                          │ limiting generalizability │
+│                       │                          │ of longevity findings.    │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ contradiction    │ database          │ time-restricted   │ The AHA 2024      │
+│                  │                   │ eating            │ study claiming    │
+│                  │                   │ cardiovascular    │ 91% higher        │
+│                  │                   │ mortality NHANES  │ cardiovascular    │
+│                  │                   │ confounding       │ death risk        │
+│                  │                   │ variables         │ contradicts       │
+│                  │                   │ methodology       │ short-term        │
+│                  │                   │ critique 2024     │ studies showing   │
+│                  │                   │                   │ CV benefit;       │
+│                  │                   │                   │ deeper            │
+│                  │                   │                   │ methodological    │
+│                  │                   │                   │ analysis is       │
+│                  │                   │                   │ warranted.        │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ spermidine        │ The FORTH/Nature  │
+│                  │                   │ autophagy         │ Cell Biology      │
+│                  │                   │ intermittent      │ finding on        │
+│                  │                   │ fasting lifespan  │ spermidine-mediat │
+│                  │                   │ human clinical    │ ed autophagy is a │
+│                  │                   │ trial 2024        │ novel mechanism   │
+│                  │                   │                   │ that may be       │
+│                  │                   │                   │ testable in human │
+│                  │                   │                   │ longevity trials. │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ fasting mimicking │ A large           │
+│                  │                   │ diet longevity    │ registered RCT    │
+│                  │                   │ diet RCT          │ (NCT05698654) on  │
+│                  │                   │ NCT05698654       │ fasting-mimicking │
+│                  │                   │ results           │ and longevity     │
+│                  │                   │                   │ diet is underway; │
+│                  │                   │                   │ results could be  │
+│                  │                   │                   │ transformative    │
+│                  │                   │                   │ for the question  │
+│                  │                   │                   │ of human lifespan │
+│                  │                   │                   │ extension.        │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ telomere length   │ The Frontiers in  │
+│                  │                   │ intermittent      │ Aging study on    │
+│                  │                   │ fasting exercise  │ metabolic         │
+│                  │                   │ metabolomics      │ signatures of     │
+│                  │                   │ aging biomarkers  │ combined exercise │
+│                  │                   │ 2024              │ and fasting links │
+│                  │                   │                   │ to telomere       │
+│                  │                   │                   │ length, a key     │
+│                  │                   │                   │ aging biomarker   │
+│                  │                   │                   │ worth             │
+│                  │                   │                   │ investigating     │
+│                  │                   │                   │ further.          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Will ongoing large-scale RCTs   │ No current RCT has followed     │
+│          │ (e.g., NCT05698654) provide     │ participants long enough to     │
+│          │ definitive evidence that IF     │ measure actual lifespan; only   │
+│          │ extends human lifespan or       │ biomarker surrogates have been  │
+│          │ healthspan?                     │ studied.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Does the cardiovascular         │ Experts including Stanford's    │
+│          │ mortality risk signal from the  │ Christopher Gardner criticized  │
+│          │ 2024 AHA observational study    │ the study for not controlling   │
+│          │ hold up after controlling for   │ for demographics, pre-existing  │
+│          │ confounders like pre-existing   │ disease, and reason for         │
+│          │ illness and dietary quality?    │ adopting IF.                    │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Can spermidine supplementation  │ FORTH research showed IF raises │
+│          │ replicate the                   │ spermidine, which activates     │
+│          │ autophagy-activating,           │ autophagy and promotes cell     │
+│          │ anti-aging effects of IF in     │ survival, suggesting            │
+│          │ humans who cannot sustain       │ supplementation as a potential  │
+│          │ fasting?                        │ proxy.                          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Which IF protocol (TRF, ADF,    │ Multiple protocols are studied  │
+│          │ 5:2, or FMD) produces the       │ with heterogeneous populations, │
+│          │ greatest longevity-associated   │ making comparative              │
+│          │ biomarker improvements in       │ effectiveness unclear.          │
+│          │ diverse human populations?      │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ Does the 92-year-old case study │ SAGE Journals reported this as  │
+│          │ of repeated 3-week annual       │ the world's longest medically   │
+│          │ fasting over 45 years offer any │ documented repeated fasting     │
+│          │ generalizable insight into      │ history; clinical parameters    │
+│          │ long-term IF and human          │ showed cyclic variation.        │
+│          │ longevity?                      │                                 │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.72                                                                │
+│ Corroborating sources: 9                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 62781                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 105.17s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3
--- a/docs/stress-tests/M3.3-runs/13-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/13-contradiction.log
@ -0,0 +1,260 @@
+Researching: Are nuclear power plants safe?
+
+{"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:01.606512Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:06:02.435399Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:02.443368Z"}
+{"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:02.477384Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Are nuclear power plants safe?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477723Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477819Z"}
+{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1169, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:15.136739Z"}
+{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11760, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:25.196255Z"}
+{"step": 23, "decision": "Token budget reached before iteration 4: 29534/20000", "event": "budget_exhausted", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263571Z"}
+{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 29534, "event": "synthesis_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263885Z"}
+{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 58649, "event": "synthesis_complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.700545Z"}
+{"step": 40, "decision": "Research complete", "confidence": 0.92, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 92.558, "event": "complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.701336Z"}
+{"confidence": 0.92, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 63429, "iterations_run": 3, "wall_time_sec": 89.22308659553528, "budget_exhausted": true, "event": "research_completed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:31.701429Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:07:31.701781Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:31.705585Z"}
+{"trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "confidence": 0.92, "citations": 8, "tokens_used": 63429, "wall_time_sec": 89.22308659553528, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.018740Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Yes, nuclear power plants are among the safest sources of electricity        │
+│ generation when measured by deaths per unit of energy produced. According to │
+│ Statista (sourcing 2018 data), nuclear energy results in approximately 0.03  │
+│ deaths per terawatt-hour (TWh), making it safer than wind (0.04), solar      │
+│ (0.02 is slightly lower), natural gas (2.82), biomass (4.63), hydro (1.3),   │
+│ oil (18.43), coal (24.62), and brown coal (32.72). A separate dataset from   │
+│ ResearchGate reports 0.04 deaths per billion kWh for nuclear, compared to    │
+│ 100 for coal. Despite three major accidents—Three Mile Island (1979),        │
+│ Chernobyl (1986), and Fukushima (2011)—the overall fatality record remains   │
+│ exceptionally low. At Chernobyl, the worst nuclear accident in history, 2    │
+│ workers died in the initial explosion, 28 of 134 acute radiation syndrome    │
+│ patients later died, and roughly 5,000 thyroid cancer cases were             │
+│ attributable to radiation exposure among those under 18 at the time          │
+│ (Canadian Nuclear Safety Commission). Stanford researchers estimated         │
+│ Fukushima may cause approximately 130 deaths and 180 cancer cases globally,  │
+│ in addition to ~600 evacuation-related deaths. Three Mile Island caused no   │
+│ direct radiation deaths. U.S. nuclear plants operate under strict NRC        │
+│ oversight using a 'defense-in-depth' multi-layer safety approach (U.S.       │
+│ Department of Energy). The IAEA also sets international design and safety    │
+│ standards. Public perception of nuclear risk is widely considered            │
+│ disproportionate to the statistical evidence.                                │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Global deaths per energy      │ Brown coal 32.72 | Coal 24.62  │  0.97 │
+│     │ source | Statista             │ | Oil 18.43 | Biomass 4.63 |   │       │
+│     │ https://www.statista.com/stat │ Natural gas 2.82 | Hydro 1.3 | │       │
+│     │ istics/494425/death-rate-worl │ Wind 0.04 | Nuclear 0.03 |     │       │
+│     │ dwide-by-energy-source/       │ Solar 0.02. Death rates are    │       │
+│     │                               │ measured based on deaths from  │       │
+│     │                               │ accidents and air pollution    │       │
+│     │                               │ per terawatt-hour (TWh) of     │       │
+│     │                               │ electricity.                   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ rates for each energy source  │ 100 for coal, 36 for oil, 24   │  0.91 │
+│     │ in deaths per billion kWh     │ for biofuel/biomass, 4 for     │       │
+│     │ produced... | ResearchGate    │ natural gas, 1.4 for hydro,    │       │
+│     │ https://www.researchgate.net/ │ 0.44 for solar, 0.15 for wind  │       │
+│     │ figure/rates-for-each-energy- │ and 0.04 for nuclear.          │       │
+│     │ source-in-deaths-per-billion- │                                │       │
+│     │ kWh-produced-Source-Updated_t │                                │       │
+│     │ bl2_272406182                 │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Health effects of the         │ The initial steam explosion at │  0.97 │
+│     │ Chornobyl accident | Canadian │ the Chornobyl nuclear plant    │       │
+│     │ Nuclear Safety Commission     │ resulted in the deaths of 2    │       │
+│     │ https://www.cnsc-ccsn.gc.ca/e │ workers, and 134 plant staff   │       │
+│     │ ng/resources/health/health-ef │ and emergency workers suffered │       │
+│     │ fects-chornobyl-accident/     │ acute radiation syndrome due   │       │
+│     │                               │ to high doses of radiation. Of │       │
+│     │                               │ these 134 people, 28 later     │       │
+│     │                               │ died. About 5,000 thyroid      │       │
+│     │                               │ cancer cases were due to       │       │
+│     │                               │ radioactive iodine             │       │
+│     │                               │ (iodine-131) exposure to       │       │
+│     │                               │ children or adolescents at the │       │
+│     │                               │ time of the accident.          │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Stanford researchers          │ Radiation from Japan's         │  0.93 │
+│     │ calculate global health       │ Fukushima Daiichi nuclear      │       │
+│     │ impacts of the Fukushima      │ disaster may eventually cause  │       │
+│     │ nuclear disaster | Stanford   │ approximately 130 deaths and   │       │
+│     │ University                    │ 180 cases of cancer, mostly in │       │
+│     │ https://engineering.stanford. │ Japan, Stanford researchers    │       │
+│     │ edu/news/stanford-researchers │ have calculated. The numbers   │       │
+│     │ -calculate-global-health-impa │ are in addition to the roughly │       │
+│     │ cts-fukushima-nuclear-disaste │ 600 deaths caused by the       │       │
+│     │ r                             │ evacuation of the area         │       │
+│     │                               │ surrounding the nuclear plant  │       │
+│     │                               │ directly after the March 2011  │       │
+│     │                               │ earthquake, tsunami and        │       │
+│     │                               │ meltdown.                      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Enhanced Safety of Advanced   │ U.S. nuclear power plants are  │  0.96 │
+│     │ Reactors | U.S. Department of │ already among the safest and   │       │
+│     │ Energy                        │ most secure industrial         │       │
+│     │ https://www.energy.gov/ne/enh │ facilities in the world due to │       │
+│     │ anced-safety-advanced-reactor │ the industry's commitment to   │       │
+│     │ s                             │ comprehensive safety           │       │
+│     │                               │ procedures, robust training    │       │
+│     │                               │ programs and stringent federal │       │
+│     │                               │ regulation that keep nuclear   │       │
+│     │                               │ plants and neighboring         │       │
+│     │                               │ communities safe.              │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Three Mile Island, Chernobyl  │ Estimates on nuclear's overall │  0.88 │
+│     │ and Fukushima accidents haunt │ mortality rate are comparable  │       │
+│     │ nuclear's past | MinnPost     │ to solar or wind power (and    │       │
+│     │ https://www.minnpost.com/othe │ roughly 2.5% that of hydro     │       │
+│     │ r-nonprofit-media/2023/10/thr │ power). Oil and coal,          │       │
+│     │ ee-mile-island-chernobyl-and- │ meanwhile, are as much as 800  │       │
+│     │ fukushima-accidents-haunt-nuc │ times higher.                  │       │
+│     │ lears-past-will-they-dictate- │                                │       │
+│     │ its-future/                   │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Devastating Consequences of   │ The Chernobyl disaster, which  │  0.85 │
+│     │ Nuclear Accidents: Chernobyl, │ occurred on April 26, 1986,    │       │
+│     │ Fukushima and Three Mile      │ was the most significant       │       │
+│     │ Island | SciTechnol           │ nuclear accident in history.   │       │
+│     │ https://www.scitechnol.com/pe │ The explosion and fire at the  │       │
+│     │ er-review/devastating-consequ │ Chernobyl nuclear power plant  │       │
+│     │ ences-of-nuclear-accidents-ch │ in Ukraine resulted in the     │       │
+│     │ ernobyl-fukushima-and-three-m │ release of large amounts of    │       │
+│     │ ile-island-HLGS.php?article_i │ radioactive material into the  │       │
+│     │ d=21379                       │ atmosphere, leading to the     │       │
+│     │                               │ deaths of 31 people, and       │       │
+│     │                               │ causing widespread             │       │
+│     │                               │ contamination of the           │       │
+│     │                               │ surrounding areas.             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Laying the Foundation for New │ Domestic power reactors are    │  0.94 │
+│     │ and Advanced Nuclear Reactors │ tightly regulated by the U.S.  │       │
+│     │ in the United States |        │ Nuclear Regulatory Commission  │       │
+│     │ National Academies            │ (NRC) in all phases of their   │       │
+│     │ https://www.nationalacademies │ life cycle—design,             │       │
+│     │ .org/read/26630/chapter/9     │ construction, operations, and  │       │
+│     │                               │ decommissioning. The NRC is    │       │
+│     │                               │ charged with licensing and     │       │
+│     │                               │ regulation of plants to        │       │
+│     │                               │ provide reasonable assurance   │       │
+│     │                               │ of adequate protection of      │       │
+│     │                               │ public health and safety.      │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ contradictory_sources │ Long-term cancer         │ Estimates of total        │
+│                       │ mortality estimates from │ Chernobyl-attributed      │
+│                       │ Chernobyl                │ cancer deaths vary widely │
+│                       │                          │ across sources, from      │
+│                       │                          │ hundreds (WHO/UNSCEAR     │
+│                       │                          │ conservative estimates)   │
+│                       │                          │ to tens of thousands      │
+│                       │                          │ (Greenpeace/TORCH         │
+│                       │                          │ report), making a         │
+│                       │                          │ definitive number         │
+│                       │                          │ difficult to cite.        │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ scope_exceeded        │ Comparative safety of    │ Evidence gathered focuses │
+│                       │ advanced/next-generation │ on existing reactor fleet │
+│                       │ reactors (Gen IV, SMRs)  │ safety records; safety    │
+│                       │                          │ data specific to small    │
+│                       │                          │ modular reactors (SMRs)   │
+│                       │                          │ or Gen IV designs was not │
+│                       │                          │ retrieved.                │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ source_not_found      │ Nuclear waste long-term  │ While radioactive waste   │
+│                       │ safety statistics        │ management was briefly    │
+│                       │                          │ mentioned, quantitative   │
+│                       │                          │ long-term health risk     │
+│                       │                          │ data from waste storage   │
+│                       │                          │ was not found in the      │
+│                       │                          │ retrieved sources.        │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ arxiv             │ nuclear power     │ A systematic      │
+│                  │                   │ plant safety      │ academic review   │
+│                  │                   │ mortality         │ post-2020 could   │
+│                  │                   │ statistics        │ provide updated   │
+│                  │                   │ systematic review │ mortality         │
+│                  │                   │ 2020-2025         │ statistics        │
+│                  │                   │                   │ incorporating the │
+│                  │                   │                   │ full operational  │
+│                  │                   │                   │ history of        │
+│                  │                   │                   │ Fukushima         │
+│                  │                   │                   │ cleanup.          │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ IAEA PRIS nuclear │ The IAEA Power    │
+│                  │                   │ power plant       │ Reactor           │
+│                  │                   │ operational       │ Information       │
+│                  │                   │ safety incidents  │ System (PRIS)     │
+│                  │                   │ database          │ contains          │
+│                  │                   │                   │ comprehensive     │
+│                  │                   │                   │ incident and      │
+│                  │                   │                   │ safety data for   │
+│                  │                   │                   │ all global        │
+│                  │                   │                   │ nuclear plants.   │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ contradiction    │ database          │ Chernobyl total   │ SciTechnol source │
+│                  │                   │ excess cancer     │ cites 31          │
+│                  │                   │ deaths estimates  │ Chernobyl deaths  │
+│                  │                   │ UNSCEAR vs WHO vs │ while CNSC cites  │
+│                  │                   │ independent       │ 28+2=30, and      │
+│                  │                   │ researchers       │ long-term cancer  │
+│                  │                   │                   │ projections       │
+│                  │                   │                   │ differ vastly     │
+│                  │                   │                   │ between           │
+│                  │                   │                   │ organizations.    │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ How do small modular reactors   │ The DOE page on enhanced safety │
+│          │ (SMRs) compare in safety        │ of advanced reactors mentions   │
+│          │ profile to traditional          │ new designs but no comparative  │
+│          │ large-scale nuclear plants?     │ safety mortality data was       │
+│          │                                 │ available in the evidence.      │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the total projected     │ Sources give conflicting        │
+│          │ cancer death toll from          │ numbers; CNSC cites 28 direct   │
+│          │ Chernobyl according to the most │ deaths but does not give a      │
+│          │ recent UNSCEAR assessment?      │ total long-term cancer          │
+│          │                                 │ projection.                     │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Does nuclear power's safety     │ Chernobyl and Fukushima both    │
+│          │ record hold across all          │ involved regulatory failures;   │
+│          │ countries, including those with │ safety statistics may differ    │
+│          │ less stringent regulatory       │ between high-regulation and     │
+│          │ frameworks?                     │ low-regulation countries.       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does nuclear power's safety │ Statista notes deaths are       │
+│          │ compare when including the      │ measured from 'accidents and    │
+│          │ health risks from uranium       │ air pollution' per TWh, which   │
+│          │ mining and fuel processing?     │ may not fully account for       │
+│          │                                 │ upstream fuel cycle risks.      │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.92                                                                │
+│ Corroborating sources: 8                                                     │
+│ Source authority: high                                                       │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.95                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 63429                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 89.22s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 2e2b6e88-c973-4422-919c-3838634336c9
--- a/docs/stress-tests/M3.3-runs/14-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/14-contradiction.log
@ -0,0 +1,358 @@
+Researching: Is dietary cholesterol harmful?
+
+{"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.656017Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:07:33.414998Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:33.424151Z"}
+{"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:33.456353Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is dietary cholesterol harmful?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456707Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456830Z"}
+{"step": 11, "decision": "Starting iteration 2/5", "tokens_so_far": 1218, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:51.522768Z"}
+{"step": 18, "decision": "Starting iteration 3/5", "tokens_so_far": 14738, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:58.291229Z"}
+{"step": 27, "decision": "Token budget reached before iteration 4: 31680/20000", "event": "budget_exhausted", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.672921Z"}
+{"step": 28, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 31680, "event": "synthesis_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.673116Z"}
+{"step": 29, "decision": "Parsed synthesis JSON successfully", "duration_ms": 82227, "event": "synthesis_complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.529276Z"}
+{"step": 50, "decision": "Research complete", "confidence": 0.78, "citation_count": 13, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 124.559, "event": "complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.531034Z"}
+{"confidence": 0.78, "citations": 13, "gaps": 3, "discovery_events": 4, "tokens_used": 64718, "iterations_run": 3, "wall_time_sec": 120.07263135910034, "budget_exhausted": true, "event": "research_completed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:33.531291Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:09:33.531771Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:33.538771Z"}
+{"trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "confidence": 0.78, "citations": 13, "tokens_used": 64718, "wall_time_sec": 120.07263135910034, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:33.877746Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ The question of whether dietary cholesterol is harmful is nuanced and        │
+│ remains an area of active scientific debate. Here is a synthesis of the      │
+│ current evidence:                                                            │
+│                                                                              │
+│ 1. **Dietary cholesterol vs. blood cholesterol**: The body produces most of  │
+│ its own cholesterol, and dietary cholesterol has a relatively modest and     │
+│ variable effect on blood (serum) cholesterol levels. Epidemiological studies │
+│ and clinical interventions have largely shown that dietary cholesterol       │
+│ intake does not significantly impact blood cholesterol in most individuals   │
+│ [PMC6024687; PMC9143438]. A meta-analysis of 224 studies (8,143 subjects)    │
+│ found only modest increases in both LDL and HDL when dietary cholesterol is  │
+│ increased [Consensus Academic Search].                                       │
+│                                                                              │
+│ 2. **CVD risk from observational studies**: A 2020 AHA Science Advisory      │
+│ (Carson et al., Circulation) found a significant positive relationship       │
+│ between dietary cholesterol intake and blood LDL, but evidence from          │
+│ observational studies generally does not indicate a significant association  │
+│ with cardiovascular disease risk [AHA Journals,                              │
+│ doi:10.1161/CIR.0000000000000743]. However, a large pooled cohort study      │
+│ (n=29,615, published in JAMA) found each additional 300 mg/day of dietary    │
+│ cholesterol was associated with higher risk of incident CVD and all-cause    │
+│ mortality [PACE-CME; The Cardiology Advisor].                                │
+│                                                                              │
+│ 3. **Updated dietary guidelines**: The 2015–2020 U.S. Dietary Guidelines     │
+│ removed the previous 300 mg/day dietary cholesterol limit, citing no         │
+│ appreciable relationship between dietary cholesterol and serum cholesterol.  │
+│ However, this decision was contested by scientists who argued the evidence   │
+│ was insufficient rather than exculpatory [Regulations.gov scientists'        │
+│ comment; PMC6024687]. The AHA's 2026 dietary guidance states that dietary    │
+│ cholesterol is 'no longer a primary target for CVD risk reduction for most   │
+│ people,' though it still advises limiting cholesterol-rich foods [AHA        │
+│ Journals, doi:10.1161/CIR.0000000000001435].                                 │
+│                                                                              │
+│ 4. **Individual variability**: People differ substantially in how they       │
+│ respond to dietary cholesterol—'hyper-responders' see more significant LDL   │
+│ increases than 'hypo-responders.' Genetic and hormonal factors play          │
+│ important roles [ScienceDirect hypo/hyperresponders; PubMed 12074253].       │
+│                                                                              │
+│ 5. **Eggs as a cholesterol source**: Eggs are the primary dietary            │
+│ cholesterol source studied. Evidence on egg consumption and CVD is           │
+│ inconsistent. A 2025 umbrella review found 'critically low' quality of       │
+│ evidence and concluded there is no sufficient evidence to discourage egg     │
+│ consumption, though weak associations with higher LDL and heart failure risk │
+│ were noted [ScienceDirect, doi:10.1016/j.numecd.2025.103849]. A BMJ          │
+│ meta-analysis suggested higher egg consumption could be associated with      │
+│ higher CVD risk [BMJ m513].                                                  │
+│                                                                              │
+│ 6. **Saturated fat confounding**: Most foods high in dietary cholesterol are │
+│ also high in saturated fat, which does raise LDL cholesterol and CVD risk.   │
+│ Eggs and shrimp are notable exceptions [PMC6024687].                         │
+│                                                                              │
+│ **Bottom line**: For most people, dietary cholesterol in moderate amounts is │
+│ unlikely to be a primary driver of CVD risk. However, it is not completely   │
+│ benign—particularly for hyper-responders or people with diabetes—and the     │
+│ overall dietary pattern (especially saturated fat intake) matters more than  │
+│ dietary cholesterol in isolation. Caution is still warranted, and individual │
+│ factors should guide dietary choices.                                        │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Dietary Cholesterol and the   │ To date, extensive research    │  0.92 │
+│     │ Lack of Evidence in           │ did not show evidence to       │       │
+│     │ Cardiovascular Disease - PMC  │ support a role of dietary      │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ cholesterol in the development │       │
+│     │ articles/PMC6024687/          │ of CVD. As a result, the       │       │
+│     │                               │ 2015–2020 Dietary Guidelines   │       │
+│     │                               │ for Americans removed the      │       │
+│     │                               │ recommendations of restricting │       │
+│     │                               │ dietary cholesterol to 300     │       │
+│     │                               │ mg/day.                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Is There a Correlation        │ it was not until the late      │  0.91 │
+│     │ between Dietary and Blood     │ 1990s when they were finally   │       │
+│     │ Cholesterol? Evidence from    │ challenged by the newer        │       │
+│     │ Epidemiological Data and      │ information derived from       │       │
+│     │ Clinical Interventions - PMC  │ epidemiological studies and    │       │
+│     │ https://pmc.ncbi.nlm.nih.gov/ │ meta-analysis, which confirmed │       │
+│     │ articles/PMC9143438/          │ the lack of correlation        │       │
+│     │                               │ between dietary and blood      │       │
+│     │                               │ cholesterol.                   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Dietary Cholesterol and       │ Evidence from observational    │  0.93 │
+│     │ Cardiovascular Risk: A        │ studies conducted in several   │       │
+│     │ Science Advisory from the AHA │ countries generally does not   │       │
+│     │ https://www.ahajournals.org/d │ indicate a significant         │       │
+│     │ oi/full/10.1161/CIR.000000000 │ association with               │       │
+│     │ 0000743                       │ cardiovascular disease risk.   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Dietary Cholesterol and       │ Differences in dietary         │  0.88 │
+│     │ Cardiovascular Risk: A        │ cholesterol ranged from 155 to │       │
+│     │ Science Advisory (full text)  │ 1000 mg/d. A significant       │       │
+│     │ https://www.ahajournals.org/d │ positive relationship was      │       │
+│     │ oi/10.1161/CIR.00000000000007 │ identified between dietary     │       │
+│     │ 43                            │ cholesterol                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ 2026 Dietary Guidance to      │ Dietary cholesterol is no      │  0.90 │
+│     │ Improve Cardiovascular Health │ longer a primary target for    │       │
+│     │ https://www.ahajournals.org/d │ CVD risk reduction for most    │       │
+│     │ oi/10.1161/CIR.00000000000014 │ people. Nevertheless, heart    │       │
+│     │ 35                            │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Higher consumption of dietary │ Among US adults, higher intake │  0.87 │
+│     │ cholesterol or eggs linked to │ of dietary cholesterol or eggs │       │
+│     │ increased risk of incident    │ was significantly linked to    │       │
+│     │ CVD and mortality - PACE-CME  │ increased risk of incident CVD │       │
+│     │ https://pace-cme.org/news/hig │ and all-cause mortality in a   │       │
+│     │ her-consumption-of-dietary-ch │ dose-response manner, which    │       │
+│     │ olesterol-or-eggs-linked-to-i │ was independent of nutrients   │       │
+│     │ ncreased-risk-of-incident-cvd │ or diets                       │       │
+│     │ -and-mortality/2455413/       │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ After Continued Debate,       │ Each additional 300 mg of      │  0.87 │
+│     │ Dietary Cholesterol Linked to │ dietary cholesterol consumed   │       │
+│     │ Significant Increase in CVD - │ per day was significantly      │       │
+│     │ The Cardiology Advisor        │ associated with a higher risk  │       │
+│     │ https://www.thecardiologyadvi │ for incident CVD and all-cause │       │
+│     │ sor.com/home/topics/metabolic │ mortality, as was each         │       │
+│     │ /dyslipidemia/after-continued │ additional half an egg         │       │
+│     │ -debate-dietary-cholesterol-l │ consumed per day.              │       │
+│     │ inked-to-significant-increase │                                │       │
+│     │ -in-cvd/                      │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Scientists' Comment on        │ dietary cholesterol is very    │  0.82 │
+│     │ Dietary Cholesterol -         │ much a 'nutrient of concern,'  │       │
+│     │ Regulations.gov               │ because it increases LDL       │       │
+│     │ https://downloads.regulations │ cholesterol, a                 │       │
+│     │ .gov/FDA-2018-P-1593-0049/att │ well-established risk factor   │       │
+│     │ achment_2.pdf                 │ for coronary heart disease.    │       │
+│     │                               │ Furthermore, the consumption   │       │
+│     │                               │ of whole eggs is associated    │       │
+│     │                               │ with the risk of type 2        │       │
+│     │                               │ diabetes                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Dietary Cholesterol And Blood │ A meta-analysis of 224 studies │  0.85 │
+│     │ Cholesterol - Consensus       │ involving 8,143 subjects found │       │
+│     │ Academic Search Engine        │ that dietary cholesterol       │       │
+│     │ https://consensus.app/questio │ intake leads to modest         │       │
+│     │ ns/dietary-cholesterol-and-bl │ increases in both LDL and HDL  │       │
+│     │ ood-cholesterol/              │ cholesterol levels. The study  │       │
+│     │                               │ highlighted that while dietary │       │
+│     │                               │ cholesterol does raise serum   │       │
+│     │                               │ cholesterol levels, the effect │       │
+│     │                               │ is relatively small and varies │       │
+│     │                               │ among individuals.             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Effect of egg consumption on  │ The overall quality of studies │  0.88 │
+│     │ health outcomes: Updated      │ was critically low. The level  │       │
+│     │ umbrella review -             │ of evidence was very weak for  │       │
+│     │ ScienceDirect                 │ all the significant            │       │
+│     │ https://www.sciencedirect.com │ associations: risk of heart    │       │
+│     │ /science/article/pii/S0939475 │ failure (RR 1.15; 95%CI:       │       │
+│     │ 325000031                     │ 1.02–1.30)... higher levels of │       │
+│     │                               │ LDL cholesterol (WMD 7.39;     │       │
+│     │                               │ 95%CI 5.82–8.95)... No         │       │
+│     │                               │ evidence of association was    │       │
+│     │                               │ found among all cardiovascular │       │
+│     │                               │ outcomes and all-cause         │       │
+│     │                               │ mortality risk                 │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 11  │ Egg consumption and risk of   │ Results from our updated       │  0.84 │
+│     │ cardiovascular disease - The  │ meta-analysis suggest that     │       │
+│     │ BMJ                           │ higher egg consumption could   │       │
+│     │ https://www.bmj.com/content/3 │ be associated with a higher    │       │
+│     │ 68/bmj.m513                   │ risk of cardiovascular disease │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 12  │ Hypo- and hyperresponders to  │ Hypo- and hyperresponders to   │  0.78 │
+│     │ dietary cholesterol -         │ dietary cholesterol            │       │
+│     │ ScienceDirect                 │                                │       │
+│     │ https://www.sciencedirect.com │                                │       │
+│     │ /science/article/abs/pii/S000 │                                │       │
+│     │ 2916523398897                 │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 13  │ Here's the latest on dietary  │ More recently, accumulating    │  0.87 │
+│     │ cholesterol and how it fits   │ data has caused researchers to │       │
+│     │ in with a healthy diet |      │ broaden their thinking about   │       │
+│     │ American Heart Association    │ how dietary cholesterol – and  │       │
+│     │ https://www.heart.org/en/news │ eggs – fit into a healthy      │       │
+│     │ /2023/08/25/heres-the-latest- │ eating pattern. 'We've         │       │
+│     │ on-dietary-cholesterol-and-ho │ advanced considerably,' said   │       │
+│     │ w-it-fits-in-with-a-healthy-d │ professor Linda Van Horn       │       │
+│     │ iet                           │                                │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Long-term RCT data on    │ Most evidence comes from  │
+│                       │ dietary cholesterol and  │ observational studies or  │
+│                       │ hard CVD endpoints       │ short-term interventions. │
+│                       │                          │ There are no large,       │
+│                       │                          │ long-term randomized      │
+│                       │                          │ controlled trials         │
+│                       │                          │ directly testing reduced  │
+│                       │                          │ dietary cholesterol       │
+│                       │                          │ versus hard CVD outcomes  │
+│                       │                          │ like myocardial           │
+│                       │                          │ infarction or             │
+│                       │                          │ cardiovascular death.     │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ source_not_found      │ Dietary cholesterol      │ While some sources        │
+│                       │ effects in specific      │ mention increased CVD     │
+│                       │ high-risk subgroups      │ risk from eggs in people  │
+│                       │ (diabetes, familial      │ with diabetes, the        │
+│                       │ hypercholesterolemia)    │ gathered evidence does    │
+│                       │                          │ not deeply characterize   │
+│                       │                          │ effects in all high-risk  │
+│                       │                          │ subgroups such as         │
+│                       │                          │ familial                  │
+│                       │                          │ hypercholesterolemia      │
+│                       │                          │ patients.                 │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Mechanisms               │ Confounding between       │
+│                       │ distinguishing dietary   │ dietary cholesterol and   │
+│                       │ cholesterol from         │ saturated fat intake      │
+│                       │ saturated fat effects    │ makes it difficult to     │
+│                       │                          │ isolate dietary           │
+│                       │                          │ cholesterol's independent │
+│                       │                          │ effect on CVD; different  │
+│                       │                          │ studies handle this       │
+│                       │                          │ confounder differently,   │
+│                       │                          │ leading to inconsistent   │
+│                       │                          │ conclusions.              │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ contradiction    │ database          │ dietary           │ The evidence is   │
+│                  │                   │ cholesterol CVD   │ contradictory     │
+│                  │                   │ risk randomized   │ between large     │
+│                  │                   │ controlled trial  │ observational     │
+│                  │                   │ meta-analysis     │ pooled cohorts    │
+│                  │                   │ 2020 2024         │ (showing CVD      │
+│                  │                   │                   │ risk) and         │
+│                  │                   │                   │ intervention/epid │
+│                  │                   │                   │ emiological       │
+│                  │                   │                   │ reviews (showing  │
+│                  │                   │                   │ no significant    │
+│                  │                   │                   │ association),     │
+│                  │                   │                   │ warranting deeper │
+│                  │                   │                   │ RCT-level         │
+│                  │                   │                   │ analysis.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ lean mass         │ A distinct        │
+│                  │                   │ hyper-responder   │ phenotype (lean   │
+│                  │                   │ LDL dietary       │ mass              │
+│                  │                   │ cholesterol       │ hyper-responders) │
+│                  │                   │ cardiovascular    │ shows pronounced  │
+│                  │                   │ risk 2023 2024    │ LDL increases on  │
+│                  │                   │                   │ low-carb diets    │
+│                  │                   │                   │ high in dietary   │
+│                  │                   │                   │ fat/cholesterol,  │
+│                  │                   │                   │ with unclear CVD  │
+│                  │                   │                   │ implications.     │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ dietary           │ Multiple sources  │
+│                  │                   │ cholesterol type  │ mention           │
+│                  │                   │ 2 diabetes risk   │ association       │
+│                  │                   │ eggs 2020 2024    │ between           │
+│                  │                   │ meta-analysis     │ egg/cholesterol   │
+│                  │                   │                   │ intake and type 2 │
+│                  │                   │                   │ diabetes risk,    │
+│                  │                   │                   │ which is not      │
+│                  │                   │                   │ fully explored in │
+│                  │                   │                   │ the gathered      │
+│                  │                   │                   │ evidence.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ database          │ ACC AHA 2026      │ New 2026 ACC/AHA  │
+│                  │                   │ dyslipidemia      │ dyslipidemia      │
+│                  │                   │ guidelines        │ guidelines were   │
+│                  │                   │ dietary           │ referenced but    │
+│                  │                   │ cholesterol       │ only partially    │
+│                  │                   │ recommendations   │ retrieved; full   │
+│                  │                   │                   │ dietary           │
+│                  │                   │                   │ cholesterol       │
+│                  │                   │                   │ guidance warrants │
+│                  │                   │                   │ review.           │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Should dietary cholesterol      │ Scientists' comments on the     │
+│          │ recommendations differ for      │ 2015 dietary guidelines and     │
+│          │ people with diabetes or         │ some observational studies      │
+│          │ familial hypercholesterolemia   │ suggest egg/cholesterol intake  │
+│          │ compared to the general         │ may increase CHD risk           │
+│          │ population?                     │ specifically in people with     │
+│          │                                 │ diabetes.                       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Do LDL cholesterol              │ Research shows wide individual  │
+│          │ hyper-responders to dietary     │ variability in LDL response to  │
+│          │ cholesterol face meaningfully   │ dietary cholesterol; it is      │
+│          │ higher long-term CVD risk, and  │ unclear whether                 │
+│          │ should they restrict dietary    │ hyper-responders have elevated  │
+│          │ cholesterol?                    │ CVD risk and need tailored      │
+│          │                                 │ advice.                         │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ How much of the observed CVD    │ PMC6024687 notes most           │
+│          │ risk associated with dietary    │ high-cholesterol foods are also │
+│          │ cholesterol in observational    │ high in saturated fat;          │
+│          │ studies is attributable to      │ isolating dietary cholesterol's │
+│          │ saturated fat co-ingestion      │ independent effect is           │
+│          │ rather than cholesterol itself? │ methodologically challenging.   │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the effect of dietary   │ PACE-CME study noted that CVD   │
+│          │ cholesterol within the context  │ risk association from dietary   │
+│          │ of a high-quality overall diet  │ cholesterol was independent of  │
+│          │ (e.g., Mediterranean or DASH    │ overall diet quality, but this  │
+│          │ diet)?                          │ needs further investigation.    │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Does the food matrix (e.g.,     │ The 2025 umbrella review of egg │
+│          │ eggs vs. red meat) in which     │ consumption found weak          │
+│          │ dietary cholesterol is consumed │ associations; it is unclear if  │
+│          │ modify its impact on CVD risk?  │ the source of dietary           │
+│          │                                 │ cholesterol modulates risk      │
+│          │                                 │ independently of the            │
+│          │                                 │ cholesterol content.            │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.78                                                                │
+│ Corroborating sources: 13                                                    │
+│ Source authority: high                                                       │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.85                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 64718                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 120.07s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 27d81891-5bf2-4bf4-9744-55f39ffaf696
--- a/docs/stress-tests/M3.3-runs/15-contradiction.log
+++ b/docs/stress-tests/M3.3-runs/15-contradiction.log
@ -0,0 +1,48 @@
+Researching: Does screen time harm child development?
+
+{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:34.721867Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:09:35.602647Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:35.613025Z"}
+{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:35.653113Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does screen time harm child development?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653592Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653723Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1126, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:45.628661Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 10139, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:51.476900Z"}
+{"step": 21, "decision": "Token budget reached before iteration 4: 23391/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056368Z"}
+{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23391, "event": "synthesis_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056571Z"}
+{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74986, "event": "synthesis_complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.739493Z"}
+{"step": 24, "decision": "Failed to build ResearchResult: 1 validation error for DiscoveryEvent\nquery\n  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n    For further information visit https://errors.pydantic.dev/2.12/v/string_type", "event": "synthesis_build_error", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.753603Z"}
+{"step": 26, "decision": "Research complete", "confidence": 0.1, "citation_count": 0, "gap_count": 1, "discovery_count": 0, "total_duration_sec": 98.512, "event": "complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.755661Z"}
+{"confidence": 0.1, "citations": 0, "gaps": 1, "discovery_events": 0, "tokens_used": 44375, "iterations_run": 3, "wall_time_sec": 95.08588027954102, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:10.755895Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:11:10.757071Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:10.770530Z"}
+{"trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "confidence": 0.1, "citations": 0, "tokens_used": 44375, "wall_time_sec": 95.08588027954102, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.105698Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Research on 'Does screen time harm child development?' completed but         │
+│ synthesis failed. 22 sources were gathered.                                  │
+╰──────────────────────────────────────────────────────────────────────────────╯
+No citations.
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic     ┃ Detail                                        ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ budget_exhausted │ synthesis │ The synthesis step failed to produce          │
+│                  │           │ structured output.                            │
+└──────────────────┴───────────┴───────────────────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.10                                                                │
+│ Corroborating sources: 0                                                     │
+│ Source authority: low                                                        │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.00                                                │
+│ Budget status: spent                                                         │
+│ Recency: unknown                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 44375                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 95.09s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 9c18d570-73d3-4e8a-98bc-7cb1b66c61d2
--- a/docs/stress-tests/M3.3-runs/16-scope.log
+++ b/docs/stress-tests/M3.3-runs/16-scope.log
@ -0,0 +1,321 @@
+Researching: What proprietary indexing strategies do high-frequency trading 
+firms use for order book reconstruction?
+
+{"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.888630Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:11:12.816801Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:12.829566Z"}
+{"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:12.871225Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.871693Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.872051Z"}
+{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1212, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:27.416025Z"}
+{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 15135, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:33.632271Z"}
+{"step": 23, "decision": "Token budget reached before iteration 4: 35581/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723229Z"}
+{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 3, "tokens_used": 35581, "event": "synthesis_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723491Z"}
+{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72229, "event": "synthesis_complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.638239Z"}
+{"step": 42, "decision": "Research complete", "confidence": 0.72, "citation_count": 8, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.111, "event": "complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.639828Z"}
+{"confidence": 0.72, "citations": 8, "gaps": 4, "discovery_events": 4, "tokens_used": 70892, "iterations_run": 3, "wall_time_sec": 97.76683187484741, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:50.639933Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:12:50.640430Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:50.648897Z"}
+{"trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "confidence": 0.72, "citations": 8, "tokens_used": 70892, "wall_time_sec": 97.76683187484741, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:50.931342Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ High-frequency trading firms use several proprietary and semi-documented     │
+│ indexing strategies for order book reconstruction, though most production    │
+│ details remain trade secrets. Based on available evidence:                   │
+│                                                                              │
+│ 1. **Hash Table + Array Hybrid**: The most commonly cited production         │
+│ approach combines plain arrays (for cache-friendly sequential memory access  │
+│ minimizing cache misses) with hash tables (for O(1) lookup of specific price │
+│ levels). This codesign optimizes both speed and cache locality. [Sources 15, │
+│ 16, 28]                                                                      │
+│                                                                              │
+│ 2. **B-Tree / ISAM Indexing**: The historically significant Island ECN       │
+│ (1996), built by Josh Levine, used in-memory B-tree indexing via an ISAM     │
+│ storage engine with zero disk access during matching, achieving O(log N)     │
+│ access per price level. This is considered the documented proof-of-concept   │
+│ for production-grade LOB indexing. [Source 29]                               │
+│                                                                              │
+│ 3. **Hybrid Binary-Linear Search**: A IEEE-documented approach proposes a    │
+│ simple linear data structure for tracking the order book combined with a     │
+│ hybrid binary-linear search algorithm to maintain top bid/ask with minimal   │
+│ latency. [Source 19]                                                         │
+│                                                                              │
+│ 4. **ROI Vector (Region-of-Interest Vector)**: Used in backtesting           │
+│ frameworks like HftBacktest, this approach restricts the active price range  │
+│ to a bounded region of interest, enabling vector-based O(1) access within    │
+│ the ROI while avoiding full-book scanning. [Source 25, 35]                   │
+│                                                                              │
+│ 5. **Lock-Free Concurrent Data Structures**: To handle concurrent updates    │
+│ without mutex overhead, firms implement lock-free data structures allowing   │
+│ multiple threads to update the LOB simultaneously. [Sources 15, 16]          │
+│                                                                              │
+│ 6. **Event-Driven with Selective Polling Hybrid**: The LOB primarily         │
+│ operates event-driven but incorporates high-frequency polling for the most   │
+│ latency-sensitive execution pathways, ensuring sub-microsecond               │
+│ responsiveness. [Sources 15, 16]                                             │
+│                                                                              │
+│ 7. **Order Record Reuse (Object Pooling)**: Levine's Island engine reused    │
+│ recently freed order records for new orders—described as 'hugely             │
+│ important'—a form of memory pooling that avoids allocation overhead during   │
+│ high-throughput periods. [Source 29]                                         │
+│                                                                              │
+│ 8. **Structural Filtration for Signal Quality**: Recent research (2025)      │
+│ proposes filtering transient LOB events by order lifetime, update count, or  │
+│ inter-update delay before indexing, improving directional signal quality     │
+│ (OBI) extracted from the reconstructed book. [Source 6]                      │
+│                                                                              │
+│ Notably, red-black trees—frequently cited in academic literature—are rarely  │
+│ used in production due to poor cache behavior versus simpler arrays at       │
+│ realistic market depths. The key insight from practitioners is that          │
+│ algorithmic data structure choice (O(log N) vs O(N)) dominates hardware      │
+│ investment: a $2M co-location/FPGA upgrade produced no measurable latency    │
+│ improvement when the underlying order book used a sorted array with O(N)     │
+│ inserts. [Source 23, 29]                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Matching Engine Architecture: │ Josh Levine built the Island   │  0.95 │
+│     │ Why Your Order Book Data      │ matching engine in FoxPro for  │       │
+│     │ Structure Is the Real Latency │ MS-DOS... The order book used  │       │
+│     │ Bottleneck                    │ in-memory B-tree indexing via  │       │
+│     │ https://electronictradinghub. │ an ISAM storage engine. Zero   │       │
+│     │ com/matching-engine-architect │ disk access during matching.   │       │
+│     │ ure-why-your-order-book-data- │ Every price level accessed in  │       │
+│     │ structure-is-the-real-latency │ O(log N) time. Levine's        │       │
+│     │ -bottleneck/                  │ optimization for new-order     │       │
+│     │                               │ entry latency: reuse recently  │       │
+│     │                               │ freed order records for new    │       │
+│     │                               │ orders — a detail he called    │       │
+│     │                               │ 'hugely important'             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Optimizing Limit Order Book   │ I use a combination of plain   │  0.88 │
+│     │ for HFT Systems               │ arrays and hash tables to      │       │
+│     │ https://www.linkedin.com/post │ manage the LOB. Arrays are     │       │
+│     │ s/silahian_hft-hft-trading-ac │ highly effective with CPU      │       │
+│     │ tivity-7351226537301417988-ei │ caches, offering sequential    │       │
+│     │ cX                            │ memory access that minimizes   │       │
+│     │                               │ cache misses. The integration  │       │
+│     │                               │ of hash tables provides quick  │       │
+│     │                               │ access to specific entries,    │       │
+│     │                               │ ensuring that both speed and   │       │
+│     │                               │ cache locality are optimized.  │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Red Black Trees for Limit     │ They're not necessarily ideal. │  0.92 │
+│     │ Order Book - Quantitative     │ In fact, they're rarely used   │       │
+│     │ Finance Stack Exchange        │ in production trading systems  │       │
+│     │ https://quant.stackexchange.c │ with low latency               │       │
+│     │ om/questions/63140/red-black- │ requirements... a simple array │       │
+│     │ trees-for-limit-order-book    │ or vector with linear access   │       │
+│     │                               │ patterns will often outperform │       │
+│     │                               │ any complex data structure     │       │
+│     │                               │ with better asymptotic runtime │       │
+│     │                               │ because a simple array         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Order Book Reconstruction -   │ HashMapMarketDepth...          │  0.85 │
+│     │ HftBacktest                   │ BTreeMarketDepth...            │       │
+│     │ https://mintlify.com/nkaz001/ │ ROIVectorMarketDepth::new(tick │       │
+│     │ hftbacktest/concepts/order-bo │ _size, lot_size, roi_lb,       │       │
+│     │ ok                            │ roi_ub)...                     │       │
+│     │                               │ FusedHashMapMarketDepth        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Order Book Filtration and     │ Three real-time, observable    │  0.82 │
+│     │ Directional Signal Extraction │ filtration schemes: based on   │       │
+│     │ at High Frequency             │ order lifetime, update count,  │       │
+│     │ https://arxiv.org/html/2507.2 │ and inter-update delay. These  │       │
+│     │ 2712v1                        │ are used to recompute OBI on   │       │
+│     │                               │ structurally filtered event    │       │
+│     │                               │ streams... Empirical results   │       │
+│     │                               │ show that structural           │       │
+│     │                               │ filtration improves            │       │
+│     │                               │ directional signal clarity in  │       │
+│     │                               │ correlation and regime-based   │       │
+│     │                               │ metrics                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Building Low-Latency Order    │ This paper proposes a simple   │  0.80 │
+│     │ Books with Hybrid             │ linear data structure for      │       │
+│     │ Binary-Linear ...             │ tracking the order book and a  │       │
+│     │ https://ieeexplore.ieee.org/d │ hybrid binary-linear search    │       │
+│     │ ocument/10296447/             │ algorithm to maintain the top  │       │
+│     │                               │ bid and ask                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Order Book Reconstruction -   │ Index reusing... Regional      │  0.75 │
+│     │ dxFeed KB                     │ events... Event flags          │       │
+│     │ https://kb.dxfeed.com/en/data │ applicable to Order event...   │       │
+│     │ -model/dxfeed-order-book/orde │ Snapshots... Transaction       │       │
+│     │ r-book-reconstruction.html    │ model... dxFeed market data    │       │
+│     │                               │ feeds (real-time, delayed or   │       │
+│     │                               │ historical) allow clients to   │       │
+│     │                               │ reconstruct order books, price │       │
+│     │                               │ level aggregations, and        │       │
+│     │                               │ aggregations by Market Maker   │       │
+│     │                               │ or a data provider.            │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ GitHub -                      │ This Limit Order Book is       │  0.70 │
+│     │ brprojects/Limit-Order-Book   │ developed in C++ from scratch  │       │
+│     │ https://github.com/brprojects │ and able to handle over        │       │
+│     │ /Limit-Order-Book             │ 1,400,000 TPS (transactions    │       │
+│     │                               │ per second), including Market, │       │
+│     │                               │ Limit, Stop and Stop Limit     │       │
+│     │                               │ orders.                        │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found │ Proprietary FPGA-based      │ Actual FPGA hardware        │
+│                  │ order book indexing schemes │ implementations used by     │
+│                  │                             │ firms like Virtu, Jane      │
+│                  │                             │ Street, or Citadel for      │
+│                  │                             │ on-chip order book indexing │
+│                  │                             │ are not publicly            │
+│                  │                             │ documented. MIT project     │
+│                  │                             │ proposal references FPGA    │
+│                  │                             │ LOB but lacks               │
+│                  │                             │ implementation details.     │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Exact data structures used  │ No public disclosure exists │
+│                  │ by specific named HFT firms │ for the specific indexing   │
+│                  │                             │ implementations of major    │
+│                  │                             │ HFT firms (e.g., Virtu, Two │
+│                  │                             │ Sigma, Jump Trading). All   │
+│                  │                             │ evidence is from            │
+│                  │                             │ practitioners sharing       │
+│                  │                             │ general principles or       │
+│                  │                             │ academic reconstructions.   │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ scope_exceeded   │ Co-location-specific memory │ NUMA-aware memory           │
+│                  │ topology optimization for   │ allocation and CPU affinity │
+│                  │ LOB                         │ strategies for LOB          │
+│                  │                             │ processes in co-located     │
+│                  │                             │ environments are referenced │
+│                  │                             │ but not detailed in         │
+│                  │                             │ available sources.          │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Crypto-specific LOB         │ While one Medium article    │
+│                  │ indexing differences vs     │ covers crypto HFT system    │
+│                  │ equity markets              │ design, it does not detail  │
+│                  │                             │ how LOB indexing strategies │
+│                  │                             │ differ for 24/7 crypto      │
+│                  │                             │ markets with different tick │
+│                  │                             │ structures.                 │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ arxiv             │ FPGA order book   │ The MIT HFT       │
+│                  │                   │ matching engine   │ Accelerator paper │
+│                  │                   │ hardware          │ and FPGA          │
+│                  │                   │ implementation    │ references        │
+│                  │                   │ nanosecond        │ suggest           │
+│                  │                   │ latency           │ significant       │
+│                  │                   │                   │ unpublished work  │
+│                  │                   │                   │ on                │
+│                  │                   │                   │ hardware-accelera │
+│                  │                   │                   │ ted LOB indexing  │
+│                  │                   │                   │ that would        │
+│                  │                   │                   │ directly answer   │
+│                  │                   │                   │ the proprietary   │
+│                  │                   │                   │ indexing question │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ limit order book  │ Cache-oblivious   │
+│                  │                   │ data structure    │ structures like   │
+│                  │                   │ cache-oblivious   │ van Emde Boas     │
+│                  │                   │ van Emde Boas     │ trees are         │
+│                  │                   │ tree HFT          │ theoretically     │
+│                  │                   │                   │ optimal for LOB   │
+│                  │                   │                   │ operations but    │
+│                  │                   │                   │ not mentioned in  │
+│                  │                   │                   │ sources; academic │
+│                  │                   │                   │ literature may    │
+│                  │                   │                   │ document their    │
+│                  │                   │                   │ use               │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ database          │ Island ECN Levine │ The Island ECN    │
+│                  │                   │ order book ISAM   │ B-tree/ISAM       │
+│                  │                   │ indexing original │ reference is      │
+│                  │                   │ documentation     │ cited secondhand; │
+│                  │                   │ 1996              │ primary           │
+│                  │                   │                   │ documentation     │
+│                  │                   │                   │ would provide     │
+│                  │                   │                   │ authoritative     │
+│                  │                   │                   │ details on the    │
+│                  │                   │                   │ original          │
+│                  │                   │                   │ production        │
+│                  │                   │                   │ indexing strategy │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ order book        │ L3 order-by-order │
+│                  │                   │ reconstruction L3 │ reconstruction    │
+│                  │                   │ tick data index   │ requires          │
+│                  │                   │ compression high  │ per-order         │
+│                  │                   │ frequency         │ indexing by       │
+│                  │                   │                   │ order_id which    │
+│                  │                   │                   │ has different     │
+│                  │                   │                   │ data structure    │
+│                  │                   │                   │ requirements than │
+│                  │                   │                   │ L2 price-level    │
+│                  │                   │                   │ indexing          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Do modern HFT firms use         │ Sources confirm cache-friendly  │
+│          │ NUMA-aware memory allocation    │ arrays dominate in production,  │
+│          │ strategies specifically tuned   │ but NUMA effects in             │
+│          │ for order book price-level      │ multi-socket co-located servers │
+│          │ index structures, and how does  │ are not addressed               │
+│          │ this interact with CPU cache    │                                 │
+│          │ topology?                       │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ How do HFT firms handle the     │ dxFeed documentation describes  │
+│          │ transition from snapshot-based  │ snapshot and transaction models │
+│          │ full order book state to        │ separately; the handoff between │
+│          │ incremental delta updates in    │ these modes in production       │
+│          │ their indexing layer without    │ indexing is not detailed        │
+│          │ introducing consistency gaps?   │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the practical           │ HftBacktest documents both      │
+│          │ throughput and latency tradeoff │ structures but does not provide │
+│          │ between ROIVectorMarketDepth    │ comparative benchmarks for edge │
+│          │ and FusedHashMapMarketDepth     │ cases like flash crashes where  │
+│          │ implementations under real      │ price moves outside the ROI     │
+│          │ market conditions with large    │                                 │
+│          │ price spikes?                   │                                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Does structural LOB filtration  │ The filtration paper shows      │
+│          │ (by order lifetime or update    │ improved OBI signal quality but │
+│          │ count) as proposed in the 2025  │ acknowledges limited gains in   │
+│          │ arxiv paper degrade order book  │ causal excitation;              │
+│          │ reconstruction accuracy under   │ accuracy-speed tradeoff for     │
+│          │ normal market conditions        │ indexing filtered vs raw        │
+│          │ compared to raw feeds?          │ streams is unresolved           │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ low      │ How do exchanges like LMAX,     │ The electronictradinghub        │
+│          │ Tokyo Stock Exchange, and NSE   │ article cites these exchanges   │
+│          │ India differ in their           │ as modern evidence but does not │
+│          │ recommended order book          │ detail their specific           │
+│          │ reconstruction protocols, and   │ reconstruction protocol         │
+│          │ do these differences force      │ differences                     │
+│          │ different indexing strategies   │                                 │
+│          │ on client-side HFT systems?     │                                 │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.72                                                                │
+│ Corroborating sources: 8                                                     │
+│ Source authority: medium                                                     │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.65                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 70892                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 97.77s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: f4c43973-7cac-4193-a249-cbb1302de4f7
--- a/docs/stress-tests/M3.3-runs/17-scope.log
+++ b/docs/stress-tests/M3.3-runs/17-scope.log
@ -0,0 +1,344 @@
+Researching: What is the actual operational doctrine of Chinese DF-41 ICBM 
+brigades?
+
+{"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:51.608714Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:12:52.450376Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:52.459819Z"}
+{"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:52.495811Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496319Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496431Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1194, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:05.548923Z"}
+{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8831, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:18.572224Z"}
+{"step": 23, "decision": "Token budget reached before iteration 4: 31917/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.495991Z"}
+{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 3, "tokens_used": 31917, "event": "synthesis_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.496215Z"}
+{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 90409, "event": "synthesis_complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.659059Z"}
+{"step": 46, "decision": "Research complete", "confidence": 0.72, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 136.645, "event": "complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.687651Z"}
+{"confidence": 0.72, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 62857, "iterations_run": 3, "wall_time_sec": 132.16255736351013, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:04.687981Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:15:04.688728Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:04.696829Z"}
+{"trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "confidence": 0.72, "citations": 12, "tokens_used": 62857, "wall_time_sec": 132.16255736351013, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:04.924751Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Chinese DF-41 ICBM brigade operational doctrine encompasses several key      │
+│ elements based on open-source intelligence and defense analysis:             │
+│                                                                              │
+│ **Basing and Mobility**: DF-41 brigades operate under a tri-basing doctrine  │
+│ employing road-mobile, rail-mobile, and silo-based launchers. The            │
+│ road-mobile variant uses the Tian HTF5980 16×16 wheeled chassis. Silo        │
+│ construction has accelerated since 2021 with three new solid-fuel ICBM silo  │
+│ fields identified in northern China. [Sources: MDAA, CSIS Missile Threat,    │
+│ FAS]                                                                         │
+│                                                                              │
+│ **Alert Posture and Launch Doctrine**: The PLARF is working to implement a   │
+│ launch-on-warning (LOW) posture. Brigades now strive to keep at least part   │
+│ of their force in a higher state of readiness, representing a significant    │
+│ shift from China's historically relaxed alert posture where warheads were    │
+│ stored separately from missiles. [Sources: Air University/PLARF Nuclear      │
+│ Warhead Management, NDU]                                                     │
+│                                                                              │
+│ **Warhead Management**: Historically, Chinese ICBMs stored warheads          │
+│ separately from missiles ('de-mated'). The shift toward LOW requires         │
+│ warheads to be mated or at least rapidly mateable to delivery systems. As of │
+│ the 2025 FAS Nuclear Notebook, China possesses approximately 600 warheads,   │
+│ with DF-41 launchers armed with either a single ~1 MT warhead or up to 10    │
+│ MIRV warheads (20/90/150 KT yield variants). [Sources: FAS 2025, MDAA]       │
+│                                                                              │
+│ **Force Structure**: As of 2020-2023, two brigades were confirmed operating  │
+│ DF-41 when it appeared in the 2019 parade. The CNS 2023 Order of Battle      │
+│ identifies Base 64 (Lanzhou HQ) Brigade 644 (Hanzhong) as a rumored DF-41    │
+│ integration base. Additional brigades under Base 63 are suspected. [Sources: │
+│ Bulletin PLARF Force Structure Table 2020, CNS OOB 2023]                     │
+│                                                                              │
+│ **Camouflage and Concealment**: Mobile DF-41 units employ camouflage netting │
+│ and disperse into forests and tunnels during exercises, consistent with      │
+│ PLARF general doctrine of 'hiding and waiting.' [Sources: Al                 │
+│ Arabiya/Facebook report]                                                     │
+│                                                                              │
+│ **No-First-Use and Deterrence**: Chinese doctrine officially maintains a     │
+│ no-first-use (NFU) posture, with the DF-41 serving as a second-strike        │
+│ deterrent. However, the silo expansion and LOW posture shift have raised     │
+│ questions among analysts about whether NFU remains operationally intact.     │
+│ [Sources: The Mandarin, FAS 2025]                                            │
+│                                                                              │
+│ **Range and Target Coverage**: With a range of 12,000–15,000 km, DF-41       │
+│ brigades based in central/northern China can target the entire continental   │
+│ United States, making them the primary strategic countervalue and            │
+│ counterforce deterrent against the US. [Sources: MDAA, CSIS Missile Threat]  │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Dong Feng-41(CSS-X-20)        │ The DF-41 has a range of       │  0.90 │
+│     │ https://www.missiledefenseadv │ 12,000-15,000 km (able to      │       │
+│     │ ocacy.org/missile-threat-and- │ target half to all of the      │       │
+│     │ proliferation/todays-missile- │ continental U.S.), can carry   │       │
+│     │ threat/china/df-41/           │ multiple independently         │       │
+│     │                               │ targetable reentry vehicles    │       │
+│     │                               │ (MIRVs), and is rail-or        │       │
+│     │                               │ road-mobile. The DF-41 is      │       │
+│     │                               │ solid propelled and can carry  │       │
+│     │                               │ a payload of up to 2500 kg.    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ DF-41 (Dong Feng-41 /         │ The DF-41 (Dong Feng [East     │  0.92 │
+│     │ CSS-X-20) | Missile Threat    │ Wind]-41, CSS-20) is Chinese   │       │
+│     │ https://missilethreat.csis.or │ road-mobile intercontinental   │       │
+│     │ g/missile/df-41/              │ ballistic missile (ICBM). It   │       │
+│     │                               │ has an operational range of up │       │
+│     │                               │ to 15,000 km, making it        │       │
+│     │                               │ China's longest-range missile, │       │
+│     │                               │ and is reportedly capable of   │       │
+│     │                               │ loading multiple               │       │
+│     │                               │ independently-targeted         │       │
+│     │                               │ warheads (MIRV).               │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ PLA Rocket Force Nuclear      │ PLARF is working to implement  │  0.88 │
+│     │ Warhead Management - Air      │ a launch-on-warning (LOW)      │       │
+│     │ University                    │ posture, and brigades now      │       │
+│     │ https://www.airuniversity.af. │ strive to keep at least part   │       │
+│     │ edu/Portals/10/CASI/documents │ of their force in a state of   │       │
+│     │ /Research/Infrastructure/2026 │                                │       │
+│     │ -03-09%20PLARF%20Nuclear%20Wa │                                │       │
+│     │ rhead%20Management.pdf        │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ IMPLICATIONS OF A PRC SHIFT   │ The PLARF has adjusted its     │  0.87 │
+│     │ TO A LAUNCH-ON-WARNING        │ nuclear warhead storage and    │       │
+│     │ https://inss.ndu.edu/LinkClic │ handling practices and         │       │
+│     │ k.aspx?fileticket=kU27dwWHUvU │ training to support regular    │       │
+│     │ %3D&portalid=82               │ alert status. A LOW posture,   │       │
+│     │                               │ which requires ICBM units      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Chinese nuclear weapons, 2025 │ China has continued to develop │  0.95 │
+│     │ - Federation of American      │ its three new missile silo     │       │
+│     │ Scientists                    │ fields for solid-fuel          │       │
+│     │ https://fas.org/wp-content/up │ intercontinental ballistic     │       │
+│     │ loads/2025/03/Chinese-nuclear │ missiles (ICBMs)...has been    │       │
+│     │ -weapons-2025.pdf             │ developing new variants of     │       │
+│     │                               │ ICBMs and advanced strategic   │       │
+│     │                               │ delivery systems, and has      │       │
+│     │                               │ likely produced excess         │       │
+│     │                               │ warheads for these systems     │       │
+│     │                               │ once they are deployed.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ New Missile Silo And DF-41    │ The photos also show that 18   │  0.90 │
+│     │ Launchers Seen In Chinese     │ road-mobile launchers of the   │       │
+│     │ Nuclear Missile Training Area │ long-awaited DF-41 ICBM were   │       │
+│     │ - FAS                         │ training in the area in        │       │
+│     │ https://fas.org/publication/c │ April-May 2019 together with   │       │
+│     │ hina-silo-df41/               │ launchers for the DF-31AG      │       │
+│     │                               │ ICBM, possibly the DF-5B ICBM, │       │
+│     │                               │ the DF-26 IRBM, and the DF-21  │       │
+│     │                               │ MRBM. Altogether, more than 72 │       │
+│     │                               │ missile launchers can be seen  │       │
+│     │                               │ operating together.            │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Table 2: PLARF Missile Force  │ 644 Brigade Hanzhong (33.1321, │  0.85 │
+│     │ Structure 2020                │ 106.9361) (DF-41) (Yes)        │       │
+│     │ https://thebulletin.org/wp-co │ Rumored DF-41 integration      │       │
+│     │ ntent/uploads/2020/12/Kristen │ base.                          │       │
+│     │ sen-Korda_Nov-Dec-China-Table │                                │       │
+│     │ 2_final.pdf                   │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Understanding the People's    │ The DF-41 will likely replace  │  0.88 │
+│     │ Liberation Army Rocket Force  │ older ICBMs in the Chinese     │       │
+│     │ https://www.armyupress.army.m │ arsenal and will carry either  │       │
+│     │ il/Journals/Military-Review/E │ a single megaton warhead or up │       │
+│     │ nglish-Edition-Archives/China │ to ten MIRV smaller warheads.  │       │
+│     │ -Reader-Special-Edition-Septe │                                │       │
+│     │ mber-2021/Mihal-PLA-Rocket-Fo │                                │       │
+│     │ rce/                          │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ China's new missile silos     │ The discovery by researchers   │  0.82 │
+│     │ (hundreds of them)            │ at the James Martin Center for │       │
+│     │ https://www.themandarin.com.a │ Nonproliferation Studies in    │       │
+│     │ u/166656-china-military-watch │ California that 119 missile    │       │
+│     │ -2/                           │ silos were being built in the  │       │
+│     │                               │ desert near the city of Yumen  │       │
+│     │                               │ in the Gansu region suggested  │       │
+│     │                               │ a rapid expansion of China's   │       │
+│     │                               │ nuclear weapons capabilities.  │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ China is building more        │ The new underground silos are  │  0.84 │
+│     │ underground silos for its     │ located in the centre of the   │       │
+│     │ ballistic missiles | SCMP     │ Jilantai training base, within │       │
+│     │ https://www.scmp.com/news/chi │ a total area of 200 sq km, and │       │
+│     │ na/military/article/3125699/c │ are spaced between 2.2km and   │       │
+│     │ hina-building-more-undergroun │ 4.4km apart so that no two of  │       │
+│     │ d-silos-its-ballistic-missile │ them can be destroyed in a     │       │
+│     │ s                             │ single nuclear attack.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 11  │ China's Mobile ICBM Brigades: │ The PLARF is currently         │  0.75 │
+│     │ The DF-31 and DF-41           │ modernizing its                │       │
+│     │ https://www.aboyandhis.blog/p │ intercontinental ballistic     │       │
+│     │ ost/china-s-mobile-icbm-briga │ missile forces with two new    │       │
+│     │ des-the-df-31-and-df-41       │ mobile systems: the new DF-41  │       │
+│     │                               │ ballistic missile and the new  │       │
+│     │                               │ DF-31AG                        │       │
+│     │                               │ transporter-erector-launcher.. │       │
+│     │                               │ .The DF-41 is thought to be    │       │
+│     │                               │ out of development but has not │       │
+│     │                               │ yet moved into Operational     │       │
+│     │                               │ Testing and Evaluation (OT&E). │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 12  │ The 2024 DOD China Military   │ Other variables are how many   │  0.90 │
+│     │ Power Report - FAS            │ warheads are assigned to the   │       │
+│     │ https://fas.org/publication/t │ DF-26 IRBM launchers (probably │       │
+│     │ he-2024-dod-china-military-po │ not all of them), how many of  │       │
+│     │ wer-report/                   │ the six SSBNs have been        │       │
+│     │                               │ upgraded to the JL-3 SLBM and  │       │
+│     │                               │ whether it is assigned         │       │
+│     │                               │ multiple warheads, and how     │       │
+│     │                               │ many DF-41 ICBM launchers are  │       │
+│     │                               │ operational and how many       │       │
+│     │                               │ warheads each missile is       │       │
+│     │                               │ assigned.                      │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                    ┃ Detail                    ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Exact number of          │ Open sources confirm at   │
+│                       │ operational DF-41        │ least two brigades as of  │
+│                       │ brigades and launchers   │ 2019 parade, with         │
+│                       │ as of 2025               │ additional brigades       │
+│                       │                          │ suspected, but no         │
+│                       │                          │ authoritative public      │
+│                       │                          │ count of currently        │
+│                       │                          │ operational DF-41         │
+│                       │                          │ launchers exists as of    │
+│                       │                          │ 2025.                     │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ scope_exceeded        │ Specific warhead mating  │ Detailed operational      │
+│                       │ protocols and            │ warhead handling          │
+│                       │ pre-delegation authority │ procedures, command       │
+│                       │ for DF-41 brigades       │ authority thresholds, and │
+│                       │                          │ pre-delegation rules for  │
+│                       │                          │ DF-41 brigades are        │
+│                       │                          │ classified and not        │
+│                       │                          │ available in open         │
+│                       │                          │ sources.                  │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ contradictory_sources │ Confirmed rail-mobile    │ Multiple sources indicate │
+│                       │ DF-41 operational        │ rail-mobile DF-41 was     │
+│                       │ deployment               │ tested and considered,    │
+│                       │                          │ but no sources confirm it │
+│                       │                          │ has been operationally    │
+│                       │                          │ deployed in that basing   │
+│                       │                          │ mode as of 2025.          │
+├───────────────────────┼──────────────────────────┼───────────────────────────┤
+│ access_denied         │ Full CNS 2023 Order of   │ The PDF was identified    │
+│                       │ Battle PDF content on    │ but binary content could  │
+│                       │ DF-41 brigades           │ not be fully parsed to    │
+│                       │                          │ extract specific DF-41    │
+│                       │                          │ brigade details from the  │
+│                       │                          │ 2023 CNS Order of Battle. │
+└───────────────────────┴──────────────────────────┴───────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ new_source       │ database          │ PLARF DF-41       │ The 2023 CNS      │
+│                  │                   │ brigade order of  │ Order of Battle   │
+│                  │                   │ battle 2024 2025  │ is the most       │
+│                  │                   │ silo field        │ recent structured │
+│                  │                   │ deployment        │ OOB but may be    │
+│                  │                   │                   │ outdated given    │
+│                  │                   │                   │ rapid 2024-2025   │
+│                  │                   │                   │ expansion.        │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ China DF-41       │ The LOW posture   │
+│                  │                   │ launch on warning │ shift is          │
+│                  │                   │ posture warhead   │ documented but    │
+│                  │                   │ mating 2024 2025  │ the degree to     │
+│                  │                   │                   │ which DF-41       │
+│                  │                   │                   │ brigades          │
+│                  │                   │                   │ specifically have │
+│                  │                   │                   │ implemented it    │
+│                  │                   │                   │ versus older      │
+│                  │                   │                   │ systems is        │
+│                  │                   │                   │ unclear.          │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ China nuclear no  │ The silo          │
+│                  │                   │ first use         │ expansion and LOW │
+│                  │                   │ doctrine DF-41    │ posture raise     │
+│                  │                   │ silo expansion    │ academic          │
+│                  │                   │ strategic         │ questions about   │
+│                  │                   │ stability         │ NFU credibility   │
+│                  │                   │                   │ that may be       │
+│                  │                   │                   │ addressed in      │
+│                  │                   │                   │ recent strategic  │
+│                  │                   │                   │ studies           │
+│                  │                   │                   │ literature.       │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ contradiction    │ null              │ DF-41 rail-mobile │ MDAA lists        │
+│                  │                   │ deployment status │ rail-mobile as an │
+│                  │                   │ operational vs    │ operational       │
+│                  │                   │ testing           │ basing mode,      │
+│                  │                   │                   │ while FAS and     │
+│                  │                   │                   │ CSIS sources      │
+│                  │                   │                   │ suggest it        │
+│                  │                   │                   │ remains in        │
+│                  │                   │                   │ testing/considera │
+│                  │                   │                   │ tion phase. This  │
+│                  │                   │                   │ contradiction     │
+│                  │                   │                   │ should be         │
+│                  │                   │                   │ investigated.     │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Has China fully transitioned to │ Air University and NDU sources  │
+│          │ a launch-on-warning posture for │ confirm PLARF is 'working to    │
+│          │ DF-41 brigades, or is this      │ implement' LOW, but the degree  │
+│          │ still aspirational?             │ of actual implementation vs.    │
+│          │                                 │ doctrinal aspiration is         │
+│          │                                 │ ambiguous.                      │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ How many DF-41 silos in the     │ Reuters December 2025 report    │
+│          │ three new silo fields           │ indicates 100+ solid-fuel ICBMs │
+│          │ (Yumen/Gansu, Hami/Xinjiang,    │ loaded in silo fields; FAS 2025 │
+│          │ Ordos/Inner Mongolia) are now   │ notes continued silo            │
+│          │ loaded with missiles as of      │ development. The DF-41 vs DF-31 │
+│          │ 2025?                           │ breakdown in these silos is     │
+│          │                                 │ unclear.                        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the command-and-control │ LOW posture implies faster      │
+│          │ structure for DF-41 brigades —  │ decision timelines, raising     │
+│          │ do brigade commanders have any  │ questions about whether China   │
+│          │ pre-delegated launch authority? │ has moved toward any degree of  │
+│          │                                 │ pre-delegation, which would be  │
+│          │                                 │ a major doctrinal shift.        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Has the DF-41 rail-mobile       │ Rail-mobile tests were reported │
+│          │ variant been operationally      │ in December 2015, and the 2019  │
+│          │ deployed with any PLARF         │ Pentagon report noted China     │
+│          │ brigade?                        │ 'appears to be considering'     │
+│          │                                 │ rail-mobile basing, but no      │
+│          │                                 │ confirmed operational           │
+│          │                                 │ deployment has been identified. │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the specific MIRV       │ FAS 2025 notes uncertainty      │
+│          │ loading assignment doctrine for │ about how many warheads each    │
+│          │ operational DF-41 missiles —    │ DF-41 is assigned in practice,  │
+│          │ are they typically deployed     │ which significantly affects     │
+│          │ with maximum warhead loads or   │ strategic stability             │
+│          │ reduced loads?                  │ calculations.                   │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.72                                                                │
+│ Corroborating sources: 12                                                    │
+│ Source authority: high                                                       │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.75                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 62857                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 132.16s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: b3d00938-5309-4faa-a20d-97a8511bb8f9
--- a/docs/stress-tests/M3.3-runs/18-scope.log
+++ b/docs/stress-tests/M3.3-runs/18-scope.log
@ -0,0 +1,272 @@
+Researching: What internal compensation bands does Goldman Sachs use for VPs in 
+2026?
+
+{"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:05.792037Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:15:06.820624Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:06.829930Z"}
+{"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:06.876139Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876453Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876542Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1108, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:14.969587Z"}
+{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5772, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:26.767509Z"}
+{"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 15029, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:32.149418Z"}
+{"step": 22, "decision": "Token budget reached before iteration 5: 26452/20000", "event": "budget_exhausted", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927200Z"}
+{"step": 23, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 4, "tokens_used": 26452, "event": "synthesis_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927359Z"}
+{"step": 24, "decision": "Parsed synthesis JSON successfully", "duration_ms": 65550, "event": "synthesis_complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.258119Z"}
+{"step": 42, "decision": "Research complete", "confidence": 0.62, "citation_count": 10, "gap_count": 4, "discovery_count": 3, "total_duration_sec": 102.914, "event": "complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.259163Z"}
+{"confidence": 0.62, "citations": 10, "gaps": 4, "discovery_events": 3, "tokens_used": 51829, "iterations_run": 4, "wall_time_sec": 98.38188624382019, "budget_exhausted": true, "event": "research_completed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:45.259280Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:16:45.259714Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:45.264223Z"}
+{"trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "confidence": 0.62, "citations": 10, "tokens_used": 51829, "wall_time_sec": 98.38188624382019, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:45.493130Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Goldman Sachs does not publicly disclose formal internal compensation bands  │
+│ for VPs. Based on available evidence, the VP title at Goldman Sachs is a     │
+│ single, wide-band level (there are no officially published sub-bands like    │
+│ VP1/VP2/VP3 at Goldman, unlike some other banks). Compensation varies        │
+│ enormously depending on division (front office vs. middle/back office) and   │
+│ seniority within the band. Key data points for 2026: (1) Glassdoor reports a │
+│ typical total pay range of $213,109–$391,379 (25th–75th percentile) across   │
+│ ~4,695 salary submissions, covering all VP roles firm-wide. (2) Levels.fyi   │
+│ reports a median total VP compensation of $144K, which likely skews toward   │
+│ tech/engineering roles. (3) 6figr reports an average of $297K (range         │
+│ $265K–$501K, top 10% up to $514K) based on 67 profiles. (4) For front-office │
+│ Investment Banking VPs specifically, Glassdoor reports a much higher range   │
+│ of $480,547–$888,585 (25th–75th percentile) based on 14 salaries. (5)        │
+│ Industry benchmarks from Mergers & Inquisitions (2026 update) place          │
+│ front-office IB VP base salary at $250–$300K with total compensation of      │
+│ $525–$800K for NY-based roles. (6) Indeed reports an average of ~$145,324,   │
+│ consistent with a broad mix of roles. Community sources (Fishbowl) confirm   │
+│ the VP band is 'very wide' with no official internal sub-levels at Goldman;  │
+│ pay differentiation happens informally by group, skillset, and front vs.     │
+│ back office status.                                                          │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Total salary range for        │ The typical pay range is       │  0.85 │
+│     │ Goldman Sachs Vice President  │ between $213,109 (25th         │       │
+│     │ - Glassdoor                   │ percentile) and $391,379 (75th │       │
+│     │ https://www.glassdoor.com/Sal │ percentile) annually. This is  │       │
+│     │ ary/Goldman-Sachs-Vice-Presid │ based on 4,695 salaries        │       │
+│     │ ent-Salaries-E2800_D_KO14,28. │ submitted by Goldman Sachs     │       │
+│     │ htm                           │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Total salary range for        │ The typical pay range is       │  0.85 │
+│     │ Goldman Sachs Vice President  │ between $220,674 (25th         │       │
+│     │ - Glassdoor                   │ percentile) and $411,924 (75th │       │
+│     │ https://www.glassdoor.com/Sal │ percentile) annually. This is  │       │
+│     │ ary/Goldman-Sachs-V-P-Salarie │ based on 4,695 salaries        │       │
+│     │ s-E2800_D_KO14,17.htm         │ submitted by Goldman Sachs     │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ Goldman Sachs Vice President  │ The median Vice President      │  0.75 │
+│     │ Salary | $110K-$144K+ |       │ compensation in United States  │       │
+│     │ Levels.fyi                    │ package at Goldman Sachs       │       │
+│     │ https://www.levels.fyi/compan │ totals $144K per year. View    │       │
+│     │ ies/goldman-sachs/salaries/vi │ the base salary, stock, and    │       │
+│     │ ce-president                  │ bonus breakdowns for Goldman   │       │
+│     │                               │ Sachs's total compensation     │       │
+│     │                               │ packages. Last updated:        │       │
+│     │                               │ 4/6/2026                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Goldman Sachs Vice President  │ Employees at Goldman Sachs as  │  0.70 │
+│     │ Vp Salaries 2026 |            │ Vice President Vp earn an      │       │
+│     │ $265k-$514k                   │ average of $297k, mostly       │       │
+│     │ https://6figr.com/us/salary/g │ ranging from $265k per year to │       │
+│     │ oldman-sachs--vice-president- │ $501k per year based on 67     │       │
+│     │ vp                            │ profiles. The top 10%          │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ Goldman Sachs Investment      │ The typical pay range is       │  0.65 │
+│     │ Banking Vice President ...    │ between $480,547 (25th         │       │
+│     │ https://www.glassdoor.com/Sal │ percentile) and $888,585 (75th │       │
+│     │ ary/Goldman-Sachs-Investment- │ percentile) annually. This is  │       │
+│     │ Banking-Vice-President-Salari │ based on 14 salaries submitted │       │
+│     │ es-E2800_D_KO14,47.htm        │ by Goldman Sachs               │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Investment Banker Salary and  │ Vice President (VP) | 28-40 |  │  0.88 │
+│     │ Bonus Report: 2026 Update     │ $250-$300K | $525-$800K | 3-4  │       │
+│     │ https://mergersandinquisition │ years                          │       │
+│     │ s.com/investment-banker-salar │                                │       │
+│     │ y/                            │ NOTE: All numbers are pre-tax  │       │
+│     │                               │ for New York-based             │       │
+│     │                               │ front-office roles and include │       │
+│     │                               │ base salaries and year-end     │       │
+│     │                               │ bonuses but not                │       │
+│     │                               │ signing/relocation bonuses,    │       │
+│     │                               │ stub bonuses, benefits, etc.   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ Vice President yearly         │ Average Goldman Sachs Vice     │  0.70 │
+│     │ salaries in the United States │ President yearly pay in the    │       │
+│     │ at Goldman Sachs              │ United States is approximately │       │
+│     │ https://www.indeed.com/cmp/Go │ $145,324, which is 9% below    │       │
+│     │ ldman-Sachs/salaries/Vice-Pre │ the national average. Salary   │       │
+│     │ sident                        │ estimated from                 │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Are there internal levels/    │ Goldman VP band is very wide.  │  0.72 │
+│     │ bands within the VP tit... |  │ Promoted from associate and    │       │
+│     │ Fishbowl                      │ Next step md is difficult to   │       │
+│     │ https://www.fishbowlapp.com/p │ get.                           │       │
+│     │ ost/are-there-internal-levels │                                │       │
+│     │ -bands-within-the-vp-title-at │ Yes, banks have different      │       │
+│     │ -goldman-sachs-fwiw-this-is-f │ bands depending on skillset,   │       │
+│     │ or-a-nonbusiness-internal-str │ group within the firm, front   │       │
+│     │ ategy-kind                    │ office vs back office, etc     │       │
+│     │                               │                                │       │
+│     │                               │ Not Goldman though. It's just  │       │
+│     │                               │ VP                             │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ VP of FP&A at Goldman Sachs   │ FP&A is middle office at       │  0.65 │
+│     │ salary : r/FPandA - Reddit    │ banks, they won't make         │       │
+│     │ https://www.reddit.com/r/FPan │ anywhere near $400k at VP      │       │
+│     │ dA/comments/1dgguz5/vp_of_fpa │ level. Front office VP         │       │
+│     │ _at_goldman_sachs_salary/     │ positions will all clear over  │       │
+│     │                               │ $400k in a place               │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Goldman Sachs Vp Salaries     │ 15 to 15 yrs. Base. $179k.     │  0.65 │
+│     │ 2026 | $208k-$586k -          │ Stocks / Yr. $21k. Bonus.      │       │
+│     │ 6figr.com                     │ $120k. Total Salary. $318k.    │       │
+│     │ https://6figr.com/us/salary/g │ Goldman Sachs Vp salary levels │       │
+│     │ oldman-sachs--vp              │ ranges from Vice President     │       │
+│     │                               │ (Accountant) upto              │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category              ┃ Topic                     ┃ Detail                   ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found      │ Official internal Goldman │ Goldman Sachs does not   │
+│                       │ Sachs VP compensation     │ publicly publish its     │
+│                       │ bands                     │ internal compensation    │
+│                       │                           │ bands or grade           │
+│                       │                           │ structures. No           │
+│                       │                           │ authoritative internal   │
+│                       │                           │ HR documentation was     │
+│                       │                           │ found. All data is from  │
+│                       │                           │ third-party crowdsourced │
+│                       │                           │ salary platforms.        │
+├───────────────────────┼───────────────────────────┼──────────────────────────┤
+│ source_not_found      │ VP sub-band breakdown     │ Community sources        │
+│                       │ (VP1/VP2/VP3 equivalents) │ explicitly state Goldman │
+│                       │                           │ uses a single 'VP' title │
+│                       │                           │ with no formal           │
+│                       │                           │ sub-levels, unlike some  │
+│                       │                           │ peers. No granular       │
+│                       │                           │ sub-band salary data     │
+│                       │                           │ exists in any source     │
+│                       │                           │ reviewed.                │
+├───────────────────────┼───────────────────────────┼──────────────────────────┤
+│ scope_exceeded        │ Non-US VP compensation    │ Some sources (e.g.,      │
+│                       │ bands                     │ AmbitionBox) reference   │
+│                       │                           │ India-based VP salaries  │
+│                       │                           │ (₹49.4L–₹54.6L), but     │
+│                       │                           │ comprehensive            │
+│                       │                           │ international band data  │
+│                       │                           │ was not gathered. The    │
+│                       │                           │ question context appears │
+│                       │                           │ US-focused.              │
+├───────────────────────┼───────────────────────────┼──────────────────────────┤
+│ contradictory_sources │ Levels.fyi median         │ Levels.fyi reports a     │
+│                       │ discrepancy               │ median of $144K while    │
+│                       │                           │ Glassdoor and 6figr      │
+│                       │                           │ report $213K–$411K       │
+│                       │                           │ ranges. Levels.fyi       │
+│                       │                           │ likely captures          │
+│                       │                           │ engineering/tech VPs who │
+│                       │                           │ have different           │
+│                       │                           │ compensation structures  │
+│                       │                           │ and lower base pay than  │
+│                       │                           │ finance VPs.             │
+└───────────────────────┴───────────────────────────┴──────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ contradiction    │ database          │ Goldman Sachs VP  │ Large discrepancy │
+│                  │                   │ total             │ between           │
+│                  │                   │ compensation by   │ Levels.fyi ($144K │
+│                  │                   │ division 2025     │ median) and       │
+│                  │                   │ 2026              │ Glassdoor         │
+│                  │                   │                   │ ($213K–$391K      │
+│                  │                   │                   │ range) suggests   │
+│                  │                   │                   │ the VP population │
+│                  │                   │                   │ is heterogeneous  │
+│                  │                   │                   │ across tech and   │
+│                  │                   │                   │ finance           │
+│                  │                   │                   │ functions;        │
+│                  │                   │                   │ further           │
+│                  │                   │                   │ segmentation by   │
+│                  │                   │                   │ division would    │
+│                  │                   │                   │ resolve this.     │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ null              │ Goldman Sachs     │ Understanding how │
+│                  │                   │ internal grade    │ Goldman's VP band │
+│                  │                   │ structure VP      │ maps to peer      │
+│                  │                   │ Director MD 2026  │ banks' grade      │
+│                  │                   │                   │ systems would     │
+│                  │                   │                   │ clarify the wide  │
+│                  │                   │                   │ compensation      │
+│                  │                   │                   │ range observed.   │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ null              │ Goldman Sachs     │ Mergers &         │
+│                  │                   │ 2025 bonus pool   │ Inquisitions      │
+│                  │                   │ VP payout by      │ notes senior      │
+│                  │                   │ division          │ bankers (VPs+)    │
+│                  │                   │                   │ received          │
+│                  │                   │                   │ disproportionate  │
+│                  │                   │                   │ 2025 bonus        │
+│                  │                   │                   │ increases;        │
+│                  │                   │                   │ division-level    │
+│                  │                   │                   │ data would        │
+│                  │                   │                   │ sharpen the band  │
+│                  │                   │                   │ picture.          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ Does Goldman Sachs use any      │ Fishbowl community posts        │
+│          │ informal internal seniority     │ confirm the VP band is wide and │
+│          │ designations within the VP      │ pay varies significantly, but   │
+│          │ title (e.g., junior VP vs.      │ it is unclear whether informal  │
+│          │ senior VP) that affect          │ tracking of seniority within    │
+│          │ compensation but are not        │ the band drives structured pay  │
+│          │ publicly disclosed?             │ steps.                          │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ How did 2025 year-end bonuses   │ Mergers & Inquisitions notes    │
+│          │ for Goldman Sachs VPs compare   │ that VPs and Directors saw      │
+│          │ to the prior year, and were     │ 10–15% total comp increases in  │
+│          │ front-office VPs                │ 2025, but Goldman-specific      │
+│          │ disproportionate beneficiaries? │ figures were not isolated.      │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Why does Levels.fyi report a    │ The discrepancy likely reflects │
+│          │ $144K median for Goldman Sachs  │ different user populations      │
+│          │ VPs when Glassdoor and 6figr    │ (tech-focused on Levels.fyi vs. │
+│          │ report ranges starting at       │ finance-focused on              │
+│          │ $213K–$265K?                    │ Glassdoor/6figr), but this has  │
+│          │                                 │ not been confirmed.             │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What is the typical             │ Fishbowl notes the VP band is   │
+│          │ time-in-grade for a Goldman     │ wide and the step to MD is      │
+│          │ Sachs VP before promotion to    │ difficult; Mergers &            │
+│          │ Managing Director, and does     │ Inquisitions gives a 3–4 year   │
+│          │ longer tenure correlate with    │ promotion window for VPs across │
+│          │ meaningfully higher within-band │ large banks.                    │
+│          │ pay?                            │                                 │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.62                                                                │
+│ Corroborating sources: 8                                                     │
+│ Source authority: medium                                                     │
+│ Contradiction detected: True                                                 │
+│ Query specificity match: 0.55                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 51829                                                                │
+│ Iterations: 4                                                                │
+│ Wall time: 98.38s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: 716e548a-ceaf-4d18-8b47-ac35e3460b52
--- a/docs/stress-tests/M3.3-runs/19-scope.log
+++ b/docs/stress-tests/M3.3-runs/19-scope.log
@ -0,0 +1,343 @@
+Researching: How does Renaissance Technologies Medallion Fund actually generate 
+alpha?
+
+{"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:46.074147Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:16:46.829107Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:46.837149Z"}
+{"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:46.869281Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869587Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869675Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:56.914799Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8370, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:03.842868Z"}
+{"step": 21, "decision": "Token budget reached before iteration 4: 20077/20000", "event": "budget_exhausted", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.960507Z"}
+{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 23, "iterations_run": 3, "tokens_used": 20077, "event": "synthesis_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.961508Z"}
+{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74831, "event": "synthesis_complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.398868Z"}
+{"step": 42, "decision": "Research complete", "confidence": 0.82, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.925, "event": "complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.400004Z"}
+{"confidence": 0.82, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 43096, "iterations_run": 3, "wall_time_sec": 98.52941536903381, "budget_exhausted": true, "event": "research_completed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:25.400108Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:18:25.400618Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:25.405316Z"}
+{"trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "confidence": 0.82, "citations": 10, "tokens_used": 43096, "wall_time_sec": 98.52941536903381, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:25.623416Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ Renaissance Technologies' Medallion Fund generates alpha through several     │
+│ reinforcing mechanisms, all grounded in quantitative and data-driven methods │
+│ rather than traditional financial intuition:                                 │
+│                                                                              │
+│ 1. **Statistical Arbitrage & Pattern Recognition**: The fund identifies      │
+│ subtle, recurring market inefficiencies and pricing anomalies by analyzing   │
+│ vast amounts of historical and real-time data. It profits from small         │
+│ mispricings across many trades rather than large directional bets. [Sources  │
+│ 3, 6, 8]                                                                     │
+│                                                                              │
+│ 2. **Advanced Mathematical & Quantitative Models**: Renaissance employs      │
+│ sophisticated statistical models, hidden Markov models (used as early as     │
+│ 1983), and continuously refined algorithms to predict short-term price       │
+│ movements. The firm hired mathematicians, physicists, and computer           │
+│ scientists—not traditional Wall Street traders—to build these models.        │
+│ [Sources 9, 16, 21, 23]                                                      │
+│                                                                              │
+│ 3. **Machine Learning & AI Integration**: Medallion continuously refines its │
+│ models using machine learning, allowing them to adapt to changing market     │
+│ conditions and discover non-obvious patterns. [Sources 6, 8]                 │
+│                                                                              │
+│ 4. **High-Frequency, Fully Automated Trading**: The fund executes            │
+│ 150,000–300,000 trades daily through fully automated systems, eliminating    │
+│ emotional bias and exploiting fleeting inefficiencies at scale. [Source 8]   │
+│                                                                              │
+│ 5. **Market-Neutral & Diversified Strategies**: By balancing long and short  │
+│ positions across many asset classes (equities, futures, options, currencies) │
+│ and geographies, the fund reduces exposure to broad market moves. This is    │
+│ evidenced by the fund returning +74.6% in 2008 when markets crashed.         │
+│ [Sources 6, 16]                                                              │
+│                                                                              │
+│ 6. **Leverage & Risk Management via Kelly Criterion**: Medallion uses        │
+│ significant leverage combined with disciplined risk management techniques,   │
+│ including the Kelly Criterion, to size positions optimally and control       │
+│ drawdown. [Sources 6, 8]                                                     │
+│                                                                              │
+│ 7. **Extreme Secrecy & Employee-Only Structure**: The fund has been closed   │
+│ to outside investors since 1993, aligning incentives exclusively with        │
+│ employees and partners. This exclusivity prevents strategy dilution and      │
+│ protects proprietary edge. [Sources 5, 6, 12]                                │
+│                                                                              │
+│ 8. **Massive Data Collection & Cleaning**: Renaissance amasses and           │
+│ meticulously cleans enormous datasets of historical price data, economic     │
+│ indicators, and alternative data sources as the raw material for model       │
+│ building. [Sources 15, 21]                                                   │
+│                                                                              │
+│ 9. **Collaborative, Academic Culture**: Simons fostered an open, peer-driven │
+│ environment where ideas were freely shared among top-tier scientists,        │
+│ accelerating model refinement and discovery. [Sources 16, 21]                │
+│                                                                              │
+│ The cumulative result: average annual returns of 66% before fees and 39%     │
+│ after fees from 1988 to 2018—the best sustained track record in investment   │
+│ history. A $100 investment in 1988 would have grown to approximately $398.7  │
+│ million by 2018, versus $1,815 for the S&P 500 over the same period.         │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ Renaissance Technologies: The │ Between 1988 and 2018,         │  0.97 │
+│     │ $100 Billion Built on         │ Renaissance Technologies'      │       │
+│     │ Statistical Arbitrage         │ Medallion Fund generated       │       │
+│     │ https://navnoorbawa.substack. │ average annual returns of 66%  │       │
+│     │ com/p/renaissance-technologie │ before fees and 39% after fees │       │
+│     │ s-the-100                     │ — the most successful track    │       │
+│     │                               │ record in investing history. A │       │
+│     │                               │ $100 investment in 1988 would  │       │
+│     │                               │ have grown to approximately    │       │
+│     │                               │ $398.7 million by 2018.        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ Jim Simons Trading Strategy   │ Fully automated systems        │  0.93 │
+│     │ Explained: Inside Renaissance │ executed 150,000–300,000       │       │
+│     │ Technologies                  │ trades daily, eliminating      │       │
+│     │ https://www.quantvps.com/blog │ emotional biases. Techniques   │       │
+│     │ /jim-simons-trading-strategy  │ like the Kelly Criterion and   │       │
+│     │                               │ balanced portfolios helped     │       │
+│     │                               │ control risk and maintain      │       │
+│     │                               │ consistent returns.            │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ The Curious Case of Medallion │ The fund employs sophisticated │  0.92 │
+│     │ Fund: Renaissance             │ statistical and mathematical   │       │
+│     │ Technologies' Hedge Fund      │ models to identify and         │       │
+│     │ Success                       │ capitalize on market           │       │
+│     │ https://www.schoolofhedge.com │ inefficiencies. Medallion      │       │
+│     │ /pages/the-curious-case-of-me │ integrates machine learning    │       │
+│     │ dallion-fund                  │ and artificial intelligence to │       │
+│     │                               │ refine its models continually, │       │
+│     │                               │ adapting to changing market    │       │
+│     │                               │ conditions.                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ Decoding the Medallion Fund   │ The Medallion Fund boasts an   │  0.95 │
+│     │ Returns: What We Know About   │ unprecedented average annual   │       │
+│     │ Its Annual Performance        │ return of 66% before fees over │       │
+│     │ https://www.quantifiedstrateg │ 30 years, achieving a net      │       │
+│     │ ies.com/medallion-fund-return │ return of 39% after fees. The  │       │
+│     │ s/                            │ Medallion Fund has been closed │       │
+│     │                               │ to outside investors since     │       │
+│     │                               │ 1993 and is only available to  │       │
+│     │                               │ current and past employees and │       │
+│     │                               │ their families.                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ James Simons (Renaissance     │ In 1983 he was using Hidden    │  0.85 │
+│     │ Technologies Corp.) and his   │ Markov Models. Now he employs  │       │
+│     │ model - Quantitative Finance  │ 100+ PhDs, therefore I expect  │       │
+│     │ Stack Exchange                │ he will have 50+ strategies    │       │
+│     │ https://quant.stackexchange.c │ using 200+ predictors. And set │       │
+│     │ om/questions/30056/james-simo │ up as a production line, from  │       │
+│     │ ns-renaissance-technologies-c │ the teams importing and        │       │
+│     │ orp-and-his-model             │ cleaning data, down to         │       │
+│     │                               │ execution of trades.           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ Simons' Strategies:           │ Market-Neutral Strategies:     │  0.91 │
+│     │ Renaissance Trading Unpacked  │ Balancing long and short       │       │
+│     │ - LuxAlgo                     │ positions reduces risk. Unique │       │
+│     │ https://www.luxalgo.com/blog/ │ Hiring: Scientists and         │       │
+│     │ simons-strategies-renaissance │ mathematicians, not Wall       │       │
+│     │ -trading-unpacked/            │ Street veterans, build their   │       │
+│     │                               │ trading models. Even during    │       │
+│     │                               │ crashes like 2008, Medallion   │       │
+│     │                               │ outperformed with a 74.6%      │       │
+│     │                               │ return.                        │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ The Man Who Solved the Market │ Renaissance's success was      │  0.93 │
+│     │ by Gregory Zuckerman -        │ built on amassing and          │       │
+│     │ Summary & Notes               │ meticulously cleaning vast     │       │
+│     │ https://bagerbach.com/books/t │ amounts of historical price    │       │
+│     │ he-man-who-solved-the-market/ │ data, then using it to model   │       │
+│     │                               │ and predict market behavior.   │       │
+│     │                               │ They treated investing like a  │       │
+│     │                               │ scientific problem, forming    │       │
+│     │                               │ hypotheses, testing them       │       │
+│     │                               │ rigorously, and iterating      │       │
+│     │                               │ constantly.                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ Cracking the Code: Inside the │ Medallion began as an          │  0.88 │
+│     │ Medallion Fund and Jim        │ experiment in pattern          │       │
+│     │ Simons' Secretive Empire      │ recognition. Over time, it     │       │
+│     │ https://medium.com/@trading.d │ evolved into a fully           │       │
+│     │ ude/cracking-the-code-inside- │ automated, high-frequency,     │       │
+│     │ the-medallion-fund-and-jim-si │ multi-strategy quant           │       │
+│     │ mons-secretive-empire-b9af084 │ powerhouse. It traded          │       │
+│     │ 15b4f                         │ everything from equities to    │       │
+│     │                               │ futures.                       │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Renaissance Technologies and  │ Renaissance Technologies,      │  0.92 │
+│     │ The Medallion Fund            │ often just referred to as      │       │
+│     │ https://quartr.com/insights/e │ RenTec, is reputed as the      │       │
+│     │ dge/renaissance-technologies- │ highest-performing investment  │       │
+│     │ and-the-medallion-fund        │ firms ever, with its Medallion │       │
+│     │                               │ Fund having returned a net     │       │
+│     │                               │ 90,129x to investors between   │       │
+│     │                               │ the years 1988-2022 leveraging │       │
+│     │                               │ a quantitative investment      │       │
+│     │                               │ approach.                      │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 10  │ Jim Simons – The Man Who      │ Simons decided to use a purely │  0.90 │
+│     │ Solved the Market - Build     │ systematic approach to avoid   │       │
+│     │ Alpha                         │ emotional rollercoasters and   │       │
+│     │ https://www.buildalpha.com/ji │ avoid common trading biases    │       │
+│     │ m-simons-the-man-who-solved-t │ that trip up most traders.     │       │
+│     │ he-market/                    │ Simons staffed the new fund,   │       │
+│     │                               │ Renaissance Technologies, with │       │
+│     │                               │ mathematicians, computer       │       │
+│     │                               │ scientists, and physicists to  │       │
+│     │                               │ pioneer.                       │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ access_denied    │ Specific algorithmic        │ Renaissance Technologies    │
+│                  │ details and signal types    │ maintains extreme secrecy   │
+│                  │ used by the Medallion Fund  │ around its specific trading │
+│                  │                             │ signals, factor exposures,  │
+│                  │                             │ and model architecture. No  │
+│                  │                             │ public source has ever      │
+│                  │                             │ confirmed the exact         │
+│                  │                             │ mathematical formulas,      │
+│                  │                             │ specific predictors, or     │
+│                  │                             │ strategy details. All       │
+│                  │                             │ evidence is from secondary  │
+│                  │                             │ sources and informed        │
+│                  │                             │ inference.                  │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Post-2018 performance data  │ Most verified return data   │
+│                  │ for the Medallion Fund      │ covers 1988-2018. Some      │
+│                  │                             │ sources reference           │
+│                  │                             │ performance through 2022    │
+│                  │                             │ but with less granular      │
+│                  │                             │ annual data. The fund does  │
+│                  │                             │ not file public performance │
+│                  │                             │ reports.                    │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Specific leverage ratios    │ While sources note that     │
+│                  │ used by the Medallion Fund  │ high leverage is a          │
+│                  │                             │ component of alpha          │
+│                  │                             │ generation, specific        │
+│                  │                             │ leverage multiples are not  │
+│                  │                             │ publicly disclosed and were │
+│                  │                             │ not found in the gathered   │
+│                  │                             │ evidence.                   │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Fee structure and its exact │ Sources confirm the fund    │
+│                  │ impact on net returns over  │ charges approximately 5%    │
+│                  │ time                        │ management and 44%          │
+│                  │                             │ performance fees            │
+│                  │                             │ (historically), but         │
+│                  │                             │ detailed year-by-year       │
+│                  │                             │ impact analysis was not     │
+│                  │                             │ found in the gathered       │
+│                  │                             │ evidence.                   │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ arxiv             │ statistical       │ Simons used       │
+│                  │                   │ arbitrage hidden  │ Hidden Markov     │
+│                  │                   │ Markov models     │ Models in 1983.   │
+│                  │                   │ financial markets │ Academic papers   │
+│                  │                   │ quantitative      │ on HMMs in        │
+│                  │                   │ trading           │ finance could     │
+│                  │                   │                   │ illuminate the    │
+│                  │                   │                   │ mathematical      │
+│                  │                   │                   │ foundation of     │
+│                  │                   │                   │ early Medallion   │
+│                  │                   │                   │ strategies.       │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ Kelly Criterion   │ The Kelly         │
+│                  │                   │ optimal position  │ Criterion is      │
+│                  │                   │ sizing hedge fund │ cited as a key    │
+│                  │                   │ leverage          │ risk management   │
+│                  │                   │ quantitative      │ tool; academic    │
+│                  │                   │ trading           │ literature could  │
+│                  │                   │                   │ clarify how it    │
+│                  │                   │                   │ specifically      │
+│                  │                   │                   │ contributes to    │
+│                  │                   │                   │ alpha             │
+│                  │                   │                   │ sustainability.   │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ new_source       │ database          │ Renaissance       │ SEC 13F filings   │
+│                  │                   │ Technologies SEC  │ for Renaissance's │
+│                  │                   │ 13F filings RIEF  │ public-facing     │
+│                  │                   │ RIDA              │ funds (RIEF,      │
+│                  │                   │ institutional     │ RIDA) could       │
+│                  │                   │ holdings          │ provide insight   │
+│                  │                   │                   │ into equity       │
+│                  │                   │                   │ selection         │
+│                  │                   │                   │ methodology,      │
+│                  │                   │                   │ though not        │
+│                  │                   │                   │ Medallion         │
+│                  │                   │                   │ directly.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ null              │ Gregory Zuckerman │ The book by       │
+│                  │                   │ The Man Who       │ Zuckerman is      │
+│                  │                   │ Solved the Market │ cited as the most │
+│                  │                   │ primary source    │ authoritative     │
+│                  │                   │ analysis          │ public account of │
+│                  │                   │                   │ Renaissance's     │
+│                  │                   │                   │ methods; a deeper │
+│                  │                   │                   │ review could      │
+│                  │                   │                   │ yield more        │
+│                  │                   │                   │ specific          │
+│                  │                   │                   │ mechanism         │
+│                  │                   │                   │ details.          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ How has the Medallion Fund      │ Multiple sources confirm the    │
+│          │ maintained its edge as markets  │ strategy has worked for 30+     │
+│          │ have become more efficient and  │ years, but with algorithmic     │
+│          │ other quant funds have adopted  │ trading now comprising 60-73%   │
+│          │ similar approaches?             │ of U.S. equity trades, the      │
+│          │                                 │ persistence of edge is          │
+│          │                                 │ theoretically challenging.      │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the role of capacity    │ The fund is closed to outside   │
+│          │ constraints in limiting         │ investors and capped in size,   │
+│          │ Medallion's AUM, and how does   │ suggesting strategy returns     │
+│          │ the fund's small size (~$10B)   │ diminish at scale. This         │
+│          │ contribute to its returns?      │ capacity question is central to │
+│          │                                 │ understanding whether the alpha │
+│          │                                 │ is truly replicable.            │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ To what extent does Medallion's │ Sources describe both           │
+│          │ alpha come from market          │ high-frequency automated        │
+│          │ microstructure exploitation     │ trading and statistical         │
+│          │ (e.g., short-term mean          │ arbitrage, but the precise time │
+│          │ reversion) vs. longer-horizon   │ horizon distribution of trades  │
+│          │ factor exposures?               │ is unknown publicly.            │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How has Medallion's strategy    │ Jim Simons passed away in May   │
+│          │ evolved since Jim Simons'       │ 2024. The sustainability of the │
+│          │ retirement from day-to-day      │ fund's culture and edge under   │
+│          │ management and his death in May │ new leadership is an open       │
+│          │ 2024?                           │ question.                       │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ What specific alternative data  │ Sources mention 'alternative    │
+│          │ sources (beyond price/volume)   │ data sources' as inputs but     │
+│          │ does Renaissance use as inputs  │ provide no specifics, leaving   │
+│          │ to its models?                  │ this dimension of the alpha     │
+│          │                                 │ generation process unresolved.  │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.82                                                                │
+│ Corroborating sources: 10                                                    │
+│ Source authority: medium                                                     │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.75                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 43096                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 98.53s                                                            │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: b7cd9d50-3eec-4eca-8db0-a580722c2b19
--- a/docs/stress-tests/M3.3-runs/20-scope.log
+++ b/docs/stress-tests/M3.3-runs/20-scope.log
@ -0,0 +1,325 @@
+Researching: What are the precise materials and tolerances in TSMC's 2nm 
+process?
+
+{"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:26.198498Z"}
+{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:18:26.963097Z"}
+{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:26.972484Z"}
+{"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:27.004492Z"}
+{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What are the precise materials and tolerances in TSMC's 2nm process?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004812Z"}
+{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004904Z"}
+{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1158, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:40.769568Z"}
+{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 11802, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:47.013233Z"}
+{"step": 19, "decision": "Token budget reached before iteration 4: 30249/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139804Z"}
+{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 30249, "event": "synthesis_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139984Z"}
+{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 77777, "event": "synthesis_complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.633197Z"}
+{"step": 40, "decision": "Research complete", "confidence": 0.42, "citation_count": 9, "gap_count": 5, "discovery_count": 4, "total_duration_sec": 109.056, "event": "complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.634189Z"}
+{"confidence": 0.42, "citations": 9, "gaps": 5, "discovery_events": 4, "tokens_used": 62620, "iterations_run": 3, "wall_time_sec": 105.62861347198486, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:20:12.634324Z"}
+{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:20:12.634698Z"}
+{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:20:12.639617Z"}
+{"trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "confidence": 0.42, "citations": 9, "tokens_used": 62620, "wall_time_sec": 105.62861347198486, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:20:12.967147Z"}
+╭─────────────────────────────────── Answer ───────────────────────────────────╮
+│ TSMC's 2nm (N2) process node, which began volume production in Q4 2025,      │
+│ introduces several key technical advances, though precise proprietary        │
+│ materials specifications and sub-angstrom tolerances are not publicly        │
+│ disclosed. What is publicly known:                                           │
+│                                                                              │
+│ **Transistor Architecture:** N2 is TSMC's first node to use Gate-All-Around  │
+│ (GAA) nanosheet transistors, replacing the FinFET architecture used since    │
+│ 2011. The gate surrounds the silicon nanosheet channel on all sides,         │
+│ providing superior electrostatic control and reduced gate leakage compared   │
+│ to 3nm FinFETs [Sources 10, 13, 21].                                         │
+│                                                                              │
+│ **Process Node Dimensions (IEEE IRDS):** The 2nm node class is projected to  │
+│ have a contacted gate pitch of ~45nm and a tightest metal pitch of ~20nm,    │
+│ per IEEE International Roadmap for Devices and Systems (2021 update) [Source │
+│ 16].                                                                         │
+│                                                                              │
+│ **Interconnects:** N2 features copper (Cu)-based redistribution layers       │
+│ (RDLs) with flat passivation and through-silicon vias (TSVs), co-optimized   │
+│ with 3DIC integration. Middle- and back-end-of-line (MEOL/BEOL)              │
+│ interconnects are included, with the densest SRAM macro ever reported at     │
+│ approximately 38 Mb/mm² [Sources 4, 21].                                     │
+│                                                                              │
+│ **Performance Metrics (vs. N3E):** 24–35% power reduction OR 15% performance │
+│ improvement at iso-voltage; >1.15x transistor density improvement over N3    │
+│ [Sources 10, 18, 21].                                                        │
+│                                                                              │
+│ **Yield:** Initial yields reportedly ~70%, with some memory products         │
+│ exceeding 90%. A 6% yield improvement over baseline was reported in late     │
+│ 2024 [Sources 13, 14].                                                       │
+│                                                                              │
+│ **Applications:** Designed for AI, mobile, and HPC applications. Key         │
+│ customers include Apple (A20 chip for iPhone 18 Pro) and NVIDIA [Sources 8,  │
+│ 14].                                                                         │
+│                                                                              │
+│ **Fab Locations:** Primary production in Hsinchu and Kaohsiung, Taiwan; a    │
+│ Kaohsiung 2nm facility expansion ceremony was held March 31, 2025 [Source    │
+│ 6].                                                                          │
+│                                                                              │
+│ **Specific proprietary materials** (e.g., exact dielectric compositions,     │
+│ gate oxide materials, metal liner chemistries, doping concentrations, and    │
+│ nanometer-level tolerances on nanosheet thickness/width) are not publicly    │
+│ disclosed by TSMC and were not found in the available evidence.              │
+╰──────────────────────────────────────────────────────────────────────────────╯
+                                   Citations                                    
+┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
+┃ #   ┃ Title / Locator               ┃ Excerpt                        ┃  Conf ┃
+┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
+│ 1   │ TSMC shares deep-dive details │ The new production node        │  0.95 │
+│     │ about its cutting edge 2nm    │ promises a 24 to 35% power     │       │
+│     │ process node at IEDM 2024 —   │ reduction or 15% performance   │       │
+│     │ 35 percent less power or 15   │ improvement at the same        │       │
+│     │ percent more performance |    │ voltage, and 1.15X higher      │       │
+│     │ Tom's Hardware                │ transistor density than the    │       │
+│     │ https://www.tomshardware.com/ │ previous 3nm node.             │       │
+│     │ tech-industry/tsmc-shares-dee │                                │       │
+│     │ p-dive-details-about-its-cutt │                                │       │
+│     │ ing-edge-2nm-process-node-at- │                                │       │
+│     │ iedm-2024-35-percent-less-pow │                                │       │
+│     │ er-or-15-percent-more-perform │                                │       │
+│     │ ance                          │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 2   │ IEDM 2024 – TSMC 2nm Process  │ The paper states that the      │  0.95 │
+│     │ Disclosure - TechInsights     │ process delivers a 30% power   │       │
+│     │ https://library.techinsights. │ improvement or 15% performance │       │
+│     │ com/public/hg-asset/f32a0f17- │ gain and >1.15x density versus │       │
+│     │ 5369-4c97-913c-b78d2ddd833b   │ the previous 3nm node.         │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 3   │ The Shape of Tomorrow's       │ The new N2 platform features   │  0.93 │
+│     │ Semiconductor Technology -    │ GAA nanosheet transistors;     │       │
+│     │ Semiconductor Digest          │ middle-/back-end-of-line       │       │
+│     │ https://www.semiconductor-dig │ interconnects with the densest │       │
+│     │ est.com/the-shape-of-tomorrow │ SRAM macro ever reported       │       │
+│     │ s-semiconductor-technology/   │ (~38Mb/mm2); and a holistic,   │       │
+│     │                               │ system-technology co-optimized │       │
+│     │                               │ (STCO) architecture offering   │       │
+│     │                               │ great design flexibility. That │       │
+│     │                               │ architecture includes a        │       │
+│     │                               │ scalable copper-based          │       │
+│     │                               │ redistribution layer and a     │       │
+│     │                               │ flat passivation layer (for    │       │
+│     │                               │ better performance, robust     │       │
+│     │                               │ CPI, and seamless 3D           │       │
+│     │                               │ integration); and              │       │
+│     │                               │ through-silicon vias, or TSVs. │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 4   │ 2 nm process - Wikipedia      │ According to the projections   │  0.90 │
+│     │ https://en.wikipedia.org/wiki │ contained in the 2021 update   │       │
+│     │ /2_nm_process                 │ of the International Roadmap   │       │
+│     │                               │ for Devices and Systems        │       │
+│     │                               │ published by the Institute of  │       │
+│     │                               │ Electrical and Electronics     │       │
+│     │                               │ Engineers (IEEE), a '2.1 nm    │       │
+│     │                               │ node range label' is expected  │       │
+│     │                               │ to have a contacted gate pitch │       │
+│     │                               │ of 45 nanometers and a         │       │
+│     │                               │ tightest metal pitch of 20     │       │
+│     │                               │ nanometers.                    │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 5   │ TSMC Boosts 2 nm Yields by    │ A key innovation in the N2     │  0.88 │
+│     │ 6%, Passing Savings to        │ process is the enhanced design │       │
+│     │ Customers | TechPowerUp       │ of its GAA nanosheet           │       │
+│     │ https://www.techpowerup.com/3 │ transistors, which offers      │       │
+│     │ 29435/tsmc-boosts-2-nm-yields │ improved electrostatic control │       │
+│     │ -by-6-passing-savings-to-cust │ and reduced gate leakage       │       │
+│     │ omers                         │ compared to 3 nm FinFET        │       │
+│     │                               │ transistors, given that the    │       │
+│     │                               │ gate can be controlled from    │       │
+│     │                               │ all sides. This advancement    │       │
+│     │                               │ enables smaller high-density   │       │
+│     │                               │ transistors to maintain        │       │
+│     │                               │ reliable performance through   │       │
+│     │                               │ better threshold voltage       │       │
+│     │                               │ tuning capabilities.           │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 6   │ TSMC 2nm, full details        │ This 2nm platform technology   │  0.82 │
+│     │ revealed-Electronics          │ includes new Cu RDLs with flat │       │
+│     │ Headlines-EEWORLD             │ passivation and TSVs,          │       │
+│     │ https://en.eeworld.com.cn/mp/ │ optimized holistically with    │       │
+│     │ Icbank/a391002.jspx           │ 3DIC to enable system          │       │
+│     │                               │ integration.                   │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 7   │ TSMC begins quietly volume    │ TSMC has quietly revealed that │  0.97 │
+│     │ production of 2nm-class chips │ it had commenced volume        │       │
+│     │ | Tom's Hardware              │ production of chips using its  │       │
+│     │ https://www.tomshardware.com/ │ N2 (2nm-class) fabrication     │       │
+│     │ tech-industry/semiconductors/ │ process... 'TSMC's 2nm (N2)    │       │
+│     │ tsmc-begins-quietly-volume-pr │ technology has started volume  │       │
+│     │ oduction-of-2nm-class-chips-f │ production in 4Q25 as          │       │
+│     │ irst-gaa-transistor-for-tsmc- │ planned.'                      │       │
+│     │ claims-up-to-15-percent-impro │                                │       │
+│     │ vement-at-iso-power           │                                │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 8   │ TSMC's 2nm Yield Rates Surge  │ Initial tsmc 2nm yield rates   │  0.75 │
+│     │ as Mass Production Ramps Up   │ are notably high, reportedly   │       │
+│     │ in 2026                       │ reaching around 70%. Some      │       │
+│     │ https://heqingele.com/blog/ts │ reports even indicate yields   │       │
+│     │ mc-2nm-yield-rates-mass-produ │ surpassing 90% for certain     │       │
+│     │ ction-status-2026/            │ memory products.               │       │
+├─────┼───────────────────────────────┼────────────────────────────────┼───────┤
+│ 9   │ Unlocking the Future: TSMC's  │ On March 31, 2025, TSMC held   │  0.80 │
+│     │ Bold Strategy for the 2nm     │ an expansion ceremony for its  │       │
+│     │ Revolution!                   │ 2nm production facility in     │       │
+│     │ https://tspasemiconductor.sub │ Kaohsiung, marking a           │       │
+│     │ stack.com/p/unlocking-the-fut │ significant milestone in       │       │
+│     │ ure-tsmcs-bold-strategy-cb2   │ Taiwan's semiconductor         │       │
+│     │                               │ advanced manufacturing         │       │
+│     │                               │ expansion.                     │       │
+└─────┴───────────────────────────────┴────────────────────────────────┴───────┘
+                                      Gaps                                      
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Category         ┃ Topic                       ┃ Detail                      ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ source_not_found │ Exact dielectric and gate   │ TSMC does not publicly      │
+│                  │ oxide materials used in N2  │ disclose the specific       │
+│                  │ GAA nanosheet transistors   │ high-k dielectric           │
+│                  │                             │ materials, interfacial      │
+│                  │                             │ layer compositions, or work │
+│                  │                             │ function metal chemistries  │
+│                  │                             │ used in the N2 gate stack.  │
+│                  │                             │ These are considered core   │
+│                  │                             │ IP.                         │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Nanosheet thickness and     │ The precise nanometer-scale │
+│                  │ width tolerances            │ dimensions and process      │
+│                  │                             │ tolerances (e.g., nanosheet │
+│                  │                             │ thickness variation,        │
+│                  │                             │ critical dimension          │
+│                  │                             │ uniformity) for N2 GAA      │
+│                  │                             │ nanosheets are not publicly │
+│                  │                             │ available.                  │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Metal interconnect liner    │ While Cu RDLs are           │
+│                  │ and barrier materials       │ confirmed, the specific     │
+│                  │                             │ barrier/liner materials     │
+│                  │                             │ (e.g., whether ruthenium or │
+│                  │                             │ cobalt liners replace       │
+│                  │                             │ TaN/Ta at this node) are    │
+│                  │                             │ not disclosed in public     │
+│                  │                             │ sources.                    │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ Doping profiles and implant │ Source/drain doping         │
+│                  │ specifications              │ concentrations, implant     │
+│                  │                             │ energies, and anneal        │
+│                  │                             │ conditions are proprietary  │
+│                  │                             │ and not published.          │
+├──────────────────┼─────────────────────────────┼─────────────────────────────┤
+│ source_not_found │ EUV lithography specifics   │ The number of EUV exposures │
+│                  │ (number of EUV layers,      │ per layer, overlay          │
+│                  │ stochastic defect control   │ tolerances, and specific    │
+│                  │ methods)                    │ stochastic control          │
+│                  │                             │ approaches are not detailed │
+│                  │                             │ in public TSMC disclosures. │
+└──────────────────┴─────────────────────────────┴─────────────────────────────┘
+                                Discovery Events                                
+┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃                  ┃ Suggested         ┃                   ┃                   ┃
+┃ Type             ┃ Researcher        ┃ Query             ┃ Reason            ┃
+┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ related_research │ arxiv             │ TSMC N2 nanosheet │ IEEE IEDM 2024    │
+│                  │                   │ GAA transistor    │ papers from TSMC  │
+│                  │                   │ gate stack        │ may contain more  │
+│                  │                   │ materials high-k  │ specific          │
+│                  │                   │ dielectric IEDM   │ materials details │
+│                  │                   │ 2024              │ in the full       │
+│                  │                   │                   │ published         │
+│                  │                   │                   │ proceedings not   │
+│                  │                   │                   │ summarized in     │
+│                  │                   │                   │ news articles.    │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ TSMC 2nm N2       │ TSMC patent       │
+│                  │                   │ process patent    │ filings related   │
+│                  │                   │ filings nanosheet │ to N2 may reveal  │
+│                  │                   │ gate-all-around   │ specific          │
+│                  │                   │ materials         │ materials         │
+│                  │                   │                   │ choices,          │
+│                  │                   │                   │ tolerances, and   │
+│                  │                   │                   │ process           │
+│                  │                   │                   │ innovations that  │
+│                  │                   │                   │ are not in press  │
+│                  │                   │                   │ releases.         │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ arxiv             │ gate-all-around   │ Academic          │
+│                  │                   │ nanosheet         │ literature on GAA │
+│                  │                   │ transistor        │ nanosheet         │
+│                  │                   │ silicon channel   │ fabrication may   │
+│                  │                   │ thickness         │ reveal typical    │
+│                  │                   │ variation         │ tolerance ranges  │
+│                  │                   │ tolerance 2nm     │ used at the 2nm   │
+│                  │                   │                   │ class node even   │
+│                  │                   │                   │ if not            │
+│                  │                   │                   │ TSMC-specific.    │
+├──────────────────┼───────────────────┼───────────────────┼───────────────────┤
+│ related_research │ database          │ TechInsights TSMC │ TechInsights      │
+│                  │                   │ N2 teardown       │ performs physical │
+│                  │                   │ materials         │ reverse           │
+│                  │                   │ analysis 2025     │ engineering of    │
+│                  │                   │                   │ chips and may     │
+│                  │                   │                   │ have detailed N2  │
+│                  │                   │                   │ materials         │
+│                  │                   │                   │ analysis          │
+│                  │                   │                   │ available through │
+│                  │                   │                   │ their             │
+│                  │                   │                   │ subscription      │
+│                  │                   │                   │ service.          │
+└──────────────────┴───────────────────┴───────────────────┴───────────────────┘
+                                 Open Questions                                 
+┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Priority ┃ Question                        ┃ Context                         ┃
+┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ high     │ What specific high-k dielectric │ Public sources confirm GAA      │
+│          │ and metal gate materials does   │ nanosheet architecture but do   │
+│          │ TSMC use in the N2 GAA          │ not specify gate dielectric     │
+│          │ nanosheet gate stack?           │ (e.g., HfO2 variants) or work   │
+│          │                                 │ function metal compositions     │
+│          │                                 │ used to achieve threshold       │
+│          │                                 │ voltage tuning.                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ Has TSMC adopted ruthenium or   │ At 20nm metal pitch,            │
+│          │ other alternative metals for    │ traditional TaN/Ta/Cu stacks    │
+│          │ BEOL interconnect liners in N2  │ face resistance issues; Intel   │
+│          │ to reduce resistance at tight   │ and others have explored Mo and │
+│          │ pitches?                        │ Ru. TSMC's specific choice for  │
+│          │                                 │ N2 BEOL is not disclosed in     │
+│          │                                 │ public sources.                 │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ high     │ What is the actual silicon      │ GAA nanosheet devices typically │
+│          │ nanosheet thickness and stack   │ stack 3-4 nanosheets; TSMC has  │
+│          │ count in TSMC's N2 process?     │ not publicly specified          │
+│          │                                 │ nanosheet dimensions or stack   │
+│          │                                 │ count for N2.                   │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ How does TSMC's N2 defect       │ A LinkedIn post references      │
+│          │ density compare quantitatively  │ Tom's Hardware reporting that   │
+│          │ to N3 at equivalent production  │ TSMC disclosed N2 defect        │
+│          │ maturity?                       │ density is lower than N3 at the │
+│          │                                 │ same stage of development, but  │
+│          │                                 │ specific numbers were not found │
+│          │                                 │ in the gathered sources.        │
+├──────────┼─────────────────────────────────┼─────────────────────────────────┤
+│ medium   │ Will TSMC's N2P (enhanced N2)   │ Sources mention N2P is a 5%     │
+│          │ node incorporate backside power │ speed-enhanced version of N2    │
+│          │ delivery network (BSPDN), and   │ targeting qualification         │
+│          │ what materials/process changes  │ completion; the SemiAnalysis    │
+│          │ does that entail?               │ report discusses BSPDN as a key │
+│          │                                 │ innovation at 2nm class nodes,  │
+│          │                                 │ and its material implications   │
+│          │                                 │ differ significantly.           │
+└──────────┴─────────────────────────────────┴─────────────────────────────────┘
+╭───────────────────────────────── Confidence ─────────────────────────────────╮
+│ Overall: 0.42                                                                │
+│ Corroborating sources: 9                                                     │
+│ Source authority: medium                                                     │
+│ Contradiction detected: False                                                │
+│ Query specificity match: 0.30                                                │
+│ Budget status: spent                                                         │
+│ Recency: current                                                             │
+╰──────────────────────────────────────────────────────────────────────────────╯
+╭──────────────────────────────────── Cost ────────────────────────────────────╮
+│ Tokens: 62620                                                                │
+│ Iterations: 3                                                                │
+│ Wall time: 105.63s                                                           │
+│ Model: claude-sonnet-4-6                                                     │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+trace_id: a4bb5b7a-61dd-446b-8c06-06c78de5fef7
--- a/scripts/calibration_collect.py
+++ b/scripts/calibration_collect.py
@ -0,0 +1,225 @@
+"""scripts/calibration_collect.py
+
+M3.3 Phase A: load every persisted ResearchResult under
+~/.marchwarden/traces/*.result.json and emit a markdown rating worksheet
+to docs/stress-tests/M3.3-rating-worksheet.md.
+
+The worksheet has one row per run with the model's self-reported confidence
+and a blank `actual_rating` column for human review (Phase B). After rating
+is complete, scripts/calibration_analyze.py (Phase C) will load the same
+file with the rating column populated and compute calibration error.
+
+Usage:
+    .venv/bin/python scripts/calibration_collect.py
+
+Optional env:
+    TRACE_DIR — override default ~/.marchwarden/traces
+    OUT       — override default docs/stress-tests/M3.3-rating-worksheet.md
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO_ROOT))
+
+from researchers.web.models import ResearchResult  # noqa: E402
+
+
+def _load_results(trace_dir: Path) -> list[tuple[Path, ResearchResult]]:
+    """Load every <id>.result.json under trace_dir, sorted by mtime."""
+    files = sorted(trace_dir.glob("*.result.json"), key=lambda p: p.stat().st_mtime)
+    out: list[tuple[Path, ResearchResult]] = []
+    for f in files:
+        try:
+            result = ResearchResult.model_validate_json(f.read_text(encoding="utf-8"))
+        except Exception as exc:
+            print(f"warning: skipping {f.name}: {exc}", file=sys.stderr)
+            continue
+        out.append((f, result))
+    return out
+
+
+def _gap_summary(result: ResearchResult) -> str:
+    """Render gap categories with counts, e.g. 'source_not_found(2), scope_exceeded(1)'."""
+    if not result.gaps:
+        return "—"
+    counts: dict[str, int] = {}
+    for g in result.gaps:
+        cat = g.category.value if hasattr(g.category, "value") else str(g.category)
+        counts[cat] = counts.get(cat, 0) + 1
+    return ", ".join(f"{k}({v})" for k, v in sorted(counts.items()))
+
+
+def _category_map(runs_dir: Path) -> dict[str, str]:
+    """Map trace_id -> category by parsing scripts/calibration_runner.sh log files.
+
+    Each log file is named like ``01-factual.log`` and contains a final
+    ``trace_id: <uuid>`` line emitted by the CLI.
+    """
+    out: dict[str, str] = {}
+    if not runs_dir.exists():
+        return out
+    for log in runs_dir.glob("*.log"):
+        # filename format: NN-category.log
+        stem = log.stem
+        parts = stem.split("-", 1)
+        if len(parts) != 2:
+            continue
+        category = parts[1]
+        try:
+            text = log.read_text(encoding="utf-8")
+        except Exception:
+            continue
+        # Find the last "trace_id: <uuid>" line
+        trace_id = None
+        for line in text.splitlines():
+            if "trace_id:" in line:
+                # Strip ANSI / rich markup if present
+                token = line.split("trace_id:")[-1].strip()
+                # Take only the UUID portion
+                token = token.split()[0] if token else ""
+                # Strip any surrounding rich markup
+                token = token.replace("[/dim]", "").replace("[dim]", "")
+                if token:
+                    trace_id = token
+        if trace_id:
+            out[trace_id] = category
+    return out
+
+
+def _question_from_trace(trace_dir: Path, trace_id: str) -> str:
+    """Recover the original question from the trace JSONL's `start` event."""
+    jsonl = trace_dir / f"{trace_id}.jsonl"
+    if not jsonl.exists():
+        return "(question not recoverable — trace missing)"
+    try:
+        for line in jsonl.read_text(encoding="utf-8").splitlines():
+            line = line.strip()
+            if not line:
+                continue
+            entry = json.loads(line)
+            if entry.get("action") == "start":
+                return entry.get("question", "(no question field)")
+    except Exception as exc:
+        return f"(parse error: {exc})"
+    return "(no start event)"
+
+
+def _build_worksheet(
+    rows: list[tuple[Path, ResearchResult]],
+    trace_dir: Path,
+    category_map: dict[str, str],
+) -> str:
+    """Render the markdown worksheet."""
+    lines: list[str] = []
+    lines.append("# M3.3 Calibration Rating Worksheet")
+    lines.append("")
+    lines.append("Issue: #46 (Phase B — human rating)")
+    lines.append("")
+    lines.append(
+        "## How to use this worksheet"
+    )
+    lines.append("")
+    lines.append(
+        "For each run below, read the answer + citations from the persisted "
+        "result file (path in the **Result file** column). Score the answer's "
+        "*actual* correctness on a 0.0–1.0 scale, **independent** of the "
+        "model's self-reported confidence. Fill in the **actual_rating** "
+        "column. Add notes in the **notes** column for anything unusual."
+    )
+    lines.append("")
+    lines.append("Rating rubric:")
+    lines.append("")
+    lines.append("- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.")
+    lines.append("- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.")
+    lines.append("- **0.6** — Substantively right but with notable errors, missing context, or weak citations.")
+    lines.append("- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.")
+    lines.append("- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.")
+    lines.append("- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.")
+    lines.append("")
+    lines.append("After rating all rows, save this file and run:")
+    lines.append("")
+    lines.append("```")
+    lines.append(".venv/bin/python scripts/calibration_analyze.py")
+    lines.append("```")
+    lines.append("")
+    lines.append(f"## Runs ({len(rows)} total)")
+    lines.append("")
+    lines.append(
+        "| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |"
+    )
+    lines.append(
+        "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|"
+    )
+
+    for i, (path, result) in enumerate(rows, 1):
+        cf = result.confidence_factors
+        cm = result.cost_metadata
+        question = _question_from_trace(trace_dir, result.trace_id).replace("|", "\\|")
+        # Truncate long questions for table readability
+        if len(question) > 80:
+            question = question[:77] + "..."
+        gaps = _gap_summary(result).replace("|", "\\|")
+        contradiction = "yes" if cf.contradiction_detected else "no"
+        budget = "spent" if cf.budget_exhausted else "under"
+        recency = cf.recency or "—"
+        category = category_map.get(result.trace_id, "ad-hoc")
+        lines.append(
+            f"| {i} "
+            f"| `{result.trace_id[:8]}` "
+            f"| {category} "
+            f"| {question} "
+            f"| {result.confidence:.2f} "
+            f"| {cf.num_corroborating_sources} "
+            f"| {cf.source_authority} "
+            f"| {contradiction} "
+            f"| {budget} "
+            f"| {recency} "
+            f"| {gaps} "
+            f"| {len(result.citations)} "
+            f"| {len(result.discovery_events)} "
+            f"| {cm.tokens_used} "
+            f"|  "
+            f"|  |"
+        )
+
+    lines.append("")
+    lines.append("## Result files (full content for review)")
+    lines.append("")
+    for i, (path, result) in enumerate(rows, 1):
+        lines.append(f"{i}. `{path}`")
+    lines.append("")
+    return "\n".join(lines)
+
+
+def main() -> int:
+    trace_dir = Path(
+        os.environ.get("TRACE_DIR", os.path.expanduser("~/.marchwarden/traces"))
+    )
+    out_path = Path(
+        os.environ.get("OUT", REPO_ROOT / "docs/stress-tests/M3.3-rating-worksheet.md")
+    )
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+
+    rows = _load_results(trace_dir)
+    if not rows:
+        print(f"No result files found under {trace_dir}", file=sys.stderr)
+        return 1
+
+    runs_dir = REPO_ROOT / "docs/stress-tests/M3.3-runs"
+    category_map = _category_map(runs_dir)
+
+    out_path.write_text(
+        _build_worksheet(rows, trace_dir, category_map), encoding="utf-8"
+    )
+    print(f"Wrote {len(rows)}-row worksheet to {out_path}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/calibration_runner.sh
+++ b/scripts/calibration_runner.sh
@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+# scripts/calibration_runner.sh
+#
+# M3.3 Phase A: run a fixed set of 20 balanced-depth calibration queries.
+# Each run writes a trace JSONL and a result.json under ~/.marchwarden/traces/.
+# This script is idempotent in the sense that it doesn't track state — re-running
+# it will produce 20 NEW traces. Don't re-run unless you want fresh data.
+#
+# Categories (5 each):
+#   - factual: single verifiable answer
+#   - comparative: X vs Y across some dimension
+#   - contradiction-prone: contested topics, sources disagree
+#   - scope-edge: niche, proprietary, or expert-only knowledge
+
+set -euo pipefail
+
+cd "$(dirname "$0")/.."
+
+PY=".venv/bin/python"
+LOG_DIR="docs/stress-tests/M3.3-runs"
+mkdir -p "$LOG_DIR"
+
+declare -a QUERIES=(
+  # factual
+  "factual|01|What is the boiling point of liquid nitrogen at standard atmospheric pressure?"
+  "factual|02|When did the James Webb Space Telescope launch?"
+  "factual|03|What programming language is the Linux kernel primarily written in?"
+  "factual|04|What is the capital of Mongolia?"
+  "factual|05|How many amino acids are encoded by the standard genetic code?"
+  # comparative
+  "comparative|06|Compare the energy density of lithium-ion vs sodium-ion batteries."
+  "comparative|07|Compare PostgreSQL and SQLite for embedded analytics workloads."
+  "comparative|08|Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing."
+  "comparative|09|Compare React and Vue for large enterprise frontends in 2026."
+  "comparative|10|Compare wind and solar capacity factors in the continental United States."
+  # contradiction-prone
+  "contradiction|11|Is red wine good for cardiovascular health?"
+  "contradiction|12|Does intermittent fasting extend lifespan in humans?"
+  "contradiction|13|Are nuclear power plants safe?"
+  "contradiction|14|Is dietary cholesterol harmful?"
+  "contradiction|15|Does screen time harm child development?"
+  # scope-edge
+  "scope|16|What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?"
+  "scope|17|What is the actual operational doctrine of Chinese DF-41 ICBM brigades?"
+  "scope|18|What internal compensation bands does Goldman Sachs use for VPs in 2026?"
+  "scope|19|How does Renaissance Technologies Medallion Fund actually generate alpha?"
+  "scope|20|What are the precise materials and tolerances in TSMC's 2nm process?"
+)
+
+echo "Running ${#QUERIES[@]} calibration queries at depth=balanced..."
+echo "Output dir: $LOG_DIR"
+echo
+
+for entry in "${QUERIES[@]}"; do
+  IFS='|' read -r category num question <<<"$entry"
+  log_file="$LOG_DIR/${num}-${category}.log"
+  echo "[$num/$category] $question"
+  if "$PY" -m cli.main ask "$question" --depth balanced >"$log_file" 2>&1; then
+    trace_id=$(grep -oE 'trace_id: [a-f0-9-]+' "$log_file" | tail -1 | awk '{print $2}')
+    echo "    -> $trace_id"
+  else
+    echo "    !! FAILED — see $log_file"
+  fi
+done
+
+echo
+echo "Done. Result files at ~/.marchwarden/traces/*.result.json"