From 13215d7ddb59e70a59a8b0aad098bb40b4b5bd90 Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Wed, 8 Apr 2026 20:21:47 -0600 Subject: [PATCH] =?UTF-8?q?docs(stress-tests):=20M3.3=20Phase=20A=20?= =?UTF-8?q?=E2=80=94=20calibration=20data=20collection?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Issue #46 (Phase A only — Phase B human rating still pending, issue stays open). Adds the data-collection half of the calibration milestone: - scripts/calibration_runner.sh — runs 20 fixed balanced-depth queries across 4 categories (factual, comparative, contradiction-prone, scope-edge), 5 each, capturing per-run logs to docs/stress-tests/M3.3-runs/. - scripts/calibration_collect.py — loads every persisted ResearchResult under ~/.marchwarden/traces/*.result.json and emits a markdown rating worksheet with one row per run. Recovers question text from each trace's start event and category from the run-log filename. - docs/stress-tests/M3.3-rating-worksheet.md — 22 runs (20 calibration + caffeine smoke + M3.2 multi-axis), with empty actual_rating columns for the human-in-the-loop scoring step. - docs/stress-tests/M3.3-runs/*.log — runtime logs from the calibration runner, kept as provenance. Gitignore updated with an exception carving stress-test logs out of the global *.log ignore. Note: M3.1's 4 runs predate #54 (full result persistence) and so are unrecoverable to the worksheet — only post-#54 runs have a result.json sibling. 22 rateable runs is still within the milestone target of 20–30. Phases B (human rating) and C (analysis + rubric + wiki update) follow in a later session. This issue stays open until both are done. Co-Authored-By: Claude Opus 4.6 (1M context) --- .gitignore | 3 + docs/stress-tests/M3.3-rating-worksheet.md | 74 ++++ docs/stress-tests/M3.3-runs/01-factual.log | 128 ++++++ docs/stress-tests/M3.3-runs/02-factual.log | 145 +++++++ docs/stress-tests/M3.3-runs/03-factual.log | 179 +++++++++ docs/stress-tests/M3.3-runs/04-factual.log | 115 ++++++ docs/stress-tests/M3.3-runs/05-factual.log | 148 +++++++ .../stress-tests/M3.3-runs/06-comparative.log | 226 +++++++++++ .../stress-tests/M3.3-runs/07-comparative.log | 350 ++++++++++++++++ .../stress-tests/M3.3-runs/08-comparative.log | 364 +++++++++++++++++ .../stress-tests/M3.3-runs/09-comparative.log | 378 ++++++++++++++++++ .../stress-tests/M3.3-runs/10-comparative.log | 310 ++++++++++++++ .../M3.3-runs/11-contradiction.log | 236 +++++++++++ .../M3.3-runs/12-contradiction.log | 330 +++++++++++++++ .../M3.3-runs/13-contradiction.log | 260 ++++++++++++ .../M3.3-runs/14-contradiction.log | 358 +++++++++++++++++ .../M3.3-runs/15-contradiction.log | 48 +++ docs/stress-tests/M3.3-runs/16-scope.log | 321 +++++++++++++++ docs/stress-tests/M3.3-runs/17-scope.log | 344 ++++++++++++++++ docs/stress-tests/M3.3-runs/18-scope.log | 272 +++++++++++++ docs/stress-tests/M3.3-runs/19-scope.log | 343 ++++++++++++++++ docs/stress-tests/M3.3-runs/20-scope.log | 325 +++++++++++++++ scripts/calibration_collect.py | 225 +++++++++++ scripts/calibration_runner.sh | 67 ++++ 24 files changed, 5549 insertions(+) create mode 100644 docs/stress-tests/M3.3-rating-worksheet.md create mode 100644 docs/stress-tests/M3.3-runs/01-factual.log create mode 100644 docs/stress-tests/M3.3-runs/02-factual.log create mode 100644 docs/stress-tests/M3.3-runs/03-factual.log create mode 100644 docs/stress-tests/M3.3-runs/04-factual.log create mode 100644 docs/stress-tests/M3.3-runs/05-factual.log create mode 100644 docs/stress-tests/M3.3-runs/06-comparative.log create mode 100644 docs/stress-tests/M3.3-runs/07-comparative.log create mode 100644 docs/stress-tests/M3.3-runs/08-comparative.log create mode 100644 docs/stress-tests/M3.3-runs/09-comparative.log create mode 100644 docs/stress-tests/M3.3-runs/10-comparative.log create mode 100644 docs/stress-tests/M3.3-runs/11-contradiction.log create mode 100644 docs/stress-tests/M3.3-runs/12-contradiction.log create mode 100644 docs/stress-tests/M3.3-runs/13-contradiction.log create mode 100644 docs/stress-tests/M3.3-runs/14-contradiction.log create mode 100644 docs/stress-tests/M3.3-runs/15-contradiction.log create mode 100644 docs/stress-tests/M3.3-runs/16-scope.log create mode 100644 docs/stress-tests/M3.3-runs/17-scope.log create mode 100644 docs/stress-tests/M3.3-runs/18-scope.log create mode 100644 docs/stress-tests/M3.3-runs/19-scope.log create mode 100644 docs/stress-tests/M3.3-runs/20-scope.log create mode 100644 scripts/calibration_collect.py create mode 100755 scripts/calibration_runner.sh diff --git a/.gitignore b/.gitignore index ef9ebb7..1e3741f 100644 --- a/.gitignore +++ b/.gitignore @@ -45,6 +45,9 @@ ehthumbs.db .env .env.local *.log +# Exception: stress test run logs are committed as provenance — they map +# trace_id -> category for the calibration collector script. +!docs/stress-tests/**/*.log # Tests .pytest_cache/ diff --git a/docs/stress-tests/M3.3-rating-worksheet.md b/docs/stress-tests/M3.3-rating-worksheet.md new file mode 100644 index 0000000..2f15de6 --- /dev/null +++ b/docs/stress-tests/M3.3-rating-worksheet.md @@ -0,0 +1,74 @@ +# M3.3 Calibration Rating Worksheet + +Issue: #46 (Phase B — human rating) + +## How to use this worksheet + +For each run below, read the answer + citations from the persisted result file (path in the **Result file** column). Score the answer's *actual* correctness on a 0.0–1.0 scale, **independent** of the model's self-reported confidence. Fill in the **actual_rating** column. Add notes in the **notes** column for anything unusual. + +Rating rubric: + +- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations. +- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance. +- **0.6** — Substantively right but with notable errors, missing context, or weak citations. +- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons. +- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing. +- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question. + +After rating all rows, save this file and run: + +``` +.venv/bin/python scripts/calibration_analyze.py +``` + +## Runs (22 total) + +| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes | +|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| +| 1 | `28f55110` | ad-hoc | What is the half-life of caffeine? | 0.95 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 11582 | | | +| 2 | `74a017bd` | ad-hoc | Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequenc... | 0.78 | 18 | medium | yes | spent | current | source_not_found(5) | 18 | 4 | 127692 | | | +| 3 | `6141a021` | factual | What is the boiling point of liquid nitrogen at standard atmospheric pressure? | 0.98 | 5 | high | no | under | current | — | 5 | 2 | 42473 | | | +| 4 | `91e87d05` | factual | When did the James Webb Space Telescope launch? | 0.99 | 5 | high | no | under | current | contradictory_sources(1) | 5 | 2 | 19708 | | | +| 5 | `710b0a62` | factual | What programming language is the Linux kernel primarily written in? | 0.97 | 6 | high | no | under | current | contradictory_sources(1), source_not_found(1) | 6 | 2 | 32922 | | | +| 6 | `ffc42162` | factual | What is the capital of Mongolia? | 0.99 | 4 | high | no | under | current | — | 4 | 1 | 11009 | | | +| 7 | `7561029e` | factual | How many amino acids are encoded by the standard genetic code? | 0.98 | 4 | high | no | under | current | scope_exceeded(1) | 4 | 2 | 48308 | | | +| 8 | `aaf3b9ef` | comparative | Compare the energy density of lithium-ion vs sodium-ion batteries. | 0.91 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 48087 | | | +| 9 | `01881015` | comparative | Compare PostgreSQL and SQLite for embedded analytics workloads. | 0.88 | 10 | medium | no | spent | current | source_not_found(3) | 10 | 4 | 61699 | | | +| 10 | `9e436db7` | comparative | Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. | 0.82 | 14 | high | no | spent | current | source_not_found(4) | 14 | 4 | 54153 | | | +| 11 | `7c8dd19b` | comparative | Compare React and Vue for large enterprise frontends in 2026. | 0.81 | 12 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 12 | 4 | 56137 | | | +| 12 | `e3fa81c3` | comparative | Compare wind and solar capacity factors in the continental United States. | 0.88 | 10 | high | no | spent | current | scope_exceeded(2), source_not_found(2) | 10 | 4 | 48230 | | | +| 13 | `96acce3c` | contradiction | Is red wine good for cardiovascular health? | 0.72 | 7 | high | yes | spent | recent | access_denied(1), contradictory_sources(1), source_not_found(1) | 9 | 3 | 42350 | | | +| 14 | `c4942f00` | contradiction | Does intermittent fasting extend lifespan in humans? | 0.72 | 9 | high | yes | spent | current | contradictory_sources(2), source_not_found(2) | 11 | 4 | 62781 | | | +| 15 | `2e2b6e88` | contradiction | Are nuclear power plants safe? | 0.92 | 8 | high | no | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 8 | 3 | 63429 | | | +| 16 | `27d81891` | contradiction | Is dietary cholesterol harmful? | 0.78 | 13 | high | yes | spent | current | contradictory_sources(1), source_not_found(2) | 13 | 4 | 64718 | | | +| 17 | `9c18d570` | contradiction | Does screen time harm child development? | 0.10 | 0 | low | no | spent | — | budget_exhausted(1) | 0 | 0 | 44375 | | | +| 18 | `f4c43973` | scope | What proprietary indexing strategies do high-frequency trading firms use for ... | 0.72 | 8 | medium | no | spent | current | scope_exceeded(1), source_not_found(3) | 8 | 4 | 70892 | | | +| 19 | `b3d00938` | scope | What is the actual operational doctrine of Chinese DF-41 ICBM brigades? | 0.72 | 12 | high | yes | spent | current | access_denied(1), contradictory_sources(1), scope_exceeded(1), source_not_found(1) | 12 | 4 | 62857 | | | +| 20 | `716e548a` | scope | What internal compensation bands does Goldman Sachs use for VPs in 2026? | 0.62 | 8 | medium | yes | spent | current | contradictory_sources(1), scope_exceeded(1), source_not_found(2) | 10 | 3 | 51829 | | | +| 21 | `b7cd9d50` | scope | How does Renaissance Technologies Medallion Fund actually generate alpha? | 0.82 | 10 | medium | no | spent | current | access_denied(1), source_not_found(3) | 10 | 4 | 43096 | | | +| 22 | `a4bb5b7a` | scope | What are the precise materials and tolerances in TSMC's 2nm process? | 0.42 | 9 | medium | no | spent | current | source_not_found(5) | 9 | 4 | 62620 | | | + +## Result files (full content for review) + +1. `/home/micro/.marchwarden/traces/28f55110-3b34-4661-87c7-e83bcbe9c4c6.result.json` +2. `/home/micro/.marchwarden/traces/74a017bd-697b-4439-96b8-fe12057cf2e8.result.json` +3. `/home/micro/.marchwarden/traces/6141a021-4a47-45df-aa0c-5acd1db78b79.result.json` +4. `/home/micro/.marchwarden/traces/91e87d05-6d23-4377-af13-270a8cf701e2.result.json` +5. `/home/micro/.marchwarden/traces/710b0a62-06c8-4f49-83e3-dc651c3702a9.result.json` +6. `/home/micro/.marchwarden/traces/ffc42162-5527-4a35-97ad-474aafa47dc1.result.json` +7. `/home/micro/.marchwarden/traces/7561029e-5dcb-4eaa-98e9-7496ed4bf4c2.result.json` +8. `/home/micro/.marchwarden/traces/aaf3b9ef-d91a-4d03-8883-b0a906929cb1.result.json` +9. `/home/micro/.marchwarden/traces/01881015-61a9-4894-a723-4e1d8b7a7755.result.json` +10. `/home/micro/.marchwarden/traces/9e436db7-fcde-4d0f-a568-c468ae4d419c.result.json` +11. `/home/micro/.marchwarden/traces/7c8dd19b-174b-4850-a2f5-28917d37c0c0.result.json` +12. `/home/micro/.marchwarden/traces/e3fa81c3-eaff-4f76-9b50-d61e70e54540.result.json` +13. `/home/micro/.marchwarden/traces/96acce3c-853d-40b7-ba02-c721ac59f85d.result.json` +14. `/home/micro/.marchwarden/traces/c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3.result.json` +15. `/home/micro/.marchwarden/traces/2e2b6e88-c973-4422-919c-3838634336c9.result.json` +16. `/home/micro/.marchwarden/traces/27d81891-5bf2-4bf4-9744-55f39ffaf696.result.json` +17. `/home/micro/.marchwarden/traces/9c18d570-73d3-4e8a-98bc-7cb1b66c61d2.result.json` +18. `/home/micro/.marchwarden/traces/f4c43973-7cac-4193-a249-cbb1302de4f7.result.json` +19. `/home/micro/.marchwarden/traces/b3d00938-5309-4faa-a20d-97a8511bb8f9.result.json` +20. `/home/micro/.marchwarden/traces/716e548a-ceaf-4d18-8b47-ac35e3460b52.result.json` +21. `/home/micro/.marchwarden/traces/b7cd9d50-3eec-4eca-8db0-a580722c2b19.result.json` +22. `/home/micro/.marchwarden/traces/a4bb5b7a-61dd-446b-8c06-06c78de5fef7.result.json` diff --git a/docs/stress-tests/M3.3-runs/01-factual.log b/docs/stress-tests/M3.3-runs/01-factual.log new file mode 100644 index 0000000..d70ec87 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/01-factual.log @@ -0,0 +1,128 @@ +Researching: What is the boiling point of liquid nitrogen at standard +atmospheric pressure? + +{"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:49:07.183443Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:49:07.993167Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:49:08.002221Z"} +{"question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:49:08.036624Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the boiling point of liquid nitrogen at standard atmospheric pressure?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037079Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:08.037172Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1107, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:20.314935Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5768, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:25.184914Z"} +{"step": 15, "decision": "Starting iteration 4/5", "tokens_so_far": 16093, "event": "iteration_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:27.276067Z"} +{"step": 17, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 17, "iterations_run": 4, "tokens_used": 29376, "event": "synthesis_start", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:49:43.946958Z"} +{"step": 18, "decision": "Parsed synthesis JSON successfully", "duration_ms": 21492, "event": "synthesis_complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.440080Z"} +{"step": 26, "decision": "Research complete", "confidence": 0.98, "citation_count": 5, "gap_count": 0, "discovery_count": 2, "total_duration_sec": 59.528, "event": "complete", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:05.442761Z"} +{"confidence": 0.98, "citations": 5, "gaps": 0, "discovery_events": 2, "tokens_used": 42473, "iterations_run": 4, "wall_time_sec": 57.403085231781006, "budget_exhausted": false, "event": "research_completed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:05.442894Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:05.443791Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:05.453034Z"} +{"trace_id": "6141a021-4a47-45df-aa0c-5acd1db78b79", "confidence": 0.98, "citations": 5, "tokens_used": 42473, "wall_time_sec": 57.403085231781006, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:05.720817Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The boiling point of liquid nitrogen at standard atmospheric pressure (1 atm │ +│ / 14.7 psia / 760 mmHg) is −195.79 °C (77 K; −320 °F). Some sources round │ +│ this to −195.8 °C or approximately −196 °C. This value represents the │ +│ temperature at which nitrogen transitions from its liquid phase to a gas │ +│ phase under normal atmospheric conditions. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Liquid Nitrogen Temperature │ The temperature of liquid │ 0.98 │ +│ │ and Facts │ nitrogen is −195.79 °C (77 K; │ │ +│ │ https://sciencenotes.org/liqu │ −320 °F). This is the boiling │ │ +│ │ id-nitrogen-temperature-and-f │ point of nitrogen. However, │ │ +│ │ acts/ │ nitrogen can exist as a liquid │ │ +│ │ │ between 63 K and 77.2 K │ │ +│ │ │ (-346°F and -320.44°F). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Nitrogen - Thermophysical │ Boiling Point - at saturation │ 0.97 │ +│ │ Properties │ pressure 14.7 psia and 760 mm │ │ +│ │ https://www.engineeringtoolbo │ Hg - ( o F, o C ) -320.4, │ │ +│ │ x.com/nitrogen-d_1421.html │ -195.8 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ What Is the Temperature of │ The temperature of liquid │ 0.95 │ +│ │ Liquid Nitrogen? - WestAir │ nitrogen is -196°C (-321°F) at │ │ +│ │ https://westairgases.com/blog │ its boiling point. The liquid │ │ +│ │ /liquid-nitrogen-temperature- │ nitrogen temperature range │ │ +│ │ properties/ │ spans between -210°C (freezing │ │ +│ │ │ point) and -196°C (boiling │ │ +│ │ │ point). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ What is the boiling point of │ At 1 atmosphere of pressure, │ 0.90 │ +│ │ liquid nitrogen? Does it │ nitrogen boils at -195.8 │ │ +│ │ change ... - Quora │ Celsius (-320.4 Fahrenheit). │ │ +│ │ https://www.quora.com/What-is │ Of course, like any substance, │ │ +│ │ -the-boiling-point-of-liquid- │ boiling point varies directly │ │ +│ │ nitrogen-Does-it-change-in-a- │ with pressure. │ │ +│ │ vacuum-or-at-standard-conditi │ │ │ +│ │ ons │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ The boiling point for liquid │ The boiling point for liquid │ 0.88 │ +│ │ nitrogen at atmospheric │ nitrogen at atmospheric │ │ +│ │ pressure is 77 K. │ pressure is 77 K. In an open │ │ +│ │ https://brainly.com/question/ │ container, liquid nitrogen's │ │ +│ │ 17018364 │ temperature is generally │ │ +│ │ │ around its boiling point of 77 │ │ +│ │ │ K due to continuous │ │ +│ │ │ vaporization. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ database │ liquid nitrogen │ The boiling point │ +│ │ │ boiling point │ of nitrogen │ +│ │ │ pressure │ varies with │ +│ │ │ dependence phase │ pressure; │ +│ │ │ diagram │ understanding │ +│ │ │ │ this relationship │ +│ │ │ │ is useful for │ +│ │ │ │ industrial and │ +│ │ │ │ scientific │ +│ │ │ │ applications. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ nitrogen phase │ Engineering │ +│ │ │ diagram triple │ ToolBox │ +│ │ │ point critical │ references a │ +│ │ │ point │ nitrogen phase │ +│ │ │ │ diagram showing │ +│ │ │ │ conditions for │ +│ │ │ │ solid, liquid, │ +│ │ │ │ and gas phases. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ medium │ How does the boiling point of │ Multiple sources note that │ +│ │ liquid nitrogen change as │ boiling point varies directly │ +│ │ pressure decreases toward a │ with pressure, suggesting │ +│ │ vacuum? │ significant changes under │ +│ │ │ reduced pressure conditions. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ What is the exact triple point │ Sources mention nitrogen exists │ +│ │ temperature and pressure for │ as a liquid between 63 K and │ +│ │ nitrogen? │ 77.2 K, implying a triple point │ +│ │ │ near 63 K, but exact triple │ +│ │ │ point data was not provided in │ +│ │ │ the gathered evidence. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.98 │ +│ Corroborating sources: 5 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 1.00 │ +│ Budget status: under cap │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 42473 │ +│ Iterations: 4 │ +│ Wall time: 57.40s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 6141a021-4a47-45df-aa0c-5acd1db78b79 diff --git a/docs/stress-tests/M3.3-runs/02-factual.log b/docs/stress-tests/M3.3-runs/02-factual.log new file mode 100644 index 0000000..ae239ef --- /dev/null +++ b/docs/stress-tests/M3.3-runs/02-factual.log @@ -0,0 +1,145 @@ +Researching: When did the James Webb Space Telescope launch? + +{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:06.289350Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:07.051309Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:07.061145Z"} +{"question": "When did the James Webb Space Telescope launch?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:07.098980Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "When did the James Webb Space Telescope launch?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099569Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:07.099732Z"} +{"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1050, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:15.512242Z"} +{"step": 8, "decision": "Starting iteration 3/5", "tokens_so_far": 5418, "event": "iteration_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:18.749199Z"} +{"step": 10, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 6, "iterations_run": 3, "tokens_used": 11453, "event": "synthesis_start", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:28.069780Z"} +{"step": 11, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24998, "event": "synthesis_complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.942803Z"} +{"step": 20, "decision": "Research complete", "confidence": 0.99, "citation_count": 5, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 47.037, "event": "complete", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:51.943609Z"} +{"confidence": 0.99, "citations": 5, "gaps": 1, "discovery_events": 2, "tokens_used": 19708, "iterations_run": 3, "wall_time_sec": 44.843754529953, "budget_exhausted": false, "event": "research_completed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:51.943716Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:50:51.944100Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:51.947937Z"} +{"trace_id": "91e87d05-6d23-4377-af13-270a8cf701e2", "confidence": 0.99, "citations": 5, "tokens_used": 19708, "wall_time_sec": 44.843754529953, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.133972Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The James Webb Space Telescope (JWST) launched on December 25, 2021, at │ +│ 12:20 UTC (7:20 AM ET) aboard an Arianespace Ariane 5 ECA+ rocket (Flight │ +│ VA256) from the Guiana Space Centre (ELA-3) in Kourou, French Guiana. It │ +│ entered service on July 12, 2022. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ James Webb Space Telescope - │ Launch date: 25 December 2021 │ 0.99 │ +│ │ Wikipedia │ (2021-12-25), 12:20 UTC | │ │ +│ │ https://en.wikipedia.org/wiki │ Rocket: Ariane 5 ECA+ (S/N │ │ +│ │ /James_Webb_Space_Telescope │ 5113, Flight VA256) | Launch │ │ +│ │ │ site: Guiana, ELA-3 | │ │ +│ │ │ Contractor: Arianespace | │ │ +│ │ │ Entered service: 12 July 2022 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ The Launch of the James Webb │ On December 25, 2021, and 7:20 │ 0.98 │ +│ │ Space Telescope - YouTube │ AM ET (12:20 UTC), the James │ │ +│ │ https://www.youtube.com/watch │ Webb Space Telescope was │ │ +│ │ ?v=9tXlqWldVVk │ launched by an ArianeSpace │ │ +│ │ │ Ariane 5 rocket from │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ James Webb Space Telescope │ The launch date was Saturday, │ 0.97 │ +│ │ (JWST) Mission (Ariane 5) - │ December 25, 2021 at 12:20 PM │ │ +│ │ RocketLaunch.Live │ (UTC). │ │ +│ │ https://www.rocketlaunch.live │ │ │ +│ │ /launch/jwst │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ James Webb Space Telescope – │ JWST's launch date was │ 0.95 │ +│ │ College of Science │ December 25 from Europe's │ │ +│ │ https://science.utah.edu/news │ Spaceport in Kourou, French │ │ +│ │ /james-webb-space-telescope/ │ Guiana. Longtime fans of the │ │ +│ │ │ telescope are celebrating it │ │ +│ │ │ as a Christmas miracle. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ NASA's James Webb Space │ Liftoff is at 7:20 a.m. EST │ 0.90 │ +│ │ Telescope officially set to │ (1220 GMT). │ │ +│ │ launch Dec. 24 | Space │ │ │ +│ │ https://www.space.com/james-w │ │ │ +│ │ ebb-space-telescope-launch-da │ │ │ +│ │ te-confirmed │ │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ contradictory_sources │ Space.com headline │ The Space.com article │ +│ │ discrepancy │ headline references Dec. │ +│ │ │ 24, which was the │ +│ │ │ announced/planned launch │ +│ │ │ date at time of │ +│ │ │ publication, while the │ +│ │ │ actual launch occurred on │ +│ │ │ Dec. 25, 2021. This is a │ +│ │ │ pre-launch announcement │ +│ │ │ artifact, not a true │ +│ │ │ contradiction, and all │ +│ │ │ other sources confirm │ +│ │ │ Dec. 25. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ null │ James Webb Space │ JWST entered │ +│ │ │ Telescope first │ service on July │ +│ │ │ science results │ 12, 2022; │ +│ │ │ July 2022 │ understanding its │ +│ │ │ │ early science │ +│ │ │ │ results provides │ +│ │ │ │ context for its │ +│ │ │ │ operational │ +│ │ │ │ impact. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ null │ JWST launch │ The telescope was │ +│ │ │ delays history │ originally │ +│ │ │ original 2007 │ planned to launch │ +│ │ │ launch plan │ in 2007 but faced │ +│ │ │ │ decades of │ +│ │ │ │ delays, making │ +│ │ │ │ the history of │ +│ │ │ │ its development │ +│ │ │ │ noteworthy. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ medium │ What were the key milestones │ Wikipedia notes the telescope │ +│ │ after JWST's launch during its │ entered service on July 12, │ +│ │ commissioning phase before │ 2022, approximately six months │ +│ │ entering service on July 12, │ after its December 25, 2021 │ +│ │ 2022? │ launch, suggesting a lengthy │ +│ │ │ commissioning process. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ What caused JWST's launch to │ Space.com's article was titled │ +│ │ slip from December 24 to │ with a Dec. 24 launch date, but │ +│ │ December 25, 2021? │ the actual launch occurred on │ +│ │ │ Dec. 25, suggesting a │ +│ │ │ last-minute slip. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does JWST's actual mission │ Wikipedia lists a 10-year │ +│ │ performance compare to its │ planned and 20-year expected │ +│ │ planned 10-year operational │ life; precise launch trajectory │ +│ │ lifespan given its fuel │ reportedly left more fuel than │ +│ │ efficiency during launch? │ expected, potentially extending │ +│ │ │ the mission. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.99 │ +│ Corroborating sources: 5 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 1.00 │ +│ Budget status: under cap │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 19708 │ +│ Iterations: 3 │ +│ Wall time: 44.84s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 91e87d05-6d23-4377-af13-270a8cf701e2 diff --git a/docs/stress-tests/M3.3-runs/03-factual.log b/docs/stress-tests/M3.3-runs/03-factual.log new file mode 100644 index 0000000..9baf104 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/03-factual.log @@ -0,0 +1,179 @@ +Researching: What programming language is the Linux kernel primarily written in? + +{"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:50:52.691750Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:50:53.397487Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:50:53.405825Z"} +{"question": "What programming language is the Linux kernel primarily written in?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:50:53.438393Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What programming language is the Linux kernel primarily written in?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438693Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:50:53.438784Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1096, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:04.950078Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7266, "event": "iteration_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:15.609351Z"} +{"step": 14, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 16, "iterations_run": 3, "tokens_used": 18342, "event": "synthesis_start", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:51:38.886838Z"} +{"step": 15, "decision": "Parsed synthesis JSON successfully", "duration_ms": 38497, "event": "synthesis_complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.247727Z"} +{"step": 26, "decision": "Research complete", "confidence": 0.97, "citation_count": 6, "gap_count": 2, "discovery_count": 2, "total_duration_sec": 85.024, "event": "complete", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:16.248500Z"} +{"confidence": 0.97, "citations": 6, "gaps": 2, "discovery_events": 2, "tokens_used": 32922, "iterations_run": 3, "wall_time_sec": 82.80920100212097, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:16.248601Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:16.248962Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:16.252134Z"} +{"trace_id": "710b0a62-06c8-4f49-83e3-dc651c3702a9", "confidence": 0.97, "citations": 6, "tokens_used": 32922, "wall_time_sec": 82.80920100212097, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.444923Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The Linux kernel is primarily written in the C programming language, │ +│ specifically the GNU dialect of ISO C11 (compiled with GCC under -std=gnu11, │ +│ or alternatively with Clang). Assembly language is also used for │ +│ architecture-specific low-level code. As of late 2022, Rust became an │ +│ officially supported second language in the kernel, and as of the 2025 Linux │ +│ Kernel Maintainer Summit, Rust was elevated from 'experimental' to a │ +│ permanent, first-class core language alongside C. According to Open Hub │ +│ statistics, C accounts for approximately 95.8% of total lines in the kernel │ +│ codebase, with Assembly at ~0.7% and Rust at ~0.3%. The kernel also uses │ +│ small amounts of shell script, Python, Make, and Perl for tooling purposes. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Programming Language — The │ The Linux kernel is written in │ 1.00 │ +│ │ Linux Kernel documentation │ the C programming language. │ │ +│ │ https://docs.kernel.org/proce │ More precisely, it is │ │ +│ │ ss/programming-language.html │ typically compiled with gcc │ │ +│ │ │ under -std=gnu11: the GNU │ │ +│ │ │ dialect of ISO C11. clang is │ │ +│ │ │ also supported. The kernel has │ │ +│ │ │ support for the Rust │ │ +│ │ │ programming language under │ │ +│ │ │ CONFIG_RUST. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ The Linux Kernel Open Source │ C | 36,226,652 | 5,218,548 | │ 0.97 │ +│ │ Project on Open Hub: │ 12.6% | 5,867,314 | 47,312,514 │ │ +│ │ Languages Page │ | 95.8% ... Assembly | 266,797 │ │ +│ │ https://openhub.net/p/linux/a │ | 50,339 | 15.9% | 49,347 | │ │ +│ │ nalyses/latest/languages_summ │ 366,483 | 0.7% ... Rust | │ │ +│ │ ary │ 90,778 | 35,328 | 28.0% | │ │ +│ │ │ 11,361 | 137,467 | 0.3% │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Rust moves from experiment to │ The consensus among the │ 0.95 │ +│ │ a core Linux kernel language │ assembled developers is that │ │ +│ │ - Spiceworks │ Rust in the kernel is no │ │ +│ │ https://www.spiceworks.com/so │ longer experimental — it is │ │ +│ │ ftware/rust-moves-from-experi │ now a core part of the kernel │ │ +│ │ ment-to-a-core-linux-kernel-l │ and is here to stay. So the │ │ +│ │ anguage/ │ 'experimental' tag will be │ │ +│ │ │ coming off. This elevates Rust │ │ +│ │ │ to being the kernel's second │ │ +│ │ │ core language alongside C. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Why Linux Kernel is written │ Although the current Linux │ 0.92 │ +│ │ in C-language but not in C++? │ Kernel source-code contain │ │ +│ │ https://thelinuxchannel.org/2 │ certain parts of the code │ │ +│ │ 024/06/why-linux-kernel-is-wr │ written in assembly code │ │ +│ │ itten-in-c-language-but-not-i │ (actually native CPU assembly │ │ +│ │ n-c-thelinuxchannel-kernelpro │ instructions) and recently │ │ +│ │ gramming/ │ certain parts of code written │ │ +│ │ │ in Rust Language, majority of │ │ +│ │ │ the Linux Kernel source-code │ │ +│ │ │ is only written in C Language. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Linux Kernel Contributors And │ The Linux kernel crossed the │ 0.90 │ +│ │ Lines of Code Statistics 2026 │ 40 million line threshold with │ │ +│ │ https://commandlinux.com/stat │ version 6.14 rc1 in January │ │ +│ │ istics/linux-kernel-contribut │ 2025, containing precisely │ │ +│ │ ors-lines-of-code-statistics/ │ 40,063,856 lines. This │ │ +│ │ │ represents exponential growth │ │ +│ │ │ from the original 10,239 lines │ │ +│ │ │ in version 0.01 released in │ │ +│ │ │ 1991. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Rust for Linux - Wikipedia │ Initial release | October 1, │ 0.93 │ +│ │ https://en.wikipedia.org/wiki │ 2022; 3 years ago (2022-10-01) │ │ +│ │ /Rust_for_Linux │ | Written in | Rust | │ │ +│ │ │ Operating system | Linux | │ │ +│ │ │ License | GPL-2.0-only with │ │ +│ │ │ Linux-syscall-note. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Exact current percentage │ Open Hub statistics may │ +│ │ of Rust code in the most │ not reflect the most │ +│ │ recent kernel versions │ recent kernel releases │ +│ │ (6.12+) │ (6.14+), so the exact │ +│ │ │ current Rust percentage │ +│ │ │ could be slightly higher │ +│ │ │ than 0.3% given active │ +│ │ │ Rust adoption. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Whether C++ is │ Open Hub reports C++ at │ +│ │ officially used in any │ 1.9% of total lines, yet │ +│ │ part of the kernel │ official kernel docs and │ +│ │ │ community sources say C │ +│ │ │ is the language and C++ │ +│ │ │ is not used. The C++ │ +│ │ │ lines may be in │ +│ │ │ tools/scripts not in the │ +│ │ │ kernel proper. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ null │ Linux kernel Rust │ Rust is growing │ +│ │ │ adoption rate │ quickly in the │ +│ │ │ 2025 lines of │ kernel; updated │ +│ │ │ code percentage │ statistics on its │ +│ │ │ │ share would be │ +│ │ │ │ valuable │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ null │ Linux kernel C++ │ Open Hub shows │ +│ │ │ code usage tools │ ~1.9% C++ but │ +│ │ │ vs kernel proper │ official docs do │ +│ │ │ │ not mention C++; │ +│ │ │ │ clarifying │ +│ │ │ │ whether this is │ +│ │ │ │ tooling code vs │ +│ │ │ │ kernel code would │ +│ │ │ │ resolve the │ +│ │ │ │ apparent │ +│ │ │ │ discrepancy │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ medium │ Will Rust eventually surpass │ Rust is at ~0.3% and Assembly │ +│ │ Assembly in lines of code │ at ~0.7% per Open Hub; with │ +│ │ within the Linux kernel? │ active Rust driver development, │ +│ │ │ Rust may soon exceed Assembly │ +│ │ │ usage. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the roadmap for Rust │ Rust is now a first-class │ +│ │ adoption in specific kernel │ language, but the Spiceworks │ +│ │ subsystems? │ article notes the focus is on │ +│ │ │ 'where, how fast, and under │ +│ │ │ whose terms does Rust spread │ +│ │ │ inside Linux'. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ Why does Open Hub report ~1.9% │ Open Hub's language breakdown │ +│ │ C++ in the Linux kernel │ shows 568,053 code lines of │ +│ │ codebase when official │ C++, which may belong to │ +│ │ documentation does not mention │ userspace tools or build │ +│ │ C++ as a supported kernel │ infrastructure bundled in the │ +│ │ language? │ same repository. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.97 │ +│ Corroborating sources: 6 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 1.00 │ +│ Budget status: under cap │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 32922 │ +│ Iterations: 3 │ +│ Wall time: 82.81s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 710b0a62-06c8-4f49-83e3-dc651c3702a9 diff --git a/docs/stress-tests/M3.3-runs/04-factual.log b/docs/stress-tests/M3.3-runs/04-factual.log new file mode 100644 index 0000000..d7b559d --- /dev/null +++ b/docs/stress-tests/M3.3-runs/04-factual.log @@ -0,0 +1,115 @@ +Researching: What is the capital of Mongolia? + +{"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:16.982178Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:17.707574Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:17.715766Z"} +{"question": "What is the capital of Mongolia?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:17.748116Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the capital of Mongolia?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748504Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:17.748598Z"} +{"step": 5, "decision": "Starting iteration 2/5", "tokens_so_far": 1043, "event": "iteration_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:25.126703Z"} +{"step": 7, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 5, "iterations_run": 2, "tokens_used": 5387, "event": "synthesis_start", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:38.025310Z"} +{"step": 8, "decision": "Parsed synthesis JSON successfully", "duration_ms": 19958, "event": "synthesis_complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.937541Z"} +{"step": 14, "decision": "Research complete", "confidence": 0.99, "citation_count": 4, "gap_count": 0, "discovery_count": 1, "total_duration_sec": 41.287, "event": "complete", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:56.938235Z"} +{"confidence": 0.99, "citations": 4, "gaps": 0, "discovery_events": 1, "tokens_used": 11009, "iterations_run": 2, "wall_time_sec": 39.189372301101685, "budget_exhausted": false, "event": "research_completed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:56.938337Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:52:56.938738Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:56.942176Z"} +{"trace_id": "ffc42162-5527-4a35-97ad-474aafa47dc1", "confidence": 0.99, "citations": 4, "tokens_used": 11009, "wall_time_sec": 39.189372301101685, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.144089Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The capital of Mongolia is Ulaanbaatar (also spelled Ulan Bator). It is the │ +│ largest city in Mongolia, situated at an elevation of 1,350 meters on the │ +│ Tuul River, and is known as the coldest national capital in the world. The │ +│ name 'Ulaanbaatar' means 'red hero' in Mongolian. It is home to over half of │ +│ Mongolia's population of approximately 3 million people. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Ulaanbaatar - Wikipedia │ Ulaanbaatar is the capital of │ 0.99 │ +│ │ https://en.wikipedia.org/wiki │ Mongolia, and is home to over │ │ +│ │ /Ulaanbaatar │ half the country's population │ │ +│ │ │ of about 3 million people. │ │ +│ │ │ Human habitation dates back │ │ +│ │ │ more than 300,000 years. The │ │ +│ │ │ city is located along the Tuul │ │ +│ │ │ River Valley. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Ulaanbaatar, Mongolia | NASA │ Ulaanbaatar is the capital of │ 0.99 │ +│ │ Jet Propulsion Laboratory │ Mongolia, and is home to over │ │ +│ │ (JPL) │ half the country's population │ │ +│ │ https://www.jpl.nasa.gov/imag │ of about 3 million people. Due │ │ +│ │ es/pia26289-ulaanbaatar-mongo │ to its location deep in the │ │ +│ │ lia/ │ interior of Asia, and its high │ │ +│ │ │ elevation, Ulaanbaatar is the │ │ +│ │ │ coldest national capital in │ │ +│ │ │ the world. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Capital of Mongolia | - │ Ulaanbaatar (Ulan Bator) is │ 0.95 │ +│ │ Everything You Need to Know │ capital of Mongolia known as │ │ +│ │ About Ulaanbaatar │ the coldest capital on earth. │ │ +│ │ https://www.travelbuddies.inf │ It is located in central Asia │ │ +│ │ o/capital-of-mongolia/ │ between China and Russia and │ │ +│ │ │ capital and largest city of │ │ +│ │ │ Mongolia. Ulaan is red and │ │ +│ │ │ Baatar is hero in Mongolian. │ │ +│ │ │ In general, Ulaanbaatar means │ │ +│ │ │ 'red hero'. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Ulan Bator, Mongolia | │ Ulaanbaatar, also known as │ 0.98 │ +│ │ Geography and Cartography | │ Ulan Bator, is the capital and │ │ +│ │ Research Starters | EBSCO │ largest city of Mongolia, │ │ +│ │ Research │ situated at an elevation of │ │ +│ │ https://www.ebsco.com/researc │ 1,350 meters (4,430 feet) on │ │ +│ │ h-starters/geography-and-cart │ the Tuul River in the │ │ +│ │ ography/ulan-bator-mongolia │ northeast of the Mongolian │ │ +│ │ │ plateau. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ null │ Ulaanbaatar air │ Multiple sources │ +│ │ │ pollution and │ mention severe │ +│ │ │ climate │ air pollution and │ +│ │ │ challenges │ extreme cold as │ +│ │ │ │ notable │ +│ │ │ │ characteristics │ +│ │ │ │ of the capital │ +│ │ │ │ worth exploring │ +│ │ │ │ further. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ low │ How has Ulaanbaatar's │ Sources mention dramatic │ +│ │ population grown over recent │ population increases due to │ +│ │ decades due to rural-to-urban │ migration from rural areas, │ +│ │ migration? │ with population estimates │ +│ │ │ ranging from 1.4 million to │ +│ │ │ over 1.6 million across │ +│ │ │ sources. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What measures is Ulaanbaatar │ Multiple sources note that coal │ +│ │ taking to address its severe │ reliance and extreme winters │ +│ │ air pollution problem? │ cause significant air pollution │ +│ │ │ in the city. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.99 │ +│ Corroborating sources: 4 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 1.00 │ +│ Budget status: under cap │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 11009 │ +│ Iterations: 2 │ +│ Wall time: 39.19s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: ffc42162-5527-4a35-97ad-474aafa47dc1 diff --git a/docs/stress-tests/M3.3-runs/05-factual.log b/docs/stress-tests/M3.3-runs/05-factual.log new file mode 100644 index 0000000..a0156c7 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/05-factual.log @@ -0,0 +1,148 @@ +Researching: How many amino acids are encoded by the standard genetic code? + +{"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:52:57.672745Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:52:58.404691Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:52:58.415522Z"} +{"question": "How many amino acids are encoded by the standard genetic code?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:52:58.449581Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "How many amino acids are encoded by the standard genetic code?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449885Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:52:58.449974Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1099, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:06.468160Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8623, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:11.359260Z"} +{"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 18453, "event": "iteration_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:15.589960Z"} +{"step": 19, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 14, "iterations_run": 4, "tokens_used": 34167, "event": "synthesis_start", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:53:38.361461Z"} +{"step": 20, "decision": "Parsed synthesis JSON successfully", "duration_ms": 24174, "event": "synthesis_complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.506420Z"} +{"step": 28, "decision": "Research complete", "confidence": 0.98, "citation_count": 4, "gap_count": 1, "discovery_count": 2, "total_duration_sec": 65.235, "event": "complete", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:01.507373Z"} +{"confidence": 0.98, "citations": 4, "gaps": 1, "discovery_events": 2, "tokens_used": 48308, "iterations_run": 4, "wall_time_sec": 63.05677556991577, "budget_exhausted": false, "event": "research_completed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:01.507469Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:54:01.507940Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:01.514127Z"} +{"trace_id": "7561029e-5dcb-4eaa-98e9-7496ed4bf4c2", "confidence": 0.98, "citations": 4, "tokens_used": 48308, "wall_time_sec": 63.05677556991577, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:01.785150Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The standard genetic code encodes 20 common amino acids. These are specified │ +│ by 64 possible codons (combinations of three nucleotides from four bases), │ +│ with most amino acids encoded by more than one codon (a property called │ +│ degeneracy). Methionine and tryptophan are the only amino acids specified by │ +│ a single codon each. Three codons serve as stop signals rather than encoding │ +│ amino acids. Beyond the standard 20, two additional amino │ +│ acids—selenocysteine (the 21st) and pyrrolysine (the 22nd)—are also │ +│ genetically encoded in certain organisms via reprogramming of stop codons │ +│ UGA and UAG, respectively, but are not part of the standard set of 20. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ The genetic code (article) - │ Most of the amino acids in the │ 0.95 │ +│ │ Khan Academy │ genetic code are encoded by at │ │ +│ │ https://www.khanacademy.org/s │ least two codons. In fact, │ │ +│ │ cience/hs-bio/x230b3ff252126b │ methionine and tryptophan are │ │ +│ │ b6:gene-expression-and-regula │ the only amino acids specified │ │ +│ │ tion/x230b3ff252126bb6:untitl │ by a single codon. │ │ +│ │ ed-348/a/the-genetic-code │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Is there a twenty third amino │ The universal genetic code │ 0.98 │ +│ │ acid in the genetic code? - │ includes 20 common amino │ │ +│ │ PubMed │ acids. In addition, │ │ +│ │ https://pubmed.ncbi.nlm.nih.g │ selenocysteine (Sec) and │ │ +│ │ ov/16713651/ │ pyrrolysine (Pyl), known as │ │ +│ │ │ the twenty first and twenty │ │ +│ │ │ second amino acids, are │ │ +│ │ │ encoded by UGA and UAG, │ │ +│ │ │ respectively, which are the │ │ +│ │ │ codons that usually function │ │ +│ │ │ as stop signals. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Genetic code - Wikipedia │ The genetic code is highly │ 0.95 │ +│ │ https://en.wikipedia.org/wiki │ similar among all organisms │ │ +│ │ /Genetic_code │ and can be expressed in a │ │ +│ │ │ simple table with 64 entries. │ │ +│ │ │ The codons specify which amino │ │ +│ │ │ acid will be added next during │ │ +│ │ │ protein biosynthesis. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Understanding the Genetic │ The universal │ 0.97 │ +│ │ Code - PMC │ triple-nucleotide genetic │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ code, allowing DNA-encoded │ │ +│ │ articles/PMC6620406/ │ mRNA to be translated into the │ │ +│ │ │ amino acid sequences of │ │ +│ │ │ proteins using transfer RNAs │ │ +│ │ │ (tRNAs) and many accessory and │ │ +│ │ │ modification factors, is │ │ +│ │ │ essentially common to all │ │ +│ │ │ living organisms on Earth. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ scope_exceeded │ Exact codon-to-amino-acid │ The full detailed codon │ +│ │ mapping table │ table listing all 64 codons │ +│ │ │ and their corresponding │ +│ │ │ amino acids was not │ +│ │ │ extracted verbatim from the │ +│ │ │ sources, though the total │ +│ │ │ count of 20 standard amino │ +│ │ │ acids is well established. │ +└────────────────┴──────────────────────────────┴──────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ database │ selenocysteine │ The PubMed source │ +│ │ │ pyrrolysine │ raises the │ +│ │ │ genetic code │ question of │ +│ │ │ expansion │ expanded genetic │ +│ │ │ organisms │ codes beyond 20 │ +│ │ │ │ amino acids, │ +│ │ │ │ which may be │ +│ │ │ │ relevant for │ +│ │ │ │ advanced biology │ +│ │ │ │ research. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ synthetic biology │ Wikipedia │ +│ │ │ unnatural amino │ mentions expanded │ +│ │ │ acids expanded │ genetic codes in │ +│ │ │ genetic code │ synthetic │ +│ │ │ │ biology, │ +│ │ │ │ suggesting active │ +│ │ │ │ research into │ +│ │ │ │ adding more than │ +│ │ │ │ 22 amino acids. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ medium │ Could a 23rd amino acid ever │ A PubMed study scanned 16 │ +│ │ become widely distributed and │ archaeal and 130 bacterial │ +│ │ genetically encoded in nature? │ genomes for tRNAs corresponding │ +│ │ │ to the three stop codons and │ +│ │ │ concluded that additional │ +│ │ │ widely distributed genetically │ +│ │ │ encoded amino acids are │ +│ │ │ unlikely. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ How many non-standard amino │ Wikipedia references expanded │ +│ │ acids have been successfully │ genetic codes in synthetic │ +│ │ incorporated into proteins via │ biology as a distinct topic, │ +│ │ synthetic biology methods? │ suggesting │ +│ │ │ laboratory-engineered codes may │ +│ │ │ go beyond the natural 22. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.98 │ +│ Corroborating sources: 4 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 1.00 │ +│ Budget status: under cap │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 48308 │ +│ Iterations: 4 │ +│ Wall time: 63.06s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 7561029e-5dcb-4eaa-98e9-7496ed4bf4c2 diff --git a/docs/stress-tests/M3.3-runs/06-comparative.log b/docs/stress-tests/M3.3-runs/06-comparative.log new file mode 100644 index 0000000..59dfc12 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/06-comparative.log @@ -0,0 +1,226 @@ +Researching: Compare the energy density of lithium-ion vs sodium-ion batteries. + +{"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:54:02.430608Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:54:03.159945Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:54:03.167971Z"} +{"question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:54:03.200030Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare the energy density of lithium-ion vs sodium-ion batteries.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200318Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:03.200405Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1114, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:14.560598Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 7183, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:18.314755Z"} +{"step": 19, "decision": "Starting iteration 4/5", "tokens_so_far": 13977, "event": "iteration_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:28.528912Z"} +{"step": 24, "decision": "Token budget reached before iteration 5: 28015/20000", "event": "budget_exhausted", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.027627Z"} +{"step": 25, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 4, "tokens_used": 28015, "event": "synthesis_start", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:54:39.028531Z"} +{"step": 26, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50955, "event": "synthesis_complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.614289Z"} +{"step": 41, "decision": "Research complete", "confidence": 0.91, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 87.865, "event": "complete", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:27.616834Z"} +{"confidence": 0.91, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 48087, "iterations_run": 4, "wall_time_sec": 84.41376757621765, "budget_exhausted": true, "event": "research_completed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:27.617014Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:55:27.617866Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:27.632124Z"} +{"trace_id": "aaf3b9ef-d91a-4d03-8883-b0a906929cb1", "confidence": 0.91, "citations": 8, "tokens_used": 48087, "wall_time_sec": 84.41376757621765, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:27.873634Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Lithium-ion batteries have significantly higher energy density than │ +│ sodium-ion batteries across all commercial chemistries. Lithium-ion cells │ +│ achieve 150–300 Wh/kg gravimetrically, depending on chemistry: NMC variants │ +│ reach 250–300 Wh/kg in premium automotive applications, while LFP cells │ +│ deliver 150–180 Wh/kg [Source 15]. Volumetrically, lithium-ion batteries │ +│ reach roughly 250–700 Wh/L [Source 16]. Sodium-ion batteries currently │ +│ achieve 90–190 Wh/kg gravimetrically; CATL's first-generation commercial │ +│ cells reached ~160 Wh/kg [Source 15], with newer products like CATL's Naxtra │ +│ reaching ~175 Wh/kg [Source 22], and ScienceDirect prototypes ranging 90–150 │ +│ Wh/kg [Source 7]. The volumetric energy density of sodium-ion is │ +│ approximately 20–40% lower than lithium-ion equivalents [Source 8]. This gap │ +│ exists fundamentally because sodium ions are heavier and larger than lithium │ +│ ions, reducing the energy stored per unit mass or volume [Source 3, Source │ +│ 20]. A notable exception is a late-2025 announcement by ZN Energy of an │ +│ anode-free solid-state sodium-ion pouch cell achieving 348.5 Wh/kg, verified │ +│ by CATARC, using a high-energy layered oxide cathode and anode-free │ +│ solid-state architecture—though this is a laboratory/prototype result, not │ +│ yet commercial [Source 10]. In practical terms, sodium-ion batteries are │ +│ best suited for stationary storage and cost-sensitive low-performance EVs │ +│ where energy density is less critical, while lithium-ion dominates portable │ +│ electronics, robotics, and long-range EVs [Source 1, Source 8]. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Battery Energy Density 2025: │ Nickel Manganese Cobalt (NMC) │ 0.95 │ +│ │ State of the Art & Next-Gen │ variants deliver the highest │ │ +│ │ Tech │ energy densities at the cell │ │ +│ │ https://timharper.net/fieldno │ level, reaching 250-300 Wh/kg │ │ +│ │ tes/battery-energy-density-20 │ in premium automotive │ │ +│ │ 25/ │ applications... Sodium-ion │ │ +│ │ │ batteries have emerged from │ │ +│ │ │ laboratory curiosity to │ │ +│ │ │ commercial reality, with │ │ +│ │ │ CATL's first-generation cells │ │ +│ │ │ achieving 160 Wh/kg energy │ │ +│ │ │ density. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Sodium ion batteries: A │ Current prototypes of SIBs │ 0.95 │ +│ │ sustainable alternative to │ have energy densities of │ │ +│ │ lithium-ion ... │ 90–150 Wh/kg, which remain │ │ +│ │ https://www.sciencedirect.com │ lower than the 130–285 Wh/kg │ │ +│ │ /science/article/pii/S2949821 │ typically achieved │ │ +│ │ X25002418 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Sodium-ion batteries: Should │ Sodium is heavier than │ 0.97 │ +│ │ we believe the hype? │ lithium, and its ions are │ │ +│ │ https://cen.acs.org/energy/en │ larger, resulting in a │ │ +│ │ ergy-storage-/Sodium-ion-batt │ volumetric energy density that │ │ +│ │ eries-Should-believe/103/web/ │ is 20–40% less than that of │ │ +│ │ 2025/11 │ lithium ion. Consequently, a │ │ +│ │ │ sodium-ion battery is bigger │ │ +│ │ │ and heavier than an equivalent │ │ +│ │ │ one made with lithium. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Energy Density of Lithium-Ion │ Modern lithium-ion batteries │ 0.90 │ +│ │ Batteries Explained: Wh/kg vs │ achieve 150-300 Wh/kg and │ │ +│ │ Wh/L │ 250-700 Wh/L, depending on │ │ +│ │ https://www.longsingtech.com/ │ chemistry and design. │ │ +│ │ energy-density-of-lithium-ion │ │ │ +│ │ -batteries/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Sodium Ion vs Lithium Ion │ Energy Density (Gravimetric): │ 0.88 │ +│ │ Batteries: 2026 Comparison & │ Sodium-ion typically ranges │ │ +│ │ Key Advantages │ from 100–175 Wh/kg (e.g., │ │ +│ │ https://chargeprotexas.com/so │ CATL's Naxtra at ~175 Wh/kg). │ │ +│ │ dium-ion-vs-lithium-ion-batte │ Lithium-ion hits 150–250+ │ │ +│ │ ries-2026-comparison/ │ Wh/kg (LFP: 150–210; NMC: │ │ +│ │ │ 240–350). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ ZN Energy Breaks Sodium-Ion │ Its >25Ah large-format AFSSSIB │ 0.78 │ +│ │ Battery Density Record at │ pouch cell achieved a │ │ +│ │ 348.5Wh/kg │ gravimetric energy density of │ │ +│ │ https://www.linkedin.com/post │ 348.5Wh/kg, verified by CATARC │ │ +│ │ s/jerry-wan-069b41105_breakin │ (China Automotive Technology & │ │ +│ │ g-the-sodium-ceiling-zhaona-e │ Research Center, Tianjin). │ │ +│ │ nergy-activity-74134108276403 │ This is not an incremental │ │ +│ │ 20000-NHd_ │ improvement—it directly │ │ +│ │ │ challenges the long-held │ │ +│ │ │ assumption that sodium │ │ +│ │ │ chemistry is structurally │ │ +│ │ │ capped at 'low energy │ │ +│ │ │ density.' │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Sodium as a Green Substitute │ But there are also downsides │ 0.93 │ +│ │ for Lithium in Batteries │ to sodium-ion batteries, the │ │ +│ │ https://physics.aps.org/artic │ top one being a lower energy │ │ +│ │ les/v17/73 │ density than their lithium-ion │ │ +│ │ │ counterparts. Energy density │ │ +│ │ │ has a direct bearing on the │ │ +│ │ │ driving range of an electric │ │ +│ │ │ vehicle. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Sodium-Ion vs Lithium-Ion │ lithium-ion batteries dominate │ 0.85 │ +│ │ Batteries Differences and │ high-performance applications │ │ +│ │ Applications in 2025 │ like consumer electronics and │ │ +│ │ https://www.large-battery.com │ robotics, owing to their │ │ +│ │ /blog/na-ion-vs-li-ion-batter │ superior energy density of │ │ +│ │ ies-2025/ │ 100–270 Wh/kg. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Volumetric energy │ Most sources provide │ +│ │ density figures for │ gravimetric (Wh/kg) data │ +│ │ sodium-ion batteries │ for sodium-ion; specific │ +│ │ │ Wh/L volumetric figures │ +│ │ │ for sodium-ion cells at │ +│ │ │ the commercial pack level │ +│ │ │ were not found in │ +│ │ │ evidence. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Independent verification │ The 348.5 Wh/kg result │ +│ │ of ZN Energy 348.5 Wh/kg │ for sodium-ion is from a │ +│ │ claim │ LinkedIn post summarizing │ +│ │ │ a company announcement. │ +│ │ │ No peer-reviewed or │ +│ │ │ independent third-party │ +│ │ │ publication was found to │ +│ │ │ corroborate this figure. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ scope_exceeded │ Cycle life vs energy │ While cycle life is │ +│ │ density trade-offs in │ mentioned in some │ +│ │ sodium-ion │ sources, a detailed │ +│ │ │ quantitative comparison │ +│ │ │ of how energy density │ +│ │ │ degrades over cycle life │ +│ │ │ compared to lithium-ion │ +│ │ │ was not covered in the │ +│ │ │ evidence. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ new_source │ arxiv │ anode-free │ ZN Energy's 348.5 │ +│ │ │ solid-state │ Wh/kg claim would │ +│ │ │ sodium-ion │ benefit from │ +│ │ │ battery energy │ peer-reviewed │ +│ │ │ density 2025 │ validation on │ +│ │ │ │ arXiv or similar │ +│ │ │ │ preprint server. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ sodium-ion │ Volumetric energy │ +│ │ │ battery │ density for │ +│ │ │ volumetric energy │ sodium-ion at the │ +│ │ │ density Wh/L │ cell and pack │ +│ │ │ commercial cells │ level is │ +│ │ │ 2025 │ underrepresented │ +│ │ │ │ in current │ +│ │ │ │ evidence. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ layered oxide │ Multiple sources │ +│ │ │ cathode │ mention cathode │ +│ │ │ sodium-ion │ engineering as │ +│ │ │ specific capacity │ the key │ +│ │ │ cycle stability │ bottleneck for │ +│ │ │ 2025 │ sodium-ion energy │ +│ │ │ │ density │ +│ │ │ │ improvement. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Will sodium-ion batteries ever │ ZN Energy's prototype achieved │ +│ │ match or exceed LFP lithium-ion │ 348.5 Wh/kg, but commercial │ +│ │ in gravimetric energy density │ CATL sodium-ion cells are at │ +│ │ at the commercial pack level? │ ~160–175 Wh/kg while LFP cells │ +│ │ │ are 150–180 Wh/kg. The gap is │ +│ │ │ closing in prototypes but not │ +│ │ │ yet in commercial products. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does energy density change │ Sources mention sodium-ion's │ +│ │ over the cycle life of │ lower risk of thermal runaway │ +│ │ sodium-ion vs lithium-ion │ and good low-temperature │ +│ │ batteries under real-world │ performance, but long-term │ +│ │ conditions? │ energy density retention data │ +│ │ │ was not found. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the volumetric energy │ C&EN states volumetric density │ +│ │ density (Wh/L) of current │ is 20–40% lower than │ +│ │ commercial sodium-ion battery │ lithium-ion but provides no │ +│ │ packs? │ absolute Wh/L figures for │ +│ │ │ sodium-ion. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.91 │ +│ Corroborating sources: 8 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 0.97 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 48087 │ +│ Iterations: 4 │ +│ Wall time: 84.41s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: aaf3b9ef-d91a-4d03-8883-b0a906929cb1 diff --git a/docs/stress-tests/M3.3-runs/07-comparative.log b/docs/stress-tests/M3.3-runs/07-comparative.log new file mode 100644 index 0000000..fe09df7 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/07-comparative.log @@ -0,0 +1,350 @@ +Researching: Compare PostgreSQL and SQLite for embedded analytics workloads. + +{"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:55:28.499294Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:55:29.256154Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:55:29.264747Z"} +{"question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:55:29.297908Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare PostgreSQL and SQLite for embedded analytics workloads.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298261Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:29.298356Z"} +{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1147, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:38.957520Z"} +{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8781, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:55:45.812510Z"} +{"step": 23, "decision": "Starting iteration 4/5", "tokens_so_far": 18324, "event": "iteration_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:00.757335Z"} +{"step": 28, "decision": "Token budget reached before iteration 5: 34877/20000", "event": "budget_exhausted", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990690Z"} +{"step": 29, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 4, "tokens_used": 34877, "event": "synthesis_start", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:56:03.990849Z"} +{"step": 30, "decision": "Parsed synthesis JSON successfully", "duration_ms": 78663, "event": "synthesis_complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.513065Z"} +{"step": 48, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 114.441, "event": "complete", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:20.536570Z"} +{"confidence": 0.88, "citations": 10, "gaps": 3, "discovery_events": 4, "tokens_used": 61699, "iterations_run": 4, "wall_time_sec": 111.20896744728088, "budget_exhausted": true, "event": "research_completed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:20.538075Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:57:20.546420Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:20.668474Z"} +{"trace_id": "01881015-61a9-4894-a723-4e1d8b7a7755", "confidence": 0.88, "citations": 10, "tokens_used": 61699, "wall_time_sec": 111.20896744728088, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:21.511598Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ PostgreSQL and SQLite differ substantially for embedded analytics workloads │ +│ across architecture, concurrency, feature set, and performance │ +│ characteristics. │ +│ │ +│ **Architecture:** SQLite is a serverless, embedded database engine that │ +│ reads/writes a single file on disk, making it highly portable and │ +│ zero-configuration. PostgreSQL uses a client-server architecture requiring a │ +│ separate server process, which enables it to scale and handle multiple │ +│ concurrent clients efficiently via Multi-Version Concurrency Control (MVCC) │ +│ [Source 5]. For embedded analytics specifically, SQLite's in-process nature │ +│ eliminates network overhead, which can yield significant read performance │ +│ advantages in local scenarios [Source 31]. │ +│ │ +│ **Concurrency:** SQLite allows multiple concurrent readers but only one │ +│ writer at a time, using file-level locking. This single-writer model is a │ +│ significant bottleneck for write-heavy or high-concurrency analytical │ +│ ingestion workloads [Source 24, Source 25]. PostgreSQL's MVCC ensures │ +│ readers and writers do not block each other, making it far superior for │ +│ multi-user or mixed OLTP/OLAP environments [Source 5]. Turso's work on │ +│ concurrent writes for SQLite demonstrates the community recognizes this │ +│ limitation, achieving up to 4x write throughput improvements over vanilla │ +│ SQLite [Source 24]. │ +│ │ +│ **OLAP/Analytical Performance:** SQLite is row-oriented and was designed │ +│ primarily as a world-class OLTP engine. For analytical workloads—complex │ +│ aggregations, percentile calculations, large scans—SQLite struggles │ +│ significantly. A cited benchmark shows a single percentile query over 13M │ +│ rows taking ~4 seconds in SQLite [Source 6]. PostgreSQL, while also │ +│ row-oriented, supports more advanced SQL features (window functions, complex │ +│ joins, partitioning) and can be tuned for analytics [Source 22]. However, │ +│ PostgreSQL itself hits a 'Postgres Wall' for heavy analytical workloads when │ +│ row-scanning large datasets exceeds available RAM [Source 13]. Neither │ +│ SQLite nor PostgreSQL is natively columnar; PostgreSQL can be extended with │ +│ columnar storage extensions for better OLAP performance [Source 23]. │ +│ │ +│ **Feature Set:** PostgreSQL offers a richer feature set including more data │ +│ types, advanced indexing, role-based access control, JSON/array support, │ +│ geospatial extensions (PostGIS), and time-series extensions. SQLite uses │ +│ dynamic typing and has a simpler, more limited feature set—easier to use but │ +│ potentially limiting for complex analytical applications [Source 5, Source │ +│ 1]. │ +│ │ +│ **Recommended Alternatives for Embedded Analytics:** DuckDB is widely cited │ +│ as the superior embedded engine for analytical workloads, outperforming both │ +│ SQLite and PostgreSQL on OLAP queries by a large margin [Source 6, Source │ +│ 2]. For embedded analytics use cases requiring columnar processing, DuckDB │ +│ or Stoolap (a Rust-based embedded OLAP engine) are more purpose-built │ +│ options. Stoolap benchmarks show up to 138x faster analytical query │ +│ performance versus SQLite [Source 9]. │ +│ │ +│ **Summary:** SQLite wins for lightweight, read-heavy, single-writer, │ +│ local/embedded OLTP workloads where portability and zero configuration │ +│ matter. PostgreSQL wins for multi-user, concurrent, complex-query │ +│ environments. For true embedded analytics workloads (large-scale │ +│ aggregations, complex OLAP queries), neither is optimal—DuckDB or a hybrid │ +│ architecture (PostgreSQL as system-of-record + DuckDB as analytical engine) │ +│ is the modern recommended approach. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ SQLite vs. PostgreSQL: The │ PostgreSQL is a client-server │ 0.97 │ +│ │ key differences and │ database system... This │ │ +│ │ advantages of each │ architecture enables │ │ +│ │ https://databaseschool.com/ar │ PostgreSQL to scale and handle │ │ +│ │ ticles/sqlite-vs-postgresql-t │ multiple concurrent clients │ │ +│ │ he-key-differences-and-advant │ efficiently... SQLite is a │ │ +│ │ ages-of-each │ serverless database engine. It │ │ +│ │ │ functions as a lightweight │ │ +│ │ │ library embedded directly into │ │ +│ │ │ applications... SQLite's │ │ +│ │ │ concurrency model is more │ │ +│ │ │ restrictive: while it allows │ │ +│ │ │ multiple readers, only one │ │ +│ │ │ process can write to the │ │ +│ │ │ database at a time. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Making -SQLite- Analytics │ In some analytical queries │ 0.95 │ +│ │ Great Again! – Oldmoe's blog │ SQLite will struggle to │ │ +│ │ https://oldmoe.blog/2025/03/1 │ perform compared to other OLAP │ │ +│ │ 2/making-sqlite-analytics-gre │ oriented engines like DuckDB. │ │ +│ │ at-again/ │ Consider the following │ │ +│ │ │ scenario: You have a table │ │ +│ │ │ with 13M entries of latency │ │ +│ │ │ data, and you want to │ │ +│ │ │ determine the following │ │ +│ │ │ percentiles: p50, p95, p99... │ │ +│ │ │ After around 4 seconds you │ │ +│ │ │ will see the result. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ DuckDB vs. Postgres for │ That 'quick' analytical query │ 0.95 │ +│ │ embedded analytics: How to │ powering a customer-facing │ │ +│ │ choose (and when to use a │ dashboard now takes 5 seconds, │ │ +│ │ hybrid architecture) │ up from 50 milliseconds. Then │ │ +│ │ https://motherduck.com/learn- │ thirty seconds. Then it times │ │ +│ │ more/duckdb-vs-postgres-embed │ out. You've hit the 'Postgres │ │ +│ │ ded-analytics/ │ Wall.' This isn't a Postgres │ │ +│ │ │ failure. It's an architectural │ │ +│ │ │ mismatch. Postgres processes │ │ +│ │ │ analytics using the same │ │ +│ │ │ row-oriented logic designed │ │ +│ │ │ for transaction safety. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Beyond the Single-Writer │ SQLite has a single-writer │ 0.93 │ +│ │ Limitation with Turso's │ transaction model, which means │ │ +│ │ Concurrent Writes │ whenever a transaction writes │ │ +│ │ https://turso.tech/blog/beyon │ to the database, no other │ │ +│ │ d-the-single-writer-limitatio │ write transactions can make │ │ +│ │ n-with-tursos-concurrent-writ │ progress until that │ │ +│ │ es │ transaction is complete... │ │ +│ │ │ When concurrent writes are │ │ +│ │ │ used, we achieve up to 4x the │ │ +│ │ │ write throughput of SQLite, │ │ +│ │ │ while also removing the │ │ +│ │ │ dreaded SQLITE_BUSY error. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Stoolap vs. SQLite: Comparing │ OLAP (Online Analytical │ 0.92 │ +│ │ Rust OLAP and Traditional │ Processing) systems are │ │ +│ │ OLTP Databases | Better Stack │ designed for a completely │ │ +│ │ Community │ different purpose. OLAP │ │ +│ │ https://betterstack.com/commu │ databases are optimized for │ │ +│ │ nity/guides/ai/stoolap-vs-sql │ complex queries and data │ │ +│ │ ite/ │ analysis... Most standard │ │ +│ │ │ application databases, │ │ +│ │ │ including SQLite, PostgreSQL, │ │ +│ │ │ and MySQL, are classified as │ │ +│ │ │ OLTP (Online Transaction │ │ +│ │ │ Processing) systems. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Postgres Tuning & Performance │ Analytics or OLAP activity │ 0.91 │ +│ │ for Analytics Data | Crunchy │ typically involves much │ │ +│ │ Data Blog │ longer, more complex queries │ │ +│ │ https://www.crunchydata.com/b │ than OLTP activity, joining │ │ +│ │ log/postgres-tuning-and-perfo │ data from multiple tables, and │ │ +│ │ rmance-for-analytics-data │ working on large data sets. │ │ +│ │ │ This means it's very resource │ │ +│ │ │ intensive. Without careful │ │ +│ │ │ planning and tuning, you can │ │ +│ │ │ find yourself with analytics │ │ +│ │ │ queries that not only take far │ │ +│ │ │ too long to run, but also slow │ │ +│ │ │ down your existing │ │ +│ │ │ application. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Postgres Columnar Storage: 4 │ PostgreSQL is a row-oriented │ 0.90 │ +│ │ Popular Extensions and a │ database by design, meaning it │ │ +│ │ Quick Tutorial │ stores data tuple-by-tuple... │ │ +│ │ https://www.epsio.io/blog/pos │ This structure is suitable for │ │ +│ │ tgres-columnar-storage-4-popu │ transactional workloads but │ │ +│ │ lar-extensions-and-a-quick-tu │ not optimized for analytical │ │ +│ │ torial │ queries that typically scan │ │ +│ │ │ large volumes of data across a │ │ +│ │ │ few columns... While │ │ +│ │ │ PostgreSQL does not natively │ │ +│ │ │ support columnar storage, │ │ +│ │ │ several extensions and │ │ +│ │ │ external tools introduce │ │ +│ │ │ columnar capabilities. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ SQLite vs PostgreSQL │ SQLite was faster. Of course │ 0.88 │ +│ │ Performance & Comparison | │ it was. Writing to a local │ │ +│ │ Pythonic AF │ file inside the same process │ │ +│ │ https://medium.com/pythonic-a │ will almost always be faster │ │ +│ │ f/sqlite-vs-postgresql-perfor │ than sending queries to a │ │ +│ │ mance-comparison-46ba1d39c9c8 │ server. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Everyone Is Wrong About │ why SQLite is often the │ 0.80 │ +│ │ SQLite (Here's When It Beats │ superior production choice for │ │ +│ │ Postgres) │ read-heavy, single-server, and │ │ +│ │ https://www.youtube.com/watch │ edge workloads ... SQLite vs │ │ +│ │ ?v=t20KyfjtUs4 │ PostgreSQL Performance. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ SQLite SO MUCH FASTER than │ Of course, with the advent of │ 0.82 │ +│ │ Postgres - Reddit │ DuckDB, you use DuckDB for │ │ +│ │ https://www.reddit.com/r/sqli │ data analysis tasks since it │ │ +│ │ te/comments/1gu219r/sqlite_so │ can be faster than either │ │ +│ │ _much_faster_than_postgres/ │ SQLite or PostgreSQL in those │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Quantitative head-to-head │ Most benchmarks found │ +│ │ benchmark of SQLite vs │ compare SQLite vs │ +│ │ PostgreSQL specifically on │ PostgreSQL on OLTP │ +│ │ analytical queries (not │ (reads/writes of individual │ +│ │ just OLTP) │ rows) or compare each │ +│ │ │ individually to │ +│ │ │ DuckDB/Stoolap on OLAP. A │ +│ │ │ direct, rigorous benchmark │ +│ │ │ of SQLite vs PostgreSQL on │ +│ │ │ complex analytical queries │ +│ │ │ (GROUP BY, window │ +│ │ │ functions, aggregations │ +│ │ │ over millions of rows) was │ +│ │ │ not surfaced in the │ +│ │ │ evidence. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ PostgreSQL columnar │ While columnar extensions │ +│ │ extension performance vs │ for PostgreSQL (e.g., Citus │ +│ │ SQLite for embedded │ columnar, hydra) are │ +│ │ analytics │ mentioned, no direct │ +│ │ │ benchmark comparing │ +│ │ │ PostgreSQL-with-columnar-ex │ +│ │ │ tension vs SQLite for │ +│ │ │ embedded analytical │ +│ │ │ workloads was found. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ SQLite WAL mode impact on │ WAL mode is mentioned as │ +│ │ analytical query │ improving concurrent │ +│ │ performance │ read/write behavior in │ +│ │ │ SQLite, but its specific │ +│ │ │ impact on analytical query │ +│ │ │ throughput in embedded │ +│ │ │ scenarios was not │ +│ │ │ quantified in the evidence. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ database │ DuckDB vs SQLite │ DuckDB is │ +│ │ │ vs PostgreSQL │ consistently │ +│ │ │ analytical │ cited as │ +│ │ │ benchmark OLAP │ outperforming │ +│ │ │ embedded 2024 │ both for │ +│ │ │ 2025 │ analytics; a │ +│ │ │ │ rigorous │ +│ │ │ │ three-way │ +│ │ │ │ comparison would │ +│ │ │ │ better answer the │ +│ │ │ │ embedded │ +│ │ │ │ analytics │ +│ │ │ │ question. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ SQLite past │ The VLDB paper on │ +│ │ │ present future │ SQLite's │ +│ │ │ VLDB paper bloom │ past/present/futu │ +│ │ │ filter analytical │ re is cited │ +│ │ │ performance 2022 │ multiple times as │ +│ │ │ │ authoritative on │ +│ │ │ │ SQLite's │ +│ │ │ │ analytical │ +│ │ │ │ limitations; │ +│ │ │ │ accessing it │ +│ │ │ │ directly would │ +│ │ │ │ strengthen │ +│ │ │ │ claims. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ pg_duckdb │ The motherduck │ +│ │ │ extension │ article │ +│ │ │ PostgreSQL │ references │ +│ │ │ embedded │ pg_duckdb as a │ +│ │ │ analytics │ key tool for │ +│ │ │ performance │ hybrid │ +│ │ │ hybrid │ Postgres+DuckDB │ +│ │ │ architecture │ analytics; │ +│ │ │ │ benchmarks for │ +│ │ │ │ this approach │ +│ │ │ │ were not found. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ null │ Stoolap embedded │ Stoolap is an │ +│ │ │ OLAP Rust │ emerging embedded │ +│ │ │ database │ OLAP engine │ +│ │ │ benchmark SQLite │ (Rust) claiming │ +│ │ │ PostgreSQL │ 138x speedup over │ +│ │ │ │ SQLite; it's a │ +│ │ │ │ relevant new │ +│ │ │ │ entrant to the │ +│ │ │ │ embedded │ +│ │ │ │ analytics space. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ At what data volume does │ The evidence shows SQLite │ +│ │ SQLite's analytical performance │ struggles at 13M rows for │ +│ │ become unacceptably slow │ percentile queries (~4s), but │ +│ │ compared to PostgreSQL for │ no clear threshold or scaling │ +│ │ typical embedded analytics │ curve vs PostgreSQL was found. │ +│ │ workloads? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Does enabling WAL mode and │ Hacker News discussion mentions │ +│ │ tuning SQLite │ WAL + synchronous=NORMAL as │ +│ │ (synchronous=NORMAL, page size, │ approaching 'line speed with IO │ +│ │ etc.) meaningfully close the │ subsystem' for writes, but │ +│ │ analytical performance gap with │ analytical query impact is │ +│ │ PostgreSQL? │ unclear. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Is a hybrid architecture │ The Postgres+DuckDB hybrid is │ +│ │ (SQLite for OLTP + DuckDB for │ well-documented, but an │ +│ │ OLAP, sharing the same data) │ SQLite+DuckDB embedded hybrid │ +│ │ practical for embedded │ (for truly serverless apps) is │ +│ │ applications, and how does it │ less explored in the evidence. │ +│ │ compare to using PostgreSQL │ │ +│ │ alone? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How do PostgreSQL columnar │ PostgreSQL columnar extensions │ +│ │ storage extensions (e.g., │ are mentioned as improving OLAP │ +│ │ Hydra, Citus columnar) perform │ performance, but no direct │ +│ │ for embedded analytics compared │ comparison to SQLite in │ +│ │ to native SQLite? │ embedded scenarios was found. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the operational │ SQLite's binary is ~500KB vs │ +│ │ overhead (memory, disk, setup │ PostgreSQL requiring a server │ +│ │ complexity) of running │ process; for edge/IoT embedded │ +│ │ PostgreSQL vs SQLite in a truly │ analytics, resource constraints │ +│ │ embedded edge or mobile │ may be the deciding factor. │ +│ │ environment? │ │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.88 │ +│ Corroborating sources: 10 │ +│ Source authority: medium │ +│ Contradiction detected: False │ +│ Query specificity match: 0.82 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 61699 │ +│ Iterations: 4 │ +│ Wall time: 111.21s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 01881015-61a9-4894-a723-4e1d8b7a7755 diff --git a/docs/stress-tests/M3.3-runs/08-comparative.log b/docs/stress-tests/M3.3-runs/08-comparative.log new file mode 100644 index 0000000..6955cba --- /dev/null +++ b/docs/stress-tests/M3.3-runs/08-comparative.log @@ -0,0 +1,364 @@ +Researching: Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing. + +{"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:57:22.951394Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:57:23.942406Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:57:23.953465Z"} +{"question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:57:24.008304Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008814Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:24.008920Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1180, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:42.087229Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 12270, "event": "iteration_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:47.632253Z"} +{"step": 21, "decision": "Token budget reached before iteration 4: 25966/20000", "event": "budget_exhausted", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072818Z"} +{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 24, "iterations_run": 3, "tokens_used": 25966, "event": "synthesis_start", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:57:55.072985Z"} +{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 89456, "event": "synthesis_complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.172200Z"} +{"step": 46, "decision": "Research complete", "confidence": 0.82, "citation_count": 14, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 121.701, "event": "complete", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:21.274347Z"} +{"confidence": 0.82, "citations": 14, "gaps": 4, "discovery_events": 4, "tokens_used": 54153, "iterations_run": 3, "wall_time_sec": 117.15539288520813, "budget_exhausted": true, "event": "research_completed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:21.275590Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T01:59:21.286942Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:21.531952Z"} +{"trace_id": "9e436db7-fcde-4d0f-a568-c468ae4d419c", "confidence": 0.82, "citations": 14, "tokens_used": 54153, "wall_time_sec": 117.15539288520813, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:22.766505Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ CRISPR-Cas9 and CRISPR-Cas12a (formerly Cpf1) are both widely used │ +│ RNA-guided nucleases adapted for genome editing, including in vivo │ +│ applications, but they differ meaningfully in mechanism, structure, PAM │ +│ requirements, cutting pattern, guide RNA architecture, specificity, and │ +│ practical suitability for in vivo delivery. │ +│ │ +│ **Mechanism and DNA Cleavage:** Cas9 (most commonly from Streptococcus │ +│ pyogenes, SpCas9) cleaves both DNA strands at the same position, producing │ +│ blunt-ended double-strand breaks (DSBs) [Source 7]. Cas12a, by contrast, │ +│ introduces staggered cuts that leave 4–5 nucleotide 5′ overhangs [Sources 2, │ +│ 7]. These sticky ends generated by Cas12a may enhance homology-directed │ +│ repair (HDR) efficiency compared to Cas9's blunt ends [Source 2]. │ +│ │ +│ **PAM Sequence:** Cas9 requires an NGG PAM (protospacer adjacent motif) on │ +│ the non-template strand downstream of the target; Cas12a recognizes a T-rich │ +│ PAM (typically TTTV) upstream of the target on the non-template strand │ +│ [Sources 2, 7]. This difference expands the targeting range of Cas12a to │ +│ AT-rich genomic regions where Cas9 is limited. │ +│ │ +│ **Guide RNA:** Cas9 uses a two-component guide (crRNA + tracrRNA, often │ +│ fused as sgRNA), while Cas12a requires only a single crRNA with a short │ +│ direct repeat and processes its own pre-crRNA array, enabling multiplexed │ +│ editing from a single transcript [Sources 2, 7, 13]. │ +│ │ +│ **Specificity and Off-Target Effects:** Kinetic studies show Cas12a exhibits │ +│ greater target specificity than Cas9, attributed to a more stringent DNA │ +│ unwinding mechanism that requires more extensive complementarity before │ +│ cleavage [Source 5]. Cas12a tolerates fewer mismatches between the guide RNA │ +│ and target, resulting in fewer off-target cuts [Sources 2, 5]. │ +│ │ +│ **Editing Efficiency:** In comparative studies using ribonucleoprotein (RNP) │ +│ delivery in rice (OsPDS gene), Cas9 and Cas12a showed different efficiencies │ +│ depending on the target site [Source 1]. In Chlamydomonas reinhardtii, both │ +│ Cas9 and Cas12a RNPs co-delivered with ssODN repair templates achieved │ +│ similar total editing levels of 20–30% [Source 4]. Context and target site │ +│ selection significantly influence which enzyme performs better. │ +│ │ +│ **In Vivo Delivery Considerations:** Both enzymes can be delivered via AAV │ +│ vectors, lipid nanoparticles (LNPs), or as RNPs via electroporation [Sources │ +│ 21, 24]. A critical practical consideration is size: SpCas9 (~4.2 kb coding │ +│ sequence) is near the AAV packaging limit (~4.7–4.8 kb), leaving little room │ +│ for promoter and regulatory elements [Sources 20, 21]. Cas12a variants │ +│ (including engineered compact forms such as EbCas12a) can be packaged │ +│ together with their crRNA within a single AAV vector, which is a significant │ +│ advantage for in vivo delivery [Sources 19, 20, 21]. A miniature Cas12f1 │ +│ variant has also demonstrated efficacy for in vivo retinal gene therapy │ +│ [Source 12]. │ +│ │ +│ **Clinical and Therapeutic Status:** CRISPR-Cas9 is currently the dominant │ +│ nuclease in clinical trials for both ex vivo and in vivo genome editing │ +│ [Sources 8, 11]. Cas12a is gaining traction in therapeutic research, │ +│ particularly where higher specificity or AAV-compatible delivery is required │ +│ [Sources 9, 13, 22]. │ +│ │ +│ **Summary Table:** │ +│ - DNA cut type: Cas9 = blunt; Cas12a = staggered (5′ overhang) │ +│ - PAM: Cas9 = NGG (3′); Cas12a = TTTV (5′) │ +│ - Guide RNA: Cas9 = sgRNA (crRNA+tracrRNA); Cas12a = crRNA only │ +│ - Multiplexing: Cas9 = limited; Cas12a = inherent crRNA array processing │ +│ - Specificity: Cas12a generally higher │ +│ - AAV compatibility: Cas12a variants better suited │ +│ - Clinical use: Cas9 more established; Cas12a emerging │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ What's the Difference Between │ Cas9...cleaves both strands of │ 0.95 │ +│ │ Cas9 and Cas12a Nucleases? | │ DNA at the same point. This │ │ +│ │ The Scientist │ creates a blunt end │ │ +│ │ https://www.the-scientist.com │ double-stranded break (DSB)... │ │ +│ │ /what-s-the-difference-betwee │ For Cas9 to function, the │ │ +│ │ n-cas9-and-cas12a-nucleases-7 │ protospacer adjacent motif │ │ +│ │ 2481 │ (PAM)—a two to six base pair │ │ +│ │ │ sequence—NGG...must sit │ │ +│ │ │ immediately downstream of the │ │ +│ │ │ target on the opposite strand. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Cas9 versus Cas12a/Cpf1: │ Cas9 and Cas12a have distinct │ 0.97 │ +│ │ Structure-function │ evolutionary origins and │ │ +│ │ comparisons and implications │ exhibit different structural │ │ +│ │ for genome editing - PubMed │ architectures, resulting in │ │ +│ │ https://pubmed.ncbi.nlm.nih.g │ distinct molecular │ │ +│ │ ov/29790280/ │ mechanisms... We discuss │ │ +│ │ │ implications for genome │ │ +│ │ │ editing, and how they may │ │ +│ │ │ influence the choice of Cas9 │ │ +│ │ │ or Cas12a for specific │ │ +│ │ │ applications. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ CRISPR-Cas12a More Precise │ Cas12a...is, according to │ 0.90 │ +│ │ Than CRISPR-Cas9 │ scientists at the University │ │ +│ │ https://www.genengnews.com/to │ of Texas at Austin │ │ +│ │ pics/genome-editing/crispr-ca │ (UT-Austin), more effective │ │ +│ │ s12a-more-precise-than-crispr │ and precise... Because Cas │ │ +│ │ -cas9/ │ enzymes occasionally fail to │ │ +│ │ │ cut DNA in the right places, │ │ +│ │ │ or even cut at all, they worry │ │ +│ │ │ developers, who want to modify │ │ +│ │ │ genomes with surgical │ │ +│ │ │ precision, especially in │ │ +│ │ │ therapeutic applications. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Comparison of CRISPR/Cas9 and │ We found that Cas9 and Cas12a │ 0.92 │ +│ │ Cas12a for gene editing in │ RNPs- co-delivered with ssODN │ │ +│ │ Chlamydomonas reinhardtii - │ repair templates- induced │ │ +│ │ ScienceDirect │ similar levels of total │ │ +│ │ https://www.sciencedirect.com │ editing, achieving as much as │ │ +│ │ /science/article/pii/S2211926 │ 20–30 % in all │ │ +│ │ 424004089 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Comparison of │ Comparison of │ 0.88 │ +│ │ CRISPR-Cas9/Cas12a │ CRISPR-Cas9/Cas12a │ │ +│ │ Ribonucleoprotein Complexes │ Ribonucleoprotein Complexes │ │ +│ │ for Genome Editing Efficiency │ for Genome Editing Efficiency │ │ +│ │ in the Rice Phytoene │ in the Rice Phytoene │ │ +│ │ Desaturase (OsPDS) Gene - PMC │ Desaturase (OsPDS) Gene │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ │ │ +│ │ articles/PMC6973557/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Current and Prospective │ Current and Prospective │ 0.87 │ +│ │ Applications of CRISPR-Cas12a │ Applications of CRISPR-Cas12a │ │ +│ │ in Pluricellular Organisms - │ in Pluricellular Organisms... │ │ +│ │ PMC │ Mol Biotechnol. 2022 Aug │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ 8;65(2):196–205. doi: │ │ +│ │ articles/PMC9841005/ │ 10.1007/s12033-022-00538-5 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ When size matters: A novel │ When size matters: A novel │ 0.90 │ +│ │ compact Cas12a variant for in │ compact Cas12a variant for in │ │ +│ │ vivo genome editing - PMC │ vivo genome editing │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ │ │ +│ │ articles/PMC11253977/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ When size matters: A novel │ Altogether, the components of │ 0.91 │ +│ │ compact Cas12a variant for in │ the EbCas12a system are well │ │ +│ │ vivo genome editing - │ below the 4.8-kb packaging │ │ +│ │ ResearchGate │ limit of AAVs, enabling │ │ +│ │ https://www.researchgate.net/ │ successful packaging in the │ │ +│ │ publication/382328745_When_si │ AAV9 │ │ +│ │ ze_matters_A_novel_compact_Ca │ │ │ +│ │ s12a_variant_for_in_vivo_geno │ │ │ +│ │ me_editing │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Therapeutic In Vivo Gene │ our current results prove that │ 0.88 │ +│ │ Editing Achieved by a │ the miniature Cas12f1 system │ │ +│ │ Hypercompact CRISPR System - │ is a promising gene editing │ │ +│ │ Advanced Science │ tool for retinal gene therapy │ │ +│ │ https://advanced.onlinelibrar │ │ │ +│ │ y.wiley.com/doi/10.1002/advs. │ │ │ +│ │ 202308095 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Delivery of CRISPR-Cas tools │ AAV is one of the most │ 0.90 │ +│ │ for in vivo genome editing │ commonly used vector systems │ │ +│ │ therapy: Trends and │ to date, but immunogenicity │ │ +│ │ challenges - ScienceDirect │ against capsid, liver toxicity │ │ +│ │ https://www.sciencedirect.com │ at high dose, and potential │ │ +│ │ /science/article/pii/S0168365 │ genotoxicity caused by │ │ +│ │ 92200027X │ off-target mutagenesis and │ │ +│ │ │ genomic integration remain │ │ +│ │ │ unsolved. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 11 │ CRISPR-Based Therapeutic │ These Cas proteins are more │ 0.87 │ +│ │ Genome Editing - DSpace@MIT │ compatible with AAV delivery, │ │ +│ │ https://dspace.mit.edu/bitstr │ enabling additional vector │ │ +│ │ eam/handle/1721.1/138388.2/ni │ design options such as │ │ +│ │ hms-1576523.pdf?sequence=4&is │ expanded promoter choices and │ │ +│ │ Allowed=y │ a streamlined delivery. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 12 │ Revolutionizing in vivo │ Genome editing using the │ 0.85 │ +│ │ therapy with CRISPR/Cas │ CRISPR/Cas system has │ │ +│ │ genome editing: │ revolutionized the field of │ │ +│ │ breakthroughs, opportunities │ genetic engineering, offering │ │ +│ │ and challenges - Frontiers │ unprecedented opportunities │ │ +│ │ https://www.frontiersin.org/j │ for therapeutic applications │ │ +│ │ ournals/genome-editing/articl │ in vivo. │ │ +│ │ es/10.3389/fgeed.2024.1342193 │ │ │ +│ │ /full │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 13 │ CRISPR Clinical Trials: A │ CRISPR Clinical Trials: A 2024 │ 0.80 │ +│ │ 2024 Update - Innovative │ Update - Innovative Genomics │ │ +│ │ Genomics Institute │ Institute (IGI) │ │ +│ │ https://innovativegenomics.or │ │ │ +│ │ g/news/crispr-clinical-trials │ │ │ +│ │ -2024/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 14 │ Alt-R CRISPR-Cas9 vs Cas12a │ The two most popular enzymes │ 0.83 │ +│ │ systems | IDT │ used in CRISPR genome editing │ │ +│ │ https://www.idtdna.com/pages/ │ are Cas9 and Cas12a (Cpf1). │ │ +│ │ technology/crispr/crispr-geno │ These enzymes are highly │ │ +│ │ me-editing/Alt-R-systems │ functional, do not require │ │ +│ │ │ binding to other enzymes as is │ │ +│ │ │ the case for type I CRISPR │ │ +│ │ │ systems, and can be readily │ │ +│ │ │ programmed to target the │ │ +│ │ │ desired genomic DNA site. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Head-to-head in vivo │ Most comparative studies │ +│ │ efficacy data in mammals │ focused on plants (rice) or │ +│ │ across multiple tissue │ algae (Chlamydomonas) or │ +│ │ types │ used in vitro/ex vivo │ +│ │ │ models. Rigorous │ +│ │ │ side-by-side in vivo │ +│ │ │ mammalian comparisons of │ +│ │ │ Cas9 vs. Cas12a across │ +│ │ │ liver, muscle, CNS, and eye │ +│ │ │ were not identified in │ +│ │ │ available sources. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Immunogenicity comparison │ While immunogenicity of │ +│ │ between Cas9 and Cas12a in │ Cas9 is well-documented as │ +│ │ vivo │ a challenge for in vivo │ +│ │ │ delivery, direct │ +│ │ │ comparative immunogenicity │ +│ │ │ data for Cas12a in humans │ +│ │ │ or animal models was not │ +│ │ │ available in the gathered │ +│ │ │ sources. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Cas12a clinical trial data │ The IGI clinical trials │ +│ │ │ update and other sources │ +│ │ │ confirm Cas9 dominance in │ +│ │ │ trials but do not provide │ +│ │ │ details on approved or │ +│ │ │ ongoing Cas12a-specific │ +│ │ │ clinical trials. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Detailed off-target │ While Cas12a is reported to │ +│ │ profiling comparison in │ be more specific than Cas9 │ +│ │ vivo │ based on kinetic studies, │ +│ │ │ comprehensive in vivo │ +│ │ │ off-target profiling │ +│ │ │ comparing both enzymes │ +│ │ │ systematically across the │ +│ │ │ same targets was not │ +│ │ │ available in the sources. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ arxiv │ Cas12a vs Cas9 in │ Head-to-head in │ +│ │ │ vivo editing │ vivo mammalian │ +│ │ │ efficiency │ comparisons are a │ +│ │ │ off-target │ critical gap; │ +│ │ │ mammalian │ preprint servers │ +│ │ │ therapeutic │ may have more │ +│ │ │ comparison 2023 │ recent │ +│ │ │ 2024 │ unpublished data │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ CRISPR Cas12a │ Clinical adoption │ +│ │ │ clinical trials │ of Cas12a in vivo │ +│ │ │ ClinicalTrials.go │ is poorly │ +│ │ │ v 2023 2024 │ characterized; a │ +│ │ │ │ ClinicalTrials.go │ +│ │ │ │ v database search │ +│ │ │ │ would clarify │ +│ │ │ │ current status │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ Cas12a │ Immunogenicity is │ +│ │ │ immunogenicity │ a key barrier for │ +│ │ │ pre-existing │ in vivo Cas9 │ +│ │ │ immunity in vivo │ delivery; whether │ +│ │ │ gene therapy │ Cas12a poses │ +│ │ │ human │ fewer immune │ +│ │ │ │ challenges is │ +│ │ │ │ clinically │ +│ │ │ │ important but not │ +│ │ │ │ covered in │ +│ │ │ │ sources │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ database │ compact Cas12a │ Compact Cas12a │ +│ │ │ EbCas12a AsCas12a │ variants show │ +│ │ │ in vivo liver │ promise for AAV │ +│ │ │ lung CNS │ delivery; recent │ +│ │ │ therapeutic │ therapeutic in │ +│ │ │ editing 2024 │ vivo data would │ +│ │ │ │ strengthen the │ +│ │ │ │ comparison │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Does Cas12a's staggered cutting │ Sources note that staggered │ +│ │ pattern result in meaningfully │ cuts may enhance HDR, but │ +│ │ higher HDR rates than Cas9's │ comparative in vivo HDR │ +│ │ blunt cuts in vivo in │ efficiency data in mammals was │ +│ │ therapeutically relevant cell │ not found in the gathered │ +│ │ types? │ evidence. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Are there pre-existing │ Immunogenicity is a known │ +│ │ antibodies or T-cell responses │ challenge for Cas9 in vivo; │ +│ │ against Cas12a proteins in │ whether Cas12a, being from │ +│ │ humans that would limit its │ different bacterial origins, │ +│ │ therapeutic use, as has been │ faces similar or lesser immune │ +│ │ documented for SpCas9? │ barriers in human patients is │ +│ │ │ clinically critical. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Can compact Cas12a variants │ Compact variants fit within AAV │ +│ │ (e.g., EbCas12a, Cas12f) │ packaging limits better than │ +│ │ consistently match or exceed │ Cas9, but their in vivo editing │ +│ │ SpCas9 editing efficiency in │ efficiency relative to SpCas9 │ +│ │ vivo across diverse tissue │ across tissues such as liver, │ +│ │ types? │ muscle, and CNS needs │ +│ │ │ systematic evaluation. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does Cas12a's inherent │ Cas12a can process its own │ +│ │ crRNA array processing and │ pre-crRNA array, enabling │ +│ │ multiplexing capability │ multiplexed targeting from a │ +│ │ translate to in vivo │ single transcript, which is │ +│ │ combinatorial therapeutic │ noted as an advantage but its │ +│ │ strategies compared to │ in vivo therapeutic │ +│ │ Cas9-based multiplex │ exploitation is not │ +│ │ approaches? │ well-characterized in available │ +│ │ │ sources. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the current status of │ The 2024 CRISPR clinical trials │ +│ │ Cas12a-specific clinical trials │ update from IGI and Frontiers │ +│ │ for in vivo gene therapy, and │ review both highlight Cas9 │ +│ │ how do their safety profiles │ dominance in clinical trials, │ +│ │ compare to Cas9-based trials? │ but Cas12a clinical translation │ +│ │ │ remains poorly documented. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.82 │ +│ Corroborating sources: 14 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 54153 │ +│ Iterations: 3 │ +│ Wall time: 117.16s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 9e436db7-fcde-4d0f-a568-c468ae4d419c diff --git a/docs/stress-tests/M3.3-runs/09-comparative.log b/docs/stress-tests/M3.3-runs/09-comparative.log new file mode 100644 index 0000000..496ca0e --- /dev/null +++ b/docs/stress-tests/M3.3-runs/09-comparative.log @@ -0,0 +1,378 @@ +Researching: Compare React and Vue for large enterprise frontends in 2026. + +{"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T01:59:24.701232Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T01:59:26.384813Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T01:59:26.398635Z"} +{"question": "Compare React and Vue for large enterprise frontends in 2026.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T01:59:26.459271Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare React and Vue for large enterprise frontends in 2026.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459554Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:26.459652Z"} +{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1149, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:37.583764Z"} +{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11893, "event": "iteration_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:45.423050Z"} +{"step": 23, "decision": "Token budget reached before iteration 4: 27147/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158499Z"} +{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 27147, "event": "synthesis_start", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T01:59:52.158736Z"} +{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 87997, "event": "synthesis_complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.872069Z"} +{"step": 46, "decision": "Research complete", "confidence": 0.81, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 114.815, "event": "complete", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:16.883053Z"} +{"confidence": 0.81, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 56137, "iterations_run": 3, "wall_time_sec": 110.40975427627563, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:16.883613Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:01:16.886961Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:16.944624Z"} +{"trace_id": "7c8dd19b-174b-4850-a2f5-28917d37c0c0", "confidence": 0.81, "citations": 12, "tokens_used": 56137, "wall_time_sec": 110.40975427627563, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:17.535111Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ For large enterprise frontends in 2026, React and Vue each offer distinct │ +│ advantages, and the best choice depends on organizational priorities. │ +│ │ +│ **Market Position & Adoption:** React dominates with ~42% market share among │ +│ professional developers (2025 State of JavaScript survey) and ~68% among │ +│ enterprise applications globally, while Vue holds ~28% developer share and │ +│ ~18% enterprise share. React powers Facebook, Netflix, Airbnb, and Uber; Vue │ +│ drives Alibaba, GitLab, and Nintendo. Some 80% of enterprise teams use React │ +│ directly or via Next.js. [Sources 1, 4, 25] │ +│ │ +│ **Performance:** Both frameworks use a virtual DOM. Vue 4 showed 15% faster │ +│ initial render times than React 19 in large-scale applications with │ +│ thousands of components (JavaScript Performance Consortium 2025 benchmarks). │ +│ However, React 19's concurrent rendering features provide superior │ +│ responsiveness during complex user interactions. In micro-benchmarks, Vue │ +│ 3.4 creates 1,000 rows in 38ms vs React 19's 42ms, and Vue's bundle size is │ +│ smaller (33KB vs 44KB min+gzip). The performance gap continues to narrow. │ +│ [Sources 1, 25] │ +│ │ +│ **React 19 Architecture Shifts:** React 19 introduces a built-in compiler │ +│ that automates memoization (making useMemo/useCallback largely redundant), │ +│ native Server Components for zero-bundle-size dependencies and direct │ +│ database access, a new Actions API for simplified async form handling, and │ +│ the `use` hook for streamlined data fetching. These changes significantly │ +│ reduce boilerplate and technical debt for enterprise teams. [Sources 18, 19, │ +│ 20] │ +│ │ +│ **Vue's Enterprise Momentum:** Vue 3's Composition API enables better logic │ +│ reuse across large codebases. Pinia (the official state manager) is │ +│ TypeScript-first and lightweight. Nuxt 3 handles SSR. Vue's natural │ +│ TypeScript support and Vite-powered tooling make it increasingly attractive │ +│ for enterprise adoption. Fortune 500 companies, SaaS platforms, and │ +│ government tech teams are growing adopters. [Sources 12, 15] │ +│ │ +│ **Learning Curve & Developer Experience:** Vue uses standard HTML/CSS/JS │ +│ with Single File Components, making it easier to onboard developers with │ +│ traditional web backgrounds. React uses JSX (combining HTML and JavaScript), │ +│ which has a steeper initial curve but becomes natural quickly. Vue's │ +│ official routing and state solutions (Vue Router, Pinia) reduce │ +│ architectural decision-making overhead. React requires selecting from a │ +│ broader ecosystem (Redux/Zustand, React Router, etc.), offering more │ +│ flexibility but more upfront choices. [Sources 1, 13, 14] │ +│ │ +│ **Ecosystem & Hiring:** React has ~44M+ weekly npm downloads vs Vue's │ +│ ~4.5M+. React has 225K+ GitHub stars vs Vue's 207K+. Fortune 500 adoption is │ +│ 47% React vs 12% Vue. React job postings vastly outnumber Vue's (e.g., │ +│ 3,200+ vs 680+/month in one market). React's ecosystem is larger and more │ +│ mature, making hiring and long-term support easier for large enterprises. │ +│ [Sources 14, 25] │ +│ │ +│ **Enterprise Scalability:** React's flexibility and massive ecosystem make │ +│ it the safer long-term choice for large, complex, cross-team enterprise │ +│ platforms. Angular is also a strong contender here with built-in DI, forms, │ +│ and strict structure. Vue excels when time-to-market and developer │ +│ productivity are top priorities and when teams want a progressive, │ +│ opinionated setup with lower onboarding cost. [Sources 13, 14, 23] │ +│ │ +│ **Recommendation:** For most large enterprise frontends in 2026, React │ +│ (often via Next.js) remains the dominant and lowest-risk choice due to its │ +│ ecosystem size, hiring market, enterprise adoption, and React 19's │ +│ architectural improvements. Vue is a compelling choice for enterprises │ +│ prioritizing developer velocity, lower onboarding costs, and smaller bundle │ +│ sizes, particularly in Asia-Pacific markets or mid-size SaaS platforms. │ +│ Neither choice is technically wrong—both are production-proven at scale. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ React vs Vue: Frontend │ According to the 2025 State of │ 0.88 │ +│ │ Frameworks Compared in 2025 │ JavaScript survey, React │ │ +│ │ https://automation-ops.com/bl │ continues to dominate with a │ │ +│ │ og/react-vs-vue-frontend-fram │ 42% market share among │ │ +│ │ eworks-compared │ professional developers, while │ │ +│ │ │ Vue has grown to capture 28% │ │ +│ │ │ of the market. Vue 4 showed a │ │ +│ │ │ 15% faster initial render time │ │ +│ │ │ compared to React 19 in │ │ +│ │ │ large-scale applications with │ │ +│ │ │ thousands of components. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Angular vs. React vs. Vue.js: │ The focus in 2025 has shifted │ 0.82 │ +│ │ A performance guide for 2026 │ away from basic component │ │ +│ │ - LogRocket Blog │ logic toward reactivity │ │ +│ │ https://blog.logrocket.com/an │ models, hydration strategies, │ │ +│ │ gular-vs-react-vs-vue-js-perf │ and compiler-driven │ │ +│ │ ormance/ │ performance optimizations. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ React vs Next.js vs Vue: │ React remains the foundation │ 0.80 │ +│ │ Which Frontend Framework Wins │ for modern frontend │ │ +│ │ in 2026? - DEV Community │ development with 80% of │ │ +│ │ https://dev.to/ciphernutz/rea │ enterprise teams still using │ │ +│ │ ct-vs-nextjs-vs-vue-which-fro │ it directly or via Next.js. │ │ +│ │ ntend-framework-wins-in-2025- │ │ │ +│ │ 26gj │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ The 2025 Tech Stack Dilemma: │ According to the 2025 State of │ 0.78 │ +│ │ React vs Vue vs Angular for │ JavaScript survey, developers │ │ +│ │ Enterprise Applications │ using frameworks report 35-50% │ │ +│ │ https://www.codertrove.com/ar │ faster development cycles │ │ +│ │ ticles/2025-tech-stack-dilemm │ compared to vanilla │ │ +│ │ a-react-vs-vue-vs-angular-for │ JavaScript. The 2024 State of │ │ +│ │ -enterprise-application │ JavaScript survey reveals that │ │ +│ │ │ 78% of developers cite 'faster │ │ +│ │ │ development' as their primary │ │ +│ │ │ reason for adoption. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Web Development with React vs │ React maintains its dominant │ 0.85 │ +│ │ Vue.js: 2025 Comparison | │ position with approximately │ │ +│ │ iTechDev Blog │ 68% market share among │ │ +│ │ https://www.itechdev.com.mx/b │ enterprise applications │ │ +│ │ log/react-vs-vue-comparison-2 │ globally. Vue 3.4 creates │ │ +│ │ 025 │ 1,000 rows in 38ms vs React │ │ +│ │ │ 19's 42ms. Bundle size │ │ +│ │ │ (min+gzip): React 44KB, Vue │ │ +│ │ │ 33KB. Fortune 500 adoption: │ │ +│ │ │ React 47%, Vue 12%. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ React 19 Features & Updates │ React 19 emerges as a landmark │ 0.87 │ +│ │ (2025): What's New & Why It │ release that brings │ │ +│ │ Matters - WEQ │ significant enhancements to │ │ +│ │ https://weqtechnologies.com/r │ performance, developer │ │ +│ │ eact-19-features-updates-2025 │ experience, and scalability. │ │ +│ │ -whats-new-why-it-matters/ │ This update builds on the │ │ +│ │ │ foundations laid by React 18, │ │ +│ │ │ introducing powerful new │ │ +│ │ │ features like the React │ │ +│ │ │ Compiler, Actions API, and │ │ +│ │ │ enhanced support for React │ │ +│ │ │ Server Components. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ React 19: Architecture │ The React Compiler │ 0.83 │ +│ │ Shifts, Performance │ automatically handles │ │ +│ │ Optimization, and the Future │ memoization, rendering hooks │ │ +│ │ of Enterprise Web Development │ like useMemo and useCallback │ │ +│ │ https://pblinuxtech.com/react │ largely redundant for │ │ +│ │ -19-architecture-shifts-perfo │ performance optimization. │ │ +│ │ rmance-optimization-and-the-f │ Native support for Server │ │ +│ │ uture-of-enterprise-web-devel │ Components allows for │ │ +│ │ opment/ │ zero-bundle-size dependencies │ │ +│ │ │ and direct database access, │ │ +│ │ │ optimizing the use of │ │ +│ │ │ Linux-based edge runtimes. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Vue.js in the Enterprise: Why │ By 2026, more │ 0.79 │ +│ │ More Companies Are Choosing │ organizations—startups, │ │ +│ │ Vue in 2026 – Manifest │ Fortune 500 companies, large │ │ +│ │ https://manifestinfotech.com/ │ SaaS platforms, and government │ │ +│ │ vue-js-in-the-enterprise-why- │ tech teams—are adopting Vue │ │ +│ │ more-companies-are-choosing-v │ for mission-critical │ │ +│ │ ue-in-2026/ │ applications. Pinia, now the │ │ +│ │ │ official store for Vue, │ │ +│ │ │ delivers TypeScript-first │ │ +│ │ │ architecture, lightweight │ │ +│ │ │ design, better devtools │ │ +│ │ │ integration, faster global │ │ +│ │ │ state handling. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ The State of Vue.js Report │ This report, created in │ 0.84 │ +│ │ 2025 │ collaboration with Evan You │ │ +│ │ https://stateofvue.framer.web │ and the Vue and Nuxt Core │ │ +│ │ site/ │ Teams, offers unique insights │ │ +│ │ │ across 150 virtual pages. │ │ +│ │ │ We've included 16 real-world │ │ +│ │ │ case studies from leading │ │ +│ │ │ brands, including GitLab, Hack │ │ +│ │ │ The Box, Storyblok, Booksy, │ │ +│ │ │ and DocPlanner. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ React vs Angular vs Vue: │ React, maintained by Meta, is │ 0.84 │ +│ │ Choosing the Best for │ a declarative, component-based │ │ +│ │ Enterprise in 2025 │ library for building user │ │ +│ │ https://softwarelogic.co/en/b │ interfaces. Its virtual DOM │ │ +│ │ log/which-javascript-framewor │ and one-way data flow provide │ │ +│ │ k-is-best-for-enterprise-reac │ outstanding performance and │ │ +│ │ t-angular-or-vue │ flexibility. Vue is loved for │ │ +│ │ │ its gentle learning curve and │ │ +│ │ │ progressive adoption. Angular │ │ +│ │ │ is designed for large, complex │ │ +│ │ │ enterprise applications where │ │ +│ │ │ structure and scalability are │ │ +│ │ │ paramount. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 11 │ React vs Vue: which one │ React is built for scale. Its │ 0.86 │ +│ │ should you choose in 2025? | │ flexibility, huge ecosystem, │ │ +│ │ DECODE │ and massive job market make it │ │ +│ │ https://decode.agency/article │ the safest choice for │ │ +│ │ /react-vs-vue/ │ enterprise-grade apps. Vue is │ │ +│ │ │ built for speed. With a gentle │ │ +│ │ │ learning curve and official │ │ +│ │ │ tools baked in, teams can move │ │ +│ │ │ faster and deliver MVPs or │ │ +│ │ │ mid-size apps quickly. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 12 │ What is React.js in 2025 and │ In React 19, that same Reactjs │ 0.82 │ +│ │ why React 19 changed │ library comes with first-class │ │ +│ │ front-end again | Merge │ async workflows, server │ │ +│ │ https://merge.rocks/blog/what │ components, and metadata │ │ +│ │ -is-react-js-in-2025-and-why- │ management, so teams spend │ │ +│ │ react-19-changed-front-end-ag │ less time gluing libraries │ │ +│ │ ain │ together and more time on │ │ +│ │ │ product work. The React team │ │ +│ │ │ also ships React Compiler, │ │ +│ │ │ currently in beta, which │ │ +│ │ │ automatically optimizes many │ │ +│ │ │ components that used to │ │ +│ │ │ require manual memoization. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Real-world 2026 │ No sources provided │ +│ │ enterprise migration │ firsthand accounts of │ +│ │ case studies from React │ enterprises switching │ +│ │ to Vue or vice versa │ frameworks in 2026 with │ +│ │ │ documented outcomes, only │ +│ │ │ general advocacy pieces. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ scope_exceeded │ Angular vs React vs Vue │ The question focused on │ +│ │ head-to-head in 2026 │ React vs Vue, but Angular │ +│ │ enterprise contexts │ is a significant │ +│ │ │ competitor in large │ +│ │ │ enterprise contexts. Full │ +│ │ │ three-way comparison with │ +│ │ │ 2026 data was not │ +│ │ │ available. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Vue 4 specific features │ One source │ +│ │ and release status │ (automation-ops.com) │ +│ │ │ mentions 'Vue 4' with │ +│ │ │ 'enhanced composition API │ +│ │ │ features', but most other │ +│ │ │ sources discuss Vue 3.x │ +│ │ │ as the current version. │ +│ │ │ Vue 4 release status is │ +│ │ │ unclear. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ source_not_found │ Verified 2026 salary and │ Salary data found was │ +│ │ hiring market data │ market-specific (Mexico) │ +│ │ │ and from 2025; global │ +│ │ │ 2026 enterprise hiring │ +│ │ │ cost comparison between │ +│ │ │ React and Vue developers │ +│ │ │ was not available. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ database │ Vue 4 release │ One source │ +│ │ │ date features │ references Vue 4 │ +│ │ │ official │ with enhanced │ +│ │ │ announcement 2025 │ composition API, │ +│ │ │ 2026 │ but most sources │ +│ │ │ │ still discuss Vue │ +│ │ │ │ 3.x; clarifying │ +│ │ │ │ whether Vue 4 has │ +│ │ │ │ been released is │ +│ │ │ │ important for │ +│ │ │ │ accurate │ +│ │ │ │ comparison. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ React Server │ SSR tooling │ +│ │ │ Components vs │ (Next.js vs Nuxt) │ +│ │ │ Nuxt SSR │ is a key │ +│ │ │ enterprise │ enterprise │ +│ │ │ performance │ decision factor │ +│ │ │ comparison 2025 │ mentioned across │ +│ │ │ 2026 │ sources but not │ +│ │ │ │ deeply │ +│ │ │ │ benchmarked │ +│ │ │ │ head-to-head. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ State of │ Multiple sources │ +│ │ │ JavaScript 2025 │ cite the 2025 │ +│ │ │ full survey │ State of │ +│ │ │ results React Vue │ JavaScript survey │ +│ │ │ Angular market │ but only with │ +│ │ │ share │ partial data; the │ +│ │ │ │ full report would │ +│ │ │ │ provide │ +│ │ │ │ authoritative │ +│ │ │ │ market share │ +│ │ │ │ figures. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ contradiction │ null │ Vue 4 vs Vue 3 │ Automation-ops │ +│ │ │ current version │ references 'Vue │ +│ │ │ enterprise 2025 │ 4' with benchmark │ +│ │ │ 2026 │ data but other │ +│ │ │ │ sources │ +│ │ │ │ consistently │ +│ │ │ │ reference Vue 3.4 │ +│ │ │ │ as current. This │ +│ │ │ │ is a factual │ +│ │ │ │ discrepancy that │ +│ │ │ │ could affect │ +│ │ │ │ benchmark │ +│ │ │ │ interpretation. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Has Vue 4 officially been │ One source claims Vue 4 shows │ +│ │ released, and what are its │ 15% faster initial render times │ +│ │ actual performance │ than React 19, but most sources │ +│ │ characteristics vs React 19 in │ still discuss Vue 3.4 as │ +│ │ enterprise applications? │ current. This discrepancy │ +│ │ │ affects benchmark reliability. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ How does React's new React │ React Compiler automates │ +│ │ Compiler (in beta) affect the │ memoization and is described as │ +│ │ performance gap between React │ a game-changer, but its │ +│ │ and Vue in production │ real-world impact on large │ +│ │ enterprise applications? │ enterprise codebases has not │ +│ │ │ yet been fully benchmarked │ +│ │ │ against Vue's │ +│ │ │ compiler-optimized reactivity. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ For enterprises currently on │ The State of Vue.js Report 2025 │ +│ │ Vue 2 or Vue 3, what is the │ includes a chapter on Vue 3 │ +│ │ actual cost and risk profile of │ Migration, suggesting migration │ +│ │ upgrading to future Vue │ is still a concern for many │ +│ │ versions vs migrating to React? │ enterprise teams. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does the developer hiring │ Sources note strong Vue │ +│ │ market for Vue vs React differ │ adoption in Asia-Pacific and │ +│ │ across regions (Asia-Pacific vs │ Latin America but React │ +│ │ North America vs Europe) for │ dominance globally. Regional │ +│ │ enterprise teams planning 2026 │ hiring market differences could │ +│ │ staffing? │ significantly impact enterprise │ +│ │ │ framework choices. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ What is the total cost of │ Sources discuss development │ +│ │ ownership difference between │ cost at project level but do │ +│ │ React+Next.js and Vue+Nuxt for │ not model long-term TCO │ +│ │ a 50+ person enterprise │ including training, │ +│ │ frontend team over a 3-year │ maintenance, tooling, and │ +│ │ horizon? │ hiring costs for large teams. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.81 │ +│ Corroborating sources: 12 │ +│ Source authority: medium │ +│ Contradiction detected: True │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 56137 │ +│ Iterations: 3 │ +│ Wall time: 110.41s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 7c8dd19b-174b-4850-a2f5-28917d37c0c0 diff --git a/docs/stress-tests/M3.3-runs/10-comparative.log b/docs/stress-tests/M3.3-runs/10-comparative.log new file mode 100644 index 0000000..bdf8e78 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/10-comparative.log @@ -0,0 +1,310 @@ +Researching: Compare wind and solar capacity factors in the continental United +States. + +{"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:01:18.663955Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:01:19.783461Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:01:19.795497Z"} +{"question": "Compare wind and solar capacity factors in the continental United States.", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:01:19.838791Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Compare wind and solar capacity factors in the continental United States.", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839685Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:19.839976Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:29.064991Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 8211, "event": "iteration_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:38.391464Z"} +{"step": 19, "decision": "Token budget reached before iteration 4: 23963/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620609Z"} +{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23963, "event": "synthesis_start", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:01:45.620851Z"} +{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72249, "event": "synthesis_complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.647112Z"} +{"step": 40, "decision": "Research complete", "confidence": 0.88, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 99.134, "event": "complete", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:55.648194Z"} +{"confidence": 0.88, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 48230, "iterations_run": 3, "wall_time_sec": 95.80813455581665, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:55.648284Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:02:55.648701Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:55.654584Z"} +{"trace_id": "e3fa81c3-eaff-4f76-9b50-d61e70e54540", "confidence": 0.88, "citations": 10, "tokens_used": 48230, "wall_time_sec": 95.80813455581665, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:55.883067Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Wind and solar capacity factors in the continental United States differ │ +│ notably, with wind generally outperforming utility-scale solar on an annual │ +│ average basis, though both vary significantly by location and season. │ +│ │ +│ **Wind Capacity Factors:** In 2023, the U.S. wind turbine fleet had an │ +│ average capacity factor of 33.5%, which was an eight-year low driven by │ +│ weaker-than-normal wind speeds (down from the 2022 all-time high of 35.9%). │ +│ Wind capacity factors are highest in spring (March–April) and lowest in │ +│ summer. In April 2024, wind generation hit a record 47.7 TWh, exceeding coal │ +│ generation for the second consecutive month. The NREL wind resource │ +│ assessment identifies areas with capacity factors ≥30% (generally mean │ +│ annual wind speeds ≥6.4 m/s) as suitable for development, with the │ +│ highest-potential zones in the central Great Plains. The U.S. total │ +│ installed wind capacity reached ~150,500 MW by end of 2023. │ +│ │ +│ **Solar (Utility-Scale PV) Capacity Factors:** The weighted average U.S. │ +│ utility-scale solar capacity factor was 23.5% in 2023, down 0.7 percentage │ +│ points from 24.2% in 2022. NREL's Annual Technology Baseline categorizes │ +│ utility-scale PV capacity factors into 10 resource classes based on mean │ +│ global horizontal irradiance (GHI); the desert Southwest achieves the │ +│ highest factors, while northern states achieve at least ~70% of the │ +│ Southwest's value. Solar generation is highest in summer and lowest in │ +│ winter, opposite to wind seasonality. │ +│ │ +│ **Comparison Summary:** On an annual fleet-wide average, wind capacity │ +│ factors (~33–36%) are materially higher than utility-scale solar capacity │ +│ factors (~23–24%). However, the two resources are complementary seasonally: │ +│ wind peaks in spring, solar peaks in summer. Both are intermittent │ +│ resources. In 2025, wind and solar together generated a record 17% of U.S. │ +│ electricity (wind: 464,000 GWh; utility-scale solar: 296,000 GWh), │ +│ reflecting wind's larger current installed base despite solar's faster │ +│ recent capacity growth. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Wind generation declined in │ Last year, the average │ 0.98 │ +│ │ 2023 for the first time since │ utilization rate, or capacity │ │ +│ │ the 1990s - EIA │ factor, of the wind turbine │ │ +│ │ https://www.eia.gov/todayinen │ fleet fell to an eight-year │ │ +│ │ ergy/detail.php?id=61943 │ low of 33.5% (compared with │ │ +│ │ │ 35.9% in 2022, the all-time │ │ +│ │ │ high). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ US solar capacity factors │ The weighted average US solar │ 0.95 │ +│ │ retreat in 2023, break │ capacity factor came in at a │ │ +│ │ multiyear streak above 24% │ calculated 23.5% annually in │ │ +│ │ https://www.spglobal.com/mark │ 2023, down 0.7 percentage │ │ +│ │ et-intelligence/en/news-insig │ point from 24.2% in 2022. │ │ +│ │ hts/research/us-solar-capacit │ │ │ +│ │ y-factors-retreat-in-2023-bre │ │ │ +│ │ ak-multiyear-streak-above-24p │ │ │ +│ │ erc │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ U.S. wind generation hit │ Wind generation, meanwhile, │ 0.97 │ +│ │ record in April 2024, │ increased to a record 47.7 │ │ +│ │ exceeding coal-fired │ TWh. However, during the first │ │ +│ │ generation - EIA │ four months of 2024, │ │ +│ │ https://www.eia.gov/todayinen │ coal-fired generation was 15% │ │ +│ │ ergy/detail.php?id=62784 │ higher than wind generation in │ │ +│ │ │ the United States. Installed │ │ +│ │ │ wind power generating capacity │ │ +│ │ │ has increased substantially in │ │ +│ │ │ the United States over the │ │ +│ │ │ last 25 years, growing from │ │ +│ │ │ 2.4 gigawatts (GW) in 2000 to │ │ +│ │ │ 150.1 GW in April 2024. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Land-Based Wind Market Report │ The U.S. wind industry │ 0.97 │ +│ │ 2024: Edition | Department of │ installed 6,474 megawatts (MW) │ │ +│ │ Energy │ of new land-based wind │ │ +│ │ https://www.energy.gov/cmei/s │ capacity in 2023, bringing the │ │ +│ │ ystems/land-based-wind-market │ cumulative total to nearly │ │ +│ │ -report-2024-edition │ 150,500 MW. Also, $10.8 │ │ +│ │ │ billion was invested in 2023 │ │ +│ │ │ in land-based wind energy │ │ +│ │ │ expansion. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Utility-Scale PV | │ The 2024 ATB provides the │ 0.93 │ +│ │ Electricity | 2024 | ATB | │ average capacity factor for 10 │ │ +│ │ NREL │ resource categories in the │ │ +│ │ https://atb.nrel.gov/electric │ United States, binned by mean │ │ +│ │ ity/2024/utility-scale_pv │ GHI. Average capacity factors │ │ +│ │ │ are calculated using │ │ +│ │ │ county-level capacity factor │ │ +│ │ │ averages from the Renewable │ │ +│ │ │ Energy Potential (reV) model │ │ +│ │ │ for 1998–2021. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ NREL projects solar │ In the latest update, zones │ 0.85 │ +│ │ generation and costs for 10 │ 2-8, representing all but the │ │ +│ │ U.S. zones – pv magazine USA │ northernmost states in the │ │ +│ │ https://pv-magazine-usa.com/2 │ continental U.S., solar │ │ +│ │ 021/07/22/nrel-projects-solar │ installations have a capacity │ │ +│ │ -generation-and-costs-for-10- │ factor that is at least 70% of │ │ +│ │ u-s-zones/ │ that in the desert Southwest's │ │ +│ │ │ zone 1, the data show. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Wind and solar generated a │ In 2025, wind power generated │ 0.96 │ +│ │ record 17% of U.S. │ 464,000 GWh of electricity, 3% │ │ +│ │ electricity in 2025 - EIA │ more than in 2024. In 2025, │ │ +│ │ https://www.eia.gov/todayinen │ utility-scale solar power │ │ +│ │ ergy/detail.php?id=67367 │ generation totaled 296,000 │ │ +│ │ │ GWh, 34% more than in 2024. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ 80 and 100 Meter Wind Energy │ Windy land defined as areas │ 0.82 │ +│ │ Resource Potential for the │ with >= 30% CF*, generally │ │ +│ │ United States - NREL │ mean annual wind speeds >= 6.4 │ │ +│ │ https://docs.nrel.gov/docs/fy │ m/s... U.S. wind potential │ │ +│ │ 10osti/48036.pdf │ from areas with CF*>=30% is │ │ +│ │ │ enormous, with almost 10,500 │ │ +│ │ │ GW capacity at 80 m and 12,000 │ │ +│ │ │ GW capacity at 100 m. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Wind power in the United │ In 2025, 464.4 terawatt-hours │ 0.88 │ +│ │ States - Wikipedia │ were generated by wind power, │ │ +│ │ https://en.wikipedia.org/wiki │ or 10.48% of electricity in │ │ +│ │ /Wind_power_in_the_United_Sta │ the United States. In March │ │ +│ │ tes │ and April of 2024, electricity │ │ +│ │ │ generation from wind exceeded │ │ +│ │ │ generation from coal, once the │ │ +│ │ │ dominant source of U.S. │ │ +│ │ │ electricity, for an extended │ │ +│ │ │ period for the first time. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Utility-scale U.S. solar │ In August 2024, a total of │ 0.94 │ +│ │ electricity generation │ 107.4 gigawatts (GW) of solar │ │ +│ │ continues to grow in 2024 - │ electricity generating │ │ +│ │ EIA │ capacity was operating in the │ │ +│ │ https://www.eia.gov/todayinen │ Lower 48 states compared with │ │ +│ │ ergy/detail.php?id=63324 │ 81.9 GW in August 2023... In │ │ +│ │ │ the final five months of 2024, │ │ +│ │ │ we expect new U.S. solar │ │ +│ │ │ electricity generating │ │ +│ │ │ capacity will make up 63%, or │ │ +│ │ │ nearly two-thirds, of all new │ │ +│ │ │ electricity generating │ │ +│ │ │ capacity to come online. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ scope_exceeded │ Offshore wind capacity │ The evidence gathered │ +│ │ factors │ focuses on land-based wind. │ +│ │ │ Offshore wind typically has │ +│ │ │ higher capacity factors │ +│ │ │ (40–50%+) than land-based │ +│ │ │ wind but was not the │ +│ │ │ primary focus of the │ +│ │ │ sources retrieved. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Most recent 2024 annual │ The 2023 annual wind │ +│ │ average wind capacity │ capacity factor (33.5%) is │ +│ │ factor │ confirmed, but a final 2024 │ +│ │ │ annual figure was not found │ +│ │ │ in the sources; only │ +│ │ │ monthly records for April │ +│ │ │ 2024 were available. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Regional breakdown of wind │ State- or region-level │ +│ │ vs. solar capacity factors │ direct comparisons of wind │ +│ │ within the continental U.S. │ vs. solar capacity factors │ +│ │ │ within the continental U.S. │ +│ │ │ were not available in the │ +│ │ │ retrieved sources. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ scope_exceeded │ Small-scale/rooftop solar │ The 23.5% solar capacity │ +│ │ capacity factors │ factor applies to │ +│ │ │ utility-scale solar. │ +│ │ │ Distributed/rooftop solar │ +│ │ │ typically has lower │ +│ │ │ capacity factors due to │ +│ │ │ suboptimal orientation; │ +│ │ │ this was not quantified in │ +│ │ │ the retrieved evidence. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ database │ U.S. offshore │ Offshore wind has │ +│ │ │ wind capacity │ substantially │ +│ │ │ factors 2023 2024 │ higher capacity │ +│ │ │ compared to │ factors than │ +│ │ │ land-based wind │ land-based wind │ +│ │ │ and solar │ and solar, which │ +│ │ │ │ would complete │ +│ │ │ │ the renewable │ +│ │ │ │ capacity factor │ +│ │ │ │ comparison │ +│ │ │ │ picture. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ NREL ATB 2024 │ NREL ATB provides │ +│ │ │ utility-scale │ wind capacity │ +│ │ │ wind capacity │ factors by │ +│ │ │ factor by │ resource class │ +│ │ │ resource class │ similar to solar, │ +│ │ │ continental US │ enabling direct │ +│ │ │ │ apples-to-apples │ +│ │ │ │ regional │ +│ │ │ │ comparison with │ +│ │ │ │ solar CF data. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ seasonal wind vs │ Wind peaks in │ +│ │ │ solar capacity │ spring, solar in │ +│ │ │ factor │ summer—understand │ +│ │ │ complementarity │ ing this │ +│ │ │ United States │ complementarity │ +│ │ │ grid balancing │ is critical for │ +│ │ │ │ grid planning and │ +│ │ │ │ storage │ +│ │ │ │ requirements. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ database │ EIA Electric │ The 2024 │ +│ │ │ Power Monthly │ full-year wind │ +│ │ │ 2024 annual wind │ capacity factor │ +│ │ │ capacity factor │ would allow │ +│ │ │ final │ updated │ +│ │ │ │ comparison with │ +│ │ │ │ the 2023 solar │ +│ │ │ │ capacity factor │ +│ │ │ │ of 23.5%. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ How do wind and solar capacity │ Texas led wind capacity │ +│ │ factors compare on a regional │ additions in 2023 (1,323 MW) │ +│ │ basis within the continental │ and is the second-largest │ +│ │ U.S., particularly in states │ utility-scale solar state (18.8 │ +│ │ like Texas and California that │ GW). California leads solar. │ +│ │ have significant installations │ Regional comparisons would │ +│ │ of both? │ clarify where each resource is │ +│ │ │ most competitive. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the projected │ NREL's ATB provides │ +│ │ trajectory of utility-scale │ Advanced/Moderate/Conservative │ +│ │ solar capacity factors as │ scenarios for solar CF │ +│ │ technology improves, and will │ improvements through 2050, and │ +│ │ solar eventually close the gap │ solar capacity additions are │ +│ │ with wind on a fleet-wide │ now outpacing wind. The │ +│ │ average basis? │ convergence timeline is │ +│ │ │ unclear. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How did the 2023 wind │ Wind generation fell 2.1% in │ +│ │ generation decline (due to low │ 2023 to an eight-year-low │ +│ │ wind speeds) affect investment │ capacity factor of 33.5%, while │ +│ │ decisions for new wind vs. │ solar continued growing. This │ +│ │ solar projects? │ may have influenced utility │ +│ │ │ procurement decisions. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the capacity factor of │ The DOE Wind Market Reports │ +│ │ offshore wind installations in │ cover offshore wind separately, │ +│ │ the U.S., and how does it │ and offshore wind typically │ +│ │ compare to both land-based wind │ achieves materially higher │ +│ │ and utility-scale solar? │ capacity factors than │ +│ │ │ land-based wind (~40–50%), but │ +│ │ │ this was not quantified in the │ +│ │ │ retrieved sources. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ How does the Inflation │ The IRA led to significant │ +│ │ Reduction Act's impact on wind │ near-term wind deployment │ +│ │ and solar deployment affect │ forecast increases and billions │ +│ │ future capacity factor trends, │ in domestic supply chain │ +│ │ given that larger, more │ investment. Average wind │ +│ │ efficient turbines and │ turbine capacity grew to 3.4 MW │ +│ │ better-sited projects may │ in 2023, up 375% since │ +│ │ improve wind CFs? │ 1998–1999. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.88 │ +│ Corroborating sources: 10 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 48230 │ +│ Iterations: 3 │ +│ Wall time: 95.81s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: e3fa81c3-eaff-4f76-9b50-d61e70e54540 diff --git a/docs/stress-tests/M3.3-runs/11-contradiction.log b/docs/stress-tests/M3.3-runs/11-contradiction.log new file mode 100644 index 0000000..ceac8ac --- /dev/null +++ b/docs/stress-tests/M3.3-runs/11-contradiction.log @@ -0,0 +1,236 @@ +Researching: Is red wine good for cardiovascular health? + +{"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:02:56.517038Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:02:57.298051Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:02:57.308234Z"} +{"question": "Is red wine good for cardiovascular health?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:02:57.343434Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is red wine good for cardiovascular health?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343753Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:02:57.343847Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1097, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:09.450890Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8466, "event": "iteration_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:15.393838Z"} +{"step": 19, "decision": "Token budget reached before iteration 4: 22139/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405453Z"} +{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 19, "iterations_run": 3, "tokens_used": 22139, "event": "synthesis_start", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:03:24.405621Z"} +{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 50486, "event": "synthesis_complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808158Z"} +{"step": 37, "decision": "Research complete", "confidence": 0.72, "citation_count": 9, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 78.676, "event": "complete", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:13.808851Z"} +{"confidence": 0.72, "citations": 9, "gaps": 3, "discovery_events": 3, "tokens_used": 42350, "iterations_run": 3, "wall_time_sec": 76.46466898918152, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:13.808934Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:04:13.809517Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:13.813434Z"} +{"trace_id": "96acce3c-853d-40b7-ba02-c721ac59f85d", "confidence": 0.72, "citations": 9, "tokens_used": 42350, "wall_time_sec": 76.46466898918152, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.104351Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The relationship between red wine and cardiovascular health is nuanced and │ +│ contested. Historically, observational studies found that moderate drinkers │ +│ (at least one drink per day) were 30–40% less likely to die from │ +│ cardiovascular disease compared to non-drinkers, a pattern sometimes called │ +│ the 'J-shaped mortality curve' [NYT/AHA]. Red wine specifically contains │ +│ polyphenols (including flavonoids and resveratrol) that may inhibit LDL │ +│ oxidation, prevent endothelial dysfunction, raise HDL cholesterol, and │ +│ decrease fibrinogen concentrations [Circulation Research; PMC6804046]. │ +│ However, no study has established a direct cause-and-effect link between red │ +│ wine consumption and improved heart health [AHA]. More recent analyses │ +│ suggest the apparent benefit may reflect confounding factors—moderate │ +│ drinkers may have healthier lifestyles overall—and methodological flaws such │ +│ as including former drinkers (who quit due to illness) in the abstainer │ +│ group [NYT; Three Spirit]. The 'French Paradox,' which popularized the red │ +│ wine-heart health hypothesis, is now being critically re-examined as a │ +│ public health myth [ResearchGate]. Major health organizations, including the │ +│ American Heart Association, do not recommend starting to drink red wine for │ +│ heart benefit, and current evidence does not support a causal protective │ +│ effect of alcohol on the heart. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ How Red Wine Lost Its Health │ Researchers found that those │ 0.85 │ +│ │ Halo - The New York Times │ who reported having at least │ │ +│ │ https://www.nytimes.com/2024/ │ one alcoholic drink per day │ │ +│ │ 02/17/well/eat/red-wine-heart │ were 30 to 40 percent less │ │ +│ │ -health.html │ likely to die from │ │ +│ │ │ cardiovascular disease. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Drinking red wine for heart │ No research has established a │ 0.92 │ +│ │ health? Read this before you │ cause-and-effect link between │ │ +│ │ toast | American Heart │ drinking alcohol and better │ │ +│ │ Association │ heart health. Rather, studies │ │ +│ │ https://www.heart.org/en/news │ have found an association │ │ +│ │ /2019/05/24/drinking-red-wine │ between wine and such benefits │ │ +│ │ -for-heart-health-read-this-b │ as a lower risk of dying from │ │ +│ │ efore-you-toast │ heart disease. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Red Wine and Cardiovascular │ The alcoholic component is │ 0.90 │ +│ │ Health | Circulation Research │ known to increase high-density │ │ +│ │ https://www.ahajournals.org/d │ lipoprotein cholesterol and to │ │ +│ │ oi/10.1161/CIRCRESAHA.112.278 │ decrease fibrinogen │ │ +│ │ 705?doi=10.1161/CIRCRESAHA.11 │ concentrations. The │ │ +│ │ 2.278705 │ polyphenols present in red │ │ +│ │ │ wine │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Wine and Cardiovascular │ Flavonoids from red wine have │ 0.88 │ +│ │ Health | Circulation │ been credited to inhibit │ │ +│ │ https://www.ahajournals.org/d │ low-density lipoprotein (LDL) │ │ +│ │ oi/10.1161/circulationaha.117 │ oxidation and prevent │ │ +│ │ .030387 │ endothelial dysfunction, which │ │ +│ │ │ is │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Red Wine Consumption and │ Red Wine Consumption and │ 0.85 │ +│ │ Cardiovascular Health - PMC │ Cardiovascular Health Luigi │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ Castaldo ... Department of │ │ +│ │ articles/PMC6804046/ │ Pharmacy, Faculty of Pharmacy, │ │ +│ │ │ University of Naples "Federico │ │ +│ │ │ II" ... Molecules. 2019 Oct │ │ +│ │ │ 8;24(19):3626. doi: │ │ +│ │ │ 10.3390/molecules24193626 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Association between Wine │ Association between Wine │ 0.87 │ +│ │ Consumption with │ Consumption with │ │ +│ │ Cardiovascular Disease and │ Cardiovascular Disease and │ │ +│ │ Cardiovascular Mortality: A │ Cardiovascular Mortality: A │ │ +│ │ Systematic Review and │ Systematic Review and │ │ +│ │ Meta-Analysis - PMC │ Meta-Analysis ... Nutrients. │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ 2023 Jun 17;15(12):2785. doi: │ │ +│ │ articles/PMC10303697/ │ 10.3390/nu15122785 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Red wine and resveratrol: │ Is red wine heart healthy? │ 0.88 │ +│ │ Good for your heart? - Mayo │ Antioxidants in red wine │ │ +│ │ Clinic │ called polyphenols may help │ │ +│ │ https://www.mayoclinic.org/di │ protect the lining of blood │ │ +│ │ seases-conditions/heart-disea │ vessels in the heart. · │ │ +│ │ se/in-depth/red-wine/art-2004 │ Resveratrol in red wine. │ │ +│ │ 8281 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Debunking the 'wine is │ In the early nineties, a TV │ 0.65 │ +│ │ healthy' myth – Three Spirit │ show in the US reported lower │ │ +│ │ US │ heart attack rates in │ │ +│ │ https://us.threespiritdrinks. │ France... The report framed │ │ +│ │ com/blogs/blog/where-the-wine │ the country's regular │ │ +│ │ -is-healthy-myth-came-from │ consumption of alcohol, in │ │ +│ │ │ particular red wine, as the │ │ +│ │ │ reason behind this, claiming │ │ +│ │ │ that it reduced that risk of │ │ +│ │ │ heart disease. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Revisiting the French │ The "French Paradox," the │ 0.78 │ +│ │ Paradox: Deconstructing a │ hypothesis that moderate red │ │ +│ │ Public Health Myth and its │ wine consumption explains │ │ +│ │ Global Commercial Legacy │ France's historically low │ │ +│ │ https://www.researchgate.net/ │ coronary heart disease rates │ │ +│ │ publication/399257280_Title_R │ │ │ +│ │ evisiting_the_French_Paradox_ │ │ │ +│ │ Deconstructing_a_Public_Healt │ │ │ +│ │ h_Myth_and_its_Global_Commerc │ │ │ +│ │ ial_Legacy │ │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Randomized controlled │ Most evidence is │ +│ │ trial evidence on red │ observational. Robust RCT │ +│ │ wine and cardiovascular │ data directly testing red │ +│ │ outcomes │ wine's causal │ +│ │ │ cardiovascular effect in │ +│ │ │ humans is lacking and not │ +│ │ │ surfaced in available │ +│ │ │ sources. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Differential effect of │ Some sources attribute │ +│ │ red wine vs. other │ benefits to polyphenols │ +│ │ alcohol types on │ specific to red wine, │ +│ │ cardiovascular health │ while others suggest the │ +│ │ │ effect is due to alcohol │ +│ │ │ in general, making it │ +│ │ │ unclear whether red wine │ +│ │ │ is uniquely beneficial. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ access_denied │ Full text of 2023 │ The PMC10303697 │ +│ │ meta-analysis findings │ meta-analysis page header │ +│ │ │ was retrieved but full │ +│ │ │ results/conclusions were │ +│ │ │ not available in the │ +│ │ │ scraped content. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ contradiction │ database │ randomized │ Observational │ +│ │ │ controlled trial │ studies suggest │ +│ │ │ red wine │ benefit, but no │ +│ │ │ polyphenols │ causal link │ +│ │ │ cardiovascular │ established; RCT │ +│ │ │ outcomes │ evidence needed │ +│ │ │ │ to resolve │ +│ │ │ │ contradiction. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ resveratrol │ Resveratrol is │ +│ │ │ bioavailability │ cited as a key │ +│ │ │ cardiovascular │ mechanism but its │ +│ │ │ human clinical │ bioavailability │ +│ │ │ trials 2022 2023 │ from wine in │ +│ │ │ 2024 │ clinically │ +│ │ │ │ meaningful doses │ +│ │ │ │ is debated. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ sick quitter bias │ The J-shaped │ +│ │ │ abstainer │ curve may be an │ +│ │ │ misclassification │ artifact of │ +│ │ │ alcohol │ methodological │ +│ │ │ cardiovascular │ flaws (sick │ +│ │ │ epidemiology │ quitters included │ +│ │ │ │ in abstainer │ +│ │ │ │ group), which │ +│ │ │ │ undermines │ +│ │ │ │ earlier │ +│ │ │ │ protective │ +│ │ │ │ findings. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Does the apparent │ Observational J-curve studies │ +│ │ cardiovascular benefit of │ may misclassify former drinkers │ +│ │ moderate red wine consumption │ who quit due to illness as │ +│ │ disappear when sick quitters │ non-drinkers, inflating the │ +│ │ are properly excluded from the │ apparent benefit of moderate │ +│ │ abstainer comparison group? │ drinking. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Is the cardiovascular effect of │ Circulation Research notes both │ +│ │ red wine attributable to │ the alcohol component and │ +│ │ polyphenols (resveratrol, │ polyphenols independently │ +│ │ flavonoids) or simply to the │ affect cardiovascular markers, │ +│ │ alcohol content? │ but their relative contribution │ +│ │ │ is unclear. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What do the most recent │ The 2023 PMC meta-analysis was │ +│ │ meta-analyses (2022–2024) │ identified but its full │ +│ │ conclude about wine consumption │ conclusions were not accessible │ +│ │ and cardiovascular mortality │ in the retrieved content. │ +│ │ after correcting for │ │ +│ │ confounders? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Are there subpopulations (e.g., │ Current guidance is │ +│ │ by age, sex, genetic profile) │ population-level; individual │ +│ │ for whom moderate red wine │ variation in alcohol metabolism │ +│ │ consumption might confer │ and cardiovascular risk │ +│ │ measurable cardiovascular │ profiles may produce different │ +│ │ benefit? │ outcomes. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.72 │ +│ Corroborating sources: 7 │ +│ Source authority: high │ +│ Contradiction detected: True │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: recent │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 42350 │ +│ Iterations: 3 │ +│ Wall time: 76.46s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 96acce3c-853d-40b7-ba02-c721ac59f85d diff --git a/docs/stress-tests/M3.3-runs/12-contradiction.log b/docs/stress-tests/M3.3-runs/12-contradiction.log new file mode 100644 index 0000000..2154e1a --- /dev/null +++ b/docs/stress-tests/M3.3-runs/12-contradiction.log @@ -0,0 +1,330 @@ +Researching: Does intermittent fasting extend lifespan in humans? + +{"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:04:14.725578Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:04:15.543876Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:04:15.553451Z"} +{"question": "Does intermittent fasting extend lifespan in humans?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:04:15.587475Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does intermittent fasting extend lifespan in humans?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587815Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:15.587912Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1148, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:22.802797Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8443, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:26.505496Z"} +{"step": 21, "decision": "Starting iteration 4/5", "tokens_so_far": 18167, "event": "iteration_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:43.089460Z"} +{"step": 26, "decision": "Token budget reached before iteration 5: 36705/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193645Z"} +{"step": 27, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 26, "iterations_run": 4, "tokens_used": 36705, "event": "synthesis_start", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:04:47.193894Z"} +{"step": 28, "decision": "Parsed synthesis JSON successfully", "duration_ms": 76890, "event": "synthesis_complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.759366Z"} +{"step": 48, "decision": "Research complete", "confidence": 0.72, "citation_count": 11, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 109.604, "event": "complete", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:00.760365Z"} +{"confidence": 0.72, "citations": 11, "gaps": 4, "discovery_events": 4, "tokens_used": 62781, "iterations_run": 4, "wall_time_sec": 105.17169857025146, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:00.760468Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:06:00.760848Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:00.765020Z"} +{"trace_id": "c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3", "confidence": 0.72, "citations": 11, "tokens_used": 62781, "wall_time_sec": 105.17169857025146, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:00.989582Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Current scientific evidence does NOT conclusively demonstrate that │ +│ intermittent fasting (IF) extends lifespan in humans. While IF has proven │ +│ lifespan-extending effects in animal models (particularly rodents), and │ +│ improves multiple healthspan markers in humans—including weight, insulin │ +│ resistance, inflammation, dyslipidemia, hypertension, oxidative stress, and │ +│ autophagy—direct evidence of increased human lifespan from IF is lacking. │ +│ Mechanistically, IF triggers 'adaptive stress' in cells, activating │ +│ antioxidant production, DNA repair, autophagy (via spermidine-mediated │ +│ pathways), and reduced inflammation, all of which are theoretically linked │ +│ to longevity [InsideTracker, FORTH/Nature Cell Biology]. A 2024 review in │ +│ Ageing Research Reviews concluded IF 'can be considered a │ +│ non-pharmacological strategy to extend lifespan' and has been 'proven to │ +│ extend lifespan in rodent models,' but human translation remains unconfirmed │ +│ [ScienceDirect/PubMed]. A scoping review of RCTs found IF improves │ +│ aging-related biomarkers in adults but stopped short of claiming lifespan │ +│ extension [PMC]. A 2024 Nature study on genetically diverse mice showed │ +│ dietary restriction (including IF) extends healthy lifespan in mice but its │ +│ human relevance is unclear. Critically, a major 2024 AHA-presented │ +│ observational study of 20,000+ U.S. adults found that eating within an │ +│ 8-hour window was associated with a 91% higher risk of cardiovascular death │ +│ compared to eating across 12–16 hours—though this study has been heavily │ +│ criticized for methodological limitations including confounding variables │ +│ (demographics, pre-existing disease) and reliance on only two days of │ +│ dietary recall data [AHA, WebMD, Forbes]. In summary, IF improves several │ +│ biomarkers associated with healthy aging in humans, and extends lifespan in │ +│ animals, but no long-term human RCT has demonstrated actual lifespan │ +│ extension, and some observational data raise cardiovascular safety concerns. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Intermittent fasting and │ IF can be considered as a │ 0.95 │ +│ │ longevity: From animal models │ non-pharmacological strategy │ │ +│ │ to implication for humans - │ to extend lifespan. IF │ │ +│ │ ScienceDirect │ improves physiological │ │ +│ │ https://www.sciencedirect.com │ function, enhances │ │ +│ │ /science/article/abs/pii/S156 │ performance, and slows aging. │ │ +│ │ 8163724000928 │ IF was proven to extend │ │ +│ │ │ lifespan in rodent models. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Intermittent fasting and │ Findings to date from both │ 0.95 │ +│ │ longevity: From animal models │ human and animal experiments │ │ +│ │ to implication for humans - │ indicate that fasting improves │ │ +│ │ PubMed │ physiological function, │ │ +│ │ https://pubmed.ncbi.nlm.nih.g │ enhances performance, and │ │ +│ │ ov/38499159/ │ slows aging and disease │ │ +│ │ │ processes. Metabolic and │ │ +│ │ │ cellular responses triggered │ │ +│ │ │ by IF could help to achieve │ │ +│ │ │ the aim of preventing disease, │ │ +│ │ │ and maximizing healthspan and │ │ +│ │ │ longevity with minimal side │ │ +│ │ │ effects. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ How Intermittent Fasting │ In humans, intermittent │ 0.88 │ +│ │ Impacts Longevity: A Summary │ fasting improves weight, │ │ +│ │ of the Research - │ insulin resistance, │ │ +│ │ InsideTracker │ inflammation, dyslipidemia, │ │ +│ │ https://www.insidetracker.com │ and hypertension. IF has also │ │ +│ │ /a/articles/how-intermittent- │ reduced tumor growth, boosted │ │ +│ │ fasting-impacts-longevity │ stem cell production, and │ │ +│ │ │ increased lifespan in mice. │ │ +│ │ │ During fasting, cells undergo │ │ +│ │ │ adaptive stress, which │ │ +│ │ │ activates different pathways │ │ +│ │ │ in the body, resulting in a │ │ +│ │ │ range of effects, including │ │ +│ │ │ increased production of │ │ +│ │ │ antioxidants, DNA repair, │ │ +│ │ │ autophagy. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Effects of Intermittent │ In humans, │ 0.97 │ +│ │ Fasting on Health, Aging, and │ intermittent-fasting │ │ +│ │ Disease - NEJM │ interventions ameliorate │ │ +│ │ https://www.nejm.org/doi/full │ obesity, insulin resistance, │ │ +│ │ /10.1056/NEJMra1905136 │ dyslipidemia, hypertension, │ │ +│ │ │ and inflammation. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Impact of Intermittent │ Impact of Intermittent Fasting │ 0.90 │ +│ │ Fasting and/or Caloric │ and/or Caloric Restriction on │ │ +│ │ Restriction on Aging-Related │ Aging-Related Outcomes in │ │ +│ │ Outcomes in Adults: A Scoping │ Adults: A Scoping Review of │ │ +│ │ Review of Randomized │ Randomized Controlled Trials. │ │ +│ │ Controlled Trials - PMC │ Nutrients. 2024 Jan │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ 20;16(2):316. doi: │ │ +│ │ articles/PMC10820472/ │ 10.3390/nu16020316 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ International scientific │ intermittent fasting increases │ 0.90 │ +│ │ collaboration reveals how │ the levels of spermidine, a │ │ +│ │ intermittent fasting │ chemical compound (natural │ │ +│ │ regulates ageing through │ polyamine), that enhances the │ │ +│ │ autophagy | FORTH │ resilience and survival of │ │ +│ │ https://forth.gr/en/news/show │ cells and organisms, through │ │ +│ │ /&tid=2606 │ the activation of autophagy. │ │ +│ │ │ Autophagy defects have been │ │ +│ │ │ linked to ageing, as well as, │ │ +│ │ │ with the emergence of │ │ +│ │ │ age-related disorders. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Dietary restriction impacts │ Caloric restriction extends │ 0.92 │ +│ │ health and lifespan of │ healthy lifespan in multiple │ │ +│ │ genetically diverse mice | │ species. Intermittent fasting, │ │ +│ │ Nature │ an alternative form of dietary │ │ +│ │ https://www.nature.com/articl │ restriction, is potentially │ │ +│ │ es/s41586-024-08026-3 │ more sustainable in humans, │ │ +│ │ │ but its effectiveness remains │ │ +│ │ │ largely unexplored. │ │ +│ │ │ Identifying the most │ │ +│ │ │ efficacious forms of dietary │ │ +│ │ │ restriction is key for │ │ +│ │ │ developing interventions to │ │ +│ │ │ improve human health and │ │ +│ │ │ longevity. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Time-restricted eating may │ A popular weight loss strategy │ 0.85 │ +│ │ raise cardiovascular death │ that limits the hours during │ │ +│ │ risk in the long term | │ which calories can be consumed │ │ +│ │ American Heart Association │ may nearly double a person's │ │ +│ │ https://www.heart.org/en/news │ long-term risk of dying from │ │ +│ │ /2024/03/18/time-restricted-e │ cardiovascular disease, new │ │ +│ │ ating-may-raise-cardiovascula │ research finds, especially │ │ +│ │ r-death-risk-in-the-long-term │ among people with underlying │ │ +│ │ │ cardiovascular disease or │ │ +│ │ │ cancer. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Fasting Study Under Fire │ Those conclusions are │ 0.87 │ +│ │ After Heart Conference - │ premature and misleading, says │ │ +│ │ WebMD │ Christopher Gardner, PhD, a │ │ +│ │ https://www.webmd.com/heart-d │ professor of medicine at │ │ +│ │ isease/features/is-intermitte │ Stanford University... people │ │ +│ │ nt-fasting-bad-for-heart-heal │ in the study group who │ │ +│ │ th │ consumed all their food in a │ │ +│ │ │ daily window of 8 hours or │ │ +│ │ │ fewer had a higher percentage │ │ +│ │ │ of men, African Americans, and │ │ +│ │ │ smoke. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Intermittent Fasting - The │ intermittent fasting activated │ 0.78 │ +│ │ Impact on Autophagy, │ autophagy, a cellular process │ │ +│ │ Inflammasome, and Senescence │ that breaks down components │ │ +│ │ https://nomix.ai/2024/05/24/f │ within cells. Autophagy has │ │ +│ │ asting-in-young-males-examini │ been linked to longevity... │ │ +│ │ ng-the-impact-on-autophagy-in │ p21 levels decreased during │ │ +│ │ flammasome-and-senescence-bio │ and after fasting. The │ │ +│ │ markers/ │ findings suggest that fasting │ │ +│ │ │ may contribute to delaying the │ │ +│ │ │ onset of age-related diseases. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 11 │ Effect of fasting-mimicking │ Significant between-group │ 0.82 │ +│ │ diet on markers of autophagy │ differences were observed in │ │ +│ │ and metabolic health in human │ changes from baseline to the │ │ +│ │ subjects | GeroScience │ end of the 6-day dietary │ │ +│ │ https://link.springer.com/art │ intervention for body weight, │ │ +│ │ icle/10.1007/s11357-025-02035 │ fasting glucose, BHB, HOMA-IR, │ │ +│ │ -4 │ and autophagic flux (p < │ │ +│ │ │ 0.05)... These results suggest │ │ +│ │ │ that FMD may improve │ │ +│ │ │ autophagic flux and markers of │ │ +│ │ │ metabolic health. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Long-term human RCT data │ No randomized controlled │ +│ │ on IF and all-cause │ trial has followed human │ +│ │ mortality or lifespan │ participants long enough │ +│ │ │ to measure actual │ +│ │ │ lifespan extension from │ +│ │ │ IF. All human longevity │ +│ │ │ evidence is based on │ +│ │ │ biomarker surrogates or │ +│ │ │ observational data. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Optimal IF protocol for │ Studies test different │ +│ │ longevity in humans │ protocols (TRF, ADF, 5:2, │ +│ │ │ FMD) with varying │ +│ │ │ durations and │ +│ │ │ populations, making it │ +│ │ │ impossible to identify a │ +│ │ │ single optimal regimen │ +│ │ │ for human longevity. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Cardiovascular safety of │ Short-term studies show │ +│ │ long-term IF │ cardiovascular benefit │ +│ │ │ (improved BP, glucose, │ +│ │ │ cholesterol), but the │ +│ │ │ 2024 AHA observational │ +│ │ │ study suggests possible │ +│ │ │ long-term cardiovascular │ +│ │ │ mortality risk, with │ +│ │ │ experts disputing │ +│ │ │ methodology. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ source_not_found │ IF effects across │ Most human studies focus │ +│ │ diverse demographic │ on limited populations │ +│ │ groups │ (e.g., young males, │ +│ │ │ specific ethnic groups), │ +│ │ │ limiting generalizability │ +│ │ │ of longevity findings. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ contradiction │ database │ time-restricted │ The AHA 2024 │ +│ │ │ eating │ study claiming │ +│ │ │ cardiovascular │ 91% higher │ +│ │ │ mortality NHANES │ cardiovascular │ +│ │ │ confounding │ death risk │ +│ │ │ variables │ contradicts │ +│ │ │ methodology │ short-term │ +│ │ │ critique 2024 │ studies showing │ +│ │ │ │ CV benefit; │ +│ │ │ │ deeper │ +│ │ │ │ methodological │ +│ │ │ │ analysis is │ +│ │ │ │ warranted. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ spermidine │ The FORTH/Nature │ +│ │ │ autophagy │ Cell Biology │ +│ │ │ intermittent │ finding on │ +│ │ │ fasting lifespan │ spermidine-mediat │ +│ │ │ human clinical │ ed autophagy is a │ +│ │ │ trial 2024 │ novel mechanism │ +│ │ │ │ that may be │ +│ │ │ │ testable in human │ +│ │ │ │ longevity trials. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ fasting mimicking │ A large │ +│ │ │ diet longevity │ registered RCT │ +│ │ │ diet RCT │ (NCT05698654) on │ +│ │ │ NCT05698654 │ fasting-mimicking │ +│ │ │ results │ and longevity │ +│ │ │ │ diet is underway; │ +│ │ │ │ results could be │ +│ │ │ │ transformative │ +│ │ │ │ for the question │ +│ │ │ │ of human lifespan │ +│ │ │ │ extension. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ telomere length │ The Frontiers in │ +│ │ │ intermittent │ Aging study on │ +│ │ │ fasting exercise │ metabolic │ +│ │ │ metabolomics │ signatures of │ +│ │ │ aging biomarkers │ combined exercise │ +│ │ │ 2024 │ and fasting links │ +│ │ │ │ to telomere │ +│ │ │ │ length, a key │ +│ │ │ │ aging biomarker │ +│ │ │ │ worth │ +│ │ │ │ investigating │ +│ │ │ │ further. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Will ongoing large-scale RCTs │ No current RCT has followed │ +│ │ (e.g., NCT05698654) provide │ participants long enough to │ +│ │ definitive evidence that IF │ measure actual lifespan; only │ +│ │ extends human lifespan or │ biomarker surrogates have been │ +│ │ healthspan? │ studied. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Does the cardiovascular │ Experts including Stanford's │ +│ │ mortality risk signal from the │ Christopher Gardner criticized │ +│ │ 2024 AHA observational study │ the study for not controlling │ +│ │ hold up after controlling for │ for demographics, pre-existing │ +│ │ confounders like pre-existing │ disease, and reason for │ +│ │ illness and dietary quality? │ adopting IF. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Can spermidine supplementation │ FORTH research showed IF raises │ +│ │ replicate the │ spermidine, which activates │ +│ │ autophagy-activating, │ autophagy and promotes cell │ +│ │ anti-aging effects of IF in │ survival, suggesting │ +│ │ humans who cannot sustain │ supplementation as a potential │ +│ │ fasting? │ proxy. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Which IF protocol (TRF, ADF, │ Multiple protocols are studied │ +│ │ 5:2, or FMD) produces the │ with heterogeneous populations, │ +│ │ greatest longevity-associated │ making comparative │ +│ │ biomarker improvements in │ effectiveness unclear. │ +│ │ diverse human populations? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ Does the 92-year-old case study │ SAGE Journals reported this as │ +│ │ of repeated 3-week annual │ the world's longest medically │ +│ │ fasting over 45 years offer any │ documented repeated fasting │ +│ │ generalizable insight into │ history; clinical parameters │ +│ │ long-term IF and human │ showed cyclic variation. │ +│ │ longevity? │ │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.72 │ +│ Corroborating sources: 9 │ +│ Source authority: high │ +│ Contradiction detected: True │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 62781 │ +│ Iterations: 4 │ +│ Wall time: 105.17s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: c4942f00-1b7a-40ba-a6e1-7eaae57b9ee3 diff --git a/docs/stress-tests/M3.3-runs/13-contradiction.log b/docs/stress-tests/M3.3-runs/13-contradiction.log new file mode 100644 index 0000000..3c7c0ee --- /dev/null +++ b/docs/stress-tests/M3.3-runs/13-contradiction.log @@ -0,0 +1,260 @@ +Researching: Are nuclear power plants safe? + +{"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:06:01.606512Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:06:02.435399Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:06:02.443368Z"} +{"question": "Are nuclear power plants safe?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:06:02.477384Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Are nuclear power plants safe?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477723Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:02.477819Z"} +{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1169, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:15.136739Z"} +{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 11760, "event": "iteration_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:25.196255Z"} +{"step": 23, "decision": "Token budget reached before iteration 4: 29534/20000", "event": "budget_exhausted", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263571Z"} +{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 29534, "event": "synthesis_start", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:06:35.263885Z"} +{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 58649, "event": "synthesis_complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.700545Z"} +{"step": 40, "decision": "Research complete", "confidence": 0.92, "citation_count": 8, "gap_count": 3, "discovery_count": 3, "total_duration_sec": 92.558, "event": "complete", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:31.701336Z"} +{"confidence": 0.92, "citations": 8, "gaps": 3, "discovery_events": 3, "tokens_used": 63429, "iterations_run": 3, "wall_time_sec": 89.22308659553528, "budget_exhausted": true, "event": "research_completed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:31.701429Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:07:31.701781Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:31.705585Z"} +{"trace_id": "2e2b6e88-c973-4422-919c-3838634336c9", "confidence": 0.92, "citations": 8, "tokens_used": 63429, "wall_time_sec": 89.22308659553528, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.018740Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Yes, nuclear power plants are among the safest sources of electricity │ +│ generation when measured by deaths per unit of energy produced. According to │ +│ Statista (sourcing 2018 data), nuclear energy results in approximately 0.03 │ +│ deaths per terawatt-hour (TWh), making it safer than wind (0.04), solar │ +│ (0.02 is slightly lower), natural gas (2.82), biomass (4.63), hydro (1.3), │ +│ oil (18.43), coal (24.62), and brown coal (32.72). A separate dataset from │ +│ ResearchGate reports 0.04 deaths per billion kWh for nuclear, compared to │ +│ 100 for coal. Despite three major accidents—Three Mile Island (1979), │ +│ Chernobyl (1986), and Fukushima (2011)—the overall fatality record remains │ +│ exceptionally low. At Chernobyl, the worst nuclear accident in history, 2 │ +│ workers died in the initial explosion, 28 of 134 acute radiation syndrome │ +│ patients later died, and roughly 5,000 thyroid cancer cases were │ +│ attributable to radiation exposure among those under 18 at the time │ +│ (Canadian Nuclear Safety Commission). Stanford researchers estimated │ +│ Fukushima may cause approximately 130 deaths and 180 cancer cases globally, │ +│ in addition to ~600 evacuation-related deaths. Three Mile Island caused no │ +│ direct radiation deaths. U.S. nuclear plants operate under strict NRC │ +│ oversight using a 'defense-in-depth' multi-layer safety approach (U.S. │ +│ Department of Energy). The IAEA also sets international design and safety │ +│ standards. Public perception of nuclear risk is widely considered │ +│ disproportionate to the statistical evidence. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Global deaths per energy │ Brown coal 32.72 | Coal 24.62 │ 0.97 │ +│ │ source | Statista │ | Oil 18.43 | Biomass 4.63 | │ │ +│ │ https://www.statista.com/stat │ Natural gas 2.82 | Hydro 1.3 | │ │ +│ │ istics/494425/death-rate-worl │ Wind 0.04 | Nuclear 0.03 | │ │ +│ │ dwide-by-energy-source/ │ Solar 0.02. Death rates are │ │ +│ │ │ measured based on deaths from │ │ +│ │ │ accidents and air pollution │ │ +│ │ │ per terawatt-hour (TWh) of │ │ +│ │ │ electricity. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ rates for each energy source │ 100 for coal, 36 for oil, 24 │ 0.91 │ +│ │ in deaths per billion kWh │ for biofuel/biomass, 4 for │ │ +│ │ produced... | ResearchGate │ natural gas, 1.4 for hydro, │ │ +│ │ https://www.researchgate.net/ │ 0.44 for solar, 0.15 for wind │ │ +│ │ figure/rates-for-each-energy- │ and 0.04 for nuclear. │ │ +│ │ source-in-deaths-per-billion- │ │ │ +│ │ kWh-produced-Source-Updated_t │ │ │ +│ │ bl2_272406182 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Health effects of the │ The initial steam explosion at │ 0.97 │ +│ │ Chornobyl accident | Canadian │ the Chornobyl nuclear plant │ │ +│ │ Nuclear Safety Commission │ resulted in the deaths of 2 │ │ +│ │ https://www.cnsc-ccsn.gc.ca/e │ workers, and 134 plant staff │ │ +│ │ ng/resources/health/health-ef │ and emergency workers suffered │ │ +│ │ fects-chornobyl-accident/ │ acute radiation syndrome due │ │ +│ │ │ to high doses of radiation. Of │ │ +│ │ │ these 134 people, 28 later │ │ +│ │ │ died. About 5,000 thyroid │ │ +│ │ │ cancer cases were due to │ │ +│ │ │ radioactive iodine │ │ +│ │ │ (iodine-131) exposure to │ │ +│ │ │ children or adolescents at the │ │ +│ │ │ time of the accident. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Stanford researchers │ Radiation from Japan's │ 0.93 │ +│ │ calculate global health │ Fukushima Daiichi nuclear │ │ +│ │ impacts of the Fukushima │ disaster may eventually cause │ │ +│ │ nuclear disaster | Stanford │ approximately 130 deaths and │ │ +│ │ University │ 180 cases of cancer, mostly in │ │ +│ │ https://engineering.stanford. │ Japan, Stanford researchers │ │ +│ │ edu/news/stanford-researchers │ have calculated. The numbers │ │ +│ │ -calculate-global-health-impa │ are in addition to the roughly │ │ +│ │ cts-fukushima-nuclear-disaste │ 600 deaths caused by the │ │ +│ │ r │ evacuation of the area │ │ +│ │ │ surrounding the nuclear plant │ │ +│ │ │ directly after the March 2011 │ │ +│ │ │ earthquake, tsunami and │ │ +│ │ │ meltdown. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Enhanced Safety of Advanced │ U.S. nuclear power plants are │ 0.96 │ +│ │ Reactors | U.S. Department of │ already among the safest and │ │ +│ │ Energy │ most secure industrial │ │ +│ │ https://www.energy.gov/ne/enh │ facilities in the world due to │ │ +│ │ anced-safety-advanced-reactor │ the industry's commitment to │ │ +│ │ s │ comprehensive safety │ │ +│ │ │ procedures, robust training │ │ +│ │ │ programs and stringent federal │ │ +│ │ │ regulation that keep nuclear │ │ +│ │ │ plants and neighboring │ │ +│ │ │ communities safe. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Three Mile Island, Chernobyl │ Estimates on nuclear's overall │ 0.88 │ +│ │ and Fukushima accidents haunt │ mortality rate are comparable │ │ +│ │ nuclear's past | MinnPost │ to solar or wind power (and │ │ +│ │ https://www.minnpost.com/othe │ roughly 2.5% that of hydro │ │ +│ │ r-nonprofit-media/2023/10/thr │ power). Oil and coal, │ │ +│ │ ee-mile-island-chernobyl-and- │ meanwhile, are as much as 800 │ │ +│ │ fukushima-accidents-haunt-nuc │ times higher. │ │ +│ │ lears-past-will-they-dictate- │ │ │ +│ │ its-future/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Devastating Consequences of │ The Chernobyl disaster, which │ 0.85 │ +│ │ Nuclear Accidents: Chernobyl, │ occurred on April 26, 1986, │ │ +│ │ Fukushima and Three Mile │ was the most significant │ │ +│ │ Island | SciTechnol │ nuclear accident in history. │ │ +│ │ https://www.scitechnol.com/pe │ The explosion and fire at the │ │ +│ │ er-review/devastating-consequ │ Chernobyl nuclear power plant │ │ +│ │ ences-of-nuclear-accidents-ch │ in Ukraine resulted in the │ │ +│ │ ernobyl-fukushima-and-three-m │ release of large amounts of │ │ +│ │ ile-island-HLGS.php?article_i │ radioactive material into the │ │ +│ │ d=21379 │ atmosphere, leading to the │ │ +│ │ │ deaths of 31 people, and │ │ +│ │ │ causing widespread │ │ +│ │ │ contamination of the │ │ +│ │ │ surrounding areas. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Laying the Foundation for New │ Domestic power reactors are │ 0.94 │ +│ │ and Advanced Nuclear Reactors │ tightly regulated by the U.S. │ │ +│ │ in the United States | │ Nuclear Regulatory Commission │ │ +│ │ National Academies │ (NRC) in all phases of their │ │ +│ │ https://www.nationalacademies │ life cycle—design, │ │ +│ │ .org/read/26630/chapter/9 │ construction, operations, and │ │ +│ │ │ decommissioning. The NRC is │ │ +│ │ │ charged with licensing and │ │ +│ │ │ regulation of plants to │ │ +│ │ │ provide reasonable assurance │ │ +│ │ │ of adequate protection of │ │ +│ │ │ public health and safety. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ contradictory_sources │ Long-term cancer │ Estimates of total │ +│ │ mortality estimates from │ Chernobyl-attributed │ +│ │ Chernobyl │ cancer deaths vary widely │ +│ │ │ across sources, from │ +│ │ │ hundreds (WHO/UNSCEAR │ +│ │ │ conservative estimates) │ +│ │ │ to tens of thousands │ +│ │ │ (Greenpeace/TORCH │ +│ │ │ report), making a │ +│ │ │ definitive number │ +│ │ │ difficult to cite. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ scope_exceeded │ Comparative safety of │ Evidence gathered focuses │ +│ │ advanced/next-generation │ on existing reactor fleet │ +│ │ reactors (Gen IV, SMRs) │ safety records; safety │ +│ │ │ data specific to small │ +│ │ │ modular reactors (SMRs) │ +│ │ │ or Gen IV designs was not │ +│ │ │ retrieved. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ source_not_found │ Nuclear waste long-term │ While radioactive waste │ +│ │ safety statistics │ management was briefly │ +│ │ │ mentioned, quantitative │ +│ │ │ long-term health risk │ +│ │ │ data from waste storage │ +│ │ │ was not found in the │ +│ │ │ retrieved sources. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ arxiv │ nuclear power │ A systematic │ +│ │ │ plant safety │ academic review │ +│ │ │ mortality │ post-2020 could │ +│ │ │ statistics │ provide updated │ +│ │ │ systematic review │ mortality │ +│ │ │ 2020-2025 │ statistics │ +│ │ │ │ incorporating the │ +│ │ │ │ full operational │ +│ │ │ │ history of │ +│ │ │ │ Fukushima │ +│ │ │ │ cleanup. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ IAEA PRIS nuclear │ The IAEA Power │ +│ │ │ power plant │ Reactor │ +│ │ │ operational │ Information │ +│ │ │ safety incidents │ System (PRIS) │ +│ │ │ database │ contains │ +│ │ │ │ comprehensive │ +│ │ │ │ incident and │ +│ │ │ │ safety data for │ +│ │ │ │ all global │ +│ │ │ │ nuclear plants. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ contradiction │ database │ Chernobyl total │ SciTechnol source │ +│ │ │ excess cancer │ cites 31 │ +│ │ │ deaths estimates │ Chernobyl deaths │ +│ │ │ UNSCEAR vs WHO vs │ while CNSC cites │ +│ │ │ independent │ 28+2=30, and │ +│ │ │ researchers │ long-term cancer │ +│ │ │ │ projections │ +│ │ │ │ differ vastly │ +│ │ │ │ between │ +│ │ │ │ organizations. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ How do small modular reactors │ The DOE page on enhanced safety │ +│ │ (SMRs) compare in safety │ of advanced reactors mentions │ +│ │ profile to traditional │ new designs but no comparative │ +│ │ large-scale nuclear plants? │ safety mortality data was │ +│ │ │ available in the evidence. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the total projected │ Sources give conflicting │ +│ │ cancer death toll from │ numbers; CNSC cites 28 direct │ +│ │ Chernobyl according to the most │ deaths but does not give a │ +│ │ recent UNSCEAR assessment? │ total long-term cancer │ +│ │ │ projection. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Does nuclear power's safety │ Chernobyl and Fukushima both │ +│ │ record hold across all │ involved regulatory failures; │ +│ │ countries, including those with │ safety statistics may differ │ +│ │ less stringent regulatory │ between high-regulation and │ +│ │ frameworks? │ low-regulation countries. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does nuclear power's safety │ Statista notes deaths are │ +│ │ compare when including the │ measured from 'accidents and │ +│ │ health risks from uranium │ air pollution' per TWh, which │ +│ │ mining and fuel processing? │ may not fully account for │ +│ │ │ upstream fuel cycle risks. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.92 │ +│ Corroborating sources: 8 │ +│ Source authority: high │ +│ Contradiction detected: False │ +│ Query specificity match: 0.95 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 63429 │ +│ Iterations: 3 │ +│ Wall time: 89.22s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 2e2b6e88-c973-4422-919c-3838634336c9 diff --git a/docs/stress-tests/M3.3-runs/14-contradiction.log b/docs/stress-tests/M3.3-runs/14-contradiction.log new file mode 100644 index 0000000..50a751b --- /dev/null +++ b/docs/stress-tests/M3.3-runs/14-contradiction.log @@ -0,0 +1,358 @@ +Researching: Is dietary cholesterol harmful? + +{"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:07:32.656017Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:07:33.414998Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:07:33.424151Z"} +{"question": "Is dietary cholesterol harmful?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:07:33.456353Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Is dietary cholesterol harmful?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456707Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:33.456830Z"} +{"step": 11, "decision": "Starting iteration 2/5", "tokens_so_far": 1218, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:51.522768Z"} +{"step": 18, "decision": "Starting iteration 3/5", "tokens_so_far": 14738, "event": "iteration_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:07:58.291229Z"} +{"step": 27, "decision": "Token budget reached before iteration 4: 31680/20000", "event": "budget_exhausted", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.672921Z"} +{"step": 28, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 33, "iterations_run": 3, "tokens_used": 31680, "event": "synthesis_start", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:08:14.673116Z"} +{"step": 29, "decision": "Parsed synthesis JSON successfully", "duration_ms": 82227, "event": "synthesis_complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.529276Z"} +{"step": 50, "decision": "Research complete", "confidence": 0.78, "citation_count": 13, "gap_count": 3, "discovery_count": 4, "total_duration_sec": 124.559, "event": "complete", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:33.531034Z"} +{"confidence": 0.78, "citations": 13, "gaps": 3, "discovery_events": 4, "tokens_used": 64718, "iterations_run": 3, "wall_time_sec": 120.07263135910034, "budget_exhausted": true, "event": "research_completed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:33.531291Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:09:33.531771Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:33.538771Z"} +{"trace_id": "27d81891-5bf2-4bf4-9744-55f39ffaf696", "confidence": 0.78, "citations": 13, "tokens_used": 64718, "wall_time_sec": 120.07263135910034, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:33.877746Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ The question of whether dietary cholesterol is harmful is nuanced and │ +│ remains an area of active scientific debate. Here is a synthesis of the │ +│ current evidence: │ +│ │ +│ 1. **Dietary cholesterol vs. blood cholesterol**: The body produces most of │ +│ its own cholesterol, and dietary cholesterol has a relatively modest and │ +│ variable effect on blood (serum) cholesterol levels. Epidemiological studies │ +│ and clinical interventions have largely shown that dietary cholesterol │ +│ intake does not significantly impact blood cholesterol in most individuals │ +│ [PMC6024687; PMC9143438]. A meta-analysis of 224 studies (8,143 subjects) │ +│ found only modest increases in both LDL and HDL when dietary cholesterol is │ +│ increased [Consensus Academic Search]. │ +│ │ +│ 2. **CVD risk from observational studies**: A 2020 AHA Science Advisory │ +│ (Carson et al., Circulation) found a significant positive relationship │ +│ between dietary cholesterol intake and blood LDL, but evidence from │ +│ observational studies generally does not indicate a significant association │ +│ with cardiovascular disease risk [AHA Journals, │ +│ doi:10.1161/CIR.0000000000000743]. However, a large pooled cohort study │ +│ (n=29,615, published in JAMA) found each additional 300 mg/day of dietary │ +│ cholesterol was associated with higher risk of incident CVD and all-cause │ +│ mortality [PACE-CME; The Cardiology Advisor]. │ +│ │ +│ 3. **Updated dietary guidelines**: The 2015–2020 U.S. Dietary Guidelines │ +│ removed the previous 300 mg/day dietary cholesterol limit, citing no │ +│ appreciable relationship between dietary cholesterol and serum cholesterol. │ +│ However, this decision was contested by scientists who argued the evidence │ +│ was insufficient rather than exculpatory [Regulations.gov scientists' │ +│ comment; PMC6024687]. The AHA's 2026 dietary guidance states that dietary │ +│ cholesterol is 'no longer a primary target for CVD risk reduction for most │ +│ people,' though it still advises limiting cholesterol-rich foods [AHA │ +│ Journals, doi:10.1161/CIR.0000000000001435]. │ +│ │ +│ 4. **Individual variability**: People differ substantially in how they │ +│ respond to dietary cholesterol—'hyper-responders' see more significant LDL │ +│ increases than 'hypo-responders.' Genetic and hormonal factors play │ +│ important roles [ScienceDirect hypo/hyperresponders; PubMed 12074253]. │ +│ │ +│ 5. **Eggs as a cholesterol source**: Eggs are the primary dietary │ +│ cholesterol source studied. Evidence on egg consumption and CVD is │ +│ inconsistent. A 2025 umbrella review found 'critically low' quality of │ +│ evidence and concluded there is no sufficient evidence to discourage egg │ +│ consumption, though weak associations with higher LDL and heart failure risk │ +│ were noted [ScienceDirect, doi:10.1016/j.numecd.2025.103849]. A BMJ │ +│ meta-analysis suggested higher egg consumption could be associated with │ +│ higher CVD risk [BMJ m513]. │ +│ │ +│ 6. **Saturated fat confounding**: Most foods high in dietary cholesterol are │ +│ also high in saturated fat, which does raise LDL cholesterol and CVD risk. │ +│ Eggs and shrimp are notable exceptions [PMC6024687]. │ +│ │ +│ **Bottom line**: For most people, dietary cholesterol in moderate amounts is │ +│ unlikely to be a primary driver of CVD risk. However, it is not completely │ +│ benign—particularly for hyper-responders or people with diabetes—and the │ +│ overall dietary pattern (especially saturated fat intake) matters more than │ +│ dietary cholesterol in isolation. Caution is still warranted, and individual │ +│ factors should guide dietary choices. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Dietary Cholesterol and the │ To date, extensive research │ 0.92 │ +│ │ Lack of Evidence in │ did not show evidence to │ │ +│ │ Cardiovascular Disease - PMC │ support a role of dietary │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ cholesterol in the development │ │ +│ │ articles/PMC6024687/ │ of CVD. As a result, the │ │ +│ │ │ 2015–2020 Dietary Guidelines │ │ +│ │ │ for Americans removed the │ │ +│ │ │ recommendations of restricting │ │ +│ │ │ dietary cholesterol to 300 │ │ +│ │ │ mg/day. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Is There a Correlation │ it was not until the late │ 0.91 │ +│ │ between Dietary and Blood │ 1990s when they were finally │ │ +│ │ Cholesterol? Evidence from │ challenged by the newer │ │ +│ │ Epidemiological Data and │ information derived from │ │ +│ │ Clinical Interventions - PMC │ epidemiological studies and │ │ +│ │ https://pmc.ncbi.nlm.nih.gov/ │ meta-analysis, which confirmed │ │ +│ │ articles/PMC9143438/ │ the lack of correlation │ │ +│ │ │ between dietary and blood │ │ +│ │ │ cholesterol. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Dietary Cholesterol and │ Evidence from observational │ 0.93 │ +│ │ Cardiovascular Risk: A │ studies conducted in several │ │ +│ │ Science Advisory from the AHA │ countries generally does not │ │ +│ │ https://www.ahajournals.org/d │ indicate a significant │ │ +│ │ oi/full/10.1161/CIR.000000000 │ association with │ │ +│ │ 0000743 │ cardiovascular disease risk. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Dietary Cholesterol and │ Differences in dietary │ 0.88 │ +│ │ Cardiovascular Risk: A │ cholesterol ranged from 155 to │ │ +│ │ Science Advisory (full text) │ 1000 mg/d. A significant │ │ +│ │ https://www.ahajournals.org/d │ positive relationship was │ │ +│ │ oi/10.1161/CIR.00000000000007 │ identified between dietary │ │ +│ │ 43 │ cholesterol │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ 2026 Dietary Guidance to │ Dietary cholesterol is no │ 0.90 │ +│ │ Improve Cardiovascular Health │ longer a primary target for │ │ +│ │ https://www.ahajournals.org/d │ CVD risk reduction for most │ │ +│ │ oi/10.1161/CIR.00000000000014 │ people. Nevertheless, heart │ │ +│ │ 35 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Higher consumption of dietary │ Among US adults, higher intake │ 0.87 │ +│ │ cholesterol or eggs linked to │ of dietary cholesterol or eggs │ │ +│ │ increased risk of incident │ was significantly linked to │ │ +│ │ CVD and mortality - PACE-CME │ increased risk of incident CVD │ │ +│ │ https://pace-cme.org/news/hig │ and all-cause mortality in a │ │ +│ │ her-consumption-of-dietary-ch │ dose-response manner, which │ │ +│ │ olesterol-or-eggs-linked-to-i │ was independent of nutrients │ │ +│ │ ncreased-risk-of-incident-cvd │ or diets │ │ +│ │ -and-mortality/2455413/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ After Continued Debate, │ Each additional 300 mg of │ 0.87 │ +│ │ Dietary Cholesterol Linked to │ dietary cholesterol consumed │ │ +│ │ Significant Increase in CVD - │ per day was significantly │ │ +│ │ The Cardiology Advisor │ associated with a higher risk │ │ +│ │ https://www.thecardiologyadvi │ for incident CVD and all-cause │ │ +│ │ sor.com/home/topics/metabolic │ mortality, as was each │ │ +│ │ /dyslipidemia/after-continued │ additional half an egg │ │ +│ │ -debate-dietary-cholesterol-l │ consumed per day. │ │ +│ │ inked-to-significant-increase │ │ │ +│ │ -in-cvd/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Scientists' Comment on │ dietary cholesterol is very │ 0.82 │ +│ │ Dietary Cholesterol - │ much a 'nutrient of concern,' │ │ +│ │ Regulations.gov │ because it increases LDL │ │ +│ │ https://downloads.regulations │ cholesterol, a │ │ +│ │ .gov/FDA-2018-P-1593-0049/att │ well-established risk factor │ │ +│ │ achment_2.pdf │ for coronary heart disease. │ │ +│ │ │ Furthermore, the consumption │ │ +│ │ │ of whole eggs is associated │ │ +│ │ │ with the risk of type 2 │ │ +│ │ │ diabetes │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Dietary Cholesterol And Blood │ A meta-analysis of 224 studies │ 0.85 │ +│ │ Cholesterol - Consensus │ involving 8,143 subjects found │ │ +│ │ Academic Search Engine │ that dietary cholesterol │ │ +│ │ https://consensus.app/questio │ intake leads to modest │ │ +│ │ ns/dietary-cholesterol-and-bl │ increases in both LDL and HDL │ │ +│ │ ood-cholesterol/ │ cholesterol levels. The study │ │ +│ │ │ highlighted that while dietary │ │ +│ │ │ cholesterol does raise serum │ │ +│ │ │ cholesterol levels, the effect │ │ +│ │ │ is relatively small and varies │ │ +│ │ │ among individuals. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Effect of egg consumption on │ The overall quality of studies │ 0.88 │ +│ │ health outcomes: Updated │ was critically low. The level │ │ +│ │ umbrella review - │ of evidence was very weak for │ │ +│ │ ScienceDirect │ all the significant │ │ +│ │ https://www.sciencedirect.com │ associations: risk of heart │ │ +│ │ /science/article/pii/S0939475 │ failure (RR 1.15; 95%CI: │ │ +│ │ 325000031 │ 1.02–1.30)... higher levels of │ │ +│ │ │ LDL cholesterol (WMD 7.39; │ │ +│ │ │ 95%CI 5.82–8.95)... No │ │ +│ │ │ evidence of association was │ │ +│ │ │ found among all cardiovascular │ │ +│ │ │ outcomes and all-cause │ │ +│ │ │ mortality risk │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 11 │ Egg consumption and risk of │ Results from our updated │ 0.84 │ +│ │ cardiovascular disease - The │ meta-analysis suggest that │ │ +│ │ BMJ │ higher egg consumption could │ │ +│ │ https://www.bmj.com/content/3 │ be associated with a higher │ │ +│ │ 68/bmj.m513 │ risk of cardiovascular disease │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 12 │ Hypo- and hyperresponders to │ Hypo- and hyperresponders to │ 0.78 │ +│ │ dietary cholesterol - │ dietary cholesterol │ │ +│ │ ScienceDirect │ │ │ +│ │ https://www.sciencedirect.com │ │ │ +│ │ /science/article/abs/pii/S000 │ │ │ +│ │ 2916523398897 │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 13 │ Here's the latest on dietary │ More recently, accumulating │ 0.87 │ +│ │ cholesterol and how it fits │ data has caused researchers to │ │ +│ │ in with a healthy diet | │ broaden their thinking about │ │ +│ │ American Heart Association │ how dietary cholesterol – and │ │ +│ │ https://www.heart.org/en/news │ eggs – fit into a healthy │ │ +│ │ /2023/08/25/heres-the-latest- │ eating pattern. 'We've │ │ +│ │ on-dietary-cholesterol-and-ho │ advanced considerably,' said │ │ +│ │ w-it-fits-in-with-a-healthy-d │ professor Linda Van Horn │ │ +│ │ iet │ │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Long-term RCT data on │ Most evidence comes from │ +│ │ dietary cholesterol and │ observational studies or │ +│ │ hard CVD endpoints │ short-term interventions. │ +│ │ │ There are no large, │ +│ │ │ long-term randomized │ +│ │ │ controlled trials │ +│ │ │ directly testing reduced │ +│ │ │ dietary cholesterol │ +│ │ │ versus hard CVD outcomes │ +│ │ │ like myocardial │ +│ │ │ infarction or │ +│ │ │ cardiovascular death. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ source_not_found │ Dietary cholesterol │ While some sources │ +│ │ effects in specific │ mention increased CVD │ +│ │ high-risk subgroups │ risk from eggs in people │ +│ │ (diabetes, familial │ with diabetes, the │ +│ │ hypercholesterolemia) │ gathered evidence does │ +│ │ │ not deeply characterize │ +│ │ │ effects in all high-risk │ +│ │ │ subgroups such as │ +│ │ │ familial │ +│ │ │ hypercholesterolemia │ +│ │ │ patients. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Mechanisms │ Confounding between │ +│ │ distinguishing dietary │ dietary cholesterol and │ +│ │ cholesterol from │ saturated fat intake │ +│ │ saturated fat effects │ makes it difficult to │ +│ │ │ isolate dietary │ +│ │ │ cholesterol's independent │ +│ │ │ effect on CVD; different │ +│ │ │ studies handle this │ +│ │ │ confounder differently, │ +│ │ │ leading to inconsistent │ +│ │ │ conclusions. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ contradiction │ database │ dietary │ The evidence is │ +│ │ │ cholesterol CVD │ contradictory │ +│ │ │ risk randomized │ between large │ +│ │ │ controlled trial │ observational │ +│ │ │ meta-analysis │ pooled cohorts │ +│ │ │ 2020 2024 │ (showing CVD │ +│ │ │ │ risk) and │ +│ │ │ │ intervention/epid │ +│ │ │ │ emiological │ +│ │ │ │ reviews (showing │ +│ │ │ │ no significant │ +│ │ │ │ association), │ +│ │ │ │ warranting deeper │ +│ │ │ │ RCT-level │ +│ │ │ │ analysis. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ lean mass │ A distinct │ +│ │ │ hyper-responder │ phenotype (lean │ +│ │ │ LDL dietary │ mass │ +│ │ │ cholesterol │ hyper-responders) │ +│ │ │ cardiovascular │ shows pronounced │ +│ │ │ risk 2023 2024 │ LDL increases on │ +│ │ │ │ low-carb diets │ +│ │ │ │ high in dietary │ +│ │ │ │ fat/cholesterol, │ +│ │ │ │ with unclear CVD │ +│ │ │ │ implications. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ dietary │ Multiple sources │ +│ │ │ cholesterol type │ mention │ +│ │ │ 2 diabetes risk │ association │ +│ │ │ eggs 2020 2024 │ between │ +│ │ │ meta-analysis │ egg/cholesterol │ +│ │ │ │ intake and type 2 │ +│ │ │ │ diabetes risk, │ +│ │ │ │ which is not │ +│ │ │ │ fully explored in │ +│ │ │ │ the gathered │ +│ │ │ │ evidence. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ database │ ACC AHA 2026 │ New 2026 ACC/AHA │ +│ │ │ dyslipidemia │ dyslipidemia │ +│ │ │ guidelines │ guidelines were │ +│ │ │ dietary │ referenced but │ +│ │ │ cholesterol │ only partially │ +│ │ │ recommendations │ retrieved; full │ +│ │ │ │ dietary │ +│ │ │ │ cholesterol │ +│ │ │ │ guidance warrants │ +│ │ │ │ review. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Should dietary cholesterol │ Scientists' comments on the │ +│ │ recommendations differ for │ 2015 dietary guidelines and │ +│ │ people with diabetes or │ some observational studies │ +│ │ familial hypercholesterolemia │ suggest egg/cholesterol intake │ +│ │ compared to the general │ may increase CHD risk │ +│ │ population? │ specifically in people with │ +│ │ │ diabetes. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Do LDL cholesterol │ Research shows wide individual │ +│ │ hyper-responders to dietary │ variability in LDL response to │ +│ │ cholesterol face meaningfully │ dietary cholesterol; it is │ +│ │ higher long-term CVD risk, and │ unclear whether │ +│ │ should they restrict dietary │ hyper-responders have elevated │ +│ │ cholesterol? │ CVD risk and need tailored │ +│ │ │ advice. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ How much of the observed CVD │ PMC6024687 notes most │ +│ │ risk associated with dietary │ high-cholesterol foods are also │ +│ │ cholesterol in observational │ high in saturated fat; │ +│ │ studies is attributable to │ isolating dietary cholesterol's │ +│ │ saturated fat co-ingestion │ independent effect is │ +│ │ rather than cholesterol itself? │ methodologically challenging. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the effect of dietary │ PACE-CME study noted that CVD │ +│ │ cholesterol within the context │ risk association from dietary │ +│ │ of a high-quality overall diet │ cholesterol was independent of │ +│ │ (e.g., Mediterranean or DASH │ overall diet quality, but this │ +│ │ diet)? │ needs further investigation. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Does the food matrix (e.g., │ The 2025 umbrella review of egg │ +│ │ eggs vs. red meat) in which │ consumption found weak │ +│ │ dietary cholesterol is consumed │ associations; it is unclear if │ +│ │ modify its impact on CVD risk? │ the source of dietary │ +│ │ │ cholesterol modulates risk │ +│ │ │ independently of the │ +│ │ │ cholesterol content. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.78 │ +│ Corroborating sources: 13 │ +│ Source authority: high │ +│ Contradiction detected: True │ +│ Query specificity match: 0.85 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 64718 │ +│ Iterations: 3 │ +│ Wall time: 120.07s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 27d81891-5bf2-4bf4-9744-55f39ffaf696 diff --git a/docs/stress-tests/M3.3-runs/15-contradiction.log b/docs/stress-tests/M3.3-runs/15-contradiction.log new file mode 100644 index 0000000..3752bbc --- /dev/null +++ b/docs/stress-tests/M3.3-runs/15-contradiction.log @@ -0,0 +1,48 @@ +Researching: Does screen time harm child development? + +{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:09:34.721867Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:09:35.602647Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:09:35.613025Z"} +{"question": "Does screen time harm child development?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:09:35.653113Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "Does screen time harm child development?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653592Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:35.653723Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1126, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:45.628661Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 10139, "event": "iteration_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:51.476900Z"} +{"step": 21, "decision": "Token budget reached before iteration 4: 23391/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056368Z"} +{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 22, "iterations_run": 3, "tokens_used": 23391, "event": "synthesis_start", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:09:58.056571Z"} +{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74986, "event": "synthesis_complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.739493Z"} +{"step": 24, "decision": "Failed to build ResearchResult: 1 validation error for DiscoveryEvent\nquery\n Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n For further information visit https://errors.pydantic.dev/2.12/v/string_type", "event": "synthesis_build_error", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.753603Z"} +{"step": 26, "decision": "Research complete", "confidence": 0.1, "citation_count": 0, "gap_count": 1, "discovery_count": 0, "total_duration_sec": 98.512, "event": "complete", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:10.755661Z"} +{"confidence": 0.1, "citations": 0, "gaps": 1, "discovery_events": 0, "tokens_used": 44375, "iterations_run": 3, "wall_time_sec": 95.08588027954102, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:10.755895Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:11:10.757071Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:10.770530Z"} +{"trace_id": "9c18d570-73d3-4e8a-98bc-7cb1b66c61d2", "confidence": 0.1, "citations": 0, "tokens_used": 44375, "wall_time_sec": 95.08588027954102, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.105698Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Research on 'Does screen time harm child development?' completed but │ +│ synthesis failed. 22 sources were gathered. │ +╰──────────────────────────────────────────────────────────────────────────────╯ +No citations. + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ budget_exhausted │ synthesis │ The synthesis step failed to produce │ +│ │ │ structured output. │ +└──────────────────┴───────────┴───────────────────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.10 │ +│ Corroborating sources: 0 │ +│ Source authority: low │ +│ Contradiction detected: False │ +│ Query specificity match: 0.00 │ +│ Budget status: spent │ +│ Recency: unknown │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 44375 │ +│ Iterations: 3 │ +│ Wall time: 95.09s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 9c18d570-73d3-4e8a-98bc-7cb1b66c61d2 diff --git a/docs/stress-tests/M3.3-runs/16-scope.log b/docs/stress-tests/M3.3-runs/16-scope.log new file mode 100644 index 0000000..8c89213 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/16-scope.log @@ -0,0 +1,321 @@ +Researching: What proprietary indexing strategies do high-frequency trading +firms use for order book reconstruction? + +{"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:11:11.888630Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:11:12.816801Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:11:12.829566Z"} +{"question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:11:12.871225Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.871693Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:12.872051Z"} +{"step": 9, "decision": "Starting iteration 2/5", "tokens_so_far": 1212, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:27.416025Z"} +{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 15135, "event": "iteration_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:33.632271Z"} +{"step": 23, "decision": "Token budget reached before iteration 4: 35581/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723229Z"} +{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 35, "iterations_run": 3, "tokens_used": 35581, "event": "synthesis_start", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:11:40.723491Z"} +{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 72229, "event": "synthesis_complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.638239Z"} +{"step": 42, "decision": "Research complete", "confidence": 0.72, "citation_count": 8, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.111, "event": "complete", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:50.639828Z"} +{"confidence": 0.72, "citations": 8, "gaps": 4, "discovery_events": 4, "tokens_used": 70892, "iterations_run": 3, "wall_time_sec": 97.76683187484741, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:50.639933Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:12:50.640430Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:50.648897Z"} +{"trace_id": "f4c43973-7cac-4193-a249-cbb1302de4f7", "confidence": 0.72, "citations": 8, "tokens_used": 70892, "wall_time_sec": 97.76683187484741, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:50.931342Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ High-frequency trading firms use several proprietary and semi-documented │ +│ indexing strategies for order book reconstruction, though most production │ +│ details remain trade secrets. Based on available evidence: │ +│ │ +│ 1. **Hash Table + Array Hybrid**: The most commonly cited production │ +│ approach combines plain arrays (for cache-friendly sequential memory access │ +│ minimizing cache misses) with hash tables (for O(1) lookup of specific price │ +│ levels). This codesign optimizes both speed and cache locality. [Sources 15, │ +│ 16, 28] │ +│ │ +│ 2. **B-Tree / ISAM Indexing**: The historically significant Island ECN │ +│ (1996), built by Josh Levine, used in-memory B-tree indexing via an ISAM │ +│ storage engine with zero disk access during matching, achieving O(log N) │ +│ access per price level. This is considered the documented proof-of-concept │ +│ for production-grade LOB indexing. [Source 29] │ +│ │ +│ 3. **Hybrid Binary-Linear Search**: A IEEE-documented approach proposes a │ +│ simple linear data structure for tracking the order book combined with a │ +│ hybrid binary-linear search algorithm to maintain top bid/ask with minimal │ +│ latency. [Source 19] │ +│ │ +│ 4. **ROI Vector (Region-of-Interest Vector)**: Used in backtesting │ +│ frameworks like HftBacktest, this approach restricts the active price range │ +│ to a bounded region of interest, enabling vector-based O(1) access within │ +│ the ROI while avoiding full-book scanning. [Source 25, 35] │ +│ │ +│ 5. **Lock-Free Concurrent Data Structures**: To handle concurrent updates │ +│ without mutex overhead, firms implement lock-free data structures allowing │ +│ multiple threads to update the LOB simultaneously. [Sources 15, 16] │ +│ │ +│ 6. **Event-Driven with Selective Polling Hybrid**: The LOB primarily │ +│ operates event-driven but incorporates high-frequency polling for the most │ +│ latency-sensitive execution pathways, ensuring sub-microsecond │ +│ responsiveness. [Sources 15, 16] │ +│ │ +│ 7. **Order Record Reuse (Object Pooling)**: Levine's Island engine reused │ +│ recently freed order records for new orders—described as 'hugely │ +│ important'—a form of memory pooling that avoids allocation overhead during │ +│ high-throughput periods. [Source 29] │ +│ │ +│ 8. **Structural Filtration for Signal Quality**: Recent research (2025) │ +│ proposes filtering transient LOB events by order lifetime, update count, or │ +│ inter-update delay before indexing, improving directional signal quality │ +│ (OBI) extracted from the reconstructed book. [Source 6] │ +│ │ +│ Notably, red-black trees—frequently cited in academic literature—are rarely │ +│ used in production due to poor cache behavior versus simpler arrays at │ +│ realistic market depths. The key insight from practitioners is that │ +│ algorithmic data structure choice (O(log N) vs O(N)) dominates hardware │ +│ investment: a $2M co-location/FPGA upgrade produced no measurable latency │ +│ improvement when the underlying order book used a sorted array with O(N) │ +│ inserts. [Source 23, 29] │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Matching Engine Architecture: │ Josh Levine built the Island │ 0.95 │ +│ │ Why Your Order Book Data │ matching engine in FoxPro for │ │ +│ │ Structure Is the Real Latency │ MS-DOS... The order book used │ │ +│ │ Bottleneck │ in-memory B-tree indexing via │ │ +│ │ https://electronictradinghub. │ an ISAM storage engine. Zero │ │ +│ │ com/matching-engine-architect │ disk access during matching. │ │ +│ │ ure-why-your-order-book-data- │ Every price level accessed in │ │ +│ │ structure-is-the-real-latency │ O(log N) time. Levine's │ │ +│ │ -bottleneck/ │ optimization for new-order │ │ +│ │ │ entry latency: reuse recently │ │ +│ │ │ freed order records for new │ │ +│ │ │ orders — a detail he called │ │ +│ │ │ 'hugely important' │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Optimizing Limit Order Book │ I use a combination of plain │ 0.88 │ +│ │ for HFT Systems │ arrays and hash tables to │ │ +│ │ https://www.linkedin.com/post │ manage the LOB. Arrays are │ │ +│ │ s/silahian_hft-hft-trading-ac │ highly effective with CPU │ │ +│ │ tivity-7351226537301417988-ei │ caches, offering sequential │ │ +│ │ cX │ memory access that minimizes │ │ +│ │ │ cache misses. The integration │ │ +│ │ │ of hash tables provides quick │ │ +│ │ │ access to specific entries, │ │ +│ │ │ ensuring that both speed and │ │ +│ │ │ cache locality are optimized. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Red Black Trees for Limit │ They're not necessarily ideal. │ 0.92 │ +│ │ Order Book - Quantitative │ In fact, they're rarely used │ │ +│ │ Finance Stack Exchange │ in production trading systems │ │ +│ │ https://quant.stackexchange.c │ with low latency │ │ +│ │ om/questions/63140/red-black- │ requirements... a simple array │ │ +│ │ trees-for-limit-order-book │ or vector with linear access │ │ +│ │ │ patterns will often outperform │ │ +│ │ │ any complex data structure │ │ +│ │ │ with better asymptotic runtime │ │ +│ │ │ because a simple array │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Order Book Reconstruction - │ HashMapMarketDepth... │ 0.85 │ +│ │ HftBacktest │ BTreeMarketDepth... │ │ +│ │ https://mintlify.com/nkaz001/ │ ROIVectorMarketDepth::new(tick │ │ +│ │ hftbacktest/concepts/order-bo │ _size, lot_size, roi_lb, │ │ +│ │ ok │ roi_ub)... │ │ +│ │ │ FusedHashMapMarketDepth │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Order Book Filtration and │ Three real-time, observable │ 0.82 │ +│ │ Directional Signal Extraction │ filtration schemes: based on │ │ +│ │ at High Frequency │ order lifetime, update count, │ │ +│ │ https://arxiv.org/html/2507.2 │ and inter-update delay. These │ │ +│ │ 2712v1 │ are used to recompute OBI on │ │ +│ │ │ structurally filtered event │ │ +│ │ │ streams... Empirical results │ │ +│ │ │ show that structural │ │ +│ │ │ filtration improves │ │ +│ │ │ directional signal clarity in │ │ +│ │ │ correlation and regime-based │ │ +│ │ │ metrics │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Building Low-Latency Order │ This paper proposes a simple │ 0.80 │ +│ │ Books with Hybrid │ linear data structure for │ │ +│ │ Binary-Linear ... │ tracking the order book and a │ │ +│ │ https://ieeexplore.ieee.org/d │ hybrid binary-linear search │ │ +│ │ ocument/10296447/ │ algorithm to maintain the top │ │ +│ │ │ bid and ask │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Order Book Reconstruction - │ Index reusing... Regional │ 0.75 │ +│ │ dxFeed KB │ events... Event flags │ │ +│ │ https://kb.dxfeed.com/en/data │ applicable to Order event... │ │ +│ │ -model/dxfeed-order-book/orde │ Snapshots... Transaction │ │ +│ │ r-book-reconstruction.html │ model... dxFeed market data │ │ +│ │ │ feeds (real-time, delayed or │ │ +│ │ │ historical) allow clients to │ │ +│ │ │ reconstruct order books, price │ │ +│ │ │ level aggregations, and │ │ +│ │ │ aggregations by Market Maker │ │ +│ │ │ or a data provider. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ GitHub - │ This Limit Order Book is │ 0.70 │ +│ │ brprojects/Limit-Order-Book │ developed in C++ from scratch │ │ +│ │ https://github.com/brprojects │ and able to handle over │ │ +│ │ /Limit-Order-Book │ 1,400,000 TPS (transactions │ │ +│ │ │ per second), including Market, │ │ +│ │ │ Limit, Stop and Stop Limit │ │ +│ │ │ orders. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Proprietary FPGA-based │ Actual FPGA hardware │ +│ │ order book indexing schemes │ implementations used by │ +│ │ │ firms like Virtu, Jane │ +│ │ │ Street, or Citadel for │ +│ │ │ on-chip order book indexing │ +│ │ │ are not publicly │ +│ │ │ documented. MIT project │ +│ │ │ proposal references FPGA │ +│ │ │ LOB but lacks │ +│ │ │ implementation details. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Exact data structures used │ No public disclosure exists │ +│ │ by specific named HFT firms │ for the specific indexing │ +│ │ │ implementations of major │ +│ │ │ HFT firms (e.g., Virtu, Two │ +│ │ │ Sigma, Jump Trading). All │ +│ │ │ evidence is from │ +│ │ │ practitioners sharing │ +│ │ │ general principles or │ +│ │ │ academic reconstructions. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ scope_exceeded │ Co-location-specific memory │ NUMA-aware memory │ +│ │ topology optimization for │ allocation and CPU affinity │ +│ │ LOB │ strategies for LOB │ +│ │ │ processes in co-located │ +│ │ │ environments are referenced │ +│ │ │ but not detailed in │ +│ │ │ available sources. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Crypto-specific LOB │ While one Medium article │ +│ │ indexing differences vs │ covers crypto HFT system │ +│ │ equity markets │ design, it does not detail │ +│ │ │ how LOB indexing strategies │ +│ │ │ differ for 24/7 crypto │ +│ │ │ markets with different tick │ +│ │ │ structures. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ arxiv │ FPGA order book │ The MIT HFT │ +│ │ │ matching engine │ Accelerator paper │ +│ │ │ hardware │ and FPGA │ +│ │ │ implementation │ references │ +│ │ │ nanosecond │ suggest │ +│ │ │ latency │ significant │ +│ │ │ │ unpublished work │ +│ │ │ │ on │ +│ │ │ │ hardware-accelera │ +│ │ │ │ ted LOB indexing │ +│ │ │ │ that would │ +│ │ │ │ directly answer │ +│ │ │ │ the proprietary │ +│ │ │ │ indexing question │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ limit order book │ Cache-oblivious │ +│ │ │ data structure │ structures like │ +│ │ │ cache-oblivious │ van Emde Boas │ +│ │ │ van Emde Boas │ trees are │ +│ │ │ tree HFT │ theoretically │ +│ │ │ │ optimal for LOB │ +│ │ │ │ operations but │ +│ │ │ │ not mentioned in │ +│ │ │ │ sources; academic │ +│ │ │ │ literature may │ +│ │ │ │ document their │ +│ │ │ │ use │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ database │ Island ECN Levine │ The Island ECN │ +│ │ │ order book ISAM │ B-tree/ISAM │ +│ │ │ indexing original │ reference is │ +│ │ │ documentation │ cited secondhand; │ +│ │ │ 1996 │ primary │ +│ │ │ │ documentation │ +│ │ │ │ would provide │ +│ │ │ │ authoritative │ +│ │ │ │ details on the │ +│ │ │ │ original │ +│ │ │ │ production │ +│ │ │ │ indexing strategy │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ order book │ L3 order-by-order │ +│ │ │ reconstruction L3 │ reconstruction │ +│ │ │ tick data index │ requires │ +│ │ │ compression high │ per-order │ +│ │ │ frequency │ indexing by │ +│ │ │ │ order_id which │ +│ │ │ │ has different │ +│ │ │ │ data structure │ +│ │ │ │ requirements than │ +│ │ │ │ L2 price-level │ +│ │ │ │ indexing │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Do modern HFT firms use │ Sources confirm cache-friendly │ +│ │ NUMA-aware memory allocation │ arrays dominate in production, │ +│ │ strategies specifically tuned │ but NUMA effects in │ +│ │ for order book price-level │ multi-socket co-located servers │ +│ │ index structures, and how does │ are not addressed │ +│ │ this interact with CPU cache │ │ +│ │ topology? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ How do HFT firms handle the │ dxFeed documentation describes │ +│ │ transition from snapshot-based │ snapshot and transaction models │ +│ │ full order book state to │ separately; the handoff between │ +│ │ incremental delta updates in │ these modes in production │ +│ │ their indexing layer without │ indexing is not detailed │ +│ │ introducing consistency gaps? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the practical │ HftBacktest documents both │ +│ │ throughput and latency tradeoff │ structures but does not provide │ +│ │ between ROIVectorMarketDepth │ comparative benchmarks for edge │ +│ │ and FusedHashMapMarketDepth │ cases like flash crashes where │ +│ │ implementations under real │ price moves outside the ROI │ +│ │ market conditions with large │ │ +│ │ price spikes? │ │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Does structural LOB filtration │ The filtration paper shows │ +│ │ (by order lifetime or update │ improved OBI signal quality but │ +│ │ count) as proposed in the 2025 │ acknowledges limited gains in │ +│ │ arxiv paper degrade order book │ causal excitation; │ +│ │ reconstruction accuracy under │ accuracy-speed tradeoff for │ +│ │ normal market conditions │ indexing filtered vs raw │ +│ │ compared to raw feeds? │ streams is unresolved │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ low │ How do exchanges like LMAX, │ The electronictradinghub │ +│ │ Tokyo Stock Exchange, and NSE │ article cites these exchanges │ +│ │ India differ in their │ as modern evidence but does not │ +│ │ recommended order book │ detail their specific │ +│ │ reconstruction protocols, and │ reconstruction protocol │ +│ │ do these differences force │ differences │ +│ │ different indexing strategies │ │ +│ │ on client-side HFT systems? │ │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.72 │ +│ Corroborating sources: 8 │ +│ Source authority: medium │ +│ Contradiction detected: False │ +│ Query specificity match: 0.65 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 70892 │ +│ Iterations: 3 │ +│ Wall time: 97.77s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: f4c43973-7cac-4193-a249-cbb1302de4f7 diff --git a/docs/stress-tests/M3.3-runs/17-scope.log b/docs/stress-tests/M3.3-runs/17-scope.log new file mode 100644 index 0000000..33f5543 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/17-scope.log @@ -0,0 +1,344 @@ +Researching: What is the actual operational doctrine of Chinese DF-41 ICBM +brigades? + +{"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:12:51.608714Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:12:52.450376Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:12:52.459819Z"} +{"question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:12:52.495811Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What is the actual operational doctrine of Chinese DF-41 ICBM brigades?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496319Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:12:52.496431Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1194, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:05.548923Z"} +{"step": 16, "decision": "Starting iteration 3/5", "tokens_so_far": 8831, "event": "iteration_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:18.572224Z"} +{"step": 23, "decision": "Token budget reached before iteration 4: 31917/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.495991Z"} +{"step": 24, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 3, "tokens_used": 31917, "event": "synthesis_start", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:13:36.496215Z"} +{"step": 25, "decision": "Parsed synthesis JSON successfully", "duration_ms": 90409, "event": "synthesis_complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.659059Z"} +{"step": 46, "decision": "Research complete", "confidence": 0.72, "citation_count": 12, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 136.645, "event": "complete", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:04.687651Z"} +{"confidence": 0.72, "citations": 12, "gaps": 4, "discovery_events": 4, "tokens_used": 62857, "iterations_run": 3, "wall_time_sec": 132.16255736351013, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:04.687981Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:15:04.688728Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:04.696829Z"} +{"trace_id": "b3d00938-5309-4faa-a20d-97a8511bb8f9", "confidence": 0.72, "citations": 12, "tokens_used": 62857, "wall_time_sec": 132.16255736351013, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:04.924751Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Chinese DF-41 ICBM brigade operational doctrine encompasses several key │ +│ elements based on open-source intelligence and defense analysis: │ +│ │ +│ **Basing and Mobility**: DF-41 brigades operate under a tri-basing doctrine │ +│ employing road-mobile, rail-mobile, and silo-based launchers. The │ +│ road-mobile variant uses the Tian HTF5980 16×16 wheeled chassis. Silo │ +│ construction has accelerated since 2021 with three new solid-fuel ICBM silo │ +│ fields identified in northern China. [Sources: MDAA, CSIS Missile Threat, │ +│ FAS] │ +│ │ +│ **Alert Posture and Launch Doctrine**: The PLARF is working to implement a │ +│ launch-on-warning (LOW) posture. Brigades now strive to keep at least part │ +│ of their force in a higher state of readiness, representing a significant │ +│ shift from China's historically relaxed alert posture where warheads were │ +│ stored separately from missiles. [Sources: Air University/PLARF Nuclear │ +│ Warhead Management, NDU] │ +│ │ +│ **Warhead Management**: Historically, Chinese ICBMs stored warheads │ +│ separately from missiles ('de-mated'). The shift toward LOW requires │ +│ warheads to be mated or at least rapidly mateable to delivery systems. As of │ +│ the 2025 FAS Nuclear Notebook, China possesses approximately 600 warheads, │ +│ with DF-41 launchers armed with either a single ~1 MT warhead or up to 10 │ +│ MIRV warheads (20/90/150 KT yield variants). [Sources: FAS 2025, MDAA] │ +│ │ +│ **Force Structure**: As of 2020-2023, two brigades were confirmed operating │ +│ DF-41 when it appeared in the 2019 parade. The CNS 2023 Order of Battle │ +│ identifies Base 64 (Lanzhou HQ) Brigade 644 (Hanzhong) as a rumored DF-41 │ +│ integration base. Additional brigades under Base 63 are suspected. [Sources: │ +│ Bulletin PLARF Force Structure Table 2020, CNS OOB 2023] │ +│ │ +│ **Camouflage and Concealment**: Mobile DF-41 units employ camouflage netting │ +│ and disperse into forests and tunnels during exercises, consistent with │ +│ PLARF general doctrine of 'hiding and waiting.' [Sources: Al │ +│ Arabiya/Facebook report] │ +│ │ +│ **No-First-Use and Deterrence**: Chinese doctrine officially maintains a │ +│ no-first-use (NFU) posture, with the DF-41 serving as a second-strike │ +│ deterrent. However, the silo expansion and LOW posture shift have raised │ +│ questions among analysts about whether NFU remains operationally intact. │ +│ [Sources: The Mandarin, FAS 2025] │ +│ │ +│ **Range and Target Coverage**: With a range of 12,000–15,000 km, DF-41 │ +│ brigades based in central/northern China can target the entire continental │ +│ United States, making them the primary strategic countervalue and │ +│ counterforce deterrent against the US. [Sources: MDAA, CSIS Missile Threat] │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Dong Feng-41(CSS-X-20) │ The DF-41 has a range of │ 0.90 │ +│ │ https://www.missiledefenseadv │ 12,000-15,000 km (able to │ │ +│ │ ocacy.org/missile-threat-and- │ target half to all of the │ │ +│ │ proliferation/todays-missile- │ continental U.S.), can carry │ │ +│ │ threat/china/df-41/ │ multiple independently │ │ +│ │ │ targetable reentry vehicles │ │ +│ │ │ (MIRVs), and is rail-or │ │ +│ │ │ road-mobile. The DF-41 is │ │ +│ │ │ solid propelled and can carry │ │ +│ │ │ a payload of up to 2500 kg. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ DF-41 (Dong Feng-41 / │ The DF-41 (Dong Feng [East │ 0.92 │ +│ │ CSS-X-20) | Missile Threat │ Wind]-41, CSS-20) is Chinese │ │ +│ │ https://missilethreat.csis.or │ road-mobile intercontinental │ │ +│ │ g/missile/df-41/ │ ballistic missile (ICBM). It │ │ +│ │ │ has an operational range of up │ │ +│ │ │ to 15,000 km, making it │ │ +│ │ │ China's longest-range missile, │ │ +│ │ │ and is reportedly capable of │ │ +│ │ │ loading multiple │ │ +│ │ │ independently-targeted │ │ +│ │ │ warheads (MIRV). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ PLA Rocket Force Nuclear │ PLARF is working to implement │ 0.88 │ +│ │ Warhead Management - Air │ a launch-on-warning (LOW) │ │ +│ │ University │ posture, and brigades now │ │ +│ │ https://www.airuniversity.af. │ strive to keep at least part │ │ +│ │ edu/Portals/10/CASI/documents │ of their force in a state of │ │ +│ │ /Research/Infrastructure/2026 │ │ │ +│ │ -03-09%20PLARF%20Nuclear%20Wa │ │ │ +│ │ rhead%20Management.pdf │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ IMPLICATIONS OF A PRC SHIFT │ The PLARF has adjusted its │ 0.87 │ +│ │ TO A LAUNCH-ON-WARNING │ nuclear warhead storage and │ │ +│ │ https://inss.ndu.edu/LinkClic │ handling practices and │ │ +│ │ k.aspx?fileticket=kU27dwWHUvU │ training to support regular │ │ +│ │ %3D&portalid=82 │ alert status. A LOW posture, │ │ +│ │ │ which requires ICBM units │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Chinese nuclear weapons, 2025 │ China has continued to develop │ 0.95 │ +│ │ - Federation of American │ its three new missile silo │ │ +│ │ Scientists │ fields for solid-fuel │ │ +│ │ https://fas.org/wp-content/up │ intercontinental ballistic │ │ +│ │ loads/2025/03/Chinese-nuclear │ missiles (ICBMs)...has been │ │ +│ │ -weapons-2025.pdf │ developing new variants of │ │ +│ │ │ ICBMs and advanced strategic │ │ +│ │ │ delivery systems, and has │ │ +│ │ │ likely produced excess │ │ +│ │ │ warheads for these systems │ │ +│ │ │ once they are deployed. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ New Missile Silo And DF-41 │ The photos also show that 18 │ 0.90 │ +│ │ Launchers Seen In Chinese │ road-mobile launchers of the │ │ +│ │ Nuclear Missile Training Area │ long-awaited DF-41 ICBM were │ │ +│ │ - FAS │ training in the area in │ │ +│ │ https://fas.org/publication/c │ April-May 2019 together with │ │ +│ │ hina-silo-df41/ │ launchers for the DF-31AG │ │ +│ │ │ ICBM, possibly the DF-5B ICBM, │ │ +│ │ │ the DF-26 IRBM, and the DF-21 │ │ +│ │ │ MRBM. Altogether, more than 72 │ │ +│ │ │ missile launchers can be seen │ │ +│ │ │ operating together. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Table 2: PLARF Missile Force │ 644 Brigade Hanzhong (33.1321, │ 0.85 │ +│ │ Structure 2020 │ 106.9361) (DF-41) (Yes) │ │ +│ │ https://thebulletin.org/wp-co │ Rumored DF-41 integration │ │ +│ │ ntent/uploads/2020/12/Kristen │ base. │ │ +│ │ sen-Korda_Nov-Dec-China-Table │ │ │ +│ │ 2_final.pdf │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Understanding the People's │ The DF-41 will likely replace │ 0.88 │ +│ │ Liberation Army Rocket Force │ older ICBMs in the Chinese │ │ +│ │ https://www.armyupress.army.m │ arsenal and will carry either │ │ +│ │ il/Journals/Military-Review/E │ a single megaton warhead or up │ │ +│ │ nglish-Edition-Archives/China │ to ten MIRV smaller warheads. │ │ +│ │ -Reader-Special-Edition-Septe │ │ │ +│ │ mber-2021/Mihal-PLA-Rocket-Fo │ │ │ +│ │ rce/ │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ China's new missile silos │ The discovery by researchers │ 0.82 │ +│ │ (hundreds of them) │ at the James Martin Center for │ │ +│ │ https://www.themandarin.com.a │ Nonproliferation Studies in │ │ +│ │ u/166656-china-military-watch │ California that 119 missile │ │ +│ │ -2/ │ silos were being built in the │ │ +│ │ │ desert near the city of Yumen │ │ +│ │ │ in the Gansu region suggested │ │ +│ │ │ a rapid expansion of China's │ │ +│ │ │ nuclear weapons capabilities. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ China is building more │ The new underground silos are │ 0.84 │ +│ │ underground silos for its │ located in the centre of the │ │ +│ │ ballistic missiles | SCMP │ Jilantai training base, within │ │ +│ │ https://www.scmp.com/news/chi │ a total area of 200 sq km, and │ │ +│ │ na/military/article/3125699/c │ are spaced between 2.2km and │ │ +│ │ hina-building-more-undergroun │ 4.4km apart so that no two of │ │ +│ │ d-silos-its-ballistic-missile │ them can be destroyed in a │ │ +│ │ s │ single nuclear attack. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 11 │ China's Mobile ICBM Brigades: │ The PLARF is currently │ 0.75 │ +│ │ The DF-31 and DF-41 │ modernizing its │ │ +│ │ https://www.aboyandhis.blog/p │ intercontinental ballistic │ │ +│ │ ost/china-s-mobile-icbm-briga │ missile forces with two new │ │ +│ │ des-the-df-31-and-df-41 │ mobile systems: the new DF-41 │ │ +│ │ │ ballistic missile and the new │ │ +│ │ │ DF-31AG │ │ +│ │ │ transporter-erector-launcher.. │ │ +│ │ │ .The DF-41 is thought to be │ │ +│ │ │ out of development but has not │ │ +│ │ │ yet moved into Operational │ │ +│ │ │ Testing and Evaluation (OT&E). │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 12 │ The 2024 DOD China Military │ Other variables are how many │ 0.90 │ +│ │ Power Report - FAS │ warheads are assigned to the │ │ +│ │ https://fas.org/publication/t │ DF-26 IRBM launchers (probably │ │ +│ │ he-2024-dod-china-military-po │ not all of them), how many of │ │ +│ │ wer-report/ │ the six SSBNs have been │ │ +│ │ │ upgraded to the JL-3 SLBM and │ │ +│ │ │ whether it is assigned │ │ +│ │ │ multiple warheads, and how │ │ +│ │ │ many DF-41 ICBM launchers are │ │ +│ │ │ operational and how many │ │ +│ │ │ warheads each missile is │ │ +│ │ │ assigned. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Exact number of │ Open sources confirm at │ +│ │ operational DF-41 │ least two brigades as of │ +│ │ brigades and launchers │ 2019 parade, with │ +│ │ as of 2025 │ additional brigades │ +│ │ │ suspected, but no │ +│ │ │ authoritative public │ +│ │ │ count of currently │ +│ │ │ operational DF-41 │ +│ │ │ launchers exists as of │ +│ │ │ 2025. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ scope_exceeded │ Specific warhead mating │ Detailed operational │ +│ │ protocols and │ warhead handling │ +│ │ pre-delegation authority │ procedures, command │ +│ │ for DF-41 brigades │ authority thresholds, and │ +│ │ │ pre-delegation rules for │ +│ │ │ DF-41 brigades are │ +│ │ │ classified and not │ +│ │ │ available in open │ +│ │ │ sources. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ contradictory_sources │ Confirmed rail-mobile │ Multiple sources indicate │ +│ │ DF-41 operational │ rail-mobile DF-41 was │ +│ │ deployment │ tested and considered, │ +│ │ │ but no sources confirm it │ +│ │ │ has been operationally │ +│ │ │ deployed in that basing │ +│ │ │ mode as of 2025. │ +├───────────────────────┼──────────────────────────┼───────────────────────────┤ +│ access_denied │ Full CNS 2023 Order of │ The PDF was identified │ +│ │ Battle PDF content on │ but binary content could │ +│ │ DF-41 brigades │ not be fully parsed to │ +│ │ │ extract specific DF-41 │ +│ │ │ brigade details from the │ +│ │ │ 2023 CNS Order of Battle. │ +└───────────────────────┴──────────────────────────┴───────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ new_source │ database │ PLARF DF-41 │ The 2023 CNS │ +│ │ │ brigade order of │ Order of Battle │ +│ │ │ battle 2024 2025 │ is the most │ +│ │ │ silo field │ recent structured │ +│ │ │ deployment │ OOB but may be │ +│ │ │ │ outdated given │ +│ │ │ │ rapid 2024-2025 │ +│ │ │ │ expansion. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ China DF-41 │ The LOW posture │ +│ │ │ launch on warning │ shift is │ +│ │ │ posture warhead │ documented but │ +│ │ │ mating 2024 2025 │ the degree to │ +│ │ │ │ which DF-41 │ +│ │ │ │ brigades │ +│ │ │ │ specifically have │ +│ │ │ │ implemented it │ +│ │ │ │ versus older │ +│ │ │ │ systems is │ +│ │ │ │ unclear. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ China nuclear no │ The silo │ +│ │ │ first use │ expansion and LOW │ +│ │ │ doctrine DF-41 │ posture raise │ +│ │ │ silo expansion │ academic │ +│ │ │ strategic │ questions about │ +│ │ │ stability │ NFU credibility │ +│ │ │ │ that may be │ +│ │ │ │ addressed in │ +│ │ │ │ recent strategic │ +│ │ │ │ studies │ +│ │ │ │ literature. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ contradiction │ null │ DF-41 rail-mobile │ MDAA lists │ +│ │ │ deployment status │ rail-mobile as an │ +│ │ │ operational vs │ operational │ +│ │ │ testing │ basing mode, │ +│ │ │ │ while FAS and │ +│ │ │ │ CSIS sources │ +│ │ │ │ suggest it │ +│ │ │ │ remains in │ +│ │ │ │ testing/considera │ +│ │ │ │ tion phase. This │ +│ │ │ │ contradiction │ +│ │ │ │ should be │ +│ │ │ │ investigated. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Has China fully transitioned to │ Air University and NDU sources │ +│ │ a launch-on-warning posture for │ confirm PLARF is 'working to │ +│ │ DF-41 brigades, or is this │ implement' LOW, but the degree │ +│ │ still aspirational? │ of actual implementation vs. │ +│ │ │ doctrinal aspiration is │ +│ │ │ ambiguous. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ How many DF-41 silos in the │ Reuters December 2025 report │ +│ │ three new silo fields │ indicates 100+ solid-fuel ICBMs │ +│ │ (Yumen/Gansu, Hami/Xinjiang, │ loaded in silo fields; FAS 2025 │ +│ │ Ordos/Inner Mongolia) are now │ notes continued silo │ +│ │ loaded with missiles as of │ development. The DF-41 vs DF-31 │ +│ │ 2025? │ breakdown in these silos is │ +│ │ │ unclear. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the command-and-control │ LOW posture implies faster │ +│ │ structure for DF-41 brigades — │ decision timelines, raising │ +│ │ do brigade commanders have any │ questions about whether China │ +│ │ pre-delegated launch authority? │ has moved toward any degree of │ +│ │ │ pre-delegation, which would be │ +│ │ │ a major doctrinal shift. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Has the DF-41 rail-mobile │ Rail-mobile tests were reported │ +│ │ variant been operationally │ in December 2015, and the 2019 │ +│ │ deployed with any PLARF │ Pentagon report noted China │ +│ │ brigade? │ 'appears to be considering' │ +│ │ │ rail-mobile basing, but no │ +│ │ │ confirmed operational │ +│ │ │ deployment has been identified. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the specific MIRV │ FAS 2025 notes uncertainty │ +│ │ loading assignment doctrine for │ about how many warheads each │ +│ │ operational DF-41 missiles — │ DF-41 is assigned in practice, │ +│ │ are they typically deployed │ which significantly affects │ +│ │ with maximum warhead loads or │ strategic stability │ +│ │ reduced loads? │ calculations. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.72 │ +│ Corroborating sources: 12 │ +│ Source authority: high │ +│ Contradiction detected: True │ +│ Query specificity match: 0.75 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 62857 │ +│ Iterations: 3 │ +│ Wall time: 132.16s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: b3d00938-5309-4faa-a20d-97a8511bb8f9 diff --git a/docs/stress-tests/M3.3-runs/18-scope.log b/docs/stress-tests/M3.3-runs/18-scope.log new file mode 100644 index 0000000..c427373 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/18-scope.log @@ -0,0 +1,272 @@ +Researching: What internal compensation bands does Goldman Sachs use for VPs in +2026? + +{"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:15:05.792037Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:15:06.820624Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:15:06.829930Z"} +{"question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:15:06.876139Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What internal compensation bands does Goldman Sachs use for VPs in 2026?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876453Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:06.876542Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1108, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:14.969587Z"} +{"step": 12, "decision": "Starting iteration 3/5", "tokens_so_far": 5772, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:26.767509Z"} +{"step": 17, "decision": "Starting iteration 4/5", "tokens_so_far": 15029, "event": "iteration_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:32.149418Z"} +{"step": 22, "decision": "Token budget reached before iteration 5: 26452/20000", "event": "budget_exhausted", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927200Z"} +{"step": 23, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 31, "iterations_run": 4, "tokens_used": 26452, "event": "synthesis_start", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:15:41.927359Z"} +{"step": 24, "decision": "Parsed synthesis JSON successfully", "duration_ms": 65550, "event": "synthesis_complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.258119Z"} +{"step": 42, "decision": "Research complete", "confidence": 0.62, "citation_count": 10, "gap_count": 4, "discovery_count": 3, "total_duration_sec": 102.914, "event": "complete", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:45.259163Z"} +{"confidence": 0.62, "citations": 10, "gaps": 4, "discovery_events": 3, "tokens_used": 51829, "iterations_run": 4, "wall_time_sec": 98.38188624382019, "budget_exhausted": true, "event": "research_completed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:45.259280Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:16:45.259714Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:45.264223Z"} +{"trace_id": "716e548a-ceaf-4d18-8b47-ac35e3460b52", "confidence": 0.62, "citations": 10, "tokens_used": 51829, "wall_time_sec": 98.38188624382019, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:45.493130Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Goldman Sachs does not publicly disclose formal internal compensation bands │ +│ for VPs. Based on available evidence, the VP title at Goldman Sachs is a │ +│ single, wide-band level (there are no officially published sub-bands like │ +│ VP1/VP2/VP3 at Goldman, unlike some other banks). Compensation varies │ +│ enormously depending on division (front office vs. middle/back office) and │ +│ seniority within the band. Key data points for 2026: (1) Glassdoor reports a │ +│ typical total pay range of $213,109–$391,379 (25th–75th percentile) across │ +│ ~4,695 salary submissions, covering all VP roles firm-wide. (2) Levels.fyi │ +│ reports a median total VP compensation of $144K, which likely skews toward │ +│ tech/engineering roles. (3) 6figr reports an average of $297K (range │ +│ $265K–$501K, top 10% up to $514K) based on 67 profiles. (4) For front-office │ +│ Investment Banking VPs specifically, Glassdoor reports a much higher range │ +│ of $480,547–$888,585 (25th–75th percentile) based on 14 salaries. (5) │ +│ Industry benchmarks from Mergers & Inquisitions (2026 update) place │ +│ front-office IB VP base salary at $250–$300K with total compensation of │ +│ $525–$800K for NY-based roles. (6) Indeed reports an average of ~$145,324, │ +│ consistent with a broad mix of roles. Community sources (Fishbowl) confirm │ +│ the VP band is 'very wide' with no official internal sub-levels at Goldman; │ +│ pay differentiation happens informally by group, skillset, and front vs. │ +│ back office status. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Total salary range for │ The typical pay range is │ 0.85 │ +│ │ Goldman Sachs Vice President │ between $213,109 (25th │ │ +│ │ - Glassdoor │ percentile) and $391,379 (75th │ │ +│ │ https://www.glassdoor.com/Sal │ percentile) annually. This is │ │ +│ │ ary/Goldman-Sachs-Vice-Presid │ based on 4,695 salaries │ │ +│ │ ent-Salaries-E2800_D_KO14,28. │ submitted by Goldman Sachs │ │ +│ │ htm │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Total salary range for │ The typical pay range is │ 0.85 │ +│ │ Goldman Sachs Vice President │ between $220,674 (25th │ │ +│ │ - Glassdoor │ percentile) and $411,924 (75th │ │ +│ │ https://www.glassdoor.com/Sal │ percentile) annually. This is │ │ +│ │ ary/Goldman-Sachs-V-P-Salarie │ based on 4,695 salaries │ │ +│ │ s-E2800_D_KO14,17.htm │ submitted by Goldman Sachs │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ Goldman Sachs Vice President │ The median Vice President │ 0.75 │ +│ │ Salary | $110K-$144K+ | │ compensation in United States │ │ +│ │ Levels.fyi │ package at Goldman Sachs │ │ +│ │ https://www.levels.fyi/compan │ totals $144K per year. View │ │ +│ │ ies/goldman-sachs/salaries/vi │ the base salary, stock, and │ │ +│ │ ce-president │ bonus breakdowns for Goldman │ │ +│ │ │ Sachs's total compensation │ │ +│ │ │ packages. Last updated: │ │ +│ │ │ 4/6/2026 │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Goldman Sachs Vice President │ Employees at Goldman Sachs as │ 0.70 │ +│ │ Vp Salaries 2026 | │ Vice President Vp earn an │ │ +│ │ $265k-$514k │ average of $297k, mostly │ │ +│ │ https://6figr.com/us/salary/g │ ranging from $265k per year to │ │ +│ │ oldman-sachs--vice-president- │ $501k per year based on 67 │ │ +│ │ vp │ profiles. The top 10% │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ Goldman Sachs Investment │ The typical pay range is │ 0.65 │ +│ │ Banking Vice President ... │ between $480,547 (25th │ │ +│ │ https://www.glassdoor.com/Sal │ percentile) and $888,585 (75th │ │ +│ │ ary/Goldman-Sachs-Investment- │ percentile) annually. This is │ │ +│ │ Banking-Vice-President-Salari │ based on 14 salaries submitted │ │ +│ │ es-E2800_D_KO14,47.htm │ by Goldman Sachs │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Investment Banker Salary and │ Vice President (VP) | 28-40 | │ 0.88 │ +│ │ Bonus Report: 2026 Update │ $250-$300K | $525-$800K | 3-4 │ │ +│ │ https://mergersandinquisition │ years │ │ +│ │ s.com/investment-banker-salar │ │ │ +│ │ y/ │ NOTE: All numbers are pre-tax │ │ +│ │ │ for New York-based │ │ +│ │ │ front-office roles and include │ │ +│ │ │ base salaries and year-end │ │ +│ │ │ bonuses but not │ │ +│ │ │ signing/relocation bonuses, │ │ +│ │ │ stub bonuses, benefits, etc. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ Vice President yearly │ Average Goldman Sachs Vice │ 0.70 │ +│ │ salaries in the United States │ President yearly pay in the │ │ +│ │ at Goldman Sachs │ United States is approximately │ │ +│ │ https://www.indeed.com/cmp/Go │ $145,324, which is 9% below │ │ +│ │ ldman-Sachs/salaries/Vice-Pre │ the national average. Salary │ │ +│ │ sident │ estimated from │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Are there internal levels/ │ Goldman VP band is very wide. │ 0.72 │ +│ │ bands within the VP tit... | │ Promoted from associate and │ │ +│ │ Fishbowl │ Next step md is difficult to │ │ +│ │ https://www.fishbowlapp.com/p │ get. │ │ +│ │ ost/are-there-internal-levels │ │ │ +│ │ -bands-within-the-vp-title-at │ Yes, banks have different │ │ +│ │ -goldman-sachs-fwiw-this-is-f │ bands depending on skillset, │ │ +│ │ or-a-nonbusiness-internal-str │ group within the firm, front │ │ +│ │ ategy-kind │ office vs back office, etc │ │ +│ │ │ │ │ +│ │ │ Not Goldman though. It's just │ │ +│ │ │ VP │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ VP of FP&A at Goldman Sachs │ FP&A is middle office at │ 0.65 │ +│ │ salary : r/FPandA - Reddit │ banks, they won't make │ │ +│ │ https://www.reddit.com/r/FPan │ anywhere near $400k at VP │ │ +│ │ dA/comments/1dgguz5/vp_of_fpa │ level. Front office VP │ │ +│ │ _at_goldman_sachs_salary/ │ positions will all clear over │ │ +│ │ │ $400k in a place │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Goldman Sachs Vp Salaries │ 15 to 15 yrs. Base. $179k. │ 0.65 │ +│ │ 2026 | $208k-$586k - │ Stocks / Yr. $21k. Bonus. │ │ +│ │ 6figr.com │ $120k. Total Salary. $318k. │ │ +│ │ https://6figr.com/us/salary/g │ Goldman Sachs Vp salary levels │ │ +│ │ oldman-sachs--vp │ ranges from Vice President │ │ +│ │ │ (Accountant) upto │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Official internal Goldman │ Goldman Sachs does not │ +│ │ Sachs VP compensation │ publicly publish its │ +│ │ bands │ internal compensation │ +│ │ │ bands or grade │ +│ │ │ structures. No │ +│ │ │ authoritative internal │ +│ │ │ HR documentation was │ +│ │ │ found. All data is from │ +│ │ │ third-party crowdsourced │ +│ │ │ salary platforms. │ +├───────────────────────┼───────────────────────────┼──────────────────────────┤ +│ source_not_found │ VP sub-band breakdown │ Community sources │ +│ │ (VP1/VP2/VP3 equivalents) │ explicitly state Goldman │ +│ │ │ uses a single 'VP' title │ +│ │ │ with no formal │ +│ │ │ sub-levels, unlike some │ +│ │ │ peers. No granular │ +│ │ │ sub-band salary data │ +│ │ │ exists in any source │ +│ │ │ reviewed. │ +├───────────────────────┼───────────────────────────┼──────────────────────────┤ +│ scope_exceeded │ Non-US VP compensation │ Some sources (e.g., │ +│ │ bands │ AmbitionBox) reference │ +│ │ │ India-based VP salaries │ +│ │ │ (₹49.4L–₹54.6L), but │ +│ │ │ comprehensive │ +│ │ │ international band data │ +│ │ │ was not gathered. The │ +│ │ │ question context appears │ +│ │ │ US-focused. │ +├───────────────────────┼───────────────────────────┼──────────────────────────┤ +│ contradictory_sources │ Levels.fyi median │ Levels.fyi reports a │ +│ │ discrepancy │ median of $144K while │ +│ │ │ Glassdoor and 6figr │ +│ │ │ report $213K–$411K │ +│ │ │ ranges. Levels.fyi │ +│ │ │ likely captures │ +│ │ │ engineering/tech VPs who │ +│ │ │ have different │ +│ │ │ compensation structures │ +│ │ │ and lower base pay than │ +│ │ │ finance VPs. │ +└───────────────────────┴───────────────────────────┴──────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ contradiction │ database │ Goldman Sachs VP │ Large discrepancy │ +│ │ │ total │ between │ +│ │ │ compensation by │ Levels.fyi ($144K │ +│ │ │ division 2025 │ median) and │ +│ │ │ 2026 │ Glassdoor │ +│ │ │ │ ($213K–$391K │ +│ │ │ │ range) suggests │ +│ │ │ │ the VP population │ +│ │ │ │ is heterogeneous │ +│ │ │ │ across tech and │ +│ │ │ │ finance │ +│ │ │ │ functions; │ +│ │ │ │ further │ +│ │ │ │ segmentation by │ +│ │ │ │ division would │ +│ │ │ │ resolve this. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ null │ Goldman Sachs │ Understanding how │ +│ │ │ internal grade │ Goldman's VP band │ +│ │ │ structure VP │ maps to peer │ +│ │ │ Director MD 2026 │ banks' grade │ +│ │ │ │ systems would │ +│ │ │ │ clarify the wide │ +│ │ │ │ compensation │ +│ │ │ │ range observed. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ null │ Goldman Sachs │ Mergers & │ +│ │ │ 2025 bonus pool │ Inquisitions │ +│ │ │ VP payout by │ notes senior │ +│ │ │ division │ bankers (VPs+) │ +│ │ │ │ received │ +│ │ │ │ disproportionate │ +│ │ │ │ 2025 bonus │ +│ │ │ │ increases; │ +│ │ │ │ division-level │ +│ │ │ │ data would │ +│ │ │ │ sharpen the band │ +│ │ │ │ picture. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ Does Goldman Sachs use any │ Fishbowl community posts │ +│ │ informal internal seniority │ confirm the VP band is wide and │ +│ │ designations within the VP │ pay varies significantly, but │ +│ │ title (e.g., junior VP vs. │ it is unclear whether informal │ +│ │ senior VP) that affect │ tracking of seniority within │ +│ │ compensation but are not │ the band drives structured pay │ +│ │ publicly disclosed? │ steps. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ How did 2025 year-end bonuses │ Mergers & Inquisitions notes │ +│ │ for Goldman Sachs VPs compare │ that VPs and Directors saw │ +│ │ to the prior year, and were │ 10–15% total comp increases in │ +│ │ front-office VPs │ 2025, but Goldman-specific │ +│ │ disproportionate beneficiaries? │ figures were not isolated. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Why does Levels.fyi report a │ The discrepancy likely reflects │ +│ │ $144K median for Goldman Sachs │ different user populations │ +│ │ VPs when Glassdoor and 6figr │ (tech-focused on Levels.fyi vs. │ +│ │ report ranges starting at │ finance-focused on │ +│ │ $213K–$265K? │ Glassdoor/6figr), but this has │ +│ │ │ not been confirmed. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What is the typical │ Fishbowl notes the VP band is │ +│ │ time-in-grade for a Goldman │ wide and the step to MD is │ +│ │ Sachs VP before promotion to │ difficult; Mergers & │ +│ │ Managing Director, and does │ Inquisitions gives a 3–4 year │ +│ │ longer tenure correlate with │ promotion window for VPs across │ +│ │ meaningfully higher within-band │ large banks. │ +│ │ pay? │ │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.62 │ +│ Corroborating sources: 8 │ +│ Source authority: medium │ +│ Contradiction detected: True │ +│ Query specificity match: 0.55 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 51829 │ +│ Iterations: 4 │ +│ Wall time: 98.38s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: 716e548a-ceaf-4d18-8b47-ac35e3460b52 diff --git a/docs/stress-tests/M3.3-runs/19-scope.log b/docs/stress-tests/M3.3-runs/19-scope.log new file mode 100644 index 0000000..4668ada --- /dev/null +++ b/docs/stress-tests/M3.3-runs/19-scope.log @@ -0,0 +1,343 @@ +Researching: How does Renaissance Technologies Medallion Fund actually generate +alpha? + +{"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:16:46.074147Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:16:46.829107Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:16:46.837149Z"} +{"question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:16:46.869281Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "How does Renaissance Technologies Medallion Fund actually generate alpha?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869587Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:46.869675Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1104, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:16:56.914799Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 8370, "event": "iteration_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:03.842868Z"} +{"step": 21, "decision": "Token budget reached before iteration 4: 20077/20000", "event": "budget_exhausted", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.960507Z"} +{"step": 22, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 23, "iterations_run": 3, "tokens_used": 20077, "event": "synthesis_start", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:17:13.961508Z"} +{"step": 23, "decision": "Parsed synthesis JSON successfully", "duration_ms": 74831, "event": "synthesis_complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.398868Z"} +{"step": 42, "decision": "Research complete", "confidence": 0.82, "citation_count": 10, "gap_count": 4, "discovery_count": 4, "total_duration_sec": 101.925, "event": "complete", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:25.400004Z"} +{"confidence": 0.82, "citations": 10, "gaps": 4, "discovery_events": 4, "tokens_used": 43096, "iterations_run": 3, "wall_time_sec": 98.52941536903381, "budget_exhausted": true, "event": "research_completed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:25.400108Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "researcher": "web", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:18:25.400618Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:25.405316Z"} +{"trace_id": "b7cd9d50-3eec-4eca-8db0-a580722c2b19", "confidence": 0.82, "citations": 10, "tokens_used": 43096, "wall_time_sec": 98.52941536903381, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:25.623416Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ Renaissance Technologies' Medallion Fund generates alpha through several │ +│ reinforcing mechanisms, all grounded in quantitative and data-driven methods │ +│ rather than traditional financial intuition: │ +│ │ +│ 1. **Statistical Arbitrage & Pattern Recognition**: The fund identifies │ +│ subtle, recurring market inefficiencies and pricing anomalies by analyzing │ +│ vast amounts of historical and real-time data. It profits from small │ +│ mispricings across many trades rather than large directional bets. [Sources │ +│ 3, 6, 8] │ +│ │ +│ 2. **Advanced Mathematical & Quantitative Models**: Renaissance employs │ +│ sophisticated statistical models, hidden Markov models (used as early as │ +│ 1983), and continuously refined algorithms to predict short-term price │ +│ movements. The firm hired mathematicians, physicists, and computer │ +│ scientists—not traditional Wall Street traders—to build these models. │ +│ [Sources 9, 16, 21, 23] │ +│ │ +│ 3. **Machine Learning & AI Integration**: Medallion continuously refines its │ +│ models using machine learning, allowing them to adapt to changing market │ +│ conditions and discover non-obvious patterns. [Sources 6, 8] │ +│ │ +│ 4. **High-Frequency, Fully Automated Trading**: The fund executes │ +│ 150,000–300,000 trades daily through fully automated systems, eliminating │ +│ emotional bias and exploiting fleeting inefficiencies at scale. [Source 8] │ +│ │ +│ 5. **Market-Neutral & Diversified Strategies**: By balancing long and short │ +│ positions across many asset classes (equities, futures, options, currencies) │ +│ and geographies, the fund reduces exposure to broad market moves. This is │ +│ evidenced by the fund returning +74.6% in 2008 when markets crashed. │ +│ [Sources 6, 16] │ +│ │ +│ 6. **Leverage & Risk Management via Kelly Criterion**: Medallion uses │ +│ significant leverage combined with disciplined risk management techniques, │ +│ including the Kelly Criterion, to size positions optimally and control │ +│ drawdown. [Sources 6, 8] │ +│ │ +│ 7. **Extreme Secrecy & Employee-Only Structure**: The fund has been closed │ +│ to outside investors since 1993, aligning incentives exclusively with │ +│ employees and partners. This exclusivity prevents strategy dilution and │ +│ protects proprietary edge. [Sources 5, 6, 12] │ +│ │ +│ 8. **Massive Data Collection & Cleaning**: Renaissance amasses and │ +│ meticulously cleans enormous datasets of historical price data, economic │ +│ indicators, and alternative data sources as the raw material for model │ +│ building. [Sources 15, 21] │ +│ │ +│ 9. **Collaborative, Academic Culture**: Simons fostered an open, peer-driven │ +│ environment where ideas were freely shared among top-tier scientists, │ +│ accelerating model refinement and discovery. [Sources 16, 21] │ +│ │ +│ The cumulative result: average annual returns of 66% before fees and 39% │ +│ after fees from 1988 to 2018—the best sustained track record in investment │ +│ history. A $100 investment in 1988 would have grown to approximately $398.7 │ +│ million by 2018, versus $1,815 for the S&P 500 over the same period. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ Renaissance Technologies: The │ Between 1988 and 2018, │ 0.97 │ +│ │ $100 Billion Built on │ Renaissance Technologies' │ │ +│ │ Statistical Arbitrage │ Medallion Fund generated │ │ +│ │ https://navnoorbawa.substack. │ average annual returns of 66% │ │ +│ │ com/p/renaissance-technologie │ before fees and 39% after fees │ │ +│ │ s-the-100 │ — the most successful track │ │ +│ │ │ record in investing history. A │ │ +│ │ │ $100 investment in 1988 would │ │ +│ │ │ have grown to approximately │ │ +│ │ │ $398.7 million by 2018. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ Jim Simons Trading Strategy │ Fully automated systems │ 0.93 │ +│ │ Explained: Inside Renaissance │ executed 150,000–300,000 │ │ +│ │ Technologies │ trades daily, eliminating │ │ +│ │ https://www.quantvps.com/blog │ emotional biases. Techniques │ │ +│ │ /jim-simons-trading-strategy │ like the Kelly Criterion and │ │ +│ │ │ balanced portfolios helped │ │ +│ │ │ control risk and maintain │ │ +│ │ │ consistent returns. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ The Curious Case of Medallion │ The fund employs sophisticated │ 0.92 │ +│ │ Fund: Renaissance │ statistical and mathematical │ │ +│ │ Technologies' Hedge Fund │ models to identify and │ │ +│ │ Success │ capitalize on market │ │ +│ │ https://www.schoolofhedge.com │ inefficiencies. Medallion │ │ +│ │ /pages/the-curious-case-of-me │ integrates machine learning │ │ +│ │ dallion-fund │ and artificial intelligence to │ │ +│ │ │ refine its models continually, │ │ +│ │ │ adapting to changing market │ │ +│ │ │ conditions. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ Decoding the Medallion Fund │ The Medallion Fund boasts an │ 0.95 │ +│ │ Returns: What We Know About │ unprecedented average annual │ │ +│ │ Its Annual Performance │ return of 66% before fees over │ │ +│ │ https://www.quantifiedstrateg │ 30 years, achieving a net │ │ +│ │ ies.com/medallion-fund-return │ return of 39% after fees. The │ │ +│ │ s/ │ Medallion Fund has been closed │ │ +│ │ │ to outside investors since │ │ +│ │ │ 1993 and is only available to │ │ +│ │ │ current and past employees and │ │ +│ │ │ their families. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ James Simons (Renaissance │ In 1983 he was using Hidden │ 0.85 │ +│ │ Technologies Corp.) and his │ Markov Models. Now he employs │ │ +│ │ model - Quantitative Finance │ 100+ PhDs, therefore I expect │ │ +│ │ Stack Exchange │ he will have 50+ strategies │ │ +│ │ https://quant.stackexchange.c │ using 200+ predictors. And set │ │ +│ │ om/questions/30056/james-simo │ up as a production line, from │ │ +│ │ ns-renaissance-technologies-c │ the teams importing and │ │ +│ │ orp-and-his-model │ cleaning data, down to │ │ +│ │ │ execution of trades. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ Simons' Strategies: │ Market-Neutral Strategies: │ 0.91 │ +│ │ Renaissance Trading Unpacked │ Balancing long and short │ │ +│ │ - LuxAlgo │ positions reduces risk. Unique │ │ +│ │ https://www.luxalgo.com/blog/ │ Hiring: Scientists and │ │ +│ │ simons-strategies-renaissance │ mathematicians, not Wall │ │ +│ │ -trading-unpacked/ │ Street veterans, build their │ │ +│ │ │ trading models. Even during │ │ +│ │ │ crashes like 2008, Medallion │ │ +│ │ │ outperformed with a 74.6% │ │ +│ │ │ return. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ The Man Who Solved the Market │ Renaissance's success was │ 0.93 │ +│ │ by Gregory Zuckerman - │ built on amassing and │ │ +│ │ Summary & Notes │ meticulously cleaning vast │ │ +│ │ https://bagerbach.com/books/t │ amounts of historical price │ │ +│ │ he-man-who-solved-the-market/ │ data, then using it to model │ │ +│ │ │ and predict market behavior. │ │ +│ │ │ They treated investing like a │ │ +│ │ │ scientific problem, forming │ │ +│ │ │ hypotheses, testing them │ │ +│ │ │ rigorously, and iterating │ │ +│ │ │ constantly. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ Cracking the Code: Inside the │ Medallion began as an │ 0.88 │ +│ │ Medallion Fund and Jim │ experiment in pattern │ │ +│ │ Simons' Secretive Empire │ recognition. Over time, it │ │ +│ │ https://medium.com/@trading.d │ evolved into a fully │ │ +│ │ ude/cracking-the-code-inside- │ automated, high-frequency, │ │ +│ │ the-medallion-fund-and-jim-si │ multi-strategy quant │ │ +│ │ mons-secretive-empire-b9af084 │ powerhouse. It traded │ │ +│ │ 15b4f │ everything from equities to │ │ +│ │ │ futures. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Renaissance Technologies and │ Renaissance Technologies, │ 0.92 │ +│ │ The Medallion Fund │ often just referred to as │ │ +│ │ https://quartr.com/insights/e │ RenTec, is reputed as the │ │ +│ │ dge/renaissance-technologies- │ highest-performing investment │ │ +│ │ and-the-medallion-fund │ firms ever, with its Medallion │ │ +│ │ │ Fund having returned a net │ │ +│ │ │ 90,129x to investors between │ │ +│ │ │ the years 1988-2022 leveraging │ │ +│ │ │ a quantitative investment │ │ +│ │ │ approach. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 10 │ Jim Simons – The Man Who │ Simons decided to use a purely │ 0.90 │ +│ │ Solved the Market - Build │ systematic approach to avoid │ │ +│ │ Alpha │ emotional rollercoasters and │ │ +│ │ https://www.buildalpha.com/ji │ avoid common trading biases │ │ +│ │ m-simons-the-man-who-solved-t │ that trip up most traders. │ │ +│ │ he-market/ │ Simons staffed the new fund, │ │ +│ │ │ Renaissance Technologies, with │ │ +│ │ │ mathematicians, computer │ │ +│ │ │ scientists, and physicists to │ │ +│ │ │ pioneer. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ access_denied │ Specific algorithmic │ Renaissance Technologies │ +│ │ details and signal types │ maintains extreme secrecy │ +│ │ used by the Medallion Fund │ around its specific trading │ +│ │ │ signals, factor exposures, │ +│ │ │ and model architecture. No │ +│ │ │ public source has ever │ +│ │ │ confirmed the exact │ +│ │ │ mathematical formulas, │ +│ │ │ specific predictors, or │ +│ │ │ strategy details. All │ +│ │ │ evidence is from secondary │ +│ │ │ sources and informed │ +│ │ │ inference. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Post-2018 performance data │ Most verified return data │ +│ │ for the Medallion Fund │ covers 1988-2018. Some │ +│ │ │ sources reference │ +│ │ │ performance through 2022 │ +│ │ │ but with less granular │ +│ │ │ annual data. The fund does │ +│ │ │ not file public performance │ +│ │ │ reports. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Specific leverage ratios │ While sources note that │ +│ │ used by the Medallion Fund │ high leverage is a │ +│ │ │ component of alpha │ +│ │ │ generation, specific │ +│ │ │ leverage multiples are not │ +│ │ │ publicly disclosed and were │ +│ │ │ not found in the gathered │ +│ │ │ evidence. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Fee structure and its exact │ Sources confirm the fund │ +│ │ impact on net returns over │ charges approximately 5% │ +│ │ time │ management and 44% │ +│ │ │ performance fees │ +│ │ │ (historically), but │ +│ │ │ detailed year-by-year │ +│ │ │ impact analysis was not │ +│ │ │ found in the gathered │ +│ │ │ evidence. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ arxiv │ statistical │ Simons used │ +│ │ │ arbitrage hidden │ Hidden Markov │ +│ │ │ Markov models │ Models in 1983. │ +│ │ │ financial markets │ Academic papers │ +│ │ │ quantitative │ on HMMs in │ +│ │ │ trading │ finance could │ +│ │ │ │ illuminate the │ +│ │ │ │ mathematical │ +│ │ │ │ foundation of │ +│ │ │ │ early Medallion │ +│ │ │ │ strategies. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ Kelly Criterion │ The Kelly │ +│ │ │ optimal position │ Criterion is │ +│ │ │ sizing hedge fund │ cited as a key │ +│ │ │ leverage │ risk management │ +│ │ │ quantitative │ tool; academic │ +│ │ │ trading │ literature could │ +│ │ │ │ clarify how it │ +│ │ │ │ specifically │ +│ │ │ │ contributes to │ +│ │ │ │ alpha │ +│ │ │ │ sustainability. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ new_source │ database │ Renaissance │ SEC 13F filings │ +│ │ │ Technologies SEC │ for Renaissance's │ +│ │ │ 13F filings RIEF │ public-facing │ +│ │ │ RIDA │ funds (RIEF, │ +│ │ │ institutional │ RIDA) could │ +│ │ │ holdings │ provide insight │ +│ │ │ │ into equity │ +│ │ │ │ selection │ +│ │ │ │ methodology, │ +│ │ │ │ though not │ +│ │ │ │ Medallion │ +│ │ │ │ directly. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ null │ Gregory Zuckerman │ The book by │ +│ │ │ The Man Who │ Zuckerman is │ +│ │ │ Solved the Market │ cited as the most │ +│ │ │ primary source │ authoritative │ +│ │ │ analysis │ public account of │ +│ │ │ │ Renaissance's │ +│ │ │ │ methods; a deeper │ +│ │ │ │ review could │ +│ │ │ │ yield more │ +│ │ │ │ specific │ +│ │ │ │ mechanism │ +│ │ │ │ details. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ How has the Medallion Fund │ Multiple sources confirm the │ +│ │ maintained its edge as markets │ strategy has worked for 30+ │ +│ │ have become more efficient and │ years, but with algorithmic │ +│ │ other quant funds have adopted │ trading now comprising 60-73% │ +│ │ similar approaches? │ of U.S. equity trades, the │ +│ │ │ persistence of edge is │ +│ │ │ theoretically challenging. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the role of capacity │ The fund is closed to outside │ +│ │ constraints in limiting │ investors and capped in size, │ +│ │ Medallion's AUM, and how does │ suggesting strategy returns │ +│ │ the fund's small size (~$10B) │ diminish at scale. This │ +│ │ contribute to its returns? │ capacity question is central to │ +│ │ │ understanding whether the alpha │ +│ │ │ is truly replicable. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ To what extent does Medallion's │ Sources describe both │ +│ │ alpha come from market │ high-frequency automated │ +│ │ microstructure exploitation │ trading and statistical │ +│ │ (e.g., short-term mean │ arbitrage, but the precise time │ +│ │ reversion) vs. longer-horizon │ horizon distribution of trades │ +│ │ factor exposures? │ is unknown publicly. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How has Medallion's strategy │ Jim Simons passed away in May │ +│ │ evolved since Jim Simons' │ 2024. The sustainability of the │ +│ │ retirement from day-to-day │ fund's culture and edge under │ +│ │ management and his death in May │ new leadership is an open │ +│ │ 2024? │ question. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ What specific alternative data │ Sources mention 'alternative │ +│ │ sources (beyond price/volume) │ data sources' as inputs but │ +│ │ does Renaissance use as inputs │ provide no specifics, leaving │ +│ │ to its models? │ this dimension of the alpha │ +│ │ │ generation process unresolved. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.82 │ +│ Corroborating sources: 10 │ +│ Source authority: medium │ +│ Contradiction detected: False │ +│ Query specificity match: 0.75 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 43096 │ +│ Iterations: 3 │ +│ Wall time: 98.53s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: b7cd9d50-3eec-4eca-8db0-a580722c2b19 diff --git a/docs/stress-tests/M3.3-runs/20-scope.log b/docs/stress-tests/M3.3-runs/20-scope.log new file mode 100644 index 0000000..519b212 --- /dev/null +++ b/docs/stress-tests/M3.3-runs/20-scope.log @@ -0,0 +1,325 @@ +Researching: What are the precise materials and tolerances in TSMC's 2nm +process? + +{"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": null, "token_budget": null, "event": "ask_started", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:18:26.198498Z"} +{"transport": "stdio", "server": "marchwarden-web-researcher", "event": "mcp_server_starting", "logger": "marchwarden.mcp", "level": "info", "timestamp": "2026-04-09T02:18:26.963097Z"} +{"event": "Processing request of type CallToolRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:18:26.972484Z"} +{"question": "What are the precise materials and tolerances in TSMC's 2nm process?", "depth": "balanced", "max_iterations": 5, "token_budget": 20000, "model_id": "claude-sonnet-4-6", "event": "research_started", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:18:27.004492Z"} +{"step": 1, "decision": "Beginning research: depth=balanced", "question": "What are the precise materials and tolerances in TSMC's 2nm process?", "context": "", "max_iterations": 5, "token_budget": 20000, "event": "start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004812Z"} +{"step": 2, "decision": "Starting iteration 1/5", "tokens_so_far": 0, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:27.004904Z"} +{"step": 7, "decision": "Starting iteration 2/5", "tokens_so_far": 1158, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:40.769568Z"} +{"step": 14, "decision": "Starting iteration 3/5", "tokens_so_far": 11802, "event": "iteration_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:47.013233Z"} +{"step": 19, "decision": "Token budget reached before iteration 4: 30249/20000", "event": "budget_exhausted", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139804Z"} +{"step": 20, "decision": "Beginning synthesis of gathered evidence", "evidence_count": 29, "iterations_run": 3, "tokens_used": 30249, "event": "synthesis_start", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:18:57.139984Z"} +{"step": 21, "decision": "Parsed synthesis JSON successfully", "duration_ms": 77777, "event": "synthesis_complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.633197Z"} +{"step": 40, "decision": "Research complete", "confidence": 0.42, "citation_count": 9, "gap_count": 5, "discovery_count": 4, "total_duration_sec": 109.056, "event": "complete", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.trace", "level": "info", "timestamp": "2026-04-09T02:20:12.634189Z"} +{"confidence": 0.42, "citations": 9, "gaps": 5, "discovery_events": 4, "tokens_used": 62620, "iterations_run": 3, "wall_time_sec": 105.62861347198486, "budget_exhausted": true, "event": "research_completed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "info", "timestamp": "2026-04-09T02:20:12.634324Z"} +{"error": "[Errno 13] Permission denied: '/home/micro/.marchwarden/costs.jsonl'", "event": "cost_ledger_write_failed", "researcher": "web", "trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "logger": "marchwarden.researcher.web", "level": "warning", "timestamp": "2026-04-09T02:20:12.634698Z"} +{"event": "Processing request of type ListToolsRequest", "logger": "mcp.server.lowlevel.server", "level": "info", "timestamp": "2026-04-09T02:20:12.639617Z"} +{"trace_id": "a4bb5b7a-61dd-446b-8c06-06c78de5fef7", "confidence": 0.42, "citations": 9, "tokens_used": 62620, "wall_time_sec": 105.62861347198486, "event": "ask_completed", "logger": "marchwarden.cli", "level": "info", "timestamp": "2026-04-09T02:20:12.967147Z"} +╭─────────────────────────────────── Answer ───────────────────────────────────╮ +│ TSMC's 2nm (N2) process node, which began volume production in Q4 2025, │ +│ introduces several key technical advances, though precise proprietary │ +│ materials specifications and sub-angstrom tolerances are not publicly │ +│ disclosed. What is publicly known: │ +│ │ +│ **Transistor Architecture:** N2 is TSMC's first node to use Gate-All-Around │ +│ (GAA) nanosheet transistors, replacing the FinFET architecture used since │ +│ 2011. The gate surrounds the silicon nanosheet channel on all sides, │ +│ providing superior electrostatic control and reduced gate leakage compared │ +│ to 3nm FinFETs [Sources 10, 13, 21]. │ +│ │ +│ **Process Node Dimensions (IEEE IRDS):** The 2nm node class is projected to │ +│ have a contacted gate pitch of ~45nm and a tightest metal pitch of ~20nm, │ +│ per IEEE International Roadmap for Devices and Systems (2021 update) [Source │ +│ 16]. │ +│ │ +│ **Interconnects:** N2 features copper (Cu)-based redistribution layers │ +│ (RDLs) with flat passivation and through-silicon vias (TSVs), co-optimized │ +│ with 3DIC integration. Middle- and back-end-of-line (MEOL/BEOL) │ +│ interconnects are included, with the densest SRAM macro ever reported at │ +│ approximately 38 Mb/mm² [Sources 4, 21]. │ +│ │ +│ **Performance Metrics (vs. N3E):** 24–35% power reduction OR 15% performance │ +│ improvement at iso-voltage; >1.15x transistor density improvement over N3 │ +│ [Sources 10, 18, 21]. │ +│ │ +│ **Yield:** Initial yields reportedly ~70%, with some memory products │ +│ exceeding 90%. A 6% yield improvement over baseline was reported in late │ +│ 2024 [Sources 13, 14]. │ +│ │ +│ **Applications:** Designed for AI, mobile, and HPC applications. Key │ +│ customers include Apple (A20 chip for iPhone 18 Pro) and NVIDIA [Sources 8, │ +│ 14]. │ +│ │ +│ **Fab Locations:** Primary production in Hsinchu and Kaohsiung, Taiwan; a │ +│ Kaohsiung 2nm facility expansion ceremony was held March 31, 2025 [Source │ +│ 6]. │ +│ │ +│ **Specific proprietary materials** (e.g., exact dielectric compositions, │ +│ gate oxide materials, metal liner chemistries, doping concentrations, and │ +│ nanometer-level tolerances on nanosheet thickness/width) are not publicly │ +│ disclosed by TSMC and were not found in the available evidence. │ +╰──────────────────────────────────────────────────────────────────────────────╯ + Citations +┏━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ +┃ # ┃ Title / Locator ┃ Excerpt ┃ Conf ┃ +┡━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ +│ 1 │ TSMC shares deep-dive details │ The new production node │ 0.95 │ +│ │ about its cutting edge 2nm │ promises a 24 to 35% power │ │ +│ │ process node at IEDM 2024 — │ reduction or 15% performance │ │ +│ │ 35 percent less power or 15 │ improvement at the same │ │ +│ │ percent more performance | │ voltage, and 1.15X higher │ │ +│ │ Tom's Hardware │ transistor density than the │ │ +│ │ https://www.tomshardware.com/ │ previous 3nm node. │ │ +│ │ tech-industry/tsmc-shares-dee │ │ │ +│ │ p-dive-details-about-its-cutt │ │ │ +│ │ ing-edge-2nm-process-node-at- │ │ │ +│ │ iedm-2024-35-percent-less-pow │ │ │ +│ │ er-or-15-percent-more-perform │ │ │ +│ │ ance │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 2 │ IEDM 2024 – TSMC 2nm Process │ The paper states that the │ 0.95 │ +│ │ Disclosure - TechInsights │ process delivers a 30% power │ │ +│ │ https://library.techinsights. │ improvement or 15% performance │ │ +│ │ com/public/hg-asset/f32a0f17- │ gain and >1.15x density versus │ │ +│ │ 5369-4c97-913c-b78d2ddd833b │ the previous 3nm node. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 3 │ The Shape of Tomorrow's │ The new N2 platform features │ 0.93 │ +│ │ Semiconductor Technology - │ GAA nanosheet transistors; │ │ +│ │ Semiconductor Digest │ middle-/back-end-of-line │ │ +│ │ https://www.semiconductor-dig │ interconnects with the densest │ │ +│ │ est.com/the-shape-of-tomorrow │ SRAM macro ever reported │ │ +│ │ s-semiconductor-technology/ │ (~38Mb/mm2); and a holistic, │ │ +│ │ │ system-technology co-optimized │ │ +│ │ │ (STCO) architecture offering │ │ +│ │ │ great design flexibility. That │ │ +│ │ │ architecture includes a │ │ +│ │ │ scalable copper-based │ │ +│ │ │ redistribution layer and a │ │ +│ │ │ flat passivation layer (for │ │ +│ │ │ better performance, robust │ │ +│ │ │ CPI, and seamless 3D │ │ +│ │ │ integration); and │ │ +│ │ │ through-silicon vias, or TSVs. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 4 │ 2 nm process - Wikipedia │ According to the projections │ 0.90 │ +│ │ https://en.wikipedia.org/wiki │ contained in the 2021 update │ │ +│ │ /2_nm_process │ of the International Roadmap │ │ +│ │ │ for Devices and Systems │ │ +│ │ │ published by the Institute of │ │ +│ │ │ Electrical and Electronics │ │ +│ │ │ Engineers (IEEE), a '2.1 nm │ │ +│ │ │ node range label' is expected │ │ +│ │ │ to have a contacted gate pitch │ │ +│ │ │ of 45 nanometers and a │ │ +│ │ │ tightest metal pitch of 20 │ │ +│ │ │ nanometers. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 5 │ TSMC Boosts 2 nm Yields by │ A key innovation in the N2 │ 0.88 │ +│ │ 6%, Passing Savings to │ process is the enhanced design │ │ +│ │ Customers | TechPowerUp │ of its GAA nanosheet │ │ +│ │ https://www.techpowerup.com/3 │ transistors, which offers │ │ +│ │ 29435/tsmc-boosts-2-nm-yields │ improved electrostatic control │ │ +│ │ -by-6-passing-savings-to-cust │ and reduced gate leakage │ │ +│ │ omers │ compared to 3 nm FinFET │ │ +│ │ │ transistors, given that the │ │ +│ │ │ gate can be controlled from │ │ +│ │ │ all sides. This advancement │ │ +│ │ │ enables smaller high-density │ │ +│ │ │ transistors to maintain │ │ +│ │ │ reliable performance through │ │ +│ │ │ better threshold voltage │ │ +│ │ │ tuning capabilities. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 6 │ TSMC 2nm, full details │ This 2nm platform technology │ 0.82 │ +│ │ revealed-Electronics │ includes new Cu RDLs with flat │ │ +│ │ Headlines-EEWORLD │ passivation and TSVs, │ │ +│ │ https://en.eeworld.com.cn/mp/ │ optimized holistically with │ │ +│ │ Icbank/a391002.jspx │ 3DIC to enable system │ │ +│ │ │ integration. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 7 │ TSMC begins quietly volume │ TSMC has quietly revealed that │ 0.97 │ +│ │ production of 2nm-class chips │ it had commenced volume │ │ +│ │ | Tom's Hardware │ production of chips using its │ │ +│ │ https://www.tomshardware.com/ │ N2 (2nm-class) fabrication │ │ +│ │ tech-industry/semiconductors/ │ process... 'TSMC's 2nm (N2) │ │ +│ │ tsmc-begins-quietly-volume-pr │ technology has started volume │ │ +│ │ oduction-of-2nm-class-chips-f │ production in 4Q25 as │ │ +│ │ irst-gaa-transistor-for-tsmc- │ planned.' │ │ +│ │ claims-up-to-15-percent-impro │ │ │ +│ │ vement-at-iso-power │ │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 8 │ TSMC's 2nm Yield Rates Surge │ Initial tsmc 2nm yield rates │ 0.75 │ +│ │ as Mass Production Ramps Up │ are notably high, reportedly │ │ +│ │ in 2026 │ reaching around 70%. Some │ │ +│ │ https://heqingele.com/blog/ts │ reports even indicate yields │ │ +│ │ mc-2nm-yield-rates-mass-produ │ surpassing 90% for certain │ │ +│ │ ction-status-2026/ │ memory products. │ │ +├─────┼───────────────────────────────┼────────────────────────────────┼───────┤ +│ 9 │ Unlocking the Future: TSMC's │ On March 31, 2025, TSMC held │ 0.80 │ +│ │ Bold Strategy for the 2nm │ an expansion ceremony for its │ │ +│ │ Revolution! │ 2nm production facility in │ │ +│ │ https://tspasemiconductor.sub │ Kaohsiung, marking a │ │ +│ │ stack.com/p/unlocking-the-fut │ significant milestone in │ │ +│ │ ure-tsmcs-bold-strategy-cb2 │ Taiwan's semiconductor │ │ +│ │ │ advanced manufacturing │ │ +│ │ │ expansion. │ │ +└─────┴───────────────────────────────┴────────────────────────────────┴───────┘ + Gaps +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Category ┃ Topic ┃ Detail ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ source_not_found │ Exact dielectric and gate │ TSMC does not publicly │ +│ │ oxide materials used in N2 │ disclose the specific │ +│ │ GAA nanosheet transistors │ high-k dielectric │ +│ │ │ materials, interfacial │ +│ │ │ layer compositions, or work │ +│ │ │ function metal chemistries │ +│ │ │ used in the N2 gate stack. │ +│ │ │ These are considered core │ +│ │ │ IP. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Nanosheet thickness and │ The precise nanometer-scale │ +│ │ width tolerances │ dimensions and process │ +│ │ │ tolerances (e.g., nanosheet │ +│ │ │ thickness variation, │ +│ │ │ critical dimension │ +│ │ │ uniformity) for N2 GAA │ +│ │ │ nanosheets are not publicly │ +│ │ │ available. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Metal interconnect liner │ While Cu RDLs are │ +│ │ and barrier materials │ confirmed, the specific │ +│ │ │ barrier/liner materials │ +│ │ │ (e.g., whether ruthenium or │ +│ │ │ cobalt liners replace │ +│ │ │ TaN/Ta at this node) are │ +│ │ │ not disclosed in public │ +│ │ │ sources. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ Doping profiles and implant │ Source/drain doping │ +│ │ specifications │ concentrations, implant │ +│ │ │ energies, and anneal │ +│ │ │ conditions are proprietary │ +│ │ │ and not published. │ +├──────────────────┼─────────────────────────────┼─────────────────────────────┤ +│ source_not_found │ EUV lithography specifics │ The number of EUV exposures │ +│ │ (number of EUV layers, │ per layer, overlay │ +│ │ stochastic defect control │ tolerances, and specific │ +│ │ methods) │ stochastic control │ +│ │ │ approaches are not detailed │ +│ │ │ in public TSMC disclosures. │ +└──────────────────┴─────────────────────────────┴─────────────────────────────┘ + Discovery Events +┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ +┃ ┃ Suggested ┃ ┃ ┃ +┃ Type ┃ Researcher ┃ Query ┃ Reason ┃ +┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ +│ related_research │ arxiv │ TSMC N2 nanosheet │ IEEE IEDM 2024 │ +│ │ │ GAA transistor │ papers from TSMC │ +│ │ │ gate stack │ may contain more │ +│ │ │ materials high-k │ specific │ +│ │ │ dielectric IEDM │ materials details │ +│ │ │ 2024 │ in the full │ +│ │ │ │ published │ +│ │ │ │ proceedings not │ +│ │ │ │ summarized in │ +│ │ │ │ news articles. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ TSMC 2nm N2 │ TSMC patent │ +│ │ │ process patent │ filings related │ +│ │ │ filings nanosheet │ to N2 may reveal │ +│ │ │ gate-all-around │ specific │ +│ │ │ materials │ materials │ +│ │ │ │ choices, │ +│ │ │ │ tolerances, and │ +│ │ │ │ process │ +│ │ │ │ innovations that │ +│ │ │ │ are not in press │ +│ │ │ │ releases. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ arxiv │ gate-all-around │ Academic │ +│ │ │ nanosheet │ literature on GAA │ +│ │ │ transistor │ nanosheet │ +│ │ │ silicon channel │ fabrication may │ +│ │ │ thickness │ reveal typical │ +│ │ │ variation │ tolerance ranges │ +│ │ │ tolerance 2nm │ used at the 2nm │ +│ │ │ │ class node even │ +│ │ │ │ if not │ +│ │ │ │ TSMC-specific. │ +├──────────────────┼───────────────────┼───────────────────┼───────────────────┤ +│ related_research │ database │ TechInsights TSMC │ TechInsights │ +│ │ │ N2 teardown │ performs physical │ +│ │ │ materials │ reverse │ +│ │ │ analysis 2025 │ engineering of │ +│ │ │ │ chips and may │ +│ │ │ │ have detailed N2 │ +│ │ │ │ materials │ +│ │ │ │ analysis │ +│ │ │ │ available through │ +│ │ │ │ their │ +│ │ │ │ subscription │ +│ │ │ │ service. │ +└──────────────────┴───────────────────┴───────────────────┴───────────────────┘ + Open Questions +┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Priority ┃ Question ┃ Context ┃ +┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ +│ high │ What specific high-k dielectric │ Public sources confirm GAA │ +│ │ and metal gate materials does │ nanosheet architecture but do │ +│ │ TSMC use in the N2 GAA │ not specify gate dielectric │ +│ │ nanosheet gate stack? │ (e.g., HfO2 variants) or work │ +│ │ │ function metal compositions │ +│ │ │ used to achieve threshold │ +│ │ │ voltage tuning. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ Has TSMC adopted ruthenium or │ At 20nm metal pitch, │ +│ │ other alternative metals for │ traditional TaN/Ta/Cu stacks │ +│ │ BEOL interconnect liners in N2 │ face resistance issues; Intel │ +│ │ to reduce resistance at tight │ and others have explored Mo and │ +│ │ pitches? │ Ru. TSMC's specific choice for │ +│ │ │ N2 BEOL is not disclosed in │ +│ │ │ public sources. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ high │ What is the actual silicon │ GAA nanosheet devices typically │ +│ │ nanosheet thickness and stack │ stack 3-4 nanosheets; TSMC has │ +│ │ count in TSMC's N2 process? │ not publicly specified │ +│ │ │ nanosheet dimensions or stack │ +│ │ │ count for N2. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ How does TSMC's N2 defect │ A LinkedIn post references │ +│ │ density compare quantitatively │ Tom's Hardware reporting that │ +│ │ to N3 at equivalent production │ TSMC disclosed N2 defect │ +│ │ maturity? │ density is lower than N3 at the │ +│ │ │ same stage of development, but │ +│ │ │ specific numbers were not found │ +│ │ │ in the gathered sources. │ +├──────────┼─────────────────────────────────┼─────────────────────────────────┤ +│ medium │ Will TSMC's N2P (enhanced N2) │ Sources mention N2P is a 5% │ +│ │ node incorporate backside power │ speed-enhanced version of N2 │ +│ │ delivery network (BSPDN), and │ targeting qualification │ +│ │ what materials/process changes │ completion; the SemiAnalysis │ +│ │ does that entail? │ report discusses BSPDN as a key │ +│ │ │ innovation at 2nm class nodes, │ +│ │ │ and its material implications │ +│ │ │ differ significantly. │ +└──────────┴─────────────────────────────────┴─────────────────────────────────┘ +╭───────────────────────────────── Confidence ─────────────────────────────────╮ +│ Overall: 0.42 │ +│ Corroborating sources: 9 │ +│ Source authority: medium │ +│ Contradiction detected: False │ +│ Query specificity match: 0.30 │ +│ Budget status: spent │ +│ Recency: current │ +╰──────────────────────────────────────────────────────────────────────────────╯ +╭──────────────────────────────────── Cost ────────────────────────────────────╮ +│ Tokens: 62620 │ +│ Iterations: 3 │ +│ Wall time: 105.63s │ +│ Model: claude-sonnet-4-6 │ +╰──────────────────────────────────────────────────────────────────────────────╯ + +trace_id: a4bb5b7a-61dd-446b-8c06-06c78de5fef7 diff --git a/scripts/calibration_collect.py b/scripts/calibration_collect.py new file mode 100644 index 0000000..8270669 --- /dev/null +++ b/scripts/calibration_collect.py @@ -0,0 +1,225 @@ +"""scripts/calibration_collect.py + +M3.3 Phase A: load every persisted ResearchResult under +~/.marchwarden/traces/*.result.json and emit a markdown rating worksheet +to docs/stress-tests/M3.3-rating-worksheet.md. + +The worksheet has one row per run with the model's self-reported confidence +and a blank `actual_rating` column for human review (Phase B). After rating +is complete, scripts/calibration_analyze.py (Phase C) will load the same +file with the rating column populated and compute calibration error. + +Usage: + .venv/bin/python scripts/calibration_collect.py + +Optional env: + TRACE_DIR — override default ~/.marchwarden/traces + OUT — override default docs/stress-tests/M3.3-rating-worksheet.md +""" + +from __future__ import annotations + +import json +import os +import sys +from pathlib import Path + +REPO_ROOT = Path(__file__).resolve().parent.parent +sys.path.insert(0, str(REPO_ROOT)) + +from researchers.web.models import ResearchResult # noqa: E402 + + +def _load_results(trace_dir: Path) -> list[tuple[Path, ResearchResult]]: + """Load every .result.json under trace_dir, sorted by mtime.""" + files = sorted(trace_dir.glob("*.result.json"), key=lambda p: p.stat().st_mtime) + out: list[tuple[Path, ResearchResult]] = [] + for f in files: + try: + result = ResearchResult.model_validate_json(f.read_text(encoding="utf-8")) + except Exception as exc: + print(f"warning: skipping {f.name}: {exc}", file=sys.stderr) + continue + out.append((f, result)) + return out + + +def _gap_summary(result: ResearchResult) -> str: + """Render gap categories with counts, e.g. 'source_not_found(2), scope_exceeded(1)'.""" + if not result.gaps: + return "—" + counts: dict[str, int] = {} + for g in result.gaps: + cat = g.category.value if hasattr(g.category, "value") else str(g.category) + counts[cat] = counts.get(cat, 0) + 1 + return ", ".join(f"{k}({v})" for k, v in sorted(counts.items())) + + +def _category_map(runs_dir: Path) -> dict[str, str]: + """Map trace_id -> category by parsing scripts/calibration_runner.sh log files. + + Each log file is named like ``01-factual.log`` and contains a final + ``trace_id: `` line emitted by the CLI. + """ + out: dict[str, str] = {} + if not runs_dir.exists(): + return out + for log in runs_dir.glob("*.log"): + # filename format: NN-category.log + stem = log.stem + parts = stem.split("-", 1) + if len(parts) != 2: + continue + category = parts[1] + try: + text = log.read_text(encoding="utf-8") + except Exception: + continue + # Find the last "trace_id: " line + trace_id = None + for line in text.splitlines(): + if "trace_id:" in line: + # Strip ANSI / rich markup if present + token = line.split("trace_id:")[-1].strip() + # Take only the UUID portion + token = token.split()[0] if token else "" + # Strip any surrounding rich markup + token = token.replace("[/dim]", "").replace("[dim]", "") + if token: + trace_id = token + if trace_id: + out[trace_id] = category + return out + + +def _question_from_trace(trace_dir: Path, trace_id: str) -> str: + """Recover the original question from the trace JSONL's `start` event.""" + jsonl = trace_dir / f"{trace_id}.jsonl" + if not jsonl.exists(): + return "(question not recoverable — trace missing)" + try: + for line in jsonl.read_text(encoding="utf-8").splitlines(): + line = line.strip() + if not line: + continue + entry = json.loads(line) + if entry.get("action") == "start": + return entry.get("question", "(no question field)") + except Exception as exc: + return f"(parse error: {exc})" + return "(no start event)" + + +def _build_worksheet( + rows: list[tuple[Path, ResearchResult]], + trace_dir: Path, + category_map: dict[str, str], +) -> str: + """Render the markdown worksheet.""" + lines: list[str] = [] + lines.append("# M3.3 Calibration Rating Worksheet") + lines.append("") + lines.append("Issue: #46 (Phase B — human rating)") + lines.append("") + lines.append( + "## How to use this worksheet" + ) + lines.append("") + lines.append( + "For each run below, read the answer + citations from the persisted " + "result file (path in the **Result file** column). Score the answer's " + "*actual* correctness on a 0.0–1.0 scale, **independent** of the " + "model's self-reported confidence. Fill in the **actual_rating** " + "column. Add notes in the **notes** column for anything unusual." + ) + lines.append("") + lines.append("Rating rubric:") + lines.append("") + lines.append("- **1.0** — Answer is fully correct, well-supported by cited sources, no material gaps or hallucinations.") + lines.append("- **0.8** — Mostly correct; minor inaccuracies or omissions that don't change the substance.") + lines.append("- **0.6** — Substantively right but with notable errors, missing context, or weak citations.") + lines.append("- **0.4** — Mixed: some right, some wrong; or right answer for wrong reasons.") + lines.append("- **0.2** — Mostly wrong, misleading, or hallucinated despite confident framing.") + lines.append("- **0.0** — Completely wrong, fabricated, or refuses to answer a tractable question.") + lines.append("") + lines.append("After rating all rows, save this file and run:") + lines.append("") + lines.append("```") + lines.append(".venv/bin/python scripts/calibration_analyze.py") + lines.append("```") + lines.append("") + lines.append(f"## Runs ({len(rows)} total)") + lines.append("") + lines.append( + "| # | trace_id | category | question | model_conf | corrob | authority | contradiction | budget | recency | gaps | citations | discoveries | tokens | actual_rating | notes |" + ) + lines.append( + "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|" + ) + + for i, (path, result) in enumerate(rows, 1): + cf = result.confidence_factors + cm = result.cost_metadata + question = _question_from_trace(trace_dir, result.trace_id).replace("|", "\\|") + # Truncate long questions for table readability + if len(question) > 80: + question = question[:77] + "..." + gaps = _gap_summary(result).replace("|", "\\|") + contradiction = "yes" if cf.contradiction_detected else "no" + budget = "spent" if cf.budget_exhausted else "under" + recency = cf.recency or "—" + category = category_map.get(result.trace_id, "ad-hoc") + lines.append( + f"| {i} " + f"| `{result.trace_id[:8]}` " + f"| {category} " + f"| {question} " + f"| {result.confidence:.2f} " + f"| {cf.num_corroborating_sources} " + f"| {cf.source_authority} " + f"| {contradiction} " + f"| {budget} " + f"| {recency} " + f"| {gaps} " + f"| {len(result.citations)} " + f"| {len(result.discovery_events)} " + f"| {cm.tokens_used} " + f"| " + f"| |" + ) + + lines.append("") + lines.append("## Result files (full content for review)") + lines.append("") + for i, (path, result) in enumerate(rows, 1): + lines.append(f"{i}. `{path}`") + lines.append("") + return "\n".join(lines) + + +def main() -> int: + trace_dir = Path( + os.environ.get("TRACE_DIR", os.path.expanduser("~/.marchwarden/traces")) + ) + out_path = Path( + os.environ.get("OUT", REPO_ROOT / "docs/stress-tests/M3.3-rating-worksheet.md") + ) + out_path.parent.mkdir(parents=True, exist_ok=True) + + rows = _load_results(trace_dir) + if not rows: + print(f"No result files found under {trace_dir}", file=sys.stderr) + return 1 + + runs_dir = REPO_ROOT / "docs/stress-tests/M3.3-runs" + category_map = _category_map(runs_dir) + + out_path.write_text( + _build_worksheet(rows, trace_dir, category_map), encoding="utf-8" + ) + print(f"Wrote {len(rows)}-row worksheet to {out_path}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/calibration_runner.sh b/scripts/calibration_runner.sh new file mode 100755 index 0000000..d936cc9 --- /dev/null +++ b/scripts/calibration_runner.sh @@ -0,0 +1,67 @@ +#!/usr/bin/env bash +# scripts/calibration_runner.sh +# +# M3.3 Phase A: run a fixed set of 20 balanced-depth calibration queries. +# Each run writes a trace JSONL and a result.json under ~/.marchwarden/traces/. +# This script is idempotent in the sense that it doesn't track state — re-running +# it will produce 20 NEW traces. Don't re-run unless you want fresh data. +# +# Categories (5 each): +# - factual: single verifiable answer +# - comparative: X vs Y across some dimension +# - contradiction-prone: contested topics, sources disagree +# - scope-edge: niche, proprietary, or expert-only knowledge + +set -euo pipefail + +cd "$(dirname "$0")/.." + +PY=".venv/bin/python" +LOG_DIR="docs/stress-tests/M3.3-runs" +mkdir -p "$LOG_DIR" + +declare -a QUERIES=( + # factual + "factual|01|What is the boiling point of liquid nitrogen at standard atmospheric pressure?" + "factual|02|When did the James Webb Space Telescope launch?" + "factual|03|What programming language is the Linux kernel primarily written in?" + "factual|04|What is the capital of Mongolia?" + "factual|05|How many amino acids are encoded by the standard genetic code?" + # comparative + "comparative|06|Compare the energy density of lithium-ion vs sodium-ion batteries." + "comparative|07|Compare PostgreSQL and SQLite for embedded analytics workloads." + "comparative|08|Compare CRISPR-Cas9 and CRISPR-Cas12 for in vivo gene editing." + "comparative|09|Compare React and Vue for large enterprise frontends in 2026." + "comparative|10|Compare wind and solar capacity factors in the continental United States." + # contradiction-prone + "contradiction|11|Is red wine good for cardiovascular health?" + "contradiction|12|Does intermittent fasting extend lifespan in humans?" + "contradiction|13|Are nuclear power plants safe?" + "contradiction|14|Is dietary cholesterol harmful?" + "contradiction|15|Does screen time harm child development?" + # scope-edge + "scope|16|What proprietary indexing strategies do high-frequency trading firms use for order book reconstruction?" + "scope|17|What is the actual operational doctrine of Chinese DF-41 ICBM brigades?" + "scope|18|What internal compensation bands does Goldman Sachs use for VPs in 2026?" + "scope|19|How does Renaissance Technologies Medallion Fund actually generate alpha?" + "scope|20|What are the precise materials and tolerances in TSMC's 2nm process?" +) + +echo "Running ${#QUERIES[@]} calibration queries at depth=balanced..." +echo "Output dir: $LOG_DIR" +echo + +for entry in "${QUERIES[@]}"; do + IFS='|' read -r category num question <<<"$entry" + log_file="$LOG_DIR/${num}-${category}.log" + echo "[$num/$category] $question" + if "$PY" -m cli.main ask "$question" --depth balanced >"$log_file" 2>&1; then + trace_id=$(grep -oE 'trace_id: [a-f0-9-]+' "$log_file" | tail -1 | awk '{print $2}') + echo " -> $trace_id" + else + echo " !! FAILED — see $log_file" + fi +done + +echo +echo "Done. Result files at ~/.marchwarden/traces/*.result.json" -- 2.45.2