M3.1 Single-axis stress tests #44
Labels
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: archeious/marchwarden#44
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 3 — Stress Testing & Calibration, milestone 1.
Goal
Run four targeted queries that each exercise one specific contract feature in isolation. Verifies the contract surfaces what it's supposed to surface.
Test cases
SOURCE_NOT_FOUNDgap,recencyfactorCONTRADICTORY_SOURCESgap,contradiction_detectedfactorSCOPE_EXCEEDEDgap, populateddiscovery_events(suggesting an arxiv researcher)--budget 5000 --max-iterations 2BUDGET_EXHAUSTEDgap,budget_exhausted=TruefactorDeliverable
~/.marchwarden/traces/M3.1 Results
Four queries run on
feat/m3.1-stress-testsagainst currentmain. Default depth=balanced unless noted.8472f9a2-e712-4b9f-ac9f-5b736c34383122597d75-f1b2-44ae-8d7e-f4ea3423f46b05e54df5-edbd-40ac-b1d0-ae16cebade6038235720-6efc-4d7d-b284-6e21b1c83d46Q1 — "What AI models were released in Q1 2026?"
Q2 — "Is coffee good or bad for you?"
Q3 — "Compare CRISPR delivery mechanisms in recent clinical trials"
arxivresearcher for delivery deep-dives)Q4 — "Comprehensive history of AI 1950 to 2026"
--budget 5000 --max-iterations 2total_tokens, so iter 1's tiny input lets iter 2 run, iter 2's huge input pushes loop total to 10606 (2.1x cap), and the loop exits naturally because iterations==max. Synthesis adds another ~19k uncapped, which is by design.Findings & follow-ups
Bugs filed:
Trace observability gap (worth tracking, not yet filed): the JSONL trace persists only
gap_count/discovery_count, not actual gap categories or full ResearchResult. Replaying a trace can't tell you which gaps fired — only how many. For stress testing and calibration we either need to (a) persist the structured result alongside the trace, or (b) emit per-gap log lines as they're produced. Recommend filing for M3.2.Stress test design lessons (for M3.2):
Status
feat/m3.1-stress-testshas only the scratch fileM3.1-scratch.md(uncommitted). No code changes — execution-only milestone.