archeious/marchwarden

Fork 0

M3.1 Single-axis stress tests #44

New issue

Closed

opened 2026-04-08 17:23:57 -06:00 by claude-code · 1 comment

claude-code commented

2026-04-08 17:23:57 -06:00

Collaborator

Phase 3 — Stress Testing & Calibration, milestone 1.

Goal

Run four targeted queries that each exercise one specific contract feature in isolation. Verifies the contract surfaces what it's supposed to surface.

Test cases

#	Query	Targets
1	"What AI models were released in Q1 2026?"	`SOURCE_NOT_FOUND` gap, `recency` factor
2	"Is coffee good or bad for you?"	`CONTRADICTORY_SOURCES` gap, `contradiction_detected` factor
3	"Compare CRISPR delivery mechanisms in recent clinical trials"	`SCOPE_EXCEEDED` gap, populated `discovery_events` (suggesting an arxiv researcher)
4	"Comprehensive history of AI 1950–2026" with `--budget 5000 --max-iterations 2`	`BUDGET_EXHAUSTED` gap, `budget_exhausted=True` factor

Deliverable

Four trace files captured at ~/.marchwarden/traces/
Documented results in this issue: trace_id + key contract fields per run
Any contract gaps or surprises filed as new issues

Phase 3 — Stress Testing & Calibration, milestone 1. ## Goal Run four targeted queries that each exercise one specific contract feature in isolation. Verifies the contract surfaces what it's supposed to surface. ## Test cases | # | Query | Targets | |---|---|---| | 1 | "What AI models were released in Q1 2026?" | `SOURCE_NOT_FOUND` gap, `recency` factor | | 2 | "Is coffee good or bad for you?" | `CONTRADICTORY_SOURCES` gap, `contradiction_detected` factor | | 3 | "Compare CRISPR delivery mechanisms in recent clinical trials" | `SCOPE_EXCEEDED` gap, populated `discovery_events` (suggesting an arxiv researcher) | | 4 | "Comprehensive history of AI 1950–2026" with `--budget 5000 --max-iterations 2` | `BUDGET_EXHAUSTED` gap, `budget_exhausted=True` factor | ## Deliverable - Four trace files captured at `~/.marchwarden/traces/` - Documented results in this issue: trace_id + key contract fields per run - Any contract gaps or surprises filed as new issues

archeious added this to the Phase 3: Stress Testing & Calibration milestone 2026-04-08 17:25:11 -06:00

claude-code referenced this issue

2026-04-08 19:12:51 -06:00

Budget cap lags one iteration behind tool payload growth #53

archeious commented

2026-04-08 19:14:10 -06:00

Owner

M3.1 Results

Four queries run on feat/m3.1-stress-tests against current main. Default depth=balanced unless noted.

#	Targets	Hit?	trace_id	Conf	Tokens	Iters
1	SOURCE_NOT_FOUND, recency	Both miss	`8472f9a2-e712-4b9f-ac9f-5b736c343831`	0.82	53134	3
2	CONTRADICTORY_SOURCES, contradiction_detected	Both miss	`22597d75-f1b2-44ae-8d7e-f4ea3423f46b`	0.91	53567	3
3	SCOPE_EXCEEDED, discovery_events	Both hit	`05e54df5-edbd-40ac-b1d0-ae16cebade60`	0.82	51710	3
4	BUDGET_EXHAUSTED, budget_exhausted	Both miss	`38235720-6efc-4d7d-b284-6e21b1c83d46`	0.87	29304	2

Q1 — "What AI models were released in Q1 2026?"

gaps: 5 surfaced (categories not inspected — see trace observability note below)
factors: corroborating=6, authority=medium, contradiction=False, recency=current
Miss: Q1 2026 isn't far enough out to starve sources. Agent found plenty. Recency reads "current," not stale. To trigger SOURCE_NOT_FOUND we need a genuinely obscure or far-future topic.

Q2 — "Is coffee good or bad for you?"

gaps: scope_exceeded(1), source_not_found(2)
factors: corroborating=10, authority=high, contradiction=False
Miss: Agent synthesized a coherent "benefits with caveats" narrative rather than recognizing genuine contradictions. The query is too easy for modern consensus to win cleanly. To force CONTRADICTORY_SOURCES we likely need a topic where reputable sources disagree on the headline claim itself (e.g., "does intermittent fasting extend lifespan in humans").

Q3 — "Compare CRISPR delivery mechanisms in recent clinical trials"

gaps: source_not_found(2), scope_exceeded(1+)
discovery_events: 4 (suggesting arxiv researcher for delivery deep-dives)
Both targets hit. Working as intended.

Q4 — "Comprehensive history of AI 1950 to 2026" `--budget 5000 --max-iterations 2`

gaps: scope_exceeded(1), access_denied(2), source_not_found(1)
factors: budget=under cap despite 5.8x overrun
Miss is a real bug — filed as #53. TL;DR: budget check at iteration boundary uses stale total_tokens, so iter 1's tiny input lets iter 2 run, iter 2's huge input pushes loop total to 10606 (2.1x cap), and the loop exits naturally because iterations==max. Synthesis adds another ~19k uncapped, which is by design.

Findings & follow-ups

Bugs filed:

#53 — Budget cap lags one iteration behind tool payload growth.

Trace observability gap (worth tracking, not yet filed): the JSONL trace persists only gap_count / discovery_count, not actual gap categories or full ResearchResult. Replaying a trace can't tell you which gaps fired — only how many. For stress testing and calibration we either need to (a) persist the structured result alongside the trace, or (b) emit per-gap log lines as they're produced. Recommend filing for M3.2.

Stress test design lessons (for M3.2):

Three of four targets missed because the query wasn't extreme enough. M3.2 query design should pick adversarial topics that force the target gap, not merely invite it.
Confidence stays high (0.82–0.91) even when target gaps fail to fire. Calibration work in M3.3 will need to explicitly probe whether high confidence is justified given the gap mix.

Status

Branch feat/m3.1-stress-tests has only the scratch file M3.1-scratch.md (uncommitted). No code changes — execution-only milestone.
Closing this issue.

## M3.1 Results Four queries run on `feat/m3.1-stress-tests` against current `main`. Default depth=balanced unless noted. | # | Targets | Hit? | trace_id | Conf | Tokens | Iters | |---|---|---|---|---|---|---| | 1 | SOURCE_NOT_FOUND, recency | Both miss | `8472f9a2-e712-4b9f-ac9f-5b736c343831` | 0.82 | 53134 | 3 | | 2 | CONTRADICTORY_SOURCES, contradiction_detected | Both miss | `22597d75-f1b2-44ae-8d7e-f4ea3423f46b` | 0.91 | 53567 | 3 | | 3 | SCOPE_EXCEEDED, discovery_events | Both hit | `05e54df5-edbd-40ac-b1d0-ae16cebade60` | 0.82 | 51710 | 3 | | 4 | BUDGET_EXHAUSTED, budget_exhausted | Both miss | `38235720-6efc-4d7d-b284-6e21b1c83d46` | 0.87 | 29304 | 2 | ### Q1 — "What AI models were released in Q1 2026?" - gaps: 5 surfaced (categories not inspected — see trace observability note below) - factors: corroborating=6, authority=medium, contradiction=False, recency=**current** - **Miss:** Q1 2026 isn't far enough out to starve sources. Agent found plenty. Recency reads "current," not stale. To trigger SOURCE_NOT_FOUND we need a genuinely obscure or far-future topic. ### Q2 — "Is coffee good or bad for you?" - gaps: scope_exceeded(1), source_not_found(2) - factors: corroborating=10, authority=high, **contradiction=False** - **Miss:** Agent synthesized a coherent "benefits with caveats" narrative rather than recognizing genuine contradictions. The query is too easy for modern consensus to win cleanly. To force CONTRADICTORY_SOURCES we likely need a topic where reputable sources disagree on the headline claim itself (e.g., "does intermittent fasting extend lifespan in humans"). ### Q3 — "Compare CRISPR delivery mechanisms in recent clinical trials" - gaps: source_not_found(2), scope_exceeded(1+) - discovery_events: 4 (suggesting `arxiv` researcher for delivery deep-dives) - Both targets hit. Working as intended. ### Q4 — "Comprehensive history of AI 1950 to 2026" `--budget 5000 --max-iterations 2` - gaps: scope_exceeded(1), access_denied(2), source_not_found(1) - factors: budget=**under cap** despite 5.8x overrun - **Miss is a real bug** — filed as #53. TL;DR: budget check at iteration boundary uses stale `total_tokens`, so iter 1's tiny input lets iter 2 run, iter 2's huge input pushes loop total to 10606 (2.1x cap), and the loop exits naturally because iterations==max. Synthesis adds another ~19k uncapped, which is by design. ## Findings & follow-ups **Bugs filed:** - #53 — Budget cap lags one iteration behind tool payload growth. **Trace observability gap (worth tracking, not yet filed):** the JSONL trace persists only `gap_count` / `discovery_count`, not actual gap categories or full ResearchResult. Replaying a trace can't tell you *which* gaps fired — only how many. For stress testing and calibration we either need to (a) persist the structured result alongside the trace, or (b) emit per-gap log lines as they're produced. Recommend filing for M3.2. **Stress test design lessons (for M3.2):** - Three of four targets missed because the query wasn't extreme enough. M3.2 query design should pick adversarial topics that *force* the target gap, not merely invite it. - Confidence stays high (0.82–0.91) even when target gaps fail to fire. Calibration work in M3.3 will need to explicitly probe whether high confidence is justified given the gap mix. ## Status - Branch `feat/m3.1-stress-tests` has only the scratch file `M3.1-scratch.md` (uncommitted). No code changes — execution-only milestone. - Closing this issue.

claude-code closed this issue

2026-04-08 19:14:15 -06:00

claude-code referenced this issue

2026-04-08 19:21:00 -06:00

Persist full ResearchResult alongside trace (observability gap) #54

archeious referenced this issue from a commit

2026-04-08 19:21:36 -06:00

docs(stress-tests): archive M3.1 results

claude-code referenced this issue

2026-04-08 19:21:40 -06:00

docs(stress-tests): archive M3.1 results #55