M3.1 Single-axis stress tests #44

Closed
opened 2026-04-08 23:23:57 +00:00 by claude-code · 1 comment
Collaborator

Phase 3 — Stress Testing & Calibration, milestone 1.

Goal

Run four targeted queries that each exercise one specific contract feature in isolation. Verifies the contract surfaces what it's supposed to surface.

Test cases

# Query Targets
1 "What AI models were released in Q1 2026?" SOURCE_NOT_FOUND gap, recency factor
2 "Is coffee good or bad for you?" CONTRADICTORY_SOURCES gap, contradiction_detected factor
3 "Compare CRISPR delivery mechanisms in recent clinical trials" SCOPE_EXCEEDED gap, populated discovery_events (suggesting an arxiv researcher)
4 "Comprehensive history of AI 1950–2026" with --budget 5000 --max-iterations 2 BUDGET_EXHAUSTED gap, budget_exhausted=True factor

Deliverable

  • Four trace files captured at ~/.marchwarden/traces/
  • Documented results in this issue: trace_id + key contract fields per run
  • Any contract gaps or surprises filed as new issues
Phase 3 — Stress Testing & Calibration, milestone 1. ## Goal Run four targeted queries that each exercise one specific contract feature in isolation. Verifies the contract surfaces what it's supposed to surface. ## Test cases | # | Query | Targets | |---|---|---| | 1 | "What AI models were released in Q1 2026?" | `SOURCE_NOT_FOUND` gap, `recency` factor | | 2 | "Is coffee good or bad for you?" | `CONTRADICTORY_SOURCES` gap, `contradiction_detected` factor | | 3 | "Compare CRISPR delivery mechanisms in recent clinical trials" | `SCOPE_EXCEEDED` gap, populated `discovery_events` (suggesting an arxiv researcher) | | 4 | "Comprehensive history of AI 1950–2026" with `--budget 5000 --max-iterations 2` | `BUDGET_EXHAUSTED` gap, `budget_exhausted=True` factor | ## Deliverable - Four trace files captured at `~/.marchwarden/traces/` - Documented results in this issue: trace_id + key contract fields per run - Any contract gaps or surprises filed as new issues
archeious added this to the Phase 3: Stress Testing & Calibration milestone 2026-04-08 23:25:11 +00:00
Owner

M3.1 Results

Four queries run on feat/m3.1-stress-tests against current main. Default depth=balanced unless noted.

# Targets Hit? trace_id Conf Tokens Iters
1 SOURCE_NOT_FOUND, recency Both miss 8472f9a2-e712-4b9f-ac9f-5b736c343831 0.82 53134 3
2 CONTRADICTORY_SOURCES, contradiction_detected Both miss 22597d75-f1b2-44ae-8d7e-f4ea3423f46b 0.91 53567 3
3 SCOPE_EXCEEDED, discovery_events Both hit 05e54df5-edbd-40ac-b1d0-ae16cebade60 0.82 51710 3
4 BUDGET_EXHAUSTED, budget_exhausted Both miss 38235720-6efc-4d7d-b284-6e21b1c83d46 0.87 29304 2

Q1 — "What AI models were released in Q1 2026?"

  • gaps: 5 surfaced (categories not inspected — see trace observability note below)
  • factors: corroborating=6, authority=medium, contradiction=False, recency=current
  • Miss: Q1 2026 isn't far enough out to starve sources. Agent found plenty. Recency reads "current," not stale. To trigger SOURCE_NOT_FOUND we need a genuinely obscure or far-future topic.

Q2 — "Is coffee good or bad for you?"

  • gaps: scope_exceeded(1), source_not_found(2)
  • factors: corroborating=10, authority=high, contradiction=False
  • Miss: Agent synthesized a coherent "benefits with caveats" narrative rather than recognizing genuine contradictions. The query is too easy for modern consensus to win cleanly. To force CONTRADICTORY_SOURCES we likely need a topic where reputable sources disagree on the headline claim itself (e.g., "does intermittent fasting extend lifespan in humans").

Q3 — "Compare CRISPR delivery mechanisms in recent clinical trials"

  • gaps: source_not_found(2), scope_exceeded(1+)
  • discovery_events: 4 (suggesting arxiv researcher for delivery deep-dives)
  • Both targets hit. Working as intended.

Q4 — "Comprehensive history of AI 1950 to 2026" --budget 5000 --max-iterations 2

  • gaps: scope_exceeded(1), access_denied(2), source_not_found(1)
  • factors: budget=under cap despite 5.8x overrun
  • Miss is a real bug — filed as #53. TL;DR: budget check at iteration boundary uses stale total_tokens, so iter 1's tiny input lets iter 2 run, iter 2's huge input pushes loop total to 10606 (2.1x cap), and the loop exits naturally because iterations==max. Synthesis adds another ~19k uncapped, which is by design.

Findings & follow-ups

Bugs filed:

  • #53 — Budget cap lags one iteration behind tool payload growth.

Trace observability gap (worth tracking, not yet filed): the JSONL trace persists only gap_count / discovery_count, not actual gap categories or full ResearchResult. Replaying a trace can't tell you which gaps fired — only how many. For stress testing and calibration we either need to (a) persist the structured result alongside the trace, or (b) emit per-gap log lines as they're produced. Recommend filing for M3.2.

Stress test design lessons (for M3.2):

  • Three of four targets missed because the query wasn't extreme enough. M3.2 query design should pick adversarial topics that force the target gap, not merely invite it.
  • Confidence stays high (0.82–0.91) even when target gaps fail to fire. Calibration work in M3.3 will need to explicitly probe whether high confidence is justified given the gap mix.

Status

  • Branch feat/m3.1-stress-tests has only the scratch file M3.1-scratch.md (uncommitted). No code changes — execution-only milestone.
  • Closing this issue.
## M3.1 Results Four queries run on `feat/m3.1-stress-tests` against current `main`. Default depth=balanced unless noted. | # | Targets | Hit? | trace_id | Conf | Tokens | Iters | |---|---|---|---|---|---|---| | 1 | SOURCE_NOT_FOUND, recency | Both miss | `8472f9a2-e712-4b9f-ac9f-5b736c343831` | 0.82 | 53134 | 3 | | 2 | CONTRADICTORY_SOURCES, contradiction_detected | Both miss | `22597d75-f1b2-44ae-8d7e-f4ea3423f46b` | 0.91 | 53567 | 3 | | 3 | SCOPE_EXCEEDED, discovery_events | Both hit | `05e54df5-edbd-40ac-b1d0-ae16cebade60` | 0.82 | 51710 | 3 | | 4 | BUDGET_EXHAUSTED, budget_exhausted | Both miss | `38235720-6efc-4d7d-b284-6e21b1c83d46` | 0.87 | 29304 | 2 | ### Q1 — "What AI models were released in Q1 2026?" - gaps: 5 surfaced (categories not inspected — see trace observability note below) - factors: corroborating=6, authority=medium, contradiction=False, recency=**current** - **Miss:** Q1 2026 isn't far enough out to starve sources. Agent found plenty. Recency reads "current," not stale. To trigger SOURCE_NOT_FOUND we need a genuinely obscure or far-future topic. ### Q2 — "Is coffee good or bad for you?" - gaps: scope_exceeded(1), source_not_found(2) - factors: corroborating=10, authority=high, **contradiction=False** - **Miss:** Agent synthesized a coherent "benefits with caveats" narrative rather than recognizing genuine contradictions. The query is too easy for modern consensus to win cleanly. To force CONTRADICTORY_SOURCES we likely need a topic where reputable sources disagree on the headline claim itself (e.g., "does intermittent fasting extend lifespan in humans"). ### Q3 — "Compare CRISPR delivery mechanisms in recent clinical trials" - gaps: source_not_found(2), scope_exceeded(1+) - discovery_events: 4 (suggesting `arxiv` researcher for delivery deep-dives) - Both targets hit. Working as intended. ### Q4 — "Comprehensive history of AI 1950 to 2026" `--budget 5000 --max-iterations 2` - gaps: scope_exceeded(1), access_denied(2), source_not_found(1) - factors: budget=**under cap** despite 5.8x overrun - **Miss is a real bug** — filed as #53. TL;DR: budget check at iteration boundary uses stale `total_tokens`, so iter 1's tiny input lets iter 2 run, iter 2's huge input pushes loop total to 10606 (2.1x cap), and the loop exits naturally because iterations==max. Synthesis adds another ~19k uncapped, which is by design. ## Findings & follow-ups **Bugs filed:** - #53 — Budget cap lags one iteration behind tool payload growth. **Trace observability gap (worth tracking, not yet filed):** the JSONL trace persists only `gap_count` / `discovery_count`, not actual gap categories or full ResearchResult. Replaying a trace can't tell you *which* gaps fired — only how many. For stress testing and calibration we either need to (a) persist the structured result alongside the trace, or (b) emit per-gap log lines as they're produced. Recommend filing for M3.2. **Stress test design lessons (for M3.2):** - Three of four targets missed because the query wasn't extreme enough. M3.2 query design should pick adversarial topics that *force* the target gap, not merely invite it. - Confidence stays high (0.82–0.91) even when target gaps fail to fire. Calibration work in M3.3 will need to explicitly probe whether high confidence is justified given the gap mix. ## Status - Branch `feat/m3.1-stress-tests` has only the scratch file `M3.1-scratch.md` (uncommitted). No code changes — execution-only milestone. - Closing this issue.
Sign in to join this conversation.
No labels
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#44
No description provided.