docs(stress-tests): archive M3.1 results
Single-axis stress test results from Issue #44. 1 of 4 query targets cleanly hit (Q3); Q1/Q2 missed because queries weren't adversarial enough; Q4 missed due to budget cap lag bug filed as #53. Trace observability gap blocking M3.2/M3.3 filed as #54. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
d279c4c20e
commit
a39407f03e
1 changed files with 60 additions and 0 deletions
60
docs/stress-tests/M3.1-results.md
Normal file
60
docs/stress-tests/M3.1-results.md
Normal file
|
|
@ -0,0 +1,60 @@
|
|||
# M3.1 Stress Test Results
|
||||
|
||||
- Issue: #44 (closed)
|
||||
- Date: 2026-04-08
|
||||
- Branch: `feat/m3.1-stress-tests`
|
||||
|
||||
## Summary
|
||||
|
||||
| Q | Targets | Result |
|
||||
|---|---|---|
|
||||
| 1 | SOURCE_NOT_FOUND, recency | Both miss (query not adversarial enough) |
|
||||
| 2 | CONTRADICTORY_SOURCES, contradiction_detected | Both miss (consensus too strong) |
|
||||
| 3 | SCOPE_EXCEEDED, discovery_events | Both hit |
|
||||
| 4 | BUDGET_EXHAUSTED, budget_exhausted | Both miss (real bug, see #53) |
|
||||
|
||||
Follow-up issues filed: #53 (budget cap lag), #54 (trace observability — full result not persisted).
|
||||
|
||||
## Q1: "What AI models were released in Q1 2026?"
|
||||
Targets: SOURCE_NOT_FOUND gap, recency factor
|
||||
|
||||
- trace_id: 8472f9a2-e712-4b9f-ac9f-5b736c343831
|
||||
- confidence: 0.82
|
||||
- confidence_factors: corroborating_sources=6, authority=medium, contradiction=False, specificity=0.85, budget=spent, recency=current
|
||||
- cost: 53134 tokens, 3 iters, 93s
|
||||
- gaps: 5 fired, categories not recoverable (run was not tee'd, and trace persists only counts — see #54)
|
||||
- **TARGET MISS:** SOURCE_NOT_FOUND not triggered (found 6 sources). Recency=current, not stale. Q1 2026 is not far enough in the past for source scarcity. Need a future-dated or genuinely obscure topic to trigger this gap.
|
||||
|
||||
## Q2: "Is coffee good or bad for you?"
|
||||
Targets: CONTRADICTORY_SOURCES gap, contradiction_detected factor
|
||||
|
||||
- trace_id: 22597d75-f1b2-44ae-8d7e-f4ea3423f46b
|
||||
- confidence: 0.91
|
||||
- confidence_factors: corroborating=10, authority=high, contradiction=False, specificity=0.88, budget=spent, recency=current
|
||||
- cost: 53567 tokens, 3 iters, 80s
|
||||
- gaps: scope_exceeded(1), source_not_found(2) — total 3
|
||||
- discovery_events: 4 (arxiv + database refs)
|
||||
- **TARGET MISS:** CONTRADICTORY_SOURCES not surfaced; contradiction_detected=False. Agent synthesized coherent "benefits with caveats" rather than recognizing genuine contradictions. Query is too easy for modern consensus to win.
|
||||
|
||||
## Q3: "Compare CRISPR delivery mechanisms in recent clinical trials"
|
||||
Targets: SCOPE_EXCEEDED gap, discovery_events populated
|
||||
|
||||
- trace_id: 05e54df5-edbd-40ac-b1d0-ae16cebade60
|
||||
- confidence: 0.82
|
||||
- confidence_factors: corroborating=9, authority=high, contradiction=False, specificity=0.80, budget=spent, recency=current
|
||||
- cost: 51710 tokens, 3 iters, 109s
|
||||
- gaps: source_not_found(2), scope_exceeded(1+) — multiple
|
||||
- discovery_events: 4 (suggesting arxiv researcher for delivery mechanism deep-dives)
|
||||
- **HIT BOTH TARGETS:** scope_exceeded gap surfaced, discovery_events populated with arxiv researcher suggestions.
|
||||
|
||||
## Q4: "Comprehensive history of AI 1950 to 2026" --budget 5000 --max-iterations 2
|
||||
Targets: BUDGET_EXHAUSTED gap, budget_exhausted factor
|
||||
|
||||
- trace_id: 38235720-6efc-4d7d-b284-6e21b1c83d46
|
||||
- confidence: 0.87
|
||||
- confidence_factors: corroborating=8, authority=high, contradiction=False, specificity=0.88, **budget=under cap**, recency=current
|
||||
- cost: **29304 tokens (5.8x over 5000 budget)**, 2 iters (cap respected), 78s
|
||||
- gaps: scope_exceeded(1), access_denied(2), source_not_found(1) — total 4. **No budget_exhausted gap.**
|
||||
- **TARGET MISS:** BUDGET_EXHAUSTED not surfaced. budget_exhausted=False despite 5.8x overrun.
|
||||
- **BUG (real):** Budget enforcement lag — see #53. Loop check uses stale `total_tokens` (only updated after a model call). Iter-1 input is tiny so check passes, iter-2's huge input pushes loop total to 10606 (2.1x cap), then loop exits naturally. Synthesis adds ~19k more (uncapped by design).
|
||||
- Trace evidence: iter1 tokens_so_far=0 → iter2 tokens_so_far=1145 → synthesis tokens_used=10606 → final 29304.
|
||||
Loading…
Reference in a new issue