docs(stress-tests): archive M3.2 multi-axis results #57
1 changed files with 50 additions and 0 deletions
50
docs/stress-tests/M3.2-results.md
Normal file
50
docs/stress-tests/M3.2-results.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# M3.2 Multi-axis Stress Test Results
|
||||
|
||||
- Issue: #45 (closed)
|
||||
- Date: 2026-04-08
|
||||
- Branch: `feat/m3.2-multiaxis`
|
||||
- Trace: `74a017bd-697b-4439-96b8-fe12057cf2e8`
|
||||
|
||||
## Query
|
||||
|
||||
> "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
|
||||
|
||||
Run with `--depth deep`. Default budget for deep = 60k tokens, max_iterations = 8.
|
||||
|
||||
## Scoring
|
||||
|
||||
| Axis | Target | Hit | Evidence |
|
||||
|---|---|---|---|
|
||||
| Recency | recency factor | ✅ | `recency=current` |
|
||||
| Contradictory sources | `contradiction_detected=True` | ✅ | True; bonus discovery_event of type `contradiction` |
|
||||
| Scope exceeded | `scope_exceeded` gap | ❌ | 5 gaps, all `source_not_found` |
|
||||
| Budget pressure | `budget_exhausted=True` | ✅ | True; 127692 tokens consumed (2.1x deep cap of 60k) |
|
||||
|
||||
3 of 4 axes hit cleanly in a single run. Multi-axis composition works.
|
||||
|
||||
## Result snapshot
|
||||
|
||||
- confidence: **0.78**
|
||||
- citations: 18 (corroborating sources)
|
||||
- gaps: 5 (all `source_not_found`)
|
||||
- discovery_events: 4 (`related_research` × 2, `new_source` × 1, **`contradiction` × 1** — first in-the-wild observation)
|
||||
- open_questions: 5
|
||||
- cost: 127692 tokens, 4 iterations, 168s
|
||||
- confidence_factors: corroborating=18, authority=medium, contradiction=True, specificity=0.72, budget=spent, recency=current
|
||||
|
||||
## Findings
|
||||
|
||||
**1. Multi-axis composition validated.** A single deep query exercised recency + contradictions + budget pressure simultaneously. The contract handled all three without losing structure — confidence dropped appropriately (0.78 vs 0.82–0.91 in M3.1) and the right factors fired.
|
||||
|
||||
**2. New discovery event type observed.** This is the first time `contradiction` has fired as a `discovery_event.type` in any test (M3.1 only saw `related_research` and `new_source`). It's a documented type (`researchers/web/models.py:154`), so this is the contract working as designed — but worth noting for calibration in M3.3 that all three documented types are now reachable in practice.
|
||||
|
||||
**3. Scope-exceeded distinction is fuzzy in practice.** The agent surfaced 5 `source_not_found` gaps, not the targeted `scope_exceeded`. Re-reading the gap topics:
|
||||
- Outage detail / SLA percentages / post-mortems → reasonable as `source_not_found` (could be on the web, just not gathered).
|
||||
- HFT-specific cold-start benchmarks → genuinely `scope_exceeded` (HFT firms don't publish these — wrong researcher entirely).
|
||||
|
||||
So it's 1 of 5, not a clear bug. The agent's prompt could nudge sharper category assignment, but the misclassification is not severe enough to file independently. Worth revisiting in M3.3 calibration if confidence routinely overestimates due to gap miscategorization.
|
||||
|
||||
**4. Persisted result file made this analysis trivial.** This is the first stress test run after #54 shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against `<trace_id>.result.json` instead of grepping rendered terminal output. M3.3 calibration work will be much faster as a result.
|
||||
|
||||
## Follow-ups
|
||||
- None blocking. The scope_exceeded vs source_not_found distinction may surface again in M3.3 calibration; if so, file it then.
|
||||
Loading…
Reference in a new issue