marchwarden/docs/stress-tests/M3.2-results.md
Jeff Smith 0ddc1e6e37 docs(stress-tests): archive M3.2 multi-axis results
Single deep query against AWS Lambda vs Azure Functions for HFT
exercised 3 of 4 target axes simultaneously: recency, contradictions,
and budget pressure all fired in the same run. scope_exceeded miss is
soft (1 of 5 gaps was arguably miscategorized as source_not_found).

First in-the-wild observation of the `contradiction` discovery_event
type. Issue #45.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:34:27 -06:00

50 lines
3.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# M3.2 Multi-axis Stress Test Results
- Issue: #45 (closed)
- Date: 2026-04-08
- Branch: `feat/m3.2-multiaxis`
- Trace: `74a017bd-697b-4439-96b8-fe12057cf2e8`
## Query
> "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."
Run with `--depth deep`. Default budget for deep = 60k tokens, max_iterations = 8.
## Scoring
| Axis | Target | Hit | Evidence |
|---|---|---|---|
| Recency | recency factor | ✅ | `recency=current` |
| Contradictory sources | `contradiction_detected=True` | ✅ | True; bonus discovery_event of type `contradiction` |
| Scope exceeded | `scope_exceeded` gap | ❌ | 5 gaps, all `source_not_found` |
| Budget pressure | `budget_exhausted=True` | ✅ | True; 127692 tokens consumed (2.1x deep cap of 60k) |
3 of 4 axes hit cleanly in a single run. Multi-axis composition works.
## Result snapshot
- confidence: **0.78**
- citations: 18 (corroborating sources)
- gaps: 5 (all `source_not_found`)
- discovery_events: 4 (`related_research` × 2, `new_source` × 1, **`contradiction` × 1** — first in-the-wild observation)
- open_questions: 5
- cost: 127692 tokens, 4 iterations, 168s
- confidence_factors: corroborating=18, authority=medium, contradiction=True, specificity=0.72, budget=spent, recency=current
## Findings
**1. Multi-axis composition validated.** A single deep query exercised recency + contradictions + budget pressure simultaneously. The contract handled all three without losing structure — confidence dropped appropriately (0.78 vs 0.820.91 in M3.1) and the right factors fired.
**2. New discovery event type observed.** This is the first time `contradiction` has fired as a `discovery_event.type` in any test (M3.1 only saw `related_research` and `new_source`). It's a documented type (`researchers/web/models.py:154`), so this is the contract working as designed — but worth noting for calibration in M3.3 that all three documented types are now reachable in practice.
**3. Scope-exceeded distinction is fuzzy in practice.** The agent surfaced 5 `source_not_found` gaps, not the targeted `scope_exceeded`. Re-reading the gap topics:
- Outage detail / SLA percentages / post-mortems → reasonable as `source_not_found` (could be on the web, just not gathered).
- HFT-specific cold-start benchmarks → genuinely `scope_exceeded` (HFT firms don't publish these — wrong researcher entirely).
So it's 1 of 5, not a clear bug. The agent's prompt could nudge sharper category assignment, but the misclassification is not severe enough to file independently. Worth revisiting in M3.3 calibration if confidence routinely overestimates due to gap miscategorization.
**4. Persisted result file made this analysis trivial.** This is the first stress test run after #54 shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against `<trace_id>.result.json` instead of grepping rendered terminal output. M3.3 calibration work will be much faster as a result.
## Follow-ups
- None blocking. The scope_exceeded vs source_not_found distinction may surface again in M3.3 calibration; if so, file it then.