marchwarden/docs/stress-tests/M3.2-results.md
Jeff Smith 0ddc1e6e37 docs(stress-tests): archive M3.2 multi-axis results
Single deep query against AWS Lambda vs Azure Functions for HFT
exercised 3 of 4 target axes simultaneously: recency, contradictions,
and budget pressure all fired in the same run. scope_exceeded miss is
soft (1 of 5 gaps was arguably miscategorized as source_not_found).

First in-the-wild observation of the `contradiction` discovery_event
type. Issue #45.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:34:27 -06:00

3.2 KiB
Raw Blame History

M3.2 Multi-axis Stress Test Results

  • Issue: #45 (closed)
  • Date: 2026-04-08
  • Branch: feat/m3.2-multiaxis
  • Trace: 74a017bd-697b-4439-96b8-fe12057cf2e8

Query

"Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."

Run with --depth deep. Default budget for deep = 60k tokens, max_iterations = 8.

Scoring

Axis Target Hit Evidence
Recency recency factor recency=current
Contradictory sources contradiction_detected=True True; bonus discovery_event of type contradiction
Scope exceeded scope_exceeded gap 5 gaps, all source_not_found
Budget pressure budget_exhausted=True True; 127692 tokens consumed (2.1x deep cap of 60k)

3 of 4 axes hit cleanly in a single run. Multi-axis composition works.

Result snapshot

  • confidence: 0.78
  • citations: 18 (corroborating sources)
  • gaps: 5 (all source_not_found)
  • discovery_events: 4 (related_research × 2, new_source × 1, contradiction × 1 — first in-the-wild observation)
  • open_questions: 5
  • cost: 127692 tokens, 4 iterations, 168s
  • confidence_factors: corroborating=18, authority=medium, contradiction=True, specificity=0.72, budget=spent, recency=current

Findings

1. Multi-axis composition validated. A single deep query exercised recency + contradictions + budget pressure simultaneously. The contract handled all three without losing structure — confidence dropped appropriately (0.78 vs 0.820.91 in M3.1) and the right factors fired.

2. New discovery event type observed. This is the first time contradiction has fired as a discovery_event.type in any test (M3.1 only saw related_research and new_source). It's a documented type (researchers/web/models.py:154), so this is the contract working as designed — but worth noting for calibration in M3.3 that all three documented types are now reachable in practice.

3. Scope-exceeded distinction is fuzzy in practice. The agent surfaced 5 source_not_found gaps, not the targeted scope_exceeded. Re-reading the gap topics:

  • Outage detail / SLA percentages / post-mortems → reasonable as source_not_found (could be on the web, just not gathered).
  • HFT-specific cold-start benchmarks → genuinely scope_exceeded (HFT firms don't publish these — wrong researcher entirely).

So it's 1 of 5, not a clear bug. The agent's prompt could nudge sharper category assignment, but the misclassification is not severe enough to file independently. Worth revisiting in M3.3 calibration if confidence routinely overestimates due to gap miscategorization.

4. Persisted result file made this analysis trivial. This is the first stress test run after #54 shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against <trace_id>.result.json instead of grepping rendered terminal output. M3.3 calibration work will be much faster as a result.

Follow-ups

  • None blocking. The scope_exceeded vs source_not_found distinction may surface again in M3.3 calibration; if so, file it then.