Jeff Smith 0ddc1e6e37 docs(stress-tests): archive M3.2 multi-axis results

Single deep query against AWS Lambda vs Azure Functions for HFT
exercised 3 of 4 target axes simultaneously: recency, contradictions,
and budget pressure all fired in the same run. scope_exceeded miss is
soft (1 of 5 gaps was arguably miscategorized as source_not_found).

First in-the-wild observation of the `contradiction` discovery_event
type. Issue #45.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 19:34:27 -06:00

3.2 KiB

Raw Blame History

M3.2 Multi-axis Stress Test Results

Issue: #45 (closed)
Date: 2026-04-08
Branch: feat/m3.2-multiaxis
Trace: 74a017bd-697b-4439-96b8-fe12057cf2e8

Query

"Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."

Run with --depth deep. Default budget for deep = 60k tokens, max_iterations = 8.

Scoring

Axis	Target	Hit	Evidence
Recency	recency factor	✅	`recency=current`
Contradictory sources	`contradiction_detected=True`	✅	True; bonus discovery_event of type `contradiction`
Scope exceeded	`scope_exceeded` gap	❌	5 gaps, all `source_not_found`
Budget pressure	`budget_exhausted=True`	✅	True; 127692 tokens consumed (2.1x deep cap of 60k)

3 of 4 axes hit cleanly in a single run. Multi-axis composition works.

Result snapshot

confidence: 0.78
citations: 18 (corroborating sources)
gaps: 5 (all source_not_found)
discovery_events: 4 (related_research × 2, new_source × 1, contradiction × 1 — first in-the-wild observation)
open_questions: 5
cost: 127692 tokens, 4 iterations, 168s
confidence_factors: corroborating=18, authority=medium, contradiction=True, specificity=0.72, budget=spent, recency=current

Findings

1. Multi-axis composition validated. A single deep query exercised recency + contradictions + budget pressure simultaneously. The contract handled all three without losing structure — confidence dropped appropriately (0.78 vs 0.82–0.91 in M3.1) and the right factors fired.

2. New discovery event type observed. This is the first time contradiction has fired as a discovery_event.type in any test (M3.1 only saw related_research and new_source). It's a documented type (researchers/web/models.py:154), so this is the contract working as designed — but worth noting for calibration in M3.3 that all three documented types are now reachable in practice.

3. Scope-exceeded distinction is fuzzy in practice. The agent surfaced 5 source_not_found gaps, not the targeted scope_exceeded. Re-reading the gap topics:

Outage detail / SLA percentages / post-mortems → reasonable as source_not_found (could be on the web, just not gathered).
HFT-specific cold-start benchmarks → genuinely scope_exceeded (HFT firms don't publish these — wrong researcher entirely).

So it's 1 of 5, not a clear bug. The agent's prompt could nudge sharper category assignment, but the misclassification is not severe enough to file independently. Worth revisiting in M3.3 calibration if confidence routinely overestimates due to gap miscategorization.

4. Persisted result file made this analysis trivial. This is the first stress test run after #54 shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against <trace_id>.result.json instead of grepping rendered terminal output. M3.3 calibration work will be much faster as a result.

Follow-ups

None blocking. The scope_exceeded vs source_not_found distinction may surface again in M3.3 calibration; if so, file it then.

3.2 KiB Raw Blame History Unescape Escape