M3.2 Multi-axis stress test #45

Closed
opened 2026-04-08 23:24:03 +00:00 by claude-code · 1 comment
Collaborator

Phase 3 — Stress Testing & Calibration, milestone 2.

Goal

A single complex query that exercises multiple contract features simultaneously, validating that they compose under load.

Query

"Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."

What this should exercise

  • Recency — needs current outage data and current latency benchmarks
  • Contradictory sources — vendor docs vs incident reports often disagree
  • Scope exceeded — HFT-specific guidance is rare on public web; should suggest other researchers via discovery_events
  • Budget pressure — large evidence space, deep exploration needed

Deliverable

  • One trace file with all four axes touched in the same run
  • Documented results: which factors fired, which gaps were detected, what the synthesis got right vs missed
  • Any contract gaps filed as new issues
Phase 3 — Stress Testing & Calibration, milestone 2. ## Goal A single complex query that exercises **multiple** contract features simultaneously, validating that they compose under load. ## Query > "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages." ## What this should exercise - **Recency** — needs current outage data and current latency benchmarks - **Contradictory sources** — vendor docs vs incident reports often disagree - **Scope exceeded** — HFT-specific guidance is rare on public web; should suggest other researchers via `discovery_events` - **Budget pressure** — large evidence space, deep exploration needed ## Deliverable - One trace file with all four axes touched in the same run - Documented results: which factors fired, which gaps were detected, what the synthesis got right vs missed - Any contract gaps filed as new issues
archeious added this to the Phase 3: Stress Testing & Calibration milestone 2026-04-08 23:25:11 +00:00
Owner

M3.2 Results

One deep query, one trace, three of four target axes hit. Full writeup archived at docs/stress-tests/M3.2-results.md (PR #57).

Trace: 74a017bd-697b-4439-96b8-fe12057cf2e8

Axis Target Hit Evidence
Recency recency factor Yes recency=current
Contradictory sources contradiction_detected=True Yes True; bonus contradiction-type discovery_event
Scope exceeded scope_exceeded gap No 5 gaps, all source_not_found
Budget pressure budget_exhausted=True Yes True; 127692 tokens, 2.1x deep cap of 60k

Snapshot

  • confidence: 0.78 (lower than M3.1's 0.82–0.91, appropriately so given gap mix)
  • citations: 18
  • gaps: 5 (all source_not_found)
  • discovery_events: 4 (related_research x2, new_source x1, contradiction x1 — first in-the-wild observation)
  • cost: 127692 tokens, 4 iterations, 168s

Notes

Multi-axis composition validated. A single deep query exercised three contract features simultaneously without losing structure. Confidence dropped appropriately and the right factors fired.

First contradiction discovery_event seen. Documented type at researchers/web/models.py:154, just hadn't fired in M3.1. All three documented discovery types are now reachable in practice.

Scope_exceeded miss is soft, not filing. Re-checked the 5 source_not_found gaps: only one (HFT-specific cold start benchmarks) is genuinely scope_exceeded — HFT firms don't publish those, so it's the wrong-researcher case. The other 4 (outage details, SLA percentages, post-mortems) are reasonable as source_not_found. So 1 of 5, not severe enough to file. May resurface in M3.3 calibration; will file then if so.

#54 paid off immediately. This is the first stress test after the persisted-result fix shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against <trace_id>.result.json instead of grepping rendered terminal output.

Closing this issue.

## M3.2 Results One deep query, one trace, three of four target axes hit. Full writeup archived at `docs/stress-tests/M3.2-results.md` (PR #57). Trace: `74a017bd-697b-4439-96b8-fe12057cf2e8` | Axis | Target | Hit | Evidence | |---|---|---|---| | Recency | recency factor | Yes | `recency=current` | | Contradictory sources | `contradiction_detected=True` | Yes | True; bonus `contradiction`-type discovery_event | | Scope exceeded | `scope_exceeded` gap | No | 5 gaps, all `source_not_found` | | Budget pressure | `budget_exhausted=True` | Yes | True; 127692 tokens, 2.1x deep cap of 60k | ## Snapshot - confidence: 0.78 (lower than M3.1's 0.82–0.91, appropriately so given gap mix) - citations: 18 - gaps: 5 (all `source_not_found`) - discovery_events: 4 (`related_research` x2, `new_source` x1, **`contradiction` x1** — first in-the-wild observation) - cost: 127692 tokens, 4 iterations, 168s ## Notes **Multi-axis composition validated.** A single deep query exercised three contract features simultaneously without losing structure. Confidence dropped appropriately and the right factors fired. **First `contradiction` discovery_event seen.** Documented type at `researchers/web/models.py:154`, just hadn't fired in M3.1. All three documented discovery types are now reachable in practice. **Scope_exceeded miss is soft, not filing.** Re-checked the 5 `source_not_found` gaps: only one (HFT-specific cold start benchmarks) is genuinely scope_exceeded — HFT firms don't publish those, so it's the wrong-researcher case. The other 4 (outage details, SLA percentages, post-mortems) are reasonable as `source_not_found`. So 1 of 5, not severe enough to file. May resurface in M3.3 calibration; will file then if so. **#54 paid off immediately.** This is the first stress test after the persisted-result fix shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against `<trace_id>.result.json` instead of grepping rendered terminal output. Closing this issue.
Sign in to join this conversation.
No labels
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#45
No description provided.