archeious/marchwarden

Fork 0

M3.2 Multi-axis stress test #45

New issue

Closed

opened 2026-04-08 17:24:03 -06:00 by claude-code · 1 comment

claude-code commented

2026-04-08 17:24:03 -06:00

Collaborator

Phase 3 — Stress Testing & Calibration, milestone 2.

Goal

A single complex query that exercises multiple contract features simultaneously, validating that they compose under load.

Query

"Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages."

What this should exercise

Recency — needs current outage data and current latency benchmarks
Contradictory sources — vendor docs vs incident reports often disagree
Scope exceeded — HFT-specific guidance is rare on public web; should suggest other researchers via discovery_events
Budget pressure — large evidence space, deep exploration needed

Deliverable

One trace file with all four axes touched in the same run
Documented results: which factors fired, which gaps were detected, what the synthesis got right vs missed
Any contract gaps filed as new issues

Phase 3 — Stress Testing & Calibration, milestone 2. ## Goal A single complex query that exercises **multiple** contract features simultaneously, validating that they compose under load. ## Query > "Compare the reliability of AWS Lambda vs. Azure Functions for a high-frequency trading platform in 2026. Identify specific latency benchmarks and any known 2025/2026 outages." ## What this should exercise - **Recency** — needs current outage data and current latency benchmarks - **Contradictory sources** — vendor docs vs incident reports often disagree - **Scope exceeded** — HFT-specific guidance is rare on public web; should suggest other researchers via `discovery_events` - **Budget pressure** — large evidence space, deep exploration needed ## Deliverable - One trace file with all four axes touched in the same run - Documented results: which factors fired, which gaps were detected, what the synthesis got right vs missed - Any contract gaps filed as new issues

archeious added this to the Phase 3: Stress Testing & Calibration milestone 2026-04-08 17:25:11 -06:00

claude-code referenced this issue

2026-04-08 19:21:00 -06:00

Persist full ResearchResult alongside trace (observability gap) #54

archeious referenced this issue from a commit

2026-04-08 19:34:29 -06:00

docs(stress-tests): archive M3.2 multi-axis results

claude-code referenced this issue from a pull request that will close it,

2026-04-08 19:34:32 -06:00

docs(stress-tests): archive M3.2 multi-axis results #57

archeious commented

2026-04-08 19:35:00 -06:00

Owner

M3.2 Results

One deep query, one trace, three of four target axes hit. Full writeup archived at docs/stress-tests/M3.2-results.md (PR #57).

Trace: 74a017bd-697b-4439-96b8-fe12057cf2e8

Axis	Target	Hit	Evidence
Recency	recency factor	Yes	`recency=current`
Contradictory sources	`contradiction_detected=True`	Yes	True; bonus `contradiction`-type discovery_event
Scope exceeded	`scope_exceeded` gap	No	5 gaps, all `source_not_found`
Budget pressure	`budget_exhausted=True`	Yes	True; 127692 tokens, 2.1x deep cap of 60k

Snapshot

confidence: 0.78 (lower than M3.1's 0.82–0.91, appropriately so given gap mix)
citations: 18
gaps: 5 (all source_not_found)
discovery_events: 4 (related_research x2, new_source x1, contradiction x1 — first in-the-wild observation)
cost: 127692 tokens, 4 iterations, 168s

Notes

Multi-axis composition validated. A single deep query exercised three contract features simultaneously without losing structure. Confidence dropped appropriately and the right factors fired.

First contradiction discovery_event seen. Documented type at researchers/web/models.py:154, just hadn't fired in M3.1. All three documented discovery types are now reachable in practice.

Scope_exceeded miss is soft, not filing. Re-checked the 5 source_not_found gaps: only one (HFT-specific cold start benchmarks) is genuinely scope_exceeded — HFT firms don't publish those, so it's the wrong-researcher case. The other 4 (outage details, SLA percentages, post-mortems) are reasonable as source_not_found. So 1 of 5, not severe enough to file. May resurface in M3.3 calibration; will file then if so.

#54 paid off immediately. This is the first stress test after the persisted-result fix shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against <trace_id>.result.json instead of grepping rendered terminal output.

Closing this issue.

## M3.2 Results One deep query, one trace, three of four target axes hit. Full writeup archived at `docs/stress-tests/M3.2-results.md` (PR #57). Trace: `74a017bd-697b-4439-96b8-fe12057cf2e8` | Axis | Target | Hit | Evidence | |---|---|---|---| | Recency | recency factor | Yes | `recency=current` | | Contradictory sources | `contradiction_detected=True` | Yes | True; bonus `contradiction`-type discovery_event | | Scope exceeded | `scope_exceeded` gap | No | 5 gaps, all `source_not_found` | | Budget pressure | `budget_exhausted=True` | Yes | True; 127692 tokens, 2.1x deep cap of 60k | ## Snapshot - confidence: 0.78 (lower than M3.1's 0.82–0.91, appropriately so given gap mix) - citations: 18 - gaps: 5 (all `source_not_found`) - discovery_events: 4 (`related_research` x2, `new_source` x1, **`contradiction` x1** — first in-the-wild observation) - cost: 127692 tokens, 4 iterations, 168s ## Notes **Multi-axis composition validated.** A single deep query exercised three contract features simultaneously without losing structure. Confidence dropped appropriately and the right factors fired. **First `contradiction` discovery_event seen.** Documented type at `researchers/web/models.py:154`, just hadn't fired in M3.1. All three documented discovery types are now reachable in practice. **Scope_exceeded miss is soft, not filing.** Re-checked the 5 `source_not_found` gaps: only one (HFT-specific cold start benchmarks) is genuinely scope_exceeded — HFT firms don't publish those, so it's the wrong-researcher case. The other 4 (outage details, SLA percentages, post-mortems) are reasonable as `source_not_found`. So 1 of 5, not severe enough to file. May resurface in M3.3 calibration; will file then if so. **#54 paid off immediately.** This is the first stress test after the persisted-result fix shipped. Recovering all 5 gap categories, all 4 discovery types, and the full confidence_factors took one Python one-liner against `<trace_id>.result.json` instead of grepping rendered terminal output. Closing this issue.

claude-code closed this issue

2026-04-08 19:35:03 -06:00

No labels

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#45

No description provided.