M2.3: First end-to-end smoke test (Utah crops) #10

Closed
opened 2026-04-08 20:48:51 +00:00 by claude-code · 0 comments
Collaborator

Result: green end-to-end run completed 2026-04-08

Initial run (2026-04-08, before fixes)

trace_id: 1a8711c4-a65b-49fd-853e-50fde79c755f — surfaced five real bugs:

  • Issue #15 — invalid default model id (fixed in PR #21)
  • Issue #16 — synthesis JSON parse failure (fixed in PR #20)
  • Issue #17 — token budget enforced post-hoc (fixed in PR #22)
  • Issue #18 — MCP stdio env propagation (fixed in PR #23)
  • Issue #19 — trace logger truncated long fields (fixed in PR #20)

Re-run (2026-04-08, after all fixes)

MARCHWARDEN_MODEL=claude-sonnet-4-6 \
  scripts/docker-test.sh ask "What are ideal crops for a garden in Utah?" --depth balanced

trace_id: 1f37b795-bdca-4ac6-80e8-f3ecd57a19e4

Field Value
answer Multi-paragraph synthesis with cool/warm-season categories, USDA zones, frost dates
citations 10
gaps 3 (categorized)
discovery_events 3 (with suggested_researcher)
open_questions 4 (2 high, 2 medium)
confidence 0.91
confidence_factors.num_corroborating_sources 8
confidence_factors.source_authority high
confidence_factors.query_specificity_match 0.92
confidence_factors.recency current
confidence_factors.budget_exhausted true
cost.tokens_used 44649
cost.iterations_run 3
cost.wall_time_sec 89.21
cost.model_id claude-sonnet-4-6

Every contract field populated. marchwarden replay 1f37b795-bdca-4ac6-80e8-f3ecd57a19e4 renders the full audit trail.

Disposition

M2.3 deliverable met: first successful end-to-end run, fully verified, all surfaced bugs resolved. Phase 2 complete. V1 ship target (Issue #1) is now achievable — the remaining V1 checklist items (the contract documentation, integration tests) are in place from earlier milestones; this run is the integration test.

## Result: green end-to-end run completed 2026-04-08 ### Initial run (2026-04-08, before fixes) **trace_id:** `1a8711c4-a65b-49fd-853e-50fde79c755f` — surfaced five real bugs: - Issue #15 — invalid default model id (fixed in PR #21) - Issue #16 — synthesis JSON parse failure (fixed in PR #20) - Issue #17 — token budget enforced post-hoc (fixed in PR #22) - Issue #18 — MCP stdio env propagation (fixed in PR #23) - Issue #19 — trace logger truncated long fields (fixed in PR #20) ### Re-run (2026-04-08, after all fixes) ``` MARCHWARDEN_MODEL=claude-sonnet-4-6 \ scripts/docker-test.sh ask "What are ideal crops for a garden in Utah?" --depth balanced ``` **trace_id:** `1f37b795-bdca-4ac6-80e8-f3ecd57a19e4` | Field | Value | |---|---| | answer | Multi-paragraph synthesis with cool/warm-season categories, USDA zones, frost dates | | citations | 10 | | gaps | 3 (categorized) | | discovery_events | 3 (with suggested_researcher) | | open_questions | 4 (2 high, 2 medium) | | confidence | **0.91** | | confidence_factors.num_corroborating_sources | 8 | | confidence_factors.source_authority | high | | confidence_factors.query_specificity_match | 0.92 | | confidence_factors.recency | current | | confidence_factors.budget_exhausted | true | | cost.tokens_used | 44649 | | cost.iterations_run | 3 | | cost.wall_time_sec | 89.21 | | cost.model_id | claude-sonnet-4-6 | Every contract field populated. `marchwarden replay 1f37b795-bdca-4ac6-80e8-f3ecd57a19e4` renders the full audit trail. ### Disposition M2.3 deliverable met: first successful end-to-end run, fully verified, all surfaced bugs resolved. **Phase 2 complete. V1 ship target (Issue #1) is now achievable** — the remaining V1 checklist items (the contract documentation, integration tests) are in place from earlier milestones; this run is the integration test.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#10
No description provided.