M4.1 Error handling and graceful degradation #47

Open
opened 2026-04-08 23:24:21 +00:00 by claude-code · 0 comments
Collaborator

Phase 4 — Hardening, milestone 1.

Goal

Production-quality failure modes. Today the agent crashes or returns a stub on most error paths; this milestone makes every failure either recoverable or surfaced as a categorized gap.

Scope

  • Tavily down / rate-limited — catch the API error, log a warning, populate a gap with category=ACCESS_DENIED, return a partial result if any evidence was already gathered
  • URL fetch failures — individual citation skipped, noted in trace, doesn't crash the iteration
  • Claude API timeout — meaningful error message, return partial results if synthesis hasn't run yet
  • Claude API rate limit — exponential backoff with jitter (cap at ~60s), then surface as a clear error
  • Budget overflow protection at the MCP server level — defense in depth: if the agent loop somehow blows past the cap, the server kills the call before billing more
  • Malformed tool args from the LLM — recover gracefully, log to trace, ask the model to retry once, then fail soft

Tests

  • Mock each failure mode and assert the correct gap category / behavior
  • Integration test: Tavily mock returns 429 → result has ACCESS_DENIED gap, runs to completion
Phase 4 — Hardening, milestone 1. ## Goal Production-quality failure modes. Today the agent crashes or returns a stub on most error paths; this milestone makes every failure either recoverable or surfaced as a categorized gap. ## Scope - **Tavily down / rate-limited** — catch the API error, log a warning, populate a `gap` with `category=ACCESS_DENIED`, return a partial result if any evidence was already gathered - **URL fetch failures** — individual citation skipped, noted in trace, doesn't crash the iteration - **Claude API timeout** — meaningful error message, return partial results if synthesis hasn't run yet - **Claude API rate limit** — exponential backoff with jitter (cap at ~60s), then surface as a clear error - **Budget overflow protection at the MCP server level** — defense in depth: if the agent loop somehow blows past the cap, the server kills the call before billing more - **Malformed tool args from the LLM** — recover gracefully, log to trace, ask the model to retry once, then fail soft ## Tests - Mock each failure mode and assert the correct gap category / behavior - Integration test: Tavily mock returns 429 → result has `ACCESS_DENIED` gap, runs to completion
archeious added this to the Phase 4: Hardening milestone 2026-04-08 23:25:12 +00:00
Sign in to join this conversation.
No labels
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/marchwarden#47
No description provided.