retro: Session 3 — Phase 3 stress testing, #54 fix, M5.1.1 arxiv ingest

Jeff Smith 2026-04-08 20:25:31 -06:00
parent 5257bb26e1
commit fe15f0acf8
2 changed files with 98 additions and 0 deletions

97
Session3.md Normal file

@ -0,0 +1,97 @@
# Session 3 Notes — 2026-04-08
## What We Set Out to Do
Open question at session start: Phase 3 stress testing or jump straight to Phase 5 arxiv-rag? User chose Phase 3. Initial scope: M3.1 (single-axis stress tests, Issue #44). The session evolved organically from there as findings opened follow-up work.
## What Actually Happened
Five PRs merged over the session, all on `main`:
1. **PR #55** — M3.1 results archive ([docs/stress-tests/M3.1-results.md](https://forgejo.labbity.unbiasedgeek.com/archeious/marchwarden/src/branch/main/docs/stress-tests/M3.1-results.md)). Ran four targeted stress queries against the v1 contract. 1 of 4 query targets cleanly hit (Q3 — CRISPR). Q1/Q2 missed because the queries weren't adversarial enough; Q4 missed due to a real bug in budget enforcement. Issue #44 closed.
2. **PR #56** — Trace observability fix (Issue #54). The M3.1 work surfaced that the JSONL trace records only summary counts on `complete` (gap_count, citation_count) — the actual gap categories, citations, and structured contract are never persisted. Replay couldn't tell you *which* gaps fired, only how many. This blocked any analysis pass over multiple runs. Fixed two ways: (a) `TraceLogger.write_result()` dumps the full pydantic ResearchResult to `<trace_id>.result.json` next to the JSONL trace; (b) the agent emits `gap_recorded`, `citation_recorded`, `discovery_recorded` events per item with categories in-band. `cli replay` now loads the sibling result and renders it. 4 tests added.
3. **PR #57** — M3.2 multi-axis stress test (Issue #45). Single deep query against AWS Lambda vs Azure Functions for HFT exercised three of four target axes simultaneously: recency, contradictions, budget pressure. Scope_exceeded miss was soft (1 of 5 gaps was arguably miscategorized). First in-the-wild observation of `contradiction` discovery_event type. Made trivial by #54.
4. **PR #58** — M5.1.1 arxiv-rag ingest pipeline (Issue #38). New `researchers/arxiv/` package: `store.py` (chromadb wrapper + papers.json manifest) and `ingest.py` (download → extract → embed → store). New CLI subgroup `marchwarden arxiv add|list|info|remove`. Heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) gated behind `[arxiv]` optional extra so the base install stays slim. 14 new tests covering extract_sections (real pymupdf against synthetic PDFs at test time), embed_and_store (stub embedder + stub chroma), top-level ingest() (mocked arxiv.Search). Done in parallel with the M3.3 collection runner.
5. **PR #59** — M3.3 Phase A: data collection. Split #46 into Phases A/B/C. Phase A (this PR): `scripts/calibration_runner.sh` runs 20 fixed balanced-depth queries across 4 categories; `scripts/calibration_collect.py` loads `*.result.json` files and emits `docs/stress-tests/M3.3-rating-worksheet.md`. 22 rateable runs total (20 + caffeine smoke + M3.2). Issue #46 stays open for Phases B (human rating) and C (analysis + rubric).
Two issues still open from this session: **#53** (budget cap lag bug from M3.1) and **#46** (M3.3 calibration awaiting human rating).
Test count: 123 → 141.
## Key Decisions & Reasoning
- **Split M3.3 into Phases A/B/C.** I can't credibly self-rate the agent's outputs — same biases, marking my own homework. Splitting unblocks the mechanical work (running queries, building the worksheet) and lets the user batch the cognitive work (rating) at their own pace. The alternative (refusing to start until human time was available) would have left the milestone idle.
- **Filed #53 as one bug, not two.** I initially proposed splitting the budget cap lag into "the lag itself" + "missing post-loop check," then walked it back to one issue when challenged. They have a single root cause; a "post-loop check" would just be papering over the lag. Lesson: when breaking up findings, ask whether the proposed second issue is a cause or a symptom.
- **Trace observability fix went in BOTH directions (#54).** Originally I would have shipped just (a) — persist the result file. The user asked for both options. (a) is the heavy lift but (b) is genuinely complementary: per-item events let you analyze a timeline of *what was kept* without loading the sibling, and (a) gives you the structured payload for full analysis. Keeping both costs maybe 40 lines and gives both the timeline view and the structured-payload view.
- **Critique of the proposed structured-data tool: don't build it now.** User asked about adding a second MCP tool that returns structured data (e.g. "top 10 hardest hikes" → JSON array). My honest response: good idea, wrong timing, wrong shape. Three reasons: (1) caller-provided schemas are a hallucination footgun — every row MUST be filled; (2) a separate tool orphans the gap/citation infrastructure; (3) no consumer exists yet — the PI agent in M6 is the natural consumer that will tell you the right shape. The right version is eventually an *optional* `output_schema` parameter to `research()` that adds an *optional* `structured_payload` field on `ResearchResult`, leaving gaps/citations/confidence intact. Wait until M6.1 ships. User accepted; they noted it saved time, effort, and tokens.
- **arxiv-rag deps as an optional extra.** The base web researcher is lightweight (anthropic, mcp, click, etc.). pymupdf + chromadb + sentence-transformers + arxiv pulls in ~2GB including a CUDA torch wheel. Forcing that on every user just to use the web researcher is hostile. `[arxiv]` extra preserves slim base install; arxiv users add `pip install -e ".[arxiv,dev]"`.
- **Lazy-import in the arxiv CLI subgroup.** Even with the `[arxiv]` extra installed, importing chromadb / torch on every CLI invocation would slow `marchwarden ask` for users who just want web research. Each `arxiv add|list|info|remove` command imports chromadb / torch inside the function body, so non-arxiv commands stay fast.
- **`*.log` gitignore exception for stress-test runs.** The collector script depends on parsing log filenames to recover the trace_id → category mapping. Ignoring them globally would break reproducibility. Added `!docs/stress-tests/**/*.log` exception.
- **Did not run a live arxiv smoke test in this session.** The 14 unit tests cover the pipeline; a real-paper smoke would download a 500MB embedding model and could fight the M3.3 collection runner for CPU. Deferred to next session. Acceptable because nothing about the code is speculative — the unit tests exercise the same code paths.
## Surprises & Discoveries
- **The budget cap lag bug (#53) is subtler than it looks.** The loop checks `total_tokens >= constraints.token_budget` at the *top* of each iteration, but `total_tokens` is incremented from `response.usage` *after* each model call. Iter 1's input is tiny (just the user question + system prompt) → ~1145 tokens. Iter 2's input is huge (contains all the fetched tool result payloads from iter 1's tool calls) but the budget check at the top of iter 2 still sees the small iter-1 number. With small `max_iterations`, the cap can be blown by 56x without ever tripping. Synthesis being uncapped is *separate* and *intentional*.
- **First in-the-wild `contradiction` discovery_event.** M3.1 only ever produced `related_research` and `new_source` types. M3.2's AWS-vs-Azure query produced the first `contradiction` — the agent recognized that vendor docs and incident reports were saying contradictory things and emitted a `contradiction`-typed discovery event suggesting follow-up. The type was documented in `models.py` from day one but had never fired. All three documented types are now reachable in practice.
- **Q17 (screen time) failed catastrophically.** The calibration runner produced confidence=0.10, citations=0, gaps=budget_exhausted(1). This is the synthesis fallback path firing — meaning the agent couldn't even produce parseable JSON. Worth investigating during Phase C analysis. May be related to #53 or to the synthesis prompt handling extreme budget pressure poorly.
- **`source_not_found` is the agent's default escape hatch.** Across M3.2 and M3.3, the agent tends to label gaps as `source_not_found` even when `scope_exceeded` would be more honest. Re-checking M3.2: 5 gaps, all `source_not_found`, but at least one (HFT cold-start benchmarks — proprietary, not publicly published) was genuinely scope_exceeded. Not severe enough to file as a bug; may inform Phase C synthesis-prompt adjustments.
- **CUDA torch wheel pulled in by sentence-transformers.** `pip install ".[arxiv]"` installed `cuda-toolkit-13.0.2` + ~10 nvidia-* packages totaling ~2GB. Harmless on CPU-only WSL but a real waste of disk. Future cleanup: pin `torch` to the CPU index in the `[arxiv]` extra.
- **Hands-off PR merge worked end-to-end.** Five PRs created, merged, and synced via the gitea MCP + REST fallback combo with no manual intervention. The pattern (`pull_request_write method=create` → `method=merge Do=merge``delete_branch` → local `checkout main && pull --ff-only && branch -d`) is solid.
## Concerns & Open Threads
- **#53 (budget cap lag) is still open and IS contract-impacting.** Any second researcher (arxiv-rag retrieval starting in #39) will inherit the same buggy semantics if it follows the web researcher's loop pattern. Recommend fixing before #39 ships, otherwise we'll have two researchers with the same bug.
- **M3.3 Phase B is cognitive load on the user.** Phase A produced a 22-row worksheet with empty `actual_rating` columns. Phase B requires the user to actually read each answer + citations and assign 0.01.0 ratings. This is not optional — the rubric in Phase C depends on it. There's no clean way to automate it; if the user doesn't get to it, M3.3 is stuck indefinitely.
- **The synthesis fallback path (Q17 case) may have other failure modes.** I noticed it fired but didn't investigate. Worth a closer look during Phase C — if the agent silently degrades to confidence=0.10 with empty everything when synthesis fails, that's a real failure mode users will hit and it deserves either better recovery or louder failure.
- **Two-branch limit was respected but tightly.** During Phase 5 work I had `feat/m3.3-collection` (waiting for runner) and `feat/arxiv-rag-ingest` open simultaneously. CLAUDE.md says max 2. Worked fine, but if the user had asked for a third concurrent thread I'd have had to push back or merge one early.
- **Pre-#54 traces (M3.1) are unrecoverable.** The fix only persists results going forward. The 4 M3.1 runs have JSONL traces but no `result.json` siblings, so the calibration worksheet has 22 rows instead of 26. Not blocking but worth noting that the observability fix is forward-only.
- **Live arxiv smoke deferred.** Code looks right, unit tests pass, but no real PDF has been ingested end-to-end. Possible failure modes still untested: actual arxiv API quirks, real PDF heading detection accuracy on real-world variation, sentence-transformers download timing, chromadb persistence across processes. Should be the first thing next session.
- **Embedding model download is a first-run UX cliff.** The first `marchwarden arxiv add` will download ~500MB of nomic-embed-text-v1.5. No progress bar, no warning. Should add a friendly message before triggering it.
## Raw Thinking
- The trace observability gap (#54) is a great example of how stress testing earns its keep. Without M3.1, no one would have thought to ask "wait, where does the structured result actually live?" until the second researcher tried to consume traces and failed. Catching it during the first stress test, before any analysis tooling depended on the missing data, was much cheaper than catching it later.
- The structured-data-tool critique was the most intellectually substantive moment of the session. The user explicitly asked for blunt and bold, and the right answer wasn't yes-and or no — it was "good idea, wrong moment, wrong shape." The "speculative consumer problem" framing felt important: the difference between building a feature for an imagined user vs. waiting for a real consumer to fail compositionally is the difference between landing on the right shape and landing on the wrong one.
- M3.3's Phase A/B/C split is a useful pattern for any "human-in-the-loop" milestone. The principle: separate the parts that need human judgment from the parts that don't, do the mechanical work eagerly, hand off the cognitive work as a single batched ask. Worth remembering for future evaluation milestones.
- The session was simultaneously running two heavy threads — calibration runner (~30 min wall, lots of API calls) and arxiv-rag implementation (lots of new code, heavy deps). Parallel execution paid off — both finished within minutes of each other and neither blocked the other. The key was identifying that they touched disjoint parts of the venv and disjoint parts of the filesystem.
## What's Next
**Recommended next session: fix #53 (budget cap lag) before continuing Phase 5.**
Reasoning: #39 (M5.1.2 arxiv-rag retrieval primitive) is the natural follow-on to M5.1.1, but the arxiv researcher will eventually have its own agent loop (#40) that inherits the budget semantics from the web researcher. Fixing #53 first means the second researcher is born with correct budget enforcement instead of duplicating the bug.
**Order of next-session candidates:**
1. **#53** — budget cap lag bug. Single-file fix (probably 10 lines in `researchers/web/agent.py`) plus a regression test. ~30 min.
2. **Live arxiv smoke**`marchwarden arxiv add 1706.03762` (Attention Is All You Need) end-to-end. Validates the M5.1.1 code paths against a real PDF. ~10 min after the embedding model finishes downloading.
3. **#39** — M5.1.2 retrieval primitive. Builds the query API on top of the chromadb collection from M5.1.1. Self-contained, ~12 hours.
4. **M3.3 Phase C** — once the user brings back the rated worksheet. Analysis script + rubric + wiki update.
5. **M4.1** (#47) — error handling / hardening. Independent of everything above.
**Open issues entering next session:** #53, #46.

@ -6,3 +6,4 @@ Index of all session notes for Marchwarden development.
|:---|:---|:---|:---| |:---|:---|:---|:---|
| [Session 1](Session1) | 2026-04-08 | Project creation through Phase 1 complete | Name: marchwarden; Tavily; contract: raw_excerpt + categorized gaps + discovery_events + open_questions + confidence_factors + model_id; Phase 0-1 shipped (81 tests) | | [Session 1](Session1) | 2026-04-08 | Project creation through Phase 1 complete | Name: marchwarden; Tavily; contract: raw_excerpt + categorized gaps + discovery_events + open_questions + confidence_factors + model_id; Phase 0-1 shipped (81 tests) |
| [Session 2](Session2) | 2026-04-08 | Phase 2 + Phase 2.5 shipped; V1 ships; arxiv-rag scoped | structlog over stdlib (OpenSearch-bound); synthesis uncapped by design; budget = soft cap on loop only; cost ledger supplements contract; arxiv-rag replaces grep file researcher in M5.1; Phase 3/4/5/6 milestones populated (123 tests) | | [Session 2](Session2) | 2026-04-08 | Phase 2 + Phase 2.5 shipped; V1 ships; arxiv-rag scoped | structlog over stdlib (OpenSearch-bound); synthesis uncapped by design; budget = soft cap on loop only; cost ledger supplements contract; arxiv-rag replaces grep file researcher in M5.1; Phase 3/4/5/6 milestones populated (123 tests) |
| [Session 3](Session3) | 2026-04-08 | Phase 3 stress testing (M3.1+M3.2 done, M3.3 split A/B/C with A done); trace observability fix (#54); M5.1.1 arxiv-rag ingest pipeline shipped; structured-data tool critiqued and deferred | trace persists full ResearchResult sibling + per-item events; budget cap lag is single bug not two; arxiv extras gated behind `[arxiv]`; lazy-import heavy deps in CLI; structured-data tool waits for M6 PI consumer not speculative shape; M3.3 split into Phases A/B/C to unblock human-in-the-loop work (141 tests) |