Table of Contents

Session 3 Notes — 2026-04-08

What We Set Out to Do
What Actually Happened
Key Decisions & Reasoning
Surprises & Discoveries
Concerns & Open Threads
Raw Thinking
What's Next

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Session 3 Notes — 2026-04-08

What We Set Out to Do

Open question at session start: Phase 3 stress testing or jump straight to Phase 5 arxiv-rag? User chose Phase 3. Initial scope: M3.1 (single-axis stress tests, Issue #44). The session evolved organically from there as findings opened follow-up work.

What Actually Happened

Five PRs merged over the session, all on main:

PR #55 — M3.1 results archive (docs/stress-tests/M3.1-results.md). Ran four targeted stress queries against the v1 contract. 1 of 4 query targets cleanly hit (Q3 — CRISPR). Q1/Q2 missed because the queries weren't adversarial enough; Q4 missed due to a real bug in budget enforcement. Issue #44 closed.
PR #56 — Trace observability fix (Issue #54). The M3.1 work surfaced that the JSONL trace records only summary counts on complete (gap_count, citation_count) — the actual gap categories, citations, and structured contract are never persisted. Replay couldn't tell you which gaps fired, only how many. This blocked any analysis pass over multiple runs. Fixed two ways: (a) TraceLogger.write_result() dumps the full pydantic ResearchResult to <trace_id>.result.json next to the JSONL trace; (b) the agent emits gap_recorded, citation_recorded, discovery_recorded events per item with categories in-band. cli replay now loads the sibling result and renders it. 4 tests added.
PR #57 — M3.2 multi-axis stress test (Issue #45). Single deep query against AWS Lambda vs Azure Functions for HFT exercised three of four target axes simultaneously: recency, contradictions, budget pressure. Scope_exceeded miss was soft (1 of 5 gaps was arguably miscategorized). First in-the-wild observation of contradiction discovery_event type. Made trivial by #54.
PR #58 — M5.1.1 arxiv-rag ingest pipeline (Issue #38). New researchers/arxiv/ package: store.py (chromadb wrapper + papers.json manifest) and ingest.py (download → extract → embed → store). New CLI subgroup marchwarden arxiv add|list|info|remove. Heavy ML deps (pymupdf, chromadb, sentence-transformers, arxiv) gated behind [arxiv] optional extra so the base install stays slim. 14 new tests covering extract_sections (real pymupdf against synthetic PDFs at test time), embed_and_store (stub embedder + stub chroma), top-level ingest() (mocked arxiv.Search). Done in parallel with the M3.3 collection runner.
PR #59 — M3.3 Phase A: data collection. Split #46 into Phases A/B/C. Phase A (this PR): scripts/calibration_runner.sh runs 20 fixed balanced-depth queries across 4 categories; scripts/calibration_collect.py loads *.result.json files and emits docs/stress-tests/M3.3-rating-worksheet.md. 22 rateable runs total (20 + caffeine smoke + M3.2). Issue #46 stays open for Phases B (human rating) and C (analysis + rubric).

Two issues still open from this session: #53 (budget cap lag bug from M3.1) and #46 (M3.3 calibration awaiting human rating).

Test count: 123 → 141.

Key Decisions & Reasoning

Split M3.3 into Phases A/B/C. I can't credibly self-rate the agent's outputs — same biases, marking my own homework. Splitting unblocks the mechanical work (running queries, building the worksheet) and lets the user batch the cognitive work (rating) at their own pace. The alternative (refusing to start until human time was available) would have left the milestone idle.
Filed #53 as one bug, not two. I initially proposed splitting the budget cap lag into "the lag itself" + "missing post-loop check," then walked it back to one issue when challenged. They have a single root cause; a "post-loop check" would just be papering over the lag. Lesson: when breaking up findings, ask whether the proposed second issue is a cause or a symptom.
Trace observability fix went in BOTH directions (#54). Originally I would have shipped just (a) — persist the result file. The user asked for both options. (a) is the heavy lift but (b) is genuinely complementary: per-item events let you analyze a timeline of what was kept without loading the sibling, and (a) gives you the structured payload for full analysis. Keeping both costs maybe 40 lines and gives both the timeline view and the structured-payload view.
Critique of the proposed structured-data tool: don't build it now. User asked about adding a second MCP tool that returns structured data (e.g. "top 10 hardest hikes" → JSON array). My honest response: good idea, wrong timing, wrong shape. Three reasons: (1) caller-provided schemas are a hallucination footgun — every row MUST be filled; (2) a separate tool orphans the gap/citation infrastructure; (3) no consumer exists yet — the PI agent in M6 is the natural consumer that will tell you the right shape. The right version is eventually an optional output_schema parameter to research() that adds an optional structured_payload field on ResearchResult, leaving gaps/citations/confidence intact. Wait until M6.1 ships. User accepted; they noted it saved time, effort, and tokens.
arxiv-rag deps as an optional extra. The base web researcher is lightweight (anthropic, mcp, click, etc.). pymupdf + chromadb + sentence-transformers + arxiv pulls in ~2GB including a CUDA torch wheel. Forcing that on every user just to use the web researcher is hostile. [arxiv] extra preserves slim base install; arxiv users add pip install -e ".[arxiv,dev]".
Lazy-import in the arxiv CLI subgroup. Even with the [arxiv] extra installed, importing chromadb / torch on every CLI invocation would slow marchwarden ask for users who just want web research. Each arxiv add|list|info|remove command imports chromadb / torch inside the function body, so non-arxiv commands stay fast.
*.log gitignore exception for stress-test runs. The collector script depends on parsing log filenames to recover the trace_id → category mapping. Ignoring them globally would break reproducibility. Added !docs/stress-tests/**/*.log exception.
Did not run a live arxiv smoke test in this session. The 14 unit tests cover the pipeline; a real-paper smoke would download a 500MB embedding model and could fight the M3.3 collection runner for CPU. Deferred to next session. Acceptable because nothing about the code is speculative — the unit tests exercise the same code paths.

Surprises & Discoveries

The budget cap lag bug (#53) is subtler than it looks. The loop checks total_tokens >= constraints.token_budget at the top of each iteration, but total_tokens is incremented from response.usage after each model call. Iter 1's input is tiny (just the user question + system prompt) → ~1145 tokens. Iter 2's input is huge (contains all the fetched tool result payloads from iter 1's tool calls) but the budget check at the top of iter 2 still sees the small iter-1 number. With small max_iterations, the cap can be blown by 5–6x without ever tripping. Synthesis being uncapped is separate and intentional.
First in-the-wild contradiction discovery_event. M3.1 only ever produced related_research and new_source types. M3.2's AWS-vs-Azure query produced the first contradiction — the agent recognized that vendor docs and incident reports were saying contradictory things and emitted a contradiction-typed discovery event suggesting follow-up. The type was documented in models.py from day one but had never fired. All three documented types are now reachable in practice.
Q17 (screen time) failed catastrophically. The calibration runner produced confidence=0.10, citations=0, gaps=budget_exhausted(1). This is the synthesis fallback path firing — meaning the agent couldn't even produce parseable JSON. Worth investigating during Phase C analysis. May be related to #53 or to the synthesis prompt handling extreme budget pressure poorly.
source_not_found is the agent's default escape hatch. Across M3.2 and M3.3, the agent tends to label gaps as source_not_found even when scope_exceeded would be more honest. Re-checking M3.2: 5 gaps, all source_not_found, but at least one (HFT cold-start benchmarks — proprietary, not publicly published) was genuinely scope_exceeded. Not severe enough to file as a bug; may inform Phase C synthesis-prompt adjustments.
CUDA torch wheel pulled in by sentence-transformers. pip install ".[arxiv]" installed cuda-toolkit-13.0.2 + ~10 nvidia-* packages totaling ~2GB. Harmless on CPU-only WSL but a real waste of disk. Future cleanup: pin torch to the CPU index in the [arxiv] extra.
Hands-off PR merge worked end-to-end. Five PRs created, merged, and synced via the gitea MCP + REST fallback combo with no manual intervention. The pattern (pull_request_write method=create → method=merge Do=merge → delete_branch → local checkout main && pull --ff-only && branch -d) is solid.

Concerns & Open Threads

#53 (budget cap lag) is still open and IS contract-impacting. Any second researcher (arxiv-rag retrieval starting in #39) will inherit the same buggy semantics if it follows the web researcher's loop pattern. Recommend fixing before #39 ships, otherwise we'll have two researchers with the same bug.
M3.3 Phase B is cognitive load on the user. Phase A produced a 22-row worksheet with empty actual_rating columns. Phase B requires the user to actually read each answer + citations and assign 0.0–1.0 ratings. This is not optional — the rubric in Phase C depends on it. There's no clean way to automate it; if the user doesn't get to it, M3.3 is stuck indefinitely.
The synthesis fallback path (Q17 case) may have other failure modes. I noticed it fired but didn't investigate. Worth a closer look during Phase C — if the agent silently degrades to confidence=0.10 with empty everything when synthesis fails, that's a real failure mode users will hit and it deserves either better recovery or louder failure.
Two-branch limit was respected but tightly. During Phase 5 work I had feat/m3.3-collection (waiting for runner) and feat/arxiv-rag-ingest open simultaneously. CLAUDE.md says max 2. Worked fine, but if the user had asked for a third concurrent thread I'd have had to push back or merge one early.
Pre-#54 traces (M3.1) are unrecoverable. The fix only persists results going forward. The 4 M3.1 runs have JSONL traces but no result.json siblings, so the calibration worksheet has 22 rows instead of 26. Not blocking but worth noting that the observability fix is forward-only.
Live arxiv smoke deferred. Code looks right, unit tests pass, but no real PDF has been ingested end-to-end. Possible failure modes still untested: actual arxiv API quirks, real PDF heading detection accuracy on real-world variation, sentence-transformers download timing, chromadb persistence across processes. Should be the first thing next session.
Embedding model download is a first-run UX cliff. The first marchwarden arxiv add will download ~500MB of nomic-embed-text-v1.5. No progress bar, no warning. Should add a friendly message before triggering it.

Raw Thinking

The trace observability gap (#54) is a great example of how stress testing earns its keep. Without M3.1, no one would have thought to ask "wait, where does the structured result actually live?" until the second researcher tried to consume traces and failed. Catching it during the first stress test, before any analysis tooling depended on the missing data, was much cheaper than catching it later.
The structured-data-tool critique was the most intellectually substantive moment of the session. The user explicitly asked for blunt and bold, and the right answer wasn't yes-and or no — it was "good idea, wrong moment, wrong shape." The "speculative consumer problem" framing felt important: the difference between building a feature for an imagined user vs. waiting for a real consumer to fail compositionally is the difference between landing on the right shape and landing on the wrong one.
M3.3's Phase A/B/C split is a useful pattern for any "human-in-the-loop" milestone. The principle: separate the parts that need human judgment from the parts that don't, do the mechanical work eagerly, hand off the cognitive work as a single batched ask. Worth remembering for future evaluation milestones.
The session was simultaneously running two heavy threads — calibration runner (~30 min wall, lots of API calls) and arxiv-rag implementation (lots of new code, heavy deps). Parallel execution paid off — both finished within minutes of each other and neither blocked the other. The key was identifying that they touched disjoint parts of the venv and disjoint parts of the filesystem.

What's Next

Recommended next session: fix #53 (budget cap lag) before continuing Phase 5.

Reasoning: #39 (M5.1.2 arxiv-rag retrieval primitive) is the natural follow-on to M5.1.1, but the arxiv researcher will eventually have its own agent loop (#40) that inherits the budget semantics from the web researcher. Fixing #53 first means the second researcher is born with correct budget enforcement instead of duplicating the bug.

Order of next-session candidates:

#53 — budget cap lag bug. Single-file fix (probably 10 lines in researchers/web/agent.py) plus a regression test. ~30 min.
Live arxiv smoke — marchwarden arxiv add 1706.03762 (Attention Is All You Need) end-to-end. Validates the M5.1.1 code paths against a real PDF. ~10 min after the embedding model finishes downloading.
#39 — M5.1.2 retrieval primitive. Builds the query API on top of the chromadb collection from M5.1.1. Self-contained, ~1–2 hours.
M3.3 Phase C — once the user brings back the rated worksheet. Analysis script + rubric + wiki update.
M4.1 (#47) — error handling / hardening. Independent of everything above.

Open issues entering next session: #53, #46.