retro: Session 4 — Phase 2 + 2.5 complete, classifier rebuild, budget fix

Jeff Smith 2026-04-06 22:52:04 -06:00
parent aabf7807fd
commit bda9958728
2 changed files with 148 additions and 0 deletions

147
Session4.md Normal file

@ -0,0 +1,147 @@
# Session 4
**Date:** 2026-04-06
**Focus:** Phase 2 (survey pass) end-to-end + Phase 2.5 (context budget reliability)
---
## What Was Done
### Phase 2: Survey Pass — completed
All four Phase 2 issues shipped, plus the `#42` filetype-classifier rebuild and the `#44` context-budget fix.
- **#4** — `_SURVEY_SYSTEM_PROMPT` added to `prompts.py`. Tri-state tool triage (relevant / skip / unlisted), bias-aware confidence rubric, explicit "do not read files" framing. PR #41.
- **#5** — `_run_survey()` and `submit_survey` tool added to `ai.py`. ≤3-turn agent loop, wired into `_run_investigation()` before the dir loop, log-only at this point. Survey failure non-fatal. Includes a Band-Aid prompt warning about the source-code-biased histogram (later removed in #42). PR #43.
- **#6** — Survey output wired into the dir loop. Two parts: prompt injection via a new `{survey_context}` placeholder in `_DIR_SYSTEM_PROMPT`, AND hard tool-schema filtering via `_filter_dir_tools()` (gated on `confidence >= 0.5`, control-flow tools always preserved). The hard filtering was added after the #5 smoke test showed prompt-only guidance was ignored. PR #45.
- **#7** — Skip survey for tiny targets. AND-semantics gate: `total_files < 5 AND total_dirs < 2`. Synthetic `_default_survey()` with `confidence=0.0` so the tool filter never enforces from a synthetic value. Reordered `_discover_directories()` to run before the survey gate so dir count is free. PR #47.
- **#42** — Replaced the bucketed `file_categories` histogram in the survey input with raw signals (`survey_signals()`): extension histogram, `file --brief` description histogram, evenly-drawn filename samples. Removed the Band-Aid prompt warning. The bucketed view stays in place for the terminal report. PR #50.
- **#44** — Fixed the context budget metric. The original `budget_exceeded()` compared the *cumulative sum* of per-call `input_tokens` against the context window size, which double-counts every turn (each turn's `input_tokens` already includes the full message history). Replaced with `last_input` (the most recent call's input_tokens). Bumped `MAX_CONTEXT` from 180k to 200k (Sonnet 4's real window). PR #52.
### New issues opened
- **#41** — PR for #4
- **#42** — Rebuild filetype classifier to remove source-code bias (shipped in this session)
- **#43** — PR for #5
- **#44** — Dir loop context budget reliability (shipped in this session)
- **#45** — PR for #6
- **#46** — Revisit survey-skip thresholds with empirical data (deferred to end-of-project tuning)
- **#47** — PR for #7
- **#48** — Unit-of-analysis problem ("file" wrong unit for mbox/SQLite/Maildir) — **Phase 4.5**
- **#49** — Terminal report shows biased bucketed view (end-of-project tuning)
- **#50** — PR for #42
- **#51** — Dir loop message history grows monotonically, no eviction (deferred until Phase 3 raises max_turns)
- **#52** — PR for #44
### PLAN.md updated
- **Phase 2** — annotated with `#42` (classifier rebuild) note
- **Phase 2.5 — Context budget reliability** added between Phase 2 and Phase 3 for `#44`
- **Phase 4.5 — Unit of analysis (#48)** added between Phase 4 and Phase 5
- **End-of-project tuning** section added at the bottom referencing `#46` and `#49`
### Wiki updated
- `Architecture.md` — context budget section rewritten to document the per-call (not cumulative) metric, the 200k window, and the rationale for the `max_turns=14` sanity cap
### Test count
136 → 168 across the session (+32). All passing.
---
## Discoveries and Observations
### The classifier bias is structural, not cosmetic
Tracing the failure for a Maildir target (issue #42 root cause analysis): every email file falls through `EXTENSION_MAP` (no extension or weird `:2,S` flag suffix), then matches `FILE_CMD_PATTERNS["text"] = "source"` because `file --brief` returns `"SMTP mail, ASCII text"`. Result: a 10,000-message mailbox is reported as 10,000 source files. The survey would have called it a giant codebase. The honest fix wasn't expanding the taxonomy (treadmill) — it was stopping the bucketing entirely for the survey input and feeding raw signals instead.
### Prompt-only guidance is not enforcement
The #5 smoke test produced an unexpectedly clean empirical observation: the survey returned `skip_tools: ["run_command"]`, the dir-loop prompt mentioned this, and the dir-loop agent **immediately called `run_command` twice anyway**. The model treats prompt instructions as soft preferences and tool availability as hard affordances. This pushed #6's scope from "inject into prompt" to "filter from tool schema." The post-#6 smoke test had zero `run_command` calls and dropped cost from $0.46 to $0.34.
### The context budget metric was a bookkeeping bug, not a real overflow
#44's verification run produced clear numbers: when the loop bailed at "139,427 tokens used," the actual most-recent `input_tokens` was **20,535** — about 10% of Sonnet's real 200k window. The growth pattern was linear (~1.52k per turn) and at the current `max_turns=14` cap nowhere close to overflowing. The fix is one field on the tracker plus a one-line change to `budget_exceeded()`. Verified empirically before coding the fix, which paid off — saved second-guessing whether to also tackle message-history pruning (it's a real but separate problem, filed as #51).
### File count alone is not a sufficient survey-skip gate
The deep-narrow case (4 files spread across 50 directories) breaks file-count-only gating. Files are tiny but the dir loop will run 50 times and benefit from survey framing on each. The honest semantics are AND, not OR: skip only when both files AND dirs are tiny, because dir count is what amortizes the survey's cost across the run. Directories don't carry information for the survey itself — they multiply its downstream value.
### Unit of analysis is the next big honesty problem
User correctly pushed back on whether `filetypes.py` should be renamed during #42. Renaming would only move the dishonesty: the real problem is "file" as the unit of analysis. Files are wrong in both directions — Maildir/`.git`/`node_modules` over-count (one logical thing, thousands of files), mbox/SQLite/zip/notebook under-count (one file, many logical things). Captured as #48 (Phase 4.5), explicitly out of scope for #42, with the rename happening when the substantive change happens.
---
## Decisions Made
**Tri-state tool triage in the survey** (relevant / skip / unlisted, not relevant / skip as complements). Rationale: a partition forces every tool into one of two buckets, but most tools should be neither — available if needed, no strong opinion. The unlisted default is the load-bearing choice; the survey only marks `skip` when use would be actively wrong, not merely unnecessary.
**Hard tool filtering, not just prompt injection, for #6.** Empirical evidence from #5 showed prompt-only is insufficient. Filtering removes the tool from the API schema entirely; the agent literally cannot call it. Gated on `confidence >= 0.5` so the survey can't break the dir loop on a low-confidence wrong call.
**`{survey_context}` as a separate placeholder** in `_DIR_SYSTEM_PROMPT`, not concatenated into `{context}`. Rationale: per-investigation vs per-directory have different lifetimes and sources; making the structure visible in the prompt template makes future debugging easier; matches the existing `{child_summaries}` precedent.
**Synthetic `_default_survey()` with `confidence=0.0`** for the tiny-target skip path. The 0.0 confidence is deliberate: it ensures `_filter_dir_tools()` never enforces `skip_tools` from a synthetic value. The dir loop gets a generic "small target, read everything" framing in the prompt and keeps its full toolbox.
**Verify-before-fix for #44.** Added a temporary `print` statement, ran a real smoke test, read the actual numbers, *then* changed the code. Confirmed the bookkeeping diagnosis before touching production code. Cost ~$0.50 in API spend, saved the cost of building the wrong fix and then having to redo it.
**`MAX_CONTEXT = 200_000` (Sonnet 4 real), `max_turns = 14` left as-is.** The budget fix gives the loop honest headroom. `max_turns` stays as a sanity bound — even on small targets the agent should produce `submit_report` long before turn 14. Documented in the wiki so future readers know it's intentional, not a leftover.
**Issue sequencing.** #44 → Phase 2.5 (must fix before Phase 3 adds longer prompts). #42 → bundled into Phase 2 (survey can't be honest without it). #46, #49 → end-of-project tuning (advisory, not blocking). #48 → Phase 4.5 (substantive new capability, after Phase 4 external tools). #51 → post-Phase-3 (only matters if max_turns is raised).
---
## Raw Thinking
### Smoke testing discipline paid off twice
The #5 smoke test discovered the prompt-vs-schema gap that became #6's expanded scope. The #44 verification print discovered that the suspected fix (pruning tool results) wasn't even what was wrong — the metric itself was buggy. In both cases the empirical step was cheap (one API call) and reframed the design before code was written. Worth keeping as a habit. The temptation to "just code the obvious fix" is real, especially on small changes, but the pattern of `add a print → run → read numbers → decide` keeps the diagnosis honest.
### The classifier bias and the unit-of-analysis problem are nested, not the same
#42 fixes "the bucketing inside per-file classification is biased." #48 fixes "the per-file unit is wrong for many real targets." It would have been easy to conflate them — both manifest as "Luminos describes a Maildir as source code" — but they're different fixes at different layers. #42 makes the survey honest about *what files are*. #48 makes the whole pipeline honest about *what the unit of analysis is*. Splitting them was the right call; bundling them would have produced a much bigger PR with weaker rationale and pushed the release further.
### The "Band-Aid then proper fix" pattern worked cleanly
The bias warning in `_SURVEY_SYSTEM_PROMPT` shipped with #5, lived through #6 and #7, and was removed by #42. Three commits with a known-acknowledged compromise in place is fine when the followup is filed and sequenced. The Band-Aid was visible (an explicit `IMPORTANT:` paragraph that no one would mistake for permanent), and the Maildir smoke test in #42 was the "yes the proper fix actually works" gate before pulling it. Worth doing this on purpose more often instead of insisting every change be the final form.
### Cost of #6 was negative
Filtering `run_command` out of the dir-loop tool schema for `luminos_lib` dropped cost from $0.46 to $0.34 (~26%) and partially masked #44 (the loop stopped tripping the buggy budget because it ran fewer turns). This is a useful instance of "fixing the right thing makes other unrelated symptoms go away." The corollary: if I'd jumped to "fix the budget" without first shipping #6, I'd have spent effort on #44 against a baseline that was about to change anyway. Order of operations matters more than I usually credit.
### Tool calls outside dedicated tools keep biting on JSON quoting
Twice this session a `curl` invocation with embedded JSON broke on bash-level quoting (parens and apostrophes inside heredocs). The pattern that works reliably: write a one-shot Python script to `/tmp/`, run it, delete it. Worth defaulting to this for any Forgejo API call with multi-paragraph bodies. Capturing this as a personal habit, not opening an issue.
### The dir loop is starting to get crowded
Survey context, child summaries, per-dir context, tool schemas, confidence rubrics, step numbering, efficiency rules, cache schemas, and now the survey-injected `relevant_tools` / `skip_tools` lists are all in `_DIR_SYSTEM_PROMPT`. It's still readable but the cohesion is starting to thin. When Phase 3 adds investigation plans, the prompt will need a structural pass. Not now — but worth flagging for the Phase 3 retrospective.
---
## What's Next
Phase 2 + 2.5 are complete. Phase 3 (investigation planning) is the next major milestone:
- **#8** — Add `_PLANNING_SYSTEM_PROMPT` to `prompts.py` (prompt-only, low risk, mirrors how #4 started Phase 2)
- **#9** — Implement planning pass with `submit_plan` tool in `ai.py`
- **#10** — Implement dynamic turn allocation from plan output
- **#11** — Save investigation plan to cache for resumed runs
Start with #8 for the same reason #4 was the right entry to Phase 2: prompt-only changes are the cheapest way to get aligned on an approach before investing in agent loop wiring.
### Sidequests still on the board
- **#38** — Cache invalidation by mtime (small, can drop in any time)
- **#46** — Revisit survey thresholds (deferred to end-of-project)
- **#48** — Phase 4.5 unit-of-analysis (don't start before Phase 4)
- **#49** — Terminal report bias (end-of-project tuning)
- **#51** — Dir loop history eviction (only matters when max_turns is raised in Phase 3+)
### Pickup point for next session
- **Branch:** main
- **File:** `luminos_lib/prompts.py`
- **In progress:** nothing — Phase 2 + 2.5 are clean
- **First action:** start #8 — analyze, plan, explain, then write `_PLANNING_SYSTEM_PROMPT`. Same pattern as #4. The survey output is now a known-quality input that the planning pass can build on.

@ -5,6 +5,7 @@
| [Session 1](Session1) | 2026-04-06 | Project setup, scan improvements, Forgejo repo, wiki, development practices |
| [Session 2](Session2) | 2026-04-06 | Forgejo milestones, issues, project board (36 issues, 9 milestones), Gitea MCP setup |
| [Session 3](Session3) | 2026-04-06 | Phase 1 complete, MCP backend architecture design, issues #38#40 opened |
| [Session 4](Session4) | 2026-04-06 | Phase 2 + 2.5 complete (#4#7, #42, #44), classifier rebuild, context budget fix, 8 PRs merged |
---