archeious/luminos

Fork 0

Table of Contents

Session 4

What Was Done

Phase 2: Survey Pass — completed
New issues opened
PLAN.md updated
Wiki updated
Test count

Discoveries and Observations

The classifier bias is structural, not cosmetic
Prompt-only guidance is not enforcement
The context budget metric was a bookkeeping bug, not a real overflow
File count alone is not a sufficient survey-skip gate
Unit of analysis is the next big honesty problem

Decisions Made
Raw Thinking

Smoke testing discipline paid off twice
The classifier bias and the unit-of-analysis problem are nested, not the same
The "Band-Aid then proper fix" pattern worked cleanly
Cost of #6 was negative
Tool calls outside dedicated tools keep biting on JSON quoting
The dir loop is starting to get crowded

What's Next

Sidequests still on the board
Pickup point for next session

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Session 4

Date: 2026-04-06 Focus: Phase 2 (survey pass) end-to-end + Phase 2.5 (context budget reliability)

What Was Done

Phase 2: Survey Pass — completed

All four Phase 2 issues shipped, plus the #42 filetype-classifier rebuild and the #44 context-budget fix.

#4 — _SURVEY_SYSTEM_PROMPT added to prompts.py. Tri-state tool triage (relevant / skip / unlisted), bias-aware confidence rubric, explicit "do not read files" framing. PR #41.
#5 — _run_survey() and submit_survey tool added to ai.py. ≤3-turn agent loop, wired into _run_investigation() before the dir loop, log-only at this point. Survey failure non-fatal. Includes a Band-Aid prompt warning about the source-code-biased histogram (later removed in #42). PR #43.
#6 — Survey output wired into the dir loop. Two parts: prompt injection via a new {survey_context} placeholder in _DIR_SYSTEM_PROMPT, AND hard tool-schema filtering via _filter_dir_tools() (gated on confidence >= 0.5, control-flow tools always preserved). The hard filtering was added after the #5 smoke test showed prompt-only guidance was ignored. PR #45.
#7 — Skip survey for tiny targets. AND-semantics gate: total_files < 5 AND total_dirs < 2. Synthetic _default_survey() with confidence=0.0 so the tool filter never enforces from a synthetic value. Reordered _discover_directories() to run before the survey gate so dir count is free. PR #47.
#42 — Replaced the bucketed file_categories histogram in the survey input with raw signals (survey_signals()): extension histogram, file --brief description histogram, evenly-drawn filename samples. Removed the Band-Aid prompt warning. The bucketed view stays in place for the terminal report. PR #50.
#44 — Fixed the context budget metric. The original budget_exceeded() compared the cumulative sum of per-call input_tokens against the context window size, which double-counts every turn (each turn's input_tokens already includes the full message history). Replaced with last_input (the most recent call's input_tokens). Bumped MAX_CONTEXT from 180k to 200k (Sonnet 4's real window). PR #52.

New issues opened

#41 — PR for #4
#42 — Rebuild filetype classifier to remove source-code bias (shipped in this session)
#43 — PR for #5
#44 — Dir loop context budget reliability (shipped in this session)
#45 — PR for #6
#46 — Revisit survey-skip thresholds with empirical data (deferred to end-of-project tuning)
#47 — PR for #7
#48 — Unit-of-analysis problem ("file" wrong unit for mbox/SQLite/Maildir) — Phase 4.5
#49 — Terminal report shows biased bucketed view (end-of-project tuning)
#50 — PR for #42
#51 — Dir loop message history grows monotonically, no eviction (deferred until Phase 3 raises max_turns)
#52 — PR for #44

PLAN.md updated

Phase 2 — annotated with #42 (classifier rebuild) note
Phase 2.5 — Context budget reliability added between Phase 2 and Phase 3 for #44
Phase 4.5 — Unit of analysis (#48) added between Phase 4 and Phase 5
End-of-project tuning section added at the bottom referencing #46 and #49

Wiki updated

Architecture.md — context budget section rewritten to document the per-call (not cumulative) metric, the 200k window, and the rationale for the max_turns=14 sanity cap

Test count

136 → 168 across the session (+32). All passing.

Discoveries and Observations

The classifier bias is structural, not cosmetic

Tracing the failure for a Maildir target (issue #42 root cause analysis): every email file falls through EXTENSION_MAP (no extension or weird :2,S flag suffix), then matches FILE_CMD_PATTERNS["text"] = "source" because file --brief returns "SMTP mail, ASCII text". Result: a 10,000-message mailbox is reported as 10,000 source files. The survey would have called it a giant codebase. The honest fix wasn't expanding the taxonomy (treadmill) — it was stopping the bucketing entirely for the survey input and feeding raw signals instead.

Prompt-only guidance is not enforcement

The #5 smoke test produced an unexpectedly clean empirical observation: the survey returned skip_tools: ["run_command"], the dir-loop prompt mentioned this, and the dir-loop agent immediately called run_command twice anyway. The model treats prompt instructions as soft preferences and tool availability as hard affordances. This pushed #6's scope from "inject into prompt" to "filter from tool schema." The post-#6 smoke test had zero run_command calls and dropped cost from $0.46 to $0.34.

The context budget metric was a bookkeeping bug, not a real overflow

#44's verification run produced clear numbers: when the loop bailed at "139,427 tokens used," the actual most-recent input_tokens was 20,535 — about 10% of Sonnet's real 200k window. The growth pattern was linear (~1.5–2k per turn) and at the current max_turns=14 cap nowhere close to overflowing. The fix is one field on the tracker plus a one-line change to budget_exceeded(). Verified empirically before coding the fix, which paid off — saved second-guessing whether to also tackle message-history pruning (it's a real but separate problem, filed as #51).

File count alone is not a sufficient survey-skip gate

The deep-narrow case (4 files spread across 50 directories) breaks file-count-only gating. Files are tiny but the dir loop will run 50 times and benefit from survey framing on each. The honest semantics are AND, not OR: skip only when both files AND dirs are tiny, because dir count is what amortizes the survey's cost across the run. Directories don't carry information for the survey itself — they multiply its downstream value.

Unit of analysis is the next big honesty problem

User correctly pushed back on whether filetypes.py should be renamed during #42. Renaming would only move the dishonesty: the real problem is "file" as the unit of analysis. Files are wrong in both directions — Maildir/.git/node_modules over-count (one logical thing, thousands of files), mbox/SQLite/zip/notebook under-count (one file, many logical things). Captured as #48 (Phase 4.5), explicitly out of scope for #42, with the rename happening when the substantive change happens.

Decisions Made

Tri-state tool triage in the survey (relevant / skip / unlisted, not relevant / skip as complements). Rationale: a partition forces every tool into one of two buckets, but most tools should be neither — available if needed, no strong opinion. The unlisted default is the load-bearing choice; the survey only marks skip when use would be actively wrong, not merely unnecessary.

Hard tool filtering, not just prompt injection, for #6. Empirical evidence from #5 showed prompt-only is insufficient. Filtering removes the tool from the API schema entirely; the agent literally cannot call it. Gated on confidence >= 0.5 so the survey can't break the dir loop on a low-confidence wrong call.

{survey_context} as a separate placeholder in _DIR_SYSTEM_PROMPT, not concatenated into {context}. Rationale: per-investigation vs per-directory have different lifetimes and sources; making the structure visible in the prompt template makes future debugging easier; matches the existing {child_summaries} precedent.

Synthetic _default_survey() with confidence=0.0 for the tiny-target skip path. The 0.0 confidence is deliberate: it ensures _filter_dir_tools() never enforces skip_tools from a synthetic value. The dir loop gets a generic "small target, read everything" framing in the prompt and keeps its full toolbox.

Verify-before-fix for #44. Added a temporary print statement, ran a real smoke test, read the actual numbers, then changed the code. Confirmed the bookkeeping diagnosis before touching production code. Cost ~$0.50 in API spend, saved the cost of building the wrong fix and then having to redo it.

MAX_CONTEXT = 200_000 (Sonnet 4 real), max_turns = 14 left as-is. The budget fix gives the loop honest headroom. max_turns stays as a sanity bound — even on small targets the agent should produce submit_report long before turn 14. Documented in the wiki so future readers know it's intentional, not a leftover.

Issue sequencing. #44 → Phase 2.5 (must fix before Phase 3 adds longer prompts). #42 → bundled into Phase 2 (survey can't be honest without it). #46, #49 → end-of-project tuning (advisory, not blocking). #48 → Phase 4.5 (substantive new capability, after Phase 4 external tools). #51 → post-Phase-3 (only matters if max_turns is raised).

Raw Thinking

Smoke testing discipline paid off twice

The #5 smoke test discovered the prompt-vs-schema gap that became #6's expanded scope. The #44 verification print discovered that the suspected fix (pruning tool results) wasn't even what was wrong — the metric itself was buggy. In both cases the empirical step was cheap (one API call) and reframed the design before code was written. Worth keeping as a habit. The temptation to "just code the obvious fix" is real, especially on small changes, but the pattern of add a print → run → read numbers → decide keeps the diagnosis honest.

The classifier bias and the unit-of-analysis problem are nested, not the same

#42 fixes "the bucketing inside per-file classification is biased." #48 fixes "the per-file unit is wrong for many real targets." It would have been easy to conflate them — both manifest as "Luminos describes a Maildir as source code" — but they're different fixes at different layers. #42 makes the survey honest about what files are. #48 makes the whole pipeline honest about what the unit of analysis is. Splitting them was the right call; bundling them would have produced a much bigger PR with weaker rationale and pushed the release further.

The "Band-Aid then proper fix" pattern worked cleanly

The bias warning in _SURVEY_SYSTEM_PROMPT shipped with #5, lived through #6 and #7, and was removed by #42. Three commits with a known-acknowledged compromise in place is fine when the followup is filed and sequenced. The Band-Aid was visible (an explicit IMPORTANT: paragraph that no one would mistake for permanent), and the Maildir smoke test in #42 was the "yes the proper fix actually works" gate before pulling it. Worth doing this on purpose more often instead of insisting every change be the final form.

Cost of #6 was negative

Filtering run_command out of the dir-loop tool schema for luminos_lib dropped cost from $0.46 to $0.34 (~26%) and partially masked #44 (the loop stopped tripping the buggy budget because it ran fewer turns). This is a useful instance of "fixing the right thing makes other unrelated symptoms go away." The corollary: if I'd jumped to "fix the budget" without first shipping #6, I'd have spent effort on #44 against a baseline that was about to change anyway. Order of operations matters more than I usually credit.

Tool calls outside dedicated tools keep biting on JSON quoting

Twice this session a curl invocation with embedded JSON broke on bash-level quoting (parens and apostrophes inside heredocs). The pattern that works reliably: write a one-shot Python script to /tmp/, run it, delete it. Worth defaulting to this for any Forgejo API call with multi-paragraph bodies. Capturing this as a personal habit, not opening an issue.

The dir loop is starting to get crowded

Survey context, child summaries, per-dir context, tool schemas, confidence rubrics, step numbering, efficiency rules, cache schemas, and now the survey-injected relevant_tools / skip_tools lists are all in _DIR_SYSTEM_PROMPT. It's still readable but the cohesion is starting to thin. When Phase 3 adds investigation plans, the prompt will need a structural pass. Not now — but worth flagging for the Phase 3 retrospective.

What's Next

Phase 2 + 2.5 are complete. Phase 3 (investigation planning) is the next major milestone:

#8 — Add _PLANNING_SYSTEM_PROMPT to prompts.py (prompt-only, low risk, mirrors how #4 started Phase 2)
#9 — Implement planning pass with submit_plan tool in ai.py
#10 — Implement dynamic turn allocation from plan output
#11 — Save investigation plan to cache for resumed runs

Start with #8 for the same reason #4 was the right entry to Phase 2: prompt-only changes are the cheapest way to get aligned on an approach before investing in agent loop wiring.

Sidequests still on the board

#38 — Cache invalidation by mtime (small, can drop in any time)
#46 — Revisit survey thresholds (deferred to end-of-project)
#48 — Phase 4.5 unit-of-analysis (don't start before Phase 4)
#49 — Terminal report bias (end-of-project tuning)
#51 — Dir loop history eviction (only matters when max_turns is raised in Phase 3+)

Pickup point for next session

Branch: main
File: luminos_lib/prompts.py
In progress: nothing — Phase 2 + 2.5 are clean
First action: start #8 — analyze, plan, explain, then write _PLANNING_SYSTEM_PROMPT. Same pattern as #4. The survey output is now a known-quality input that the planning pass can build on.