Table of Contents

Session 9 Notes — 2026-04-11

What We Set Out to Do
What Actually Happened

Scope shift (#64)
#57: dir loop refactor
#56: tool registration consolidation
#55: pure-helper test coverage, wave 1
#70: pure-helper test coverage, wave 2
Phase 3 prep recommendations and #72

Key Decisions & Reasoning
Surprises & Discoveries
Concerns & Open Threads
Raw Thinking
What's Next

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Session 9 Notes — 2026-04-11

What We Set Out to Do

Open question. The session started with a State-of-the-App summary request, which surfaced two threads: (1) a scope shift the user had been mulling, and (2) the existing Phase 3 prerequisites (#55, #56, #57) blocking Phase 3 proper. Neither was named as the goal up front — they emerged from the conversation and stacked.

What we shipped, in execution order:

#64 — AI investigation is the product, drop zero-dep constraint, delete watch mode
#57 — Refactor _run_dir_loop into three focused helpers
#56 — Single-source tool registration via register_tool()
#55 — Unit test coverage for pure helpers in ai.py (wave 1)
#70 — Test coverage wave 2: _TokenTracker, _synthesize_from_cache, _discover_directories
#72 — Document the leaf-first investigation contract in Internals.md

Six issues, six PRs (or wiki commits), 234 tests, four open issues closed plus the new ones, all in one continuous session.

What Actually Happened

Scope shift (#64)

The user opened with "I would like to make a couple scope changes" — drop the zero-dep constraint, make AI investigation the main show. We worked through the framing in conversation: option (a) AI-default with --no-ai escape hatch, or option (b) AI-only with the base scan purely internal. User picked (b). Watch mode was deleted as part of the same change because a non-AI churn monitor conflicts with the new philosophy.

Reading the code first turned up two things that made the change much smaller than expected. First, ai.py and ast_parser.py already did top-level imports of anthropic/magic/tree_sitter — the "lazy deps" pattern lived only in luminos.py's if args.ai: gate. Removing that gate WAS the entire technical change. Second, capabilities.py was almost dead weight: only clear_cache() was load-bearing, and only because it knew about CACHE_ROOT (which already lived in cache.py). One import move and the whole module could be deleted.

PR #65 landed 11 file changes: deleted watch.py, deleted capabilities.py, deleted tests/test_capabilities.py, moved clear_cache() into cache.py, rewrote luminos.py to make AI mandatory, dropped check_ai_dependencies() from ai.py, added requirements.txt, updated setup_env.sh, and rewrote README.md / CLAUDE.md / PLAN.md to match. Wiki updates landed in a separate commit on the wiki repo. PR #66 was the matching session-log bump.

The graceful exit case mattered enough to call out: first draft checked ANTHROPIC_API_KEY after the base scan ran. That makes the user wait through a multi-second scan only to be told they can't use the result. Moved to top of main(), after target validation but before scan(). Verified by running unset ANTHROPIC_API_KEY && python3 luminos.py /tmp and observing a clean exit 0 with the hint.

#57: dir loop refactor

After the scope change shipped, the natural next move was the Phase 3 prerequisites. Picked #57 first because it was the structural one that everything else benefits from.

_run_dir_loop was ~160 lines holding four conceptual layers: pre-loop setup, budget check + partial-flush (~57 lines, the largest single block), API call + response printing, and tool dispatch + done detection. Phase 3 dynamic turn allocation will inject more state into the same code path, so the refactor had to land first.

I read the code carefully before designing the helpers. The cleanest split turned out to be three: _build_dir_loop_context() (pure setup, returns a _DirLoopContext namedtuple), _flush_partial_dir_entry() (idempotent partial-cache writer for the budget-exceeded path), and _handle_turn_response() (per-turn response processing — print, append, dispatch). The new _run_dir_loop body is ~25 lines.

PR #67 shipped clean, 164 tests passed unchanged. Internals.md §4 was updated to reflect the new structure and the file:line refs that drifted.

#56: tool registration consolidation

The user explicitly chose "fix now" over "defer to Phase 3.5 when MCP will replace this anyway." Reasoning: easier to migrate one well-structured registry to MCP than to migrate two parallel structures.

Before designing, I had to recognize a complication: the existing tool tables aren't a single registry. There are three scope lists (_DIR_TOOLS, _SYNTHESIS_TOOLS, _SURVEY_TOOLS) AND one global _TOOL_DISPATCH dict. Some tools (like flag) appear in multiple scopes with the same schema. Some tools (like submit_report) appear in multiple scopes with different schemas. And submit_report / submit_survey have schemas but no dispatch entry because the loop body intercepts them.

Final design: a register_tool(name, description, schema, scopes, handler=None) function. Single source of truth per (tool, scope) pair. Tools in multiple scopes get multiple register_tool() calls to preserve order (otherwise the order in the second scope drifts relative to other tools).

PR #68 was 399 insertions / 344 deletions. Runtime introspection confirmed identical scope contents and identical 10-entry dispatch table. 164 tests still passed unchanged. Internals.md §4.2 and §9.1 shrunk: §9.1 went from a 5-step "don't forget the second half" process to a 4-step process with one obvious place to look.

#55: pure-helper test coverage, wave 1

The user said "No one likes doing tests but we need them." Picked the issue's seven targets and added one bonus from #57 (_flush_partial_dir_entry).

Used the _make_manager() pattern from tests/test_cache.py to construct a _CacheManager rooted in a tempdir, sidestepping CACHE_ROOT entirely. 45 tests across 8 helpers. One test had a typo in an asserted substring on the first run — the actual partial reason string is "context budget reached before files processed", not "before any files" — caught and fixed in 30 seconds. 209 total tests after PR #69.

The two notable behaviors pinned: _filter_dir_tools threshold gate is strict < (the boundary case where confidence equals the threshold passes the gate), and _path_is_safe correctly rejects sibling-with-target-prefix (/tmp/foo vs /tmp/foo_sibling — the easy-to-miss path traversal case).

#70: pure-helper test coverage, wave 2

I noticed the wave 1 picks left out three high-impact helpers: _TokenTracker, _synthesize_from_cache, _discover_directories. Pitched them as "low effort, high impact." User agreed and asked me to file an issue, insert it into the roadmap before Phase 3, and ship.

Reading _TokenTracker corrected my issue draft: I had written reset_loop() "preserves last_input" — actually it zeroes last_input along with the loop counters. The test pins the real behavior. I also discovered the record() method (not record_usage() as I'd written in the issue), and that SimpleNamespace works as a fake usage object because the function uses getattr(usage, "input_tokens", 0).

The load-bearing test in this batch is the budget-exceeded check under cumulative-input pressure: record 10 calls each with input_tokens = CONTEXT_BUDGET // 5, so total cumulative is 2x the budget but last_input stays at 1/5 of budget. Assert that budget_exceeded() returns False. This is exactly the #44 fix condition — if anyone regresses to "exceeded if cumulative > budget," this test screams.

_synthesize_from_cache only reads dir entries (not file entries) — worth pinning explicitly so a future maintainer doesn't add file entries thinking they should appear in the fallback report.

_discover_directories tests now pin: leaves-first ordering, skip list (.git, __pycache__, node_modules, *.egg-info), custom exclude, hidden dirs by default, and the subtle show_hidden=True case where the skip list still applies (.git stays out even with hidden visible).

PR #71 added 25 tests, 234 total. PLAN.md got restructured: new Phase 2.7 (#56 ✅) and Phase 2.8 (#55 ✅, #70) entries, the stale Phase 3.4 (#56) and "Background chore" (#55) sections deleted since they were displaced by the pre-Phase-3 cleanup pattern.

Phase 3 prep recommendations and #72

After the four pre-reqs were done, the user asked what else I'd recommend before starting Phase 3 ("phase 3 is a biggie and I want it to have a solid base"). I came back with three picks: end-to-end smoke test, design sketch for the planning pass, and document the leaf-first contract.

User responded: smoke test already done externally (looks fine); design sketch deferred to Phase 3 task 1 (intent matched, timing disagreement); leaf-first contract — make it so.

The leaf-first contract issue (#72) is wiki-only, no code. Added a new §4.7 to Internals.md explaining that _discover_directories() returns leaves-first as a load-bearing invariant, that _get_child_summaries() silently depends on it, and that the (none — this is a leaf directory) placeholder LIES if the children just haven't been investigated yet — the agent has no way to know. Two safe paths if Phase 3 changes the order: preserve leaf-first within priority bands, or rewrite the placeholder to be honest. First draft accidentally inserted §4.7 before §4.6 in the file; caught on re-read, swapped, committed.

Key Decisions & Reasoning

Scope shift went with option (b), not (a). AI-only with the base scan purely internal. Reasoning: keeping --no-ai would have meant maintaining two CLI surfaces and two documentation paths for what the philosophy says is one product. Cleaner story.
Delete watch mode rather than park it. Parking would have required explaining in docs why one feature ignored the AI-first philosophy. PLAN.md already notes that watch comes back as incremental AI re-investigation if it comes back at all.
Delete --ai cleanly, no deprecation. Per global CLAUDE.md ("no backwards-compat shims when you can just change the code"). Personal project, no external users to deprecate against.
Graceful exit on missing API key, exit 0 not exit 1. Missing key is a user-fixable configuration state, not an error. Exit 0 + hint reads as "here's what you need to do," not "something broke."
Fix #56 now rather than defer to Phase 3.5. User chose this explicitly. The structure introduced (one registry call per (tool, scope) pair) is naturally MCP-shaped, so the eventual MCP migration collapses to "replace register_tool() with a server call."
Test coverage in two waves rather than one batch. Wave 1 (#55) shipped first with the issue's stated targets. Then I noticed three more high-impact helpers were uncovered, pitched them, and the user greenlit a wave 2 (#70). Splitting kept each PR cohesive and reviewable.
Phase 3 design sketch deferred to Phase 3 task 1. I recommended it as Phase 3 prep. User overrode: "agree on intent, disagree on timing." Result: the design sketch is now bookkept as the first thing Phase 3 does, not as a separate prep cycle. Cleaner if Phase 3 has the design fresh in mind when the rest of the work starts.
One commit per PR for the scope change, not split into code + docs commits. The two are tightly coupled — splitting would create a half-broken state in commit 1 where code says one thing and docs say another. Same logical change.

Surprises & Discoveries

The lazy-deps story was thinner than expected. ai.py and ast_parser.py already did top-level imports of the AI packages. The "lazy" pattern lived only in the CLI gate. Removing the gate WAS the technical change for the entire scope shift. Lesson: when a constraint feels heavier in docs than in code, check whether the code is actually enforcing it.
capabilities.py was almost dead weight. Only clear_cache() was load-bearing, and even that only because of the CACHE_ROOT reference. We'd been paying a tax for the lazy-deps story that the code wasn't actually charging.
_TOOL_DISPATCH and _DIR_TOOLS had a name collision case. submit_report appears in both _DIR_TOOLS and _SYNTHESIS_TOOLS with different schemas. The new registry handles this with two register_tool() calls per scope, but the existence of the collision wasn't obvious until I read the code.
_TokenTracker.reset_loop() zeroes last_input. My #70 issue draft assumed it preserved last_input across resets. The actual code doesn't. Reading the code corrected the test plan before any test was wrong. Always read the code before writing the spec.
_synthesize_from_cache reads dir entries only. I had assumed it would also pull file entries in some "even more degraded" case. It doesn't. The fallback is dir-only or nothing.
The graceful exit had to fire before the base scan, not after. First draft put it after. Caught it in the writing stage, not in testing — but worth noting because the same pattern can sneak into other early-exit checks.

Concerns & Open Threads

Phase 3 design sketch is bookkept as Phase 3 task 1, not done yet. This is the highest-priority unresolved thread. The planning pass touches many things (cache schema, dir loop orchestration, max_turns propagation, plan persistence, survey interaction, resume semantics, optional global token budget) and hand-rolling the design while implementing leads to drift. Make sure Phase 3 actually starts with the design sketch.
The leaf-first contract is documented but only loosely enforced. TestDiscoverDirectories pins the ordering, but there's no test that asserts "dirs are processed in the order they come out of _discover_directories" — the orchestrator could re-sort silently and the test wouldn't catch it. Phase 3 will introduce alternative orderings; this gap matters.
Token budget arithmetic for Phase 3 is still a known unknown. PLAN.md flags it: "How does the agent 'request more turns'?" The current _TokenTracker is per-loop with grand totals for cost. There's no concept of "we've spent X out of Y on this whole investigation." If Phase 3 dynamic turn allocation needs that, it has to grow it explicitly.
No live integration smoke test from this session. The user ran one externally and confirmed it works, but the assistant didn't observe it. If a regression slipped through, we'd find out at the start of Phase 3 or later. The unit tests are 234 strong but they don't cover the full pipeline end-to-end.
Six PRs in one session is a lot of merge commits on main. Not a problem per se, but if a regression bisects to "somewhere in Session 9" the bisect surface is wider than usual. Worth noting for the next session retro.
Wiki-only changes (#72) work fine via direct commits to wiki main. The pattern is established; future doc-only work can follow it without ceremony.

Raw Thinking

The pre-Phase-3 cleanup pattern (#54 → #57 → #56 → #55 → #70 → #72) is worth naming as a paradigm: "pay debts in the area before adding new state to that area." Phase 2.6, 2.7, 2.8 in PLAN.md reflect this. Could be applied generally to any large milestone: inventory the helpers it'll touch, refactor + test them first, then add the new work on top of a known-good foundation.
The State-of-the-App summary at session start was useful framing. It surfaced which threads were on the table, which were blocked, and which had decision points pending. Worth doing more often, especially at the start of long sessions or sessions that start with "what's left."
_TokenTracker test count (11) was higher than I initially scoped. Once I started enumerating edge cases (boundary, defaults, multiple loops, reset semantics, the load-bearing #44 case) the count grew naturally. Good unit tests don't shrink. They accrete.
The register_tool() design is naturally MCP-shaped. A registry of (name, schema, scopes, handler) is exactly what an MCP tool list looks like. When Phase 3.5 lands, register_tool() can collapse to a one-line forward to the connected MCP server's tools/list response, and the migration touches almost nothing else. This was unintentional but lucky.
The session was unusually productive — 6 PRs, 5 issues filed, 4 issues closed, 70 net new tests, 4 wiki page updates — because each piece of work unblocked the next and the user kept the decisions decisive. The TaskCreate breakdowns helped, but the real speedup was that nothing was ambiguous when execution started. When the user redirects with a single sentence ("fix now," "delete it," "make it so"), the loop doesn't have to stop to re-confirm.
"Documentation is work" — #72 was a quick experiment in shipping a doc-only issue with the same workflow as code. Worked fine. Pattern is repeatable for other cross-cutting concerns: contracts, invariants, design decisions that aren't enforced anywhere except in human heads.

What's Next

In priority order:

Phase 3 task 1: write the planning pass design sketch. Deferred from this session. ~30-45 minutes, no code. Cover the submit_plan schema, plan storage in cache, max_turns propagation, skip-dir semantics, survey-output integration, resume semantics, and the optional global token budget question. Land in PLAN.md or a new wiki page before any Phase 3 code is cut.
Phase 3 implementation: #19–#29 cluster. Planning pass after survey, before dir loops; submit_plan tool; dynamic turn allocation based on plan output; dir loop orchestrator updated to follow the plan. Multi-PR, probably multi-session.
Phase 3.5: MCP backend abstraction (#39). The pivot point. After Phase 3 is working, before Phase 4. The register_tool() refactor from #56 makes this much easier than it would have been.
Phase 4+: external knowledge tools, scale-tiered synthesis, hypothesis-driven synthesis, refinement, dynamic report structure. The full backlog from PLAN.md.

When Phase 3 starts: re-read PLAN.md Part 4 (Investigation Planning) and Internals.md §4.7 (the leaf-first contract) before designing. The contract WILL be tempting to violate; the design sketch has to address it explicitly.