Second wave of pre-Phase-3 test coverage. The #55 round picked off the
easy decision-logic helpers; this round covers the three highest-impact
helpers that escaped the first sweep.
Three new test classes appended to tests/test_ai_pure.py:
- TestTokenTracker (11 tests)
Pins the load-bearing #44 fix: budget_exceeded() must use last_input
(the most recent call's context size) NOT cumulative input, because
each turn's input_tokens already includes the full message history.
Tests assert: cumulative-input far above budget does NOT trip the
gate when last_input stays small; reset_loop() preserves grand
totals; the boundary is strict > not >=.
- TestSynthesizeFromCache (5 tests)
The synthesis fallback fires only when _run_synthesis exhausts its
max_turns, which almost never happens in normal runs — exactly the
kind of code that silently rots. Tests assert: empty cache returns
the incomplete-message brief and empty detailed; single dir entry
produces a markdown line; multi-entry detailed contains all entries;
empty-summary entries are skipped; file entries alone do not satisfy
(the function reads dir entries only).
- TestDiscoverDirectories (9 tests)
The leaves-first walk drives the entire dir-loop iteration order
and is the foundation of the cache reuse story. Tests assert:
empty target returns target only; nested trees come back leaves-
first; .git / __pycache__ / node_modules / *.egg-info excluded;
custom --exclude honored; hidden dirs excluded by default; show_
hidden=True includes them but does not override the skip list.
PLAN.md: added Phase 2.7 (#56✅) and Phase 2.8 (#55✅, #70) entries
to the implementation order, and removed the now-stale Phase 3.4 (#56)
and Background chore (#55) sections that were displaced by the
pre-Phase-3 cleanup pattern.
Verification: 234 tests pass (209 prior + 25 new).
ai.py was documented as fully exempt from unit testing because the dir
loop and synthesis pass require a live Anthropic API. But several
helpers in the module are pure functions with no API dependency, and
they're the kind of thing that breaks silently. The #57 refactor added
two more (_build_dir_loop_context, _flush_partial_dir_entry) that are
also naturally testable.
New tests/test_ai_pure.py — 45 tests across 8 helpers:
- _should_skip_dir: exact-match, *.egg-info glob, no-match cases
- _path_is_safe: inside, nested, equals, outside, traversal,
sibling-with-target-prefix (the easy-to-miss security case)
- _default_survey: shape, zero confidence guarantees no filtering,
passes through _filter_dir_tools unchanged
- _format_survey_block: None, empty, minimal, with relevant_tools,
with skip_tools, with domain_notes, empty-list omission
- _filter_dir_tools: None, empty, low confidence, high confidence
filters, protected tools never removed, unknown skip silently
ignored, garbage/None confidence treated as zero, threshold
boundary inclusive
- _format_survey_signals: None, empty, zero total_files, full,
partial (only extensions)
- _block_to_dict: text, tool_use, unknown type
- _flush_partial_dir_entry (#57): idempotent when entry exists,
no-file-entries stub path, with-file-entries summary synthesis,
notable_files collection
Uses the same _make_manager() pattern as test_cache.py to construct
a _CacheManager rooted in a tempdir, sidestepping CACHE_ROOT entirely.
Doc updates:
- CLAUDE.md, README.md, docs/wiki/DevelopmentGuide.md: ai.py is no
longer fully exempt — only the API-dependent loops are. Pure
helpers are covered by test_ai_pure.py.
Verification: 209 tests pass (164 prior + 45 new).
Adding a tool used to require updating two parallel structures in
ai.py: a name->handler entry in _TOOL_DISPATCH and a schema dict in
_DIR_TOOLS (or _SYNTHESIS_TOOLS or _SURVEY_TOOLS). Forgetting one half
was silent. Internals.md §9.1 documented this as a 5-step process.
Replaced both with a single register_tool() call per (tool, scope):
register_tool(
name="read_file",
description="...",
schema={...},
scopes=["dir"],
handler=_tool_read_file,
)
The function appends the schema to one or more scope lists
(_DIR_TOOLS / _SYNTHESIS_TOOLS / _SURVEY_TOOLS) and lands the handler
in _TOOL_DISPATCH. Tools intercepted by the loop body (submit_report,
submit_survey) register schema only with handler=None.
Tools whose schema differs by scope (submit_report has different shapes
in dir vs synthesis loops) get one register_tool() call per scope.
flag is also registered twice because it appears in dir + synthesis at
different positions in each list — the order is preserved with two
calls rather than reordered for fewer calls.
Verification:
- _DIR_TOOLS, _SYNTHESIS_TOOLS, _SURVEY_TOOLS contain the same names
in the same order as before.
- _TOOL_DISPATCH contains the same 10 handlers as before.
- 164 tests pass.
No behavior change. Phase 3.5 (#39) MCP backend will eventually replace
this with dynamic discovery from the connected MCP server, at which
point register_tool() collapses to a one-line forward.
_run_dir_loop was ~160 lines holding four conceptual layers in one
function: pre-loop setup, budget check + partial-flush, API call +
response printing, and tool dispatch + done detection. Phase 3 dynamic
turn allocation will inject more state into the same code path, so
this debt is paid before that lands.
Three new helpers above _run_dir_loop:
- _build_dir_loop_context(): pure setup. Builds the dir context, child
summaries, survey block, filtered tool list, system prompt, and seed
user message. Returns a _DirLoopContext namedtuple.
- _flush_partial_dir_entry(): idempotent partial-cache writer for the
budget-exceeded path. Returns the partial summary string. Idempotent
via cache.has_entry() guard, so callers can call it without checking.
- _handle_turn_response(): per-turn response processing. Prints text
blocks and tool decisions, appends the assistant message, dispatches
tools (or nudges the agent to call submit_report), appends
tool_results. Returns (done, summary).
_run_dir_loop is now a ~25-line coordinator: build context, then
for-loop calls budget check, API, and turn handler in sequence.
No behavior change. 164 tests pass. Internals.md §4 updated for the
new structure and the file:line refs that drifted.
Two original design constraints are dropped:
1. Zero-dependency Python CLI is no longer a goal. Luminos installs from
requirements.txt like a normal Python project.
2. AI investigation is the headline. The base scan becomes the agent's
first input pass, not a standalone product. There is no --ai flag and
no --no-ai mode. AI runs unconditionally on every invocation.
Watch mode is deleted as part of the same change because a non-AI
filesystem-churn monitor conflicts with the new philosophy. If a live
update mode is wanted later, it gets rebuilt as incremental AI
re-investigation.
Code:
- Delete luminos_lib/watch.py
- Delete luminos_lib/capabilities.py and tests/test_capabilities.py
- Move clear_cache() into luminos_lib/cache.py
- luminos.py: remove --watch, --ai, --install-extras flags. AI runs
unconditionally after the base scan. If ANTHROPIC_API_KEY is unset,
exit 0 with a one-line hint before running the base scan.
- ai.py: drop the check_ai_dependencies() call and import.
- New requirements.txt: anthropic, tree-sitter + grammars, python-magic.
- setup_env.sh installs from requirements.txt.
Docs:
- README.md rewritten to lead with AI investigation, drops the two-modes
framing and the watch feature line.
- CLAUDE.md (project): rewrites Key Constraints, updates module map and
Running Luminos commands.
- PLAN.md: strips zero-dep philosophy from the file map and reframes the
watch+incremental note as a future live-mode feature.
Tests: 164 pass (down from 168 with the 4 removed capabilities tests).
Preparing luminos for a public GitHub mirror of the Forgejo source of
truth. README covers what Luminos is, why, features, installation for
base mode and AI mode, usage examples, how the AI investigation works,
and a link back to the canonical Forgejo repo. LICENSE is the standard
Apache 2.0 text.
The system prompt already instructs the agent to set confidence/
confidence_reason on every write_cache call, but the tool's data
schema description listed only the legacy fields. Add the confidence
fields and a one-line calibration pointer so the model sees them
when binding the tool, not just in the system prompt.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move Development Workflow, Branching Discipline, Documentation Workflow,
ADHD Session Protocols, and Session Protocols out of the project CLAUDE.md
and into the global one so all projects share them. Move docs/externalize.md
and docs/wrap-up.md to ~/.claude/protocols/ (lightly generalized). Project
CLAUDE.md keeps only luminos-specific state, module map, constraints,
naming, test command, and session log.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dir loop was exiting early on small targets (a 13-file Python lib
hit the budget at 92k–139k cumulative tokens) because _TokenTracker
compared the SUM of input_tokens across all turns to the context
window size. input_tokens from each API response is the size of the
full prompt sent on that turn (system + every prior message + new
tool results), so summing across turns multi-counts everything. The
real per-call context size never approached the limit.
Verified empirically: on luminos_lib pre-fix, the loop bailed when
the most recent call's input_tokens was 20,535 (~10% of Sonnet's
200k window) but the cumulative sum was 134,983.
Changes:
- _TokenTracker now tracks last_input (the most recent call's
input_tokens), separate from the cumulative loop_input/total_input
used for cost reporting.
- budget_exceeded() returns last_input > CONTEXT_BUDGET, not the
cumulative sum.
- MAX_CONTEXT bumped from 180_000 to 200_000 (Sonnet 4's real
context window). CONTEXT_BUDGET stays at 70% = 140,000.
- Early-exit message now shows context size, threshold, AND
cumulative spend separately so future debugging is unambiguous.
Smoke test on luminos_lib: investigation completes without early
exit (~$0.37). 6 unit tests added covering the new semantics,
including the key regression: a sequence of small calls whose sum
exceeds the budget must NOT trip the check.
Wiki Architecture page updated.
#51 filed for the separate message-history-growth issue.
The survey pass no longer receives the bucketed file_categories
histogram, which was biased toward source-code targets and would
mislabel mail, notebooks, ledgers, and other non-code domains as
"source" via the file --brief "text" pattern fallback.
Adds filetypes.survey_signals(), which assembles raw signals from
the same `classified` data the bucketer already processes — no new
walks, no new dependencies:
total_files — total count
extension_histogram — top 20 extensions, raw, no taxonomy
file_descriptions — top 20 `file --brief` outputs, by count
filename_samples — 20 names, evenly drawn (not first-20)
`survey --brief` descriptions are truncated at 80 chars before
counting so prefixes group correctly without exploding key cardinality.
The Band-Aid in _SURVEY_SYSTEM_PROMPT (warning the LLM that the
histogram was biased toward source code) is removed and replaced
with neutral guidance on how to read the raw signals together.
The {file_type_distribution} placeholder is renamed to
{survey_signals} to reflect the broader content.
luminos.py base scan computes survey_signals once and stores it on
report["survey_signals"]; AI consumers read from there.
summarize_categories() and report["file_categories"] are unchanged
— the terminal report still uses the bucketed view (#49 tracks
fixing that follow-up).
Smoke tested on two targets:
- luminos_lib: identical-quality survey ("Python library package",
confidence 0.85), unchanged behavior on code targets.
- A synthetic Maildir of 8 messages with `:2,S` flag suffixes:
survey now correctly identifies it as "A Maildir-format mailbox
containing 8 email messages" with confidence 0.90, names the
Maildir naming convention in domain_notes, and correctly marks
parse_structure as a skip tool. Before #42 this would have been
"8 source files."
Adds 8 unit tests for survey_signals covering empty input, extension
histogram, description aggregation/truncation, top-N cap, and
even-stride filename sampling.
#48 tracks the unit-of-analysis limitation (file is the wrong unit
for mbox, SQLite, archives, notebooks) — explicitly out of scope
for #42 and documented in survey_signals' docstring.
#48 captures the unit-of-analysis problem: "file" is the wrong unit
for containers (mbox, SQLite, zip, notebooks) and dense directories
(Maildir, .git, node_modules). Sequenced after Phase 4 as its own
phase since it requires format detection and container handlers.
#49 captures the smaller follow-up that the terminal report still
shows the biased bucketed view. Deferred to end-of-project tuning.
Adds a gate in _run_investigation that skips the survey API call when
a target has both fewer than _SURVEY_MIN_FILES (5) files AND fewer
than _SURVEY_MIN_DIRS (2) directories. AND semantics handle the
deep-narrow edge case correctly: a target with 4 files spread across
50 directories still gets a survey because dir count amortizes the
cost across 50 dir loops.
When skipped, _default_survey() supplies a synthetic dict with
confidence=0.0 — chosen specifically so _filter_dir_tools() never
enforces skip_tools from a synthetic value. The dir loop receives
a generic "small target, read everything" framing in its prompt and
keeps its full toolbox.
Reorders _discover_directories() to run before the survey gate so
total_dirs is available without a second walk.
#46 tracks revisiting the threshold values with empirical data after
Phase 2 ships and we've run --ai on a variety of real targets.
Smoke tested on a 2-file target: gate triggers, default survey
substituted, dir loop completes normally. Adds 4 unit tests for
_default_survey() covering schema, confidence guard, filter
interaction, and empty skip_tools.
The survey pass now actually steers dir loop behavior, in two ways:
1. Prompt injection: a new {survey_context} placeholder in
_DIR_SYSTEM_PROMPT receives the survey description, approach,
domain_notes, relevant_tools, and skip_tools so the dir-loop agent
has investigation context before its first turn.
2. Tool schema filtering: _filter_dir_tools() removes any tool listed
in skip_tools from the schema passed to the API, gated on
survey confidence >= 0.5. Control-flow tools (submit_report) are
always preserved. This is hard enforcement — the agent literally
cannot call a filtered tool, which the smoke test for #5 showed
was necessary (prompt-only guidance was ignored).
Smoke test on luminos_lib: zero run_command invocations (vs 2 before),
context budget no longer exhausted (87k vs 133k), cost ~$0.34 (vs
$0.46), investigation completes instead of early-exiting.
Adds tests/test_ai_filter.py with 14 tests covering _filter_dir_tools
and _format_survey_block — both pure helpers, no live API needed.
#5 smoke test showed the dir loop exhausts the 126k context budget on
a 13-file Python lib. Sequencing #44 between Phase 2 and Phase 3 so
the foundation is solid before planning + external tools add more
prompt and tool weight.
Adds the reconnaissance survey pass: a fast, ≤3-turn LLM call that
characterizes the target before any directory investigation begins.
The survey receives the file-type distribution (from the base scan),
a top-2-level tree preview, and the list of available dir-loop tools,
and returns description / approach / relevant_tools / skip_tools /
domain_notes / confidence via a single submit_survey tool call.
Wired into _run_investigation() before the directory loop. Output is
logged but not yet consumed — that wiring is #6. Survey failure is
non-fatal: if the call errors or runs out of turns, the investigation
proceeds without survey context.
Also adds a Band-Aid to _SURVEY_SYSTEM_PROMPT warning the LLM that
the file-type histogram is biased toward source code (the underlying
classifier has no concept of mail, notebooks, ledgers, etc.) and to
trust the tree preview when they conflict. The proper fix is #42.
The filetype classifier is biased toward source code and would mislead
the survey pass on non-code targets (mail, notebooks, ledgers). #5
ships with a prompt-level Band-Aid; #42 captures the real fix and is
sequenced after the survey pass is observable end-to-end and before
Phase 3 depends on survey output.
Adds the system prompt for the survey reconnaissance pass. The survey
agent answers three questions (what is this, what approach, which tools
matter) from cheap signals — file type distribution and a top-2-level
tree — without reading files. Tool triage is tri-state: relevant, skip,
or unlisted (default), so skip is reserved for tools whose use would be
actively wrong rather than merely unnecessary.
Wiring of _run_survey() and the submit_survey tool follows in #5.
Returns all file and dir cache entries with confidence below a given
threshold (default 0.7). Entries missing a confidence field are
included as unrated/untrusted. Results sorted ascending by confidence
so least-confident entries come first.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>