The survey pass no longer receives the bucketed file_categories
histogram, which was biased toward source-code targets and would
mislabel mail, notebooks, ledgers, and other non-code domains as
"source" via the file --brief "text" pattern fallback.
Adds filetypes.survey_signals(), which assembles raw signals from
the same `classified` data the bucketer already processes — no new
walks, no new dependencies:
total_files — total count
extension_histogram — top 20 extensions, raw, no taxonomy
file_descriptions — top 20 `file --brief` outputs, by count
filename_samples — 20 names, evenly drawn (not first-20)
`survey --brief` descriptions are truncated at 80 chars before
counting so prefixes group correctly without exploding key cardinality.
The Band-Aid in _SURVEY_SYSTEM_PROMPT (warning the LLM that the
histogram was biased toward source code) is removed and replaced
with neutral guidance on how to read the raw signals together.
The {file_type_distribution} placeholder is renamed to
{survey_signals} to reflect the broader content.
luminos.py base scan computes survey_signals once and stores it on
report["survey_signals"]; AI consumers read from there.
summarize_categories() and report["file_categories"] are unchanged
— the terminal report still uses the bucketed view (#49 tracks
fixing that follow-up).
Smoke tested on two targets:
- luminos_lib: identical-quality survey ("Python library package",
confidence 0.85), unchanged behavior on code targets.
- A synthetic Maildir of 8 messages with `:2,S` flag suffixes:
survey now correctly identifies it as "A Maildir-format mailbox
containing 8 email messages" with confidence 0.90, names the
Maildir naming convention in domain_notes, and correctly marks
parse_structure as a skip tool. Before #42 this would have been
"8 source files."
Adds 8 unit tests for survey_signals covering empty input, extension
histogram, description aggregation/truncation, top-N cap, and
even-stride filename sampling.
#48 tracks the unit-of-analysis limitation (file is the wrong unit
for mbox, SQLite, archives, notebooks) — explicitly out of scope
for #42 and documented in survey_signals' docstring.
Adds think, checkpoint, and flag tools for agent reasoning visibility:
- think: records observation/hypothesis/next_action before investigation
- checkpoint: summarizes learned/unknown/next_phase after file clusters
- flag: marks notable findings to flags.jsonl with severity levels
Additional changes:
- Step numbering in investigation system prompt
- Text blocks from agent now printed to stderr (step labels visible)
- flag tool available in both investigation and synthesis passes
- analyze_directory() returns (brief, detailed, flags) three-tuple
- format_flags() in report.py renders flags sorted by severity
- Per-directory max_turns increased from 10 to 14
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- --install-extras: prints status of all optional AI packages
- --clear-cache: wipes /tmp/luminos/ investigation cache
- --fresh: forces a new investigation ID, ignoring cached results
- AI import is now lazy (only when --ai is used) so the base tool
never touches optional dependencies
- target argument is optional when using --install-extras
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds --ai flag that sends the directory tree, file categories, and
sampled file contents to Claude for analysis. Produces a brief
summary at the top of the report and a detailed breakdown at the
end. Requires ANTHROPIC_API_KEY env var; degrades gracefully without it.
Uses only stdlib (urllib) to keep the zero-dependency constraint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>