feat(filetypes): expose raw signals to survey (#42) #50

Closed

archeious wants to merge 0 commits from feat/issue-42-classifier-bias into main

archeious commented

2026-04-06 22:36:24 -06:00

Owner

Closes #42

Replaces the bucketed file_categories histogram in the survey input with raw signals (extensions, file --brief output, filename samples). The Band-Aid prompt warning is removed.

Smoke test: synthetic Maildir of 8 messages now characterized as Maildir mailbox at 0.90 confidence; previously would have been 8 source files. Code targets unchanged.

#48 tracks the deeper unit-of-analysis problem (out of scope here).

Closes #42 Replaces the bucketed file_categories histogram in the survey input with raw signals (extensions, file --brief output, filename samples). The Band-Aid prompt warning is removed. Smoke test: synthetic Maildir of 8 messages now characterized as Maildir mailbox at 0.90 confidence; previously would have been 8 source files. Code targets unchanged. #48 tracks the deeper unit-of-analysis problem (out of scope here).

archeious added 1 commit 2026-04-06 22:36:25 -06:00

feat(filetypes): expose raw signals to survey, remove classifier bias (#42 ) f3abbce7d4

The survey pass no longer receives the bucketed file_categories
histogram, which was biased toward source-code targets and would
mislabel mail, notebooks, ledgers, and other non-code domains as
"source" via the file --brief "text" pattern fallback.

Adds filetypes.survey_signals(), which assembles raw signals from
the same `classified` data the bucketer already processes — no new
walks, no new dependencies:
  total_files       — total count
  extension_histogram — top 20 extensions, raw, no taxonomy
  file_descriptions   — top 20 `file --brief` outputs, by count
  filename_samples    — 20 names, evenly drawn (not first-20)

`survey --brief` descriptions are truncated at 80 chars before
counting so prefixes group correctly without exploding key cardinality.

The Band-Aid in _SURVEY_SYSTEM_PROMPT (warning the LLM that the
histogram was biased toward source code) is removed and replaced
with neutral guidance on how to read the raw signals together.
The {file_type_distribution} placeholder is renamed to
{survey_signals} to reflect the broader content.

luminos.py base scan computes survey_signals once and stores it on
report["survey_signals"]; AI consumers read from there.

summarize_categories() and report["file_categories"] are unchanged
— the terminal report still uses the bucketed view (#49 tracks
fixing that follow-up).

Smoke tested on two targets:
- luminos_lib: identical-quality survey ("Python library package",
  confidence 0.85), unchanged behavior on code targets.
- A synthetic Maildir of 8 messages with `:2,S` flag suffixes:
  survey now correctly identifies it as "A Maildir-format mailbox
  containing 8 email messages" with confidence 0.90, names the
  Maildir naming convention in domain_notes, and correctly marks
  parse_structure as a skip tool. Before #42 this would have been
  "8 source files."

Adds 8 unit tests for survey_signals covering empty input, extension
histogram, description aggregation/truncation, top-N cap, and
even-stride filename sampling.

#48 tracks the unit-of-analysis limitation (file is the wrong unit
for mbox, SQLite, archives, notebooks) — explicitly out of scope
for #42 and documented in survey_signals' docstring.