Terminal report shows biased bucketed file-type view #49

New issue

Open

opened 2026-04-06 22:31:20 -06:00 by archeious · 0 comments

archeious commented

2026-04-06 22:31:20 -06:00

Owner

Problem

The terminal report's "FILE TYPE INTELLIGENCE" section displays report["file_categories"], which comes from the bucketed summarize_categories() classifier. This classifier is biased toward source-code targets (see #42) and produces misleading summaries for non-code targets.

Concrete example from the smoke test on luminos_lib:

>> FILE TYPE INTELLIGENCE
  source         13
  unknown        13
  TOTAL          26

The 13 "source" files are real Python source. The 13 "unknown" files are .pyc files in __pycache__/, which the classifier doesn't recognize. A user reading this can't tell that half the count is bytecode noise.

Why this is separate from #42

#42 stops feeding the biased histogram to the survey by exposing raw signals (survey_signals()) for the survey's consumption. The terminal report continues to use summarize_categories() because a quick "N source / N config / N data" overview is reasonable for a human eyeballing the report — the bucketed view is not wrong as a summary, only as an input to AI characterization.

This issue is the smaller follow-up: should the terminal report also switch to or supplement with the new signals — for example, an extension histogram or a file --brief description summary — so the user sees a more honest view of the target?

Possible directions

Add an "Extensions" sub-section under FILE TYPE INTELLIGENCE that lists the top extensions raw, alongside the bucketed view. Cheap, additive, no consumer break.
Replace the bucketed view entirely with the extension histogram. Cleaner but loses the at-a-glance taxonomy.
Mark .pyc and similar generated files distinctly in the existing taxonomy (a generated bucket) so the source/unknown collision goes away. Overlaps with #42's expand-the-taxonomy option which we already rejected.
Wait for the unit-of-analysis fix and let that drive the report redesign.

Direction 1 is the smallest useful change.

Sequencing

Low priority. After #42 and Phase 2.5 (#44). Not blocking anything.

## Problem The terminal report's "FILE TYPE INTELLIGENCE" section displays `report["file_categories"]`, which comes from the bucketed `summarize_categories()` classifier. This classifier is biased toward source-code targets (see #42) and produces misleading summaries for non-code targets. Concrete example from the smoke test on `luminos_lib`: ``` >> FILE TYPE INTELLIGENCE source 13 unknown 13 TOTAL 26 ``` The 13 "source" files are real Python source. The 13 "unknown" files are `.pyc` files in `__pycache__/`, which the classifier doesn't recognize. A user reading this can't tell that half the count is bytecode noise. ## Why this is separate from #42 #42 stops feeding the biased histogram to the survey by exposing raw signals (`survey_signals()`) for the survey's consumption. The terminal report continues to use `summarize_categories()` because a quick "N source / N config / N data" overview is reasonable for a human eyeballing the report — the bucketed view is not wrong as a summary, only as an input to AI characterization. This issue is the smaller follow-up: should the terminal report also switch to or supplement with the new signals — for example, an extension histogram or a `file --brief` description summary — so the user sees a more honest view of the target? ## Possible directions 1. Add an "Extensions" sub-section under FILE TYPE INTELLIGENCE that lists the top extensions raw, alongside the bucketed view. Cheap, additive, no consumer break. 2. Replace the bucketed view entirely with the extension histogram. Cleaner but loses the at-a-glance taxonomy. 3. Mark `.pyc` and similar generated files distinctly in the existing taxonomy (a `generated` bucket) so the source/unknown collision goes away. Overlaps with #42's expand-the-taxonomy option which we already rejected. 4. Wait for the unit-of-analysis fix and let that drive the report redesign. Direction 1 is the smallest useful change. ## Sequencing Low priority. After #42 and Phase 2.5 (#44). Not blocking anything.