Terminal report shows biased bucketed file-type view #49
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The terminal report's "FILE TYPE INTELLIGENCE" section displays
report["file_categories"], which comes from the bucketedsummarize_categories()classifier. This classifier is biased toward source-code targets (see #42) and produces misleading summaries for non-code targets.Concrete example from the smoke test on
luminos_lib:The 13 "source" files are real Python source. The 13 "unknown" files are
.pycfiles in__pycache__/, which the classifier doesn't recognize. A user reading this can't tell that half the count is bytecode noise.Why this is separate from #42
#42 stops feeding the biased histogram to the survey by exposing raw signals (
survey_signals()) for the survey's consumption. The terminal report continues to usesummarize_categories()because a quick "N source / N config / N data" overview is reasonable for a human eyeballing the report — the bucketed view is not wrong as a summary, only as an input to AI characterization.This issue is the smaller follow-up: should the terminal report also switch to or supplement with the new signals — for example, an extension histogram or a
file --briefdescription summary — so the user sees a more honest view of the target?Possible directions
.pycand similar generated files distinctly in the existing taxonomy (ageneratedbucket) so the source/unknown collision goes away. Overlaps with #42's expand-the-taxonomy option which we already rejected.Direction 1 is the smallest useful change.
Sequencing
Low priority. After #42 and Phase 2.5 (#44). Not blocking anything.