Unit of analysis is hardcoded as "file" — over/under-counts container formats #48
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Luminos uses "file" as the unit of analysis everywhere — file count, file categories, file summaries. This is accurate at the POSIX level but wrong at the semantic level for any target where the meaningful unit is not a file.
Failure modes
Over-counting (one logical thing, many files):
.git/directory is one logical repo but thousands of pack/object files that are implementation noise.Under-counting (many logical things, one file):
In both directions, the histograms, summaries, and survey signals Luminos produces are systematically wrong about the scale and shape of the target.
Why renaming
filetypes.pyis not the fixA rename to
objects.pyorentries.pywould move the lie from "we count files" to "we count objects, where object happens to mean file." The dishonesty is in the unit of analysis, not the label. The honest fix requires:.git/,node_modules/,__pycache__/should report as one logical thing, not thousandsWhen this work is done, the module rename happens as part of the substantive change. The new name reflects new behavior.
Acceptance
--aion a Maildir produces a description and count consistent with "a mailbox of N messages," not "a directory of N files"--aion an mbox file produces a description consistent with "a mailbox file containing N messages," not "one text file"--aion a target containing.git/andnode_modules/does not let either dominate the file countfiletypes,cache,report, andaiSequencing
After Phase 4 (external knowledge tools). This issue overlaps with format inspection and is substantial enough to be its own phase. It should NOT block Phase 2 / 3 / 3.5 / 4 — those phases improve characterization within the current file-as-unit model, which is fine for the codebase targets that motivated the project.
References
survey_signals()(added in #42) is named for its purpose, not its unit, so the name still fits if the function eventually returns mixed units (files + messages + rows).