Revisit survey-skip thresholds with empirical data #46
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Background
#7 introduced a gate that skips the survey pass when a target has both
< _SURVEY_MIN_FILESfiles AND< _SURVEY_MIN_DIRSdirectories. The thresholds shipped as_SURVEY_MIN_FILES = 5and_SURVEY_MIN_DIRS = 2, picked from the example in #7's body without any empirical basis. The AND semantics correctly handle the deep-narrow edge case (few files, many dirs — survey still runs because dir count amortizes the cost across many dir loops).What this issue is for
After Phase 2 ships and we have run
--aion a variety of real targets, revisit whether the thresholds and gate logic are actually pulling their weight.Questions to answer with data
Are we skipping surveys we should be running? Look for runs where the survey was skipped and the dir loop produced a vague or wrong description. If those exist, the threshold is too aggressive.
Are we running surveys we should be skipping? Look for runs where the survey was called on a target small enough that the dir loop would have figured it out instantly. If those exist, the threshold is too conservative — survey is wasted spend.
Is file count the right input? File count from
report["file_categories"]includes binary, generated, and pycache files. A target with 200.pycfiles and 3 source files looks large by this measure but is small in any meaningful sense. Should we excludeunknownand certain categories from the count? (Note: this overlaps with #42 — fixing the classifier would change what counts.)Is dir count the right input? Currently uses
len(_discover_directories(...)), which is post-exclude. Good. But it counts all descendants equally — a 50-dir-deep linear chain looks the same as 50 sibling dirs even though their dir loops behave very differently. Worth considering.Should the gate consider total bytes too? A target with 5 files where one is 50 MB is not the same as 5 small files. Probably edge case, but worth checking.
Should the gate be one constant or scale? Right now
_SURVEY_MIN_FILESand_SURVEY_MIN_DIRSare separate constants checked with AND. A single "survey value score" combining files, dirs, and maybe bytes might be more honest than two thresholds.Acceptance
Sequencing
After Phase 2 ships (#4–#7 plus #42, #44) and after we have run
--aion at least 5 distinct real targets of varying shapes — including at least one that triggers the skip and one deep-narrow case. Probably end of Phase 2 retrospective or start of Phase 3.