Rebuild filetype classifier to remove source-code bias #42
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The current
filetypes.pyclassifier is biased toward source-code targets and produces misleading signals for non-code domains. This breaks the survey pass (#5) for any target that is not a codebase or document collection.Problem
_classify_one()uses two-stage classification:EXTENSION_MAPcovering ~80 extensions across 7 buckets: source, config, data, document, media, archive, unknownfile --briefmatched againstFILE_CMD_PATTERNSwheretext→ source,script→ source,program→ sourceBoth stages assume the target is a developer artifact. The taxonomy has no concept of mail, calendar, contacts, notebooks, ledgers, journals, photo libraries, or any other personal-data domain.
Failure trace: Maildir target
Given a Maildir tree (10,000 message files):
:2,Sflag suffixes) → not inEXTENSION_MAPfile --briefreturns"SMTP mail, ASCII text"→ matchestextpattern → classified assourcereport["file_categories"] == {"source": 10000}.emlfiles, mbox files, and most personal-data formats hit the same failure mode.Why this matters now
#5 is shipping with a Band-Aid: the survey prompt explicitly tells the LLM the histogram is biased toward source code and to trust the tree preview when they conflict. That works for obvious cases like Maildir but is fragile for subtler ones (mixed personal-data dirs, photo libraries with sidecar files, accounting exports). The right fix is to remove the bias at the source.
Proposed approaches
Option A — Expand the taxonomy. Add categories:
mail,notebook,personal_data,ledger,archive_metadata, etc. Add extension andfilepattern mappings for each. Pros: minimal change to downstream consumers. Cons: every new domain needs hand-curation; the taxonomy will always lag reality.Option B — Switch to raw-signal feeding for the survey. Stop bucketing for the survey pass entirely. Feed the survey: the raw extension histogram (no bucketing), a sample of
file --briefoutputs, a sample of filenames. Let the LLM categorize from primary signals. Keep the bucketed view only where it is actually used (the base scan report). Pros: no taxonomy maintenance; LLM does the work it is good at. Cons: more tokens per survey call; diverges from the current report shape.Option C — Hybrid. Expand the taxonomy modestly (Option A for the most common non-code cases) and also expose raw signals to the survey (Option B). Belt and suspenders.
Recommend deciding between these as part of the issue, not in advance.
Sequencing
This issue should land in Phase 2, after the survey pass is shipped end-to-end with the Band-Aid (#4–#7), and before Phase 3 (planning) starts depending on survey output for real decisions. At that point we will have observed actual survey behavior on real targets and can pick the right option with evidence.
Acceptance
report["file_categories"]for the same target reflects the actual contentShipped in #50, merged to main. Closing manually — Forgejo's
Closeskeyword didn't auto-close this from the PR body.