Rebuild filetype classifier to remove source-code bias

archeious commented

2026-04-06 21:47:36 -06:00

Owner

The current filetypes.py classifier is biased toward source-code targets and produces misleading signals for non-code domains. This breaks the survey pass (#5) for any target that is not a codebase or document collection.

Problem

_classify_one() uses two-stage classification:

Hardcoded EXTENSION_MAP covering ~80 extensions across 7 buckets: source, config, data, document, media, archive, unknown
Fallback to file --brief matched against FILE_CMD_PATTERNS where text → source, script → source, program → source

Both stages assume the target is a developer artifact. The taxonomy has no concept of mail, calendar, contacts, notebooks, ledgers, journals, photo libraries, or any other personal-data domain.

Failure trace: Maildir target

Given a Maildir tree (10,000 message files):

Files have no extension (or weird :2,S flag suffixes) → not in EXTENSION_MAP
file --brief returns "SMTP mail, ASCII text" → matches text pattern → classified as source
Result: report["file_categories"] == {"source": 10000}
Survey pass sees "10000 source files" and concludes it is a giant codebase. Wrong.

.eml files, mbox files, and most personal-data formats hit the same failure mode.

Why this matters now

#5 is shipping with a Band-Aid: the survey prompt explicitly tells the LLM the histogram is biased toward source code and to trust the tree preview when they conflict. That works for obvious cases like Maildir but is fragile for subtler ones (mixed personal-data dirs, photo libraries with sidecar files, accounting exports). The right fix is to remove the bias at the source.

Proposed approaches

Option A — Expand the taxonomy. Add categories: mail, notebook, personal_data, ledger, archive_metadata, etc. Add extension and file pattern mappings for each. Pros: minimal change to downstream consumers. Cons: every new domain needs hand-curation; the taxonomy will always lag reality.

Option B — Switch to raw-signal feeding for the survey. Stop bucketing for the survey pass entirely. Feed the survey: the raw extension histogram (no bucketing), a sample of file --brief outputs, a sample of filenames. Let the LLM categorize from primary signals. Keep the bucketed view only where it is actually used (the base scan report). Pros: no taxonomy maintenance; LLM does the work it is good at. Cons: more tokens per survey call; diverges from the current report shape.

Option C — Hybrid. Expand the taxonomy modestly (Option A for the most common non-code cases) and also expose raw signals to the survey (Option B). Belt and suspenders.

Recommend deciding between these as part of the issue, not in advance.

Sequencing

This issue should land in Phase 2, after the survey pass is shipped end-to-end with the Band-Aid (#4–#7), and before Phase 3 (planning) starts depending on survey output for real decisions. At that point we will have observed actual survey behavior on real targets and can pick the right option with evidence.

Acceptance

A non-code target (Maildir, .eml dir, photo library, or ledger export) produces a survey description that does not call it a codebase
report["file_categories"] for the same target reflects the actual content
Existing code-target behavior is unchanged or improved

The current `filetypes.py` classifier is biased toward source-code targets and produces misleading signals for non-code domains. This breaks the survey pass (#5) for any target that is not a codebase or document collection. ## Problem `_classify_one()` uses two-stage classification: 1. Hardcoded `EXTENSION_MAP` covering ~80 extensions across 7 buckets: source, config, data, document, media, archive, unknown 2. Fallback to `file --brief` matched against `FILE_CMD_PATTERNS` where `text` → source, `script` → source, `program` → source Both stages assume the target is a developer artifact. The taxonomy has no concept of mail, calendar, contacts, notebooks, ledgers, journals, photo libraries, or any other personal-data domain. ## Failure trace: Maildir target Given a Maildir tree (10,000 message files): - Files have no extension (or weird `:2,S` flag suffixes) → not in `EXTENSION_MAP` - `file --brief` returns `"SMTP mail, ASCII text"` → matches `text` pattern → classified as `source` - Result: `report["file_categories"] == {"source": 10000}` - Survey pass sees "10000 source files" and concludes it is a giant codebase. Wrong. `.eml` files, mbox files, and most personal-data formats hit the same failure mode. ## Why this matters now #5 is shipping with a Band-Aid: the survey prompt explicitly tells the LLM the histogram is biased toward source code and to trust the tree preview when they conflict. That works for obvious cases like Maildir but is fragile for subtler ones (mixed personal-data dirs, photo libraries with sidecar files, accounting exports). The right fix is to remove the bias at the source. ## Proposed approaches **Option A — Expand the taxonomy.** Add categories: `mail`, `notebook`, `personal_data`, `ledger`, `archive_metadata`, etc. Add extension and `file` pattern mappings for each. Pros: minimal change to downstream consumers. Cons: every new domain needs hand-curation; the taxonomy will always lag reality. **Option B — Switch to raw-signal feeding for the survey.** Stop bucketing for the survey pass entirely. Feed the survey: the raw extension histogram (no bucketing), a sample of `file --brief` outputs, a sample of filenames. Let the LLM categorize from primary signals. Keep the bucketed view only where it is actually used (the base scan report). Pros: no taxonomy maintenance; LLM does the work it is good at. Cons: more tokens per survey call; diverges from the current report shape. **Option C — Hybrid.** Expand the taxonomy modestly (Option A for the most common non-code cases) and also expose raw signals to the survey (Option B). Belt and suspenders. Recommend deciding between these as part of the issue, not in advance. ## Sequencing This issue should land in Phase 2, after the survey pass is shipped end-to-end with the Band-Aid (#4–#7), and before Phase 3 (planning) starts depending on survey output for real decisions. At that point we will have observed actual survey behavior on real targets and can pick the right option with evidence. ## Acceptance - A non-code target (Maildir, .eml dir, photo library, or ledger export) produces a survey description that does not call it a codebase - `report["file_categories"]` for the same target reflects the actual content - Existing code-target behavior is unchanged or improved