Unit of analysis is hardcoded as "file" — over/under-counts container formats #48

Open
opened 2026-04-07 04:31:03 +00:00 by archeious · 0 comments
Owner

Problem

Luminos uses "file" as the unit of analysis everywhere — file count, file categories, file summaries. This is accurate at the POSIX level but wrong at the semantic level for any target where the meaningful unit is not a file.

Failure modes

Over-counting (one logical thing, many files):

  • A Maildir is one logical mailbox but ten thousand files. Luminos reports "10000 files" when the user thinks "1 mailbox."
  • A .git/ directory is one logical repo but thousands of pack/object files that are implementation noise.
  • A node_modules tree is one logical dependency set but tens of thousands of files.
  • A photo library with sidecar metadata files (XMP, AAE) double-counts every photo.

Under-counting (many logical things, one file):

  • An mbox file is one file but contains thousands of messages.
  • A SQLite database is one file but contains tables and rows.
  • A zip/tar archive is one file but holds many files.
  • A Jupyter notebook is one file but contains many code/markdown cells.
  • A JSONL log file is one file but contains many records.

In both directions, the histograms, summaries, and survey signals Luminos produces are systematically wrong about the scale and shape of the target.

Why renaming filetypes.py is not the fix

A rename to objects.py or entries.py would move the lie from "we count files" to "we count objects, where object happens to mean file." The dishonesty is in the unit of analysis, not the label. The honest fix requires:

  1. Format detection — recognize containers (mbox, SQLite, zip, tar, notebook, JSONL) by content, not just extension
  2. Container handlers — per-format code that can crack open a container and enumerate its contents at the semantic level
  3. A unified "logical unit" abstraction — the survey, cache, and report layers stop assuming the unit is always a file
  4. Skip rules for noise containers.git/, node_modules/, __pycache__/ should report as one logical thing, not thousands

When this work is done, the module rename happens as part of the substantive change. The new name reflects new behavior.

Acceptance

  • Running --ai on a Maildir produces a description and count consistent with "a mailbox of N messages," not "a directory of N files"
  • Running --ai on an mbox file produces a description consistent with "a mailbox file containing N messages," not "one text file"
  • Running --ai on a target containing .git/ and node_modules/ does not let either dominate the file count
  • The unit-of-analysis abstraction is documented and used consistently across filetypes, cache, report, and ai

Sequencing

After Phase 4 (external knowledge tools). This issue overlaps with format inspection and is substantial enough to be its own phase. It should NOT block Phase 2 / 3 / 3.5 / 4 — those phases improve characterization within the current file-as-unit model, which is fine for the codebase targets that motivated the project.

References

  • #42 (classifier bias) addresses a related but smaller problem: bias inside the per-file taxonomy. #42 leaves the unit of analysis as "file."
  • survey_signals() (added in #42) is named for its purpose, not its unit, so the name still fits if the function eventually returns mixed units (files + messages + rows).
## Problem Luminos uses "file" as the unit of analysis everywhere — file count, file categories, file summaries. This is accurate at the POSIX level but wrong at the semantic level for any target where the meaningful unit is not a file. ## Failure modes **Over-counting** (one logical thing, many files): - A Maildir is one logical mailbox but ten thousand files. Luminos reports "10000 files" when the user thinks "1 mailbox." - A `.git/` directory is one logical repo but thousands of pack/object files that are implementation noise. - A node_modules tree is one logical dependency set but tens of thousands of files. - A photo library with sidecar metadata files (XMP, AAE) double-counts every photo. **Under-counting** (many logical things, one file): - An mbox file is one file but contains thousands of messages. - A SQLite database is one file but contains tables and rows. - A zip/tar archive is one file but holds many files. - A Jupyter notebook is one file but contains many code/markdown cells. - A JSONL log file is one file but contains many records. In both directions, the histograms, summaries, and survey signals Luminos produces are systematically wrong about the scale and shape of the target. ## Why renaming `filetypes.py` is not the fix A rename to `objects.py` or `entries.py` would move the lie from "we count files" to "we count objects, where object happens to mean file." The dishonesty is in the unit of analysis, not the label. The honest fix requires: 1. **Format detection** — recognize containers (mbox, SQLite, zip, tar, notebook, JSONL) by content, not just extension 2. **Container handlers** — per-format code that can crack open a container and enumerate its contents at the semantic level 3. **A unified "logical unit" abstraction** — the survey, cache, and report layers stop assuming the unit is always a file 4. **Skip rules for noise containers** — `.git/`, `node_modules/`, `__pycache__/` should report as one logical thing, not thousands When this work is done, the module rename happens as part of the substantive change. The new name reflects new behavior. ## Acceptance - Running `--ai` on a Maildir produces a description and count consistent with "a mailbox of N messages," not "a directory of N files" - Running `--ai` on an mbox file produces a description consistent with "a mailbox file containing N messages," not "one text file" - Running `--ai` on a target containing `.git/` and `node_modules/` does not let either dominate the file count - The unit-of-analysis abstraction is documented and used consistently across `filetypes`, `cache`, `report`, and `ai` ## Sequencing After Phase 4 (external knowledge tools). This issue overlaps with format inspection and is substantial enough to be its own phase. It should NOT block Phase 2 / 3 / 3.5 / 4 — those phases improve characterization within the current file-as-unit model, which is fine for the codebase targets that motivated the project. ## References - #42 (classifier bias) addresses a related but smaller problem: bias inside the per-file taxonomy. #42 leaves the unit of analysis as "file." - `survey_signals()` (added in #42) is named for its purpose, not its unit, so the name still fits if the function eventually returns mixed units (files + messages + rows).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/luminos#48
No description provided.