Stale cache entries survive --fresh when path format changed #79

Open
opened 2026-04-12 20:57:34 -06:00 by claude-code · 0 comments
Collaborator

Problem

Cache entries from older runs can coexist with entries from new runs for the same directory. The synthesis pass reads all entries and sees duplicates.

Root cause

Cache entries are stored as dirs/{sha256(path)}.json. The SHA256 is computed from the path argument passed to write_entry(). In older versions of luminos, some entries were written with relative paths (e.g. docs/wiki). Current code writes absolute paths (e.g. /home/micro/luminos/docs/wiki). These hash to different filenames, so both survive in the same cache directory.

--fresh creates a new investigation ID, but the investigation ID for a given target is stored in /tmp/luminos/investigations.json keyed by absolute target path. If an old investigation used a different path format or the mapping was reused, stale entries can leak through.

Observed

Running --fresh against /home/micro/luminos produced an investigation with 7 dir cache entries for 5 directories:

docs/wiki   (absolute path, 2026-04-13)  <-- new
docs/wiki   (relative path, 2024-12-28)  <-- stale
docs        (absolute path, 2026-04-13)  <-- new
docs        (relative path, 2024-12-19)  <-- stale
luminos_lib (absolute path, 2026-04-13)  <-- new
luminos_lib (relative path, 2024-12-28)  <-- stale
luminos     (absolute path, 2025-01-12)  <-- old format

The synthesis pass called cache.read_all_entries("dir") and received all 7, producing a report that potentially double-counted some directories.

Fix options

  1. Normalize paths before hashing. Always os.path.realpath() the path before computing the SHA256. This ensures the same directory always produces the same cache key regardless of how it was referenced.

  2. --fresh should start from an empty cache directory. If the intent is a clean investigation, delete or ignore the old cache tree entirely rather than appending into it.

  3. Deduplicate on read. read_all_entries() could deduplicate by relative_path field, preferring the newest cached_at timestamp. This is a safety net, not a fix for the root cause.

Option 1 is the cleanest. Option 2 is the most robust. Both could be done together.

## Problem Cache entries from older runs can coexist with entries from new runs for the same directory. The synthesis pass reads all entries and sees duplicates. ## Root cause Cache entries are stored as `dirs/{sha256(path)}.json`. The SHA256 is computed from the `path` argument passed to `write_entry()`. In older versions of luminos, some entries were written with relative paths (e.g. `docs/wiki`). Current code writes absolute paths (e.g. `/home/micro/luminos/docs/wiki`). These hash to different filenames, so both survive in the same cache directory. `--fresh` creates a new investigation ID, but the investigation ID for a given target is stored in `/tmp/luminos/investigations.json` keyed by absolute target path. If an old investigation used a different path format or the mapping was reused, stale entries can leak through. ## Observed Running `--fresh` against `/home/micro/luminos` produced an investigation with 7 dir cache entries for 5 directories: ``` docs/wiki (absolute path, 2026-04-13) <-- new docs/wiki (relative path, 2024-12-28) <-- stale docs (absolute path, 2026-04-13) <-- new docs (relative path, 2024-12-19) <-- stale luminos_lib (absolute path, 2026-04-13) <-- new luminos_lib (relative path, 2024-12-28) <-- stale luminos (absolute path, 2025-01-12) <-- old format ``` The synthesis pass called `cache.read_all_entries("dir")` and received all 7, producing a report that potentially double-counted some directories. ## Fix options 1. **Normalize paths before hashing.** Always `os.path.realpath()` the path before computing the SHA256. This ensures the same directory always produces the same cache key regardless of how it was referenced. 2. **`--fresh` should start from an empty cache directory.** If the intent is a clean investigation, delete or ignore the old cache tree entirely rather than appending into it. 3. **Deduplicate on read.** `read_all_entries()` could deduplicate by `relative_path` field, preferring the newest `cached_at` timestamp. This is a safety net, not a fix for the root cause. Option 1 is the cleanest. Option 2 is the most robust. Both could be done together.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/luminos#79
No description provided.