fix(cache): _save_investigations write is not atomic — crash mid-write corrupts shared UUID map #85

Open
opened 2026-04-18 20:24:28 -06:00 by claude-code · 0 comments
Collaborator

Problem

/tmp/luminos/investigations.json is shared state across every invocation on the machine: it's the map from absolute target path → UUID that lets luminos resume prior investigations. The write path is non-atomic:

luminos_lib/cache.py:46-49:

def _save_investigations(data):
    os.makedirs(CACHE_ROOT, exist_ok=True)
    with open(INVESTIGATIONS_PATH, "w") as f:
        json.dump(data, f, indent=2)

If the process crashes, is killed, or runs out of disk between open("w") (which truncates) and json.dump completing, the file is left truncated or empty. _load_investigations then sees a json.JSONDecodeError, returns {}, and every resumable investigation on this machine loses its UUID mapping — they all become unreachable (the cache dirs still exist under their old UUIDs; luminos just can't find them from the target path anymore).

A concurrent run can also clobber: two luminos invocations against different targets, if they race on load/modify/save, lose one of the two updates.

Fix

Standard atomic-replace pattern:

def _save_investigations(data):
    os.makedirs(CACHE_ROOT, exist_ok=True)
    tmp = INVESTIGATIONS_PATH + ".tmp"
    with open(tmp, "w") as f:
        json.dump(data, f, indent=2)
    os.replace(tmp, INVESTIGATIONS_PATH)

Addresses crash-safety. Concurrency is not solved by this alone (two writers can still lose one side's update) but is far less likely than crash corruption in practice, and the fix is cheap.

If concurrency matters more later (probably when #39 MCP backend lands), a flock over INVESTIGATIONS_PATH around load + save is the next step.

Acceptance

  • _save_investigations uses temp file + os.replace
  • Test in tests/test_cache.py simulates a mid-write crash and confirms the old file is intact
  • No behavior change on happy path
## Problem `/tmp/luminos/investigations.json` is shared state across every invocation on the machine: it's the map from absolute target path → UUID that lets luminos resume prior investigations. The write path is non-atomic: `luminos_lib/cache.py:46-49`: ```python def _save_investigations(data): os.makedirs(CACHE_ROOT, exist_ok=True) with open(INVESTIGATIONS_PATH, "w") as f: json.dump(data, f, indent=2) ``` If the process crashes, is killed, or runs out of disk between `open("w")` (which truncates) and `json.dump` completing, the file is left truncated or empty. `_load_investigations` then sees a `json.JSONDecodeError`, returns `{}`, and every resumable investigation on this machine loses its UUID mapping — they all become unreachable (the cache dirs still exist under their old UUIDs; luminos just can't find them from the target path anymore). A concurrent run can also clobber: two `luminos` invocations against different targets, if they race on load/modify/save, lose one of the two updates. ## Fix Standard atomic-replace pattern: ```python def _save_investigations(data): os.makedirs(CACHE_ROOT, exist_ok=True) tmp = INVESTIGATIONS_PATH + ".tmp" with open(tmp, "w") as f: json.dump(data, f, indent=2) os.replace(tmp, INVESTIGATIONS_PATH) ``` Addresses crash-safety. Concurrency is not solved by this alone (two writers can still lose one side's update) but is far less likely than crash corruption in practice, and the fix is cheap. If concurrency matters more later (probably when #39 MCP backend lands), a flock over `INVESTIGATIONS_PATH` around load + save is the next step. ## Acceptance - [ ] `_save_investigations` uses temp file + `os.replace` - [ ] Test in `tests/test_cache.py` simulates a mid-write crash and confirms the old file is intact - [ ] No behavior change on happy path
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: archeious/luminos#85
No description provided.