archeious/luminos

Fork 0

Table of Contents

Internals

1. The two layers
2. Base scan walkthrough
3. AI pipeline walkthrough

3.1 The orchestrator
3.2 The survey pass
3.3 How the survey shapes dir loops
3.4 The planning pass (Phase 3)

4. The dir loop in depth

4.1 The message history grows monotonically
4.2 Tool dispatch
4.3 Pre-loaded context
4.4 The token tracker and the budget check
4.5 What the loop returns
4.6 The streaming API caller
4.7 The leaf-first contract (load-bearing for child summaries)

5. The cache model

5.1 Investigation IDs
5.2 What's stored
5.3 Confidence + completeness support
5.4 Why one-file-per-entry instead of JSONL
5.5 The planning files

6. Prompts
7. Synthesis pass
8. Flags
9. Where to make common changes

9.1 Add a new tool the dir agent can call
9.2 Add a whole new pass
9.3 Change a prompt
9.4 Change cache schema
9.5 Add a CLI flag

10. Token budget and cost
11. Glossary

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Internals

A code tour of how Luminos actually works. Read this after Development Guide and Architecture. The goal is that a developer who knows basic Python but has never built an agent loop can finish this page and start making non-trivial changes.

All file:line references are accurate as of the date this page was last edited — verify with git log or by opening the file before relying on a specific line number. ai.py in particular grows each phase and references drift.

1. The two layers

Luminos still has two internal layers, but they are no longer separated by a CLI flag. AI runs unconditionally on every invocation.

Layer	What it does	Imports
Base scan	Walks the directory, classifies files, counts lines, ranks recency, measures disk usage. Produces the `report` dict that feeds the agent.	stdlib + GNU coreutils via subprocess.
AI pipeline	Runs a multi-pass agent investigation via the Claude API on top of the base scan output.	`anthropic`, `tree-sitter`, `python-magic`.

The base scan modules (tree.py, filetypes.py, code.py, recency.py, disk.py, report.py) still don't depend on ai.py, ast_parser.py, or prompts.py. This separation is convention rather than a hard constraint now: it keeps the base scan fast and easy to test, but it is no longer enforced by lazy imports. luminos.py imports from luminos_lib.ai at the bottom of main(), after the base scan has produced its report.

If ANTHROPIC_API_KEY is unset, luminos.py exits cleanly (exit 0) with a one-line hint before running the base scan, so the user isn't made to wait for a scan they can't use.

2. Base scan walkthrough

Entry: luminos.py:main() parses args, then calls scan(target, ...). scan() is a flat sequence — it builds a report dict by calling helpers from luminos_lib/, one per concern, in order:

scan(target)
  build_tree()           → report["tree"], report["tree_rendered"]
  classify_files()       → report["classified_files"]
  summarize_categories() → report["file_categories"]
  survey_signals()       → report["survey_signals"]    ← input to AI survey
  detect_languages()     → report["languages"], report["lines_of_code"]
  find_large_files()     → report["large_files"]
  find_recent_files()    → report["recent_files"]
  get_disk_usage()       → report["disk_usage"]
  top_directories()      → report["top_directories"]
  return report

Each helper is independent. You could delete find_recent_files() and the report would just be missing that field. The flow is procedural, not event-driven, and there is no shared state object — everything passes through the local report dict.

The progress lines you see on stderr ([scan] Counting lines... foo.py) come from _progress() in luminos.py, which returns an on_file callback that the helpers call as they work. If you add a new helper that walks files, plumb a progress callback through the same way for consistency.

After scan() returns, main() either runs the AI pipeline or jumps straight to format_report() (luminos_lib/report.py) for terminal output, or json.dumps() for JSON. The AI pipeline always runs after the base scan because it needs report["survey_signals"] and report["file_categories"] as inputs.

3. AI pipeline walkthrough

The AI pipeline is what makes Luminos interesting and is also where almost all the complexity lives. Everything below happens inside luminos_lib/ai.py (~2060 lines as of writing), called from luminos.py via analyze_directory().

3.1 The orchestrator

analyze_directory() is a thin wrapper that checks dependencies, gets the API key, builds the Anthropic client, and calls _run_investigation(). If anything fails it prints a warning and returns empty strings — the rest of luminos keeps working.

_run_investigation() is the real entry point. Read this function first if you want to understand the pipeline shape. It does seven things, in order:

Get/create an investigation ID and cache. Investigation IDs let you resume a previous run; see §5 below.
Discover all directories under the target via _discover_directories(). Returns them sorted leaves-first — the deepest paths come first. This matters because each dir loop reads its child directories' summaries from cache, so children must be investigated before parents.
Run the survey pass unless the target is below _SURVEY_MIN_FILES and _SURVEY_MIN_DIRS, in which case _default_survey() returns a synthetic skip.
Filter out cached directories. If you're resuming an investigation, dirs that already have a dir cache entry are skipped — only new ones get a fresh dir loop.
Run the planning pass (Phase 3) unless the target is small, in which case _default_plan() returns an empty plan. On resumed runs the planner is skipped and plan.json is loaded from cache instead. _apply_plan() then sorts dirs into priority/default/shallow bands and builds a {dir_path: max_turns} map. Leaf-first ordering is preserved within each band (see §4.7).
Run a dir loop per remaining directory, iterating the plan-ordered list with the per-directory max_turns from the plan. _write_plan_evaluation() records turn-utilization metrics at the end. This is the heart of the system — see §4.
Run the synthesis pass reading only dir cache entries to produce (brief, detailed).

It also reads flags.jsonl from disk at the end and returns (brief, detailed, flags) to analyze_directory().

3.2 The survey pass

_run_survey() is a short, single-purpose loop. It exists to give the dir loops some shared context about what they're looking at as a whole before any of them start.

Inputs go into the system prompt (_SURVEY_SYSTEM_PROMPT in prompts.py):

survey_signals — extension histogram, file --brief outputs, filename samples (built by filetypes.survey_signals() during the base scan)
A 2-level tree preview from build_tree(target, max_depth=2)
The list of tools the dir loop will have available

The survey is allowed only submit_survey as a tool (_SURVEY_TOOLS). It runs at most 3 turns. The agent must call submit_survey exactly once with six fields:

{
    "description":     "plain language — what is this target",
    "approach":        "how the dir loops should investigate it",
    "relevant_tools":  ["read_file", "parse_structure", ...],
    "skip_tools":      ["parse_structure", ...],   # for non-code targets
    "domain_notes":    "anything unusual the dir loops should know",
    "confidence":      0.0–1.0,
}

The result is a Python dict that gets passed into every dir loop as survey=.... If the survey fails (API error, ran out of turns), the dir loops still run but with survey=None — the system degrades gracefully.

3.3 How the survey shapes dir loops

Two things happen with the survey output before each dir loop runs:

Survey block injection. _format_survey_block() renders the survey dict as a labeled text block, which gets .format()-injected into the dir loop system prompt as {survey_context}. The dir agent sees the description, approach, domain notes, and which tools it should lean on or skip.

Tool filtering. _filter_dir_tools() returns a copy of _DIR_TOOLS with anything in skip_tools removed — but only if the survey's confidence is at or above _SURVEY_CONFIDENCE_THRESHOLD = 0.5. Below that threshold the agent gets the full toolbox. The control-flow tool submit_report is in _PROTECTED_DIR_TOOLS and can never be filtered out — removing it would break loop termination.

This is the only place in the codebase where the agent's available tools change at runtime. If you add a new tool, decide whether it should be protectable.

3.4 The planning pass (Phase 3)

_run_planning() is structured like _run_survey(): a single-purpose loop with one submit tool (submit_plan), low max turns. Its job is to decide where the dir loops should spend turns, not to investigate.

Inputs:

The survey dict (formatted via _format_survey_block())
The full tree at depth 6 (deeper than the survey's 2-level preview)
The base scan's survey_signals (raw file signals)
The list of already-cached directories (so the planner doesn't plan around dirs that will be skipped)

The plan schema, tier allocations (priority 15–20 cap 25, default 10, shallow 5, skip 0), fallback behavior, and resume behavior are covered in full on the Planning Pass page.

_apply_plan() is a pure helper that translates the plan into an ordered list of directories plus a {dir_path: max_turns} map. It sorts dirs into priority/default/shallow bands but preserves leaf-first ordering within each band — so children always run before their parents, even in "priority-first" mode. See §4.7.

_write_plan_evaluation() writes plan_evaluation.json at the end of every run with turns_allocated, turns_used, and completeness per directory. This is the planning pass's report card.

4. The dir loop in depth

_run_dir_loop() is a hand-written agent loop, and you should expect to read it several times before it clicks. As of #57 the loop body itself is a thin coordinator (~25 lines): it calls three helpers that own the layers it used to inline.

Helper	Job
`_build_dir_loop_context()`	Pure setup. Builds dir context, child summaries, survey block, filtered tool list, system prompt, and the seed user message. Returns a `_DirLoopContext` namedtuple.
`_flush_partial_dir_entry()`	Idempotent partial-cache writer for the budget-exceeded path. Synthesizes a summary from already-cached file entries when possible, or writes a "no files processed" stub. Returns the partial summary string.
`_handle_turn_response()`	Per-turn response processing. Prints text blocks and tool decisions to stderr, appends the assistant message, dispatches tools (or nudges the agent to call submit_report), appends tool_results. Returns `(done, summary, completeness)`.

The shape of the loop body is now:

ctx = _build_dir_loop_context(...)
reset per-loop token counter
for turn in range(max_turns):                   # max_turns from plan (5–25)
    if budget exceeded:
        print warning
        partial = _flush_partial_dir_entry(...)
        if partial: summary = partial
        break
    call API (streaming)
    done, turn_summary, turn_completeness = _handle_turn_response(...)
    if turn_summary: summary = turn_summary
    if turn_completeness: completeness = turn_completeness
    if done: break
return (summary, completeness)

A few non-obvious mechanics:

4.1 The message history grows monotonically

Every turn appends an assistant message (the model's response) and a user message (the tool results). Nothing is ever evicted. This means input_tokens on each successive API call grows roughly linearly — the model is re-sent the full conversation every turn. On code targets we see ~1.5–2k tokens added per turn. At max_turns=14 this stays under the budget; raising the cap would expose this. With Phase 3's priority-tier cap of 25, we're still well under budget in practice but closer to the ceiling. See #51.

4.2 Tool dispatch

Tools are plain functions in ai.py. They are wired up via a single register_tool() call that lands the schema in one or more scope lists (_DIR_TOOLS, _SYNTHESIS_TOOLS, _SURVEY_TOOLS, _PLANNING_TOOLS) and the handler in _TOOL_DISPATCH. The registrations live below the tool implementations in ai.py and read top-to-bottom in dir-then-synthesis-then-survey-then-planning order.

_execute_tool() looks up the handler by name in _TOOL_DISPATCH, calls it, logs the turn to investigation.log, and returns the result string. Tools intercepted by the loop body — submit_report, submit_survey, submit_plan — register their schema only and have no handler entry. _handle_turn_response() recognizes submit_report specially: it sets done = True, extracts the summary from the tool input, and also extracts the optional completeness field (Phase 3 instrumentation).

think, checkpoint, and flag are in dispatch, but they have side effects that just print to stderr or append to flags.jsonl — the return value is always "ok".

When you add a tool: write the function, then add one register_tool() call below it. That's it. There is no second place to forget.

4.3 Pre-loaded context

Before the loop starts, _build_dir_loop_context() calls two helpers that prepare static context for the system prompt:

_build_dir_context() — ls-style listing of the dir with sizes and MIME types via python-magic. The agent sees this before it makes any tool calls, so it doesn't waste a turn just listing the directory.
_get_child_summaries() — looks up each subdirectory in the cache and pulls its summary field. This is how leaves-first ordering pays off: by the time the loop runs on src/, all of src/auth/, src/db/, src/middleware/ already have cached summaries that get injected as {child_summaries}.

If _get_child_summaries() returns nothing, the prompt distinguishes leaf directories ("(none: this is a leaf directory)") from parents whose children haven't been investigated yet ("(child directories exist but have not been investigated yet)"). See §4.7.

4.4 The token tracker and the budget check

_TokenTracker is a tiny accumulator with one important subtlety, captured in #44:

Cumulative input tokens are NOT a meaningful proxy for context size: each turn's input_tokens already includes the full message history, so summing across turns double-counts everything. Use last_input for budget decisions, totals for billing.

So budget_exceeded() compares last_input (the most recent call's input_tokens) to CONTEXT_BUDGET, which is 70% of 200k. This is checked at the top of each loop iteration, before the next API call.

When the budget check trips, the loop:

Prints a Context budget reached warning to stderr
Calls _flush_partial_dir_entry(), which writes a partial dir cache entry from any file cache entries the agent already produced, marked with partial: True and partial_reason. The helper is idempotent — if a dir entry already exists, it returns "" without writing.
Breaks out of the loop

This means a budget breach doesn't lose work — anything the agent already cached survives, and the synthesis pass will see a partial dir summary rather than nothing.

4.5 What the loop returns

_run_dir_loop() returns (summary, completeness). The summary is the string from submit_report (or the partial summary returned by _flush_partial_dir_entry() if the budget tripped). The completeness is the agent's self-rated investigation thoroughness (0.0–1.0) — Phase 3 instrumentation used in plan_evaluation.json — or None if the agent didn't report one.

_run_investigation() writes a normal dir cache entry from this summary (with completeness included if non-None), unless the dir loop already wrote one itself via the partial-flush path, in which case the cache.has_entry("dir", dir_path) check skips it.

4.6 The streaming API caller

_call_api_streaming() is a thin wrapper around client.messages.stream(). It currently doesn't print tokens as they arrive — it iterates the stream, drops everything, then pulls the final message via stream.get_final_message(). The streaming API is used for real-time tool decision printing, which today happens only after the full response arrives. There's room here to add live progress printing if you want it.

4.7 The leaf-first contract (load-bearing for child summaries)

_discover_directories() returns directories sorted leaves-first (the deepest paths first, parents last). This is not a stylistic choice. It is a load-bearing invariant. _get_child_summaries() depends on it.

When the dir loop runs on a parent like src/, _get_child_summaries() reads the cache for each subdirectory of src/ (src/auth/, src/db/, src/middleware/) and injects their existing summaries into the parent's system prompt under {child_summaries}. This is how the agent gets context about parts of the project it isn't currently inside without re-reading them, and it is the entire payoff of leaves-first ordering.

The trick: those subdirectory summaries only exist if the children were investigated first. If src/ runs before src/auth/, the cache lookup returns nothing.

Phase 3 addressed this contract in two ways:

Band-sorted ordering preserves leaf-first within priority bands. _apply_plan() groups directories into priority/default/shallow bands but keeps the leaf-first sort within each band. So children always run before their parents, even in "priority-first" mode.
The placeholder was fixed. _get_child_summaries() now distinguishes actual leaf directories ("this is a leaf directory") from parents whose children haven't been investigated yet ("child directories exist but have not been investigated yet"). The old placeholder claimed every empty-cache case was a leaf, which was a lie when children simply hadn't been processed yet.

The test class TestDiscoverDirectories (in tests/test_ai_pure.py) pins the base leaf-first ordering. TestGetChildSummaries pins the updated placeholder behavior. See Planning Pass for the full design.

5. The cache model

Cache lives at /tmp/luminos/{investigation_id}/. Code is luminos_lib/cache.py.

5.1 Investigation IDs

/tmp/luminos/investigations.json maps absolute target paths to UUIDs. _get_investigation_id() looks up the target and either returns the existing UUID (resume) or creates a new one (fresh run). --fresh forces a new UUID even if one exists.

5.2 What's stored

Inside /tmp/luminos/{uuid}/:

meta.json                 investigation metadata (model, start time, dir count)
plan.json                 planning pass output — cached for resumed runs
plan_evaluation.json      post-investigation quality report (Phase 3)
files/<sha256>.json       one file per cached file entry
dirs/<sha256>.json        one file per cached directory entry
flags.jsonl               JSONL — appended on every flag tool call
investigation.log         JSONL — appended on every tool call

File and dir cache entries are NOT in JSONL — they are one sha256-keyed JSON file per entry. The sha256 is over the path string. Only flags.jsonl and investigation.log use JSONL.

Required fields are validated in write_entry():

file: {path, relative_path, size_bytes, category, summary, cached_at}
dir:  {path, relative_path, child_count, dominant_category, summary, cached_at}

The validator also rejects entries containing content, contents, or raw fields — the agent is explicitly forbidden from caching raw file contents, summaries only. If you change the schema, update the required set in write_entry() and update the test in tests/test_cache.py.

5.3 Confidence + completeness support

write_entry() validates optional confidence and confidence_reason fields (Phase 1) and an optional completeness field (Phase 3, 0.0–1.0, the dir agent's self-rated thoroughness). low_confidence_entries(threshold=0.7) returns all entries below a threshold, sorted ascending — future refinement-pass fuel.

5.4 Why one-file-per-entry instead of JSONL

Random access by path. The dir loop calls cache.has_entry("dir", path) once per directory during the _get_child_summaries() lookup; with sha256-keyed files this is an os.path.exists() call. With JSONL it would be a full file scan.

5.5 The planning files

plan.json is written by _run_investigation() after a successful planning pass, so resumed runs can skip the planner. It is loaded before the dir loops run when --fresh is not set and the file exists.

plan_evaluation.json is written by _write_plan_evaluation() after the dir loops finish. Schema: plan_order, total_dirs_investigated, total_turns_allocated, total_turns_used, overall_utilization, per_directory (list of {dir, planned_tier, turns_allocated, turns_used, utilization, completeness, confidence}), evaluated_at. See Planning Pass for how to use it.

6. Prompts

All prompt templates live in luminos_lib/prompts.py. There are four:

Constant	Used by	What it carries
`_SURVEY_SYSTEM_PROMPT`	`_run_survey`	survey_signals, tree_preview, available_tools
`_PLANNING_SYSTEM_PROMPT`	`_run_planning`	survey, tree, file signals, cached_dirs
`_DIR_SYSTEM_PROMPT`	`_run_dir_loop`	dir_path, dir_rel, max_turns, context, child_summaries, survey_context
`_SYNTHESIS_SYSTEM_PROMPT`	`_run_synthesis`	target, summaries_text

Each is a Python f-string-style template with {name} placeholders. The caller assembles values and passes them to .format(...) immediately before the API call. There is no template engine — it's plain string formatting.

When you change a prompt, the only thing you need to keep in sync is the set of placeholders. If you add {foo} to the template, the caller must provide foo=.... If you remove a placeholder from the template but leave the kwarg in the caller, .format() silently ignores it. If you add a placeholder and forget to provide it, .format() raises KeyError at runtime.

prompts.py has no logic and no tests — it's listed in Development Guide as exempt from unit testing for that reason.

7. Synthesis pass

_run_synthesis() is structurally similar to the dir loop but much simpler:

Reads all dir cache entries via cache.read_all_entries("dir")
Renders them into a summaries_text block (one section per dir)
Stuffs that into _SYNTHESIS_SYSTEM_PROMPT
Loops up to max_turns=5 waiting for submit_report with brief and detailed fields

Tools available: read_cache, list_cache, flag, submit_report (_SYNTHESIS_TOOLS). The synthesis agent can pull specific cache entries back if it needs to drill in, but it cannot read files directly — synthesis is meant to operate on summaries, not raw contents.

There's a fallback: if synthesis runs out of turns without calling submit_report, _synthesize_from_cache() builds a mechanical brief+detailed from the cached dir summaries with no AI call. This guarantees you always get something in the report.

8. Flags

The flag tool is the agent's pressure valve for "I noticed something that should not be lost in the summary." _tool_flag() prints to stderr and appends a JSONL line to {cache.root}/flags.jsonl. At the end of _run_investigation(), the orchestrator reads that file back and includes the flags in its return tuple. format_report() then renders them in a dedicated section.

Severity is info | concern | critical. The agent is told to flag immediately on discovery, not save findings for the report — this is in the tool description.

9. Where to make common changes

A cookbook for the kinds of changes that come up most often.

9.1 Add a new tool the dir agent can call

Write the implementation: _tool_<name>(args, target, cache) in the tool implementations section of ai.py. Return a string.
Add a register_tool() call in the registrations block at the bottom of the tool implementations section, with scopes=["dir"] and handler=_tool_<name>. The schema follows Anthropic tool-use shape: name, description, input_schema.
Decide whether the survey should be able to filter it out (default: yes — leave it out of _PROTECTED_DIR_TOOLS) or whether it's control-flow critical (add to _PROTECTED_DIR_TOOLS).
Update _DIR_SYSTEM_PROMPT in prompts.py if the agent needs instructions on when to use the new tool.
There is no unit test for tool registration today (ai.py is exempt). If you want coverage, the test would assert that _TOOL_DISPATCH contains your handler and _DIR_TOOLS contains your schema after importing luminos_lib.ai.

To make a tool available in synthesis, survey, or planning instead of (or in addition to) dir, pass scopes=["synthesis"], scopes=["survey"], scopes=["planning"], or any combination. Tools whose schema differs by scope (like submit_report) get a separate register_tool() call per scope.

9.2 Add a whole new pass

(Phase 3's planning pass is the immediate example.) The pattern:

Define a new system prompt constant in prompts.py
Define a new tool list in ai.py for the pass-specific submit tool
Write _run_<pass>() in ai.py, modeled on _run_survey() — single submit tool, low max_turns, returns a dict or None on failure
Wire it into _run_investigation() between existing passes
Pass its output downstream by adding a kwarg to _run_dir_loop() (or wherever it's needed) and threading it through

The survey pass is the cleanest reference implementation because it's short and self-contained.

9.3 Change a prompt

Edit the constant in prompts.py. If you add a {placeholder}, also update the corresponding .format(...) call in ai.py. Search the codebase for the constant name to find the call site:

grep -n SURVEY_SYSTEM_PROMPT luminos_lib/ai.py

There is no prompt versioning today. Investigation cache entries don't record which prompt version produced them, so re-running with a new prompt against an existing investigation will mix old and new outputs unless you --fresh.

9.4 Change cache schema

Update the required-fields set in cache.py:write_entry()
Update _DIR_TOOLS's write_cache description in ai.py so the agent knows what to write
Update _DIR_SYSTEM_PROMPT in prompts.py if the agent needs to know how to populate the new field
Update tests/test_cache.py — schema validation is the part of the cache that is covered

9.5 Add a CLI flag

Edit luminos.py:main()'s argparse setup to define the flag, then plumb it through whatever functions need it. New AI-related flags typically need to be added to analyze_directory()'s signature and then forwarded to _run_investigation().

10. Token budget and cost

Budget logic is in _TokenTracker.budget_exceeded() and is checked at the top of every dir loop iteration. The budget is per call, not cumulative — see §4.4. The breach handler flushes a partial dir cache entry so work isn't lost.

Cost reporting happens once at the end of _run_investigation(), using the cumulative total_input and total_output counters multiplied by the constants near the top of ai.py. There is no running cost display during the investigation today. If you want one, _TokenTracker.summary() already returns the formatted string — just call it after each dir loop.

11. Glossary

Term	Meaning
base scan	The non-AI phase: tree, classification, languages, recency, disk usage. Stdlib + coreutils only.
dir loop	Per-directory agent loop in `_run_dir_loop`. Turns allocated by the planning pass (5 shallow / 10 default / 15–20 priority, capped at 25). Produces a `dir` cache entry.
survey pass	Single short loop before any dir loops, producing a shared description and tool guidance.
planning pass	Phase 3 pass after the survey, before dir loops. Produces a plan (priority/shallow/skip dirs + turn allocations + order).
synthesis pass	Final loop that reads `dir` cache entries and produces `(brief, detailed)`.
leaves-first	Discovery order in `_discover_directories`: deepest paths first, so child summaries exist when parents are investigated. Preserved within planning bands by `_apply_plan`.
investigation	One end-to-end run, identified by a UUID, persisted under `/tmp/luminos/{uuid}/`.
investigation_id	The UUID. Stored in `/tmp/luminos/investigations.json` keyed by absolute target path.
cache entry	A JSON file under `files/` or `dirs/` named by sha256(path).
flag	An agent finding written to `flags.jsonl` and reported separately. info / concern / critical.
partial entry	A `dir` cache entry written when the budget tripped before `submit_report`. Marked with `partial: True`.
completeness	Phase 3 agent self-rated thoroughness (0.0–1.0) from `submit_report`. Feeds `plan_evaluation.json`.
survey signals	The histogram + samples computed by `filetypes.survey_signals()` during the base scan, fed to the survey prompt.
last_input	The `input_tokens` count from the most recent API call. The basis for budget checks. NOT the cumulative sum.
CONTEXT_BUDGET	70% of 200k = 140k. Trigger threshold for early exit.
`_PROTECTED_DIR_TOOLS`	Tools the survey is forbidden from filtering out of the dir loop's toolbox. Currently `{submit_report}`.
plan.json	Serialized planning output, cached so resumed runs skip the planner.
plan_evaluation.json	Post-investigation quality report comparing plan predictions to outcomes.