docs: add Planning Pass design sketch, update Architecture and Internals for Phase 3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jeff Smith 2026-04-12 20:32:05 -06:00
parent 31a052eca0
commit 3fcf8c221d
4 changed files with 322 additions and 41 deletions

@ -8,8 +8,9 @@
Luminos is an agentic Claude investigation tool. Every invocation runs the
full pipeline: a base scan first to feed the agent its initial picture, then
a survey pass, then per-directory dir loops, then a final synthesis pass.
The base scan is not a standalone product, it is the agent's input.
a survey pass, a planning pass, per-directory dir loops with dynamic turn
allocation, then a final synthesis pass. The base scan is not a standalone
product, it is the agent's input.
**Entry point:** `luminos.py` — argument parsing, scan orchestration, AI
pipeline kickoff, output routing.
@ -69,7 +70,17 @@ analyze_directory(report, target)
├── _filter_dir_tools(survey) remove skip_tools (if confidence ≥ 0.5)
├── per-directory loop (each uncached dir, up to max_turns=14)
├── _run_planning() single loop, max 3 turns
│ inputs: survey output + full tree + file signals
│ Tools: submit_plan
│ output: plan dict (priority/shallow/skip dirs,
│ turn allocations, investigation order)
│ (skipped on tiny targets or loaded from plan.json
│ on resumed runs)
├── _apply_plan() sort dirs into bands, build turn map
├── per-directory loop (ordered by plan, dynamic max_turns)
│ _build_dir_context() list files + sizes + MIME
│ _get_child_summaries() read cached child summaries
│ _format_survey_block() inject survey context into prompt
@ -78,7 +89,9 @@ analyze_directory(report, target)
│ cache entry on budget breach
│ Tools: read_file, list_directory, run_command,
│ parse_structure, write_cache, think, checkpoint,
│ flag, submit_report
│ flag, submit_report (with completeness)
├── _write_plan_evaluation() plan_evaluation.json quality metrics
├── _run_synthesis() single loop, max 5 turns
│ reads all "dir" cache entries
@ -104,6 +117,8 @@ Layout:
```
meta.json investigation metadata
plan.json planning pass output (cached for resumed runs)
plan_evaluation.json quality metrics: plan predictions vs outcomes
files/<sha256>.json one JSON file per cached file entry
dirs/<sha256>.json one JSON file per cached directory entry
flags.jsonl JSONL — appended on every flag tool call
@ -170,19 +185,18 @@ the *latest* per-call `input_tokens` reading (the actual size of the
context window in use), not the cumulative sum across turns. Early
exit flushes partial cache on budget breach. See #44.
**Per-loop turn cap.** Each dir loop runs for at most `max_turns = 14`
turns. This is a sanity bound separate from the context budget — even
on small targets the agent should produce a `submit_report` long
before exhausting 14 turns. The cap exists to prevent runaway loops
when the agent gets stuck (e.g. repeatedly retrying a failing tool
call). If we observe legitimate investigations consistently hitting
14, raise the cap; do not raise it speculatively.
**Per-loop turn cap.** The planning pass assigns each directory a turn
budget: priority dirs get 15-20 (capped at 25), shallow dirs get 5,
default dirs get 10. This replaced the old fixed `max_turns=14`. The
cap exists to prevent runaway loops when the agent gets stuck. The
`plan_evaluation.json` quality report tracks turns used vs allocated
per directory. See [Planning Pass](PlanningPass) for the full design.
**Per-loop message history growth.** Tool results are appended to the
message history and never evicted, so per-turn `input_tokens` grows
roughly linearly across a loop (~1.52k per turn observed on
codebase targets). At the current `max_turns=14` cap this stays well
under 200k. Raising `max_turns` significantly (e.g. via Phase 3
dynamic turn allocation) would expose this — see #51.
roughly linearly across a loop (~1.5-2k per turn observed on
codebase targets). At the current caps (max 25 turns for priority
dirs) this stays under 200k. Raising caps significantly would
expose this further. See #51.
Pricing tracked and reported at end of each run.

@ -10,9 +10,9 @@ runs first to feed the agent its initial picture of the target.
## Current State
- **Phase:** Active development — core pipeline stable, scaling and domain intelligence planned
- **Last worked on:** 2026-04-06
- **Last commit:** merge: add -x/--exclude flag for directory exclusion
- **Phase:** Active development — Phases 1-3 complete. Phase 3 added planning pass with dynamic turn allocation and quality instrumentation.
- **Last worked on:** 2026-04-12
- **Last commit:** feat(ai): Phase 3 investigation planning (#75)
- **Blocking:** None
---
@ -23,6 +23,7 @@ runs first to feed the agent its initial picture of the target.
|---|---|
| [Architecture](Architecture) | Module breakdown, data flow, AI pipeline |
| [Internals](Internals) | Code-level tour: dir loop, cache, prompts, where to make changes |
| [Planning Pass](PlanningPass) | Phase 3 design sketch: dynamic turn allocation, quality metrics |
| [Development Guide](DevelopmentGuide) | Setup, git workflow, testing, commands |
| [Roadmap](Roadmap) | Phase status — pointer to PLAN.md and open issues |
| [Session Retrospectives](SessionRetrospectives) | Full session history |

@ -313,32 +313,26 @@ is the entire payoff of leaves-first ordering.
The trick: those subdirectory summaries only exist if the children
were investigated *first*. If `src/` runs before `src/auth/`, the
cache lookup at `ai.py:825` returns nothing. The function falls
through to its default at `ai.py:832` and returns the string
`(none — this is a leaf directory)`. The parent's system prompt
silently loses all of its child context, and the agent has no way to
know — the placeholder claims the dir is a leaf, which is a lie when
the children just haven't been investigated yet. The dir summary
degrades and the synthesis pass inherits the degradation.
cache lookup returns nothing.
**If you change the investigation order**, you have to do one of:
**Phase 3 addressed this contract in two ways:**
1. **Preserve the leaf-first invariant within whatever new order you
introduce.** A "priority-first" order can still process directories
leaves-first within each priority band, so children always run
before parents.
2. **Explicitly handle the missing-child-summaries case in the
prompt.** Replace the lie ("leaf directory") with the truth
("children not yet investigated") so the agent at least knows what
it doesn't have, and accept that some dirs will run with degraded
context.
1. **Band-sorted ordering preserves leaf-first within priority bands.**
`_apply_plan()` groups directories into priority/default/shallow
bands but keeps the leaf-first sort within each band. So children
always run before their parents, even in "priority-first" mode.
Phase 3's planning pass introduces the temptation to investigate
priority dirs first. Both alternatives above are open. Whichever is
chosen, this contract has to be addressed *explicitly* — the test
class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`) pins the
current ordering, so any change will be loud, but the *reason* the
ordering matters lives here.
2. **The placeholder was fixed.** `_get_child_summaries()` now
distinguishes actual leaf directories ("this is a leaf directory")
from parents whose children haven't been investigated yet ("child
directories exist but have not been investigated yet"). The old
placeholder claimed every empty-cache case was a leaf, which was a
lie when children simply hadn't been processed yet.
The test class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`)
pins the base leaf-first ordering. `TestGetChildSummaries` pins the
updated placeholder behavior. See [Planning Pass](PlanningPass) for
the full design.
---

272
PlanningPass.md Normal file

@ -0,0 +1,272 @@
# Planning Pass Design Sketch
The planning pass is Phase 3 of the Luminos investigation pipeline. It
runs after the survey and before the per-directory dir loops, deciding
where to invest investigative depth across the directory tree.
---
## Problem
Before Phase 3, every directory received the same fixed allocation:
`max_turns=14`. A two-file docs directory got the same budget as a
fifty-file core source directory. This wasted turns on trivial dirs and
under-invested in complex ones.
---
## Solution: Plan Before You Investigate
A single-turn Claude call (the "planning pass") examines cheap signals
(survey output, full directory tree, file statistics) and produces a
structured plan that the orchestrator uses to allocate resources.
```
survey pass
| survey dict
v
planning pass <-- NEW
| plan dict (priority/shallow/skip dirs, turn allocations)
v
dir loop (per directory, ordered by plan)
| cached dir entries
v
synthesis pass
```
The planning pass does not read files or explore the filesystem. It is
a "strategy from the map" pass: it looks at structure and makes
judgment calls about where depth will pay off.
---
## Plan Schema
The planning agent produces a plan via the `submit_plan` tool:
```python
{
"priority_dirs": [
{"path": str, "reason": str, "suggested_turns": int}
],
"shallow_dirs": [
{"path": str, "reason": str}
],
"skip_dirs": [
{"path": str, "reason": str}
],
"investigation_order": "leaf-first" | "priority-first",
"notes": str,
}
```
Directories not mentioned in any tier receive a default allocation
(currently 10 turns). The planner does not need to list every
directory; it focuses on cases where the default would clearly be
wrong.
---
## Turn Allocation
| Tier | Turns | When to use |
|---|---|---|
| **priority** | 15-20 (capped at 25) | Complex, central, or important dirs: many source files, core logic, schemas, migrations |
| **default** | 10 | Unlisted dirs; reasonable for most directories |
| **shallow** | 5 | Simple, peripheral, or predictable: few files, test fixtures, static assets, docs-only |
| **skip** | 0 (excluded) | Build output, dependency caches, vendored code, generated artifacts |
The global turn budget is `base_turns_per_dir * dir_count` (10 per
dir). The planner's allocations should roughly respect this budget.
Allocations above the ceiling (25 turns) are capped by the
orchestrator.
### Why no mid-loop borrowing (yet)
PLAN.md envisions a global budget with mid-loop turn borrowing (an
agent that needs more turns can "borrow" from the remaining budget).
This requires inter-loop communication that does not exist today. The
v1 implementation uses simple per-directory allocation with no
borrowing. If the quality instrumentation shows that priority dirs
consistently exhaust their allocation while shallow dirs finish early,
borrowing becomes worth building.
---
## Investigation Order
Two strategies are available:
**leaf-first** (default): the existing order from `_discover_directories()`.
Deepest directories first, parents last. Ensures child summaries are
always cached before parent investigation begins.
**priority-first**: priority directories before shallow/default, but
leaf-first *within each band*. This preserves the child-summaries
invariant while letting high-value subtrees inform the rest of the
investigation.
Both strategies preserve the leaf-first contract documented in
[Internals](Internals) section 4.7. The `_apply_plan()` function sorts
directories into bands without breaking the within-band leaf ordering.
---
## Inputs to the Planner
The planning agent receives four signals:
1. **Survey output**: the full survey dict (description, approach,
domain notes, tool recommendations), formatted as a text block.
2. **Full directory tree**: `render_tree()` output at depth 6 (deeper
than the survey's 2-level preview).
3. **File signals**: extension histogram, `file --brief` descriptions,
filename samples (the same raw signals the survey sees).
4. **Cached directories**: which dirs are already cached from a prior
run (so the planner knows what will be skipped).
---
## Fallback Behavior
The planning pass degrades gracefully:
- **Small targets** (below `_SURVEY_MIN_FILES` and `_SURVEY_MIN_DIRS`):
planning is skipped entirely, same threshold as the survey. All dirs
get the default allocation in leaf-first order.
- **Planning fails** (API error, agent doesn't call `submit_plan`):
`_default_plan()` returns an empty plan. All dirs get 10 turns,
leaf-first order. The investigation proceeds as if Phase 3 didn't
exist.
- **Resumed runs**: the plan is cached as `plan.json` in the
investigation cache. On resume (without `--fresh`), the cached plan
is loaded and `_run_planning()` is skipped.
---
## Quality Instrumentation
Phase 3 ships with built-in measurement so we can tell whether planning
actually improves investigation quality. Three metrics:
### Turn utilization
Tracked per directory: turns allocated vs turns used. An agent that
finishes in 3 turns on an 18-turn budget suggests over-allocation. An
agent that hits the cap on a 5-turn budget suggests under-allocation.
### Completeness self-rating
The `submit_report` tool (dir scope) now includes a `completeness`
field (0.0-1.0). The agent rates how thoroughly it investigated the
directory. This is not perfectly reliable (it is a self-assessment),
but it provides signal: a priority dir with completeness 0.3 probably
needed more turns; a shallow dir with completeness 0.95 probably
didn't need its 5 turns.
### plan_evaluation.json
Written at the end of every investigation, this file is the planning
pass's report card. It compares plan predictions to outcomes:
```json
{
"plan_order": "leaf-first",
"total_dirs_investigated": 12,
"total_turns_allocated": 120,
"total_turns_used": 87,
"overall_utilization": 0.73,
"per_directory": [
{
"dir": "src/core",
"planned_tier": "priority",
"turns_allocated": 18,
"turns_used": 14,
"utilization": 0.78,
"completeness": 0.9,
"confidence": 0.85
}
],
"evaluated_at": "2026-04-12T..."
}
```
Run luminos on the same target before and after changes to compare
these metrics. The golden set for baseline comparison: luminos itself.
---
## Implementation Map
| Component | Location | Purpose |
|---|---|---|
| `_PLANNING_SYSTEM_PROMPT` | `prompts.py` | System prompt for the planning agent |
| `submit_plan` tool | `ai.py` (planning scope) | Tool schema for plan submission |
| `_run_planning()` | `ai.py` | Runs the planning pass (follows `_run_survey` pattern) |
| `_apply_plan()` | `ai.py` | Pure function: plan + dir list to ordered list + turn map |
| `_default_plan()` | `ai.py` | Fallback empty plan |
| `_write_plan_evaluation()` | `ai.py` | Writes `plan_evaluation.json` after dir loops |
| `_TokenTracker._loop_turns` | `ai.py` | Counts API calls per dir loop for utilization tracking |
| `plan.json` | cache root | Persisted plan for resumed runs |
| `plan_evaluation.json` | cache root | Post-investigation quality report |
---
## Design Decisions
### Why band-sorted order instead of arbitrary reordering
The leaf-first contract (`_get_child_summaries()`) is load-bearing.
Breaking it silently degrades parent summaries because child cache
entries don't exist yet. Band-sorting preserves leaf-first within each
priority band, giving us "priority-first" without losing child context.
### Why per-directory allocation instead of a shared global pool
A shared pool with mid-loop borrowing requires the orchestrator to
communicate with running agents, which doesn't exist in the current
architecture (each `_run_dir_loop` call is independent). Per-directory
allocation is a strict improvement over fixed-14-for-everyone with zero
new machinery. The quality instrumentation will tell us if borrowing is
worth building.
### Why the child-summaries placeholder was fixed
`_get_child_summaries()` previously returned "this is a leaf directory"
for any directory with no cached children, whether it was actually a
leaf or just hadn't been investigated yet. With priority-first ordering,
this lie becomes more likely to trigger. The fix distinguishes the two
cases: actual leaves get "this is a leaf directory", uninvestigated
parents get "child directories exist but have not been investigated
yet".
### Why completeness is a self-rating
An external completeness metric would require knowing "how many files
should have been examined", which depends on the directory contents and
is exactly the kind of judgment the agent makes. Self-rating is
imperfect but cheap, and the correlation between self-rated
completeness and turn utilization gives us a useful signal even if the
absolute values aren't perfectly calibrated.
---
## Future Work
- **Mid-loop turn borrowing**: if utilization data shows priority dirs
consistently hit their cap while others finish early, implement a
shared budget pool.
- **Plan refinement**: after the first dir loop run, re-evaluate the
plan based on early findings (some "shallow" dirs might turn out to
be important).
- **Cross-run learning**: use `plan_evaluation.json` from prior runs to
improve planning on similar targets.
---
## References
- Issues: #8, #9, #10, #11, #74
- PR: #75
- PLAN.md Part 4: Investigation Planning
- [Internals](Internals) section 4.7: leaf-first contract