docs: add Planning Pass design sketch, update Architecture and Internals for Phase 3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 20:32:05 -06:00 · 2026-04-12 20:32:05 -06:00 · 3fcf8c221d
commit 3fcf8c221d
parent 31a052eca0
4 changed files with 322 additions and 41 deletions
--- a/Architecture.md
+++ b/Architecture.md
@ -8,8 +8,9 @@
 Luminos is an agentic Claude investigation tool. Every invocation runs the
 full pipeline: a base scan first to feed the agent its initial picture, then
-a survey pass, then per-directory dir loops, then a final synthesis pass.
+a survey pass, a planning pass, per-directory dir loops with dynamic turn
-The base scan is not a standalone product, it is the agent's input.
+allocation, then a final synthesis pass. The base scan is not a standalone
 product, it is the agent's input.
 **Entry point:** `luminos.py` — argument parsing, scan orchestration, AI
 pipeline kickoff, output routing.
@ -69,7 +70,17 @@ analyze_directory(report, target)
            │
            ├── _filter_dir_tools(survey)  remove skip_tools (if confidence ≥ 0.5)
            │
-            ├── per-directory loop (each uncached dir, up to max_turns=14)
+            ├── _run_planning()            single loop, max 3 turns
            │       inputs: survey output + full tree + file signals
            │       Tools: submit_plan
            │       output: plan dict (priority/shallow/skip dirs,
            │               turn allocations, investigation order)
            │       (skipped on tiny targets or loaded from plan.json
            │        on resumed runs)
            │
            ├── _apply_plan()              sort dirs into bands, build turn map
            │
            ├── per-directory loop (ordered by plan, dynamic max_turns)
            │       _build_dir_context()    list files + sizes + MIME
            │       _get_child_summaries()  read cached child summaries
            │       _format_survey_block()  inject survey context into prompt
@ -78,7 +89,9 @@ analyze_directory(report, target)
            │                               cache entry on budget breach
            │       Tools: read_file, list_directory, run_command,
            │              parse_structure, write_cache, think, checkpoint,
-            │              flag, submit_report
+            │              flag, submit_report (with completeness)
            │
            ├── _write_plan_evaluation()   plan_evaluation.json quality metrics
            │
            ├── _run_synthesis()            single loop, max 5 turns
            │       reads all "dir" cache entries
@ -104,6 +117,8 @@ Layout:
 ```
 meta.json              investigation metadata
 plan.json              planning pass output (cached for resumed runs)
 plan_evaluation.json   quality metrics: plan predictions vs outcomes
 files/<sha256>.json    one JSON file per cached file entry
 dirs/<sha256>.json     one JSON file per cached directory entry
 flags.jsonl            JSONL — appended on every flag tool call
@ -170,19 +185,18 @@ the *latest* per-call `input_tokens` reading (the actual size of the
 context window in use), not the cumulative sum across turns. Early
 exit flushes partial cache on budget breach. See #44.
-**Per-loop turn cap.** Each dir loop runs for at most `max_turns = 14`
+**Per-loop turn cap.** The planning pass assigns each directory a turn
-turns. This is a sanity bound separate from the context budget — even
+budget: priority dirs get 15-20 (capped at 25), shallow dirs get 5,
-on small targets the agent should produce a `submit_report` long
+default dirs get 10. This replaced the old fixed `max_turns=14`. The
-before exhausting 14 turns. The cap exists to prevent runaway loops
+cap exists to prevent runaway loops when the agent gets stuck. The
-when the agent gets stuck (e.g. repeatedly retrying a failing tool
+`plan_evaluation.json` quality report tracks turns used vs allocated
-call). If we observe legitimate investigations consistently hitting
+per directory. See [Planning Pass](PlanningPass) for the full design.
 14, raise the cap; do not raise it speculatively.
 **Per-loop message history growth.** Tool results are appended to the
 message history and never evicted, so per-turn `input_tokens` grows
-roughly linearly across a loop (~1.5–2k per turn observed on
+roughly linearly across a loop (~1.5-2k per turn observed on
-codebase targets). At the current `max_turns=14` cap this stays well
+codebase targets). At the current caps (max 25 turns for priority
-under 200k. Raising `max_turns` significantly (e.g. via Phase 3
+dirs) this stays under 200k. Raising caps significantly would
-dynamic turn allocation) would expose this — see #51.
+expose this further. See #51.
 Pricing tracked and reported at end of each run.
--- a/Home.md
+++ b/Home.md
@ -10,9 +10,9 @@ runs first to feed the agent its initial picture of the target.
 ## Current State
- **Phase:** Active development — core pipeline stable, scaling and domain intelligence planned
+- **Phase:** Active development — Phases 1-3 complete. Phase 3 added planning pass with dynamic turn allocation and quality instrumentation.
- **Last worked on:** 2026-04-06
+- **Last worked on:** 2026-04-12
- **Last commit:** merge: add -x/--exclude flag for directory exclusion
+- **Last commit:** feat(ai): Phase 3 investigation planning (#75)
 - **Blocking:** None
 ---
@ -23,6 +23,7 @@ runs first to feed the agent its initial picture of the target.
 |---|---|
 | [Architecture](Architecture) | Module breakdown, data flow, AI pipeline |
 | [Internals](Internals) | Code-level tour: dir loop, cache, prompts, where to make changes |
 | [Planning Pass](PlanningPass) | Phase 3 design sketch: dynamic turn allocation, quality metrics |
 | [Development Guide](DevelopmentGuide) | Setup, git workflow, testing, commands |
 | [Roadmap](Roadmap) | Phase status — pointer to PLAN.md and open issues |
 | [Session Retrospectives](SessionRetrospectives) | Full session history |
--- a/Internals.md
+++ b/Internals.md
@ -313,32 +313,26 @@ is the entire payoff of leaves-first ordering.
 The trick: those subdirectory summaries only exist if the children
 were investigated *first*. If `src/` runs before `src/auth/`, the
-cache lookup at `ai.py:825` returns nothing. The function falls
+cache lookup returns nothing.
 through to its default at `ai.py:832` and returns the string
 `(none — this is a leaf directory)`. The parent's system prompt
 silently loses all of its child context, and the agent has no way to
 know — the placeholder claims the dir is a leaf, which is a lie when
 the children just haven't been investigated yet. The dir summary
 degrades and the synthesis pass inherits the degradation.
-**If you change the investigation order**, you have to do one of:
+**Phase 3 addressed this contract in two ways:**
-1. **Preserve the leaf-first invariant within whatever new order you
+1. **Band-sorted ordering preserves leaf-first within priority bands.**
-   introduce.** A "priority-first" order can still process directories
+   `_apply_plan()` groups directories into priority/default/shallow
-   leaves-first within each priority band, so children always run
+   bands but keeps the leaf-first sort within each band. So children
-   before parents.
+   always run before their parents, even in "priority-first" mode.
 2. **Explicitly handle the missing-child-summaries case in the
   prompt.** Replace the lie ("leaf directory") with the truth
   ("children not yet investigated") so the agent at least knows what
   it doesn't have, and accept that some dirs will run with degraded
   context.
-Phase 3's planning pass introduces the temptation to investigate
+2. **The placeholder was fixed.** `_get_child_summaries()` now
-priority dirs first. Both alternatives above are open. Whichever is
+   distinguishes actual leaf directories ("this is a leaf directory")
-chosen, this contract has to be addressed *explicitly* — the test
+   from parents whose children haven't been investigated yet ("child
-class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`) pins the
+   directories exist but have not been investigated yet"). The old
-current ordering, so any change will be loud, but the *reason* the
+   placeholder claimed every empty-cache case was a leaf, which was a
-ordering matters lives here.
+   lie when children simply hadn't been processed yet.
 The test class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`)
 pins the base leaf-first ordering. `TestGetChildSummaries` pins the
 updated placeholder behavior. See [Planning Pass](PlanningPass) for
 the full design.
 ---
--- a/PlanningPass.md
+++ b/PlanningPass.md
@ -0,0 +1,272 @@
 # Planning Pass Design Sketch
 The planning pass is Phase 3 of the Luminos investigation pipeline. It
 runs after the survey and before the per-directory dir loops, deciding
 where to invest investigative depth across the directory tree.
 ---
 ## Problem
 Before Phase 3, every directory received the same fixed allocation:
 `max_turns=14`. A two-file docs directory got the same budget as a
 fifty-file core source directory. This wasted turns on trivial dirs and
 under-invested in complex ones.
 ---
 ## Solution: Plan Before You Investigate
 A single-turn Claude call (the "planning pass") examines cheap signals
 (survey output, full directory tree, file statistics) and produces a
 structured plan that the orchestrator uses to allocate resources.
 ```
 survey pass
    |  survey dict
    v
 planning pass    <-- NEW
    |  plan dict (priority/shallow/skip dirs, turn allocations)
    v
 dir loop (per directory, ordered by plan)
    |  cached dir entries
    v
 synthesis pass
 ```
 The planning pass does not read files or explore the filesystem. It is
 a "strategy from the map" pass: it looks at structure and makes
 judgment calls about where depth will pay off.
 ---
 ## Plan Schema
 The planning agent produces a plan via the `submit_plan` tool:
 ```python
 {
    "priority_dirs": [
        {"path": str, "reason": str, "suggested_turns": int}
    ],
    "shallow_dirs": [
        {"path": str, "reason": str}
    ],
    "skip_dirs": [
        {"path": str, "reason": str}
    ],
    "investigation_order": "leaf-first" | "priority-first",
    "notes": str,
 }
 ```
 Directories not mentioned in any tier receive a default allocation
 (currently 10 turns). The planner does not need to list every
 directory; it focuses on cases where the default would clearly be
 wrong.
 ---
 ## Turn Allocation
 | Tier | Turns | When to use |
 |---|---|---|
 | **priority** | 15-20 (capped at 25) | Complex, central, or important dirs: many source files, core logic, schemas, migrations |
 | **default** | 10 | Unlisted dirs; reasonable for most directories |
 | **shallow** | 5 | Simple, peripheral, or predictable: few files, test fixtures, static assets, docs-only |
 | **skip** | 0 (excluded) | Build output, dependency caches, vendored code, generated artifacts |
 The global turn budget is `base_turns_per_dir * dir_count` (10 per
 dir). The planner's allocations should roughly respect this budget.
 Allocations above the ceiling (25 turns) are capped by the
 orchestrator.
 ### Why no mid-loop borrowing (yet)
 PLAN.md envisions a global budget with mid-loop turn borrowing (an
 agent that needs more turns can "borrow" from the remaining budget).
 This requires inter-loop communication that does not exist today. The
 v1 implementation uses simple per-directory allocation with no
 borrowing. If the quality instrumentation shows that priority dirs
 consistently exhaust their allocation while shallow dirs finish early,
 borrowing becomes worth building.
 ---
 ## Investigation Order
 Two strategies are available:
 **leaf-first** (default): the existing order from `_discover_directories()`.
 Deepest directories first, parents last. Ensures child summaries are
 always cached before parent investigation begins.
 **priority-first**: priority directories before shallow/default, but
 leaf-first *within each band*. This preserves the child-summaries
 invariant while letting high-value subtrees inform the rest of the
 investigation.
 Both strategies preserve the leaf-first contract documented in
 [Internals](Internals) section 4.7. The `_apply_plan()` function sorts
 directories into bands without breaking the within-band leaf ordering.
 ---
 ## Inputs to the Planner
 The planning agent receives four signals:
 1. **Survey output**: the full survey dict (description, approach,
   domain notes, tool recommendations), formatted as a text block.
 2. **Full directory tree**: `render_tree()` output at depth 6 (deeper
   than the survey's 2-level preview).
 3. **File signals**: extension histogram, `file --brief` descriptions,
   filename samples (the same raw signals the survey sees).
 4. **Cached directories**: which dirs are already cached from a prior
   run (so the planner knows what will be skipped).
 ---
 ## Fallback Behavior
 The planning pass degrades gracefully:
 - **Small targets** (below `_SURVEY_MIN_FILES` and `_SURVEY_MIN_DIRS`):
  planning is skipped entirely, same threshold as the survey. All dirs
  get the default allocation in leaf-first order.
 - **Planning fails** (API error, agent doesn't call `submit_plan`):
  `_default_plan()` returns an empty plan. All dirs get 10 turns,
  leaf-first order. The investigation proceeds as if Phase 3 didn't
  exist.
 - **Resumed runs**: the plan is cached as `plan.json` in the
  investigation cache. On resume (without `--fresh`), the cached plan
  is loaded and `_run_planning()` is skipped.
 ---
 ## Quality Instrumentation
 Phase 3 ships with built-in measurement so we can tell whether planning
 actually improves investigation quality. Three metrics:
 ### Turn utilization
 Tracked per directory: turns allocated vs turns used. An agent that
 finishes in 3 turns on an 18-turn budget suggests over-allocation. An
 agent that hits the cap on a 5-turn budget suggests under-allocation.
 ### Completeness self-rating
 The `submit_report` tool (dir scope) now includes a `completeness`
 field (0.0-1.0). The agent rates how thoroughly it investigated the
 directory. This is not perfectly reliable (it is a self-assessment),
 but it provides signal: a priority dir with completeness 0.3 probably
 needed more turns; a shallow dir with completeness 0.95 probably
 didn't need its 5 turns.
 ### plan_evaluation.json
 Written at the end of every investigation, this file is the planning
 pass's report card. It compares plan predictions to outcomes:
 ```json
 {
    "plan_order": "leaf-first",
    "total_dirs_investigated": 12,
    "total_turns_allocated": 120,
    "total_turns_used": 87,
    "overall_utilization": 0.73,
    "per_directory": [
        {
            "dir": "src/core",
            "planned_tier": "priority",
            "turns_allocated": 18,
            "turns_used": 14,
            "utilization": 0.78,
            "completeness": 0.9,
            "confidence": 0.85
        }
    ],
    "evaluated_at": "2026-04-12T..."
 }
 ```
 Run luminos on the same target before and after changes to compare
 these metrics. The golden set for baseline comparison: luminos itself.
 ---
 ## Implementation Map
 | Component | Location | Purpose |
 |---|---|---|
 | `_PLANNING_SYSTEM_PROMPT` | `prompts.py` | System prompt for the planning agent |
 | `submit_plan` tool | `ai.py` (planning scope) | Tool schema for plan submission |
 | `_run_planning()` | `ai.py` | Runs the planning pass (follows `_run_survey` pattern) |
 | `_apply_plan()` | `ai.py` | Pure function: plan + dir list to ordered list + turn map |
 | `_default_plan()` | `ai.py` | Fallback empty plan |
 | `_write_plan_evaluation()` | `ai.py` | Writes `plan_evaluation.json` after dir loops |
 | `_TokenTracker._loop_turns` | `ai.py` | Counts API calls per dir loop for utilization tracking |
 | `plan.json` | cache root | Persisted plan for resumed runs |
 | `plan_evaluation.json` | cache root | Post-investigation quality report |
 ---
 ## Design Decisions
 ### Why band-sorted order instead of arbitrary reordering
 The leaf-first contract (`_get_child_summaries()`) is load-bearing.
 Breaking it silently degrades parent summaries because child cache
 entries don't exist yet. Band-sorting preserves leaf-first within each
 priority band, giving us "priority-first" without losing child context.
 ### Why per-directory allocation instead of a shared global pool
 A shared pool with mid-loop borrowing requires the orchestrator to
 communicate with running agents, which doesn't exist in the current
 architecture (each `_run_dir_loop` call is independent). Per-directory
 allocation is a strict improvement over fixed-14-for-everyone with zero
 new machinery. The quality instrumentation will tell us if borrowing is
 worth building.
 ### Why the child-summaries placeholder was fixed
 `_get_child_summaries()` previously returned "this is a leaf directory"
 for any directory with no cached children, whether it was actually a
 leaf or just hadn't been investigated yet. With priority-first ordering,
 this lie becomes more likely to trigger. The fix distinguishes the two
 cases: actual leaves get "this is a leaf directory", uninvestigated
 parents get "child directories exist but have not been investigated
 yet".
 ### Why completeness is a self-rating
 An external completeness metric would require knowing "how many files
 should have been examined", which depends on the directory contents and
 is exactly the kind of judgment the agent makes. Self-rating is
 imperfect but cheap, and the correlation between self-rated
 completeness and turn utilization gives us a useful signal even if the
 absolute values aren't perfectly calibrated.
 ---
 ## Future Work
 - **Mid-loop turn borrowing**: if utilization data shows priority dirs
  consistently hit their cap while others finish early, implement a
  shared budget pool.
 - **Plan refinement**: after the first dir loop run, re-evaluate the
  plan based on early findings (some "shallow" dirs might turn out to
  be important).
 - **Cross-run learning**: use `plan_evaluation.json` from prior runs to
  improve planning on similar targets.
 ---
 ## References
 - Issues: #8, #9, #10, #11, #74
 - PR: #75
 - PLAN.md Part 4: Investigation Planning
 - [Internals](Internals) section 4.7: leaf-first contract