diff --git a/Architecture.md b/Architecture.md index eb821f9..01be951 100644 --- a/Architecture.md +++ b/Architecture.md @@ -8,8 +8,9 @@ Luminos is an agentic Claude investigation tool. Every invocation runs the full pipeline: a base scan first to feed the agent its initial picture, then -a survey pass, then per-directory dir loops, then a final synthesis pass. -The base scan is not a standalone product, it is the agent's input. +a survey pass, a planning pass, per-directory dir loops with dynamic turn +allocation, then a final synthesis pass. The base scan is not a standalone +product, it is the agent's input. **Entry point:** `luminos.py` — argument parsing, scan orchestration, AI pipeline kickoff, output routing. @@ -69,7 +70,17 @@ analyze_directory(report, target) │ ├── _filter_dir_tools(survey) remove skip_tools (if confidence ≥ 0.5) │ - ├── per-directory loop (each uncached dir, up to max_turns=14) + ├── _run_planning() single loop, max 3 turns + │ inputs: survey output + full tree + file signals + │ Tools: submit_plan + │ output: plan dict (priority/shallow/skip dirs, + │ turn allocations, investigation order) + │ (skipped on tiny targets or loaded from plan.json + │ on resumed runs) + │ + ├── _apply_plan() sort dirs into bands, build turn map + │ + ├── per-directory loop (ordered by plan, dynamic max_turns) │ _build_dir_context() list files + sizes + MIME │ _get_child_summaries() read cached child summaries │ _format_survey_block() inject survey context into prompt @@ -78,7 +89,9 @@ analyze_directory(report, target) │ cache entry on budget breach │ Tools: read_file, list_directory, run_command, │ parse_structure, write_cache, think, checkpoint, - │ flag, submit_report + │ flag, submit_report (with completeness) + │ + ├── _write_plan_evaluation() plan_evaluation.json quality metrics │ ├── _run_synthesis() single loop, max 5 turns │ reads all "dir" cache entries @@ -104,6 +117,8 @@ Layout: ``` meta.json investigation metadata +plan.json planning pass output (cached for resumed runs) +plan_evaluation.json quality metrics: plan predictions vs outcomes files/.json one JSON file per cached file entry dirs/.json one JSON file per cached directory entry flags.jsonl JSONL — appended on every flag tool call @@ -170,19 +185,18 @@ the *latest* per-call `input_tokens` reading (the actual size of the context window in use), not the cumulative sum across turns. Early exit flushes partial cache on budget breach. See #44. -**Per-loop turn cap.** Each dir loop runs for at most `max_turns = 14` -turns. This is a sanity bound separate from the context budget — even -on small targets the agent should produce a `submit_report` long -before exhausting 14 turns. The cap exists to prevent runaway loops -when the agent gets stuck (e.g. repeatedly retrying a failing tool -call). If we observe legitimate investigations consistently hitting -14, raise the cap; do not raise it speculatively. +**Per-loop turn cap.** The planning pass assigns each directory a turn +budget: priority dirs get 15-20 (capped at 25), shallow dirs get 5, +default dirs get 10. This replaced the old fixed `max_turns=14`. The +cap exists to prevent runaway loops when the agent gets stuck. The +`plan_evaluation.json` quality report tracks turns used vs allocated +per directory. See [Planning Pass](PlanningPass) for the full design. **Per-loop message history growth.** Tool results are appended to the message history and never evicted, so per-turn `input_tokens` grows -roughly linearly across a loop (~1.5–2k per turn observed on -codebase targets). At the current `max_turns=14` cap this stays well -under 200k. Raising `max_turns` significantly (e.g. via Phase 3 -dynamic turn allocation) would expose this — see #51. +roughly linearly across a loop (~1.5-2k per turn observed on +codebase targets). At the current caps (max 25 turns for priority +dirs) this stays under 200k. Raising caps significantly would +expose this further. See #51. Pricing tracked and reported at end of each run. diff --git a/Home.md b/Home.md index 44a09fc..8c1cd3d 100644 --- a/Home.md +++ b/Home.md @@ -10,9 +10,9 @@ runs first to feed the agent its initial picture of the target. ## Current State -- **Phase:** Active development — core pipeline stable, scaling and domain intelligence planned -- **Last worked on:** 2026-04-06 -- **Last commit:** merge: add -x/--exclude flag for directory exclusion +- **Phase:** Active development — Phases 1-3 complete. Phase 3 added planning pass with dynamic turn allocation and quality instrumentation. +- **Last worked on:** 2026-04-12 +- **Last commit:** feat(ai): Phase 3 investigation planning (#75) - **Blocking:** None --- @@ -23,6 +23,7 @@ runs first to feed the agent its initial picture of the target. |---|---| | [Architecture](Architecture) | Module breakdown, data flow, AI pipeline | | [Internals](Internals) | Code-level tour: dir loop, cache, prompts, where to make changes | +| [Planning Pass](PlanningPass) | Phase 3 design sketch: dynamic turn allocation, quality metrics | | [Development Guide](DevelopmentGuide) | Setup, git workflow, testing, commands | | [Roadmap](Roadmap) | Phase status — pointer to PLAN.md and open issues | | [Session Retrospectives](SessionRetrospectives) | Full session history | diff --git a/Internals.md b/Internals.md index 4cb7a46..4fc12ea 100644 --- a/Internals.md +++ b/Internals.md @@ -313,32 +313,26 @@ is the entire payoff of leaves-first ordering. The trick: those subdirectory summaries only exist if the children were investigated *first*. If `src/` runs before `src/auth/`, the -cache lookup at `ai.py:825` returns nothing. The function falls -through to its default at `ai.py:832` and returns the string -`(none — this is a leaf directory)`. The parent's system prompt -silently loses all of its child context, and the agent has no way to -know — the placeholder claims the dir is a leaf, which is a lie when -the children just haven't been investigated yet. The dir summary -degrades and the synthesis pass inherits the degradation. +cache lookup returns nothing. -**If you change the investigation order**, you have to do one of: +**Phase 3 addressed this contract in two ways:** -1. **Preserve the leaf-first invariant within whatever new order you - introduce.** A "priority-first" order can still process directories - leaves-first within each priority band, so children always run - before parents. -2. **Explicitly handle the missing-child-summaries case in the - prompt.** Replace the lie ("leaf directory") with the truth - ("children not yet investigated") so the agent at least knows what - it doesn't have, and accept that some dirs will run with degraded - context. +1. **Band-sorted ordering preserves leaf-first within priority bands.** + `_apply_plan()` groups directories into priority/default/shallow + bands but keeps the leaf-first sort within each band. So children + always run before their parents, even in "priority-first" mode. -Phase 3's planning pass introduces the temptation to investigate -priority dirs first. Both alternatives above are open. Whichever is -chosen, this contract has to be addressed *explicitly* — the test -class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`) pins the -current ordering, so any change will be loud, but the *reason* the -ordering matters lives here. +2. **The placeholder was fixed.** `_get_child_summaries()` now + distinguishes actual leaf directories ("this is a leaf directory") + from parents whose children haven't been investigated yet ("child + directories exist but have not been investigated yet"). The old + placeholder claimed every empty-cache case was a leaf, which was a + lie when children simply hadn't been processed yet. + +The test class `TestDiscoverDirectories` (in `tests/test_ai_pure.py`) +pins the base leaf-first ordering. `TestGetChildSummaries` pins the +updated placeholder behavior. See [Planning Pass](PlanningPass) for +the full design. --- diff --git a/PlanningPass.md b/PlanningPass.md new file mode 100644 index 0000000..021ac4e --- /dev/null +++ b/PlanningPass.md @@ -0,0 +1,272 @@ +# Planning Pass Design Sketch + +The planning pass is Phase 3 of the Luminos investigation pipeline. It +runs after the survey and before the per-directory dir loops, deciding +where to invest investigative depth across the directory tree. + +--- + +## Problem + +Before Phase 3, every directory received the same fixed allocation: +`max_turns=14`. A two-file docs directory got the same budget as a +fifty-file core source directory. This wasted turns on trivial dirs and +under-invested in complex ones. + +--- + +## Solution: Plan Before You Investigate + +A single-turn Claude call (the "planning pass") examines cheap signals +(survey output, full directory tree, file statistics) and produces a +structured plan that the orchestrator uses to allocate resources. + +``` +survey pass + | survey dict + v +planning pass <-- NEW + | plan dict (priority/shallow/skip dirs, turn allocations) + v +dir loop (per directory, ordered by plan) + | cached dir entries + v +synthesis pass +``` + +The planning pass does not read files or explore the filesystem. It is +a "strategy from the map" pass: it looks at structure and makes +judgment calls about where depth will pay off. + +--- + +## Plan Schema + +The planning agent produces a plan via the `submit_plan` tool: + +```python +{ + "priority_dirs": [ + {"path": str, "reason": str, "suggested_turns": int} + ], + "shallow_dirs": [ + {"path": str, "reason": str} + ], + "skip_dirs": [ + {"path": str, "reason": str} + ], + "investigation_order": "leaf-first" | "priority-first", + "notes": str, +} +``` + +Directories not mentioned in any tier receive a default allocation +(currently 10 turns). The planner does not need to list every +directory; it focuses on cases where the default would clearly be +wrong. + +--- + +## Turn Allocation + +| Tier | Turns | When to use | +|---|---|---| +| **priority** | 15-20 (capped at 25) | Complex, central, or important dirs: many source files, core logic, schemas, migrations | +| **default** | 10 | Unlisted dirs; reasonable for most directories | +| **shallow** | 5 | Simple, peripheral, or predictable: few files, test fixtures, static assets, docs-only | +| **skip** | 0 (excluded) | Build output, dependency caches, vendored code, generated artifacts | + +The global turn budget is `base_turns_per_dir * dir_count` (10 per +dir). The planner's allocations should roughly respect this budget. +Allocations above the ceiling (25 turns) are capped by the +orchestrator. + +### Why no mid-loop borrowing (yet) + +PLAN.md envisions a global budget with mid-loop turn borrowing (an +agent that needs more turns can "borrow" from the remaining budget). +This requires inter-loop communication that does not exist today. The +v1 implementation uses simple per-directory allocation with no +borrowing. If the quality instrumentation shows that priority dirs +consistently exhaust their allocation while shallow dirs finish early, +borrowing becomes worth building. + +--- + +## Investigation Order + +Two strategies are available: + +**leaf-first** (default): the existing order from `_discover_directories()`. +Deepest directories first, parents last. Ensures child summaries are +always cached before parent investigation begins. + +**priority-first**: priority directories before shallow/default, but +leaf-first *within each band*. This preserves the child-summaries +invariant while letting high-value subtrees inform the rest of the +investigation. + +Both strategies preserve the leaf-first contract documented in +[Internals](Internals) section 4.7. The `_apply_plan()` function sorts +directories into bands without breaking the within-band leaf ordering. + +--- + +## Inputs to the Planner + +The planning agent receives four signals: + +1. **Survey output**: the full survey dict (description, approach, + domain notes, tool recommendations), formatted as a text block. +2. **Full directory tree**: `render_tree()` output at depth 6 (deeper + than the survey's 2-level preview). +3. **File signals**: extension histogram, `file --brief` descriptions, + filename samples (the same raw signals the survey sees). +4. **Cached directories**: which dirs are already cached from a prior + run (so the planner knows what will be skipped). + +--- + +## Fallback Behavior + +The planning pass degrades gracefully: + +- **Small targets** (below `_SURVEY_MIN_FILES` and `_SURVEY_MIN_DIRS`): + planning is skipped entirely, same threshold as the survey. All dirs + get the default allocation in leaf-first order. +- **Planning fails** (API error, agent doesn't call `submit_plan`): + `_default_plan()` returns an empty plan. All dirs get 10 turns, + leaf-first order. The investigation proceeds as if Phase 3 didn't + exist. +- **Resumed runs**: the plan is cached as `plan.json` in the + investigation cache. On resume (without `--fresh`), the cached plan + is loaded and `_run_planning()` is skipped. + +--- + +## Quality Instrumentation + +Phase 3 ships with built-in measurement so we can tell whether planning +actually improves investigation quality. Three metrics: + +### Turn utilization + +Tracked per directory: turns allocated vs turns used. An agent that +finishes in 3 turns on an 18-turn budget suggests over-allocation. An +agent that hits the cap on a 5-turn budget suggests under-allocation. + +### Completeness self-rating + +The `submit_report` tool (dir scope) now includes a `completeness` +field (0.0-1.0). The agent rates how thoroughly it investigated the +directory. This is not perfectly reliable (it is a self-assessment), +but it provides signal: a priority dir with completeness 0.3 probably +needed more turns; a shallow dir with completeness 0.95 probably +didn't need its 5 turns. + +### plan_evaluation.json + +Written at the end of every investigation, this file is the planning +pass's report card. It compares plan predictions to outcomes: + +```json +{ + "plan_order": "leaf-first", + "total_dirs_investigated": 12, + "total_turns_allocated": 120, + "total_turns_used": 87, + "overall_utilization": 0.73, + "per_directory": [ + { + "dir": "src/core", + "planned_tier": "priority", + "turns_allocated": 18, + "turns_used": 14, + "utilization": 0.78, + "completeness": 0.9, + "confidence": 0.85 + } + ], + "evaluated_at": "2026-04-12T..." +} +``` + +Run luminos on the same target before and after changes to compare +these metrics. The golden set for baseline comparison: luminos itself. + +--- + +## Implementation Map + +| Component | Location | Purpose | +|---|---|---| +| `_PLANNING_SYSTEM_PROMPT` | `prompts.py` | System prompt for the planning agent | +| `submit_plan` tool | `ai.py` (planning scope) | Tool schema for plan submission | +| `_run_planning()` | `ai.py` | Runs the planning pass (follows `_run_survey` pattern) | +| `_apply_plan()` | `ai.py` | Pure function: plan + dir list to ordered list + turn map | +| `_default_plan()` | `ai.py` | Fallback empty plan | +| `_write_plan_evaluation()` | `ai.py` | Writes `plan_evaluation.json` after dir loops | +| `_TokenTracker._loop_turns` | `ai.py` | Counts API calls per dir loop for utilization tracking | +| `plan.json` | cache root | Persisted plan for resumed runs | +| `plan_evaluation.json` | cache root | Post-investigation quality report | + +--- + +## Design Decisions + +### Why band-sorted order instead of arbitrary reordering + +The leaf-first contract (`_get_child_summaries()`) is load-bearing. +Breaking it silently degrades parent summaries because child cache +entries don't exist yet. Band-sorting preserves leaf-first within each +priority band, giving us "priority-first" without losing child context. + +### Why per-directory allocation instead of a shared global pool + +A shared pool with mid-loop borrowing requires the orchestrator to +communicate with running agents, which doesn't exist in the current +architecture (each `_run_dir_loop` call is independent). Per-directory +allocation is a strict improvement over fixed-14-for-everyone with zero +new machinery. The quality instrumentation will tell us if borrowing is +worth building. + +### Why the child-summaries placeholder was fixed + +`_get_child_summaries()` previously returned "this is a leaf directory" +for any directory with no cached children, whether it was actually a +leaf or just hadn't been investigated yet. With priority-first ordering, +this lie becomes more likely to trigger. The fix distinguishes the two +cases: actual leaves get "this is a leaf directory", uninvestigated +parents get "child directories exist but have not been investigated +yet". + +### Why completeness is a self-rating + +An external completeness metric would require knowing "how many files +should have been examined", which depends on the directory contents and +is exactly the kind of judgment the agent makes. Self-rating is +imperfect but cheap, and the correlation between self-rated +completeness and turn utilization gives us a useful signal even if the +absolute values aren't perfectly calibrated. + +--- + +## Future Work + +- **Mid-loop turn borrowing**: if utilization data shows priority dirs + consistently hit their cap while others finish early, implement a + shared budget pool. +- **Plan refinement**: after the first dir loop run, re-evaluate the + plan based on early findings (some "shallow" dirs might turn out to + be important). +- **Cross-run learning**: use `plan_evaluation.json` from prior runs to + improve planning on similar targets. + +--- + +## References + +- Issues: #8, #9, #10, #11, #74 +- PR: #75 +- PLAN.md Part 4: Investigation Planning +- [Internals](Internals) section 4.7: leaf-first contract