Jeff Smith e942ecc34a docs(plan): add Phase 2.5 context budget reliability (#44 )

#5 smoke test showed the dir loop exhausts the 126k context budget on
a 13-file Python lib. Sequencing #44 between Phase 2 and Phase 3 so
the foundation is solid before planning + external tools add more
prompt and tool weight.

2026-04-06 21:59:01 -06:00

35 KiB

Raw Blame History

Luminos — Evolution Plan

Core Philosophy

The current design is a pipeline with AI steps. Even though it uses an agent loop, the structure is predetermined:

investigate every directory (leaf-first) → synthesize → done

The agent executes a fixed sequence. It cannot decide what matters most, cannot resolve its own uncertainty, cannot go get information it's missing, cannot adapt its strategy based on what it finds.

The target philosophy is investigation driven by curiosity:

survey → form hypotheses → investigate to confirm/refute →
resolve uncertainty → update understanding → repeat until satisfied

This is how a human engineer actually explores an unfamiliar codebase or file collection. The agent should decide what it needs to know and how to find it out — not execute a predetermined checklist.

Every feature in this plan should be evaluated against that principle.

Part 1: Uncertainty as a First-Class Concept

The core loop

The agent explicitly tracks confidence at each step. Low confidence is not noted and moved past — it triggers a resolution strategy.

observe something
    → assess confidence
    → if high: cache and continue
    → if low: choose resolution strategy
        → read more local files
        → search externally (web, package registry, docs)
        → ask the user
        → flag as genuinely unknowable with explicit reasoning
    → update understanding
    → continue

This should be reflected in the dir loop prompt, the synthesis prompt, and the refinement prompt. "I don't know" is never an acceptable terminal state if there are resolution strategies available.

Confidence tracking in cache entries

Add a confidence field (0.0–1.0) to both file and dir cache entries. The agent sets this when writing cache. Low-confidence entries are candidates for refinement-pass investigation.

File cache schema addition:

confidence: float        # 0.0–1.0, agent's confidence in its summary
confidence_reason: str   # why confidence is low, if below ~0.7

The synthesis and refinement passes can use confidence scores to prioritize what to look at again.

Part 2: Dynamic Domain Detection

Why not a hardcoded taxonomy

A fixed domain list (code, documents, data, media, mixed) forces content into predetermined buckets. Edge cases are inevitable: a medical imaging archive, a legal discovery collection, a CAD project, a music production session, a Jupyter notebook repo that's half code and half data. Hardcoded domains require code changes to handle novel content and will always mis-classify ambiguous cases.

More fundamentally: the AI is good at recognizing what something is. Using rule-based file-type ratios to gate its behavior is fighting the tool's strengths.

Survey pass (replaces hardcoded detection)

Before dir loops begin, a lightweight survey pass runs:

Input: file type distribution, tree structure (top 2 levels), total counts

Task: the survey agent answers three questions:

What is this? (plain language description, no forced taxonomy)
What analytical approach would be most useful?
Which available tools are relevant and which can be skipped?

Output (submit_survey tool):

{
    "description": str,          # "a Python web service with Postgres migrations"
    "approach": str,             # how to investigate — what to prioritize, what to skip
    "relevant_tools": [str],     # tools worth using for this content
    "skip_tools": [str],         # tools not useful (e.g. parse_structure for a journal)
    "domain_notes": str,         # anything unusual the dir loops should know
    "confidence": float,         # how clear the signal was
}

Max turns: 3 — this is a lightweight orientation pass, not a deep investigation.

This output is injected into the dir loop system prompt as context. The dir loops know what they're looking at before they start. They can also deviate if they find something the survey missed.

Tools are always available, AI selects what's relevant

Rather than gating tools by domain, every tool is offered with a clear description of what it's for. The AI simply won't call parse_structure on a .xlsx file because the description says it works on source files.

This also means new tools are automatically available to all future domains without any profile configuration.

What stays rule-based

The file type distribution summary fed into the survey prompt is still computed from filetypes.py — this is cheap and provides useful signal. The difference is that the AI interprets it rather than a lookup table.

Part 3: External Knowledge Tools

The resolution strategy toolkit

When the agent encounters something it doesn't understand, it has options beyond "read more local files." These are resolution strategies for specific kinds of uncertainty.

web_search(query) → results

Use when: unfamiliar library, file format, API, framework, toolchain, naming convention that doesn't resolve from local files alone.

Query construction should be evidence-based:

Seeing import dramatiq → "dramatiq python task queue library"
Finding .avro files → "apache avro file format schema"
Spotting unfamiliar config key → "<framework> <key> configuration"

Results are summarized before injection into context. Raw search results are not passed directly — a lightweight extraction pulls the relevant 2-3 sentences.

Budget: configurable max searches per session (default: 10). Logged in report.

fetch_url(url) → content

Use when: a local file references a URL that would explain what the project is (e.g. a README links to documentation, a config references a schema URL, a package.json has a homepage field).

Constrained to read-only fetches. Content truncated to relevant sections. Budget: configurable (default: 5 per session).

package_lookup(name, ecosystem) → metadata

Use when: an import or dependency declaration references an unfamiliar package.

Queries package registries (PyPI, npm, crates.io, pkg.go.dev) for:

Package description
Version in use vs latest
Known security advisories (if available)
License

This is more targeted than web search and returns structured data. Particularly useful for security-relevant analysis.

Budget: generous (default: 30) since queries are cheap and targeted.

ask_user(question) → answer (interactive mode only)

Use when: uncertainty cannot be resolved by any other means.

Examples:

"I found 40 files with .xyz extension I don't recognize — what format is this?"
"There are two entry points (server.py and worker.py) — which is the primary one?"
"This directory appears to contain personal data — should I analyze it or skip it?"

Only triggered when other resolution strategies have been tried or are clearly not applicable. Gated behind an --interactive flag since it blocks execution.

All external tools are opt-in

--no-external flag disables all network tools (web_search, fetch_url, package_lookup). Default behavior TBD — arguably external lookups should be opt-in rather than opt-out given privacy considerations (see Concerns).

Part 4: Investigation Planning

Survey → plan → execute

Currently: every directory is processed in leaf-first order with equal resource allocation. A 2-file directory gets the same max_turns as a 50-file one.

Better: after the survey pass, a planning step decides where to invest depth.

Planning pass (submit_plan tool):

Input: survey output + full directory tree

Output:

{
    "priority_dirs": [           # investigate these deeply
        {"path": str, "reason": str, "suggested_turns": int}
    ],
    "shallow_dirs": [            # quick pass only
        {"path": str, "reason": str}
    ],
    "skip_dirs": [               # skip entirely (generated, vendored, etc.)
        {"path": str, "reason": str}
    ],
    "investigation_order": str,  # "leaf-first" | "priority-first" | "breadth-first"
    "notes": str,                # anything else the investigation should know
}

The orchestrator uses this plan to allocate turns per directory and set investigation order. The plan is also saved to cache so resumed investigations can follow the same strategy.

Dynamic turn allocation

Replace fixed max_turns=14 per directory with a global turn budget the agent manages. The planning pass allocates turns to directories based on apparent complexity. The agent can request additional turns mid-investigation if it hits something unexpectedly complex.

A simple model:

Global budget = base_turns_per_dir * dir_count (e.g. 10 * 20 = 200)
Planning pass distributes: priority dirs get 15-20, shallow dirs get 5, skip dirs get 0
Agent can "borrow" turns from its own budget if it needs more
If budget runs low, a warning is injected into the prompt

Part 5: Scale-Tiered Synthesis

Why tiers are still needed

Even with better investigation planning and agentic depth control, the synthesis input problem remains: 300 directory summaries cannot be meaningfully synthesized in one shot. The output is either truncated, loses fidelity, or both.

Tier classification based on post-loop measurements:

Tier	dir_count	file_count	Synthesis approach
`small`	< 5	< 30	Feed per-file cache entries directly
`medium`	5–30	30–300	Dir summaries (current approach)
`large`	31–150	301–1500	Multi-level synthesis
`xlarge`	> 150	> 1500	Multi-level + subsystem grouping

Thresholds configurable via CLI flags or config file.

Small tier: per-file summaries

File cache entries are the most granular, most grounded signal in the system — written while the AI was actually reading files. For small targets they fit comfortably in the synthesis context window and produce a richer output than dir summaries.

Multi-level synthesis (large/xlarge)

dir summaries
    ↓  (grouping pass: dirs → subsystems, AI-identified)
subsystem summaries (3–10 groups)
    ↓  (final synthesis)
report

The grouping pass is itself agentic: the AI identifies logical subsystems from dir summaries, not from directory structure. An auth/ dir and a middleware/session/ dir might end up in the same "Authentication" subsystem.

For xlarge:

dir summaries
    ↓  (level-1: dirs → subsystems, 10–30 groups)
    ↓  (level-2: subsystems → domains/layers, 3–8 groups)
    ↓  (final synthesis)

Synthesis depth scales with tier

The synthesis prompt receives explicit depth guidance:

small: "Be concise but specific. Reference actual filenames. 2–3 paragraphs."
medium: "Produce a structured breakdown. Cover purpose, components, concerns."
large: "Produce a thorough architectural analysis with section headers. Be specific."
xlarge: "Produce a comprehensive report. Cover architecture, subsystems, interfaces, cross-cutting concerns, and notable anomalies. Reference actual paths."

Part 6: Hypothesis-Driven Synthesis

Current approach: aggregation

Synthesis currently aggregates dir summaries into a report. It's descriptive: "here is what I found in each part."

Better approach: conclusion with evidence

The synthesis agent should:

Form an initial hypothesis about the whole from the dir summaries
Look for evidence that confirms or refutes it
Consider alternative interpretations
Produce a conclusion that reflects the reasoning, not just the observations

This produces output like: "This appears to be a multi-tenant SaaS backend (hypothesis) — the presence of tenant_id throughout the schema, separate per-tenant job queues, and the auth middleware's scope validation all support this (evidence). The monolith structure suggests it hasn't been decomposed into services yet (alternative consideration)."

Rather than: "The auth directory handles authentication. The jobs directory handles background jobs. The models directory contains database models."

The think tool already supports this pattern — the synthesis prompt should explicitly instruct hypothesis formation before submit_report.

Part 7: Refinement Pass

Trigger

--refine flag. Off by default.

What it does

After synthesis, the refinement agent receives:

Current synthesis output (brief + full analysis)
All dir and file cache entries including confidence scores
Full investigation toolset including external knowledge tools
A list of low-confidence cache entries (confidence < 0.7)

It is instructed to:

Identify gaps (things not determined from summaries)
Identify contradictions (dir summaries that conflict)
Identify cross-cutting concerns (patterns spanning multiple dirs)
Resolve low-confidence entries
Submit an improved report

The refinement agent owns its investigation — it decides what to look at and in what order, using the full resolution strategy toolkit.

Multiple passes

--refine-depth N runs N refinement passes. Natural stopping condition: the agent calls submit_report without making any file reads or external lookups (indicates nothing new was found). This can short-circuit before N passes.

Refinement vs re-investigation

Refinement is targeted — it focuses on specific gaps and uncertainties. It is not a re-run of the full dir loops. The prompt makes this explicit: "Focus on resolving uncertainty, not re-summarizing what is already known."

Part 8: Report Structure

Domain-appropriate sections

Instead of fixed brief + detailed fields, the synthesis produces structured fields based on what the survey identified. Fields that are absent or empty are not rendered.

The survey output's description shapes what fields are relevant. This is not a hardcoded domain → schema mapping — the synthesis prompt asks the agent to populate fields that are relevant to this specific content from a superset of available fields:

Available output fields (populate those relevant to this content):
- brief           (always)
- architecture    (software projects)
- components      (software projects, large document collections)
- tech_stack      (software projects)
- entry_points    (software projects, CLI tools)
- datasets        (data collections)
- schema_summary  (data collections, databases)
- period_covered  (financial data, journals, time-series)
- themes          (document collections, journals)
- data_quality    (data collections)
- concerns        (any domain)
- overall_purpose (mixed/composite targets)

The report formatter renders populated fields with appropriate headers and skips unpopulated ones. Small simple targets produce minimal output. Large complex targets produce full structured reports.

Progressive output (future)

Rather than one report at the end, stream findings as the investigation proceeds. The user sees the agent's understanding build in real time. This converts luminos from a batch tool into an interactive investigation partner.

Requires a streaming-aware output layer — significant architectural change, probably not Phase 1.

Part 9: Parallel Investigation

For large targets

Multiple dir-loop agents investigate different subsystems concurrently, then report to a coordinator. The coordinator synthesizes their findings and identifies cross-cutting concerns.

This requires:

A coordinator agent that owns the investigation plan
Worker agents scoped to subsystems
A shared cache that workers write to concurrently (needs locking or append-only design)
A merge step in the coordinator before synthesis

Significant complexity. Probably deferred until single-agent investigation quality is high. The main benefit is speed, not quality — worth revisiting when the investigation quality ceiling has been reached.

Part 10: MCP Backend Abstraction

Why

The investigation loop (survey → plan → investigate → synthesize) is generic. The filesystem-specific parts — how to list a directory, read a file, parse structure — are an implementation detail. Abstracting the backend via MCP decouples the two and makes luminos extensible to any exploration target: websites, wikis, databases, running processes.

This pivot also serves the project's learning goal. Migrating working code into an agentic framework is a common and painful real-world task. Building it clean from the start teaches the pattern; migrating teaches why the pattern exists. The migration pain is intentional.

The model

Each exploration target is an MCP server. Luminos is an MCP client. The investigation loop connects to a server at startup, discovers its tools, passes them to the Anthropic API, and forwards tool calls to the server at runtime.

luminos (MCP client)
    ↓  connects to
filesystem MCP server  |  process MCP server  |  wiki MCP server  |  ...
    ↓  exposes tools
read_file, list_dir, parse_structure, ...
    ↓  passed to
Anthropic API (agent calls them)
    ↓  forwarded back to
MCP server (executes, returns result)

The filesystem MCP server is the default. --mcp <uri> selects an alternative server.

What changes

ai.py tool dispatch: instead of calling local Python functions, forward to the connected MCP server
Tool definitions: dynamically discovered from the server via tools/list, not hardcoded in ai.py
New luminos_lib/mcp_client.py: thin MCP client (stdio transport)
New luminos_mcp/filesystem.py: MCP server wrapping existing filesystem tools (read_file, list_dir, parse_structure, run_command, stat_file)
--mcp CLI flag for selecting a non-default server

What does not change

Cache storage, confidence tracking, survey/planning/synthesis passes, token tracking, cost reporting, all prompts. None of these know or care what backend provided the data.

Known tensions

The tree assumption. The investigation loop assumes hierarchical containers. Non-filesystem backends (websites, processes) must present a virtual tree or the traversal model breaks. This is the MCP server's problem to solve, not luminos's — but it is real design work.

Tool count. If multiple MCP servers are connected simultaneously (filesystem + web search + package lookup), tool count grows. More tools degrades agent decision quality. Keep each server focused.

The filesystem backend is a demotion. Currently filesystem investigation is native — zero overhead. Making it an MCP server adds process-launch overhead. Acceptable given API call latency already dominates, but worth knowing.

Phase 4 becomes MCP servers. After the pivot, web_search, fetch_url, and package_lookup are natural candidates to implement as MCP servers rather than hardcoded Python functions. Phase 4 and the MCP pattern reinforce each other.

Timing

After Phase 3, before Phase 4. At that point survey + planning + dir loops + synthesis are all working with filesystem assumptions baked in — enough surface area to make the migration instructive without 9 phases of rework.

Implementation Order

Phase 1 — Confidence tracking

Add confidence + confidence_reason to cache schemas
Update dir loop prompt to set confidence when writing cache
No behavior change yet — just instrumentation

Phase 2 — Survey pass

New _run_survey() function in ai.py
submit_survey tool definition
_SURVEY_SYSTEM_PROMPT in prompts.py
Wire into _run_investigation() before dir loops
Survey output injected into dir loop system prompt
Rebuild filetype classifier (#42) to remove source-code bias — lands after the survey pass is observable end-to-end (#4–#7) and before Phase 3 starts depending on survey output for real decisions. Until then, the survey prompt carries a Band-Aid warning that the histogram is biased toward source code.

Phase 2.5 — Context budget reliability (#44)

Dir loop exhausts the 126k context budget on a 13-file Python lib (observed during #5 smoke test). Must be addressed before Phase 3 adds longer prompts and more tools, or every later phase will inherit a broken foundation.
Investigate root cause (tool result accumulation, parse_structure output size, redundant reads) before picking a fix.
Add token-usage instrumentation so regressions are visible.

Phase 3 — Investigation planning

Planning pass after survey, before dir loops
submit_plan tool
Dynamic turn allocation based on plan
Dir loop orchestrator updated to follow plan

Phase 3.5 — MCP backend abstraction (pivot point)

See Part 10 for full design. This phase happens after Phase 3 is working and before Phase 4. The goal is to migrate the filesystem investigation into an MCP server/client model before adding more backends or external tools.

Extract filesystem tools (read_file, list_dir, parse_structure, run_command, stat_file) into a standalone MCP server
Refactor ai.py into an MCP client: discover tools dynamically, forward tool calls to the server, return results to the agent
Replace hardcoded tool list in the dir loop with dynamic tool discovery from the connected MCP server
Keep the filesystem MCP server as the default; --mcp flag selects alternative servers
No behavior change to the investigation loop — purely structural

Learning goal: experience migrating working code into an MCP architecture. The migration pain is intentional and instructive.

Phase 4 — External knowledge tools

web_search tool + implementation (requires optional dep: search API client)
package_lookup tool + implementation (HTTP to package registries)
fetch_url tool + implementation
--no-external flag to disable network tools
Budget tracking and logging

Phase 5 — Scale-tiered synthesis

Sizing measurement after dir loops
Tier classification
Small tier: switch synthesis input to file cache entries
Depth instructions in synthesis prompt

Phase 6 — Multi-level synthesis

Grouping pass + submit_grouping tool
Final synthesis receives subsystem summaries at large/xlarge tier
Two-level grouping for xlarge

Phase 7 — Hypothesis-driven synthesis

Update synthesis prompt to require hypothesis formation before submit_report
think tool made available in synthesis (currently restricted)

Phase 8 — Refinement pass

--refine flag + _run_refinement()
Refinement uses confidence scores to prioritize
--refine-depth N

Phase 9 — Dynamic report structure

Superset output fields in synthesis submit_report schema
Report formatter renders populated fields only
Domain-appropriate section headers

File Map

File	Changes
`luminos_lib/domain.py`	new — survey pass, plan pass, profile-free detection
`luminos_lib/prompts.py`	survey prompt, planning prompt, refinement prompt, updated dir/synthesis prompts
`luminos_lib/ai.py`	survey, planning, external tools, tiered synthesis, multi-level grouping, refinement, confidence-aware cache writes
`luminos_lib/cache.py`	confidence fields in schemas, low-confidence query
`luminos_lib/report.py`	dynamic field rendering, domain-appropriate sections
`luminos.py`	--refine, --no-external, --refine-depth flags; wire survey into scan
`luminos_lib/search.py`	new — web_search, fetch_url, package_lookup implementations

No changes needed to: tree.py, filetypes.py, code.py, recency.py, disk.py, capabilities.py, watch.py, ast_parser.py

Known Unknowns

Search API choice Web search requires an API (Brave Search, Serper, SerpAPI, DuckDuckGo, etc.). Each has different pricing, rate limits, result quality, and privacy implications. Which one to use, whether to require an API key, and what the fallback is when no key is configured — all undecided. Could support multiple backends with a configurable preference.

Package registry coverage package_lookup needs to handle PyPI, npm, crates.io, pkg.go.dev, Maven, RubyGems, NuGet at minimum. Each has a different API shape. Coverage gap for less common ecosystems (Hex for Elixir, Hackage for Haskell, etc.) — the agent will get no lookup result and must fall back to web search.

search result summarization Raw search results can't be injected directly into context — they're too long and too noisy. A summarization step is needed. Options: another AI call (adds latency and cost), regex extraction (fragile), a lightweight extraction heuristic. The right approach is unclear.

Turn budget arithmetic Dynamic turn allocation sounds clean in theory. In practice: how does the agent "request more turns"? The orchestrator has to interrupt the loop, check the global budget, and decide whether to grant more. This requires mid-loop communication that doesn't exist today. Implementation complexity is non-trivial.

Cache invalidation on strategy changes If a user re-runs with different flags (--refine, --no-external, new --exclude list), the existing cache entries may have been produced under a different investigation strategy. Should they be invalidated? Currently --fresh is the only mechanism. A smarter approach would store the investigation parameters in cache metadata and detect mismatches.

Confidence calibration Asking the agent to self-report confidence (0.0–1.0) is only useful if the numbers are meaningful and consistent. LLMs are known to be poorly calibrated on confidence. A 0.6 from one run may not mean the same as 0.6 from another. This may need to be a categorical signal (high/medium/low) rather than numeric to be reliable in practice.

Context window growth with external tools Each web search result, package lookup, and URL fetch adds to the context window for that dir loop. For a directory with many unknown dependencies, the context could grow large enough to trigger the budget early exit. Need to think about how external tool results are managed in context — perhaps summarized and discarded from messages after being processed.

ask_user blocking behavior Interactive mode with ask_user would block execution waiting for input. This is fine in a terminal session but incompatible with piped output, scripted use, or running luminos as a subprocess. Needs a clear mode distinction and graceful degradation when input is not a TTY.

Survey pass quality on tiny targets For a target with 3 files, the survey pass adds an API call that may cost more than it's worth. There should be a minimum size threshold below which the survey is skipped and a generic approach is used.

Parallel investigation complexity Concurrent dir-loop agents writing to a shared cache introduces race conditions. The current _CacheManager writes files directly with no locking. This would need to be addressed before parallel investigation is viable.

Additional Suggestions

Config file Many things that are currently hardcoded (turn budget, tier thresholds, search budget, confidence threshold for refinement) should be user-configurable without CLI flags. A luminos.toml in the target directory or ~/.config/luminos/ would allow project-specific and user-specific defaults.

Structured logging The [AI] stderr output is useful but informal. A structured log (JSONL file alongside the cache) would allow post-hoc analysis of investigation quality: which dirs used the most turns, which triggered web searches, which had low confidence, where budget pressure hit. This also enables future tooling on top of luminos investigations.

Investigation replay The cache already stores summaries but not the investigation trace (what the agent read, in what order, what it decided to skip). Storing the full message history per directory would allow replaying or auditing an investigation. Cost: storage. Benefit: debuggability, ability to resume investigations more faithfully.

Watch mode + incremental investigation Watch mode currently re-runs the full base scan on changes. For AI-augmented watch mode: detect which directories changed, re-investigate only those, and patch the cache entries. The synthesis would then re-run from the updated cache without re-investigating unchanged directories.

Optional PDF and Office document readers The data and documents domains would benefit from native content extraction:

pdfminer or pypdf for PDF text extraction
openpyxl for Excel schema and sheet enumeration
python-docx for Word document text These would be optional deps like the existing AI deps, gated behind --install-extras. The agent currently can only see filename and size for these formats.

Security-focused analysis mode A --security flag could tune the investigation toward security-relevant findings: dependency vulnerability scanning, hardcoded secrets detection, permission issues, exposed configuration, insecure patterns. The flag would bias the survey, dir loop prompts, and synthesis toward these concerns and expand the flags output with severity-ranked security findings.

Output formats The current report is terminal-formatted text or JSON. Additional formats worth considering:

Markdown (for saving to wikis, Notion, Obsidian)
HTML (self-contained report with collapsible sections)
SARIF (for security findings — integrates with GitHub Code Scanning)

Model selection The model is hardcoded to claude-sonnet-4-20250514. The survey and planning passes are lightweight enough to use a faster/cheaper model (Haiku). The dir loops and synthesis warrant Sonnet or better. The refinement pass might benefit from Opus for difficult cases. A --model flag and per-pass model configuration would allow cost/quality tradeoffs.

Concerns

Cost at scale Adding a survey pass, planning pass, external tool lookups, and multiple refinement passes significantly increases API call count and token consumption. A large repo run with --refine could easily cost several dollars. The current cost reporting (total tokens at end) may not be sufficient — users need to understand cost before committing to a long run. Consider a --estimate mode that projects cost from the base scan without running AI.

Privacy and external lookups Web searches and URL fetches send information about the target's contents to external services. For a personal journal or proprietary codebase this could be a significant privacy concern. The --no-external flag addresses this but it should probably be the default for sensitive-looking content (PII detected in filenames, etc.), not something the user has to know to enable.

Prompt injection via file contents read_file passes raw file contents into the context. A malicious file in the target directory could contain prompt injection attempts. The current system has no sanitization. This is an existing concern that grows as the agent gains more capabilities (web search, URL fetch, package lookup — all of which could theoretically be manipulated by a crafted file).

Reliability of self-reported confidence The confidence tracking system depends on the agent accurately reporting its own uncertainty. If the agent is systematically over-confident (which LLMs tend to be), the refinement pass will never trigger on cases where it's most needed. The system should have a skeptical prior — low-confidence by default for unfamiliar file types, missing READMEs, ambiguous structures.

Investigation quality regression risk Each new pass (survey, planning, refinement) adds opportunities for the investigation to go wrong. A bad survey misleads all subsequent dir loops. A bad plan wastes turns on shallow directories and skips critical ones. The system needs quality signals — probably the confidence scores aggregated across the investigation — to detect when something went wrong and potentially retry.

Watch mode compatibility Several of the planned features (survey pass, planning, external tools) are not designed for incremental re-use in watch mode. Adding AI capability to watch mode is a separate design problem that deserves its own thinking.

Turn budget contention If the planning pass allocates turns and the agent borrows from its budget when it needs more, there's a risk of runaway investigation on unexpectedly complex directories. Needs a hard ceiling (global max tokens, not just per-dir turns) as a backstop.

Raw Thoughts

The investigation planning idea is conceptually appealing but has a chicken-and- egg problem: you need to know what's in the directories to plan how to investigate them, but you haven't investigated yet. The survey pass helps but it's shallow. Maybe the first pass through each directory should be a cheap orientation (list contents, read one file) that feeds the plan before the full investigation starts. Two-phase dir investigation: orient then investigate.

The hypothesis-driven synthesis is probably the highest leverage change in this whole plan. The current synthesis produces descriptive output. Hypothesis-driven synthesis produces analytical output. The prompt change is small but the output quality difference could be significant.

Web search feels like it should be a last resort, not an early one. The agent should exhaust local investigation before reaching for external sources. The prompt should reflect this: "Only search if you cannot determine this from the files available."

There's a question of whether the survey pass should run before the base scan or after. After makes sense because the base scan's file_categories is useful survey input. But the base scan itself could be informed by the survey (e.g. skip certain directories the survey identified as low-value). Probably the right answer is: survey runs after base scan but before AI dir loops, using base scan output as input.

The ask_user tool is interesting because it inverts the relationship — the agent asks the human rather than the other way around. This is powerful but needs careful constraints. The agent should only ask when it's genuinely stuck, not as a shortcut to avoid investigation. The prompt should require that other resolution strategies have been exhausted before asking.

Multi-level synthesis (grouping pass) might produce better results than expected because the grouping agent has a different task than the dir-loop agents — it's looking for relationships and patterns across summaries rather than summarizing individual directories. It might surface architectural insights that none of the dir loops could see individually.

Package vulnerability lookups are potentially the highest signal-to-noise external tool — structured data, specific to the files present, directly actionable. Worth implementing before general web search.

The confidence calibration problem is real but maybe not critical to solve precisely. Even if 0.6 doesn't mean the same thing every time, entries with confidence below some threshold will still tend to be the more uncertain ones. Categorical (high/medium/low) is probably fine for the first implementation.

Progressive output and interactive mode are probably the features that would most change how luminos feels to use. The current UX is: run it, wait, get a report. Progressive output would make it feel like watching someone explore the codebase in real time. Worth thinking about the UX before the architecture.

There's a version of this tool that goes well beyond file system analysis — a general-purpose investigative agent that can be pointed at anything (a directory, a URL, a database, a running process) and produce an intelligence report. The current architecture is already pointing in that direction. Worth keeping that possibility in mind when making structural decisions so we don't close off that path prematurely.

35 KiB Raw Blame History Unescape Escape