849 lines
34 KiB
Markdown
849 lines
34 KiB
Markdown
# Luminos — Evolution Plan
|
||
|
||
## Core Philosophy
|
||
|
||
The current design is a **pipeline with AI steps**. Even though it uses an
|
||
agent loop, the structure is predetermined:
|
||
|
||
```
|
||
investigate every directory (leaf-first) → synthesize → done
|
||
```
|
||
|
||
The agent executes a fixed sequence. It cannot decide what matters most, cannot
|
||
resolve its own uncertainty, cannot go get information it's missing, cannot
|
||
adapt its strategy based on what it finds.
|
||
|
||
The target philosophy is **investigation driven by curiosity**:
|
||
|
||
```
|
||
survey → form hypotheses → investigate to confirm/refute →
|
||
resolve uncertainty → update understanding → repeat until satisfied
|
||
```
|
||
|
||
This is how a human engineer actually explores an unfamiliar codebase or file
|
||
collection. The agent should decide *what it needs to know* and *how to find
|
||
it out* — not execute a predetermined checklist.
|
||
|
||
Every feature in this plan should be evaluated against that principle.
|
||
|
||
---
|
||
|
||
## Part 1: Uncertainty as a First-Class Concept
|
||
|
||
### The core loop
|
||
|
||
The agent explicitly tracks confidence at each step. Low confidence is not
|
||
noted and moved past — it triggers a resolution strategy.
|
||
|
||
```
|
||
observe something
|
||
→ assess confidence
|
||
→ if high: cache and continue
|
||
→ if low: choose resolution strategy
|
||
→ read more local files
|
||
→ search externally (web, package registry, docs)
|
||
→ ask the user
|
||
→ flag as genuinely unknowable with explicit reasoning
|
||
→ update understanding
|
||
→ continue
|
||
```
|
||
|
||
This should be reflected in the dir loop prompt, the synthesis prompt, and the
|
||
refinement prompt. "I don't know" is never an acceptable terminal state if
|
||
there are resolution strategies available.
|
||
|
||
### Confidence tracking in cache entries
|
||
|
||
Add a `confidence` field (0.0–1.0) to both file and dir cache entries. The
|
||
agent sets this when writing cache. Low-confidence entries are candidates for
|
||
refinement-pass investigation.
|
||
|
||
File cache schema addition:
|
||
```
|
||
confidence: float # 0.0–1.0, agent's confidence in its summary
|
||
confidence_reason: str # why confidence is low, if below ~0.7
|
||
```
|
||
|
||
The synthesis and refinement passes can use confidence scores to prioritize
|
||
what to look at again.
|
||
|
||
---
|
||
|
||
## Part 2: Dynamic Domain Detection
|
||
|
||
### Why not a hardcoded taxonomy
|
||
|
||
A fixed domain list (code, documents, data, media, mixed) forces content into
|
||
predetermined buckets. Edge cases are inevitable: a medical imaging archive,
|
||
a legal discovery collection, a CAD project, a music production session, a
|
||
Jupyter notebook repo that's half code and half data. Hardcoded domains require
|
||
code changes to handle novel content and will always mis-classify ambiguous cases.
|
||
|
||
More fundamentally: the AI is good at recognizing what something is. Using
|
||
rule-based file-type ratios to gate its behavior is fighting the tool's
|
||
strengths.
|
||
|
||
### Survey pass (replaces hardcoded detection)
|
||
|
||
Before dir loops begin, a lightweight survey pass runs:
|
||
|
||
**Input**: file type distribution, tree structure (top 2 levels), total counts
|
||
|
||
**Task**: the survey agent answers three questions:
|
||
1. What is this? (plain language description, no forced taxonomy)
|
||
2. What analytical approach would be most useful?
|
||
3. Which available tools are relevant and which can be skipped?
|
||
|
||
**Output** (`submit_survey` tool):
|
||
```python
|
||
{
|
||
"description": str, # "a Python web service with Postgres migrations"
|
||
"approach": str, # how to investigate — what to prioritize, what to skip
|
||
"relevant_tools": [str], # tools worth using for this content
|
||
"skip_tools": [str], # tools not useful (e.g. parse_structure for a journal)
|
||
"domain_notes": str, # anything unusual the dir loops should know
|
||
"confidence": float, # how clear the signal was
|
||
}
|
||
```
|
||
|
||
**Max turns**: 3 — this is a lightweight orientation pass, not a deep investigation.
|
||
|
||
This output is injected into the dir loop system prompt as context. The dir
|
||
loops know what they're looking at before they start. They can also deviate if
|
||
they find something the survey missed.
|
||
|
||
### Tools are always available, AI selects what's relevant
|
||
|
||
Rather than gating tools by domain, every tool is offered with a clear
|
||
description of what it's for. The AI simply won't call `parse_structure` on
|
||
a `.xlsx` file because the description says it works on source files.
|
||
|
||
This also means new tools are automatically available to all future domains
|
||
without any profile configuration.
|
||
|
||
### What stays rule-based
|
||
|
||
The file type distribution summary fed into the survey prompt is still computed
|
||
from `filetypes.py` — this is cheap and provides useful signal. The difference
|
||
is that the AI interprets it rather than a lookup table.
|
||
|
||
---
|
||
|
||
## Part 3: External Knowledge Tools
|
||
|
||
### The resolution strategy toolkit
|
||
|
||
When the agent encounters something it doesn't understand, it has options beyond
|
||
"read more local files." These are resolution strategies for specific kinds of
|
||
uncertainty.
|
||
|
||
**`web_search(query) → results`**
|
||
|
||
Use when: unfamiliar library, file format, API, framework, toolchain, naming
|
||
convention that doesn't resolve from local files alone.
|
||
|
||
Query construction should be evidence-based:
|
||
- Seeing `import dramatiq` → `"dramatiq python task queue library"`
|
||
- Finding `.avro` files → `"apache avro file format schema"`
|
||
- Spotting unfamiliar config key → `"<framework> <key> configuration"`
|
||
|
||
Results are summarized before injection into context. Raw search results are
|
||
not passed directly — a lightweight extraction pulls the relevant 2-3 sentences.
|
||
|
||
Budget: configurable max searches per session (default: 10). Logged in report.
|
||
|
||
**`fetch_url(url) → content`**
|
||
|
||
Use when: a local file references a URL that would explain what the project is
|
||
(e.g. a README links to documentation, a config references a schema URL, a
|
||
package.json has a homepage field).
|
||
|
||
Constrained to read-only fetches. Content truncated to relevant sections.
|
||
Budget: configurable (default: 5 per session).
|
||
|
||
**`package_lookup(name, ecosystem) → metadata`**
|
||
|
||
Use when: an import or dependency declaration references an unfamiliar package.
|
||
|
||
Queries package registries (PyPI, npm, crates.io, pkg.go.dev) for:
|
||
- Package description
|
||
- Version in use vs latest
|
||
- Known security advisories (if available)
|
||
- License
|
||
|
||
This is more targeted than web search and returns structured data. Particularly
|
||
useful for security-relevant analysis.
|
||
|
||
Budget: generous (default: 30) since queries are cheap and targeted.
|
||
|
||
**`ask_user(question) → answer`** *(interactive mode only)*
|
||
|
||
Use when: uncertainty cannot be resolved by any other means.
|
||
|
||
Examples:
|
||
- "I found 40 files with `.xyz` extension I don't recognize — what format is this?"
|
||
- "There are two entry points (server.py and worker.py) — which is the primary one?"
|
||
- "This directory appears to contain personal data — should I analyze it or skip it?"
|
||
|
||
Only triggered when other resolution strategies have been tried or are clearly
|
||
not applicable. Gated behind an `--interactive` flag since it blocks execution.
|
||
|
||
### All external tools are opt-in
|
||
|
||
`--no-external` flag disables all network tools (web_search, fetch_url,
|
||
package_lookup). Default behavior TBD — arguably external lookups should be
|
||
opt-in rather than opt-out given privacy considerations (see Concerns).
|
||
|
||
---
|
||
|
||
## Part 4: Investigation Planning
|
||
|
||
### Survey → plan → execute
|
||
|
||
Currently: every directory is processed in leaf-first order with equal
|
||
resource allocation. A 2-file directory gets the same max_turns as a 50-file
|
||
one.
|
||
|
||
Better: after the survey pass, a planning step decides where to invest depth.
|
||
|
||
**Planning pass** (`submit_plan` tool):
|
||
|
||
Input: survey output + full directory tree
|
||
|
||
Output:
|
||
```python
|
||
{
|
||
"priority_dirs": [ # investigate these deeply
|
||
{"path": str, "reason": str, "suggested_turns": int}
|
||
],
|
||
"shallow_dirs": [ # quick pass only
|
||
{"path": str, "reason": str}
|
||
],
|
||
"skip_dirs": [ # skip entirely (generated, vendored, etc.)
|
||
{"path": str, "reason": str}
|
||
],
|
||
"investigation_order": str, # "leaf-first" | "priority-first" | "breadth-first"
|
||
"notes": str, # anything else the investigation should know
|
||
}
|
||
```
|
||
|
||
The orchestrator uses this plan to allocate turns per directory and set
|
||
investigation order. The plan is also saved to cache so resumed investigations
|
||
can follow the same strategy.
|
||
|
||
### Dynamic turn allocation
|
||
|
||
Replace fixed `max_turns=14` per directory with a global turn budget the agent
|
||
manages. The planning pass allocates turns to directories based on apparent
|
||
complexity. The agent can request additional turns mid-investigation if it hits
|
||
something unexpectedly complex.
|
||
|
||
A simple model:
|
||
- Global budget = `base_turns_per_dir * dir_count` (e.g. 10 * 20 = 200)
|
||
- Planning pass distributes: priority dirs get 15-20, shallow dirs get 5, skip dirs get 0
|
||
- Agent can "borrow" turns from its own budget if it needs more
|
||
- If budget runs low, a warning is injected into the prompt
|
||
|
||
---
|
||
|
||
## Part 5: Scale-Tiered Synthesis
|
||
|
||
### Why tiers are still needed
|
||
|
||
Even with better investigation planning and agentic depth control, the synthesis
|
||
input problem remains: 300 directory summaries cannot be meaningfully synthesized
|
||
in one shot. The output is either truncated, loses fidelity, or both.
|
||
|
||
Tier classification based on post-loop measurements:
|
||
|
||
| Tier | dir_count | file_count | Synthesis approach |
|
||
|---|---|---|---|
|
||
| `small` | < 5 | < 30 | Feed per-file cache entries directly |
|
||
| `medium` | 5–30 | 30–300 | Dir summaries (current approach) |
|
||
| `large` | 31–150 | 301–1500 | Multi-level synthesis |
|
||
| `xlarge` | > 150 | > 1500 | Multi-level + subsystem grouping |
|
||
|
||
Thresholds configurable via CLI flags or config file.
|
||
|
||
### Small tier: per-file summaries
|
||
|
||
File cache entries are the most granular, most grounded signal in the system —
|
||
written while the AI was actually reading files. For small targets they fit
|
||
comfortably in the synthesis context window and produce a richer output than
|
||
dir summaries.
|
||
|
||
### Multi-level synthesis (large/xlarge)
|
||
|
||
```
|
||
dir summaries
|
||
↓ (grouping pass: dirs → subsystems, AI-identified)
|
||
subsystem summaries (3–10 groups)
|
||
↓ (final synthesis)
|
||
report
|
||
```
|
||
|
||
The grouping pass is itself agentic: the AI identifies logical subsystems from
|
||
dir summaries, not from directory structure. An `auth/` dir and a
|
||
`middleware/session/` dir might end up in the same "Authentication" subsystem.
|
||
|
||
For xlarge:
|
||
```
|
||
dir summaries
|
||
↓ (level-1: dirs → subsystems, 10–30 groups)
|
||
↓ (level-2: subsystems → domains/layers, 3–8 groups)
|
||
↓ (final synthesis)
|
||
```
|
||
|
||
### Synthesis depth scales with tier
|
||
|
||
The synthesis prompt receives explicit depth guidance:
|
||
|
||
- **small**: "Be concise but specific. Reference actual filenames. 2–3 paragraphs."
|
||
- **medium**: "Produce a structured breakdown. Cover purpose, components, concerns."
|
||
- **large**: "Produce a thorough architectural analysis with section headers. Be specific."
|
||
- **xlarge**: "Produce a comprehensive report. Cover architecture, subsystems, interfaces, cross-cutting concerns, and notable anomalies. Reference actual paths."
|
||
|
||
---
|
||
|
||
## Part 6: Hypothesis-Driven Synthesis
|
||
|
||
### Current approach: aggregation
|
||
|
||
Synthesis currently aggregates dir summaries into a report. It's descriptive:
|
||
"here is what I found in each part."
|
||
|
||
### Better approach: conclusion with evidence
|
||
|
||
The synthesis agent should:
|
||
1. Form an initial hypothesis about the whole from the dir summaries
|
||
2. Look for evidence that confirms or refutes it
|
||
3. Consider alternative interpretations
|
||
4. Produce a conclusion that reflects the reasoning, not just the observations
|
||
|
||
This produces output like: *"This appears to be a multi-tenant SaaS backend
|
||
(hypothesis) — the presence of tenant_id throughout the schema, separate
|
||
per-tenant job queues, and the auth middleware's scope validation all support
|
||
this (evidence). The monolith structure suggests it hasn't been decomposed into
|
||
services yet (alternative consideration)."*
|
||
|
||
Rather than: *"The auth directory handles authentication. The jobs directory
|
||
handles background jobs. The models directory contains database models."*
|
||
|
||
The `think` tool already supports this pattern — the synthesis prompt should
|
||
explicitly instruct hypothesis formation before `submit_report`.
|
||
|
||
---
|
||
|
||
## Part 7: Refinement Pass
|
||
|
||
### Trigger
|
||
|
||
`--refine` flag. Off by default.
|
||
|
||
### What it does
|
||
|
||
After synthesis, the refinement agent receives:
|
||
- Current synthesis output (brief + full analysis)
|
||
- All dir and file cache entries including confidence scores
|
||
- Full investigation toolset including external knowledge tools
|
||
- A list of low-confidence cache entries (confidence < 0.7)
|
||
|
||
It is instructed to:
|
||
1. Identify gaps (things not determined from summaries)
|
||
2. Identify contradictions (dir summaries that conflict)
|
||
3. Identify cross-cutting concerns (patterns spanning multiple dirs)
|
||
4. Resolve low-confidence entries
|
||
5. Submit an improved report
|
||
|
||
The refinement agent owns its investigation — it decides what to look at and
|
||
in what order, using the full resolution strategy toolkit.
|
||
|
||
### Multiple passes
|
||
|
||
`--refine-depth N` runs N refinement passes. Natural stopping condition: the
|
||
agent calls `submit_report` without making any file reads or external lookups
|
||
(indicates nothing new was found). This can short-circuit before N passes.
|
||
|
||
### Refinement vs re-investigation
|
||
|
||
Refinement is targeted — it focuses on specific gaps and uncertainties. It is
|
||
not a re-run of the full dir loops. The prompt makes this explicit:
|
||
*"Focus on resolving uncertainty, not re-summarizing what is already known."*
|
||
|
||
---
|
||
|
||
## Part 8: Report Structure
|
||
|
||
### Domain-appropriate sections
|
||
|
||
Instead of fixed `brief` + `detailed` fields, the synthesis produces structured
|
||
fields based on what the survey identified. Fields that are absent or empty are
|
||
not rendered.
|
||
|
||
The survey output's `description` shapes what fields are relevant. This is not
|
||
a hardcoded domain → schema mapping — the synthesis prompt asks the agent to
|
||
populate fields that are relevant to *this specific content* from a superset
|
||
of available fields:
|
||
|
||
```
|
||
Available output fields (populate those relevant to this content):
|
||
- brief (always)
|
||
- architecture (software projects)
|
||
- components (software projects, large document collections)
|
||
- tech_stack (software projects)
|
||
- entry_points (software projects, CLI tools)
|
||
- datasets (data collections)
|
||
- schema_summary (data collections, databases)
|
||
- period_covered (financial data, journals, time-series)
|
||
- themes (document collections, journals)
|
||
- data_quality (data collections)
|
||
- concerns (any domain)
|
||
- overall_purpose (mixed/composite targets)
|
||
```
|
||
|
||
The report formatter renders populated fields with appropriate headers and
|
||
skips unpopulated ones. Small simple targets produce minimal output. Large
|
||
complex targets produce full structured reports.
|
||
|
||
### Progressive output (future)
|
||
|
||
Rather than one report at the end, stream findings as the investigation
|
||
proceeds. The user sees the agent's understanding build in real time. This
|
||
converts luminos from a batch tool into an interactive investigation partner.
|
||
|
||
Requires a streaming-aware output layer — significant architectural change,
|
||
probably not Phase 1.
|
||
|
||
---
|
||
|
||
## Part 9: Parallel Investigation
|
||
|
||
### For large targets
|
||
|
||
Multiple dir-loop agents investigate different subsystems concurrently, then
|
||
report to a coordinator. The coordinator synthesizes their findings and
|
||
identifies cross-cutting concerns.
|
||
|
||
This requires:
|
||
- A coordinator agent that owns the investigation plan
|
||
- Worker agents scoped to subsystems
|
||
- A shared cache that workers write to concurrently (needs locking or
|
||
append-only design)
|
||
- A merge step in the coordinator before synthesis
|
||
|
||
Significant complexity. Probably deferred until single-agent investigation
|
||
quality is high. The main benefit is speed, not quality — worth revisiting when
|
||
the investigation quality ceiling has been reached.
|
||
|
||
---
|
||
|
||
## Part 10: MCP Backend Abstraction
|
||
|
||
### Why
|
||
|
||
The investigation loop (survey → plan → investigate → synthesize) is
|
||
generic. The filesystem-specific parts — how to list a directory, read
|
||
a file, parse structure — are an implementation detail. Abstracting
|
||
the backend via MCP decouples the two and makes luminos extensible to
|
||
any exploration target: websites, wikis, databases, running processes.
|
||
|
||
This pivot also serves the project's learning goal. Migrating working
|
||
code into an agentic framework is a common and painful real-world task.
|
||
Building it clean from the start teaches the pattern; migrating teaches
|
||
*why* the pattern exists. The migration pain is intentional.
|
||
|
||
### The model
|
||
|
||
Each exploration target is an MCP server. Luminos is an MCP client.
|
||
The investigation loop connects to a server at startup, discovers its
|
||
tools, passes them to the Anthropic API, and forwards tool calls to
|
||
the server at runtime.
|
||
|
||
```
|
||
luminos (MCP client)
|
||
↓ connects to
|
||
filesystem MCP server | process MCP server | wiki MCP server | ...
|
||
↓ exposes tools
|
||
read_file, list_dir, parse_structure, ...
|
||
↓ passed to
|
||
Anthropic API (agent calls them)
|
||
↓ forwarded back to
|
||
MCP server (executes, returns result)
|
||
```
|
||
|
||
The filesystem MCP server is the default. `--mcp <uri>` selects
|
||
an alternative server.
|
||
|
||
### What changes
|
||
|
||
- `ai.py` tool dispatch: instead of calling local Python functions,
|
||
forward to the connected MCP server
|
||
- Tool definitions: dynamically discovered from the server via
|
||
`tools/list`, not hardcoded in `ai.py`
|
||
- New `luminos_lib/mcp_client.py`: thin MCP client (stdio transport)
|
||
- New `luminos_mcp/filesystem.py`: MCP server wrapping existing
|
||
filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
||
`run_command`, `stat_file`)
|
||
- `--mcp` CLI flag for selecting a non-default server
|
||
|
||
### What does not change
|
||
|
||
Cache storage, confidence tracking, survey/planning/synthesis passes,
|
||
token tracking, cost reporting, all prompts. None of these know or
|
||
care what backend provided the data.
|
||
|
||
### Known tensions
|
||
|
||
**The tree assumption.** The investigation loop assumes hierarchical
|
||
containers. Non-filesystem backends (websites, processes) must present
|
||
a virtual tree or the traversal model breaks. This is the MCP server's
|
||
problem to solve, not luminos's — but it is real design work.
|
||
|
||
**Tool count.** If multiple MCP servers are connected simultaneously
|
||
(filesystem + web search + package lookup), tool count grows. More
|
||
tools degrades agent decision quality. Keep each server focused.
|
||
|
||
**The filesystem backend is a demotion.** Currently filesystem
|
||
investigation is native — zero overhead. Making it an MCP server adds
|
||
process-launch overhead. Acceptable given API call latency already
|
||
dominates, but worth knowing.
|
||
|
||
**Phase 4 becomes MCP servers.** After the pivot, web_search,
|
||
fetch_url, and package_lookup are natural candidates to implement as
|
||
MCP servers rather than hardcoded Python functions. Phase 4 and the
|
||
MCP pattern reinforce each other.
|
||
|
||
### Timing
|
||
|
||
After Phase 3, before Phase 4. At that point survey + planning +
|
||
dir loops + synthesis are all working with filesystem assumptions
|
||
baked in — enough surface area to make the migration instructive
|
||
without 9 phases of rework.
|
||
|
||
---
|
||
|
||
## Implementation Order
|
||
|
||
### Phase 1 — Confidence tracking
|
||
- Add `confidence` + `confidence_reason` to cache schemas
|
||
- Update dir loop prompt to set confidence when writing cache
|
||
- No behavior change yet — just instrumentation
|
||
|
||
### Phase 2 — Survey pass
|
||
- New `_run_survey()` function in `ai.py`
|
||
- `submit_survey` tool definition
|
||
- `_SURVEY_SYSTEM_PROMPT` in `prompts.py`
|
||
- Wire into `_run_investigation()` before dir loops
|
||
- Survey output injected into dir loop system prompt
|
||
|
||
### Phase 3 — Investigation planning
|
||
- Planning pass after survey, before dir loops
|
||
- `submit_plan` tool
|
||
- Dynamic turn allocation based on plan
|
||
- Dir loop orchestrator updated to follow plan
|
||
|
||
### Phase 3.5 — MCP backend abstraction (pivot point)
|
||
See Part 10 for full design. This phase happens *after* Phase 3 is
|
||
working and *before* Phase 4. The goal is to migrate the filesystem
|
||
investigation into an MCP server/client model before adding more
|
||
backends or external tools.
|
||
|
||
- Extract filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
||
`run_command`, `stat_file`) into a standalone MCP server
|
||
- Refactor `ai.py` into an MCP client: discover tools dynamically,
|
||
forward tool calls to the server, return results to the agent
|
||
- Replace hardcoded tool list in the dir loop with dynamic tool
|
||
discovery from the connected MCP server
|
||
- Keep the filesystem MCP server as the default; `--mcp` flag selects
|
||
alternative servers
|
||
- No behavior change to the investigation loop — purely structural
|
||
|
||
**Learning goal:** experience migrating working code into an MCP
|
||
architecture. The migration pain is intentional and instructive.
|
||
|
||
### Phase 4 — External knowledge tools
|
||
- `web_search` tool + implementation (requires optional dep: search API client)
|
||
- `package_lookup` tool + implementation (HTTP to package registries)
|
||
- `fetch_url` tool + implementation
|
||
- `--no-external` flag to disable network tools
|
||
- Budget tracking and logging
|
||
|
||
### Phase 5 — Scale-tiered synthesis
|
||
- Sizing measurement after dir loops
|
||
- Tier classification
|
||
- Small tier: switch synthesis input to file cache entries
|
||
- Depth instructions in synthesis prompt
|
||
|
||
### Phase 6 — Multi-level synthesis
|
||
- Grouping pass + `submit_grouping` tool
|
||
- Final synthesis receives subsystem summaries at large/xlarge tier
|
||
- Two-level grouping for xlarge
|
||
|
||
### Phase 7 — Hypothesis-driven synthesis
|
||
- Update synthesis prompt to require hypothesis formation before submit_report
|
||
- `think` tool made available in synthesis (currently restricted)
|
||
|
||
### Phase 8 — Refinement pass
|
||
- `--refine` flag + `_run_refinement()`
|
||
- Refinement uses confidence scores to prioritize
|
||
- `--refine-depth N`
|
||
|
||
### Phase 9 — Dynamic report structure
|
||
- Superset output fields in synthesis submit_report schema
|
||
- Report formatter renders populated fields only
|
||
- Domain-appropriate section headers
|
||
|
||
---
|
||
|
||
## File Map
|
||
|
||
| File | Changes |
|
||
|---|---|
|
||
| `luminos_lib/domain.py` | **new** — survey pass, plan pass, profile-free detection |
|
||
| `luminos_lib/prompts.py` | survey prompt, planning prompt, refinement prompt, updated dir/synthesis prompts |
|
||
| `luminos_lib/ai.py` | survey, planning, external tools, tiered synthesis, multi-level grouping, refinement, confidence-aware cache writes |
|
||
| `luminos_lib/cache.py` | confidence fields in schemas, low-confidence query |
|
||
| `luminos_lib/report.py` | dynamic field rendering, domain-appropriate sections |
|
||
| `luminos.py` | --refine, --no-external, --refine-depth flags; wire survey into scan |
|
||
| `luminos_lib/search.py` | **new** — web_search, fetch_url, package_lookup implementations |
|
||
|
||
No changes needed to: `tree.py`, `filetypes.py`, `code.py`, `recency.py`,
|
||
`disk.py`, `capabilities.py`, `watch.py`, `ast_parser.py`
|
||
|
||
---
|
||
|
||
## Known Unknowns
|
||
|
||
**Search API choice**
|
||
Web search requires an API (Brave Search, Serper, SerpAPI, DuckDuckGo, etc.).
|
||
Each has different pricing, rate limits, result quality, and privacy
|
||
implications. Which one to use, whether to require an API key, and what the
|
||
fallback is when no key is configured — all undecided. Could support multiple
|
||
backends with a configurable preference.
|
||
|
||
**Package registry coverage**
|
||
`package_lookup` needs to handle PyPI, npm, crates.io, pkg.go.dev, Maven,
|
||
RubyGems, NuGet at minimum. Each has a different API shape. Coverage gap for
|
||
less common ecosystems (Hex for Elixir, Hackage for Haskell, etc.) — the agent
|
||
will get no lookup result and must fall back to web search.
|
||
|
||
**search result summarization**
|
||
Raw search results can't be injected directly into context — they're too long
|
||
and too noisy. A summarization step is needed. Options: another AI call (adds
|
||
latency and cost), regex extraction (fragile), a lightweight extraction
|
||
heuristic. The right approach is unclear.
|
||
|
||
**Turn budget arithmetic**
|
||
Dynamic turn allocation sounds clean in theory. In practice: how does the
|
||
agent "request more turns"? The orchestrator has to interrupt the loop,
|
||
check the global budget, and decide whether to grant more. This requires
|
||
mid-loop communication that doesn't exist today. Implementation complexity
|
||
is non-trivial.
|
||
|
||
**Cache invalidation on strategy changes**
|
||
If a user re-runs with different flags (--refine, --no-external, new --exclude
|
||
list), the existing cache entries may have been produced under a different
|
||
investigation strategy. Should they be invalidated? Currently --fresh is the
|
||
only mechanism. A smarter approach would store the investigation parameters
|
||
in cache metadata and detect mismatches.
|
||
|
||
**Confidence calibration**
|
||
Asking the agent to self-report confidence (0.0–1.0) is only useful if the
|
||
numbers are meaningful and consistent. LLMs are known to be poorly calibrated
|
||
on confidence. A 0.6 from one run may not mean the same as 0.6 from another.
|
||
This may need to be a categorical signal (high/medium/low) rather than numeric
|
||
to be reliable in practice.
|
||
|
||
**Context window growth with external tools**
|
||
Each web search result, package lookup, and URL fetch adds to the context
|
||
window for that dir loop. For a directory with many unknown dependencies, the
|
||
context could grow large enough to trigger the budget early exit. Need to think
|
||
about how external tool results are managed in context — perhaps summarized and
|
||
discarded from messages after being processed.
|
||
|
||
**`ask_user` blocking behavior**
|
||
Interactive mode with `ask_user` would block execution waiting for input. This
|
||
is fine in a terminal session but incompatible with piped output, scripted use,
|
||
or running luminos as a subprocess. Needs a clear mode distinction and graceful
|
||
degradation when input is not a TTY.
|
||
|
||
**Survey pass quality on tiny targets**
|
||
For a target with 3 files, the survey pass adds an API call that may cost more
|
||
than it's worth. There should be a minimum size threshold below which the
|
||
survey is skipped and a generic approach is used.
|
||
|
||
**Parallel investigation complexity**
|
||
Concurrent dir-loop agents writing to a shared cache introduces race conditions.
|
||
The current `_CacheManager` writes files directly with no locking. This would
|
||
need to be addressed before parallel investigation is viable.
|
||
|
||
---
|
||
|
||
## Additional Suggestions
|
||
|
||
**Config file**
|
||
Many things that are currently hardcoded (turn budget, tier thresholds, search
|
||
budget, confidence threshold for refinement) should be user-configurable without
|
||
CLI flags. A `luminos.toml` in the target directory or `~/.config/luminos/`
|
||
would allow project-specific and user-specific defaults.
|
||
|
||
**Structured logging**
|
||
The `[AI]` stderr output is useful but informal. A structured log (JSONL file
|
||
alongside the cache) would allow post-hoc analysis of investigation quality:
|
||
which dirs used the most turns, which triggered web searches, which had low
|
||
confidence, where budget pressure hit. This also enables future tooling on top
|
||
of luminos investigations.
|
||
|
||
**Investigation replay**
|
||
The cache already stores summaries but not the investigation trace (what the
|
||
agent read, in what order, what it decided to skip). Storing the full message
|
||
history per directory would allow replaying or auditing an investigation. Cost:
|
||
storage. Benefit: debuggability, ability to resume investigations more faithfully.
|
||
|
||
**Watch mode + incremental investigation**
|
||
Watch mode currently re-runs the full base scan on changes. For AI-augmented
|
||
watch mode: detect which directories changed, re-investigate only those, and
|
||
patch the cache entries. The synthesis would then re-run from the updated cache
|
||
without re-investigating unchanged directories.
|
||
|
||
**Optional PDF and Office document readers**
|
||
The data and documents domains would benefit from native content extraction:
|
||
- `pdfminer` or `pypdf` for PDF text extraction
|
||
- `openpyxl` for Excel schema and sheet enumeration
|
||
- `python-docx` for Word document text
|
||
These would be optional deps like the existing AI deps, gated behind
|
||
`--install-extras`. The agent currently can only see filename and size for
|
||
these formats.
|
||
|
||
**Security-focused analysis mode**
|
||
A `--security` flag could tune the investigation toward security-relevant
|
||
findings: dependency vulnerability scanning, hardcoded secrets detection,
|
||
permission issues, exposed configuration, insecure patterns. The flag would
|
||
bias the survey, dir loop prompts, and synthesis toward these concerns and
|
||
expand the flags output with severity-ranked security findings.
|
||
|
||
**Output formats**
|
||
The current report is terminal-formatted text or JSON. Additional formats worth
|
||
considering:
|
||
- Markdown (for saving to wikis, Notion, Obsidian)
|
||
- HTML (self-contained report with collapsible sections)
|
||
- SARIF (for security findings — integrates with GitHub Code Scanning)
|
||
|
||
**Model selection**
|
||
The model is hardcoded to `claude-sonnet-4-20250514`. The survey and planning
|
||
passes are lightweight enough to use a faster/cheaper model (Haiku). The dir
|
||
loops and synthesis warrant Sonnet or better. The refinement pass might benefit
|
||
from Opus for difficult cases. A `--model` flag and per-pass model configuration
|
||
would allow cost/quality tradeoffs.
|
||
|
||
---
|
||
|
||
## Concerns
|
||
|
||
**Cost at scale**
|
||
Adding a survey pass, planning pass, external tool lookups, and multiple
|
||
refinement passes significantly increases API call count and token consumption.
|
||
A large repo run with `--refine` could easily cost several dollars. The current
|
||
cost reporting (total tokens at end) may not be sufficient — users need to
|
||
understand cost before committing to a long run. Consider a `--estimate` mode
|
||
that projects cost from the base scan without running AI.
|
||
|
||
**Privacy and external lookups**
|
||
Web searches and URL fetches send information about the target's contents to
|
||
external services. For a personal journal or proprietary codebase this could
|
||
be a significant privacy concern. The `--no-external` flag addresses this but
|
||
it should probably be the *default* for sensitive-looking content (PII detected
|
||
in filenames, etc.), not something the user has to know to enable.
|
||
|
||
**Prompt injection via file contents**
|
||
`read_file` passes raw file contents into the context. A malicious file in the
|
||
target directory could contain prompt injection attempts. The current system has
|
||
no sanitization. This is an existing concern that grows as the agent gains more
|
||
capabilities (web search, URL fetch, package lookup — all of which could
|
||
theoretically be manipulated by a crafted file).
|
||
|
||
**Reliability of self-reported confidence**
|
||
The confidence tracking system depends on the agent accurately reporting its
|
||
own uncertainty. If the agent is systematically over-confident (which LLMs tend
|
||
to be), the refinement pass will never trigger on cases where it's most needed.
|
||
The system should have a skeptical prior — low-confidence by default for
|
||
unfamiliar file types, missing READMEs, ambiguous structures.
|
||
|
||
**Investigation quality regression risk**
|
||
Each new pass (survey, planning, refinement) adds opportunities for the
|
||
investigation to go wrong. A bad survey misleads all subsequent dir loops. A
|
||
bad plan wastes turns on shallow directories and skips critical ones. The system
|
||
needs quality signals — probably the confidence scores aggregated across the
|
||
investigation — to detect when something went wrong and potentially retry.
|
||
|
||
**Watch mode compatibility**
|
||
Several of the planned features (survey pass, planning, external tools) are not
|
||
designed for incremental re-use in watch mode. Adding AI capability to watch
|
||
mode is a separate design problem that deserves its own thinking.
|
||
|
||
**Turn budget contention**
|
||
If the planning pass allocates turns and the agent borrows from its budget when
|
||
it needs more, there's a risk of runaway investigation on unexpectedly complex
|
||
directories. Needs a hard ceiling (global max tokens, not just per-dir turns)
|
||
as a backstop.
|
||
|
||
---
|
||
|
||
## Raw Thoughts
|
||
|
||
The investigation planning idea is conceptually appealing but has a chicken-and-
|
||
egg problem: you need to know what's in the directories to plan how to
|
||
investigate them, but you haven't investigated yet. The survey pass helps but
|
||
it's shallow. Maybe the first pass through each directory should be a cheap
|
||
orientation (list contents, read one file) that feeds the plan before the full
|
||
investigation starts. Two-phase dir investigation: orient then investigate.
|
||
|
||
The hypothesis-driven synthesis is probably the highest leverage change in this
|
||
whole plan. The current synthesis produces descriptive output. Hypothesis-driven
|
||
synthesis produces analytical output. The prompt change is small but the output
|
||
quality difference could be significant.
|
||
|
||
Web search feels like it should be a last resort, not an early one. The agent
|
||
should exhaust local investigation before reaching for external sources. The
|
||
prompt should reflect this: "Only search if you cannot determine this from the
|
||
files available."
|
||
|
||
There's a question of whether the survey pass should run before the base scan
|
||
or after. After makes sense because the base scan's file_categories is useful
|
||
survey input. But the base scan itself could be informed by the survey (e.g.
|
||
skip certain directories the survey identified as low-value). Probably the right
|
||
answer is: survey runs after base scan but before AI dir loops, using base scan
|
||
output as input.
|
||
|
||
The `ask_user` tool is interesting because it inverts the relationship — the
|
||
agent asks the human rather than the other way around. This is powerful but
|
||
needs careful constraints. The agent should only ask when it's genuinely stuck,
|
||
not as a shortcut to avoid investigation. The prompt should require that other
|
||
resolution strategies have been exhausted before asking.
|
||
|
||
Multi-level synthesis (grouping pass) might produce better results than
|
||
expected because the grouping agent has a different task than the dir-loop
|
||
agents — it's looking for relationships and patterns across summaries rather
|
||
than summarizing individual directories. It might surface architectural insights
|
||
that none of the dir loops could see individually.
|
||
|
||
Package vulnerability lookups are potentially the highest signal-to-noise
|
||
external tool — structured data, specific to the files present, directly
|
||
actionable. Worth implementing before general web search.
|
||
|
||
The confidence calibration problem is real but maybe not critical to solve
|
||
precisely. Even if 0.6 doesn't mean the same thing every time, entries with
|
||
confidence below some threshold will still tend to be the more uncertain ones.
|
||
Categorical (high/medium/low) is probably fine for the first implementation.
|
||
|
||
Progressive output and interactive mode are probably the features that would
|
||
most change how luminos *feels* to use. The current UX is: run it, wait, get a
|
||
report. Progressive output would make it feel like watching someone explore
|
||
the codebase in real time. Worth thinking about the UX before the architecture.
|
||
|
||
There's a version of this tool that goes well beyond file system analysis —
|
||
a general-purpose investigative agent that can be pointed at anything (a
|
||
directory, a URL, a database, a running process) and produce an intelligence
|
||
report. The current architecture is already pointing in that direction. Worth
|
||
keeping that possibility in mind when making structural decisions so we don't
|
||
close off that path prematurely.
|