2026-04-07 03:15:27 +00:00
|
|
|
|
# Luminos — Evolution Plan
|
|
|
|
|
|
|
|
|
|
|
|
## Core Philosophy
|
|
|
|
|
|
|
|
|
|
|
|
The current design is a **pipeline with AI steps**. Even though it uses an
|
|
|
|
|
|
agent loop, the structure is predetermined:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
investigate every directory (leaf-first) → synthesize → done
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The agent executes a fixed sequence. It cannot decide what matters most, cannot
|
|
|
|
|
|
resolve its own uncertainty, cannot go get information it's missing, cannot
|
|
|
|
|
|
adapt its strategy based on what it finds.
|
|
|
|
|
|
|
|
|
|
|
|
The target philosophy is **investigation driven by curiosity**:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
survey → form hypotheses → investigate to confirm/refute →
|
|
|
|
|
|
resolve uncertainty → update understanding → repeat until satisfied
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
This is how a human engineer actually explores an unfamiliar codebase or file
|
|
|
|
|
|
collection. The agent should decide *what it needs to know* and *how to find
|
|
|
|
|
|
it out* — not execute a predetermined checklist.
|
|
|
|
|
|
|
|
|
|
|
|
Every feature in this plan should be evaluated against that principle.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 1: Uncertainty as a First-Class Concept
|
|
|
|
|
|
|
|
|
|
|
|
### The core loop
|
|
|
|
|
|
|
|
|
|
|
|
The agent explicitly tracks confidence at each step. Low confidence is not
|
|
|
|
|
|
noted and moved past — it triggers a resolution strategy.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
observe something
|
|
|
|
|
|
→ assess confidence
|
|
|
|
|
|
→ if high: cache and continue
|
|
|
|
|
|
→ if low: choose resolution strategy
|
|
|
|
|
|
→ read more local files
|
|
|
|
|
|
→ search externally (web, package registry, docs)
|
|
|
|
|
|
→ ask the user
|
|
|
|
|
|
→ flag as genuinely unknowable with explicit reasoning
|
|
|
|
|
|
→ update understanding
|
|
|
|
|
|
→ continue
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
This should be reflected in the dir loop prompt, the synthesis prompt, and the
|
|
|
|
|
|
refinement prompt. "I don't know" is never an acceptable terminal state if
|
|
|
|
|
|
there are resolution strategies available.
|
|
|
|
|
|
|
|
|
|
|
|
### Confidence tracking in cache entries
|
|
|
|
|
|
|
|
|
|
|
|
Add a `confidence` field (0.0–1.0) to both file and dir cache entries. The
|
|
|
|
|
|
agent sets this when writing cache. Low-confidence entries are candidates for
|
|
|
|
|
|
refinement-pass investigation.
|
|
|
|
|
|
|
|
|
|
|
|
File cache schema addition:
|
|
|
|
|
|
```
|
|
|
|
|
|
confidence: float # 0.0–1.0, agent's confidence in its summary
|
|
|
|
|
|
confidence_reason: str # why confidence is low, if below ~0.7
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The synthesis and refinement passes can use confidence scores to prioritize
|
|
|
|
|
|
what to look at again.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 2: Dynamic Domain Detection
|
|
|
|
|
|
|
|
|
|
|
|
### Why not a hardcoded taxonomy
|
|
|
|
|
|
|
|
|
|
|
|
A fixed domain list (code, documents, data, media, mixed) forces content into
|
|
|
|
|
|
predetermined buckets. Edge cases are inevitable: a medical imaging archive,
|
|
|
|
|
|
a legal discovery collection, a CAD project, a music production session, a
|
|
|
|
|
|
Jupyter notebook repo that's half code and half data. Hardcoded domains require
|
|
|
|
|
|
code changes to handle novel content and will always mis-classify ambiguous cases.
|
|
|
|
|
|
|
|
|
|
|
|
More fundamentally: the AI is good at recognizing what something is. Using
|
|
|
|
|
|
rule-based file-type ratios to gate its behavior is fighting the tool's
|
|
|
|
|
|
strengths.
|
|
|
|
|
|
|
|
|
|
|
|
### Survey pass (replaces hardcoded detection)
|
|
|
|
|
|
|
|
|
|
|
|
Before dir loops begin, a lightweight survey pass runs:
|
|
|
|
|
|
|
|
|
|
|
|
**Input**: file type distribution, tree structure (top 2 levels), total counts
|
|
|
|
|
|
|
|
|
|
|
|
**Task**: the survey agent answers three questions:
|
|
|
|
|
|
1. What is this? (plain language description, no forced taxonomy)
|
|
|
|
|
|
2. What analytical approach would be most useful?
|
|
|
|
|
|
3. Which available tools are relevant and which can be skipped?
|
|
|
|
|
|
|
|
|
|
|
|
**Output** (`submit_survey` tool):
|
|
|
|
|
|
```python
|
|
|
|
|
|
{
|
|
|
|
|
|
"description": str, # "a Python web service with Postgres migrations"
|
|
|
|
|
|
"approach": str, # how to investigate — what to prioritize, what to skip
|
|
|
|
|
|
"relevant_tools": [str], # tools worth using for this content
|
|
|
|
|
|
"skip_tools": [str], # tools not useful (e.g. parse_structure for a journal)
|
|
|
|
|
|
"domain_notes": str, # anything unusual the dir loops should know
|
|
|
|
|
|
"confidence": float, # how clear the signal was
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Max turns**: 3 — this is a lightweight orientation pass, not a deep investigation.
|
|
|
|
|
|
|
|
|
|
|
|
This output is injected into the dir loop system prompt as context. The dir
|
|
|
|
|
|
loops know what they're looking at before they start. They can also deviate if
|
|
|
|
|
|
they find something the survey missed.
|
|
|
|
|
|
|
|
|
|
|
|
### Tools are always available, AI selects what's relevant
|
|
|
|
|
|
|
|
|
|
|
|
Rather than gating tools by domain, every tool is offered with a clear
|
|
|
|
|
|
description of what it's for. The AI simply won't call `parse_structure` on
|
|
|
|
|
|
a `.xlsx` file because the description says it works on source files.
|
|
|
|
|
|
|
|
|
|
|
|
This also means new tools are automatically available to all future domains
|
|
|
|
|
|
without any profile configuration.
|
|
|
|
|
|
|
|
|
|
|
|
### What stays rule-based
|
|
|
|
|
|
|
|
|
|
|
|
The file type distribution summary fed into the survey prompt is still computed
|
|
|
|
|
|
from `filetypes.py` — this is cheap and provides useful signal. The difference
|
|
|
|
|
|
is that the AI interprets it rather than a lookup table.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 3: External Knowledge Tools
|
|
|
|
|
|
|
|
|
|
|
|
### The resolution strategy toolkit
|
|
|
|
|
|
|
|
|
|
|
|
When the agent encounters something it doesn't understand, it has options beyond
|
|
|
|
|
|
"read more local files." These are resolution strategies for specific kinds of
|
|
|
|
|
|
uncertainty.
|
|
|
|
|
|
|
|
|
|
|
|
**`web_search(query) → results`**
|
|
|
|
|
|
|
|
|
|
|
|
Use when: unfamiliar library, file format, API, framework, toolchain, naming
|
|
|
|
|
|
convention that doesn't resolve from local files alone.
|
|
|
|
|
|
|
|
|
|
|
|
Query construction should be evidence-based:
|
|
|
|
|
|
- Seeing `import dramatiq` → `"dramatiq python task queue library"`
|
|
|
|
|
|
- Finding `.avro` files → `"apache avro file format schema"`
|
|
|
|
|
|
- Spotting unfamiliar config key → `"<framework> <key> configuration"`
|
|
|
|
|
|
|
|
|
|
|
|
Results are summarized before injection into context. Raw search results are
|
|
|
|
|
|
not passed directly — a lightweight extraction pulls the relevant 2-3 sentences.
|
|
|
|
|
|
|
|
|
|
|
|
Budget: configurable max searches per session (default: 10). Logged in report.
|
|
|
|
|
|
|
|
|
|
|
|
**`fetch_url(url) → content`**
|
|
|
|
|
|
|
|
|
|
|
|
Use when: a local file references a URL that would explain what the project is
|
|
|
|
|
|
(e.g. a README links to documentation, a config references a schema URL, a
|
|
|
|
|
|
package.json has a homepage field).
|
|
|
|
|
|
|
|
|
|
|
|
Constrained to read-only fetches. Content truncated to relevant sections.
|
|
|
|
|
|
Budget: configurable (default: 5 per session).
|
|
|
|
|
|
|
|
|
|
|
|
**`package_lookup(name, ecosystem) → metadata`**
|
|
|
|
|
|
|
|
|
|
|
|
Use when: an import or dependency declaration references an unfamiliar package.
|
|
|
|
|
|
|
|
|
|
|
|
Queries package registries (PyPI, npm, crates.io, pkg.go.dev) for:
|
|
|
|
|
|
- Package description
|
|
|
|
|
|
- Version in use vs latest
|
|
|
|
|
|
- Known security advisories (if available)
|
|
|
|
|
|
- License
|
|
|
|
|
|
|
|
|
|
|
|
This is more targeted than web search and returns structured data. Particularly
|
|
|
|
|
|
useful for security-relevant analysis.
|
|
|
|
|
|
|
|
|
|
|
|
Budget: generous (default: 30) since queries are cheap and targeted.
|
|
|
|
|
|
|
|
|
|
|
|
**`ask_user(question) → answer`** *(interactive mode only)*
|
|
|
|
|
|
|
|
|
|
|
|
Use when: uncertainty cannot be resolved by any other means.
|
|
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
- "I found 40 files with `.xyz` extension I don't recognize — what format is this?"
|
|
|
|
|
|
- "There are two entry points (server.py and worker.py) — which is the primary one?"
|
|
|
|
|
|
- "This directory appears to contain personal data — should I analyze it or skip it?"
|
|
|
|
|
|
|
|
|
|
|
|
Only triggered when other resolution strategies have been tried or are clearly
|
|
|
|
|
|
not applicable. Gated behind an `--interactive` flag since it blocks execution.
|
|
|
|
|
|
|
|
|
|
|
|
### All external tools are opt-in
|
|
|
|
|
|
|
|
|
|
|
|
`--no-external` flag disables all network tools (web_search, fetch_url,
|
|
|
|
|
|
package_lookup). Default behavior TBD — arguably external lookups should be
|
|
|
|
|
|
opt-in rather than opt-out given privacy considerations (see Concerns).
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 4: Investigation Planning
|
|
|
|
|
|
|
|
|
|
|
|
### Survey → plan → execute
|
|
|
|
|
|
|
|
|
|
|
|
Currently: every directory is processed in leaf-first order with equal
|
|
|
|
|
|
resource allocation. A 2-file directory gets the same max_turns as a 50-file
|
|
|
|
|
|
one.
|
|
|
|
|
|
|
|
|
|
|
|
Better: after the survey pass, a planning step decides where to invest depth.
|
|
|
|
|
|
|
|
|
|
|
|
**Planning pass** (`submit_plan` tool):
|
|
|
|
|
|
|
|
|
|
|
|
Input: survey output + full directory tree
|
|
|
|
|
|
|
|
|
|
|
|
Output:
|
|
|
|
|
|
```python
|
|
|
|
|
|
{
|
|
|
|
|
|
"priority_dirs": [ # investigate these deeply
|
|
|
|
|
|
{"path": str, "reason": str, "suggested_turns": int}
|
|
|
|
|
|
],
|
|
|
|
|
|
"shallow_dirs": [ # quick pass only
|
|
|
|
|
|
{"path": str, "reason": str}
|
|
|
|
|
|
],
|
|
|
|
|
|
"skip_dirs": [ # skip entirely (generated, vendored, etc.)
|
|
|
|
|
|
{"path": str, "reason": str}
|
|
|
|
|
|
],
|
|
|
|
|
|
"investigation_order": str, # "leaf-first" | "priority-first" | "breadth-first"
|
|
|
|
|
|
"notes": str, # anything else the investigation should know
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The orchestrator uses this plan to allocate turns per directory and set
|
|
|
|
|
|
investigation order. The plan is also saved to cache so resumed investigations
|
|
|
|
|
|
can follow the same strategy.
|
|
|
|
|
|
|
|
|
|
|
|
### Dynamic turn allocation
|
|
|
|
|
|
|
|
|
|
|
|
Replace fixed `max_turns=14` per directory with a global turn budget the agent
|
|
|
|
|
|
manages. The planning pass allocates turns to directories based on apparent
|
|
|
|
|
|
complexity. The agent can request additional turns mid-investigation if it hits
|
|
|
|
|
|
something unexpectedly complex.
|
|
|
|
|
|
|
|
|
|
|
|
A simple model:
|
|
|
|
|
|
- Global budget = `base_turns_per_dir * dir_count` (e.g. 10 * 20 = 200)
|
|
|
|
|
|
- Planning pass distributes: priority dirs get 15-20, shallow dirs get 5, skip dirs get 0
|
|
|
|
|
|
- Agent can "borrow" turns from its own budget if it needs more
|
|
|
|
|
|
- If budget runs low, a warning is injected into the prompt
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 5: Scale-Tiered Synthesis
|
|
|
|
|
|
|
|
|
|
|
|
### Why tiers are still needed
|
|
|
|
|
|
|
|
|
|
|
|
Even with better investigation planning and agentic depth control, the synthesis
|
|
|
|
|
|
input problem remains: 300 directory summaries cannot be meaningfully synthesized
|
|
|
|
|
|
in one shot. The output is either truncated, loses fidelity, or both.
|
|
|
|
|
|
|
|
|
|
|
|
Tier classification based on post-loop measurements:
|
|
|
|
|
|
|
|
|
|
|
|
| Tier | dir_count | file_count | Synthesis approach |
|
|
|
|
|
|
|---|---|---|---|
|
|
|
|
|
|
| `small` | < 5 | < 30 | Feed per-file cache entries directly |
|
|
|
|
|
|
| `medium` | 5–30 | 30–300 | Dir summaries (current approach) |
|
|
|
|
|
|
| `large` | 31–150 | 301–1500 | Multi-level synthesis |
|
|
|
|
|
|
| `xlarge` | > 150 | > 1500 | Multi-level + subsystem grouping |
|
|
|
|
|
|
|
|
|
|
|
|
Thresholds configurable via CLI flags or config file.
|
|
|
|
|
|
|
|
|
|
|
|
### Small tier: per-file summaries
|
|
|
|
|
|
|
|
|
|
|
|
File cache entries are the most granular, most grounded signal in the system —
|
|
|
|
|
|
written while the AI was actually reading files. For small targets they fit
|
|
|
|
|
|
comfortably in the synthesis context window and produce a richer output than
|
|
|
|
|
|
dir summaries.
|
|
|
|
|
|
|
|
|
|
|
|
### Multi-level synthesis (large/xlarge)
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
dir summaries
|
|
|
|
|
|
↓ (grouping pass: dirs → subsystems, AI-identified)
|
|
|
|
|
|
subsystem summaries (3–10 groups)
|
|
|
|
|
|
↓ (final synthesis)
|
|
|
|
|
|
report
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The grouping pass is itself agentic: the AI identifies logical subsystems from
|
|
|
|
|
|
dir summaries, not from directory structure. An `auth/` dir and a
|
|
|
|
|
|
`middleware/session/` dir might end up in the same "Authentication" subsystem.
|
|
|
|
|
|
|
|
|
|
|
|
For xlarge:
|
|
|
|
|
|
```
|
|
|
|
|
|
dir summaries
|
|
|
|
|
|
↓ (level-1: dirs → subsystems, 10–30 groups)
|
|
|
|
|
|
↓ (level-2: subsystems → domains/layers, 3–8 groups)
|
|
|
|
|
|
↓ (final synthesis)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Synthesis depth scales with tier
|
|
|
|
|
|
|
|
|
|
|
|
The synthesis prompt receives explicit depth guidance:
|
|
|
|
|
|
|
|
|
|
|
|
- **small**: "Be concise but specific. Reference actual filenames. 2–3 paragraphs."
|
|
|
|
|
|
- **medium**: "Produce a structured breakdown. Cover purpose, components, concerns."
|
|
|
|
|
|
- **large**: "Produce a thorough architectural analysis with section headers. Be specific."
|
|
|
|
|
|
- **xlarge**: "Produce a comprehensive report. Cover architecture, subsystems, interfaces, cross-cutting concerns, and notable anomalies. Reference actual paths."
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 6: Hypothesis-Driven Synthesis
|
|
|
|
|
|
|
|
|
|
|
|
### Current approach: aggregation
|
|
|
|
|
|
|
|
|
|
|
|
Synthesis currently aggregates dir summaries into a report. It's descriptive:
|
|
|
|
|
|
"here is what I found in each part."
|
|
|
|
|
|
|
|
|
|
|
|
### Better approach: conclusion with evidence
|
|
|
|
|
|
|
|
|
|
|
|
The synthesis agent should:
|
|
|
|
|
|
1. Form an initial hypothesis about the whole from the dir summaries
|
|
|
|
|
|
2. Look for evidence that confirms or refutes it
|
|
|
|
|
|
3. Consider alternative interpretations
|
|
|
|
|
|
4. Produce a conclusion that reflects the reasoning, not just the observations
|
|
|
|
|
|
|
|
|
|
|
|
This produces output like: *"This appears to be a multi-tenant SaaS backend
|
|
|
|
|
|
(hypothesis) — the presence of tenant_id throughout the schema, separate
|
|
|
|
|
|
per-tenant job queues, and the auth middleware's scope validation all support
|
|
|
|
|
|
this (evidence). The monolith structure suggests it hasn't been decomposed into
|
|
|
|
|
|
services yet (alternative consideration)."*
|
|
|
|
|
|
|
|
|
|
|
|
Rather than: *"The auth directory handles authentication. The jobs directory
|
|
|
|
|
|
handles background jobs. The models directory contains database models."*
|
|
|
|
|
|
|
|
|
|
|
|
The `think` tool already supports this pattern — the synthesis prompt should
|
|
|
|
|
|
explicitly instruct hypothesis formation before `submit_report`.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 7: Refinement Pass
|
|
|
|
|
|
|
|
|
|
|
|
### Trigger
|
|
|
|
|
|
|
|
|
|
|
|
`--refine` flag. Off by default.
|
|
|
|
|
|
|
|
|
|
|
|
### What it does
|
|
|
|
|
|
|
|
|
|
|
|
After synthesis, the refinement agent receives:
|
|
|
|
|
|
- Current synthesis output (brief + full analysis)
|
|
|
|
|
|
- All dir and file cache entries including confidence scores
|
|
|
|
|
|
- Full investigation toolset including external knowledge tools
|
|
|
|
|
|
- A list of low-confidence cache entries (confidence < 0.7)
|
|
|
|
|
|
|
|
|
|
|
|
It is instructed to:
|
|
|
|
|
|
1. Identify gaps (things not determined from summaries)
|
|
|
|
|
|
2. Identify contradictions (dir summaries that conflict)
|
|
|
|
|
|
3. Identify cross-cutting concerns (patterns spanning multiple dirs)
|
|
|
|
|
|
4. Resolve low-confidence entries
|
|
|
|
|
|
5. Submit an improved report
|
|
|
|
|
|
|
|
|
|
|
|
The refinement agent owns its investigation — it decides what to look at and
|
|
|
|
|
|
in what order, using the full resolution strategy toolkit.
|
|
|
|
|
|
|
|
|
|
|
|
### Multiple passes
|
|
|
|
|
|
|
|
|
|
|
|
`--refine-depth N` runs N refinement passes. Natural stopping condition: the
|
|
|
|
|
|
agent calls `submit_report` without making any file reads or external lookups
|
|
|
|
|
|
(indicates nothing new was found). This can short-circuit before N passes.
|
|
|
|
|
|
|
|
|
|
|
|
### Refinement vs re-investigation
|
|
|
|
|
|
|
|
|
|
|
|
Refinement is targeted — it focuses on specific gaps and uncertainties. It is
|
|
|
|
|
|
not a re-run of the full dir loops. The prompt makes this explicit:
|
|
|
|
|
|
*"Focus on resolving uncertainty, not re-summarizing what is already known."*
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 8: Report Structure
|
|
|
|
|
|
|
|
|
|
|
|
### Domain-appropriate sections
|
|
|
|
|
|
|
|
|
|
|
|
Instead of fixed `brief` + `detailed` fields, the synthesis produces structured
|
|
|
|
|
|
fields based on what the survey identified. Fields that are absent or empty are
|
|
|
|
|
|
not rendered.
|
|
|
|
|
|
|
|
|
|
|
|
The survey output's `description` shapes what fields are relevant. This is not
|
|
|
|
|
|
a hardcoded domain → schema mapping — the synthesis prompt asks the agent to
|
|
|
|
|
|
populate fields that are relevant to *this specific content* from a superset
|
|
|
|
|
|
of available fields:
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
Available output fields (populate those relevant to this content):
|
|
|
|
|
|
- brief (always)
|
|
|
|
|
|
- architecture (software projects)
|
|
|
|
|
|
- components (software projects, large document collections)
|
|
|
|
|
|
- tech_stack (software projects)
|
|
|
|
|
|
- entry_points (software projects, CLI tools)
|
|
|
|
|
|
- datasets (data collections)
|
|
|
|
|
|
- schema_summary (data collections, databases)
|
|
|
|
|
|
- period_covered (financial data, journals, time-series)
|
|
|
|
|
|
- themes (document collections, journals)
|
|
|
|
|
|
- data_quality (data collections)
|
|
|
|
|
|
- concerns (any domain)
|
|
|
|
|
|
- overall_purpose (mixed/composite targets)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The report formatter renders populated fields with appropriate headers and
|
|
|
|
|
|
skips unpopulated ones. Small simple targets produce minimal output. Large
|
|
|
|
|
|
complex targets produce full structured reports.
|
|
|
|
|
|
|
|
|
|
|
|
### Progressive output (future)
|
|
|
|
|
|
|
|
|
|
|
|
Rather than one report at the end, stream findings as the investigation
|
|
|
|
|
|
proceeds. The user sees the agent's understanding build in real time. This
|
|
|
|
|
|
converts luminos from a batch tool into an interactive investigation partner.
|
|
|
|
|
|
|
|
|
|
|
|
Requires a streaming-aware output layer — significant architectural change,
|
|
|
|
|
|
probably not Phase 1.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 9: Parallel Investigation
|
|
|
|
|
|
|
|
|
|
|
|
### For large targets
|
|
|
|
|
|
|
|
|
|
|
|
Multiple dir-loop agents investigate different subsystems concurrently, then
|
|
|
|
|
|
report to a coordinator. The coordinator synthesizes their findings and
|
|
|
|
|
|
identifies cross-cutting concerns.
|
|
|
|
|
|
|
|
|
|
|
|
This requires:
|
|
|
|
|
|
- A coordinator agent that owns the investigation plan
|
|
|
|
|
|
- Worker agents scoped to subsystems
|
|
|
|
|
|
- A shared cache that workers write to concurrently (needs locking or
|
|
|
|
|
|
append-only design)
|
|
|
|
|
|
- A merge step in the coordinator before synthesis
|
|
|
|
|
|
|
|
|
|
|
|
Significant complexity. Probably deferred until single-agent investigation
|
|
|
|
|
|
quality is high. The main benefit is speed, not quality — worth revisiting when
|
|
|
|
|
|
the investigation quality ceiling has been reached.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Part 10: MCP Backend Abstraction
|
|
|
|
|
|
|
|
|
|
|
|
### Why
|
|
|
|
|
|
|
|
|
|
|
|
The investigation loop (survey → plan → investigate → synthesize) is
|
|
|
|
|
|
generic. The filesystem-specific parts — how to list a directory, read
|
|
|
|
|
|
a file, parse structure — are an implementation detail. Abstracting
|
|
|
|
|
|
the backend via MCP decouples the two and makes luminos extensible to
|
|
|
|
|
|
any exploration target: websites, wikis, databases, running processes.
|
|
|
|
|
|
|
|
|
|
|
|
This pivot also serves the project's learning goal. Migrating working
|
|
|
|
|
|
code into an agentic framework is a common and painful real-world task.
|
|
|
|
|
|
Building it clean from the start teaches the pattern; migrating teaches
|
|
|
|
|
|
*why* the pattern exists. The migration pain is intentional.
|
|
|
|
|
|
|
|
|
|
|
|
### The model
|
|
|
|
|
|
|
|
|
|
|
|
Each exploration target is an MCP server. Luminos is an MCP client.
|
|
|
|
|
|
The investigation loop connects to a server at startup, discovers its
|
|
|
|
|
|
tools, passes them to the Anthropic API, and forwards tool calls to
|
|
|
|
|
|
the server at runtime.
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
luminos (MCP client)
|
|
|
|
|
|
↓ connects to
|
|
|
|
|
|
filesystem MCP server | process MCP server | wiki MCP server | ...
|
|
|
|
|
|
↓ exposes tools
|
|
|
|
|
|
read_file, list_dir, parse_structure, ...
|
|
|
|
|
|
↓ passed to
|
|
|
|
|
|
Anthropic API (agent calls them)
|
|
|
|
|
|
↓ forwarded back to
|
|
|
|
|
|
MCP server (executes, returns result)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The filesystem MCP server is the default. `--mcp <uri>` selects
|
|
|
|
|
|
an alternative server.
|
|
|
|
|
|
|
|
|
|
|
|
### What changes
|
|
|
|
|
|
|
|
|
|
|
|
- `ai.py` tool dispatch: instead of calling local Python functions,
|
|
|
|
|
|
forward to the connected MCP server
|
|
|
|
|
|
- Tool definitions: dynamically discovered from the server via
|
|
|
|
|
|
`tools/list`, not hardcoded in `ai.py`
|
|
|
|
|
|
- New `luminos_lib/mcp_client.py`: thin MCP client (stdio transport)
|
|
|
|
|
|
- New `luminos_mcp/filesystem.py`: MCP server wrapping existing
|
|
|
|
|
|
filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
|
|
|
|
|
`run_command`, `stat_file`)
|
|
|
|
|
|
- `--mcp` CLI flag for selecting a non-default server
|
|
|
|
|
|
|
|
|
|
|
|
### What does not change
|
|
|
|
|
|
|
|
|
|
|
|
Cache storage, confidence tracking, survey/planning/synthesis passes,
|
|
|
|
|
|
token tracking, cost reporting, all prompts. None of these know or
|
|
|
|
|
|
care what backend provided the data.
|
|
|
|
|
|
|
|
|
|
|
|
### Known tensions
|
|
|
|
|
|
|
|
|
|
|
|
**The tree assumption.** The investigation loop assumes hierarchical
|
|
|
|
|
|
containers. Non-filesystem backends (websites, processes) must present
|
|
|
|
|
|
a virtual tree or the traversal model breaks. This is the MCP server's
|
|
|
|
|
|
problem to solve, not luminos's — but it is real design work.
|
|
|
|
|
|
|
|
|
|
|
|
**Tool count.** If multiple MCP servers are connected simultaneously
|
|
|
|
|
|
(filesystem + web search + package lookup), tool count grows. More
|
|
|
|
|
|
tools degrades agent decision quality. Keep each server focused.
|
|
|
|
|
|
|
|
|
|
|
|
**The filesystem backend is a demotion.** Currently filesystem
|
|
|
|
|
|
investigation is native — zero overhead. Making it an MCP server adds
|
|
|
|
|
|
process-launch overhead. Acceptable given API call latency already
|
|
|
|
|
|
dominates, but worth knowing.
|
|
|
|
|
|
|
|
|
|
|
|
**Phase 4 becomes MCP servers.** After the pivot, web_search,
|
|
|
|
|
|
fetch_url, and package_lookup are natural candidates to implement as
|
|
|
|
|
|
MCP servers rather than hardcoded Python functions. Phase 4 and the
|
|
|
|
|
|
MCP pattern reinforce each other.
|
|
|
|
|
|
|
|
|
|
|
|
### Timing
|
|
|
|
|
|
|
|
|
|
|
|
After Phase 3, before Phase 4. At that point survey + planning +
|
|
|
|
|
|
dir loops + synthesis are all working with filesystem assumptions
|
|
|
|
|
|
baked in — enough surface area to make the migration instructive
|
|
|
|
|
|
without 9 phases of rework.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Implementation Order
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 1 — Confidence tracking
|
|
|
|
|
|
- Add `confidence` + `confidence_reason` to cache schemas
|
|
|
|
|
|
- Update dir loop prompt to set confidence when writing cache
|
|
|
|
|
|
- No behavior change yet — just instrumentation
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 2 — Survey pass
|
|
|
|
|
|
- New `_run_survey()` function in `ai.py`
|
|
|
|
|
|
- `submit_survey` tool definition
|
|
|
|
|
|
- `_SURVEY_SYSTEM_PROMPT` in `prompts.py`
|
|
|
|
|
|
- Wire into `_run_investigation()` before dir loops
|
|
|
|
|
|
- Survey output injected into dir loop system prompt
|
2026-04-07 03:47:49 +00:00
|
|
|
|
- **Rebuild filetype classifier (#42)** to remove source-code bias —
|
|
|
|
|
|
lands after the survey pass is observable end-to-end (#4–#7) and
|
|
|
|
|
|
before Phase 3 starts depending on survey output for real decisions.
|
|
|
|
|
|
Until then, the survey prompt carries a Band-Aid warning that the
|
|
|
|
|
|
histogram is biased toward source code.
|
2026-04-07 03:15:27 +00:00
|
|
|
|
|
2026-04-07 03:59:01 +00:00
|
|
|
|
### Phase 2.5 — Context budget reliability (#44)
|
|
|
|
|
|
- Dir loop exhausts the 126k context budget on a 13-file Python lib
|
|
|
|
|
|
(observed during #5 smoke test). Must be addressed before Phase 3
|
|
|
|
|
|
adds longer prompts and more tools, or every later phase will
|
|
|
|
|
|
inherit a broken foundation.
|
|
|
|
|
|
- Investigate root cause (tool result accumulation, parse_structure
|
|
|
|
|
|
output size, redundant reads) before picking a fix.
|
|
|
|
|
|
- Add token-usage instrumentation so regressions are visible.
|
|
|
|
|
|
|
2026-04-07 03:15:27 +00:00
|
|
|
|
### Phase 3 — Investigation planning
|
|
|
|
|
|
- Planning pass after survey, before dir loops
|
|
|
|
|
|
- `submit_plan` tool
|
|
|
|
|
|
- Dynamic turn allocation based on plan
|
|
|
|
|
|
- Dir loop orchestrator updated to follow plan
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 3.5 — MCP backend abstraction (pivot point)
|
|
|
|
|
|
See Part 10 for full design. This phase happens *after* Phase 3 is
|
|
|
|
|
|
working and *before* Phase 4. The goal is to migrate the filesystem
|
|
|
|
|
|
investigation into an MCP server/client model before adding more
|
|
|
|
|
|
backends or external tools.
|
|
|
|
|
|
|
|
|
|
|
|
- Extract filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
|
|
|
|
|
`run_command`, `stat_file`) into a standalone MCP server
|
|
|
|
|
|
- Refactor `ai.py` into an MCP client: discover tools dynamically,
|
|
|
|
|
|
forward tool calls to the server, return results to the agent
|
|
|
|
|
|
- Replace hardcoded tool list in the dir loop with dynamic tool
|
|
|
|
|
|
discovery from the connected MCP server
|
|
|
|
|
|
- Keep the filesystem MCP server as the default; `--mcp` flag selects
|
|
|
|
|
|
alternative servers
|
|
|
|
|
|
- No behavior change to the investigation loop — purely structural
|
|
|
|
|
|
|
|
|
|
|
|
**Learning goal:** experience migrating working code into an MCP
|
|
|
|
|
|
architecture. The migration pain is intentional and instructive.
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 4 — External knowledge tools
|
|
|
|
|
|
- `web_search` tool + implementation (requires optional dep: search API client)
|
|
|
|
|
|
- `package_lookup` tool + implementation (HTTP to package registries)
|
|
|
|
|
|
- `fetch_url` tool + implementation
|
|
|
|
|
|
- `--no-external` flag to disable network tools
|
|
|
|
|
|
- Budget tracking and logging
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 5 — Scale-tiered synthesis
|
|
|
|
|
|
- Sizing measurement after dir loops
|
|
|
|
|
|
- Tier classification
|
|
|
|
|
|
- Small tier: switch synthesis input to file cache entries
|
|
|
|
|
|
- Depth instructions in synthesis prompt
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 6 — Multi-level synthesis
|
|
|
|
|
|
- Grouping pass + `submit_grouping` tool
|
|
|
|
|
|
- Final synthesis receives subsystem summaries at large/xlarge tier
|
|
|
|
|
|
- Two-level grouping for xlarge
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 7 — Hypothesis-driven synthesis
|
|
|
|
|
|
- Update synthesis prompt to require hypothesis formation before submit_report
|
|
|
|
|
|
- `think` tool made available in synthesis (currently restricted)
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 8 — Refinement pass
|
|
|
|
|
|
- `--refine` flag + `_run_refinement()`
|
|
|
|
|
|
- Refinement uses confidence scores to prioritize
|
|
|
|
|
|
- `--refine-depth N`
|
|
|
|
|
|
|
|
|
|
|
|
### Phase 9 — Dynamic report structure
|
|
|
|
|
|
- Superset output fields in synthesis submit_report schema
|
|
|
|
|
|
- Report formatter renders populated fields only
|
|
|
|
|
|
- Domain-appropriate section headers
|
|
|
|
|
|
|
2026-04-07 04:20:54 +00:00
|
|
|
|
### End-of-project tuning
|
|
|
|
|
|
- **Revisit survey-skip thresholds (#46)** — `_SURVEY_MIN_FILES` and
|
|
|
|
|
|
`_SURVEY_MIN_DIRS` shipped with values from #7's example, no
|
|
|
|
|
|
empirical basis. Once `--ai` has been run on a variety of real
|
|
|
|
|
|
targets, look at which runs skipped the survey vs ran it and decide
|
|
|
|
|
|
whether the thresholds (or the gate logic itself) need to change.
|
|
|
|
|
|
|
2026-04-07 03:15:27 +00:00
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## File Map
|
|
|
|
|
|
|
|
|
|
|
|
| File | Changes |
|
|
|
|
|
|
|---|---|
|
|
|
|
|
|
| `luminos_lib/domain.py` | **new** — survey pass, plan pass, profile-free detection |
|
|
|
|
|
|
| `luminos_lib/prompts.py` | survey prompt, planning prompt, refinement prompt, updated dir/synthesis prompts |
|
|
|
|
|
|
| `luminos_lib/ai.py` | survey, planning, external tools, tiered synthesis, multi-level grouping, refinement, confidence-aware cache writes |
|
|
|
|
|
|
| `luminos_lib/cache.py` | confidence fields in schemas, low-confidence query |
|
|
|
|
|
|
| `luminos_lib/report.py` | dynamic field rendering, domain-appropriate sections |
|
|
|
|
|
|
| `luminos.py` | --refine, --no-external, --refine-depth flags; wire survey into scan |
|
|
|
|
|
|
| `luminos_lib/search.py` | **new** — web_search, fetch_url, package_lookup implementations |
|
|
|
|
|
|
|
|
|
|
|
|
No changes needed to: `tree.py`, `filetypes.py`, `code.py`, `recency.py`,
|
|
|
|
|
|
`disk.py`, `capabilities.py`, `watch.py`, `ast_parser.py`
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Known Unknowns
|
|
|
|
|
|
|
|
|
|
|
|
**Search API choice**
|
|
|
|
|
|
Web search requires an API (Brave Search, Serper, SerpAPI, DuckDuckGo, etc.).
|
|
|
|
|
|
Each has different pricing, rate limits, result quality, and privacy
|
|
|
|
|
|
implications. Which one to use, whether to require an API key, and what the
|
|
|
|
|
|
fallback is when no key is configured — all undecided. Could support multiple
|
|
|
|
|
|
backends with a configurable preference.
|
|
|
|
|
|
|
|
|
|
|
|
**Package registry coverage**
|
|
|
|
|
|
`package_lookup` needs to handle PyPI, npm, crates.io, pkg.go.dev, Maven,
|
|
|
|
|
|
RubyGems, NuGet at minimum. Each has a different API shape. Coverage gap for
|
|
|
|
|
|
less common ecosystems (Hex for Elixir, Hackage for Haskell, etc.) — the agent
|
|
|
|
|
|
will get no lookup result and must fall back to web search.
|
|
|
|
|
|
|
|
|
|
|
|
**search result summarization**
|
|
|
|
|
|
Raw search results can't be injected directly into context — they're too long
|
|
|
|
|
|
and too noisy. A summarization step is needed. Options: another AI call (adds
|
|
|
|
|
|
latency and cost), regex extraction (fragile), a lightweight extraction
|
|
|
|
|
|
heuristic. The right approach is unclear.
|
|
|
|
|
|
|
|
|
|
|
|
**Turn budget arithmetic**
|
|
|
|
|
|
Dynamic turn allocation sounds clean in theory. In practice: how does the
|
|
|
|
|
|
agent "request more turns"? The orchestrator has to interrupt the loop,
|
|
|
|
|
|
check the global budget, and decide whether to grant more. This requires
|
|
|
|
|
|
mid-loop communication that doesn't exist today. Implementation complexity
|
|
|
|
|
|
is non-trivial.
|
|
|
|
|
|
|
|
|
|
|
|
**Cache invalidation on strategy changes**
|
|
|
|
|
|
If a user re-runs with different flags (--refine, --no-external, new --exclude
|
|
|
|
|
|
list), the existing cache entries may have been produced under a different
|
|
|
|
|
|
investigation strategy. Should they be invalidated? Currently --fresh is the
|
|
|
|
|
|
only mechanism. A smarter approach would store the investigation parameters
|
|
|
|
|
|
in cache metadata and detect mismatches.
|
|
|
|
|
|
|
|
|
|
|
|
**Confidence calibration**
|
|
|
|
|
|
Asking the agent to self-report confidence (0.0–1.0) is only useful if the
|
|
|
|
|
|
numbers are meaningful and consistent. LLMs are known to be poorly calibrated
|
|
|
|
|
|
on confidence. A 0.6 from one run may not mean the same as 0.6 from another.
|
|
|
|
|
|
This may need to be a categorical signal (high/medium/low) rather than numeric
|
|
|
|
|
|
to be reliable in practice.
|
|
|
|
|
|
|
|
|
|
|
|
**Context window growth with external tools**
|
|
|
|
|
|
Each web search result, package lookup, and URL fetch adds to the context
|
|
|
|
|
|
window for that dir loop. For a directory with many unknown dependencies, the
|
|
|
|
|
|
context could grow large enough to trigger the budget early exit. Need to think
|
|
|
|
|
|
about how external tool results are managed in context — perhaps summarized and
|
|
|
|
|
|
discarded from messages after being processed.
|
|
|
|
|
|
|
|
|
|
|
|
**`ask_user` blocking behavior**
|
|
|
|
|
|
Interactive mode with `ask_user` would block execution waiting for input. This
|
|
|
|
|
|
is fine in a terminal session but incompatible with piped output, scripted use,
|
|
|
|
|
|
or running luminos as a subprocess. Needs a clear mode distinction and graceful
|
|
|
|
|
|
degradation when input is not a TTY.
|
|
|
|
|
|
|
|
|
|
|
|
**Survey pass quality on tiny targets**
|
|
|
|
|
|
For a target with 3 files, the survey pass adds an API call that may cost more
|
|
|
|
|
|
than it's worth. There should be a minimum size threshold below which the
|
|
|
|
|
|
survey is skipped and a generic approach is used.
|
|
|
|
|
|
|
|
|
|
|
|
**Parallel investigation complexity**
|
|
|
|
|
|
Concurrent dir-loop agents writing to a shared cache introduces race conditions.
|
|
|
|
|
|
The current `_CacheManager` writes files directly with no locking. This would
|
|
|
|
|
|
need to be addressed before parallel investigation is viable.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Additional Suggestions
|
|
|
|
|
|
|
|
|
|
|
|
**Config file**
|
|
|
|
|
|
Many things that are currently hardcoded (turn budget, tier thresholds, search
|
|
|
|
|
|
budget, confidence threshold for refinement) should be user-configurable without
|
|
|
|
|
|
CLI flags. A `luminos.toml` in the target directory or `~/.config/luminos/`
|
|
|
|
|
|
would allow project-specific and user-specific defaults.
|
|
|
|
|
|
|
|
|
|
|
|
**Structured logging**
|
|
|
|
|
|
The `[AI]` stderr output is useful but informal. A structured log (JSONL file
|
|
|
|
|
|
alongside the cache) would allow post-hoc analysis of investigation quality:
|
|
|
|
|
|
which dirs used the most turns, which triggered web searches, which had low
|
|
|
|
|
|
confidence, where budget pressure hit. This also enables future tooling on top
|
|
|
|
|
|
of luminos investigations.
|
|
|
|
|
|
|
|
|
|
|
|
**Investigation replay**
|
|
|
|
|
|
The cache already stores summaries but not the investigation trace (what the
|
|
|
|
|
|
agent read, in what order, what it decided to skip). Storing the full message
|
|
|
|
|
|
history per directory would allow replaying or auditing an investigation. Cost:
|
|
|
|
|
|
storage. Benefit: debuggability, ability to resume investigations more faithfully.
|
|
|
|
|
|
|
|
|
|
|
|
**Watch mode + incremental investigation**
|
|
|
|
|
|
Watch mode currently re-runs the full base scan on changes. For AI-augmented
|
|
|
|
|
|
watch mode: detect which directories changed, re-investigate only those, and
|
|
|
|
|
|
patch the cache entries. The synthesis would then re-run from the updated cache
|
|
|
|
|
|
without re-investigating unchanged directories.
|
|
|
|
|
|
|
|
|
|
|
|
**Optional PDF and Office document readers**
|
|
|
|
|
|
The data and documents domains would benefit from native content extraction:
|
|
|
|
|
|
- `pdfminer` or `pypdf` for PDF text extraction
|
|
|
|
|
|
- `openpyxl` for Excel schema and sheet enumeration
|
|
|
|
|
|
- `python-docx` for Word document text
|
|
|
|
|
|
These would be optional deps like the existing AI deps, gated behind
|
|
|
|
|
|
`--install-extras`. The agent currently can only see filename and size for
|
|
|
|
|
|
these formats.
|
|
|
|
|
|
|
|
|
|
|
|
**Security-focused analysis mode**
|
|
|
|
|
|
A `--security` flag could tune the investigation toward security-relevant
|
|
|
|
|
|
findings: dependency vulnerability scanning, hardcoded secrets detection,
|
|
|
|
|
|
permission issues, exposed configuration, insecure patterns. The flag would
|
|
|
|
|
|
bias the survey, dir loop prompts, and synthesis toward these concerns and
|
|
|
|
|
|
expand the flags output with severity-ranked security findings.
|
|
|
|
|
|
|
|
|
|
|
|
**Output formats**
|
|
|
|
|
|
The current report is terminal-formatted text or JSON. Additional formats worth
|
|
|
|
|
|
considering:
|
|
|
|
|
|
- Markdown (for saving to wikis, Notion, Obsidian)
|
|
|
|
|
|
- HTML (self-contained report with collapsible sections)
|
|
|
|
|
|
- SARIF (for security findings — integrates with GitHub Code Scanning)
|
|
|
|
|
|
|
|
|
|
|
|
**Model selection**
|
|
|
|
|
|
The model is hardcoded to `claude-sonnet-4-20250514`. The survey and planning
|
|
|
|
|
|
passes are lightweight enough to use a faster/cheaper model (Haiku). The dir
|
|
|
|
|
|
loops and synthesis warrant Sonnet or better. The refinement pass might benefit
|
|
|
|
|
|
from Opus for difficult cases. A `--model` flag and per-pass model configuration
|
|
|
|
|
|
would allow cost/quality tradeoffs.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Concerns
|
|
|
|
|
|
|
|
|
|
|
|
**Cost at scale**
|
|
|
|
|
|
Adding a survey pass, planning pass, external tool lookups, and multiple
|
|
|
|
|
|
refinement passes significantly increases API call count and token consumption.
|
|
|
|
|
|
A large repo run with `--refine` could easily cost several dollars. The current
|
|
|
|
|
|
cost reporting (total tokens at end) may not be sufficient — users need to
|
|
|
|
|
|
understand cost before committing to a long run. Consider a `--estimate` mode
|
|
|
|
|
|
that projects cost from the base scan without running AI.
|
|
|
|
|
|
|
|
|
|
|
|
**Privacy and external lookups**
|
|
|
|
|
|
Web searches and URL fetches send information about the target's contents to
|
|
|
|
|
|
external services. For a personal journal or proprietary codebase this could
|
|
|
|
|
|
be a significant privacy concern. The `--no-external` flag addresses this but
|
|
|
|
|
|
it should probably be the *default* for sensitive-looking content (PII detected
|
|
|
|
|
|
in filenames, etc.), not something the user has to know to enable.
|
|
|
|
|
|
|
|
|
|
|
|
**Prompt injection via file contents**
|
|
|
|
|
|
`read_file` passes raw file contents into the context. A malicious file in the
|
|
|
|
|
|
target directory could contain prompt injection attempts. The current system has
|
|
|
|
|
|
no sanitization. This is an existing concern that grows as the agent gains more
|
|
|
|
|
|
capabilities (web search, URL fetch, package lookup — all of which could
|
|
|
|
|
|
theoretically be manipulated by a crafted file).
|
|
|
|
|
|
|
|
|
|
|
|
**Reliability of self-reported confidence**
|
|
|
|
|
|
The confidence tracking system depends on the agent accurately reporting its
|
|
|
|
|
|
own uncertainty. If the agent is systematically over-confident (which LLMs tend
|
|
|
|
|
|
to be), the refinement pass will never trigger on cases where it's most needed.
|
|
|
|
|
|
The system should have a skeptical prior — low-confidence by default for
|
|
|
|
|
|
unfamiliar file types, missing READMEs, ambiguous structures.
|
|
|
|
|
|
|
|
|
|
|
|
**Investigation quality regression risk**
|
|
|
|
|
|
Each new pass (survey, planning, refinement) adds opportunities for the
|
|
|
|
|
|
investigation to go wrong. A bad survey misleads all subsequent dir loops. A
|
|
|
|
|
|
bad plan wastes turns on shallow directories and skips critical ones. The system
|
|
|
|
|
|
needs quality signals — probably the confidence scores aggregated across the
|
|
|
|
|
|
investigation — to detect when something went wrong and potentially retry.
|
|
|
|
|
|
|
|
|
|
|
|
**Watch mode compatibility**
|
|
|
|
|
|
Several of the planned features (survey pass, planning, external tools) are not
|
|
|
|
|
|
designed for incremental re-use in watch mode. Adding AI capability to watch
|
|
|
|
|
|
mode is a separate design problem that deserves its own thinking.
|
|
|
|
|
|
|
|
|
|
|
|
**Turn budget contention**
|
|
|
|
|
|
If the planning pass allocates turns and the agent borrows from its budget when
|
|
|
|
|
|
it needs more, there's a risk of runaway investigation on unexpectedly complex
|
|
|
|
|
|
directories. Needs a hard ceiling (global max tokens, not just per-dir turns)
|
|
|
|
|
|
as a backstop.
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## Raw Thoughts
|
|
|
|
|
|
|
|
|
|
|
|
The investigation planning idea is conceptually appealing but has a chicken-and-
|
|
|
|
|
|
egg problem: you need to know what's in the directories to plan how to
|
|
|
|
|
|
investigate them, but you haven't investigated yet. The survey pass helps but
|
|
|
|
|
|
it's shallow. Maybe the first pass through each directory should be a cheap
|
|
|
|
|
|
orientation (list contents, read one file) that feeds the plan before the full
|
|
|
|
|
|
investigation starts. Two-phase dir investigation: orient then investigate.
|
|
|
|
|
|
|
|
|
|
|
|
The hypothesis-driven synthesis is probably the highest leverage change in this
|
|
|
|
|
|
whole plan. The current synthesis produces descriptive output. Hypothesis-driven
|
|
|
|
|
|
synthesis produces analytical output. The prompt change is small but the output
|
|
|
|
|
|
quality difference could be significant.
|
|
|
|
|
|
|
|
|
|
|
|
Web search feels like it should be a last resort, not an early one. The agent
|
|
|
|
|
|
should exhaust local investigation before reaching for external sources. The
|
|
|
|
|
|
prompt should reflect this: "Only search if you cannot determine this from the
|
|
|
|
|
|
files available."
|
|
|
|
|
|
|
|
|
|
|
|
There's a question of whether the survey pass should run before the base scan
|
|
|
|
|
|
or after. After makes sense because the base scan's file_categories is useful
|
|
|
|
|
|
survey input. But the base scan itself could be informed by the survey (e.g.
|
|
|
|
|
|
skip certain directories the survey identified as low-value). Probably the right
|
|
|
|
|
|
answer is: survey runs after base scan but before AI dir loops, using base scan
|
|
|
|
|
|
output as input.
|
|
|
|
|
|
|
|
|
|
|
|
The `ask_user` tool is interesting because it inverts the relationship — the
|
|
|
|
|
|
agent asks the human rather than the other way around. This is powerful but
|
|
|
|
|
|
needs careful constraints. The agent should only ask when it's genuinely stuck,
|
|
|
|
|
|
not as a shortcut to avoid investigation. The prompt should require that other
|
|
|
|
|
|
resolution strategies have been exhausted before asking.
|
|
|
|
|
|
|
|
|
|
|
|
Multi-level synthesis (grouping pass) might produce better results than
|
|
|
|
|
|
expected because the grouping agent has a different task than the dir-loop
|
|
|
|
|
|
agents — it's looking for relationships and patterns across summaries rather
|
|
|
|
|
|
than summarizing individual directories. It might surface architectural insights
|
|
|
|
|
|
that none of the dir loops could see individually.
|
|
|
|
|
|
|
|
|
|
|
|
Package vulnerability lookups are potentially the highest signal-to-noise
|
|
|
|
|
|
external tool — structured data, specific to the files present, directly
|
|
|
|
|
|
actionable. Worth implementing before general web search.
|
|
|
|
|
|
|
|
|
|
|
|
The confidence calibration problem is real but maybe not critical to solve
|
|
|
|
|
|
precisely. Even if 0.6 doesn't mean the same thing every time, entries with
|
|
|
|
|
|
confidence below some threshold will still tend to be the more uncertain ones.
|
|
|
|
|
|
Categorical (high/medium/low) is probably fine for the first implementation.
|
|
|
|
|
|
|
|
|
|
|
|
Progressive output and interactive mode are probably the features that would
|
|
|
|
|
|
most change how luminos *feels* to use. The current UX is: run it, wait, get a
|
|
|
|
|
|
report. Progressive output would make it feel like watching someone explore
|
|
|
|
|
|
the codebase in real time. Worth thinking about the UX before the architecture.
|
|
|
|
|
|
|
|
|
|
|
|
There's a version of this tool that goes well beyond file system analysis —
|
|
|
|
|
|
a general-purpose investigative agent that can be pointed at anything (a
|
|
|
|
|
|
directory, a URL, a database, a running process) and produce an intelligence
|
|
|
|
|
|
report. The current architecture is already pointing in that direction. Worth
|
|
|
|
|
|
keeping that possibility in mind when making structural decisions so we don't
|
|
|
|
|
|
close off that path prematurely.
|