chore: update CLAUDE.md for session 3
This commit is contained in:
parent
09e5686bea
commit
0a9afc96c9
2 changed files with 852 additions and 2 deletions
|
|
@ -10,9 +10,9 @@
|
|||
|
||||
## Current Project State
|
||||
|
||||
- **Phase:** Active development — issue tracking and MCP configured, Phase 1 (confidence tracking) ready to start
|
||||
- **Phase:** Active development — Phase 1 (confidence tracking) complete, Phase 2 (survey pass) ready to start
|
||||
- **Last worked on:** 2026-04-06
|
||||
- **Last commit:** chore: update CLAUDE.md for session 2
|
||||
- **Last commit:** merge: feat/issue-3-low-confidence-entries (#3)
|
||||
- **Blocking:** None
|
||||
|
||||
---
|
||||
|
|
@ -185,5 +185,6 @@ python3 luminos.py --install-extras
|
|||
|---|---|---|
|
||||
| 1 | 2026-04-06 | Project setup, scan progress output, in-place file display, --exclude flag, Forgejo repo, PLAN.md, wiki, development practices |
|
||||
| 2 | 2026-04-06 | Forgejo milestones (9), issues (36), project board, Gitea MCP installed and configured globally |
|
||||
| 3 | 2026-04-06 | Phase 1 complete (#1–#3), MCP backend architecture design (Part 10, Phase 3.5), issues #38–#40 opened |
|
||||
|
||||
Full log: wiki — [Session Retrospectives](https://forgejo.labbity.unbiasedgeek.com/archeious/luminos/wiki/SessionRetrospectives)
|
||||
|
|
|
|||
849
PLAN.md
Normal file
849
PLAN.md
Normal file
|
|
@ -0,0 +1,849 @@
|
|||
# Luminos — Evolution Plan
|
||||
|
||||
## Core Philosophy
|
||||
|
||||
The current design is a **pipeline with AI steps**. Even though it uses an
|
||||
agent loop, the structure is predetermined:
|
||||
|
||||
```
|
||||
investigate every directory (leaf-first) → synthesize → done
|
||||
```
|
||||
|
||||
The agent executes a fixed sequence. It cannot decide what matters most, cannot
|
||||
resolve its own uncertainty, cannot go get information it's missing, cannot
|
||||
adapt its strategy based on what it finds.
|
||||
|
||||
The target philosophy is **investigation driven by curiosity**:
|
||||
|
||||
```
|
||||
survey → form hypotheses → investigate to confirm/refute →
|
||||
resolve uncertainty → update understanding → repeat until satisfied
|
||||
```
|
||||
|
||||
This is how a human engineer actually explores an unfamiliar codebase or file
|
||||
collection. The agent should decide *what it needs to know* and *how to find
|
||||
it out* — not execute a predetermined checklist.
|
||||
|
||||
Every feature in this plan should be evaluated against that principle.
|
||||
|
||||
---
|
||||
|
||||
## Part 1: Uncertainty as a First-Class Concept
|
||||
|
||||
### The core loop
|
||||
|
||||
The agent explicitly tracks confidence at each step. Low confidence is not
|
||||
noted and moved past — it triggers a resolution strategy.
|
||||
|
||||
```
|
||||
observe something
|
||||
→ assess confidence
|
||||
→ if high: cache and continue
|
||||
→ if low: choose resolution strategy
|
||||
→ read more local files
|
||||
→ search externally (web, package registry, docs)
|
||||
→ ask the user
|
||||
→ flag as genuinely unknowable with explicit reasoning
|
||||
→ update understanding
|
||||
→ continue
|
||||
```
|
||||
|
||||
This should be reflected in the dir loop prompt, the synthesis prompt, and the
|
||||
refinement prompt. "I don't know" is never an acceptable terminal state if
|
||||
there are resolution strategies available.
|
||||
|
||||
### Confidence tracking in cache entries
|
||||
|
||||
Add a `confidence` field (0.0–1.0) to both file and dir cache entries. The
|
||||
agent sets this when writing cache. Low-confidence entries are candidates for
|
||||
refinement-pass investigation.
|
||||
|
||||
File cache schema addition:
|
||||
```
|
||||
confidence: float # 0.0–1.0, agent's confidence in its summary
|
||||
confidence_reason: str # why confidence is low, if below ~0.7
|
||||
```
|
||||
|
||||
The synthesis and refinement passes can use confidence scores to prioritize
|
||||
what to look at again.
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Dynamic Domain Detection
|
||||
|
||||
### Why not a hardcoded taxonomy
|
||||
|
||||
A fixed domain list (code, documents, data, media, mixed) forces content into
|
||||
predetermined buckets. Edge cases are inevitable: a medical imaging archive,
|
||||
a legal discovery collection, a CAD project, a music production session, a
|
||||
Jupyter notebook repo that's half code and half data. Hardcoded domains require
|
||||
code changes to handle novel content and will always mis-classify ambiguous cases.
|
||||
|
||||
More fundamentally: the AI is good at recognizing what something is. Using
|
||||
rule-based file-type ratios to gate its behavior is fighting the tool's
|
||||
strengths.
|
||||
|
||||
### Survey pass (replaces hardcoded detection)
|
||||
|
||||
Before dir loops begin, a lightweight survey pass runs:
|
||||
|
||||
**Input**: file type distribution, tree structure (top 2 levels), total counts
|
||||
|
||||
**Task**: the survey agent answers three questions:
|
||||
1. What is this? (plain language description, no forced taxonomy)
|
||||
2. What analytical approach would be most useful?
|
||||
3. Which available tools are relevant and which can be skipped?
|
||||
|
||||
**Output** (`submit_survey` tool):
|
||||
```python
|
||||
{
|
||||
"description": str, # "a Python web service with Postgres migrations"
|
||||
"approach": str, # how to investigate — what to prioritize, what to skip
|
||||
"relevant_tools": [str], # tools worth using for this content
|
||||
"skip_tools": [str], # tools not useful (e.g. parse_structure for a journal)
|
||||
"domain_notes": str, # anything unusual the dir loops should know
|
||||
"confidence": float, # how clear the signal was
|
||||
}
|
||||
```
|
||||
|
||||
**Max turns**: 3 — this is a lightweight orientation pass, not a deep investigation.
|
||||
|
||||
This output is injected into the dir loop system prompt as context. The dir
|
||||
loops know what they're looking at before they start. They can also deviate if
|
||||
they find something the survey missed.
|
||||
|
||||
### Tools are always available, AI selects what's relevant
|
||||
|
||||
Rather than gating tools by domain, every tool is offered with a clear
|
||||
description of what it's for. The AI simply won't call `parse_structure` on
|
||||
a `.xlsx` file because the description says it works on source files.
|
||||
|
||||
This also means new tools are automatically available to all future domains
|
||||
without any profile configuration.
|
||||
|
||||
### What stays rule-based
|
||||
|
||||
The file type distribution summary fed into the survey prompt is still computed
|
||||
from `filetypes.py` — this is cheap and provides useful signal. The difference
|
||||
is that the AI interprets it rather than a lookup table.
|
||||
|
||||
---
|
||||
|
||||
## Part 3: External Knowledge Tools
|
||||
|
||||
### The resolution strategy toolkit
|
||||
|
||||
When the agent encounters something it doesn't understand, it has options beyond
|
||||
"read more local files." These are resolution strategies for specific kinds of
|
||||
uncertainty.
|
||||
|
||||
**`web_search(query) → results`**
|
||||
|
||||
Use when: unfamiliar library, file format, API, framework, toolchain, naming
|
||||
convention that doesn't resolve from local files alone.
|
||||
|
||||
Query construction should be evidence-based:
|
||||
- Seeing `import dramatiq` → `"dramatiq python task queue library"`
|
||||
- Finding `.avro` files → `"apache avro file format schema"`
|
||||
- Spotting unfamiliar config key → `"<framework> <key> configuration"`
|
||||
|
||||
Results are summarized before injection into context. Raw search results are
|
||||
not passed directly — a lightweight extraction pulls the relevant 2-3 sentences.
|
||||
|
||||
Budget: configurable max searches per session (default: 10). Logged in report.
|
||||
|
||||
**`fetch_url(url) → content`**
|
||||
|
||||
Use when: a local file references a URL that would explain what the project is
|
||||
(e.g. a README links to documentation, a config references a schema URL, a
|
||||
package.json has a homepage field).
|
||||
|
||||
Constrained to read-only fetches. Content truncated to relevant sections.
|
||||
Budget: configurable (default: 5 per session).
|
||||
|
||||
**`package_lookup(name, ecosystem) → metadata`**
|
||||
|
||||
Use when: an import or dependency declaration references an unfamiliar package.
|
||||
|
||||
Queries package registries (PyPI, npm, crates.io, pkg.go.dev) for:
|
||||
- Package description
|
||||
- Version in use vs latest
|
||||
- Known security advisories (if available)
|
||||
- License
|
||||
|
||||
This is more targeted than web search and returns structured data. Particularly
|
||||
useful for security-relevant analysis.
|
||||
|
||||
Budget: generous (default: 30) since queries are cheap and targeted.
|
||||
|
||||
**`ask_user(question) → answer`** *(interactive mode only)*
|
||||
|
||||
Use when: uncertainty cannot be resolved by any other means.
|
||||
|
||||
Examples:
|
||||
- "I found 40 files with `.xyz` extension I don't recognize — what format is this?"
|
||||
- "There are two entry points (server.py and worker.py) — which is the primary one?"
|
||||
- "This directory appears to contain personal data — should I analyze it or skip it?"
|
||||
|
||||
Only triggered when other resolution strategies have been tried or are clearly
|
||||
not applicable. Gated behind an `--interactive` flag since it blocks execution.
|
||||
|
||||
### All external tools are opt-in
|
||||
|
||||
`--no-external` flag disables all network tools (web_search, fetch_url,
|
||||
package_lookup). Default behavior TBD — arguably external lookups should be
|
||||
opt-in rather than opt-out given privacy considerations (see Concerns).
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Investigation Planning
|
||||
|
||||
### Survey → plan → execute
|
||||
|
||||
Currently: every directory is processed in leaf-first order with equal
|
||||
resource allocation. A 2-file directory gets the same max_turns as a 50-file
|
||||
one.
|
||||
|
||||
Better: after the survey pass, a planning step decides where to invest depth.
|
||||
|
||||
**Planning pass** (`submit_plan` tool):
|
||||
|
||||
Input: survey output + full directory tree
|
||||
|
||||
Output:
|
||||
```python
|
||||
{
|
||||
"priority_dirs": [ # investigate these deeply
|
||||
{"path": str, "reason": str, "suggested_turns": int}
|
||||
],
|
||||
"shallow_dirs": [ # quick pass only
|
||||
{"path": str, "reason": str}
|
||||
],
|
||||
"skip_dirs": [ # skip entirely (generated, vendored, etc.)
|
||||
{"path": str, "reason": str}
|
||||
],
|
||||
"investigation_order": str, # "leaf-first" | "priority-first" | "breadth-first"
|
||||
"notes": str, # anything else the investigation should know
|
||||
}
|
||||
```
|
||||
|
||||
The orchestrator uses this plan to allocate turns per directory and set
|
||||
investigation order. The plan is also saved to cache so resumed investigations
|
||||
can follow the same strategy.
|
||||
|
||||
### Dynamic turn allocation
|
||||
|
||||
Replace fixed `max_turns=14` per directory with a global turn budget the agent
|
||||
manages. The planning pass allocates turns to directories based on apparent
|
||||
complexity. The agent can request additional turns mid-investigation if it hits
|
||||
something unexpectedly complex.
|
||||
|
||||
A simple model:
|
||||
- Global budget = `base_turns_per_dir * dir_count` (e.g. 10 * 20 = 200)
|
||||
- Planning pass distributes: priority dirs get 15-20, shallow dirs get 5, skip dirs get 0
|
||||
- Agent can "borrow" turns from its own budget if it needs more
|
||||
- If budget runs low, a warning is injected into the prompt
|
||||
|
||||
---
|
||||
|
||||
## Part 5: Scale-Tiered Synthesis
|
||||
|
||||
### Why tiers are still needed
|
||||
|
||||
Even with better investigation planning and agentic depth control, the synthesis
|
||||
input problem remains: 300 directory summaries cannot be meaningfully synthesized
|
||||
in one shot. The output is either truncated, loses fidelity, or both.
|
||||
|
||||
Tier classification based on post-loop measurements:
|
||||
|
||||
| Tier | dir_count | file_count | Synthesis approach |
|
||||
|---|---|---|---|
|
||||
| `small` | < 5 | < 30 | Feed per-file cache entries directly |
|
||||
| `medium` | 5–30 | 30–300 | Dir summaries (current approach) |
|
||||
| `large` | 31–150 | 301–1500 | Multi-level synthesis |
|
||||
| `xlarge` | > 150 | > 1500 | Multi-level + subsystem grouping |
|
||||
|
||||
Thresholds configurable via CLI flags or config file.
|
||||
|
||||
### Small tier: per-file summaries
|
||||
|
||||
File cache entries are the most granular, most grounded signal in the system —
|
||||
written while the AI was actually reading files. For small targets they fit
|
||||
comfortably in the synthesis context window and produce a richer output than
|
||||
dir summaries.
|
||||
|
||||
### Multi-level synthesis (large/xlarge)
|
||||
|
||||
```
|
||||
dir summaries
|
||||
↓ (grouping pass: dirs → subsystems, AI-identified)
|
||||
subsystem summaries (3–10 groups)
|
||||
↓ (final synthesis)
|
||||
report
|
||||
```
|
||||
|
||||
The grouping pass is itself agentic: the AI identifies logical subsystems from
|
||||
dir summaries, not from directory structure. An `auth/` dir and a
|
||||
`middleware/session/` dir might end up in the same "Authentication" subsystem.
|
||||
|
||||
For xlarge:
|
||||
```
|
||||
dir summaries
|
||||
↓ (level-1: dirs → subsystems, 10–30 groups)
|
||||
↓ (level-2: subsystems → domains/layers, 3–8 groups)
|
||||
↓ (final synthesis)
|
||||
```
|
||||
|
||||
### Synthesis depth scales with tier
|
||||
|
||||
The synthesis prompt receives explicit depth guidance:
|
||||
|
||||
- **small**: "Be concise but specific. Reference actual filenames. 2–3 paragraphs."
|
||||
- **medium**: "Produce a structured breakdown. Cover purpose, components, concerns."
|
||||
- **large**: "Produce a thorough architectural analysis with section headers. Be specific."
|
||||
- **xlarge**: "Produce a comprehensive report. Cover architecture, subsystems, interfaces, cross-cutting concerns, and notable anomalies. Reference actual paths."
|
||||
|
||||
---
|
||||
|
||||
## Part 6: Hypothesis-Driven Synthesis
|
||||
|
||||
### Current approach: aggregation
|
||||
|
||||
Synthesis currently aggregates dir summaries into a report. It's descriptive:
|
||||
"here is what I found in each part."
|
||||
|
||||
### Better approach: conclusion with evidence
|
||||
|
||||
The synthesis agent should:
|
||||
1. Form an initial hypothesis about the whole from the dir summaries
|
||||
2. Look for evidence that confirms or refutes it
|
||||
3. Consider alternative interpretations
|
||||
4. Produce a conclusion that reflects the reasoning, not just the observations
|
||||
|
||||
This produces output like: *"This appears to be a multi-tenant SaaS backend
|
||||
(hypothesis) — the presence of tenant_id throughout the schema, separate
|
||||
per-tenant job queues, and the auth middleware's scope validation all support
|
||||
this (evidence). The monolith structure suggests it hasn't been decomposed into
|
||||
services yet (alternative consideration)."*
|
||||
|
||||
Rather than: *"The auth directory handles authentication. The jobs directory
|
||||
handles background jobs. The models directory contains database models."*
|
||||
|
||||
The `think` tool already supports this pattern — the synthesis prompt should
|
||||
explicitly instruct hypothesis formation before `submit_report`.
|
||||
|
||||
---
|
||||
|
||||
## Part 7: Refinement Pass
|
||||
|
||||
### Trigger
|
||||
|
||||
`--refine` flag. Off by default.
|
||||
|
||||
### What it does
|
||||
|
||||
After synthesis, the refinement agent receives:
|
||||
- Current synthesis output (brief + full analysis)
|
||||
- All dir and file cache entries including confidence scores
|
||||
- Full investigation toolset including external knowledge tools
|
||||
- A list of low-confidence cache entries (confidence < 0.7)
|
||||
|
||||
It is instructed to:
|
||||
1. Identify gaps (things not determined from summaries)
|
||||
2. Identify contradictions (dir summaries that conflict)
|
||||
3. Identify cross-cutting concerns (patterns spanning multiple dirs)
|
||||
4. Resolve low-confidence entries
|
||||
5. Submit an improved report
|
||||
|
||||
The refinement agent owns its investigation — it decides what to look at and
|
||||
in what order, using the full resolution strategy toolkit.
|
||||
|
||||
### Multiple passes
|
||||
|
||||
`--refine-depth N` runs N refinement passes. Natural stopping condition: the
|
||||
agent calls `submit_report` without making any file reads or external lookups
|
||||
(indicates nothing new was found). This can short-circuit before N passes.
|
||||
|
||||
### Refinement vs re-investigation
|
||||
|
||||
Refinement is targeted — it focuses on specific gaps and uncertainties. It is
|
||||
not a re-run of the full dir loops. The prompt makes this explicit:
|
||||
*"Focus on resolving uncertainty, not re-summarizing what is already known."*
|
||||
|
||||
---
|
||||
|
||||
## Part 8: Report Structure
|
||||
|
||||
### Domain-appropriate sections
|
||||
|
||||
Instead of fixed `brief` + `detailed` fields, the synthesis produces structured
|
||||
fields based on what the survey identified. Fields that are absent or empty are
|
||||
not rendered.
|
||||
|
||||
The survey output's `description` shapes what fields are relevant. This is not
|
||||
a hardcoded domain → schema mapping — the synthesis prompt asks the agent to
|
||||
populate fields that are relevant to *this specific content* from a superset
|
||||
of available fields:
|
||||
|
||||
```
|
||||
Available output fields (populate those relevant to this content):
|
||||
- brief (always)
|
||||
- architecture (software projects)
|
||||
- components (software projects, large document collections)
|
||||
- tech_stack (software projects)
|
||||
- entry_points (software projects, CLI tools)
|
||||
- datasets (data collections)
|
||||
- schema_summary (data collections, databases)
|
||||
- period_covered (financial data, journals, time-series)
|
||||
- themes (document collections, journals)
|
||||
- data_quality (data collections)
|
||||
- concerns (any domain)
|
||||
- overall_purpose (mixed/composite targets)
|
||||
```
|
||||
|
||||
The report formatter renders populated fields with appropriate headers and
|
||||
skips unpopulated ones. Small simple targets produce minimal output. Large
|
||||
complex targets produce full structured reports.
|
||||
|
||||
### Progressive output (future)
|
||||
|
||||
Rather than one report at the end, stream findings as the investigation
|
||||
proceeds. The user sees the agent's understanding build in real time. This
|
||||
converts luminos from a batch tool into an interactive investigation partner.
|
||||
|
||||
Requires a streaming-aware output layer — significant architectural change,
|
||||
probably not Phase 1.
|
||||
|
||||
---
|
||||
|
||||
## Part 9: Parallel Investigation
|
||||
|
||||
### For large targets
|
||||
|
||||
Multiple dir-loop agents investigate different subsystems concurrently, then
|
||||
report to a coordinator. The coordinator synthesizes their findings and
|
||||
identifies cross-cutting concerns.
|
||||
|
||||
This requires:
|
||||
- A coordinator agent that owns the investigation plan
|
||||
- Worker agents scoped to subsystems
|
||||
- A shared cache that workers write to concurrently (needs locking or
|
||||
append-only design)
|
||||
- A merge step in the coordinator before synthesis
|
||||
|
||||
Significant complexity. Probably deferred until single-agent investigation
|
||||
quality is high. The main benefit is speed, not quality — worth revisiting when
|
||||
the investigation quality ceiling has been reached.
|
||||
|
||||
---
|
||||
|
||||
## Part 10: MCP Backend Abstraction
|
||||
|
||||
### Why
|
||||
|
||||
The investigation loop (survey → plan → investigate → synthesize) is
|
||||
generic. The filesystem-specific parts — how to list a directory, read
|
||||
a file, parse structure — are an implementation detail. Abstracting
|
||||
the backend via MCP decouples the two and makes luminos extensible to
|
||||
any exploration target: websites, wikis, databases, running processes.
|
||||
|
||||
This pivot also serves the project's learning goal. Migrating working
|
||||
code into an agentic framework is a common and painful real-world task.
|
||||
Building it clean from the start teaches the pattern; migrating teaches
|
||||
*why* the pattern exists. The migration pain is intentional.
|
||||
|
||||
### The model
|
||||
|
||||
Each exploration target is an MCP server. Luminos is an MCP client.
|
||||
The investigation loop connects to a server at startup, discovers its
|
||||
tools, passes them to the Anthropic API, and forwards tool calls to
|
||||
the server at runtime.
|
||||
|
||||
```
|
||||
luminos (MCP client)
|
||||
↓ connects to
|
||||
filesystem MCP server | process MCP server | wiki MCP server | ...
|
||||
↓ exposes tools
|
||||
read_file, list_dir, parse_structure, ...
|
||||
↓ passed to
|
||||
Anthropic API (agent calls them)
|
||||
↓ forwarded back to
|
||||
MCP server (executes, returns result)
|
||||
```
|
||||
|
||||
The filesystem MCP server is the default. `--mcp <uri>` selects
|
||||
an alternative server.
|
||||
|
||||
### What changes
|
||||
|
||||
- `ai.py` tool dispatch: instead of calling local Python functions,
|
||||
forward to the connected MCP server
|
||||
- Tool definitions: dynamically discovered from the server via
|
||||
`tools/list`, not hardcoded in `ai.py`
|
||||
- New `luminos_lib/mcp_client.py`: thin MCP client (stdio transport)
|
||||
- New `luminos_mcp/filesystem.py`: MCP server wrapping existing
|
||||
filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
||||
`run_command`, `stat_file`)
|
||||
- `--mcp` CLI flag for selecting a non-default server
|
||||
|
||||
### What does not change
|
||||
|
||||
Cache storage, confidence tracking, survey/planning/synthesis passes,
|
||||
token tracking, cost reporting, all prompts. None of these know or
|
||||
care what backend provided the data.
|
||||
|
||||
### Known tensions
|
||||
|
||||
**The tree assumption.** The investigation loop assumes hierarchical
|
||||
containers. Non-filesystem backends (websites, processes) must present
|
||||
a virtual tree or the traversal model breaks. This is the MCP server's
|
||||
problem to solve, not luminos's — but it is real design work.
|
||||
|
||||
**Tool count.** If multiple MCP servers are connected simultaneously
|
||||
(filesystem + web search + package lookup), tool count grows. More
|
||||
tools degrades agent decision quality. Keep each server focused.
|
||||
|
||||
**The filesystem backend is a demotion.** Currently filesystem
|
||||
investigation is native — zero overhead. Making it an MCP server adds
|
||||
process-launch overhead. Acceptable given API call latency already
|
||||
dominates, but worth knowing.
|
||||
|
||||
**Phase 4 becomes MCP servers.** After the pivot, web_search,
|
||||
fetch_url, and package_lookup are natural candidates to implement as
|
||||
MCP servers rather than hardcoded Python functions. Phase 4 and the
|
||||
MCP pattern reinforce each other.
|
||||
|
||||
### Timing
|
||||
|
||||
After Phase 3, before Phase 4. At that point survey + planning +
|
||||
dir loops + synthesis are all working with filesystem assumptions
|
||||
baked in — enough surface area to make the migration instructive
|
||||
without 9 phases of rework.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
### Phase 1 — Confidence tracking
|
||||
- Add `confidence` + `confidence_reason` to cache schemas
|
||||
- Update dir loop prompt to set confidence when writing cache
|
||||
- No behavior change yet — just instrumentation
|
||||
|
||||
### Phase 2 — Survey pass
|
||||
- New `_run_survey()` function in `ai.py`
|
||||
- `submit_survey` tool definition
|
||||
- `_SURVEY_SYSTEM_PROMPT` in `prompts.py`
|
||||
- Wire into `_run_investigation()` before dir loops
|
||||
- Survey output injected into dir loop system prompt
|
||||
|
||||
### Phase 3 — Investigation planning
|
||||
- Planning pass after survey, before dir loops
|
||||
- `submit_plan` tool
|
||||
- Dynamic turn allocation based on plan
|
||||
- Dir loop orchestrator updated to follow plan
|
||||
|
||||
### Phase 3.5 — MCP backend abstraction (pivot point)
|
||||
See Part 10 for full design. This phase happens *after* Phase 3 is
|
||||
working and *before* Phase 4. The goal is to migrate the filesystem
|
||||
investigation into an MCP server/client model before adding more
|
||||
backends or external tools.
|
||||
|
||||
- Extract filesystem tools (`read_file`, `list_dir`, `parse_structure`,
|
||||
`run_command`, `stat_file`) into a standalone MCP server
|
||||
- Refactor `ai.py` into an MCP client: discover tools dynamically,
|
||||
forward tool calls to the server, return results to the agent
|
||||
- Replace hardcoded tool list in the dir loop with dynamic tool
|
||||
discovery from the connected MCP server
|
||||
- Keep the filesystem MCP server as the default; `--mcp` flag selects
|
||||
alternative servers
|
||||
- No behavior change to the investigation loop — purely structural
|
||||
|
||||
**Learning goal:** experience migrating working code into an MCP
|
||||
architecture. The migration pain is intentional and instructive.
|
||||
|
||||
### Phase 4 — External knowledge tools
|
||||
- `web_search` tool + implementation (requires optional dep: search API client)
|
||||
- `package_lookup` tool + implementation (HTTP to package registries)
|
||||
- `fetch_url` tool + implementation
|
||||
- `--no-external` flag to disable network tools
|
||||
- Budget tracking and logging
|
||||
|
||||
### Phase 5 — Scale-tiered synthesis
|
||||
- Sizing measurement after dir loops
|
||||
- Tier classification
|
||||
- Small tier: switch synthesis input to file cache entries
|
||||
- Depth instructions in synthesis prompt
|
||||
|
||||
### Phase 6 — Multi-level synthesis
|
||||
- Grouping pass + `submit_grouping` tool
|
||||
- Final synthesis receives subsystem summaries at large/xlarge tier
|
||||
- Two-level grouping for xlarge
|
||||
|
||||
### Phase 7 — Hypothesis-driven synthesis
|
||||
- Update synthesis prompt to require hypothesis formation before submit_report
|
||||
- `think` tool made available in synthesis (currently restricted)
|
||||
|
||||
### Phase 8 — Refinement pass
|
||||
- `--refine` flag + `_run_refinement()`
|
||||
- Refinement uses confidence scores to prioritize
|
||||
- `--refine-depth N`
|
||||
|
||||
### Phase 9 — Dynamic report structure
|
||||
- Superset output fields in synthesis submit_report schema
|
||||
- Report formatter renders populated fields only
|
||||
- Domain-appropriate section headers
|
||||
|
||||
---
|
||||
|
||||
## File Map
|
||||
|
||||
| File | Changes |
|
||||
|---|---|
|
||||
| `luminos_lib/domain.py` | **new** — survey pass, plan pass, profile-free detection |
|
||||
| `luminos_lib/prompts.py` | survey prompt, planning prompt, refinement prompt, updated dir/synthesis prompts |
|
||||
| `luminos_lib/ai.py` | survey, planning, external tools, tiered synthesis, multi-level grouping, refinement, confidence-aware cache writes |
|
||||
| `luminos_lib/cache.py` | confidence fields in schemas, low-confidence query |
|
||||
| `luminos_lib/report.py` | dynamic field rendering, domain-appropriate sections |
|
||||
| `luminos.py` | --refine, --no-external, --refine-depth flags; wire survey into scan |
|
||||
| `luminos_lib/search.py` | **new** — web_search, fetch_url, package_lookup implementations |
|
||||
|
||||
No changes needed to: `tree.py`, `filetypes.py`, `code.py`, `recency.py`,
|
||||
`disk.py`, `capabilities.py`, `watch.py`, `ast_parser.py`
|
||||
|
||||
---
|
||||
|
||||
## Known Unknowns
|
||||
|
||||
**Search API choice**
|
||||
Web search requires an API (Brave Search, Serper, SerpAPI, DuckDuckGo, etc.).
|
||||
Each has different pricing, rate limits, result quality, and privacy
|
||||
implications. Which one to use, whether to require an API key, and what the
|
||||
fallback is when no key is configured — all undecided. Could support multiple
|
||||
backends with a configurable preference.
|
||||
|
||||
**Package registry coverage**
|
||||
`package_lookup` needs to handle PyPI, npm, crates.io, pkg.go.dev, Maven,
|
||||
RubyGems, NuGet at minimum. Each has a different API shape. Coverage gap for
|
||||
less common ecosystems (Hex for Elixir, Hackage for Haskell, etc.) — the agent
|
||||
will get no lookup result and must fall back to web search.
|
||||
|
||||
**search result summarization**
|
||||
Raw search results can't be injected directly into context — they're too long
|
||||
and too noisy. A summarization step is needed. Options: another AI call (adds
|
||||
latency and cost), regex extraction (fragile), a lightweight extraction
|
||||
heuristic. The right approach is unclear.
|
||||
|
||||
**Turn budget arithmetic**
|
||||
Dynamic turn allocation sounds clean in theory. In practice: how does the
|
||||
agent "request more turns"? The orchestrator has to interrupt the loop,
|
||||
check the global budget, and decide whether to grant more. This requires
|
||||
mid-loop communication that doesn't exist today. Implementation complexity
|
||||
is non-trivial.
|
||||
|
||||
**Cache invalidation on strategy changes**
|
||||
If a user re-runs with different flags (--refine, --no-external, new --exclude
|
||||
list), the existing cache entries may have been produced under a different
|
||||
investigation strategy. Should they be invalidated? Currently --fresh is the
|
||||
only mechanism. A smarter approach would store the investigation parameters
|
||||
in cache metadata and detect mismatches.
|
||||
|
||||
**Confidence calibration**
|
||||
Asking the agent to self-report confidence (0.0–1.0) is only useful if the
|
||||
numbers are meaningful and consistent. LLMs are known to be poorly calibrated
|
||||
on confidence. A 0.6 from one run may not mean the same as 0.6 from another.
|
||||
This may need to be a categorical signal (high/medium/low) rather than numeric
|
||||
to be reliable in practice.
|
||||
|
||||
**Context window growth with external tools**
|
||||
Each web search result, package lookup, and URL fetch adds to the context
|
||||
window for that dir loop. For a directory with many unknown dependencies, the
|
||||
context could grow large enough to trigger the budget early exit. Need to think
|
||||
about how external tool results are managed in context — perhaps summarized and
|
||||
discarded from messages after being processed.
|
||||
|
||||
**`ask_user` blocking behavior**
|
||||
Interactive mode with `ask_user` would block execution waiting for input. This
|
||||
is fine in a terminal session but incompatible with piped output, scripted use,
|
||||
or running luminos as a subprocess. Needs a clear mode distinction and graceful
|
||||
degradation when input is not a TTY.
|
||||
|
||||
**Survey pass quality on tiny targets**
|
||||
For a target with 3 files, the survey pass adds an API call that may cost more
|
||||
than it's worth. There should be a minimum size threshold below which the
|
||||
survey is skipped and a generic approach is used.
|
||||
|
||||
**Parallel investigation complexity**
|
||||
Concurrent dir-loop agents writing to a shared cache introduces race conditions.
|
||||
The current `_CacheManager` writes files directly with no locking. This would
|
||||
need to be addressed before parallel investigation is viable.
|
||||
|
||||
---
|
||||
|
||||
## Additional Suggestions
|
||||
|
||||
**Config file**
|
||||
Many things that are currently hardcoded (turn budget, tier thresholds, search
|
||||
budget, confidence threshold for refinement) should be user-configurable without
|
||||
CLI flags. A `luminos.toml` in the target directory or `~/.config/luminos/`
|
||||
would allow project-specific and user-specific defaults.
|
||||
|
||||
**Structured logging**
|
||||
The `[AI]` stderr output is useful but informal. A structured log (JSONL file
|
||||
alongside the cache) would allow post-hoc analysis of investigation quality:
|
||||
which dirs used the most turns, which triggered web searches, which had low
|
||||
confidence, where budget pressure hit. This also enables future tooling on top
|
||||
of luminos investigations.
|
||||
|
||||
**Investigation replay**
|
||||
The cache already stores summaries but not the investigation trace (what the
|
||||
agent read, in what order, what it decided to skip). Storing the full message
|
||||
history per directory would allow replaying or auditing an investigation. Cost:
|
||||
storage. Benefit: debuggability, ability to resume investigations more faithfully.
|
||||
|
||||
**Watch mode + incremental investigation**
|
||||
Watch mode currently re-runs the full base scan on changes. For AI-augmented
|
||||
watch mode: detect which directories changed, re-investigate only those, and
|
||||
patch the cache entries. The synthesis would then re-run from the updated cache
|
||||
without re-investigating unchanged directories.
|
||||
|
||||
**Optional PDF and Office document readers**
|
||||
The data and documents domains would benefit from native content extraction:
|
||||
- `pdfminer` or `pypdf` for PDF text extraction
|
||||
- `openpyxl` for Excel schema and sheet enumeration
|
||||
- `python-docx` for Word document text
|
||||
These would be optional deps like the existing AI deps, gated behind
|
||||
`--install-extras`. The agent currently can only see filename and size for
|
||||
these formats.
|
||||
|
||||
**Security-focused analysis mode**
|
||||
A `--security` flag could tune the investigation toward security-relevant
|
||||
findings: dependency vulnerability scanning, hardcoded secrets detection,
|
||||
permission issues, exposed configuration, insecure patterns. The flag would
|
||||
bias the survey, dir loop prompts, and synthesis toward these concerns and
|
||||
expand the flags output with severity-ranked security findings.
|
||||
|
||||
**Output formats**
|
||||
The current report is terminal-formatted text or JSON. Additional formats worth
|
||||
considering:
|
||||
- Markdown (for saving to wikis, Notion, Obsidian)
|
||||
- HTML (self-contained report with collapsible sections)
|
||||
- SARIF (for security findings — integrates with GitHub Code Scanning)
|
||||
|
||||
**Model selection**
|
||||
The model is hardcoded to `claude-sonnet-4-20250514`. The survey and planning
|
||||
passes are lightweight enough to use a faster/cheaper model (Haiku). The dir
|
||||
loops and synthesis warrant Sonnet or better. The refinement pass might benefit
|
||||
from Opus for difficult cases. A `--model` flag and per-pass model configuration
|
||||
would allow cost/quality tradeoffs.
|
||||
|
||||
---
|
||||
|
||||
## Concerns
|
||||
|
||||
**Cost at scale**
|
||||
Adding a survey pass, planning pass, external tool lookups, and multiple
|
||||
refinement passes significantly increases API call count and token consumption.
|
||||
A large repo run with `--refine` could easily cost several dollars. The current
|
||||
cost reporting (total tokens at end) may not be sufficient — users need to
|
||||
understand cost before committing to a long run. Consider a `--estimate` mode
|
||||
that projects cost from the base scan without running AI.
|
||||
|
||||
**Privacy and external lookups**
|
||||
Web searches and URL fetches send information about the target's contents to
|
||||
external services. For a personal journal or proprietary codebase this could
|
||||
be a significant privacy concern. The `--no-external` flag addresses this but
|
||||
it should probably be the *default* for sensitive-looking content (PII detected
|
||||
in filenames, etc.), not something the user has to know to enable.
|
||||
|
||||
**Prompt injection via file contents**
|
||||
`read_file` passes raw file contents into the context. A malicious file in the
|
||||
target directory could contain prompt injection attempts. The current system has
|
||||
no sanitization. This is an existing concern that grows as the agent gains more
|
||||
capabilities (web search, URL fetch, package lookup — all of which could
|
||||
theoretically be manipulated by a crafted file).
|
||||
|
||||
**Reliability of self-reported confidence**
|
||||
The confidence tracking system depends on the agent accurately reporting its
|
||||
own uncertainty. If the agent is systematically over-confident (which LLMs tend
|
||||
to be), the refinement pass will never trigger on cases where it's most needed.
|
||||
The system should have a skeptical prior — low-confidence by default for
|
||||
unfamiliar file types, missing READMEs, ambiguous structures.
|
||||
|
||||
**Investigation quality regression risk**
|
||||
Each new pass (survey, planning, refinement) adds opportunities for the
|
||||
investigation to go wrong. A bad survey misleads all subsequent dir loops. A
|
||||
bad plan wastes turns on shallow directories and skips critical ones. The system
|
||||
needs quality signals — probably the confidence scores aggregated across the
|
||||
investigation — to detect when something went wrong and potentially retry.
|
||||
|
||||
**Watch mode compatibility**
|
||||
Several of the planned features (survey pass, planning, external tools) are not
|
||||
designed for incremental re-use in watch mode. Adding AI capability to watch
|
||||
mode is a separate design problem that deserves its own thinking.
|
||||
|
||||
**Turn budget contention**
|
||||
If the planning pass allocates turns and the agent borrows from its budget when
|
||||
it needs more, there's a risk of runaway investigation on unexpectedly complex
|
||||
directories. Needs a hard ceiling (global max tokens, not just per-dir turns)
|
||||
as a backstop.
|
||||
|
||||
---
|
||||
|
||||
## Raw Thoughts
|
||||
|
||||
The investigation planning idea is conceptually appealing but has a chicken-and-
|
||||
egg problem: you need to know what's in the directories to plan how to
|
||||
investigate them, but you haven't investigated yet. The survey pass helps but
|
||||
it's shallow. Maybe the first pass through each directory should be a cheap
|
||||
orientation (list contents, read one file) that feeds the plan before the full
|
||||
investigation starts. Two-phase dir investigation: orient then investigate.
|
||||
|
||||
The hypothesis-driven synthesis is probably the highest leverage change in this
|
||||
whole plan. The current synthesis produces descriptive output. Hypothesis-driven
|
||||
synthesis produces analytical output. The prompt change is small but the output
|
||||
quality difference could be significant.
|
||||
|
||||
Web search feels like it should be a last resort, not an early one. The agent
|
||||
should exhaust local investigation before reaching for external sources. The
|
||||
prompt should reflect this: "Only search if you cannot determine this from the
|
||||
files available."
|
||||
|
||||
There's a question of whether the survey pass should run before the base scan
|
||||
or after. After makes sense because the base scan's file_categories is useful
|
||||
survey input. But the base scan itself could be informed by the survey (e.g.
|
||||
skip certain directories the survey identified as low-value). Probably the right
|
||||
answer is: survey runs after base scan but before AI dir loops, using base scan
|
||||
output as input.
|
||||
|
||||
The `ask_user` tool is interesting because it inverts the relationship — the
|
||||
agent asks the human rather than the other way around. This is powerful but
|
||||
needs careful constraints. The agent should only ask when it's genuinely stuck,
|
||||
not as a shortcut to avoid investigation. The prompt should require that other
|
||||
resolution strategies have been exhausted before asking.
|
||||
|
||||
Multi-level synthesis (grouping pass) might produce better results than
|
||||
expected because the grouping agent has a different task than the dir-loop
|
||||
agents — it's looking for relationships and patterns across summaries rather
|
||||
than summarizing individual directories. It might surface architectural insights
|
||||
that none of the dir loops could see individually.
|
||||
|
||||
Package vulnerability lookups are potentially the highest signal-to-noise
|
||||
external tool — structured data, specific to the files present, directly
|
||||
actionable. Worth implementing before general web search.
|
||||
|
||||
The confidence calibration problem is real but maybe not critical to solve
|
||||
precisely. Even if 0.6 doesn't mean the same thing every time, entries with
|
||||
confidence below some threshold will still tend to be the more uncertain ones.
|
||||
Categorical (high/medium/low) is probably fine for the first implementation.
|
||||
|
||||
Progressive output and interactive mode are probably the features that would
|
||||
most change how luminos *feels* to use. The current UX is: run it, wait, get a
|
||||
report. Progressive output would make it feel like watching someone explore
|
||||
the codebase in real time. Worth thinking about the UX before the architecture.
|
||||
|
||||
There's a version of this tool that goes well beyond file system analysis —
|
||||
a general-purpose investigative agent that can be pointed at anything (a
|
||||
directory, a URL, a database, a running process) and produce an intelligence
|
||||
report. The current architecture is already pointing in that direction. Worth
|
||||
keeping that possibility in mind when making structural decisions so we don't
|
||||
close off that path prematurely.
|
||||
Loading…
Reference in a new issue