From 0a9afc96c9fffd4deeee2f4a3e6a996cd2be3ab7 Mon Sep 17 00:00:00 2001 From: Jeff Smith Date: Mon, 6 Apr 2026 21:15:27 -0600 Subject: [PATCH] chore: update CLAUDE.md for session 3 --- CLAUDE.md | 5 +- PLAN.md | 849 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 852 insertions(+), 2 deletions(-) create mode 100644 PLAN.md diff --git a/CLAUDE.md b/CLAUDE.md index 74f655c..6d1eb22 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -10,9 +10,9 @@ ## Current Project State -- **Phase:** Active development — issue tracking and MCP configured, Phase 1 (confidence tracking) ready to start +- **Phase:** Active development — Phase 1 (confidence tracking) complete, Phase 2 (survey pass) ready to start - **Last worked on:** 2026-04-06 -- **Last commit:** chore: update CLAUDE.md for session 2 +- **Last commit:** merge: feat/issue-3-low-confidence-entries (#3) - **Blocking:** None --- @@ -185,5 +185,6 @@ python3 luminos.py --install-extras |---|---|---| | 1 | 2026-04-06 | Project setup, scan progress output, in-place file display, --exclude flag, Forgejo repo, PLAN.md, wiki, development practices | | 2 | 2026-04-06 | Forgejo milestones (9), issues (36), project board, Gitea MCP installed and configured globally | +| 3 | 2026-04-06 | Phase 1 complete (#1–#3), MCP backend architecture design (Part 10, Phase 3.5), issues #38–#40 opened | Full log: wiki — [Session Retrospectives](https://forgejo.labbity.unbiasedgeek.com/archeious/luminos/wiki/SessionRetrospectives) diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..fdab07b --- /dev/null +++ b/PLAN.md @@ -0,0 +1,849 @@ +# Luminos — Evolution Plan + +## Core Philosophy + +The current design is a **pipeline with AI steps**. Even though it uses an +agent loop, the structure is predetermined: + +``` +investigate every directory (leaf-first) → synthesize → done +``` + +The agent executes a fixed sequence. It cannot decide what matters most, cannot +resolve its own uncertainty, cannot go get information it's missing, cannot +adapt its strategy based on what it finds. + +The target philosophy is **investigation driven by curiosity**: + +``` +survey → form hypotheses → investigate to confirm/refute → +resolve uncertainty → update understanding → repeat until satisfied +``` + +This is how a human engineer actually explores an unfamiliar codebase or file +collection. The agent should decide *what it needs to know* and *how to find +it out* — not execute a predetermined checklist. + +Every feature in this plan should be evaluated against that principle. + +--- + +## Part 1: Uncertainty as a First-Class Concept + +### The core loop + +The agent explicitly tracks confidence at each step. Low confidence is not +noted and moved past — it triggers a resolution strategy. + +``` +observe something + → assess confidence + → if high: cache and continue + → if low: choose resolution strategy + → read more local files + → search externally (web, package registry, docs) + → ask the user + → flag as genuinely unknowable with explicit reasoning + → update understanding + → continue +``` + +This should be reflected in the dir loop prompt, the synthesis prompt, and the +refinement prompt. "I don't know" is never an acceptable terminal state if +there are resolution strategies available. + +### Confidence tracking in cache entries + +Add a `confidence` field (0.0–1.0) to both file and dir cache entries. The +agent sets this when writing cache. Low-confidence entries are candidates for +refinement-pass investigation. + +File cache schema addition: +``` +confidence: float # 0.0–1.0, agent's confidence in its summary +confidence_reason: str # why confidence is low, if below ~0.7 +``` + +The synthesis and refinement passes can use confidence scores to prioritize +what to look at again. + +--- + +## Part 2: Dynamic Domain Detection + +### Why not a hardcoded taxonomy + +A fixed domain list (code, documents, data, media, mixed) forces content into +predetermined buckets. Edge cases are inevitable: a medical imaging archive, +a legal discovery collection, a CAD project, a music production session, a +Jupyter notebook repo that's half code and half data. Hardcoded domains require +code changes to handle novel content and will always mis-classify ambiguous cases. + +More fundamentally: the AI is good at recognizing what something is. Using +rule-based file-type ratios to gate its behavior is fighting the tool's +strengths. + +### Survey pass (replaces hardcoded detection) + +Before dir loops begin, a lightweight survey pass runs: + +**Input**: file type distribution, tree structure (top 2 levels), total counts + +**Task**: the survey agent answers three questions: +1. What is this? (plain language description, no forced taxonomy) +2. What analytical approach would be most useful? +3. Which available tools are relevant and which can be skipped? + +**Output** (`submit_survey` tool): +```python +{ + "description": str, # "a Python web service with Postgres migrations" + "approach": str, # how to investigate — what to prioritize, what to skip + "relevant_tools": [str], # tools worth using for this content + "skip_tools": [str], # tools not useful (e.g. parse_structure for a journal) + "domain_notes": str, # anything unusual the dir loops should know + "confidence": float, # how clear the signal was +} +``` + +**Max turns**: 3 — this is a lightweight orientation pass, not a deep investigation. + +This output is injected into the dir loop system prompt as context. The dir +loops know what they're looking at before they start. They can also deviate if +they find something the survey missed. + +### Tools are always available, AI selects what's relevant + +Rather than gating tools by domain, every tool is offered with a clear +description of what it's for. The AI simply won't call `parse_structure` on +a `.xlsx` file because the description says it works on source files. + +This also means new tools are automatically available to all future domains +without any profile configuration. + +### What stays rule-based + +The file type distribution summary fed into the survey prompt is still computed +from `filetypes.py` — this is cheap and provides useful signal. The difference +is that the AI interprets it rather than a lookup table. + +--- + +## Part 3: External Knowledge Tools + +### The resolution strategy toolkit + +When the agent encounters something it doesn't understand, it has options beyond +"read more local files." These are resolution strategies for specific kinds of +uncertainty. + +**`web_search(query) → results`** + +Use when: unfamiliar library, file format, API, framework, toolchain, naming +convention that doesn't resolve from local files alone. + +Query construction should be evidence-based: +- Seeing `import dramatiq` → `"dramatiq python task queue library"` +- Finding `.avro` files → `"apache avro file format schema"` +- Spotting unfamiliar config key → `" configuration"` + +Results are summarized before injection into context. Raw search results are +not passed directly — a lightweight extraction pulls the relevant 2-3 sentences. + +Budget: configurable max searches per session (default: 10). Logged in report. + +**`fetch_url(url) → content`** + +Use when: a local file references a URL that would explain what the project is +(e.g. a README links to documentation, a config references a schema URL, a +package.json has a homepage field). + +Constrained to read-only fetches. Content truncated to relevant sections. +Budget: configurable (default: 5 per session). + +**`package_lookup(name, ecosystem) → metadata`** + +Use when: an import or dependency declaration references an unfamiliar package. + +Queries package registries (PyPI, npm, crates.io, pkg.go.dev) for: +- Package description +- Version in use vs latest +- Known security advisories (if available) +- License + +This is more targeted than web search and returns structured data. Particularly +useful for security-relevant analysis. + +Budget: generous (default: 30) since queries are cheap and targeted. + +**`ask_user(question) → answer`** *(interactive mode only)* + +Use when: uncertainty cannot be resolved by any other means. + +Examples: +- "I found 40 files with `.xyz` extension I don't recognize — what format is this?" +- "There are two entry points (server.py and worker.py) — which is the primary one?" +- "This directory appears to contain personal data — should I analyze it or skip it?" + +Only triggered when other resolution strategies have been tried or are clearly +not applicable. Gated behind an `--interactive` flag since it blocks execution. + +### All external tools are opt-in + +`--no-external` flag disables all network tools (web_search, fetch_url, +package_lookup). Default behavior TBD — arguably external lookups should be +opt-in rather than opt-out given privacy considerations (see Concerns). + +--- + +## Part 4: Investigation Planning + +### Survey → plan → execute + +Currently: every directory is processed in leaf-first order with equal +resource allocation. A 2-file directory gets the same max_turns as a 50-file +one. + +Better: after the survey pass, a planning step decides where to invest depth. + +**Planning pass** (`submit_plan` tool): + +Input: survey output + full directory tree + +Output: +```python +{ + "priority_dirs": [ # investigate these deeply + {"path": str, "reason": str, "suggested_turns": int} + ], + "shallow_dirs": [ # quick pass only + {"path": str, "reason": str} + ], + "skip_dirs": [ # skip entirely (generated, vendored, etc.) + {"path": str, "reason": str} + ], + "investigation_order": str, # "leaf-first" | "priority-first" | "breadth-first" + "notes": str, # anything else the investigation should know +} +``` + +The orchestrator uses this plan to allocate turns per directory and set +investigation order. The plan is also saved to cache so resumed investigations +can follow the same strategy. + +### Dynamic turn allocation + +Replace fixed `max_turns=14` per directory with a global turn budget the agent +manages. The planning pass allocates turns to directories based on apparent +complexity. The agent can request additional turns mid-investigation if it hits +something unexpectedly complex. + +A simple model: +- Global budget = `base_turns_per_dir * dir_count` (e.g. 10 * 20 = 200) +- Planning pass distributes: priority dirs get 15-20, shallow dirs get 5, skip dirs get 0 +- Agent can "borrow" turns from its own budget if it needs more +- If budget runs low, a warning is injected into the prompt + +--- + +## Part 5: Scale-Tiered Synthesis + +### Why tiers are still needed + +Even with better investigation planning and agentic depth control, the synthesis +input problem remains: 300 directory summaries cannot be meaningfully synthesized +in one shot. The output is either truncated, loses fidelity, or both. + +Tier classification based on post-loop measurements: + +| Tier | dir_count | file_count | Synthesis approach | +|---|---|---|---| +| `small` | < 5 | < 30 | Feed per-file cache entries directly | +| `medium` | 5–30 | 30–300 | Dir summaries (current approach) | +| `large` | 31–150 | 301–1500 | Multi-level synthesis | +| `xlarge` | > 150 | > 1500 | Multi-level + subsystem grouping | + +Thresholds configurable via CLI flags or config file. + +### Small tier: per-file summaries + +File cache entries are the most granular, most grounded signal in the system — +written while the AI was actually reading files. For small targets they fit +comfortably in the synthesis context window and produce a richer output than +dir summaries. + +### Multi-level synthesis (large/xlarge) + +``` +dir summaries + ↓ (grouping pass: dirs → subsystems, AI-identified) +subsystem summaries (3–10 groups) + ↓ (final synthesis) +report +``` + +The grouping pass is itself agentic: the AI identifies logical subsystems from +dir summaries, not from directory structure. An `auth/` dir and a +`middleware/session/` dir might end up in the same "Authentication" subsystem. + +For xlarge: +``` +dir summaries + ↓ (level-1: dirs → subsystems, 10–30 groups) + ↓ (level-2: subsystems → domains/layers, 3–8 groups) + ↓ (final synthesis) +``` + +### Synthesis depth scales with tier + +The synthesis prompt receives explicit depth guidance: + +- **small**: "Be concise but specific. Reference actual filenames. 2–3 paragraphs." +- **medium**: "Produce a structured breakdown. Cover purpose, components, concerns." +- **large**: "Produce a thorough architectural analysis with section headers. Be specific." +- **xlarge**: "Produce a comprehensive report. Cover architecture, subsystems, interfaces, cross-cutting concerns, and notable anomalies. Reference actual paths." + +--- + +## Part 6: Hypothesis-Driven Synthesis + +### Current approach: aggregation + +Synthesis currently aggregates dir summaries into a report. It's descriptive: +"here is what I found in each part." + +### Better approach: conclusion with evidence + +The synthesis agent should: +1. Form an initial hypothesis about the whole from the dir summaries +2. Look for evidence that confirms or refutes it +3. Consider alternative interpretations +4. Produce a conclusion that reflects the reasoning, not just the observations + +This produces output like: *"This appears to be a multi-tenant SaaS backend +(hypothesis) — the presence of tenant_id throughout the schema, separate +per-tenant job queues, and the auth middleware's scope validation all support +this (evidence). The monolith structure suggests it hasn't been decomposed into +services yet (alternative consideration)."* + +Rather than: *"The auth directory handles authentication. The jobs directory +handles background jobs. The models directory contains database models."* + +The `think` tool already supports this pattern — the synthesis prompt should +explicitly instruct hypothesis formation before `submit_report`. + +--- + +## Part 7: Refinement Pass + +### Trigger + +`--refine` flag. Off by default. + +### What it does + +After synthesis, the refinement agent receives: +- Current synthesis output (brief + full analysis) +- All dir and file cache entries including confidence scores +- Full investigation toolset including external knowledge tools +- A list of low-confidence cache entries (confidence < 0.7) + +It is instructed to: +1. Identify gaps (things not determined from summaries) +2. Identify contradictions (dir summaries that conflict) +3. Identify cross-cutting concerns (patterns spanning multiple dirs) +4. Resolve low-confidence entries +5. Submit an improved report + +The refinement agent owns its investigation — it decides what to look at and +in what order, using the full resolution strategy toolkit. + +### Multiple passes + +`--refine-depth N` runs N refinement passes. Natural stopping condition: the +agent calls `submit_report` without making any file reads or external lookups +(indicates nothing new was found). This can short-circuit before N passes. + +### Refinement vs re-investigation + +Refinement is targeted — it focuses on specific gaps and uncertainties. It is +not a re-run of the full dir loops. The prompt makes this explicit: +*"Focus on resolving uncertainty, not re-summarizing what is already known."* + +--- + +## Part 8: Report Structure + +### Domain-appropriate sections + +Instead of fixed `brief` + `detailed` fields, the synthesis produces structured +fields based on what the survey identified. Fields that are absent or empty are +not rendered. + +The survey output's `description` shapes what fields are relevant. This is not +a hardcoded domain → schema mapping — the synthesis prompt asks the agent to +populate fields that are relevant to *this specific content* from a superset +of available fields: + +``` +Available output fields (populate those relevant to this content): +- brief (always) +- architecture (software projects) +- components (software projects, large document collections) +- tech_stack (software projects) +- entry_points (software projects, CLI tools) +- datasets (data collections) +- schema_summary (data collections, databases) +- period_covered (financial data, journals, time-series) +- themes (document collections, journals) +- data_quality (data collections) +- concerns (any domain) +- overall_purpose (mixed/composite targets) +``` + +The report formatter renders populated fields with appropriate headers and +skips unpopulated ones. Small simple targets produce minimal output. Large +complex targets produce full structured reports. + +### Progressive output (future) + +Rather than one report at the end, stream findings as the investigation +proceeds. The user sees the agent's understanding build in real time. This +converts luminos from a batch tool into an interactive investigation partner. + +Requires a streaming-aware output layer — significant architectural change, +probably not Phase 1. + +--- + +## Part 9: Parallel Investigation + +### For large targets + +Multiple dir-loop agents investigate different subsystems concurrently, then +report to a coordinator. The coordinator synthesizes their findings and +identifies cross-cutting concerns. + +This requires: +- A coordinator agent that owns the investigation plan +- Worker agents scoped to subsystems +- A shared cache that workers write to concurrently (needs locking or + append-only design) +- A merge step in the coordinator before synthesis + +Significant complexity. Probably deferred until single-agent investigation +quality is high. The main benefit is speed, not quality — worth revisiting when +the investigation quality ceiling has been reached. + +--- + +## Part 10: MCP Backend Abstraction + +### Why + +The investigation loop (survey → plan → investigate → synthesize) is +generic. The filesystem-specific parts — how to list a directory, read +a file, parse structure — are an implementation detail. Abstracting +the backend via MCP decouples the two and makes luminos extensible to +any exploration target: websites, wikis, databases, running processes. + +This pivot also serves the project's learning goal. Migrating working +code into an agentic framework is a common and painful real-world task. +Building it clean from the start teaches the pattern; migrating teaches +*why* the pattern exists. The migration pain is intentional. + +### The model + +Each exploration target is an MCP server. Luminos is an MCP client. +The investigation loop connects to a server at startup, discovers its +tools, passes them to the Anthropic API, and forwards tool calls to +the server at runtime. + +``` +luminos (MCP client) + ↓ connects to +filesystem MCP server | process MCP server | wiki MCP server | ... + ↓ exposes tools +read_file, list_dir, parse_structure, ... + ↓ passed to +Anthropic API (agent calls them) + ↓ forwarded back to +MCP server (executes, returns result) +``` + +The filesystem MCP server is the default. `--mcp ` selects +an alternative server. + +### What changes + +- `ai.py` tool dispatch: instead of calling local Python functions, + forward to the connected MCP server +- Tool definitions: dynamically discovered from the server via + `tools/list`, not hardcoded in `ai.py` +- New `luminos_lib/mcp_client.py`: thin MCP client (stdio transport) +- New `luminos_mcp/filesystem.py`: MCP server wrapping existing + filesystem tools (`read_file`, `list_dir`, `parse_structure`, + `run_command`, `stat_file`) +- `--mcp` CLI flag for selecting a non-default server + +### What does not change + +Cache storage, confidence tracking, survey/planning/synthesis passes, +token tracking, cost reporting, all prompts. None of these know or +care what backend provided the data. + +### Known tensions + +**The tree assumption.** The investigation loop assumes hierarchical +containers. Non-filesystem backends (websites, processes) must present +a virtual tree or the traversal model breaks. This is the MCP server's +problem to solve, not luminos's — but it is real design work. + +**Tool count.** If multiple MCP servers are connected simultaneously +(filesystem + web search + package lookup), tool count grows. More +tools degrades agent decision quality. Keep each server focused. + +**The filesystem backend is a demotion.** Currently filesystem +investigation is native — zero overhead. Making it an MCP server adds +process-launch overhead. Acceptable given API call latency already +dominates, but worth knowing. + +**Phase 4 becomes MCP servers.** After the pivot, web_search, +fetch_url, and package_lookup are natural candidates to implement as +MCP servers rather than hardcoded Python functions. Phase 4 and the +MCP pattern reinforce each other. + +### Timing + +After Phase 3, before Phase 4. At that point survey + planning + +dir loops + synthesis are all working with filesystem assumptions +baked in — enough surface area to make the migration instructive +without 9 phases of rework. + +--- + +## Implementation Order + +### Phase 1 — Confidence tracking +- Add `confidence` + `confidence_reason` to cache schemas +- Update dir loop prompt to set confidence when writing cache +- No behavior change yet — just instrumentation + +### Phase 2 — Survey pass +- New `_run_survey()` function in `ai.py` +- `submit_survey` tool definition +- `_SURVEY_SYSTEM_PROMPT` in `prompts.py` +- Wire into `_run_investigation()` before dir loops +- Survey output injected into dir loop system prompt + +### Phase 3 — Investigation planning +- Planning pass after survey, before dir loops +- `submit_plan` tool +- Dynamic turn allocation based on plan +- Dir loop orchestrator updated to follow plan + +### Phase 3.5 — MCP backend abstraction (pivot point) +See Part 10 for full design. This phase happens *after* Phase 3 is +working and *before* Phase 4. The goal is to migrate the filesystem +investigation into an MCP server/client model before adding more +backends or external tools. + +- Extract filesystem tools (`read_file`, `list_dir`, `parse_structure`, + `run_command`, `stat_file`) into a standalone MCP server +- Refactor `ai.py` into an MCP client: discover tools dynamically, + forward tool calls to the server, return results to the agent +- Replace hardcoded tool list in the dir loop with dynamic tool + discovery from the connected MCP server +- Keep the filesystem MCP server as the default; `--mcp` flag selects + alternative servers +- No behavior change to the investigation loop — purely structural + +**Learning goal:** experience migrating working code into an MCP +architecture. The migration pain is intentional and instructive. + +### Phase 4 — External knowledge tools +- `web_search` tool + implementation (requires optional dep: search API client) +- `package_lookup` tool + implementation (HTTP to package registries) +- `fetch_url` tool + implementation +- `--no-external` flag to disable network tools +- Budget tracking and logging + +### Phase 5 — Scale-tiered synthesis +- Sizing measurement after dir loops +- Tier classification +- Small tier: switch synthesis input to file cache entries +- Depth instructions in synthesis prompt + +### Phase 6 — Multi-level synthesis +- Grouping pass + `submit_grouping` tool +- Final synthesis receives subsystem summaries at large/xlarge tier +- Two-level grouping for xlarge + +### Phase 7 — Hypothesis-driven synthesis +- Update synthesis prompt to require hypothesis formation before submit_report +- `think` tool made available in synthesis (currently restricted) + +### Phase 8 — Refinement pass +- `--refine` flag + `_run_refinement()` +- Refinement uses confidence scores to prioritize +- `--refine-depth N` + +### Phase 9 — Dynamic report structure +- Superset output fields in synthesis submit_report schema +- Report formatter renders populated fields only +- Domain-appropriate section headers + +--- + +## File Map + +| File | Changes | +|---|---| +| `luminos_lib/domain.py` | **new** — survey pass, plan pass, profile-free detection | +| `luminos_lib/prompts.py` | survey prompt, planning prompt, refinement prompt, updated dir/synthesis prompts | +| `luminos_lib/ai.py` | survey, planning, external tools, tiered synthesis, multi-level grouping, refinement, confidence-aware cache writes | +| `luminos_lib/cache.py` | confidence fields in schemas, low-confidence query | +| `luminos_lib/report.py` | dynamic field rendering, domain-appropriate sections | +| `luminos.py` | --refine, --no-external, --refine-depth flags; wire survey into scan | +| `luminos_lib/search.py` | **new** — web_search, fetch_url, package_lookup implementations | + +No changes needed to: `tree.py`, `filetypes.py`, `code.py`, `recency.py`, +`disk.py`, `capabilities.py`, `watch.py`, `ast_parser.py` + +--- + +## Known Unknowns + +**Search API choice** +Web search requires an API (Brave Search, Serper, SerpAPI, DuckDuckGo, etc.). +Each has different pricing, rate limits, result quality, and privacy +implications. Which one to use, whether to require an API key, and what the +fallback is when no key is configured — all undecided. Could support multiple +backends with a configurable preference. + +**Package registry coverage** +`package_lookup` needs to handle PyPI, npm, crates.io, pkg.go.dev, Maven, +RubyGems, NuGet at minimum. Each has a different API shape. Coverage gap for +less common ecosystems (Hex for Elixir, Hackage for Haskell, etc.) — the agent +will get no lookup result and must fall back to web search. + +**search result summarization** +Raw search results can't be injected directly into context — they're too long +and too noisy. A summarization step is needed. Options: another AI call (adds +latency and cost), regex extraction (fragile), a lightweight extraction +heuristic. The right approach is unclear. + +**Turn budget arithmetic** +Dynamic turn allocation sounds clean in theory. In practice: how does the +agent "request more turns"? The orchestrator has to interrupt the loop, +check the global budget, and decide whether to grant more. This requires +mid-loop communication that doesn't exist today. Implementation complexity +is non-trivial. + +**Cache invalidation on strategy changes** +If a user re-runs with different flags (--refine, --no-external, new --exclude +list), the existing cache entries may have been produced under a different +investigation strategy. Should they be invalidated? Currently --fresh is the +only mechanism. A smarter approach would store the investigation parameters +in cache metadata and detect mismatches. + +**Confidence calibration** +Asking the agent to self-report confidence (0.0–1.0) is only useful if the +numbers are meaningful and consistent. LLMs are known to be poorly calibrated +on confidence. A 0.6 from one run may not mean the same as 0.6 from another. +This may need to be a categorical signal (high/medium/low) rather than numeric +to be reliable in practice. + +**Context window growth with external tools** +Each web search result, package lookup, and URL fetch adds to the context +window for that dir loop. For a directory with many unknown dependencies, the +context could grow large enough to trigger the budget early exit. Need to think +about how external tool results are managed in context — perhaps summarized and +discarded from messages after being processed. + +**`ask_user` blocking behavior** +Interactive mode with `ask_user` would block execution waiting for input. This +is fine in a terminal session but incompatible with piped output, scripted use, +or running luminos as a subprocess. Needs a clear mode distinction and graceful +degradation when input is not a TTY. + +**Survey pass quality on tiny targets** +For a target with 3 files, the survey pass adds an API call that may cost more +than it's worth. There should be a minimum size threshold below which the +survey is skipped and a generic approach is used. + +**Parallel investigation complexity** +Concurrent dir-loop agents writing to a shared cache introduces race conditions. +The current `_CacheManager` writes files directly with no locking. This would +need to be addressed before parallel investigation is viable. + +--- + +## Additional Suggestions + +**Config file** +Many things that are currently hardcoded (turn budget, tier thresholds, search +budget, confidence threshold for refinement) should be user-configurable without +CLI flags. A `luminos.toml` in the target directory or `~/.config/luminos/` +would allow project-specific and user-specific defaults. + +**Structured logging** +The `[AI]` stderr output is useful but informal. A structured log (JSONL file +alongside the cache) would allow post-hoc analysis of investigation quality: +which dirs used the most turns, which triggered web searches, which had low +confidence, where budget pressure hit. This also enables future tooling on top +of luminos investigations. + +**Investigation replay** +The cache already stores summaries but not the investigation trace (what the +agent read, in what order, what it decided to skip). Storing the full message +history per directory would allow replaying or auditing an investigation. Cost: +storage. Benefit: debuggability, ability to resume investigations more faithfully. + +**Watch mode + incremental investigation** +Watch mode currently re-runs the full base scan on changes. For AI-augmented +watch mode: detect which directories changed, re-investigate only those, and +patch the cache entries. The synthesis would then re-run from the updated cache +without re-investigating unchanged directories. + +**Optional PDF and Office document readers** +The data and documents domains would benefit from native content extraction: +- `pdfminer` or `pypdf` for PDF text extraction +- `openpyxl` for Excel schema and sheet enumeration +- `python-docx` for Word document text +These would be optional deps like the existing AI deps, gated behind +`--install-extras`. The agent currently can only see filename and size for +these formats. + +**Security-focused analysis mode** +A `--security` flag could tune the investigation toward security-relevant +findings: dependency vulnerability scanning, hardcoded secrets detection, +permission issues, exposed configuration, insecure patterns. The flag would +bias the survey, dir loop prompts, and synthesis toward these concerns and +expand the flags output with severity-ranked security findings. + +**Output formats** +The current report is terminal-formatted text or JSON. Additional formats worth +considering: +- Markdown (for saving to wikis, Notion, Obsidian) +- HTML (self-contained report with collapsible sections) +- SARIF (for security findings — integrates with GitHub Code Scanning) + +**Model selection** +The model is hardcoded to `claude-sonnet-4-20250514`. The survey and planning +passes are lightweight enough to use a faster/cheaper model (Haiku). The dir +loops and synthesis warrant Sonnet or better. The refinement pass might benefit +from Opus for difficult cases. A `--model` flag and per-pass model configuration +would allow cost/quality tradeoffs. + +--- + +## Concerns + +**Cost at scale** +Adding a survey pass, planning pass, external tool lookups, and multiple +refinement passes significantly increases API call count and token consumption. +A large repo run with `--refine` could easily cost several dollars. The current +cost reporting (total tokens at end) may not be sufficient — users need to +understand cost before committing to a long run. Consider a `--estimate` mode +that projects cost from the base scan without running AI. + +**Privacy and external lookups** +Web searches and URL fetches send information about the target's contents to +external services. For a personal journal or proprietary codebase this could +be a significant privacy concern. The `--no-external` flag addresses this but +it should probably be the *default* for sensitive-looking content (PII detected +in filenames, etc.), not something the user has to know to enable. + +**Prompt injection via file contents** +`read_file` passes raw file contents into the context. A malicious file in the +target directory could contain prompt injection attempts. The current system has +no sanitization. This is an existing concern that grows as the agent gains more +capabilities (web search, URL fetch, package lookup — all of which could +theoretically be manipulated by a crafted file). + +**Reliability of self-reported confidence** +The confidence tracking system depends on the agent accurately reporting its +own uncertainty. If the agent is systematically over-confident (which LLMs tend +to be), the refinement pass will never trigger on cases where it's most needed. +The system should have a skeptical prior — low-confidence by default for +unfamiliar file types, missing READMEs, ambiguous structures. + +**Investigation quality regression risk** +Each new pass (survey, planning, refinement) adds opportunities for the +investigation to go wrong. A bad survey misleads all subsequent dir loops. A +bad plan wastes turns on shallow directories and skips critical ones. The system +needs quality signals — probably the confidence scores aggregated across the +investigation — to detect when something went wrong and potentially retry. + +**Watch mode compatibility** +Several of the planned features (survey pass, planning, external tools) are not +designed for incremental re-use in watch mode. Adding AI capability to watch +mode is a separate design problem that deserves its own thinking. + +**Turn budget contention** +If the planning pass allocates turns and the agent borrows from its budget when +it needs more, there's a risk of runaway investigation on unexpectedly complex +directories. Needs a hard ceiling (global max tokens, not just per-dir turns) +as a backstop. + +--- + +## Raw Thoughts + +The investigation planning idea is conceptually appealing but has a chicken-and- +egg problem: you need to know what's in the directories to plan how to +investigate them, but you haven't investigated yet. The survey pass helps but +it's shallow. Maybe the first pass through each directory should be a cheap +orientation (list contents, read one file) that feeds the plan before the full +investigation starts. Two-phase dir investigation: orient then investigate. + +The hypothesis-driven synthesis is probably the highest leverage change in this +whole plan. The current synthesis produces descriptive output. Hypothesis-driven +synthesis produces analytical output. The prompt change is small but the output +quality difference could be significant. + +Web search feels like it should be a last resort, not an early one. The agent +should exhaust local investigation before reaching for external sources. The +prompt should reflect this: "Only search if you cannot determine this from the +files available." + +There's a question of whether the survey pass should run before the base scan +or after. After makes sense because the base scan's file_categories is useful +survey input. But the base scan itself could be informed by the survey (e.g. +skip certain directories the survey identified as low-value). Probably the right +answer is: survey runs after base scan but before AI dir loops, using base scan +output as input. + +The `ask_user` tool is interesting because it inverts the relationship — the +agent asks the human rather than the other way around. This is powerful but +needs careful constraints. The agent should only ask when it's genuinely stuck, +not as a shortcut to avoid investigation. The prompt should require that other +resolution strategies have been exhausted before asking. + +Multi-level synthesis (grouping pass) might produce better results than +expected because the grouping agent has a different task than the dir-loop +agents — it's looking for relationships and patterns across summaries rather +than summarizing individual directories. It might surface architectural insights +that none of the dir loops could see individually. + +Package vulnerability lookups are potentially the highest signal-to-noise +external tool — structured data, specific to the files present, directly +actionable. Worth implementing before general web search. + +The confidence calibration problem is real but maybe not critical to solve +precisely. Even if 0.6 doesn't mean the same thing every time, entries with +confidence below some threshold will still tend to be the more uncertain ones. +Categorical (high/medium/low) is probably fine for the first implementation. + +Progressive output and interactive mode are probably the features that would +most change how luminos *feels* to use. The current UX is: run it, wait, get a +report. Progressive output would make it feel like watching someone explore +the codebase in real time. Worth thinking about the UX before the architecture. + +There's a version of this tool that goes well beyond file system analysis — +a general-purpose investigative agent that can be pointed at anything (a +directory, a URL, a database, a running process) and produce an intelligence +report. The current architecture is already pointing in that direction. Worth +keeping that possibility in mind when making structural decisions so we don't +close off that path prematurely.