1 Session10
Jeff Smith edited this page 2026-04-12 21:15:55 -06:00

Session 10 Notes — 2026-04-12

What We Set Out to Do

Ship Phase 3: Investigation Planning. The design sketch had been deferred from Session 9. The goal was to plan the nuts and bolts, then implement everything: planning pass, dynamic turn allocation, plan caching, and quality instrumentation.

What Actually Happened

Started with a design discussion. Jeff asked the right foundational question before any code was written: "How can we measure the quality of the data being produced?" This reframed the work from "add a planning pass" to "add a planning pass and the instrumentation to know if it's helping."

Implementation went smoothly. All four Phase 3 issues (#8, #9, #10, #11) plus the new quality instrumentation issue (#74) shipped in a single PR (#75): 943 insertions, 26 new tests (234 to 260), all passing.

Smoke test against luminos_lib/ immediately caught a bug: the planning pass correctly identified the target root as priority with 20 turns, but the orchestrator gave it 10 (the default). Root cause analysis revealed a path-matching mismatch in _apply_plan(): the planner uses basename(target) (from the tree output) but the lookup table used "." (from os.path.relpath). Filed as #76, fixed in PR #77 (43 insertions, 2 new tests, 262 total).

Second smoke test against the full repo root (luminos/) worked correctly. The planner made sensible tier assignments. docs/wiki/ was classified as priority, which Jeff correctly defended: luminos is a general investigation tool, not a code-only tool, so documentation is a legitimate investigation target.

The full-repo run also surfaced two more issues:

  • #78: Synthesis output is ephemeral (printed to terminal, not persisted). The tool's primary deliverable should be a first-class artifact.
  • #79: Stale cache entries from 2024 (with relative paths) coexist with new entries (absolute paths) because they hash differently. The synthesis pass saw 7 entries for 5 directories.

Third smoke test against Jeff's homelab IaC repo (a project luminos had never seen before). The investigation produced an accurate, well-structured report that identified the three-layer architecture, the migration strategy, the enterprise-grade operational practices, and caught a critical flag (plaintext credentials in terraform.tfvars).

Jeff shared that luminos is his learning project for agentic AI applications. He learns by doing, not reading, and wanted a truly hard problem that would push the AI.

Key Decisions & Reasoning

Quality instrumentation ships with Phase 3, not after. Jeff's question about measuring quality before we changed the pipeline was the right call. We added three metrics: turn utilization per directory, completeness self-rating on submit_report, and plan_evaluation.json as the planning pass's report card. This paid off immediately when the smoke test revealed the #76 bug through the evaluation data.

Per-directory allocation, no mid-loop borrowing. PLAN.md envisioned a global budget with mid-loop turn borrowing. We chose the simpler model: fixed per-dir allocations from the plan. The quality data will tell us if borrowing is needed. No speculative complexity.

Band-sorted ordering preserves leaf-first. Rather than allowing arbitrary reordering (which would break child summaries), we sort directories into priority/default/shallow bands and preserve leaf-first within each band. This gives us "priority-first" without breaking the invariant.

docs/wiki as priority is correct. The planner classified wiki docs as priority. Initial reaction was to push back (it's not code), but Jeff correctly pointed out that luminos investigates filesystems, not codebases. Documentation is part of what a directory contains.

Surprises & Discoveries

The #76 bug was immediately visible in plan_evaluation.json. The evaluation showed turns_allocated: 10 for a directory the plan said should get 20. Without the quality instrumentation, we would have had no signal that anything was wrong. The investigation would have produced a slightly worse report and we'd never have known why.

Stale cache contamination (#79). Old cache entries from 2024 used relative paths while current code uses absolute paths. Since entries are keyed by SHA256(path), both versions coexist. --fresh doesn't clean this up because it reuses the investigation ID mapping. This is a pre-Phase 3 bug but we discovered it through Phase 3 testing.

The wiki colors the investigation. The full-repo synthesis report referenced "9 development phases" and "session logs tracking project evolution," which came from reading the wiki, not the code. The agent synthesizes documentary and code knowledge together. Whether this is good depends on the use case.

Concerns & Open Threads

Synthesis output is ephemeral (#78). The tool's primary deliverable vanishes when you close the terminal. This needs to be fixed before luminos is useful to anyone besides the person running it.

Stale cache problem (#79). The path normalization issue could cause subtle quality degradation on any target that was previously investigated with an older version. Option 1 (normalize before hashing) plus option 2 (--fresh wipes the cache dir) would fix it cleanly.

tests directory hit its ceiling. In the full-repo run, the shallow allocation (5 turns) for tests/ was fully consumed (100% utilization) and completeness wasn't reported, suggesting the agent ran out of turns before calling submit_report cleanly. The shallow default of 5 may be too tight for directories with many files.

Completeness field is not always reported. The tests directory entry has no completeness value. This happens when the agent exhausts turns and the partial-flush path kicks in (it doesn't call submit_report, so completeness is never set). The instrumentation has a gap for partial entries.

Raw Thinking

The quality measurement question Jeff raised is more important than it seemed at first. We added three cheap metrics, but the fundamental challenge remains: luminos produces natural language output, and the only real quality judge is a human reading it. The instrumentation tells us about efficiency (did we waste turns?) and self-assessment (did the agent think it was thorough?), but not accuracy (did the agent get it right?).

The homelab IaC smoke test was the most interesting data point. The agent had never seen the project, had no wiki to lean on, and produced a report that Jeff (who built the project) didn't dispute. That's a real signal. The quality isn't just "it sounds plausible"; it's "the person who wrote the code agrees with the assessment."

Jeff's comment about learning by doing, not reading, maps directly to how luminos itself works: it investigates by reading files, not by being told what they are. The survey and planning passes are the "reading the map" phase; the dir loops are the "walking the terrain" phase. The synthesis is the "here's what I found" debrief. It's the same investigate-then-synthesize loop a human would use.

What's Next

Priority order for next session:

  1. Fix #78 (synthesis persistence) - save synthesis.json to cache, include AI analysis in --json -o output. Small scope, high impact.
  2. Fix #79 (stale cache) - normalize paths before hashing, consider --fresh wiping the cache dir entirely. Correctness fix.
  3. Phase 4 review (#40) - reassess Phase 4+ issues after the MCP pivot and Phase 3 experience. Some issues may be stale or reprioritized.
  4. Phase 4 (External Knowledge Tools) or Phase 5 (Scale-Tiered Synthesis) depending on what #40 reveals.