Session 10 Notes — 2026-04-12
What We Set Out to Do
Ship Phase 3: Investigation Planning. The design sketch had been deferred from Session 9. The goal was to plan the nuts and bolts, then implement everything: planning pass, dynamic turn allocation, plan caching, and quality instrumentation.
What Actually Happened
Started with a design discussion. Jeff asked the right foundational question before any code was written: "How can we measure the quality of the data being produced?" This reframed the work from "add a planning pass" to "add a planning pass and the instrumentation to know if it's helping."
Implementation went smoothly. All four Phase 3 issues (#8, #9, #10, #11) plus the new quality instrumentation issue (#74) shipped in a single PR (#75): 943 insertions, 26 new tests (234 to 260), all passing.
Smoke test against luminos_lib/ immediately caught a bug: the
planning pass correctly identified the target root as priority with 20
turns, but the orchestrator gave it 10 (the default). Root cause
analysis revealed a path-matching mismatch in _apply_plan(): the
planner uses basename(target) (from the tree output) but the lookup
table used "." (from os.path.relpath). Filed as #76, fixed in PR
#77 (43 insertions, 2 new tests, 262 total).
Second smoke test against the full repo root (luminos/) worked
correctly. The planner made sensible tier assignments. docs/wiki/
was classified as priority, which Jeff correctly defended: luminos is
a general investigation tool, not a code-only tool, so documentation
is a legitimate investigation target.
The full-repo run also surfaced two more issues:
- #78: Synthesis output is ephemeral (printed to terminal, not persisted). The tool's primary deliverable should be a first-class artifact.
- #79: Stale cache entries from 2024 (with relative paths) coexist with new entries (absolute paths) because they hash differently. The synthesis pass saw 7 entries for 5 directories.
Third smoke test against Jeff's homelab IaC repo (a project luminos had never seen before). The investigation produced an accurate, well-structured report that identified the three-layer architecture, the migration strategy, the enterprise-grade operational practices, and caught a critical flag (plaintext credentials in terraform.tfvars).
Jeff shared that luminos is his learning project for agentic AI applications. He learns by doing, not reading, and wanted a truly hard problem that would push the AI.
Key Decisions & Reasoning
Quality instrumentation ships with Phase 3, not after. Jeff's question about measuring quality before we changed the pipeline was the right call. We added three metrics: turn utilization per directory, completeness self-rating on submit_report, and plan_evaluation.json as the planning pass's report card. This paid off immediately when the smoke test revealed the #76 bug through the evaluation data.
Per-directory allocation, no mid-loop borrowing. PLAN.md envisioned a global budget with mid-loop turn borrowing. We chose the simpler model: fixed per-dir allocations from the plan. The quality data will tell us if borrowing is needed. No speculative complexity.
Band-sorted ordering preserves leaf-first. Rather than allowing arbitrary reordering (which would break child summaries), we sort directories into priority/default/shallow bands and preserve leaf-first within each band. This gives us "priority-first" without breaking the invariant.
docs/wiki as priority is correct. The planner classified wiki docs as priority. Initial reaction was to push back (it's not code), but Jeff correctly pointed out that luminos investigates filesystems, not codebases. Documentation is part of what a directory contains.
Surprises & Discoveries
The #76 bug was immediately visible in plan_evaluation.json. The
evaluation showed turns_allocated: 10 for a directory the plan said
should get 20. Without the quality instrumentation, we would have had
no signal that anything was wrong. The investigation would have
produced a slightly worse report and we'd never have known why.
Stale cache contamination (#79). Old cache entries from 2024 used
relative paths while current code uses absolute paths. Since entries
are keyed by SHA256(path), both versions coexist. --fresh doesn't
clean this up because it reuses the investigation ID mapping. This is
a pre-Phase 3 bug but we discovered it through Phase 3 testing.
The wiki colors the investigation. The full-repo synthesis report referenced "9 development phases" and "session logs tracking project evolution," which came from reading the wiki, not the code. The agent synthesizes documentary and code knowledge together. Whether this is good depends on the use case.
Concerns & Open Threads
Synthesis output is ephemeral (#78). The tool's primary deliverable vanishes when you close the terminal. This needs to be fixed before luminos is useful to anyone besides the person running it.
Stale cache problem (#79). The path normalization issue could cause
subtle quality degradation on any target that was previously
investigated with an older version. Option 1 (normalize before hashing)
plus option 2 (--fresh wipes the cache dir) would fix it cleanly.
tests directory hit its ceiling. In the full-repo run, the
shallow allocation (5 turns) for tests/ was fully consumed (100%
utilization) and completeness wasn't reported, suggesting the agent
ran out of turns before calling submit_report cleanly. The shallow
default of 5 may be too tight for directories with many files.
Completeness field is not always reported. The tests directory
entry has no completeness value. This happens when the agent exhausts
turns and the partial-flush path kicks in (it doesn't call
submit_report, so completeness is never set). The instrumentation has
a gap for partial entries.
Raw Thinking
The quality measurement question Jeff raised is more important than it seemed at first. We added three cheap metrics, but the fundamental challenge remains: luminos produces natural language output, and the only real quality judge is a human reading it. The instrumentation tells us about efficiency (did we waste turns?) and self-assessment (did the agent think it was thorough?), but not accuracy (did the agent get it right?).
The homelab IaC smoke test was the most interesting data point. The agent had never seen the project, had no wiki to lean on, and produced a report that Jeff (who built the project) didn't dispute. That's a real signal. The quality isn't just "it sounds plausible"; it's "the person who wrote the code agrees with the assessment."
Jeff's comment about learning by doing, not reading, maps directly to how luminos itself works: it investigates by reading files, not by being told what they are. The survey and planning passes are the "reading the map" phase; the dir loops are the "walking the terrain" phase. The synthesis is the "here's what I found" debrief. It's the same investigate-then-synthesize loop a human would use.
What's Next
Priority order for next session:
- Fix #78 (synthesis persistence) - save synthesis.json to cache,
include AI analysis in
--json -ooutput. Small scope, high impact. - Fix #79 (stale cache) - normalize paths before hashing, consider
--freshwiping the cache dir entirely. Correctness fix. - Phase 4 review (#40) - reassess Phase 4+ issues after the MCP pivot and Phase 3 experience. Some issues may be stale or reprioritized.
- Phase 4 (External Knowledge Tools) or Phase 5 (Scale-Tiered Synthesis) depending on what #40 reveals.