retro: Session 10 — Phase 3 shipped, quality instrumentation, smoke tests

Jeff Smith 2026-04-12 21:15:55 -06:00
parent 3fcf8c221d
commit 36f0834c93
2 changed files with 159 additions and 0 deletions

158
Session10.md Normal file

@ -0,0 +1,158 @@
# Session 10 Notes — 2026-04-12
## What We Set Out to Do
Ship Phase 3: Investigation Planning. The design sketch had been deferred
from Session 9. The goal was to plan the nuts and bolts, then implement
everything: planning pass, dynamic turn allocation, plan caching, and
quality instrumentation.
## What Actually Happened
Started with a design discussion. Jeff asked the right foundational
question before any code was written: "How can we measure the quality
of the data being produced?" This reframed the work from "add a
planning pass" to "add a planning pass and the instrumentation to know
if it's helping."
Implementation went smoothly. All four Phase 3 issues (#8, #9, #10,
#11) plus the new quality instrumentation issue (#74) shipped in a
single PR (#75): 943 insertions, 26 new tests (234 to 260), all
passing.
Smoke test against `luminos_lib/` immediately caught a bug: the
planning pass correctly identified the target root as priority with 20
turns, but the orchestrator gave it 10 (the default). Root cause
analysis revealed a path-matching mismatch in `_apply_plan()`: the
planner uses `basename(target)` (from the tree output) but the lookup
table used `"."` (from `os.path.relpath`). Filed as #76, fixed in PR
#77 (43 insertions, 2 new tests, 262 total).
Second smoke test against the full repo root (`luminos/`) worked
correctly. The planner made sensible tier assignments. `docs/wiki/`
was classified as priority, which Jeff correctly defended: luminos is
a general investigation tool, not a code-only tool, so documentation
is a legitimate investigation target.
The full-repo run also surfaced two more issues:
- #78: Synthesis output is ephemeral (printed to terminal, not
persisted). The tool's primary deliverable should be a first-class
artifact.
- #79: Stale cache entries from 2024 (with relative paths) coexist
with new entries (absolute paths) because they hash differently.
The synthesis pass saw 7 entries for 5 directories.
Third smoke test against Jeff's homelab IaC repo (a project luminos
had never seen before). The investigation produced an accurate,
well-structured report that identified the three-layer architecture,
the migration strategy, the enterprise-grade operational practices,
and caught a critical flag (plaintext credentials in terraform.tfvars).
Jeff shared that luminos is his learning project for agentic AI
applications. He learns by doing, not reading, and wanted a truly hard
problem that would push the AI.
## Key Decisions & Reasoning
**Quality instrumentation ships with Phase 3, not after.** Jeff's
question about measuring quality before we changed the pipeline was
the right call. We added three metrics: turn utilization per directory,
completeness self-rating on submit_report, and plan_evaluation.json as
the planning pass's report card. This paid off immediately when the
smoke test revealed the #76 bug through the evaluation data.
**Per-directory allocation, no mid-loop borrowing.** PLAN.md envisioned
a global budget with mid-loop turn borrowing. We chose the simpler
model: fixed per-dir allocations from the plan. The quality data will
tell us if borrowing is needed. No speculative complexity.
**Band-sorted ordering preserves leaf-first.** Rather than allowing
arbitrary reordering (which would break child summaries), we sort
directories into priority/default/shallow bands and preserve leaf-first
within each band. This gives us "priority-first" without breaking the
invariant.
**docs/wiki as priority is correct.** The planner classified wiki docs
as priority. Initial reaction was to push back (it's not code), but
Jeff correctly pointed out that luminos investigates filesystems, not
codebases. Documentation is part of what a directory contains.
## Surprises & Discoveries
**The #76 bug was immediately visible in plan_evaluation.json.** The
evaluation showed `turns_allocated: 10` for a directory the plan said
should get 20. Without the quality instrumentation, we would have had
no signal that anything was wrong. The investigation would have
produced a slightly worse report and we'd never have known why.
**Stale cache contamination (#79).** Old cache entries from 2024 used
relative paths while current code uses absolute paths. Since entries
are keyed by SHA256(path), both versions coexist. `--fresh` doesn't
clean this up because it reuses the investigation ID mapping. This is
a pre-Phase 3 bug but we discovered it through Phase 3 testing.
**The wiki colors the investigation.** The full-repo synthesis report
referenced "9 development phases" and "session logs tracking project
evolution," which came from reading the wiki, not the code. The agent
synthesizes documentary and code knowledge together. Whether this is
good depends on the use case.
## Concerns & Open Threads
**Synthesis output is ephemeral (#78).** The tool's primary deliverable
vanishes when you close the terminal. This needs to be fixed before
luminos is useful to anyone besides the person running it.
**Stale cache problem (#79).** The path normalization issue could cause
subtle quality degradation on any target that was previously
investigated with an older version. Option 1 (normalize before hashing)
plus option 2 (`--fresh` wipes the cache dir) would fix it cleanly.
**`tests` directory hit its ceiling.** In the full-repo run, the
shallow allocation (5 turns) for `tests/` was fully consumed (100%
utilization) and completeness wasn't reported, suggesting the agent
ran out of turns before calling submit_report cleanly. The shallow
default of 5 may be too tight for directories with many files.
**Completeness field is not always reported.** The `tests` directory
entry has no completeness value. This happens when the agent exhausts
turns and the partial-flush path kicks in (it doesn't call
submit_report, so completeness is never set). The instrumentation has
a gap for partial entries.
## Raw Thinking
The quality measurement question Jeff raised is more important than it
seemed at first. We added three cheap metrics, but the fundamental
challenge remains: luminos produces natural language output, and the
only real quality judge is a human reading it. The instrumentation tells
us about efficiency (did we waste turns?) and self-assessment (did the
agent think it was thorough?), but not accuracy (did the agent get it
right?).
The homelab IaC smoke test was the most interesting data point. The
agent had never seen the project, had no wiki to lean on, and produced
a report that Jeff (who built the project) didn't dispute. That's a
real signal. The quality isn't just "it sounds plausible"; it's "the
person who wrote the code agrees with the assessment."
Jeff's comment about learning by doing, not reading, maps directly to
how luminos itself works: it investigates by reading files, not by
being told what they are. The survey and planning passes are the
"reading the map" phase; the dir loops are the "walking the terrain"
phase. The synthesis is the "here's what I found" debrief. It's the
same investigate-then-synthesize loop a human would use.
## What's Next
Priority order for next session:
1. **Fix #78 (synthesis persistence)** - save synthesis.json to cache,
include AI analysis in `--json -o` output. Small scope, high impact.
2. **Fix #79 (stale cache)** - normalize paths before hashing, consider
`--fresh` wiping the cache dir entirely. Correctness fix.
3. **Phase 4 review (#40)** - reassess Phase 4+ issues after the MCP
pivot and Phase 3 experience. Some issues may be stale or
reprioritized.
4. **Phase 4 (External Knowledge Tools)** or **Phase 5 (Scale-Tiered
Synthesis)** depending on what #40 reveals.

@ -11,6 +11,7 @@
| [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified | | [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified |
| [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete | | [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete |
| [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete | | [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete |
| [Session 10](Session10) | 2026-04-12 | Phase 3 shipped: planning pass, dynamic turn allocation, quality instrumentation (#8, #9, #10, #11, #74). Found and fixed root-path matching bug (#76). Smoke tests against luminos and homelab IaC. Filed #78 (synthesis persistence), #79 (stale cache). 3 PRs, 28 new tests (234→262) |
--- ---