retro: Session 10 — Phase 3 shipped, quality instrumentation, smoke tests

2026-04-12 21:15:55 -06:00 · 2026-04-12 21:15:55 -06:00 · 36f0834c93
commit 36f0834c93
parent 3fcf8c221d
2 changed files with 159 additions and 0 deletions
--- a/Session10.md
+++ b/Session10.md
@ -0,0 +1,158 @@
 # Session 10 Notes — 2026-04-12
 ## What We Set Out to Do
 Ship Phase 3: Investigation Planning. The design sketch had been deferred
 from Session 9. The goal was to plan the nuts and bolts, then implement
 everything: planning pass, dynamic turn allocation, plan caching, and
 quality instrumentation.
 ## What Actually Happened
 Started with a design discussion. Jeff asked the right foundational
 question before any code was written: "How can we measure the quality
 of the data being produced?" This reframed the work from "add a
 planning pass" to "add a planning pass and the instrumentation to know
 if it's helping."
 Implementation went smoothly. All four Phase 3 issues (#8, #9, #10,
 #11) plus the new quality instrumentation issue (#74) shipped in a
 single PR (#75): 943 insertions, 26 new tests (234 to 260), all
 passing.
 Smoke test against `luminos_lib/` immediately caught a bug: the
 planning pass correctly identified the target root as priority with 20
 turns, but the orchestrator gave it 10 (the default). Root cause
 analysis revealed a path-matching mismatch in `_apply_plan()`: the
 planner uses `basename(target)` (from the tree output) but the lookup
 table used `"."` (from `os.path.relpath`). Filed as #76, fixed in PR
 #77 (43 insertions, 2 new tests, 262 total).
 Second smoke test against the full repo root (`luminos/`) worked
 correctly. The planner made sensible tier assignments. `docs/wiki/`
 was classified as priority, which Jeff correctly defended: luminos is
 a general investigation tool, not a code-only tool, so documentation
 is a legitimate investigation target.
 The full-repo run also surfaced two more issues:
 - #78: Synthesis output is ephemeral (printed to terminal, not
  persisted). The tool's primary deliverable should be a first-class
  artifact.
 - #79: Stale cache entries from 2024 (with relative paths) coexist
  with new entries (absolute paths) because they hash differently.
  The synthesis pass saw 7 entries for 5 directories.
 Third smoke test against Jeff's homelab IaC repo (a project luminos
 had never seen before). The investigation produced an accurate,
 well-structured report that identified the three-layer architecture,
 the migration strategy, the enterprise-grade operational practices,
 and caught a critical flag (plaintext credentials in terraform.tfvars).
 Jeff shared that luminos is his learning project for agentic AI
 applications. He learns by doing, not reading, and wanted a truly hard
 problem that would push the AI.
 ## Key Decisions & Reasoning
 **Quality instrumentation ships with Phase 3, not after.** Jeff's
 question about measuring quality before we changed the pipeline was
 the right call. We added three metrics: turn utilization per directory,
 completeness self-rating on submit_report, and plan_evaluation.json as
 the planning pass's report card. This paid off immediately when the
 smoke test revealed the #76 bug through the evaluation data.
 **Per-directory allocation, no mid-loop borrowing.** PLAN.md envisioned
 a global budget with mid-loop turn borrowing. We chose the simpler
 model: fixed per-dir allocations from the plan. The quality data will
 tell us if borrowing is needed. No speculative complexity.
 **Band-sorted ordering preserves leaf-first.** Rather than allowing
 arbitrary reordering (which would break child summaries), we sort
 directories into priority/default/shallow bands and preserve leaf-first
 within each band. This gives us "priority-first" without breaking the
 invariant.
 **docs/wiki as priority is correct.** The planner classified wiki docs
 as priority. Initial reaction was to push back (it's not code), but
 Jeff correctly pointed out that luminos investigates filesystems, not
 codebases. Documentation is part of what a directory contains.
 ## Surprises & Discoveries
 **The #76 bug was immediately visible in plan_evaluation.json.** The
 evaluation showed `turns_allocated: 10` for a directory the plan said
 should get 20. Without the quality instrumentation, we would have had
 no signal that anything was wrong. The investigation would have
 produced a slightly worse report and we'd never have known why.
 **Stale cache contamination (#79).** Old cache entries from 2024 used
 relative paths while current code uses absolute paths. Since entries
 are keyed by SHA256(path), both versions coexist. `--fresh` doesn't
 clean this up because it reuses the investigation ID mapping. This is
 a pre-Phase 3 bug but we discovered it through Phase 3 testing.
 **The wiki colors the investigation.** The full-repo synthesis report
 referenced "9 development phases" and "session logs tracking project
 evolution," which came from reading the wiki, not the code. The agent
 synthesizes documentary and code knowledge together. Whether this is
 good depends on the use case.
 ## Concerns & Open Threads
 **Synthesis output is ephemeral (#78).** The tool's primary deliverable
 vanishes when you close the terminal. This needs to be fixed before
 luminos is useful to anyone besides the person running it.
 **Stale cache problem (#79).** The path normalization issue could cause
 subtle quality degradation on any target that was previously
 investigated with an older version. Option 1 (normalize before hashing)
 plus option 2 (`--fresh` wipes the cache dir) would fix it cleanly.
 **`tests` directory hit its ceiling.** In the full-repo run, the
 shallow allocation (5 turns) for `tests/` was fully consumed (100%
 utilization) and completeness wasn't reported, suggesting the agent
 ran out of turns before calling submit_report cleanly. The shallow
 default of 5 may be too tight for directories with many files.
 **Completeness field is not always reported.** The `tests` directory
 entry has no completeness value. This happens when the agent exhausts
 turns and the partial-flush path kicks in (it doesn't call
 submit_report, so completeness is never set). The instrumentation has
 a gap for partial entries.
 ## Raw Thinking
 The quality measurement question Jeff raised is more important than it
 seemed at first. We added three cheap metrics, but the fundamental
 challenge remains: luminos produces natural language output, and the
 only real quality judge is a human reading it. The instrumentation tells
 us about efficiency (did we waste turns?) and self-assessment (did the
 agent think it was thorough?), but not accuracy (did the agent get it
 right?).
 The homelab IaC smoke test was the most interesting data point. The
 agent had never seen the project, had no wiki to lean on, and produced
 a report that Jeff (who built the project) didn't dispute. That's a
 real signal. The quality isn't just "it sounds plausible"; it's "the
 person who wrote the code agrees with the assessment."
 Jeff's comment about learning by doing, not reading, maps directly to
 how luminos itself works: it investigates by reading files, not by
 being told what they are. The survey and planning passes are the
 "reading the map" phase; the dir loops are the "walking the terrain"
 phase. The synthesis is the "here's what I found" debrief. It's the
 same investigate-then-synthesize loop a human would use.
 ## What's Next
 Priority order for next session:
 1. **Fix #78 (synthesis persistence)** - save synthesis.json to cache,
   include AI analysis in `--json -o` output. Small scope, high impact.
 2. **Fix #79 (stale cache)** - normalize paths before hashing, consider
   `--fresh` wiping the cache dir entirely. Correctness fix.
 3. **Phase 4 review (#40)** - reassess Phase 4+ issues after the MCP
   pivot and Phase 3 experience. Some issues may be stale or
   reprioritized.
 4. **Phase 4 (External Knowledge Tools)** or **Phase 5 (Scale-Tiered
   Synthesis)** depending on what #40 reveals.
--- a/SessionRetrospectives.md
+++ b/SessionRetrospectives.md
@ -11,6 +11,7 @@
 | [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified |
 | [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete |
 | [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete |
 | [Session 10](Session10) | 2026-04-12 | Phase 3 shipped: planning pass, dynamic turn allocation, quality instrumentation (#8, #9, #10, #11, #74). Found and fixed root-path matching bug (#76). Smoke tests against luminos and homelab IaC. Filed #78 (synthesis persistence), #79 (stale cache). 3 PRs, 28 new tests (234→262) |
 ---