retro: Session 10 — Phase 3 shipped, quality instrumentation, smoke tests

2026-04-12 21:15:55 -06:00 · 2026-04-12 21:15:55 -06:00 · 36f0834c93
commit 36f0834c93
parent 3fcf8c221d
2 changed files with 159 additions and 0 deletions
--- a/Session10.md
+++ b/Session10.md
@ -0,0 +1,158 @@
+# Session 10 Notes — 2026-04-12
+
+## What We Set Out to Do
+
+Ship Phase 3: Investigation Planning. The design sketch had been deferred
+from Session 9. The goal was to plan the nuts and bolts, then implement
+everything: planning pass, dynamic turn allocation, plan caching, and
+quality instrumentation.
+
+## What Actually Happened
+
+Started with a design discussion. Jeff asked the right foundational
+question before any code was written: "How can we measure the quality
+of the data being produced?" This reframed the work from "add a
+planning pass" to "add a planning pass and the instrumentation to know
+if it's helping."
+
+Implementation went smoothly. All four Phase 3 issues (#8, #9, #10,
+#11) plus the new quality instrumentation issue (#74) shipped in a
+single PR (#75): 943 insertions, 26 new tests (234 to 260), all
+passing.
+
+Smoke test against `luminos_lib/` immediately caught a bug: the
+planning pass correctly identified the target root as priority with 20
+turns, but the orchestrator gave it 10 (the default). Root cause
+analysis revealed a path-matching mismatch in `_apply_plan()`: the
+planner uses `basename(target)` (from the tree output) but the lookup
+table used `"."` (from `os.path.relpath`). Filed as #76, fixed in PR
+#77 (43 insertions, 2 new tests, 262 total).
+
+Second smoke test against the full repo root (`luminos/`) worked
+correctly. The planner made sensible tier assignments. `docs/wiki/`
+was classified as priority, which Jeff correctly defended: luminos is
+a general investigation tool, not a code-only tool, so documentation
+is a legitimate investigation target.
+
+The full-repo run also surfaced two more issues:
+- #78: Synthesis output is ephemeral (printed to terminal, not
+  persisted). The tool's primary deliverable should be a first-class
+  artifact.
+- #79: Stale cache entries from 2024 (with relative paths) coexist
+  with new entries (absolute paths) because they hash differently.
+  The synthesis pass saw 7 entries for 5 directories.
+
+Third smoke test against Jeff's homelab IaC repo (a project luminos
+had never seen before). The investigation produced an accurate,
+well-structured report that identified the three-layer architecture,
+the migration strategy, the enterprise-grade operational practices,
+and caught a critical flag (plaintext credentials in terraform.tfvars).
+
+Jeff shared that luminos is his learning project for agentic AI
+applications. He learns by doing, not reading, and wanted a truly hard
+problem that would push the AI.
+
+## Key Decisions & Reasoning
+
+**Quality instrumentation ships with Phase 3, not after.** Jeff's
+question about measuring quality before we changed the pipeline was
+the right call. We added three metrics: turn utilization per directory,
+completeness self-rating on submit_report, and plan_evaluation.json as
+the planning pass's report card. This paid off immediately when the
+smoke test revealed the #76 bug through the evaluation data.
+
+**Per-directory allocation, no mid-loop borrowing.** PLAN.md envisioned
+a global budget with mid-loop turn borrowing. We chose the simpler
+model: fixed per-dir allocations from the plan. The quality data will
+tell us if borrowing is needed. No speculative complexity.
+
+**Band-sorted ordering preserves leaf-first.** Rather than allowing
+arbitrary reordering (which would break child summaries), we sort
+directories into priority/default/shallow bands and preserve leaf-first
+within each band. This gives us "priority-first" without breaking the
+invariant.
+
+**docs/wiki as priority is correct.** The planner classified wiki docs
+as priority. Initial reaction was to push back (it's not code), but
+Jeff correctly pointed out that luminos investigates filesystems, not
+codebases. Documentation is part of what a directory contains.
+
+## Surprises & Discoveries
+
+**The #76 bug was immediately visible in plan_evaluation.json.** The
+evaluation showed `turns_allocated: 10` for a directory the plan said
+should get 20. Without the quality instrumentation, we would have had
+no signal that anything was wrong. The investigation would have
+produced a slightly worse report and we'd never have known why.
+
+**Stale cache contamination (#79).** Old cache entries from 2024 used
+relative paths while current code uses absolute paths. Since entries
+are keyed by SHA256(path), both versions coexist. `--fresh` doesn't
+clean this up because it reuses the investigation ID mapping. This is
+a pre-Phase 3 bug but we discovered it through Phase 3 testing.
+
+**The wiki colors the investigation.** The full-repo synthesis report
+referenced "9 development phases" and "session logs tracking project
+evolution," which came from reading the wiki, not the code. The agent
+synthesizes documentary and code knowledge together. Whether this is
+good depends on the use case.
+
+## Concerns & Open Threads
+
+**Synthesis output is ephemeral (#78).** The tool's primary deliverable
+vanishes when you close the terminal. This needs to be fixed before
+luminos is useful to anyone besides the person running it.
+
+**Stale cache problem (#79).** The path normalization issue could cause
+subtle quality degradation on any target that was previously
+investigated with an older version. Option 1 (normalize before hashing)
+plus option 2 (`--fresh` wipes the cache dir) would fix it cleanly.
+
+**`tests` directory hit its ceiling.** In the full-repo run, the
+shallow allocation (5 turns) for `tests/` was fully consumed (100%
+utilization) and completeness wasn't reported, suggesting the agent
+ran out of turns before calling submit_report cleanly. The shallow
+default of 5 may be too tight for directories with many files.
+
+**Completeness field is not always reported.** The `tests` directory
+entry has no completeness value. This happens when the agent exhausts
+turns and the partial-flush path kicks in (it doesn't call
+submit_report, so completeness is never set). The instrumentation has
+a gap for partial entries.
+
+## Raw Thinking
+
+The quality measurement question Jeff raised is more important than it
+seemed at first. We added three cheap metrics, but the fundamental
+challenge remains: luminos produces natural language output, and the
+only real quality judge is a human reading it. The instrumentation tells
+us about efficiency (did we waste turns?) and self-assessment (did the
+agent think it was thorough?), but not accuracy (did the agent get it
+right?).
+
+The homelab IaC smoke test was the most interesting data point. The
+agent had never seen the project, had no wiki to lean on, and produced
+a report that Jeff (who built the project) didn't dispute. That's a
+real signal. The quality isn't just "it sounds plausible"; it's "the
+person who wrote the code agrees with the assessment."
+
+Jeff's comment about learning by doing, not reading, maps directly to
+how luminos itself works: it investigates by reading files, not by
+being told what they are. The survey and planning passes are the
+"reading the map" phase; the dir loops are the "walking the terrain"
+phase. The synthesis is the "here's what I found" debrief. It's the
+same investigate-then-synthesize loop a human would use.
+
+## What's Next
+
+Priority order for next session:
+
+1. **Fix #78 (synthesis persistence)** - save synthesis.json to cache,
+   include AI analysis in `--json -o` output. Small scope, high impact.
+2. **Fix #79 (stale cache)** - normalize paths before hashing, consider
+   `--fresh` wiping the cache dir entirely. Correctness fix.
+3. **Phase 4 review (#40)** - reassess Phase 4+ issues after the MCP
+   pivot and Phase 3 experience. Some issues may be stale or
+   reprioritized.
+4. **Phase 4 (External Knowledge Tools)** or **Phase 5 (Scale-Tiered
+   Synthesis)** depending on what #40 reveals.
--- a/SessionRetrospectives.md
+++ b/SessionRetrospectives.md
@ -11,6 +11,7 @@
 | [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified |
 | [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete |
 | [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete |
+| [Session 10](Session10) | 2026-04-12 | Phase 3 shipped: planning pass, dynamic turn allocation, quality instrumentation (#8, #9, #10, #11, #74). Found and fixed root-path matching bug (#76). Smoke tests against luminos and homelab IaC. Filed #78 (synthesis persistence), #79 (stale cache). 3 PRs, 28 new tests (234→262) |

 ---