diff --git a/Session10.md b/Session10.md new file mode 100644 index 0000000..4e6bc65 --- /dev/null +++ b/Session10.md @@ -0,0 +1,158 @@ +# Session 10 Notes — 2026-04-12 + +## What We Set Out to Do + +Ship Phase 3: Investigation Planning. The design sketch had been deferred +from Session 9. The goal was to plan the nuts and bolts, then implement +everything: planning pass, dynamic turn allocation, plan caching, and +quality instrumentation. + +## What Actually Happened + +Started with a design discussion. Jeff asked the right foundational +question before any code was written: "How can we measure the quality +of the data being produced?" This reframed the work from "add a +planning pass" to "add a planning pass and the instrumentation to know +if it's helping." + +Implementation went smoothly. All four Phase 3 issues (#8, #9, #10, +#11) plus the new quality instrumentation issue (#74) shipped in a +single PR (#75): 943 insertions, 26 new tests (234 to 260), all +passing. + +Smoke test against `luminos_lib/` immediately caught a bug: the +planning pass correctly identified the target root as priority with 20 +turns, but the orchestrator gave it 10 (the default). Root cause +analysis revealed a path-matching mismatch in `_apply_plan()`: the +planner uses `basename(target)` (from the tree output) but the lookup +table used `"."` (from `os.path.relpath`). Filed as #76, fixed in PR +#77 (43 insertions, 2 new tests, 262 total). + +Second smoke test against the full repo root (`luminos/`) worked +correctly. The planner made sensible tier assignments. `docs/wiki/` +was classified as priority, which Jeff correctly defended: luminos is +a general investigation tool, not a code-only tool, so documentation +is a legitimate investigation target. + +The full-repo run also surfaced two more issues: +- #78: Synthesis output is ephemeral (printed to terminal, not + persisted). The tool's primary deliverable should be a first-class + artifact. +- #79: Stale cache entries from 2024 (with relative paths) coexist + with new entries (absolute paths) because they hash differently. + The synthesis pass saw 7 entries for 5 directories. + +Third smoke test against Jeff's homelab IaC repo (a project luminos +had never seen before). The investigation produced an accurate, +well-structured report that identified the three-layer architecture, +the migration strategy, the enterprise-grade operational practices, +and caught a critical flag (plaintext credentials in terraform.tfvars). + +Jeff shared that luminos is his learning project for agentic AI +applications. He learns by doing, not reading, and wanted a truly hard +problem that would push the AI. + +## Key Decisions & Reasoning + +**Quality instrumentation ships with Phase 3, not after.** Jeff's +question about measuring quality before we changed the pipeline was +the right call. We added three metrics: turn utilization per directory, +completeness self-rating on submit_report, and plan_evaluation.json as +the planning pass's report card. This paid off immediately when the +smoke test revealed the #76 bug through the evaluation data. + +**Per-directory allocation, no mid-loop borrowing.** PLAN.md envisioned +a global budget with mid-loop turn borrowing. We chose the simpler +model: fixed per-dir allocations from the plan. The quality data will +tell us if borrowing is needed. No speculative complexity. + +**Band-sorted ordering preserves leaf-first.** Rather than allowing +arbitrary reordering (which would break child summaries), we sort +directories into priority/default/shallow bands and preserve leaf-first +within each band. This gives us "priority-first" without breaking the +invariant. + +**docs/wiki as priority is correct.** The planner classified wiki docs +as priority. Initial reaction was to push back (it's not code), but +Jeff correctly pointed out that luminos investigates filesystems, not +codebases. Documentation is part of what a directory contains. + +## Surprises & Discoveries + +**The #76 bug was immediately visible in plan_evaluation.json.** The +evaluation showed `turns_allocated: 10` for a directory the plan said +should get 20. Without the quality instrumentation, we would have had +no signal that anything was wrong. The investigation would have +produced a slightly worse report and we'd never have known why. + +**Stale cache contamination (#79).** Old cache entries from 2024 used +relative paths while current code uses absolute paths. Since entries +are keyed by SHA256(path), both versions coexist. `--fresh` doesn't +clean this up because it reuses the investigation ID mapping. This is +a pre-Phase 3 bug but we discovered it through Phase 3 testing. + +**The wiki colors the investigation.** The full-repo synthesis report +referenced "9 development phases" and "session logs tracking project +evolution," which came from reading the wiki, not the code. The agent +synthesizes documentary and code knowledge together. Whether this is +good depends on the use case. + +## Concerns & Open Threads + +**Synthesis output is ephemeral (#78).** The tool's primary deliverable +vanishes when you close the terminal. This needs to be fixed before +luminos is useful to anyone besides the person running it. + +**Stale cache problem (#79).** The path normalization issue could cause +subtle quality degradation on any target that was previously +investigated with an older version. Option 1 (normalize before hashing) +plus option 2 (`--fresh` wipes the cache dir) would fix it cleanly. + +**`tests` directory hit its ceiling.** In the full-repo run, the +shallow allocation (5 turns) for `tests/` was fully consumed (100% +utilization) and completeness wasn't reported, suggesting the agent +ran out of turns before calling submit_report cleanly. The shallow +default of 5 may be too tight for directories with many files. + +**Completeness field is not always reported.** The `tests` directory +entry has no completeness value. This happens when the agent exhausts +turns and the partial-flush path kicks in (it doesn't call +submit_report, so completeness is never set). The instrumentation has +a gap for partial entries. + +## Raw Thinking + +The quality measurement question Jeff raised is more important than it +seemed at first. We added three cheap metrics, but the fundamental +challenge remains: luminos produces natural language output, and the +only real quality judge is a human reading it. The instrumentation tells +us about efficiency (did we waste turns?) and self-assessment (did the +agent think it was thorough?), but not accuracy (did the agent get it +right?). + +The homelab IaC smoke test was the most interesting data point. The +agent had never seen the project, had no wiki to lean on, and produced +a report that Jeff (who built the project) didn't dispute. That's a +real signal. The quality isn't just "it sounds plausible"; it's "the +person who wrote the code agrees with the assessment." + +Jeff's comment about learning by doing, not reading, maps directly to +how luminos itself works: it investigates by reading files, not by +being told what they are. The survey and planning passes are the +"reading the map" phase; the dir loops are the "walking the terrain" +phase. The synthesis is the "here's what I found" debrief. It's the +same investigate-then-synthesize loop a human would use. + +## What's Next + +Priority order for next session: + +1. **Fix #78 (synthesis persistence)** - save synthesis.json to cache, + include AI analysis in `--json -o` output. Small scope, high impact. +2. **Fix #79 (stale cache)** - normalize paths before hashing, consider + `--fresh` wiping the cache dir entirely. Correctness fix. +3. **Phase 4 review (#40)** - reassess Phase 4+ issues after the MCP + pivot and Phase 3 experience. Some issues may be stale or + reprioritized. +4. **Phase 4 (External Knowledge Tools)** or **Phase 5 (Scale-Tiered + Synthesis)** depending on what #40 reveals. diff --git a/SessionRetrospectives.md b/SessionRetrospectives.md index bb28728..0f7e7e1 100644 --- a/SessionRetrospectives.md +++ b/SessionRetrospectives.md @@ -11,6 +11,7 @@ | [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified | | [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete | | [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete | +| [Session 10](Session10) | 2026-04-12 | Phase 3 shipped: planning pass, dynamic turn allocation, quality instrumentation (#8, #9, #10, #11, #74). Found and fixed root-path matching bug (#76). Smoke tests against luminos and homelab IaC. Filed #78 (synthesis persistence), #79 (stale cache). 3 PRs, 28 new tests (234→262) | ---