retro: Session 10 — Phase 3 shipped, quality instrumentation, smoke tests
parent
3fcf8c221d
commit
36f0834c93
2 changed files with 159 additions and 0 deletions
158
Session10.md
Normal file
158
Session10.md
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
# Session 10 Notes — 2026-04-12
|
||||
|
||||
## What We Set Out to Do
|
||||
|
||||
Ship Phase 3: Investigation Planning. The design sketch had been deferred
|
||||
from Session 9. The goal was to plan the nuts and bolts, then implement
|
||||
everything: planning pass, dynamic turn allocation, plan caching, and
|
||||
quality instrumentation.
|
||||
|
||||
## What Actually Happened
|
||||
|
||||
Started with a design discussion. Jeff asked the right foundational
|
||||
question before any code was written: "How can we measure the quality
|
||||
of the data being produced?" This reframed the work from "add a
|
||||
planning pass" to "add a planning pass and the instrumentation to know
|
||||
if it's helping."
|
||||
|
||||
Implementation went smoothly. All four Phase 3 issues (#8, #9, #10,
|
||||
#11) plus the new quality instrumentation issue (#74) shipped in a
|
||||
single PR (#75): 943 insertions, 26 new tests (234 to 260), all
|
||||
passing.
|
||||
|
||||
Smoke test against `luminos_lib/` immediately caught a bug: the
|
||||
planning pass correctly identified the target root as priority with 20
|
||||
turns, but the orchestrator gave it 10 (the default). Root cause
|
||||
analysis revealed a path-matching mismatch in `_apply_plan()`: the
|
||||
planner uses `basename(target)` (from the tree output) but the lookup
|
||||
table used `"."` (from `os.path.relpath`). Filed as #76, fixed in PR
|
||||
#77 (43 insertions, 2 new tests, 262 total).
|
||||
|
||||
Second smoke test against the full repo root (`luminos/`) worked
|
||||
correctly. The planner made sensible tier assignments. `docs/wiki/`
|
||||
was classified as priority, which Jeff correctly defended: luminos is
|
||||
a general investigation tool, not a code-only tool, so documentation
|
||||
is a legitimate investigation target.
|
||||
|
||||
The full-repo run also surfaced two more issues:
|
||||
- #78: Synthesis output is ephemeral (printed to terminal, not
|
||||
persisted). The tool's primary deliverable should be a first-class
|
||||
artifact.
|
||||
- #79: Stale cache entries from 2024 (with relative paths) coexist
|
||||
with new entries (absolute paths) because they hash differently.
|
||||
The synthesis pass saw 7 entries for 5 directories.
|
||||
|
||||
Third smoke test against Jeff's homelab IaC repo (a project luminos
|
||||
had never seen before). The investigation produced an accurate,
|
||||
well-structured report that identified the three-layer architecture,
|
||||
the migration strategy, the enterprise-grade operational practices,
|
||||
and caught a critical flag (plaintext credentials in terraform.tfvars).
|
||||
|
||||
Jeff shared that luminos is his learning project for agentic AI
|
||||
applications. He learns by doing, not reading, and wanted a truly hard
|
||||
problem that would push the AI.
|
||||
|
||||
## Key Decisions & Reasoning
|
||||
|
||||
**Quality instrumentation ships with Phase 3, not after.** Jeff's
|
||||
question about measuring quality before we changed the pipeline was
|
||||
the right call. We added three metrics: turn utilization per directory,
|
||||
completeness self-rating on submit_report, and plan_evaluation.json as
|
||||
the planning pass's report card. This paid off immediately when the
|
||||
smoke test revealed the #76 bug through the evaluation data.
|
||||
|
||||
**Per-directory allocation, no mid-loop borrowing.** PLAN.md envisioned
|
||||
a global budget with mid-loop turn borrowing. We chose the simpler
|
||||
model: fixed per-dir allocations from the plan. The quality data will
|
||||
tell us if borrowing is needed. No speculative complexity.
|
||||
|
||||
**Band-sorted ordering preserves leaf-first.** Rather than allowing
|
||||
arbitrary reordering (which would break child summaries), we sort
|
||||
directories into priority/default/shallow bands and preserve leaf-first
|
||||
within each band. This gives us "priority-first" without breaking the
|
||||
invariant.
|
||||
|
||||
**docs/wiki as priority is correct.** The planner classified wiki docs
|
||||
as priority. Initial reaction was to push back (it's not code), but
|
||||
Jeff correctly pointed out that luminos investigates filesystems, not
|
||||
codebases. Documentation is part of what a directory contains.
|
||||
|
||||
## Surprises & Discoveries
|
||||
|
||||
**The #76 bug was immediately visible in plan_evaluation.json.** The
|
||||
evaluation showed `turns_allocated: 10` for a directory the plan said
|
||||
should get 20. Without the quality instrumentation, we would have had
|
||||
no signal that anything was wrong. The investigation would have
|
||||
produced a slightly worse report and we'd never have known why.
|
||||
|
||||
**Stale cache contamination (#79).** Old cache entries from 2024 used
|
||||
relative paths while current code uses absolute paths. Since entries
|
||||
are keyed by SHA256(path), both versions coexist. `--fresh` doesn't
|
||||
clean this up because it reuses the investigation ID mapping. This is
|
||||
a pre-Phase 3 bug but we discovered it through Phase 3 testing.
|
||||
|
||||
**The wiki colors the investigation.** The full-repo synthesis report
|
||||
referenced "9 development phases" and "session logs tracking project
|
||||
evolution," which came from reading the wiki, not the code. The agent
|
||||
synthesizes documentary and code knowledge together. Whether this is
|
||||
good depends on the use case.
|
||||
|
||||
## Concerns & Open Threads
|
||||
|
||||
**Synthesis output is ephemeral (#78).** The tool's primary deliverable
|
||||
vanishes when you close the terminal. This needs to be fixed before
|
||||
luminos is useful to anyone besides the person running it.
|
||||
|
||||
**Stale cache problem (#79).** The path normalization issue could cause
|
||||
subtle quality degradation on any target that was previously
|
||||
investigated with an older version. Option 1 (normalize before hashing)
|
||||
plus option 2 (`--fresh` wipes the cache dir) would fix it cleanly.
|
||||
|
||||
**`tests` directory hit its ceiling.** In the full-repo run, the
|
||||
shallow allocation (5 turns) for `tests/` was fully consumed (100%
|
||||
utilization) and completeness wasn't reported, suggesting the agent
|
||||
ran out of turns before calling submit_report cleanly. The shallow
|
||||
default of 5 may be too tight for directories with many files.
|
||||
|
||||
**Completeness field is not always reported.** The `tests` directory
|
||||
entry has no completeness value. This happens when the agent exhausts
|
||||
turns and the partial-flush path kicks in (it doesn't call
|
||||
submit_report, so completeness is never set). The instrumentation has
|
||||
a gap for partial entries.
|
||||
|
||||
## Raw Thinking
|
||||
|
||||
The quality measurement question Jeff raised is more important than it
|
||||
seemed at first. We added three cheap metrics, but the fundamental
|
||||
challenge remains: luminos produces natural language output, and the
|
||||
only real quality judge is a human reading it. The instrumentation tells
|
||||
us about efficiency (did we waste turns?) and self-assessment (did the
|
||||
agent think it was thorough?), but not accuracy (did the agent get it
|
||||
right?).
|
||||
|
||||
The homelab IaC smoke test was the most interesting data point. The
|
||||
agent had never seen the project, had no wiki to lean on, and produced
|
||||
a report that Jeff (who built the project) didn't dispute. That's a
|
||||
real signal. The quality isn't just "it sounds plausible"; it's "the
|
||||
person who wrote the code agrees with the assessment."
|
||||
|
||||
Jeff's comment about learning by doing, not reading, maps directly to
|
||||
how luminos itself works: it investigates by reading files, not by
|
||||
being told what they are. The survey and planning passes are the
|
||||
"reading the map" phase; the dir loops are the "walking the terrain"
|
||||
phase. The synthesis is the "here's what I found" debrief. It's the
|
||||
same investigate-then-synthesize loop a human would use.
|
||||
|
||||
## What's Next
|
||||
|
||||
Priority order for next session:
|
||||
|
||||
1. **Fix #78 (synthesis persistence)** - save synthesis.json to cache,
|
||||
include AI analysis in `--json -o` output. Small scope, high impact.
|
||||
2. **Fix #79 (stale cache)** - normalize paths before hashing, consider
|
||||
`--fresh` wiping the cache dir entirely. Correctness fix.
|
||||
3. **Phase 4 review (#40)** - reassess Phase 4+ issues after the MCP
|
||||
pivot and Phase 3 experience. Some issues may be stale or
|
||||
reprioritized.
|
||||
4. **Phase 4 (External Knowledge Tools)** or **Phase 5 (Scale-Tiered
|
||||
Synthesis)** depending on what #40 reveals.
|
||||
|
|
@ -11,6 +11,7 @@
|
|||
| [Session 7](Session7) | 2026-04-07 | Phase 1 audit (#1 closed, only #54 remains); gitea MCP credential overhaul — dedicated `claude-code` Forgejo user with admin on luminos, write+delete verified |
|
||||
| [Session 8](Session8) | 2026-04-07 | Closed #54 — added confidence/confidence_reason to write_cache tool schema description; Phase 1 milestone now 4/4 complete |
|
||||
| [Session 9](Session9) | 2026-04-11 | Scope shift (#64) + all Phase 3 prereqs: dir loop refactor (#57), tool registry consolidation (#56), pure-helper test coverage waves 1+2 (#55, #70), leaf-first contract docs (#72). 6 PRs, 70 new tests (164→234), Phase 2.6/2.7/2.8 milestones complete |
|
||||
| [Session 10](Session10) | 2026-04-12 | Phase 3 shipped: planning pass, dynamic turn allocation, quality instrumentation (#8, #9, #10, #11, #74). Found and fixed root-path matching bug (#76). Smoke tests against luminos and homelab IaC. Filed #78 (synthesis persistence), #79 (stale cache). 3 PRs, 28 new tests (234→262) |
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Reference in a new issue