retro: Session 5 — documentation deep dive, Internals.md added, doc reconciliation (#53)

2026-04-06 23:23:09 -06:00 · 2026-04-06 23:23:09 -06:00 · ab10d0cbf9
commit ab10d0cbf9
parent 457e0a7661
2 changed files with 68 additions and 0 deletions
--- a/Session5.md
+++ b/Session5.md
@ -0,0 +1,67 @@
 # Session 5
 **Date:** 2026-04-06
 **Focus:** Documentation deep dive — orient a new developer to the project before Phase 3 begins
 **Duration estimate:** ~1 hour
 ## What was done
 Issue [#53](https://forgejo.labbity.unbiasedgeek.com/archeious/luminos/issues/53) — *Developer onboarding: deep-dive code tour + reconcile architecture/roadmap docs*. Single branch `docs/issue-53-onboarding-internals`, merged in commit `8c0e29b`.
 ### New wiki page
 - **`Internals.md`** (~500 lines) — code-level walkthrough of how Luminos actually works, with file:line references throughout. Sections: the two layers, base scan walkthrough, AI pipeline orchestrator, survey pass, dir loop in depth (message history, tool dispatch, pre-loaded context, token tracker, budget check, partial flush), cache model, prompts, synthesis pass, flags, where-to-make-common-changes cookbook (add a tool / add a pass / change a prompt / change cache schema / add a CLI flag), token budget and cost, glossary. Targeted at a developer who knows basic Python but has never built an agent loop.
 ### Wiki fixes
 - **`Architecture.md`** — added Internals link at top; rewrote the AI pipeline diagram to include `_run_investigation`, the survey pass, `_filter_dir_tools`, and the partial-flush early-exit; **fixed the cache section** which incorrectly described file/dir entries as JSONL (they are sha256-keyed JSON files; only `flags.jsonl` and `investigation.log` are JSONL); added `partial`/`partial_reason` to the dir entry schema; documented `investigations.json`; added a note about `_TokenTracker` and the per-call (not cumulative) budget logic.
 - **`Roadmap.md`** — deleted and replaced. The previous version still listed Phase 1 and Phase 2 as "Not started" months after they shipped, and was missing Phases 2.5, 3.5, and 4.5 entirely. New version is a thin pointer to PLAN.md and open issues, plus a status-at-a-glance table. Three sources of truth (CLAUDE.md / PLAN.md / Roadmap.md) collapsed to two.
 - **`DevelopmentGuide.md`** — added a header note pointing to Internals for code-level questions. Otherwise unchanged.
 - **`Home.md`** — added Internals to the quick-links table.
 ### PLAN.md fixes (main repo)
 - Added a status snapshot at the top with the same phase table as Roadmap.md.
 - Fixed the "File Map" section: it claimed a `luminos_lib/domain.py` would be created for the survey pass, but `domain.py` was never created — survey logic landed inside `ai.py` (`_run_survey`, `_format_survey_block`, `_filter_dir_tools`). Marked the row as struck-through with a note explaining what actually happened, and tagged `prompts.py` and `ai.py` rows with ✅ for survey work that's already done.
 ## Discoveries and observations
 - **Architecture.md cache section was wrong.** I had assumed JSONL based on the wiki, but reading `cache.py:106–141` showed sha256-keyed JSON files in `files/` and `dirs/` subdirs. The reason for the design (O(1) `has_entry()` check) is genuinely good and worth documenting — it's now in Internals §5.4. This is the kind of thing a deep-dive doc pass catches that incremental edits never would.
 - **PLAN.md File Map drift is structural, not editorial.** PLAN.md was written before implementation. The implementation rationally chose a different file layout. The plan was never updated. This will keep happening to PLAN.md as long as it pretends to be both a design doc and a current-state doc. The fix in this session was minimal (a struck-through row + a note at the top), but the broader question — how to keep PLAN.md from rotting — remains.
 - **The survey pass is the cleanest reference implementation in the codebase.** It's short, single-purpose, has a clear submit-tool exit, degrades gracefully on failure. Any new pass (planning, refinement) should be modeled on it. Internals §9.2 says exactly this.
 - **`_call_api_streaming` doesn't actually stream anything to the user.** It uses the streaming endpoint but discards events and pulls the final message at the end. There's a latent capability to print live progress that nobody is using. Not a bug — just a thing I noticed.
 - **Confidence support exists in the cache but no prompt sets it.** `cache.py:191` has `low_confidence_entries(threshold=0.7)` ready to go. The agent never writes confidence values because no prompt instructs it to. Phase 1 was marked "shipped" but the wiring is half-finished — the schema is there, the read side is there, the write side is dormant. Worth a follow-up issue.
 ## Decisions
 - **One bundled PR for all the doc work**, not split per file. The changes are tightly coupled (Architecture cache section ⇄ Internals cache section ⇄ PLAN.md status ⇄ Roadmap status table) and reviewing them together is easier than reviewing them separately.
 - **Replace Roadmap.md, don't fix it.** The user explicitly chose this. Fixing it in place just resets the clock until the next drift; replacing it with a pointer to two living sources (PLAN.md design + issues status) removes the failure mode entirely.
 - **Internals.md targets "basic Python, no agent-loop background"** per the user. This shaped the depth of explanation in §3 (orchestrator) and §4 (dir loop). I expanded the message-history-grows-monotonically point because that's exactly the kind of thing a non-agent-loop person would not anticipate.
 - **Out of scope on purpose:** no code changes, no PLAN.md restructuring beyond status + file-map fix, no fix for the dormant confidence-write path. All confined to issue #53; anything that came up gets a follow-up issue or a Raw Thinking note here.
 ## Raw thinking
 - **Phase 1 completion is technically a lie.** The cache schema accepts confidence; `low_confidence_entries()` exists; the prompt never asks the agent to set it; therefore no entries ever have it; therefore `low_confidence_entries()` returns everything (since `e.get("confidence", 0.0) < 0.7` is always true for missing-confidence entries). The function works as documented but is currently not useful. This isn't a bug worth fixing in isolation — it'll be wired up when refinement (Phase 8) actually consumes the signal — but Phase 1's "shipped" status hides that the read path is starving.
 - **Internals.md is going to drift.** Every file:line reference is a hostage to future commits. I added a disclaimer at the top but disclaimers don't actually keep docs current. A lighter-touch alternative would have been file references without line numbers (e.g. "see `_run_dir_loop` in `ai.py`"). The user wanted concrete refs and they're more useful for a new dev — but we should expect to see "Internals.md is stale" issues in a few sessions.
 - **The MCP pivot (Phase 3.5) makes some of Internals.md temporary.** Once the dir loop's tools are discovered dynamically from an MCP server, §4.2 (Tool dispatch) and §9.1 (Add a new tool) become wrong. Internals will need a major revision after Phase 3.5 lands. Not a reason to delay the doc — better to have it right *now* and revise *then* than to delay until "the architecture stabilizes" (which never happens).
 - **`_run_dir_loop` is 160 lines and four conceptual layers (budget check + early flush, API call + history append, tool dispatch + result append, done check). It's near its readability ceiling.** Phase 3 will add planning-pass-driven turn allocation, which is more state in the same function. Worth a refactor before that lands or the function becomes the kind of thing nobody wants to touch.
 - **There's no test for `ai.py` because it requires a live API.** This is documented as policy. But several pieces of `ai.py` are pure functions (`_filter_dir_tools`, `_format_survey_block`, `_format_survey_signals`, `_default_survey`, `_should_skip_dir`, `_path_is_safe`) that could absolutely be unit-tested. Worth a follow-up issue to extract these into `_test_helpers` or just add `tests/test_ai_pure.py`.
 - **The "tools registered in two places" smell.** A new tool needs an entry in `_TOOL_DISPATCH` *and* an entry in `_DIR_TOOLS`. Easy to add one without the other. A small registration helper (`@dir_tool(name=..., description=..., schema=...)`) would collapse this. Not urgent — there are only ~10 tools — but the size of §9.1 in Internals (5 numbered steps for adding one tool) suggests this is friction.
 - **Roadmap.md is now a 30-line pointer.** It's fine but slightly silly as a standalone wiki page. Could be folded into Home.md. Left it standalone for now because external links to `wiki/Roadmap` would break.
 ## What's next
 User-stated next focus: **Phase 3 — Investigation planning**. The relevant issues are:
 - #8 Add `_PLANNING_SYSTEM_PROMPT` to prompts.py
 - #9 Implement planning pass with `submit_plan` tool in ai.py
 - #10 Add dynamic turn allocation from plan output
 - #11 Save investigation plan to cache for resumed runs
 The natural order is #8 → #9 → #10 → #11. #8 and #9 are sized like the survey pass work was — small, single-pass loops. #10 is the harder one because it changes how `_run_dir_loop` consumes `max_turns` (currently a constant, would become per-dir from the plan). #11 is mostly a cache schema change.
 Follow-ups noticed during this session that should become issues:
 - Phase 1 confidence-write path is dormant (no prompt sets it)
 - Pure helpers in `ai.py` could be unit-tested without a live API
 - Tool registration is duplicated between `_TOOL_DISPATCH` and `_DIR_TOOLS`
 - `_run_dir_loop` is approaching its readability ceiling — refactor before Phase 3 adds more state
--- a/SessionRetrospectives.md
+++ b/SessionRetrospectives.md
@ -6,6 +6,7 @@
 | [Session 2](Session2) | 2026-04-06 | Forgejo milestones, issues, project board (36 issues, 9 milestones), Gitea MCP setup |
 | [Session 3](Session3) | 2026-04-06 | Phase 1 complete, MCP backend architecture design, issues #38–#40 opened |
 | [Session 4](Session4) | 2026-04-06 | Phase 2 + 2.5 complete (#4–#7, #42, #44), classifier rebuild, context budget fix, 8 PRs merged |
 | [Session 5](Session5) | 2026-04-06 | Documentation deep dive: new Internals.md code tour, Architecture cache fix, Roadmap replaced with pointer, PLAN.md status snapshot (#53) |
 ---