retro: Session 2 — platform contract + /healthz + JSON logs
parent
cbd2e11a64
commit
84d4827d60
1 changed files with 298 additions and 0 deletions
298
Session2.md
Normal file
298
Session2.md
Normal file
|
|
@ -0,0 +1,298 @@
|
||||||
|
# Session 2 Notes — 2026-04-19
|
||||||
|
|
||||||
|
## What We Set Out to Do
|
||||||
|
|
||||||
|
Jeff opened the session with a platform-team intake form (#25) sitting
|
||||||
|
in the repo from the homelab-IaC automation. The goal was to establish
|
||||||
|
a real deploy contract for Quartermaster on home-ctr-onyx, then land
|
||||||
|
whatever the contract committed the app side to building. Secondary
|
||||||
|
goal: shake down the subagent-driven-development workflow on a
|
||||||
|
non-trivial feature pair.
|
||||||
|
|
||||||
|
## What Actually Happened
|
||||||
|
|
||||||
|
One arc, two shipped issues (#26, #27) merged to main, three
|
||||||
|
deploy-prep issues queued (#28, #29, #30), and one cleanup umbrella
|
||||||
|
(#31).
|
||||||
|
|
||||||
|
1. **Intake form filled out (#25).** Read the Archon platform contract
|
||||||
|
for shape, drafted answers from the repo's own state (stack, port,
|
||||||
|
data store, resource profile), and surfaced six real questions for
|
||||||
|
Jeff to decide: hostname, exposure, auth mechanism, launch window,
|
||||||
|
memory budget, structured-log ambition. After he answered, the
|
||||||
|
form landed as edit-in-place on the issue body. Key decisions:
|
||||||
|
`quartermaster.unbiasedgeek.com` (product surface, own cert),
|
||||||
|
public exposure, Traefik basic auth, ASAP, 1 GB memory, JSON logs
|
||||||
|
at launch.
|
||||||
|
|
||||||
|
2. **Platform team responded (automated).** The homelab-IaC automation
|
||||||
|
drafted `PlatformContractQuartermaster.md`, provisioned DNS, two
|
||||||
|
Traefik middlewares (basic-auth, rate-limit), the bind mount with
|
||||||
|
nightly restic, and the basic-auth credentials — then closed the
|
||||||
|
issue with a "what you do next on your side" comment enumerating
|
||||||
|
Dockerfile / compose / Actions / Alembic-on-startup. The comment is
|
||||||
|
the authoritative next-steps document and got mapped to three
|
||||||
|
dependent issues after the main work landed.
|
||||||
|
|
||||||
|
3. **Brainstorm → spec → plan for #26 + #27.** Ran the superpowers
|
||||||
|
brainstorming skill. Two meaningful design decisions: how broad
|
||||||
|
the seed-event instrumentation should be (picked "B: seed a
|
||||||
|
handful of events" — 5 events, cheap enough to be useful on day
|
||||||
|
one, narrow enough not to sprawl), and which JSON formatter
|
||||||
|
library to use (picked `python-json-logger`, rejected rolling a
|
||||||
|
stdlib-only `JsonFormatter` subclass). Spec landed at
|
||||||
|
`docs/superpowers/specs/2026-04-19-healthz-and-json-logs-design.md`;
|
||||||
|
plan at `docs/superpowers/plans/2026-04-19-healthz-and-json-logs.md`
|
||||||
|
(851 lines, eight bite-sized TDD tasks).
|
||||||
|
|
||||||
|
4. **Subagent-driven execution.** Eight tasks dispatched sequentially
|
||||||
|
on `feat/platform-deploy-prep`, each with a fresh implementer
|
||||||
|
subagent, spec-compliance review, and code-quality review via the
|
||||||
|
`superpowers:code-reviewer` agent. Model selection: haiku for
|
||||||
|
mechanical tasks (dep bump, test-only adds, README edits, single
|
||||||
|
seed events), sonnet for integration tasks (log config + filter,
|
||||||
|
four coordinated seed events, /healthz endpoint wiring). Fix
|
||||||
|
commits triggered by review: #20bca13 (restore logger state in
|
||||||
|
dictConfig test) and #bfe9ed4 (assert healthz_failed log on 503).
|
||||||
|
Final state: 12 commits on the branch, 148/148 tests passing.
|
||||||
|
|
||||||
|
5. **Merge to main.** User picked "merge locally" over PR. Main had
|
||||||
|
diverged because an unrelated MCP proposal doc (#23 → PR #24)
|
||||||
|
landed on origin while we worked. Option-chose rebase over merge
|
||||||
|
commits: rebased local spec+plan+feature onto `origin/main`, then
|
||||||
|
fast-forwarded. One snag: dirty `CLAUDE.md` (pre-session
|
||||||
|
modifications by Jeff) blocked the rebase silently under `set -e`;
|
||||||
|
stashed, rebased, popped. Linear history preserved.
|
||||||
|
|
||||||
|
6. **Filed follow-up issues.** Three deploy-pipeline issues with
|
||||||
|
dependency chain (#28 Dockerfile → #29 compose → #30 Actions
|
||||||
|
workflow) plus #31 umbrella for the three non-blocking code-review
|
||||||
|
nits.
|
||||||
|
|
||||||
|
## Key Decisions & Reasoning
|
||||||
|
|
||||||
|
* **Intake answers favoured narrow over broad.** Memory budget of
|
||||||
|
1 GB (vs the platform's 4 GB default) because the app is genuinely
|
||||||
|
small. Structured JSON logs as launch work (not deferred) because
|
||||||
|
Promtail is already there, post-launch retrofits are annoying, and
|
||||||
|
the Loki-query story is only useful if the events exist from day
|
||||||
|
one. Rate-limit numbers proposed but deferred to the platform
|
||||||
|
team's tenant-default preferences. Bind mount specifically called
|
||||||
|
out to cover both the SQLite file AND the `backups/` directory, so
|
||||||
|
`scripts/backup-db.sh` keeps working unchanged in production.
|
||||||
|
|
||||||
|
* **Single source of truth for log config (JSON, not dual files).**
|
||||||
|
Spec originally said "Python dict as source with a YAML shim for
|
||||||
|
uvicorn." Plan deviated to JSON-as-source loaded by Python at
|
||||||
|
import, consumed by uvicorn via `--log-config` pointing at the
|
||||||
|
same file. Reasoning: YAML requires `pyyaml` at runtime (uvicorn
|
||||||
|
doesn't bundle it), and a dual-file setup introduces drift risk.
|
||||||
|
The spec was edited mid-execution to reflect the simpler shape.
|
||||||
|
|
||||||
|
* **`pythonjsonlogger.jsonlogger.JsonFormatter` kept despite
|
||||||
|
deprecation.** The plan used the deprecated module path; the Task 1
|
||||||
|
code reviewer flagged it. Kept the plan as-written through Task 2
|
||||||
|
(minimum change to ship), then fixed in a cleanup commit at the end
|
||||||
|
(`pythonjsonlogger.json.JsonFormatter`). This would have been the
|
||||||
|
wrong call in a busier branch; here it was fine because the branch
|
||||||
|
was short and the fix was cheap.
|
||||||
|
|
||||||
|
* **Five seed events, not deeper instrumentation.** `month_created`,
|
||||||
|
`month_closed`, `template_entry_updated`, `posting_added`,
|
||||||
|
`posting_deleted` at the service-layer mutation sites. Placed after
|
||||||
|
`db.commit()` + `db.refresh()` so they only fire on durable success.
|
||||||
|
Payloads are minimal and deliberately exclude anything PII-ish
|
||||||
|
(no notes content, no names). An audit trail belongs in a table, not
|
||||||
|
in logs.
|
||||||
|
|
||||||
|
* **`fmt` over `format` in the JSON config.** Plan specified `format`;
|
||||||
|
implementer hit `TypeError` on direct `factory(**cfg)` because
|
||||||
|
`JsonFormatter.__init__` inherits `fmt` from stdlib's `logging.Formatter`.
|
||||||
|
Implementer changed to `fmt`, documented in report. dictConfig with
|
||||||
|
`()` factory passes kwargs literally, so `fmt` works on both paths.
|
||||||
|
Plan was slightly wrong; fix was necessary.
|
||||||
|
|
||||||
|
* **`/healthz` on its own router — but only as a file boundary.**
|
||||||
|
Spec language ("router separation protects from future middleware")
|
||||||
|
overstated what the separation buys. FastAPI's
|
||||||
|
`app.add_middleware(...)` is global and ignores routers. The
|
||||||
|
separation does help if future auth lands as router-scoped
|
||||||
|
`dependencies=[...]`. For Quartermaster this is moot: auth is
|
||||||
|
Traefik-side per the platform contract. Documented as a minor
|
||||||
|
cleanup note in #31 and in the spec's out-of-scope section.
|
||||||
|
|
||||||
|
* **Rebase not merge on the divergence.** MCP proposal doc was
|
||||||
|
completely independent work; rebasing onto origin/main kept linear
|
||||||
|
history and read as one story. In a PR-workflow branch we'd have
|
||||||
|
used the merge button; for a direct local merge, rebase was the
|
||||||
|
right call.
|
||||||
|
|
||||||
|
## Surprises & Discoveries
|
||||||
|
|
||||||
|
* **The autouse-fixture workaround was wrong.** Task 5's test was
|
||||||
|
flaky because Task 3's `test_log_config_loads_via_dictconfig`
|
||||||
|
mutated the `quartermaster` logger's `propagate` flag globally.
|
||||||
|
Implementer's first fix added a file-scoped autouse fixture. Code
|
||||||
|
reviewer flagged it correctly: the problem is that the contaminating
|
||||||
|
test doesn't clean up after itself, and an autouse fixture only
|
||||||
|
protects tests in the same file — future tests elsewhere that use
|
||||||
|
caplog on `quartermaster.*` would be bitten silently. Right fix:
|
||||||
|
save/restore state in the contaminating test itself. This is a
|
||||||
|
generalisable pattern worth remembering — global-state-leaking
|
||||||
|
tests should always clean up, don't push cleanup onto every other
|
||||||
|
test.
|
||||||
|
|
||||||
|
* **LogQL example had a field-that-doesn't-exist.** Third example in
|
||||||
|
the Logs section was `event="month_closed" | line_format "{{.path}}"`,
|
||||||
|
but `month_closed` records carry `year_month`, not `path` —
|
||||||
|
`{{.path}}` would render empty. Caught in the end-of-branch review,
|
||||||
|
fixed in cleanup commit. Small but exactly the kind of thing a fresh
|
||||||
|
set of eyes catches; easy to miss when writing docs while holding
|
||||||
|
the full model in your head.
|
||||||
|
|
||||||
|
* **The subagent TDD discipline worked.** Every implementer subagent
|
||||||
|
actually wrote the failing test first, verified it fails, then
|
||||||
|
implemented. I briefed this in every prompt and they followed it.
|
||||||
|
The TDD failures were logged in each report ("assertion 0 == 1" /
|
||||||
|
"ModuleNotFoundError"). Didn't need to correct any subagent for
|
||||||
|
skipping that step. This is a win over direct execution, where I'd
|
||||||
|
sometimes rationalise past writing the failing test first.
|
||||||
|
|
||||||
|
* **Task 4's smoke test hit `/` and got a 500.** Manual `docker run`
|
||||||
|
style smoke test in the subagent's CWD where the SQLite file or
|
||||||
|
schema wasn't set up. Didn't invalidate the task (the JSON access
|
||||||
|
log WAS captured and proved the format worked), but a reminder
|
||||||
|
that smoke tests need stable test environments. Worth checking
|
||||||
|
once deploy lands.
|
||||||
|
|
||||||
|
* **Git `set -e` does not reliably abort on `git rebase` failure
|
||||||
|
inside a bash heredoc.** The first merge attempt ran three steps
|
||||||
|
under `set -e`; the rebase failures due to dirty `CLAUDE.md`
|
||||||
|
exited non-zero but the subsequent `git log` + `git merge` ran
|
||||||
|
anyway, landing the branch in "accidentally Option-3 state." Had
|
||||||
|
to untangle with a stash/rebase/pop. Not a subagent issue — bash
|
||||||
|
quirk, worth remembering.
|
||||||
|
|
||||||
|
* **Platform team's `QUARTERMASTER_DB_URL` example had three slashes.**
|
||||||
|
Comment on #25 wrote `sqlite:///data/quartermaster.db` (three =
|
||||||
|
relative). For an absolute path in the container you need four
|
||||||
|
(`sqlite:////data/quartermaster.db`). Flagged in #29's body so the
|
||||||
|
compose writer doesn't copy the wrong spelling.
|
||||||
|
|
||||||
|
* **Commits on a long branch, reviewer's recommendation: don't
|
||||||
|
squash.** The end-of-branch reviewer explicitly argued against
|
||||||
|
squashing the 12 commits — the fix commits (20bca13, bfe9ed4) tell
|
||||||
|
a real story about what came up in review. I'd reflexively have
|
||||||
|
squashed; the case for preserving was more compelling. Left as-is.
|
||||||
|
|
||||||
|
## Concerns & Open Threads
|
||||||
|
|
||||||
|
* **No end-to-end test of the full deploy path.** We have formatter-
|
||||||
|
level tests, filter-level tests, and a manual uvicorn smoke test
|
||||||
|
(verified one JSON access log line). What we don't have is a test
|
||||||
|
that builds the container, runs it, curls `/healthz`, and asserts
|
||||||
|
on the captured stdout. That's #28 + #29 + #30 territory; the Actions
|
||||||
|
workflow's post-deploy `curl` step will be the first real end-to-end
|
||||||
|
assertion.
|
||||||
|
|
||||||
|
* **Deprecation warning is fixed, but we're still using
|
||||||
|
`pythonjsonlogger.json` — a 4.x module path.** If the library ever
|
||||||
|
does another reorg or drops the API we're using, the log format
|
||||||
|
quietly breaks. `python-json-logger` is maintained but small; worth
|
||||||
|
pinning to a range (`>=3.2,<5`) in `pyproject.toml` if we ever hit
|
||||||
|
a 5.x migration.
|
||||||
|
|
||||||
|
* **The intake form's §3.7 still says "Endpoint does not exist yet
|
||||||
|
— will land in the repo before the first deploy cut."** Harmless
|
||||||
|
because the issue is closed, but stale. If we ever revisit the
|
||||||
|
form as a template for another app, clean that up.
|
||||||
|
|
||||||
|
* **Session memory about TDD enforcement.** I've been saving feedback
|
||||||
|
memories about user-specific patterns. This session reinforced that
|
||||||
|
the subagent-driven skill's TDD gate works when prompts explicitly
|
||||||
|
require failing-first verification. Saving the pattern is worth it,
|
||||||
|
but I'm wary of over-memoing the workflow meta; Jeff's CLAUDE.md
|
||||||
|
global already prefers "do the work, don't narrate the workflow."
|
||||||
|
|
||||||
|
* **Rate-limit numbers still open.** Filed as a platform-team ask on
|
||||||
|
#25; the response didn't address numbers specifically but the
|
||||||
|
middleware is deployed at `10/30 per-IP` (matches my starter
|
||||||
|
proposal in the intake). If that's wrong for this tenant we'll hear
|
||||||
|
about it when real traffic hits.
|
||||||
|
|
||||||
|
* **CLAUDE.md Session Log is out of date.** The file has a Session 1
|
||||||
|
entry; Session 2's wiki page exists now but CLAUDE.md still reflects
|
||||||
|
pre-session state. Left unmodified this session because the CLAUDE.md
|
||||||
|
working-tree diff was Jeff's pre-session edit; merging my own
|
||||||
|
Session 2 references on top would have entangled our changes. Next
|
||||||
|
session he'll want to land both (his stale edits + a Session 2 bullet
|
||||||
|
in the log).
|
||||||
|
|
||||||
|
## Raw Thinking
|
||||||
|
|
||||||
|
* **The intake form's "explain what you've already tried" structure
|
||||||
|
was load-bearing.** When Jeff asked me to "fill out what I can and
|
||||||
|
ask questions," the right shape was a draft-with-question-list,
|
||||||
|
not a comment dump. Surfacing six specific binary-ish questions
|
||||||
|
("option A or B?") compressed a long back-and-forth into one
|
||||||
|
round-trip. This pairs with his "minimalist answers" pattern from
|
||||||
|
Session 1: if I'm clear about my defaults, he answers by approving
|
||||||
|
or swapping; if I'm vague, I get silence and ship the wrong thing.
|
||||||
|
|
||||||
|
* **Subagent-driven felt like overkill for Task 1 (dep bump) and
|
||||||
|
Task 3 (test-only additions) — but it caught real issues anyway.**
|
||||||
|
The Task 2 code reviewer flagged the `pythonjsonlogger.jsonlogger`
|
||||||
|
deprecation before we'd written a line of production code that
|
||||||
|
depended on it. The Task 5 reviewer caught the autouse fixture
|
||||||
|
wrong. These catches would have leaked through "just write the
|
||||||
|
commit" execution. The cost was roughly 2-3x the token budget of
|
||||||
|
direct execution; the quality delta was worth it on a branch that
|
||||||
|
needs to work in prod.
|
||||||
|
|
||||||
|
* **The spec-and-plan artifacts had real value beyond process
|
||||||
|
ceremony.** Each implementer subagent got pasted a task body of
|
||||||
|
100-200 lines that was self-contained: exact file paths, exact
|
||||||
|
code, exact commit message, verification commands. No subagent
|
||||||
|
asked "what should the format string be" or "what fields should
|
||||||
|
the filter set." That prep-once-execute-many pattern is what makes
|
||||||
|
subagent-driven work at all. The plan file is almost a shared
|
||||||
|
cache across subagents.
|
||||||
|
|
||||||
|
* **Platform contracts are surprisingly lightweight when one tenant
|
||||||
|
already shipped.** Archon went through this first, so the intake
|
||||||
|
form inherited structure, the wiki template existed, and the
|
||||||
|
platform-team response was mechanical. For the Quartermaster
|
||||||
|
tenant this worked out to maybe an hour of thinking and a morning
|
||||||
|
of execution. Future tenants should benefit more.
|
||||||
|
|
||||||
|
* **The Traefik-not-FastAPI auth choice ripples into the code.**
|
||||||
|
Because auth lives at the edge, `/healthz` stays unauthenticated
|
||||||
|
with zero effort inside the app, and there's no session middleware
|
||||||
|
or login template to build. This is why "infrastructure-level
|
||||||
|
auth" is such a good default for small single-user tools —
|
||||||
|
the app stays boring.
|
||||||
|
|
||||||
|
## What's Next
|
||||||
|
|
||||||
|
Priority order for the next session:
|
||||||
|
|
||||||
|
1. **#28 Dockerfile.** Run as `USER 1000:1000`, Alembic + uvicorn
|
||||||
|
entrypoint, `HEALTHCHECK` hitting /healthz. Local build + run
|
||||||
|
smoke test. Should be small.
|
||||||
|
2. **#29 compose.yml.** Consumes #28. The bulk of the Traefik labels,
|
||||||
|
resource limits, required container labels. Mind the four-slash
|
||||||
|
`QUARTERMASTER_DB_URL`.
|
||||||
|
3. **#30 Forgejo Actions deploy workflow.** Build, tag by git SHA,
|
||||||
|
push to the Forgejo registry, SSH to home-ctr-onyx, `docker compose
|
||||||
|
pull && up -d`, post-deploy `curl /healthz`. First real end-to-end
|
||||||
|
deploy test.
|
||||||
|
4. **First production rollout.** Flip the DNS, hit the live URL
|
||||||
|
through Traefik, verify basic auth works, verify JSON logs flow
|
||||||
|
into Loki with `event`/`level` as labels. Should be uneventful;
|
||||||
|
the interesting part was the work leading up to it.
|
||||||
|
5. **#31 cleanups.** Fold into whichever follow-up PR naturally
|
||||||
|
touches `service.py` or `routes_health.py`.
|
||||||
|
6. **Deferred items from #1 still open.** Constrain posting dates to
|
||||||
|
the month, closed-month archive treatment, cross-month
|
||||||
|
summaries. Nothing about this session disturbed the priority of
|
||||||
|
those.
|
||||||
Loading…
Reference in a new issue