retro: Session 2 — platform contract + /healthz + JSON logs
parent
cbd2e11a64
commit
84d4827d60
1 changed files with 298 additions and 0 deletions
298
Session2.md
Normal file
298
Session2.md
Normal file
|
|
@ -0,0 +1,298 @@
|
|||
# Session 2 Notes — 2026-04-19
|
||||
|
||||
## What We Set Out to Do
|
||||
|
||||
Jeff opened the session with a platform-team intake form (#25) sitting
|
||||
in the repo from the homelab-IaC automation. The goal was to establish
|
||||
a real deploy contract for Quartermaster on home-ctr-onyx, then land
|
||||
whatever the contract committed the app side to building. Secondary
|
||||
goal: shake down the subagent-driven-development workflow on a
|
||||
non-trivial feature pair.
|
||||
|
||||
## What Actually Happened
|
||||
|
||||
One arc, two shipped issues (#26, #27) merged to main, three
|
||||
deploy-prep issues queued (#28, #29, #30), and one cleanup umbrella
|
||||
(#31).
|
||||
|
||||
1. **Intake form filled out (#25).** Read the Archon platform contract
|
||||
for shape, drafted answers from the repo's own state (stack, port,
|
||||
data store, resource profile), and surfaced six real questions for
|
||||
Jeff to decide: hostname, exposure, auth mechanism, launch window,
|
||||
memory budget, structured-log ambition. After he answered, the
|
||||
form landed as edit-in-place on the issue body. Key decisions:
|
||||
`quartermaster.unbiasedgeek.com` (product surface, own cert),
|
||||
public exposure, Traefik basic auth, ASAP, 1 GB memory, JSON logs
|
||||
at launch.
|
||||
|
||||
2. **Platform team responded (automated).** The homelab-IaC automation
|
||||
drafted `PlatformContractQuartermaster.md`, provisioned DNS, two
|
||||
Traefik middlewares (basic-auth, rate-limit), the bind mount with
|
||||
nightly restic, and the basic-auth credentials — then closed the
|
||||
issue with a "what you do next on your side" comment enumerating
|
||||
Dockerfile / compose / Actions / Alembic-on-startup. The comment is
|
||||
the authoritative next-steps document and got mapped to three
|
||||
dependent issues after the main work landed.
|
||||
|
||||
3. **Brainstorm → spec → plan for #26 + #27.** Ran the superpowers
|
||||
brainstorming skill. Two meaningful design decisions: how broad
|
||||
the seed-event instrumentation should be (picked "B: seed a
|
||||
handful of events" — 5 events, cheap enough to be useful on day
|
||||
one, narrow enough not to sprawl), and which JSON formatter
|
||||
library to use (picked `python-json-logger`, rejected rolling a
|
||||
stdlib-only `JsonFormatter` subclass). Spec landed at
|
||||
`docs/superpowers/specs/2026-04-19-healthz-and-json-logs-design.md`;
|
||||
plan at `docs/superpowers/plans/2026-04-19-healthz-and-json-logs.md`
|
||||
(851 lines, eight bite-sized TDD tasks).
|
||||
|
||||
4. **Subagent-driven execution.** Eight tasks dispatched sequentially
|
||||
on `feat/platform-deploy-prep`, each with a fresh implementer
|
||||
subagent, spec-compliance review, and code-quality review via the
|
||||
`superpowers:code-reviewer` agent. Model selection: haiku for
|
||||
mechanical tasks (dep bump, test-only adds, README edits, single
|
||||
seed events), sonnet for integration tasks (log config + filter,
|
||||
four coordinated seed events, /healthz endpoint wiring). Fix
|
||||
commits triggered by review: #20bca13 (restore logger state in
|
||||
dictConfig test) and #bfe9ed4 (assert healthz_failed log on 503).
|
||||
Final state: 12 commits on the branch, 148/148 tests passing.
|
||||
|
||||
5. **Merge to main.** User picked "merge locally" over PR. Main had
|
||||
diverged because an unrelated MCP proposal doc (#23 → PR #24)
|
||||
landed on origin while we worked. Option-chose rebase over merge
|
||||
commits: rebased local spec+plan+feature onto `origin/main`, then
|
||||
fast-forwarded. One snag: dirty `CLAUDE.md` (pre-session
|
||||
modifications by Jeff) blocked the rebase silently under `set -e`;
|
||||
stashed, rebased, popped. Linear history preserved.
|
||||
|
||||
6. **Filed follow-up issues.** Three deploy-pipeline issues with
|
||||
dependency chain (#28 Dockerfile → #29 compose → #30 Actions
|
||||
workflow) plus #31 umbrella for the three non-blocking code-review
|
||||
nits.
|
||||
|
||||
## Key Decisions & Reasoning
|
||||
|
||||
* **Intake answers favoured narrow over broad.** Memory budget of
|
||||
1 GB (vs the platform's 4 GB default) because the app is genuinely
|
||||
small. Structured JSON logs as launch work (not deferred) because
|
||||
Promtail is already there, post-launch retrofits are annoying, and
|
||||
the Loki-query story is only useful if the events exist from day
|
||||
one. Rate-limit numbers proposed but deferred to the platform
|
||||
team's tenant-default preferences. Bind mount specifically called
|
||||
out to cover both the SQLite file AND the `backups/` directory, so
|
||||
`scripts/backup-db.sh` keeps working unchanged in production.
|
||||
|
||||
* **Single source of truth for log config (JSON, not dual files).**
|
||||
Spec originally said "Python dict as source with a YAML shim for
|
||||
uvicorn." Plan deviated to JSON-as-source loaded by Python at
|
||||
import, consumed by uvicorn via `--log-config` pointing at the
|
||||
same file. Reasoning: YAML requires `pyyaml` at runtime (uvicorn
|
||||
doesn't bundle it), and a dual-file setup introduces drift risk.
|
||||
The spec was edited mid-execution to reflect the simpler shape.
|
||||
|
||||
* **`pythonjsonlogger.jsonlogger.JsonFormatter` kept despite
|
||||
deprecation.** The plan used the deprecated module path; the Task 1
|
||||
code reviewer flagged it. Kept the plan as-written through Task 2
|
||||
(minimum change to ship), then fixed in a cleanup commit at the end
|
||||
(`pythonjsonlogger.json.JsonFormatter`). This would have been the
|
||||
wrong call in a busier branch; here it was fine because the branch
|
||||
was short and the fix was cheap.
|
||||
|
||||
* **Five seed events, not deeper instrumentation.** `month_created`,
|
||||
`month_closed`, `template_entry_updated`, `posting_added`,
|
||||
`posting_deleted` at the service-layer mutation sites. Placed after
|
||||
`db.commit()` + `db.refresh()` so they only fire on durable success.
|
||||
Payloads are minimal and deliberately exclude anything PII-ish
|
||||
(no notes content, no names). An audit trail belongs in a table, not
|
||||
in logs.
|
||||
|
||||
* **`fmt` over `format` in the JSON config.** Plan specified `format`;
|
||||
implementer hit `TypeError` on direct `factory(**cfg)` because
|
||||
`JsonFormatter.__init__` inherits `fmt` from stdlib's `logging.Formatter`.
|
||||
Implementer changed to `fmt`, documented in report. dictConfig with
|
||||
`()` factory passes kwargs literally, so `fmt` works on both paths.
|
||||
Plan was slightly wrong; fix was necessary.
|
||||
|
||||
* **`/healthz` on its own router — but only as a file boundary.**
|
||||
Spec language ("router separation protects from future middleware")
|
||||
overstated what the separation buys. FastAPI's
|
||||
`app.add_middleware(...)` is global and ignores routers. The
|
||||
separation does help if future auth lands as router-scoped
|
||||
`dependencies=[...]`. For Quartermaster this is moot: auth is
|
||||
Traefik-side per the platform contract. Documented as a minor
|
||||
cleanup note in #31 and in the spec's out-of-scope section.
|
||||
|
||||
* **Rebase not merge on the divergence.** MCP proposal doc was
|
||||
completely independent work; rebasing onto origin/main kept linear
|
||||
history and read as one story. In a PR-workflow branch we'd have
|
||||
used the merge button; for a direct local merge, rebase was the
|
||||
right call.
|
||||
|
||||
## Surprises & Discoveries
|
||||
|
||||
* **The autouse-fixture workaround was wrong.** Task 5's test was
|
||||
flaky because Task 3's `test_log_config_loads_via_dictconfig`
|
||||
mutated the `quartermaster` logger's `propagate` flag globally.
|
||||
Implementer's first fix added a file-scoped autouse fixture. Code
|
||||
reviewer flagged it correctly: the problem is that the contaminating
|
||||
test doesn't clean up after itself, and an autouse fixture only
|
||||
protects tests in the same file — future tests elsewhere that use
|
||||
caplog on `quartermaster.*` would be bitten silently. Right fix:
|
||||
save/restore state in the contaminating test itself. This is a
|
||||
generalisable pattern worth remembering — global-state-leaking
|
||||
tests should always clean up, don't push cleanup onto every other
|
||||
test.
|
||||
|
||||
* **LogQL example had a field-that-doesn't-exist.** Third example in
|
||||
the Logs section was `event="month_closed" | line_format "{{.path}}"`,
|
||||
but `month_closed` records carry `year_month`, not `path` —
|
||||
`{{.path}}` would render empty. Caught in the end-of-branch review,
|
||||
fixed in cleanup commit. Small but exactly the kind of thing a fresh
|
||||
set of eyes catches; easy to miss when writing docs while holding
|
||||
the full model in your head.
|
||||
|
||||
* **The subagent TDD discipline worked.** Every implementer subagent
|
||||
actually wrote the failing test first, verified it fails, then
|
||||
implemented. I briefed this in every prompt and they followed it.
|
||||
The TDD failures were logged in each report ("assertion 0 == 1" /
|
||||
"ModuleNotFoundError"). Didn't need to correct any subagent for
|
||||
skipping that step. This is a win over direct execution, where I'd
|
||||
sometimes rationalise past writing the failing test first.
|
||||
|
||||
* **Task 4's smoke test hit `/` and got a 500.** Manual `docker run`
|
||||
style smoke test in the subagent's CWD where the SQLite file or
|
||||
schema wasn't set up. Didn't invalidate the task (the JSON access
|
||||
log WAS captured and proved the format worked), but a reminder
|
||||
that smoke tests need stable test environments. Worth checking
|
||||
once deploy lands.
|
||||
|
||||
* **Git `set -e` does not reliably abort on `git rebase` failure
|
||||
inside a bash heredoc.** The first merge attempt ran three steps
|
||||
under `set -e`; the rebase failures due to dirty `CLAUDE.md`
|
||||
exited non-zero but the subsequent `git log` + `git merge` ran
|
||||
anyway, landing the branch in "accidentally Option-3 state." Had
|
||||
to untangle with a stash/rebase/pop. Not a subagent issue — bash
|
||||
quirk, worth remembering.
|
||||
|
||||
* **Platform team's `QUARTERMASTER_DB_URL` example had three slashes.**
|
||||
Comment on #25 wrote `sqlite:///data/quartermaster.db` (three =
|
||||
relative). For an absolute path in the container you need four
|
||||
(`sqlite:////data/quartermaster.db`). Flagged in #29's body so the
|
||||
compose writer doesn't copy the wrong spelling.
|
||||
|
||||
* **Commits on a long branch, reviewer's recommendation: don't
|
||||
squash.** The end-of-branch reviewer explicitly argued against
|
||||
squashing the 12 commits — the fix commits (20bca13, bfe9ed4) tell
|
||||
a real story about what came up in review. I'd reflexively have
|
||||
squashed; the case for preserving was more compelling. Left as-is.
|
||||
|
||||
## Concerns & Open Threads
|
||||
|
||||
* **No end-to-end test of the full deploy path.** We have formatter-
|
||||
level tests, filter-level tests, and a manual uvicorn smoke test
|
||||
(verified one JSON access log line). What we don't have is a test
|
||||
that builds the container, runs it, curls `/healthz`, and asserts
|
||||
on the captured stdout. That's #28 + #29 + #30 territory; the Actions
|
||||
workflow's post-deploy `curl` step will be the first real end-to-end
|
||||
assertion.
|
||||
|
||||
* **Deprecation warning is fixed, but we're still using
|
||||
`pythonjsonlogger.json` — a 4.x module path.** If the library ever
|
||||
does another reorg or drops the API we're using, the log format
|
||||
quietly breaks. `python-json-logger` is maintained but small; worth
|
||||
pinning to a range (`>=3.2,<5`) in `pyproject.toml` if we ever hit
|
||||
a 5.x migration.
|
||||
|
||||
* **The intake form's §3.7 still says "Endpoint does not exist yet
|
||||
— will land in the repo before the first deploy cut."** Harmless
|
||||
because the issue is closed, but stale. If we ever revisit the
|
||||
form as a template for another app, clean that up.
|
||||
|
||||
* **Session memory about TDD enforcement.** I've been saving feedback
|
||||
memories about user-specific patterns. This session reinforced that
|
||||
the subagent-driven skill's TDD gate works when prompts explicitly
|
||||
require failing-first verification. Saving the pattern is worth it,
|
||||
but I'm wary of over-memoing the workflow meta; Jeff's CLAUDE.md
|
||||
global already prefers "do the work, don't narrate the workflow."
|
||||
|
||||
* **Rate-limit numbers still open.** Filed as a platform-team ask on
|
||||
#25; the response didn't address numbers specifically but the
|
||||
middleware is deployed at `10/30 per-IP` (matches my starter
|
||||
proposal in the intake). If that's wrong for this tenant we'll hear
|
||||
about it when real traffic hits.
|
||||
|
||||
* **CLAUDE.md Session Log is out of date.** The file has a Session 1
|
||||
entry; Session 2's wiki page exists now but CLAUDE.md still reflects
|
||||
pre-session state. Left unmodified this session because the CLAUDE.md
|
||||
working-tree diff was Jeff's pre-session edit; merging my own
|
||||
Session 2 references on top would have entangled our changes. Next
|
||||
session he'll want to land both (his stale edits + a Session 2 bullet
|
||||
in the log).
|
||||
|
||||
## Raw Thinking
|
||||
|
||||
* **The intake form's "explain what you've already tried" structure
|
||||
was load-bearing.** When Jeff asked me to "fill out what I can and
|
||||
ask questions," the right shape was a draft-with-question-list,
|
||||
not a comment dump. Surfacing six specific binary-ish questions
|
||||
("option A or B?") compressed a long back-and-forth into one
|
||||
round-trip. This pairs with his "minimalist answers" pattern from
|
||||
Session 1: if I'm clear about my defaults, he answers by approving
|
||||
or swapping; if I'm vague, I get silence and ship the wrong thing.
|
||||
|
||||
* **Subagent-driven felt like overkill for Task 1 (dep bump) and
|
||||
Task 3 (test-only additions) — but it caught real issues anyway.**
|
||||
The Task 2 code reviewer flagged the `pythonjsonlogger.jsonlogger`
|
||||
deprecation before we'd written a line of production code that
|
||||
depended on it. The Task 5 reviewer caught the autouse fixture
|
||||
wrong. These catches would have leaked through "just write the
|
||||
commit" execution. The cost was roughly 2-3x the token budget of
|
||||
direct execution; the quality delta was worth it on a branch that
|
||||
needs to work in prod.
|
||||
|
||||
* **The spec-and-plan artifacts had real value beyond process
|
||||
ceremony.** Each implementer subagent got pasted a task body of
|
||||
100-200 lines that was self-contained: exact file paths, exact
|
||||
code, exact commit message, verification commands. No subagent
|
||||
asked "what should the format string be" or "what fields should
|
||||
the filter set." That prep-once-execute-many pattern is what makes
|
||||
subagent-driven work at all. The plan file is almost a shared
|
||||
cache across subagents.
|
||||
|
||||
* **Platform contracts are surprisingly lightweight when one tenant
|
||||
already shipped.** Archon went through this first, so the intake
|
||||
form inherited structure, the wiki template existed, and the
|
||||
platform-team response was mechanical. For the Quartermaster
|
||||
tenant this worked out to maybe an hour of thinking and a morning
|
||||
of execution. Future tenants should benefit more.
|
||||
|
||||
* **The Traefik-not-FastAPI auth choice ripples into the code.**
|
||||
Because auth lives at the edge, `/healthz` stays unauthenticated
|
||||
with zero effort inside the app, and there's no session middleware
|
||||
or login template to build. This is why "infrastructure-level
|
||||
auth" is such a good default for small single-user tools —
|
||||
the app stays boring.
|
||||
|
||||
## What's Next
|
||||
|
||||
Priority order for the next session:
|
||||
|
||||
1. **#28 Dockerfile.** Run as `USER 1000:1000`, Alembic + uvicorn
|
||||
entrypoint, `HEALTHCHECK` hitting /healthz. Local build + run
|
||||
smoke test. Should be small.
|
||||
2. **#29 compose.yml.** Consumes #28. The bulk of the Traefik labels,
|
||||
resource limits, required container labels. Mind the four-slash
|
||||
`QUARTERMASTER_DB_URL`.
|
||||
3. **#30 Forgejo Actions deploy workflow.** Build, tag by git SHA,
|
||||
push to the Forgejo registry, SSH to home-ctr-onyx, `docker compose
|
||||
pull && up -d`, post-deploy `curl /healthz`. First real end-to-end
|
||||
deploy test.
|
||||
4. **First production rollout.** Flip the DNS, hit the live URL
|
||||
through Traefik, verify basic auth works, verify JSON logs flow
|
||||
into Loki with `event`/`level` as labels. Should be uneventful;
|
||||
the interesting part was the work leading up to it.
|
||||
5. **#31 cleanups.** Fold into whichever follow-up PR naturally
|
||||
touches `service.py` or `routes_health.py`.
|
||||
6. **Deferred items from #1 still open.** Constrain posting dates to
|
||||
the month, closed-month archive treatment, cross-month
|
||||
summaries. Nothing about this session disturbed the priority of
|
||||
those.
|
||||
Loading…
Reference in a new issue