retro: Session 3 — deploy pipeline live on home-ctr-onyx

claude-code 2026-04-19 18:32:10 -06:00
parent 4cd63859d1
commit 5df30ca81d

237
Session3.md Normal file

@ -0,0 +1,237 @@
# Session 3 Notes — 2026-04-19
## What We Set Out to Do
Finish the deploy pipeline. Session 2 left three dependency-chained
issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo
Actions workflow. Goal was to land all three, merge them to `main`,
and watch the first automated rollout put Quartermaster live at
`https://quartermaster.unbiasedgeek.com/`.
## What Actually Happened
Four PRs merged, all three target issues closed, one post-deploy
bug fix.
1. **#32 Dockerfile (#28).** `python:3.12-slim-bookworm` base, `uv`
pulled from its official image and pinned at `0.5.11`,
`uv sync --no-dev --frozen` in two layers for cache friendliness,
`USER 1000:1000`, `EXPOSE 8000`, `HEALTHCHECK` against `/healthz`
using `python -c` + `urllib` (python-slim has no curl). Entrypoint
runs `alembic upgrade head` (the pre-upgrade backup hook fires
automatically via `alembic/env.py`) then `exec uvicorn`. Local
smoke passed: built, ran with a tempfile DB URL, hit `/healthz`,
JSON logs on stdout, container marked `healthy`, SQLite file
landed in the bind mount owned `1000:1000`.
2. **#33 compose.yml (#29).** Single `quartermaster` service with
the image tag parameterised via `${QUARTERMASTER_TAG:-latest}`
so CI can pin a SHA per deploy via a host-side `.env` rather
than editing the file. `/mnt/quartermaster:/data` bind mount,
`QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db` (four
slashes — the intake comment had three, which would put the DB
off the mount). `proxy-net` external, 1 GB mem+memswap,
`unless-stopped`, `json-file` logging capped at 50 MB × 3, plus
all twelve Traefik + required container labels from the
platform contract. Validated structurally against the wiki's
label list by parsing the file with pyyaml inside the #32 image.
3. **#34 Forgejo Actions workflow (#30).** The one with the
interesting design conversation. First draft had an SSH step
from runner to host, needing `DEPLOY_SSH_KEY` +
`DEPLOY_KNOWN_HOSTS` secrets. Jeff pushed back ("do we need to
store a ssh private key in the repo?"), and the re-read landed
on the right answer: the `homelab` runner *lives on
home-ctr-onyx itself* with the host's Docker socket mounted —
so `docker compose pull && up -d` from the runner already
manages the production container directly, no SSH hop needed.
Dropped two secrets and the private-key risk surface. Final
workflow: checkout → buildx → registry login → build + push
(tagged with `${{ github.sha }}` and `latest`) → write `.env` +
`docker compose pull` + `up -d` → smoke
`curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD
https://quartermaster.unbiasedgeek.com/healthz` with 10 × 3 s
retries. Two repo-scoped secrets:
`REGISTRY_TOKEN` (archeious Forgejo PAT with `read:package` +
`write:package`) and `QUARTERMASTER_SMOKE_PASSWORD` (plaintext
basic-auth).
4. **#35 uvicorn proxy-headers fix.** First merge-to-main fired
the workflow; the image rolled, `/healthz` returned 200, smoke
went green. But the browser page rendered unstyled. Root cause:
Starlette's `url_for()` in templates emitted
`<link rel="stylesheet" href="http://<internal>/static/app.css">`
because uvicorn had been started without `--proxy-headers`, so
`X-Forwarded-Proto` from Traefik was ignored. Browsers blocked
the http CSS as mixed content on the https page. Reproduced
locally by curling the pre-fix image with
`-H 'Host: quartermaster.unbiasedgeek.com'
-H 'X-Forwarded-Proto: https'`; added `--proxy-headers
--forwarded-allow-ips='*'` to `docker/entrypoint.sh`. Safe to
trust all forwarded IPs because `compose.yml` publishes no host
port; only Traefik on `proxy-net` can reach port 8000.
## Key Decisions & Reasoning
* **No SSH step in the deploy workflow.** Explicit win for
minimising secret surface. The `homelab` runner + mounted Docker
socket pattern means most single-tenant deploys on home-ctr-onyx
should default to "no SSH" unless there's a real cross-host
hop. Worth remembering for the next service we onboard.
* **Image tag parameterised via `QUARTERMASTER_TAG`, not hard-coded
`latest` or `:${github.sha}` in the compose file.** This keeps
the checked-in compose file generic and pushes the tag decision
to the deploy layer. The Actions workflow writes
`QUARTERMASTER_TAG=<sha>` to `.env` next to the compose file,
and `docker compose` auto-loads `.env`. Rollback is manual but
cheap: set `QUARTERMASTER_TAG` back to a prior SHA,
`docker compose up -d`.
* **`COMPOSE_PROJECT_NAME=quartermaster` pinned in the workflow
env.** Without this, compose derives the project name from the
runner's ephemeral workspace directory, so successive deploys
could fight over whether an existing container belongs to the
project. With it, every deploy identifies the same production
container by project label. `container_name: quartermaster` in
compose.yml already forces a stable container name; pinning
project name is belt-and-suspenders.
* **Smoke test hits the public URL with basic-auth creds, not the
container directly.** Option (a) over option (b) in the pre-#30
question. The whole point is catching regressions in TLS, DNS,
Traefik routing, and the basic-auth middleware — not just "is
the container up" (which the image's `HEALTHCHECK` already
tells us). `QUARTERMASTER_SMOKE_PASSWORD` is plaintext on the
tenant side; the platform team stores the bcrypt hash. Rotation
flows through both in lockstep.
* **Mint `REGISTRY_TOKEN` via the Forgejo API, not the UI.** When
Jeff asked if I could do it as the `archeious` admin, my first
pass said no — the MCP bot user is `claude-code` with
`is_admin: false`. But sourcing `/home/jeff/projects/homelab-IaC/bin/load-ops-secrets`
exposed `FORGEJO_ARCHEIOUS_PASSWORD`, which unlocked basic-auth
against `POST /api/v1/users/archeious/tokens`. Whole flow
(delete-if-exists, create, register via
`PUT /repos/.../actions/secrets/REGISTRY_TOKEN`, verify) ran in
one bash block with the raw token never hitting the transcript.
## Surprises & Discoveries
* **The CSS bug was invisible from pytest, invisible from a
`/healthz`-only smoke, invisible until a human eyeball saw the
rendered page.** `url_for()` is only exercised when templates
render; `/healthz` returns raw JSON and doesn't exercise the
template layer at all. No pytest coverage because the tests run
against `TestClient` which doesn't go through Traefik. Future
lesson: any service that renders HTML through `url_for()` needs
at least one smoke that loads the index page and grep the output
for the expected href scheme.
* **README "Docker" section I wrote in #32 got trimmed on review
before merge; I had re-added equivalent content across #29 and
#30 without realising.** Looking back, the three sections I
added (Docker in #32, Deploy in #29, CI/CD in #30) together
covered the ground the first one tried to cover alone. Jeff's
edit was the right call — better to let the content emerge
alongside each artefact than to front-load it.
* **Forgejo Actions secret flow via API is pleasant.**
`PUT /repos/{owner}/{repo}/actions/secrets/{name}` with
`{"data": "..."}` is idempotent (201 or 204, both success) and
doesn't need a preceding GET. Easier than the GitHub equivalent
which requires fetching the repo's libsodium public key and
encrypting client-side. For one-off provisioning from an ops
shell this is a nice ergonomic.
* **The runner being on the deploy host changes the default
design.** Came up twice this session: once as "no SSH needed",
once as "the build and push both use the same Docker daemon
that will run the container." Small detail with big
consequences.
## Concerns & Open Threads
* **#26 and #27 landed on main but were never closed on Forgejo.**
Their commits reference the issues in parentheses (`(#26)`) but
don't use `Closes #N`. Not a big deal; a one-line API call
closes them when we want.
* **The smoke test retries for 30 s total.** If a first-ever Let's
Encrypt cert issuance takes longer than that (DNS-01 can be
slow under load), the smoke curl fails with a TLS error and the
workflow goes red — harmless, re-run from the Actions UI once
the cert is up. A richer smoke with a longer cert-acquisition
window would be nice, but it'd complicate the step and isn't
worth it for a homelab app that re-certs once every 90 days.
* **No browser-rendered-page test.** The `/healthz` smoke doesn't
exercise template URL generation. A CI step that loads `/` and
asserts that every `<link>` and `<img>` href parses as https
would have caught #35 before it hit the live site. Probably
worth filing.
* **#31 polish still open.** Logger placement in `service.py`,
middleware-vs-router comment in `routes_health.py`, richer
`template_entry_updated` extras. Fold into whichever PR next
touches those files.
* **Rollback is manual.** For v1 this is fine (single-user app,
rollback = `QUARTERMASTER_TAG=<prev-sha>; docker compose up -d`),
but a one-line re-deploy job that takes a tag would be worth
~30 minutes of work once we have a reason to roll back under
pressure.
## Raw Thinking
* **The "first deploy is the real test" framing was accurate in
both directions.** On one hand, every pre-deploy artefact I
built worked as specified — the image ran, the compose file
parsed, the workflow fired, the secrets authenticated. On the
other hand, a real bug (proxy-headers) only showed up when a
human viewed the live page. The local dev loop is good but
finite; production traffic finds the gap between
"looks-tested" and "tested".
* **Admin-level work via API vs UI.** I initially deflected on
"mint the registry token" with "I'd need archeious's
credentials." Jeff nudged: "try sourcing load-ops-secrets."
That pattern — "check the environment before giving up" — is a
reusable heuristic. Ops tooling that front-loads secrets into
the shell is common; I should poke at it before declining.
* **Dropping the SSH step felt both obvious and non-obvious.**
Obvious in retrospect: the runner is on the deploy host, why
would I SSH? Non-obvious while in the middle of drafting: the
issue description literally says "SSHes to home-ctr-onyx" and
I was pattern-matching on that instead of re-reading the
platform contract's note that the runner has the Docker socket
mounted. Lesson: when a deploy spec says "SSH to host X", ask
whether the runner IS host X.
* **The #35 fix was fast because the reproduction was tight.**
One local `docker run` with two Traefik-style headers, grep
output for the stylesheet href. Seeing the wrong URL in 30 s
meant I could be certain the fix was at the right layer before
committing. The alternative — "rebuild, push, deploy, look at
the live site, iterate" — would have been several minutes per
loop. Worth knowing the no-`--proxy-headers` failure mode is
easy to reproduce with curl.
## What's Next
Nothing blocks launch — launch happened. Remaining queue:
1. **Close #26 and #27 manually.** Housekeeping.
2. **#31 polish.** Pick up opportunistically when touching the
affected files.
3. **Deferred items from earlier sessions.** Constrain posting
dates to the month, closed-month archive treatment, cross-month
summaries, copy-forward months. Nothing about production
deployment changes the priority here.
4. **Operational shakedown.** Watch the first Loki-indexed access
log pattern, see whether the rate-limit numbers (10/30 per-IP)
make sense for real traffic, confirm the restic nightly backup
of `/mnt/quartermaster/` actually includes everything we expect.
5. **Browser-rendered smoke test.** File an issue. Would have
caught #35.