retro: Session 3 — deploy pipeline live on home-ctr-onyx

2026-04-19 18:32:10 -06:00 · 2026-04-19 18:32:10 -06:00 · 5df30ca81d
commit 5df30ca81d
parent 4cd63859d1
1 changed files with 237 additions and 0 deletions
--- a/Session3.md
+++ b/Session3.md
@ -0,0 +1,237 @@
 # Session 3 Notes — 2026-04-19
 ## What We Set Out to Do
 Finish the deploy pipeline. Session 2 left three dependency-chained
 issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo
 Actions workflow. Goal was to land all three, merge them to `main`,
 and watch the first automated rollout put Quartermaster live at
 `https://quartermaster.unbiasedgeek.com/`.
 ## What Actually Happened
 Four PRs merged, all three target issues closed, one post-deploy
 bug fix.
 1. **#32 Dockerfile (#28).** `python:3.12-slim-bookworm` base, `uv`
   pulled from its official image and pinned at `0.5.11`,
   `uv sync --no-dev --frozen` in two layers for cache friendliness,
   `USER 1000:1000`, `EXPOSE 8000`, `HEALTHCHECK` against `/healthz`
   using `python -c` + `urllib` (python-slim has no curl). Entrypoint
   runs `alembic upgrade head` (the pre-upgrade backup hook fires
   automatically via `alembic/env.py`) then `exec uvicorn`. Local
   smoke passed: built, ran with a tempfile DB URL, hit `/healthz`,
   JSON logs on stdout, container marked `healthy`, SQLite file
   landed in the bind mount owned `1000:1000`.
 2. **#33 compose.yml (#29).** Single `quartermaster` service with
   the image tag parameterised via `${QUARTERMASTER_TAG:-latest}`
   so CI can pin a SHA per deploy via a host-side `.env` rather
   than editing the file. `/mnt/quartermaster:/data` bind mount,
   `QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db` (four
   slashes — the intake comment had three, which would put the DB
   off the mount). `proxy-net` external, 1 GB mem+memswap,
   `unless-stopped`, `json-file` logging capped at 50 MB × 3, plus
   all twelve Traefik + required container labels from the
   platform contract. Validated structurally against the wiki's
   label list by parsing the file with pyyaml inside the #32 image.
 3. **#34 Forgejo Actions workflow (#30).** The one with the
   interesting design conversation. First draft had an SSH step
   from runner to host, needing `DEPLOY_SSH_KEY` +
   `DEPLOY_KNOWN_HOSTS` secrets. Jeff pushed back ("do we need to
   store a ssh private key in the repo?"), and the re-read landed
   on the right answer: the `homelab` runner *lives on
   home-ctr-onyx itself* with the host's Docker socket mounted —
   so `docker compose pull && up -d` from the runner already
   manages the production container directly, no SSH hop needed.
   Dropped two secrets and the private-key risk surface. Final
   workflow: checkout → buildx → registry login → build + push
   (tagged with `${{ github.sha }}` and `latest`) → write `.env` +
   `docker compose pull` + `up -d` → smoke
   `curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD
   https://quartermaster.unbiasedgeek.com/healthz` with 10 × 3 s
   retries. Two repo-scoped secrets:
   `REGISTRY_TOKEN` (archeious Forgejo PAT with `read:package` +
   `write:package`) and `QUARTERMASTER_SMOKE_PASSWORD` (plaintext
   basic-auth).
 4. **#35 uvicorn proxy-headers fix.** First merge-to-main fired
   the workflow; the image rolled, `/healthz` returned 200, smoke
   went green. But the browser page rendered unstyled. Root cause:
   Starlette's `url_for()` in templates emitted
   `<link rel="stylesheet" href="http://<internal>/static/app.css">`
   because uvicorn had been started without `--proxy-headers`, so
   `X-Forwarded-Proto` from Traefik was ignored. Browsers blocked
   the http CSS as mixed content on the https page. Reproduced
   locally by curling the pre-fix image with
   `-H 'Host: quartermaster.unbiasedgeek.com'
   -H 'X-Forwarded-Proto: https'`; added `--proxy-headers
   --forwarded-allow-ips='*'` to `docker/entrypoint.sh`. Safe to
   trust all forwarded IPs because `compose.yml` publishes no host
   port; only Traefik on `proxy-net` can reach port 8000.
 ## Key Decisions & Reasoning
 * **No SSH step in the deploy workflow.** Explicit win for
  minimising secret surface. The `homelab` runner + mounted Docker
  socket pattern means most single-tenant deploys on home-ctr-onyx
  should default to "no SSH" unless there's a real cross-host
  hop. Worth remembering for the next service we onboard.
 * **Image tag parameterised via `QUARTERMASTER_TAG`, not hard-coded
  `latest` or `:${github.sha}` in the compose file.** This keeps
  the checked-in compose file generic and pushes the tag decision
  to the deploy layer. The Actions workflow writes
  `QUARTERMASTER_TAG=<sha>` to `.env` next to the compose file,
  and `docker compose` auto-loads `.env`. Rollback is manual but
  cheap: set `QUARTERMASTER_TAG` back to a prior SHA,
  `docker compose up -d`.
 * **`COMPOSE_PROJECT_NAME=quartermaster` pinned in the workflow
  env.** Without this, compose derives the project name from the
  runner's ephemeral workspace directory, so successive deploys
  could fight over whether an existing container belongs to the
  project. With it, every deploy identifies the same production
  container by project label. `container_name: quartermaster` in
  compose.yml already forces a stable container name; pinning
  project name is belt-and-suspenders.
 * **Smoke test hits the public URL with basic-auth creds, not the
  container directly.** Option (a) over option (b) in the pre-#30
  question. The whole point is catching regressions in TLS, DNS,
  Traefik routing, and the basic-auth middleware — not just "is
  the container up" (which the image's `HEALTHCHECK` already
  tells us). `QUARTERMASTER_SMOKE_PASSWORD` is plaintext on the
  tenant side; the platform team stores the bcrypt hash. Rotation
  flows through both in lockstep.
 * **Mint `REGISTRY_TOKEN` via the Forgejo API, not the UI.** When
  Jeff asked if I could do it as the `archeious` admin, my first
  pass said no — the MCP bot user is `claude-code` with
  `is_admin: false`. But sourcing `/home/jeff/projects/homelab-IaC/bin/load-ops-secrets`
  exposed `FORGEJO_ARCHEIOUS_PASSWORD`, which unlocked basic-auth
  against `POST /api/v1/users/archeious/tokens`. Whole flow
  (delete-if-exists, create, register via
  `PUT /repos/.../actions/secrets/REGISTRY_TOKEN`, verify) ran in
  one bash block with the raw token never hitting the transcript.
 ## Surprises & Discoveries
 * **The CSS bug was invisible from pytest, invisible from a
  `/healthz`-only smoke, invisible until a human eyeball saw the
  rendered page.** `url_for()` is only exercised when templates
  render; `/healthz` returns raw JSON and doesn't exercise the
  template layer at all. No pytest coverage because the tests run
  against `TestClient` which doesn't go through Traefik. Future
  lesson: any service that renders HTML through `url_for()` needs
  at least one smoke that loads the index page and grep the output
  for the expected href scheme.
 * **README "Docker" section I wrote in #32 got trimmed on review
  before merge; I had re-added equivalent content across #29 and
  #30 without realising.** Looking back, the three sections I
  added (Docker in #32, Deploy in #29, CI/CD in #30) together
  covered the ground the first one tried to cover alone. Jeff's
  edit was the right call — better to let the content emerge
  alongside each artefact than to front-load it.
 * **Forgejo Actions secret flow via API is pleasant.**
  `PUT /repos/{owner}/{repo}/actions/secrets/{name}` with
  `{"data": "..."}` is idempotent (201 or 204, both success) and
  doesn't need a preceding GET. Easier than the GitHub equivalent
  which requires fetching the repo's libsodium public key and
  encrypting client-side. For one-off provisioning from an ops
  shell this is a nice ergonomic.
 * **The runner being on the deploy host changes the default
  design.** Came up twice this session: once as "no SSH needed",
  once as "the build and push both use the same Docker daemon
  that will run the container." Small detail with big
  consequences.
 ## Concerns & Open Threads
 * **#26 and #27 landed on main but were never closed on Forgejo.**
  Their commits reference the issues in parentheses (`(#26)`) but
  don't use `Closes #N`. Not a big deal; a one-line API call
  closes them when we want.
 * **The smoke test retries for 30 s total.** If a first-ever Let's
  Encrypt cert issuance takes longer than that (DNS-01 can be
  slow under load), the smoke curl fails with a TLS error and the
  workflow goes red — harmless, re-run from the Actions UI once
  the cert is up. A richer smoke with a longer cert-acquisition
  window would be nice, but it'd complicate the step and isn't
  worth it for a homelab app that re-certs once every 90 days.
 * **No browser-rendered-page test.** The `/healthz` smoke doesn't
  exercise template URL generation. A CI step that loads `/` and
  asserts that every `<link>` and `<img>` href parses as https
  would have caught #35 before it hit the live site. Probably
  worth filing.
 * **#31 polish still open.** Logger placement in `service.py`,
  middleware-vs-router comment in `routes_health.py`, richer
  `template_entry_updated` extras. Fold into whichever PR next
  touches those files.
 * **Rollback is manual.** For v1 this is fine (single-user app,
  rollback = `QUARTERMASTER_TAG=<prev-sha>; docker compose up -d`),
  but a one-line re-deploy job that takes a tag would be worth
  ~30 minutes of work once we have a reason to roll back under
  pressure.
 ## Raw Thinking
 * **The "first deploy is the real test" framing was accurate in
  both directions.** On one hand, every pre-deploy artefact I
  built worked as specified — the image ran, the compose file
  parsed, the workflow fired, the secrets authenticated. On the
  other hand, a real bug (proxy-headers) only showed up when a
  human viewed the live page. The local dev loop is good but
  finite; production traffic finds the gap between
  "looks-tested" and "tested".
 * **Admin-level work via API vs UI.** I initially deflected on
  "mint the registry token" with "I'd need archeious's
  credentials." Jeff nudged: "try sourcing load-ops-secrets."
  That pattern — "check the environment before giving up" — is a
  reusable heuristic. Ops tooling that front-loads secrets into
  the shell is common; I should poke at it before declining.
 * **Dropping the SSH step felt both obvious and non-obvious.**
  Obvious in retrospect: the runner is on the deploy host, why
  would I SSH? Non-obvious while in the middle of drafting: the
  issue description literally says "SSHes to home-ctr-onyx" and
  I was pattern-matching on that instead of re-reading the
  platform contract's note that the runner has the Docker socket
  mounted. Lesson: when a deploy spec says "SSH to host X", ask
  whether the runner IS host X.
 * **The #35 fix was fast because the reproduction was tight.**
  One local `docker run` with two Traefik-style headers, grep
  output for the stylesheet href. Seeing the wrong URL in 30 s
  meant I could be certain the fix was at the right layer before
  committing. The alternative — "rebuild, push, deploy, look at
  the live site, iterate" — would have been several minutes per
  loop. Worth knowing the no-`--proxy-headers` failure mode is
  easy to reproduce with curl.
 ## What's Next
 Nothing blocks launch — launch happened. Remaining queue:
 1. **Close #26 and #27 manually.** Housekeeping.
 2. **#31 polish.** Pick up opportunistically when touching the
   affected files.
 3. **Deferred items from earlier sessions.** Constrain posting
   dates to the month, closed-month archive treatment, cross-month
   summaries, copy-forward months. Nothing about production
   deployment changes the priority here.
 4. **Operational shakedown.** Watch the first Loki-indexed access
   log pattern, see whether the rate-limit numbers (10/30 per-IP)
   make sense for real traffic, confirm the restic nightly backup
   of `/mnt/quartermaster/` actually includes everything we expect.
 5. **Browser-rendered smoke test.** File an issue. Would have
   caught #35.