From 5df30ca81d7c2c4159d1cf27df75138b4229d465 Mon Sep 17 00:00:00 2001 From: claude-code Date: Sun, 19 Apr 2026 18:32:10 -0600 Subject: [PATCH] =?UTF-8?q?retro:=20Session=203=20=E2=80=94=20deploy=20pip?= =?UTF-8?q?eline=20live=20on=20home-ctr-onyx?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Session3.md | 237 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 237 insertions(+) create mode 100644 Session3.md diff --git a/Session3.md b/Session3.md new file mode 100644 index 0000000..5c4b73c --- /dev/null +++ b/Session3.md @@ -0,0 +1,237 @@ +# Session 3 Notes — 2026-04-19 + +## What We Set Out to Do + +Finish the deploy pipeline. Session 2 left three dependency-chained +issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo +Actions workflow. Goal was to land all three, merge them to `main`, +and watch the first automated rollout put Quartermaster live at +`https://quartermaster.unbiasedgeek.com/`. + +## What Actually Happened + +Four PRs merged, all three target issues closed, one post-deploy +bug fix. + +1. **#32 Dockerfile (#28).** `python:3.12-slim-bookworm` base, `uv` + pulled from its official image and pinned at `0.5.11`, + `uv sync --no-dev --frozen` in two layers for cache friendliness, + `USER 1000:1000`, `EXPOSE 8000`, `HEALTHCHECK` against `/healthz` + using `python -c` + `urllib` (python-slim has no curl). Entrypoint + runs `alembic upgrade head` (the pre-upgrade backup hook fires + automatically via `alembic/env.py`) then `exec uvicorn`. Local + smoke passed: built, ran with a tempfile DB URL, hit `/healthz`, + JSON logs on stdout, container marked `healthy`, SQLite file + landed in the bind mount owned `1000:1000`. + +2. **#33 compose.yml (#29).** Single `quartermaster` service with + the image tag parameterised via `${QUARTERMASTER_TAG:-latest}` + so CI can pin a SHA per deploy via a host-side `.env` rather + than editing the file. `/mnt/quartermaster:/data` bind mount, + `QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db` (four + slashes — the intake comment had three, which would put the DB + off the mount). `proxy-net` external, 1 GB mem+memswap, + `unless-stopped`, `json-file` logging capped at 50 MB × 3, plus + all twelve Traefik + required container labels from the + platform contract. Validated structurally against the wiki's + label list by parsing the file with pyyaml inside the #32 image. + +3. **#34 Forgejo Actions workflow (#30).** The one with the + interesting design conversation. First draft had an SSH step + from runner to host, needing `DEPLOY_SSH_KEY` + + `DEPLOY_KNOWN_HOSTS` secrets. Jeff pushed back ("do we need to + store a ssh private key in the repo?"), and the re-read landed + on the right answer: the `homelab` runner *lives on + home-ctr-onyx itself* with the host's Docker socket mounted — + so `docker compose pull && up -d` from the runner already + manages the production container directly, no SSH hop needed. + Dropped two secrets and the private-key risk surface. Final + workflow: checkout → buildx → registry login → build + push + (tagged with `${{ github.sha }}` and `latest`) → write `.env` + + `docker compose pull` + `up -d` → smoke + `curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD + https://quartermaster.unbiasedgeek.com/healthz` with 10 × 3 s + retries. Two repo-scoped secrets: + `REGISTRY_TOKEN` (archeious Forgejo PAT with `read:package` + + `write:package`) and `QUARTERMASTER_SMOKE_PASSWORD` (plaintext + basic-auth). + +4. **#35 uvicorn proxy-headers fix.** First merge-to-main fired + the workflow; the image rolled, `/healthz` returned 200, smoke + went green. But the browser page rendered unstyled. Root cause: + Starlette's `url_for()` in templates emitted + `` + because uvicorn had been started without `--proxy-headers`, so + `X-Forwarded-Proto` from Traefik was ignored. Browsers blocked + the http CSS as mixed content on the https page. Reproduced + locally by curling the pre-fix image with + `-H 'Host: quartermaster.unbiasedgeek.com' + -H 'X-Forwarded-Proto: https'`; added `--proxy-headers + --forwarded-allow-ips='*'` to `docker/entrypoint.sh`. Safe to + trust all forwarded IPs because `compose.yml` publishes no host + port; only Traefik on `proxy-net` can reach port 8000. + +## Key Decisions & Reasoning + +* **No SSH step in the deploy workflow.** Explicit win for + minimising secret surface. The `homelab` runner + mounted Docker + socket pattern means most single-tenant deploys on home-ctr-onyx + should default to "no SSH" unless there's a real cross-host + hop. Worth remembering for the next service we onboard. + +* **Image tag parameterised via `QUARTERMASTER_TAG`, not hard-coded + `latest` or `:${github.sha}` in the compose file.** This keeps + the checked-in compose file generic and pushes the tag decision + to the deploy layer. The Actions workflow writes + `QUARTERMASTER_TAG=` to `.env` next to the compose file, + and `docker compose` auto-loads `.env`. Rollback is manual but + cheap: set `QUARTERMASTER_TAG` back to a prior SHA, + `docker compose up -d`. + +* **`COMPOSE_PROJECT_NAME=quartermaster` pinned in the workflow + env.** Without this, compose derives the project name from the + runner's ephemeral workspace directory, so successive deploys + could fight over whether an existing container belongs to the + project. With it, every deploy identifies the same production + container by project label. `container_name: quartermaster` in + compose.yml already forces a stable container name; pinning + project name is belt-and-suspenders. + +* **Smoke test hits the public URL with basic-auth creds, not the + container directly.** Option (a) over option (b) in the pre-#30 + question. The whole point is catching regressions in TLS, DNS, + Traefik routing, and the basic-auth middleware — not just "is + the container up" (which the image's `HEALTHCHECK` already + tells us). `QUARTERMASTER_SMOKE_PASSWORD` is plaintext on the + tenant side; the platform team stores the bcrypt hash. Rotation + flows through both in lockstep. + +* **Mint `REGISTRY_TOKEN` via the Forgejo API, not the UI.** When + Jeff asked if I could do it as the `archeious` admin, my first + pass said no — the MCP bot user is `claude-code` with + `is_admin: false`. But sourcing `/home/jeff/projects/homelab-IaC/bin/load-ops-secrets` + exposed `FORGEJO_ARCHEIOUS_PASSWORD`, which unlocked basic-auth + against `POST /api/v1/users/archeious/tokens`. Whole flow + (delete-if-exists, create, register via + `PUT /repos/.../actions/secrets/REGISTRY_TOKEN`, verify) ran in + one bash block with the raw token never hitting the transcript. + +## Surprises & Discoveries + +* **The CSS bug was invisible from pytest, invisible from a + `/healthz`-only smoke, invisible until a human eyeball saw the + rendered page.** `url_for()` is only exercised when templates + render; `/healthz` returns raw JSON and doesn't exercise the + template layer at all. No pytest coverage because the tests run + against `TestClient` which doesn't go through Traefik. Future + lesson: any service that renders HTML through `url_for()` needs + at least one smoke that loads the index page and grep the output + for the expected href scheme. + +* **README "Docker" section I wrote in #32 got trimmed on review + before merge; I had re-added equivalent content across #29 and + #30 without realising.** Looking back, the three sections I + added (Docker in #32, Deploy in #29, CI/CD in #30) together + covered the ground the first one tried to cover alone. Jeff's + edit was the right call — better to let the content emerge + alongside each artefact than to front-load it. + +* **Forgejo Actions secret flow via API is pleasant.** + `PUT /repos/{owner}/{repo}/actions/secrets/{name}` with + `{"data": "..."}` is idempotent (201 or 204, both success) and + doesn't need a preceding GET. Easier than the GitHub equivalent + which requires fetching the repo's libsodium public key and + encrypting client-side. For one-off provisioning from an ops + shell this is a nice ergonomic. + +* **The runner being on the deploy host changes the default + design.** Came up twice this session: once as "no SSH needed", + once as "the build and push both use the same Docker daemon + that will run the container." Small detail with big + consequences. + +## Concerns & Open Threads + +* **#26 and #27 landed on main but were never closed on Forgejo.** + Their commits reference the issues in parentheses (`(#26)`) but + don't use `Closes #N`. Not a big deal; a one-line API call + closes them when we want. + +* **The smoke test retries for 30 s total.** If a first-ever Let's + Encrypt cert issuance takes longer than that (DNS-01 can be + slow under load), the smoke curl fails with a TLS error and the + workflow goes red — harmless, re-run from the Actions UI once + the cert is up. A richer smoke with a longer cert-acquisition + window would be nice, but it'd complicate the step and isn't + worth it for a homelab app that re-certs once every 90 days. + +* **No browser-rendered-page test.** The `/healthz` smoke doesn't + exercise template URL generation. A CI step that loads `/` and + asserts that every `` and `` href parses as https + would have caught #35 before it hit the live site. Probably + worth filing. + +* **#31 polish still open.** Logger placement in `service.py`, + middleware-vs-router comment in `routes_health.py`, richer + `template_entry_updated` extras. Fold into whichever PR next + touches those files. + +* **Rollback is manual.** For v1 this is fine (single-user app, + rollback = `QUARTERMASTER_TAG=; docker compose up -d`), + but a one-line re-deploy job that takes a tag would be worth + ~30 minutes of work once we have a reason to roll back under + pressure. + +## Raw Thinking + +* **The "first deploy is the real test" framing was accurate in + both directions.** On one hand, every pre-deploy artefact I + built worked as specified — the image ran, the compose file + parsed, the workflow fired, the secrets authenticated. On the + other hand, a real bug (proxy-headers) only showed up when a + human viewed the live page. The local dev loop is good but + finite; production traffic finds the gap between + "looks-tested" and "tested". + +* **Admin-level work via API vs UI.** I initially deflected on + "mint the registry token" with "I'd need archeious's + credentials." Jeff nudged: "try sourcing load-ops-secrets." + That pattern — "check the environment before giving up" — is a + reusable heuristic. Ops tooling that front-loads secrets into + the shell is common; I should poke at it before declining. + +* **Dropping the SSH step felt both obvious and non-obvious.** + Obvious in retrospect: the runner is on the deploy host, why + would I SSH? Non-obvious while in the middle of drafting: the + issue description literally says "SSHes to home-ctr-onyx" and + I was pattern-matching on that instead of re-reading the + platform contract's note that the runner has the Docker socket + mounted. Lesson: when a deploy spec says "SSH to host X", ask + whether the runner IS host X. + +* **The #35 fix was fast because the reproduction was tight.** + One local `docker run` with two Traefik-style headers, grep + output for the stylesheet href. Seeing the wrong URL in 30 s + meant I could be certain the fix was at the right layer before + committing. The alternative — "rebuild, push, deploy, look at + the live site, iterate" — would have been several minutes per + loop. Worth knowing the no-`--proxy-headers` failure mode is + easy to reproduce with curl. + +## What's Next + +Nothing blocks launch — launch happened. Remaining queue: + +1. **Close #26 and #27 manually.** Housekeeping. +2. **#31 polish.** Pick up opportunistically when touching the + affected files. +3. **Deferred items from earlier sessions.** Constrain posting + dates to the month, closed-month archive treatment, cross-month + summaries, copy-forward months. Nothing about production + deployment changes the priority here. +4. **Operational shakedown.** Watch the first Loki-indexed access + log pattern, see whether the rate-limit numbers (10/30 per-IP) + make sense for real traffic, confirm the restic nightly backup + of `/mnt/quartermaster/` actually includes everything we expect. +5. **Browser-rendered smoke test.** File an issue. Would have + caught #35.