retro: Session 3 — deploy pipeline live on home-ctr-onyx
parent
4cd63859d1
commit
5df30ca81d
1 changed files with 237 additions and 0 deletions
237
Session3.md
Normal file
237
Session3.md
Normal file
|
|
@ -0,0 +1,237 @@
|
||||||
|
# Session 3 Notes — 2026-04-19
|
||||||
|
|
||||||
|
## What We Set Out to Do
|
||||||
|
|
||||||
|
Finish the deploy pipeline. Session 2 left three dependency-chained
|
||||||
|
issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo
|
||||||
|
Actions workflow. Goal was to land all three, merge them to `main`,
|
||||||
|
and watch the first automated rollout put Quartermaster live at
|
||||||
|
`https://quartermaster.unbiasedgeek.com/`.
|
||||||
|
|
||||||
|
## What Actually Happened
|
||||||
|
|
||||||
|
Four PRs merged, all three target issues closed, one post-deploy
|
||||||
|
bug fix.
|
||||||
|
|
||||||
|
1. **#32 Dockerfile (#28).** `python:3.12-slim-bookworm` base, `uv`
|
||||||
|
pulled from its official image and pinned at `0.5.11`,
|
||||||
|
`uv sync --no-dev --frozen` in two layers for cache friendliness,
|
||||||
|
`USER 1000:1000`, `EXPOSE 8000`, `HEALTHCHECK` against `/healthz`
|
||||||
|
using `python -c` + `urllib` (python-slim has no curl). Entrypoint
|
||||||
|
runs `alembic upgrade head` (the pre-upgrade backup hook fires
|
||||||
|
automatically via `alembic/env.py`) then `exec uvicorn`. Local
|
||||||
|
smoke passed: built, ran with a tempfile DB URL, hit `/healthz`,
|
||||||
|
JSON logs on stdout, container marked `healthy`, SQLite file
|
||||||
|
landed in the bind mount owned `1000:1000`.
|
||||||
|
|
||||||
|
2. **#33 compose.yml (#29).** Single `quartermaster` service with
|
||||||
|
the image tag parameterised via `${QUARTERMASTER_TAG:-latest}`
|
||||||
|
so CI can pin a SHA per deploy via a host-side `.env` rather
|
||||||
|
than editing the file. `/mnt/quartermaster:/data` bind mount,
|
||||||
|
`QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db` (four
|
||||||
|
slashes — the intake comment had three, which would put the DB
|
||||||
|
off the mount). `proxy-net` external, 1 GB mem+memswap,
|
||||||
|
`unless-stopped`, `json-file` logging capped at 50 MB × 3, plus
|
||||||
|
all twelve Traefik + required container labels from the
|
||||||
|
platform contract. Validated structurally against the wiki's
|
||||||
|
label list by parsing the file with pyyaml inside the #32 image.
|
||||||
|
|
||||||
|
3. **#34 Forgejo Actions workflow (#30).** The one with the
|
||||||
|
interesting design conversation. First draft had an SSH step
|
||||||
|
from runner to host, needing `DEPLOY_SSH_KEY` +
|
||||||
|
`DEPLOY_KNOWN_HOSTS` secrets. Jeff pushed back ("do we need to
|
||||||
|
store a ssh private key in the repo?"), and the re-read landed
|
||||||
|
on the right answer: the `homelab` runner *lives on
|
||||||
|
home-ctr-onyx itself* with the host's Docker socket mounted —
|
||||||
|
so `docker compose pull && up -d` from the runner already
|
||||||
|
manages the production container directly, no SSH hop needed.
|
||||||
|
Dropped two secrets and the private-key risk surface. Final
|
||||||
|
workflow: checkout → buildx → registry login → build + push
|
||||||
|
(tagged with `${{ github.sha }}` and `latest`) → write `.env` +
|
||||||
|
`docker compose pull` + `up -d` → smoke
|
||||||
|
`curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD
|
||||||
|
https://quartermaster.unbiasedgeek.com/healthz` with 10 × 3 s
|
||||||
|
retries. Two repo-scoped secrets:
|
||||||
|
`REGISTRY_TOKEN` (archeious Forgejo PAT with `read:package` +
|
||||||
|
`write:package`) and `QUARTERMASTER_SMOKE_PASSWORD` (plaintext
|
||||||
|
basic-auth).
|
||||||
|
|
||||||
|
4. **#35 uvicorn proxy-headers fix.** First merge-to-main fired
|
||||||
|
the workflow; the image rolled, `/healthz` returned 200, smoke
|
||||||
|
went green. But the browser page rendered unstyled. Root cause:
|
||||||
|
Starlette's `url_for()` in templates emitted
|
||||||
|
`<link rel="stylesheet" href="http://<internal>/static/app.css">`
|
||||||
|
because uvicorn had been started without `--proxy-headers`, so
|
||||||
|
`X-Forwarded-Proto` from Traefik was ignored. Browsers blocked
|
||||||
|
the http CSS as mixed content on the https page. Reproduced
|
||||||
|
locally by curling the pre-fix image with
|
||||||
|
`-H 'Host: quartermaster.unbiasedgeek.com'
|
||||||
|
-H 'X-Forwarded-Proto: https'`; added `--proxy-headers
|
||||||
|
--forwarded-allow-ips='*'` to `docker/entrypoint.sh`. Safe to
|
||||||
|
trust all forwarded IPs because `compose.yml` publishes no host
|
||||||
|
port; only Traefik on `proxy-net` can reach port 8000.
|
||||||
|
|
||||||
|
## Key Decisions & Reasoning
|
||||||
|
|
||||||
|
* **No SSH step in the deploy workflow.** Explicit win for
|
||||||
|
minimising secret surface. The `homelab` runner + mounted Docker
|
||||||
|
socket pattern means most single-tenant deploys on home-ctr-onyx
|
||||||
|
should default to "no SSH" unless there's a real cross-host
|
||||||
|
hop. Worth remembering for the next service we onboard.
|
||||||
|
|
||||||
|
* **Image tag parameterised via `QUARTERMASTER_TAG`, not hard-coded
|
||||||
|
`latest` or `:${github.sha}` in the compose file.** This keeps
|
||||||
|
the checked-in compose file generic and pushes the tag decision
|
||||||
|
to the deploy layer. The Actions workflow writes
|
||||||
|
`QUARTERMASTER_TAG=<sha>` to `.env` next to the compose file,
|
||||||
|
and `docker compose` auto-loads `.env`. Rollback is manual but
|
||||||
|
cheap: set `QUARTERMASTER_TAG` back to a prior SHA,
|
||||||
|
`docker compose up -d`.
|
||||||
|
|
||||||
|
* **`COMPOSE_PROJECT_NAME=quartermaster` pinned in the workflow
|
||||||
|
env.** Without this, compose derives the project name from the
|
||||||
|
runner's ephemeral workspace directory, so successive deploys
|
||||||
|
could fight over whether an existing container belongs to the
|
||||||
|
project. With it, every deploy identifies the same production
|
||||||
|
container by project label. `container_name: quartermaster` in
|
||||||
|
compose.yml already forces a stable container name; pinning
|
||||||
|
project name is belt-and-suspenders.
|
||||||
|
|
||||||
|
* **Smoke test hits the public URL with basic-auth creds, not the
|
||||||
|
container directly.** Option (a) over option (b) in the pre-#30
|
||||||
|
question. The whole point is catching regressions in TLS, DNS,
|
||||||
|
Traefik routing, and the basic-auth middleware — not just "is
|
||||||
|
the container up" (which the image's `HEALTHCHECK` already
|
||||||
|
tells us). `QUARTERMASTER_SMOKE_PASSWORD` is plaintext on the
|
||||||
|
tenant side; the platform team stores the bcrypt hash. Rotation
|
||||||
|
flows through both in lockstep.
|
||||||
|
|
||||||
|
* **Mint `REGISTRY_TOKEN` via the Forgejo API, not the UI.** When
|
||||||
|
Jeff asked if I could do it as the `archeious` admin, my first
|
||||||
|
pass said no — the MCP bot user is `claude-code` with
|
||||||
|
`is_admin: false`. But sourcing `/home/jeff/projects/homelab-IaC/bin/load-ops-secrets`
|
||||||
|
exposed `FORGEJO_ARCHEIOUS_PASSWORD`, which unlocked basic-auth
|
||||||
|
against `POST /api/v1/users/archeious/tokens`. Whole flow
|
||||||
|
(delete-if-exists, create, register via
|
||||||
|
`PUT /repos/.../actions/secrets/REGISTRY_TOKEN`, verify) ran in
|
||||||
|
one bash block with the raw token never hitting the transcript.
|
||||||
|
|
||||||
|
## Surprises & Discoveries
|
||||||
|
|
||||||
|
* **The CSS bug was invisible from pytest, invisible from a
|
||||||
|
`/healthz`-only smoke, invisible until a human eyeball saw the
|
||||||
|
rendered page.** `url_for()` is only exercised when templates
|
||||||
|
render; `/healthz` returns raw JSON and doesn't exercise the
|
||||||
|
template layer at all. No pytest coverage because the tests run
|
||||||
|
against `TestClient` which doesn't go through Traefik. Future
|
||||||
|
lesson: any service that renders HTML through `url_for()` needs
|
||||||
|
at least one smoke that loads the index page and grep the output
|
||||||
|
for the expected href scheme.
|
||||||
|
|
||||||
|
* **README "Docker" section I wrote in #32 got trimmed on review
|
||||||
|
before merge; I had re-added equivalent content across #29 and
|
||||||
|
#30 without realising.** Looking back, the three sections I
|
||||||
|
added (Docker in #32, Deploy in #29, CI/CD in #30) together
|
||||||
|
covered the ground the first one tried to cover alone. Jeff's
|
||||||
|
edit was the right call — better to let the content emerge
|
||||||
|
alongside each artefact than to front-load it.
|
||||||
|
|
||||||
|
* **Forgejo Actions secret flow via API is pleasant.**
|
||||||
|
`PUT /repos/{owner}/{repo}/actions/secrets/{name}` with
|
||||||
|
`{"data": "..."}` is idempotent (201 or 204, both success) and
|
||||||
|
doesn't need a preceding GET. Easier than the GitHub equivalent
|
||||||
|
which requires fetching the repo's libsodium public key and
|
||||||
|
encrypting client-side. For one-off provisioning from an ops
|
||||||
|
shell this is a nice ergonomic.
|
||||||
|
|
||||||
|
* **The runner being on the deploy host changes the default
|
||||||
|
design.** Came up twice this session: once as "no SSH needed",
|
||||||
|
once as "the build and push both use the same Docker daemon
|
||||||
|
that will run the container." Small detail with big
|
||||||
|
consequences.
|
||||||
|
|
||||||
|
## Concerns & Open Threads
|
||||||
|
|
||||||
|
* **#26 and #27 landed on main but were never closed on Forgejo.**
|
||||||
|
Their commits reference the issues in parentheses (`(#26)`) but
|
||||||
|
don't use `Closes #N`. Not a big deal; a one-line API call
|
||||||
|
closes them when we want.
|
||||||
|
|
||||||
|
* **The smoke test retries for 30 s total.** If a first-ever Let's
|
||||||
|
Encrypt cert issuance takes longer than that (DNS-01 can be
|
||||||
|
slow under load), the smoke curl fails with a TLS error and the
|
||||||
|
workflow goes red — harmless, re-run from the Actions UI once
|
||||||
|
the cert is up. A richer smoke with a longer cert-acquisition
|
||||||
|
window would be nice, but it'd complicate the step and isn't
|
||||||
|
worth it for a homelab app that re-certs once every 90 days.
|
||||||
|
|
||||||
|
* **No browser-rendered-page test.** The `/healthz` smoke doesn't
|
||||||
|
exercise template URL generation. A CI step that loads `/` and
|
||||||
|
asserts that every `<link>` and `<img>` href parses as https
|
||||||
|
would have caught #35 before it hit the live site. Probably
|
||||||
|
worth filing.
|
||||||
|
|
||||||
|
* **#31 polish still open.** Logger placement in `service.py`,
|
||||||
|
middleware-vs-router comment in `routes_health.py`, richer
|
||||||
|
`template_entry_updated` extras. Fold into whichever PR next
|
||||||
|
touches those files.
|
||||||
|
|
||||||
|
* **Rollback is manual.** For v1 this is fine (single-user app,
|
||||||
|
rollback = `QUARTERMASTER_TAG=<prev-sha>; docker compose up -d`),
|
||||||
|
but a one-line re-deploy job that takes a tag would be worth
|
||||||
|
~30 minutes of work once we have a reason to roll back under
|
||||||
|
pressure.
|
||||||
|
|
||||||
|
## Raw Thinking
|
||||||
|
|
||||||
|
* **The "first deploy is the real test" framing was accurate in
|
||||||
|
both directions.** On one hand, every pre-deploy artefact I
|
||||||
|
built worked as specified — the image ran, the compose file
|
||||||
|
parsed, the workflow fired, the secrets authenticated. On the
|
||||||
|
other hand, a real bug (proxy-headers) only showed up when a
|
||||||
|
human viewed the live page. The local dev loop is good but
|
||||||
|
finite; production traffic finds the gap between
|
||||||
|
"looks-tested" and "tested".
|
||||||
|
|
||||||
|
* **Admin-level work via API vs UI.** I initially deflected on
|
||||||
|
"mint the registry token" with "I'd need archeious's
|
||||||
|
credentials." Jeff nudged: "try sourcing load-ops-secrets."
|
||||||
|
That pattern — "check the environment before giving up" — is a
|
||||||
|
reusable heuristic. Ops tooling that front-loads secrets into
|
||||||
|
the shell is common; I should poke at it before declining.
|
||||||
|
|
||||||
|
* **Dropping the SSH step felt both obvious and non-obvious.**
|
||||||
|
Obvious in retrospect: the runner is on the deploy host, why
|
||||||
|
would I SSH? Non-obvious while in the middle of drafting: the
|
||||||
|
issue description literally says "SSHes to home-ctr-onyx" and
|
||||||
|
I was pattern-matching on that instead of re-reading the
|
||||||
|
platform contract's note that the runner has the Docker socket
|
||||||
|
mounted. Lesson: when a deploy spec says "SSH to host X", ask
|
||||||
|
whether the runner IS host X.
|
||||||
|
|
||||||
|
* **The #35 fix was fast because the reproduction was tight.**
|
||||||
|
One local `docker run` with two Traefik-style headers, grep
|
||||||
|
output for the stylesheet href. Seeing the wrong URL in 30 s
|
||||||
|
meant I could be certain the fix was at the right layer before
|
||||||
|
committing. The alternative — "rebuild, push, deploy, look at
|
||||||
|
the live site, iterate" — would have been several minutes per
|
||||||
|
loop. Worth knowing the no-`--proxy-headers` failure mode is
|
||||||
|
easy to reproduce with curl.
|
||||||
|
|
||||||
|
## What's Next
|
||||||
|
|
||||||
|
Nothing blocks launch — launch happened. Remaining queue:
|
||||||
|
|
||||||
|
1. **Close #26 and #27 manually.** Housekeeping.
|
||||||
|
2. **#31 polish.** Pick up opportunistically when touching the
|
||||||
|
affected files.
|
||||||
|
3. **Deferred items from earlier sessions.** Constrain posting
|
||||||
|
dates to the month, closed-month archive treatment, cross-month
|
||||||
|
summaries, copy-forward months. Nothing about production
|
||||||
|
deployment changes the priority here.
|
||||||
|
4. **Operational shakedown.** Watch the first Loki-indexed access
|
||||||
|
log pattern, see whether the rate-limit numbers (10/30 per-IP)
|
||||||
|
make sense for real traffic, confirm the restic nightly backup
|
||||||
|
of `/mnt/quartermaster/` actually includes everything we expect.
|
||||||
|
5. **Browser-rendered smoke test.** File an issue. Would have
|
||||||
|
caught #35.
|
||||||
Loading…
Reference in a new issue