retro: Session 3 — deploy pipeline live on home-ctr-onyx
parent
4cd63859d1
commit
5df30ca81d
1 changed files with 237 additions and 0 deletions
237
Session3.md
Normal file
237
Session3.md
Normal file
|
|
@ -0,0 +1,237 @@
|
|||
# Session 3 Notes — 2026-04-19
|
||||
|
||||
## What We Set Out to Do
|
||||
|
||||
Finish the deploy pipeline. Session 2 left three dependency-chained
|
||||
issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo
|
||||
Actions workflow. Goal was to land all three, merge them to `main`,
|
||||
and watch the first automated rollout put Quartermaster live at
|
||||
`https://quartermaster.unbiasedgeek.com/`.
|
||||
|
||||
## What Actually Happened
|
||||
|
||||
Four PRs merged, all three target issues closed, one post-deploy
|
||||
bug fix.
|
||||
|
||||
1. **#32 Dockerfile (#28).** `python:3.12-slim-bookworm` base, `uv`
|
||||
pulled from its official image and pinned at `0.5.11`,
|
||||
`uv sync --no-dev --frozen` in two layers for cache friendliness,
|
||||
`USER 1000:1000`, `EXPOSE 8000`, `HEALTHCHECK` against `/healthz`
|
||||
using `python -c` + `urllib` (python-slim has no curl). Entrypoint
|
||||
runs `alembic upgrade head` (the pre-upgrade backup hook fires
|
||||
automatically via `alembic/env.py`) then `exec uvicorn`. Local
|
||||
smoke passed: built, ran with a tempfile DB URL, hit `/healthz`,
|
||||
JSON logs on stdout, container marked `healthy`, SQLite file
|
||||
landed in the bind mount owned `1000:1000`.
|
||||
|
||||
2. **#33 compose.yml (#29).** Single `quartermaster` service with
|
||||
the image tag parameterised via `${QUARTERMASTER_TAG:-latest}`
|
||||
so CI can pin a SHA per deploy via a host-side `.env` rather
|
||||
than editing the file. `/mnt/quartermaster:/data` bind mount,
|
||||
`QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db` (four
|
||||
slashes — the intake comment had three, which would put the DB
|
||||
off the mount). `proxy-net` external, 1 GB mem+memswap,
|
||||
`unless-stopped`, `json-file` logging capped at 50 MB × 3, plus
|
||||
all twelve Traefik + required container labels from the
|
||||
platform contract. Validated structurally against the wiki's
|
||||
label list by parsing the file with pyyaml inside the #32 image.
|
||||
|
||||
3. **#34 Forgejo Actions workflow (#30).** The one with the
|
||||
interesting design conversation. First draft had an SSH step
|
||||
from runner to host, needing `DEPLOY_SSH_KEY` +
|
||||
`DEPLOY_KNOWN_HOSTS` secrets. Jeff pushed back ("do we need to
|
||||
store a ssh private key in the repo?"), and the re-read landed
|
||||
on the right answer: the `homelab` runner *lives on
|
||||
home-ctr-onyx itself* with the host's Docker socket mounted —
|
||||
so `docker compose pull && up -d` from the runner already
|
||||
manages the production container directly, no SSH hop needed.
|
||||
Dropped two secrets and the private-key risk surface. Final
|
||||
workflow: checkout → buildx → registry login → build + push
|
||||
(tagged with `${{ github.sha }}` and `latest`) → write `.env` +
|
||||
`docker compose pull` + `up -d` → smoke
|
||||
`curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD
|
||||
https://quartermaster.unbiasedgeek.com/healthz` with 10 × 3 s
|
||||
retries. Two repo-scoped secrets:
|
||||
`REGISTRY_TOKEN` (archeious Forgejo PAT with `read:package` +
|
||||
`write:package`) and `QUARTERMASTER_SMOKE_PASSWORD` (plaintext
|
||||
basic-auth).
|
||||
|
||||
4. **#35 uvicorn proxy-headers fix.** First merge-to-main fired
|
||||
the workflow; the image rolled, `/healthz` returned 200, smoke
|
||||
went green. But the browser page rendered unstyled. Root cause:
|
||||
Starlette's `url_for()` in templates emitted
|
||||
`<link rel="stylesheet" href="http://<internal>/static/app.css">`
|
||||
because uvicorn had been started without `--proxy-headers`, so
|
||||
`X-Forwarded-Proto` from Traefik was ignored. Browsers blocked
|
||||
the http CSS as mixed content on the https page. Reproduced
|
||||
locally by curling the pre-fix image with
|
||||
`-H 'Host: quartermaster.unbiasedgeek.com'
|
||||
-H 'X-Forwarded-Proto: https'`; added `--proxy-headers
|
||||
--forwarded-allow-ips='*'` to `docker/entrypoint.sh`. Safe to
|
||||
trust all forwarded IPs because `compose.yml` publishes no host
|
||||
port; only Traefik on `proxy-net` can reach port 8000.
|
||||
|
||||
## Key Decisions & Reasoning
|
||||
|
||||
* **No SSH step in the deploy workflow.** Explicit win for
|
||||
minimising secret surface. The `homelab` runner + mounted Docker
|
||||
socket pattern means most single-tenant deploys on home-ctr-onyx
|
||||
should default to "no SSH" unless there's a real cross-host
|
||||
hop. Worth remembering for the next service we onboard.
|
||||
|
||||
* **Image tag parameterised via `QUARTERMASTER_TAG`, not hard-coded
|
||||
`latest` or `:${github.sha}` in the compose file.** This keeps
|
||||
the checked-in compose file generic and pushes the tag decision
|
||||
to the deploy layer. The Actions workflow writes
|
||||
`QUARTERMASTER_TAG=<sha>` to `.env` next to the compose file,
|
||||
and `docker compose` auto-loads `.env`. Rollback is manual but
|
||||
cheap: set `QUARTERMASTER_TAG` back to a prior SHA,
|
||||
`docker compose up -d`.
|
||||
|
||||
* **`COMPOSE_PROJECT_NAME=quartermaster` pinned in the workflow
|
||||
env.** Without this, compose derives the project name from the
|
||||
runner's ephemeral workspace directory, so successive deploys
|
||||
could fight over whether an existing container belongs to the
|
||||
project. With it, every deploy identifies the same production
|
||||
container by project label. `container_name: quartermaster` in
|
||||
compose.yml already forces a stable container name; pinning
|
||||
project name is belt-and-suspenders.
|
||||
|
||||
* **Smoke test hits the public URL with basic-auth creds, not the
|
||||
container directly.** Option (a) over option (b) in the pre-#30
|
||||
question. The whole point is catching regressions in TLS, DNS,
|
||||
Traefik routing, and the basic-auth middleware — not just "is
|
||||
the container up" (which the image's `HEALTHCHECK` already
|
||||
tells us). `QUARTERMASTER_SMOKE_PASSWORD` is plaintext on the
|
||||
tenant side; the platform team stores the bcrypt hash. Rotation
|
||||
flows through both in lockstep.
|
||||
|
||||
* **Mint `REGISTRY_TOKEN` via the Forgejo API, not the UI.** When
|
||||
Jeff asked if I could do it as the `archeious` admin, my first
|
||||
pass said no — the MCP bot user is `claude-code` with
|
||||
`is_admin: false`. But sourcing `/home/jeff/projects/homelab-IaC/bin/load-ops-secrets`
|
||||
exposed `FORGEJO_ARCHEIOUS_PASSWORD`, which unlocked basic-auth
|
||||
against `POST /api/v1/users/archeious/tokens`. Whole flow
|
||||
(delete-if-exists, create, register via
|
||||
`PUT /repos/.../actions/secrets/REGISTRY_TOKEN`, verify) ran in
|
||||
one bash block with the raw token never hitting the transcript.
|
||||
|
||||
## Surprises & Discoveries
|
||||
|
||||
* **The CSS bug was invisible from pytest, invisible from a
|
||||
`/healthz`-only smoke, invisible until a human eyeball saw the
|
||||
rendered page.** `url_for()` is only exercised when templates
|
||||
render; `/healthz` returns raw JSON and doesn't exercise the
|
||||
template layer at all. No pytest coverage because the tests run
|
||||
against `TestClient` which doesn't go through Traefik. Future
|
||||
lesson: any service that renders HTML through `url_for()` needs
|
||||
at least one smoke that loads the index page and grep the output
|
||||
for the expected href scheme.
|
||||
|
||||
* **README "Docker" section I wrote in #32 got trimmed on review
|
||||
before merge; I had re-added equivalent content across #29 and
|
||||
#30 without realising.** Looking back, the three sections I
|
||||
added (Docker in #32, Deploy in #29, CI/CD in #30) together
|
||||
covered the ground the first one tried to cover alone. Jeff's
|
||||
edit was the right call — better to let the content emerge
|
||||
alongside each artefact than to front-load it.
|
||||
|
||||
* **Forgejo Actions secret flow via API is pleasant.**
|
||||
`PUT /repos/{owner}/{repo}/actions/secrets/{name}` with
|
||||
`{"data": "..."}` is idempotent (201 or 204, both success) and
|
||||
doesn't need a preceding GET. Easier than the GitHub equivalent
|
||||
which requires fetching the repo's libsodium public key and
|
||||
encrypting client-side. For one-off provisioning from an ops
|
||||
shell this is a nice ergonomic.
|
||||
|
||||
* **The runner being on the deploy host changes the default
|
||||
design.** Came up twice this session: once as "no SSH needed",
|
||||
once as "the build and push both use the same Docker daemon
|
||||
that will run the container." Small detail with big
|
||||
consequences.
|
||||
|
||||
## Concerns & Open Threads
|
||||
|
||||
* **#26 and #27 landed on main but were never closed on Forgejo.**
|
||||
Their commits reference the issues in parentheses (`(#26)`) but
|
||||
don't use `Closes #N`. Not a big deal; a one-line API call
|
||||
closes them when we want.
|
||||
|
||||
* **The smoke test retries for 30 s total.** If a first-ever Let's
|
||||
Encrypt cert issuance takes longer than that (DNS-01 can be
|
||||
slow under load), the smoke curl fails with a TLS error and the
|
||||
workflow goes red — harmless, re-run from the Actions UI once
|
||||
the cert is up. A richer smoke with a longer cert-acquisition
|
||||
window would be nice, but it'd complicate the step and isn't
|
||||
worth it for a homelab app that re-certs once every 90 days.
|
||||
|
||||
* **No browser-rendered-page test.** The `/healthz` smoke doesn't
|
||||
exercise template URL generation. A CI step that loads `/` and
|
||||
asserts that every `<link>` and `<img>` href parses as https
|
||||
would have caught #35 before it hit the live site. Probably
|
||||
worth filing.
|
||||
|
||||
* **#31 polish still open.** Logger placement in `service.py`,
|
||||
middleware-vs-router comment in `routes_health.py`, richer
|
||||
`template_entry_updated` extras. Fold into whichever PR next
|
||||
touches those files.
|
||||
|
||||
* **Rollback is manual.** For v1 this is fine (single-user app,
|
||||
rollback = `QUARTERMASTER_TAG=<prev-sha>; docker compose up -d`),
|
||||
but a one-line re-deploy job that takes a tag would be worth
|
||||
~30 minutes of work once we have a reason to roll back under
|
||||
pressure.
|
||||
|
||||
## Raw Thinking
|
||||
|
||||
* **The "first deploy is the real test" framing was accurate in
|
||||
both directions.** On one hand, every pre-deploy artefact I
|
||||
built worked as specified — the image ran, the compose file
|
||||
parsed, the workflow fired, the secrets authenticated. On the
|
||||
other hand, a real bug (proxy-headers) only showed up when a
|
||||
human viewed the live page. The local dev loop is good but
|
||||
finite; production traffic finds the gap between
|
||||
"looks-tested" and "tested".
|
||||
|
||||
* **Admin-level work via API vs UI.** I initially deflected on
|
||||
"mint the registry token" with "I'd need archeious's
|
||||
credentials." Jeff nudged: "try sourcing load-ops-secrets."
|
||||
That pattern — "check the environment before giving up" — is a
|
||||
reusable heuristic. Ops tooling that front-loads secrets into
|
||||
the shell is common; I should poke at it before declining.
|
||||
|
||||
* **Dropping the SSH step felt both obvious and non-obvious.**
|
||||
Obvious in retrospect: the runner is on the deploy host, why
|
||||
would I SSH? Non-obvious while in the middle of drafting: the
|
||||
issue description literally says "SSHes to home-ctr-onyx" and
|
||||
I was pattern-matching on that instead of re-reading the
|
||||
platform contract's note that the runner has the Docker socket
|
||||
mounted. Lesson: when a deploy spec says "SSH to host X", ask
|
||||
whether the runner IS host X.
|
||||
|
||||
* **The #35 fix was fast because the reproduction was tight.**
|
||||
One local `docker run` with two Traefik-style headers, grep
|
||||
output for the stylesheet href. Seeing the wrong URL in 30 s
|
||||
meant I could be certain the fix was at the right layer before
|
||||
committing. The alternative — "rebuild, push, deploy, look at
|
||||
the live site, iterate" — would have been several minutes per
|
||||
loop. Worth knowing the no-`--proxy-headers` failure mode is
|
||||
easy to reproduce with curl.
|
||||
|
||||
## What's Next
|
||||
|
||||
Nothing blocks launch — launch happened. Remaining queue:
|
||||
|
||||
1. **Close #26 and #27 manually.** Housekeeping.
|
||||
2. **#31 polish.** Pick up opportunistically when touching the
|
||||
affected files.
|
||||
3. **Deferred items from earlier sessions.** Constrain posting
|
||||
dates to the month, closed-month archive treatment, cross-month
|
||||
summaries, copy-forward months. Nothing about production
|
||||
deployment changes the priority here.
|
||||
4. **Operational shakedown.** Watch the first Loki-indexed access
|
||||
log pattern, see whether the rate-limit numbers (10/30 per-IP)
|
||||
make sense for real traffic, confirm the restic nightly backup
|
||||
of `/mnt/quartermaster/` actually includes everything we expect.
|
||||
5. **Browser-rendered smoke test.** File an issue. Would have
|
||||
caught #35.
|
||||
Loading…
Reference in a new issue