This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Session 3 Notes — 2026-04-19
What We Set Out to Do
Finish the deploy pipeline. Session 2 left three dependency-chained
issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo
Actions workflow. Goal was to land all three, merge them to main,
and watch the first automated rollout put Quartermaster live at
https://quartermaster.unbiasedgeek.com/.
What Actually Happened
Four PRs merged, all three target issues closed, one post-deploy bug fix.
-
#32 Dockerfile (#28).
python:3.12-slim-bookwormbase,uvpulled from its official image and pinned at0.5.11,uv sync --no-dev --frozenin two layers for cache friendliness,USER 1000:1000,EXPOSE 8000,HEALTHCHECKagainst/healthzusingpython -c+urllib(python-slim has no curl). Entrypoint runsalembic upgrade head(the pre-upgrade backup hook fires automatically viaalembic/env.py) thenexec uvicorn. Local smoke passed: built, ran with a tempfile DB URL, hit/healthz, JSON logs on stdout, container markedhealthy, SQLite file landed in the bind mount owned1000:1000. -
#33 compose.yml (#29). Single
quartermasterservice with the image tag parameterised via${QUARTERMASTER_TAG:-latest}so CI can pin a SHA per deploy via a host-side.envrather than editing the file./mnt/quartermaster:/databind mount,QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db(four slashes — the intake comment had three, which would put the DB off the mount).proxy-netexternal, 1 GB mem+memswap,unless-stopped,json-filelogging capped at 50 MB × 3, plus all twelve Traefik + required container labels from the platform contract. Validated structurally against the wiki's label list by parsing the file with pyyaml inside the #32 image. -
#34 Forgejo Actions workflow (#30). The one with the interesting design conversation. First draft had an SSH step from runner to host, needing
DEPLOY_SSH_KEY+DEPLOY_KNOWN_HOSTSsecrets. Jeff pushed back ("do we need to store a ssh private key in the repo?"), and the re-read landed on the right answer: thehomelabrunner lives on home-ctr-onyx itself with the host's Docker socket mounted — sodocker compose pull && up -dfrom the runner already manages the production container directly, no SSH hop needed. Dropped two secrets and the private-key risk surface. Final workflow: checkout → buildx → registry login → build + push (tagged with${{ github.sha }}andlatest) → write.env+docker compose pull+up -d→ smokecurl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD https://quartermaster.unbiasedgeek.com/healthzwith 10 × 3 s retries. Two repo-scoped secrets:REGISTRY_TOKEN(archeious Forgejo PAT withread:package+write:package) andQUARTERMASTER_SMOKE_PASSWORD(plaintext basic-auth). -
#35 uvicorn proxy-headers fix. First merge-to-main fired the workflow; the image rolled,
/healthzreturned 200, smoke went green. But the browser page rendered unstyled. Root cause: Starlette'surl_for()in templates emitted<link rel="stylesheet" href="http://<internal>/static/app.css">because uvicorn had been started without--proxy-headers, soX-Forwarded-Protofrom Traefik was ignored. Browsers blocked the http CSS as mixed content on the https page. Reproduced locally by curling the pre-fix image with-H 'Host: quartermaster.unbiasedgeek.com' -H 'X-Forwarded-Proto: https'; added--proxy-headers --forwarded-allow-ips='*'todocker/entrypoint.sh. Safe to trust all forwarded IPs becausecompose.ymlpublishes no host port; only Traefik onproxy-netcan reach port 8000.
Key Decisions & Reasoning
-
No SSH step in the deploy workflow. Explicit win for minimising secret surface. The
homelabrunner + mounted Docker socket pattern means most single-tenant deploys on home-ctr-onyx should default to "no SSH" unless there's a real cross-host hop. Worth remembering for the next service we onboard. -
Image tag parameterised via
QUARTERMASTER_TAG, not hard-codedlatestor:${github.sha}in the compose file. This keeps the checked-in compose file generic and pushes the tag decision to the deploy layer. The Actions workflow writesQUARTERMASTER_TAG=<sha>to.envnext to the compose file, anddocker composeauto-loads.env. Rollback is manual but cheap: setQUARTERMASTER_TAGback to a prior SHA,docker compose up -d. -
COMPOSE_PROJECT_NAME=quartermasterpinned in the workflow env. Without this, compose derives the project name from the runner's ephemeral workspace directory, so successive deploys could fight over whether an existing container belongs to the project. With it, every deploy identifies the same production container by project label.container_name: quartermasterin compose.yml already forces a stable container name; pinning project name is belt-and-suspenders. -
Smoke test hits the public URL with basic-auth creds, not the container directly. Option (a) over option (b) in the pre-#30 question. The whole point is catching regressions in TLS, DNS, Traefik routing, and the basic-auth middleware — not just "is the container up" (which the image's
HEALTHCHECKalready tells us).QUARTERMASTER_SMOKE_PASSWORDis plaintext on the tenant side; the platform team stores the bcrypt hash. Rotation flows through both in lockstep. -
Mint
REGISTRY_TOKENvia the Forgejo API, not the UI. When Jeff asked if I could do it as thearcheiousadmin, my first pass said no — the MCP bot user isclaude-codewithis_admin: false. But sourcing/home/jeff/projects/homelab-IaC/bin/load-ops-secretsexposedFORGEJO_ARCHEIOUS_PASSWORD, which unlocked basic-auth againstPOST /api/v1/users/archeious/tokens. Whole flow (delete-if-exists, create, register viaPUT /repos/.../actions/secrets/REGISTRY_TOKEN, verify) ran in one bash block with the raw token never hitting the transcript.
Surprises & Discoveries
-
The CSS bug was invisible from pytest, invisible from a
/healthz-only smoke, invisible until a human eyeball saw the rendered page.url_for()is only exercised when templates render;/healthzreturns raw JSON and doesn't exercise the template layer at all. No pytest coverage because the tests run againstTestClientwhich doesn't go through Traefik. Future lesson: any service that renders HTML throughurl_for()needs at least one smoke that loads the index page and grep the output for the expected href scheme. -
README "Docker" section I wrote in #32 got trimmed on review before merge; I had re-added equivalent content across #29 and #30 without realising. Looking back, the three sections I added (Docker in #32, Deploy in #29, CI/CD in #30) together covered the ground the first one tried to cover alone. Jeff's edit was the right call — better to let the content emerge alongside each artefact than to front-load it.
-
Forgejo Actions secret flow via API is pleasant.
PUT /repos/{owner}/{repo}/actions/secrets/{name}with{"data": "..."}is idempotent (201 or 204, both success) and doesn't need a preceding GET. Easier than the GitHub equivalent which requires fetching the repo's libsodium public key and encrypting client-side. For one-off provisioning from an ops shell this is a nice ergonomic. -
The runner being on the deploy host changes the default design. Came up twice this session: once as "no SSH needed", once as "the build and push both use the same Docker daemon that will run the container." Small detail with big consequences.
Concerns & Open Threads
-
#26 and #27 landed on main but were never closed on Forgejo. Their commits reference the issues in parentheses (
(#26)) but don't useCloses #N. Not a big deal; a one-line API call closes them when we want. -
The smoke test retries for 30 s total. If a first-ever Let's Encrypt cert issuance takes longer than that (DNS-01 can be slow under load), the smoke curl fails with a TLS error and the workflow goes red — harmless, re-run from the Actions UI once the cert is up. A richer smoke with a longer cert-acquisition window would be nice, but it'd complicate the step and isn't worth it for a homelab app that re-certs once every 90 days.
-
No browser-rendered-page test. The
/healthzsmoke doesn't exercise template URL generation. A CI step that loads/and asserts that every<link>and<img>href parses as https would have caught #35 before it hit the live site. Probably worth filing. -
#31 polish still open. Logger placement in
service.py, middleware-vs-router comment inroutes_health.py, richertemplate_entry_updatedextras. Fold into whichever PR next touches those files. -
Rollback is manual. For v1 this is fine (single-user app, rollback =
QUARTERMASTER_TAG=<prev-sha>; docker compose up -d), but a one-line re-deploy job that takes a tag would be worth ~30 minutes of work once we have a reason to roll back under pressure.
Raw Thinking
-
The "first deploy is the real test" framing was accurate in both directions. On one hand, every pre-deploy artefact I built worked as specified — the image ran, the compose file parsed, the workflow fired, the secrets authenticated. On the other hand, a real bug (proxy-headers) only showed up when a human viewed the live page. The local dev loop is good but finite; production traffic finds the gap between "looks-tested" and "tested".
-
Admin-level work via API vs UI. I initially deflected on "mint the registry token" with "I'd need archeious's credentials." Jeff nudged: "try sourcing load-ops-secrets." That pattern — "check the environment before giving up" — is a reusable heuristic. Ops tooling that front-loads secrets into the shell is common; I should poke at it before declining.
-
Dropping the SSH step felt both obvious and non-obvious. Obvious in retrospect: the runner is on the deploy host, why would I SSH? Non-obvious while in the middle of drafting: the issue description literally says "SSHes to home-ctr-onyx" and I was pattern-matching on that instead of re-reading the platform contract's note that the runner has the Docker socket mounted. Lesson: when a deploy spec says "SSH to host X", ask whether the runner IS host X.
-
The #35 fix was fast because the reproduction was tight. One local
docker runwith two Traefik-style headers, grep output for the stylesheet href. Seeing the wrong URL in 30 s meant I could be certain the fix was at the right layer before committing. The alternative — "rebuild, push, deploy, look at the live site, iterate" — would have been several minutes per loop. Worth knowing the no---proxy-headersfailure mode is easy to reproduce with curl.
What's Next
Nothing blocks launch — launch happened. Remaining queue:
- Close #26 and #27 manually. Housekeeping.
- #31 polish. Pick up opportunistically when touching the affected files.
- Deferred items from earlier sessions. Constrain posting dates to the month, closed-month archive treatment, cross-month summaries, copy-forward months. Nothing about production deployment changes the priority here.
- Operational shakedown. Watch the first Loki-indexed access
log pattern, see whether the rate-limit numbers (10/30 per-IP)
make sense for real traffic, confirm the restic nightly backup
of
/mnt/quartermaster/actually includes everything we expect. - Browser-rendered smoke test. File an issue. Would have caught #35.