1 Session3
claude-code edited this page 2026-04-19 18:32:10 -06:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Session 3 Notes — 2026-04-19

What We Set Out to Do

Finish the deploy pipeline. Session 2 left three dependency-chained issues on the table: #28 Dockerfile, #29 compose.yml, #30 Forgejo Actions workflow. Goal was to land all three, merge them to main, and watch the first automated rollout put Quartermaster live at https://quartermaster.unbiasedgeek.com/.

What Actually Happened

Four PRs merged, all three target issues closed, one post-deploy bug fix.

  1. #32 Dockerfile (#28). python:3.12-slim-bookworm base, uv pulled from its official image and pinned at 0.5.11, uv sync --no-dev --frozen in two layers for cache friendliness, USER 1000:1000, EXPOSE 8000, HEALTHCHECK against /healthz using python -c + urllib (python-slim has no curl). Entrypoint runs alembic upgrade head (the pre-upgrade backup hook fires automatically via alembic/env.py) then exec uvicorn. Local smoke passed: built, ran with a tempfile DB URL, hit /healthz, JSON logs on stdout, container marked healthy, SQLite file landed in the bind mount owned 1000:1000.

  2. #33 compose.yml (#29). Single quartermaster service with the image tag parameterised via ${QUARTERMASTER_TAG:-latest} so CI can pin a SHA per deploy via a host-side .env rather than editing the file. /mnt/quartermaster:/data bind mount, QUARTERMASTER_DB_URL=sqlite:////data/quartermaster.db (four slashes — the intake comment had three, which would put the DB off the mount). proxy-net external, 1 GB mem+memswap, unless-stopped, json-file logging capped at 50 MB × 3, plus all twelve Traefik + required container labels from the platform contract. Validated structurally against the wiki's label list by parsing the file with pyyaml inside the #32 image.

  3. #34 Forgejo Actions workflow (#30). The one with the interesting design conversation. First draft had an SSH step from runner to host, needing DEPLOY_SSH_KEY + DEPLOY_KNOWN_HOSTS secrets. Jeff pushed back ("do we need to store a ssh private key in the repo?"), and the re-read landed on the right answer: the homelab runner lives on home-ctr-onyx itself with the host's Docker socket mounted — so docker compose pull && up -d from the runner already manages the production container directly, no SSH hop needed. Dropped two secrets and the private-key risk surface. Final workflow: checkout → buildx → registry login → build + push (tagged with ${{ github.sha }} and latest) → write .env + docker compose pull + up -d → smoke curl -fsS -u admin:$QUARTERMASTER_SMOKE_PASSWORD https://quartermaster.unbiasedgeek.com/healthz with 10 × 3 s retries. Two repo-scoped secrets: REGISTRY_TOKEN (archeious Forgejo PAT with read:package + write:package) and QUARTERMASTER_SMOKE_PASSWORD (plaintext basic-auth).

  4. #35 uvicorn proxy-headers fix. First merge-to-main fired the workflow; the image rolled, /healthz returned 200, smoke went green. But the browser page rendered unstyled. Root cause: Starlette's url_for() in templates emitted <link rel="stylesheet" href="http://<internal>/static/app.css"> because uvicorn had been started without --proxy-headers, so X-Forwarded-Proto from Traefik was ignored. Browsers blocked the http CSS as mixed content on the https page. Reproduced locally by curling the pre-fix image with -H 'Host: quartermaster.unbiasedgeek.com' -H 'X-Forwarded-Proto: https'; added --proxy-headers --forwarded-allow-ips='*' to docker/entrypoint.sh. Safe to trust all forwarded IPs because compose.yml publishes no host port; only Traefik on proxy-net can reach port 8000.

Key Decisions & Reasoning

  • No SSH step in the deploy workflow. Explicit win for minimising secret surface. The homelab runner + mounted Docker socket pattern means most single-tenant deploys on home-ctr-onyx should default to "no SSH" unless there's a real cross-host hop. Worth remembering for the next service we onboard.

  • Image tag parameterised via QUARTERMASTER_TAG, not hard-coded latest or :${github.sha} in the compose file. This keeps the checked-in compose file generic and pushes the tag decision to the deploy layer. The Actions workflow writes QUARTERMASTER_TAG=<sha> to .env next to the compose file, and docker compose auto-loads .env. Rollback is manual but cheap: set QUARTERMASTER_TAG back to a prior SHA, docker compose up -d.

  • COMPOSE_PROJECT_NAME=quartermaster pinned in the workflow env. Without this, compose derives the project name from the runner's ephemeral workspace directory, so successive deploys could fight over whether an existing container belongs to the project. With it, every deploy identifies the same production container by project label. container_name: quartermaster in compose.yml already forces a stable container name; pinning project name is belt-and-suspenders.

  • Smoke test hits the public URL with basic-auth creds, not the container directly. Option (a) over option (b) in the pre-#30 question. The whole point is catching regressions in TLS, DNS, Traefik routing, and the basic-auth middleware — not just "is the container up" (which the image's HEALTHCHECK already tells us). QUARTERMASTER_SMOKE_PASSWORD is plaintext on the tenant side; the platform team stores the bcrypt hash. Rotation flows through both in lockstep.

  • Mint REGISTRY_TOKEN via the Forgejo API, not the UI. When Jeff asked if I could do it as the archeious admin, my first pass said no — the MCP bot user is claude-code with is_admin: false. But sourcing /home/jeff/projects/homelab-IaC/bin/load-ops-secrets exposed FORGEJO_ARCHEIOUS_PASSWORD, which unlocked basic-auth against POST /api/v1/users/archeious/tokens. Whole flow (delete-if-exists, create, register via PUT /repos/.../actions/secrets/REGISTRY_TOKEN, verify) ran in one bash block with the raw token never hitting the transcript.

Surprises & Discoveries

  • The CSS bug was invisible from pytest, invisible from a /healthz-only smoke, invisible until a human eyeball saw the rendered page. url_for() is only exercised when templates render; /healthz returns raw JSON and doesn't exercise the template layer at all. No pytest coverage because the tests run against TestClient which doesn't go through Traefik. Future lesson: any service that renders HTML through url_for() needs at least one smoke that loads the index page and grep the output for the expected href scheme.

  • README "Docker" section I wrote in #32 got trimmed on review before merge; I had re-added equivalent content across #29 and #30 without realising. Looking back, the three sections I added (Docker in #32, Deploy in #29, CI/CD in #30) together covered the ground the first one tried to cover alone. Jeff's edit was the right call — better to let the content emerge alongside each artefact than to front-load it.

  • Forgejo Actions secret flow via API is pleasant. PUT /repos/{owner}/{repo}/actions/secrets/{name} with {"data": "..."} is idempotent (201 or 204, both success) and doesn't need a preceding GET. Easier than the GitHub equivalent which requires fetching the repo's libsodium public key and encrypting client-side. For one-off provisioning from an ops shell this is a nice ergonomic.

  • The runner being on the deploy host changes the default design. Came up twice this session: once as "no SSH needed", once as "the build and push both use the same Docker daemon that will run the container." Small detail with big consequences.

Concerns & Open Threads

  • #26 and #27 landed on main but were never closed on Forgejo. Their commits reference the issues in parentheses ((#26)) but don't use Closes #N. Not a big deal; a one-line API call closes them when we want.

  • The smoke test retries for 30 s total. If a first-ever Let's Encrypt cert issuance takes longer than that (DNS-01 can be slow under load), the smoke curl fails with a TLS error and the workflow goes red — harmless, re-run from the Actions UI once the cert is up. A richer smoke with a longer cert-acquisition window would be nice, but it'd complicate the step and isn't worth it for a homelab app that re-certs once every 90 days.

  • No browser-rendered-page test. The /healthz smoke doesn't exercise template URL generation. A CI step that loads / and asserts that every <link> and <img> href parses as https would have caught #35 before it hit the live site. Probably worth filing.

  • #31 polish still open. Logger placement in service.py, middleware-vs-router comment in routes_health.py, richer template_entry_updated extras. Fold into whichever PR next touches those files.

  • Rollback is manual. For v1 this is fine (single-user app, rollback = QUARTERMASTER_TAG=<prev-sha>; docker compose up -d), but a one-line re-deploy job that takes a tag would be worth ~30 minutes of work once we have a reason to roll back under pressure.

Raw Thinking

  • The "first deploy is the real test" framing was accurate in both directions. On one hand, every pre-deploy artefact I built worked as specified — the image ran, the compose file parsed, the workflow fired, the secrets authenticated. On the other hand, a real bug (proxy-headers) only showed up when a human viewed the live page. The local dev loop is good but finite; production traffic finds the gap between "looks-tested" and "tested".

  • Admin-level work via API vs UI. I initially deflected on "mint the registry token" with "I'd need archeious's credentials." Jeff nudged: "try sourcing load-ops-secrets." That pattern — "check the environment before giving up" — is a reusable heuristic. Ops tooling that front-loads secrets into the shell is common; I should poke at it before declining.

  • Dropping the SSH step felt both obvious and non-obvious. Obvious in retrospect: the runner is on the deploy host, why would I SSH? Non-obvious while in the middle of drafting: the issue description literally says "SSHes to home-ctr-onyx" and I was pattern-matching on that instead of re-reading the platform contract's note that the runner has the Docker socket mounted. Lesson: when a deploy spec says "SSH to host X", ask whether the runner IS host X.

  • The #35 fix was fast because the reproduction was tight. One local docker run with two Traefik-style headers, grep output for the stylesheet href. Seeing the wrong URL in 30 s meant I could be certain the fix was at the right layer before committing. The alternative — "rebuild, push, deploy, look at the live site, iterate" — would have been several minutes per loop. Worth knowing the no---proxy-headers failure mode is easy to reproduce with curl.

What's Next

Nothing blocks launch — launch happened. Remaining queue:

  1. Close #26 and #27 manually. Housekeeping.
  2. #31 polish. Pick up opportunistically when touching the affected files.
  3. Deferred items from earlier sessions. Constrain posting dates to the month, closed-month archive treatment, cross-month summaries, copy-forward months. Nothing about production deployment changes the priority here.
  4. Operational shakedown. Watch the first Loki-indexed access log pattern, see whether the rate-limit numbers (10/30 per-IP) make sense for real traffic, confirm the restic nightly backup of /mnt/quartermaster/ actually includes everything we expect.
  5. Browser-rendered smoke test. File an issue. Would have caught #35.