From ff15cb645f7e76cc4e9818b46a6ceb2e783d4e78 Mon Sep 17 00:00:00 2001 From: claude-code Date: Sun, 19 Apr 2026 12:30:10 -0600 Subject: [PATCH] docs(operations): add Logs and Health sections for #26, #27 --- Operations.md | 111 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 103 insertions(+), 8 deletions(-) diff --git a/Operations.md b/Operations.md index 9a40a11..bb1c3b0 100644 --- a/Operations.md +++ b/Operations.md @@ -92,10 +92,93 @@ rm -rf /tmp/qm-dev.db /tmp/qm-dev-backups ## Running in "production" -There is no prod. This is a local single-user app. Run the uvicorn -dev server and reach it at http://127.0.0.1:8000. For automatic -restart on crash, wrap the command in a `systemd --user` unit or a -supervisor of your choice; the app itself does nothing special. +Production is the homelab host home-ctr-onyx, containerised. Dev is +the uvicorn reload server at `http://127.0.0.1:8000`. The platform +contract ([PlatformContractQuartermaster](https://forgejo.labbity.unbiasedgeek.com/homelab/homelab-IaC/wiki/PlatformContractQuartermaster)) +is the authoritative record of the deploy surface; the sections below +cover the app-side affordances that feed into it. + +## Health + +`GET /healthz` — unauthenticated, returns: + +* `200 {"status":"ok"}` when a trivial `SELECT 1` through the + SQLAlchemy session succeeds. +* `503 {"status":"error","detail":""}` on any + exception from the DB probe. The error class name is the only + detail leaked (no message, no traceback) — enough for an operator + to see what tripped the check from a `curl` without log access. + +No auth on purpose: the Docker `HEALTHCHECK` runs inside the container +and cannot carry credentials, and Traefik's basic-auth middleware is +not applied to this route. Kept on a dedicated router +(`src/quartermaster/routes_health.py`) so any future router-scoped auth +on the main routers leaves it alone. + +A failed probe also emits a structured warning log (`event=healthz_failed`, +`error_class=`) for Loki. + +## Logs + +Logs are JSON on stdout. The config lives at +`src/quartermaster/logconfig.json` and is consumed both by Python (via +the `LOG_CONFIG` dict loaded in `src/quartermaster/logging_config.py`) +and by uvicorn CLI: + +```sh +uv run uvicorn quartermaster.main:app \ + --log-config src/quartermaster/logconfig.json \ + --reload +``` + +Each log line has `level` and `event` as top-level JSON fields +(Promtail on home-ctr-onyx extracts them as queryable Loki labels), +plus arbitrary extras in the JSON body. + +### Access logs + +Uvicorn access records are enriched by `AccessLogFilter` into: + +```json +{ + "timestamp": "...", "level": "INFO", "logger": "uvicorn.access", + "event": "http_request", "method": "GET", "path": "/healthz", + "status": 200, "client_ip": "10.0.0.42:54321", + "message": "... - \"GET /healthz HTTP/1.1\" 200" +} +``` + +### Application events + +Five seed events fire at the most operationally interesting mutations: + +| Event | Fires in | Extras | +|---|---|---| +| `month_created` | `month_service.create_month` | `year_month` | +| `month_closed` | `month_service.close_month` | `year_month` | +| `template_entry_updated` | `service.update_entry` | `entry_id` | +| `posting_added` | `month_service.add_posting` | `posting_id`, `month_entry_id`, `amount` | +| `posting_deleted` | `month_service.delete_posting` | `posting_id` | +| `healthz_failed` | `routes_health.healthz` (WARNING) | `error_class` | + +Additional events can be added the same way — `logger.info(msg, +extra={"event": "...", ...})` on a logger under `quartermaster.*`. + +### Example LogQL queries + +Grafana Explore, Loki data source, once the deploy is live: + +``` +{container="quartermaster"} | json +{container="quartermaster", event="http_request", status=~"5.."} +{container="quartermaster", event="month_closed"} | json | line_format "{{.year_month}} {{.message}}" +``` + +### Dev ergonomics + +Omit `--log-config src/quartermaster/logconfig.json` during local dev +if you'd rather read logs in uvicorn's default human-readable format. +Production must use the config so Promtail indexes properly. ## Troubleshooting @@ -109,8 +192,7 @@ time, which is expected. You pulled code with a new column but did not apply the migration. Run `uv run alembic upgrade head`. The backup hook backs up the live -DB first, then the migration adds the missing column. Common after -a pull that includes schema work (notes field, month lifecycle). +DB first, then the migration adds the missing column. ### Alembic reports a revision it cannot locate @@ -123,6 +205,18 @@ or downgrade the DB to a known-good revision before continuing. Intentional. The script is sqlite-specific. Switch the URL back to sqlite:///... or do the backup manually via your Postgres tooling. +### `/healthz` returns 503 + +Inspect the logged `event=healthz_failed` record in Loki or stdout. +`error_class` names the exception type; common causes are a +misconfigured `QUARTERMASTER_DB_URL`, a DB file that got wiped or +permissions-corrupted, or Alembic having failed on container start. + +### DeprecationWarning about `pythonjsonlogger.jsonlogger` + +Fixed on main. The config now references `pythonjsonlogger.json`. +If you see the warning, pull and re-run `uv sync`. + ## Current schema Applied migrations at time of writing: @@ -135,5 +229,6 @@ Applied migrations at time of writing: | `a4ec4f8f6e9f` | add month lifecycle columns (`state`, `activated_at`, `closed_at`) | | `cc60e7f73a1c` | add `posting` ledger table, seed opening-balance postings, drop `month_entry.applied` | -After pulling new code, `uv run alembic upgrade head` walks the chain -and the backup hook fires between each hop. +No schema change between `cc60e7f73a1c` and HEAD. After pulling new +code, `uv run alembic upgrade head` walks the chain and the backup +hook fires between each hop.