quartermaster/docs/superpowers/specs/2026-04-19-healthz-and-json-logs-design.md
Jeff Smith 88e68ea2f9 docs: design spec for /healthz and structured JSON logs (#26, #27)
Part of the platform-contract intake (#25). Covers both pieces of work
that must land before first deploy to home-ctr-onyx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 12:12:55 -06:00

6.2 KiB

Healthcheck endpoint and structured JSON logs

Date: 2026-04-19 Issues: #26 (healthz), #27 (JSON logs), part of #25 (platform contract intake)

Background

The homelab platform contract for Quartermaster (#25) requires two things the codebase does not have today:

  1. A Docker HEALTHCHECK so container_health_status is visible to cAdvisor/Prometheus, which in turn drives the container-down alert planned at launch. That requires an in-app endpoint to target.
  2. Structured JSON logs on stdout with level and event fields so Promtail indexes them as Loki labels.

Both block the first deploy to home-ctr-onyx. This spec covers both so the work can land as one coherent change.

/healthz

Endpoint

GET /healthz, unauthenticated.

  • Success: 200 {"status": "ok"}.
  • Failure: 503 {"status": "error", "detail": "<exception class name>"}. The class name goes in so operators can tell from the response body what tripped the check; no traceback or message is leaked.

The check opens a session via the standard SessionLocal factory, runs SELECT 1, and closes. Any exception surfaces as a 503.

Placement

New module src/quartermaster/routes_health.py with its own APIRouter, included from main.create_app() alongside the existing routers. Keeping it on a dedicated router means any future middleware (basic-auth, rate-limit bypass) applied to the main routers can leave /healthz alone — the Docker healthcheck runs inside the container and must not need credentials.

Tests

tests/test_health.py:

  • Success: FastAPI TestClient hits /healthz, asserts 200 and {"status": "ok"}.
  • Failure: monkey-patch the session factory to raise on .execute(), assert 503 and {"status": "error", "detail": "<class-name>"}.

Structured JSON logs

Dependency

Add python-json-logger to [project].dependencies in pyproject.toml. One small, single-purpose dep; no transitive surprises. structlog is explicitly out of scope (#27).

Config module

New src/quartermaster/logging_config.py exposing LOG_CONFIG, a logging.config.dictConfig-compatible dict:

  • One formatter using pythonjsonlogger.jsonlogger.JsonFormatter emitting timestamp (ISO-8601 UTC), level, event, logger, message. extra={...} kwargs passed to logger calls flatten into the JSON body.
  • One handler writing to sys.stdout.
  • Loggers: the root app logger and uvicorn.access both route through the JSON handler. uvicorn.error also gets the handler so startup / shutdown lines are captured in the same format.

A Python dict (rather than YAML) is the source of truth because tests can import it and apply dictConfig in-process. The uvicorn CLI consumes it via a small logconfig.yaml shim at repo root that references the dict module.

Access log filter

Uvicorn's access logger emits a record whose message is the raw access line; the fields we care about live on the record's positional args. A small logging.Filter subclass in logging_config.py unpacks those args and sets:

  • event = "http_request"
  • method, path, status, client_ip
  • duration_ms (uvicorn doesn't expose this natively; computed via the extra injected by a small middleware if straightforward, otherwise deferred — the filter already gives Loki status + path, which is the main thing)

If the duration cannot be obtained cheaply from uvicorn's access record, landing the rest is still a win; the duration_ms field can come in a follow-up without changing the log schema (it's an extra field, not a label).

Seed application events

Five events added as single-line logger.info(..., extra={"event": "..."}) calls at the matching code paths (names aligned with the existing function names):

Event Site
month_created month_service.create_month
month_closed month_service.close_month
template_entry_updated service.update_entry
posting_added month_service.add_posting
posting_deleted month_service.delete_posting

One module-scoped logger at the top of each file that touches these paths. No broader instrumentation in this change.

Tests

tests/test_logging.py:

  • Apply LOG_CONFIG via logging.config.dictConfig, emit a record with extra={"event": "smoke"}, capture stdout via capsys, json.loads the captured line, assert level / event / logger / message / timestamp all present and correct.
  • Feed a synthetic uvicorn access record through the filter, assert resulting fields include event="http_request", method, path, status.

No end-to-end uvicorn-subprocess test. Formatter and filter correctness at the handler level is enough for the launch contract.

Dev flow

uv run uvicorn quartermaster.main:app --log-config logconfig.yaml --reload--reload keeps working. README gets a short "Logs" section with two LogQL examples mirroring the Archon contract style.

File additions / changes

New:

  • src/quartermaster/routes_health.py
  • src/quartermaster/logging_config.py
  • logconfig.yaml (YAML shim for uvicorn CLI)
  • tests/test_health.py
  • tests/test_logging.py

Changed:

  • pyproject.toml — add python-json-logger
  • src/quartermaster/main.py — include the health router
  • src/quartermaster/service.py — add one logger.info seed call in update_entry
  • src/quartermaster/month_service.py — add four logger.info seed calls in create_month, close_month, add_posting, delete_posting
  • README.md — add the "Logs" section and mention --log-config in the Run block

Not touched:

  • Dockerfile / Compose: owned by later issues under #25.
  • Alembic / DB layer: the healthcheck uses the existing session factory; no migration.

Order of work

Logging before healthz. Once LOG_CONFIG exists the healthz handler can emit event="healthz_check" for free; the reverse order doesn't give logging anything useful. Not load-bearing.

Out of scope

  • /readyz vs. /livez split — one endpoint covers this single- container app.
  • /metrics or any Prometheus exposition (5.2 in #25 is "not needed").
  • Adding structlog (#27 explicitly excludes).
  • Log-shipping configuration — Promtail on the host handles it.
  • Broad app instrumentation beyond the five seed events.