claude-gauge/PLAN.md

343 lines
13 KiB
Markdown
Raw Normal View History

# claude-gauge
Hardware instrument cluster plus companion web dashboard for Claude
Code session telemetry. Primary surface is a physical cluster on the
desk (fighter-jet / race-car aesthetic). Secondary surface is a
local web dashboard for the deep stats.
## Problem
There is no official programmatic API for Claude Max plan usage
today. The `/usage` and `/status` slash commands in Claude Code are
interactive-only. Open feature requests (claude-code issues #40395,
#13585, #27217, #33978) track this but nothing has shipped as of
2026-04-17.
Rather than wait, drive everything from local Claude Code state by
watching the session transcript JSONL files in
`~/.claude/projects/`. Swap the data source later when Anthropic
ships a real endpoint; the hardware, firmware, daemon API, and
dashboard all stay the same.
## Instrument cluster (primary surface)
Physical cluster on the desk. Three analog gauges across the top,
annunciator row below. Backlit. Brushed aluminium bezel, black face,
cream needles, burgundy redline zone (optional: match the
quartermaster palette for a house aesthetic).
```
+------------+ +--------------+ +------------+
| 5h FUEL | | TOKENS/MIN | | 7d FUEL |
| 0 - 100% | | 0 - redline | | 0 - 100% |
+------------+ +--------------+ +------------+
[OPUS] [SONNET] [HAIKU] [HOT] [WARN] [STALL] [IDLE]
```
### Primary gauges
| Gauge | Input | Feel |
|---|---|---|
| Center tach | tokens/min, short rolling window | Jumpy, fun, shows when Claude is cooking |
| Left fuel | % of 5h plan window used | Slow, steady, tells you when to worry |
| Right fuel | % of 7d plan window used | Slowest, long-arc view of the week |
### Annunciator row
| Lamp | Condition |
|---|---|
| OPUS / SONNET / HAIKU | lights the model that wrote the most recent tokens |
| HOT | flashes when tach crosses redline |
| WARN | solid when either fuel gauge is above 80% |
| STALL | lights after N minutes of JSONL silence (no activity) |
| IDLE | green "power on, daemon reachable" indicator |
Model lamps are colour-coded: Opus deep red, Sonnet amber, Haiku
green. Visual cue for why the tach just spiked.
### Optional fourth gauge
If physical real estate allows, a "temp" or "boost" sub-gauge:
* **Cache hit rate** as a boost gauge. High cache read = cheap
inference. Bragging-rights needle.
* **Thinking-to-output ratio** as a temp gauge. High = Claude is
grinding; low = cruising. Tells you when a task is hard.
## Companion web dashboard (secondary surface)
Same daemon serves a browser dashboard at `http://<host>:<port>/`
with the deep stats. Claude Code is already a terminal tool, so the
dashboard lives alongside as a geek-out surface when you want to
drill in past the three-needle summary.
### Dashboard sections
1. **Overview**: the three gauges rendered in the browser, plus the
annunciator row. Same data as the cluster, so a quick sanity
check from any device on the LAN.
2. **Rates and windows**: line charts for tokens/min over last 1h,
6h, 24h. 5h and 7d window sums over the past month.
3. **Cost**: dollar estimates derived from token counts and
published per-model pricing. Today, week-to-date, month-to-date,
projected month.
4. **Models**: stacked time series by model (Opus / Sonnet / Haiku).
Token split pie for the current 7d window.
5. **Projects**: tokens per project, time per project, last-active
timestamp. Sortable table.
6. **Tools**: top tools by call count, success rate per tool,
Bash command distribution (top commands by root executable).
7. **Files**: hottest files by Read + Edit count, edit-to-read
ratio per file.
8. **Rhythm**: time-of-day heatmap, day-of-week heatmap, session
duration distribution.
9. **Raw events**: streaming tail of the latest parsed JSONL rows,
for debugging and to watch the data land.
Dashboard is read-only. Nothing the user does here mutates state.
## Architecture
```
~/.claude/projects/**/*.jsonl
|
v (watchdog / inotify, line-append events)
|
[ gauge daemon ] <--- local, Python, long-running
|
+-- parse new line -> structured event
| { ts, session_id, project, model, role,
| input_tokens, output_tokens,
| cache_read, cache_creation,
| thinking_tokens, stop_reason,
| tool_use: [...], is_sidechain, request_id }
|
+-- in-memory ring buffer (last ~60 min of events)
+-- periodic flush of events + rolling aggregates to SQLite
|
+-- computed windows (primary, cluster):
| rate_instant last 30s, tokens/min
| rate_1m rolling 60s, tokens/min
| rate_5m rolling 5m, tokens/min
| window_5h rolling 5h sum
| window_7d rolling 7d sum
|
+-- computed aggregates (secondary, dashboard):
| cache hit rate, thinking ratio, cost estimate,
| per-model token split, per-project totals,
| tool call counts, file touch counts, rhythm grids
|
+-- derived flags:
| hot, warn, stall, idle, last_model, last_project
|
+-- HTTP endpoints:
| GET /usage primary cluster payload
| GET /stats deep aggregates for the dashboard
| GET /stream SSE of parsed events (raw tail)
| GET / dashboard UI
|
v (poll every ~1s)
[ ESP32 / Pi Pico, on LAN ]
|
+-- drives three needles via PWM + low-pass filter
(or servos if full-swing dials are easier)
+-- drives annunciator LEDs over GPIO
+-- optional small OLED behind the cluster for raw numbers
```
## Real-time honesty
Claude Code writes a JSONL line **after** each assistant message
completes, not during streaming. The cluster therefore updates
per-message, not per-token:
* Rapid back-and-forth with small messages: tach ticks steadily,
feels alive.
* Opus cranking a 40-second response: tach reads zero, zero, zero,
then SPIKES at the end with the full message's token count.
Still useful. The spike itself is informative ("that burn just hit
8k tokens in 40 seconds"). STALL lamp covers the "is it still
working or is the session idle" ambiguity.
If true stream-time is wanted later, a Claude Code hook
(`SessionStart` / `Stop` / `PostToolUse`) can push heartbeats into
the daemon. Dramatically more work for modest gain. Park as v2.
## Calibration
The daemon does not know Anthropic's real per-plan caps. User
configures a local ceiling:
```
CLAUDE_GAUGE_5H_CEILING=<tokens>
CLAUDE_GAUGE_7D_CEILING=<tokens>
CLAUDE_GAUGE_TACH_REDLINE=<tokens/min>
```
Gauges read against those ceilings. Estimate from experience: run
for a week at normal usage, note where `/usage` says you are vs
where the dials read, adjust the config. Ship sensible defaults
tunable per user.
## Data source migration path
When Anthropic ships an official usage endpoint, the daemon replaces
its input with that endpoint and drops the JSONL tailing. The HTTP
shape it serves to the cluster stays the same. The dashboard gains
accuracy but does not change structure. Zero hardware change, zero
firmware change.
## Stack (proposed, not decided)
**Daemon**: Python 3.12+, `uv`, `watchdog` for file tail, FastAPI
for HTTP, SQLite for durable state, Jinja2 for dashboard templates,
Alpine.js or HTMX for dashboard interactivity. Matches the house
style (see quartermaster).
**Firmware**: MicroPython on ESP32 or Pi Pico W for the wifi stack.
C/Rust if MicroPython HTTP polling turns out wobbly at the required
update rate.
**Hardware**: three analog voltmeter movements (0-5V or similar)
for the gauges, driven from PWM pins through low-pass filters, or
tiny servos with printed needles. Annunciator row is discrete LEDs
behind a smoked acrylic face.
## Metrics brainstorm (for later inspection)
Everything on this list is parseable from the JSONL today. The MVP
daemon only needs the primary-window stats; everything else lands
under later issues. Capturing them here so they don't get forgotten.
### Cost and tokens
* Total tokens by window (already primary)
* Breakdown: `input` / `output` / `cache_read` / `cache_creation`
* **Cache hit rate** = cache_read / (cache_read + cache_creation + input)
* **Cache savings** in dollars (cache reads are 10% of normal input cost)
* Cost per session at published per-model pricing
* Cost per day / week / month; projected month
* Opus / Sonnet / Haiku token split
* Server tool use: `web_search_requests`, `web_fetch_requests` per day
* Service tier distribution (standard vs priority)
* Ephemeral-1h vs ephemeral-5m cache split
### Time and rhythm
* Sessions per day / per week, rolling average
* Session duration distribution
* Time-of-day heatmap (circadian work pattern)
* Day-of-week heatmap
* Longest continuous session on record
* Your think-time: gap between assistant-end and next user message
* Claude's work-time: user message to assistant complete
* Active vs idle ratio within sessions
* Streak tracking (consecutive days used)
* All-nighter detector (session crossing 2am local)
### Work shape
* **Thinking tokens** counted from content blocks of type `thinking`
* Thinking-to-output ratio ("cogitation index")
* Stop-reason distribution (`end_turn` / `tool_use` / `max_tokens`);
watch for rising `max_tokens` (responses getting cut off)
* Messages per session
* Tool calls per assistant response (parallelism indicator)
* User interrupt rate (sessions ending on cancel)
* Iteration count per task (assistant messages between two
consecutive user prompts as a proxy)
### Tool usage
* Top tools by count (Bash, Edit, Read, Grep, ...)
* Tool success vs failure rate
* Bash command distribution, parsed by root executable (git, python,
uv, ls, ...)
* File reads / edits / writes per session
* Hottest files by touch count across all sessions
* Agent / subagent counts (`isSidechain=true`), subagent depth
* Web search / web fetch counts
### Project and context
* Tokens per project (JSONL path encodes the project)
* Time per project
* Project switching rate within a session
* Dormant project detector (no activity in N days)
* Languages touched, derived from file extensions
* Last file edited per project (resume-where-you-left-off)
### Friction and quality signals
* User message length distribution (one-word directives vs prose)
* Rough correction reflex count: user messages starting with "no",
"wrong", "stop", "actually"
* Permission denial frequency
* Retry / regenerate patterns
* File-history-snapshot count per session (checkpoint density)
### Character and fun
* Em dash violation count in assistant text (per the CLAUDE.md rule,
a needle that reads "rule-break events this week")
* Emoji leakage count
* Most-used phrase by Claude in your transcripts
* Most-used phrase by you (captures your actual directive vocabulary)
* "Dude, chill" detector: count of explicit pushback against Claude
* Thank-you rate per session
* Silent sessions (ended without `/compact` or `/clear`)
### Cross-system correlations (bigger, later)
* Git: commits produced per session, lines changed per token spent
* PRs / issues closed per session
* Quartermaster: correlate long Claude sessions with budget-editing
days (just for fun)
## Phasing
### Phase A — daemon MVP
One issue. Tail JSONL, parse into structured events, maintain the
five primary windows, print them to stdout every second. No HTTP,
no hardware, no dashboard. Prove the numbers land correctly against
Claude Code's own `/usage` output.
### Phase B — HTTP + dashboard overview
Stand up FastAPI, expose `GET /usage` for the cluster and `GET /`
for a dashboard overview page (three gauges rendered in SVG, same
data as the cluster). No deep stats yet. First thing you can load
in a browser.
### Phase C — firmware + single needle
Minimal ESP32 / Pico firmware that polls `/usage` and drives one
needle (the tach). Prove the hardware path end to end.
### Phase D — full cluster
Two more needles + annunciator row. Enclosure prototype.
### Phase E — dashboard deep stats
Cost panel, model panel, project panel, rhythm heatmaps, tool
panel, file panel, raw-event tail. Pulls from the aggregate store
the daemon has been building since Phase A.
### Phase F — character and cross-system
Em-dash detector, phrase extractor, git correlation, quartermaster
correlation. Lowest priority, highest amusement.
## Next steps
1. Forgejo repo at `archeious/claude-gauge` (created alongside
this plan).
2. File Phase A as the first issue.
3. Ship Phase A. See the five numbers tick up in a terminal while
typing into Claude Code. First dopamine hit.
4. File follow-up issues one phase at a time. No scope bleed
between phases.