markdown · 10611 bytes Raw Blame History

Observability — metrics tier

Two layers of monitoring run today:

  1. DigitalOcean droplet alerts + uptime check (see alerts.md). Tells us when the droplet is unhappy at the OS level (CPU, mem, disk, load, down, SSL expiry, latency from outside).
  2. Grafana Cloud free tier — Prometheus + dashboards (this doc). Tells us what shithubd is doing internally: request latency, DB pool saturation, panics, job-queue depth, archive failures.

Layer 1 fires for "is the box alive?" Layer 2 fires for "is the app healthy?" Different signal, different audience, both needed.

What's wired

The monitoring-client ansible role runs node_exporter (host metrics on 127.0.0.1:9100) and Grafana Alloy (the modern Grafana Agent rebrand). Alloy scrapes:

  • 127.0.0.1:9100/metrics — host metrics from node_exporter
  • 127.0.0.1:8080/metrics — shithubd's internal metrics

Alloy remote_writes both into Grafana Cloud's Prometheus endpoint (Mimir under the hood). No inbound port on the droplet — push-only. No Prometheus or Grafana running locally.

Notable shithubd metrics

Name Type Meaning
shithub_http_in_flight gauge Requests currently being served. >50 sustained = saturated.
shithub_http_request_duration_seconds histogram Request latency by route + method. Compute p95 from buckets.
shithub_http_requests_total counter All-time request count by route + method + status. Rate it for QPS.
shithub_db_pool_acquired gauge Active Postgres connections. Approaching shithub_db_pool_total = saturation.
shithub_db_pool_acquire_wait_seconds_total counter Cumulative wait time. Sudden derivative climb = pool too small.
shithub_panics_total counter Recovered panics. Should be 0 in steady state.
shithub_actions_queue_depth gauge Queued Actions runs/jobs. Sustained job depth means runners cannot keep up.
shithub_actions_active gauge Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners.
shithub_actions_runner_heartbeat_age_seconds gauge Seconds since each runner heartbeat. >60s sustained means the runner is stale.
shithub_actions_run_duration_seconds histogram Terminal Actions run duration by event and conclusion.
shithub_actions_log_chunk_bytes_total counter Accepted Actions log bytes, used for throughput and scrubber health alerts.

Operator setup (one-time)

Free-tier Grafana Cloud gives you ~10k active series — way more than one shithubd droplet emits. Sign-up flow:

  1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026 the default flow drops you into a 14-day "Unlimited usage trial" — that's fine; without a card on file it auto-reverts to the Forever Free tier (~10k active series) when the trial ends.
  2. Pick a Stack name (e.g. shithub). The portal creates the stack at <name>.grafana.net and lands you in the in-app "Get started" guide.
  3. Get the Prometheus details. The in-app guide has drifted; the reliable path is via the org-level Cloud portal:
    • In the right column of the Get started page, Organization details → click Cloud portal, OR navigate directly to https://grafana.com/orgs/<your-org>.
    • Click Details on the Prometheus tile.
    • The page shows the Remote Write Endpoint (full URL), Username / Instance ID (numeric), and a Generate now link for the Password / API Token (a glc_… Access Policy token with metrics:write). Copy the token immediately — it's shown once.
  4. Add to inventory/production. Use the exact Remote Write Endpoint shown on the page — the host (prod-NN, region) differs per tenant:
    grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
    grafana_cloud_prom_user=1234567
    grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3
    
  5. ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring (or just rerun site.yml — the role is idempotent).

If you don't have ansible installed (manual fallback)

The monitoring-client role's actions are simple enough to apply by hand over SSH. Use this if you're trying to bolt monitoring onto an existing droplet and don't want to risk the full play.

ssh root@shithub.sh '
  # node_exporter
  apt-get install -y prometheus-node-exporter
  systemctl enable --now prometheus-node-exporter

  # Grafana apt repo + alloy
  mkdir -p /etc/apt/keyrings
  curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
  chmod 0644 /etc/apt/keyrings/grafana.gpg.key
  echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
    > /etc/apt/sources.list.d/grafana.list
  apt-get update && apt-get install -y alloy

  # Credentials env (mode 0640, root:alloy)
  mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
  install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
GRAFANA_CLOUD_PROM_URL=<your URL>
GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
EOF

  # systemd drop-in to source the credentials
  mkdir -p /etc/systemd/system/alloy.service.d
  cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
[Service]
EnvironmentFile=/etc/alloy/credentials.env
EOF
  systemctl daemon-reload
'

Then copy the rendered alloy-config.river.j2 (substitute the {{ ansible_hostname }} template var with the real hostname) to /etc/alloy/config.alloy (mode 0644 root:alloy) and:

ssh root@shithub.sh 'systemctl enable --now alloy'

The token is the only secret; do NOT pipe it through your shell history. The install … /dev/stdin … <<EOF form above keeps it out of ps/argv, but you'll still want to clear shell history after.

Within a minute, metrics start landing. From the Cloud portal:

  • Explore tab → datasource = your Prometheus → query up{job=~"node|shithubd"} — both should show 1.
  • Sample queries to confirm shithubd telemetry:
    • sum(rate(shithub_http_requests_total[5m])) — overall QPS
    • histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m]))) — p95 latency by route
    • shithub_db_pool_acquired / shithub_db_pool_total — pool utilization
    • rate(shithub_panics_total[5m]) > 0 — recovered panics

Building a starter dashboard

Cloud → Dashboards → New → import. Dashboard JSON is committed under:

deploy/monitoring/grafana/dashboards/shithubd-overview.json
deploy/monitoring/grafana/dashboards/actions.json

Use the panels below as fallback queries if the dashboard import UI drifts or you need to rebuild by hand.

Panel Query
QPS sum(rate(shithub_http_requests_total[5m]))
p95 latency by route histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))
Error rate sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))
In-flight requests shithub_http_in_flight
DB pool utilization shithub_db_pool_acquired / shithub_db_pool_total
Pool wait derivative rate(shithub_db_pool_acquire_wait_seconds_total[5m])
Panics rate(shithub_panics_total[5m])
Host CPU 100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
Host memory used (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
Disk free node_filesystem_avail_bytes{mountpoint="/"}

Alerting

Three places to set alert rules. Pick one:

  • Grafana Cloud → Alerts → Alert rules. UI-first; fires to email/Slack/PagerDuty via Cloud's contact points.
  • Alertmanager via remote_write. Out of scope for the free tier without their managed contact points.
  • Skip Grafana alerts entirely; use the DO alerts (alerts.md) for operational pages and read shithubd metrics manually.

Suggested first three alerts (Cloud UI):

Name Query Threshold Reason
shithubd p95 latency high p95 latency query above > 1s for 5m UI feels slow before users notice
DB pool saturation utilization query > 0.8 for 10m Pool too small or query regression
Recovered panic in last 5m panics rate > 0 for 1m Bug ticket worth

When metrics stop landing

ssh root@shithub.sh '
  systemctl status alloy --no-pager
  journalctl -u alloy -n 30 --no-pager | tail -20
  curl -fsS http://127.0.0.1:8080/metrics | head -3
  curl -fsS http://127.0.0.1:9100/metrics | head -3
'

Common failures:

  • 401 Unauthorized in alloy logs — token expired or copy-pasted wrong. Mint a fresh one in the Cloud portal, update inventory, re-run the role.
  • node_exporter not listening on 9100apt reinstall prometheus-node-exporter, check journal.
  • shithubd not exposing /metricsSHITHUB_METRICS__ENABLED=true in web.env (the default), restart shithubd-web.
  • Free-tier limit hit — Cloud portal shows "Active series" at cap. Drop high-cardinality labels via metric_relabel_configs in alloy-config.river.j2.
  • up{job="shithubd"} = 0 while up{job="node"} = 1 — alloy can reach shithubd (scrape_samples_scraped > 0) but the parser fails with expected a valid start token, got "\x1f". Root cause: alloy advertised Accept-Encoding: gzip, shithubd correctly returned gzipped bytes with Content-Encoding: gzip, but alloy's Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already in the template (enable_compression = false on the shithubd scrape job). If you see this on a hand-rolled config, add that attribute and systemctl restart alloy. Verify via curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd → look for health: up.

Why Grafana Cloud over self-hosted Prometheus

  • Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, storage for time-series, an open inbound port (or VPN), TLS for the Grafana UI, an OAuth proxy for auth, and we maintain the upgrade path.
  • Free Grafana Cloud is enough for our scale, gives us managed backup of metrics + dashboards + alerts, and the data leaves the droplet (so a droplet outage doesn't kill its own observability).
  • Trade-off: cardinality cap. Once we approach 10k active series, evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.
View source
1 # Observability — metrics tier
2
3 Two layers of monitoring run today:
4
5 1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`).
6 Tells us when the droplet is unhappy at the OS level (CPU, mem,
7 disk, load, down, SSL expiry, latency from outside).
8 2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc).
9 Tells us what shithubd is doing internally: request latency,
10 DB pool saturation, panics, job-queue depth, archive failures.
11
12 Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
13 app healthy?" Different signal, different audience, both needed.
14
15 ## What's wired
16
17 The `monitoring-client` ansible role runs **node_exporter**
18 (host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the
19 modern Grafana Agent rebrand). Alloy scrapes:
20
21 - `127.0.0.1:9100/metrics` — host metrics from node_exporter
22 - `127.0.0.1:8080/metrics` — shithubd's internal metrics
23
24 Alloy `remote_write`s both into Grafana Cloud's Prometheus
25 endpoint (Mimir under the hood). No inbound port on the droplet —
26 push-only. No Prometheus or Grafana running locally.
27
28 ### Notable shithubd metrics
29
30 | Name | Type | Meaning |
31 |---|---|---|
32 | `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. |
33 | `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. |
34 | `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. |
35 | `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. |
36 | `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. |
37 | `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. |
38 | `shithub_actions_queue_depth` | gauge | Queued Actions runs/jobs. Sustained job depth means runners cannot keep up. |
39 | `shithub_actions_active` | gauge | Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners. |
40 | `shithub_actions_runner_heartbeat_age_seconds` | gauge | Seconds since each runner heartbeat. >60s sustained means the runner is stale. |
41 | `shithub_actions_run_duration_seconds` | histogram | Terminal Actions run duration by event and conclusion. |
42 | `shithub_actions_log_chunk_bytes_total` | counter | Accepted Actions log bytes, used for throughput and scrubber health alerts. |
43
44 ## Operator setup (one-time)
45
46 Free-tier Grafana Cloud gives you ~10k active series — way more
47 than one shithubd droplet emits. Sign-up flow:
48
49 1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
50 the default flow drops you into a 14-day "Unlimited usage trial"
51 — that's fine; without a card on file it auto-reverts to the
52 Forever Free tier (~10k active series) when the trial ends.
53 2. Pick a **Stack name** (e.g. `shithub`). The portal creates the
54 stack at `<name>.grafana.net` and lands you in the in-app
55 "Get started" guide.
56 3. Get the Prometheus details. The in-app guide has drifted; the
57 reliable path is via the org-level Cloud portal:
58 - In the right column of the Get started page, **Organization
59 details** → click **Cloud portal**, OR navigate directly to
60 `https://grafana.com/orgs/<your-org>`.
61 - Click **Details** on the **Prometheus** tile.
62 - The page shows the **Remote Write Endpoint** (full URL),
63 **Username / Instance ID** (numeric), and a **Generate now**
64 link for the **Password / API Token** (a `glc_…` Access Policy
65 token with `metrics:write`). Copy the token immediately — it's
66 shown once.
67 4. Add to `inventory/production`. Use the exact Remote Write Endpoint
68 shown on the page — the host (prod-NN, region) differs per tenant:
69 ```ini
70 grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
71 grafana_cloud_prom_user=1234567
72 grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3
73 ```
74 5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
75 (or just rerun site.yml — the role is idempotent).
76
77 ### If you don't have ansible installed (manual fallback)
78
79 The `monitoring-client` role's actions are simple enough to apply
80 by hand over SSH. Use this if you're trying to bolt monitoring
81 onto an existing droplet and don't want to risk the full play.
82
83 ```sh
84 ssh root@shithub.sh '
85 # node_exporter
86 apt-get install -y prometheus-node-exporter
87 systemctl enable --now prometheus-node-exporter
88
89 # Grafana apt repo + alloy
90 mkdir -p /etc/apt/keyrings
91 curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
92 chmod 0644 /etc/apt/keyrings/grafana.gpg.key
93 echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
94 > /etc/apt/sources.list.d/grafana.list
95 apt-get update && apt-get install -y alloy
96
97 # Credentials env (mode 0640, root:alloy)
98 mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
99 install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
100 GRAFANA_CLOUD_PROM_URL=<your URL>
101 GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
102 GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
103 EOF
104
105 # systemd drop-in to source the credentials
106 mkdir -p /etc/systemd/system/alloy.service.d
107 cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
108 [Service]
109 EnvironmentFile=/etc/alloy/credentials.env
110 EOF
111 systemctl daemon-reload
112 '
113 ```
114
115 Then copy the rendered `alloy-config.river.j2` (substitute the
116 `{{ ansible_hostname }}` template var with the real hostname) to
117 `/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
118
119 ```sh
120 ssh root@shithub.sh 'systemctl enable --now alloy'
121 ```
122
123 The token is the only secret; do NOT pipe it through your shell
124 history. The `install … /dev/stdin … <<EOF` form above keeps it
125 out of `ps`/argv, but you'll still want to clear shell history
126 after.
127
128 Within a minute, metrics start landing. From the Cloud portal:
129
130 - **Explore** tab → datasource = your Prometheus → query
131 `up{job=~"node|shithubd"}` — both should show 1.
132 - Sample queries to confirm shithubd telemetry:
133 - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
134 - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
135 - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
136 - `rate(shithub_panics_total[5m]) > 0` — recovered panics
137
138 ## Building a starter dashboard
139
140 Cloud → Dashboards → New → import. Dashboard JSON is committed under:
141
142 ```text
143 deploy/monitoring/grafana/dashboards/shithubd-overview.json
144 deploy/monitoring/grafana/dashboards/actions.json
145 ```
146
147 Use the panels below as fallback queries if the dashboard import UI drifts or
148 you need to rebuild by hand.
149
150 | Panel | Query |
151 |---|---|
152 | QPS | `sum(rate(shithub_http_requests_total[5m]))` |
153 | p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` |
154 | Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` |
155 | In-flight requests | `shithub_http_in_flight` |
156 | DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` |
157 | Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` |
158 | Panics | `rate(shithub_panics_total[5m])` |
159 | Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` |
160 | Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` |
161 | Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` |
162
163 ## Alerting
164
165 Three places to set alert rules. Pick one:
166
167 - **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to
168 email/Slack/PagerDuty via Cloud's contact points.
169 - **Alertmanager via remote_write.** Out of scope for the free
170 tier without their managed contact points.
171 - **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
172 for operational pages and read shithubd metrics manually.**
173
174 Suggested first three alerts (Cloud UI):
175
176 | Name | Query | Threshold | Reason |
177 |---|---|---|---|
178 | shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice |
179 | DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression |
180 | Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth |
181
182 ## When metrics stop landing
183
184 ```sh
185 ssh root@shithub.sh '
186 systemctl status alloy --no-pager
187 journalctl -u alloy -n 30 --no-pager | tail -20
188 curl -fsS http://127.0.0.1:8080/metrics | head -3
189 curl -fsS http://127.0.0.1:9100/metrics | head -3
190 '
191 ```
192
193 Common failures:
194
195 - **401 Unauthorized in alloy logs** — token expired or
196 copy-pasted wrong. Mint a fresh one in the Cloud portal,
197 update inventory, re-run the role.
198 - **node_exporter not listening on 9100** — `apt reinstall
199 prometheus-node-exporter`, check journal.
200 - **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true`
201 in web.env (the default), restart shithubd-web.
202 - **Free-tier limit hit** — Cloud portal shows "Active series" at
203 cap. Drop high-cardinality labels via `metric_relabel_configs` in
204 alloy-config.river.j2.
205 - **`up{job="shithubd"} = 0` while `up{job="node"} = 1`** — alloy can
206 reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
207 with `expected a valid start token, got "\x1f"`. Root cause: alloy
208 advertised `Accept-Encoding: gzip`, shithubd correctly returned
209 gzipped bytes with `Content-Encoding: gzip`, but alloy's
210 Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
211 in the template (`enable_compression = false` on the shithubd
212 scrape job). If you see this on a hand-rolled config, add that
213 attribute and `systemctl restart alloy`. Verify via
214 `curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
215 → look for `health: up`.
216
217 ## Why Grafana Cloud over self-hosted Prometheus
218
219 - Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
220 storage for time-series, an open inbound port (or VPN), TLS for
221 the Grafana UI, an OAuth proxy for auth, and we maintain the
222 upgrade path.
223 - Free Grafana Cloud is enough for our scale, gives us managed
224 backup of metrics + dashboards + alerts, and the data leaves
225 the droplet (so a droplet outage doesn't kill its own
226 observability).
227 - Trade-off: cardinality cap. Once we approach 10k active series,
228 evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.