markdown · 9895 bytes Raw Blame History

Observability — metrics tier

Two layers of monitoring run today:

  1. DigitalOcean droplet alerts + uptime check (see alerts.md). Tells us when the droplet is unhappy at the OS level (CPU, mem, disk, load, down, SSL expiry, latency from outside).
  2. Grafana Cloud free tier — Prometheus + dashboards (this doc). Tells us what shithubd is doing internally: request latency, DB pool saturation, panics, job-queue depth, archive failures.

Layer 1 fires for "is the box alive?" Layer 2 fires for "is the app healthy?" Different signal, different audience, both needed.

What's wired

The monitoring-client ansible role runs node_exporter (host metrics on 127.0.0.1:9100) and Grafana Alloy (the modern Grafana Agent rebrand). Alloy scrapes:

  • 127.0.0.1:9100/metrics — host metrics from node_exporter
  • 127.0.0.1:8080/metrics — shithubd's internal metrics

Alloy remote_writes both into Grafana Cloud's Prometheus endpoint (Mimir under the hood). No inbound port on the droplet — push-only. No Prometheus or Grafana running locally.

Notable shithubd metrics

Name Type Meaning
shithub_http_in_flight gauge Requests currently being served. >50 sustained = saturated.
shithub_http_request_duration_seconds histogram Request latency by route + method. Compute p95 from buckets.
shithub_http_requests_total counter All-time request count by route + method + status. Rate it for QPS.
shithub_db_pool_acquired gauge Active Postgres connections. Approaching shithub_db_pool_total = saturation.
shithub_db_pool_acquire_wait_seconds_total counter Cumulative wait time. Sudden derivative climb = pool too small.
shithub_panics_total counter Recovered panics. Should be 0 in steady state.

Operator setup (one-time)

Free-tier Grafana Cloud gives you ~10k active series — way more than one shithubd droplet emits. Sign-up flow:

  1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026 the default flow drops you into a 14-day "Unlimited usage trial" — that's fine; without a card on file it auto-reverts to the Forever Free tier (~10k active series) when the trial ends.
  2. Pick a Stack name (e.g. shithub). The portal creates the stack at <name>.grafana.net and lands you in the in-app "Get started" guide.
  3. Get the Prometheus details. The in-app guide has drifted; the reliable path is via the org-level Cloud portal:
    • In the right column of the Get started page, Organization details → click Cloud portal, OR navigate directly to https://grafana.com/orgs/<your-org>.
    • Click Details on the Prometheus tile.
    • The page shows the Remote Write Endpoint (full URL), Username / Instance ID (numeric), and a Generate now link for the Password / API Token (a glc_… Access Policy token with metrics:write). Copy the token immediately — it's shown once.
  4. Add to inventory/production. Use the exact Remote Write Endpoint shown on the page — the host (prod-NN, region) differs per tenant:
    grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
    grafana_cloud_prom_user=1234567
    grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3
    
  5. ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring (or just rerun site.yml — the role is idempotent).

If you don't have ansible installed (manual fallback)

The monitoring-client role's actions are simple enough to apply by hand over SSH. Use this if you're trying to bolt monitoring onto an existing droplet and don't want to risk the full play.

ssh root@shithub.sh '
  # node_exporter
  apt-get install -y prometheus-node-exporter
  systemctl enable --now prometheus-node-exporter

  # Grafana apt repo + alloy
  mkdir -p /etc/apt/keyrings
  curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
  chmod 0644 /etc/apt/keyrings/grafana.gpg.key
  echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
    > /etc/apt/sources.list.d/grafana.list
  apt-get update && apt-get install -y alloy

  # Credentials env (mode 0640, root:alloy)
  mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
  install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
GRAFANA_CLOUD_PROM_URL=<your URL>
GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
EOF

  # systemd drop-in to source the credentials
  mkdir -p /etc/systemd/system/alloy.service.d
  cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
[Service]
EnvironmentFile=/etc/alloy/credentials.env
EOF
  systemctl daemon-reload
'

Then copy the rendered alloy-config.river.j2 (substitute the {{ ansible_hostname }} template var with the real hostname) to /etc/alloy/config.alloy (mode 0644 root:alloy) and:

ssh root@shithub.sh 'systemctl enable --now alloy'

The token is the only secret; do NOT pipe it through your shell history. The install … /dev/stdin … <<EOF form above keeps it out of ps/argv, but you'll still want to clear shell history after.

Within a minute, metrics start landing. From the Cloud portal:

  • Explore tab → datasource = your Prometheus → query up{job=~"node|shithubd"} — both should show 1.
  • Sample queries to confirm shithubd telemetry:
    • sum(rate(shithub_http_requests_total[5m])) — overall QPS
    • histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m]))) — p95 latency by route
    • shithub_db_pool_acquired / shithub_db_pool_total — pool utilization
    • rate(shithub_panics_total[5m]) > 0 — recovered panics

Building a starter dashboard

Cloud → Dashboards → New → import. Use the panels below as a spine. (We're not committing the dashboard JSON to the repo yet because Grafana's UUIDs are stack-specific; doc the queries and let the operator import once.)

Panel Query
QPS sum(rate(shithub_http_requests_total[5m]))
p95 latency by route histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))
Error rate sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))
In-flight requests shithub_http_in_flight
DB pool utilization shithub_db_pool_acquired / shithub_db_pool_total
Pool wait derivative rate(shithub_db_pool_acquire_wait_seconds_total[5m])
Panics rate(shithub_panics_total[5m])
Host CPU 100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)
Host memory used (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes
Disk free node_filesystem_avail_bytes{mountpoint="/"}

Alerting

Three places to set alert rules. Pick one:

  • Grafana Cloud → Alerts → Alert rules. UI-first; fires to email/Slack/PagerDuty via Cloud's contact points.
  • Alertmanager via remote_write. Out of scope for the free tier without their managed contact points.
  • Skip Grafana alerts entirely; use the DO alerts (alerts.md) for operational pages and read shithubd metrics manually.

Suggested first three alerts (Cloud UI):

Name Query Threshold Reason
shithubd p95 latency high p95 latency query above > 1s for 5m UI feels slow before users notice
DB pool saturation utilization query > 0.8 for 10m Pool too small or query regression
Recovered panic in last 5m panics rate > 0 for 1m Bug ticket worth

When metrics stop landing

ssh root@shithub.sh '
  systemctl status alloy --no-pager
  journalctl -u alloy -n 30 --no-pager | tail -20
  curl -fsS http://127.0.0.1:8080/metrics | head -3
  curl -fsS http://127.0.0.1:9100/metrics | head -3
'

Common failures:

  • 401 Unauthorized in alloy logs — token expired or copy-pasted wrong. Mint a fresh one in the Cloud portal, update inventory, re-run the role.
  • node_exporter not listening on 9100apt reinstall prometheus-node-exporter, check journal.
  • shithubd not exposing /metricsSHITHUB_METRICS__ENABLED=true in web.env (the default), restart shithubd-web.
  • Free-tier limit hit — Cloud portal shows "Active series" at cap. Drop high-cardinality labels via metric_relabel_configs in alloy-config.river.j2.
  • up{job="shithubd"} = 0 while up{job="node"} = 1 — alloy can reach shithubd (scrape_samples_scraped > 0) but the parser fails with expected a valid start token, got "\x1f". Root cause: alloy advertised Accept-Encoding: gzip, shithubd correctly returned gzipped bytes with Content-Encoding: gzip, but alloy's Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already in the template (enable_compression = false on the shithubd scrape job). If you see this on a hand-rolled config, add that attribute and systemctl restart alloy. Verify via curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd → look for health: up.

Why Grafana Cloud over self-hosted Prometheus

  • Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, storage for time-series, an open inbound port (or VPN), TLS for the Grafana UI, an OAuth proxy for auth, and we maintain the upgrade path.
  • Free Grafana Cloud is enough for our scale, gives us managed backup of metrics + dashboards + alerts, and the data leaves the droplet (so a droplet outage doesn't kill its own observability).
  • Trade-off: cardinality cap. Once we approach 10k active series, evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.
View source
1 # Observability — metrics tier
2
3 Two layers of monitoring run today:
4
5 1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`).
6 Tells us when the droplet is unhappy at the OS level (CPU, mem,
7 disk, load, down, SSL expiry, latency from outside).
8 2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc).
9 Tells us what shithubd is doing internally: request latency,
10 DB pool saturation, panics, job-queue depth, archive failures.
11
12 Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
13 app healthy?" Different signal, different audience, both needed.
14
15 ## What's wired
16
17 The `monitoring-client` ansible role runs **node_exporter**
18 (host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the
19 modern Grafana Agent rebrand). Alloy scrapes:
20
21 - `127.0.0.1:9100/metrics` — host metrics from node_exporter
22 - `127.0.0.1:8080/metrics` — shithubd's internal metrics
23
24 Alloy `remote_write`s both into Grafana Cloud's Prometheus
25 endpoint (Mimir under the hood). No inbound port on the droplet —
26 push-only. No Prometheus or Grafana running locally.
27
28 ### Notable shithubd metrics
29
30 | Name | Type | Meaning |
31 |---|---|---|
32 | `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. |
33 | `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. |
34 | `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. |
35 | `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. |
36 | `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. |
37 | `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. |
38
39 ## Operator setup (one-time)
40
41 Free-tier Grafana Cloud gives you ~10k active series — way more
42 than one shithubd droplet emits. Sign-up flow:
43
44 1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
45 the default flow drops you into a 14-day "Unlimited usage trial"
46 — that's fine; without a card on file it auto-reverts to the
47 Forever Free tier (~10k active series) when the trial ends.
48 2. Pick a **Stack name** (e.g. `shithub`). The portal creates the
49 stack at `<name>.grafana.net` and lands you in the in-app
50 "Get started" guide.
51 3. Get the Prometheus details. The in-app guide has drifted; the
52 reliable path is via the org-level Cloud portal:
53 - In the right column of the Get started page, **Organization
54 details** → click **Cloud portal**, OR navigate directly to
55 `https://grafana.com/orgs/<your-org>`.
56 - Click **Details** on the **Prometheus** tile.
57 - The page shows the **Remote Write Endpoint** (full URL),
58 **Username / Instance ID** (numeric), and a **Generate now**
59 link for the **Password / API Token** (a `glc_…` Access Policy
60 token with `metrics:write`). Copy the token immediately — it's
61 shown once.
62 4. Add to `inventory/production`. Use the exact Remote Write Endpoint
63 shown on the page — the host (prod-NN, region) differs per tenant:
64 ```ini
65 grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
66 grafana_cloud_prom_user=1234567
67 grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3
68 ```
69 5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
70 (or just rerun site.yml — the role is idempotent).
71
72 ### If you don't have ansible installed (manual fallback)
73
74 The `monitoring-client` role's actions are simple enough to apply
75 by hand over SSH. Use this if you're trying to bolt monitoring
76 onto an existing droplet and don't want to risk the full play.
77
78 ```sh
79 ssh root@shithub.sh '
80 # node_exporter
81 apt-get install -y prometheus-node-exporter
82 systemctl enable --now prometheus-node-exporter
83
84 # Grafana apt repo + alloy
85 mkdir -p /etc/apt/keyrings
86 curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
87 chmod 0644 /etc/apt/keyrings/grafana.gpg.key
88 echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
89 > /etc/apt/sources.list.d/grafana.list
90 apt-get update && apt-get install -y alloy
91
92 # Credentials env (mode 0640, root:alloy)
93 mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
94 install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
95 GRAFANA_CLOUD_PROM_URL=<your URL>
96 GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
97 GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
98 EOF
99
100 # systemd drop-in to source the credentials
101 mkdir -p /etc/systemd/system/alloy.service.d
102 cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
103 [Service]
104 EnvironmentFile=/etc/alloy/credentials.env
105 EOF
106 systemctl daemon-reload
107 '
108 ```
109
110 Then copy the rendered `alloy-config.river.j2` (substitute the
111 `{{ ansible_hostname }}` template var with the real hostname) to
112 `/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
113
114 ```sh
115 ssh root@shithub.sh 'systemctl enable --now alloy'
116 ```
117
118 The token is the only secret; do NOT pipe it through your shell
119 history. The `install … /dev/stdin … <<EOF` form above keeps it
120 out of `ps`/argv, but you'll still want to clear shell history
121 after.
122
123 Within a minute, metrics start landing. From the Cloud portal:
124
125 - **Explore** tab → datasource = your Prometheus → query
126 `up{job=~"node|shithubd"}` — both should show 1.
127 - Sample queries to confirm shithubd telemetry:
128 - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
129 - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
130 - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
131 - `rate(shithub_panics_total[5m]) > 0` — recovered panics
132
133 ## Building a starter dashboard
134
135 Cloud → Dashboards → New → import. Use the panels below as a
136 spine. (We're not committing the dashboard JSON to the repo yet
137 because Grafana's UUIDs are stack-specific; doc the queries and
138 let the operator import once.)
139
140 | Panel | Query |
141 |---|---|
142 | QPS | `sum(rate(shithub_http_requests_total[5m]))` |
143 | p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` |
144 | Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` |
145 | In-flight requests | `shithub_http_in_flight` |
146 | DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` |
147 | Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` |
148 | Panics | `rate(shithub_panics_total[5m])` |
149 | Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` |
150 | Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` |
151 | Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` |
152
153 ## Alerting
154
155 Three places to set alert rules. Pick one:
156
157 - **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to
158 email/Slack/PagerDuty via Cloud's contact points.
159 - **Alertmanager via remote_write.** Out of scope for the free
160 tier without their managed contact points.
161 - **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
162 for operational pages and read shithubd metrics manually.**
163
164 Suggested first three alerts (Cloud UI):
165
166 | Name | Query | Threshold | Reason |
167 |---|---|---|---|
168 | shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice |
169 | DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression |
170 | Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth |
171
172 ## When metrics stop landing
173
174 ```sh
175 ssh root@shithub.sh '
176 systemctl status alloy --no-pager
177 journalctl -u alloy -n 30 --no-pager | tail -20
178 curl -fsS http://127.0.0.1:8080/metrics | head -3
179 curl -fsS http://127.0.0.1:9100/metrics | head -3
180 '
181 ```
182
183 Common failures:
184
185 - **401 Unauthorized in alloy logs** — token expired or
186 copy-pasted wrong. Mint a fresh one in the Cloud portal,
187 update inventory, re-run the role.
188 - **node_exporter not listening on 9100** — `apt reinstall
189 prometheus-node-exporter`, check journal.
190 - **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true`
191 in web.env (the default), restart shithubd-web.
192 - **Free-tier limit hit** — Cloud portal shows "Active series" at
193 cap. Drop high-cardinality labels via `metric_relabel_configs` in
194 alloy-config.river.j2.
195 - **`up{job="shithubd"} = 0` while `up{job="node"} = 1`** — alloy can
196 reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
197 with `expected a valid start token, got "\x1f"`. Root cause: alloy
198 advertised `Accept-Encoding: gzip`, shithubd correctly returned
199 gzipped bytes with `Content-Encoding: gzip`, but alloy's
200 Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
201 in the template (`enable_compression = false` on the shithubd
202 scrape job). If you see this on a hand-rolled config, add that
203 attribute and `systemctl restart alloy`. Verify via
204 `curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
205 → look for `health: up`.
206
207 ## Why Grafana Cloud over self-hosted Prometheus
208
209 - Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
210 storage for time-series, an open inbound port (or VPN), TLS for
211 the Grafana UI, an OAuth proxy for auth, and we maintain the
212 upgrade path.
213 - Free Grafana Cloud is enough for our scale, gives us managed
214 backup of metrics + dashboards + alerts, and the data leaves
215 the droplet (so a droplet outage doesn't kill its own
216 observability).
217 - Trade-off: cardinality cap. Once we approach 10k active series,
218 evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.