tenseleyflow/shithub / b86182c

Browse files

docs: observability runbook — Grafana Cloud signup, queries, dashboard, alerts

Authored by espadonne
SHA
b86182c0f3e151dd2d4d998039c6e47d195c3152
Parents
d27e7a1
Tree
91c27d7

1 changed file

StatusFile+-
A docs/internal/runbooks/observability.md 146 0
docs/internal/runbooks/observability.mdadded
@@ -0,0 +1,146 @@
1
+# Observability — metrics tier
2
+
3
+Two layers of monitoring run today:
4
+
5
+1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`).
6
+   Tells us when the droplet is unhappy at the OS level (CPU, mem,
7
+   disk, load, down, SSL expiry, latency from outside).
8
+2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc).
9
+   Tells us what shithubd is doing internally: request latency,
10
+   DB pool saturation, panics, job-queue depth, archive failures.
11
+
12
+Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
13
+app healthy?" Different signal, different audience, both needed.
14
+
15
+## What's wired
16
+
17
+The `monitoring-client` ansible role runs **node_exporter**
18
+(host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the
19
+modern Grafana Agent rebrand). Alloy scrapes:
20
+
21
+- `127.0.0.1:9100/metrics` — host metrics from node_exporter
22
+- `127.0.0.1:8080/metrics` — shithubd's internal metrics
23
+
24
+Alloy `remote_write`s both into Grafana Cloud's Prometheus
25
+endpoint (Mimir under the hood). No inbound port on the droplet —
26
+push-only. No Prometheus or Grafana running locally.
27
+
28
+### Notable shithubd metrics
29
+
30
+| Name | Type | Meaning |
31
+|---|---|---|
32
+| `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. |
33
+| `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. |
34
+| `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. |
35
+| `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. |
36
+| `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. |
37
+| `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. |
38
+
39
+## Operator setup (one-time)
40
+
41
+Free-tier Grafana Cloud gives you ~10k active series — way more
42
+than one shithubd droplet emits. Sign-up flow:
43
+
44
+1. Go to https://grafana.com/auth/sign-up/create-user, pick the
45
+   **free** plan ("Forever Free" — no card required).
46
+2. Pick a **Stack name** (e.g. `shithub`). The portal creates
47
+   it and lands you in the stack overview page.
48
+3. Click **Send Metrics → Hosted Prometheus metrics**. The page
49
+   shows the **URL**, **Username** (numeric instance ID), and a
50
+   **Generate now** button for the password (an Access Policy token
51
+   with `metrics:write`). Copy all three immediately — the token is
52
+   shown once.
53
+4. Add to `inventory/production`:
54
+   ```ini
55
+   grafana_cloud_prom_url=https://prometheus-prod-NN-prod-us-central-0.grafana.net/api/prom/push
56
+   grafana_cloud_prom_user=1234567
57
+   grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3
58
+   ```
59
+5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
60
+   (or just rerun site.yml — the role is idempotent).
61
+
62
+Within a minute, metrics start landing. From the Cloud portal:
63
+
64
+- **Explore** tab → datasource = your Prometheus → query
65
+  `up{job=~"node|shithubd"}` — both should show 1.
66
+- Sample queries to confirm shithubd telemetry:
67
+  - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
68
+  - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
69
+  - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
70
+  - `rate(shithub_panics_total[5m]) > 0` — recovered panics
71
+
72
+## Building a starter dashboard
73
+
74
+Cloud → Dashboards → New → import. Use the panels below as a
75
+spine. (We're not committing the dashboard JSON to the repo yet
76
+because Grafana's UUIDs are stack-specific; doc the queries and
77
+let the operator import once.)
78
+
79
+| Panel | Query |
80
+|---|---|
81
+| QPS | `sum(rate(shithub_http_requests_total[5m]))` |
82
+| p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` |
83
+| Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` |
84
+| In-flight requests | `shithub_http_in_flight` |
85
+| DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` |
86
+| Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` |
87
+| Panics | `rate(shithub_panics_total[5m])` |
88
+| Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` |
89
+| Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` |
90
+| Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` |
91
+
92
+## Alerting
93
+
94
+Three places to set alert rules. Pick one:
95
+
96
+- **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to
97
+  email/Slack/PagerDuty via Cloud's contact points.
98
+- **Alertmanager via remote_write.** Out of scope for the free
99
+  tier without their managed contact points.
100
+- **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
101
+  for operational pages and read shithubd metrics manually.**
102
+
103
+Suggested first three alerts (Cloud UI):
104
+
105
+| Name | Query | Threshold | Reason |
106
+|---|---|---|---|
107
+| shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice |
108
+| DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression |
109
+| Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth |
110
+
111
+## When metrics stop landing
112
+
113
+```sh
114
+ssh root@shithub.sh '
115
+  systemctl status alloy --no-pager
116
+  journalctl -u alloy -n 30 --no-pager | tail -20
117
+  curl -fsS http://127.0.0.1:8080/metrics | head -3
118
+  curl -fsS http://127.0.0.1:9100/metrics | head -3
119
+'
120
+```
121
+
122
+Common failures:
123
+
124
+- **401 Unauthorized in alloy logs** — token expired or
125
+  copy-pasted wrong. Mint a fresh one in the Cloud portal,
126
+  update inventory, re-run the role.
127
+- **node_exporter not listening on 9100** — `apt reinstall
128
+  prometheus-node-exporter`, check journal.
129
+- **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true`
130
+  in web.env (the default), restart shithubd-web.
131
+- **Free-tier limit hit** — Cloud portal shows "Active series" at
132
+  cap. Drop high-cardinality labels via `metric_relabel_configs` in
133
+  alloy-config.river.j2.
134
+
135
+## Why Grafana Cloud over self-hosted Prometheus
136
+
137
+- Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
138
+  storage for time-series, an open inbound port (or VPN), TLS for
139
+  the Grafana UI, an OAuth proxy for auth, and we maintain the
140
+  upgrade path.
141
+- Free Grafana Cloud is enough for our scale, gives us managed
142
+  backup of metrics + dashboards + alerts, and the data leaves
143
+  the droplet (so a droplet outage doesn't kill its own
144
+  observability).
145
+- Trade-off: cardinality cap. Once we approach 10k active series,
146
+  evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.