@@ -0,0 +1,146 @@ |
| 1 | +# Observability — metrics tier |
| 2 | + |
| 3 | +Two layers of monitoring run today: |
| 4 | + |
| 5 | +1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`). |
| 6 | + Tells us when the droplet is unhappy at the OS level (CPU, mem, |
| 7 | + disk, load, down, SSL expiry, latency from outside). |
| 8 | +2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc). |
| 9 | + Tells us what shithubd is doing internally: request latency, |
| 10 | + DB pool saturation, panics, job-queue depth, archive failures. |
| 11 | + |
| 12 | +Layer 1 fires for "is the box alive?" Layer 2 fires for "is the |
| 13 | +app healthy?" Different signal, different audience, both needed. |
| 14 | + |
| 15 | +## What's wired |
| 16 | + |
| 17 | +The `monitoring-client` ansible role runs **node_exporter** |
| 18 | +(host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the |
| 19 | +modern Grafana Agent rebrand). Alloy scrapes: |
| 20 | + |
| 21 | +- `127.0.0.1:9100/metrics` — host metrics from node_exporter |
| 22 | +- `127.0.0.1:8080/metrics` — shithubd's internal metrics |
| 23 | + |
| 24 | +Alloy `remote_write`s both into Grafana Cloud's Prometheus |
| 25 | +endpoint (Mimir under the hood). No inbound port on the droplet — |
| 26 | +push-only. No Prometheus or Grafana running locally. |
| 27 | + |
| 28 | +### Notable shithubd metrics |
| 29 | + |
| 30 | +| Name | Type | Meaning | |
| 31 | +|---|---|---| |
| 32 | +| `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. | |
| 33 | +| `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. | |
| 34 | +| `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. | |
| 35 | +| `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. | |
| 36 | +| `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. | |
| 37 | +| `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. | |
| 38 | + |
| 39 | +## Operator setup (one-time) |
| 40 | + |
| 41 | +Free-tier Grafana Cloud gives you ~10k active series — way more |
| 42 | +than one shithubd droplet emits. Sign-up flow: |
| 43 | + |
| 44 | +1. Go to https://grafana.com/auth/sign-up/create-user, pick the |
| 45 | + **free** plan ("Forever Free" — no card required). |
| 46 | +2. Pick a **Stack name** (e.g. `shithub`). The portal creates |
| 47 | + it and lands you in the stack overview page. |
| 48 | +3. Click **Send Metrics → Hosted Prometheus metrics**. The page |
| 49 | + shows the **URL**, **Username** (numeric instance ID), and a |
| 50 | + **Generate now** button for the password (an Access Policy token |
| 51 | + with `metrics:write`). Copy all three immediately — the token is |
| 52 | + shown once. |
| 53 | +4. Add to `inventory/production`: |
| 54 | + ```ini |
| 55 | + grafana_cloud_prom_url=https://prometheus-prod-NN-prod-us-central-0.grafana.net/api/prom/push |
| 56 | + grafana_cloud_prom_user=1234567 |
| 57 | + grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3 |
| 58 | + ``` |
| 59 | +5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring` |
| 60 | + (or just rerun site.yml — the role is idempotent). |
| 61 | + |
| 62 | +Within a minute, metrics start landing. From the Cloud portal: |
| 63 | + |
| 64 | +- **Explore** tab → datasource = your Prometheus → query |
| 65 | + `up{job=~"node|shithubd"}` — both should show 1. |
| 66 | +- Sample queries to confirm shithubd telemetry: |
| 67 | + - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS |
| 68 | + - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route |
| 69 | + - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization |
| 70 | + - `rate(shithub_panics_total[5m]) > 0` — recovered panics |
| 71 | + |
| 72 | +## Building a starter dashboard |
| 73 | + |
| 74 | +Cloud → Dashboards → New → import. Use the panels below as a |
| 75 | +spine. (We're not committing the dashboard JSON to the repo yet |
| 76 | +because Grafana's UUIDs are stack-specific; doc the queries and |
| 77 | +let the operator import once.) |
| 78 | + |
| 79 | +| Panel | Query | |
| 80 | +|---|---| |
| 81 | +| QPS | `sum(rate(shithub_http_requests_total[5m]))` | |
| 82 | +| p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` | |
| 83 | +| Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` | |
| 84 | +| In-flight requests | `shithub_http_in_flight` | |
| 85 | +| DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` | |
| 86 | +| Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` | |
| 87 | +| Panics | `rate(shithub_panics_total[5m])` | |
| 88 | +| Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` | |
| 89 | +| Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` | |
| 90 | +| Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` | |
| 91 | + |
| 92 | +## Alerting |
| 93 | + |
| 94 | +Three places to set alert rules. Pick one: |
| 95 | + |
| 96 | +- **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to |
| 97 | + email/Slack/PagerDuty via Cloud's contact points. |
| 98 | +- **Alertmanager via remote_write.** Out of scope for the free |
| 99 | + tier without their managed contact points. |
| 100 | +- **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`) |
| 101 | + for operational pages and read shithubd metrics manually.** |
| 102 | + |
| 103 | +Suggested first three alerts (Cloud UI): |
| 104 | + |
| 105 | +| Name | Query | Threshold | Reason | |
| 106 | +|---|---|---|---| |
| 107 | +| shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice | |
| 108 | +| DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression | |
| 109 | +| Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth | |
| 110 | + |
| 111 | +## When metrics stop landing |
| 112 | + |
| 113 | +```sh |
| 114 | +ssh root@shithub.sh ' |
| 115 | + systemctl status alloy --no-pager |
| 116 | + journalctl -u alloy -n 30 --no-pager | tail -20 |
| 117 | + curl -fsS http://127.0.0.1:8080/metrics | head -3 |
| 118 | + curl -fsS http://127.0.0.1:9100/metrics | head -3 |
| 119 | +' |
| 120 | +``` |
| 121 | + |
| 122 | +Common failures: |
| 123 | + |
| 124 | +- **401 Unauthorized in alloy logs** — token expired or |
| 125 | + copy-pasted wrong. Mint a fresh one in the Cloud portal, |
| 126 | + update inventory, re-run the role. |
| 127 | +- **node_exporter not listening on 9100** — `apt reinstall |
| 128 | + prometheus-node-exporter`, check journal. |
| 129 | +- **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true` |
| 130 | + in web.env (the default), restart shithubd-web. |
| 131 | +- **Free-tier limit hit** — Cloud portal shows "Active series" at |
| 132 | + cap. Drop high-cardinality labels via `metric_relabel_configs` in |
| 133 | + alloy-config.river.j2. |
| 134 | + |
| 135 | +## Why Grafana Cloud over self-hosted Prometheus |
| 136 | + |
| 137 | +- Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, |
| 138 | + storage for time-series, an open inbound port (or VPN), TLS for |
| 139 | + the Grafana UI, an OAuth proxy for auth, and we maintain the |
| 140 | + upgrade path. |
| 141 | +- Free Grafana Cloud is enough for our scale, gives us managed |
| 142 | + backup of metrics + dashboards + alerts, and the data leaves |
| 143 | + the droplet (so a droplet outage doesn't kill its own |
| 144 | + observability). |
| 145 | +- Trade-off: cardinality cap. Once we approach 10k active series, |
| 146 | + evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host. |