shithub Public

Watch 1 Fork 0 Star 0

markdown · 9895 bytes Raw Blame History

Observability — metrics tier

Two layers of monitoring run today:

DigitalOcean droplet alerts + uptime check (see alerts.md). Tells us when the droplet is unhappy at the OS level (CPU, mem, disk, load, down, SSL expiry, latency from outside).
Grafana Cloud free tier — Prometheus + dashboards (this doc). Tells us what shithubd is doing internally: request latency, DB pool saturation, panics, job-queue depth, archive failures.

Layer 1 fires for "is the box alive?" Layer 2 fires for "is the app healthy?" Different signal, different audience, both needed.

What's wired

The monitoring-client ansible role runs node_exporter (host metrics on 127.0.0.1:9100) and Grafana Alloy (the modern Grafana Agent rebrand). Alloy scrapes:

127.0.0.1:9100/metrics — host metrics from node_exporter
127.0.0.1:8080/metrics — shithubd's internal metrics

Alloy remote_writes both into Grafana Cloud's Prometheus endpoint (Mimir under the hood). No inbound port on the droplet — push-only. No Prometheus or Grafana running locally.

Notable shithubd metrics

Name	Type	Meaning
`shithub_http_in_flight`	gauge	Requests currently being served. >50 sustained = saturated.
`shithub_http_request_duration_seconds`	histogram	Request latency by route + method. Compute p95 from buckets.
`shithub_http_requests_total`	counter	All-time request count by route + method + status. Rate it for QPS.
`shithub_db_pool_acquired`	gauge	Active Postgres connections. Approaching `shithub_db_pool_total` = saturation.
`shithub_db_pool_acquire_wait_seconds_total`	counter	Cumulative wait time. Sudden derivative climb = pool too small.
`shithub_panics_total`	counter	Recovered panics. Should be 0 in steady state.

Operator setup (one-time)

Free-tier Grafana Cloud gives you ~10k active series — way more than one shithubd droplet emits. Sign-up flow:

Go to https://grafana.com/auth/sign-up/create-user. As of 2026 the default flow drops you into a 14-day "Unlimited usage trial" — that's fine; without a card on file it auto-reverts to the Forever Free tier (~10k active series) when the trial ends.
Pick a Stack name (e.g. shithub). The portal creates the stack at <name>.grafana.net and lands you in the in-app "Get started" guide.
Get the Prometheus details. The in-app guide has drifted; the reliable path is via the org-level Cloud portal:
- In the right column of the Get started page, Organization details → click Cloud portal, OR navigate directly to https://grafana.com/orgs/<your-org>.
- Click Details on the Prometheus tile.
- The page shows the Remote Write Endpoint (full URL), Username / Instance ID (numeric), and a Generate now link for the Password / API Token (a glc_… Access Policy token with metrics:write). Copy the token immediately — it's shown once.

Add to inventory/production. Use the exact Remote Write Endpoint shown on the page — the host (prod-NN, region) differs per tenant:

grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
grafana_cloud_prom_user=1234567
grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3

ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring (or just rerun site.yml — the role is idempotent).

If you don't have ansible installed (manual fallback)

The monitoring-client role's actions are simple enough to apply by hand over SSH. Use this if you're trying to bolt monitoring onto an existing droplet and don't want to risk the full play.

ssh root@shithub.sh '
  # node_exporter
  apt-get install -y prometheus-node-exporter
  systemctl enable --now prometheus-node-exporter

  # Grafana apt repo + alloy
  mkdir -p /etc/apt/keyrings
  curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
  chmod 0644 /etc/apt/keyrings/grafana.gpg.key
  echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
    > /etc/apt/sources.list.d/grafana.list
  apt-get update && apt-get install -y alloy

  # Credentials env (mode 0640, root:alloy)
  mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
  install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
GRAFANA_CLOUD_PROM_URL=<your URL>
GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
EOF

  # systemd drop-in to source the credentials
  mkdir -p /etc/systemd/system/alloy.service.d
  cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
[Service]
EnvironmentFile=/etc/alloy/credentials.env
EOF
  systemctl daemon-reload
'

Then copy the rendered alloy-config.river.j2 (substitute the {{ ansible_hostname }} template var with the real hostname) to /etc/alloy/config.alloy (mode 0644 root:alloy) and:

ssh root@shithub.sh 'systemctl enable --now alloy'

The token is the only secret; do NOT pipe it through your shell history. The install … /dev/stdin … <<EOF form above keeps it out of ps/argv, but you'll still want to clear shell history after.

Within a minute, metrics start landing. From the Cloud portal:

Explore tab → datasource = your Prometheus → query up{job=~"node|shithubd"} — both should show 1.
Sample queries to confirm shithubd telemetry:
- sum(rate(shithub_http_requests_total[5m])) — overall QPS
- histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m]))) — p95 latency by route
- shithub_db_pool_acquired / shithub_db_pool_total — pool utilization
- rate(shithub_panics_total[5m]) > 0 — recovered panics

Building a starter dashboard

Cloud → Dashboards → New → import. Use the panels below as a spine. (We're not committing the dashboard JSON to the repo yet because Grafana's UUIDs are stack-specific; doc the queries and let the operator import once.)

Panel	Query
QPS	`sum(rate(shithub_http_requests_total[5m]))`
p95 latency by route	`histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))`
Error rate	`sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))`
In-flight requests	`shithub_http_in_flight`
DB pool utilization	`shithub_db_pool_acquired / shithub_db_pool_total`
Pool wait derivative	`rate(shithub_db_pool_acquire_wait_seconds_total[5m])`
Panics	`rate(shithub_panics_total[5m])`
Host CPU	`100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)`
Host memory used	`(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes`
Disk free	`node_filesystem_avail_bytes{mountpoint="/"}`

Alerting

Three places to set alert rules. Pick one:

Grafana Cloud → Alerts → Alert rules. UI-first; fires to email/Slack/PagerDuty via Cloud's contact points.
Alertmanager via remote_write. Out of scope for the free tier without their managed contact points.
Skip Grafana alerts entirely; use the DO alerts (alerts.md) for operational pages and read shithubd metrics manually.

Suggested first three alerts (Cloud UI):

Name	Query	Threshold	Reason
shithubd p95 latency high	p95 latency query above	`> 1s` for 5m	UI feels slow before users notice
DB pool saturation	utilization query	`> 0.8` for 10m	Pool too small or query regression
Recovered panic in last 5m	panics rate	`> 0` for 1m	Bug ticket worth

When metrics stop landing

ssh root@shithub.sh '
  systemctl status alloy --no-pager
  journalctl -u alloy -n 30 --no-pager | tail -20
  curl -fsS http://127.0.0.1:8080/metrics | head -3
  curl -fsS http://127.0.0.1:9100/metrics | head -3
'

Common failures:

401 Unauthorized in alloy logs — token expired or copy-pasted wrong. Mint a fresh one in the Cloud portal, update inventory, re-run the role.
node_exporter not listening on 9100 — apt reinstall prometheus-node-exporter, check journal.
shithubd not exposing /metrics — SHITHUB_METRICS__ENABLED=true in web.env (the default), restart shithubd-web.
Free-tier limit hit — Cloud portal shows "Active series" at cap. Drop high-cardinality labels via metric_relabel_configs in alloy-config.river.j2.
up{job="shithubd"} = 0 while up{job="node"} = 1 — alloy can reach shithubd (scrape_samples_scraped > 0) but the parser fails with expected a valid start token, got "\x1f". Root cause: alloy advertised Accept-Encoding: gzip, shithubd correctly returned gzipped bytes with Content-Encoding: gzip, but alloy's Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already in the template (enable_compression = false on the shithubd scrape job). If you see this on a hand-rolled config, add that attribute and systemctl restart alloy. Verify via curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd → look for health: up.

Why Grafana Cloud over self-hosted Prometheus

Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, storage for time-series, an open inbound port (or VPN), TLS for the Grafana UI, an OAuth proxy for auth, and we maintain the upgrade path.
Free Grafana Cloud is enough for our scale, gives us managed backup of metrics + dashboards + alerts, and the data leaves the droplet (so a droplet outage doesn't kill its own observability).
Trade-off: cardinality cap. Once we approach 10k active series, evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.

View source

  
        1
        # Observability — metrics tier
      
        2
        
        3
        Two layers of monitoring run today:
      
        4
        
        5
        1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`).
      
        6
           Tells us when the droplet is unhappy at the OS level (CPU, mem,
      
        7
           disk, load, down, SSL expiry, latency from outside).
      
        8
        2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc).
      
        9
           Tells us what shithubd is doing internally: request latency,
      
        10
           DB pool saturation, panics, job-queue depth, archive failures.
      
        11
        
        12
        Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
      
        13
        app healthy?" Different signal, different audience, both needed.
      
        14
        
        15
        ## What's wired
      
        16
        
        17
        The `monitoring-client` ansible role runs **node_exporter**
      
        18
        (host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the
      
        19
        modern Grafana Agent rebrand). Alloy scrapes:
      
        20
        
        21
        - `127.0.0.1:9100/metrics` — host metrics from node_exporter
      
        22
        - `127.0.0.1:8080/metrics` — shithubd's internal metrics
      
        23
        
        24
        Alloy `remote_write`s both into Grafana Cloud's Prometheus
      
        25
        endpoint (Mimir under the hood). No inbound port on the droplet —
      
        26
        push-only. No Prometheus or Grafana running locally.
      
        27
        
        28
        ### Notable shithubd metrics
      
        29
        
        30
        | Name | Type | Meaning |
      
        31
        |---|---|---|
      
        32
        | `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. |
      
        33
        | `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. |
      
        34
        | `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. |
      
        35
        | `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. |
      
        36
        | `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. |
      
        37
        | `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. |
      
        38
        
        39
        ## Operator setup (one-time)
      
        40
        
        41
        Free-tier Grafana Cloud gives you ~10k active series — way more
      
        42
        than one shithubd droplet emits. Sign-up flow:
      
        43
        
        44
        1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
      
        45
           the default flow drops you into a 14-day "Unlimited usage trial"
      
        46
           — that's fine; without a card on file it auto-reverts to the
      
        47
           Forever Free tier (~10k active series) when the trial ends.
      
        48
        2. Pick a **Stack name** (e.g. `shithub`). The portal creates the
      
        49
           stack at `<name>.grafana.net` and lands you in the in-app
      
        50
           "Get started" guide.
      
        51
        3. Get the Prometheus details. The in-app guide has drifted; the
      
        52
           reliable path is via the org-level Cloud portal:
      
        53
           - In the right column of the Get started page, **Organization
      
        54
             details** → click **Cloud portal**, OR navigate directly to
      
        55
             `https://grafana.com/orgs/<your-org>`.
      
        56
           - Click **Details** on the **Prometheus** tile.
      
        57
           - The page shows the **Remote Write Endpoint** (full URL),
      
        58
             **Username / Instance ID** (numeric), and a **Generate now**
      
        59
             link for the **Password / API Token** (a `glc_…` Access Policy
      
        60
             token with `metrics:write`). Copy the token immediately — it's
      
        61
             shown once.
      
        62
        4. Add to `inventory/production`. Use the exact Remote Write Endpoint
      
        63
           shown on the page — the host (prod-NN, region) differs per tenant:
      
        64
           ```ini
      
        65
           grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
      
        66
           grafana_cloud_prom_user=1234567
      
        67
           grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3
      
        68
           ```
      
        69
        5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
      
        70
           (or just rerun site.yml — the role is idempotent).
      
        71
        
        72
        ### If you don't have ansible installed (manual fallback)
      
        73
        
        74
        The `monitoring-client` role's actions are simple enough to apply
      
        75
        by hand over SSH. Use this if you're trying to bolt monitoring
      
        76
        onto an existing droplet and don't want to risk the full play.
      
        77
        
        78
        ```sh
      
        79
        ssh root@shithub.sh '
      
        80
          # node_exporter
      
        81
          apt-get install -y prometheus-node-exporter
      
        82
          systemctl enable --now prometheus-node-exporter
      
        83
        
        84
          # Grafana apt repo + alloy
      
        85
          mkdir -p /etc/apt/keyrings
      
        86
          curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
      
        87
          chmod 0644 /etc/apt/keyrings/grafana.gpg.key
      
        88
          echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
      
        89
            > /etc/apt/sources.list.d/grafana.list
      
        90
          apt-get update && apt-get install -y alloy
      
        91
        
        92
          # Credentials env (mode 0640, root:alloy)
      
        93
          mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
      
        94
          install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
      
        95
        GRAFANA_CLOUD_PROM_URL=<your URL>
      
        96
        GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
      
        97
        GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
      
        98
        EOF
      
        99
        
        100
          # systemd drop-in to source the credentials
      
        101
          mkdir -p /etc/systemd/system/alloy.service.d
      
        102
          cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
      
        103
        [Service]
      
        104
        EnvironmentFile=/etc/alloy/credentials.env
      
        105
        EOF
      
        106
          systemctl daemon-reload
      
        107
        '
      
        108
        ```
      
        109
        
        110
        Then copy the rendered `alloy-config.river.j2` (substitute the
      
        111
        `{{ ansible_hostname }}` template var with the real hostname) to
      
        112
        `/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
      
        113
        
        114
        ```sh
      
        115
        ssh root@shithub.sh 'systemctl enable --now alloy'
      
        116
        ```
      
        117
        
        118
        The token is the only secret; do NOT pipe it through your shell
      
        119
        history. The `install … /dev/stdin … <<EOF` form above keeps it
      
        120
        out of `ps`/argv, but you'll still want to clear shell history
      
        121
        after.
      
        122
        
        123
        Within a minute, metrics start landing. From the Cloud portal:
      
        124
        
        125
        - **Explore** tab → datasource = your Prometheus → query
      
        126
          `up{job=~"node|shithubd"}` — both should show 1.
      
        127
        - Sample queries to confirm shithubd telemetry:
      
        128
          - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
      
        129
          - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
      
        130
          - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
      
        131
          - `rate(shithub_panics_total[5m]) > 0` — recovered panics
      
        132
        
        133
        ## Building a starter dashboard
      
        134
        
        135
        Cloud → Dashboards → New → import. Use the panels below as a
      
        136
        spine. (We're not committing the dashboard JSON to the repo yet
      
        137
        because Grafana's UUIDs are stack-specific; doc the queries and
      
        138
        let the operator import once.)
      
        139
        
        140
        | Panel | Query |
      
        141
        |---|---|
      
        142
        | QPS | `sum(rate(shithub_http_requests_total[5m]))` |
      
        143
        | p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` |
      
        144
        | Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` |
      
        145
        | In-flight requests | `shithub_http_in_flight` |
      
        146
        | DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` |
      
        147
        | Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` |
      
        148
        | Panics | `rate(shithub_panics_total[5m])` |
      
        149
        | Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` |
      
        150
        | Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` |
      
        151
        | Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` |
      
        152
        
        153
        ## Alerting
      
        154
        
        155
        Three places to set alert rules. Pick one:
      
        156
        
        157
        - **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to
      
        158
          email/Slack/PagerDuty via Cloud's contact points.
      
        159
        - **Alertmanager via remote_write.** Out of scope for the free
      
        160
          tier without their managed contact points.
      
        161
        - **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
      
        162
          for operational pages and read shithubd metrics manually.**
      
        163
        
        164
        Suggested first three alerts (Cloud UI):
      
        165
        
        166
        | Name | Query | Threshold | Reason |
      
        167
        |---|---|---|---|
      
        168
        | shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice |
      
        169
        | DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression |
      
        170
        | Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth |
      
        171
        
        172
        ## When metrics stop landing
      
        173
        
        174
        ```sh
      
        175
        ssh root@shithub.sh '
      
        176
          systemctl status alloy --no-pager
      
        177
          journalctl -u alloy -n 30 --no-pager | tail -20
      
        178
          curl -fsS http://127.0.0.1:8080/metrics | head -3
      
        179
          curl -fsS http://127.0.0.1:9100/metrics | head -3
      
        180
        '
      
        181
        ```
      
        182
        
        183
        Common failures:
      
        184
        
        185
        - **401 Unauthorized in alloy logs** — token expired or
      
        186
          copy-pasted wrong. Mint a fresh one in the Cloud portal,
      
        187
          update inventory, re-run the role.
      
        188
        - **node_exporter not listening on 9100** — `apt reinstall
      
        189
          prometheus-node-exporter`, check journal.
      
        190
        - **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true`
      
        191
          in web.env (the default), restart shithubd-web.
      
        192
        - **Free-tier limit hit** — Cloud portal shows "Active series" at
      
        193
          cap. Drop high-cardinality labels via `metric_relabel_configs` in
      
        194
          alloy-config.river.j2.
      
        195
        - **`up{job="shithubd"} = 0` while `up{job="node"} = 1`** — alloy can
      
        196
          reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
      
        197
          with `expected a valid start token, got "\x1f"`. Root cause: alloy
      
        198
          advertised `Accept-Encoding: gzip`, shithubd correctly returned
      
        199
          gzipped bytes with `Content-Encoding: gzip`, but alloy's
      
        200
          Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
      
        201
          in the template (`enable_compression = false` on the shithubd
      
        202
          scrape job). If you see this on a hand-rolled config, add that
      
        203
          attribute and `systemctl restart alloy`. Verify via
      
        204
          `curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
      
        205
          → look for `health: up`.
      
        206
        
        207
        ## Why Grafana Cloud over self-hosted Prometheus
      
        208
        
        209
        - Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
      
        210
          storage for time-series, an open inbound port (or VPN), TLS for
      
        211
          the Grafana UI, an OAuth proxy for auth, and we maintain the
      
        212
          upgrade path.
      
        213
        - Free Grafana Cloud is enough for our scale, gives us managed
      
        214
          backup of metrics + dashboards + alerts, and the data leaves
      
        215
          the droplet (so a droplet outage doesn't kill its own
      
        216
          observability).
      
        217
        - Trade-off: cardinality cap. Once we approach 10k active series,
      
        218
          evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.

1	# Observability — metrics tier
2
3	Two layers of monitoring run today:
4
5	1. DigitalOcean droplet alerts + uptime check (see `alerts.md`).
6	Tells us when the droplet is unhappy at the OS level (CPU, mem,
7	disk, load, down, SSL expiry, latency from outside).
8	2. Grafana Cloud free tier — Prometheus + dashboards (this doc).
9	Tells us what shithubd is doing internally: request latency,
10	DB pool saturation, panics, job-queue depth, archive failures.
11
12	Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
13	app healthy?" Different signal, different audience, both needed.
14
15	## What's wired
16
17	The `monitoring-client` ansible role runs node_exporter
18	(host metrics on `127.0.0.1:9100`) and Grafana Alloy (the
19	modern Grafana Agent rebrand). Alloy scrapes:
20
21	- `127.0.0.1:9100/metrics` — host metrics from node_exporter
22	- `127.0.0.1:8080/metrics` — shithubd's internal metrics
23
24	Alloy `remote_write`s both into Grafana Cloud's Prometheus
25	endpoint (Mimir under the hood). No inbound port on the droplet —
26	push-only. No Prometheus or Grafana running locally.
27
28	### Notable shithubd metrics
29
30	\| Name \| Type \| Meaning \|
31	\|---\|---\|---\|
32	\| `shithub_http_in_flight` \| gauge \| Requests currently being served. >50 sustained = saturated. \|
33	\| `shithub_http_request_duration_seconds` \| histogram \| Request latency by route + method. Compute p95 from buckets. \|
34	\| `shithub_http_requests_total` \| counter \| All-time request count by route + method + status. Rate it for QPS. \|
35	\| `shithub_db_pool_acquired` \| gauge \| Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. \|
36	\| `shithub_db_pool_acquire_wait_seconds_total` \| counter \| Cumulative wait time. Sudden derivative climb = pool too small. \|
37	\| `shithub_panics_total` \| counter \| Recovered panics. Should be 0 in steady state. \|
38
39	## Operator setup (one-time)
40
41	Free-tier Grafana Cloud gives you ~10k active series — way more
42	than one shithubd droplet emits. Sign-up flow:
43
44	1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
45	the default flow drops you into a 14-day "Unlimited usage trial"
46	— that's fine; without a card on file it auto-reverts to the
47	Forever Free tier (~10k active series) when the trial ends.
48	2. Pick a Stack name (e.g. `shithub`). The portal creates the
49	stack at `<name>.grafana.net` and lands you in the in-app
50	"Get started" guide.
51	3. Get the Prometheus details. The in-app guide has drifted; the
52	reliable path is via the org-level Cloud portal:
53	- In the right column of the Get started page, **Organization
54	details → click Cloud portal**, OR navigate directly to
55	`https://grafana.com/orgs/<your-org>`.
56	- Click Details on the Prometheus tile.
57	- The page shows the Remote Write Endpoint (full URL),
58	Username / Instance ID (numeric), and a Generate now
59	link for the Password / API Token (a `glc_…` Access Policy
60	token with `metrics:write`). Copy the token immediately — it's
61	shown once.
62	4. Add to `inventory/production`. Use the exact Remote Write Endpoint
63	shown on the page — the host (prod-NN, region) differs per tenant:
64	```ini
65	grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
66	grafana_cloud_prom_user=1234567
67	grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3
68	```
69	5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
70	(or just rerun site.yml — the role is idempotent).
71
72	### If you don't have ansible installed (manual fallback)
73
74	The `monitoring-client` role's actions are simple enough to apply
75	by hand over SSH. Use this if you're trying to bolt monitoring
76	onto an existing droplet and don't want to risk the full play.
77
78	```sh
79	ssh root@shithub.sh '
80	# node_exporter
81	apt-get install -y prometheus-node-exporter
82	systemctl enable --now prometheus-node-exporter
83
84	# Grafana apt repo + alloy
85	mkdir -p /etc/apt/keyrings
86	curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
87	chmod 0644 /etc/apt/keyrings/grafana.gpg.key
88	echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
89	> /etc/apt/sources.list.d/grafana.list
90	apt-get update && apt-get install -y alloy
91
92	# Credentials env (mode 0640, root:alloy)
93	mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
94	install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
95	GRAFANA_CLOUD_PROM_URL=<your URL>
96	GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
97	GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
98	EOF
99
100	# systemd drop-in to source the credentials
101	mkdir -p /etc/systemd/system/alloy.service.d
102	cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
103	[Service]
104	EnvironmentFile=/etc/alloy/credentials.env
105	EOF
106	systemctl daemon-reload
107	'
108	```
109
110	Then copy the rendered `alloy-config.river.j2` (substitute the
111	`{{ ansible_hostname }}` template var with the real hostname) to
112	`/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
113
114	```sh
115	ssh root@shithub.sh 'systemctl enable --now alloy'
116	```
117
118	The token is the only secret; do NOT pipe it through your shell
119	history. The `install … /dev/stdin … <<EOF` form above keeps it
120	out of `ps`/argv, but you'll still want to clear shell history
121	after.
122
123	Within a minute, metrics start landing. From the Cloud portal:
124
125	- Explore tab → datasource = your Prometheus → query
126	`up{job=~"node\|shithubd"}` — both should show 1.
127	- Sample queries to confirm shithubd telemetry:
128	- `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
129	- `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
130	- `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
131	- `rate(shithub_panics_total[5m]) > 0` — recovered panics
132
133	## Building a starter dashboard
134
135	Cloud → Dashboards → New → import. Use the panels below as a
136	spine. (We're not committing the dashboard JSON to the repo yet
137	because Grafana's UUIDs are stack-specific; doc the queries and
138	let the operator import once.)
139
140	\| Panel \| Query \|
141	\|---\|---\|
142	\| QPS \| `sum(rate(shithub_http_requests_total[5m]))` \|
143	\| p95 latency by route \| `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` \|
144	\| Error rate \| `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` \|
145	\| In-flight requests \| `shithub_http_in_flight` \|
146	\| DB pool utilization \| `shithub_db_pool_acquired / shithub_db_pool_total` \|
147	\| Pool wait derivative \| `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` \|
148	\| Panics \| `rate(shithub_panics_total[5m])` \|
149	\| Host CPU \| `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` \|
150	\| Host memory used \| `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` \|
151	\| Disk free \| `node_filesystem_avail_bytes{mountpoint="/"}` \|
152
153	## Alerting
154
155	Three places to set alert rules. Pick one:
156
157	- Grafana Cloud → Alerts → Alert rules. UI-first; fires to
158	email/Slack/PagerDuty via Cloud's contact points.
159	- Alertmanager via remote_write. Out of scope for the free
160	tier without their managed contact points.
161	- **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
162	for operational pages and read shithubd metrics manually.**
163
164	Suggested first three alerts (Cloud UI):
165
166	\| Name \| Query \| Threshold \| Reason \|
167	\|---\|---\|---\|---\|
168	\| shithubd p95 latency high \| p95 latency query above \| `> 1s` for 5m \| UI feels slow before users notice \|
169	\| DB pool saturation \| utilization query \| `> 0.8` for 10m \| Pool too small or query regression \|
170	\| Recovered panic in last 5m \| panics rate \| `> 0` for 1m \| Bug ticket worth \|
171
172	## When metrics stop landing
173
174	```sh
175	ssh root@shithub.sh '
176	systemctl status alloy --no-pager
177	journalctl -u alloy -n 30 --no-pager \| tail -20
178	curl -fsS http://127.0.0.1:8080/metrics \| head -3
179	curl -fsS http://127.0.0.1:9100/metrics \| head -3
180	'
181	```
182
183	Common failures:
184
185	- 401 Unauthorized in alloy logs — token expired or
186	copy-pasted wrong. Mint a fresh one in the Cloud portal,
187	update inventory, re-run the role.
188	- node_exporter not listening on 9100 — `apt reinstall
189	prometheus-node-exporter`, check journal.
190	- shithubd not exposing /metrics — `SHITHUB_METRICS__ENABLED=true`
191	in web.env (the default), restart shithubd-web.
192	- Free-tier limit hit — Cloud portal shows "Active series" at
193	cap. Drop high-cardinality labels via `metric_relabel_configs` in
194	alloy-config.river.j2.
195	- `up{job="shithubd"} = 0` while `up{job="node"} = 1` — alloy can
196	reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
197	with `expected a valid start token, got "\x1f"`. Root cause: alloy
198	advertised `Accept-Encoding: gzip`, shithubd correctly returned
199	gzipped bytes with `Content-Encoding: gzip`, but alloy's
200	Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
201	in the template (`enable_compression = false` on the shithubd
202	scrape job). If you see this on a hand-rolled config, add that
203	attribute and `systemctl restart alloy`. Verify via
204	`curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
205	→ look for `health: up`.
206
207	## Why Grafana Cloud over self-hosted Prometheus
208
209	- Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
210	storage for time-series, an open inbound port (or VPN), TLS for
211	the Grafana UI, an OAuth proxy for auth, and we maintain the
212	upgrade path.
213	- Free Grafana Cloud is enough for our scale, gives us managed
214	backup of metrics + dashboards + alerts, and the data leaves
215	the droplet (so a droplet outage doesn't kill its own
216	observability).
217	- Trade-off: cardinality cap. Once we approach 10k active series,
218	evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.