shithub Public

Watch 1 Fork 0 Star 0

markdown · 10611 bytes Raw Blame History

Observability — metrics tier

Two layers of monitoring run today:

DigitalOcean droplet alerts + uptime check (see alerts.md). Tells us when the droplet is unhappy at the OS level (CPU, mem, disk, load, down, SSL expiry, latency from outside).
Grafana Cloud free tier — Prometheus + dashboards (this doc). Tells us what shithubd is doing internally: request latency, DB pool saturation, panics, job-queue depth, archive failures.

Layer 1 fires for "is the box alive?" Layer 2 fires for "is the app healthy?" Different signal, different audience, both needed.

What's wired

The monitoring-client ansible role runs node_exporter (host metrics on 127.0.0.1:9100) and Grafana Alloy (the modern Grafana Agent rebrand). Alloy scrapes:

127.0.0.1:9100/metrics — host metrics from node_exporter
127.0.0.1:8080/metrics — shithubd's internal metrics

Alloy remote_writes both into Grafana Cloud's Prometheus endpoint (Mimir under the hood). No inbound port on the droplet — push-only. No Prometheus or Grafana running locally.

Notable shithubd metrics

Name	Type	Meaning
`shithub_http_in_flight`	gauge	Requests currently being served. >50 sustained = saturated.
`shithub_http_request_duration_seconds`	histogram	Request latency by route + method. Compute p95 from buckets.
`shithub_http_requests_total`	counter	All-time request count by route + method + status. Rate it for QPS.
`shithub_db_pool_acquired`	gauge	Active Postgres connections. Approaching `shithub_db_pool_total` = saturation.
`shithub_db_pool_acquire_wait_seconds_total`	counter	Cumulative wait time. Sudden derivative climb = pool too small.
`shithub_panics_total`	counter	Recovered panics. Should be 0 in steady state.
`shithub_actions_queue_depth`	gauge	Queued Actions runs/jobs. Sustained job depth means runners cannot keep up.
`shithub_actions_active`	gauge	Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners.
`shithub_actions_runner_heartbeat_age_seconds`	gauge	Seconds since each runner heartbeat. >60s sustained means the runner is stale.
`shithub_actions_run_duration_seconds`	histogram	Terminal Actions run duration by event and conclusion.
`shithub_actions_log_chunk_bytes_total`	counter	Accepted Actions log bytes, used for throughput and scrubber health alerts.

Operator setup (one-time)

Free-tier Grafana Cloud gives you ~10k active series — way more than one shithubd droplet emits. Sign-up flow:

Go to https://grafana.com/auth/sign-up/create-user. As of 2026 the default flow drops you into a 14-day "Unlimited usage trial" — that's fine; without a card on file it auto-reverts to the Forever Free tier (~10k active series) when the trial ends.
Pick a Stack name (e.g. shithub). The portal creates the stack at <name>.grafana.net and lands you in the in-app "Get started" guide.
Get the Prometheus details. The in-app guide has drifted; the reliable path is via the org-level Cloud portal:
- In the right column of the Get started page, Organization details → click Cloud portal, OR navigate directly to https://grafana.com/orgs/<your-org>.
- Click Details on the Prometheus tile.
- The page shows the Remote Write Endpoint (full URL), Username / Instance ID (numeric), and a Generate now link for the Password / API Token (a glc_… Access Policy token with metrics:write). Copy the token immediately — it's shown once.

Add to inventory/production. Use the exact Remote Write Endpoint shown on the page — the host (prod-NN, region) differs per tenant:

grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
grafana_cloud_prom_user=1234567
grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3

ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring (or just rerun site.yml — the role is idempotent).

If you don't have ansible installed (manual fallback)

The monitoring-client role's actions are simple enough to apply by hand over SSH. Use this if you're trying to bolt monitoring onto an existing droplet and don't want to risk the full play.

ssh root@shithub.sh '
  # node_exporter
  apt-get install -y prometheus-node-exporter
  systemctl enable --now prometheus-node-exporter

  # Grafana apt repo + alloy
  mkdir -p /etc/apt/keyrings
  curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
  chmod 0644 /etc/apt/keyrings/grafana.gpg.key
  echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
    > /etc/apt/sources.list.d/grafana.list
  apt-get update && apt-get install -y alloy

  # Credentials env (mode 0640, root:alloy)
  mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
  install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
GRAFANA_CLOUD_PROM_URL=<your URL>
GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
EOF

  # systemd drop-in to source the credentials
  mkdir -p /etc/systemd/system/alloy.service.d
  cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
[Service]
EnvironmentFile=/etc/alloy/credentials.env
EOF
  systemctl daemon-reload
'

Then copy the rendered alloy-config.river.j2 (substitute the {{ ansible_hostname }} template var with the real hostname) to /etc/alloy/config.alloy (mode 0644 root:alloy) and:

ssh root@shithub.sh 'systemctl enable --now alloy'

The token is the only secret; do NOT pipe it through your shell history. The install … /dev/stdin … <<EOF form above keeps it out of ps/argv, but you'll still want to clear shell history after.

Within a minute, metrics start landing. From the Cloud portal:

Explore tab → datasource = your Prometheus → query up{job=~"node|shithubd"} — both should show 1.
Sample queries to confirm shithubd telemetry:
- sum(rate(shithub_http_requests_total[5m])) — overall QPS
- histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m]))) — p95 latency by route
- shithub_db_pool_acquired / shithub_db_pool_total — pool utilization
- rate(shithub_panics_total[5m]) > 0 — recovered panics

Building a starter dashboard

Cloud → Dashboards → New → import. Dashboard JSON is committed under:

deploy/monitoring/grafana/dashboards/shithubd-overview.json
deploy/monitoring/grafana/dashboards/actions.json

Use the panels below as fallback queries if the dashboard import UI drifts or you need to rebuild by hand.

Panel	Query
QPS	`sum(rate(shithub_http_requests_total[5m]))`
p95 latency by route	`histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))`
Error rate	`sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))`
In-flight requests	`shithub_http_in_flight`
DB pool utilization	`shithub_db_pool_acquired / shithub_db_pool_total`
Pool wait derivative	`rate(shithub_db_pool_acquire_wait_seconds_total[5m])`
Panics	`rate(shithub_panics_total[5m])`
Host CPU	`100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)`
Host memory used	`(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes`
Disk free	`node_filesystem_avail_bytes{mountpoint="/"}`

Alerting

Three places to set alert rules. Pick one:

Grafana Cloud → Alerts → Alert rules. UI-first; fires to email/Slack/PagerDuty via Cloud's contact points.
Alertmanager via remote_write. Out of scope for the free tier without their managed contact points.
Skip Grafana alerts entirely; use the DO alerts (alerts.md) for operational pages and read shithubd metrics manually.

Suggested first three alerts (Cloud UI):

Name	Query	Threshold	Reason
shithubd p95 latency high	p95 latency query above	`> 1s` for 5m	UI feels slow before users notice
DB pool saturation	utilization query	`> 0.8` for 10m	Pool too small or query regression
Recovered panic in last 5m	panics rate	`> 0` for 1m	Bug ticket worth

When metrics stop landing

ssh root@shithub.sh '
  systemctl status alloy --no-pager
  journalctl -u alloy -n 30 --no-pager | tail -20
  curl -fsS http://127.0.0.1:8080/metrics | head -3
  curl -fsS http://127.0.0.1:9100/metrics | head -3
'

Common failures:

401 Unauthorized in alloy logs — token expired or copy-pasted wrong. Mint a fresh one in the Cloud portal, update inventory, re-run the role.
node_exporter not listening on 9100 — apt reinstall prometheus-node-exporter, check journal.
shithubd not exposing /metrics — SHITHUB_METRICS__ENABLED=true in web.env (the default), restart shithubd-web.
Free-tier limit hit — Cloud portal shows "Active series" at cap. Drop high-cardinality labels via metric_relabel_configs in alloy-config.river.j2.
up{job="shithubd"} = 0 while up{job="node"} = 1 — alloy can reach shithubd (scrape_samples_scraped > 0) but the parser fails with expected a valid start token, got "\x1f". Root cause: alloy advertised Accept-Encoding: gzip, shithubd correctly returned gzipped bytes with Content-Encoding: gzip, but alloy's Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already in the template (enable_compression = false on the shithubd scrape job). If you see this on a hand-rolled config, add that attribute and systemctl restart alloy. Verify via curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd → look for health: up.

Why Grafana Cloud over self-hosted Prometheus

Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, storage for time-series, an open inbound port (or VPN), TLS for the Grafana UI, an OAuth proxy for auth, and we maintain the upgrade path.
Free Grafana Cloud is enough for our scale, gives us managed backup of metrics + dashboards + alerts, and the data leaves the droplet (so a droplet outage doesn't kill its own observability).
Trade-off: cardinality cap. Once we approach 10k active series, evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.

View source

  
        1
        # Observability — metrics tier
      
        2
        
        3
        Two layers of monitoring run today:
      
        4
        
        5
        1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`).
      
        6
           Tells us when the droplet is unhappy at the OS level (CPU, mem,
      
        7
           disk, load, down, SSL expiry, latency from outside).
      
        8
        2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc).
      
        9
           Tells us what shithubd is doing internally: request latency,
      
        10
           DB pool saturation, panics, job-queue depth, archive failures.
      
        11
        
        12
        Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
      
        13
        app healthy?" Different signal, different audience, both needed.
      
        14
        
        15
        ## What's wired
      
        16
        
        17
        The `monitoring-client` ansible role runs **node_exporter**
      
        18
        (host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the
      
        19
        modern Grafana Agent rebrand). Alloy scrapes:
      
        20
        
        21
        - `127.0.0.1:9100/metrics` — host metrics from node_exporter
      
        22
        - `127.0.0.1:8080/metrics` — shithubd's internal metrics
      
        23
        
        24
        Alloy `remote_write`s both into Grafana Cloud's Prometheus
      
        25
        endpoint (Mimir under the hood). No inbound port on the droplet —
      
        26
        push-only. No Prometheus or Grafana running locally.
      
        27
        
        28
        ### Notable shithubd metrics
      
        29
        
        30
        | Name | Type | Meaning |
      
        31
        |---|---|---|
      
        32
        | `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. |
      
        33
        | `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. |
      
        34
        | `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. |
      
        35
        | `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. |
      
        36
        | `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. |
      
        37
        | `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. |
      
        38
        | `shithub_actions_queue_depth` | gauge | Queued Actions runs/jobs. Sustained job depth means runners cannot keep up. |
      
        39
        | `shithub_actions_active` | gauge | Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners. |
      
        40
        | `shithub_actions_runner_heartbeat_age_seconds` | gauge | Seconds since each runner heartbeat. >60s sustained means the runner is stale. |
      
        41
        | `shithub_actions_run_duration_seconds` | histogram | Terminal Actions run duration by event and conclusion. |
      
        42
        | `shithub_actions_log_chunk_bytes_total` | counter | Accepted Actions log bytes, used for throughput and scrubber health alerts. |
      
        43
        
        44
        ## Operator setup (one-time)
      
        45
        
        46
        Free-tier Grafana Cloud gives you ~10k active series — way more
      
        47
        than one shithubd droplet emits. Sign-up flow:
      
        48
        
        49
        1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
      
        50
           the default flow drops you into a 14-day "Unlimited usage trial"
      
        51
           — that's fine; without a card on file it auto-reverts to the
      
        52
           Forever Free tier (~10k active series) when the trial ends.
      
        53
        2. Pick a **Stack name** (e.g. `shithub`). The portal creates the
      
        54
           stack at `<name>.grafana.net` and lands you in the in-app
      
        55
           "Get started" guide.
      
        56
        3. Get the Prometheus details. The in-app guide has drifted; the
      
        57
           reliable path is via the org-level Cloud portal:
      
        58
           - In the right column of the Get started page, **Organization
      
        59
             details** → click **Cloud portal**, OR navigate directly to
      
        60
             `https://grafana.com/orgs/<your-org>`.
      
        61
           - Click **Details** on the **Prometheus** tile.
      
        62
           - The page shows the **Remote Write Endpoint** (full URL),
      
        63
             **Username / Instance ID** (numeric), and a **Generate now**
      
        64
             link for the **Password / API Token** (a `glc_…` Access Policy
      
        65
             token with `metrics:write`). Copy the token immediately — it's
      
        66
             shown once.
      
        67
        4. Add to `inventory/production`. Use the exact Remote Write Endpoint
      
        68
           shown on the page — the host (prod-NN, region) differs per tenant:
      
        69
           ```ini
      
        70
           grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
      
        71
           grafana_cloud_prom_user=1234567
      
        72
           grafana_cloud_prom_token=glc_eyJrIjoi...   # the token from step 3
      
        73
           ```
      
        74
        5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
      
        75
           (or just rerun site.yml — the role is idempotent).
      
        76
        
        77
        ### If you don't have ansible installed (manual fallback)
      
        78
        
        79
        The `monitoring-client` role's actions are simple enough to apply
      
        80
        by hand over SSH. Use this if you're trying to bolt monitoring
      
        81
        onto an existing droplet and don't want to risk the full play.
      
        82
        
        83
        ```sh
      
        84
        ssh root@shithub.sh '
      
        85
          # node_exporter
      
        86
          apt-get install -y prometheus-node-exporter
      
        87
          systemctl enable --now prometheus-node-exporter
      
        88
        
        89
          # Grafana apt repo + alloy
      
        90
          mkdir -p /etc/apt/keyrings
      
        91
          curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
      
        92
          chmod 0644 /etc/apt/keyrings/grafana.gpg.key
      
        93
          echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
      
        94
            > /etc/apt/sources.list.d/grafana.list
      
        95
          apt-get update && apt-get install -y alloy
      
        96
        
        97
          # Credentials env (mode 0640, root:alloy)
      
        98
          mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
      
        99
          install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
      
        100
        GRAFANA_CLOUD_PROM_URL=<your URL>
      
        101
        GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
      
        102
        GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
      
        103
        EOF
      
        104
        
        105
          # systemd drop-in to source the credentials
      
        106
          mkdir -p /etc/systemd/system/alloy.service.d
      
        107
          cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
      
        108
        [Service]
      
        109
        EnvironmentFile=/etc/alloy/credentials.env
      
        110
        EOF
      
        111
          systemctl daemon-reload
      
        112
        '
      
        113
        ```
      
        114
        
        115
        Then copy the rendered `alloy-config.river.j2` (substitute the
      
        116
        `{{ ansible_hostname }}` template var with the real hostname) to
      
        117
        `/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
      
        118
        
        119
        ```sh
      
        120
        ssh root@shithub.sh 'systemctl enable --now alloy'
      
        121
        ```
      
        122
        
        123
        The token is the only secret; do NOT pipe it through your shell
      
        124
        history. The `install … /dev/stdin … <<EOF` form above keeps it
      
        125
        out of `ps`/argv, but you'll still want to clear shell history
      
        126
        after.
      
        127
        
        128
        Within a minute, metrics start landing. From the Cloud portal:
      
        129
        
        130
        - **Explore** tab → datasource = your Prometheus → query
      
        131
          `up{job=~"node|shithubd"}` — both should show 1.
      
        132
        - Sample queries to confirm shithubd telemetry:
      
        133
          - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
      
        134
          - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
      
        135
          - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
      
        136
          - `rate(shithub_panics_total[5m]) > 0` — recovered panics
      
        137
        
        138
        ## Building a starter dashboard
      
        139
        
        140
        Cloud → Dashboards → New → import. Dashboard JSON is committed under:
      
        141
        
        142
        ```text
      
        143
        deploy/monitoring/grafana/dashboards/shithubd-overview.json
      
        144
        deploy/monitoring/grafana/dashboards/actions.json
      
        145
        ```
      
        146
        
        147
        Use the panels below as fallback queries if the dashboard import UI drifts or
      
        148
        you need to rebuild by hand.
      
        149
        
        150
        | Panel | Query |
      
        151
        |---|---|
      
        152
        | QPS | `sum(rate(shithub_http_requests_total[5m]))` |
      
        153
        | p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` |
      
        154
        | Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` |
      
        155
        | In-flight requests | `shithub_http_in_flight` |
      
        156
        | DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` |
      
        157
        | Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` |
      
        158
        | Panics | `rate(shithub_panics_total[5m])` |
      
        159
        | Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` |
      
        160
        | Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` |
      
        161
        | Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` |
      
        162
        
        163
        ## Alerting
      
        164
        
        165
        Three places to set alert rules. Pick one:
      
        166
        
        167
        - **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to
      
        168
          email/Slack/PagerDuty via Cloud's contact points.
      
        169
        - **Alertmanager via remote_write.** Out of scope for the free
      
        170
          tier without their managed contact points.
      
        171
        - **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
      
        172
          for operational pages and read shithubd metrics manually.**
      
        173
        
        174
        Suggested first three alerts (Cloud UI):
      
        175
        
        176
        | Name | Query | Threshold | Reason |
      
        177
        |---|---|---|---|
      
        178
        | shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice |
      
        179
        | DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression |
      
        180
        | Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth |
      
        181
        
        182
        ## When metrics stop landing
      
        183
        
        184
        ```sh
      
        185
        ssh root@shithub.sh '
      
        186
          systemctl status alloy --no-pager
      
        187
          journalctl -u alloy -n 30 --no-pager | tail -20
      
        188
          curl -fsS http://127.0.0.1:8080/metrics | head -3
      
        189
          curl -fsS http://127.0.0.1:9100/metrics | head -3
      
        190
        '
      
        191
        ```
      
        192
        
        193
        Common failures:
      
        194
        
        195
        - **401 Unauthorized in alloy logs** — token expired or
      
        196
          copy-pasted wrong. Mint a fresh one in the Cloud portal,
      
        197
          update inventory, re-run the role.
      
        198
        - **node_exporter not listening on 9100** — `apt reinstall
      
        199
          prometheus-node-exporter`, check journal.
      
        200
        - **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true`
      
        201
          in web.env (the default), restart shithubd-web.
      
        202
        - **Free-tier limit hit** — Cloud portal shows "Active series" at
      
        203
          cap. Drop high-cardinality labels via `metric_relabel_configs` in
      
        204
          alloy-config.river.j2.
      
        205
        - **`up{job="shithubd"} = 0` while `up{job="node"} = 1`** — alloy can
      
        206
          reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
      
        207
          with `expected a valid start token, got "\x1f"`. Root cause: alloy
      
        208
          advertised `Accept-Encoding: gzip`, shithubd correctly returned
      
        209
          gzipped bytes with `Content-Encoding: gzip`, but alloy's
      
        210
          Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
      
        211
          in the template (`enable_compression = false` on the shithubd
      
        212
          scrape job). If you see this on a hand-rolled config, add that
      
        213
          attribute and `systemctl restart alloy`. Verify via
      
        214
          `curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
      
        215
          → look for `health: up`.
      
        216
        
        217
        ## Why Grafana Cloud over self-hosted Prometheus
      
        218
        
        219
        - Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
      
        220
          storage for time-series, an open inbound port (or VPN), TLS for
      
        221
          the Grafana UI, an OAuth proxy for auth, and we maintain the
      
        222
          upgrade path.
      
        223
        - Free Grafana Cloud is enough for our scale, gives us managed
      
        224
          backup of metrics + dashboards + alerts, and the data leaves
      
        225
          the droplet (so a droplet outage doesn't kill its own
      
        226
          observability).
      
        227
        - Trade-off: cardinality cap. Once we approach 10k active series,
      
        228
          evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.

1	# Observability — metrics tier
2
3	Two layers of monitoring run today:
4
5	1. DigitalOcean droplet alerts + uptime check (see `alerts.md`).
6	Tells us when the droplet is unhappy at the OS level (CPU, mem,
7	disk, load, down, SSL expiry, latency from outside).
8	2. Grafana Cloud free tier — Prometheus + dashboards (this doc).
9	Tells us what shithubd is doing internally: request latency,
10	DB pool saturation, panics, job-queue depth, archive failures.
11
12	Layer 1 fires for "is the box alive?" Layer 2 fires for "is the
13	app healthy?" Different signal, different audience, both needed.
14
15	## What's wired
16
17	The `monitoring-client` ansible role runs node_exporter
18	(host metrics on `127.0.0.1:9100`) and Grafana Alloy (the
19	modern Grafana Agent rebrand). Alloy scrapes:
20
21	- `127.0.0.1:9100/metrics` — host metrics from node_exporter
22	- `127.0.0.1:8080/metrics` — shithubd's internal metrics
23
24	Alloy `remote_write`s both into Grafana Cloud's Prometheus
25	endpoint (Mimir under the hood). No inbound port on the droplet —
26	push-only. No Prometheus or Grafana running locally.
27
28	### Notable shithubd metrics
29
30	\| Name \| Type \| Meaning \|
31	\|---\|---\|---\|
32	\| `shithub_http_in_flight` \| gauge \| Requests currently being served. >50 sustained = saturated. \|
33	\| `shithub_http_request_duration_seconds` \| histogram \| Request latency by route + method. Compute p95 from buckets. \|
34	\| `shithub_http_requests_total` \| counter \| All-time request count by route + method + status. Rate it for QPS. \|
35	\| `shithub_db_pool_acquired` \| gauge \| Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. \|
36	\| `shithub_db_pool_acquire_wait_seconds_total` \| counter \| Cumulative wait time. Sudden derivative climb = pool too small. \|
37	\| `shithub_panics_total` \| counter \| Recovered panics. Should be 0 in steady state. \|
38	\| `shithub_actions_queue_depth` \| gauge \| Queued Actions runs/jobs. Sustained job depth means runners cannot keep up. \|
39	\| `shithub_actions_active` \| gauge \| Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners. \|
40	\| `shithub_actions_runner_heartbeat_age_seconds` \| gauge \| Seconds since each runner heartbeat. >60s sustained means the runner is stale. \|
41	\| `shithub_actions_run_duration_seconds` \| histogram \| Terminal Actions run duration by event and conclusion. \|
42	\| `shithub_actions_log_chunk_bytes_total` \| counter \| Accepted Actions log bytes, used for throughput and scrubber health alerts. \|
43
44	## Operator setup (one-time)
45
46	Free-tier Grafana Cloud gives you ~10k active series — way more
47	than one shithubd droplet emits. Sign-up flow:
48
49	1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026
50	the default flow drops you into a 14-day "Unlimited usage trial"
51	— that's fine; without a card on file it auto-reverts to the
52	Forever Free tier (~10k active series) when the trial ends.
53	2. Pick a Stack name (e.g. `shithub`). The portal creates the
54	stack at `<name>.grafana.net` and lands you in the in-app
55	"Get started" guide.
56	3. Get the Prometheus details. The in-app guide has drifted; the
57	reliable path is via the org-level Cloud portal:
58	- In the right column of the Get started page, **Organization
59	details → click Cloud portal**, OR navigate directly to
60	`https://grafana.com/orgs/<your-org>`.
61	- Click Details on the Prometheus tile.
62	- The page shows the Remote Write Endpoint (full URL),
63	Username / Instance ID (numeric), and a Generate now
64	link for the Password / API Token (a `glc_…` Access Policy
65	token with `metrics:write`). Copy the token immediately — it's
66	shown once.
67	4. Add to `inventory/production`. Use the exact Remote Write Endpoint
68	shown on the page — the host (prod-NN, region) differs per tenant:
69	```ini
70	grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push
71	grafana_cloud_prom_user=1234567
72	grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3
73	```
74	5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring`
75	(or just rerun site.yml — the role is idempotent).
76
77	### If you don't have ansible installed (manual fallback)
78
79	The `monitoring-client` role's actions are simple enough to apply
80	by hand over SSH. Use this if you're trying to bolt monitoring
81	onto an existing droplet and don't want to risk the full play.
82
83	```sh
84	ssh root@shithub.sh '
85	# node_exporter
86	apt-get install -y prometheus-node-exporter
87	systemctl enable --now prometheus-node-exporter
88
89	# Grafana apt repo + alloy
90	mkdir -p /etc/apt/keyrings
91	curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
92	chmod 0644 /etc/apt/keyrings/grafana.gpg.key
93	echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
94	> /etc/apt/sources.list.d/grafana.list
95	apt-get update && apt-get install -y alloy
96
97	# Credentials env (mode 0640, root:alloy)
98	mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
99	install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
100	GRAFANA_CLOUD_PROM_URL=<your URL>
101	GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
102	GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
103	EOF
104
105	# systemd drop-in to source the credentials
106	mkdir -p /etc/systemd/system/alloy.service.d
107	cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
108	[Service]
109	EnvironmentFile=/etc/alloy/credentials.env
110	EOF
111	systemctl daemon-reload
112	'
113	```
114
115	Then copy the rendered `alloy-config.river.j2` (substitute the
116	`{{ ansible_hostname }}` template var with the real hostname) to
117	`/etc/alloy/config.alloy` (mode 0644 root:alloy) and:
118
119	```sh
120	ssh root@shithub.sh 'systemctl enable --now alloy'
121	```
122
123	The token is the only secret; do NOT pipe it through your shell
124	history. The `install … /dev/stdin … <<EOF` form above keeps it
125	out of `ps`/argv, but you'll still want to clear shell history
126	after.
127
128	Within a minute, metrics start landing. From the Cloud portal:
129
130	- Explore tab → datasource = your Prometheus → query
131	`up{job=~"node\|shithubd"}` — both should show 1.
132	- Sample queries to confirm shithubd telemetry:
133	- `sum(rate(shithub_http_requests_total[5m]))` — overall QPS
134	- `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route
135	- `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization
136	- `rate(shithub_panics_total[5m]) > 0` — recovered panics
137
138	## Building a starter dashboard
139
140	Cloud → Dashboards → New → import. Dashboard JSON is committed under:
141
142	```text
143	deploy/monitoring/grafana/dashboards/shithubd-overview.json
144	deploy/monitoring/grafana/dashboards/actions.json
145	```
146
147	Use the panels below as fallback queries if the dashboard import UI drifts or
148	you need to rebuild by hand.
149
150	\| Panel \| Query \|
151	\|---\|---\|
152	\| QPS \| `sum(rate(shithub_http_requests_total[5m]))` \|
153	\| p95 latency by route \| `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` \|
154	\| Error rate \| `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` \|
155	\| In-flight requests \| `shithub_http_in_flight` \|
156	\| DB pool utilization \| `shithub_db_pool_acquired / shithub_db_pool_total` \|
157	\| Pool wait derivative \| `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` \|
158	\| Panics \| `rate(shithub_panics_total[5m])` \|
159	\| Host CPU \| `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` \|
160	\| Host memory used \| `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` \|
161	\| Disk free \| `node_filesystem_avail_bytes{mountpoint="/"}` \|
162
163	## Alerting
164
165	Three places to set alert rules. Pick one:
166
167	- Grafana Cloud → Alerts → Alert rules. UI-first; fires to
168	email/Slack/PagerDuty via Cloud's contact points.
169	- Alertmanager via remote_write. Out of scope for the free
170	tier without their managed contact points.
171	- **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`)
172	for operational pages and read shithubd metrics manually.**
173
174	Suggested first three alerts (Cloud UI):
175
176	\| Name \| Query \| Threshold \| Reason \|
177	\|---\|---\|---\|---\|
178	\| shithubd p95 latency high \| p95 latency query above \| `> 1s` for 5m \| UI feels slow before users notice \|
179	\| DB pool saturation \| utilization query \| `> 0.8` for 10m \| Pool too small or query regression \|
180	\| Recovered panic in last 5m \| panics rate \| `> 0` for 1m \| Bug ticket worth \|
181
182	## When metrics stop landing
183
184	```sh
185	ssh root@shithub.sh '
186	systemctl status alloy --no-pager
187	journalctl -u alloy -n 30 --no-pager \| tail -20
188	curl -fsS http://127.0.0.1:8080/metrics \| head -3
189	curl -fsS http://127.0.0.1:9100/metrics \| head -3
190	'
191	```
192
193	Common failures:
194
195	- 401 Unauthorized in alloy logs — token expired or
196	copy-pasted wrong. Mint a fresh one in the Cloud portal,
197	update inventory, re-run the role.
198	- node_exporter not listening on 9100 — `apt reinstall
199	prometheus-node-exporter`, check journal.
200	- shithubd not exposing /metrics — `SHITHUB_METRICS__ENABLED=true`
201	in web.env (the default), restart shithubd-web.
202	- Free-tier limit hit — Cloud portal shows "Active series" at
203	cap. Drop high-cardinality labels via `metric_relabel_configs` in
204	alloy-config.river.j2.
205	- `up{job="shithubd"} = 0` while `up{job="node"} = 1` — alloy can
206	reach shithubd (`scrape_samples_scraped > 0`) but the parser fails
207	with `expected a valid start token, got "\x1f"`. Root cause: alloy
208	advertised `Accept-Encoding: gzip`, shithubd correctly returned
209	gzipped bytes with `Content-Encoding: gzip`, but alloy's
210	Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already
211	in the template (`enable_compression = false` on the shithubd
212	scrape job). If you see this on a hand-rolled config, add that
213	attribute and `systemctl restart alloy`. Verify via
214	`curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd`
215	→ look for `health: up`.
216
217	## Why Grafana Cloud over self-hosted Prometheus
218
219	- Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM,
220	storage for time-series, an open inbound port (or VPN), TLS for
221	the Grafana UI, an OAuth proxy for auth, and we maintain the
222	upgrade path.
223	- Free Grafana Cloud is enough for our scale, gives us managed
224	backup of metrics + dashboards + alerts, and the data leaves
225	the droplet (so a droplet outage doesn't kill its own
226	observability).
227	- Trade-off: cardinality cap. Once we approach 10k active series,
228	evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.