Observability — metrics tier
Two layers of monitoring run today:
- DigitalOcean droplet alerts + uptime check (see
alerts.md). Tells us when the droplet is unhappy at the OS level (CPU, mem, disk, load, down, SSL expiry, latency from outside). - Grafana Cloud free tier — Prometheus + dashboards (this doc). Tells us what shithubd is doing internally: request latency, DB pool saturation, panics, job-queue depth, archive failures.
Layer 1 fires for "is the box alive?" Layer 2 fires for "is the app healthy?" Different signal, different audience, both needed.
What's wired
The monitoring-client ansible role runs node_exporter
(host metrics on 127.0.0.1:9100) and Grafana Alloy (the
modern Grafana Agent rebrand). Alloy scrapes:
127.0.0.1:9100/metrics— host metrics from node_exporter127.0.0.1:8080/metrics— shithubd's internal metrics
Alloy remote_writes both into Grafana Cloud's Prometheus
endpoint (Mimir under the hood). No inbound port on the droplet —
push-only. No Prometheus or Grafana running locally.
Notable shithubd metrics
| Name | Type | Meaning |
|---|---|---|
shithub_http_in_flight |
gauge | Requests currently being served. >50 sustained = saturated. |
shithub_http_request_duration_seconds |
histogram | Request latency by route + method. Compute p95 from buckets. |
shithub_http_requests_total |
counter | All-time request count by route + method + status. Rate it for QPS. |
shithub_db_pool_acquired |
gauge | Active Postgres connections. Approaching shithub_db_pool_total = saturation. |
shithub_db_pool_acquire_wait_seconds_total |
counter | Cumulative wait time. Sudden derivative climb = pool too small. |
shithub_panics_total |
counter | Recovered panics. Should be 0 in steady state. |
shithub_actions_queue_depth |
gauge | Queued Actions runs/jobs. Sustained job depth means runners cannot keep up. |
shithub_actions_active |
gauge | Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners. |
shithub_actions_runner_heartbeat_age_seconds |
gauge | Seconds since each runner heartbeat. >60s sustained means the runner is stale. |
shithub_actions_run_duration_seconds |
histogram | Terminal Actions run duration by event and conclusion. |
shithub_actions_log_chunk_bytes_total |
counter | Accepted Actions log bytes, used for throughput and scrubber health alerts. |
Operator setup (one-time)
Free-tier Grafana Cloud gives you ~10k active series — way more than one shithubd droplet emits. Sign-up flow:
- Go to https://grafana.com/auth/sign-up/create-user. As of 2026 the default flow drops you into a 14-day "Unlimited usage trial" — that's fine; without a card on file it auto-reverts to the Forever Free tier (~10k active series) when the trial ends.
- Pick a Stack name (e.g.
shithub). The portal creates the stack at<name>.grafana.netand lands you in the in-app "Get started" guide. - Get the Prometheus details. The in-app guide has drifted; the
reliable path is via the org-level Cloud portal:
- In the right column of the Get started page, Organization
details → click Cloud portal, OR navigate directly to
https://grafana.com/orgs/<your-org>. - Click Details on the Prometheus tile.
- The page shows the Remote Write Endpoint (full URL),
Username / Instance ID (numeric), and a Generate now
link for the Password / API Token (a
glc_…Access Policy token withmetrics:write). Copy the token immediately — it's shown once.
- In the right column of the Get started page, Organization
details → click Cloud portal, OR navigate directly to
- Add to
inventory/production. Use the exact Remote Write Endpoint shown on the page — the host (prod-NN, region) differs per tenant:grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push grafana_cloud_prom_user=1234567 grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3 ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring(or just rerun site.yml — the role is idempotent).
If you don't have ansible installed (manual fallback)
The monitoring-client role's actions are simple enough to apply
by hand over SSH. Use this if you're trying to bolt monitoring
onto an existing droplet and don't want to risk the full play.
ssh root@shithub.sh '
# node_exporter
apt-get install -y prometheus-node-exporter
systemctl enable --now prometheus-node-exporter
# Grafana apt repo + alloy
mkdir -p /etc/apt/keyrings
curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key
chmod 0644 /etc/apt/keyrings/grafana.gpg.key
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \
> /etc/apt/sources.list.d/grafana.list
apt-get update && apt-get install -y alloy
# Credentials env (mode 0640, root:alloy)
mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy
install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF
GRAFANA_CLOUD_PROM_URL=<your URL>
GRAFANA_CLOUD_PROM_USER=<your numeric tenant id>
GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token>
EOF
# systemd drop-in to source the credentials
mkdir -p /etc/systemd/system/alloy.service.d
cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF
[Service]
EnvironmentFile=/etc/alloy/credentials.env
EOF
systemctl daemon-reload
'
Then copy the rendered alloy-config.river.j2 (substitute the
{{ ansible_hostname }} template var with the real hostname) to
/etc/alloy/config.alloy (mode 0644 root:alloy) and:
ssh root@shithub.sh 'systemctl enable --now alloy'
The token is the only secret; do NOT pipe it through your shell
history. The install … /dev/stdin … <<EOF form above keeps it
out of ps/argv, but you'll still want to clear shell history
after.
Within a minute, metrics start landing. From the Cloud portal:
- Explore tab → datasource = your Prometheus → query
up{job=~"node|shithubd"}— both should show 1. - Sample queries to confirm shithubd telemetry:
sum(rate(shithub_http_requests_total[5m]))— overall QPShistogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))— p95 latency by routeshithub_db_pool_acquired / shithub_db_pool_total— pool utilizationrate(shithub_panics_total[5m]) > 0— recovered panics
Building a starter dashboard
Cloud → Dashboards → New → import. Dashboard JSON is committed under:
deploy/monitoring/grafana/dashboards/shithubd-overview.json
deploy/monitoring/grafana/dashboards/actions.json
Use the panels below as fallback queries if the dashboard import UI drifts or you need to rebuild by hand.
| Panel | Query |
|---|---|
| QPS | sum(rate(shithub_http_requests_total[5m])) |
| p95 latency by route | histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m]))) |
| Error rate | sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m])) |
| In-flight requests | shithub_http_in_flight |
| DB pool utilization | shithub_db_pool_acquired / shithub_db_pool_total |
| Pool wait derivative | rate(shithub_db_pool_acquire_wait_seconds_total[5m]) |
| Panics | rate(shithub_panics_total[5m]) |
| Host CPU | 100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100) |
| Host memory used | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes |
| Disk free | node_filesystem_avail_bytes{mountpoint="/"} |
Alerting
Three places to set alert rules. Pick one:
- Grafana Cloud → Alerts → Alert rules. UI-first; fires to email/Slack/PagerDuty via Cloud's contact points.
- Alertmanager via remote_write. Out of scope for the free tier without their managed contact points.
- Skip Grafana alerts entirely; use the DO alerts (
alerts.md) for operational pages and read shithubd metrics manually.
Suggested first three alerts (Cloud UI):
| Name | Query | Threshold | Reason |
|---|---|---|---|
| shithubd p95 latency high | p95 latency query above | > 1s for 5m |
UI feels slow before users notice |
| DB pool saturation | utilization query | > 0.8 for 10m |
Pool too small or query regression |
| Recovered panic in last 5m | panics rate | > 0 for 1m |
Bug ticket worth |
When metrics stop landing
ssh root@shithub.sh '
systemctl status alloy --no-pager
journalctl -u alloy -n 30 --no-pager | tail -20
curl -fsS http://127.0.0.1:8080/metrics | head -3
curl -fsS http://127.0.0.1:9100/metrics | head -3
'
Common failures:
- 401 Unauthorized in alloy logs — token expired or copy-pasted wrong. Mint a fresh one in the Cloud portal, update inventory, re-run the role.
- node_exporter not listening on 9100 —
apt reinstall prometheus-node-exporter, check journal. - shithubd not exposing /metrics —
SHITHUB_METRICS__ENABLED=truein web.env (the default), restart shithubd-web. - Free-tier limit hit — Cloud portal shows "Active series" at
cap. Drop high-cardinality labels via
metric_relabel_configsin alloy-config.river.j2. up{job="shithubd"} = 0whileup{job="node"} = 1— alloy can reach shithubd (scrape_samples_scraped > 0) but the parser fails withexpected a valid start token, got "\x1f". Root cause: alloy advertisedAccept-Encoding: gzip, shithubd correctly returned gzipped bytes withContent-Encoding: gzip, but alloy's Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already in the template (enable_compression = falseon the shithubd scrape job). If you see this on a hand-rolled config, add that attribute andsystemctl restart alloy. Verify viacurl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd→ look forhealth: up.
Why Grafana Cloud over self-hosted Prometheus
- Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, storage for time-series, an open inbound port (or VPN), TLS for the Grafana UI, an OAuth proxy for auth, and we maintain the upgrade path.
- Free Grafana Cloud is enough for our scale, gives us managed backup of metrics + dashboards + alerts, and the data leaves the droplet (so a droplet outage doesn't kill its own observability).
- Trade-off: cardinality cap. Once we approach 10k active series, evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host.
View source
| 1 | # Observability — metrics tier |
| 2 | |
| 3 | Two layers of monitoring run today: |
| 4 | |
| 5 | 1. **DigitalOcean droplet alerts + uptime check** (see `alerts.md`). |
| 6 | Tells us when the droplet is unhappy at the OS level (CPU, mem, |
| 7 | disk, load, down, SSL expiry, latency from outside). |
| 8 | 2. **Grafana Cloud free tier — Prometheus + dashboards** (this doc). |
| 9 | Tells us what shithubd is doing internally: request latency, |
| 10 | DB pool saturation, panics, job-queue depth, archive failures. |
| 11 | |
| 12 | Layer 1 fires for "is the box alive?" Layer 2 fires for "is the |
| 13 | app healthy?" Different signal, different audience, both needed. |
| 14 | |
| 15 | ## What's wired |
| 16 | |
| 17 | The `monitoring-client` ansible role runs **node_exporter** |
| 18 | (host metrics on `127.0.0.1:9100`) and **Grafana Alloy** (the |
| 19 | modern Grafana Agent rebrand). Alloy scrapes: |
| 20 | |
| 21 | - `127.0.0.1:9100/metrics` — host metrics from node_exporter |
| 22 | - `127.0.0.1:8080/metrics` — shithubd's internal metrics |
| 23 | |
| 24 | Alloy `remote_write`s both into Grafana Cloud's Prometheus |
| 25 | endpoint (Mimir under the hood). No inbound port on the droplet — |
| 26 | push-only. No Prometheus or Grafana running locally. |
| 27 | |
| 28 | ### Notable shithubd metrics |
| 29 | |
| 30 | | Name | Type | Meaning | |
| 31 | |---|---|---| |
| 32 | | `shithub_http_in_flight` | gauge | Requests currently being served. >50 sustained = saturated. | |
| 33 | | `shithub_http_request_duration_seconds` | histogram | Request latency by route + method. Compute p95 from buckets. | |
| 34 | | `shithub_http_requests_total` | counter | All-time request count by route + method + status. Rate it for QPS. | |
| 35 | | `shithub_db_pool_acquired` | gauge | Active Postgres connections. Approaching `shithub_db_pool_total` = saturation. | |
| 36 | | `shithub_db_pool_acquire_wait_seconds_total` | counter | Cumulative wait time. Sudden derivative climb = pool too small. | |
| 37 | | `shithub_panics_total` | counter | Recovered panics. Should be 0 in steady state. | |
| 38 | | `shithub_actions_queue_depth` | gauge | Queued Actions runs/jobs. Sustained job depth means runners cannot keep up. | |
| 39 | | `shithub_actions_active` | gauge | Running Actions runs/jobs. Use with capacity to distinguish slow jobs from lack of runners. | |
| 40 | | `shithub_actions_runner_heartbeat_age_seconds` | gauge | Seconds since each runner heartbeat. >60s sustained means the runner is stale. | |
| 41 | | `shithub_actions_run_duration_seconds` | histogram | Terminal Actions run duration by event and conclusion. | |
| 42 | | `shithub_actions_log_chunk_bytes_total` | counter | Accepted Actions log bytes, used for throughput and scrubber health alerts. | |
| 43 | |
| 44 | ## Operator setup (one-time) |
| 45 | |
| 46 | Free-tier Grafana Cloud gives you ~10k active series — way more |
| 47 | than one shithubd droplet emits. Sign-up flow: |
| 48 | |
| 49 | 1. Go to https://grafana.com/auth/sign-up/create-user. As of 2026 |
| 50 | the default flow drops you into a 14-day "Unlimited usage trial" |
| 51 | — that's fine; without a card on file it auto-reverts to the |
| 52 | Forever Free tier (~10k active series) when the trial ends. |
| 53 | 2. Pick a **Stack name** (e.g. `shithub`). The portal creates the |
| 54 | stack at `<name>.grafana.net` and lands you in the in-app |
| 55 | "Get started" guide. |
| 56 | 3. Get the Prometheus details. The in-app guide has drifted; the |
| 57 | reliable path is via the org-level Cloud portal: |
| 58 | - In the right column of the Get started page, **Organization |
| 59 | details** → click **Cloud portal**, OR navigate directly to |
| 60 | `https://grafana.com/orgs/<your-org>`. |
| 61 | - Click **Details** on the **Prometheus** tile. |
| 62 | - The page shows the **Remote Write Endpoint** (full URL), |
| 63 | **Username / Instance ID** (numeric), and a **Generate now** |
| 64 | link for the **Password / API Token** (a `glc_…` Access Policy |
| 65 | token with `metrics:write`). Copy the token immediately — it's |
| 66 | shown once. |
| 67 | 4. Add to `inventory/production`. Use the exact Remote Write Endpoint |
| 68 | shown on the page — the host (prod-NN, region) differs per tenant: |
| 69 | ```ini |
| 70 | grafana_cloud_prom_url=https://prometheus-prod-XX-prod-REGION.grafana.net/api/prom/push |
| 71 | grafana_cloud_prom_user=1234567 |
| 72 | grafana_cloud_prom_token=glc_eyJrIjoi... # the token from step 3 |
| 73 | ``` |
| 74 | 5. `ansible-playbook -i inventory/production deploy/ansible/site.yml -t monitoring` |
| 75 | (or just rerun site.yml — the role is idempotent). |
| 76 | |
| 77 | ### If you don't have ansible installed (manual fallback) |
| 78 | |
| 79 | The `monitoring-client` role's actions are simple enough to apply |
| 80 | by hand over SSH. Use this if you're trying to bolt monitoring |
| 81 | onto an existing droplet and don't want to risk the full play. |
| 82 | |
| 83 | ```sh |
| 84 | ssh root@shithub.sh ' |
| 85 | # node_exporter |
| 86 | apt-get install -y prometheus-node-exporter |
| 87 | systemctl enable --now prometheus-node-exporter |
| 88 | |
| 89 | # Grafana apt repo + alloy |
| 90 | mkdir -p /etc/apt/keyrings |
| 91 | curl -fsS https://apt.grafana.com/gpg.key -o /etc/apt/keyrings/grafana.gpg.key |
| 92 | chmod 0644 /etc/apt/keyrings/grafana.gpg.key |
| 93 | echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg.key] https://apt.grafana.com stable main" \ |
| 94 | > /etc/apt/sources.list.d/grafana.list |
| 95 | apt-get update && apt-get install -y alloy |
| 96 | |
| 97 | # Credentials env (mode 0640, root:alloy) |
| 98 | mkdir -p /etc/alloy && chown root:alloy /etc/alloy && chmod 0750 /etc/alloy |
| 99 | install -o root -g alloy -m 0640 /dev/stdin /etc/alloy/credentials.env <<EOF |
| 100 | GRAFANA_CLOUD_PROM_URL=<your URL> |
| 101 | GRAFANA_CLOUD_PROM_USER=<your numeric tenant id> |
| 102 | GRAFANA_CLOUD_PROM_TOKEN=<your glc_… token> |
| 103 | EOF |
| 104 | |
| 105 | # systemd drop-in to source the credentials |
| 106 | mkdir -p /etc/systemd/system/alloy.service.d |
| 107 | cat > /etc/systemd/system/alloy.service.d/shithub.conf <<EOF |
| 108 | [Service] |
| 109 | EnvironmentFile=/etc/alloy/credentials.env |
| 110 | EOF |
| 111 | systemctl daemon-reload |
| 112 | ' |
| 113 | ``` |
| 114 | |
| 115 | Then copy the rendered `alloy-config.river.j2` (substitute the |
| 116 | `{{ ansible_hostname }}` template var with the real hostname) to |
| 117 | `/etc/alloy/config.alloy` (mode 0644 root:alloy) and: |
| 118 | |
| 119 | ```sh |
| 120 | ssh root@shithub.sh 'systemctl enable --now alloy' |
| 121 | ``` |
| 122 | |
| 123 | The token is the only secret; do NOT pipe it through your shell |
| 124 | history. The `install … /dev/stdin … <<EOF` form above keeps it |
| 125 | out of `ps`/argv, but you'll still want to clear shell history |
| 126 | after. |
| 127 | |
| 128 | Within a minute, metrics start landing. From the Cloud portal: |
| 129 | |
| 130 | - **Explore** tab → datasource = your Prometheus → query |
| 131 | `up{job=~"node|shithubd"}` — both should show 1. |
| 132 | - Sample queries to confirm shithubd telemetry: |
| 133 | - `sum(rate(shithub_http_requests_total[5m]))` — overall QPS |
| 134 | - `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` — p95 latency by route |
| 135 | - `shithub_db_pool_acquired / shithub_db_pool_total` — pool utilization |
| 136 | - `rate(shithub_panics_total[5m]) > 0` — recovered panics |
| 137 | |
| 138 | ## Building a starter dashboard |
| 139 | |
| 140 | Cloud → Dashboards → New → import. Dashboard JSON is committed under: |
| 141 | |
| 142 | ```text |
| 143 | deploy/monitoring/grafana/dashboards/shithubd-overview.json |
| 144 | deploy/monitoring/grafana/dashboards/actions.json |
| 145 | ``` |
| 146 | |
| 147 | Use the panels below as fallback queries if the dashboard import UI drifts or |
| 148 | you need to rebuild by hand. |
| 149 | |
| 150 | | Panel | Query | |
| 151 | |---|---| |
| 152 | | QPS | `sum(rate(shithub_http_requests_total[5m]))` | |
| 153 | | p95 latency by route | `histogram_quantile(0.95, sum by (le, route) (rate(shithub_http_request_duration_seconds_bucket[5m])))` | |
| 154 | | Error rate | `sum(rate(shithub_http_requests_total{status=~"5.."}[5m])) / sum(rate(shithub_http_requests_total[5m]))` | |
| 155 | | In-flight requests | `shithub_http_in_flight` | |
| 156 | | DB pool utilization | `shithub_db_pool_acquired / shithub_db_pool_total` | |
| 157 | | Pool wait derivative | `rate(shithub_db_pool_acquire_wait_seconds_total[5m])` | |
| 158 | | Panics | `rate(shithub_panics_total[5m])` | |
| 159 | | Host CPU | `100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)` | |
| 160 | | Host memory used | `(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes` | |
| 161 | | Disk free | `node_filesystem_avail_bytes{mountpoint="/"}` | |
| 162 | |
| 163 | ## Alerting |
| 164 | |
| 165 | Three places to set alert rules. Pick one: |
| 166 | |
| 167 | - **Grafana Cloud → Alerts → Alert rules.** UI-first; fires to |
| 168 | email/Slack/PagerDuty via Cloud's contact points. |
| 169 | - **Alertmanager via remote_write.** Out of scope for the free |
| 170 | tier without their managed contact points. |
| 171 | - **Skip Grafana alerts entirely; use the DO alerts (`alerts.md`) |
| 172 | for operational pages and read shithubd metrics manually.** |
| 173 | |
| 174 | Suggested first three alerts (Cloud UI): |
| 175 | |
| 176 | | Name | Query | Threshold | Reason | |
| 177 | |---|---|---|---| |
| 178 | | shithubd p95 latency high | p95 latency query above | `> 1s` for 5m | UI feels slow before users notice | |
| 179 | | DB pool saturation | utilization query | `> 0.8` for 10m | Pool too small or query regression | |
| 180 | | Recovered panic in last 5m | panics rate | `> 0` for 1m | Bug ticket worth | |
| 181 | |
| 182 | ## When metrics stop landing |
| 183 | |
| 184 | ```sh |
| 185 | ssh root@shithub.sh ' |
| 186 | systemctl status alloy --no-pager |
| 187 | journalctl -u alloy -n 30 --no-pager | tail -20 |
| 188 | curl -fsS http://127.0.0.1:8080/metrics | head -3 |
| 189 | curl -fsS http://127.0.0.1:9100/metrics | head -3 |
| 190 | ' |
| 191 | ``` |
| 192 | |
| 193 | Common failures: |
| 194 | |
| 195 | - **401 Unauthorized in alloy logs** — token expired or |
| 196 | copy-pasted wrong. Mint a fresh one in the Cloud portal, |
| 197 | update inventory, re-run the role. |
| 198 | - **node_exporter not listening on 9100** — `apt reinstall |
| 199 | prometheus-node-exporter`, check journal. |
| 200 | - **shithubd not exposing /metrics** — `SHITHUB_METRICS__ENABLED=true` |
| 201 | in web.env (the default), restart shithubd-web. |
| 202 | - **Free-tier limit hit** — Cloud portal shows "Active series" at |
| 203 | cap. Drop high-cardinality labels via `metric_relabel_configs` in |
| 204 | alloy-config.river.j2. |
| 205 | - **`up{job="shithubd"} = 0` while `up{job="node"} = 1`** — alloy can |
| 206 | reach shithubd (`scrape_samples_scraped > 0`) but the parser fails |
| 207 | with `expected a valid start token, got "\x1f"`. Root cause: alloy |
| 208 | advertised `Accept-Encoding: gzip`, shithubd correctly returned |
| 209 | gzipped bytes with `Content-Encoding: gzip`, but alloy's |
| 210 | Prometheus scraper parsed the raw bytes pre-gunzip. Fix is already |
| 211 | in the template (`enable_compression = false` on the shithubd |
| 212 | scrape job). If you see this on a hand-rolled config, add that |
| 213 | attribute and `systemctl restart alloy`. Verify via |
| 214 | `curl http://127.0.0.1:12345/api/v0/web/components/prometheus.scrape.shithubd` |
| 215 | → look for `health: up`. |
| 216 | |
| 217 | ## Why Grafana Cloud over self-hosted Prometheus |
| 218 | |
| 219 | - Self-hosted needs Prom + Grafana on the droplet → ~500 MB RAM, |
| 220 | storage for time-series, an open inbound port (or VPN), TLS for |
| 221 | the Grafana UI, an OAuth proxy for auth, and we maintain the |
| 222 | upgrade path. |
| 223 | - Free Grafana Cloud is enough for our scale, gives us managed |
| 224 | backup of metrics + dashboards + alerts, and the data leaves |
| 225 | the droplet (so a droplet outage doesn't kill its own |
| 226 | observability). |
| 227 | - Trade-off: cardinality cap. Once we approach 10k active series, |
| 228 | evaluate either paid Cloud (~$8/mo for 10× the cap) or self-host. |