tenseleyflow/shithub / 05892ba

Browse files

docs: alerts runbook — what's wired, triage steps, edit procedure

Authored by espadonne
SHA
05892baa8cc7eb0ab938e66e68e5dfb94280a7ba
Parents
c824190
Tree
1037cd5

1 changed file

StatusFile+-
A docs/internal/runbooks/alerts.md 118 0
docs/internal/runbooks/alerts.mdadded
@@ -0,0 +1,118 @@
1
+# Alerts
2
+
3
+DigitalOcean's built-in monitoring is the cheap-and-good alerting
4
+tier. Two layers, both managed via `doctl`:
5
+
6
+- **Resource alerts** scoped to the `shithub-app` droplet, fired when
7
+  CPU / memory / disk / load1 cross a threshold for a sustained
8
+  window.
9
+- **Uptime checks** that probe the public URL from two regions and
10
+  fire on outages, SSL-cert decay, or latency regressions.
11
+
12
+Everything is reproducible via
13
+`deploy/cutover/provision-do-alerts.sh` — re-running is a no-op.
14
+The script is the source of truth for what alerts SHOULD exist;
15
+the DO account is the source of truth for what currently fires.
16
+
17
+## What's wired
18
+
19
+### Droplet resource alerts
20
+
21
+Tagged on droplet `shithub-app` (id resolves at runtime from
22
+the droplet name in `provision-do-alerts.sh`).
23
+
24
+| Description | Type | Threshold | Window |
25
+|---|---|---:|---:|
26
+| `shithub-app CPU > 80% for 10m` | `cpu` | 80% | 10m |
27
+| `shithub-app memory > 90% for 10m` | `memory_utilization_percent` | 90% | 10m |
28
+| `shithub-app disk > 80% for 10m` | `disk_utilization_percent` | 80% | 10m |
29
+| `shithub-app load1 > 4 for 10m` | `load_1` | 4.0 | 10m |
30
+
31
+Window is 10m so a single transient spike (build, AIDE re-baseline,
32
+WAL burst) doesn't page. All four email
33
+`wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses
34
+unverified destinations).
35
+
36
+### Uptime check `shithub.sh`
37
+
38
+`https://shithub.sh` probed every minute from `us_east` and
39
+`eu_west`. Three alerts hang off it:
40
+
41
+| Name | Type | Threshold | Period |
42
+|---|---|---:|---:|
43
+| `shithub.sh down (all regions)` | `down_global` | 1 region down | 5m |
44
+| `shithub.sh SSL cert expiring < 14 days` | `ssl_expiry` | 14 days | 5m |
45
+| `shithub.sh latency p95 > 2s` | `latency` | 2000 ms | 5m |
46
+
47
+`down_global` deliberately requires both regions to agree before
48
+paging — single-region noise from DO's own infrastructure is more
49
+common than real outages and was the leading cause of false
50
+positives on a previous setup.
51
+
52
+`ssl_expiry < 14 days` gives a buffer to investigate Caddy's
53
+auto-renew if it doesn't fire on its own. Caddy's renewal window
54
+is normally 30 days before expiry, so by the time this alert
55
+fires we're already off the happy path.
56
+
57
+## When an alert fires
58
+
59
+1. **Read the email.** It names the alert and the value that
60
+   crossed.
61
+2. **Check the droplet first**:
62
+   ```sh
63
+   ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime'
64
+   ```
65
+3. **Common patterns**:
66
+   - **CPU sustained > 80%** — usually `aideinit` or `aide --check`
67
+     mid-run (12 min daily, 7 min for re-baseline). Cross-check
68
+     against `journalctl -t shithub-aide -n 50`. If unrelated:
69
+     `top -c -o %CPU` and identify the process.
70
+   - **Memory > 90%** — Postgres + shithubd combined exceed budget.
71
+     `systemctl status shithubd-web` for a dump of resident set;
72
+     `sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for
73
+     long-running queries.
74
+   - **Disk > 80%** — first check `du -sh /var/log/* /data/* /root/*
75
+     /var/lib/postgresql/* | sort -h`. Most likely Postgres logs
76
+     accumulating; check `log_rotation_age`.
77
+   - **Load1 > 4** for a 4-vCPU droplet means we're saturated. If
78
+     it's also a CPU alert, same diagnosis. If load is high but CPU
79
+     isn't, look at I/O wait (`top` → `1` to expand cores → `wa`).
80
+   - **down_global** — `curl -v https://shithub.sh/` from your
81
+     laptop. Caddy log: `ssh root@shithub.sh 'tail -100
82
+     /var/log/caddy/access.log | jq .'`. shithubd journal:
83
+     `journalctl -u shithubd-web -n 100`.
84
+   - **ssl_expiry** — `ssh root@shithub.sh 'systemctl status caddy;
85
+     journalctl -u caddy -n 200 | grep -iE "renew|cert"'`. Most
86
+     likely cause: rate-limited by Let's Encrypt or DNS
87
+     misconfiguration after a CNAME change.
88
+   - **latency p95 > 2s** — could be slow-query regressions, a
89
+     wedged worker, or Caddy reverse-proxy timeouts. Start with
90
+     `journalctl -u shithubd-web -n 200 | grep duration | sort -k1
91
+     -nr | head`.
92
+
93
+## Editing thresholds
94
+
95
+Change in `deploy/cutover/provision-do-alerts.sh`, then either:
96
+
97
+1. Delete the existing alert via `doctl monitoring alert delete
98
+   <uuid>` and re-run the provision script (it'll create the new
99
+   one), OR
100
+2. Edit in place: `doctl monitoring alert update <uuid> --value
101
+   <new>`.
102
+
103
+For uptime alerts: `doctl monitoring uptime alert update
104
+<check-id> <alert-id> ...`. (CLI bug: uptime check `update`
105
+strips `--enabled`; if you mutate uptime properties, delete +
106
+recreate via the script.)
107
+
108
+## What's NOT here
109
+
110
+- **shithubd-internal metrics** (request latency p95, DB pool
111
+  saturation, job queue depth, throttle rejections, archive
112
+  failures from Postgres). Those need a Prometheus scrape of
113
+  `127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see
114
+  task #263 / `runbooks/observability.md` (TBD).
115
+- **Slack delivery.** Today everything emails. The DO API supports
116
+  Slack webhooks; add `--slack-channels` + `--slack-urls` to the
117
+  provision script when there's a destination.
118
+- **Pagerduty.** Same — DO supports webhook destinations.