@@ -0,0 +1,118 @@ |
| 1 | +# Alerts |
| 2 | + |
| 3 | +DigitalOcean's built-in monitoring is the cheap-and-good alerting |
| 4 | +tier. Two layers, both managed via `doctl`: |
| 5 | + |
| 6 | +- **Resource alerts** scoped to the `shithub-app` droplet, fired when |
| 7 | + CPU / memory / disk / load1 cross a threshold for a sustained |
| 8 | + window. |
| 9 | +- **Uptime checks** that probe the public URL from two regions and |
| 10 | + fire on outages, SSL-cert decay, or latency regressions. |
| 11 | + |
| 12 | +Everything is reproducible via |
| 13 | +`deploy/cutover/provision-do-alerts.sh` — re-running is a no-op. |
| 14 | +The script is the source of truth for what alerts SHOULD exist; |
| 15 | +the DO account is the source of truth for what currently fires. |
| 16 | + |
| 17 | +## What's wired |
| 18 | + |
| 19 | +### Droplet resource alerts |
| 20 | + |
| 21 | +Tagged on droplet `shithub-app` (id resolves at runtime from |
| 22 | +the droplet name in `provision-do-alerts.sh`). |
| 23 | + |
| 24 | +| Description | Type | Threshold | Window | |
| 25 | +|---|---|---:|---:| |
| 26 | +| `shithub-app CPU > 80% for 10m` | `cpu` | 80% | 10m | |
| 27 | +| `shithub-app memory > 90% for 10m` | `memory_utilization_percent` | 90% | 10m | |
| 28 | +| `shithub-app disk > 80% for 10m` | `disk_utilization_percent` | 80% | 10m | |
| 29 | +| `shithub-app load1 > 4 for 10m` | `load_1` | 4.0 | 10m | |
| 30 | + |
| 31 | +Window is 10m so a single transient spike (build, AIDE re-baseline, |
| 32 | +WAL burst) doesn't page. All four email |
| 33 | +`wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses |
| 34 | +unverified destinations). |
| 35 | + |
| 36 | +### Uptime check `shithub.sh` |
| 37 | + |
| 38 | +`https://shithub.sh` probed every minute from `us_east` and |
| 39 | +`eu_west`. Three alerts hang off it: |
| 40 | + |
| 41 | +| Name | Type | Threshold | Period | |
| 42 | +|---|---|---:|---:| |
| 43 | +| `shithub.sh down (all regions)` | `down_global` | 1 region down | 5m | |
| 44 | +| `shithub.sh SSL cert expiring < 14 days` | `ssl_expiry` | 14 days | 5m | |
| 45 | +| `shithub.sh latency p95 > 2s` | `latency` | 2000 ms | 5m | |
| 46 | + |
| 47 | +`down_global` deliberately requires both regions to agree before |
| 48 | +paging — single-region noise from DO's own infrastructure is more |
| 49 | +common than real outages and was the leading cause of false |
| 50 | +positives on a previous setup. |
| 51 | + |
| 52 | +`ssl_expiry < 14 days` gives a buffer to investigate Caddy's |
| 53 | +auto-renew if it doesn't fire on its own. Caddy's renewal window |
| 54 | +is normally 30 days before expiry, so by the time this alert |
| 55 | +fires we're already off the happy path. |
| 56 | + |
| 57 | +## When an alert fires |
| 58 | + |
| 59 | +1. **Read the email.** It names the alert and the value that |
| 60 | + crossed. |
| 61 | +2. **Check the droplet first**: |
| 62 | + ```sh |
| 63 | + ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime' |
| 64 | + ``` |
| 65 | +3. **Common patterns**: |
| 66 | + - **CPU sustained > 80%** — usually `aideinit` or `aide --check` |
| 67 | + mid-run (12 min daily, 7 min for re-baseline). Cross-check |
| 68 | + against `journalctl -t shithub-aide -n 50`. If unrelated: |
| 69 | + `top -c -o %CPU` and identify the process. |
| 70 | + - **Memory > 90%** — Postgres + shithubd combined exceed budget. |
| 71 | + `systemctl status shithubd-web` for a dump of resident set; |
| 72 | + `sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for |
| 73 | + long-running queries. |
| 74 | + - **Disk > 80%** — first check `du -sh /var/log/* /data/* /root/* |
| 75 | + /var/lib/postgresql/* | sort -h`. Most likely Postgres logs |
| 76 | + accumulating; check `log_rotation_age`. |
| 77 | + - **Load1 > 4** for a 4-vCPU droplet means we're saturated. If |
| 78 | + it's also a CPU alert, same diagnosis. If load is high but CPU |
| 79 | + isn't, look at I/O wait (`top` → `1` to expand cores → `wa`). |
| 80 | + - **down_global** — `curl -v https://shithub.sh/` from your |
| 81 | + laptop. Caddy log: `ssh root@shithub.sh 'tail -100 |
| 82 | + /var/log/caddy/access.log | jq .'`. shithubd journal: |
| 83 | + `journalctl -u shithubd-web -n 100`. |
| 84 | + - **ssl_expiry** — `ssh root@shithub.sh 'systemctl status caddy; |
| 85 | + journalctl -u caddy -n 200 | grep -iE "renew|cert"'`. Most |
| 86 | + likely cause: rate-limited by Let's Encrypt or DNS |
| 87 | + misconfiguration after a CNAME change. |
| 88 | + - **latency p95 > 2s** — could be slow-query regressions, a |
| 89 | + wedged worker, or Caddy reverse-proxy timeouts. Start with |
| 90 | + `journalctl -u shithubd-web -n 200 | grep duration | sort -k1 |
| 91 | + -nr | head`. |
| 92 | + |
| 93 | +## Editing thresholds |
| 94 | + |
| 95 | +Change in `deploy/cutover/provision-do-alerts.sh`, then either: |
| 96 | + |
| 97 | +1. Delete the existing alert via `doctl monitoring alert delete |
| 98 | + <uuid>` and re-run the provision script (it'll create the new |
| 99 | + one), OR |
| 100 | +2. Edit in place: `doctl monitoring alert update <uuid> --value |
| 101 | + <new>`. |
| 102 | + |
| 103 | +For uptime alerts: `doctl monitoring uptime alert update |
| 104 | +<check-id> <alert-id> ...`. (CLI bug: uptime check `update` |
| 105 | +strips `--enabled`; if you mutate uptime properties, delete + |
| 106 | +recreate via the script.) |
| 107 | + |
| 108 | +## What's NOT here |
| 109 | + |
| 110 | +- **shithubd-internal metrics** (request latency p95, DB pool |
| 111 | + saturation, job queue depth, throttle rejections, archive |
| 112 | + failures from Postgres). Those need a Prometheus scrape of |
| 113 | + `127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see |
| 114 | + task #263 / `runbooks/observability.md` (TBD). |
| 115 | +- **Slack delivery.** Today everything emails. The DO API supports |
| 116 | + Slack webhooks; add `--slack-channels` + `--slack-urls` to the |
| 117 | + provision script when there's a destination. |
| 118 | +- **Pagerduty.** Same — DO supports webhook destinations. |