Alerts
DigitalOcean's built-in monitoring is the cheap-and-good alerting
tier. Two layers, both managed via doctl:
- Resource alerts scoped to the
shithub-appdroplet, fired when CPU / memory / disk / load1 cross a threshold for a sustained window. - Uptime checks that probe the public URL from two regions and fire on outages, SSL-cert decay, or latency regressions.
Everything is reproducible via
deploy/cutover/provision-do-alerts.sh — re-running is a no-op.
The script is the source of truth for what alerts SHOULD exist;
the DO account is the source of truth for what currently fires.
What's wired
Droplet resource alerts
Tagged on droplet shithub-app (id resolves at runtime from
the droplet name in provision-do-alerts.sh).
| Description | Type | Threshold | Window |
|---|---|---|---|
shithub-app CPU > 80% for 10m |
cpu |
80% | 10m |
shithub-app memory > 90% for 10m |
memory_utilization_percent |
90% | 10m |
shithub-app disk > 80% for 10m |
disk_utilization_percent |
80% | 10m |
shithub-app load1 > 4 for 10m |
load_1 |
4.0 | 10m |
Window is 10m so a single transient spike (build, AIDE re-baseline,
WAL burst) doesn't page. All four email
wolffemf@dukes.jmu.edu (the account-owner email; DO refuses
unverified destinations).
Uptime check shithub.sh
https://shithub.sh probed every minute from us_east and
eu_west. Three alerts hang off it:
| Name | Type | Threshold | Period |
|---|---|---|---|
shithub.sh down (all regions) |
down_global |
1 region down | 5m |
shithub.sh SSL cert expiring < 14 days |
ssl_expiry |
14 days | 5m |
shithub.sh latency p95 > 2s |
latency |
2000 ms | 5m |
down_global deliberately requires both regions to agree before
paging — single-region noise from DO's own infrastructure is more
common than real outages and was the leading cause of false
positives on a previous setup.
ssl_expiry < 14 days gives a buffer to investigate Caddy's
auto-renew if it doesn't fire on its own. Caddy's renewal window
is normally 30 days before expiry, so by the time this alert
fires we're already off the happy path.
When an alert fires
- Read the email. It names the alert and the value that crossed.
- Check the droplet first:
ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime' - Common patterns:
- CPU sustained > 80% — usually
aideinitoraide --checkmid-run (12 min daily, 7 min for re-baseline). Cross-check againstjournalctl -t shithub-aide -n 50. If unrelated:top -c -o %CPUand identify the process. - Memory > 90% — Postgres + shithubd combined exceed budget.
systemctl status shithubd-webfor a dump of resident set;sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"for long-running queries. - Disk > 80% — first check
du -sh /var/log/* /data/* /root/* /var/lib/postgresql/* | sort -h. Most likely Postgres logs accumulating; checklog_rotation_age. - Load1 > 4 for a 4-vCPU droplet means we're saturated. If
it's also a CPU alert, same diagnosis. If load is high but CPU
isn't, look at I/O wait (
top→1to expand cores →wa). - down_global —
curl -v https://shithub.sh/from your laptop. Caddy log:ssh root@shithub.sh 'tail -100 /var/log/caddy/access.log | jq .'. shithubd journal:journalctl -u shithubd-web -n 100. - ssl_expiry —
ssh root@shithub.sh 'systemctl status caddy; journalctl -u caddy -n 200 | grep -iE "renew|cert"'. Most likely cause: rate-limited by Let's Encrypt or DNS misconfiguration after a CNAME change. - latency p95 > 2s — could be slow-query regressions, a
wedged worker, or Caddy reverse-proxy timeouts. Start with
journalctl -u shithubd-web -n 200 | grep duration | sort -k1 -nr | head.
- CPU sustained > 80% — usually
Editing thresholds
Change in deploy/cutover/provision-do-alerts.sh, then either:
- Delete the existing alert via
doctl monitoring alert delete <uuid>and re-run the provision script (it'll create the new one), OR - Edit in place:
doctl monitoring alert update <uuid> --value <new>.
For uptime alerts: doctl monitoring uptime alert update <check-id> <alert-id> .... (CLI bug: uptime check update
strips --enabled; if you mutate uptime properties, delete +
recreate via the script.)
What's NOT here
- shithubd-internal metrics (request latency p95, DB pool
saturation, job queue depth, throttle rejections, archive
failures from Postgres). Those need a Prometheus scrape of
127.0.0.1:8080/metricsplus a Grafana Cloud free-tier — see task #263 /runbooks/observability.md(TBD). - Slack delivery. Today everything emails. The DO API supports
Slack webhooks; add
--slack-channels+--slack-urlsto the provision script when there's a destination. - Pagerduty. Same — DO supports webhook destinations.
View source
| 1 | # Alerts |
| 2 | |
| 3 | DigitalOcean's built-in monitoring is the cheap-and-good alerting |
| 4 | tier. Two layers, both managed via `doctl`: |
| 5 | |
| 6 | - **Resource alerts** scoped to the `shithub-app` droplet, fired when |
| 7 | CPU / memory / disk / load1 cross a threshold for a sustained |
| 8 | window. |
| 9 | - **Uptime checks** that probe the public URL from two regions and |
| 10 | fire on outages, SSL-cert decay, or latency regressions. |
| 11 | |
| 12 | Everything is reproducible via |
| 13 | `deploy/cutover/provision-do-alerts.sh` — re-running is a no-op. |
| 14 | The script is the source of truth for what alerts SHOULD exist; |
| 15 | the DO account is the source of truth for what currently fires. |
| 16 | |
| 17 | ## What's wired |
| 18 | |
| 19 | ### Droplet resource alerts |
| 20 | |
| 21 | Tagged on droplet `shithub-app` (id resolves at runtime from |
| 22 | the droplet name in `provision-do-alerts.sh`). |
| 23 | |
| 24 | | Description | Type | Threshold | Window | |
| 25 | |---|---|---:|---:| |
| 26 | | `shithub-app CPU > 80% for 10m` | `cpu` | 80% | 10m | |
| 27 | | `shithub-app memory > 90% for 10m` | `memory_utilization_percent` | 90% | 10m | |
| 28 | | `shithub-app disk > 80% for 10m` | `disk_utilization_percent` | 80% | 10m | |
| 29 | | `shithub-app load1 > 4 for 10m` | `load_1` | 4.0 | 10m | |
| 30 | |
| 31 | Window is 10m so a single transient spike (build, AIDE re-baseline, |
| 32 | WAL burst) doesn't page. All four email |
| 33 | `wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses |
| 34 | unverified destinations). |
| 35 | |
| 36 | ### Uptime check `shithub.sh` |
| 37 | |
| 38 | `https://shithub.sh` probed every minute from `us_east` and |
| 39 | `eu_west`. Three alerts hang off it: |
| 40 | |
| 41 | | Name | Type | Threshold | Period | |
| 42 | |---|---|---:|---:| |
| 43 | | `shithub.sh down (all regions)` | `down_global` | 1 region down | 5m | |
| 44 | | `shithub.sh SSL cert expiring < 14 days` | `ssl_expiry` | 14 days | 5m | |
| 45 | | `shithub.sh latency p95 > 2s` | `latency` | 2000 ms | 5m | |
| 46 | |
| 47 | `down_global` deliberately requires both regions to agree before |
| 48 | paging — single-region noise from DO's own infrastructure is more |
| 49 | common than real outages and was the leading cause of false |
| 50 | positives on a previous setup. |
| 51 | |
| 52 | `ssl_expiry < 14 days` gives a buffer to investigate Caddy's |
| 53 | auto-renew if it doesn't fire on its own. Caddy's renewal window |
| 54 | is normally 30 days before expiry, so by the time this alert |
| 55 | fires we're already off the happy path. |
| 56 | |
| 57 | ## When an alert fires |
| 58 | |
| 59 | 1. **Read the email.** It names the alert and the value that |
| 60 | crossed. |
| 61 | 2. **Check the droplet first**: |
| 62 | ```sh |
| 63 | ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime' |
| 64 | ``` |
| 65 | 3. **Common patterns**: |
| 66 | - **CPU sustained > 80%** — usually `aideinit` or `aide --check` |
| 67 | mid-run (12 min daily, 7 min for re-baseline). Cross-check |
| 68 | against `journalctl -t shithub-aide -n 50`. If unrelated: |
| 69 | `top -c -o %CPU` and identify the process. |
| 70 | - **Memory > 90%** — Postgres + shithubd combined exceed budget. |
| 71 | `systemctl status shithubd-web` for a dump of resident set; |
| 72 | `sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for |
| 73 | long-running queries. |
| 74 | - **Disk > 80%** — first check `du -sh /var/log/* /data/* /root/* |
| 75 | /var/lib/postgresql/* | sort -h`. Most likely Postgres logs |
| 76 | accumulating; check `log_rotation_age`. |
| 77 | - **Load1 > 4** for a 4-vCPU droplet means we're saturated. If |
| 78 | it's also a CPU alert, same diagnosis. If load is high but CPU |
| 79 | isn't, look at I/O wait (`top` → `1` to expand cores → `wa`). |
| 80 | - **down_global** — `curl -v https://shithub.sh/` from your |
| 81 | laptop. Caddy log: `ssh root@shithub.sh 'tail -100 |
| 82 | /var/log/caddy/access.log | jq .'`. shithubd journal: |
| 83 | `journalctl -u shithubd-web -n 100`. |
| 84 | - **ssl_expiry** — `ssh root@shithub.sh 'systemctl status caddy; |
| 85 | journalctl -u caddy -n 200 | grep -iE "renew|cert"'`. Most |
| 86 | likely cause: rate-limited by Let's Encrypt or DNS |
| 87 | misconfiguration after a CNAME change. |
| 88 | - **latency p95 > 2s** — could be slow-query regressions, a |
| 89 | wedged worker, or Caddy reverse-proxy timeouts. Start with |
| 90 | `journalctl -u shithubd-web -n 200 | grep duration | sort -k1 |
| 91 | -nr | head`. |
| 92 | |
| 93 | ## Editing thresholds |
| 94 | |
| 95 | Change in `deploy/cutover/provision-do-alerts.sh`, then either: |
| 96 | |
| 97 | 1. Delete the existing alert via `doctl monitoring alert delete |
| 98 | <uuid>` and re-run the provision script (it'll create the new |
| 99 | one), OR |
| 100 | 2. Edit in place: `doctl monitoring alert update <uuid> --value |
| 101 | <new>`. |
| 102 | |
| 103 | For uptime alerts: `doctl monitoring uptime alert update |
| 104 | <check-id> <alert-id> ...`. (CLI bug: uptime check `update` |
| 105 | strips `--enabled`; if you mutate uptime properties, delete + |
| 106 | recreate via the script.) |
| 107 | |
| 108 | ## What's NOT here |
| 109 | |
| 110 | - **shithubd-internal metrics** (request latency p95, DB pool |
| 111 | saturation, job queue depth, throttle rejections, archive |
| 112 | failures from Postgres). Those need a Prometheus scrape of |
| 113 | `127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see |
| 114 | task #263 / `runbooks/observability.md` (TBD). |
| 115 | - **Slack delivery.** Today everything emails. The DO API supports |
| 116 | Slack webhooks; add `--slack-channels` + `--slack-urls` to the |
| 117 | provision script when there's a destination. |
| 118 | - **Pagerduty.** Same — DO supports webhook destinations. |