markdown · 5001 bytes Raw Blame History

Alerts

DigitalOcean's built-in monitoring is the cheap-and-good alerting tier. Two layers, both managed via doctl:

  • Resource alerts scoped to the shithub-app droplet, fired when CPU / memory / disk / load1 cross a threshold for a sustained window.
  • Uptime checks that probe the public URL from two regions and fire on outages, SSL-cert decay, or latency regressions.

Everything is reproducible via deploy/cutover/provision-do-alerts.sh — re-running is a no-op. The script is the source of truth for what alerts SHOULD exist; the DO account is the source of truth for what currently fires.

What's wired

Droplet resource alerts

Tagged on droplet shithub-app (id resolves at runtime from the droplet name in provision-do-alerts.sh).

Description Type Threshold Window
shithub-app CPU > 80% for 10m cpu 80% 10m
shithub-app memory > 90% for 10m memory_utilization_percent 90% 10m
shithub-app disk > 80% for 10m disk_utilization_percent 80% 10m
shithub-app load1 > 4 for 10m load_1 4.0 10m

Window is 10m so a single transient spike (build, AIDE re-baseline, WAL burst) doesn't page. All four email wolffemf@dukes.jmu.edu (the account-owner email; DO refuses unverified destinations).

Uptime check shithub.sh

https://shithub.sh probed every minute from us_east and eu_west. Three alerts hang off it:

Name Type Threshold Period
shithub.sh down (all regions) down_global 1 region down 5m
shithub.sh SSL cert expiring < 14 days ssl_expiry 14 days 5m
shithub.sh latency p95 > 2s latency 2000 ms 5m

down_global deliberately requires both regions to agree before paging — single-region noise from DO's own infrastructure is more common than real outages and was the leading cause of false positives on a previous setup.

ssl_expiry < 14 days gives a buffer to investigate Caddy's auto-renew if it doesn't fire on its own. Caddy's renewal window is normally 30 days before expiry, so by the time this alert fires we're already off the happy path.

When an alert fires

  1. Read the email. It names the alert and the value that crossed.
  2. Check the droplet first:
    ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime'
    
  3. Common patterns:
    • CPU sustained > 80% — usually aideinit or aide --check mid-run (12 min daily, 7 min for re-baseline). Cross-check against journalctl -t shithub-aide -n 50. If unrelated: top -c -o %CPU and identify the process.
    • Memory > 90% — Postgres + shithubd combined exceed budget. systemctl status shithubd-web for a dump of resident set; sudo -u postgres psql -c "SELECT * FROM pg_stat_activity" for long-running queries.
    • Disk > 80% — first check du -sh /var/log/* /data/* /root/* /var/lib/postgresql/* | sort -h. Most likely Postgres logs accumulating; check log_rotation_age.
    • Load1 > 4 for a 4-vCPU droplet means we're saturated. If it's also a CPU alert, same diagnosis. If load is high but CPU isn't, look at I/O wait (top1 to expand cores → wa).
    • down_globalcurl -v https://shithub.sh/ from your laptop. Caddy log: ssh root@shithub.sh 'tail -100 /var/log/caddy/access.log | jq .'. shithubd journal: journalctl -u shithubd-web -n 100.
    • ssl_expiryssh root@shithub.sh 'systemctl status caddy; journalctl -u caddy -n 200 | grep -iE "renew|cert"'. Most likely cause: rate-limited by Let's Encrypt or DNS misconfiguration after a CNAME change.
    • latency p95 > 2s — could be slow-query regressions, a wedged worker, or Caddy reverse-proxy timeouts. Start with journalctl -u shithubd-web -n 200 | grep duration | sort -k1 -nr | head.

Editing thresholds

Change in deploy/cutover/provision-do-alerts.sh, then either:

  1. Delete the existing alert via doctl monitoring alert delete <uuid> and re-run the provision script (it'll create the new one), OR
  2. Edit in place: doctl monitoring alert update <uuid> --value <new>.

For uptime alerts: doctl monitoring uptime alert update <check-id> <alert-id> .... (CLI bug: uptime check update strips --enabled; if you mutate uptime properties, delete + recreate via the script.)

What's NOT here

  • shithubd-internal metrics (request latency p95, DB pool saturation, job queue depth, throttle rejections, archive failures from Postgres). Those need a Prometheus scrape of 127.0.0.1:8080/metrics plus a Grafana Cloud free-tier — see task #263 / runbooks/observability.md (TBD).
  • Slack delivery. Today everything emails. The DO API supports Slack webhooks; add --slack-channels + --slack-urls to the provision script when there's a destination.
  • Pagerduty. Same — DO supports webhook destinations.
View source
1 # Alerts
2
3 DigitalOcean's built-in monitoring is the cheap-and-good alerting
4 tier. Two layers, both managed via `doctl`:
5
6 - **Resource alerts** scoped to the `shithub-app` droplet, fired when
7 CPU / memory / disk / load1 cross a threshold for a sustained
8 window.
9 - **Uptime checks** that probe the public URL from two regions and
10 fire on outages, SSL-cert decay, or latency regressions.
11
12 Everything is reproducible via
13 `deploy/cutover/provision-do-alerts.sh` — re-running is a no-op.
14 The script is the source of truth for what alerts SHOULD exist;
15 the DO account is the source of truth for what currently fires.
16
17 ## What's wired
18
19 ### Droplet resource alerts
20
21 Tagged on droplet `shithub-app` (id resolves at runtime from
22 the droplet name in `provision-do-alerts.sh`).
23
24 | Description | Type | Threshold | Window |
25 |---|---|---:|---:|
26 | `shithub-app CPU > 80% for 10m` | `cpu` | 80% | 10m |
27 | `shithub-app memory > 90% for 10m` | `memory_utilization_percent` | 90% | 10m |
28 | `shithub-app disk > 80% for 10m` | `disk_utilization_percent` | 80% | 10m |
29 | `shithub-app load1 > 4 for 10m` | `load_1` | 4.0 | 10m |
30
31 Window is 10m so a single transient spike (build, AIDE re-baseline,
32 WAL burst) doesn't page. All four email
33 `wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses
34 unverified destinations).
35
36 ### Uptime check `shithub.sh`
37
38 `https://shithub.sh` probed every minute from `us_east` and
39 `eu_west`. Three alerts hang off it:
40
41 | Name | Type | Threshold | Period |
42 |---|---|---:|---:|
43 | `shithub.sh down (all regions)` | `down_global` | 1 region down | 5m |
44 | `shithub.sh SSL cert expiring < 14 days` | `ssl_expiry` | 14 days | 5m |
45 | `shithub.sh latency p95 > 2s` | `latency` | 2000 ms | 5m |
46
47 `down_global` deliberately requires both regions to agree before
48 paging — single-region noise from DO's own infrastructure is more
49 common than real outages and was the leading cause of false
50 positives on a previous setup.
51
52 `ssl_expiry < 14 days` gives a buffer to investigate Caddy's
53 auto-renew if it doesn't fire on its own. Caddy's renewal window
54 is normally 30 days before expiry, so by the time this alert
55 fires we're already off the happy path.
56
57 ## When an alert fires
58
59 1. **Read the email.** It names the alert and the value that
60 crossed.
61 2. **Check the droplet first**:
62 ```sh
63 ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime'
64 ```
65 3. **Common patterns**:
66 - **CPU sustained > 80%** — usually `aideinit` or `aide --check`
67 mid-run (12 min daily, 7 min for re-baseline). Cross-check
68 against `journalctl -t shithub-aide -n 50`. If unrelated:
69 `top -c -o %CPU` and identify the process.
70 - **Memory > 90%** — Postgres + shithubd combined exceed budget.
71 `systemctl status shithubd-web` for a dump of resident set;
72 `sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for
73 long-running queries.
74 - **Disk > 80%** — first check `du -sh /var/log/* /data/* /root/*
75 /var/lib/postgresql/* | sort -h`. Most likely Postgres logs
76 accumulating; check `log_rotation_age`.
77 - **Load1 > 4** for a 4-vCPU droplet means we're saturated. If
78 it's also a CPU alert, same diagnosis. If load is high but CPU
79 isn't, look at I/O wait (`top` → `1` to expand cores → `wa`).
80 - **down_global** — `curl -v https://shithub.sh/` from your
81 laptop. Caddy log: `ssh root@shithub.sh 'tail -100
82 /var/log/caddy/access.log | jq .'`. shithubd journal:
83 `journalctl -u shithubd-web -n 100`.
84 - **ssl_expiry** — `ssh root@shithub.sh 'systemctl status caddy;
85 journalctl -u caddy -n 200 | grep -iE "renew|cert"'`. Most
86 likely cause: rate-limited by Let's Encrypt or DNS
87 misconfiguration after a CNAME change.
88 - **latency p95 > 2s** — could be slow-query regressions, a
89 wedged worker, or Caddy reverse-proxy timeouts. Start with
90 `journalctl -u shithubd-web -n 200 | grep duration | sort -k1
91 -nr | head`.
92
93 ## Editing thresholds
94
95 Change in `deploy/cutover/provision-do-alerts.sh`, then either:
96
97 1. Delete the existing alert via `doctl monitoring alert delete
98 <uuid>` and re-run the provision script (it'll create the new
99 one), OR
100 2. Edit in place: `doctl monitoring alert update <uuid> --value
101 <new>`.
102
103 For uptime alerts: `doctl monitoring uptime alert update
104 <check-id> <alert-id> ...`. (CLI bug: uptime check `update`
105 strips `--enabled`; if you mutate uptime properties, delete +
106 recreate via the script.)
107
108 ## What's NOT here
109
110 - **shithubd-internal metrics** (request latency p95, DB pool
111 saturation, job queue depth, throttle rejections, archive
112 failures from Postgres). Those need a Prometheus scrape of
113 `127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see
114 task #263 / `runbooks/observability.md` (TBD).
115 - **Slack delivery.** Today everything emails. The DO API supports
116 Slack webhooks; add `--slack-channels` + `--slack-urls` to the
117 provision script when there's a destination.
118 - **Pagerduty.** Same — DO supports webhook destinations.