shithub Public

Watch 1 Fork 0 Star 0

markdown · 5001 bytes Raw Blame History

Alerts

DigitalOcean's built-in monitoring is the cheap-and-good alerting tier. Two layers, both managed via doctl:

Resource alerts scoped to the shithub-app droplet, fired when CPU / memory / disk / load1 cross a threshold for a sustained window.
Uptime checks that probe the public URL from two regions and fire on outages, SSL-cert decay, or latency regressions.

Everything is reproducible via deploy/cutover/provision-do-alerts.sh — re-running is a no-op. The script is the source of truth for what alerts SHOULD exist; the DO account is the source of truth for what currently fires.

What's wired

Droplet resource alerts

Tagged on droplet shithub-app (id resolves at runtime from the droplet name in provision-do-alerts.sh).

Description	Type	Threshold	Window
`shithub-app CPU > 80% for 10m`	`cpu`	80%	10m
`shithub-app memory > 90% for 10m`	`memory_utilization_percent`	90%	10m
`shithub-app disk > 80% for 10m`	`disk_utilization_percent`	80%	10m
`shithub-app load1 > 4 for 10m`	`load_1`	4.0	10m

Window is 10m so a single transient spike (build, AIDE re-baseline, WAL burst) doesn't page. All four email wolffemf@dukes.jmu.edu (the account-owner email; DO refuses unverified destinations).

Uptime check `shithub.sh`

https://shithub.sh probed every minute from us_east and eu_west. Three alerts hang off it:

Name	Type	Threshold	Period
`shithub.sh down (all regions)`	`down_global`	1 region down	5m
`shithub.sh SSL cert expiring < 14 days`	`ssl_expiry`	14 days	5m
`shithub.sh latency p95 > 2s`	`latency`	2000 ms	5m

down_global deliberately requires both regions to agree before paging — single-region noise from DO's own infrastructure is more common than real outages and was the leading cause of false positives on a previous setup.

ssl_expiry < 14 days gives a buffer to investigate Caddy's auto-renew if it doesn't fire on its own. Caddy's renewal window is normally 30 days before expiry, so by the time this alert fires we're already off the happy path.

When an alert fires

Read the email. It names the alert and the value that crossed.

Check the droplet first:

ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime'

Common patterns:
- CPU sustained > 80% — usually aideinit or aide --check mid-run (12 min daily, 7 min for re-baseline). Cross-check against journalctl -t shithub-aide -n 50. If unrelated: top -c -o %CPU and identify the process.
- Memory > 90% — Postgres + shithubd combined exceed budget. systemctl status shithubd-web for a dump of resident set; sudo -u postgres psql -c "SELECT * FROM pg_stat_activity" for long-running queries.
- Disk > 80% — first check du -sh /var/log/* /data/* /root/* /var/lib/postgresql/* | sort -h. Most likely Postgres logs accumulating; check log_rotation_age.
- Load1 > 4 for a 4-vCPU droplet means we're saturated. If it's also a CPU alert, same diagnosis. If load is high but CPU isn't, look at I/O wait (top → 1 to expand cores → wa).
- down_global — curl -v https://shithub.sh/ from your laptop. Caddy log: ssh root@shithub.sh 'tail -100 /var/log/caddy/access.log | jq .'. shithubd journal: journalctl -u shithubd-web -n 100.
- ssl_expiry — ssh root@shithub.sh 'systemctl status caddy; journalctl -u caddy -n 200 | grep -iE "renew|cert"'. Most likely cause: rate-limited by Let's Encrypt or DNS misconfiguration after a CNAME change.
- latency p95 > 2s — could be slow-query regressions, a wedged worker, or Caddy reverse-proxy timeouts. Start with journalctl -u shithubd-web -n 200 | grep duration | sort -k1 -nr | head.

Editing thresholds

Change in deploy/cutover/provision-do-alerts.sh, then either:

Delete the existing alert via doctl monitoring alert delete <uuid> and re-run the provision script (it'll create the new one), OR
Edit in place: doctl monitoring alert update <uuid> --value <new>.

For uptime alerts: doctl monitoring uptime alert update <check-id> <alert-id> .... (CLI bug: uptime check update strips --enabled; if you mutate uptime properties, delete + recreate via the script.)

What's NOT here

shithubd-internal metrics (request latency p95, DB pool saturation, job queue depth, throttle rejections, archive failures from Postgres). Those need a Prometheus scrape of 127.0.0.1:8080/metrics plus a Grafana Cloud free-tier — see task #263 / runbooks/observability.md (TBD).
Slack delivery. Today everything emails. The DO API supports Slack webhooks; add --slack-channels + --slack-urls to the provision script when there's a destination.
Pagerduty. Same — DO supports webhook destinations.

View source

  
        1
        # Alerts
      
        2
        
        3
        DigitalOcean's built-in monitoring is the cheap-and-good alerting
      
        4
        tier. Two layers, both managed via `doctl`:
      
        5
        
        6
        - **Resource alerts** scoped to the `shithub-app` droplet, fired when
      
        7
          CPU / memory / disk / load1 cross a threshold for a sustained
      
        8
          window.
      
        9
        - **Uptime checks** that probe the public URL from two regions and
      
        10
          fire on outages, SSL-cert decay, or latency regressions.
      
        11
        
        12
        Everything is reproducible via
      
        13
        `deploy/cutover/provision-do-alerts.sh` — re-running is a no-op.
      
        14
        The script is the source of truth for what alerts SHOULD exist;
      
        15
        the DO account is the source of truth for what currently fires.
      
        16
        
        17
        ## What's wired
      
        18
        
        19
        ### Droplet resource alerts
      
        20
        
        21
        Tagged on droplet `shithub-app` (id resolves at runtime from
      
        22
        the droplet name in `provision-do-alerts.sh`).
      
        23
        
        24
        | Description | Type | Threshold | Window |
      
        25
        |---|---|---:|---:|
      
        26
        | `shithub-app CPU > 80% for 10m` | `cpu` | 80% | 10m |
      
        27
        | `shithub-app memory > 90% for 10m` | `memory_utilization_percent` | 90% | 10m |
      
        28
        | `shithub-app disk > 80% for 10m` | `disk_utilization_percent` | 80% | 10m |
      
        29
        | `shithub-app load1 > 4 for 10m` | `load_1` | 4.0 | 10m |
      
        30
        
        31
        Window is 10m so a single transient spike (build, AIDE re-baseline,
      
        32
        WAL burst) doesn't page. All four email
      
        33
        `wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses
      
        34
        unverified destinations).
      
        35
        
        36
        ### Uptime check `shithub.sh`
      
        37
        
        38
        `https://shithub.sh` probed every minute from `us_east` and
      
        39
        `eu_west`. Three alerts hang off it:
      
        40
        
        41
        | Name | Type | Threshold | Period |
      
        42
        |---|---|---:|---:|
      
        43
        | `shithub.sh down (all regions)` | `down_global` | 1 region down | 5m |
      
        44
        | `shithub.sh SSL cert expiring < 14 days` | `ssl_expiry` | 14 days | 5m |
      
        45
        | `shithub.sh latency p95 > 2s` | `latency` | 2000 ms | 5m |
      
        46
        
        47
        `down_global` deliberately requires both regions to agree before
      
        48
        paging — single-region noise from DO's own infrastructure is more
      
        49
        common than real outages and was the leading cause of false
      
        50
        positives on a previous setup.
      
        51
        
        52
        `ssl_expiry < 14 days` gives a buffer to investigate Caddy's
      
        53
        auto-renew if it doesn't fire on its own. Caddy's renewal window
      
        54
        is normally 30 days before expiry, so by the time this alert
      
        55
        fires we're already off the happy path.
      
        56
        
        57
        ## When an alert fires
      
        58
        
        59
        1. **Read the email.** It names the alert and the value that
      
        60
           crossed.
      
        61
        2. **Check the droplet first**:
      
        62
           ```sh
      
        63
           ssh root@shithub.sh 'top -bn1 | head -20; df -h /; free -h; uptime'
      
        64
           ```
      
        65
        3. **Common patterns**:
      
        66
           - **CPU sustained > 80%** — usually `aideinit` or `aide --check`
      
        67
             mid-run (12 min daily, 7 min for re-baseline). Cross-check
      
        68
             against `journalctl -t shithub-aide -n 50`. If unrelated:
      
        69
             `top -c -o %CPU` and identify the process.
      
        70
           - **Memory > 90%** — Postgres + shithubd combined exceed budget.
      
        71
             `systemctl status shithubd-web` for a dump of resident set;
      
        72
             `sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for
      
        73
             long-running queries.
      
        74
           - **Disk > 80%** — first check `du -sh /var/log/* /data/* /root/*
      
        75
             /var/lib/postgresql/* | sort -h`. Most likely Postgres logs
      
        76
             accumulating; check `log_rotation_age`.
      
        77
           - **Load1 > 4** for a 4-vCPU droplet means we're saturated. If
      
        78
             it's also a CPU alert, same diagnosis. If load is high but CPU
      
        79
             isn't, look at I/O wait (`top` → `1` to expand cores → `wa`).
      
        80
           - **down_global** — `curl -v https://shithub.sh/` from your
      
        81
             laptop. Caddy log: `ssh root@shithub.sh 'tail -100
      
        82
             /var/log/caddy/access.log | jq .'`. shithubd journal:
      
        83
             `journalctl -u shithubd-web -n 100`.
      
        84
           - **ssl_expiry** — `ssh root@shithub.sh 'systemctl status caddy;
      
        85
             journalctl -u caddy -n 200 | grep -iE "renew|cert"'`. Most
      
        86
             likely cause: rate-limited by Let's Encrypt or DNS
      
        87
             misconfiguration after a CNAME change.
      
        88
           - **latency p95 > 2s** — could be slow-query regressions, a
      
        89
             wedged worker, or Caddy reverse-proxy timeouts. Start with
      
        90
             `journalctl -u shithubd-web -n 200 | grep duration | sort -k1
      
        91
             -nr | head`.
      
        92
        
        93
        ## Editing thresholds
      
        94
        
        95
        Change in `deploy/cutover/provision-do-alerts.sh`, then either:
      
        96
        
        97
        1. Delete the existing alert via `doctl monitoring alert delete
      
        98
           <uuid>` and re-run the provision script (it'll create the new
      
        99
           one), OR
      
        100
        2. Edit in place: `doctl monitoring alert update <uuid> --value
      
        101
           <new>`.
      
        102
        
        103
        For uptime alerts: `doctl monitoring uptime alert update
      
        104
        <check-id> <alert-id> ...`. (CLI bug: uptime check `update`
      
        105
        strips `--enabled`; if you mutate uptime properties, delete +
      
        106
        recreate via the script.)
      
        107
        
        108
        ## What's NOT here
      
        109
        
        110
        - **shithubd-internal metrics** (request latency p95, DB pool
      
        111
          saturation, job queue depth, throttle rejections, archive
      
        112
          failures from Postgres). Those need a Prometheus scrape of
      
        113
          `127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see
      
        114
          task #263 / `runbooks/observability.md` (TBD).
      
        115
        - **Slack delivery.** Today everything emails. The DO API supports
      
        116
          Slack webhooks; add `--slack-channels` + `--slack-urls` to the
      
        117
          provision script when there's a destination.
      
        118
        - **Pagerduty.** Same — DO supports webhook destinations.

1	# Alerts
2
3	DigitalOcean's built-in monitoring is the cheap-and-good alerting
4	tier. Two layers, both managed via `doctl`:
5
6	- Resource alerts scoped to the `shithub-app` droplet, fired when
7	CPU / memory / disk / load1 cross a threshold for a sustained
8	window.
9	- Uptime checks that probe the public URL from two regions and
10	fire on outages, SSL-cert decay, or latency regressions.
11
12	Everything is reproducible via
13	`deploy/cutover/provision-do-alerts.sh` — re-running is a no-op.
14	The script is the source of truth for what alerts SHOULD exist;
15	the DO account is the source of truth for what currently fires.
16
17	## What's wired
18
19	### Droplet resource alerts
20
21	Tagged on droplet `shithub-app` (id resolves at runtime from
22	the droplet name in `provision-do-alerts.sh`).
23
24	\| Description \| Type \| Threshold \| Window \|
25	\|---\|---\|---:\|---:\|
26	\| `shithub-app CPU > 80% for 10m` \| `cpu` \| 80% \| 10m \|
27	\| `shithub-app memory > 90% for 10m` \| `memory_utilization_percent` \| 90% \| 10m \|
28	\| `shithub-app disk > 80% for 10m` \| `disk_utilization_percent` \| 80% \| 10m \|
29	\| `shithub-app load1 > 4 for 10m` \| `load_1` \| 4.0 \| 10m \|
30
31	Window is 10m so a single transient spike (build, AIDE re-baseline,
32	WAL burst) doesn't page. All four email
33	`wolffemf@dukes.jmu.edu` (the account-owner email; DO refuses
34	unverified destinations).
35
36	### Uptime check `shithub.sh`
37
38	`https://shithub.sh` probed every minute from `us_east` and
39	`eu_west`. Three alerts hang off it:
40
41	\| Name \| Type \| Threshold \| Period \|
42	\|---\|---\|---:\|---:\|
43	\| `shithub.sh down (all regions)` \| `down_global` \| 1 region down \| 5m \|
44	\| `shithub.sh SSL cert expiring < 14 days` \| `ssl_expiry` \| 14 days \| 5m \|
45	\| `shithub.sh latency p95 > 2s` \| `latency` \| 2000 ms \| 5m \|
46
47	`down_global` deliberately requires both regions to agree before
48	paging — single-region noise from DO's own infrastructure is more
49	common than real outages and was the leading cause of false
50	positives on a previous setup.
51
52	`ssl_expiry < 14 days` gives a buffer to investigate Caddy's
53	auto-renew if it doesn't fire on its own. Caddy's renewal window
54	is normally 30 days before expiry, so by the time this alert
55	fires we're already off the happy path.
56
57	## When an alert fires
58
59	1. Read the email. It names the alert and the value that
60	crossed.
61	2. Check the droplet first:
62	```sh
63	ssh root@shithub.sh 'top -bn1 \| head -20; df -h /; free -h; uptime'
64	```
65	3. Common patterns:
66	- CPU sustained > 80% — usually `aideinit` or `aide --check`
67	mid-run (12 min daily, 7 min for re-baseline). Cross-check
68	against `journalctl -t shithub-aide -n 50`. If unrelated:
69	`top -c -o %CPU` and identify the process.
70	- Memory > 90% — Postgres + shithubd combined exceed budget.
71	`systemctl status shithubd-web` for a dump of resident set;
72	`sudo -u postgres psql -c "SELECT * FROM pg_stat_activity"` for
73	long-running queries.
74	- Disk > 80% — first check `du -sh /var/log/* /data/* /root/*
75	/var/lib/postgresql/* \| sort -h`. Most likely Postgres logs
76	accumulating; check `log_rotation_age`.
77	- Load1 > 4 for a 4-vCPU droplet means we're saturated. If
78	it's also a CPU alert, same diagnosis. If load is high but CPU
79	isn't, look at I/O wait (`top` → `1` to expand cores → `wa`).
80	- down_global — `curl -v https://shithub.sh/` from your
81	laptop. Caddy log: `ssh root@shithub.sh 'tail -100
82	/var/log/caddy/access.log \| jq .'`. shithubd journal:
83	`journalctl -u shithubd-web -n 100`.
84	- ssl_expiry — `ssh root@shithub.sh 'systemctl status caddy;
85	journalctl -u caddy -n 200 \| grep -iE "renew\|cert"'`. Most
86	likely cause: rate-limited by Let's Encrypt or DNS
87	misconfiguration after a CNAME change.
88	- latency p95 > 2s — could be slow-query regressions, a
89	wedged worker, or Caddy reverse-proxy timeouts. Start with
90	`journalctl -u shithubd-web -n 200 \| grep duration \| sort -k1
91	-nr \| head`.
92
93	## Editing thresholds
94
95	Change in `deploy/cutover/provision-do-alerts.sh`, then either:
96
97	1. Delete the existing alert via `doctl monitoring alert delete
98	<uuid>` and re-run the provision script (it'll create the new
99	one), OR
100	2. Edit in place: `doctl monitoring alert update <uuid> --value
101	<new>`.
102
103	For uptime alerts: `doctl monitoring uptime alert update
104	<check-id> <alert-id> ...`. (CLI bug: uptime check `update`
105	strips `--enabled`; if you mutate uptime properties, delete +
106	recreate via the script.)
107
108	## What's NOT here
109
110	- shithubd-internal metrics (request latency p95, DB pool
111	saturation, job queue depth, throttle rejections, archive
112	failures from Postgres). Those need a Prometheus scrape of
113	`127.0.0.1:8080/metrics` plus a Grafana Cloud free-tier — see
114	task #263 / `runbooks/observability.md` (TBD).
115	- Slack delivery. Today everything emails. The DO API supports
116	Slack webhooks; add `--slack-channels` + `--slack-urls` to the
117	provision script when there's a destination.
118	- Pagerduty. Same — DO supports webhook destinations.