markdown · 2926 bytes Raw Blame History

Incidents

Alert names in deploy/monitoring/prometheus/rules.yml link to the anchors below. Keep procedures here short and actionable: what to check first, how to mitigate, what to record. Postmortems live elsewhere.

shithubd-down

Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns 502/connection refused.

  1. SSH to the affected host.
  2. systemctl status shithubd-web — if active (running) but alert fires, the metrics scrape is broken (check wg0).
  3. If failed: journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts.
  4. Mitigate: systemctl restart shithubd-web.
  5. If two restarts in a row fail, roll back per rollback.md and open an incident.

worker-down

Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms visible to users: webhook deliveries stall, async fan-out lags.

  1. systemctl status shithubd-worker.
  2. journalctl -u shithubd-worker -n 200.
  3. Restart. The job loop uses FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.

postgres-down

Symptom: up{job="postgres"} == 0. Site returns 500 on every write; reads through cache may still appear to work briefly.

  1. SSH to the db host.
  2. systemctl status postgresql.
  3. If startup is failing on WAL replay: do not delete pgdata. Check disk free first (df -h /data). If full, the most likely cause is the WAL archiver failing — see archive-failing below.
  4. If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.

archive-failing

Symptom: pg_stat_archiver.failed_count increasing, disk filling.

  1. Look at journalctl -u postgresql -n 200 for archive_command failed lines. The last few should print the rclone error.
  2. Common causes: bucket creds rotated, network partition to Spaces.
  3. Confirm: sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:.
  4. Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
  5. If disk is critically full and Spaces will not be reachable in time: this is an emergency. Page the operator. Do not pg_archivecleanup manually unless directed — you can lose data.

job-backlog

Symptom: shithubd_job_queue_depth > 5000 for 15m.

  1. Check what's in the queue: psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;".
  2. If one kind dominates, look for a poison job in worker logs.
  3. If the spread is even, the worker isn't keeping up — scale by running a second worker host (the FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely).
  4. To purge a poison job: mark it failed (don't delete — we want the audit trail).
View source
1 # Incidents
2
3 Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
4 anchors below. Keep procedures here short and actionable: what to
5 check first, how to mitigate, what to record. Postmortems live
6 elsewhere.
7
8 ## shithubd-down
9
10 **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
11 502/connection refused.
12
13 1. SSH to the affected host.
14 2. `systemctl status shithubd-web` — if `active (running)` but
15 alert fires, the metrics scrape is broken (check wg0).
16 3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
17 for migration failures (ExecStartPre), env file errors, port
18 conflicts.
19 4. Mitigate: `systemctl restart shithubd-web`.
20 5. If two restarts in a row fail, roll back per `rollback.md` and
21 open an incident.
22
23 ## worker-down
24
25 **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
26 visible to users: webhook deliveries stall, async fan-out lags.
27
28 1. `systemctl status shithubd-worker`.
29 2. `journalctl -u shithubd-worker -n 200`.
30 3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
31 never loses or double-processes jobs.
32
33 ## postgres-down
34
35 **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
36 write; reads through cache may still appear to work briefly.
37
38 1. SSH to the db host.
39 2. `systemctl status postgresql`.
40 3. If startup is failing on WAL replay: do **not** delete pgdata.
41 Check disk free first (`df -h /data`). If full, the most likely
42 cause is the WAL archiver failing — see `archive-failing` below.
43 4. If the data directory is fine, page the on-call DBA. Do not
44 restore from backup unless directed; a working PITR restore
45 takes ~30 min and we shouldn't gamble.
46
47 ## archive-failing
48
49 **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
50
51 1. Look at `journalctl -u postgresql -n 200` for `archive_command
52 failed` lines. The last few should print the rclone error.
53 2. Common causes: bucket creds rotated, network partition to Spaces.
54 3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
55 rclone.conf lsd spaces-prod:`.
56 4. Mitigation: fix the underlying issue (rotate creds, restore
57 network). Postgres will retry automatically.
58 5. If disk is critically full and Spaces will not be reachable in
59 time: this is an emergency. Page the operator. **Do not**
60 `pg_archivecleanup` manually unless directed — you can lose
61 data.
62
63 ## job-backlog
64
65 **Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
66
67 1. Check what's in the queue:
68 `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
69 2. If one kind dominates, look for a poison job in worker logs.
70 3. If the spread is even, the worker isn't keeping up — scale by
71 running a second worker host (the `FOR UPDATE SKIP LOCKED`
72 pattern lets multiple workers coexist safely).
73 4. To purge a poison job: mark it `failed` (don't delete — we want
74 the audit trail).