Incidents
Alert names in deploy/monitoring/prometheus/rules.yml link to the
anchors below. Keep procedures here short and actionable: what to
check first, how to mitigate, what to record. Postmortems live
elsewhere.
shithubd-down
Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns
502/connection refused.
- SSH to the affected host.
systemctl status shithubd-web— ifactive (running)but alert fires, the metrics scrape is broken (check wg0).- If failed:
journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts. - Mitigate:
systemctl restart shithubd-web. - If two restarts in a row fail, roll back per
rollback.mdand open an incident.
worker-down
Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms
visible to users: webhook deliveries stall, async fan-out lags.
systemctl status shithubd-worker.journalctl -u shithubd-worker -n 200.- Restart. The job loop uses
FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.
postgres-down
Symptom: up{job="postgres"} == 0. Site returns 500 on every
write; reads through cache may still appear to work briefly.
- SSH to the db host.
systemctl status postgresql.- If startup is failing on WAL replay: do not delete pgdata.
Check disk free first (
df -h /data). If full, the most likely cause is the WAL archiver failing — seearchive-failingbelow. - If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.
archive-failing
Symptom: pg_stat_archiver.failed_count increasing, disk filling.
- Look at
journalctl -u postgresql -n 200forarchive_command failedlines. The last few should print the rclone error. - Common causes: bucket creds rotated, network partition to Spaces.
- Confirm:
sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:. - Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
- If disk is critically full and Spaces will not be reachable in
time: this is an emergency. Page the operator. Do not
pg_archivecleanupmanually unless directed — you can lose data.
job-backlog
Symptom: shithubd_job_queue_depth > 5000 for 15m.
- Check what's in the queue:
psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;". - If one kind dominates, look for a poison job in worker logs.
- If the spread is even, the worker isn't keeping up — scale by
running a second worker host (the
FOR UPDATE SKIP LOCKEDpattern lets multiple workers coexist safely). - To purge a poison job: mark it
failed(don't delete — we want the audit trail).
View source
| 1 | # Incidents |
| 2 | |
| 3 | Alert names in `deploy/monitoring/prometheus/rules.yml` link to the |
| 4 | anchors below. Keep procedures here short and actionable: what to |
| 5 | check first, how to mitigate, what to record. Postmortems live |
| 6 | elsewhere. |
| 7 | |
| 8 | ## shithubd-down |
| 9 | |
| 10 | **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns |
| 11 | 502/connection refused. |
| 12 | |
| 13 | 1. SSH to the affected host. |
| 14 | 2. `systemctl status shithubd-web` — if `active (running)` but |
| 15 | alert fires, the metrics scrape is broken (check wg0). |
| 16 | 3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look |
| 17 | for migration failures (ExecStartPre), env file errors, port |
| 18 | conflicts. |
| 19 | 4. Mitigate: `systemctl restart shithubd-web`. |
| 20 | 5. If two restarts in a row fail, roll back per `rollback.md` and |
| 21 | open an incident. |
| 22 | |
| 23 | ## worker-down |
| 24 | |
| 25 | **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms |
| 26 | visible to users: webhook deliveries stall, async fan-out lags. |
| 27 | |
| 28 | 1. `systemctl status shithubd-worker`. |
| 29 | 2. `journalctl -u shithubd-worker -n 200`. |
| 30 | 3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart |
| 31 | never loses or double-processes jobs. |
| 32 | |
| 33 | ## postgres-down |
| 34 | |
| 35 | **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every |
| 36 | write; reads through cache may still appear to work briefly. |
| 37 | |
| 38 | 1. SSH to the db host. |
| 39 | 2. `systemctl status postgresql`. |
| 40 | 3. If startup is failing on WAL replay: do **not** delete pgdata. |
| 41 | Check disk free first (`df -h /data`). If full, the most likely |
| 42 | cause is the WAL archiver failing — see `archive-failing` below. |
| 43 | 4. If the data directory is fine, page the on-call DBA. Do not |
| 44 | restore from backup unless directed; a working PITR restore |
| 45 | takes ~30 min and we shouldn't gamble. |
| 46 | |
| 47 | ## archive-failing |
| 48 | |
| 49 | **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling. |
| 50 | |
| 51 | 1. Look at `journalctl -u postgresql -n 200` for `archive_command |
| 52 | failed` lines. The last few should print the rclone error. |
| 53 | 2. Common causes: bucket creds rotated, network partition to Spaces. |
| 54 | 3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/ |
| 55 | rclone.conf lsd spaces-prod:`. |
| 56 | 4. Mitigation: fix the underlying issue (rotate creds, restore |
| 57 | network). Postgres will retry automatically. |
| 58 | 5. If disk is critically full and Spaces will not be reachable in |
| 59 | time: this is an emergency. Page the operator. **Do not** |
| 60 | `pg_archivecleanup` manually unless directed — you can lose |
| 61 | data. |
| 62 | |
| 63 | ## job-backlog |
| 64 | |
| 65 | **Symptom:** `shithubd_job_queue_depth > 5000` for 15m. |
| 66 | |
| 67 | 1. Check what's in the queue: |
| 68 | `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`. |
| 69 | 2. If one kind dominates, look for a poison job in worker logs. |
| 70 | 3. If the spread is even, the worker isn't keeping up — scale by |
| 71 | running a second worker host (the `FOR UPDATE SKIP LOCKED` |
| 72 | pattern lets multiple workers coexist safely). |
| 73 | 4. To purge a poison job: mark it `failed` (don't delete — we want |
| 74 | the audit trail). |