# Incidents

Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
anchors below. Keep procedures here short and actionable: what to
check first, how to mitigate, what to record. Postmortems live
elsewhere.

## shithubd-down

**Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
502/connection refused.

1. SSH to the affected host.
2. `systemctl status shithubd-web` — if `active (running)` but
   alert fires, the metrics scrape is broken (check wg0).
3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
   for migration failures (ExecStartPre), env file errors, port
   conflicts.
4. Mitigate: `systemctl restart shithubd-web`.
5. If two restarts in a row fail, roll back per `rollback.md` and
   open an incident.

## worker-down

**Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
visible to users: webhook deliveries stall, async fan-out lags.

1. `systemctl status shithubd-worker`.
2. `journalctl -u shithubd-worker -n 200`.
3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
   never loses or double-processes jobs.

## postgres-down

**Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
write; reads through cache may still appear to work briefly.

1. SSH to the db host.
2. `systemctl status postgresql`.
3. If startup is failing on WAL replay: do **not** delete pgdata.
   Check disk free first (`df -h /data`). If full, the most likely
   cause is the WAL archiver failing — see `archive-failing` below.
4. If the data directory is fine, page the on-call DBA. Do not
   restore from backup unless directed; a working PITR restore
   takes ~30 min and we shouldn't gamble.

## archive-failing

**Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.

1. Look at `journalctl -u postgresql -n 200` for `archive_command
   failed` lines. The last few should print the rclone error.
2. Common causes: bucket creds rotated, network partition to Spaces.
3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
   rclone.conf lsd spaces-prod:`.
4. Mitigation: fix the underlying issue (rotate creds, restore
   network). Postgres will retry automatically.
5. If disk is critically full and Spaces will not be reachable in
   time: this is an emergency. Page the operator. **Do not**
   `pg_archivecleanup` manually unless directed — you can lose
   data.

## job-backlog

**Symptom:** `shithubd_job_queue_depth > 5000` for 15m.

1. Check what's in the queue:
   `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
2. If one kind dominates, look for a poison job in worker logs.
3. If the spread is even, the worker isn't keeping up — scale by
   running a second worker host (the `FOR UPDATE SKIP LOCKED`
   pattern lets multiple workers coexist safely).
4. To purge a poison job: mark it `failed` (don't delete — we want
   the audit trail).