shithub Public

Watch 1 Fork 0 Star 0

markdown · 2926 bytes Raw Blame History

Incidents

Alert names in deploy/monitoring/prometheus/rules.yml link to the anchors below. Keep procedures here short and actionable: what to check first, how to mitigate, what to record. Postmortems live elsewhere.

shithubd-down

Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns 502/connection refused.

SSH to the affected host.
systemctl status shithubd-web — if active (running) but alert fires, the metrics scrape is broken (check wg0).
If failed: journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts.
Mitigate: systemctl restart shithubd-web.
If two restarts in a row fail, roll back per rollback.md and open an incident.

worker-down

Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms visible to users: webhook deliveries stall, async fan-out lags.

systemctl status shithubd-worker.
journalctl -u shithubd-worker -n 200.
Restart. The job loop uses FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.

postgres-down

Symptom: up{job="postgres"} == 0. Site returns 500 on every write; reads through cache may still appear to work briefly.

SSH to the db host.
systemctl status postgresql.
If startup is failing on WAL replay: do not delete pgdata. Check disk free first (df -h /data). If full, the most likely cause is the WAL archiver failing — see archive-failing below.
If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.

archive-failing

Symptom: pg_stat_archiver.failed_count increasing, disk filling.

Look at journalctl -u postgresql -n 200 for archive_command failed lines. The last few should print the rclone error.
Common causes: bucket creds rotated, network partition to Spaces.
Confirm: sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:.
Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
If disk is critically full and Spaces will not be reachable in time: this is an emergency. Page the operator. Do not pg_archivecleanup manually unless directed — you can lose data.

job-backlog

Symptom: shithubd_job_queue_depth > 5000 for 15m.

Check what's in the queue: psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;".
If one kind dominates, look for a poison job in worker logs.
If the spread is even, the worker isn't keeping up — scale by running a second worker host (the FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely).
To purge a poison job: mark it failed (don't delete — we want the audit trail).

View source

  
        1
        # Incidents
      
        2
        
        3
        Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
      
        4
        anchors below. Keep procedures here short and actionable: what to
      
        5
        check first, how to mitigate, what to record. Postmortems live
      
        6
        elsewhere.
      
        7
        
        8
        ## shithubd-down
      
        9
        
        10
        **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
      
        11
        502/connection refused.
      
        12
        
        13
        1. SSH to the affected host.
      
        14
        2. `systemctl status shithubd-web` — if `active (running)` but
      
        15
           alert fires, the metrics scrape is broken (check wg0).
      
        16
        3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
      
        17
           for migration failures (ExecStartPre), env file errors, port
      
        18
           conflicts.
      
        19
        4. Mitigate: `systemctl restart shithubd-web`.
      
        20
        5. If two restarts in a row fail, roll back per `rollback.md` and
      
        21
           open an incident.
      
        22
        
        23
        ## worker-down
      
        24
        
        25
        **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
      
        26
        visible to users: webhook deliveries stall, async fan-out lags.
      
        27
        
        28
        1. `systemctl status shithubd-worker`.
      
        29
        2. `journalctl -u shithubd-worker -n 200`.
      
        30
        3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
      
        31
           never loses or double-processes jobs.
      
        32
        
        33
        ## postgres-down
      
        34
        
        35
        **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
      
        36
        write; reads through cache may still appear to work briefly.
      
        37
        
        38
        1. SSH to the db host.
      
        39
        2. `systemctl status postgresql`.
      
        40
        3. If startup is failing on WAL replay: do **not** delete pgdata.
      
        41
           Check disk free first (`df -h /data`). If full, the most likely
      
        42
           cause is the WAL archiver failing — see `archive-failing` below.
      
        43
        4. If the data directory is fine, page the on-call DBA. Do not
      
        44
           restore from backup unless directed; a working PITR restore
      
        45
           takes ~30 min and we shouldn't gamble.
      
        46
        
        47
        ## archive-failing
      
        48
        
        49
        **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
      
        50
        
        51
        1. Look at `journalctl -u postgresql -n 200` for `archive_command
      
        52
           failed` lines. The last few should print the rclone error.
      
        53
        2. Common causes: bucket creds rotated, network partition to Spaces.
      
        54
        3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
      
        55
           rclone.conf lsd spaces-prod:`.
      
        56
        4. Mitigation: fix the underlying issue (rotate creds, restore
      
        57
           network). Postgres will retry automatically.
      
        58
        5. If disk is critically full and Spaces will not be reachable in
      
        59
           time: this is an emergency. Page the operator. **Do not**
      
        60
           `pg_archivecleanup` manually unless directed — you can lose
      
        61
           data.
      
        62
        
        63
        ## job-backlog
      
        64
        
        65
        **Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
      
        66
        
        67
        1. Check what's in the queue:
      
        68
           `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
      
        69
        2. If one kind dominates, look for a poison job in worker logs.
      
        70
        3. If the spread is even, the worker isn't keeping up — scale by
      
        71
           running a second worker host (the `FOR UPDATE SKIP LOCKED`
      
        72
           pattern lets multiple workers coexist safely).
      
        73
        4. To purge a poison job: mark it `failed` (don't delete — we want
      
        74
           the audit trail).

1	# Incidents
2
3	Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
4	anchors below. Keep procedures here short and actionable: what to
5	check first, how to mitigate, what to record. Postmortems live
6	elsewhere.
7
8	## shithubd-down
9
10	Symptom: `up{job="shithubd-web"} == 0` for >2m. Site returns
11	502/connection refused.
12
13	1. SSH to the affected host.
14	2. `systemctl status shithubd-web` — if `active (running)` but
15	alert fires, the metrics scrape is broken (check wg0).
16	3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
17	for migration failures (ExecStartPre), env file errors, port
18	conflicts.
19	4. Mitigate: `systemctl restart shithubd-web`.
20	5. If two restarts in a row fail, roll back per `rollback.md` and
21	open an incident.
22
23	## worker-down
24
25	Symptom: `up{job="shithubd-worker"} == 0` for >5m. Symptoms
26	visible to users: webhook deliveries stall, async fan-out lags.
27
28	1. `systemctl status shithubd-worker`.
29	2. `journalctl -u shithubd-worker -n 200`.
30	3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
31	never loses or double-processes jobs.
32
33	## postgres-down
34
35	Symptom: `up{job="postgres"} == 0`. Site returns 500 on every
36	write; reads through cache may still appear to work briefly.
37
38	1. SSH to the db host.
39	2. `systemctl status postgresql`.
40	3. If startup is failing on WAL replay: do not delete pgdata.
41	Check disk free first (`df -h /data`). If full, the most likely
42	cause is the WAL archiver failing — see `archive-failing` below.
43	4. If the data directory is fine, page the on-call DBA. Do not
44	restore from backup unless directed; a working PITR restore
45	takes ~30 min and we shouldn't gamble.
46
47	## archive-failing
48
49	Symptom: `pg_stat_archiver.failed_count` increasing, disk filling.
50
51	1. Look at `journalctl -u postgresql -n 200` for `archive_command
52	failed` lines. The last few should print the rclone error.
53	2. Common causes: bucket creds rotated, network partition to Spaces.
54	3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
55	rclone.conf lsd spaces-prod:`.
56	4. Mitigation: fix the underlying issue (rotate creds, restore
57	network). Postgres will retry automatically.
58	5. If disk is critically full and Spaces will not be reachable in
59	time: this is an emergency. Page the operator. Do not
60	`pg_archivecleanup` manually unless directed — you can lose
61	data.
62
63	## job-backlog
64
65	Symptom: `shithubd_job_queue_depth > 5000` for 15m.
66
67	1. Check what's in the queue:
68	`psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
69	2. If one kind dominates, look for a poison job in worker logs.
70	3. If the spread is even, the worker isn't keeping up — scale by
71	running a second worker host (the `FOR UPDATE SKIP LOCKED`
72	pattern lets multiple workers coexist safely).
73	4. To purge a poison job: mark it `failed` (don't delete — we want
74	the audit trail).