shithub Public

Watch 1 Fork 0 Star 0

markdown · 5906 bytes Raw Blame History

Incidents

Alert names in deploy/monitoring/prometheus/rules.yml link to the anchors below. Keep procedures here short and actionable: what to check first, how to mitigate, what to record. Postmortems live elsewhere.

shithubd-down

Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns 502/connection refused.

SSH to the affected host.
systemctl status shithubd-web — if active (running) but alert fires, the metrics scrape is broken (check wg0).
If failed: journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts.
Mitigate: systemctl restart shithubd-web.
If two restarts in a row fail, roll back per rollback.md and open an incident.

worker-down

Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms visible to users: webhook deliveries stall, async fan-out lags.

systemctl status shithubd-worker.
journalctl -u shithubd-worker -n 200.
Restart. The job loop uses FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.

postgres-down

Symptom: up{job="postgres"} == 0. Site returns 500 on every write; reads through cache may still appear to work briefly.

SSH to the db host.
systemctl status postgresql.
If startup is failing on WAL replay: do not delete pgdata. Check disk free first (df -h /data). If full, the most likely cause is the WAL archiver failing — see archive-failing below.
If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.

archive-failing

Symptom: pg_stat_archiver.failed_count increasing, disk filling.

Look at journalctl -u postgresql -n 200 for archive_command failed lines. The last few should print the rclone error.
Common causes: bucket creds rotated, network partition to Spaces.
Confirm: sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:.
Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
If disk is critically full and Spaces will not be reachable in time: this is an emergency. Page the operator. Do not pg_archivecleanup manually unless directed — you can lose data.

job-backlog

Symptom: shithubd_job_queue_depth > 5000 for 15m.

Check what's in the queue: psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;".
If one kind dominates, look for a poison job in worker logs.
If the spread is even, the worker isn't keeping up — scale by running a second worker host (the FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely).
To purge a poison job: mark it failed (don't delete — we want the audit trail).

actions-runner-heartbeat-stale

Symptom: shithub_actions_runner_heartbeat_age_seconds{status!="offline"} > 60 for 5m. Actions jobs can remain queued even while the runner appears registered.

Identify the runner from the alert label.
On the runner host: systemctl status shithubd-runner and journalctl -u shithubd-runner -n 200 --no-pager.
On the app host: shithubd admin actions runner list and confirm the runner labels still match queued jobs.
If the runner is wedged, restart shithubd-runner. If it cannot authenticate, rotate the runner token and redeploy the service env.
Record whether the stale heartbeat happened during a deploy, network partition, token rotation, or runner engine failure.

actions-queue-depth-high

Symptom: shithub_actions_queue_depth{resource="jobs"} > 100 for 10m.

Check runner availability: shithubd admin actions runner list.
Compare queued labels with runner labels. A workflow using an unsupported runs-on value will sit queued until a compatible runner exists.
Inspect web and worker logs for trigger storms, claim errors, and DB pool saturation.
If legitimate load exceeds capacity, add runners or raise capacity on idle runner hosts. If one repository dominates, cancel or throttle that workload.
After mitigation, watch shithub_actions_queue_depth drain and confirm shithub_actions_active does not flatline.

actions-run-duration-p99-regressed

Symptom: Actions p99 duration over 30m is >50% above the same window 24h ago.

Split by event in the Actions dashboard. A single event type usually points to one workflow shape rather than runner infrastructure.
Compare shithub_actions_active and runner capacity. High duration with low active jobs suggests slow jobs; high duration with saturated active jobs suggests insufficient runner capacity.
Check runner host CPU, memory, disk, and Docker/engine logs.
If the regression started with a deploy, review runner API, log streaming, checkout, and container execution changes first.
Capture representative slow run IDs and their step durations before canceling or pruning anything.

actions-log-scrubber-possibly-missing

Symptom: server-side Actions log bytes are flowing for 30m, but shithub_actions_log_scrub_replacements_total{location="server"} remains zero.

This is a warning, not proof of leaked secrets. Some periods legitimately have no secret-bearing logs.

Confirm secrets or variables with sensitive values exist for workloads that ran during the window.
Trigger a controlled workflow that echoes a known test secret value and verify the rendered logs contain ***, not plaintext.
Check runner claims happened after the secret was created or rotated; mask snapshots are captured at claim time.
If the controlled workflow is not masked, stop affected runners, rotate the exposed secret, and open a security incident.

View source

  
        1
        # Incidents
      
        2
        
        3
        Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
      
        4
        anchors below. Keep procedures here short and actionable: what to
      
        5
        check first, how to mitigate, what to record. Postmortems live
      
        6
        elsewhere.
      
        7
        
        8
        ## shithubd-down
      
        9
        
        10
        **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
      
        11
        502/connection refused.
      
        12
        
        13
        1. SSH to the affected host.
      
        14
        2. `systemctl status shithubd-web` — if `active (running)` but
      
        15
           alert fires, the metrics scrape is broken (check wg0).
      
        16
        3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
      
        17
           for migration failures (ExecStartPre), env file errors, port
      
        18
           conflicts.
      
        19
        4. Mitigate: `systemctl restart shithubd-web`.
      
        20
        5. If two restarts in a row fail, roll back per `rollback.md` and
      
        21
           open an incident.
      
        22
        
        23
        ## worker-down
      
        24
        
        25
        **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
      
        26
        visible to users: webhook deliveries stall, async fan-out lags.
      
        27
        
        28
        1. `systemctl status shithubd-worker`.
      
        29
        2. `journalctl -u shithubd-worker -n 200`.
      
        30
        3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
      
        31
           never loses or double-processes jobs.
      
        32
        
        33
        ## postgres-down
      
        34
        
        35
        **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
      
        36
        write; reads through cache may still appear to work briefly.
      
        37
        
        38
        1. SSH to the db host.
      
        39
        2. `systemctl status postgresql`.
      
        40
        3. If startup is failing on WAL replay: do **not** delete pgdata.
      
        41
           Check disk free first (`df -h /data`). If full, the most likely
      
        42
           cause is the WAL archiver failing — see `archive-failing` below.
      
        43
        4. If the data directory is fine, page the on-call DBA. Do not
      
        44
           restore from backup unless directed; a working PITR restore
      
        45
           takes ~30 min and we shouldn't gamble.
      
        46
        
        47
        ## archive-failing
      
        48
        
        49
        **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
      
        50
        
        51
        1. Look at `journalctl -u postgresql -n 200` for `archive_command
      
        52
           failed` lines. The last few should print the rclone error.
      
        53
        2. Common causes: bucket creds rotated, network partition to Spaces.
      
        54
        3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
      
        55
           rclone.conf lsd spaces-prod:`.
      
        56
        4. Mitigation: fix the underlying issue (rotate creds, restore
      
        57
           network). Postgres will retry automatically.
      
        58
        5. If disk is critically full and Spaces will not be reachable in
      
        59
           time: this is an emergency. Page the operator. **Do not**
      
        60
           `pg_archivecleanup` manually unless directed — you can lose
      
        61
           data.
      
        62
        
        63
        ## job-backlog
      
        64
        
        65
        **Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
      
        66
        
        67
        1. Check what's in the queue:
      
        68
           `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
      
        69
        2. If one kind dominates, look for a poison job in worker logs.
      
        70
        3. If the spread is even, the worker isn't keeping up — scale by
      
        71
           running a second worker host (the `FOR UPDATE SKIP LOCKED`
      
        72
           pattern lets multiple workers coexist safely).
      
        73
        4. To purge a poison job: mark it `failed` (don't delete — we want
      
        74
           the audit trail).
      
        75
        
        76
        ## actions-runner-heartbeat-stale
      
        77
        
        78
        **Symptom:** `shithub_actions_runner_heartbeat_age_seconds{status!="offline"} >
      
        79
        60` for 5m. Actions jobs can remain queued even while the runner appears
      
        80
        registered.
      
        81
        
        82
        1. Identify the runner from the alert label.
      
        83
        2. On the runner host: `systemctl status shithubd-runner` and
      
        84
           `journalctl -u shithubd-runner -n 200 --no-pager`.
      
        85
        3. On the app host: `shithubd admin actions runner list` and confirm the
      
        86
           runner labels still match queued jobs.
      
        87
        4. If the runner is wedged, restart `shithubd-runner`. If it cannot
      
        88
           authenticate, rotate the runner token and redeploy the service env.
      
        89
        5. Record whether the stale heartbeat happened during a deploy, network
      
        90
           partition, token rotation, or runner engine failure.
      
        91
        
        92
        ## actions-queue-depth-high
      
        93
        
        94
        **Symptom:** `shithub_actions_queue_depth{resource="jobs"} > 100` for 10m.
      
        95
        
        96
        1. Check runner availability:
      
        97
           `shithubd admin actions runner list`.
      
        98
        2. Compare queued labels with runner labels. A workflow using an unsupported
      
        99
           `runs-on` value will sit queued until a compatible runner exists.
      
        100
        3. Inspect web and worker logs for trigger storms, claim errors, and DB pool
      
        101
           saturation.
      
        102
        4. If legitimate load exceeds capacity, add runners or raise capacity on idle
      
        103
           runner hosts. If one repository dominates, cancel or throttle that workload.
      
        104
        5. After mitigation, watch `shithub_actions_queue_depth` drain and confirm
      
        105
           `shithub_actions_active` does not flatline.
      
        106
        
        107
        ## actions-run-duration-p99-regressed
      
        108
        
        109
        **Symptom:** Actions p99 duration over 30m is >50% above the same window 24h
      
        110
        ago.
      
        111
        
        112
        1. Split by event in the Actions dashboard. A single event type usually points
      
        113
           to one workflow shape rather than runner infrastructure.
      
        114
        2. Compare `shithub_actions_active` and runner capacity. High duration with
      
        115
           low active jobs suggests slow jobs; high duration with saturated active jobs
      
        116
           suggests insufficient runner capacity.
      
        117
        3. Check runner host CPU, memory, disk, and Docker/engine logs.
      
        118
        4. If the regression started with a deploy, review runner API, log streaming,
      
        119
           checkout, and container execution changes first.
      
        120
        5. Capture representative slow run IDs and their step durations before
      
        121
           canceling or pruning anything.
      
        122
        
        123
        ## actions-log-scrubber-possibly-missing
      
        124
        
        125
        **Symptom:** server-side Actions log bytes are flowing for 30m, but
      
        126
        `shithub_actions_log_scrub_replacements_total{location="server"}` remains zero.
      
        127
        
        128
        This is a warning, not proof of leaked secrets. Some periods legitimately have
      
        129
        no secret-bearing logs.
      
        130
        
        131
        1. Confirm secrets or variables with sensitive values exist for workloads that
      
        132
           ran during the window.
      
        133
        2. Trigger a controlled workflow that echoes a known test secret value and
      
        134
           verify the rendered logs contain `***`, not plaintext.
      
        135
        3. Check runner claims happened after the secret was created or rotated; mask
      
        136
           snapshots are captured at claim time.
      
        137
        4. If the controlled workflow is not masked, stop affected runners, rotate the
      
        138
           exposed secret, and open a security incident.

1	# Incidents
2
3	Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
4	anchors below. Keep procedures here short and actionable: what to
5	check first, how to mitigate, what to record. Postmortems live
6	elsewhere.
7
8	## shithubd-down
9
10	Symptom: `up{job="shithubd-web"} == 0` for >2m. Site returns
11	502/connection refused.
12
13	1. SSH to the affected host.
14	2. `systemctl status shithubd-web` — if `active (running)` but
15	alert fires, the metrics scrape is broken (check wg0).
16	3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
17	for migration failures (ExecStartPre), env file errors, port
18	conflicts.
19	4. Mitigate: `systemctl restart shithubd-web`.
20	5. If two restarts in a row fail, roll back per `rollback.md` and
21	open an incident.
22
23	## worker-down
24
25	Symptom: `up{job="shithubd-worker"} == 0` for >5m. Symptoms
26	visible to users: webhook deliveries stall, async fan-out lags.
27
28	1. `systemctl status shithubd-worker`.
29	2. `journalctl -u shithubd-worker -n 200`.
30	3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
31	never loses or double-processes jobs.
32
33	## postgres-down
34
35	Symptom: `up{job="postgres"} == 0`. Site returns 500 on every
36	write; reads through cache may still appear to work briefly.
37
38	1. SSH to the db host.
39	2. `systemctl status postgresql`.
40	3. If startup is failing on WAL replay: do not delete pgdata.
41	Check disk free first (`df -h /data`). If full, the most likely
42	cause is the WAL archiver failing — see `archive-failing` below.
43	4. If the data directory is fine, page the on-call DBA. Do not
44	restore from backup unless directed; a working PITR restore
45	takes ~30 min and we shouldn't gamble.
46
47	## archive-failing
48
49	Symptom: `pg_stat_archiver.failed_count` increasing, disk filling.
50
51	1. Look at `journalctl -u postgresql -n 200` for `archive_command
52	failed` lines. The last few should print the rclone error.
53	2. Common causes: bucket creds rotated, network partition to Spaces.
54	3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
55	rclone.conf lsd spaces-prod:`.
56	4. Mitigation: fix the underlying issue (rotate creds, restore
57	network). Postgres will retry automatically.
58	5. If disk is critically full and Spaces will not be reachable in
59	time: this is an emergency. Page the operator. Do not
60	`pg_archivecleanup` manually unless directed — you can lose
61	data.
62
63	## job-backlog
64
65	Symptom: `shithubd_job_queue_depth > 5000` for 15m.
66
67	1. Check what's in the queue:
68	`psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
69	2. If one kind dominates, look for a poison job in worker logs.
70	3. If the spread is even, the worker isn't keeping up — scale by
71	running a second worker host (the `FOR UPDATE SKIP LOCKED`
72	pattern lets multiple workers coexist safely).
73	4. To purge a poison job: mark it `failed` (don't delete — we want
74	the audit trail).
75
76	## actions-runner-heartbeat-stale
77
78	Symptom: `shithub_actions_runner_heartbeat_age_seconds{status!="offline"} >
79	60` for 5m. Actions jobs can remain queued even while the runner appears
80	registered.
81
82	1. Identify the runner from the alert label.
83	2. On the runner host: `systemctl status shithubd-runner` and
84	`journalctl -u shithubd-runner -n 200 --no-pager`.
85	3. On the app host: `shithubd admin actions runner list` and confirm the
86	runner labels still match queued jobs.
87	4. If the runner is wedged, restart `shithubd-runner`. If it cannot
88	authenticate, rotate the runner token and redeploy the service env.
89	5. Record whether the stale heartbeat happened during a deploy, network
90	partition, token rotation, or runner engine failure.
91
92	## actions-queue-depth-high
93
94	Symptom: `shithub_actions_queue_depth{resource="jobs"} > 100` for 10m.
95
96	1. Check runner availability:
97	`shithubd admin actions runner list`.
98	2. Compare queued labels with runner labels. A workflow using an unsupported
99	`runs-on` value will sit queued until a compatible runner exists.
100	3. Inspect web and worker logs for trigger storms, claim errors, and DB pool
101	saturation.
102	4. If legitimate load exceeds capacity, add runners or raise capacity on idle
103	runner hosts. If one repository dominates, cancel or throttle that workload.
104	5. After mitigation, watch `shithub_actions_queue_depth` drain and confirm
105	`shithub_actions_active` does not flatline.
106
107	## actions-run-duration-p99-regressed
108
109	Symptom: Actions p99 duration over 30m is >50% above the same window 24h
110	ago.
111
112	1. Split by event in the Actions dashboard. A single event type usually points
113	to one workflow shape rather than runner infrastructure.
114	2. Compare `shithub_actions_active` and runner capacity. High duration with
115	low active jobs suggests slow jobs; high duration with saturated active jobs
116	suggests insufficient runner capacity.
117	3. Check runner host CPU, memory, disk, and Docker/engine logs.
118	4. If the regression started with a deploy, review runner API, log streaming,
119	checkout, and container execution changes first.
120	5. Capture representative slow run IDs and their step durations before
121	canceling or pruning anything.
122
123	## actions-log-scrubber-possibly-missing
124
125	Symptom: server-side Actions log bytes are flowing for 30m, but
126	`shithub_actions_log_scrub_replacements_total{location="server"}` remains zero.
127
128	This is a warning, not proof of leaked secrets. Some periods legitimately have
129	no secret-bearing logs.
130
131	1. Confirm secrets or variables with sensitive values exist for workloads that
132	ran during the window.
133	2. Trigger a controlled workflow that echoes a known test secret value and
134	verify the rendered logs contain `***`, not plaintext.
135	3. Check runner claims happened after the secret was created or rotated; mask
136	snapshots are captured at claim time.
137	4. If the controlled workflow is not masked, stop affected runners, rotate the
138	exposed secret, and open a security incident.