markdown · 5906 bytes Raw Blame History

Incidents

Alert names in deploy/monitoring/prometheus/rules.yml link to the anchors below. Keep procedures here short and actionable: what to check first, how to mitigate, what to record. Postmortems live elsewhere.

shithubd-down

Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns 502/connection refused.

  1. SSH to the affected host.
  2. systemctl status shithubd-web — if active (running) but alert fires, the metrics scrape is broken (check wg0).
  3. If failed: journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts.
  4. Mitigate: systemctl restart shithubd-web.
  5. If two restarts in a row fail, roll back per rollback.md and open an incident.

worker-down

Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms visible to users: webhook deliveries stall, async fan-out lags.

  1. systemctl status shithubd-worker.
  2. journalctl -u shithubd-worker -n 200.
  3. Restart. The job loop uses FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.

postgres-down

Symptom: up{job="postgres"} == 0. Site returns 500 on every write; reads through cache may still appear to work briefly.

  1. SSH to the db host.
  2. systemctl status postgresql.
  3. If startup is failing on WAL replay: do not delete pgdata. Check disk free first (df -h /data). If full, the most likely cause is the WAL archiver failing — see archive-failing below.
  4. If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.

archive-failing

Symptom: pg_stat_archiver.failed_count increasing, disk filling.

  1. Look at journalctl -u postgresql -n 200 for archive_command failed lines. The last few should print the rclone error.
  2. Common causes: bucket creds rotated, network partition to Spaces.
  3. Confirm: sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:.
  4. Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
  5. If disk is critically full and Spaces will not be reachable in time: this is an emergency. Page the operator. Do not pg_archivecleanup manually unless directed — you can lose data.

job-backlog

Symptom: shithubd_job_queue_depth > 5000 for 15m.

  1. Check what's in the queue: psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;".
  2. If one kind dominates, look for a poison job in worker logs.
  3. If the spread is even, the worker isn't keeping up — scale by running a second worker host (the FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely).
  4. To purge a poison job: mark it failed (don't delete — we want the audit trail).

actions-runner-heartbeat-stale

Symptom: shithub_actions_runner_heartbeat_age_seconds{status!="offline"} > 60 for 5m. Actions jobs can remain queued even while the runner appears registered.

  1. Identify the runner from the alert label.
  2. On the runner host: systemctl status shithubd-runner and journalctl -u shithubd-runner -n 200 --no-pager.
  3. On the app host: shithubd admin actions runner list and confirm the runner labels still match queued jobs.
  4. If the runner is wedged, restart shithubd-runner. If it cannot authenticate, rotate the runner token and redeploy the service env.
  5. Record whether the stale heartbeat happened during a deploy, network partition, token rotation, or runner engine failure.

actions-queue-depth-high

Symptom: shithub_actions_queue_depth{resource="jobs"} > 100 for 10m.

  1. Check runner availability: shithubd admin actions runner list.
  2. Compare queued labels with runner labels. A workflow using an unsupported runs-on value will sit queued until a compatible runner exists.
  3. Inspect web and worker logs for trigger storms, claim errors, and DB pool saturation.
  4. If legitimate load exceeds capacity, add runners or raise capacity on idle runner hosts. If one repository dominates, cancel or throttle that workload.
  5. After mitigation, watch shithub_actions_queue_depth drain and confirm shithub_actions_active does not flatline.

actions-run-duration-p99-regressed

Symptom: Actions p99 duration over 30m is >50% above the same window 24h ago.

  1. Split by event in the Actions dashboard. A single event type usually points to one workflow shape rather than runner infrastructure.
  2. Compare shithub_actions_active and runner capacity. High duration with low active jobs suggests slow jobs; high duration with saturated active jobs suggests insufficient runner capacity.
  3. Check runner host CPU, memory, disk, and Docker/engine logs.
  4. If the regression started with a deploy, review runner API, log streaming, checkout, and container execution changes first.
  5. Capture representative slow run IDs and their step durations before canceling or pruning anything.

actions-log-scrubber-possibly-missing

Symptom: server-side Actions log bytes are flowing for 30m, but shithub_actions_log_scrub_replacements_total{location="server"} remains zero.

This is a warning, not proof of leaked secrets. Some periods legitimately have no secret-bearing logs.

  1. Confirm secrets or variables with sensitive values exist for workloads that ran during the window.
  2. Trigger a controlled workflow that echoes a known test secret value and verify the rendered logs contain ***, not plaintext.
  3. Check runner claims happened after the secret was created or rotated; mask snapshots are captured at claim time.
  4. If the controlled workflow is not masked, stop affected runners, rotate the exposed secret, and open a security incident.
View source
1 # Incidents
2
3 Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
4 anchors below. Keep procedures here short and actionable: what to
5 check first, how to mitigate, what to record. Postmortems live
6 elsewhere.
7
8 ## shithubd-down
9
10 **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
11 502/connection refused.
12
13 1. SSH to the affected host.
14 2. `systemctl status shithubd-web` — if `active (running)` but
15 alert fires, the metrics scrape is broken (check wg0).
16 3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
17 for migration failures (ExecStartPre), env file errors, port
18 conflicts.
19 4. Mitigate: `systemctl restart shithubd-web`.
20 5. If two restarts in a row fail, roll back per `rollback.md` and
21 open an incident.
22
23 ## worker-down
24
25 **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
26 visible to users: webhook deliveries stall, async fan-out lags.
27
28 1. `systemctl status shithubd-worker`.
29 2. `journalctl -u shithubd-worker -n 200`.
30 3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
31 never loses or double-processes jobs.
32
33 ## postgres-down
34
35 **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
36 write; reads through cache may still appear to work briefly.
37
38 1. SSH to the db host.
39 2. `systemctl status postgresql`.
40 3. If startup is failing on WAL replay: do **not** delete pgdata.
41 Check disk free first (`df -h /data`). If full, the most likely
42 cause is the WAL archiver failing — see `archive-failing` below.
43 4. If the data directory is fine, page the on-call DBA. Do not
44 restore from backup unless directed; a working PITR restore
45 takes ~30 min and we shouldn't gamble.
46
47 ## archive-failing
48
49 **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
50
51 1. Look at `journalctl -u postgresql -n 200` for `archive_command
52 failed` lines. The last few should print the rclone error.
53 2. Common causes: bucket creds rotated, network partition to Spaces.
54 3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
55 rclone.conf lsd spaces-prod:`.
56 4. Mitigation: fix the underlying issue (rotate creds, restore
57 network). Postgres will retry automatically.
58 5. If disk is critically full and Spaces will not be reachable in
59 time: this is an emergency. Page the operator. **Do not**
60 `pg_archivecleanup` manually unless directed — you can lose
61 data.
62
63 ## job-backlog
64
65 **Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
66
67 1. Check what's in the queue:
68 `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
69 2. If one kind dominates, look for a poison job in worker logs.
70 3. If the spread is even, the worker isn't keeping up — scale by
71 running a second worker host (the `FOR UPDATE SKIP LOCKED`
72 pattern lets multiple workers coexist safely).
73 4. To purge a poison job: mark it `failed` (don't delete — we want
74 the audit trail).
75
76 ## actions-runner-heartbeat-stale
77
78 **Symptom:** `shithub_actions_runner_heartbeat_age_seconds{status!="offline"} >
79 60` for 5m. Actions jobs can remain queued even while the runner appears
80 registered.
81
82 1. Identify the runner from the alert label.
83 2. On the runner host: `systemctl status shithubd-runner` and
84 `journalctl -u shithubd-runner -n 200 --no-pager`.
85 3. On the app host: `shithubd admin actions runner list` and confirm the
86 runner labels still match queued jobs.
87 4. If the runner is wedged, restart `shithubd-runner`. If it cannot
88 authenticate, rotate the runner token and redeploy the service env.
89 5. Record whether the stale heartbeat happened during a deploy, network
90 partition, token rotation, or runner engine failure.
91
92 ## actions-queue-depth-high
93
94 **Symptom:** `shithub_actions_queue_depth{resource="jobs"} > 100` for 10m.
95
96 1. Check runner availability:
97 `shithubd admin actions runner list`.
98 2. Compare queued labels with runner labels. A workflow using an unsupported
99 `runs-on` value will sit queued until a compatible runner exists.
100 3. Inspect web and worker logs for trigger storms, claim errors, and DB pool
101 saturation.
102 4. If legitimate load exceeds capacity, add runners or raise capacity on idle
103 runner hosts. If one repository dominates, cancel or throttle that workload.
104 5. After mitigation, watch `shithub_actions_queue_depth` drain and confirm
105 `shithub_actions_active` does not flatline.
106
107 ## actions-run-duration-p99-regressed
108
109 **Symptom:** Actions p99 duration over 30m is >50% above the same window 24h
110 ago.
111
112 1. Split by event in the Actions dashboard. A single event type usually points
113 to one workflow shape rather than runner infrastructure.
114 2. Compare `shithub_actions_active` and runner capacity. High duration with
115 low active jobs suggests slow jobs; high duration with saturated active jobs
116 suggests insufficient runner capacity.
117 3. Check runner host CPU, memory, disk, and Docker/engine logs.
118 4. If the regression started with a deploy, review runner API, log streaming,
119 checkout, and container execution changes first.
120 5. Capture representative slow run IDs and their step durations before
121 canceling or pruning anything.
122
123 ## actions-log-scrubber-possibly-missing
124
125 **Symptom:** server-side Actions log bytes are flowing for 30m, but
126 `shithub_actions_log_scrub_replacements_total{location="server"}` remains zero.
127
128 This is a warning, not proof of leaked secrets. Some periods legitimately have
129 no secret-bearing logs.
130
131 1. Confirm secrets or variables with sensitive values exist for workloads that
132 ran during the window.
133 2. Trigger a controlled workflow that echoes a known test secret value and
134 verify the rendered logs contain `***`, not plaintext.
135 3. Check runner claims happened after the secret was created or rotated; mask
136 snapshots are captured at claim time.
137 4. If the controlled workflow is not masked, stop affected runners, rotate the
138 exposed secret, and open a security incident.