Incidents
Alert names in deploy/monitoring/prometheus/rules.yml link to the
anchors below. Keep procedures here short and actionable: what to
check first, how to mitigate, what to record. Postmortems live
elsewhere.
shithubd-down
Symptom: up{job="shithubd-web"} == 0 for >2m. Site returns
502/connection refused.
- SSH to the affected host.
systemctl status shithubd-web— ifactive (running)but alert fires, the metrics scrape is broken (check wg0).- If failed:
journalctl -u shithubd-web -n 200 --no-pager. Look for migration failures (ExecStartPre), env file errors, port conflicts. - Mitigate:
systemctl restart shithubd-web. - If two restarts in a row fail, roll back per
rollback.mdand open an incident.
worker-down
Symptom: up{job="shithubd-worker"} == 0 for >5m. Symptoms
visible to users: webhook deliveries stall, async fan-out lags.
systemctl status shithubd-worker.journalctl -u shithubd-worker -n 200.- Restart. The job loop uses
FOR UPDATE SKIP LOCKED, so a restart never loses or double-processes jobs.
postgres-down
Symptom: up{job="postgres"} == 0. Site returns 500 on every
write; reads through cache may still appear to work briefly.
- SSH to the db host.
systemctl status postgresql.- If startup is failing on WAL replay: do not delete pgdata.
Check disk free first (
df -h /data). If full, the most likely cause is the WAL archiver failing — seearchive-failingbelow. - If the data directory is fine, page the on-call DBA. Do not restore from backup unless directed; a working PITR restore takes ~30 min and we shouldn't gamble.
archive-failing
Symptom: pg_stat_archiver.failed_count increasing, disk filling.
- Look at
journalctl -u postgresql -n 200forarchive_command failedlines. The last few should print the rclone error. - Common causes: bucket creds rotated, network partition to Spaces.
- Confirm:
sudo -u postgres rclone --config /root/.config/rclone/ rclone.conf lsd spaces-prod:. - Mitigation: fix the underlying issue (rotate creds, restore network). Postgres will retry automatically.
- If disk is critically full and Spaces will not be reachable in
time: this is an emergency. Page the operator. Do not
pg_archivecleanupmanually unless directed — you can lose data.
job-backlog
Symptom: shithubd_job_queue_depth > 5000 for 15m.
- Check what's in the queue:
psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;". - If one kind dominates, look for a poison job in worker logs.
- If the spread is even, the worker isn't keeping up — scale by
running a second worker host (the
FOR UPDATE SKIP LOCKEDpattern lets multiple workers coexist safely). - To purge a poison job: mark it
failed(don't delete — we want the audit trail).
actions-runner-heartbeat-stale
Symptom: shithub_actions_runner_heartbeat_age_seconds{status!="offline"} > 60 for 5m. Actions jobs can remain queued even while the runner appears
registered.
- Identify the runner from the alert label.
- On the runner host:
systemctl status shithubd-runnerandjournalctl -u shithubd-runner -n 200 --no-pager. - On the app host:
shithubd admin actions runner listand confirm the runner labels still match queued jobs. - If the runner is wedged, restart
shithubd-runner. If it cannot authenticate, rotate the runner token and redeploy the service env. - Record whether the stale heartbeat happened during a deploy, network partition, token rotation, or runner engine failure.
actions-queue-depth-high
Symptom: shithub_actions_queue_depth{resource="jobs"} > 100 for 10m.
- Check runner availability:
shithubd admin actions runner list. - Compare queued labels with runner labels. A workflow using an unsupported
runs-onvalue will sit queued until a compatible runner exists. - Inspect web and worker logs for trigger storms, claim errors, and DB pool saturation.
- If legitimate load exceeds capacity, add runners or raise capacity on idle runner hosts. If one repository dominates, cancel or throttle that workload.
- After mitigation, watch
shithub_actions_queue_depthdrain and confirmshithub_actions_activedoes not flatline.
actions-run-duration-p99-regressed
Symptom: Actions p99 duration over 30m is >50% above the same window 24h ago.
- Split by event in the Actions dashboard. A single event type usually points to one workflow shape rather than runner infrastructure.
- Compare
shithub_actions_activeand runner capacity. High duration with low active jobs suggests slow jobs; high duration with saturated active jobs suggests insufficient runner capacity. - Check runner host CPU, memory, disk, and Docker/engine logs.
- If the regression started with a deploy, review runner API, log streaming, checkout, and container execution changes first.
- Capture representative slow run IDs and their step durations before canceling or pruning anything.
actions-log-scrubber-possibly-missing
Symptom: server-side Actions log bytes are flowing for 30m, but
shithub_actions_log_scrub_replacements_total{location="server"} remains zero.
This is a warning, not proof of leaked secrets. Some periods legitimately have no secret-bearing logs.
- Confirm secrets or variables with sensitive values exist for workloads that ran during the window.
- Trigger a controlled workflow that echoes a known test secret value and
verify the rendered logs contain
***, not plaintext. - Check runner claims happened after the secret was created or rotated; mask snapshots are captured at claim time.
- If the controlled workflow is not masked, stop affected runners, rotate the exposed secret, and open a security incident.
View source
| 1 | # Incidents |
| 2 | |
| 3 | Alert names in `deploy/monitoring/prometheus/rules.yml` link to the |
| 4 | anchors below. Keep procedures here short and actionable: what to |
| 5 | check first, how to mitigate, what to record. Postmortems live |
| 6 | elsewhere. |
| 7 | |
| 8 | ## shithubd-down |
| 9 | |
| 10 | **Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns |
| 11 | 502/connection refused. |
| 12 | |
| 13 | 1. SSH to the affected host. |
| 14 | 2. `systemctl status shithubd-web` — if `active (running)` but |
| 15 | alert fires, the metrics scrape is broken (check wg0). |
| 16 | 3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look |
| 17 | for migration failures (ExecStartPre), env file errors, port |
| 18 | conflicts. |
| 19 | 4. Mitigate: `systemctl restart shithubd-web`. |
| 20 | 5. If two restarts in a row fail, roll back per `rollback.md` and |
| 21 | open an incident. |
| 22 | |
| 23 | ## worker-down |
| 24 | |
| 25 | **Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms |
| 26 | visible to users: webhook deliveries stall, async fan-out lags. |
| 27 | |
| 28 | 1. `systemctl status shithubd-worker`. |
| 29 | 2. `journalctl -u shithubd-worker -n 200`. |
| 30 | 3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart |
| 31 | never loses or double-processes jobs. |
| 32 | |
| 33 | ## postgres-down |
| 34 | |
| 35 | **Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every |
| 36 | write; reads through cache may still appear to work briefly. |
| 37 | |
| 38 | 1. SSH to the db host. |
| 39 | 2. `systemctl status postgresql`. |
| 40 | 3. If startup is failing on WAL replay: do **not** delete pgdata. |
| 41 | Check disk free first (`df -h /data`). If full, the most likely |
| 42 | cause is the WAL archiver failing — see `archive-failing` below. |
| 43 | 4. If the data directory is fine, page the on-call DBA. Do not |
| 44 | restore from backup unless directed; a working PITR restore |
| 45 | takes ~30 min and we shouldn't gamble. |
| 46 | |
| 47 | ## archive-failing |
| 48 | |
| 49 | **Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling. |
| 50 | |
| 51 | 1. Look at `journalctl -u postgresql -n 200` for `archive_command |
| 52 | failed` lines. The last few should print the rclone error. |
| 53 | 2. Common causes: bucket creds rotated, network partition to Spaces. |
| 54 | 3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/ |
| 55 | rclone.conf lsd spaces-prod:`. |
| 56 | 4. Mitigation: fix the underlying issue (rotate creds, restore |
| 57 | network). Postgres will retry automatically. |
| 58 | 5. If disk is critically full and Spaces will not be reachable in |
| 59 | time: this is an emergency. Page the operator. **Do not** |
| 60 | `pg_archivecleanup` manually unless directed — you can lose |
| 61 | data. |
| 62 | |
| 63 | ## job-backlog |
| 64 | |
| 65 | **Symptom:** `shithubd_job_queue_depth > 5000` for 15m. |
| 66 | |
| 67 | 1. Check what's in the queue: |
| 68 | `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`. |
| 69 | 2. If one kind dominates, look for a poison job in worker logs. |
| 70 | 3. If the spread is even, the worker isn't keeping up — scale by |
| 71 | running a second worker host (the `FOR UPDATE SKIP LOCKED` |
| 72 | pattern lets multiple workers coexist safely). |
| 73 | 4. To purge a poison job: mark it `failed` (don't delete — we want |
| 74 | the audit trail). |
| 75 | |
| 76 | ## actions-runner-heartbeat-stale |
| 77 | |
| 78 | **Symptom:** `shithub_actions_runner_heartbeat_age_seconds{status!="offline"} > |
| 79 | 60` for 5m. Actions jobs can remain queued even while the runner appears |
| 80 | registered. |
| 81 | |
| 82 | 1. Identify the runner from the alert label. |
| 83 | 2. On the runner host: `systemctl status shithubd-runner` and |
| 84 | `journalctl -u shithubd-runner -n 200 --no-pager`. |
| 85 | 3. On the app host: `shithubd admin actions runner list` and confirm the |
| 86 | runner labels still match queued jobs. |
| 87 | 4. If the runner is wedged, restart `shithubd-runner`. If it cannot |
| 88 | authenticate, rotate the runner token and redeploy the service env. |
| 89 | 5. Record whether the stale heartbeat happened during a deploy, network |
| 90 | partition, token rotation, or runner engine failure. |
| 91 | |
| 92 | ## actions-queue-depth-high |
| 93 | |
| 94 | **Symptom:** `shithub_actions_queue_depth{resource="jobs"} > 100` for 10m. |
| 95 | |
| 96 | 1. Check runner availability: |
| 97 | `shithubd admin actions runner list`. |
| 98 | 2. Compare queued labels with runner labels. A workflow using an unsupported |
| 99 | `runs-on` value will sit queued until a compatible runner exists. |
| 100 | 3. Inspect web and worker logs for trigger storms, claim errors, and DB pool |
| 101 | saturation. |
| 102 | 4. If legitimate load exceeds capacity, add runners or raise capacity on idle |
| 103 | runner hosts. If one repository dominates, cancel or throttle that workload. |
| 104 | 5. After mitigation, watch `shithub_actions_queue_depth` drain and confirm |
| 105 | `shithub_actions_active` does not flatline. |
| 106 | |
| 107 | ## actions-run-duration-p99-regressed |
| 108 | |
| 109 | **Symptom:** Actions p99 duration over 30m is >50% above the same window 24h |
| 110 | ago. |
| 111 | |
| 112 | 1. Split by event in the Actions dashboard. A single event type usually points |
| 113 | to one workflow shape rather than runner infrastructure. |
| 114 | 2. Compare `shithub_actions_active` and runner capacity. High duration with |
| 115 | low active jobs suggests slow jobs; high duration with saturated active jobs |
| 116 | suggests insufficient runner capacity. |
| 117 | 3. Check runner host CPU, memory, disk, and Docker/engine logs. |
| 118 | 4. If the regression started with a deploy, review runner API, log streaming, |
| 119 | checkout, and container execution changes first. |
| 120 | 5. Capture representative slow run IDs and their step durations before |
| 121 | canceling or pruning anything. |
| 122 | |
| 123 | ## actions-log-scrubber-possibly-missing |
| 124 | |
| 125 | **Symptom:** server-side Actions log bytes are flowing for 30m, but |
| 126 | `shithub_actions_log_scrub_replacements_total{location="server"}` remains zero. |
| 127 | |
| 128 | This is a warning, not proof of leaked secrets. Some periods legitimately have |
| 129 | no secret-bearing logs. |
| 130 | |
| 131 | 1. Confirm secrets or variables with sensitive values exist for workloads that |
| 132 | ran during the window. |
| 133 | 2. Trigger a controlled workflow that echoes a known test secret value and |
| 134 | verify the rendered logs contain `***`, not plaintext. |
| 135 | 3. Check runner claims happened after the secret was created or rotated; mask |
| 136 | snapshots are captured at claim time. |
| 137 | 4. If the controlled workflow is not masked, stop affected runners, rotate the |
| 138 | exposed secret, and open a security incident. |