# Incidents

Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
anchors below. Keep procedures here short and actionable: what to
check first, how to mitigate, what to record. Postmortems live
elsewhere.

## shithubd-down

**Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
502/connection refused.

1. SSH to the affected host.
2. `systemctl status shithubd-web` — if `active (running)` but
   alert fires, the metrics scrape is broken (check wg0).
3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
   for migration failures (ExecStartPre), env file errors, port
   conflicts.
4. Mitigate: `systemctl restart shithubd-web`.
5. If two restarts in a row fail, roll back per `rollback.md` and
   open an incident.

## worker-down

**Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
visible to users: webhook deliveries stall, async fan-out lags.

1. `systemctl status shithubd-worker`.
2. `journalctl -u shithubd-worker -n 200`.
3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
   never loses or double-processes jobs.

## postgres-down

**Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
write; reads through cache may still appear to work briefly.

1. SSH to the db host.
2. `systemctl status postgresql`.
3. If startup is failing on WAL replay: do **not** delete pgdata.
   Check disk free first (`df -h /data`). If full, the most likely
   cause is the WAL archiver failing — see `archive-failing` below.
4. If the data directory is fine, page the on-call DBA. Do not
   restore from backup unless directed; a working PITR restore
   takes ~30 min and we shouldn't gamble.

## archive-failing

**Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.

1. Look at `journalctl -u postgresql -n 200` for `archive_command
   failed` lines. The last few should print the rclone error.
2. Common causes: bucket creds rotated, network partition to Spaces.
3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
   rclone.conf lsd spaces-prod:`.
4. Mitigation: fix the underlying issue (rotate creds, restore
   network). Postgres will retry automatically.
5. If disk is critically full and Spaces will not be reachable in
   time: this is an emergency. Page the operator. **Do not**
   `pg_archivecleanup` manually unless directed — you can lose
   data.

## job-backlog

**Symptom:** `shithubd_job_queue_depth > 5000` for 15m.

1. Check what's in the queue:
   `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
2. If one kind dominates, look for a poison job in worker logs.
3. If the spread is even, the worker isn't keeping up — scale by
   running a second worker host (the `FOR UPDATE SKIP LOCKED`
   pattern lets multiple workers coexist safely).
4. To purge a poison job: mark it `failed` (don't delete — we want
   the audit trail).

## actions-runner-heartbeat-stale

**Symptom:** `shithub_actions_runner_heartbeat_age_seconds{status!="offline"} >
60` for 5m. Actions jobs can remain queued even while the runner appears
registered.

1. Identify the runner from the alert label.
2. On the runner host: `systemctl status shithubd-runner` and
   `journalctl -u shithubd-runner -n 200 --no-pager`.
3. On the app host: `shithubd admin actions runner list` and confirm the
   runner labels still match queued jobs.
4. If the runner is wedged, restart `shithubd-runner`. If it cannot
   authenticate, rotate the runner token and redeploy the service env.
5. Record whether the stale heartbeat happened during a deploy, network
   partition, token rotation, or runner engine failure.

## actions-queue-depth-high

**Symptom:** `shithub_actions_queue_depth{resource="jobs"} > 100` for 10m.

1. Check runner availability:
   `shithubd admin actions runner list`.
2. Compare queued labels with runner labels. A workflow using an unsupported
   `runs-on` value will sit queued until a compatible runner exists.
3. Inspect web and worker logs for trigger storms, claim errors, and DB pool
   saturation.
4. If legitimate load exceeds capacity, add runners or raise capacity on idle
   runner hosts. If one repository dominates, cancel or throttle that workload.
5. After mitigation, watch `shithub_actions_queue_depth` drain and confirm
   `shithub_actions_active` does not flatline.

## actions-run-duration-p99-regressed

**Symptom:** Actions p99 duration over 30m is >50% above the same window 24h
ago.

1. Split by event in the Actions dashboard. A single event type usually points
   to one workflow shape rather than runner infrastructure.
2. Compare `shithub_actions_active` and runner capacity. High duration with
   low active jobs suggests slow jobs; high duration with saturated active jobs
   suggests insufficient runner capacity.
3. Check runner host CPU, memory, disk, and Docker/engine logs.
4. If the regression started with a deploy, review runner API, log streaming,
   checkout, and container execution changes first.
5. Capture representative slow run IDs and their step durations before
   canceling or pruning anything.

## actions-log-scrubber-possibly-missing

**Symptom:** server-side Actions log bytes are flowing for 30m, but
`shithub_actions_log_scrub_replacements_total{location="server"}` remains zero.

This is a warning, not proof of leaked secrets. Some periods legitimately have
no secret-bearing logs.

1. Confirm secrets or variables with sensitive values exist for workloads that
   ran during the window.
2. Trigger a controlled workflow that echoes a known test secret value and
   verify the rendered logs contain `***`, not plaintext.
3. Check runner claims happened after the secret was created or rotated; mask
   snapshots are captured at claim time.
4. If the controlled workflow is not masked, stop affected runners, rotate the
   exposed secret, and open a security incident.