@@ -55,6 +55,84 @@ shithubd-runner run \ |
| 55 | 55 | --dns-servers 172.30.0.1 |
| 56 | 56 | ``` |
| 57 | 57 | |
| 58 | +## Pool Operations |
| 59 | + |
| 60 | +List pool state: |
| 61 | + |
| 62 | +```sh |
| 63 | +shithubd admin runner list --output json |
| 64 | +shithubd admin runner queue --output json |
| 65 | +``` |
| 66 | + |
| 67 | +`list` includes labels, capacity, active job count, last heartbeat, |
| 68 | +host name, runner version, drain state, and revoke state. `queue` |
| 69 | +groups queued jobs by requested `runs-on` label so unsupported labels |
| 70 | +are visible without querying Postgres. |
| 71 | + |
| 72 | +Drain a runner before host maintenance: |
| 73 | + |
| 74 | +```sh |
| 75 | +shithubd admin runner drain --id 7 --reason 'kernel update' |
| 76 | +``` |
| 77 | + |
| 78 | +The runner keeps heartbeating and can finish already claimed jobs, but |
| 79 | +new heartbeat claims return 204 until it is undrained: |
| 80 | + |
| 81 | +```sh |
| 82 | +shithubd admin runner undrain --id 7 |
| 83 | +``` |
| 84 | + |
| 85 | +Rotate a registration token after a config-management change: |
| 86 | + |
| 87 | +```sh |
| 88 | +shithubd admin runner rotate-token --id 7 --expires-in 24h --output json |
| 89 | +``` |
| 90 | + |
| 91 | +Update the runner host config with the printed token, restart |
| 92 | +`shithubd-runner`, then verify heartbeats with `runner list`. |
| 93 | + |
| 94 | +Mark stale runners offline: |
| 95 | + |
| 96 | +```sh |
| 97 | +shithubd admin runner cleanup-stale --older-than 2m |
| 98 | +``` |
| 99 | + |
| 100 | +Hard-revoke a compromised runner: |
| 101 | + |
| 102 | +```sh |
| 103 | +shithubd admin runner revoke --id 7 --reason 'host compromise' |
| 104 | +``` |
| 105 | + |
| 106 | +Revocation records `revoked_at`, marks the runner offline, revokes every |
| 107 | +registration token for that runner, and causes existing job API JWTs from |
| 108 | +that runner to fail. Use drain for routine maintenance; use revoke when |
| 109 | +the token or host may be compromised. |
| 110 | + |
| 111 | +Destroy/recreate with DigitalOcean: |
| 112 | + |
| 113 | +```sh |
| 114 | +doctl compute droplet list --tag-name shithub-actions-runner |
| 115 | +doctl compute droplet delete <droplet-id> |
| 116 | +``` |
| 117 | + |
| 118 | +After recreation, register a fresh runner token and let the provisioning |
| 119 | +role write the new config. Do not reuse a token from a revoked runner. |
| 120 | + |
| 121 | +Emergency controls: |
| 122 | + |
| 123 | +- Stop all new claims: disable Actions at site level or drain every |
| 124 | + runner with `shithubd admin runner list --output json` followed by |
| 125 | + `shithubd admin runner drain --id <id>`. |
| 126 | +- Pause one repo/org: set the repo/org Actions policy to disabled so the |
| 127 | + claim query leaves its queued jobs untouched. |
| 128 | +- Cancel active work for a repo: use the repo Actions UI or job cancel |
| 129 | + API for each queued/running job. |
| 130 | +- Revoke all runner tokens: iterate `shithubd admin runner revoke --id |
| 131 | + <id>` over every non-revoked runner. |
| 132 | +- Fence the pool: block or destroy droplets tagged |
| 133 | + `shithub-actions-runner` in DigitalOcean, then rotate or revoke the |
| 134 | + affected runner tokens before allowing replacement hosts to connect. |
| 135 | + |
| 58 | 136 | Equivalent config file: |
| 59 | 137 | |
| 60 | 138 | ```toml |
@@ -111,7 +189,7 @@ Claim a job: |
| 111 | 189 | curl -fsS "$BASE/api/v1/runners/heartbeat" \ |
| 112 | 190 | -H "Authorization: Bearer $RUNNER_TOKEN" \ |
| 113 | 191 | -H "Content-Type: application/json" \ |
| 114 | | - -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1}' \ |
| 192 | + -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \ |
| 115 | 193 | | tee /tmp/shithub-claim.json |
| 116 | 194 | ``` |
| 117 | 195 | |
@@ -209,7 +287,9 @@ Expected results: |
| 209 | 287 | jobs are terminal. |
| 210 | 288 | - The PR Checks tab shows the matching check run as success. |
| 211 | 289 | - `/metrics` includes runner registration, heartbeat, JWT, job |
| 212 | | - cancellation, log-scrub, step-timeout, and retention counters. |
| 290 | + cancellation, log-scrub, step-timeout, retention, queue-depth by label, |
| 291 | + claim-latency, runner-online, runner-stale, runner-draining, and |
| 292 | + runner-revocation metrics. |
| 213 | 293 | |
| 214 | 294 | ## Retention Sweep |
| 215 | 295 | |