tenseleyflow/shithub / ea09767

Browse files

docs/actions: document runner pool operations (S41j-4)

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
ea09767d0868087944652edb121e94bb528e7508
Parents
3b6ce7e
Tree
9bea261

4 changed files

StatusFile+-
M docs/internal/actions-runner-api.md 19 1
M docs/internal/actions-schema.md 8 1
M docs/internal/runbooks/actions-runner.md 82 2
M docs/public/api/actions-runner.md 5 0
docs/internal/actions-runner-api.mdmodified
@@ -24,6 +24,16 @@ The command inserts `workflow_runners`, stores only a SHA-256 hash in
2424
 the runner token before it expires, because the runner uses that same token for
2525
 heartbeat authentication.
2626
 
27
+Operators can drain, undrain, rotate, and hard-revoke runners with
28
+`shithubd admin runner drain`, `undrain`, `rotate-token`, and `revoke`.
29
+Drained runners keep heartbeating and may finish already claimed jobs,
30
+but heartbeat claims return 204 until the runner is undrained. Hard
31
+revocation sets the runner offline, records `revoked_at`, revokes all
32
+registration tokens, and makes job API JWTs minted for that runner
33
+invalid. This is the token-compromise boundary: a host with an old config
34
+file cannot claim new jobs or update already claimed jobs after
35
+revocation lands in Postgres.
36
+
2737
 `POST /api/v1/runners/heartbeat` accepts:
2838
 
2939
 ```http
@@ -77,7 +87,12 @@ runner API endpoints.
7787
 Request body:
7888
 
7989
 ```json
80
-{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1}
90
+{
91
+  "labels": ["self-hosted", "linux", "ubuntu-latest", "x64"],
92
+  "capacity": 1,
93
+  "host_name": "runner-host-1",
94
+  "version": "v0.1.0"
95
+}
8196
 ```
8297
 
8398
 Returns 204 when no matching job is claimable. Returns 200 with
@@ -89,6 +104,9 @@ disabled repos, approval-pending runs, per-repo concurrent job caps, and
89104
 per-owner/org concurrent job caps are not dispatchable. Approval simply
90105
 sets `workflow_runs.approved_by_user_id`; the next heartbeat can claim the
91106
 same queued jobs, so no duplicate run is created.
107
+`host_name` and `version` are optional runner metadata. The server stores
108
+trimmed values up to 255 bytes for pool diagnostics and preserves the
109
+previous values when old runners omit them.
92110
 The job payload includes `checkout_url`, `checkout_token`, resolved
93111
 `secrets`, and `mask_values`; repo secrets shadow org secrets with the
94112
 same name. The server also stores an encrypted claim-time copy of the mask
docs/internal/actions-schema.mdmodified
@@ -12,7 +12,7 @@ without churning under them.
1212
 
1313
 ## SQL schema
1414
 
15
-Actions migrations currently span 0042–0051, 0053, 0057, 0060, and 0064–0066.
15
+Actions migrations currently span 0042–0051, 0053, 0057, 0060, and 0064–0067.
1616
 Migration 0052 belongs to the repo source-remotes feature, 0054
1717
 belongs to push event protocol tracking, 0055 belongs to the social
1818
 feed, 0056 belongs to user profile contribution settings, 0058 belongs
@@ -34,6 +34,7 @@ to repo name reuse, and 0059 belongs to GitHub org imports.
3434
 | 0057  | `workflow_job_secret_masks` | Encrypted claim-time log mask snapshots per job               |
3535
 | 0060  | Actions retention indexes   | Narrow cleanup indexes for terminal steps/runs                |
3636
 | 0066  | `actions_*_policies`, `workflow_run_approvals` | Enablement, runner-pool caps, and approval decisions |
37
+| 0067  | `workflow_runners` ops state | Host/version metadata, drain state, and hard revocation state |
3738
 
3839
 A few load-bearing choices, called out so they're easy to spot in a
3940
 later schema diff:
@@ -395,6 +396,12 @@ Other admin surfaces are scoped to later sub-sprints:
395396
 
396397
 - S41c: `shithubd admin runner register --name <foo>` issues a
397398
   registration token + writes a row to `workflow_runners`.
399
+- S41j: `shithubd admin runner drain|undrain|rotate-token|revoke|cleanup-stale`
400
+  gives operators pool controls. Drained runners keep heartbeating and
401
+  may finish already claimed jobs but receive no new claims. Revoked
402
+  runners are set offline, all registration tokens are revoked, and job
403
+  API JWTs from that runner are rejected even if the runner still has an
404
+  old config file.
398405
 - S41g: `POST /api/v1/jobs/{id}/cancel` and the repository run-detail
399406
   UI request cancellation. Running jobs flip `cancel_requested`; queued
400407
   jobs are made terminal immediately.
docs/internal/runbooks/actions-runner.mdmodified
@@ -55,6 +55,84 @@ shithubd-runner run \
5555
   --dns-servers 172.30.0.1
5656
 ```
5757
 
58
+## Pool Operations
59
+
60
+List pool state:
61
+
62
+```sh
63
+shithubd admin runner list --output json
64
+shithubd admin runner queue --output json
65
+```
66
+
67
+`list` includes labels, capacity, active job count, last heartbeat,
68
+host name, runner version, drain state, and revoke state. `queue`
69
+groups queued jobs by requested `runs-on` label so unsupported labels
70
+are visible without querying Postgres.
71
+
72
+Drain a runner before host maintenance:
73
+
74
+```sh
75
+shithubd admin runner drain --id 7 --reason 'kernel update'
76
+```
77
+
78
+The runner keeps heartbeating and can finish already claimed jobs, but
79
+new heartbeat claims return 204 until it is undrained:
80
+
81
+```sh
82
+shithubd admin runner undrain --id 7
83
+```
84
+
85
+Rotate a registration token after a config-management change:
86
+
87
+```sh
88
+shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
89
+```
90
+
91
+Update the runner host config with the printed token, restart
92
+`shithubd-runner`, then verify heartbeats with `runner list`.
93
+
94
+Mark stale runners offline:
95
+
96
+```sh
97
+shithubd admin runner cleanup-stale --older-than 2m
98
+```
99
+
100
+Hard-revoke a compromised runner:
101
+
102
+```sh
103
+shithubd admin runner revoke --id 7 --reason 'host compromise'
104
+```
105
+
106
+Revocation records `revoked_at`, marks the runner offline, revokes every
107
+registration token for that runner, and causes existing job API JWTs from
108
+that runner to fail. Use drain for routine maintenance; use revoke when
109
+the token or host may be compromised.
110
+
111
+Destroy/recreate with DigitalOcean:
112
+
113
+```sh
114
+doctl compute droplet list --tag-name shithub-actions-runner
115
+doctl compute droplet delete <droplet-id>
116
+```
117
+
118
+After recreation, register a fresh runner token and let the provisioning
119
+role write the new config. Do not reuse a token from a revoked runner.
120
+
121
+Emergency controls:
122
+
123
+- Stop all new claims: disable Actions at site level or drain every
124
+  runner with `shithubd admin runner list --output json` followed by
125
+  `shithubd admin runner drain --id <id>`.
126
+- Pause one repo/org: set the repo/org Actions policy to disabled so the
127
+  claim query leaves its queued jobs untouched.
128
+- Cancel active work for a repo: use the repo Actions UI or job cancel
129
+  API for each queued/running job.
130
+- Revoke all runner tokens: iterate `shithubd admin runner revoke --id
131
+  <id>` over every non-revoked runner.
132
+- Fence the pool: block or destroy droplets tagged
133
+  `shithub-actions-runner` in DigitalOcean, then rotate or revoke the
134
+  affected runner tokens before allowing replacement hosts to connect.
135
+
58136
 Equivalent config file:
59137
 
60138
 ```toml
@@ -111,7 +189,7 @@ Claim a job:
111189
 curl -fsS "$BASE/api/v1/runners/heartbeat" \
112190
   -H "Authorization: Bearer $RUNNER_TOKEN" \
113191
   -H "Content-Type: application/json" \
114
-  -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1}' \
192
+  -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
115193
   | tee /tmp/shithub-claim.json
116194
 ```
117195
 
@@ -209,7 +287,9 @@ Expected results:
209287
   jobs are terminal.
210288
 - The PR Checks tab shows the matching check run as success.
211289
 - `/metrics` includes runner registration, heartbeat, JWT, job
212
-  cancellation, log-scrub, step-timeout, and retention counters.
290
+  cancellation, log-scrub, step-timeout, retention, queue-depth by label,
291
+  claim-latency, runner-online, runner-stale, runner-draining, and
292
+  runner-revocation metrics.
213293
 
214294
 ## Retention Sweep
215295
 
docs/public/api/actions-runner.mdmodified
@@ -19,6 +19,11 @@ Claims one queued workflow job for a registered runner when labels and
1919
 capacity match. Returns `204 No Content` when no job is available, or
2020
 `200 OK` with a job payload and job JWT.
2121
 
22
+The runner may include `host_name` and `version` in the heartbeat body
23
+for operator diagnostics. Drained runners return `204 No Content` and do
24
+not claim new jobs. Revoked runners receive `401 Unauthorized`, and job
25
+JWTs previously minted for that runner are rejected.
26
+
2227
 ## Job log chunks
2328
 
2429
 ```