markdown · 10073 bytes Raw Blame History

Actions runner deploy runbook

This runbook owns the S41d/S41j deployment path for shithubd-runner: the Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner pool bootstrap. The smoke flow for an already-installed runner lives in actions-runner.md.

Prereqs

  • The app database is migrated through S41d and the web API has auth.totp_key_b64 configured so job JWTs can be minted.
  • Docker is installed on the runner host and the docker group exists. The runner process needs Docker socket access; treat the host itself as trusted even though individual step containers are sandboxed.
  • For DigitalOcean runner pools: doctl and jq installed locally, authenticated with doctl auth init, and a named SSH key uploaded to the DigitalOcean account.
  • Runner host SSH ingress must be restricted to operator or VPN CIDRs. Do not create runner droplets with SSH open to 0.0.0.0/0.
  • bin/shithubd-runner exists locally. make build builds both bin/shithubd and bin/shithubd-runner with the same version ldflags.
  • The default image has been loaded or published. Build it with:
nix build ./deploy/runner-images#runnerImage
docker load < result

The committed deploy/runner-images/flake.lock pins the nixpkgs input. Update it deliberately when changing the default image toolchain.

Publishing to GHCR is manual through .github/workflows/runner-image.yml because forks may not control the upstream ghcr.io/shithub namespace. Leave the workflow's image input blank to publish under the current repository's package namespace, or set it explicitly for upstream publishing.

DigitalOcean pool bootstrap

S41j runner hosts are separate droplets tagged shithub-actions-runner. The app/database host must not double as the arbitrary-code runner host.

Validate the plan first:

SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh --dry-run

Create the first shared Linux runner:

SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh

Useful overrides:

POOL_NAME=shared-linux
PROJECT_NAME=shithub-prod
REGION=sfo3
SIZE=s-2vcpu-4gb
IMAGE=ubuntu-24-04-x64
COUNT=1
VPC_UUID=REPLACE_ME_OPTIONAL

The provisioner:

  • creates or reuses the DigitalOcean project;
  • creates or reuses a tag-targeted cloud firewall;
  • refuses public SSH CIDRs;
  • creates missing droplets named shithub-runner-<pool>-N;
  • applies the shithub-actions-runner and pool tags;
  • installs a no-secret cloud-init baseline that enables Docker;
  • prints machine-readable JSON for operator records.

Generate an Ansible inventory from the DigitalOcean tag:

./deploy/doctl/generate-actions-runner-inventory.sh \
  --output deploy/ansible/inventory/actions-runners

The generated file contains per-host shithub_runner_token placeholders. Replace them with values from shithubd admin runner register, ideally through ansible-vault or host_vars rather than committing plaintext inventory.

Register

Run this once from a host that can reach the production database config:

shithubd admin runner register \
  --name prod-runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Store the returned token in ansible-vault or the deployment secret store. Only the token hash is stored in Postgres; the raw token cannot be recovered later. Use --expires-in only when the deployment rotates the token before expiry, because the runner presents the registration token on every heartbeat.

For a generated DigitalOcean inventory, register one token per runner host and use the host name as the runner name:

shithubd admin runner register \
  --name shithub-runner-shared-linux-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Inventory

Enable the role explicitly. The default is disabled so ordinary app deploys do not start a runner by accident.

[shithub:vars]
shithub_runner_enabled=true
shithub_runner_token=REPLACE_ME
shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64
shithub_runner_capacity=1
shithub_runner_default_image=ghcr.io/tenseleyflow/shithub/runner-nix:1.0
shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
shithub_runner_container_user=65534:65534
shithub_runner_pids_limit=512

The role writes non-secret config to /etc/shithubd-runner/config.toml and the registration token to /etc/shithubd-runner/runner.env with mode 0600. Keep shithub_runner_workspace_root under /var/lib/shithubd-runner; the systemd unit grants runner writes only to that subtree.

shithub_runner_network_allowlist defaults to GitHub source/archive hosts plus Docker Hub registry hosts. Override it when a runner must fetch from an internal package registry. The role creates the shithub-actions Docker bridge at 172.30.0.1/24, runs dnsmasq on that bridge, and sets engine.dns_servers to the bridge resolver by default.

Deploy

For the runner role only:

make build
cd deploy/ansible
ansible-playbook -i inventory/production site.yml -t shithubd-runner

When deploying from a non-Linux operator machine, build the runner binary for the target host architecture before running Ansible:

GOOS=linux GOARCH=amd64 CGO_ENABLED=0 make build

For the generated DigitalOcean runner inventory:

make build
cd deploy/ansible
ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner

The role:

  • creates the shithub-runner system user and joins it to docker
  • uploads /usr/local/bin/shithubd-runner
  • renders /etc/shithubd-runner/config.toml and runner.env
  • creates the dedicated Actions Docker network and bridge
  • renders /etc/dnsmasq.d/shithubd-runner.conf from the network allowlist and starts dnsmasq bound to the Actions bridge
  • installs shithub-runner-firewall.service, which rejects direct-IP egress from step containers unless dnsmasq populated the destination in the allowlist ipset
  • installs the pinned seccomp profile at /etc/shithubd-runner/seccomp.json
  • installs deploy/systemd/shithubd-runner.service
  • pulls the configured runner image
  • enables and starts shithubd-runner

Verify

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

Then push a workflow with a simple run: step:

name: ci
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - run: bash -c "echo hello && exit 0"

Expected state:

  • a runner heartbeat claims the queued job within one idle poll interval
  • the step emits SQL log chunks during execution
  • workflow:finalize_step uploads actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log
  • the job and check run complete with conclusion success

Repeat with exit 1; the check should complete with conclusion failure.

Before declaring a shared pool available to arbitrary repositories, run the arbitrary repository smoke checklist against scratch and at least one additional repository. The single repo check above proves the runner process works; the arbitrary-repo smoke proves label routing, checkout scoping, archived logs, check runs, and negative queue/secret cases across normal repositories.

Sandbox smoke checks:

name: sandbox-smoke
on: push
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - run: id -u
      - run: test "$(id -u)" = "65534"
      - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
      - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi

Expected state:

  • the UID check prints 65534
  • a workflow-level request for permissions: {shithub-runner-root: write} still runs as 65534; root opt-in is disabled in the shipped runner config until a trusted-workflow policy exists
  • writing under /etc fails because the root filesystem is read-only
  • mount fails because the container does not have CAP_SYS_ADMIN
  • step logs and systemd journal include the configured image, network, CPU/memory limits, PID limit, container user, and seccomp profile

Network Allowlist

The runner config carries two separate network controls:

  • runner.network_allowlist: the host patterns allowed by the operator's DNS allowlist resolver.
  • engine.dns_servers: DNS servers passed to each step container with Docker --dns.

For a single-host deployment, create a dedicated Docker bridge for Actions jobs, run dnsmasq bound to that bridge, and set shithub_runner_dns_servers to the bridge address of that resolver. The Ansible role now does this by default. The rendered dnsmasq config has no default upstream resolver; names not matching the allowlist fail DNS resolution.

The firewall service closes the direct-IP bypass: containers on the Actions subnet may send DNS only to the bridge resolver, and other egress is allowed only when the destination IP is present in the dnsmasq-populated ipset. Keep the runner on a separate host from web and database services.

Rollback

Stop the runner first so it does not claim new jobs:

systemctl stop shithubd-runner
systemctl disable shithubd-runner

If the binary itself is bad, copy a prior archived binary from /var/lib/shithubd-runner/binaries/ back to /usr/local/bin/shithubd-runner and restart the unit. Jobs already claimed by the stopped runner remain visible in the database; S41g adds operator cancel/re-run controls.

For a DigitalOcean test runner, drain or revoke the runner in shithub before destroying the droplet. Then list and delete the specific test host:

doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete shithub-runner-shared-linux-1 --force

If the pool firewall was created only for a disposable test and no runner droplets still use it:

doctl compute firewall list
doctl compute firewall delete <firewall-id> --force
View source
1 # Actions runner deploy runbook
2
3 This runbook owns the S41d/S41j deployment path for `shithubd-runner`: the
4 Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner
5 pool bootstrap. The smoke flow for an already-installed runner lives in
6 [actions-runner.md](./actions-runner.md).
7
8 ## Prereqs
9
10 - The app database is migrated through S41d and the web API has
11 `auth.totp_key_b64` configured so job JWTs can be minted.
12 - Docker is installed on the runner host and the `docker` group exists.
13 The runner process needs Docker socket access; treat the host itself
14 as trusted even though individual step containers are sandboxed.
15 - For DigitalOcean runner pools: `doctl` and `jq` installed locally,
16 authenticated with `doctl auth init`, and a named SSH key uploaded to
17 the DigitalOcean account.
18 - Runner host SSH ingress must be restricted to operator or VPN CIDRs.
19 Do not create runner droplets with SSH open to `0.0.0.0/0`.
20 - `bin/shithubd-runner` exists locally. `make build` builds both
21 `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags.
22 - The default image has been loaded or published. Build it with:
23
24 ```sh
25 nix build ./deploy/runner-images#runnerImage
26 docker load < result
27 ```
28
29 The committed `deploy/runner-images/flake.lock` pins the nixpkgs input.
30 Update it deliberately when changing the default image toolchain.
31
32 Publishing to GHCR is manual through `.github/workflows/runner-image.yml`
33 because forks may not control the upstream `ghcr.io/shithub` namespace.
34 Leave the workflow's `image` input blank to publish under the current
35 repository's package namespace, or set it explicitly for upstream
36 publishing.
37
38 ## DigitalOcean pool bootstrap
39
40 S41j runner hosts are separate droplets tagged `shithub-actions-runner`.
41 The app/database host must not double as the arbitrary-code runner host.
42
43 Validate the plan first:
44
45 ```sh
46 SSH_KEY_NAME=macbook-pro \
47 SSH_ALLOWED_CIDRS=203.0.113.4/32 \
48 ./deploy/doctl/provision-actions-runner-pool.sh --dry-run
49 ```
50
51 Create the first shared Linux runner:
52
53 ```sh
54 SSH_KEY_NAME=macbook-pro \
55 SSH_ALLOWED_CIDRS=203.0.113.4/32 \
56 ./deploy/doctl/provision-actions-runner-pool.sh
57 ```
58
59 Useful overrides:
60
61 ```sh
62 POOL_NAME=shared-linux
63 PROJECT_NAME=shithub-prod
64 REGION=sfo3
65 SIZE=s-2vcpu-4gb
66 IMAGE=ubuntu-24-04-x64
67 COUNT=1
68 VPC_UUID=REPLACE_ME_OPTIONAL
69 ```
70
71 The provisioner:
72
73 - creates or reuses the DigitalOcean project;
74 - creates or reuses a tag-targeted cloud firewall;
75 - refuses public SSH CIDRs;
76 - creates missing droplets named `shithub-runner-<pool>-N`;
77 - applies the `shithub-actions-runner` and pool tags;
78 - installs a no-secret cloud-init baseline that enables Docker;
79 - prints machine-readable JSON for operator records.
80
81 Generate an Ansible inventory from the DigitalOcean tag:
82
83 ```sh
84 ./deploy/doctl/generate-actions-runner-inventory.sh \
85 --output deploy/ansible/inventory/actions-runners
86 ```
87
88 The generated file contains per-host `shithub_runner_token` placeholders.
89 Replace them with values from `shithubd admin runner register`, ideally through
90 ansible-vault or host_vars rather than committing plaintext inventory.
91
92 ## Register
93
94 Run this once from a host that can reach the production database config:
95
96 ```sh
97 shithubd admin runner register \
98 --name prod-runner-1 \
99 --labels self-hosted,linux,ubuntu-latest,x64 \
100 --capacity 1 \
101 --output json
102 ```
103
104 Store the returned `token` in ansible-vault or the deployment secret store.
105 Only the token hash is stored in Postgres; the raw token cannot be
106 recovered later.
107 Use `--expires-in` only when the deployment rotates the token before expiry,
108 because the runner presents the registration token on every heartbeat.
109
110 For a generated DigitalOcean inventory, register one token per runner host and
111 use the host name as the runner name:
112
113 ```sh
114 shithubd admin runner register \
115 --name shithub-runner-shared-linux-1 \
116 --labels self-hosted,linux,ubuntu-latest,x64 \
117 --capacity 1 \
118 --output json
119 ```
120
121 ## Inventory
122
123 Enable the role explicitly. The default is disabled so ordinary app
124 deploys do not start a runner by accident.
125
126 ```ini
127 [shithub:vars]
128 shithub_runner_enabled=true
129 shithub_runner_token=REPLACE_ME
130 shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64
131 shithub_runner_capacity=1
132 shithub_runner_default_image=ghcr.io/tenseleyflow/shithub/runner-nix:1.0
133 shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
134 shithub_runner_container_user=65534:65534
135 shithub_runner_pids_limit=512
136 ```
137
138 The role writes non-secret config to
139 `/etc/shithubd-runner/config.toml` and the registration token to
140 `/etc/shithubd-runner/runner.env` with mode `0600`.
141 Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`;
142 the systemd unit grants runner writes only to that subtree.
143
144 `shithub_runner_network_allowlist` defaults to GitHub source/archive
145 hosts plus Docker Hub registry hosts. Override it when a runner must
146 fetch from an internal package registry. The role creates the
147 `shithub-actions` Docker bridge at `172.30.0.1/24`, runs dnsmasq on
148 that bridge, and sets `engine.dns_servers` to the bridge resolver by
149 default.
150
151 ## Deploy
152
153 For the runner role only:
154
155 ```sh
156 make build
157 cd deploy/ansible
158 ansible-playbook -i inventory/production site.yml -t shithubd-runner
159 ```
160
161 When deploying from a non-Linux operator machine, build the runner binary for
162 the target host architecture before running Ansible:
163
164 ```sh
165 GOOS=linux GOARCH=amd64 CGO_ENABLED=0 make build
166 ```
167
168 For the generated DigitalOcean runner inventory:
169
170 ```sh
171 make build
172 cd deploy/ansible
173 ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner
174 ```
175
176 The role:
177
178 - creates the `shithub-runner` system user and joins it to `docker`
179 - uploads `/usr/local/bin/shithubd-runner`
180 - renders `/etc/shithubd-runner/config.toml` and `runner.env`
181 - creates the dedicated Actions Docker network and bridge
182 - renders `/etc/dnsmasq.d/shithubd-runner.conf` from the network
183 allowlist and starts dnsmasq bound to the Actions bridge
184 - installs `shithub-runner-firewall.service`, which rejects direct-IP
185 egress from step containers unless dnsmasq populated the destination
186 in the allowlist ipset
187 - installs the pinned seccomp profile at
188 `/etc/shithubd-runner/seccomp.json`
189 - installs `deploy/systemd/shithubd-runner.service`
190 - pulls the configured runner image
191 - enables and starts `shithubd-runner`
192
193 ## Verify
194
195 On the runner host:
196
197 ```sh
198 systemctl status shithubd-runner
199 journalctl -u shithubd-runner -n 100 --no-pager
200 ```
201
202 Then push a workflow with a simple `run:` step:
203
204 ```yaml
205 name: ci
206 on: push
207 jobs:
208 build:
209 runs-on: ubuntu-latest
210 steps:
211 - run: bash -c "echo hello && exit 0"
212 ```
213
214 Expected state:
215
216 - a runner heartbeat claims the queued job within one idle poll interval
217 - the step emits SQL log chunks during execution
218 - `workflow:finalize_step` uploads
219 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`
220 - the job and check run complete with conclusion `success`
221
222 Repeat with `exit 1`; the check should complete with conclusion
223 `failure`.
224
225 Before declaring a shared pool available to arbitrary repositories, run the
226 [arbitrary repository smoke](./actions-runner.md#arbitrary-repository-smoke)
227 checklist against scratch and at least one additional repository. The single
228 repo check above proves the runner process works; the arbitrary-repo smoke
229 proves label routing, checkout scoping, archived logs, check runs, and negative
230 queue/secret cases across normal repositories.
231
232 Sandbox smoke checks:
233
234 ```yaml
235 name: sandbox-smoke
236 on: push
237 jobs:
238 smoke:
239 runs-on: ubuntu-latest
240 steps:
241 - run: id -u
242 - run: test "$(id -u)" = "65534"
243 - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
244 - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi
245 ```
246
247 Expected state:
248
249 - the UID check prints `65534`
250 - a workflow-level request for `permissions: {shithub-runner-root: write}`
251 still runs as `65534`; root opt-in is disabled in the shipped runner
252 config until a trusted-workflow policy exists
253 - writing under `/etc` fails because the root filesystem is read-only
254 - `mount` fails because the container does not have `CAP_SYS_ADMIN`
255 - step logs and systemd journal include the configured image, network,
256 CPU/memory limits, PID limit, container user, and seccomp profile
257
258 ## Network Allowlist
259
260 The runner config carries two separate network controls:
261
262 - `runner.network_allowlist`: the host patterns allowed by the
263 operator's DNS allowlist resolver.
264 - `engine.dns_servers`: DNS servers passed to each step container with
265 Docker `--dns`.
266
267 For a single-host deployment, create a dedicated Docker bridge for
268 Actions jobs, run dnsmasq bound to that bridge, and set
269 `shithub_runner_dns_servers` to the bridge address of that resolver.
270 The Ansible role now does this by default. The rendered dnsmasq config
271 has no default upstream resolver; names not matching the allowlist fail
272 DNS resolution.
273
274 The firewall service closes the direct-IP bypass: containers on the
275 Actions subnet may send DNS only to the bridge resolver, and other
276 egress is allowed only when the destination IP is present in the
277 dnsmasq-populated ipset. Keep the runner on a separate host from web
278 and database services.
279
280 ## Rollback
281
282 Stop the runner first so it does not claim new jobs:
283
284 ```sh
285 systemctl stop shithubd-runner
286 systemctl disable shithubd-runner
287 ```
288
289 If the binary itself is bad, copy a prior archived binary from
290 `/var/lib/shithubd-runner/binaries/` back to
291 `/usr/local/bin/shithubd-runner` and restart the unit. Jobs already
292 claimed by the stopped runner remain visible in the database; S41g adds
293 operator cancel/re-run controls.
294
295 For a DigitalOcean test runner, drain or revoke the runner in shithub before
296 destroying the droplet. Then list and delete the specific test host:
297
298 ```sh
299 doctl compute droplet list --tag-name shithub-actions-runner
300 doctl compute droplet delete shithub-runner-shared-linux-1 --force
301 ```
302
303 If the pool firewall was created only for a disposable test and no runner
304 droplets still use it:
305
306 ```sh
307 doctl compute firewall list
308 doctl compute firewall delete <firewall-id> --force
309 ```