markdown · 9435 bytes Raw Blame History

Actions runner deploy runbook

This runbook owns the S41d/S41j deployment path for shithubd-runner: the Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner pool bootstrap. The smoke flow for an already-installed runner lives in actions-runner.md.

Prereqs

  • The app database is migrated through S41d and the web API has auth.totp_key_b64 configured so job JWTs can be minted.
  • Docker is installed on the runner host and the docker group exists. The runner process needs Docker socket access; treat the host itself as trusted even though individual step containers are sandboxed.
  • For DigitalOcean runner pools: doctl and jq installed locally, authenticated with doctl auth init, and a named SSH key uploaded to the DigitalOcean account.
  • Runner host SSH ingress must be restricted to operator or VPN CIDRs. Do not create runner droplets with SSH open to 0.0.0.0/0.
  • bin/shithubd-runner exists locally. make build builds both bin/shithubd and bin/shithubd-runner with the same version ldflags.
  • The default image has been loaded or published. Build it with:
nix build ./deploy/runner-images#runnerImage
docker load < result

The committed deploy/runner-images/flake.lock pins the nixpkgs input. Update it deliberately when changing the default image toolchain.

Publishing to GHCR is manual through .github/workflows/runner-image.yml because forks may not control the upstream ghcr.io/shithub namespace. Leave the workflow's image input blank to publish under the current repository's package namespace, or set it explicitly for upstream publishing.

DigitalOcean pool bootstrap

S41j runner hosts are separate droplets tagged shithub-actions-runner. The app/database host must not double as the arbitrary-code runner host.

Validate the plan first:

SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh --dry-run

Create the first shared Linux runner:

SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh

Useful overrides:

POOL_NAME=shared-linux
PROJECT_NAME=shithub-prod
REGION=sfo3
SIZE=s-2vcpu-4gb
IMAGE=ubuntu-24-04-x64
COUNT=1
VPC_UUID=REPLACE_ME_OPTIONAL

The provisioner:

  • creates or reuses the DigitalOcean project;
  • creates or reuses a tag-targeted cloud firewall;
  • refuses public SSH CIDRs;
  • creates missing droplets named shithub-runner-<pool>-N;
  • applies the shithub-actions-runner and pool tags;
  • installs a no-secret cloud-init baseline that enables Docker;
  • prints machine-readable JSON for operator records.

Generate an Ansible inventory from the DigitalOcean tag:

./deploy/doctl/generate-actions-runner-inventory.sh \
  --output deploy/ansible/inventory/actions-runners

The generated file contains per-host shithub_runner_token placeholders. Replace them with values from shithubd admin runner register, ideally through ansible-vault or host_vars rather than committing plaintext inventory.

Register

Run this once from a host that can reach the production database config:

shithubd admin runner register \
  --name prod-runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Store the returned token in ansible-vault or the deployment secret store. Only the token hash is stored in Postgres; the raw token cannot be recovered later. Use --expires-in only when the deployment rotates the token before expiry, because the runner presents the registration token on every heartbeat.

For a generated DigitalOcean inventory, register one token per runner host and use the host name as the runner name:

shithubd admin runner register \
  --name shithub-runner-shared-linux-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Inventory

Enable the role explicitly. The default is disabled so ordinary app deploys do not start a runner by accident.

[shithub:vars]
shithub_runner_enabled=true
shithub_runner_token=REPLACE_ME
shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64
shithub_runner_capacity=1
shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
shithub_runner_container_user=65534:65534
shithub_runner_pids_limit=512

The role writes non-secret config to /etc/shithubd-runner/config.toml and the registration token to /etc/shithubd-runner/runner.env with mode 0600. Keep shithub_runner_workspace_root under /var/lib/shithubd-runner; the systemd unit grants runner writes only to that subtree.

shithub_runner_network_allowlist defaults to GitHub source/archive hosts plus Docker Hub registry hosts. Override it when a runner must fetch from an internal package registry. The role creates the shithub-actions Docker bridge at 172.30.0.1/24, runs dnsmasq on that bridge, and sets engine.dns_servers to the bridge resolver by default.

Deploy

For the runner role only:

make build
cd deploy/ansible
ansible-playbook -i inventory/production site.yml -t shithubd-runner

For the generated DigitalOcean runner inventory:

make build
cd deploy/ansible
ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner

The role:

  • creates the shithub-runner system user and joins it to docker
  • uploads /usr/local/bin/shithubd-runner
  • renders /etc/shithubd-runner/config.toml and runner.env
  • creates the dedicated Actions Docker network and bridge
  • renders /etc/dnsmasq.d/shithubd-runner.conf from the network allowlist and starts dnsmasq bound to the Actions bridge
  • installs shithub-runner-firewall.service, which rejects direct-IP egress from step containers unless dnsmasq populated the destination in the allowlist ipset
  • installs the pinned seccomp profile at /etc/shithubd-runner/seccomp.json
  • installs deploy/systemd/shithubd-runner.service
  • pulls the configured runner image
  • enables and starts shithubd-runner

Verify

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

Then push a workflow with a simple run: step:

name: ci
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - run: bash -c "echo hello && exit 0"

Expected state:

  • a runner heartbeat claims the queued job within one idle poll interval
  • the step emits SQL log chunks during execution
  • workflow:finalize_step uploads actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log
  • the job and check run complete with conclusion success

Repeat with exit 1; the check should complete with conclusion failure.

Sandbox smoke checks:

name: sandbox-smoke
on: push
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - run: id -u
      - run: test "$(id -u)" = "65534"
      - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
      - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi

Expected state:

  • the UID check prints 65534
  • a workflow-level request for permissions: {shithub-runner-root: write} still runs as 65534; root opt-in is disabled in the shipped runner config until a trusted-workflow policy exists
  • writing under /etc fails because the root filesystem is read-only
  • mount fails because the container does not have CAP_SYS_ADMIN
  • step logs and systemd journal include the configured image, network, CPU/memory limits, PID limit, container user, and seccomp profile

Network Allowlist

The runner config carries two separate network controls:

  • runner.network_allowlist: the host patterns allowed by the operator's DNS allowlist resolver.
  • engine.dns_servers: DNS servers passed to each step container with Docker --dns.

For a single-host deployment, create a dedicated Docker bridge for Actions jobs, run dnsmasq bound to that bridge, and set shithub_runner_dns_servers to the bridge address of that resolver. The Ansible role now does this by default. The rendered dnsmasq config has no default upstream resolver; names not matching the allowlist fail DNS resolution.

The firewall service closes the direct-IP bypass: containers on the Actions subnet may send DNS only to the bridge resolver, and other egress is allowed only when the destination IP is present in the dnsmasq-populated ipset. Keep the runner on a separate host from web and database services.

Rollback

Stop the runner first so it does not claim new jobs:

systemctl stop shithubd-runner
systemctl disable shithubd-runner

If the binary itself is bad, copy a prior archived binary from /var/lib/shithubd-runner/binaries/ back to /usr/local/bin/shithubd-runner and restart the unit. Jobs already claimed by the stopped runner remain visible in the database; S41g adds operator cancel/re-run controls.

For a DigitalOcean test runner, drain or revoke the runner in shithub before destroying the droplet. Then list and delete the specific test host:

doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete shithub-runner-shared-linux-1 --force

If the pool firewall was created only for a disposable test and no runner droplets still use it:

doctl compute firewall list
doctl compute firewall delete <firewall-id> --force
View source
1 # Actions runner deploy runbook
2
3 This runbook owns the S41d/S41j deployment path for `shithubd-runner`: the
4 Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner
5 pool bootstrap. The smoke flow for an already-installed runner lives in
6 [actions-runner.md](./actions-runner.md).
7
8 ## Prereqs
9
10 - The app database is migrated through S41d and the web API has
11 `auth.totp_key_b64` configured so job JWTs can be minted.
12 - Docker is installed on the runner host and the `docker` group exists.
13 The runner process needs Docker socket access; treat the host itself
14 as trusted even though individual step containers are sandboxed.
15 - For DigitalOcean runner pools: `doctl` and `jq` installed locally,
16 authenticated with `doctl auth init`, and a named SSH key uploaded to
17 the DigitalOcean account.
18 - Runner host SSH ingress must be restricted to operator or VPN CIDRs.
19 Do not create runner droplets with SSH open to `0.0.0.0/0`.
20 - `bin/shithubd-runner` exists locally. `make build` builds both
21 `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags.
22 - The default image has been loaded or published. Build it with:
23
24 ```sh
25 nix build ./deploy/runner-images#runnerImage
26 docker load < result
27 ```
28
29 The committed `deploy/runner-images/flake.lock` pins the nixpkgs input.
30 Update it deliberately when changing the default image toolchain.
31
32 Publishing to GHCR is manual through `.github/workflows/runner-image.yml`
33 because forks may not control the upstream `ghcr.io/shithub` namespace.
34 Leave the workflow's `image` input blank to publish under the current
35 repository's package namespace, or set it explicitly for upstream
36 publishing.
37
38 ## DigitalOcean pool bootstrap
39
40 S41j runner hosts are separate droplets tagged `shithub-actions-runner`.
41 The app/database host must not double as the arbitrary-code runner host.
42
43 Validate the plan first:
44
45 ```sh
46 SSH_KEY_NAME=macbook-pro \
47 SSH_ALLOWED_CIDRS=203.0.113.4/32 \
48 ./deploy/doctl/provision-actions-runner-pool.sh --dry-run
49 ```
50
51 Create the first shared Linux runner:
52
53 ```sh
54 SSH_KEY_NAME=macbook-pro \
55 SSH_ALLOWED_CIDRS=203.0.113.4/32 \
56 ./deploy/doctl/provision-actions-runner-pool.sh
57 ```
58
59 Useful overrides:
60
61 ```sh
62 POOL_NAME=shared-linux
63 PROJECT_NAME=shithub-prod
64 REGION=sfo3
65 SIZE=s-2vcpu-4gb
66 IMAGE=ubuntu-24-04-x64
67 COUNT=1
68 VPC_UUID=REPLACE_ME_OPTIONAL
69 ```
70
71 The provisioner:
72
73 - creates or reuses the DigitalOcean project;
74 - creates or reuses a tag-targeted cloud firewall;
75 - refuses public SSH CIDRs;
76 - creates missing droplets named `shithub-runner-<pool>-N`;
77 - applies the `shithub-actions-runner` and pool tags;
78 - installs a no-secret cloud-init baseline that enables Docker;
79 - prints machine-readable JSON for operator records.
80
81 Generate an Ansible inventory from the DigitalOcean tag:
82
83 ```sh
84 ./deploy/doctl/generate-actions-runner-inventory.sh \
85 --output deploy/ansible/inventory/actions-runners
86 ```
87
88 The generated file contains per-host `shithub_runner_token` placeholders.
89 Replace them with values from `shithubd admin runner register`, ideally through
90 ansible-vault or host_vars rather than committing plaintext inventory.
91
92 ## Register
93
94 Run this once from a host that can reach the production database config:
95
96 ```sh
97 shithubd admin runner register \
98 --name prod-runner-1 \
99 --labels self-hosted,linux,ubuntu-latest,x64 \
100 --capacity 1 \
101 --output json
102 ```
103
104 Store the returned `token` in ansible-vault or the deployment secret store.
105 Only the token hash is stored in Postgres; the raw token cannot be
106 recovered later.
107 Use `--expires-in` only when the deployment rotates the token before expiry,
108 because the runner presents the registration token on every heartbeat.
109
110 For a generated DigitalOcean inventory, register one token per runner host and
111 use the host name as the runner name:
112
113 ```sh
114 shithubd admin runner register \
115 --name shithub-runner-shared-linux-1 \
116 --labels self-hosted,linux,ubuntu-latest,x64 \
117 --capacity 1 \
118 --output json
119 ```
120
121 ## Inventory
122
123 Enable the role explicitly. The default is disabled so ordinary app
124 deploys do not start a runner by accident.
125
126 ```ini
127 [shithub:vars]
128 shithub_runner_enabled=true
129 shithub_runner_token=REPLACE_ME
130 shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64
131 shithub_runner_capacity=1
132 shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
133 shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
134 shithub_runner_container_user=65534:65534
135 shithub_runner_pids_limit=512
136 ```
137
138 The role writes non-secret config to
139 `/etc/shithubd-runner/config.toml` and the registration token to
140 `/etc/shithubd-runner/runner.env` with mode `0600`.
141 Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`;
142 the systemd unit grants runner writes only to that subtree.
143
144 `shithub_runner_network_allowlist` defaults to GitHub source/archive
145 hosts plus Docker Hub registry hosts. Override it when a runner must
146 fetch from an internal package registry. The role creates the
147 `shithub-actions` Docker bridge at `172.30.0.1/24`, runs dnsmasq on
148 that bridge, and sets `engine.dns_servers` to the bridge resolver by
149 default.
150
151 ## Deploy
152
153 For the runner role only:
154
155 ```sh
156 make build
157 cd deploy/ansible
158 ansible-playbook -i inventory/production site.yml -t shithubd-runner
159 ```
160
161 For the generated DigitalOcean runner inventory:
162
163 ```sh
164 make build
165 cd deploy/ansible
166 ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner
167 ```
168
169 The role:
170
171 - creates the `shithub-runner` system user and joins it to `docker`
172 - uploads `/usr/local/bin/shithubd-runner`
173 - renders `/etc/shithubd-runner/config.toml` and `runner.env`
174 - creates the dedicated Actions Docker network and bridge
175 - renders `/etc/dnsmasq.d/shithubd-runner.conf` from the network
176 allowlist and starts dnsmasq bound to the Actions bridge
177 - installs `shithub-runner-firewall.service`, which rejects direct-IP
178 egress from step containers unless dnsmasq populated the destination
179 in the allowlist ipset
180 - installs the pinned seccomp profile at
181 `/etc/shithubd-runner/seccomp.json`
182 - installs `deploy/systemd/shithubd-runner.service`
183 - pulls the configured runner image
184 - enables and starts `shithubd-runner`
185
186 ## Verify
187
188 On the runner host:
189
190 ```sh
191 systemctl status shithubd-runner
192 journalctl -u shithubd-runner -n 100 --no-pager
193 ```
194
195 Then push a workflow with a simple `run:` step:
196
197 ```yaml
198 name: ci
199 on: push
200 jobs:
201 build:
202 runs-on: ubuntu-latest
203 steps:
204 - run: bash -c "echo hello && exit 0"
205 ```
206
207 Expected state:
208
209 - a runner heartbeat claims the queued job within one idle poll interval
210 - the step emits SQL log chunks during execution
211 - `workflow:finalize_step` uploads
212 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`
213 - the job and check run complete with conclusion `success`
214
215 Repeat with `exit 1`; the check should complete with conclusion
216 `failure`.
217
218 Sandbox smoke checks:
219
220 ```yaml
221 name: sandbox-smoke
222 on: push
223 jobs:
224 smoke:
225 runs-on: ubuntu-latest
226 steps:
227 - run: id -u
228 - run: test "$(id -u)" = "65534"
229 - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
230 - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi
231 ```
232
233 Expected state:
234
235 - the UID check prints `65534`
236 - a workflow-level request for `permissions: {shithub-runner-root: write}`
237 still runs as `65534`; root opt-in is disabled in the shipped runner
238 config until a trusted-workflow policy exists
239 - writing under `/etc` fails because the root filesystem is read-only
240 - `mount` fails because the container does not have `CAP_SYS_ADMIN`
241 - step logs and systemd journal include the configured image, network,
242 CPU/memory limits, PID limit, container user, and seccomp profile
243
244 ## Network Allowlist
245
246 The runner config carries two separate network controls:
247
248 - `runner.network_allowlist`: the host patterns allowed by the
249 operator's DNS allowlist resolver.
250 - `engine.dns_servers`: DNS servers passed to each step container with
251 Docker `--dns`.
252
253 For a single-host deployment, create a dedicated Docker bridge for
254 Actions jobs, run dnsmasq bound to that bridge, and set
255 `shithub_runner_dns_servers` to the bridge address of that resolver.
256 The Ansible role now does this by default. The rendered dnsmasq config
257 has no default upstream resolver; names not matching the allowlist fail
258 DNS resolution.
259
260 The firewall service closes the direct-IP bypass: containers on the
261 Actions subnet may send DNS only to the bridge resolver, and other
262 egress is allowed only when the destination IP is present in the
263 dnsmasq-populated ipset. Keep the runner on a separate host from web
264 and database services.
265
266 ## Rollback
267
268 Stop the runner first so it does not claim new jobs:
269
270 ```sh
271 systemctl stop shithubd-runner
272 systemctl disable shithubd-runner
273 ```
274
275 If the binary itself is bad, copy a prior archived binary from
276 `/var/lib/shithubd-runner/binaries/` back to
277 `/usr/local/bin/shithubd-runner` and restart the unit. Jobs already
278 claimed by the stopped runner remain visible in the database; S41g adds
279 operator cancel/re-run controls.
280
281 For a DigitalOcean test runner, drain or revoke the runner in shithub before
282 destroying the droplet. Then list and delete the specific test host:
283
284 ```sh
285 doctl compute droplet list --tag-name shithub-actions-runner
286 doctl compute droplet delete shithub-runner-shared-linux-1 --force
287 ```
288
289 If the pool firewall was created only for a disposable test and no runner
290 droplets still use it:
291
292 ```sh
293 doctl compute firewall list
294 doctl compute firewall delete <firewall-id> --force
295 ```