markdown · 6551 bytes Raw Blame History

Actions runner deploy runbook

This runbook owns the S41d deployment path for shithubd-runner: the Nix-built default image, systemd unit, and Ansible role. The smoke flow for an already-installed runner lives in actions-runner.md.

Prereqs

  • The app database is migrated through S41d and the web API has auth.totp_key_b64 configured so job JWTs can be minted.
  • Docker is installed on the runner host and the docker group exists. The runner process needs Docker socket access; treat the host itself as trusted even though individual step containers are sandboxed.
  • bin/shithubd-runner exists locally. make build builds both bin/shithubd and bin/shithubd-runner with the same version ldflags.
  • The default image has been loaded or published. Build it with:
nix build ./deploy/runner-images#runnerImage
docker load < result

The committed deploy/runner-images/flake.lock pins the nixpkgs input. Update it deliberately when changing the default image toolchain.

Publishing to GHCR is manual through .github/workflows/runner-image.yml because forks may not control the upstream ghcr.io/shithub namespace. Leave the workflow's image input blank to publish under the current repository's package namespace, or set it explicitly for upstream publishing.

Register

Run this once from a host that can reach the production database config:

shithubd admin runner register \
  --name prod-runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1

Store the printed token in ansible-vault or the deployment secret store. Only the token hash is stored in Postgres; the raw token cannot be recovered later.

Inventory

Enable the role explicitly. The default is disabled so ordinary app deploys do not start a runner by accident.

[shithub:vars]
shithub_runner_enabled=true
shithub_runner_token=REPLACE_ME
shithub_runner_labels=self-hosted,linux,ubuntu-latest
shithub_runner_capacity=1
shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
shithub_runner_container_user=65534:65534
shithub_runner_pids_limit=512

The role writes non-secret config to /etc/shithubd-runner/config.toml and the registration token to /etc/shithubd-runner/runner.env with mode 0600. Keep shithub_runner_workspace_root under /var/lib/shithubd-runner; the systemd unit grants runner writes only to that subtree.

shithub_runner_network_allowlist defaults to GitHub source/archive hosts plus Docker Hub registry hosts. Override it when a runner must fetch from an internal package registry. The role creates the shithub-actions Docker bridge at 172.30.0.1/24, runs dnsmasq on that bridge, and sets engine.dns_servers to the bridge resolver by default.

Deploy

For the runner role only:

make build
cd deploy/ansible
ansible-playbook -i inventory/production site.yml -t shithubd-runner

The role:

  • creates the shithub-runner system user and joins it to docker
  • uploads /usr/local/bin/shithubd-runner
  • renders /etc/shithubd-runner/config.toml and runner.env
  • creates the dedicated Actions Docker network and bridge
  • renders /etc/dnsmasq.d/shithubd-runner.conf from the network allowlist and starts dnsmasq bound to the Actions bridge
  • installs shithub-runner-firewall.service, which rejects direct-IP egress from step containers unless dnsmasq populated the destination in the allowlist ipset
  • installs the pinned seccomp profile at /etc/shithubd-runner/seccomp.json
  • installs deploy/systemd/shithubd-runner.service
  • pulls the configured runner image
  • enables and starts shithubd-runner

Verify

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

Then push a workflow with a simple run: step:

name: ci
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - run: bash -c "echo hello && exit 0"

Expected state:

  • a runner heartbeat claims the queued job within one idle poll interval
  • the step emits SQL log chunks during execution
  • workflow:finalize_step uploads actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log
  • the job and check run complete with conclusion success

Repeat with exit 1; the check should complete with conclusion failure.

Sandbox smoke checks:

name: sandbox-smoke
on: push
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - run: id -u
      - run: test "$(id -u)" = "65534"
      - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
      - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi

Expected state:

  • the UID check prints 65534
  • a workflow-level request for permissions: {shithub-runner-root: write} still runs as 65534; root opt-in is disabled in the shipped runner config until a trusted-workflow policy exists
  • writing under /etc fails because the root filesystem is read-only
  • mount fails because the container does not have CAP_SYS_ADMIN
  • step logs and systemd journal include the configured image, network, CPU/memory limits, PID limit, container user, and seccomp profile

Network Allowlist

The runner config carries two separate network controls:

  • runner.network_allowlist: the host patterns allowed by the operator's DNS allowlist resolver.
  • engine.dns_servers: DNS servers passed to each step container with Docker --dns.

For a single-host deployment, create a dedicated Docker bridge for Actions jobs, run dnsmasq bound to that bridge, and set shithub_runner_dns_servers to the bridge address of that resolver. The Ansible role now does this by default. The rendered dnsmasq config has no default upstream resolver; names not matching the allowlist fail DNS resolution.

The firewall service closes the direct-IP bypass: containers on the Actions subnet may send DNS only to the bridge resolver, and other egress is allowed only when the destination IP is present in the dnsmasq-populated ipset. Keep the runner on a separate host from web and database services.

Rollback

Stop the runner first so it does not claim new jobs:

systemctl stop shithubd-runner
systemctl disable shithubd-runner

If the binary itself is bad, copy a prior archived binary from /var/lib/shithubd-runner/binaries/ back to /usr/local/bin/shithubd-runner and restart the unit. Jobs already claimed by the stopped runner remain visible in the database; S41g adds operator cancel/re-run controls.

View source
1 # Actions runner deploy runbook
2
3 This runbook owns the S41d deployment path for `shithubd-runner`: the
4 Nix-built default image, systemd unit, and Ansible role. The smoke flow
5 for an already-installed runner lives in [actions-runner.md](./actions-runner.md).
6
7 ## Prereqs
8
9 - The app database is migrated through S41d and the web API has
10 `auth.totp_key_b64` configured so job JWTs can be minted.
11 - Docker is installed on the runner host and the `docker` group exists.
12 The runner process needs Docker socket access; treat the host itself
13 as trusted even though individual step containers are sandboxed.
14 - `bin/shithubd-runner` exists locally. `make build` builds both
15 `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags.
16 - The default image has been loaded or published. Build it with:
17
18 ```sh
19 nix build ./deploy/runner-images#runnerImage
20 docker load < result
21 ```
22
23 The committed `deploy/runner-images/flake.lock` pins the nixpkgs input.
24 Update it deliberately when changing the default image toolchain.
25
26 Publishing to GHCR is manual through `.github/workflows/runner-image.yml`
27 because forks may not control the upstream `ghcr.io/shithub` namespace.
28 Leave the workflow's `image` input blank to publish under the current
29 repository's package namespace, or set it explicitly for upstream
30 publishing.
31
32 ## Register
33
34 Run this once from a host that can reach the production database config:
35
36 ```sh
37 shithubd admin runner register \
38 --name prod-runner-1 \
39 --labels self-hosted,linux,ubuntu-latest \
40 --capacity 1
41 ```
42
43 Store the printed token in ansible-vault or the deployment secret store.
44 Only the token hash is stored in Postgres; the raw token cannot be
45 recovered later.
46
47 ## Inventory
48
49 Enable the role explicitly. The default is disabled so ordinary app
50 deploys do not start a runner by accident.
51
52 ```ini
53 [shithub:vars]
54 shithub_runner_enabled=true
55 shithub_runner_token=REPLACE_ME
56 shithub_runner_labels=self-hosted,linux,ubuntu-latest
57 shithub_runner_capacity=1
58 shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
59 shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
60 shithub_runner_container_user=65534:65534
61 shithub_runner_pids_limit=512
62 ```
63
64 The role writes non-secret config to
65 `/etc/shithubd-runner/config.toml` and the registration token to
66 `/etc/shithubd-runner/runner.env` with mode `0600`.
67 Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`;
68 the systemd unit grants runner writes only to that subtree.
69
70 `shithub_runner_network_allowlist` defaults to GitHub source/archive
71 hosts plus Docker Hub registry hosts. Override it when a runner must
72 fetch from an internal package registry. The role creates the
73 `shithub-actions` Docker bridge at `172.30.0.1/24`, runs dnsmasq on
74 that bridge, and sets `engine.dns_servers` to the bridge resolver by
75 default.
76
77 ## Deploy
78
79 For the runner role only:
80
81 ```sh
82 make build
83 cd deploy/ansible
84 ansible-playbook -i inventory/production site.yml -t shithubd-runner
85 ```
86
87 The role:
88
89 - creates the `shithub-runner` system user and joins it to `docker`
90 - uploads `/usr/local/bin/shithubd-runner`
91 - renders `/etc/shithubd-runner/config.toml` and `runner.env`
92 - creates the dedicated Actions Docker network and bridge
93 - renders `/etc/dnsmasq.d/shithubd-runner.conf` from the network
94 allowlist and starts dnsmasq bound to the Actions bridge
95 - installs `shithub-runner-firewall.service`, which rejects direct-IP
96 egress from step containers unless dnsmasq populated the destination
97 in the allowlist ipset
98 - installs the pinned seccomp profile at
99 `/etc/shithubd-runner/seccomp.json`
100 - installs `deploy/systemd/shithubd-runner.service`
101 - pulls the configured runner image
102 - enables and starts `shithubd-runner`
103
104 ## Verify
105
106 On the runner host:
107
108 ```sh
109 systemctl status shithubd-runner
110 journalctl -u shithubd-runner -n 100 --no-pager
111 ```
112
113 Then push a workflow with a simple `run:` step:
114
115 ```yaml
116 name: ci
117 on: push
118 jobs:
119 build:
120 runs-on: ubuntu-latest
121 steps:
122 - run: bash -c "echo hello && exit 0"
123 ```
124
125 Expected state:
126
127 - a runner heartbeat claims the queued job within one idle poll interval
128 - the step emits SQL log chunks during execution
129 - `workflow:finalize_step` uploads
130 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`
131 - the job and check run complete with conclusion `success`
132
133 Repeat with `exit 1`; the check should complete with conclusion
134 `failure`.
135
136 Sandbox smoke checks:
137
138 ```yaml
139 name: sandbox-smoke
140 on: push
141 jobs:
142 smoke:
143 runs-on: ubuntu-latest
144 steps:
145 - run: id -u
146 - run: test "$(id -u)" = "65534"
147 - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
148 - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi
149 ```
150
151 Expected state:
152
153 - the UID check prints `65534`
154 - a workflow-level request for `permissions: {shithub-runner-root: write}`
155 still runs as `65534`; root opt-in is disabled in the shipped runner
156 config until a trusted-workflow policy exists
157 - writing under `/etc` fails because the root filesystem is read-only
158 - `mount` fails because the container does not have `CAP_SYS_ADMIN`
159 - step logs and systemd journal include the configured image, network,
160 CPU/memory limits, PID limit, container user, and seccomp profile
161
162 ## Network Allowlist
163
164 The runner config carries two separate network controls:
165
166 - `runner.network_allowlist`: the host patterns allowed by the
167 operator's DNS allowlist resolver.
168 - `engine.dns_servers`: DNS servers passed to each step container with
169 Docker `--dns`.
170
171 For a single-host deployment, create a dedicated Docker bridge for
172 Actions jobs, run dnsmasq bound to that bridge, and set
173 `shithub_runner_dns_servers` to the bridge address of that resolver.
174 The Ansible role now does this by default. The rendered dnsmasq config
175 has no default upstream resolver; names not matching the allowlist fail
176 DNS resolution.
177
178 The firewall service closes the direct-IP bypass: containers on the
179 Actions subnet may send DNS only to the bridge resolver, and other
180 egress is allowed only when the destination IP is present in the
181 dnsmasq-populated ipset. Keep the runner on a separate host from web
182 and database services.
183
184 ## Rollback
185
186 Stop the runner first so it does not claim new jobs:
187
188 ```sh
189 systemctl stop shithubd-runner
190 systemctl disable shithubd-runner
191 ```
192
193 If the binary itself is bad, copy a prior archived binary from
194 `/var/lib/shithubd-runner/binaries/` back to
195 `/usr/local/bin/shithubd-runner` and restart the unit. Jobs already
196 claimed by the stopped runner remain visible in the database; S41g adds
197 operator cancel/re-run controls.