Actions runner deploy runbook
This runbook owns the S41d/S41j deployment path for shithubd-runner: the
Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner
pool bootstrap. The smoke flow for an already-installed runner lives in
actions-runner.md.
Prereqs
- The app database is migrated through S41d and the web API has
auth.totp_key_b64configured so job JWTs can be minted. - Docker is installed on the runner host and the
dockergroup exists. The runner process needs Docker socket access; treat the host itself as trusted even though individual step containers are sandboxed. - For DigitalOcean runner pools:
doctlandjqinstalled locally, authenticated withdoctl auth init, and a named SSH key uploaded to the DigitalOcean account. - Runner host SSH ingress must be restricted to operator or VPN CIDRs.
Do not create runner droplets with SSH open to
0.0.0.0/0. bin/shithubd-runnerexists locally.make buildbuilds bothbin/shithubdandbin/shithubd-runnerwith the same version ldflags.- The default image has been loaded or published. Build it with:
nix build ./deploy/runner-images#runnerImage
docker load < result
The committed deploy/runner-images/flake.lock pins the nixpkgs input.
Update it deliberately when changing the default image toolchain.
Publishing to GHCR is manual through .github/workflows/runner-image.yml
because forks may not control the upstream ghcr.io/shithub namespace.
Leave the workflow's image input blank to publish under the current
repository's package namespace, or set it explicitly for upstream
publishing.
DigitalOcean pool bootstrap
S41j runner hosts are separate droplets tagged shithub-actions-runner.
The app/database host must not double as the arbitrary-code runner host.
Validate the plan first:
SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh --dry-run
Create the first shared Linux runner:
SSH_KEY_NAME=macbook-pro \
SSH_ALLOWED_CIDRS=203.0.113.4/32 \
./deploy/doctl/provision-actions-runner-pool.sh
Useful overrides:
POOL_NAME=shared-linux
PROJECT_NAME=shithub-prod
REGION=sfo3
SIZE=s-2vcpu-4gb
IMAGE=ubuntu-24-04-x64
COUNT=1
VPC_UUID=REPLACE_ME_OPTIONAL
The provisioner:
- creates or reuses the DigitalOcean project;
- creates or reuses a tag-targeted cloud firewall;
- refuses public SSH CIDRs;
- creates missing droplets named
shithub-runner-<pool>-N; - applies the
shithub-actions-runnerand pool tags; - installs a no-secret cloud-init baseline that enables Docker;
- prints machine-readable JSON for operator records.
Generate an Ansible inventory from the DigitalOcean tag:
./deploy/doctl/generate-actions-runner-inventory.sh \
--output deploy/ansible/inventory/actions-runners
The generated file contains per-host shithub_runner_token placeholders.
Replace them with values from shithubd admin runner register, ideally through
ansible-vault or host_vars rather than committing plaintext inventory.
Register
Run this once from a host that can reach the production database config:
shithubd admin runner register \
--name prod-runner-1 \
--labels self-hosted,linux,ubuntu-latest,x64 \
--capacity 1 \
--output json
Store the returned token in ansible-vault or the deployment secret store.
Only the token hash is stored in Postgres; the raw token cannot be
recovered later.
Use --expires-in only when the deployment rotates the token before expiry,
because the runner presents the registration token on every heartbeat.
For a generated DigitalOcean inventory, register one token per runner host and use the host name as the runner name:
shithubd admin runner register \
--name shithub-runner-shared-linux-1 \
--labels self-hosted,linux,ubuntu-latest,x64 \
--capacity 1 \
--output json
Inventory
Enable the role explicitly. The default is disabled so ordinary app deploys do not start a runner by accident.
[shithub:vars]
shithub_runner_enabled=true
shithub_runner_token=REPLACE_ME
shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64
shithub_runner_capacity=1
shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
shithub_runner_container_user=65534:65534
shithub_runner_pids_limit=512
The role writes non-secret config to
/etc/shithubd-runner/config.toml and the registration token to
/etc/shithubd-runner/runner.env with mode 0600.
Keep shithub_runner_workspace_root under /var/lib/shithubd-runner;
the systemd unit grants runner writes only to that subtree.
shithub_runner_network_allowlist defaults to GitHub source/archive
hosts plus Docker Hub registry hosts. Override it when a runner must
fetch from an internal package registry. The role creates the
shithub-actions Docker bridge at 172.30.0.1/24, runs dnsmasq on
that bridge, and sets engine.dns_servers to the bridge resolver by
default.
Deploy
For the runner role only:
make build
cd deploy/ansible
ansible-playbook -i inventory/production site.yml -t shithubd-runner
For the generated DigitalOcean runner inventory:
make build
cd deploy/ansible
ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner
The role:
- creates the
shithub-runnersystem user and joins it todocker - uploads
/usr/local/bin/shithubd-runner - renders
/etc/shithubd-runner/config.tomlandrunner.env - creates the dedicated Actions Docker network and bridge
- renders
/etc/dnsmasq.d/shithubd-runner.conffrom the network allowlist and starts dnsmasq bound to the Actions bridge - installs
shithub-runner-firewall.service, which rejects direct-IP egress from step containers unless dnsmasq populated the destination in the allowlist ipset - installs the pinned seccomp profile at
/etc/shithubd-runner/seccomp.json - installs
deploy/systemd/shithubd-runner.service - pulls the configured runner image
- enables and starts
shithubd-runner
Verify
On the runner host:
systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager
Then push a workflow with a simple run: step:
name: ci
on: push
jobs:
build:
runs-on: ubuntu-latest
steps:
- run: bash -c "echo hello && exit 0"
Expected state:
- a runner heartbeat claims the queued job within one idle poll interval
- the step emits SQL log chunks during execution
workflow:finalize_stepuploadsactions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log- the job and check run complete with conclusion
success
Repeat with exit 1; the check should complete with conclusion
failure.
Before declaring a shared pool available to arbitrary repositories, run the arbitrary repository smoke checklist against scratch and at least one additional repository. The single repo check above proves the runner process works; the arbitrary-repo smoke proves label routing, checkout scoping, archived logs, check runs, and negative queue/secret cases across normal repositories.
Sandbox smoke checks:
name: sandbox-smoke
on: push
jobs:
smoke:
runs-on: ubuntu-latest
steps:
- run: id -u
- run: test "$(id -u)" = "65534"
- run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
- run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi
Expected state:
- the UID check prints
65534 - a workflow-level request for
permissions: {shithub-runner-root: write}still runs as65534; root opt-in is disabled in the shipped runner config until a trusted-workflow policy exists - writing under
/etcfails because the root filesystem is read-only mountfails because the container does not haveCAP_SYS_ADMIN- step logs and systemd journal include the configured image, network, CPU/memory limits, PID limit, container user, and seccomp profile
Network Allowlist
The runner config carries two separate network controls:
runner.network_allowlist: the host patterns allowed by the operator's DNS allowlist resolver.engine.dns_servers: DNS servers passed to each step container with Docker--dns.
For a single-host deployment, create a dedicated Docker bridge for
Actions jobs, run dnsmasq bound to that bridge, and set
shithub_runner_dns_servers to the bridge address of that resolver.
The Ansible role now does this by default. The rendered dnsmasq config
has no default upstream resolver; names not matching the allowlist fail
DNS resolution.
The firewall service closes the direct-IP bypass: containers on the Actions subnet may send DNS only to the bridge resolver, and other egress is allowed only when the destination IP is present in the dnsmasq-populated ipset. Keep the runner on a separate host from web and database services.
Rollback
Stop the runner first so it does not claim new jobs:
systemctl stop shithubd-runner
systemctl disable shithubd-runner
If the binary itself is bad, copy a prior archived binary from
/var/lib/shithubd-runner/binaries/ back to
/usr/local/bin/shithubd-runner and restart the unit. Jobs already
claimed by the stopped runner remain visible in the database; S41g adds
operator cancel/re-run controls.
For a DigitalOcean test runner, drain or revoke the runner in shithub before destroying the droplet. Then list and delete the specific test host:
doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete shithub-runner-shared-linux-1 --force
If the pool firewall was created only for a disposable test and no runner droplets still use it:
doctl compute firewall list
doctl compute firewall delete <firewall-id> --force
View source
| 1 | # Actions runner deploy runbook |
| 2 | |
| 3 | This runbook owns the S41d/S41j deployment path for `shithubd-runner`: the |
| 4 | Nix-built default image, systemd unit, Ansible role, and DigitalOcean runner |
| 5 | pool bootstrap. The smoke flow for an already-installed runner lives in |
| 6 | [actions-runner.md](./actions-runner.md). |
| 7 | |
| 8 | ## Prereqs |
| 9 | |
| 10 | - The app database is migrated through S41d and the web API has |
| 11 | `auth.totp_key_b64` configured so job JWTs can be minted. |
| 12 | - Docker is installed on the runner host and the `docker` group exists. |
| 13 | The runner process needs Docker socket access; treat the host itself |
| 14 | as trusted even though individual step containers are sandboxed. |
| 15 | - For DigitalOcean runner pools: `doctl` and `jq` installed locally, |
| 16 | authenticated with `doctl auth init`, and a named SSH key uploaded to |
| 17 | the DigitalOcean account. |
| 18 | - Runner host SSH ingress must be restricted to operator or VPN CIDRs. |
| 19 | Do not create runner droplets with SSH open to `0.0.0.0/0`. |
| 20 | - `bin/shithubd-runner` exists locally. `make build` builds both |
| 21 | `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags. |
| 22 | - The default image has been loaded or published. Build it with: |
| 23 | |
| 24 | ```sh |
| 25 | nix build ./deploy/runner-images#runnerImage |
| 26 | docker load < result |
| 27 | ``` |
| 28 | |
| 29 | The committed `deploy/runner-images/flake.lock` pins the nixpkgs input. |
| 30 | Update it deliberately when changing the default image toolchain. |
| 31 | |
| 32 | Publishing to GHCR is manual through `.github/workflows/runner-image.yml` |
| 33 | because forks may not control the upstream `ghcr.io/shithub` namespace. |
| 34 | Leave the workflow's `image` input blank to publish under the current |
| 35 | repository's package namespace, or set it explicitly for upstream |
| 36 | publishing. |
| 37 | |
| 38 | ## DigitalOcean pool bootstrap |
| 39 | |
| 40 | S41j runner hosts are separate droplets tagged `shithub-actions-runner`. |
| 41 | The app/database host must not double as the arbitrary-code runner host. |
| 42 | |
| 43 | Validate the plan first: |
| 44 | |
| 45 | ```sh |
| 46 | SSH_KEY_NAME=macbook-pro \ |
| 47 | SSH_ALLOWED_CIDRS=203.0.113.4/32 \ |
| 48 | ./deploy/doctl/provision-actions-runner-pool.sh --dry-run |
| 49 | ``` |
| 50 | |
| 51 | Create the first shared Linux runner: |
| 52 | |
| 53 | ```sh |
| 54 | SSH_KEY_NAME=macbook-pro \ |
| 55 | SSH_ALLOWED_CIDRS=203.0.113.4/32 \ |
| 56 | ./deploy/doctl/provision-actions-runner-pool.sh |
| 57 | ``` |
| 58 | |
| 59 | Useful overrides: |
| 60 | |
| 61 | ```sh |
| 62 | POOL_NAME=shared-linux |
| 63 | PROJECT_NAME=shithub-prod |
| 64 | REGION=sfo3 |
| 65 | SIZE=s-2vcpu-4gb |
| 66 | IMAGE=ubuntu-24-04-x64 |
| 67 | COUNT=1 |
| 68 | VPC_UUID=REPLACE_ME_OPTIONAL |
| 69 | ``` |
| 70 | |
| 71 | The provisioner: |
| 72 | |
| 73 | - creates or reuses the DigitalOcean project; |
| 74 | - creates or reuses a tag-targeted cloud firewall; |
| 75 | - refuses public SSH CIDRs; |
| 76 | - creates missing droplets named `shithub-runner-<pool>-N`; |
| 77 | - applies the `shithub-actions-runner` and pool tags; |
| 78 | - installs a no-secret cloud-init baseline that enables Docker; |
| 79 | - prints machine-readable JSON for operator records. |
| 80 | |
| 81 | Generate an Ansible inventory from the DigitalOcean tag: |
| 82 | |
| 83 | ```sh |
| 84 | ./deploy/doctl/generate-actions-runner-inventory.sh \ |
| 85 | --output deploy/ansible/inventory/actions-runners |
| 86 | ``` |
| 87 | |
| 88 | The generated file contains per-host `shithub_runner_token` placeholders. |
| 89 | Replace them with values from `shithubd admin runner register`, ideally through |
| 90 | ansible-vault or host_vars rather than committing plaintext inventory. |
| 91 | |
| 92 | ## Register |
| 93 | |
| 94 | Run this once from a host that can reach the production database config: |
| 95 | |
| 96 | ```sh |
| 97 | shithubd admin runner register \ |
| 98 | --name prod-runner-1 \ |
| 99 | --labels self-hosted,linux,ubuntu-latest,x64 \ |
| 100 | --capacity 1 \ |
| 101 | --output json |
| 102 | ``` |
| 103 | |
| 104 | Store the returned `token` in ansible-vault or the deployment secret store. |
| 105 | Only the token hash is stored in Postgres; the raw token cannot be |
| 106 | recovered later. |
| 107 | Use `--expires-in` only when the deployment rotates the token before expiry, |
| 108 | because the runner presents the registration token on every heartbeat. |
| 109 | |
| 110 | For a generated DigitalOcean inventory, register one token per runner host and |
| 111 | use the host name as the runner name: |
| 112 | |
| 113 | ```sh |
| 114 | shithubd admin runner register \ |
| 115 | --name shithub-runner-shared-linux-1 \ |
| 116 | --labels self-hosted,linux,ubuntu-latest,x64 \ |
| 117 | --capacity 1 \ |
| 118 | --output json |
| 119 | ``` |
| 120 | |
| 121 | ## Inventory |
| 122 | |
| 123 | Enable the role explicitly. The default is disabled so ordinary app |
| 124 | deploys do not start a runner by accident. |
| 125 | |
| 126 | ```ini |
| 127 | [shithub:vars] |
| 128 | shithub_runner_enabled=true |
| 129 | shithub_runner_token=REPLACE_ME |
| 130 | shithub_runner_labels=self-hosted,linux,ubuntu-latest,x64 |
| 131 | shithub_runner_capacity=1 |
| 132 | shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0 |
| 133 | shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json |
| 134 | shithub_runner_container_user=65534:65534 |
| 135 | shithub_runner_pids_limit=512 |
| 136 | ``` |
| 137 | |
| 138 | The role writes non-secret config to |
| 139 | `/etc/shithubd-runner/config.toml` and the registration token to |
| 140 | `/etc/shithubd-runner/runner.env` with mode `0600`. |
| 141 | Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`; |
| 142 | the systemd unit grants runner writes only to that subtree. |
| 143 | |
| 144 | `shithub_runner_network_allowlist` defaults to GitHub source/archive |
| 145 | hosts plus Docker Hub registry hosts. Override it when a runner must |
| 146 | fetch from an internal package registry. The role creates the |
| 147 | `shithub-actions` Docker bridge at `172.30.0.1/24`, runs dnsmasq on |
| 148 | that bridge, and sets `engine.dns_servers` to the bridge resolver by |
| 149 | default. |
| 150 | |
| 151 | ## Deploy |
| 152 | |
| 153 | For the runner role only: |
| 154 | |
| 155 | ```sh |
| 156 | make build |
| 157 | cd deploy/ansible |
| 158 | ansible-playbook -i inventory/production site.yml -t shithubd-runner |
| 159 | ``` |
| 160 | |
| 161 | For the generated DigitalOcean runner inventory: |
| 162 | |
| 163 | ```sh |
| 164 | make build |
| 165 | cd deploy/ansible |
| 166 | ansible-playbook -i inventory/actions-runners site.yml -t shithubd-runner |
| 167 | ``` |
| 168 | |
| 169 | The role: |
| 170 | |
| 171 | - creates the `shithub-runner` system user and joins it to `docker` |
| 172 | - uploads `/usr/local/bin/shithubd-runner` |
| 173 | - renders `/etc/shithubd-runner/config.toml` and `runner.env` |
| 174 | - creates the dedicated Actions Docker network and bridge |
| 175 | - renders `/etc/dnsmasq.d/shithubd-runner.conf` from the network |
| 176 | allowlist and starts dnsmasq bound to the Actions bridge |
| 177 | - installs `shithub-runner-firewall.service`, which rejects direct-IP |
| 178 | egress from step containers unless dnsmasq populated the destination |
| 179 | in the allowlist ipset |
| 180 | - installs the pinned seccomp profile at |
| 181 | `/etc/shithubd-runner/seccomp.json` |
| 182 | - installs `deploy/systemd/shithubd-runner.service` |
| 183 | - pulls the configured runner image |
| 184 | - enables and starts `shithubd-runner` |
| 185 | |
| 186 | ## Verify |
| 187 | |
| 188 | On the runner host: |
| 189 | |
| 190 | ```sh |
| 191 | systemctl status shithubd-runner |
| 192 | journalctl -u shithubd-runner -n 100 --no-pager |
| 193 | ``` |
| 194 | |
| 195 | Then push a workflow with a simple `run:` step: |
| 196 | |
| 197 | ```yaml |
| 198 | name: ci |
| 199 | on: push |
| 200 | jobs: |
| 201 | build: |
| 202 | runs-on: ubuntu-latest |
| 203 | steps: |
| 204 | - run: bash -c "echo hello && exit 0" |
| 205 | ``` |
| 206 | |
| 207 | Expected state: |
| 208 | |
| 209 | - a runner heartbeat claims the queued job within one idle poll interval |
| 210 | - the step emits SQL log chunks during execution |
| 211 | - `workflow:finalize_step` uploads |
| 212 | `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log` |
| 213 | - the job and check run complete with conclusion `success` |
| 214 | |
| 215 | Repeat with `exit 1`; the check should complete with conclusion |
| 216 | `failure`. |
| 217 | |
| 218 | Before declaring a shared pool available to arbitrary repositories, run the |
| 219 | [arbitrary repository smoke](./actions-runner.md#arbitrary-repository-smoke) |
| 220 | checklist against scratch and at least one additional repository. The single |
| 221 | repo check above proves the runner process works; the arbitrary-repo smoke |
| 222 | proves label routing, checkout scoping, archived logs, check runs, and negative |
| 223 | queue/secret cases across normal repositories. |
| 224 | |
| 225 | Sandbox smoke checks: |
| 226 | |
| 227 | ```yaml |
| 228 | name: sandbox-smoke |
| 229 | on: push |
| 230 | jobs: |
| 231 | smoke: |
| 232 | runs-on: ubuntu-latest |
| 233 | steps: |
| 234 | - run: id -u |
| 235 | - run: test "$(id -u)" = "65534" |
| 236 | - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi |
| 237 | - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi |
| 238 | ``` |
| 239 | |
| 240 | Expected state: |
| 241 | |
| 242 | - the UID check prints `65534` |
| 243 | - a workflow-level request for `permissions: {shithub-runner-root: write}` |
| 244 | still runs as `65534`; root opt-in is disabled in the shipped runner |
| 245 | config until a trusted-workflow policy exists |
| 246 | - writing under `/etc` fails because the root filesystem is read-only |
| 247 | - `mount` fails because the container does not have `CAP_SYS_ADMIN` |
| 248 | - step logs and systemd journal include the configured image, network, |
| 249 | CPU/memory limits, PID limit, container user, and seccomp profile |
| 250 | |
| 251 | ## Network Allowlist |
| 252 | |
| 253 | The runner config carries two separate network controls: |
| 254 | |
| 255 | - `runner.network_allowlist`: the host patterns allowed by the |
| 256 | operator's DNS allowlist resolver. |
| 257 | - `engine.dns_servers`: DNS servers passed to each step container with |
| 258 | Docker `--dns`. |
| 259 | |
| 260 | For a single-host deployment, create a dedicated Docker bridge for |
| 261 | Actions jobs, run dnsmasq bound to that bridge, and set |
| 262 | `shithub_runner_dns_servers` to the bridge address of that resolver. |
| 263 | The Ansible role now does this by default. The rendered dnsmasq config |
| 264 | has no default upstream resolver; names not matching the allowlist fail |
| 265 | DNS resolution. |
| 266 | |
| 267 | The firewall service closes the direct-IP bypass: containers on the |
| 268 | Actions subnet may send DNS only to the bridge resolver, and other |
| 269 | egress is allowed only when the destination IP is present in the |
| 270 | dnsmasq-populated ipset. Keep the runner on a separate host from web |
| 271 | and database services. |
| 272 | |
| 273 | ## Rollback |
| 274 | |
| 275 | Stop the runner first so it does not claim new jobs: |
| 276 | |
| 277 | ```sh |
| 278 | systemctl stop shithubd-runner |
| 279 | systemctl disable shithubd-runner |
| 280 | ``` |
| 281 | |
| 282 | If the binary itself is bad, copy a prior archived binary from |
| 283 | `/var/lib/shithubd-runner/binaries/` back to |
| 284 | `/usr/local/bin/shithubd-runner` and restart the unit. Jobs already |
| 285 | claimed by the stopped runner remain visible in the database; S41g adds |
| 286 | operator cancel/re-run controls. |
| 287 | |
| 288 | For a DigitalOcean test runner, drain or revoke the runner in shithub before |
| 289 | destroying the droplet. Then list and delete the specific test host: |
| 290 | |
| 291 | ```sh |
| 292 | doctl compute droplet list --tag-name shithub-actions-runner |
| 293 | doctl compute droplet delete shithub-runner-shared-linux-1 --force |
| 294 | ``` |
| 295 | |
| 296 | If the pool firewall was created only for a disposable test and no runner |
| 297 | droplets still use it: |
| 298 | |
| 299 | ```sh |
| 300 | doctl compute firewall list |
| 301 | doctl compute firewall delete <firewall-id> --force |
| 302 | ``` |