markdown · 6137 bytes Raw Blame History

Actions runner deploy runbook

This runbook owns the S41d deployment path for shithubd-runner: the Nix-built default image, systemd unit, and Ansible role. The smoke flow for an already-installed runner lives in actions-runner.md.

Prereqs

  • The app database is migrated through S41d and the web API has auth.totp_key_b64 configured so job JWTs can be minted.
  • Docker is installed on the runner host and the docker group exists. The runner process needs Docker socket access; treat the host itself as trusted even though individual step containers are sandboxed.
  • bin/shithubd-runner exists locally. make build builds both bin/shithubd and bin/shithubd-runner with the same version ldflags.
  • The default image has been loaded or published. Build it with:
nix build ./deploy/runner-images#runnerImage
docker load < result

The committed deploy/runner-images/flake.lock pins the nixpkgs input. Update it deliberately when changing the default image toolchain.

Publishing to GHCR is manual through .github/workflows/runner-image.yml because forks may not control the upstream ghcr.io/shithub namespace. Leave the workflow's image input blank to publish under the current repository's package namespace, or set it explicitly for upstream publishing.

Register

Run this once from a host that can reach the production database config:

shithubd admin runner register \
  --name prod-runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1

Store the printed token in ansible-vault or the deployment secret store. Only the token hash is stored in Postgres; the raw token cannot be recovered later.

Inventory

Enable the role explicitly. The default is disabled so ordinary app deploys do not start a runner by accident.

[shithub:vars]
shithub_runner_enabled=true
shithub_runner_token=REPLACE_ME
shithub_runner_labels=self-hosted,linux,ubuntu-latest
shithub_runner_capacity=1
shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
shithub_runner_container_user=65534:65534
shithub_runner_pids_limit=512
shithub_runner_dns_servers=172.30.0.1

The role writes non-secret config to /etc/shithubd-runner/config.toml and the registration token to /etc/shithubd-runner/runner.env with mode 0600. Keep shithub_runner_workspace_root under /var/lib/shithubd-runner; the systemd unit grants runner writes only to that subtree.

shithub_runner_network_allowlist defaults to GitHub source/archive hosts plus Docker Hub registry hosts. Override it when a runner must fetch from an internal package registry. shithub_runner_dns_servers is empty by default; set it only after a DNS allowlist resolver exists on the runner network.

Deploy

For the runner role only:

make build
cd deploy/ansible
ansible-playbook -i inventory/production site.yml -t shithubd-runner

The role:

  • creates the shithub-runner system user and joins it to docker
  • uploads /usr/local/bin/shithubd-runner
  • renders /etc/shithubd-runner/config.toml and runner.env
  • renders /etc/shithubd-runner/dnsmasq.conf from the network allowlist for operators who run a local DNS allowlist resolver
  • installs the pinned seccomp profile at /etc/shithubd-runner/seccomp.json
  • installs deploy/systemd/shithubd-runner.service
  • pulls the configured runner image
  • enables and starts shithubd-runner

Verify

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

Then push a workflow with a simple run: step:

name: ci
on: push
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - run: bash -c "echo hello && exit 0"

Expected state:

  • a runner heartbeat claims the queued job within one idle poll interval
  • the step emits SQL log chunks during execution
  • workflow:finalize_step uploads actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log
  • the job and check run complete with conclusion success

Repeat with exit 1; the check should complete with conclusion failure.

Sandbox smoke checks:

name: sandbox-smoke
on: push
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - run: id -u
      - run: test "$(id -u)" = "65534"
      - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
      - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi

Expected state:

  • the UID check prints 65534
  • writing under /etc fails because the root filesystem is read-only
  • mount fails because the container does not have CAP_SYS_ADMIN
  • step logs and systemd journal include the configured image, network, CPU/memory limits, PID limit, container user, and seccomp profile

Network Allowlist

The runner config carries two separate network controls:

  • runner.network_allowlist: the host patterns allowed by the operator's DNS allowlist resolver.
  • engine.dns_servers: DNS servers passed to each step container with Docker --dns.

For a single-host deployment, create a dedicated Docker bridge for Actions jobs, run dnsmasq bound to that bridge, render /etc/shithubd-runner/dnsmasq.conf, and set shithub_runner_dns_servers to the bridge address of that resolver. The rendered dnsmasq config has no default upstream resolver; names not matching the allowlist fail DNS resolution.

DNS filtering is not a complete egress boundary by itself. Block direct-IP egress from the Actions bridge with host firewall rules, and allow only DNS to the resolver plus established outbound connections opened by that resolver. Keep the runner on a separate host from web and database services.

Rollback

Stop the runner first so it does not claim new jobs:

systemctl stop shithubd-runner
systemctl disable shithubd-runner

If the binary itself is bad, copy a prior archived binary from /var/lib/shithubd-runner/binaries/ back to /usr/local/bin/shithubd-runner and restart the unit. Jobs already claimed by the stopped runner remain visible in the database; S41g adds operator cancel/re-run controls.

View source
1 # Actions runner deploy runbook
2
3 This runbook owns the S41d deployment path for `shithubd-runner`: the
4 Nix-built default image, systemd unit, and Ansible role. The smoke flow
5 for an already-installed runner lives in [actions-runner.md](./actions-runner.md).
6
7 ## Prereqs
8
9 - The app database is migrated through S41d and the web API has
10 `auth.totp_key_b64` configured so job JWTs can be minted.
11 - Docker is installed on the runner host and the `docker` group exists.
12 The runner process needs Docker socket access; treat the host itself
13 as trusted even though individual step containers are sandboxed.
14 - `bin/shithubd-runner` exists locally. `make build` builds both
15 `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags.
16 - The default image has been loaded or published. Build it with:
17
18 ```sh
19 nix build ./deploy/runner-images#runnerImage
20 docker load < result
21 ```
22
23 The committed `deploy/runner-images/flake.lock` pins the nixpkgs input.
24 Update it deliberately when changing the default image toolchain.
25
26 Publishing to GHCR is manual through `.github/workflows/runner-image.yml`
27 because forks may not control the upstream `ghcr.io/shithub` namespace.
28 Leave the workflow's `image` input blank to publish under the current
29 repository's package namespace, or set it explicitly for upstream
30 publishing.
31
32 ## Register
33
34 Run this once from a host that can reach the production database config:
35
36 ```sh
37 shithubd admin runner register \
38 --name prod-runner-1 \
39 --labels self-hosted,linux,ubuntu-latest \
40 --capacity 1
41 ```
42
43 Store the printed token in ansible-vault or the deployment secret store.
44 Only the token hash is stored in Postgres; the raw token cannot be
45 recovered later.
46
47 ## Inventory
48
49 Enable the role explicitly. The default is disabled so ordinary app
50 deploys do not start a runner by accident.
51
52 ```ini
53 [shithub:vars]
54 shithub_runner_enabled=true
55 shithub_runner_token=REPLACE_ME
56 shithub_runner_labels=self-hosted,linux,ubuntu-latest
57 shithub_runner_capacity=1
58 shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0
59 shithub_runner_seccomp_profile=/etc/shithubd-runner/seccomp.json
60 shithub_runner_container_user=65534:65534
61 shithub_runner_pids_limit=512
62 shithub_runner_dns_servers=172.30.0.1
63 ```
64
65 The role writes non-secret config to
66 `/etc/shithubd-runner/config.toml` and the registration token to
67 `/etc/shithubd-runner/runner.env` with mode `0600`.
68 Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`;
69 the systemd unit grants runner writes only to that subtree.
70
71 `shithub_runner_network_allowlist` defaults to GitHub source/archive
72 hosts plus Docker Hub registry hosts. Override it when a runner must
73 fetch from an internal package registry. `shithub_runner_dns_servers`
74 is empty by default; set it only after a DNS allowlist resolver exists
75 on the runner network.
76
77 ## Deploy
78
79 For the runner role only:
80
81 ```sh
82 make build
83 cd deploy/ansible
84 ansible-playbook -i inventory/production site.yml -t shithubd-runner
85 ```
86
87 The role:
88
89 - creates the `shithub-runner` system user and joins it to `docker`
90 - uploads `/usr/local/bin/shithubd-runner`
91 - renders `/etc/shithubd-runner/config.toml` and `runner.env`
92 - renders `/etc/shithubd-runner/dnsmasq.conf` from the network
93 allowlist for operators who run a local DNS allowlist resolver
94 - installs the pinned seccomp profile at
95 `/etc/shithubd-runner/seccomp.json`
96 - installs `deploy/systemd/shithubd-runner.service`
97 - pulls the configured runner image
98 - enables and starts `shithubd-runner`
99
100 ## Verify
101
102 On the runner host:
103
104 ```sh
105 systemctl status shithubd-runner
106 journalctl -u shithubd-runner -n 100 --no-pager
107 ```
108
109 Then push a workflow with a simple `run:` step:
110
111 ```yaml
112 name: ci
113 on: push
114 jobs:
115 build:
116 runs-on: ubuntu-latest
117 steps:
118 - run: bash -c "echo hello && exit 0"
119 ```
120
121 Expected state:
122
123 - a runner heartbeat claims the queued job within one idle poll interval
124 - the step emits SQL log chunks during execution
125 - `workflow:finalize_step` uploads
126 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`
127 - the job and check run complete with conclusion `success`
128
129 Repeat with `exit 1`; the check should complete with conclusion
130 `failure`.
131
132 Sandbox smoke checks:
133
134 ```yaml
135 name: sandbox-smoke
136 on: push
137 jobs:
138 smoke:
139 runs-on: ubuntu-latest
140 steps:
141 - run: id -u
142 - run: test "$(id -u)" = "65534"
143 - run: if mkdir /etc/shithub-smoke 2>/dev/null; then exit 1; fi
144 - run: if mount -t tmpfs tmpfs /mnt 2>/dev/null; then exit 1; fi
145 ```
146
147 Expected state:
148
149 - the UID check prints `65534`
150 - writing under `/etc` fails because the root filesystem is read-only
151 - `mount` fails because the container does not have `CAP_SYS_ADMIN`
152 - step logs and systemd journal include the configured image, network,
153 CPU/memory limits, PID limit, container user, and seccomp profile
154
155 ## Network Allowlist
156
157 The runner config carries two separate network controls:
158
159 - `runner.network_allowlist`: the host patterns allowed by the
160 operator's DNS allowlist resolver.
161 - `engine.dns_servers`: DNS servers passed to each step container with
162 Docker `--dns`.
163
164 For a single-host deployment, create a dedicated Docker bridge for
165 Actions jobs, run dnsmasq bound to that bridge, render
166 `/etc/shithubd-runner/dnsmasq.conf`, and set
167 `shithub_runner_dns_servers` to the bridge address of that resolver.
168 The rendered dnsmasq config has no default upstream resolver; names not
169 matching the allowlist fail DNS resolution.
170
171 DNS filtering is not a complete egress boundary by itself. Block
172 direct-IP egress from the Actions bridge with host firewall rules, and
173 allow only DNS to the resolver plus established outbound connections
174 opened by that resolver. Keep the runner on a separate host from web
175 and database services.
176
177 ## Rollback
178
179 Stop the runner first so it does not claim new jobs:
180
181 ```sh
182 systemctl stop shithubd-runner
183 systemctl disable shithubd-runner
184 ```
185
186 If the binary itself is bad, copy a prior archived binary from
187 `/var/lib/shithubd-runner/binaries/` back to
188 `/usr/local/bin/shithubd-runner` and restart the unit. Jobs already
189 claimed by the stopped runner remain visible in the database; S41g adds
190 operator cancel/re-run controls.