@@ -0,0 +1,130 @@ |
| 1 | +# Actions runner deploy runbook |
| 2 | + |
| 3 | +This runbook owns the S41d deployment path for `shithubd-runner`: the |
| 4 | +Nix-built default image, systemd unit, and Ansible role. The smoke flow |
| 5 | +for an already-installed runner lives in [actions-runner.md](./actions-runner.md). |
| 6 | + |
| 7 | +## Prereqs |
| 8 | + |
| 9 | +- The app database is migrated through S41d and the web API has |
| 10 | + `auth.totp_key_b64` configured so job JWTs can be minted. |
| 11 | +- Docker is installed on the runner host and the `docker` group exists. |
| 12 | + S41e narrows the sandbox; S41d runner hosts must be treated as trusted. |
| 13 | +- `bin/shithubd-runner` exists locally. `make build` builds both |
| 14 | + `bin/shithubd` and `bin/shithubd-runner` with the same version ldflags. |
| 15 | +- The default image has been loaded or published. Build it with: |
| 16 | + |
| 17 | +```sh |
| 18 | +nix build ./deploy/runner-images#runnerImage |
| 19 | +docker load < result |
| 20 | +``` |
| 21 | + |
| 22 | +The committed `deploy/runner-images/flake.lock` pins the nixpkgs input. |
| 23 | +Update it deliberately when changing the default image toolchain. |
| 24 | + |
| 25 | +Publishing to GHCR is manual through `.github/workflows/runner-image.yml` |
| 26 | +because forks may not control the upstream `ghcr.io/shithub` namespace. |
| 27 | +Leave the workflow's `image` input blank to publish under the current |
| 28 | +repository's package namespace, or set it explicitly for upstream |
| 29 | +publishing. |
| 30 | + |
| 31 | +## Register |
| 32 | + |
| 33 | +Run this once from a host that can reach the production database config: |
| 34 | + |
| 35 | +```sh |
| 36 | +shithubd admin runner register \ |
| 37 | + --name prod-runner-1 \ |
| 38 | + --labels self-hosted,linux,ubuntu-latest \ |
| 39 | + --capacity 1 |
| 40 | +``` |
| 41 | + |
| 42 | +Store the printed token in ansible-vault or the deployment secret store. |
| 43 | +Only the token hash is stored in Postgres; the raw token cannot be |
| 44 | +recovered later. |
| 45 | + |
| 46 | +## Inventory |
| 47 | + |
| 48 | +Enable the role explicitly. The default is disabled so ordinary app |
| 49 | +deploys do not start a runner by accident. |
| 50 | + |
| 51 | +```ini |
| 52 | +[shithub:vars] |
| 53 | +shithub_runner_enabled=true |
| 54 | +shithub_runner_token=REPLACE_ME |
| 55 | +shithub_runner_labels=self-hosted,linux,ubuntu-latest |
| 56 | +shithub_runner_capacity=1 |
| 57 | +shithub_runner_default_image=ghcr.io/shithub/runner-nix:1.0 |
| 58 | +``` |
| 59 | + |
| 60 | +The role writes non-secret config to |
| 61 | +`/etc/shithubd-runner/config.toml` and the registration token to |
| 62 | +`/etc/shithubd-runner/runner.env` with mode `0600`. |
| 63 | +Keep `shithub_runner_workspace_root` under `/var/lib/shithubd-runner`; |
| 64 | +the systemd unit grants runner writes only to that subtree. |
| 65 | + |
| 66 | +## Deploy |
| 67 | + |
| 68 | +For the runner role only: |
| 69 | + |
| 70 | +```sh |
| 71 | +make build |
| 72 | +cd deploy/ansible |
| 73 | +ansible-playbook -i inventory/production site.yml -t shithubd-runner |
| 74 | +``` |
| 75 | + |
| 76 | +The role: |
| 77 | + |
| 78 | +- creates the `shithub-runner` system user and joins it to `docker` |
| 79 | +- uploads `/usr/local/bin/shithubd-runner` |
| 80 | +- renders `/etc/shithubd-runner/config.toml` and `runner.env` |
| 81 | +- installs `deploy/systemd/shithubd-runner.service` |
| 82 | +- pulls the configured runner image |
| 83 | +- enables and starts `shithubd-runner` |
| 84 | + |
| 85 | +## Verify |
| 86 | + |
| 87 | +On the runner host: |
| 88 | + |
| 89 | +```sh |
| 90 | +systemctl status shithubd-runner |
| 91 | +journalctl -u shithubd-runner -n 100 --no-pager |
| 92 | +``` |
| 93 | + |
| 94 | +Then push a workflow with a simple `run:` step: |
| 95 | + |
| 96 | +```yaml |
| 97 | +name: ci |
| 98 | +on: push |
| 99 | +jobs: |
| 100 | + build: |
| 101 | + runs-on: ubuntu-latest |
| 102 | + steps: |
| 103 | + - run: bash -c "echo hello && exit 0" |
| 104 | +``` |
| 105 | + |
| 106 | +Expected state: |
| 107 | + |
| 108 | +- a runner heartbeat claims the queued job within one idle poll interval |
| 109 | +- the step emits SQL log chunks during execution |
| 110 | +- `workflow:finalize_step` uploads |
| 111 | + `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log` |
| 112 | +- the job and check run complete with conclusion `success` |
| 113 | + |
| 114 | +Repeat with `exit 1`; the check should complete with conclusion |
| 115 | +`failure`. |
| 116 | + |
| 117 | +## Rollback |
| 118 | + |
| 119 | +Stop the runner first so it does not claim new jobs: |
| 120 | + |
| 121 | +```sh |
| 122 | +systemctl stop shithubd-runner |
| 123 | +systemctl disable shithubd-runner |
| 124 | +``` |
| 125 | + |
| 126 | +If the binary itself is bad, copy a prior archived binary from |
| 127 | +`/var/lib/shithubd-runner/binaries/` back to |
| 128 | +`/usr/local/bin/shithubd-runner` and restart the unit. Jobs already |
| 129 | +claimed by the stopped runner remain visible in the database; S41g adds |
| 130 | +operator cancel/re-run controls. |