Actions runner smoke runbook
This runbook validates the runner-facing Actions path. shithubd-runner
now claims jobs, performs scoped actions/checkout@v4, and executes
containerized run: steps through Docker or Podman. The curl flow below
remains useful for token/replay debugging.
For host provisioning and the systemd/Ansible path, see runner-deploy.md.
Prereqs:
- Database migrations are current through
0057_workflow_job_secret_masks.sql. SHITHUB_TOTP_KEYorauth.totp_key_b64is set on the web process.- Object storage is configured if testing artifact upload.
- Docker or Podman is installed on the runner host.
- A repo has a workflow under
.shithub/workflows/*.ymlwithruns-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout andrun:steps are executable; artifact aliases remain reserved until artifact transfer lands.
runs-on is a runner-label selector, not a hard-coded image name.
A workflow that says runs-on: ubuntu-latest can be claimed by any
runner advertising the ubuntu-latest label. The container image is
selected by the runner host's engine.default_image setting; the
reproducible Nix-built image is the default, but operators can point it
at another OCI image when they need closer Ubuntu parity.
Register a runner:
shithubd admin runner register \
--name runner-1 \
--labels self-hosted,linux,ubuntu-latest \
--capacity 1
Save the printed token:
export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'
Run the binary:
shithubd-runner run \
--server-url "$BASE" \
--token "$RUNNER_TOKEN" \
--labels self-hosted,linux,ubuntu-latest \
--workspace-root /var/lib/shithubd-runner/workspaces \
--network shithub-actions \
--dns-servers 172.30.0.1
Equivalent config file:
[server]
base_url = "https://shithub.example"
[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
"api.github.com",
"auth.docker.io",
"codeload.github.com",
"github.com",
"objects.githubusercontent.com",
"production.cloudflare.docker.com",
"registry-1.docker.io",
"*.githubusercontent.com",
]
[engine]
kind = "docker"
default_image = "ghcr.io/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]
The config path defaults to /etc/shithubd-runner/config.toml.
Environment variables use the SHITHUB_RUNNER_ prefix, for example
SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL.
The Ansible runner role creates the shithub-actions bridge, runs the
allowlist resolver at 172.30.0.1, and installs firewall rules that
reject direct-IP egress from step containers. If you run the binary
without the role, provision equivalent network controls before pointing
workflows at the runner.
Curl token smoke
Claim a job:
curl -fsS "$BASE/api/v1/runners/heartbeat" \
-H "Authorization: Bearer $RUNNER_TOKEN" \
-H "Content-Type: application/json" \
-d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \
| tee /tmp/shithub-claim.json
Extract the job token and id:
export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
Append a log chunk:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
-H "Authorization: Bearer $JOB_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
| tee /tmp/shithub-log.json
export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
Complete the job:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
-H "Authorization: Bearer $JOB_TOKEN" \
-H "Content-Type: application/json" \
-d '{"status":"completed","conclusion":"success"}'
Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
-X POST
If the job is still queued, it becomes cancelled immediately. If it is
running, the next runner cancel check returns {"cancelled":true}, the
runner kills the active container, and the terminal job status becomes
cancelled.
Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:
curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
-X POST
Expected response includes a new run_id, the new run_index, and
parent_run_id equal to the source run. Confirm the new row has the
same head_sha as the source run.
Timeout smoke: create a workflow with a short deadline and a long-running step:
jobs:
timeout_probe:
runs-on: ubuntu-latest
timeout-minutes: 1
steps:
- run: sleep 600
Expected results:
- The runner kills the active container shortly after the one-minute deadline.
- The step row becomes
status=completed,conclusion=timed_out. - The job row becomes
status=completed,conclusion=timed_out. /metricsincrementsshithub_actions_step_timeouts_total.
Replay check: reusing the log token after the log call must fail with
401 because its jti is already present in runner_jwt_used.
curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
-H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
-H "Content-Type: application/json" \
-d '{"status":"running"}'
Expected results:
workflow_jobs.status = completedand conclusionsuccess.- The parent
workflow_runsrow rolls up to completed/success when all jobs are terminal. - The PR Checks tab shows the matching check run as success.
/metricsincludes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, and retention counters.
Retention Sweep
The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC,
after the 03:17 backup window:
systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100
Manual smoke:
shithubd admin run-job workflow:cleanup
Expected behavior:
- SQL log chunks older than 7 days for terminal steps are deleted.
- Expired artifact rows are deleted only after their
actions/runs/...objects are deleted from object storage. - Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
- Consumed runner JWT rows older than 30 days are pruned.
/metricsexposesshithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.
View source
| 1 | # Actions runner smoke runbook |
| 2 | |
| 3 | This runbook validates the runner-facing Actions path. `shithubd-runner` |
| 4 | now claims jobs, performs scoped `actions/checkout@v4`, and executes |
| 5 | containerized `run:` steps through Docker or Podman. The curl flow below |
| 6 | remains useful for token/replay debugging. |
| 7 | |
| 8 | For host provisioning and the systemd/Ansible path, see |
| 9 | [runner-deploy.md](./runner-deploy.md). |
| 10 | |
| 11 | Prereqs: |
| 12 | |
| 13 | - Database migrations are current through `0057_workflow_job_secret_masks.sql`. |
| 14 | - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process. |
| 15 | - Object storage is configured if testing artifact upload. |
| 16 | - Docker or Podman is installed on the runner host. |
| 17 | - A repo has a workflow under `.shithub/workflows/*.yml` with |
| 18 | `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run. |
| 19 | Checkout and `run:` steps are executable; artifact aliases remain |
| 20 | reserved until artifact transfer lands. |
| 21 | |
| 22 | `runs-on` is a runner-label selector, not a hard-coded image name. |
| 23 | A workflow that says `runs-on: ubuntu-latest` can be claimed by any |
| 24 | runner advertising the `ubuntu-latest` label. The container image is |
| 25 | selected by the runner host's `engine.default_image` setting; the |
| 26 | reproducible Nix-built image is the default, but operators can point it |
| 27 | at another OCI image when they need closer Ubuntu parity. |
| 28 | |
| 29 | Register a runner: |
| 30 | |
| 31 | ```sh |
| 32 | shithubd admin runner register \ |
| 33 | --name runner-1 \ |
| 34 | --labels self-hosted,linux,ubuntu-latest \ |
| 35 | --capacity 1 |
| 36 | ``` |
| 37 | |
| 38 | Save the printed token: |
| 39 | |
| 40 | ```sh |
| 41 | export RUNNER_TOKEN='<printed-token>' |
| 42 | export BASE='https://shithub.example' |
| 43 | ``` |
| 44 | |
| 45 | Run the binary: |
| 46 | |
| 47 | ```sh |
| 48 | shithubd-runner run \ |
| 49 | --server-url "$BASE" \ |
| 50 | --token "$RUNNER_TOKEN" \ |
| 51 | --labels self-hosted,linux,ubuntu-latest \ |
| 52 | --workspace-root /var/lib/shithubd-runner/workspaces \ |
| 53 | --network shithub-actions \ |
| 54 | --dns-servers 172.30.0.1 |
| 55 | ``` |
| 56 | |
| 57 | Equivalent config file: |
| 58 | |
| 59 | ```toml |
| 60 | [server] |
| 61 | base_url = "https://shithub.example" |
| 62 | |
| 63 | [runner] |
| 64 | token = "<printed-token>" |
| 65 | labels = ["self-hosted", "linux", "ubuntu-latest"] |
| 66 | capacity = 1 |
| 67 | poll_interval = "5s" |
| 68 | workspace_root = "/var/lib/shithubd-runner/workspaces" |
| 69 | workspace_ttl = "24h" |
| 70 | network_allowlist = [ |
| 71 | "api.github.com", |
| 72 | "auth.docker.io", |
| 73 | "codeload.github.com", |
| 74 | "github.com", |
| 75 | "objects.githubusercontent.com", |
| 76 | "production.cloudflare.docker.com", |
| 77 | "registry-1.docker.io", |
| 78 | "*.githubusercontent.com", |
| 79 | ] |
| 80 | |
| 81 | [engine] |
| 82 | kind = "docker" |
| 83 | default_image = "ghcr.io/shithub/runner-nix:1.0" |
| 84 | network = "shithub-actions" |
| 85 | memory = "2g" |
| 86 | cpus = "2" |
| 87 | seccomp_profile = "/etc/shithubd-runner/seccomp.json" |
| 88 | user = "65534:65534" |
| 89 | pids_limit = 512 |
| 90 | dns_servers = ["172.30.0.1"] |
| 91 | ``` |
| 92 | |
| 93 | The config path defaults to `/etc/shithubd-runner/config.toml`. |
| 94 | Environment variables use the `SHITHUB_RUNNER_` prefix, for example |
| 95 | `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`. |
| 96 | |
| 97 | The Ansible runner role creates the `shithub-actions` bridge, runs the |
| 98 | allowlist resolver at `172.30.0.1`, and installs firewall rules that |
| 99 | reject direct-IP egress from step containers. If you run the binary |
| 100 | without the role, provision equivalent network controls before pointing |
| 101 | workflows at the runner. |
| 102 | |
| 103 | ## Curl token smoke |
| 104 | |
| 105 | Claim a job: |
| 106 | |
| 107 | ```sh |
| 108 | curl -fsS "$BASE/api/v1/runners/heartbeat" \ |
| 109 | -H "Authorization: Bearer $RUNNER_TOKEN" \ |
| 110 | -H "Content-Type: application/json" \ |
| 111 | -d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \ |
| 112 | | tee /tmp/shithub-claim.json |
| 113 | ``` |
| 114 | |
| 115 | Extract the job token and id: |
| 116 | |
| 117 | ```sh |
| 118 | export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)" |
| 119 | export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)" |
| 120 | ``` |
| 121 | |
| 122 | Append a log chunk: |
| 123 | |
| 124 | ```sh |
| 125 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \ |
| 126 | -H "Authorization: Bearer $JOB_TOKEN" \ |
| 127 | -H "Content-Type: application/json" \ |
| 128 | -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \ |
| 129 | | tee /tmp/shithub-log.json |
| 130 | |
| 131 | export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)" |
| 132 | ``` |
| 133 | |
| 134 | Complete the job: |
| 135 | |
| 136 | ```sh |
| 137 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \ |
| 138 | -H "Authorization: Bearer $JOB_TOKEN" \ |
| 139 | -H "Content-Type: application/json" \ |
| 140 | -d '{"status":"completed","conclusion":"success"}' |
| 141 | ``` |
| 142 | |
| 143 | Cancel smoke: on a separate queued or running job, a repo-write PAT can |
| 144 | request cancellation: |
| 145 | |
| 146 | ```sh |
| 147 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \ |
| 148 | -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \ |
| 149 | -X POST |
| 150 | ``` |
| 151 | |
| 152 | If the job is still queued, it becomes `cancelled` immediately. If it is |
| 153 | running, the next runner cancel check returns `{"cancelled":true}`, the |
| 154 | runner kills the active container, and the terminal job status becomes |
| 155 | `cancelled`. |
| 156 | |
| 157 | Re-run smoke: after a completed or cancelled workflow run, a repo-write |
| 158 | PAT can enqueue a new run from the original workflow file at the |
| 159 | original commit: |
| 160 | |
| 161 | ```sh |
| 162 | curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \ |
| 163 | -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \ |
| 164 | -X POST |
| 165 | ``` |
| 166 | |
| 167 | Expected response includes a new `run_id`, the new `run_index`, and |
| 168 | `parent_run_id` equal to the source run. Confirm the new row has the |
| 169 | same `head_sha` as the source run. |
| 170 | |
| 171 | Timeout smoke: create a workflow with a short deadline and a long-running |
| 172 | step: |
| 173 | |
| 174 | ```yaml |
| 175 | jobs: |
| 176 | timeout_probe: |
| 177 | runs-on: ubuntu-latest |
| 178 | timeout-minutes: 1 |
| 179 | steps: |
| 180 | - run: sleep 600 |
| 181 | ``` |
| 182 | |
| 183 | Expected results: |
| 184 | |
| 185 | - The runner kills the active container shortly after the one-minute |
| 186 | deadline. |
| 187 | - The step row becomes `status=completed`, |
| 188 | `conclusion=timed_out`. |
| 189 | - The job row becomes `status=completed`, `conclusion=timed_out`. |
| 190 | - `/metrics` increments `shithub_actions_step_timeouts_total`. |
| 191 | |
| 192 | Replay check: reusing the log token after the log call must fail with |
| 193 | 401 because its `jti` is already present in `runner_jwt_used`. |
| 194 | |
| 195 | ```sh |
| 196 | curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \ |
| 197 | -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \ |
| 198 | -H "Content-Type: application/json" \ |
| 199 | -d '{"status":"running"}' |
| 200 | ``` |
| 201 | |
| 202 | Expected results: |
| 203 | |
| 204 | - `workflow_jobs.status = completed` and conclusion `success`. |
| 205 | - The parent `workflow_runs` row rolls up to completed/success when all |
| 206 | jobs are terminal. |
| 207 | - The PR Checks tab shows the matching check run as success. |
| 208 | - `/metrics` includes runner registration, heartbeat, JWT, job |
| 209 | cancellation, log-scrub, step-timeout, and retention counters. |
| 210 | |
| 211 | ## Retention Sweep |
| 212 | |
| 213 | The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC, |
| 214 | after the 03:17 backup window: |
| 215 | |
| 216 | ```sh |
| 217 | systemctl list-timers shithubd-cron.timer |
| 218 | journalctl -u shithubd-cron.service -n 100 |
| 219 | ``` |
| 220 | |
| 221 | Manual smoke: |
| 222 | |
| 223 | ```sh |
| 224 | shithubd admin run-job workflow:cleanup |
| 225 | ``` |
| 226 | |
| 227 | Expected behavior: |
| 228 | |
| 229 | - SQL log chunks older than 7 days for terminal steps are deleted. |
| 230 | - Expired artifact rows are deleted only after their `actions/runs/...` |
| 231 | objects are deleted from object storage. |
| 232 | - Unpinned terminal workflow runs older than 365 days are pruned; |
| 233 | pinned runs survive. |
| 234 | - Consumed runner JWT rows older than 30 days are pruned. |
| 235 | - `/metrics` exposes |
| 236 | `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`. |