markdown · 6887 bytes Raw Blame History

Actions runner smoke runbook

This runbook validates the runner-facing Actions path. shithubd-runner now claims jobs, performs scoped actions/checkout@v4, and executes containerized run: steps through Docker or Podman. The curl flow below remains useful for token/replay debugging.

For host provisioning and the systemd/Ansible path, see runner-deploy.md.

Prereqs:

  • Database migrations are current through 0057_workflow_job_secret_masks.sql.
  • SHITHUB_TOTP_KEY or auth.totp_key_b64 is set on the web process.
  • Object storage is configured if testing artifact upload.
  • Docker or Podman is installed on the runner host.
  • A repo has a workflow under .shithub/workflows/*.yml with runs-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout and run: steps are executable; artifact aliases remain reserved until artifact transfer lands.

runs-on is a runner-label selector, not a hard-coded image name. A workflow that says runs-on: ubuntu-latest can be claimed by any runner advertising the ubuntu-latest label. The container image is selected by the runner host's engine.default_image setting; the reproducible Nix-built image is the default, but operators can point it at another OCI image when they need closer Ubuntu parity.

Register a runner:

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1

Save the printed token:

export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'

Run the binary:

shithubd-runner run \
  --server-url "$BASE" \
  --token "$RUNNER_TOKEN" \
  --labels self-hosted,linux,ubuntu-latest \
  --workspace-root /var/lib/shithubd-runner/workspaces \
  --network shithub-actions \
  --dns-servers 172.30.0.1

Equivalent config file:

[server]
base_url = "https://shithub.example"

[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
  "api.github.com",
  "auth.docker.io",
  "codeload.github.com",
  "github.com",
  "objects.githubusercontent.com",
  "production.cloudflare.docker.com",
  "registry-1.docker.io",
  "*.githubusercontent.com",
]

[engine]
kind = "docker"
default_image = "ghcr.io/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]

The config path defaults to /etc/shithubd-runner/config.toml. Environment variables use the SHITHUB_RUNNER_ prefix, for example SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL.

The Ansible runner role creates the shithub-actions bridge, runs the allowlist resolver at 172.30.0.1, and installs firewall rules that reject direct-IP egress from step containers. If you run the binary without the role, provision equivalent network controls before pointing workflows at the runner.

Curl token smoke

Claim a job:

curl -fsS "$BASE/api/v1/runners/heartbeat" \
  -H "Authorization: Bearer $RUNNER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \
  | tee /tmp/shithub-claim.json

Extract the job token and id:

export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"

Append a log chunk:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
  | tee /tmp/shithub-log.json

export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"

Complete the job:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","conclusion":"success"}'

Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

If the job is still queued, it becomes cancelled immediately. If it is running, the next runner cancel check returns {"cancelled":true}, the runner kills the active container, and the terminal job status becomes cancelled.

Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:

curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

Expected response includes a new run_id, the new run_index, and parent_run_id equal to the source run. Confirm the new row has the same head_sha as the source run.

Timeout smoke: create a workflow with a short deadline and a long-running step:

jobs:
  timeout_probe:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - run: sleep 600

Expected results:

  • The runner kills the active container shortly after the one-minute deadline.
  • The step row becomes status=completed, conclusion=timed_out.
  • The job row becomes status=completed, conclusion=timed_out.
  • /metrics increments shithub_actions_step_timeouts_total.

Replay check: reusing the log token after the log call must fail with 401 because its jti is already present in runner_jwt_used.

curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
  -H "Content-Type: application/json" \
  -d '{"status":"running"}'

Expected results:

  • workflow_jobs.status = completed and conclusion success.
  • The parent workflow_runs row rolls up to completed/success when all jobs are terminal.
  • The PR Checks tab shows the matching check run as success.
  • /metrics includes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, and retention counters.

Retention Sweep

The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC, after the 03:17 backup window:

systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100

Manual smoke:

shithubd admin run-job workflow:cleanup

Expected behavior:

  • SQL log chunks older than 7 days for terminal steps are deleted.
  • Expired artifact rows are deleted only after their actions/runs/... objects are deleted from object storage.
  • Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
  • Consumed runner JWT rows older than 30 days are pruned.
  • /metrics exposes shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.
View source
1 # Actions runner smoke runbook
2
3 This runbook validates the runner-facing Actions path. `shithubd-runner`
4 now claims jobs, performs scoped `actions/checkout@v4`, and executes
5 containerized `run:` steps through Docker or Podman. The curl flow below
6 remains useful for token/replay debugging.
7
8 For host provisioning and the systemd/Ansible path, see
9 [runner-deploy.md](./runner-deploy.md).
10
11 Prereqs:
12
13 - Database migrations are current through `0057_workflow_job_secret_masks.sql`.
14 - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
15 - Object storage is configured if testing artifact upload.
16 - Docker or Podman is installed on the runner host.
17 - A repo has a workflow under `.shithub/workflows/*.yml` with
18 `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
19 Checkout and `run:` steps are executable; artifact aliases remain
20 reserved until artifact transfer lands.
21
22 `runs-on` is a runner-label selector, not a hard-coded image name.
23 A workflow that says `runs-on: ubuntu-latest` can be claimed by any
24 runner advertising the `ubuntu-latest` label. The container image is
25 selected by the runner host's `engine.default_image` setting; the
26 reproducible Nix-built image is the default, but operators can point it
27 at another OCI image when they need closer Ubuntu parity.
28
29 Register a runner:
30
31 ```sh
32 shithubd admin runner register \
33 --name runner-1 \
34 --labels self-hosted,linux,ubuntu-latest \
35 --capacity 1
36 ```
37
38 Save the printed token:
39
40 ```sh
41 export RUNNER_TOKEN='<printed-token>'
42 export BASE='https://shithub.example'
43 ```
44
45 Run the binary:
46
47 ```sh
48 shithubd-runner run \
49 --server-url "$BASE" \
50 --token "$RUNNER_TOKEN" \
51 --labels self-hosted,linux,ubuntu-latest \
52 --workspace-root /var/lib/shithubd-runner/workspaces \
53 --network shithub-actions \
54 --dns-servers 172.30.0.1
55 ```
56
57 Equivalent config file:
58
59 ```toml
60 [server]
61 base_url = "https://shithub.example"
62
63 [runner]
64 token = "<printed-token>"
65 labels = ["self-hosted", "linux", "ubuntu-latest"]
66 capacity = 1
67 poll_interval = "5s"
68 workspace_root = "/var/lib/shithubd-runner/workspaces"
69 workspace_ttl = "24h"
70 network_allowlist = [
71 "api.github.com",
72 "auth.docker.io",
73 "codeload.github.com",
74 "github.com",
75 "objects.githubusercontent.com",
76 "production.cloudflare.docker.com",
77 "registry-1.docker.io",
78 "*.githubusercontent.com",
79 ]
80
81 [engine]
82 kind = "docker"
83 default_image = "ghcr.io/shithub/runner-nix:1.0"
84 network = "shithub-actions"
85 memory = "2g"
86 cpus = "2"
87 seccomp_profile = "/etc/shithubd-runner/seccomp.json"
88 user = "65534:65534"
89 pids_limit = 512
90 dns_servers = ["172.30.0.1"]
91 ```
92
93 The config path defaults to `/etc/shithubd-runner/config.toml`.
94 Environment variables use the `SHITHUB_RUNNER_` prefix, for example
95 `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
96
97 The Ansible runner role creates the `shithub-actions` bridge, runs the
98 allowlist resolver at `172.30.0.1`, and installs firewall rules that
99 reject direct-IP egress from step containers. If you run the binary
100 without the role, provision equivalent network controls before pointing
101 workflows at the runner.
102
103 ## Curl token smoke
104
105 Claim a job:
106
107 ```sh
108 curl -fsS "$BASE/api/v1/runners/heartbeat" \
109 -H "Authorization: Bearer $RUNNER_TOKEN" \
110 -H "Content-Type: application/json" \
111 -d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \
112 | tee /tmp/shithub-claim.json
113 ```
114
115 Extract the job token and id:
116
117 ```sh
118 export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
119 export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
120 ```
121
122 Append a log chunk:
123
124 ```sh
125 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
126 -H "Authorization: Bearer $JOB_TOKEN" \
127 -H "Content-Type: application/json" \
128 -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
129 | tee /tmp/shithub-log.json
130
131 export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
132 ```
133
134 Complete the job:
135
136 ```sh
137 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
138 -H "Authorization: Bearer $JOB_TOKEN" \
139 -H "Content-Type: application/json" \
140 -d '{"status":"completed","conclusion":"success"}'
141 ```
142
143 Cancel smoke: on a separate queued or running job, a repo-write PAT can
144 request cancellation:
145
146 ```sh
147 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
148 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
149 -X POST
150 ```
151
152 If the job is still queued, it becomes `cancelled` immediately. If it is
153 running, the next runner cancel check returns `{"cancelled":true}`, the
154 runner kills the active container, and the terminal job status becomes
155 `cancelled`.
156
157 Re-run smoke: after a completed or cancelled workflow run, a repo-write
158 PAT can enqueue a new run from the original workflow file at the
159 original commit:
160
161 ```sh
162 curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
163 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
164 -X POST
165 ```
166
167 Expected response includes a new `run_id`, the new `run_index`, and
168 `parent_run_id` equal to the source run. Confirm the new row has the
169 same `head_sha` as the source run.
170
171 Timeout smoke: create a workflow with a short deadline and a long-running
172 step:
173
174 ```yaml
175 jobs:
176 timeout_probe:
177 runs-on: ubuntu-latest
178 timeout-minutes: 1
179 steps:
180 - run: sleep 600
181 ```
182
183 Expected results:
184
185 - The runner kills the active container shortly after the one-minute
186 deadline.
187 - The step row becomes `status=completed`,
188 `conclusion=timed_out`.
189 - The job row becomes `status=completed`, `conclusion=timed_out`.
190 - `/metrics` increments `shithub_actions_step_timeouts_total`.
191
192 Replay check: reusing the log token after the log call must fail with
193 401 because its `jti` is already present in `runner_jwt_used`.
194
195 ```sh
196 curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
197 -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
198 -H "Content-Type: application/json" \
199 -d '{"status":"running"}'
200 ```
201
202 Expected results:
203
204 - `workflow_jobs.status = completed` and conclusion `success`.
205 - The parent `workflow_runs` row rolls up to completed/success when all
206 jobs are terminal.
207 - The PR Checks tab shows the matching check run as success.
208 - `/metrics` includes runner registration, heartbeat, JWT, job
209 cancellation, log-scrub, step-timeout, and retention counters.
210
211 ## Retention Sweep
212
213 The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
214 after the 03:17 backup window:
215
216 ```sh
217 systemctl list-timers shithubd-cron.timer
218 journalctl -u shithubd-cron.service -n 100
219 ```
220
221 Manual smoke:
222
223 ```sh
224 shithubd admin run-job workflow:cleanup
225 ```
226
227 Expected behavior:
228
229 - SQL log chunks older than 7 days for terminal steps are deleted.
230 - Expired artifact rows are deleted only after their `actions/runs/...`
231 objects are deleted from object storage.
232 - Unpinned terminal workflow runs older than 365 days are pruned;
233 pinned runs survive.
234 - Consumed runner JWT rows older than 30 days are pruned.
235 - `/metrics` exposes
236 `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.