markdown · 6838 bytes Raw Blame History

Actions runner smoke runbook

This runbook validates the runner-facing Actions path. shithubd-runner now claims jobs and executes containerized run: steps through Docker or Podman. The curl flow below remains useful for token/replay debugging.

For host provisioning and the systemd/Ansible path, see runner-deploy.md.

Prereqs:

  • Database migrations are current through 0057_workflow_job_secret_masks.sql.
  • SHITHUB_TOTP_KEY or auth.totp_key_b64 is set on the web process.
  • Object storage is configured if testing artifact upload.
  • Docker or Podman is installed on the runner host.
  • A repo has a workflow under .shithub/workflows/*.yml with runs-on: ubuntu-latest, and a push/dispatch has enqueued a run. S41d PR1 supports run: steps; checkout and artifact aliases land in the following S41d slices.

runs-on is a runner-label selector, not a hard-coded image name. A workflow that says runs-on: ubuntu-latest can be claimed by any runner advertising the ubuntu-latest label. The container image is selected by the runner host's engine.default_image setting; the reproducible Nix-built image is the default, but operators can point it at another OCI image when they need closer Ubuntu parity.

Register a runner:

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1

Save the printed token:

export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'

Run the binary:

shithubd-runner run \
  --server-url "$BASE" \
  --token "$RUNNER_TOKEN" \
  --labels self-hosted,linux,ubuntu-latest \
  --workspace-root /var/lib/shithubd-runner/workspaces \
  --network shithub-actions \
  --dns-servers 172.30.0.1

Equivalent config file:

[server]
base_url = "https://shithub.example"

[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
  "api.github.com",
  "auth.docker.io",
  "codeload.github.com",
  "github.com",
  "objects.githubusercontent.com",
  "production.cloudflare.docker.com",
  "registry-1.docker.io",
  "*.githubusercontent.com",
]

[engine]
kind = "docker"
default_image = "ghcr.io/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]

The config path defaults to /etc/shithubd-runner/config.toml. Environment variables use the SHITHUB_RUNNER_ prefix, for example SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL.

The Ansible runner role creates the shithub-actions bridge, runs the allowlist resolver at 172.30.0.1, and installs firewall rules that reject direct-IP egress from step containers. If you run the binary without the role, provision equivalent network controls before pointing workflows at the runner.

Curl token smoke

Claim a job:

curl -fsS "$BASE/api/v1/runners/heartbeat" \
  -H "Authorization: Bearer $RUNNER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \
  | tee /tmp/shithub-claim.json

Extract the job token and id:

export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"

Append a log chunk:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
  | tee /tmp/shithub-log.json

export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"

Complete the job:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","conclusion":"success"}'

Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

If the job is still queued, it becomes cancelled immediately. If it is running, the next runner cancel check returns {"cancelled":true}, the runner kills the active container, and the terminal job status becomes cancelled.

Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:

curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

Expected response includes a new run_id, the new run_index, and parent_run_id equal to the source run. Confirm the new row has the same head_sha as the source run.

Timeout smoke: create a workflow with a short deadline and a long-running step:

jobs:
  timeout_probe:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - run: sleep 600

Expected results:

  • The runner kills the active container shortly after the one-minute deadline.
  • The step row becomes status=completed, conclusion=timed_out.
  • The job row becomes status=completed, conclusion=timed_out.
  • /metrics increments shithub_actions_step_timeouts_total.

Replay check: reusing the log token after the log call must fail with 401 because its jti is already present in runner_jwt_used.

curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
  -H "Content-Type: application/json" \
  -d '{"status":"running"}'

Expected results:

  • workflow_jobs.status = completed and conclusion success.
  • The parent workflow_runs row rolls up to completed/success when all jobs are terminal.
  • The PR Checks tab shows the matching check run as success.
  • /metrics includes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, and retention counters.

Retention Sweep

The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC, after the 03:17 backup window:

systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100

Manual smoke:

shithubd admin run-job workflow:cleanup

Expected behavior:

  • SQL log chunks older than 7 days for terminal steps are deleted.
  • Expired artifact rows are deleted only after their actions/runs/... objects are deleted from object storage.
  • Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
  • Consumed runner JWT rows older than 30 days are pruned.
  • /metrics exposes shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.
View source
1 # Actions runner smoke runbook
2
3 This runbook validates the runner-facing Actions path. `shithubd-runner`
4 now claims jobs and executes containerized `run:` steps through Docker or
5 Podman. The curl flow below remains useful for token/replay debugging.
6
7 For host provisioning and the systemd/Ansible path, see
8 [runner-deploy.md](./runner-deploy.md).
9
10 Prereqs:
11
12 - Database migrations are current through `0057_workflow_job_secret_masks.sql`.
13 - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
14 - Object storage is configured if testing artifact upload.
15 - Docker or Podman is installed on the runner host.
16 - A repo has a workflow under `.shithub/workflows/*.yml` with
17 `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
18 S41d PR1 supports `run:` steps; checkout and artifact aliases land in
19 the following S41d slices.
20
21 `runs-on` is a runner-label selector, not a hard-coded image name.
22 A workflow that says `runs-on: ubuntu-latest` can be claimed by any
23 runner advertising the `ubuntu-latest` label. The container image is
24 selected by the runner host's `engine.default_image` setting; the
25 reproducible Nix-built image is the default, but operators can point it
26 at another OCI image when they need closer Ubuntu parity.
27
28 Register a runner:
29
30 ```sh
31 shithubd admin runner register \
32 --name runner-1 \
33 --labels self-hosted,linux,ubuntu-latest \
34 --capacity 1
35 ```
36
37 Save the printed token:
38
39 ```sh
40 export RUNNER_TOKEN='<printed-token>'
41 export BASE='https://shithub.example'
42 ```
43
44 Run the binary:
45
46 ```sh
47 shithubd-runner run \
48 --server-url "$BASE" \
49 --token "$RUNNER_TOKEN" \
50 --labels self-hosted,linux,ubuntu-latest \
51 --workspace-root /var/lib/shithubd-runner/workspaces \
52 --network shithub-actions \
53 --dns-servers 172.30.0.1
54 ```
55
56 Equivalent config file:
57
58 ```toml
59 [server]
60 base_url = "https://shithub.example"
61
62 [runner]
63 token = "<printed-token>"
64 labels = ["self-hosted", "linux", "ubuntu-latest"]
65 capacity = 1
66 poll_interval = "5s"
67 workspace_root = "/var/lib/shithubd-runner/workspaces"
68 workspace_ttl = "24h"
69 network_allowlist = [
70 "api.github.com",
71 "auth.docker.io",
72 "codeload.github.com",
73 "github.com",
74 "objects.githubusercontent.com",
75 "production.cloudflare.docker.com",
76 "registry-1.docker.io",
77 "*.githubusercontent.com",
78 ]
79
80 [engine]
81 kind = "docker"
82 default_image = "ghcr.io/shithub/runner-nix:1.0"
83 network = "shithub-actions"
84 memory = "2g"
85 cpus = "2"
86 seccomp_profile = "/etc/shithubd-runner/seccomp.json"
87 user = "65534:65534"
88 pids_limit = 512
89 dns_servers = ["172.30.0.1"]
90 ```
91
92 The config path defaults to `/etc/shithubd-runner/config.toml`.
93 Environment variables use the `SHITHUB_RUNNER_` prefix, for example
94 `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
95
96 The Ansible runner role creates the `shithub-actions` bridge, runs the
97 allowlist resolver at `172.30.0.1`, and installs firewall rules that
98 reject direct-IP egress from step containers. If you run the binary
99 without the role, provision equivalent network controls before pointing
100 workflows at the runner.
101
102 ## Curl token smoke
103
104 Claim a job:
105
106 ```sh
107 curl -fsS "$BASE/api/v1/runners/heartbeat" \
108 -H "Authorization: Bearer $RUNNER_TOKEN" \
109 -H "Content-Type: application/json" \
110 -d '{"labels":["self-hosted","linux","ubuntu-latest"],"capacity":1}' \
111 | tee /tmp/shithub-claim.json
112 ```
113
114 Extract the job token and id:
115
116 ```sh
117 export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
118 export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
119 ```
120
121 Append a log chunk:
122
123 ```sh
124 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
125 -H "Authorization: Bearer $JOB_TOKEN" \
126 -H "Content-Type: application/json" \
127 -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
128 | tee /tmp/shithub-log.json
129
130 export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
131 ```
132
133 Complete the job:
134
135 ```sh
136 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
137 -H "Authorization: Bearer $JOB_TOKEN" \
138 -H "Content-Type: application/json" \
139 -d '{"status":"completed","conclusion":"success"}'
140 ```
141
142 Cancel smoke: on a separate queued or running job, a repo-write PAT can
143 request cancellation:
144
145 ```sh
146 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
147 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
148 -X POST
149 ```
150
151 If the job is still queued, it becomes `cancelled` immediately. If it is
152 running, the next runner cancel check returns `{"cancelled":true}`, the
153 runner kills the active container, and the terminal job status becomes
154 `cancelled`.
155
156 Re-run smoke: after a completed or cancelled workflow run, a repo-write
157 PAT can enqueue a new run from the original workflow file at the
158 original commit:
159
160 ```sh
161 curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
162 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
163 -X POST
164 ```
165
166 Expected response includes a new `run_id`, the new `run_index`, and
167 `parent_run_id` equal to the source run. Confirm the new row has the
168 same `head_sha` as the source run.
169
170 Timeout smoke: create a workflow with a short deadline and a long-running
171 step:
172
173 ```yaml
174 jobs:
175 timeout_probe:
176 runs-on: ubuntu-latest
177 timeout-minutes: 1
178 steps:
179 - run: sleep 600
180 ```
181
182 Expected results:
183
184 - The runner kills the active container shortly after the one-minute
185 deadline.
186 - The step row becomes `status=completed`,
187 `conclusion=timed_out`.
188 - The job row becomes `status=completed`, `conclusion=timed_out`.
189 - `/metrics` increments `shithub_actions_step_timeouts_total`.
190
191 Replay check: reusing the log token after the log call must fail with
192 401 because its `jti` is already present in `runner_jwt_used`.
193
194 ```sh
195 curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
196 -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
197 -H "Content-Type: application/json" \
198 -d '{"status":"running"}'
199 ```
200
201 Expected results:
202
203 - `workflow_jobs.status = completed` and conclusion `success`.
204 - The parent `workflow_runs` row rolls up to completed/success when all
205 jobs are terminal.
206 - The PR Checks tab shows the matching check run as success.
207 - `/metrics` includes runner registration, heartbeat, JWT, job
208 cancellation, log-scrub, step-timeout, and retention counters.
209
210 ## Retention Sweep
211
212 The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
213 after the 03:17 backup window:
214
215 ```sh
216 systemctl list-timers shithubd-cron.timer
217 journalctl -u shithubd-cron.service -n 100
218 ```
219
220 Manual smoke:
221
222 ```sh
223 shithubd admin run-job workflow:cleanup
224 ```
225
226 Expected behavior:
227
228 - SQL log chunks older than 7 days for terminal steps are deleted.
229 - Expired artifact rows are deleted only after their `actions/runs/...`
230 objects are deleted from object storage.
231 - Unpinned terminal workflow runs older than 365 days are pruned;
232 pinned runs survive.
233 - Consumed runner JWT rows older than 30 days are pruned.
234 - `/metrics` exposes
235 `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.