markdown · 12973 bytes Raw Blame History

Actions runner smoke runbook

This runbook validates the runner-facing Actions path. shithubd-runner now claims jobs, performs scoped actions/checkout@v4, and executes containerized run: steps through Docker or Podman. The curl flow below remains useful for token/replay debugging.

For host provisioning and the systemd/Ansible path, see runner-deploy.md.

Prereqs:

  • Database migrations are current through 0067_runner_pool_ops.sql.
  • SHITHUB_TOTP_KEY or auth.totp_key_b64 is set on the web process.
  • Object storage is configured if testing artifact upload.
  • Docker or Podman is installed on the runner host.
  • A repo has a workflow under .shithub/workflows/*.yml with runs-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout and run: steps are executable; artifact aliases remain reserved until artifact transfer lands.

runs-on is a runner-label selector, not a hard-coded image name. A workflow that says runs-on: ubuntu-latest can be claimed by any runner advertising the ubuntu-latest label. The container image is selected by the runner host's engine.default_image setting; the reproducible Nix-built image is the default, but operators can point it at another OCI image when they need closer Ubuntu parity.

Register a runner:

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Save the returned token:

export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'

Run the binary:

shithubd-runner run \
  --server-url "$BASE" \
  --token "$RUNNER_TOKEN" \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --workspace-root /var/lib/shithubd-runner/workspaces \
  --network shithub-actions \
  --dns-servers 172.30.0.1

Pool Operations

List pool state:

shithubd admin runner list --output json
shithubd admin runner queue --output json

list includes labels, capacity, active job count, last heartbeat, host name, runner version, drain state, and revoke state. queue groups queued jobs by requested runs-on label so unsupported labels are visible without querying Postgres.

Drain a runner before host maintenance:

shithubd admin runner drain --id 7 --reason 'kernel update'

The runner keeps heartbeating and can finish already claimed jobs, but new heartbeat claims return 204 until it is undrained:

shithubd admin runner undrain --id 7

Rotate a registration token after a config-management change:

shithubd admin runner rotate-token --id 7 --expires-in 24h --output json

Update the runner host config with the printed token, restart shithubd-runner, then verify heartbeats with runner list.

Mark stale runners offline:

shithubd admin runner cleanup-stale --older-than 2m

Hard-revoke a compromised runner:

shithubd admin runner revoke --id 7 --reason 'host compromise'

Revocation records revoked_at, marks the runner offline, revokes every registration token for that runner, and causes existing job API JWTs from that runner to fail. Use drain for routine maintenance; use revoke when the token or host may be compromised.

Destroy/recreate with DigitalOcean:

doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete <droplet-id>

After recreation, register a fresh runner token and let the provisioning role write the new config. Do not reuse a token from a revoked runner.

Emergency controls:

  • Stop all new claims: disable Actions at site level or drain every runner with shithubd admin runner list --output json followed by shithubd admin runner drain --id <id>.
  • Pause one repo/org: set the repo/org Actions policy to disabled so the claim query leaves its queued jobs untouched.
  • Cancel active work for a repo: use the repo Actions UI or job cancel API for each queued/running job.
  • Revoke all runner tokens: iterate shithubd admin runner revoke --id <id> over every non-revoked runner.
  • Fence the pool: block or destroy droplets tagged shithub-actions-runner in DigitalOcean, then rotate or revoke the affected runner tokens before allowing replacement hosts to connect.

The site-level disable path is a hard kill switch and overrides repo/org Actions policy. Until a site-admin UI exists, use:

UPDATE actions_site_policy
   SET actions_enabled = false,
       updated_at = now()
 WHERE id = true;

The public shared-runner rollout criteria live in actions-public-runners.md.

Equivalent config file:

[server]
base_url = "https://shithub.example"

[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
  "api.github.com",
  "auth.docker.io",
  "codeload.github.com",
  "github.com",
  "objects.githubusercontent.com",
  "production.cloudflare.docker.com",
  "registry-1.docker.io",
  "*.githubusercontent.com",
]

[engine]
kind = "docker"
default_image = "ghcr.io/tenseleyflow/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]

The config path defaults to /etc/shithubd-runner/config.toml. Environment variables use the SHITHUB_RUNNER_ prefix, for example SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL. Use --expires-in only for tokens that your automation rotates before expiry; the runner presents its registration token on every heartbeat.

Arbitrary Repository Smoke

Use this checklist after provisioning a shared runner pool or changing runner labels. The purpose is to prove that ordinary repositories can use the pool without repo-specific labels.

Pick at least two repositories:

  • mfwolffe/scratch, the historical dogfood repo;
  • one additional public repository;
  • one private repository if one is available for the operator account.

In each repository, commit this file as .shithub/workflows/smoke.yml on trunk:

name: Smoke
on:
  push:
    branches: [trunk]
jobs:
  green:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Verify checkout
        run: test -f README.md || test -f readme.md || pwd
      - name: Smoke
        run: printf 'shithub actions smoke passed\n'

Expected results for each repo:

  • the repo heading shows a green check after the push;
  • the Actions run page shows Triggered via push on refs/heads/trunk;
  • the green job is completed with conclusion success;
  • the checkout step logs the scoped repository URL for that repo;
  • the smoke step log contains shithub actions smoke passed;
  • the check run attached to the commit agrees with the workflow run state;
  • the downloaded or raw archived step log matches the in-page step log.

Confirm the pool is shared:

shithubd admin runner list --output json
shithubd admin runner queue --output json

The same online runner labels should satisfy both repositories. The smoke workflow must use runs-on: ubuntu-latest; do not add per-repo labels for this test.

Unsupported-label negative test:

name: Unsupported runner label
on:
  workflow_dispatch:
jobs:
  nope:
    runs-on: windows-latest
    steps:
      - run: echo should-not-run

Trigger it manually. The run should stay queued and the run page should say Waiting for runner with labels: windows-latest. shithubd admin runner queue --output json should show one queued windows-latest job with zero matching runners. Cancel the run after confirming the diagnostic.

Untrusted-PR secret negative test:

  1. Add a repo secret named S41J_SECRET_SMOKE with any non-production value.
  2. Open an untrusted pull request whose workflow prints whether that secret is present.
  3. Before approval, confirm the claimed job contains no injected secrets and logs do not contain the secret value.
  4. Only after explicit approval should the run be allowed through the trusted secret path.

Runner-outage negative test:

  1. Drain every shared runner with shithubd admin runner drain --id <id>.
  2. Push the smoke workflow to a test repo.
  3. Confirm the run stays queued with the requested ubuntu-latest label visible.
  4. Undrain one runner and confirm the same queued job is claimed and completes.

This smoke is considered passing only when scratch and the second repository both complete from the same shared label set and the negative cases produce clear queued/secret-denied behavior.

The Ansible runner role creates the shithub-actions bridge, runs the allowlist resolver at 172.30.0.1, and installs firewall rules that reject direct-IP egress from step containers. If you run the binary without the role, provision equivalent network controls before pointing workflows at the runner.

Curl token smoke

Claim a job:

curl -fsS "$BASE/api/v1/runners/heartbeat" \
  -H "Authorization: Bearer $RUNNER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
  | tee /tmp/shithub-claim.json

Extract the job token and id:

export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"

Append a log chunk:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
  | tee /tmp/shithub-log.json

export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"

Complete the job:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","conclusion":"success"}'

Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

If the job is still queued, it becomes cancelled immediately. If it is running, the next runner cancel check returns {"cancelled":true}, the runner kills the active container, and the terminal job status becomes cancelled.

Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:

curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

Expected response includes a new run_id, the new run_index, and parent_run_id equal to the source run. Confirm the new row has the same head_sha as the source run.

Timeout smoke: create a workflow with a short deadline and a long-running step:

jobs:
  timeout_probe:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - run: sleep 600

Expected results:

  • The runner kills the active container shortly after the one-minute deadline.
  • The step row becomes status=completed, conclusion=timed_out.
  • The job row becomes status=completed, conclusion=timed_out.
  • /metrics increments shithub_actions_step_timeouts_total.

Replay check: reusing the log token after the log call must fail with 401 because its jti is already present in runner_jwt_used.

curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
  -H "Content-Type: application/json" \
  -d '{"status":"running"}'

Expected results:

  • workflow_jobs.status = completed and conclusion success.
  • The parent workflow_runs row rolls up to completed/success when all jobs are terminal.
  • The PR Checks tab shows the matching check run as success.
  • /metrics includes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, retention, queue-depth by label, claim-latency, runner-online, runner-stale, runner-draining, and runner-revocation metrics.

Retention Sweep

The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC, after the 03:17 backup window:

systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100

Manual smoke:

shithubd admin run-job workflow:cleanup

Expected behavior:

  • SQL log chunks older than 7 days for terminal steps are deleted.
  • Expired artifact rows are deleted only after their actions/runs/... objects are deleted from object storage.
  • Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
  • Consumed runner JWT rows older than 30 days are pruned.
  • /metrics exposes shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.
View source
1 # Actions runner smoke runbook
2
3 This runbook validates the runner-facing Actions path. `shithubd-runner`
4 now claims jobs, performs scoped `actions/checkout@v4`, and executes
5 containerized `run:` steps through Docker or Podman. The curl flow below
6 remains useful for token/replay debugging.
7
8 For host provisioning and the systemd/Ansible path, see
9 [runner-deploy.md](./runner-deploy.md).
10
11 Prereqs:
12
13 - Database migrations are current through `0067_runner_pool_ops.sql`.
14 - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
15 - Object storage is configured if testing artifact upload.
16 - Docker or Podman is installed on the runner host.
17 - A repo has a workflow under `.shithub/workflows/*.yml` with
18 `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
19 Checkout and `run:` steps are executable; artifact aliases remain
20 reserved until artifact transfer lands.
21
22 `runs-on` is a runner-label selector, not a hard-coded image name.
23 A workflow that says `runs-on: ubuntu-latest` can be claimed by any
24 runner advertising the `ubuntu-latest` label. The container image is
25 selected by the runner host's `engine.default_image` setting; the
26 reproducible Nix-built image is the default, but operators can point it
27 at another OCI image when they need closer Ubuntu parity.
28
29 Register a runner:
30
31 ```sh
32 shithubd admin runner register \
33 --name runner-1 \
34 --labels self-hosted,linux,ubuntu-latest,x64 \
35 --capacity 1 \
36 --output json
37 ```
38
39 Save the returned token:
40
41 ```sh
42 export RUNNER_TOKEN='<printed-token>'
43 export BASE='https://shithub.example'
44 ```
45
46 Run the binary:
47
48 ```sh
49 shithubd-runner run \
50 --server-url "$BASE" \
51 --token "$RUNNER_TOKEN" \
52 --labels self-hosted,linux,ubuntu-latest,x64 \
53 --workspace-root /var/lib/shithubd-runner/workspaces \
54 --network shithub-actions \
55 --dns-servers 172.30.0.1
56 ```
57
58 ## Pool Operations
59
60 List pool state:
61
62 ```sh
63 shithubd admin runner list --output json
64 shithubd admin runner queue --output json
65 ```
66
67 `list` includes labels, capacity, active job count, last heartbeat,
68 host name, runner version, drain state, and revoke state. `queue`
69 groups queued jobs by requested `runs-on` label so unsupported labels
70 are visible without querying Postgres.
71
72 Drain a runner before host maintenance:
73
74 ```sh
75 shithubd admin runner drain --id 7 --reason 'kernel update'
76 ```
77
78 The runner keeps heartbeating and can finish already claimed jobs, but
79 new heartbeat claims return 204 until it is undrained:
80
81 ```sh
82 shithubd admin runner undrain --id 7
83 ```
84
85 Rotate a registration token after a config-management change:
86
87 ```sh
88 shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
89 ```
90
91 Update the runner host config with the printed token, restart
92 `shithubd-runner`, then verify heartbeats with `runner list`.
93
94 Mark stale runners offline:
95
96 ```sh
97 shithubd admin runner cleanup-stale --older-than 2m
98 ```
99
100 Hard-revoke a compromised runner:
101
102 ```sh
103 shithubd admin runner revoke --id 7 --reason 'host compromise'
104 ```
105
106 Revocation records `revoked_at`, marks the runner offline, revokes every
107 registration token for that runner, and causes existing job API JWTs from
108 that runner to fail. Use drain for routine maintenance; use revoke when
109 the token or host may be compromised.
110
111 Destroy/recreate with DigitalOcean:
112
113 ```sh
114 doctl compute droplet list --tag-name shithub-actions-runner
115 doctl compute droplet delete <droplet-id>
116 ```
117
118 After recreation, register a fresh runner token and let the provisioning
119 role write the new config. Do not reuse a token from a revoked runner.
120
121 Emergency controls:
122
123 - Stop all new claims: disable Actions at site level or drain every
124 runner with `shithubd admin runner list --output json` followed by
125 `shithubd admin runner drain --id <id>`.
126 - Pause one repo/org: set the repo/org Actions policy to disabled so the
127 claim query leaves its queued jobs untouched.
128 - Cancel active work for a repo: use the repo Actions UI or job cancel
129 API for each queued/running job.
130 - Revoke all runner tokens: iterate `shithubd admin runner revoke --id
131 <id>` over every non-revoked runner.
132 - Fence the pool: block or destroy droplets tagged
133 `shithub-actions-runner` in DigitalOcean, then rotate or revoke the
134 affected runner tokens before allowing replacement hosts to connect.
135
136 The site-level disable path is a hard kill switch and overrides repo/org
137 Actions policy. Until a site-admin UI exists, use:
138
139 ```sql
140 UPDATE actions_site_policy
141 SET actions_enabled = false,
142 updated_at = now()
143 WHERE id = true;
144 ```
145
146 The public shared-runner rollout criteria live in
147 [`actions-public-runners.md`](../actions-public-runners.md).
148
149 Equivalent config file:
150
151 ```toml
152 [server]
153 base_url = "https://shithub.example"
154
155 [runner]
156 token = "<printed-token>"
157 labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
158 capacity = 1
159 poll_interval = "5s"
160 workspace_root = "/var/lib/shithubd-runner/workspaces"
161 workspace_ttl = "24h"
162 network_allowlist = [
163 "api.github.com",
164 "auth.docker.io",
165 "codeload.github.com",
166 "github.com",
167 "objects.githubusercontent.com",
168 "production.cloudflare.docker.com",
169 "registry-1.docker.io",
170 "*.githubusercontent.com",
171 ]
172
173 [engine]
174 kind = "docker"
175 default_image = "ghcr.io/tenseleyflow/shithub/runner-nix:1.0"
176 network = "shithub-actions"
177 memory = "2g"
178 cpus = "2"
179 seccomp_profile = "/etc/shithubd-runner/seccomp.json"
180 user = "65534:65534"
181 pids_limit = 512
182 dns_servers = ["172.30.0.1"]
183 ```
184
185 The config path defaults to `/etc/shithubd-runner/config.toml`.
186 Environment variables use the `SHITHUB_RUNNER_` prefix, for example
187 `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
188 Use `--expires-in` only for tokens that your automation rotates before expiry;
189 the runner presents its registration token on every heartbeat.
190
191 ## Arbitrary Repository Smoke
192
193 Use this checklist after provisioning a shared runner pool or changing runner
194 labels. The purpose is to prove that ordinary repositories can use the pool
195 without repo-specific labels.
196
197 Pick at least two repositories:
198
199 - `mfwolffe/scratch`, the historical dogfood repo;
200 - one additional public repository;
201 - one private repository if one is available for the operator account.
202
203 In each repository, commit this file as `.shithub/workflows/smoke.yml` on
204 `trunk`:
205
206 ```yaml
207 name: Smoke
208 on:
209 push:
210 branches: [trunk]
211 jobs:
212 green:
213 runs-on: ubuntu-latest
214 steps:
215 - uses: actions/checkout@v4
216 - name: Verify checkout
217 run: test -f README.md || test -f readme.md || pwd
218 - name: Smoke
219 run: printf 'shithub actions smoke passed\n'
220 ```
221
222 Expected results for each repo:
223
224 - the repo heading shows a green check after the push;
225 - the Actions run page shows `Triggered via push` on `refs/heads/trunk`;
226 - the `green` job is completed with conclusion `success`;
227 - the checkout step logs the scoped repository URL for that repo;
228 - the smoke step log contains `shithub actions smoke passed`;
229 - the check run attached to the commit agrees with the workflow run state;
230 - the downloaded or raw archived step log matches the in-page step log.
231
232 Confirm the pool is shared:
233
234 ```sh
235 shithubd admin runner list --output json
236 shithubd admin runner queue --output json
237 ```
238
239 The same online runner labels should satisfy both repositories. The smoke
240 workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this
241 test.
242
243 Unsupported-label negative test:
244
245 ```yaml
246 name: Unsupported runner label
247 on:
248 workflow_dispatch:
249 jobs:
250 nope:
251 runs-on: windows-latest
252 steps:
253 - run: echo should-not-run
254 ```
255
256 Trigger it manually. The run should stay queued and the run page should say
257 `Waiting for runner with labels: windows-latest`. `shithubd admin runner queue
258 --output json` should show one queued `windows-latest` job with zero matching
259 runners. Cancel the run after confirming the diagnostic.
260
261 Untrusted-PR secret negative test:
262
263 1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value.
264 2. Open an untrusted pull request whose workflow prints whether that secret is
265 present.
266 3. Before approval, confirm the claimed job contains no injected secrets and
267 logs do not contain the secret value.
268 4. Only after explicit approval should the run be allowed through the trusted
269 secret path.
270
271 Runner-outage negative test:
272
273 1. Drain every shared runner with `shithubd admin runner drain --id <id>`.
274 2. Push the smoke workflow to a test repo.
275 3. Confirm the run stays queued with the requested `ubuntu-latest` label visible.
276 4. Undrain one runner and confirm the same queued job is claimed and completes.
277
278 This smoke is considered passing only when scratch and the second repository
279 both complete from the same shared label set and the negative cases produce
280 clear queued/secret-denied behavior.
281
282 The Ansible runner role creates the `shithub-actions` bridge, runs the
283 allowlist resolver at `172.30.0.1`, and installs firewall rules that
284 reject direct-IP egress from step containers. If you run the binary
285 without the role, provision equivalent network controls before pointing
286 workflows at the runner.
287
288 ## Curl token smoke
289
290 Claim a job:
291
292 ```sh
293 curl -fsS "$BASE/api/v1/runners/heartbeat" \
294 -H "Authorization: Bearer $RUNNER_TOKEN" \
295 -H "Content-Type: application/json" \
296 -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
297 | tee /tmp/shithub-claim.json
298 ```
299
300 Extract the job token and id:
301
302 ```sh
303 export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
304 export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
305 ```
306
307 Append a log chunk:
308
309 ```sh
310 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
311 -H "Authorization: Bearer $JOB_TOKEN" \
312 -H "Content-Type: application/json" \
313 -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
314 | tee /tmp/shithub-log.json
315
316 export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
317 ```
318
319 Complete the job:
320
321 ```sh
322 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
323 -H "Authorization: Bearer $JOB_TOKEN" \
324 -H "Content-Type: application/json" \
325 -d '{"status":"completed","conclusion":"success"}'
326 ```
327
328 Cancel smoke: on a separate queued or running job, a repo-write PAT can
329 request cancellation:
330
331 ```sh
332 curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
333 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
334 -X POST
335 ```
336
337 If the job is still queued, it becomes `cancelled` immediately. If it is
338 running, the next runner cancel check returns `{"cancelled":true}`, the
339 runner kills the active container, and the terminal job status becomes
340 `cancelled`.
341
342 Re-run smoke: after a completed or cancelled workflow run, a repo-write
343 PAT can enqueue a new run from the original workflow file at the
344 original commit:
345
346 ```sh
347 curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
348 -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
349 -X POST
350 ```
351
352 Expected response includes a new `run_id`, the new `run_index`, and
353 `parent_run_id` equal to the source run. Confirm the new row has the
354 same `head_sha` as the source run.
355
356 Timeout smoke: create a workflow with a short deadline and a long-running
357 step:
358
359 ```yaml
360 jobs:
361 timeout_probe:
362 runs-on: ubuntu-latest
363 timeout-minutes: 1
364 steps:
365 - run: sleep 600
366 ```
367
368 Expected results:
369
370 - The runner kills the active container shortly after the one-minute
371 deadline.
372 - The step row becomes `status=completed`,
373 `conclusion=timed_out`.
374 - The job row becomes `status=completed`, `conclusion=timed_out`.
375 - `/metrics` increments `shithub_actions_step_timeouts_total`.
376
377 Replay check: reusing the log token after the log call must fail with
378 401 because its `jti` is already present in `runner_jwt_used`.
379
380 ```sh
381 curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
382 -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
383 -H "Content-Type: application/json" \
384 -d '{"status":"running"}'
385 ```
386
387 Expected results:
388
389 - `workflow_jobs.status = completed` and conclusion `success`.
390 - The parent `workflow_runs` row rolls up to completed/success when all
391 jobs are terminal.
392 - The PR Checks tab shows the matching check run as success.
393 - `/metrics` includes runner registration, heartbeat, JWT, job
394 cancellation, log-scrub, step-timeout, retention, queue-depth by label,
395 claim-latency, runner-online, runner-stale, runner-draining, and
396 runner-revocation metrics.
397
398 ## Retention Sweep
399
400 The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
401 after the 03:17 backup window:
402
403 ```sh
404 systemctl list-timers shithubd-cron.timer
405 journalctl -u shithubd-cron.service -n 100
406 ```
407
408 Manual smoke:
409
410 ```sh
411 shithubd admin run-job workflow:cleanup
412 ```
413
414 Expected behavior:
415
416 - SQL log chunks older than 7 days for terminal steps are deleted.
417 - Expired artifact rows are deleted only after their `actions/runs/...`
418 objects are deleted from object storage.
419 - Unpinned terminal workflow runs older than 365 days are pruned;
420 pinned runs survive.
421 - Consumed runner JWT rows older than 30 days are pruned.
422 - `/metrics` exposes
423 `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.