Actions runner smoke runbook
This runbook validates the runner-facing Actions path. shithubd-runner
now claims jobs, performs scoped actions/checkout@v4, and executes
containerized run: steps through Docker or Podman. The curl flow below
remains useful for token/replay debugging.
For host provisioning and the systemd/Ansible path, see runner-deploy.md.
Prereqs:
- Database migrations are current through
0067_runner_pool_ops.sql. SHITHUB_TOTP_KEYorauth.totp_key_b64is set on the web process.- Object storage is configured if testing artifact upload.
- Docker or Podman is installed on the runner host.
- A repo has a workflow under
.shithub/workflows/*.ymlwithruns-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout andrun:steps are executable; artifact aliases remain reserved until artifact transfer lands.
runs-on is a runner-label selector, not a hard-coded image name.
A workflow that says runs-on: ubuntu-latest can be claimed by any
runner advertising the ubuntu-latest label. The container image is
selected by the runner host's engine.default_image setting; the
reproducible Nix-built image is the default, but operators can point it
at another OCI image when they need closer Ubuntu parity.
Register a runner:
shithubd admin runner register \
--name runner-1 \
--labels self-hosted,linux,ubuntu-latest,x64 \
--capacity 1 \
--output json
Save the returned token:
export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'
Run the binary:
shithubd-runner run \
--server-url "$BASE" \
--token "$RUNNER_TOKEN" \
--labels self-hosted,linux,ubuntu-latest,x64 \
--workspace-root /var/lib/shithubd-runner/workspaces \
--network shithub-actions \
--dns-servers 172.30.0.1
Pool Operations
List pool state:
shithubd admin runner list --output json
shithubd admin runner queue --output json
list includes labels, capacity, active job count, last heartbeat,
host name, runner version, drain state, and revoke state. queue
groups queued jobs by requested runs-on label so unsupported labels
are visible without querying Postgres.
Drain a runner before host maintenance:
shithubd admin runner drain --id 7 --reason 'kernel update'
The runner keeps heartbeating and can finish already claimed jobs, but new heartbeat claims return 204 until it is undrained:
shithubd admin runner undrain --id 7
Rotate a registration token after a config-management change:
shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
Update the runner host config with the printed token, restart
shithubd-runner, then verify heartbeats with runner list.
Mark stale runners offline:
shithubd admin runner cleanup-stale --older-than 2m
Hard-revoke a compromised runner:
shithubd admin runner revoke --id 7 --reason 'host compromise'
Revocation records revoked_at, marks the runner offline, revokes every
registration token for that runner, and causes existing job API JWTs from
that runner to fail. Use drain for routine maintenance; use revoke when
the token or host may be compromised.
Destroy/recreate with DigitalOcean:
doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete <droplet-id>
After recreation, register a fresh runner token and let the provisioning role write the new config. Do not reuse a token from a revoked runner.
Emergency controls:
- Stop all new claims: disable Actions at site level or drain every
runner with
shithubd admin runner list --output jsonfollowed byshithubd admin runner drain --id <id>. - Pause one repo/org: set the repo/org Actions policy to disabled so the claim query leaves its queued jobs untouched.
- Cancel active work for a repo: use the repo Actions UI or job cancel API for each queued/running job.
- Revoke all runner tokens: iterate
shithubd admin runner revoke --id <id>over every non-revoked runner. - Fence the pool: block or destroy droplets tagged
shithub-actions-runnerin DigitalOcean, then rotate or revoke the affected runner tokens before allowing replacement hosts to connect.
Equivalent config file:
[server]
base_url = "https://shithub.example"
[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
"api.github.com",
"auth.docker.io",
"codeload.github.com",
"github.com",
"objects.githubusercontent.com",
"production.cloudflare.docker.com",
"registry-1.docker.io",
"*.githubusercontent.com",
]
[engine]
kind = "docker"
default_image = "ghcr.io/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]
The config path defaults to /etc/shithubd-runner/config.toml.
Environment variables use the SHITHUB_RUNNER_ prefix, for example
SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL.
Use --expires-in only for tokens that your automation rotates before expiry;
the runner presents its registration token on every heartbeat.
Arbitrary Repository Smoke
Use this checklist after provisioning a shared runner pool or changing runner labels. The purpose is to prove that ordinary repositories can use the pool without repo-specific labels.
Pick at least two repositories:
mfwolffe/scratch, the historical dogfood repo;- one additional public repository;
- one private repository if one is available for the operator account.
In each repository, commit this file as .shithub/workflows/smoke.yml on
trunk:
name: Smoke
on:
push:
branches: [trunk]
jobs:
green:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Verify checkout
run: test -f README.md || test -f readme.md || pwd
- name: Smoke
run: printf 'shithub actions smoke passed\n'
Expected results for each repo:
- the repo heading shows a green check after the push;
- the Actions run page shows
Triggered via pushonrefs/heads/trunk; - the
greenjob is completed with conclusionsuccess; - the checkout step logs the scoped repository URL for that repo;
- the smoke step log contains
shithub actions smoke passed; - the check run attached to the commit agrees with the workflow run state;
- the downloaded or raw archived step log matches the in-page step log.
Confirm the pool is shared:
shithubd admin runner list --output json
shithubd admin runner queue --output json
The same online runner labels should satisfy both repositories. The smoke
workflow must use runs-on: ubuntu-latest; do not add per-repo labels for this
test.
Unsupported-label negative test:
name: Unsupported runner label
on:
workflow_dispatch:
jobs:
nope:
runs-on: windows-latest
steps:
- run: echo should-not-run
Trigger it manually. The run should stay queued and the run page should say
Waiting for runner with labels: windows-latest. shithubd admin runner queue --output json should show one queued windows-latest job with zero matching
runners. Cancel the run after confirming the diagnostic.
Untrusted-PR secret negative test:
- Add a repo secret named
S41J_SECRET_SMOKEwith any non-production value. - Open an untrusted pull request whose workflow prints whether that secret is present.
- Before approval, confirm the claimed job contains no injected secrets and logs do not contain the secret value.
- Only after explicit approval should the run be allowed through the trusted secret path.
Runner-outage negative test:
- Drain every shared runner with
shithubd admin runner drain --id <id>. - Push the smoke workflow to a test repo.
- Confirm the run stays queued with the requested
ubuntu-latestlabel visible. - Undrain one runner and confirm the same queued job is claimed and completes.
This smoke is considered passing only when scratch and the second repository both complete from the same shared label set and the negative cases produce clear queued/secret-denied behavior.
The Ansible runner role creates the shithub-actions bridge, runs the
allowlist resolver at 172.30.0.1, and installs firewall rules that
reject direct-IP egress from step containers. If you run the binary
without the role, provision equivalent network controls before pointing
workflows at the runner.
Curl token smoke
Claim a job:
curl -fsS "$BASE/api/v1/runners/heartbeat" \
-H "Authorization: Bearer $RUNNER_TOKEN" \
-H "Content-Type: application/json" \
-d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
| tee /tmp/shithub-claim.json
Extract the job token and id:
export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
Append a log chunk:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
-H "Authorization: Bearer $JOB_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
| tee /tmp/shithub-log.json
export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
Complete the job:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
-H "Authorization: Bearer $JOB_TOKEN" \
-H "Content-Type: application/json" \
-d '{"status":"completed","conclusion":"success"}'
Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:
curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
-X POST
If the job is still queued, it becomes cancelled immediately. If it is
running, the next runner cancel check returns {"cancelled":true}, the
runner kills the active container, and the terminal job status becomes
cancelled.
Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:
curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
-X POST
Expected response includes a new run_id, the new run_index, and
parent_run_id equal to the source run. Confirm the new row has the
same head_sha as the source run.
Timeout smoke: create a workflow with a short deadline and a long-running step:
jobs:
timeout_probe:
runs-on: ubuntu-latest
timeout-minutes: 1
steps:
- run: sleep 600
Expected results:
- The runner kills the active container shortly after the one-minute deadline.
- The step row becomes
status=completed,conclusion=timed_out. - The job row becomes
status=completed,conclusion=timed_out. /metricsincrementsshithub_actions_step_timeouts_total.
Replay check: reusing the log token after the log call must fail with
401 because its jti is already present in runner_jwt_used.
curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
-H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
-H "Content-Type: application/json" \
-d '{"status":"running"}'
Expected results:
workflow_jobs.status = completedand conclusionsuccess.- The parent
workflow_runsrow rolls up to completed/success when all jobs are terminal. - The PR Checks tab shows the matching check run as success.
/metricsincludes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, retention, queue-depth by label, claim-latency, runner-online, runner-stale, runner-draining, and runner-revocation metrics.
Retention Sweep
The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC,
after the 03:17 backup window:
systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100
Manual smoke:
shithubd admin run-job workflow:cleanup
Expected behavior:
- SQL log chunks older than 7 days for terminal steps are deleted.
- Expired artifact rows are deleted only after their
actions/runs/...objects are deleted from object storage. - Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
- Consumed runner JWT rows older than 30 days are pruned.
/metricsexposesshithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.
View source
| 1 | # Actions runner smoke runbook |
| 2 | |
| 3 | This runbook validates the runner-facing Actions path. `shithubd-runner` |
| 4 | now claims jobs, performs scoped `actions/checkout@v4`, and executes |
| 5 | containerized `run:` steps through Docker or Podman. The curl flow below |
| 6 | remains useful for token/replay debugging. |
| 7 | |
| 8 | For host provisioning and the systemd/Ansible path, see |
| 9 | [runner-deploy.md](./runner-deploy.md). |
| 10 | |
| 11 | Prereqs: |
| 12 | |
| 13 | - Database migrations are current through `0067_runner_pool_ops.sql`. |
| 14 | - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process. |
| 15 | - Object storage is configured if testing artifact upload. |
| 16 | - Docker or Podman is installed on the runner host. |
| 17 | - A repo has a workflow under `.shithub/workflows/*.yml` with |
| 18 | `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run. |
| 19 | Checkout and `run:` steps are executable; artifact aliases remain |
| 20 | reserved until artifact transfer lands. |
| 21 | |
| 22 | `runs-on` is a runner-label selector, not a hard-coded image name. |
| 23 | A workflow that says `runs-on: ubuntu-latest` can be claimed by any |
| 24 | runner advertising the `ubuntu-latest` label. The container image is |
| 25 | selected by the runner host's `engine.default_image` setting; the |
| 26 | reproducible Nix-built image is the default, but operators can point it |
| 27 | at another OCI image when they need closer Ubuntu parity. |
| 28 | |
| 29 | Register a runner: |
| 30 | |
| 31 | ```sh |
| 32 | shithubd admin runner register \ |
| 33 | --name runner-1 \ |
| 34 | --labels self-hosted,linux,ubuntu-latest,x64 \ |
| 35 | --capacity 1 \ |
| 36 | --output json |
| 37 | ``` |
| 38 | |
| 39 | Save the returned token: |
| 40 | |
| 41 | ```sh |
| 42 | export RUNNER_TOKEN='<printed-token>' |
| 43 | export BASE='https://shithub.example' |
| 44 | ``` |
| 45 | |
| 46 | Run the binary: |
| 47 | |
| 48 | ```sh |
| 49 | shithubd-runner run \ |
| 50 | --server-url "$BASE" \ |
| 51 | --token "$RUNNER_TOKEN" \ |
| 52 | --labels self-hosted,linux,ubuntu-latest,x64 \ |
| 53 | --workspace-root /var/lib/shithubd-runner/workspaces \ |
| 54 | --network shithub-actions \ |
| 55 | --dns-servers 172.30.0.1 |
| 56 | ``` |
| 57 | |
| 58 | ## Pool Operations |
| 59 | |
| 60 | List pool state: |
| 61 | |
| 62 | ```sh |
| 63 | shithubd admin runner list --output json |
| 64 | shithubd admin runner queue --output json |
| 65 | ``` |
| 66 | |
| 67 | `list` includes labels, capacity, active job count, last heartbeat, |
| 68 | host name, runner version, drain state, and revoke state. `queue` |
| 69 | groups queued jobs by requested `runs-on` label so unsupported labels |
| 70 | are visible without querying Postgres. |
| 71 | |
| 72 | Drain a runner before host maintenance: |
| 73 | |
| 74 | ```sh |
| 75 | shithubd admin runner drain --id 7 --reason 'kernel update' |
| 76 | ``` |
| 77 | |
| 78 | The runner keeps heartbeating and can finish already claimed jobs, but |
| 79 | new heartbeat claims return 204 until it is undrained: |
| 80 | |
| 81 | ```sh |
| 82 | shithubd admin runner undrain --id 7 |
| 83 | ``` |
| 84 | |
| 85 | Rotate a registration token after a config-management change: |
| 86 | |
| 87 | ```sh |
| 88 | shithubd admin runner rotate-token --id 7 --expires-in 24h --output json |
| 89 | ``` |
| 90 | |
| 91 | Update the runner host config with the printed token, restart |
| 92 | `shithubd-runner`, then verify heartbeats with `runner list`. |
| 93 | |
| 94 | Mark stale runners offline: |
| 95 | |
| 96 | ```sh |
| 97 | shithubd admin runner cleanup-stale --older-than 2m |
| 98 | ``` |
| 99 | |
| 100 | Hard-revoke a compromised runner: |
| 101 | |
| 102 | ```sh |
| 103 | shithubd admin runner revoke --id 7 --reason 'host compromise' |
| 104 | ``` |
| 105 | |
| 106 | Revocation records `revoked_at`, marks the runner offline, revokes every |
| 107 | registration token for that runner, and causes existing job API JWTs from |
| 108 | that runner to fail. Use drain for routine maintenance; use revoke when |
| 109 | the token or host may be compromised. |
| 110 | |
| 111 | Destroy/recreate with DigitalOcean: |
| 112 | |
| 113 | ```sh |
| 114 | doctl compute droplet list --tag-name shithub-actions-runner |
| 115 | doctl compute droplet delete <droplet-id> |
| 116 | ``` |
| 117 | |
| 118 | After recreation, register a fresh runner token and let the provisioning |
| 119 | role write the new config. Do not reuse a token from a revoked runner. |
| 120 | |
| 121 | Emergency controls: |
| 122 | |
| 123 | - Stop all new claims: disable Actions at site level or drain every |
| 124 | runner with `shithubd admin runner list --output json` followed by |
| 125 | `shithubd admin runner drain --id <id>`. |
| 126 | - Pause one repo/org: set the repo/org Actions policy to disabled so the |
| 127 | claim query leaves its queued jobs untouched. |
| 128 | - Cancel active work for a repo: use the repo Actions UI or job cancel |
| 129 | API for each queued/running job. |
| 130 | - Revoke all runner tokens: iterate `shithubd admin runner revoke --id |
| 131 | <id>` over every non-revoked runner. |
| 132 | - Fence the pool: block or destroy droplets tagged |
| 133 | `shithub-actions-runner` in DigitalOcean, then rotate or revoke the |
| 134 | affected runner tokens before allowing replacement hosts to connect. |
| 135 | |
| 136 | Equivalent config file: |
| 137 | |
| 138 | ```toml |
| 139 | [server] |
| 140 | base_url = "https://shithub.example" |
| 141 | |
| 142 | [runner] |
| 143 | token = "<printed-token>" |
| 144 | labels = ["self-hosted", "linux", "ubuntu-latest", "x64"] |
| 145 | capacity = 1 |
| 146 | poll_interval = "5s" |
| 147 | workspace_root = "/var/lib/shithubd-runner/workspaces" |
| 148 | workspace_ttl = "24h" |
| 149 | network_allowlist = [ |
| 150 | "api.github.com", |
| 151 | "auth.docker.io", |
| 152 | "codeload.github.com", |
| 153 | "github.com", |
| 154 | "objects.githubusercontent.com", |
| 155 | "production.cloudflare.docker.com", |
| 156 | "registry-1.docker.io", |
| 157 | "*.githubusercontent.com", |
| 158 | ] |
| 159 | |
| 160 | [engine] |
| 161 | kind = "docker" |
| 162 | default_image = "ghcr.io/shithub/runner-nix:1.0" |
| 163 | network = "shithub-actions" |
| 164 | memory = "2g" |
| 165 | cpus = "2" |
| 166 | seccomp_profile = "/etc/shithubd-runner/seccomp.json" |
| 167 | user = "65534:65534" |
| 168 | pids_limit = 512 |
| 169 | dns_servers = ["172.30.0.1"] |
| 170 | ``` |
| 171 | |
| 172 | The config path defaults to `/etc/shithubd-runner/config.toml`. |
| 173 | Environment variables use the `SHITHUB_RUNNER_` prefix, for example |
| 174 | `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`. |
| 175 | Use `--expires-in` only for tokens that your automation rotates before expiry; |
| 176 | the runner presents its registration token on every heartbeat. |
| 177 | |
| 178 | ## Arbitrary Repository Smoke |
| 179 | |
| 180 | Use this checklist after provisioning a shared runner pool or changing runner |
| 181 | labels. The purpose is to prove that ordinary repositories can use the pool |
| 182 | without repo-specific labels. |
| 183 | |
| 184 | Pick at least two repositories: |
| 185 | |
| 186 | - `mfwolffe/scratch`, the historical dogfood repo; |
| 187 | - one additional public repository; |
| 188 | - one private repository if one is available for the operator account. |
| 189 | |
| 190 | In each repository, commit this file as `.shithub/workflows/smoke.yml` on |
| 191 | `trunk`: |
| 192 | |
| 193 | ```yaml |
| 194 | name: Smoke |
| 195 | on: |
| 196 | push: |
| 197 | branches: [trunk] |
| 198 | jobs: |
| 199 | green: |
| 200 | runs-on: ubuntu-latest |
| 201 | steps: |
| 202 | - uses: actions/checkout@v4 |
| 203 | - name: Verify checkout |
| 204 | run: test -f README.md || test -f readme.md || pwd |
| 205 | - name: Smoke |
| 206 | run: printf 'shithub actions smoke passed\n' |
| 207 | ``` |
| 208 | |
| 209 | Expected results for each repo: |
| 210 | |
| 211 | - the repo heading shows a green check after the push; |
| 212 | - the Actions run page shows `Triggered via push` on `refs/heads/trunk`; |
| 213 | - the `green` job is completed with conclusion `success`; |
| 214 | - the checkout step logs the scoped repository URL for that repo; |
| 215 | - the smoke step log contains `shithub actions smoke passed`; |
| 216 | - the check run attached to the commit agrees with the workflow run state; |
| 217 | - the downloaded or raw archived step log matches the in-page step log. |
| 218 | |
| 219 | Confirm the pool is shared: |
| 220 | |
| 221 | ```sh |
| 222 | shithubd admin runner list --output json |
| 223 | shithubd admin runner queue --output json |
| 224 | ``` |
| 225 | |
| 226 | The same online runner labels should satisfy both repositories. The smoke |
| 227 | workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this |
| 228 | test. |
| 229 | |
| 230 | Unsupported-label negative test: |
| 231 | |
| 232 | ```yaml |
| 233 | name: Unsupported runner label |
| 234 | on: |
| 235 | workflow_dispatch: |
| 236 | jobs: |
| 237 | nope: |
| 238 | runs-on: windows-latest |
| 239 | steps: |
| 240 | - run: echo should-not-run |
| 241 | ``` |
| 242 | |
| 243 | Trigger it manually. The run should stay queued and the run page should say |
| 244 | `Waiting for runner with labels: windows-latest`. `shithubd admin runner queue |
| 245 | --output json` should show one queued `windows-latest` job with zero matching |
| 246 | runners. Cancel the run after confirming the diagnostic. |
| 247 | |
| 248 | Untrusted-PR secret negative test: |
| 249 | |
| 250 | 1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value. |
| 251 | 2. Open an untrusted pull request whose workflow prints whether that secret is |
| 252 | present. |
| 253 | 3. Before approval, confirm the claimed job contains no injected secrets and |
| 254 | logs do not contain the secret value. |
| 255 | 4. Only after explicit approval should the run be allowed through the trusted |
| 256 | secret path. |
| 257 | |
| 258 | Runner-outage negative test: |
| 259 | |
| 260 | 1. Drain every shared runner with `shithubd admin runner drain --id <id>`. |
| 261 | 2. Push the smoke workflow to a test repo. |
| 262 | 3. Confirm the run stays queued with the requested `ubuntu-latest` label visible. |
| 263 | 4. Undrain one runner and confirm the same queued job is claimed and completes. |
| 264 | |
| 265 | This smoke is considered passing only when scratch and the second repository |
| 266 | both complete from the same shared label set and the negative cases produce |
| 267 | clear queued/secret-denied behavior. |
| 268 | |
| 269 | The Ansible runner role creates the `shithub-actions` bridge, runs the |
| 270 | allowlist resolver at `172.30.0.1`, and installs firewall rules that |
| 271 | reject direct-IP egress from step containers. If you run the binary |
| 272 | without the role, provision equivalent network controls before pointing |
| 273 | workflows at the runner. |
| 274 | |
| 275 | ## Curl token smoke |
| 276 | |
| 277 | Claim a job: |
| 278 | |
| 279 | ```sh |
| 280 | curl -fsS "$BASE/api/v1/runners/heartbeat" \ |
| 281 | -H "Authorization: Bearer $RUNNER_TOKEN" \ |
| 282 | -H "Content-Type: application/json" \ |
| 283 | -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \ |
| 284 | | tee /tmp/shithub-claim.json |
| 285 | ``` |
| 286 | |
| 287 | Extract the job token and id: |
| 288 | |
| 289 | ```sh |
| 290 | export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)" |
| 291 | export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)" |
| 292 | ``` |
| 293 | |
| 294 | Append a log chunk: |
| 295 | |
| 296 | ```sh |
| 297 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \ |
| 298 | -H "Authorization: Bearer $JOB_TOKEN" \ |
| 299 | -H "Content-Type: application/json" \ |
| 300 | -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \ |
| 301 | | tee /tmp/shithub-log.json |
| 302 | |
| 303 | export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)" |
| 304 | ``` |
| 305 | |
| 306 | Complete the job: |
| 307 | |
| 308 | ```sh |
| 309 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \ |
| 310 | -H "Authorization: Bearer $JOB_TOKEN" \ |
| 311 | -H "Content-Type: application/json" \ |
| 312 | -d '{"status":"completed","conclusion":"success"}' |
| 313 | ``` |
| 314 | |
| 315 | Cancel smoke: on a separate queued or running job, a repo-write PAT can |
| 316 | request cancellation: |
| 317 | |
| 318 | ```sh |
| 319 | curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \ |
| 320 | -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \ |
| 321 | -X POST |
| 322 | ``` |
| 323 | |
| 324 | If the job is still queued, it becomes `cancelled` immediately. If it is |
| 325 | running, the next runner cancel check returns `{"cancelled":true}`, the |
| 326 | runner kills the active container, and the terminal job status becomes |
| 327 | `cancelled`. |
| 328 | |
| 329 | Re-run smoke: after a completed or cancelled workflow run, a repo-write |
| 330 | PAT can enqueue a new run from the original workflow file at the |
| 331 | original commit: |
| 332 | |
| 333 | ```sh |
| 334 | curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \ |
| 335 | -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \ |
| 336 | -X POST |
| 337 | ``` |
| 338 | |
| 339 | Expected response includes a new `run_id`, the new `run_index`, and |
| 340 | `parent_run_id` equal to the source run. Confirm the new row has the |
| 341 | same `head_sha` as the source run. |
| 342 | |
| 343 | Timeout smoke: create a workflow with a short deadline and a long-running |
| 344 | step: |
| 345 | |
| 346 | ```yaml |
| 347 | jobs: |
| 348 | timeout_probe: |
| 349 | runs-on: ubuntu-latest |
| 350 | timeout-minutes: 1 |
| 351 | steps: |
| 352 | - run: sleep 600 |
| 353 | ``` |
| 354 | |
| 355 | Expected results: |
| 356 | |
| 357 | - The runner kills the active container shortly after the one-minute |
| 358 | deadline. |
| 359 | - The step row becomes `status=completed`, |
| 360 | `conclusion=timed_out`. |
| 361 | - The job row becomes `status=completed`, `conclusion=timed_out`. |
| 362 | - `/metrics` increments `shithub_actions_step_timeouts_total`. |
| 363 | |
| 364 | Replay check: reusing the log token after the log call must fail with |
| 365 | 401 because its `jti` is already present in `runner_jwt_used`. |
| 366 | |
| 367 | ```sh |
| 368 | curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \ |
| 369 | -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \ |
| 370 | -H "Content-Type: application/json" \ |
| 371 | -d '{"status":"running"}' |
| 372 | ``` |
| 373 | |
| 374 | Expected results: |
| 375 | |
| 376 | - `workflow_jobs.status = completed` and conclusion `success`. |
| 377 | - The parent `workflow_runs` row rolls up to completed/success when all |
| 378 | jobs are terminal. |
| 379 | - The PR Checks tab shows the matching check run as success. |
| 380 | - `/metrics` includes runner registration, heartbeat, JWT, job |
| 381 | cancellation, log-scrub, step-timeout, retention, queue-depth by label, |
| 382 | claim-latency, runner-online, runner-stale, runner-draining, and |
| 383 | runner-revocation metrics. |
| 384 | |
| 385 | ## Retention Sweep |
| 386 | |
| 387 | The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC, |
| 388 | after the 03:17 backup window: |
| 389 | |
| 390 | ```sh |
| 391 | systemctl list-timers shithubd-cron.timer |
| 392 | journalctl -u shithubd-cron.service -n 100 |
| 393 | ``` |
| 394 | |
| 395 | Manual smoke: |
| 396 | |
| 397 | ```sh |
| 398 | shithubd admin run-job workflow:cleanup |
| 399 | ``` |
| 400 | |
| 401 | Expected behavior: |
| 402 | |
| 403 | - SQL log chunks older than 7 days for terminal steps are deleted. |
| 404 | - Expired artifact rows are deleted only after their `actions/runs/...` |
| 405 | objects are deleted from object storage. |
| 406 | - Unpinned terminal workflow runs older than 365 days are pruned; |
| 407 | pinned runs survive. |
| 408 | - Consumed runner JWT rows older than 30 days are pruned. |
| 409 | - `/metrics` exposes |
| 410 | `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`. |