shithub Public

Watch 1 Fork 0 Star 0

markdown · 12608 bytes Raw Blame History

Actions runner smoke runbook

This runbook validates the runner-facing Actions path. shithubd-runner now claims jobs, performs scoped actions/checkout@v4, and executes containerized run: steps through Docker or Podman. The curl flow below remains useful for token/replay debugging.

For host provisioning and the systemd/Ansible path, see runner-deploy.md.

Prereqs:

Database migrations are current through 0067_runner_pool_ops.sql.
SHITHUB_TOTP_KEY or auth.totp_key_b64 is set on the web process.
Object storage is configured if testing artifact upload.
Docker or Podman is installed on the runner host.
A repo has a workflow under .shithub/workflows/*.yml with runs-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout and run: steps are executable; artifact aliases remain reserved until artifact transfer lands.

runs-on is a runner-label selector, not a hard-coded image name. A workflow that says runs-on: ubuntu-latest can be claimed by any runner advertising the ubuntu-latest label. The container image is selected by the runner host's engine.default_image setting; the reproducible Nix-built image is the default, but operators can point it at another OCI image when they need closer Ubuntu parity.

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Save the returned token:

export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'

Run the binary:

shithubd-runner run \
  --server-url "$BASE" \
  --token "$RUNNER_TOKEN" \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --workspace-root /var/lib/shithubd-runner/workspaces \
  --network shithub-actions \
  --dns-servers 172.30.0.1

Pool Operations

List pool state:

shithubd admin runner list --output json
shithubd admin runner queue --output json

list includes labels, capacity, active job count, last heartbeat, host name, runner version, drain state, and revoke state. queue groups queued jobs by requested runs-on label so unsupported labels are visible without querying Postgres.

Drain a runner before host maintenance:

shithubd admin runner drain --id 7 --reason 'kernel update'

The runner keeps heartbeating and can finish already claimed jobs, but new heartbeat claims return 204 until it is undrained:

shithubd admin runner undrain --id 7

Rotate a registration token after a config-management change:

shithubd admin runner rotate-token --id 7 --expires-in 24h --output json

Update the runner host config with the printed token, restart shithubd-runner, then verify heartbeats with runner list.

Mark stale runners offline:

shithubd admin runner cleanup-stale --older-than 2m

Hard-revoke a compromised runner:

shithubd admin runner revoke --id 7 --reason 'host compromise'

Revocation records revoked_at, marks the runner offline, revokes every registration token for that runner, and causes existing job API JWTs from that runner to fail. Use drain for routine maintenance; use revoke when the token or host may be compromised.

Destroy/recreate with DigitalOcean:

doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete <droplet-id>

After recreation, register a fresh runner token and let the provisioning role write the new config. Do not reuse a token from a revoked runner.

Emergency controls:

Stop all new claims: disable Actions at site level or drain every runner with shithubd admin runner list --output json followed by shithubd admin runner drain --id <id>.
Pause one repo/org: set the repo/org Actions policy to disabled so the claim query leaves its queued jobs untouched.
Cancel active work for a repo: use the repo Actions UI or job cancel API for each queued/running job.
Revoke all runner tokens: iterate shithubd admin runner revoke --id <id> over every non-revoked runner.
Fence the pool: block or destroy droplets tagged shithub-actions-runner in DigitalOcean, then rotate or revoke the affected runner tokens before allowing replacement hosts to connect.

Equivalent config file:

[server]
base_url = "https://shithub.example"

[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
  "api.github.com",
  "auth.docker.io",
  "codeload.github.com",
  "github.com",
  "objects.githubusercontent.com",
  "production.cloudflare.docker.com",
  "registry-1.docker.io",
  "*.githubusercontent.com",
]

[engine]
kind = "docker"
default_image = "ghcr.io/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]

The config path defaults to /etc/shithubd-runner/config.toml. Environment variables use the SHITHUB_RUNNER_ prefix, for example SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL. Use --expires-in only for tokens that your automation rotates before expiry; the runner presents its registration token on every heartbeat.

Arbitrary Repository Smoke

Use this checklist after provisioning a shared runner pool or changing runner labels. The purpose is to prove that ordinary repositories can use the pool without repo-specific labels.

Pick at least two repositories:

mfwolffe/scratch, the historical dogfood repo;
one additional public repository;
one private repository if one is available for the operator account.

In each repository, commit this file as .shithub/workflows/smoke.yml on trunk:

name: Smoke
on:
  push:
    branches: [trunk]
jobs:
  green:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Verify checkout
        run: test -f README.md || test -f readme.md || pwd
      - name: Smoke
        run: printf 'shithub actions smoke passed\n'

Expected results for each repo:

the repo heading shows a green check after the push;
the Actions run page shows Triggered via push on refs/heads/trunk;
the green job is completed with conclusion success;
the checkout step logs the scoped repository URL for that repo;
the smoke step log contains shithub actions smoke passed;
the check run attached to the commit agrees with the workflow run state;
the downloaded or raw archived step log matches the in-page step log.

Confirm the pool is shared:

shithubd admin runner list --output json
shithubd admin runner queue --output json

The same online runner labels should satisfy both repositories. The smoke workflow must use runs-on: ubuntu-latest; do not add per-repo labels for this test.

Unsupported-label negative test:

name: Unsupported runner label
on:
  workflow_dispatch:
jobs:
  nope:
    runs-on: windows-latest
    steps:
      - run: echo should-not-run

Trigger it manually. The run should stay queued and the run page should say Waiting for runner with labels: windows-latest. shithubd admin runner queue --output json should show one queued windows-latest job with zero matching runners. Cancel the run after confirming the diagnostic.

Untrusted-PR secret negative test:

Add a repo secret named S41J_SECRET_SMOKE with any non-production value.
Open an untrusted pull request whose workflow prints whether that secret is present.
Before approval, confirm the claimed job contains no injected secrets and logs do not contain the secret value.
Only after explicit approval should the run be allowed through the trusted secret path.

Runner-outage negative test:

Drain every shared runner with shithubd admin runner drain --id <id>.
Push the smoke workflow to a test repo.
Confirm the run stays queued with the requested ubuntu-latest label visible.
Undrain one runner and confirm the same queued job is claimed and completes.

This smoke is considered passing only when scratch and the second repository both complete from the same shared label set and the negative cases produce clear queued/secret-denied behavior.

The Ansible runner role creates the shithub-actions bridge, runs the allowlist resolver at 172.30.0.1, and installs firewall rules that reject direct-IP egress from step containers. If you run the binary without the role, provision equivalent network controls before pointing workflows at the runner.

Curl token smoke

Claim a job:

curl -fsS "$BASE/api/v1/runners/heartbeat" \
  -H "Authorization: Bearer $RUNNER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
  | tee /tmp/shithub-claim.json

Extract the job token and id:

export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"

Append a log chunk:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
  | tee /tmp/shithub-log.json

export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"

Complete the job:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","conclusion":"success"}'

Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

If the job is still queued, it becomes cancelled immediately. If it is running, the next runner cancel check returns {"cancelled":true}, the runner kills the active container, and the terminal job status becomes cancelled.

Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:

curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

Expected response includes a new run_id, the new run_index, and parent_run_id equal to the source run. Confirm the new row has the same head_sha as the source run.

Timeout smoke: create a workflow with a short deadline and a long-running step:

jobs:
  timeout_probe:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - run: sleep 600

Expected results:

The runner kills the active container shortly after the one-minute deadline.
The step row becomes status=completed, conclusion=timed_out.
The job row becomes status=completed, conclusion=timed_out.
/metrics increments shithub_actions_step_timeouts_total.

Replay check: reusing the log token after the log call must fail with 401 because its jti is already present in runner_jwt_used.

curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
  -H "Content-Type: application/json" \
  -d '{"status":"running"}'

Expected results:

workflow_jobs.status = completed and conclusion success.
The parent workflow_runs row rolls up to completed/success when all jobs are terminal.
The PR Checks tab shows the matching check run as success.
/metrics includes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, retention, queue-depth by label, claim-latency, runner-online, runner-stale, runner-draining, and runner-revocation metrics.

Retention Sweep

The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC, after the 03:17 backup window:

systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100

Manual smoke:

shithubd admin run-job workflow:cleanup

Expected behavior:

SQL log chunks older than 7 days for terminal steps are deleted.
Expired artifact rows are deleted only after their actions/runs/... objects are deleted from object storage.
Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
Consumed runner JWT rows older than 30 days are pruned.
/metrics exposes shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.

View source

  
        1
        # Actions runner smoke runbook
      
        2
        
        3
        This runbook validates the runner-facing Actions path. `shithubd-runner`
      
        4
        now claims jobs, performs scoped `actions/checkout@v4`, and executes
      
        5
        containerized `run:` steps through Docker or Podman. The curl flow below
      
        6
        remains useful for token/replay debugging.
      
        7
        
        8
        For host provisioning and the systemd/Ansible path, see
      
        9
        [runner-deploy.md](./runner-deploy.md).
      
        10
        
        11
        Prereqs:
      
        12
        
        13
        - Database migrations are current through `0067_runner_pool_ops.sql`.
      
        14
        - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
      
        15
        - Object storage is configured if testing artifact upload.
      
        16
        - Docker or Podman is installed on the runner host.
      
        17
        - A repo has a workflow under `.shithub/workflows/*.yml` with
      
        18
          `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
      
        19
          Checkout and `run:` steps are executable; artifact aliases remain
      
        20
          reserved until artifact transfer lands.
      
        21
        
        22
        `runs-on` is a runner-label selector, not a hard-coded image name.
      
        23
        A workflow that says `runs-on: ubuntu-latest` can be claimed by any
      
        24
        runner advertising the `ubuntu-latest` label. The container image is
      
        25
        selected by the runner host's `engine.default_image` setting; the
      
        26
        reproducible Nix-built image is the default, but operators can point it
      
        27
        at another OCI image when they need closer Ubuntu parity.
      
        28
        
        29
        Register a runner:
      
        30
        
        31
        ```sh
      
        32
        shithubd admin runner register \
      
        33
          --name runner-1 \
      
        34
          --labels self-hosted,linux,ubuntu-latest,x64 \
      
        35
          --capacity 1 \
      
        36
          --output json
      
        37
        ```
      
        38
        
        39
        Save the returned token:
      
        40
        
        41
        ```sh
      
        42
        export RUNNER_TOKEN='<printed-token>'
      
        43
        export BASE='https://shithub.example'
      
        44
        ```
      
        45
        
        46
        Run the binary:
      
        47
        
        48
        ```sh
      
        49
        shithubd-runner run \
      
        50
          --server-url "$BASE" \
      
        51
          --token "$RUNNER_TOKEN" \
      
        52
          --labels self-hosted,linux,ubuntu-latest,x64 \
      
        53
          --workspace-root /var/lib/shithubd-runner/workspaces \
      
        54
          --network shithub-actions \
      
        55
          --dns-servers 172.30.0.1
      
        56
        ```
      
        57
        
        58
        ## Pool Operations
      
        59
        
        60
        List pool state:
      
        61
        
        62
        ```sh
      
        63
        shithubd admin runner list --output json
      
        64
        shithubd admin runner queue --output json
      
        65
        ```
      
        66
        
        67
        `list` includes labels, capacity, active job count, last heartbeat,
      
        68
        host name, runner version, drain state, and revoke state. `queue`
      
        69
        groups queued jobs by requested `runs-on` label so unsupported labels
      
        70
        are visible without querying Postgres.
      
        71
        
        72
        Drain a runner before host maintenance:
      
        73
        
        74
        ```sh
      
        75
        shithubd admin runner drain --id 7 --reason 'kernel update'
      
        76
        ```
      
        77
        
        78
        The runner keeps heartbeating and can finish already claimed jobs, but
      
        79
        new heartbeat claims return 204 until it is undrained:
      
        80
        
        81
        ```sh
      
        82
        shithubd admin runner undrain --id 7
      
        83
        ```
      
        84
        
        85
        Rotate a registration token after a config-management change:
      
        86
        
        87
        ```sh
      
        88
        shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
      
        89
        ```
      
        90
        
        91
        Update the runner host config with the printed token, restart
      
        92
        `shithubd-runner`, then verify heartbeats with `runner list`.
      
        93
        
        94
        Mark stale runners offline:
      
        95
        
        96
        ```sh
      
        97
        shithubd admin runner cleanup-stale --older-than 2m
      
        98
        ```
      
        99
        
        100
        Hard-revoke a compromised runner:
      
        101
        
        102
        ```sh
      
        103
        shithubd admin runner revoke --id 7 --reason 'host compromise'
      
        104
        ```
      
        105
        
        106
        Revocation records `revoked_at`, marks the runner offline, revokes every
      
        107
        registration token for that runner, and causes existing job API JWTs from
      
        108
        that runner to fail. Use drain for routine maintenance; use revoke when
      
        109
        the token or host may be compromised.
      
        110
        
        111
        Destroy/recreate with DigitalOcean:
      
        112
        
        113
        ```sh
      
        114
        doctl compute droplet list --tag-name shithub-actions-runner
      
        115
        doctl compute droplet delete <droplet-id>
      
        116
        ```
      
        117
        
        118
        After recreation, register a fresh runner token and let the provisioning
      
        119
        role write the new config. Do not reuse a token from a revoked runner.
      
        120
        
        121
        Emergency controls:
      
        122
        
        123
        - Stop all new claims: disable Actions at site level or drain every
      
        124
          runner with `shithubd admin runner list --output json` followed by
      
        125
          `shithubd admin runner drain --id <id>`.
      
        126
        - Pause one repo/org: set the repo/org Actions policy to disabled so the
      
        127
          claim query leaves its queued jobs untouched.
      
        128
        - Cancel active work for a repo: use the repo Actions UI or job cancel
      
        129
          API for each queued/running job.
      
        130
        - Revoke all runner tokens: iterate `shithubd admin runner revoke --id
      
        131
          <id>` over every non-revoked runner.
      
        132
        - Fence the pool: block or destroy droplets tagged
      
        133
          `shithub-actions-runner` in DigitalOcean, then rotate or revoke the
      
        134
          affected runner tokens before allowing replacement hosts to connect.
      
        135
        
        136
        Equivalent config file:
      
        137
        
        138
        ```toml
      
        139
        [server]
      
        140
        base_url = "https://shithub.example"
      
        141
        
        142
        [runner]
      
        143
        token = "<printed-token>"
      
        144
        labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
      
        145
        capacity = 1
      
        146
        poll_interval = "5s"
      
        147
        workspace_root = "/var/lib/shithubd-runner/workspaces"
      
        148
        workspace_ttl = "24h"
      
        149
        network_allowlist = [
      
        150
          "api.github.com",
      
        151
          "auth.docker.io",
      
        152
          "codeload.github.com",
      
        153
          "github.com",
      
        154
          "objects.githubusercontent.com",
      
        155
          "production.cloudflare.docker.com",
      
        156
          "registry-1.docker.io",
      
        157
          "*.githubusercontent.com",
      
        158
        ]
      
        159
        
        160
        [engine]
      
        161
        kind = "docker"
      
        162
        default_image = "ghcr.io/shithub/runner-nix:1.0"
      
        163
        network = "shithub-actions"
      
        164
        memory = "2g"
      
        165
        cpus = "2"
      
        166
        seccomp_profile = "/etc/shithubd-runner/seccomp.json"
      
        167
        user = "65534:65534"
      
        168
        pids_limit = 512
      
        169
        dns_servers = ["172.30.0.1"]
      
        170
        ```
      
        171
        
        172
        The config path defaults to `/etc/shithubd-runner/config.toml`.
      
        173
        Environment variables use the `SHITHUB_RUNNER_` prefix, for example
      
        174
        `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
      
        175
        Use `--expires-in` only for tokens that your automation rotates before expiry;
      
        176
        the runner presents its registration token on every heartbeat.
      
        177
        
        178
        ## Arbitrary Repository Smoke
      
        179
        
        180
        Use this checklist after provisioning a shared runner pool or changing runner
      
        181
        labels. The purpose is to prove that ordinary repositories can use the pool
      
        182
        without repo-specific labels.
      
        183
        
        184
        Pick at least two repositories:
      
        185
        
        186
        - `mfwolffe/scratch`, the historical dogfood repo;
      
        187
        - one additional public repository;
      
        188
        - one private repository if one is available for the operator account.
      
        189
        
        190
        In each repository, commit this file as `.shithub/workflows/smoke.yml` on
      
        191
        `trunk`:
      
        192
        
        193
        ```yaml
      
        194
        name: Smoke
      
        195
        on:
      
        196
          push:
      
        197
            branches: [trunk]
      
        198
        jobs:
      
        199
          green:
      
        200
            runs-on: ubuntu-latest
      
        201
            steps:
      
        202
              - uses: actions/checkout@v4
      
        203
              - name: Verify checkout
      
        204
                run: test -f README.md || test -f readme.md || pwd
      
        205
              - name: Smoke
      
        206
                run: printf 'shithub actions smoke passed\n'
      
        207
        ```
      
        208
        
        209
        Expected results for each repo:
      
        210
        
        211
        - the repo heading shows a green check after the push;
      
        212
        - the Actions run page shows `Triggered via push` on `refs/heads/trunk`;
      
        213
        - the `green` job is completed with conclusion `success`;
      
        214
        - the checkout step logs the scoped repository URL for that repo;
      
        215
        - the smoke step log contains `shithub actions smoke passed`;
      
        216
        - the check run attached to the commit agrees with the workflow run state;
      
        217
        - the downloaded or raw archived step log matches the in-page step log.
      
        218
        
        219
        Confirm the pool is shared:
      
        220
        
        221
        ```sh
      
        222
        shithubd admin runner list --output json
      
        223
        shithubd admin runner queue --output json
      
        224
        ```
      
        225
        
        226
        The same online runner labels should satisfy both repositories. The smoke
      
        227
        workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this
      
        228
        test.
      
        229
        
        230
        Unsupported-label negative test:
      
        231
        
        232
        ```yaml
      
        233
        name: Unsupported runner label
      
        234
        on:
      
        235
          workflow_dispatch:
      
        236
        jobs:
      
        237
          nope:
      
        238
            runs-on: windows-latest
      
        239
            steps:
      
        240
              - run: echo should-not-run
      
        241
        ```
      
        242
        
        243
        Trigger it manually. The run should stay queued and the run page should say
      
        244
        `Waiting for runner with labels: windows-latest`. `shithubd admin runner queue
      
        245
        --output json` should show one queued `windows-latest` job with zero matching
      
        246
        runners. Cancel the run after confirming the diagnostic.
      
        247
        
        248
        Untrusted-PR secret negative test:
      
        249
        
        250
        1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value.
      
        251
        2. Open an untrusted pull request whose workflow prints whether that secret is
      
        252
           present.
      
        253
        3. Before approval, confirm the claimed job contains no injected secrets and
      
        254
           logs do not contain the secret value.
      
        255
        4. Only after explicit approval should the run be allowed through the trusted
      
        256
           secret path.
      
        257
        
        258
        Runner-outage negative test:
      
        259
        
        260
        1. Drain every shared runner with `shithubd admin runner drain --id <id>`.
      
        261
        2. Push the smoke workflow to a test repo.
      
        262
        3. Confirm the run stays queued with the requested `ubuntu-latest` label visible.
      
        263
        4. Undrain one runner and confirm the same queued job is claimed and completes.
      
        264
        
        265
        This smoke is considered passing only when scratch and the second repository
      
        266
        both complete from the same shared label set and the negative cases produce
      
        267
        clear queued/secret-denied behavior.
      
        268
        
        269
        The Ansible runner role creates the `shithub-actions` bridge, runs the
      
        270
        allowlist resolver at `172.30.0.1`, and installs firewall rules that
      
        271
        reject direct-IP egress from step containers. If you run the binary
      
        272
        without the role, provision equivalent network controls before pointing
      
        273
        workflows at the runner.
      
        274
        
        275
        ## Curl token smoke
      
        276
        
        277
        Claim a job:
      
        278
        
        279
        ```sh
      
        280
        curl -fsS "$BASE/api/v1/runners/heartbeat" \
      
        281
          -H "Authorization: Bearer $RUNNER_TOKEN" \
      
        282
          -H "Content-Type: application/json" \
      
        283
          -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
      
        284
          | tee /tmp/shithub-claim.json
      
        285
        ```
      
        286
        
        287
        Extract the job token and id:
      
        288
        
        289
        ```sh
      
        290
        export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
      
        291
        export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
      
        292
        ```
      
        293
        
        294
        Append a log chunk:
      
        295
        
        296
        ```sh
      
        297
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
      
        298
          -H "Authorization: Bearer $JOB_TOKEN" \
      
        299
          -H "Content-Type: application/json" \
      
        300
          -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
      
        301
          | tee /tmp/shithub-log.json
      
        302
        
        303
        export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
      
        304
        ```
      
        305
        
        306
        Complete the job:
      
        307
        
        308
        ```sh
      
        309
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
      
        310
          -H "Authorization: Bearer $JOB_TOKEN" \
      
        311
          -H "Content-Type: application/json" \
      
        312
          -d '{"status":"completed","conclusion":"success"}'
      
        313
        ```
      
        314
        
        315
        Cancel smoke: on a separate queued or running job, a repo-write PAT can
      
        316
        request cancellation:
      
        317
        
        318
        ```sh
      
        319
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
      
        320
          -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
      
        321
          -X POST
      
        322
        ```
      
        323
        
        324
        If the job is still queued, it becomes `cancelled` immediately. If it is
      
        325
        running, the next runner cancel check returns `{"cancelled":true}`, the
      
        326
        runner kills the active container, and the terminal job status becomes
      
        327
        `cancelled`.
      
        328
        
        329
        Re-run smoke: after a completed or cancelled workflow run, a repo-write
      
        330
        PAT can enqueue a new run from the original workflow file at the
      
        331
        original commit:
      
        332
        
        333
        ```sh
      
        334
        curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
      
        335
          -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
      
        336
          -X POST
      
        337
        ```
      
        338
        
        339
        Expected response includes a new `run_id`, the new `run_index`, and
      
        340
        `parent_run_id` equal to the source run. Confirm the new row has the
      
        341
        same `head_sha` as the source run.
      
        342
        
        343
        Timeout smoke: create a workflow with a short deadline and a long-running
      
        344
        step:
      
        345
        
        346
        ```yaml
      
        347
        jobs:
      
        348
          timeout_probe:
      
        349
            runs-on: ubuntu-latest
      
        350
            timeout-minutes: 1
      
        351
            steps:
      
        352
              - run: sleep 600
      
        353
        ```
      
        354
        
        355
        Expected results:
      
        356
        
        357
        - The runner kills the active container shortly after the one-minute
      
        358
          deadline.
      
        359
        - The step row becomes `status=completed`,
      
        360
          `conclusion=timed_out`.
      
        361
        - The job row becomes `status=completed`, `conclusion=timed_out`.
      
        362
        - `/metrics` increments `shithub_actions_step_timeouts_total`.
      
        363
        
        364
        Replay check: reusing the log token after the log call must fail with
      
        365
        401 because its `jti` is already present in `runner_jwt_used`.
      
        366
        
        367
        ```sh
      
        368
        curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
      
        369
          -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
      
        370
          -H "Content-Type: application/json" \
      
        371
          -d '{"status":"running"}'
      
        372
        ```
      
        373
        
        374
        Expected results:
      
        375
        
        376
        - `workflow_jobs.status = completed` and conclusion `success`.
      
        377
        - The parent `workflow_runs` row rolls up to completed/success when all
      
        378
          jobs are terminal.
      
        379
        - The PR Checks tab shows the matching check run as success.
      
        380
        - `/metrics` includes runner registration, heartbeat, JWT, job
      
        381
          cancellation, log-scrub, step-timeout, retention, queue-depth by label,
      
        382
          claim-latency, runner-online, runner-stale, runner-draining, and
      
        383
          runner-revocation metrics.
      
        384
        
        385
        ## Retention Sweep
      
        386
        
        387
        The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
      
        388
        after the 03:17 backup window:
      
        389
        
        390
        ```sh
      
        391
        systemctl list-timers shithubd-cron.timer
      
        392
        journalctl -u shithubd-cron.service -n 100
      
        393
        ```
      
        394
        
        395
        Manual smoke:
      
        396
        
        397
        ```sh
      
        398
        shithubd admin run-job workflow:cleanup
      
        399
        ```
      
        400
        
        401
        Expected behavior:
      
        402
        
        403
        - SQL log chunks older than 7 days for terminal steps are deleted.
      
        404
        - Expired artifact rows are deleted only after their `actions/runs/...`
      
        405
          objects are deleted from object storage.
      
        406
        - Unpinned terminal workflow runs older than 365 days are pruned;
      
        407
          pinned runs survive.
      
        408
        - Consumed runner JWT rows older than 30 days are pruned.
      
        409
        - `/metrics` exposes
      
        410
          `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.

1	# Actions runner smoke runbook
2
3	This runbook validates the runner-facing Actions path. `shithubd-runner`
4	now claims jobs, performs scoped `actions/checkout@v4`, and executes
5	containerized `run:` steps through Docker or Podman. The curl flow below
6	remains useful for token/replay debugging.
7
8	For host provisioning and the systemd/Ansible path, see
9	[runner-deploy.md](./runner-deploy.md).
10
11	Prereqs:
12
13	- Database migrations are current through `0067_runner_pool_ops.sql`.
14	- `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
15	- Object storage is configured if testing artifact upload.
16	- Docker or Podman is installed on the runner host.
17	- A repo has a workflow under `.shithub/workflows/*.yml` with
18	`runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
19	Checkout and `run:` steps are executable; artifact aliases remain
20	reserved until artifact transfer lands.
21
22	`runs-on` is a runner-label selector, not a hard-coded image name.
23	A workflow that says `runs-on: ubuntu-latest` can be claimed by any
24	runner advertising the `ubuntu-latest` label. The container image is
25	selected by the runner host's `engine.default_image` setting; the
26	reproducible Nix-built image is the default, but operators can point it
27	at another OCI image when they need closer Ubuntu parity.
28
29	Register a runner:
30
31	```sh
32	shithubd admin runner register \
33	--name runner-1 \
34	--labels self-hosted,linux,ubuntu-latest,x64 \
35	--capacity 1 \
36	--output json
37	```
38
39	Save the returned token:
40
41	```sh
42	export RUNNER_TOKEN='<printed-token>'
43	export BASE='https://shithub.example'
44	```
45
46	Run the binary:
47
48	```sh
49	shithubd-runner run \
50	--server-url "$BASE" \
51	--token "$RUNNER_TOKEN" \
52	--labels self-hosted,linux,ubuntu-latest,x64 \
53	--workspace-root /var/lib/shithubd-runner/workspaces \
54	--network shithub-actions \
55	--dns-servers 172.30.0.1
56	```
57
58	## Pool Operations
59
60	List pool state:
61
62	```sh
63	shithubd admin runner list --output json
64	shithubd admin runner queue --output json
65	```
66
67	`list` includes labels, capacity, active job count, last heartbeat,
68	host name, runner version, drain state, and revoke state. `queue`
69	groups queued jobs by requested `runs-on` label so unsupported labels
70	are visible without querying Postgres.
71
72	Drain a runner before host maintenance:
73
74	```sh
75	shithubd admin runner drain --id 7 --reason 'kernel update'
76	```
77
78	The runner keeps heartbeating and can finish already claimed jobs, but
79	new heartbeat claims return 204 until it is undrained:
80
81	```sh
82	shithubd admin runner undrain --id 7
83	```
84
85	Rotate a registration token after a config-management change:
86
87	```sh
88	shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
89	```
90
91	Update the runner host config with the printed token, restart
92	`shithubd-runner`, then verify heartbeats with `runner list`.
93
94	Mark stale runners offline:
95
96	```sh
97	shithubd admin runner cleanup-stale --older-than 2m
98	```
99
100	Hard-revoke a compromised runner:
101
102	```sh
103	shithubd admin runner revoke --id 7 --reason 'host compromise'
104	```
105
106	Revocation records `revoked_at`, marks the runner offline, revokes every
107	registration token for that runner, and causes existing job API JWTs from
108	that runner to fail. Use drain for routine maintenance; use revoke when
109	the token or host may be compromised.
110
111	Destroy/recreate with DigitalOcean:
112
113	```sh
114	doctl compute droplet list --tag-name shithub-actions-runner
115	doctl compute droplet delete <droplet-id>
116	```
117
118	After recreation, register a fresh runner token and let the provisioning
119	role write the new config. Do not reuse a token from a revoked runner.
120
121	Emergency controls:
122
123	- Stop all new claims: disable Actions at site level or drain every
124	runner with `shithubd admin runner list --output json` followed by
125	`shithubd admin runner drain --id <id>`.
126	- Pause one repo/org: set the repo/org Actions policy to disabled so the
127	claim query leaves its queued jobs untouched.
128	- Cancel active work for a repo: use the repo Actions UI or job cancel
129	API for each queued/running job.
130	- Revoke all runner tokens: iterate `shithubd admin runner revoke --id
131	<id>` over every non-revoked runner.
132	- Fence the pool: block or destroy droplets tagged
133	`shithub-actions-runner` in DigitalOcean, then rotate or revoke the
134	affected runner tokens before allowing replacement hosts to connect.
135
136	Equivalent config file:
137
138	```toml
139	[server]
140	base_url = "https://shithub.example"
141
142	[runner]
143	token = "<printed-token>"
144	labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
145	capacity = 1
146	poll_interval = "5s"
147	workspace_root = "/var/lib/shithubd-runner/workspaces"
148	workspace_ttl = "24h"
149	network_allowlist = [
150	"api.github.com",
151	"auth.docker.io",
152	"codeload.github.com",
153	"github.com",
154	"objects.githubusercontent.com",
155	"production.cloudflare.docker.com",
156	"registry-1.docker.io",
157	"*.githubusercontent.com",
158	]
159
160	[engine]
161	kind = "docker"
162	default_image = "ghcr.io/shithub/runner-nix:1.0"
163	network = "shithub-actions"
164	memory = "2g"
165	cpus = "2"
166	seccomp_profile = "/etc/shithubd-runner/seccomp.json"
167	user = "65534:65534"
168	pids_limit = 512
169	dns_servers = ["172.30.0.1"]
170	```
171
172	The config path defaults to `/etc/shithubd-runner/config.toml`.
173	Environment variables use the `SHITHUB_RUNNER_` prefix, for example
174	`SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
175	Use `--expires-in` only for tokens that your automation rotates before expiry;
176	the runner presents its registration token on every heartbeat.
177
178	## Arbitrary Repository Smoke
179
180	Use this checklist after provisioning a shared runner pool or changing runner
181	labels. The purpose is to prove that ordinary repositories can use the pool
182	without repo-specific labels.
183
184	Pick at least two repositories:
185
186	- `mfwolffe/scratch`, the historical dogfood repo;
187	- one additional public repository;
188	- one private repository if one is available for the operator account.
189
190	In each repository, commit this file as `.shithub/workflows/smoke.yml` on
191	`trunk`:
192
193	```yaml
194	name: Smoke
195	on:
196	push:
197	branches: [trunk]
198	jobs:
199	green:
200	runs-on: ubuntu-latest
201	steps:
202	- uses: actions/checkout@v4
203	- name: Verify checkout
204	run: test -f README.md \|\| test -f readme.md \|\| pwd
205	- name: Smoke
206	run: printf 'shithub actions smoke passed\n'
207	```
208
209	Expected results for each repo:
210
211	- the repo heading shows a green check after the push;
212	- the Actions run page shows `Triggered via push` on `refs/heads/trunk`;
213	- the `green` job is completed with conclusion `success`;
214	- the checkout step logs the scoped repository URL for that repo;
215	- the smoke step log contains `shithub actions smoke passed`;
216	- the check run attached to the commit agrees with the workflow run state;
217	- the downloaded or raw archived step log matches the in-page step log.
218
219	Confirm the pool is shared:
220
221	```sh
222	shithubd admin runner list --output json
223	shithubd admin runner queue --output json
224	```
225
226	The same online runner labels should satisfy both repositories. The smoke
227	workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this
228	test.
229
230	Unsupported-label negative test:
231
232	```yaml
233	name: Unsupported runner label
234	on:
235	workflow_dispatch:
236	jobs:
237	nope:
238	runs-on: windows-latest
239	steps:
240	- run: echo should-not-run
241	```
242
243	Trigger it manually. The run should stay queued and the run page should say
244	`Waiting for runner with labels: windows-latest`. `shithubd admin runner queue
245	--output json` should show one queued `windows-latest` job with zero matching
246	runners. Cancel the run after confirming the diagnostic.
247
248	Untrusted-PR secret negative test:
249
250	1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value.
251	2. Open an untrusted pull request whose workflow prints whether that secret is
252	present.
253	3. Before approval, confirm the claimed job contains no injected secrets and
254	logs do not contain the secret value.
255	4. Only after explicit approval should the run be allowed through the trusted
256	secret path.
257
258	Runner-outage negative test:
259
260	1. Drain every shared runner with `shithubd admin runner drain --id <id>`.
261	2. Push the smoke workflow to a test repo.
262	3. Confirm the run stays queued with the requested `ubuntu-latest` label visible.
263	4. Undrain one runner and confirm the same queued job is claimed and completes.
264
265	This smoke is considered passing only when scratch and the second repository
266	both complete from the same shared label set and the negative cases produce
267	clear queued/secret-denied behavior.
268
269	The Ansible runner role creates the `shithub-actions` bridge, runs the
270	allowlist resolver at `172.30.0.1`, and installs firewall rules that
271	reject direct-IP egress from step containers. If you run the binary
272	without the role, provision equivalent network controls before pointing
273	workflows at the runner.
274
275	## Curl token smoke
276
277	Claim a job:
278
279	```sh
280	curl -fsS "$BASE/api/v1/runners/heartbeat" \
281	-H "Authorization: Bearer $RUNNER_TOKEN" \
282	-H "Content-Type: application/json" \
283	-d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
284	\| tee /tmp/shithub-claim.json
285	```
286
287	Extract the job token and id:
288
289	```sh
290	export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
291	export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
292	```
293
294	Append a log chunk:
295
296	```sh
297	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
298	-H "Authorization: Bearer $JOB_TOKEN" \
299	-H "Content-Type: application/json" \
300	-d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' \| base64)\"}" \
301	\| tee /tmp/shithub-log.json
302
303	export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
304	```
305
306	Complete the job:
307
308	```sh
309	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
310	-H "Authorization: Bearer $JOB_TOKEN" \
311	-H "Content-Type: application/json" \
312	-d '{"status":"completed","conclusion":"success"}'
313	```
314
315	Cancel smoke: on a separate queued or running job, a repo-write PAT can
316	request cancellation:
317
318	```sh
319	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
320	-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
321	-X POST
322	```
323
324	If the job is still queued, it becomes `cancelled` immediately. If it is
325	running, the next runner cancel check returns `{"cancelled":true}`, the
326	runner kills the active container, and the terminal job status becomes
327	`cancelled`.
328
329	Re-run smoke: after a completed or cancelled workflow run, a repo-write
330	PAT can enqueue a new run from the original workflow file at the
331	original commit:
332
333	```sh
334	curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
335	-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
336	-X POST
337	```
338
339	Expected response includes a new `run_id`, the new `run_index`, and
340	`parent_run_id` equal to the source run. Confirm the new row has the
341	same `head_sha` as the source run.
342
343	Timeout smoke: create a workflow with a short deadline and a long-running
344	step:
345
346	```yaml
347	jobs:
348	timeout_probe:
349	runs-on: ubuntu-latest
350	timeout-minutes: 1
351	steps:
352	- run: sleep 600
353	```
354
355	Expected results:
356
357	- The runner kills the active container shortly after the one-minute
358	deadline.
359	- The step row becomes `status=completed`,
360	`conclusion=timed_out`.
361	- The job row becomes `status=completed`, `conclusion=timed_out`.
362	- `/metrics` increments `shithub_actions_step_timeouts_total`.
363
364	Replay check: reusing the log token after the log call must fail with
365	401 because its `jti` is already present in `runner_jwt_used`.
366
367	```sh
368	curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
369	-H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
370	-H "Content-Type: application/json" \
371	-d '{"status":"running"}'
372	```
373
374	Expected results:
375
376	- `workflow_jobs.status = completed` and conclusion `success`.
377	- The parent `workflow_runs` row rolls up to completed/success when all
378	jobs are terminal.
379	- The PR Checks tab shows the matching check run as success.
380	- `/metrics` includes runner registration, heartbeat, JWT, job
381	cancellation, log-scrub, step-timeout, retention, queue-depth by label,
382	claim-latency, runner-online, runner-stale, runner-draining, and
383	runner-revocation metrics.
384
385	## Retention Sweep
386
387	The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
388	after the 03:17 backup window:
389
390	```sh
391	systemctl list-timers shithubd-cron.timer
392	journalctl -u shithubd-cron.service -n 100
393	```
394
395	Manual smoke:
396
397	```sh
398	shithubd admin run-job workflow:cleanup
399	```
400
401	Expected behavior:
402
403	- SQL log chunks older than 7 days for terminal steps are deleted.
404	- Expired artifact rows are deleted only after their `actions/runs/...`
405	objects are deleted from object storage.
406	- Unpinned terminal workflow runs older than 365 days are pruned;
407	pinned runs survive.
408	- Consumed runner JWT rows older than 30 days are pruned.
409	- `/metrics` exposes
410	`shithub_actions_runs_pruned_total{kind="chunks\|blobs\|runs\|jwt_used"}`.