shithub Public

Watch 1 Fork 0 Star 0

markdown · 12973 bytes Raw Blame History

Actions runner smoke runbook

This runbook validates the runner-facing Actions path. shithubd-runner now claims jobs, performs scoped actions/checkout@v4, and executes containerized run: steps through Docker or Podman. The curl flow below remains useful for token/replay debugging.

For host provisioning and the systemd/Ansible path, see runner-deploy.md.

Prereqs:

Database migrations are current through 0067_runner_pool_ops.sql.
SHITHUB_TOTP_KEY or auth.totp_key_b64 is set on the web process.
Object storage is configured if testing artifact upload.
Docker or Podman is installed on the runner host.
A repo has a workflow under .shithub/workflows/*.yml with runs-on: ubuntu-latest, and a push/dispatch has enqueued a run. Checkout and run: steps are executable; artifact aliases remain reserved until artifact transfer lands.

runs-on is a runner-label selector, not a hard-coded image name. A workflow that says runs-on: ubuntu-latest can be claimed by any runner advertising the ubuntu-latest label. The container image is selected by the runner host's engine.default_image setting; the reproducible Nix-built image is the default, but operators can point it at another OCI image when they need closer Ubuntu parity.

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Save the returned token:

export RUNNER_TOKEN='<printed-token>'
export BASE='https://shithub.example'

Run the binary:

shithubd-runner run \
  --server-url "$BASE" \
  --token "$RUNNER_TOKEN" \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --workspace-root /var/lib/shithubd-runner/workspaces \
  --network shithub-actions \
  --dns-servers 172.30.0.1

Pool Operations

List pool state:

shithubd admin runner list --output json
shithubd admin runner queue --output json

list includes labels, capacity, active job count, last heartbeat, host name, runner version, drain state, and revoke state. queue groups queued jobs by requested runs-on label so unsupported labels are visible without querying Postgres.

Drain a runner before host maintenance:

shithubd admin runner drain --id 7 --reason 'kernel update'

The runner keeps heartbeating and can finish already claimed jobs, but new heartbeat claims return 204 until it is undrained:

shithubd admin runner undrain --id 7

Rotate a registration token after a config-management change:

shithubd admin runner rotate-token --id 7 --expires-in 24h --output json

Update the runner host config with the printed token, restart shithubd-runner, then verify heartbeats with runner list.

Mark stale runners offline:

shithubd admin runner cleanup-stale --older-than 2m

Hard-revoke a compromised runner:

shithubd admin runner revoke --id 7 --reason 'host compromise'

Revocation records revoked_at, marks the runner offline, revokes every registration token for that runner, and causes existing job API JWTs from that runner to fail. Use drain for routine maintenance; use revoke when the token or host may be compromised.

Destroy/recreate with DigitalOcean:

doctl compute droplet list --tag-name shithub-actions-runner
doctl compute droplet delete <droplet-id>

After recreation, register a fresh runner token and let the provisioning role write the new config. Do not reuse a token from a revoked runner.

Emergency controls:

Stop all new claims: disable Actions at site level or drain every runner with shithubd admin runner list --output json followed by shithubd admin runner drain --id <id>.
Pause one repo/org: set the repo/org Actions policy to disabled so the claim query leaves its queued jobs untouched.
Cancel active work for a repo: use the repo Actions UI or job cancel API for each queued/running job.
Revoke all runner tokens: iterate shithubd admin runner revoke --id <id> over every non-revoked runner.
Fence the pool: block or destroy droplets tagged shithub-actions-runner in DigitalOcean, then rotate or revoke the affected runner tokens before allowing replacement hosts to connect.

The site-level disable path is a hard kill switch and overrides repo/org Actions policy. Until a site-admin UI exists, use:

UPDATE actions_site_policy
   SET actions_enabled = false,
       updated_at = now()
 WHERE id = true;

The public shared-runner rollout criteria live in actions-public-runners.md.

Equivalent config file:

[server]
base_url = "https://shithub.example"

[runner]
token = "<printed-token>"
labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
capacity = 1
poll_interval = "5s"
workspace_root = "/var/lib/shithubd-runner/workspaces"
workspace_ttl = "24h"
network_allowlist = [
  "api.github.com",
  "auth.docker.io",
  "codeload.github.com",
  "github.com",
  "objects.githubusercontent.com",
  "production.cloudflare.docker.com",
  "registry-1.docker.io",
  "*.githubusercontent.com",
]

[engine]
kind = "docker"
default_image = "ghcr.io/tenseleyflow/shithub/runner-nix:1.0"
network = "shithub-actions"
memory = "2g"
cpus = "2"
seccomp_profile = "/etc/shithubd-runner/seccomp.json"
user = "65534:65534"
pids_limit = 512
dns_servers = ["172.30.0.1"]

The config path defaults to /etc/shithubd-runner/config.toml. Environment variables use the SHITHUB_RUNNER_ prefix, for example SHITHUB_RUNNER_TOKEN or SHITHUB_RUNNER_SERVER__BASE_URL. Use --expires-in only for tokens that your automation rotates before expiry; the runner presents its registration token on every heartbeat.

Arbitrary Repository Smoke

Use this checklist after provisioning a shared runner pool or changing runner labels. The purpose is to prove that ordinary repositories can use the pool without repo-specific labels.

Pick at least two repositories:

mfwolffe/scratch, the historical dogfood repo;
one additional public repository;
one private repository if one is available for the operator account.

In each repository, commit this file as .shithub/workflows/smoke.yml on trunk:

name: Smoke
on:
  push:
    branches: [trunk]
jobs:
  green:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Verify checkout
        run: test -f README.md || test -f readme.md || pwd
      - name: Smoke
        run: printf 'shithub actions smoke passed\n'

Expected results for each repo:

the repo heading shows a green check after the push;
the Actions run page shows Triggered via push on refs/heads/trunk;
the green job is completed with conclusion success;
the checkout step logs the scoped repository URL for that repo;
the smoke step log contains shithub actions smoke passed;
the check run attached to the commit agrees with the workflow run state;
the downloaded or raw archived step log matches the in-page step log.

Confirm the pool is shared:

shithubd admin runner list --output json
shithubd admin runner queue --output json

The same online runner labels should satisfy both repositories. The smoke workflow must use runs-on: ubuntu-latest; do not add per-repo labels for this test.

Unsupported-label negative test:

name: Unsupported runner label
on:
  workflow_dispatch:
jobs:
  nope:
    runs-on: windows-latest
    steps:
      - run: echo should-not-run

Trigger it manually. The run should stay queued and the run page should say Waiting for runner with labels: windows-latest. shithubd admin runner queue --output json should show one queued windows-latest job with zero matching runners. Cancel the run after confirming the diagnostic.

Untrusted-PR secret negative test:

Add a repo secret named S41J_SECRET_SMOKE with any non-production value.
Open an untrusted pull request whose workflow prints whether that secret is present.
Before approval, confirm the claimed job contains no injected secrets and logs do not contain the secret value.
Only after explicit approval should the run be allowed through the trusted secret path.

Runner-outage negative test:

Drain every shared runner with shithubd admin runner drain --id <id>.
Push the smoke workflow to a test repo.
Confirm the run stays queued with the requested ubuntu-latest label visible.
Undrain one runner and confirm the same queued job is claimed and completes.

This smoke is considered passing only when scratch and the second repository both complete from the same shared label set and the negative cases produce clear queued/secret-denied behavior.

The Ansible runner role creates the shithub-actions bridge, runs the allowlist resolver at 172.30.0.1, and installs firewall rules that reject direct-IP egress from step containers. If you run the binary without the role, provision equivalent network controls before pointing workflows at the runner.

Curl token smoke

Claim a job:

curl -fsS "$BASE/api/v1/runners/heartbeat" \
  -H "Authorization: Bearer $RUNNER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
  | tee /tmp/shithub-claim.json

Extract the job token and id:

export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"

Append a log chunk:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
  | tee /tmp/shithub-log.json

export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"

Complete the job:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $JOB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"status":"completed","conclusion":"success"}'

Cancel smoke: on a separate queued or running job, a repo-write PAT can request cancellation:

curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

If the job is still queued, it becomes cancelled immediately. If it is running, the next runner cancel check returns {"cancelled":true}, the runner kills the active container, and the terminal job status becomes cancelled.

Re-run smoke: after a completed or cancelled workflow run, a repo-write PAT can enqueue a new run from the original workflow file at the original commit:

curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
  -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
  -X POST

Expected response includes a new run_id, the new run_index, and parent_run_id equal to the source run. Confirm the new row has the same head_sha as the source run.

Timeout smoke: create a workflow with a short deadline and a long-running step:

jobs:
  timeout_probe:
    runs-on: ubuntu-latest
    timeout-minutes: 1
    steps:
      - run: sleep 600

Expected results:

The runner kills the active container shortly after the one-minute deadline.
The step row becomes status=completed, conclusion=timed_out.
The job row becomes status=completed, conclusion=timed_out.
/metrics increments shithub_actions_step_timeouts_total.

Replay check: reusing the log token after the log call must fail with 401 because its jti is already present in runner_jwt_used.

curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
  -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
  -H "Content-Type: application/json" \
  -d '{"status":"running"}'

Expected results:

workflow_jobs.status = completed and conclusion success.
The parent workflow_runs row rolls up to completed/success when all jobs are terminal.
The PR Checks tab shows the matching check run as success.
/metrics includes runner registration, heartbeat, JWT, job cancellation, log-scrub, step-timeout, retention, queue-depth by label, claim-latency, runner-online, runner-stale, runner-draining, and runner-revocation metrics.

Retention Sweep

The daily housekeeping timer enqueues workflow:cleanup at 03:30 UTC, after the 03:17 backup window:

systemctl list-timers shithubd-cron.timer
journalctl -u shithubd-cron.service -n 100

Manual smoke:

shithubd admin run-job workflow:cleanup

Expected behavior:

SQL log chunks older than 7 days for terminal steps are deleted.
Expired artifact rows are deleted only after their actions/runs/... objects are deleted from object storage.
Unpinned terminal workflow runs older than 365 days are pruned; pinned runs survive.
Consumed runner JWT rows older than 30 days are pruned.
/metrics exposes shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}.

View source

  
        1
        # Actions runner smoke runbook
      
        2
        
        3
        This runbook validates the runner-facing Actions path. `shithubd-runner`
      
        4
        now claims jobs, performs scoped `actions/checkout@v4`, and executes
      
        5
        containerized `run:` steps through Docker or Podman. The curl flow below
      
        6
        remains useful for token/replay debugging.
      
        7
        
        8
        For host provisioning and the systemd/Ansible path, see
      
        9
        [runner-deploy.md](./runner-deploy.md).
      
        10
        
        11
        Prereqs:
      
        12
        
        13
        - Database migrations are current through `0067_runner_pool_ops.sql`.
      
        14
        - `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
      
        15
        - Object storage is configured if testing artifact upload.
      
        16
        - Docker or Podman is installed on the runner host.
      
        17
        - A repo has a workflow under `.shithub/workflows/*.yml` with
      
        18
          `runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
      
        19
          Checkout and `run:` steps are executable; artifact aliases remain
      
        20
          reserved until artifact transfer lands.
      
        21
        
        22
        `runs-on` is a runner-label selector, not a hard-coded image name.
      
        23
        A workflow that says `runs-on: ubuntu-latest` can be claimed by any
      
        24
        runner advertising the `ubuntu-latest` label. The container image is
      
        25
        selected by the runner host's `engine.default_image` setting; the
      
        26
        reproducible Nix-built image is the default, but operators can point it
      
        27
        at another OCI image when they need closer Ubuntu parity.
      
        28
        
        29
        Register a runner:
      
        30
        
        31
        ```sh
      
        32
        shithubd admin runner register \
      
        33
          --name runner-1 \
      
        34
          --labels self-hosted,linux,ubuntu-latest,x64 \
      
        35
          --capacity 1 \
      
        36
          --output json
      
        37
        ```
      
        38
        
        39
        Save the returned token:
      
        40
        
        41
        ```sh
      
        42
        export RUNNER_TOKEN='<printed-token>'
      
        43
        export BASE='https://shithub.example'
      
        44
        ```
      
        45
        
        46
        Run the binary:
      
        47
        
        48
        ```sh
      
        49
        shithubd-runner run \
      
        50
          --server-url "$BASE" \
      
        51
          --token "$RUNNER_TOKEN" \
      
        52
          --labels self-hosted,linux,ubuntu-latest,x64 \
      
        53
          --workspace-root /var/lib/shithubd-runner/workspaces \
      
        54
          --network shithub-actions \
      
        55
          --dns-servers 172.30.0.1
      
        56
        ```
      
        57
        
        58
        ## Pool Operations
      
        59
        
        60
        List pool state:
      
        61
        
        62
        ```sh
      
        63
        shithubd admin runner list --output json
      
        64
        shithubd admin runner queue --output json
      
        65
        ```
      
        66
        
        67
        `list` includes labels, capacity, active job count, last heartbeat,
      
        68
        host name, runner version, drain state, and revoke state. `queue`
      
        69
        groups queued jobs by requested `runs-on` label so unsupported labels
      
        70
        are visible without querying Postgres.
      
        71
        
        72
        Drain a runner before host maintenance:
      
        73
        
        74
        ```sh
      
        75
        shithubd admin runner drain --id 7 --reason 'kernel update'
      
        76
        ```
      
        77
        
        78
        The runner keeps heartbeating and can finish already claimed jobs, but
      
        79
        new heartbeat claims return 204 until it is undrained:
      
        80
        
        81
        ```sh
      
        82
        shithubd admin runner undrain --id 7
      
        83
        ```
      
        84
        
        85
        Rotate a registration token after a config-management change:
      
        86
        
        87
        ```sh
      
        88
        shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
      
        89
        ```
      
        90
        
        91
        Update the runner host config with the printed token, restart
      
        92
        `shithubd-runner`, then verify heartbeats with `runner list`.
      
        93
        
        94
        Mark stale runners offline:
      
        95
        
        96
        ```sh
      
        97
        shithubd admin runner cleanup-stale --older-than 2m
      
        98
        ```
      
        99
        
        100
        Hard-revoke a compromised runner:
      
        101
        
        102
        ```sh
      
        103
        shithubd admin runner revoke --id 7 --reason 'host compromise'
      
        104
        ```
      
        105
        
        106
        Revocation records `revoked_at`, marks the runner offline, revokes every
      
        107
        registration token for that runner, and causes existing job API JWTs from
      
        108
        that runner to fail. Use drain for routine maintenance; use revoke when
      
        109
        the token or host may be compromised.
      
        110
        
        111
        Destroy/recreate with DigitalOcean:
      
        112
        
        113
        ```sh
      
        114
        doctl compute droplet list --tag-name shithub-actions-runner
      
        115
        doctl compute droplet delete <droplet-id>
      
        116
        ```
      
        117
        
        118
        After recreation, register a fresh runner token and let the provisioning
      
        119
        role write the new config. Do not reuse a token from a revoked runner.
      
        120
        
        121
        Emergency controls:
      
        122
        
        123
        - Stop all new claims: disable Actions at site level or drain every
      
        124
          runner with `shithubd admin runner list --output json` followed by
      
        125
          `shithubd admin runner drain --id <id>`.
      
        126
        - Pause one repo/org: set the repo/org Actions policy to disabled so the
      
        127
          claim query leaves its queued jobs untouched.
      
        128
        - Cancel active work for a repo: use the repo Actions UI or job cancel
      
        129
          API for each queued/running job.
      
        130
        - Revoke all runner tokens: iterate `shithubd admin runner revoke --id
      
        131
          <id>` over every non-revoked runner.
      
        132
        - Fence the pool: block or destroy droplets tagged
      
        133
          `shithub-actions-runner` in DigitalOcean, then rotate or revoke the
      
        134
          affected runner tokens before allowing replacement hosts to connect.
      
        135
        
        136
        The site-level disable path is a hard kill switch and overrides repo/org
      
        137
        Actions policy. Until a site-admin UI exists, use:
      
        138
        
        139
        ```sql
      
        140
        UPDATE actions_site_policy
      
        141
           SET actions_enabled = false,
      
        142
               updated_at = now()
      
        143
         WHERE id = true;
      
        144
        ```
      
        145
        
        146
        The public shared-runner rollout criteria live in
      
        147
        [`actions-public-runners.md`](../actions-public-runners.md).
      
        148
        
        149
        Equivalent config file:
      
        150
        
        151
        ```toml
      
        152
        [server]
      
        153
        base_url = "https://shithub.example"
      
        154
        
        155
        [runner]
      
        156
        token = "<printed-token>"
      
        157
        labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
      
        158
        capacity = 1
      
        159
        poll_interval = "5s"
      
        160
        workspace_root = "/var/lib/shithubd-runner/workspaces"
      
        161
        workspace_ttl = "24h"
      
        162
        network_allowlist = [
      
        163
          "api.github.com",
      
        164
          "auth.docker.io",
      
        165
          "codeload.github.com",
      
        166
          "github.com",
      
        167
          "objects.githubusercontent.com",
      
        168
          "production.cloudflare.docker.com",
      
        169
          "registry-1.docker.io",
      
        170
          "*.githubusercontent.com",
      
        171
        ]
      
        172
        
        173
        [engine]
      
        174
        kind = "docker"
      
        175
        default_image = "ghcr.io/tenseleyflow/shithub/runner-nix:1.0"
      
        176
        network = "shithub-actions"
      
        177
        memory = "2g"
      
        178
        cpus = "2"
      
        179
        seccomp_profile = "/etc/shithubd-runner/seccomp.json"
      
        180
        user = "65534:65534"
      
        181
        pids_limit = 512
      
        182
        dns_servers = ["172.30.0.1"]
      
        183
        ```
      
        184
        
        185
        The config path defaults to `/etc/shithubd-runner/config.toml`.
      
        186
        Environment variables use the `SHITHUB_RUNNER_` prefix, for example
      
        187
        `SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
      
        188
        Use `--expires-in` only for tokens that your automation rotates before expiry;
      
        189
        the runner presents its registration token on every heartbeat.
      
        190
        
        191
        ## Arbitrary Repository Smoke
      
        192
        
        193
        Use this checklist after provisioning a shared runner pool or changing runner
      
        194
        labels. The purpose is to prove that ordinary repositories can use the pool
      
        195
        without repo-specific labels.
      
        196
        
        197
        Pick at least two repositories:
      
        198
        
        199
        - `mfwolffe/scratch`, the historical dogfood repo;
      
        200
        - one additional public repository;
      
        201
        - one private repository if one is available for the operator account.
      
        202
        
        203
        In each repository, commit this file as `.shithub/workflows/smoke.yml` on
      
        204
        `trunk`:
      
        205
        
        206
        ```yaml
      
        207
        name: Smoke
      
        208
        on:
      
        209
          push:
      
        210
            branches: [trunk]
      
        211
        jobs:
      
        212
          green:
      
        213
            runs-on: ubuntu-latest
      
        214
            steps:
      
        215
              - uses: actions/checkout@v4
      
        216
              - name: Verify checkout
      
        217
                run: test -f README.md || test -f readme.md || pwd
      
        218
              - name: Smoke
      
        219
                run: printf 'shithub actions smoke passed\n'
      
        220
        ```
      
        221
        
        222
        Expected results for each repo:
      
        223
        
        224
        - the repo heading shows a green check after the push;
      
        225
        - the Actions run page shows `Triggered via push` on `refs/heads/trunk`;
      
        226
        - the `green` job is completed with conclusion `success`;
      
        227
        - the checkout step logs the scoped repository URL for that repo;
      
        228
        - the smoke step log contains `shithub actions smoke passed`;
      
        229
        - the check run attached to the commit agrees with the workflow run state;
      
        230
        - the downloaded or raw archived step log matches the in-page step log.
      
        231
        
        232
        Confirm the pool is shared:
      
        233
        
        234
        ```sh
      
        235
        shithubd admin runner list --output json
      
        236
        shithubd admin runner queue --output json
      
        237
        ```
      
        238
        
        239
        The same online runner labels should satisfy both repositories. The smoke
      
        240
        workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this
      
        241
        test.
      
        242
        
        243
        Unsupported-label negative test:
      
        244
        
        245
        ```yaml
      
        246
        name: Unsupported runner label
      
        247
        on:
      
        248
          workflow_dispatch:
      
        249
        jobs:
      
        250
          nope:
      
        251
            runs-on: windows-latest
      
        252
            steps:
      
        253
              - run: echo should-not-run
      
        254
        ```
      
        255
        
        256
        Trigger it manually. The run should stay queued and the run page should say
      
        257
        `Waiting for runner with labels: windows-latest`. `shithubd admin runner queue
      
        258
        --output json` should show one queued `windows-latest` job with zero matching
      
        259
        runners. Cancel the run after confirming the diagnostic.
      
        260
        
        261
        Untrusted-PR secret negative test:
      
        262
        
        263
        1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value.
      
        264
        2. Open an untrusted pull request whose workflow prints whether that secret is
      
        265
           present.
      
        266
        3. Before approval, confirm the claimed job contains no injected secrets and
      
        267
           logs do not contain the secret value.
      
        268
        4. Only after explicit approval should the run be allowed through the trusted
      
        269
           secret path.
      
        270
        
        271
        Runner-outage negative test:
      
        272
        
        273
        1. Drain every shared runner with `shithubd admin runner drain --id <id>`.
      
        274
        2. Push the smoke workflow to a test repo.
      
        275
        3. Confirm the run stays queued with the requested `ubuntu-latest` label visible.
      
        276
        4. Undrain one runner and confirm the same queued job is claimed and completes.
      
        277
        
        278
        This smoke is considered passing only when scratch and the second repository
      
        279
        both complete from the same shared label set and the negative cases produce
      
        280
        clear queued/secret-denied behavior.
      
        281
        
        282
        The Ansible runner role creates the `shithub-actions` bridge, runs the
      
        283
        allowlist resolver at `172.30.0.1`, and installs firewall rules that
      
        284
        reject direct-IP egress from step containers. If you run the binary
      
        285
        without the role, provision equivalent network controls before pointing
      
        286
        workflows at the runner.
      
        287
        
        288
        ## Curl token smoke
      
        289
        
        290
        Claim a job:
      
        291
        
        292
        ```sh
      
        293
        curl -fsS "$BASE/api/v1/runners/heartbeat" \
      
        294
          -H "Authorization: Bearer $RUNNER_TOKEN" \
      
        295
          -H "Content-Type: application/json" \
      
        296
          -d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
      
        297
          | tee /tmp/shithub-claim.json
      
        298
        ```
      
        299
        
        300
        Extract the job token and id:
      
        301
        
        302
        ```sh
      
        303
        export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
      
        304
        export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
      
        305
        ```
      
        306
        
        307
        Append a log chunk:
      
        308
        
        309
        ```sh
      
        310
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
      
        311
          -H "Authorization: Bearer $JOB_TOKEN" \
      
        312
          -H "Content-Type: application/json" \
      
        313
          -d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' | base64)\"}" \
      
        314
          | tee /tmp/shithub-log.json
      
        315
        
        316
        export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
      
        317
        ```
      
        318
        
        319
        Complete the job:
      
        320
        
        321
        ```sh
      
        322
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
      
        323
          -H "Authorization: Bearer $JOB_TOKEN" \
      
        324
          -H "Content-Type: application/json" \
      
        325
          -d '{"status":"completed","conclusion":"success"}'
      
        326
        ```
      
        327
        
        328
        Cancel smoke: on a separate queued or running job, a repo-write PAT can
      
        329
        request cancellation:
      
        330
        
        331
        ```sh
      
        332
        curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
      
        333
          -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
      
        334
          -X POST
      
        335
        ```
      
        336
        
        337
        If the job is still queued, it becomes `cancelled` immediately. If it is
      
        338
        running, the next runner cancel check returns `{"cancelled":true}`, the
      
        339
        runner kills the active container, and the terminal job status becomes
      
        340
        `cancelled`.
      
        341
        
        342
        Re-run smoke: after a completed or cancelled workflow run, a repo-write
      
        343
        PAT can enqueue a new run from the original workflow file at the
      
        344
        original commit:
      
        345
        
        346
        ```sh
      
        347
        curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
      
        348
          -H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
      
        349
          -X POST
      
        350
        ```
      
        351
        
        352
        Expected response includes a new `run_id`, the new `run_index`, and
      
        353
        `parent_run_id` equal to the source run. Confirm the new row has the
      
        354
        same `head_sha` as the source run.
      
        355
        
        356
        Timeout smoke: create a workflow with a short deadline and a long-running
      
        357
        step:
      
        358
        
        359
        ```yaml
      
        360
        jobs:
      
        361
          timeout_probe:
      
        362
            runs-on: ubuntu-latest
      
        363
            timeout-minutes: 1
      
        364
            steps:
      
        365
              - run: sleep 600
      
        366
        ```
      
        367
        
        368
        Expected results:
      
        369
        
        370
        - The runner kills the active container shortly after the one-minute
      
        371
          deadline.
      
        372
        - The step row becomes `status=completed`,
      
        373
          `conclusion=timed_out`.
      
        374
        - The job row becomes `status=completed`, `conclusion=timed_out`.
      
        375
        - `/metrics` increments `shithub_actions_step_timeouts_total`.
      
        376
        
        377
        Replay check: reusing the log token after the log call must fail with
      
        378
        401 because its `jti` is already present in `runner_jwt_used`.
      
        379
        
        380
        ```sh
      
        381
        curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
      
        382
          -H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
      
        383
          -H "Content-Type: application/json" \
      
        384
          -d '{"status":"running"}'
      
        385
        ```
      
        386
        
        387
        Expected results:
      
        388
        
        389
        - `workflow_jobs.status = completed` and conclusion `success`.
      
        390
        - The parent `workflow_runs` row rolls up to completed/success when all
      
        391
          jobs are terminal.
      
        392
        - The PR Checks tab shows the matching check run as success.
      
        393
        - `/metrics` includes runner registration, heartbeat, JWT, job
      
        394
          cancellation, log-scrub, step-timeout, retention, queue-depth by label,
      
        395
          claim-latency, runner-online, runner-stale, runner-draining, and
      
        396
          runner-revocation metrics.
      
        397
        
        398
        ## Retention Sweep
      
        399
        
        400
        The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
      
        401
        after the 03:17 backup window:
      
        402
        
        403
        ```sh
      
        404
        systemctl list-timers shithubd-cron.timer
      
        405
        journalctl -u shithubd-cron.service -n 100
      
        406
        ```
      
        407
        
        408
        Manual smoke:
      
        409
        
        410
        ```sh
      
        411
        shithubd admin run-job workflow:cleanup
      
        412
        ```
      
        413
        
        414
        Expected behavior:
      
        415
        
        416
        - SQL log chunks older than 7 days for terminal steps are deleted.
      
        417
        - Expired artifact rows are deleted only after their `actions/runs/...`
      
        418
          objects are deleted from object storage.
      
        419
        - Unpinned terminal workflow runs older than 365 days are pruned;
      
        420
          pinned runs survive.
      
        421
        - Consumed runner JWT rows older than 30 days are pruned.
      
        422
        - `/metrics` exposes
      
        423
          `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.

1	# Actions runner smoke runbook
2
3	This runbook validates the runner-facing Actions path. `shithubd-runner`
4	now claims jobs, performs scoped `actions/checkout@v4`, and executes
5	containerized `run:` steps through Docker or Podman. The curl flow below
6	remains useful for token/replay debugging.
7
8	For host provisioning and the systemd/Ansible path, see
9	[runner-deploy.md](./runner-deploy.md).
10
11	Prereqs:
12
13	- Database migrations are current through `0067_runner_pool_ops.sql`.
14	- `SHITHUB_TOTP_KEY` or `auth.totp_key_b64` is set on the web process.
15	- Object storage is configured if testing artifact upload.
16	- Docker or Podman is installed on the runner host.
17	- A repo has a workflow under `.shithub/workflows/*.yml` with
18	`runs-on: ubuntu-latest`, and a push/dispatch has enqueued a run.
19	Checkout and `run:` steps are executable; artifact aliases remain
20	reserved until artifact transfer lands.
21
22	`runs-on` is a runner-label selector, not a hard-coded image name.
23	A workflow that says `runs-on: ubuntu-latest` can be claimed by any
24	runner advertising the `ubuntu-latest` label. The container image is
25	selected by the runner host's `engine.default_image` setting; the
26	reproducible Nix-built image is the default, but operators can point it
27	at another OCI image when they need closer Ubuntu parity.
28
29	Register a runner:
30
31	```sh
32	shithubd admin runner register \
33	--name runner-1 \
34	--labels self-hosted,linux,ubuntu-latest,x64 \
35	--capacity 1 \
36	--output json
37	```
38
39	Save the returned token:
40
41	```sh
42	export RUNNER_TOKEN='<printed-token>'
43	export BASE='https://shithub.example'
44	```
45
46	Run the binary:
47
48	```sh
49	shithubd-runner run \
50	--server-url "$BASE" \
51	--token "$RUNNER_TOKEN" \
52	--labels self-hosted,linux,ubuntu-latest,x64 \
53	--workspace-root /var/lib/shithubd-runner/workspaces \
54	--network shithub-actions \
55	--dns-servers 172.30.0.1
56	```
57
58	## Pool Operations
59
60	List pool state:
61
62	```sh
63	shithubd admin runner list --output json
64	shithubd admin runner queue --output json
65	```
66
67	`list` includes labels, capacity, active job count, last heartbeat,
68	host name, runner version, drain state, and revoke state. `queue`
69	groups queued jobs by requested `runs-on` label so unsupported labels
70	are visible without querying Postgres.
71
72	Drain a runner before host maintenance:
73
74	```sh
75	shithubd admin runner drain --id 7 --reason 'kernel update'
76	```
77
78	The runner keeps heartbeating and can finish already claimed jobs, but
79	new heartbeat claims return 204 until it is undrained:
80
81	```sh
82	shithubd admin runner undrain --id 7
83	```
84
85	Rotate a registration token after a config-management change:
86
87	```sh
88	shithubd admin runner rotate-token --id 7 --expires-in 24h --output json
89	```
90
91	Update the runner host config with the printed token, restart
92	`shithubd-runner`, then verify heartbeats with `runner list`.
93
94	Mark stale runners offline:
95
96	```sh
97	shithubd admin runner cleanup-stale --older-than 2m
98	```
99
100	Hard-revoke a compromised runner:
101
102	```sh
103	shithubd admin runner revoke --id 7 --reason 'host compromise'
104	```
105
106	Revocation records `revoked_at`, marks the runner offline, revokes every
107	registration token for that runner, and causes existing job API JWTs from
108	that runner to fail. Use drain for routine maintenance; use revoke when
109	the token or host may be compromised.
110
111	Destroy/recreate with DigitalOcean:
112
113	```sh
114	doctl compute droplet list --tag-name shithub-actions-runner
115	doctl compute droplet delete <droplet-id>
116	```
117
118	After recreation, register a fresh runner token and let the provisioning
119	role write the new config. Do not reuse a token from a revoked runner.
120
121	Emergency controls:
122
123	- Stop all new claims: disable Actions at site level or drain every
124	runner with `shithubd admin runner list --output json` followed by
125	`shithubd admin runner drain --id <id>`.
126	- Pause one repo/org: set the repo/org Actions policy to disabled so the
127	claim query leaves its queued jobs untouched.
128	- Cancel active work for a repo: use the repo Actions UI or job cancel
129	API for each queued/running job.
130	- Revoke all runner tokens: iterate `shithubd admin runner revoke --id
131	<id>` over every non-revoked runner.
132	- Fence the pool: block or destroy droplets tagged
133	`shithub-actions-runner` in DigitalOcean, then rotate or revoke the
134	affected runner tokens before allowing replacement hosts to connect.
135
136	The site-level disable path is a hard kill switch and overrides repo/org
137	Actions policy. Until a site-admin UI exists, use:
138
139	```sql
140	UPDATE actions_site_policy
141	SET actions_enabled = false,
142	updated_at = now()
143	WHERE id = true;
144	```
145
146	The public shared-runner rollout criteria live in
147	[`actions-public-runners.md`](../actions-public-runners.md).
148
149	Equivalent config file:
150
151	```toml
152	[server]
153	base_url = "https://shithub.example"
154
155	[runner]
156	token = "<printed-token>"
157	labels = ["self-hosted", "linux", "ubuntu-latest", "x64"]
158	capacity = 1
159	poll_interval = "5s"
160	workspace_root = "/var/lib/shithubd-runner/workspaces"
161	workspace_ttl = "24h"
162	network_allowlist = [
163	"api.github.com",
164	"auth.docker.io",
165	"codeload.github.com",
166	"github.com",
167	"objects.githubusercontent.com",
168	"production.cloudflare.docker.com",
169	"registry-1.docker.io",
170	"*.githubusercontent.com",
171	]
172
173	[engine]
174	kind = "docker"
175	default_image = "ghcr.io/tenseleyflow/shithub/runner-nix:1.0"
176	network = "shithub-actions"
177	memory = "2g"
178	cpus = "2"
179	seccomp_profile = "/etc/shithubd-runner/seccomp.json"
180	user = "65534:65534"
181	pids_limit = 512
182	dns_servers = ["172.30.0.1"]
183	```
184
185	The config path defaults to `/etc/shithubd-runner/config.toml`.
186	Environment variables use the `SHITHUB_RUNNER_` prefix, for example
187	`SHITHUB_RUNNER_TOKEN` or `SHITHUB_RUNNER_SERVER__BASE_URL`.
188	Use `--expires-in` only for tokens that your automation rotates before expiry;
189	the runner presents its registration token on every heartbeat.
190
191	## Arbitrary Repository Smoke
192
193	Use this checklist after provisioning a shared runner pool or changing runner
194	labels. The purpose is to prove that ordinary repositories can use the pool
195	without repo-specific labels.
196
197	Pick at least two repositories:
198
199	- `mfwolffe/scratch`, the historical dogfood repo;
200	- one additional public repository;
201	- one private repository if one is available for the operator account.
202
203	In each repository, commit this file as `.shithub/workflows/smoke.yml` on
204	`trunk`:
205
206	```yaml
207	name: Smoke
208	on:
209	push:
210	branches: [trunk]
211	jobs:
212	green:
213	runs-on: ubuntu-latest
214	steps:
215	- uses: actions/checkout@v4
216	- name: Verify checkout
217	run: test -f README.md \|\| test -f readme.md \|\| pwd
218	- name: Smoke
219	run: printf 'shithub actions smoke passed\n'
220	```
221
222	Expected results for each repo:
223
224	- the repo heading shows a green check after the push;
225	- the Actions run page shows `Triggered via push` on `refs/heads/trunk`;
226	- the `green` job is completed with conclusion `success`;
227	- the checkout step logs the scoped repository URL for that repo;
228	- the smoke step log contains `shithub actions smoke passed`;
229	- the check run attached to the commit agrees with the workflow run state;
230	- the downloaded or raw archived step log matches the in-page step log.
231
232	Confirm the pool is shared:
233
234	```sh
235	shithubd admin runner list --output json
236	shithubd admin runner queue --output json
237	```
238
239	The same online runner labels should satisfy both repositories. The smoke
240	workflow must use `runs-on: ubuntu-latest`; do not add per-repo labels for this
241	test.
242
243	Unsupported-label negative test:
244
245	```yaml
246	name: Unsupported runner label
247	on:
248	workflow_dispatch:
249	jobs:
250	nope:
251	runs-on: windows-latest
252	steps:
253	- run: echo should-not-run
254	```
255
256	Trigger it manually. The run should stay queued and the run page should say
257	`Waiting for runner with labels: windows-latest`. `shithubd admin runner queue
258	--output json` should show one queued `windows-latest` job with zero matching
259	runners. Cancel the run after confirming the diagnostic.
260
261	Untrusted-PR secret negative test:
262
263	1. Add a repo secret named `S41J_SECRET_SMOKE` with any non-production value.
264	2. Open an untrusted pull request whose workflow prints whether that secret is
265	present.
266	3. Before approval, confirm the claimed job contains no injected secrets and
267	logs do not contain the secret value.
268	4. Only after explicit approval should the run be allowed through the trusted
269	secret path.
270
271	Runner-outage negative test:
272
273	1. Drain every shared runner with `shithubd admin runner drain --id <id>`.
274	2. Push the smoke workflow to a test repo.
275	3. Confirm the run stays queued with the requested `ubuntu-latest` label visible.
276	4. Undrain one runner and confirm the same queued job is claimed and completes.
277
278	This smoke is considered passing only when scratch and the second repository
279	both complete from the same shared label set and the negative cases produce
280	clear queued/secret-denied behavior.
281
282	The Ansible runner role creates the `shithub-actions` bridge, runs the
283	allowlist resolver at `172.30.0.1`, and installs firewall rules that
284	reject direct-IP egress from step containers. If you run the binary
285	without the role, provision equivalent network controls before pointing
286	workflows at the runner.
287
288	## Curl token smoke
289
290	Claim a job:
291
292	```sh
293	curl -fsS "$BASE/api/v1/runners/heartbeat" \
294	-H "Authorization: Bearer $RUNNER_TOKEN" \
295	-H "Content-Type: application/json" \
296	-d '{"labels":["self-hosted","linux","ubuntu-latest","x64"],"capacity":1,"host_name":"curl-smoke","version":"manual"}' \
297	\| tee /tmp/shithub-claim.json
298	```
299
300	Extract the job token and id:
301
302	```sh
303	export JOB_ID="$(jq -r '.job.id' /tmp/shithub-claim.json)"
304	export JOB_TOKEN="$(jq -r '.token' /tmp/shithub-claim.json)"
305	```
306
307	Append a log chunk:
308
309	```sh
310	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/logs" \
311	-H "Authorization: Bearer $JOB_TOKEN" \
312	-H "Content-Type: application/json" \
313	-d "{\"seq\":0,\"chunk\":\"$(printf 'hello from curl\n' \| base64)\"}" \
314	\| tee /tmp/shithub-log.json
315
316	export JOB_TOKEN="$(jq -r '.next_token' /tmp/shithub-log.json)"
317	```
318
319	Complete the job:
320
321	```sh
322	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/status" \
323	-H "Authorization: Bearer $JOB_TOKEN" \
324	-H "Content-Type: application/json" \
325	-d '{"status":"completed","conclusion":"success"}'
326	```
327
328	Cancel smoke: on a separate queued or running job, a repo-write PAT can
329	request cancellation:
330
331	```sh
332	curl -fsS "$BASE/api/v1/jobs/$JOB_ID/cancel" \
333	-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
334	-X POST
335	```
336
337	If the job is still queued, it becomes `cancelled` immediately. If it is
338	running, the next runner cancel check returns `{"cancelled":true}`, the
339	runner kills the active container, and the terminal job status becomes
340	`cancelled`.
341
342	Re-run smoke: after a completed or cancelled workflow run, a repo-write
343	PAT can enqueue a new run from the original workflow file at the
344	original commit:
345
346	```sh
347	curl -fsS "$BASE/api/v1/runs/$RUN_ID/rerun" \
348	-H "Authorization: Bearer $PAT_WITH_REPO_WRITE" \
349	-X POST
350	```
351
352	Expected response includes a new `run_id`, the new `run_index`, and
353	`parent_run_id` equal to the source run. Confirm the new row has the
354	same `head_sha` as the source run.
355
356	Timeout smoke: create a workflow with a short deadline and a long-running
357	step:
358
359	```yaml
360	jobs:
361	timeout_probe:
362	runs-on: ubuntu-latest
363	timeout-minutes: 1
364	steps:
365	- run: sleep 600
366	```
367
368	Expected results:
369
370	- The runner kills the active container shortly after the one-minute
371	deadline.
372	- The step row becomes `status=completed`,
373	`conclusion=timed_out`.
374	- The job row becomes `status=completed`, `conclusion=timed_out`.
375	- `/metrics` increments `shithub_actions_step_timeouts_total`.
376
377	Replay check: reusing the log token after the log call must fail with
378	401 because its `jti` is already present in `runner_jwt_used`.
379
380	```sh
381	curl -i "$BASE/api/v1/jobs/$JOB_ID/status" \
382	-H "Authorization: Bearer $(jq -r '.next_token' /tmp/shithub-log.json)" \
383	-H "Content-Type: application/json" \
384	-d '{"status":"running"}'
385	```
386
387	Expected results:
388
389	- `workflow_jobs.status = completed` and conclusion `success`.
390	- The parent `workflow_runs` row rolls up to completed/success when all
391	jobs are terminal.
392	- The PR Checks tab shows the matching check run as success.
393	- `/metrics` includes runner registration, heartbeat, JWT, job
394	cancellation, log-scrub, step-timeout, retention, queue-depth by label,
395	claim-latency, runner-online, runner-stale, runner-draining, and
396	runner-revocation metrics.
397
398	## Retention Sweep
399
400	The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
401	after the 03:17 backup window:
402
403	```sh
404	systemctl list-timers shithubd-cron.timer
405	journalctl -u shithubd-cron.service -n 100
406	```
407
408	Manual smoke:
409
410	```sh
411	shithubd admin run-job workflow:cleanup
412	```
413
414	Expected behavior:
415
416	- SQL log chunks older than 7 days for terminal steps are deleted.
417	- Expired artifact rows are deleted only after their `actions/runs/...`
418	objects are deleted from object storage.
419	- Unpinned terminal workflow runs older than 365 days are pruned;
420	pinned runs survive.
421	- Consumed runner JWT rows older than 30 days are pruned.
422	- `/metrics` exposes
423	`shithub_actions_runs_pruned_total{kind="chunks\|blobs\|runs\|jwt_used"}`.