Actions runbook
This is the operator runbook for shithub Actions. Host provisioning lives in runner-deploy.md, and runner protocol details live in actions-runner-api.md.
Shape
git push / workflow_dispatch / schedule / pull_request
|
v
workflow:trigger worker job
|
v
workflow_runs + workflow_jobs + workflow_steps + check_runs
|
v
registered runner heartbeat claims a matching queued job
|
v
containerized run: steps -> log chunks -> step/job status -> run rollup
The v1 executor supports containerized run: steps. The parser reserves
actions/checkout@v4, shithub/upload-artifact@v1, and
shithub/download-artifact@v1, but the Docker runner rejects uses: steps
until checkout metadata and artifact transfer are wired end to end. Do not use
actions/checkout@v4 in production smoke workflows yet.
First smoke
- Confirm migrations are applied and the web process can enqueue workers.
- Register one runner with a label that matches the workflow:
shithubd admin runner register \
--name smoke-runner-1 \
--labels self-hosted,linux,ubuntu-latest \
--capacity 1
- Start
shithubd-runnerwith the printed token. For production hosts, use the Ansible/systemd path in runner-deploy.md. - Push a
run:-only workflow:
name: smoke
on: [push, workflow_dispatch]
jobs:
hello:
runs-on: ubuntu-latest
env:
RUN_ID: ${{ shithub.run_id }}
steps:
- run: echo "hello from shithub actions"
- run: test -n "$RUN_ID"
- Expected result:
workflow:triggerenqueues a workflow run.- A runner heartbeat claims the queued job within one idle poll interval.
- The Actions run page streams step logs while the job is running.
- The matching check run completes with
success. /{owner}/{repo}/actions.atomincludes the completed run.
Repeat with exit 1; the check should complete with failure.
Live log tail
Step log pages open an SSE stream at:
/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
The stream sends event: chunk records with the chunk sequence as the SSE
id. Browsers reconnect with Last-Event-ID; the handler also accepts
?after=<seq> for the first connection from a rendered log page. A terminal
step sends event: done and closes the stream.
In shithubd, this route is mounted outside the normal app compression and
30-second timeout middleware. If a future route move puts live logs back under
either middleware, EventSource clients will churn and logs can buffer despite
the Caddy flush setting.
Log chunks are never sent through Postgres NOTIFY. Runner log writes append
to workflow_step_log_chunks, then NOTIFY step_log_<step_id> with only the
sequence number. Step completion notifies done.
Rate limit
Live tails use internal/ratelimit scope actions:logtail with five
concurrent streams per viewer. Authenticated viewers key by user id; anonymous
public-repo viewers key by client IP. The limiter uses a short lease TTL so a
dropped connection cannot hold a slot permanently.
Caddy
The production Caddy template has a dedicated Actions log-stream route with:
flush_interval -1
The same route is excluded from gzip compression. If logs arrive only after
several kilobytes accumulate, verify the deployed /etc/caddy/Caddyfile
contains that route and reload Caddy:
sudo caddy reload --config /etc/caddy/Caddyfile
Runner health
On the runner host:
systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager
On the app host, inspect runner registration and heartbeat state:
shithubd admin actions runner list
Important metrics:
shithub_actions_runner_heartbeats_total{result="claimed|no_job"}shithub_actions_runner_jwt_total{result="issued|rejected|replay"}shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}shithub_actions_log_scrub_replacements_total{location="server"}shithub_actions_step_timeouts_total
Emergency cancel
Start with a dry run:
shithubd admin actions cancel-all --dry-run --limit 100
Scope to one repository when possible:
shithubd admin actions cancel-all --dry-run --repo-id 42
Then confirm:
shithubd admin actions cancel-all --confirm --repo-id 42
Queued jobs are marked cancelled immediately. Running jobs receive
cancel_requested=true; the runner sees that through /cancel-check, kills the
active container, and reports terminal cancelled.
Common failures
- Run never appears: confirm the workflow file is under
.shithub/workflows/, parse it withshithubd admin actions parse <file>, and verify the trigger event matcheson:. - Run stays queued: confirm a runner is registered with matching labels and capacity, then inspect runner journal output and heartbeat metrics.
- Step logs buffer: verify the Caddy route above and confirm the SSE route is still mounted outside compression and short timeouts.
uses:step fails: expected for now. Replace with arun:step until checkout/artifact support lands.- Secrets appear masked inconsistently: check
shithub_actions_log_scrub_replacements_total{location="server"}and confirm the job was claimed after the secret was created or rotated. Mask snapshots are captured at claim time.
View source
| 1 | # Actions runbook |
| 2 | |
| 3 | This is the operator runbook for shithub Actions. Host provisioning lives in |
| 4 | [runner-deploy.md](./runner-deploy.md), and runner protocol details live in |
| 5 | [actions-runner-api.md](../actions-runner-api.md). |
| 6 | |
| 7 | ## Shape |
| 8 | |
| 9 | ```text |
| 10 | git push / workflow_dispatch / schedule / pull_request |
| 11 | | |
| 12 | v |
| 13 | workflow:trigger worker job |
| 14 | | |
| 15 | v |
| 16 | workflow_runs + workflow_jobs + workflow_steps + check_runs |
| 17 | | |
| 18 | v |
| 19 | registered runner heartbeat claims a matching queued job |
| 20 | | |
| 21 | v |
| 22 | containerized run: steps -> log chunks -> step/job status -> run rollup |
| 23 | ``` |
| 24 | |
| 25 | The v1 executor supports containerized `run:` steps. The parser reserves |
| 26 | `actions/checkout@v4`, `shithub/upload-artifact@v1`, and |
| 27 | `shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps |
| 28 | until checkout metadata and artifact transfer are wired end to end. Do not use |
| 29 | `actions/checkout@v4` in production smoke workflows yet. |
| 30 | |
| 31 | ## First smoke |
| 32 | |
| 33 | 1. Confirm migrations are applied and the web process can enqueue workers. |
| 34 | 2. Register one runner with a label that matches the workflow: |
| 35 | |
| 36 | ```sh |
| 37 | shithubd admin runner register \ |
| 38 | --name smoke-runner-1 \ |
| 39 | --labels self-hosted,linux,ubuntu-latest \ |
| 40 | --capacity 1 |
| 41 | ``` |
| 42 | |
| 43 | 3. Start `shithubd-runner` with the printed token. For production hosts, use |
| 44 | the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md). |
| 45 | 4. Push a `run:`-only workflow: |
| 46 | |
| 47 | ```yaml |
| 48 | name: smoke |
| 49 | on: [push, workflow_dispatch] |
| 50 | jobs: |
| 51 | hello: |
| 52 | runs-on: ubuntu-latest |
| 53 | env: |
| 54 | RUN_ID: ${{ shithub.run_id }} |
| 55 | steps: |
| 56 | - run: echo "hello from shithub actions" |
| 57 | - run: test -n "$RUN_ID" |
| 58 | ``` |
| 59 | |
| 60 | 5. Expected result: |
| 61 | |
| 62 | - `workflow:trigger` enqueues a workflow run. |
| 63 | - A runner heartbeat claims the queued job within one idle poll interval. |
| 64 | - The Actions run page streams step logs while the job is running. |
| 65 | - The matching check run completes with `success`. |
| 66 | - `/{owner}/{repo}/actions.atom` includes the completed run. |
| 67 | |
| 68 | Repeat with `exit 1`; the check should complete with `failure`. |
| 69 | |
| 70 | ## Live log tail |
| 71 | |
| 72 | Step log pages open an SSE stream at: |
| 73 | |
| 74 | ```text |
| 75 | /{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream |
| 76 | ``` |
| 77 | |
| 78 | The stream sends `event: chunk` records with the chunk sequence as the SSE |
| 79 | `id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts |
| 80 | `?after=<seq>` for the first connection from a rendered log page. A terminal |
| 81 | step sends `event: done` and closes the stream. |
| 82 | |
| 83 | In `shithubd`, this route is mounted outside the normal app compression and |
| 84 | 30-second timeout middleware. If a future route move puts live logs back under |
| 85 | either middleware, EventSource clients will churn and logs can buffer despite |
| 86 | the Caddy flush setting. |
| 87 | |
| 88 | Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append |
| 89 | to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the |
| 90 | sequence number. Step completion notifies `done`. |
| 91 | |
| 92 | ## Rate limit |
| 93 | |
| 94 | Live tails use `internal/ratelimit` scope `actions:logtail` with five |
| 95 | concurrent streams per viewer. Authenticated viewers key by user id; anonymous |
| 96 | public-repo viewers key by client IP. The limiter uses a short lease TTL so a |
| 97 | dropped connection cannot hold a slot permanently. |
| 98 | |
| 99 | ## Caddy |
| 100 | |
| 101 | The production Caddy template has a dedicated Actions log-stream route with: |
| 102 | |
| 103 | ```caddy |
| 104 | flush_interval -1 |
| 105 | ``` |
| 106 | |
| 107 | The same route is excluded from gzip compression. If logs arrive only after |
| 108 | several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile` |
| 109 | contains that route and reload Caddy: |
| 110 | |
| 111 | ```sh |
| 112 | sudo caddy reload --config /etc/caddy/Caddyfile |
| 113 | ``` |
| 114 | |
| 115 | ## Runner health |
| 116 | |
| 117 | On the runner host: |
| 118 | |
| 119 | ```sh |
| 120 | systemctl status shithubd-runner |
| 121 | journalctl -u shithubd-runner -n 100 --no-pager |
| 122 | ``` |
| 123 | |
| 124 | On the app host, inspect runner registration and heartbeat state: |
| 125 | |
| 126 | ```sh |
| 127 | shithubd admin actions runner list |
| 128 | ``` |
| 129 | |
| 130 | Important metrics: |
| 131 | |
| 132 | - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}` |
| 133 | - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}` |
| 134 | - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` |
| 135 | - `shithub_actions_log_scrub_replacements_total{location="server"}` |
| 136 | - `shithub_actions_step_timeouts_total` |
| 137 | |
| 138 | ## Emergency cancel |
| 139 | |
| 140 | Start with a dry run: |
| 141 | |
| 142 | ```sh |
| 143 | shithubd admin actions cancel-all --dry-run --limit 100 |
| 144 | ``` |
| 145 | |
| 146 | Scope to one repository when possible: |
| 147 | |
| 148 | ```sh |
| 149 | shithubd admin actions cancel-all --dry-run --repo-id 42 |
| 150 | ``` |
| 151 | |
| 152 | Then confirm: |
| 153 | |
| 154 | ```sh |
| 155 | shithubd admin actions cancel-all --confirm --repo-id 42 |
| 156 | ``` |
| 157 | |
| 158 | Queued jobs are marked cancelled immediately. Running jobs receive |
| 159 | `cancel_requested=true`; the runner sees that through `/cancel-check`, kills the |
| 160 | active container, and reports terminal `cancelled`. |
| 161 | |
| 162 | ## Common failures |
| 163 | |
| 164 | - **Run never appears:** confirm the workflow file is under |
| 165 | `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`, |
| 166 | and verify the trigger event matches `on:`. |
| 167 | - **Run stays queued:** confirm a runner is registered with matching labels and |
| 168 | capacity, then inspect runner journal output and heartbeat metrics. |
| 169 | - **Step logs buffer:** verify the Caddy route above and confirm the SSE route |
| 170 | is still mounted outside compression and short timeouts. |
| 171 | - **`uses:` step fails:** expected for now. Replace with a `run:` step until |
| 172 | checkout/artifact support lands. |
| 173 | - **Secrets appear masked inconsistently:** check |
| 174 | `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm |
| 175 | the job was claimed after the secret was created or rotated. Mask snapshots |
| 176 | are captured at claim time. |