markdown · 8324 bytes Raw Blame History

Actions runbook

This is the operator runbook for shithub Actions. Host provisioning lives in runner-deploy.md, and runner protocol details live in actions-runner-api.md.

Shape

git push / workflow_dispatch / schedule / pull_request
        |
        v
workflow:trigger worker job
        |
        v
workflow_runs + workflow_jobs + workflow_steps + check_runs
        |
        v
registered runner heartbeat claims a matching queued job
        |
        v
actions/checkout@v4 -> containerized run: steps
        |
        v
log chunks -> step/job status -> run rollup

The v1 executor supports host-side actions/checkout@v4 plus containerized run: steps. The checkout token is short-lived, repository-scoped, tied to a running job, and accepted only for read-only smart-HTTP fetches. shithub/upload-artifact@v1 and shithub/download-artifact@v1 are still reserved aliases and fail until artifact transfer is wired end to end.

First smoke

  1. Confirm migrations are applied and the web process can enqueue workers.
  2. Register one runner with a label that matches the workflow:
shithubd admin runner register \
  --name smoke-runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1
  1. Start shithubd-runner with the printed token. For production hosts, use the Ansible/systemd path in runner-deploy.md.
  2. Push a run:-only workflow:
name: smoke
on: [push, workflow_dispatch]
jobs:
  hello:
    runs-on: ubuntu-latest
    env:
      RUN_ID: ${{ shithub.run_id }}
    steps:
      - run: echo "hello from shithub actions"
      - run: test -n "$RUN_ID"
  1. Expected result:
  • workflow:trigger enqueues a workflow run.
  • A runner heartbeat claims the queued job within one idle poll interval.
  • The Actions run page streams step logs while the job is running.
  • The matching check run completes with success.
  • /{owner}/{repo}/actions.atom includes the completed run.

Repeat with exit 1; the check should complete with failure.

Checkout smoke

After the run-only smoke passes, verify repository checkout with:

name: checkout smoke
on: [push, workflow_dispatch]
jobs:
  checkout:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: test -f README.md
      - run: test "$(git rev-parse HEAD)" = "${{ shithub.sha }}"

Expected result:

  • The claim response contains checkout_url and checkout_token.
  • Git smart HTTP sees the checkout token as Basic auth and permits git-upload-pack for the claimed repo.
  • git-receive-pack rejects the same credential with 403.
  • The job workspace contains the exact shithub.sha commit before the first run: step starts.

Live log tail

Step log pages open an SSE stream at:

/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream

The stream sends event: chunk records with the chunk sequence as the SSE id. Browsers reconnect with Last-Event-ID; the handler also accepts ?after=<seq> for the first connection from a rendered log page. A terminal step sends event: done and closes the stream.

In shithubd, this route is mounted outside the normal app compression and 30-second timeout middleware. If a future route move puts live logs back under either middleware, EventSource clients will churn and logs can buffer despite the Caddy flush setting.

Log chunks are never sent through Postgres NOTIFY. Runner log writes append to workflow_step_log_chunks, then NOTIFY step_log_<step_id> with only the sequence number. Step completion notifies done.

Rate limit

Live tails use internal/ratelimit scope actions:logtail with five concurrent streams per viewer. Authenticated viewers key by user id; anonymous public-repo viewers key by client IP. The limiter uses a short lease TTL so a dropped connection cannot hold a slot permanently.

Caddy

The production Caddy template has a dedicated Actions log-stream route with:

flush_interval -1

The same route is excluded from gzip compression. If logs arrive only after several kilobytes accumulate, verify the deployed /etc/caddy/Caddyfile contains that route and reload Caddy:

sudo caddy reload --config /etc/caddy/Caddyfile

Runner health

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

On the app host, inspect runner registration and heartbeat state:

shithubd admin actions runner list

Important metrics:

  • shithub_actions_queue_depth{resource="runs|jobs"}
  • shithub_actions_active{resource="runs|jobs"}
  • shithub_actions_runner_heartbeat_age_seconds{runner,status}
  • shithub_actions_runner_capacity{runner,status}
  • shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
  • shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
  • shithub_actions_runs_completed_total{event,conclusion}
  • shithub_actions_run_duration_seconds{event,conclusion}
  • shithub_actions_steps_completed_total{step_type,conclusion}
  • shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
  • shithub_actions_log_scrub_replacements_total{location="server"}
  • shithub_actions_log_chunks_total{location="server"}
  • shithub_actions_log_chunk_bytes_total{location="server"}
  • shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}
  • shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}
  • shithub_actions_step_timeouts_total

The committed dashboard JSON lives at:

deploy/monitoring/grafana/dashboards/actions.json

Load harness

bench/k6/actions-load.js exercises the runner HTTP API under concurrent job claims. It does not create workflow runs itself; seed the queue first with pushes or workflow dispatches that produce jobs matching the runner labels.

Required environment:

SHITHUB_BASE_URL=https://shithub.example.test \
SHITHUB_RUNNER_TOKENS=token-a,token-b,token-c \
k6 run bench/k6/actions-load.js

Useful knobs:

  • SHITHUB_ACTIONS_VUS=50 controls concurrent virtual users.
  • SHITHUB_ACTIONS_DURATION=10m controls the steady-state window.
  • SHITHUB_RUNNER_LABELS=self-hosted,linux,ubuntu-latest sets heartbeat labels.
  • SHITHUB_RUNNER_CAPACITY=17 keeps three runners near the 50-concurrent target.
  • SHITHUB_ACTIONS_LOG_BYTES=4096 controls emitted log chunk size.

Healthy run expectations:

  • queued jobs drain without unbounded shithub_actions_queue_depth growth;
  • runner heartbeats keep advancing and no runner deadlocks;
  • log append p99 stays below five seconds;
  • retention metrics catch up after the retention sweep.

Emergency cancel

Start with a dry run:

shithubd admin actions cancel-all --dry-run --limit 100

Scope to one repository when possible:

shithubd admin actions cancel-all --dry-run --repo-id 42

Then confirm:

shithubd admin actions cancel-all --confirm --repo-id 42

Queued jobs are marked cancelled immediately. Running jobs receive cancel_requested=true; the runner sees that through /cancel-check, kills the active container, and reports terminal cancelled.

Common failures

  • Run never appears: confirm the workflow file is under .shithub/workflows/, parse it with shithubd admin actions parse <file>, and verify the trigger event matches on:.
  • Run stays queued: confirm a runner is registered with matching labels and capacity, then inspect runner journal output and heartbeat metrics.
  • Step logs buffer: verify the Caddy route above and confirm the SSE route is still mounted outside compression and short timeouts.
  • actions/checkout@v4 fails: confirm the job is still running, the repo URL in the runner claim points at this shithub instance, and the runner host can reach smart HTTP. The checkout token is not valid after the job leaves running.
  • Artifact uses: step fails: expected for now. Replace with a run: step until artifact support lands.
  • Secrets appear masked inconsistently: check shithub_actions_log_scrub_replacements_total{location="server"} and confirm the job was claimed after the secret was created or rotated. Mask snapshots are captured at claim time.
View source
1 # Actions runbook
2
3 This is the operator runbook for shithub Actions. Host provisioning lives in
4 [runner-deploy.md](./runner-deploy.md), and runner protocol details live in
5 [actions-runner-api.md](../actions-runner-api.md).
6
7 ## Shape
8
9 ```text
10 git push / workflow_dispatch / schedule / pull_request
11 |
12 v
13 workflow:trigger worker job
14 |
15 v
16 workflow_runs + workflow_jobs + workflow_steps + check_runs
17 |
18 v
19 registered runner heartbeat claims a matching queued job
20 |
21 v
22 actions/checkout@v4 -> containerized run: steps
23 |
24 v
25 log chunks -> step/job status -> run rollup
26 ```
27
28 The v1 executor supports host-side `actions/checkout@v4` plus containerized
29 `run:` steps. The checkout token is short-lived, repository-scoped, tied to a
30 running job, and accepted only for read-only smart-HTTP fetches.
31 `shithub/upload-artifact@v1` and `shithub/download-artifact@v1` are still
32 reserved aliases and fail until artifact transfer is wired end to end.
33
34 ## First smoke
35
36 1. Confirm migrations are applied and the web process can enqueue workers.
37 2. Register one runner with a label that matches the workflow:
38
39 ```sh
40 shithubd admin runner register \
41 --name smoke-runner-1 \
42 --labels self-hosted,linux,ubuntu-latest \
43 --capacity 1
44 ```
45
46 3. Start `shithubd-runner` with the printed token. For production hosts, use
47 the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md).
48 4. Push a `run:`-only workflow:
49
50 ```yaml
51 name: smoke
52 on: [push, workflow_dispatch]
53 jobs:
54 hello:
55 runs-on: ubuntu-latest
56 env:
57 RUN_ID: ${{ shithub.run_id }}
58 steps:
59 - run: echo "hello from shithub actions"
60 - run: test -n "$RUN_ID"
61 ```
62
63 5. Expected result:
64
65 - `workflow:trigger` enqueues a workflow run.
66 - A runner heartbeat claims the queued job within one idle poll interval.
67 - The Actions run page streams step logs while the job is running.
68 - The matching check run completes with `success`.
69 - `/{owner}/{repo}/actions.atom` includes the completed run.
70
71 Repeat with `exit 1`; the check should complete with `failure`.
72
73 ## Checkout smoke
74
75 After the run-only smoke passes, verify repository checkout with:
76
77 ```yaml
78 name: checkout smoke
79 on: [push, workflow_dispatch]
80 jobs:
81 checkout:
82 runs-on: ubuntu-latest
83 steps:
84 - uses: actions/checkout@v4
85 - run: test -f README.md
86 - run: test "$(git rev-parse HEAD)" = "${{ shithub.sha }}"
87 ```
88
89 Expected result:
90
91 - The claim response contains `checkout_url` and `checkout_token`.
92 - Git smart HTTP sees the checkout token as Basic auth and permits
93 `git-upload-pack` for the claimed repo.
94 - `git-receive-pack` rejects the same credential with 403.
95 - The job workspace contains the exact `shithub.sha` commit before the first
96 `run:` step starts.
97
98 ## Live log tail
99
100 Step log pages open an SSE stream at:
101
102 ```text
103 /{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
104 ```
105
106 The stream sends `event: chunk` records with the chunk sequence as the SSE
107 `id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
108 `?after=<seq>` for the first connection from a rendered log page. A terminal
109 step sends `event: done` and closes the stream.
110
111 In `shithubd`, this route is mounted outside the normal app compression and
112 30-second timeout middleware. If a future route move puts live logs back under
113 either middleware, EventSource clients will churn and logs can buffer despite
114 the Caddy flush setting.
115
116 Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
117 to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
118 sequence number. Step completion notifies `done`.
119
120 ## Rate limit
121
122 Live tails use `internal/ratelimit` scope `actions:logtail` with five
123 concurrent streams per viewer. Authenticated viewers key by user id; anonymous
124 public-repo viewers key by client IP. The limiter uses a short lease TTL so a
125 dropped connection cannot hold a slot permanently.
126
127 ## Caddy
128
129 The production Caddy template has a dedicated Actions log-stream route with:
130
131 ```caddy
132 flush_interval -1
133 ```
134
135 The same route is excluded from gzip compression. If logs arrive only after
136 several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
137 contains that route and reload Caddy:
138
139 ```sh
140 sudo caddy reload --config /etc/caddy/Caddyfile
141 ```
142
143 ## Runner health
144
145 On the runner host:
146
147 ```sh
148 systemctl status shithubd-runner
149 journalctl -u shithubd-runner -n 100 --no-pager
150 ```
151
152 On the app host, inspect runner registration and heartbeat state:
153
154 ```sh
155 shithubd admin actions runner list
156 ```
157
158 Important metrics:
159
160 - `shithub_actions_queue_depth{resource="runs|jobs"}`
161 - `shithub_actions_active{resource="runs|jobs"}`
162 - `shithub_actions_runner_heartbeat_age_seconds{runner,status}`
163 - `shithub_actions_runner_capacity{runner,status}`
164 - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
165 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
166 - `shithub_actions_runs_completed_total{event,conclusion}`
167 - `shithub_actions_run_duration_seconds{event,conclusion}`
168 - `shithub_actions_steps_completed_total{step_type,conclusion}`
169 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
170 - `shithub_actions_log_scrub_replacements_total{location="server"}`
171 - `shithub_actions_log_chunks_total{location="server"}`
172 - `shithub_actions_log_chunk_bytes_total{location="server"}`
173 - `shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}`
174 - `shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}`
175 - `shithub_actions_step_timeouts_total`
176
177 The committed dashboard JSON lives at:
178
179 ```text
180 deploy/monitoring/grafana/dashboards/actions.json
181 ```
182
183 ## Load harness
184
185 `bench/k6/actions-load.js` exercises the runner HTTP API under concurrent job
186 claims. It does not create workflow runs itself; seed the queue first with
187 pushes or workflow dispatches that produce jobs matching the runner labels.
188
189 Required environment:
190
191 ```sh
192 SHITHUB_BASE_URL=https://shithub.example.test \
193 SHITHUB_RUNNER_TOKENS=token-a,token-b,token-c \
194 k6 run bench/k6/actions-load.js
195 ```
196
197 Useful knobs:
198
199 - `SHITHUB_ACTIONS_VUS=50` controls concurrent virtual users.
200 - `SHITHUB_ACTIONS_DURATION=10m` controls the steady-state window.
201 - `SHITHUB_RUNNER_LABELS=self-hosted,linux,ubuntu-latest` sets heartbeat
202 labels.
203 - `SHITHUB_RUNNER_CAPACITY=17` keeps three runners near the 50-concurrent
204 target.
205 - `SHITHUB_ACTIONS_LOG_BYTES=4096` controls emitted log chunk size.
206
207 Healthy run expectations:
208
209 - queued jobs drain without unbounded `shithub_actions_queue_depth` growth;
210 - runner heartbeats keep advancing and no runner deadlocks;
211 - log append p99 stays below five seconds;
212 - retention metrics catch up after the retention sweep.
213
214 ## Emergency cancel
215
216 Start with a dry run:
217
218 ```sh
219 shithubd admin actions cancel-all --dry-run --limit 100
220 ```
221
222 Scope to one repository when possible:
223
224 ```sh
225 shithubd admin actions cancel-all --dry-run --repo-id 42
226 ```
227
228 Then confirm:
229
230 ```sh
231 shithubd admin actions cancel-all --confirm --repo-id 42
232 ```
233
234 Queued jobs are marked cancelled immediately. Running jobs receive
235 `cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
236 active container, and reports terminal `cancelled`.
237
238 ## Common failures
239
240 - **Run never appears:** confirm the workflow file is under
241 `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
242 and verify the trigger event matches `on:`.
243 - **Run stays queued:** confirm a runner is registered with matching labels and
244 capacity, then inspect runner journal output and heartbeat metrics.
245 - **Step logs buffer:** verify the Caddy route above and confirm the SSE route
246 is still mounted outside compression and short timeouts.
247 - **`actions/checkout@v4` fails:** confirm the job is still running, the repo
248 URL in the runner claim points at this shithub instance, and the runner host
249 can reach smart HTTP. The checkout token is not valid after the job leaves
250 `running`.
251 - **Artifact `uses:` step fails:** expected for now. Replace with a `run:`
252 step until artifact support lands.
253 - **Secrets appear masked inconsistently:** check
254 `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
255 the job was claimed after the secret was created or rotated. Mask snapshots
256 are captured at claim time.