markdown · 5498 bytes Raw Blame History

Actions runbook

This is the operator runbook for shithub Actions. Host provisioning lives in runner-deploy.md, and runner protocol details live in actions-runner-api.md.

Shape

git push / workflow_dispatch / schedule / pull_request
        |
        v
workflow:trigger worker job
        |
        v
workflow_runs + workflow_jobs + workflow_steps + check_runs
        |
        v
registered runner heartbeat claims a matching queued job
        |
        v
containerized run: steps -> log chunks -> step/job status -> run rollup

The v1 executor supports containerized run: steps. The parser reserves actions/checkout@v4, shithub/upload-artifact@v1, and shithub/download-artifact@v1, but the Docker runner rejects uses: steps until checkout metadata and artifact transfer are wired end to end. Do not use actions/checkout@v4 in production smoke workflows yet.

First smoke

  1. Confirm migrations are applied and the web process can enqueue workers.
  2. Register one runner with a label that matches the workflow:
shithubd admin runner register \
  --name smoke-runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1
  1. Start shithubd-runner with the printed token. For production hosts, use the Ansible/systemd path in runner-deploy.md.
  2. Push a run:-only workflow:
name: smoke
on: [push, workflow_dispatch]
jobs:
  hello:
    runs-on: ubuntu-latest
    env:
      RUN_ID: ${{ shithub.run_id }}
    steps:
      - run: echo "hello from shithub actions"
      - run: test -n "$RUN_ID"
  1. Expected result:
  • workflow:trigger enqueues a workflow run.
  • A runner heartbeat claims the queued job within one idle poll interval.
  • The Actions run page streams step logs while the job is running.
  • The matching check run completes with success.
  • /{owner}/{repo}/actions.atom includes the completed run.

Repeat with exit 1; the check should complete with failure.

Live log tail

Step log pages open an SSE stream at:

/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream

The stream sends event: chunk records with the chunk sequence as the SSE id. Browsers reconnect with Last-Event-ID; the handler also accepts ?after=<seq> for the first connection from a rendered log page. A terminal step sends event: done and closes the stream.

In shithubd, this route is mounted outside the normal app compression and 30-second timeout middleware. If a future route move puts live logs back under either middleware, EventSource clients will churn and logs can buffer despite the Caddy flush setting.

Log chunks are never sent through Postgres NOTIFY. Runner log writes append to workflow_step_log_chunks, then NOTIFY step_log_<step_id> with only the sequence number. Step completion notifies done.

Rate limit

Live tails use internal/ratelimit scope actions:logtail with five concurrent streams per viewer. Authenticated viewers key by user id; anonymous public-repo viewers key by client IP. The limiter uses a short lease TTL so a dropped connection cannot hold a slot permanently.

Caddy

The production Caddy template has a dedicated Actions log-stream route with:

flush_interval -1

The same route is excluded from gzip compression. If logs arrive only after several kilobytes accumulate, verify the deployed /etc/caddy/Caddyfile contains that route and reload Caddy:

sudo caddy reload --config /etc/caddy/Caddyfile

Runner health

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

On the app host, inspect runner registration and heartbeat state:

shithubd admin actions runner list

Important metrics:

  • shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
  • shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
  • shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
  • shithub_actions_log_scrub_replacements_total{location="server"}
  • shithub_actions_step_timeouts_total

Emergency cancel

Start with a dry run:

shithubd admin actions cancel-all --dry-run --limit 100

Scope to one repository when possible:

shithubd admin actions cancel-all --dry-run --repo-id 42

Then confirm:

shithubd admin actions cancel-all --confirm --repo-id 42

Queued jobs are marked cancelled immediately. Running jobs receive cancel_requested=true; the runner sees that through /cancel-check, kills the active container, and reports terminal cancelled.

Common failures

  • Run never appears: confirm the workflow file is under .shithub/workflows/, parse it with shithubd admin actions parse <file>, and verify the trigger event matches on:.
  • Run stays queued: confirm a runner is registered with matching labels and capacity, then inspect runner journal output and heartbeat metrics.
  • Step logs buffer: verify the Caddy route above and confirm the SSE route is still mounted outside compression and short timeouts.
  • uses: step fails: expected for now. Replace with a run: step until checkout/artifact support lands.
  • Secrets appear masked inconsistently: check shithub_actions_log_scrub_replacements_total{location="server"} and confirm the job was claimed after the secret was created or rotated. Mask snapshots are captured at claim time.
View source
1 # Actions runbook
2
3 This is the operator runbook for shithub Actions. Host provisioning lives in
4 [runner-deploy.md](./runner-deploy.md), and runner protocol details live in
5 [actions-runner-api.md](../actions-runner-api.md).
6
7 ## Shape
8
9 ```text
10 git push / workflow_dispatch / schedule / pull_request
11 |
12 v
13 workflow:trigger worker job
14 |
15 v
16 workflow_runs + workflow_jobs + workflow_steps + check_runs
17 |
18 v
19 registered runner heartbeat claims a matching queued job
20 |
21 v
22 containerized run: steps -> log chunks -> step/job status -> run rollup
23 ```
24
25 The v1 executor supports containerized `run:` steps. The parser reserves
26 `actions/checkout@v4`, `shithub/upload-artifact@v1`, and
27 `shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps
28 until checkout metadata and artifact transfer are wired end to end. Do not use
29 `actions/checkout@v4` in production smoke workflows yet.
30
31 ## First smoke
32
33 1. Confirm migrations are applied and the web process can enqueue workers.
34 2. Register one runner with a label that matches the workflow:
35
36 ```sh
37 shithubd admin runner register \
38 --name smoke-runner-1 \
39 --labels self-hosted,linux,ubuntu-latest \
40 --capacity 1
41 ```
42
43 3. Start `shithubd-runner` with the printed token. For production hosts, use
44 the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md).
45 4. Push a `run:`-only workflow:
46
47 ```yaml
48 name: smoke
49 on: [push, workflow_dispatch]
50 jobs:
51 hello:
52 runs-on: ubuntu-latest
53 env:
54 RUN_ID: ${{ shithub.run_id }}
55 steps:
56 - run: echo "hello from shithub actions"
57 - run: test -n "$RUN_ID"
58 ```
59
60 5. Expected result:
61
62 - `workflow:trigger` enqueues a workflow run.
63 - A runner heartbeat claims the queued job within one idle poll interval.
64 - The Actions run page streams step logs while the job is running.
65 - The matching check run completes with `success`.
66 - `/{owner}/{repo}/actions.atom` includes the completed run.
67
68 Repeat with `exit 1`; the check should complete with `failure`.
69
70 ## Live log tail
71
72 Step log pages open an SSE stream at:
73
74 ```text
75 /{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
76 ```
77
78 The stream sends `event: chunk` records with the chunk sequence as the SSE
79 `id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
80 `?after=<seq>` for the first connection from a rendered log page. A terminal
81 step sends `event: done` and closes the stream.
82
83 In `shithubd`, this route is mounted outside the normal app compression and
84 30-second timeout middleware. If a future route move puts live logs back under
85 either middleware, EventSource clients will churn and logs can buffer despite
86 the Caddy flush setting.
87
88 Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
89 to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
90 sequence number. Step completion notifies `done`.
91
92 ## Rate limit
93
94 Live tails use `internal/ratelimit` scope `actions:logtail` with five
95 concurrent streams per viewer. Authenticated viewers key by user id; anonymous
96 public-repo viewers key by client IP. The limiter uses a short lease TTL so a
97 dropped connection cannot hold a slot permanently.
98
99 ## Caddy
100
101 The production Caddy template has a dedicated Actions log-stream route with:
102
103 ```caddy
104 flush_interval -1
105 ```
106
107 The same route is excluded from gzip compression. If logs arrive only after
108 several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
109 contains that route and reload Caddy:
110
111 ```sh
112 sudo caddy reload --config /etc/caddy/Caddyfile
113 ```
114
115 ## Runner health
116
117 On the runner host:
118
119 ```sh
120 systemctl status shithubd-runner
121 journalctl -u shithubd-runner -n 100 --no-pager
122 ```
123
124 On the app host, inspect runner registration and heartbeat state:
125
126 ```sh
127 shithubd admin actions runner list
128 ```
129
130 Important metrics:
131
132 - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
133 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
134 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
135 - `shithub_actions_log_scrub_replacements_total{location="server"}`
136 - `shithub_actions_step_timeouts_total`
137
138 ## Emergency cancel
139
140 Start with a dry run:
141
142 ```sh
143 shithubd admin actions cancel-all --dry-run --limit 100
144 ```
145
146 Scope to one repository when possible:
147
148 ```sh
149 shithubd admin actions cancel-all --dry-run --repo-id 42
150 ```
151
152 Then confirm:
153
154 ```sh
155 shithubd admin actions cancel-all --confirm --repo-id 42
156 ```
157
158 Queued jobs are marked cancelled immediately. Running jobs receive
159 `cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
160 active container, and reports terminal `cancelled`.
161
162 ## Common failures
163
164 - **Run never appears:** confirm the workflow file is under
165 `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
166 and verify the trigger event matches `on:`.
167 - **Run stays queued:** confirm a runner is registered with matching labels and
168 capacity, then inspect runner journal output and heartbeat metrics.
169 - **Step logs buffer:** verify the Caddy route above and confirm the SSE route
170 is still mounted outside compression and short timeouts.
171 - **`uses:` step fails:** expected for now. Replace with a `run:` step until
172 checkout/artifact support lands.
173 - **Secrets appear masked inconsistently:** check
174 `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
175 the job was claimed after the secret was created or rotated. Mask snapshots
176 are captured at claim time.