shithub Public

Watch 1 Fork 0 Star 0

markdown · 5498 bytes Raw Blame History

Actions runbook

This is the operator runbook for shithub Actions. Host provisioning lives in runner-deploy.md, and runner protocol details live in actions-runner-api.md.

Shape

git push / workflow_dispatch / schedule / pull_request
        |
        v
workflow:trigger worker job
        |
        v
workflow_runs + workflow_jobs + workflow_steps + check_runs
        |
        v
registered runner heartbeat claims a matching queued job
        |
        v
containerized run: steps -> log chunks -> step/job status -> run rollup

The v1 executor supports containerized run: steps. The parser reserves actions/checkout@v4, shithub/upload-artifact@v1, and shithub/download-artifact@v1, but the Docker runner rejects uses: steps until checkout metadata and artifact transfer are wired end to end. Do not use actions/checkout@v4 in production smoke workflows yet.

First smoke

Confirm migrations are applied and the web process can enqueue workers.
Register one runner with a label that matches the workflow:

shithubd admin runner register \
  --name smoke-runner-1 \
  --labels self-hosted,linux,ubuntu-latest \
  --capacity 1

Start shithubd-runner with the printed token. For production hosts, use the Ansible/systemd path in runner-deploy.md.
Push a run:-only workflow:

name: smoke
on: [push, workflow_dispatch]
jobs:
  hello:
    runs-on: ubuntu-latest
    env:
      RUN_ID: ${{ shithub.run_id }}
    steps:
      - run: echo "hello from shithub actions"
      - run: test -n "$RUN_ID"

Expected result:

workflow:trigger enqueues a workflow run.
A runner heartbeat claims the queued job within one idle poll interval.
The Actions run page streams step logs while the job is running.
The matching check run completes with success.
/{owner}/{repo}/actions.atom includes the completed run.

Repeat with exit 1; the check should complete with failure.

Live log tail

Step log pages open an SSE stream at:

/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream

The stream sends event: chunk records with the chunk sequence as the SSE id. Browsers reconnect with Last-Event-ID; the handler also accepts ?after=<seq> for the first connection from a rendered log page. A terminal step sends event: done and closes the stream.

In shithubd, this route is mounted outside the normal app compression and 30-second timeout middleware. If a future route move puts live logs back under either middleware, EventSource clients will churn and logs can buffer despite the Caddy flush setting.

Log chunks are never sent through Postgres NOTIFY. Runner log writes append to workflow_step_log_chunks, then NOTIFY step_log_<step_id> with only the sequence number. Step completion notifies done.

Rate limit

Live tails use internal/ratelimit scope actions:logtail with five concurrent streams per viewer. Authenticated viewers key by user id; anonymous public-repo viewers key by client IP. The limiter uses a short lease TTL so a dropped connection cannot hold a slot permanently.

Caddy

The production Caddy template has a dedicated Actions log-stream route with:

flush_interval -1

The same route is excluded from gzip compression. If logs arrive only after several kilobytes accumulate, verify the deployed /etc/caddy/Caddyfile contains that route and reload Caddy:

sudo caddy reload --config /etc/caddy/Caddyfile

Runner health

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

On the app host, inspect runner registration and heartbeat state:

shithubd admin actions runner list

Important metrics:

shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
shithub_actions_log_scrub_replacements_total{location="server"}
shithub_actions_step_timeouts_total

Emergency cancel

Start with a dry run:

shithubd admin actions cancel-all --dry-run --limit 100

Scope to one repository when possible:

shithubd admin actions cancel-all --dry-run --repo-id 42

Then confirm:

shithubd admin actions cancel-all --confirm --repo-id 42

Queued jobs are marked cancelled immediately. Running jobs receive cancel_requested=true; the runner sees that through /cancel-check, kills the active container, and reports terminal cancelled.

Common failures

Run never appears: confirm the workflow file is under .shithub/workflows/, parse it with shithubd admin actions parse <file>, and verify the trigger event matches on:.
Run stays queued: confirm a runner is registered with matching labels and capacity, then inspect runner journal output and heartbeat metrics.
Step logs buffer: verify the Caddy route above and confirm the SSE route is still mounted outside compression and short timeouts.
uses: step fails: expected for now. Replace with a run: step until checkout/artifact support lands.
Secrets appear masked inconsistently: check shithub_actions_log_scrub_replacements_total{location="server"} and confirm the job was claimed after the secret was created or rotated. Mask snapshots are captured at claim time.

View source

  
        1
        # Actions runbook
      
        2
        
        3
        This is the operator runbook for shithub Actions. Host provisioning lives in
      
        4
        [runner-deploy.md](./runner-deploy.md), and runner protocol details live in
      
        5
        [actions-runner-api.md](../actions-runner-api.md).
      
        6
        
        7
        ## Shape
      
        8
        
        9
        ```text
      
        10
        git push / workflow_dispatch / schedule / pull_request
      
        11
                |
      
        12
                v
      
        13
        workflow:trigger worker job
      
        14
                |
      
        15
                v
      
        16
        workflow_runs + workflow_jobs + workflow_steps + check_runs
      
        17
                |
      
        18
                v
      
        19
        registered runner heartbeat claims a matching queued job
      
        20
                |
      
        21
                v
      
        22
        containerized run: steps -> log chunks -> step/job status -> run rollup
      
        23
        ```
      
        24
        
        25
        The v1 executor supports containerized `run:` steps. The parser reserves
      
        26
        `actions/checkout@v4`, `shithub/upload-artifact@v1`, and
      
        27
        `shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps
      
        28
        until checkout metadata and artifact transfer are wired end to end. Do not use
      
        29
        `actions/checkout@v4` in production smoke workflows yet.
      
        30
        
        31
        ## First smoke
      
        32
        
        33
        1. Confirm migrations are applied and the web process can enqueue workers.
      
        34
        2. Register one runner with a label that matches the workflow:
      
        35
        
        36
        ```sh
      
        37
        shithubd admin runner register \
      
        38
          --name smoke-runner-1 \
      
        39
          --labels self-hosted,linux,ubuntu-latest \
      
        40
          --capacity 1
      
        41
        ```
      
        42
        
        43
        3. Start `shithubd-runner` with the printed token. For production hosts, use
      
        44
           the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md).
      
        45
        4. Push a `run:`-only workflow:
      
        46
        
        47
        ```yaml
      
        48
        name: smoke
      
        49
        on: [push, workflow_dispatch]
      
        50
        jobs:
      
        51
          hello:
      
        52
            runs-on: ubuntu-latest
      
        53
            env:
      
        54
              RUN_ID: ${{ shithub.run_id }}
      
        55
            steps:
      
        56
              - run: echo "hello from shithub actions"
      
        57
              - run: test -n "$RUN_ID"
      
        58
        ```
      
        59
        
        60
        5. Expected result:
      
        61
        
        62
        - `workflow:trigger` enqueues a workflow run.
      
        63
        - A runner heartbeat claims the queued job within one idle poll interval.
      
        64
        - The Actions run page streams step logs while the job is running.
      
        65
        - The matching check run completes with `success`.
      
        66
        - `/{owner}/{repo}/actions.atom` includes the completed run.
      
        67
        
        68
        Repeat with `exit 1`; the check should complete with `failure`.
      
        69
        
        70
        ## Live log tail
      
        71
        
        72
        Step log pages open an SSE stream at:
      
        73
        
        74
        ```text
      
        75
        /{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
      
        76
        ```
      
        77
        
        78
        The stream sends `event: chunk` records with the chunk sequence as the SSE
      
        79
        `id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
      
        80
        `?after=<seq>` for the first connection from a rendered log page. A terminal
      
        81
        step sends `event: done` and closes the stream.
      
        82
        
        83
        In `shithubd`, this route is mounted outside the normal app compression and
      
        84
        30-second timeout middleware. If a future route move puts live logs back under
      
        85
        either middleware, EventSource clients will churn and logs can buffer despite
      
        86
        the Caddy flush setting.
      
        87
        
        88
        Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
      
        89
        to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
      
        90
        sequence number. Step completion notifies `done`.
      
        91
        
        92
        ## Rate limit
      
        93
        
        94
        Live tails use `internal/ratelimit` scope `actions:logtail` with five
      
        95
        concurrent streams per viewer. Authenticated viewers key by user id; anonymous
      
        96
        public-repo viewers key by client IP. The limiter uses a short lease TTL so a
      
        97
        dropped connection cannot hold a slot permanently.
      
        98
        
        99
        ## Caddy
      
        100
        
        101
        The production Caddy template has a dedicated Actions log-stream route with:
      
        102
        
        103
        ```caddy
      
        104
        flush_interval -1
      
        105
        ```
      
        106
        
        107
        The same route is excluded from gzip compression. If logs arrive only after
      
        108
        several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
      
        109
        contains that route and reload Caddy:
      
        110
        
        111
        ```sh
      
        112
        sudo caddy reload --config /etc/caddy/Caddyfile
      
        113
        ```
      
        114
        
        115
        ## Runner health
      
        116
        
        117
        On the runner host:
      
        118
        
        119
        ```sh
      
        120
        systemctl status shithubd-runner
      
        121
        journalctl -u shithubd-runner -n 100 --no-pager
      
        122
        ```
      
        123
        
        124
        On the app host, inspect runner registration and heartbeat state:
      
        125
        
        126
        ```sh
      
        127
        shithubd admin actions runner list
      
        128
        ```
      
        129
        
        130
        Important metrics:
      
        131
        
        132
        - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
      
        133
        - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
      
        134
        - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
      
        135
        - `shithub_actions_log_scrub_replacements_total{location="server"}`
      
        136
        - `shithub_actions_step_timeouts_total`
      
        137
        
        138
        ## Emergency cancel
      
        139
        
        140
        Start with a dry run:
      
        141
        
        142
        ```sh
      
        143
        shithubd admin actions cancel-all --dry-run --limit 100
      
        144
        ```
      
        145
        
        146
        Scope to one repository when possible:
      
        147
        
        148
        ```sh
      
        149
        shithubd admin actions cancel-all --dry-run --repo-id 42
      
        150
        ```
      
        151
        
        152
        Then confirm:
      
        153
        
        154
        ```sh
      
        155
        shithubd admin actions cancel-all --confirm --repo-id 42
      
        156
        ```
      
        157
        
        158
        Queued jobs are marked cancelled immediately. Running jobs receive
      
        159
        `cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
      
        160
        active container, and reports terminal `cancelled`.
      
        161
        
        162
        ## Common failures
      
        163
        
        164
        - **Run never appears:** confirm the workflow file is under
      
        165
          `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
      
        166
          and verify the trigger event matches `on:`.
      
        167
        - **Run stays queued:** confirm a runner is registered with matching labels and
      
        168
          capacity, then inspect runner journal output and heartbeat metrics.
      
        169
        - **Step logs buffer:** verify the Caddy route above and confirm the SSE route
      
        170
          is still mounted outside compression and short timeouts.
      
        171
        - **`uses:` step fails:** expected for now. Replace with a `run:` step until
      
        172
          checkout/artifact support lands.
      
        173
        - **Secrets appear masked inconsistently:** check
      
        174
          `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
      
        175
          the job was claimed after the secret was created or rotated. Mask snapshots
      
        176
          are captured at claim time.

1	# Actions runbook
2
3	This is the operator runbook for shithub Actions. Host provisioning lives in
4	[runner-deploy.md](./runner-deploy.md), and runner protocol details live in
5	[actions-runner-api.md](../actions-runner-api.md).
6
7	## Shape
8
9	```text
10	git push / workflow_dispatch / schedule / pull_request
11	\|
12	v
13	workflow:trigger worker job
14	\|
15	v
16	workflow_runs + workflow_jobs + workflow_steps + check_runs
17	\|
18	v
19	registered runner heartbeat claims a matching queued job
20	\|
21	v
22	containerized run: steps -> log chunks -> step/job status -> run rollup
23	```
24
25	The v1 executor supports containerized `run:` steps. The parser reserves
26	`actions/checkout@v4`, `shithub/upload-artifact@v1`, and
27	`shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps
28	until checkout metadata and artifact transfer are wired end to end. Do not use
29	`actions/checkout@v4` in production smoke workflows yet.
30
31	## First smoke
32
33	1. Confirm migrations are applied and the web process can enqueue workers.
34	2. Register one runner with a label that matches the workflow:
35
36	```sh
37	shithubd admin runner register \
38	--name smoke-runner-1 \
39	--labels self-hosted,linux,ubuntu-latest \
40	--capacity 1
41	```
42
43	3. Start `shithubd-runner` with the printed token. For production hosts, use
44	the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md).
45	4. Push a `run:`-only workflow:
46
47	```yaml
48	name: smoke
49	on: [push, workflow_dispatch]
50	jobs:
51	hello:
52	runs-on: ubuntu-latest
53	env:
54	RUN_ID: ${{ shithub.run_id }}
55	steps:
56	- run: echo "hello from shithub actions"
57	- run: test -n "$RUN_ID"
58	```
59
60	5. Expected result:
61
62	- `workflow:trigger` enqueues a workflow run.
63	- A runner heartbeat claims the queued job within one idle poll interval.
64	- The Actions run page streams step logs while the job is running.
65	- The matching check run completes with `success`.
66	- `/{owner}/{repo}/actions.atom` includes the completed run.
67
68	Repeat with `exit 1`; the check should complete with `failure`.
69
70	## Live log tail
71
72	Step log pages open an SSE stream at:
73
74	```text
75	/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
76	```
77
78	The stream sends `event: chunk` records with the chunk sequence as the SSE
79	`id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
80	`?after=<seq>` for the first connection from a rendered log page. A terminal
81	step sends `event: done` and closes the stream.
82
83	In `shithubd`, this route is mounted outside the normal app compression and
84	30-second timeout middleware. If a future route move puts live logs back under
85	either middleware, EventSource clients will churn and logs can buffer despite
86	the Caddy flush setting.
87
88	Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
89	to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
90	sequence number. Step completion notifies `done`.
91
92	## Rate limit
93
94	Live tails use `internal/ratelimit` scope `actions:logtail` with five
95	concurrent streams per viewer. Authenticated viewers key by user id; anonymous
96	public-repo viewers key by client IP. The limiter uses a short lease TTL so a
97	dropped connection cannot hold a slot permanently.
98
99	## Caddy
100
101	The production Caddy template has a dedicated Actions log-stream route with:
102
103	```caddy
104	flush_interval -1
105	```
106
107	The same route is excluded from gzip compression. If logs arrive only after
108	several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
109	contains that route and reload Caddy:
110
111	```sh
112	sudo caddy reload --config /etc/caddy/Caddyfile
113	```
114
115	## Runner health
116
117	On the runner host:
118
119	```sh
120	systemctl status shithubd-runner
121	journalctl -u shithubd-runner -n 100 --no-pager
122	```
123
124	On the app host, inspect runner registration and heartbeat state:
125
126	```sh
127	shithubd admin actions runner list
128	```
129
130	Important metrics:
131
132	- `shithub_actions_runner_heartbeats_total{result="claimed\|no_job"}`
133	- `shithub_actions_runner_jwt_total{result="issued\|rejected\|replay"}`
134	- `shithub_actions_jobs_cancelled_total{reason="user\|concurrency\|timeout"}`
135	- `shithub_actions_log_scrub_replacements_total{location="server"}`
136	- `shithub_actions_step_timeouts_total`
137
138	## Emergency cancel
139
140	Start with a dry run:
141
142	```sh
143	shithubd admin actions cancel-all --dry-run --limit 100
144	```
145
146	Scope to one repository when possible:
147
148	```sh
149	shithubd admin actions cancel-all --dry-run --repo-id 42
150	```
151
152	Then confirm:
153
154	```sh
155	shithubd admin actions cancel-all --confirm --repo-id 42
156	```
157
158	Queued jobs are marked cancelled immediately. Running jobs receive
159	`cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
160	active container, and reports terminal `cancelled`.
161
162	## Common failures
163
164	- Run never appears: confirm the workflow file is under
165	`.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
166	and verify the trigger event matches `on:`.
167	- Run stays queued: confirm a runner is registered with matching labels and
168	capacity, then inspect runner journal output and heartbeat metrics.
169	- Step logs buffer: verify the Caddy route above and confirm the SSE route
170	is still mounted outside compression and short timeouts.
171	- `uses:` step fails: expected for now. Replace with a `run:` step until
172	checkout/artifact support lands.
173	- Secrets appear masked inconsistently: check
174	`shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
175	the job was claimed after the secret was created or rotated. Mask snapshots
176	are captured at claim time.