shithub Public

Watch 1 Fork 0 Star 0

markdown · 8954 bytes Raw Blame History

Actions runbook

This is the operator runbook for shithub Actions. Host provisioning lives in runner-deploy.md, and runner protocol details live in actions-runner-api.md.

Shape

git push / workflow_dispatch / schedule / pull_request
        |
        v
workflow:trigger worker job
        |
        v
workflow_runs + workflow_jobs + workflow_steps + check_runs
        |
        v
registered runner heartbeat claims a matching queued job
        |
        v
actions/checkout@v4 -> containerized run: steps
        |
        v
log chunks -> step/job status -> run rollup

The v1 executor supports host-side actions/checkout@v4 plus containerized run: steps. The checkout token is short-lived, repository-scoped, tied to a running job, and accepted only for read-only smart-HTTP fetches. shithub/upload-artifact@v1 and shithub/download-artifact@v1 are still reserved aliases and fail until artifact transfer is wired end to end.

First smoke

Confirm migrations are applied and the web process can enqueue workers.
Register one runner with a label that matches the workflow:

shithubd admin runner register \
  --name smoke-runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

Start shithubd-runner with the returned token. For production hosts, use one token per host and store it in ansible-vault, host vars, or the deployment secret store. The role writes /etc/shithubd-runner/config.toml with restrictive permissions. Use --expires-in only when the automation rotates the runner token before that deadline; the current runner uses the registration token for every heartbeat.
Push a run:-only workflow:

name: smoke
on: [push, workflow_dispatch]
jobs:
  hello:
    runs-on: ubuntu-latest
    env:
      RUN_ID: ${{ shithub.run_id }}
    steps:
      - run: echo "hello from shithub actions"
      - run: test -n "$RUN_ID"

Expected result:

workflow:trigger enqueues a workflow run.
A runner heartbeat claims the queued job within one idle poll interval.
The Actions run page streams step logs while the job is running.
The matching check run completes with success.
/{owner}/{repo}/actions.atom includes the completed run.

Repeat with exit 1; the check should complete with failure.

Checkout smoke

After the run-only smoke passes, verify repository checkout with:

name: checkout smoke
on: [push, workflow_dispatch]
jobs:
  checkout:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: test -f README.md
      - run: test "$(git rev-parse HEAD)" = "${{ shithub.sha }}"

Expected result:

The claim response contains checkout_url and checkout_token.
Git smart HTTP sees the checkout token as Basic auth and permits git-upload-pack for the claimed repo.
git-receive-pack rejects the same credential with 403.
The job workspace contains the exact shithub.sha commit before the first run: step starts.

Live log tail

Step log pages open an SSE stream at:

/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream

The stream sends event: chunk records with the chunk sequence as the SSE id. Browsers reconnect with Last-Event-ID; the handler also accepts ?after=<seq> for the first connection from a rendered log page. A terminal step sends event: done and closes the stream.

In shithubd, this route is mounted outside the normal app compression and 30-second timeout middleware. If a future route move puts live logs back under either middleware, EventSource clients will churn and logs can buffer despite the Caddy flush setting.

Log chunks are never sent through Postgres NOTIFY. Runner log writes append to workflow_step_log_chunks, then NOTIFY step_log_<step_id> with only the sequence number. Step completion notifies done.

Rate limit

Live tails use internal/ratelimit scope actions:logtail with five concurrent streams per viewer. Authenticated viewers key by user id; anonymous public-repo viewers key by client IP. The limiter uses a short lease TTL so a dropped connection cannot hold a slot permanently.

Caddy

The production Caddy template has a dedicated Actions log-stream route with:

flush_interval -1

The same route is excluded from gzip compression. If logs arrive only after several kilobytes accumulate, verify the deployed /etc/caddy/Caddyfile contains that route and reload Caddy:

sudo caddy reload --config /etc/caddy/Caddyfile

Runner health

On the runner host:

systemctl status shithubd-runner
journalctl -u shithubd-runner -n 100 --no-pager

On the app host, inspect runner registration and heartbeat state:

shithubd admin actions runner list

Inspect queued jobs by requested runs-on label:

shithubd admin runner queue
shithubd admin runner queue --output json

Important metrics:

shithub_actions_queue_depth{resource="runs|jobs"}
shithub_actions_active{resource="runs|jobs"}
shithub_actions_runner_heartbeat_age_seconds{runner,status}
shithub_actions_runner_capacity{runner,status}
shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
shithub_actions_runs_completed_total{event,conclusion}
shithub_actions_run_duration_seconds{event,conclusion}
shithub_actions_steps_completed_total{step_type,conclusion}
shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
shithub_actions_log_scrub_replacements_total{location="server"}
shithub_actions_log_chunks_total{location="server"}
shithub_actions_log_chunk_bytes_total{location="server"}
shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}
shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}
shithub_actions_step_timeouts_total

The committed dashboard JSON lives at:

deploy/monitoring/grafana/dashboards/actions.json

Load harness

bench/k6/actions-load.js exercises the runner HTTP API under concurrent job claims. It does not create workflow runs itself; seed the queue first with pushes or workflow dispatches that produce jobs matching the runner labels.

Required environment:

SHITHUB_BASE_URL=https://shithub.example.test \
SHITHUB_RUNNER_TOKENS=token-a,token-b,token-c \
k6 run bench/k6/actions-load.js

Useful knobs:

SHITHUB_ACTIONS_VUS=50 controls concurrent virtual users.
SHITHUB_ACTIONS_DURATION=10m controls the steady-state window.
SHITHUB_RUNNER_LABELS=self-hosted,linux,ubuntu-latest,x64 sets heartbeat labels.
SHITHUB_RUNNER_CAPACITY=17 keeps three runners near the 50-concurrent target.
SHITHUB_ACTIONS_LOG_BYTES=4096 controls emitted log chunk size.

Healthy run expectations:

queued jobs drain without unbounded shithub_actions_queue_depth growth;
runner heartbeats keep advancing and no runner deadlocks;
log append p99 stays below five seconds;
retention metrics catch up after the retention sweep.

Emergency cancel

Start with a dry run:

shithubd admin actions cancel-all --dry-run --limit 100

Scope to one repository when possible:

shithubd admin actions cancel-all --dry-run --repo-id 42

Then confirm:

shithubd admin actions cancel-all --confirm --repo-id 42

Queued jobs are marked cancelled immediately. Running jobs receive cancel_requested=true; the runner sees that through /cancel-check, kills the active container, and reports terminal cancelled.

Common failures

Run never appears: confirm the workflow file is under .shithub/workflows/, parse it with shithubd admin actions parse <file>, and verify the trigger event matches on:.
Run stays queued: open the run page to see the requested runner labels, then run shithubd admin runner queue and confirm a live runner is registered with matching labels and capacity. Unsupported hosted labels such as windows-latest and macos-latest intentionally remain queued until an operator registers matching runners.
Step logs buffer: verify the Caddy route above and confirm the SSE route is still mounted outside compression and short timeouts.
actions/checkout@v4 fails: confirm the job is still running, the repo URL in the runner claim points at this shithub instance, and the runner host can reach smart HTTP. The checkout token is not valid after the job leaves running.
Artifact uses: step fails: expected for now. Replace with a run: step until artifact support lands.
Secrets appear masked inconsistently: check shithub_actions_log_scrub_replacements_total{location="server"} and confirm the job was claimed after the secret was created or rotated. Mask snapshots are captured at claim time.

View source

  
        1
        # Actions runbook
      
        2
        
        3
        This is the operator runbook for shithub Actions. Host provisioning lives in
      
        4
        [runner-deploy.md](./runner-deploy.md), and runner protocol details live in
      
        5
        [actions-runner-api.md](../actions-runner-api.md).
      
        6
        
        7
        ## Shape
      
        8
        
        9
        ```text
      
        10
        git push / workflow_dispatch / schedule / pull_request
      
        11
                |
      
        12
                v
      
        13
        workflow:trigger worker job
      
        14
                |
      
        15
                v
      
        16
        workflow_runs + workflow_jobs + workflow_steps + check_runs
      
        17
                |
      
        18
                v
      
        19
        registered runner heartbeat claims a matching queued job
      
        20
                |
      
        21
                v
      
        22
        actions/checkout@v4 -> containerized run: steps
      
        23
                |
      
        24
                v
      
        25
        log chunks -> step/job status -> run rollup
      
        26
        ```
      
        27
        
        28
        The v1 executor supports host-side `actions/checkout@v4` plus containerized
      
        29
        `run:` steps. The checkout token is short-lived, repository-scoped, tied to a
      
        30
        running job, and accepted only for read-only smart-HTTP fetches.
      
        31
        `shithub/upload-artifact@v1` and `shithub/download-artifact@v1` are still
      
        32
        reserved aliases and fail until artifact transfer is wired end to end.
      
        33
        
        34
        ## First smoke
      
        35
        
        36
        1. Confirm migrations are applied and the web process can enqueue workers.
      
        37
        2. Register one runner with a label that matches the workflow:
      
        38
        
        39
        ```sh
      
        40
        shithubd admin runner register \
      
        41
          --name smoke-runner-1 \
      
        42
          --labels self-hosted,linux,ubuntu-latest,x64 \
      
        43
          --capacity 1 \
      
        44
          --output json
      
        45
        ```
      
        46
        
        47
        3. Start `shithubd-runner` with the returned token. For production hosts, use
      
        48
           one token per host and store it in ansible-vault, host vars, or the deployment
      
        49
           secret store. The role writes `/etc/shithubd-runner/config.toml` with
      
        50
           restrictive permissions. Use `--expires-in` only when the automation rotates
      
        51
           the runner token before that deadline; the current runner uses the
      
        52
           registration token for every heartbeat.
      
        53
        4. Push a `run:`-only workflow:
      
        54
        
        55
        ```yaml
      
        56
        name: smoke
      
        57
        on: [push, workflow_dispatch]
      
        58
        jobs:
      
        59
          hello:
      
        60
            runs-on: ubuntu-latest
      
        61
            env:
      
        62
              RUN_ID: ${{ shithub.run_id }}
      
        63
            steps:
      
        64
              - run: echo "hello from shithub actions"
      
        65
              - run: test -n "$RUN_ID"
      
        66
        ```
      
        67
        
        68
        5. Expected result:
      
        69
        
        70
        - `workflow:trigger` enqueues a workflow run.
      
        71
        - A runner heartbeat claims the queued job within one idle poll interval.
      
        72
        - The Actions run page streams step logs while the job is running.
      
        73
        - The matching check run completes with `success`.
      
        74
        - `/{owner}/{repo}/actions.atom` includes the completed run.
      
        75
        
        76
        Repeat with `exit 1`; the check should complete with `failure`.
      
        77
        
        78
        ## Checkout smoke
      
        79
        
        80
        After the run-only smoke passes, verify repository checkout with:
      
        81
        
        82
        ```yaml
      
        83
        name: checkout smoke
      
        84
        on: [push, workflow_dispatch]
      
        85
        jobs:
      
        86
          checkout:
      
        87
            runs-on: ubuntu-latest
      
        88
            steps:
      
        89
              - uses: actions/checkout@v4
      
        90
              - run: test -f README.md
      
        91
              - run: test "$(git rev-parse HEAD)" = "${{ shithub.sha }}"
      
        92
        ```
      
        93
        
        94
        Expected result:
      
        95
        
        96
        - The claim response contains `checkout_url` and `checkout_token`.
      
        97
        - Git smart HTTP sees the checkout token as Basic auth and permits
      
        98
          `git-upload-pack` for the claimed repo.
      
        99
        - `git-receive-pack` rejects the same credential with 403.
      
        100
        - The job workspace contains the exact `shithub.sha` commit before the first
      
        101
          `run:` step starts.
      
        102
        
        103
        ## Live log tail
      
        104
        
        105
        Step log pages open an SSE stream at:
      
        106
        
        107
        ```text
      
        108
        /{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
      
        109
        ```
      
        110
        
        111
        The stream sends `event: chunk` records with the chunk sequence as the SSE
      
        112
        `id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
      
        113
        `?after=<seq>` for the first connection from a rendered log page. A terminal
      
        114
        step sends `event: done` and closes the stream.
      
        115
        
        116
        In `shithubd`, this route is mounted outside the normal app compression and
      
        117
        30-second timeout middleware. If a future route move puts live logs back under
      
        118
        either middleware, EventSource clients will churn and logs can buffer despite
      
        119
        the Caddy flush setting.
      
        120
        
        121
        Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
      
        122
        to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
      
        123
        sequence number. Step completion notifies `done`.
      
        124
        
        125
        ## Rate limit
      
        126
        
        127
        Live tails use `internal/ratelimit` scope `actions:logtail` with five
      
        128
        concurrent streams per viewer. Authenticated viewers key by user id; anonymous
      
        129
        public-repo viewers key by client IP. The limiter uses a short lease TTL so a
      
        130
        dropped connection cannot hold a slot permanently.
      
        131
        
        132
        ## Caddy
      
        133
        
        134
        The production Caddy template has a dedicated Actions log-stream route with:
      
        135
        
        136
        ```caddy
      
        137
        flush_interval -1
      
        138
        ```
      
        139
        
        140
        The same route is excluded from gzip compression. If logs arrive only after
      
        141
        several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
      
        142
        contains that route and reload Caddy:
      
        143
        
        144
        ```sh
      
        145
        sudo caddy reload --config /etc/caddy/Caddyfile
      
        146
        ```
      
        147
        
        148
        ## Runner health
      
        149
        
        150
        On the runner host:
      
        151
        
        152
        ```sh
      
        153
        systemctl status shithubd-runner
      
        154
        journalctl -u shithubd-runner -n 100 --no-pager
      
        155
        ```
      
        156
        
        157
        On the app host, inspect runner registration and heartbeat state:
      
        158
        
        159
        ```sh
      
        160
        shithubd admin actions runner list
      
        161
        ```
      
        162
        
        163
        Inspect queued jobs by requested `runs-on` label:
      
        164
        
        165
        ```sh
      
        166
        shithubd admin runner queue
      
        167
        shithubd admin runner queue --output json
      
        168
        ```
      
        169
        
        170
        Important metrics:
      
        171
        
        172
        - `shithub_actions_queue_depth{resource="runs|jobs"}`
      
        173
        - `shithub_actions_active{resource="runs|jobs"}`
      
        174
        - `shithub_actions_runner_heartbeat_age_seconds{runner,status}`
      
        175
        - `shithub_actions_runner_capacity{runner,status}`
      
        176
        - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
      
        177
        - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
      
        178
        - `shithub_actions_runs_completed_total{event,conclusion}`
      
        179
        - `shithub_actions_run_duration_seconds{event,conclusion}`
      
        180
        - `shithub_actions_steps_completed_total{step_type,conclusion}`
      
        181
        - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
      
        182
        - `shithub_actions_log_scrub_replacements_total{location="server"}`
      
        183
        - `shithub_actions_log_chunks_total{location="server"}`
      
        184
        - `shithub_actions_log_chunk_bytes_total{location="server"}`
      
        185
        - `shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}`
      
        186
        - `shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}`
      
        187
        - `shithub_actions_step_timeouts_total`
      
        188
        
        189
        The committed dashboard JSON lives at:
      
        190
        
        191
        ```text
      
        192
        deploy/monitoring/grafana/dashboards/actions.json
      
        193
        ```
      
        194
        
        195
        ## Load harness
      
        196
        
        197
        `bench/k6/actions-load.js` exercises the runner HTTP API under concurrent job
      
        198
        claims. It does not create workflow runs itself; seed the queue first with
      
        199
        pushes or workflow dispatches that produce jobs matching the runner labels.
      
        200
        
        201
        Required environment:
      
        202
        
        203
        ```sh
      
        204
        SHITHUB_BASE_URL=https://shithub.example.test \
      
        205
        SHITHUB_RUNNER_TOKENS=token-a,token-b,token-c \
      
        206
        k6 run bench/k6/actions-load.js
      
        207
        ```
      
        208
        
        209
        Useful knobs:
      
        210
        
        211
        - `SHITHUB_ACTIONS_VUS=50` controls concurrent virtual users.
      
        212
        - `SHITHUB_ACTIONS_DURATION=10m` controls the steady-state window.
      
        213
        - `SHITHUB_RUNNER_LABELS=self-hosted,linux,ubuntu-latest,x64` sets heartbeat
      
        214
          labels.
      
        215
        - `SHITHUB_RUNNER_CAPACITY=17` keeps three runners near the 50-concurrent
      
        216
          target.
      
        217
        - `SHITHUB_ACTIONS_LOG_BYTES=4096` controls emitted log chunk size.
      
        218
        
        219
        Healthy run expectations:
      
        220
        
        221
        - queued jobs drain without unbounded `shithub_actions_queue_depth` growth;
      
        222
        - runner heartbeats keep advancing and no runner deadlocks;
      
        223
        - log append p99 stays below five seconds;
      
        224
        - retention metrics catch up after the retention sweep.
      
        225
        
        226
        ## Emergency cancel
      
        227
        
        228
        Start with a dry run:
      
        229
        
        230
        ```sh
      
        231
        shithubd admin actions cancel-all --dry-run --limit 100
      
        232
        ```
      
        233
        
        234
        Scope to one repository when possible:
      
        235
        
        236
        ```sh
      
        237
        shithubd admin actions cancel-all --dry-run --repo-id 42
      
        238
        ```
      
        239
        
        240
        Then confirm:
      
        241
        
        242
        ```sh
      
        243
        shithubd admin actions cancel-all --confirm --repo-id 42
      
        244
        ```
      
        245
        
        246
        Queued jobs are marked cancelled immediately. Running jobs receive
      
        247
        `cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
      
        248
        active container, and reports terminal `cancelled`.
      
        249
        
        250
        ## Common failures
      
        251
        
        252
        - **Run never appears:** confirm the workflow file is under
      
        253
          `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
      
        254
          and verify the trigger event matches `on:`.
      
        255
        - **Run stays queued:** open the run page to see the requested runner labels,
      
        256
          then run `shithubd admin runner queue` and confirm a live runner is registered
      
        257
          with matching labels and capacity. Unsupported hosted labels such as
      
        258
          `windows-latest` and `macos-latest` intentionally remain queued until an
      
        259
          operator registers matching runners.
      
        260
        - **Step logs buffer:** verify the Caddy route above and confirm the SSE route
      
        261
          is still mounted outside compression and short timeouts.
      
        262
        - **`actions/checkout@v4` fails:** confirm the job is still running, the repo
      
        263
          URL in the runner claim points at this shithub instance, and the runner host
      
        264
          can reach smart HTTP. The checkout token is not valid after the job leaves
      
        265
          `running`.
      
        266
        - **Artifact `uses:` step fails:** expected for now. Replace with a `run:`
      
        267
          step until artifact support lands.
      
        268
        - **Secrets appear masked inconsistently:** check
      
        269
          `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
      
        270
          the job was claimed after the secret was created or rotated. Mask snapshots
      
        271
          are captured at claim time.

1	# Actions runbook
2
3	This is the operator runbook for shithub Actions. Host provisioning lives in
4	[runner-deploy.md](./runner-deploy.md), and runner protocol details live in
5	[actions-runner-api.md](../actions-runner-api.md).
6
7	## Shape
8
9	```text
10	git push / workflow_dispatch / schedule / pull_request
11	\|
12	v
13	workflow:trigger worker job
14	\|
15	v
16	workflow_runs + workflow_jobs + workflow_steps + check_runs
17	\|
18	v
19	registered runner heartbeat claims a matching queued job
20	\|
21	v
22	actions/checkout@v4 -> containerized run: steps
23	\|
24	v
25	log chunks -> step/job status -> run rollup
26	```
27
28	The v1 executor supports host-side `actions/checkout@v4` plus containerized
29	`run:` steps. The checkout token is short-lived, repository-scoped, tied to a
30	running job, and accepted only for read-only smart-HTTP fetches.
31	`shithub/upload-artifact@v1` and `shithub/download-artifact@v1` are still
32	reserved aliases and fail until artifact transfer is wired end to end.
33
34	## First smoke
35
36	1. Confirm migrations are applied and the web process can enqueue workers.
37	2. Register one runner with a label that matches the workflow:
38
39	```sh
40	shithubd admin runner register \
41	--name smoke-runner-1 \
42	--labels self-hosted,linux,ubuntu-latest,x64 \
43	--capacity 1 \
44	--output json
45	```
46
47	3. Start `shithubd-runner` with the returned token. For production hosts, use
48	one token per host and store it in ansible-vault, host vars, or the deployment
49	secret store. The role writes `/etc/shithubd-runner/config.toml` with
50	restrictive permissions. Use `--expires-in` only when the automation rotates
51	the runner token before that deadline; the current runner uses the
52	registration token for every heartbeat.
53	4. Push a `run:`-only workflow:
54
55	```yaml
56	name: smoke
57	on: [push, workflow_dispatch]
58	jobs:
59	hello:
60	runs-on: ubuntu-latest
61	env:
62	RUN_ID: ${{ shithub.run_id }}
63	steps:
64	- run: echo "hello from shithub actions"
65	- run: test -n "$RUN_ID"
66	```
67
68	5. Expected result:
69
70	- `workflow:trigger` enqueues a workflow run.
71	- A runner heartbeat claims the queued job within one idle poll interval.
72	- The Actions run page streams step logs while the job is running.
73	- The matching check run completes with `success`.
74	- `/{owner}/{repo}/actions.atom` includes the completed run.
75
76	Repeat with `exit 1`; the check should complete with `failure`.
77
78	## Checkout smoke
79
80	After the run-only smoke passes, verify repository checkout with:
81
82	```yaml
83	name: checkout smoke
84	on: [push, workflow_dispatch]
85	jobs:
86	checkout:
87	runs-on: ubuntu-latest
88	steps:
89	- uses: actions/checkout@v4
90	- run: test -f README.md
91	- run: test "$(git rev-parse HEAD)" = "${{ shithub.sha }}"
92	```
93
94	Expected result:
95
96	- The claim response contains `checkout_url` and `checkout_token`.
97	- Git smart HTTP sees the checkout token as Basic auth and permits
98	`git-upload-pack` for the claimed repo.
99	- `git-receive-pack` rejects the same credential with 403.
100	- The job workspace contains the exact `shithub.sha` commit before the first
101	`run:` step starts.
102
103	## Live log tail
104
105	Step log pages open an SSE stream at:
106
107	```text
108	/{owner}/{repo}/actions/runs/{run}/jobs/{job}/steps/{step}/log/stream
109	```
110
111	The stream sends `event: chunk` records with the chunk sequence as the SSE
112	`id`. Browsers reconnect with `Last-Event-ID`; the handler also accepts
113	`?after=<seq>` for the first connection from a rendered log page. A terminal
114	step sends `event: done` and closes the stream.
115
116	In `shithubd`, this route is mounted outside the normal app compression and
117	30-second timeout middleware. If a future route move puts live logs back under
118	either middleware, EventSource clients will churn and logs can buffer despite
119	the Caddy flush setting.
120
121	Log chunks are never sent through Postgres `NOTIFY`. Runner log writes append
122	to `workflow_step_log_chunks`, then `NOTIFY step_log_<step_id>` with only the
123	sequence number. Step completion notifies `done`.
124
125	## Rate limit
126
127	Live tails use `internal/ratelimit` scope `actions:logtail` with five
128	concurrent streams per viewer. Authenticated viewers key by user id; anonymous
129	public-repo viewers key by client IP. The limiter uses a short lease TTL so a
130	dropped connection cannot hold a slot permanently.
131
132	## Caddy
133
134	The production Caddy template has a dedicated Actions log-stream route with:
135
136	```caddy
137	flush_interval -1
138	```
139
140	The same route is excluded from gzip compression. If logs arrive only after
141	several kilobytes accumulate, verify the deployed `/etc/caddy/Caddyfile`
142	contains that route and reload Caddy:
143
144	```sh
145	sudo caddy reload --config /etc/caddy/Caddyfile
146	```
147
148	## Runner health
149
150	On the runner host:
151
152	```sh
153	systemctl status shithubd-runner
154	journalctl -u shithubd-runner -n 100 --no-pager
155	```
156
157	On the app host, inspect runner registration and heartbeat state:
158
159	```sh
160	shithubd admin actions runner list
161	```
162
163	Inspect queued jobs by requested `runs-on` label:
164
165	```sh
166	shithubd admin runner queue
167	shithubd admin runner queue --output json
168	```
169
170	Important metrics:
171
172	- `shithub_actions_queue_depth{resource="runs\|jobs"}`
173	- `shithub_actions_active{resource="runs\|jobs"}`
174	- `shithub_actions_runner_heartbeat_age_seconds{runner,status}`
175	- `shithub_actions_runner_capacity{runner,status}`
176	- `shithub_actions_runner_heartbeats_total{result="claimed\|no_job"}`
177	- `shithub_actions_runner_jwt_total{result="issued\|rejected\|replay"}`
178	- `shithub_actions_runs_completed_total{event,conclusion}`
179	- `shithub_actions_run_duration_seconds{event,conclusion}`
180	- `shithub_actions_steps_completed_total{step_type,conclusion}`
181	- `shithub_actions_jobs_cancelled_total{reason="user\|concurrency\|timeout"}`
182	- `shithub_actions_log_scrub_replacements_total{location="server"}`
183	- `shithub_actions_log_chunks_total{location="server"}`
184	- `shithub_actions_log_chunk_bytes_total{location="server"}`
185	- `shithub_actions_storage_objects{kind="artifacts\|step_logs\|hot_log_chunks"}`
186	- `shithub_actions_storage_bytes{kind="artifacts\|step_logs\|hot_log_chunks"}`
187	- `shithub_actions_step_timeouts_total`
188
189	The committed dashboard JSON lives at:
190
191	```text
192	deploy/monitoring/grafana/dashboards/actions.json
193	```
194
195	## Load harness
196
197	`bench/k6/actions-load.js` exercises the runner HTTP API under concurrent job
198	claims. It does not create workflow runs itself; seed the queue first with
199	pushes or workflow dispatches that produce jobs matching the runner labels.
200
201	Required environment:
202
203	```sh
204	SHITHUB_BASE_URL=https://shithub.example.test \
205	SHITHUB_RUNNER_TOKENS=token-a,token-b,token-c \
206	k6 run bench/k6/actions-load.js
207	```
208
209	Useful knobs:
210
211	- `SHITHUB_ACTIONS_VUS=50` controls concurrent virtual users.
212	- `SHITHUB_ACTIONS_DURATION=10m` controls the steady-state window.
213	- `SHITHUB_RUNNER_LABELS=self-hosted,linux,ubuntu-latest,x64` sets heartbeat
214	labels.
215	- `SHITHUB_RUNNER_CAPACITY=17` keeps three runners near the 50-concurrent
216	target.
217	- `SHITHUB_ACTIONS_LOG_BYTES=4096` controls emitted log chunk size.
218
219	Healthy run expectations:
220
221	- queued jobs drain without unbounded `shithub_actions_queue_depth` growth;
222	- runner heartbeats keep advancing and no runner deadlocks;
223	- log append p99 stays below five seconds;
224	- retention metrics catch up after the retention sweep.
225
226	## Emergency cancel
227
228	Start with a dry run:
229
230	```sh
231	shithubd admin actions cancel-all --dry-run --limit 100
232	```
233
234	Scope to one repository when possible:
235
236	```sh
237	shithubd admin actions cancel-all --dry-run --repo-id 42
238	```
239
240	Then confirm:
241
242	```sh
243	shithubd admin actions cancel-all --confirm --repo-id 42
244	```
245
246	Queued jobs are marked cancelled immediately. Running jobs receive
247	`cancel_requested=true`; the runner sees that through `/cancel-check`, kills the
248	active container, and reports terminal `cancelled`.
249
250	## Common failures
251
252	- Run never appears: confirm the workflow file is under
253	`.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`,
254	and verify the trigger event matches `on:`.
255	- Run stays queued: open the run page to see the requested runner labels,
256	then run `shithubd admin runner queue` and confirm a live runner is registered
257	with matching labels and capacity. Unsupported hosted labels such as
258	`windows-latest` and `macos-latest` intentionally remain queued until an
259	operator registers matching runners.
260	- Step logs buffer: verify the Caddy route above and confirm the SSE route
261	is still mounted outside compression and short timeouts.
262	- `actions/checkout@v4` fails: confirm the job is still running, the repo
263	URL in the runner claim points at this shithub instance, and the runner host
264	can reach smart HTTP. The checkout token is not valid after the job leaves
265	`running`.
266	- Artifact `uses:` step fails: expected for now. Replace with a `run:`
267	step until artifact support lands.
268	- Secrets appear masked inconsistently: check
269	`shithub_actions_log_scrub_replacements_total{location="server"}` and confirm
270	the job was claimed after the secret was created or rotated. Mask snapshots
271	are captured at claim time.