@@ -1,5 +1,72 @@ |
| 1 | # Actions runbook | 1 | # Actions runbook |
| 2 | | 2 | |
| | 3 | +This is the operator runbook for shithub Actions. Host provisioning lives in |
| | 4 | +[runner-deploy.md](./runner-deploy.md), and runner protocol details live in |
| | 5 | +[actions-runner-api.md](../actions-runner-api.md). |
| | 6 | + |
| | 7 | +## Shape |
| | 8 | + |
| | 9 | +```text |
| | 10 | +git push / workflow_dispatch / schedule / pull_request |
| | 11 | + | |
| | 12 | + v |
| | 13 | +workflow:trigger worker job |
| | 14 | + | |
| | 15 | + v |
| | 16 | +workflow_runs + workflow_jobs + workflow_steps + check_runs |
| | 17 | + | |
| | 18 | + v |
| | 19 | +registered runner heartbeat claims a matching queued job |
| | 20 | + | |
| | 21 | + v |
| | 22 | +containerized run: steps -> log chunks -> step/job status -> run rollup |
| | 23 | +``` |
| | 24 | + |
| | 25 | +The v1 executor supports containerized `run:` steps. The parser reserves |
| | 26 | +`actions/checkout@v4`, `shithub/upload-artifact@v1`, and |
| | 27 | +`shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps |
| | 28 | +until checkout metadata and artifact transfer are wired end to end. Do not use |
| | 29 | +`actions/checkout@v4` in production smoke workflows yet. |
| | 30 | + |
| | 31 | +## First smoke |
| | 32 | + |
| | 33 | +1. Confirm migrations are applied and the web process can enqueue workers. |
| | 34 | +2. Register one runner with a label that matches the workflow: |
| | 35 | + |
| | 36 | +```sh |
| | 37 | +shithubd admin runner register \ |
| | 38 | + --name smoke-runner-1 \ |
| | 39 | + --labels self-hosted,linux,ubuntu-latest \ |
| | 40 | + --capacity 1 |
| | 41 | +``` |
| | 42 | + |
| | 43 | +3. Start `shithubd-runner` with the printed token. For production hosts, use |
| | 44 | + the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md). |
| | 45 | +4. Push a `run:`-only workflow: |
| | 46 | + |
| | 47 | +```yaml |
| | 48 | +name: smoke |
| | 49 | +on: [push, workflow_dispatch] |
| | 50 | +jobs: |
| | 51 | + hello: |
| | 52 | + runs-on: ubuntu-latest |
| | 53 | + env: |
| | 54 | + RUN_ID: ${{ shithub.run_id }} |
| | 55 | + steps: |
| | 56 | + - run: echo "hello from shithub actions" |
| | 57 | + - run: test -n "$RUN_ID" |
| | 58 | +``` |
| | 59 | + |
| | 60 | +5. Expected result: |
| | 61 | + |
| | 62 | +- `workflow:trigger` enqueues a workflow run. |
| | 63 | +- A runner heartbeat claims the queued job within one idle poll interval. |
| | 64 | +- The Actions run page streams step logs while the job is running. |
| | 65 | +- The matching check run completes with `success`. |
| | 66 | +- `/{owner}/{repo}/actions.atom` includes the completed run. |
| | 67 | + |
| | 68 | +Repeat with `exit 1`; the check should complete with `failure`. |
| | 69 | + |
| 3 | ## Live log tail | 70 | ## Live log tail |
| 4 | | 71 | |
| 5 | Step log pages open an SSE stream at: | 72 | Step log pages open an SSE stream at: |
@@ -44,3 +111,66 @@ contains that route and reload Caddy: |
| 44 | ```sh | 111 | ```sh |
| 45 | sudo caddy reload --config /etc/caddy/Caddyfile | 112 | sudo caddy reload --config /etc/caddy/Caddyfile |
| 46 | ``` | 113 | ``` |
| | 114 | + |
| | 115 | +## Runner health |
| | 116 | + |
| | 117 | +On the runner host: |
| | 118 | + |
| | 119 | +```sh |
| | 120 | +systemctl status shithubd-runner |
| | 121 | +journalctl -u shithubd-runner -n 100 --no-pager |
| | 122 | +``` |
| | 123 | + |
| | 124 | +On the app host, inspect runner registration and heartbeat state: |
| | 125 | + |
| | 126 | +```sh |
| | 127 | +shithubd admin actions runner list |
| | 128 | +``` |
| | 129 | + |
| | 130 | +Important metrics: |
| | 131 | + |
| | 132 | +- `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}` |
| | 133 | +- `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}` |
| | 134 | +- `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` |
| | 135 | +- `shithub_actions_log_scrub_replacements_total{location="server"}` |
| | 136 | +- `shithub_actions_step_timeouts_total` |
| | 137 | + |
| | 138 | +## Emergency cancel |
| | 139 | + |
| | 140 | +Start with a dry run: |
| | 141 | + |
| | 142 | +```sh |
| | 143 | +shithubd admin actions cancel-all --dry-run --limit 100 |
| | 144 | +``` |
| | 145 | + |
| | 146 | +Scope to one repository when possible: |
| | 147 | + |
| | 148 | +```sh |
| | 149 | +shithubd admin actions cancel-all --dry-run --repo-id 42 |
| | 150 | +``` |
| | 151 | + |
| | 152 | +Then confirm: |
| | 153 | + |
| | 154 | +```sh |
| | 155 | +shithubd admin actions cancel-all --confirm --repo-id 42 |
| | 156 | +``` |
| | 157 | + |
| | 158 | +Queued jobs are marked cancelled immediately. Running jobs receive |
| | 159 | +`cancel_requested=true`; the runner sees that through `/cancel-check`, kills the |
| | 160 | +active container, and reports terminal `cancelled`. |
| | 161 | + |
| | 162 | +## Common failures |
| | 163 | + |
| | 164 | +- **Run never appears:** confirm the workflow file is under |
| | 165 | + `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`, |
| | 166 | + and verify the trigger event matches `on:`. |
| | 167 | +- **Run stays queued:** confirm a runner is registered with matching labels and |
| | 168 | + capacity, then inspect runner journal output and heartbeat metrics. |
| | 169 | +- **Step logs buffer:** verify the Caddy route above and confirm the SSE route |
| | 170 | + is still mounted outside compression and short timeouts. |
| | 171 | +- **`uses:` step fails:** expected for now. Replace with a `run:` step until |
| | 172 | + checkout/artifact support lands. |
| | 173 | +- **Secrets appear masked inconsistently:** check |
| | 174 | + `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm |
| | 175 | + the job was claimed after the secret was created or rotated. Mask snapshots |
| | 176 | + are captured at claim time. |