@@ -1,5 +1,72 @@ |
| 1 | 1 | # Actions runbook |
| 2 | 2 | |
| 3 | +This is the operator runbook for shithub Actions. Host provisioning lives in |
| 4 | +[runner-deploy.md](./runner-deploy.md), and runner protocol details live in |
| 5 | +[actions-runner-api.md](../actions-runner-api.md). |
| 6 | + |
| 7 | +## Shape |
| 8 | + |
| 9 | +```text |
| 10 | +git push / workflow_dispatch / schedule / pull_request |
| 11 | + | |
| 12 | + v |
| 13 | +workflow:trigger worker job |
| 14 | + | |
| 15 | + v |
| 16 | +workflow_runs + workflow_jobs + workflow_steps + check_runs |
| 17 | + | |
| 18 | + v |
| 19 | +registered runner heartbeat claims a matching queued job |
| 20 | + | |
| 21 | + v |
| 22 | +containerized run: steps -> log chunks -> step/job status -> run rollup |
| 23 | +``` |
| 24 | + |
| 25 | +The v1 executor supports containerized `run:` steps. The parser reserves |
| 26 | +`actions/checkout@v4`, `shithub/upload-artifact@v1`, and |
| 27 | +`shithub/download-artifact@v1`, but the Docker runner rejects `uses:` steps |
| 28 | +until checkout metadata and artifact transfer are wired end to end. Do not use |
| 29 | +`actions/checkout@v4` in production smoke workflows yet. |
| 30 | + |
| 31 | +## First smoke |
| 32 | + |
| 33 | +1. Confirm migrations are applied and the web process can enqueue workers. |
| 34 | +2. Register one runner with a label that matches the workflow: |
| 35 | + |
| 36 | +```sh |
| 37 | +shithubd admin runner register \ |
| 38 | + --name smoke-runner-1 \ |
| 39 | + --labels self-hosted,linux,ubuntu-latest \ |
| 40 | + --capacity 1 |
| 41 | +``` |
| 42 | + |
| 43 | +3. Start `shithubd-runner` with the printed token. For production hosts, use |
| 44 | + the Ansible/systemd path in [runner-deploy.md](./runner-deploy.md). |
| 45 | +4. Push a `run:`-only workflow: |
| 46 | + |
| 47 | +```yaml |
| 48 | +name: smoke |
| 49 | +on: [push, workflow_dispatch] |
| 50 | +jobs: |
| 51 | + hello: |
| 52 | + runs-on: ubuntu-latest |
| 53 | + env: |
| 54 | + RUN_ID: ${{ shithub.run_id }} |
| 55 | + steps: |
| 56 | + - run: echo "hello from shithub actions" |
| 57 | + - run: test -n "$RUN_ID" |
| 58 | +``` |
| 59 | + |
| 60 | +5. Expected result: |
| 61 | + |
| 62 | +- `workflow:trigger` enqueues a workflow run. |
| 63 | +- A runner heartbeat claims the queued job within one idle poll interval. |
| 64 | +- The Actions run page streams step logs while the job is running. |
| 65 | +- The matching check run completes with `success`. |
| 66 | +- `/{owner}/{repo}/actions.atom` includes the completed run. |
| 67 | + |
| 68 | +Repeat with `exit 1`; the check should complete with `failure`. |
| 69 | + |
| 3 | 70 | ## Live log tail |
| 4 | 71 | |
| 5 | 72 | Step log pages open an SSE stream at: |
@@ -44,3 +111,66 @@ contains that route and reload Caddy: |
| 44 | 111 | ```sh |
| 45 | 112 | sudo caddy reload --config /etc/caddy/Caddyfile |
| 46 | 113 | ``` |
| 114 | + |
| 115 | +## Runner health |
| 116 | + |
| 117 | +On the runner host: |
| 118 | + |
| 119 | +```sh |
| 120 | +systemctl status shithubd-runner |
| 121 | +journalctl -u shithubd-runner -n 100 --no-pager |
| 122 | +``` |
| 123 | + |
| 124 | +On the app host, inspect runner registration and heartbeat state: |
| 125 | + |
| 126 | +```sh |
| 127 | +shithubd admin actions runner list |
| 128 | +``` |
| 129 | + |
| 130 | +Important metrics: |
| 131 | + |
| 132 | +- `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}` |
| 133 | +- `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}` |
| 134 | +- `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` |
| 135 | +- `shithub_actions_log_scrub_replacements_total{location="server"}` |
| 136 | +- `shithub_actions_step_timeouts_total` |
| 137 | + |
| 138 | +## Emergency cancel |
| 139 | + |
| 140 | +Start with a dry run: |
| 141 | + |
| 142 | +```sh |
| 143 | +shithubd admin actions cancel-all --dry-run --limit 100 |
| 144 | +``` |
| 145 | + |
| 146 | +Scope to one repository when possible: |
| 147 | + |
| 148 | +```sh |
| 149 | +shithubd admin actions cancel-all --dry-run --repo-id 42 |
| 150 | +``` |
| 151 | + |
| 152 | +Then confirm: |
| 153 | + |
| 154 | +```sh |
| 155 | +shithubd admin actions cancel-all --confirm --repo-id 42 |
| 156 | +``` |
| 157 | + |
| 158 | +Queued jobs are marked cancelled immediately. Running jobs receive |
| 159 | +`cancel_requested=true`; the runner sees that through `/cancel-check`, kills the |
| 160 | +active container, and reports terminal `cancelled`. |
| 161 | + |
| 162 | +## Common failures |
| 163 | + |
| 164 | +- **Run never appears:** confirm the workflow file is under |
| 165 | + `.shithub/workflows/`, parse it with `shithubd admin actions parse <file>`, |
| 166 | + and verify the trigger event matches `on:`. |
| 167 | +- **Run stays queued:** confirm a runner is registered with matching labels and |
| 168 | + capacity, then inspect runner journal output and heartbeat metrics. |
| 169 | +- **Step logs buffer:** verify the Caddy route above and confirm the SSE route |
| 170 | + is still mounted outside compression and short timeouts. |
| 171 | +- **`uses:` step fails:** expected for now. Replace with a `run:` step until |
| 172 | + checkout/artifact support lands. |
| 173 | +- **Secrets appear masked inconsistently:** check |
| 174 | + `shithub_actions_log_scrub_replacements_total{location="server"}` and confirm |
| 175 | + the job was claimed after the secret was created or rotated. Mask snapshots |
| 176 | + are captured at claim time. |