docs: record actions timeout behavior
Authored by
mfwolffe <wolffemf@dukes.jmu.edu>
- SHA
59d7d7aae491f8625ee6a76429cdea8e1521f6ca- Parents
-
f23d1f1 - Tree
8faf4a2
59d7d7a
59d7d7aae491f8625ee6a76429cdea8e1521f6caf23d1f1
8faf4a2| Status | File | + | - |
|---|---|---|---|
| M |
docs/internal/actions-runner-api.md
|
7 | 0 |
| M |
docs/internal/actions-schema.md
|
18 | 0 |
| M |
docs/internal/runbooks/actions-runner.md
|
22 | 1 |
docs/internal/actions-runner-api.mdmodified@@ -131,6 +131,12 @@ Completed jobs require a valid check conclusion. The handler updates | ||
| 131 | 131 | `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the |
| 132 | 132 | matching `check_runs` row created by the trigger pipeline. |
| 133 | 133 | |
| 134 | +`timeout-minutes` is enforced by `shithubd-runner` as a whole-job | |
| 135 | +deadline. When it expires, the runner kills the active container, | |
| 136 | +reports the current step as `completed/timed_out`, and reports the job | |
| 137 | +as `completed/timed_out`. The server treats that conclusion as terminal | |
| 138 | +failure for the workflow run rollup. | |
| 139 | + | |
| 134 | 140 | When a runner reports `status:"cancelled"`, any still-open steps in the |
| 135 | 141 | job are marked cancelled too. This keeps a killed job from leaving queued |
| 136 | 142 | step rows that the UI would otherwise treat as live. |
@@ -209,3 +215,4 @@ runner posts terminal job status `cancelled`. | ||
| 209 | 215 | - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` |
| 210 | 216 | - `shithub_actions_log_scrub_replacements_total{location="server"}` |
| 211 | 217 | - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}` |
| 218 | +- `shithub_actions_step_timeouts_total` | |
docs/internal/actions-schema.mdmodified@@ -382,6 +382,24 @@ Other admin surfaces are scoped to later sub-sprints: | ||
| 382 | 382 | `shithubd-cron.service`. Operators can run it manually with |
| 383 | 383 | `shithubd admin run-job workflow:cleanup`. |
| 384 | 384 | |
| 385 | +## Runner timeouts (S41g) | |
| 386 | + | |
| 387 | +`jobs.<key>.timeout-minutes` is enforced by `shithubd-runner` as a | |
| 388 | +whole-job deadline. The parser stores the value in | |
| 389 | +`workflow_jobs.timeout_minutes` with the GitHub-compatible default of | |
| 390 | +360 minutes and a 1..4320 cap. | |
| 391 | + | |
| 392 | +When the deadline expires, the Docker engine explicitly kills the | |
| 393 | +active step container, emits a terminal step update with | |
| 394 | +`status=completed` and `conclusion=timed_out`, and the runner reports | |
| 395 | +the job itself as `completed/timed_out`. The server rolls the parent | |
| 396 | +workflow run up to `timed_out` when all jobs are terminal. A timed-out | |
| 397 | +step is not masked by `continue-on-error`; the job deadline always wins. | |
| 398 | + | |
| 399 | +The runner API increments `shithub_actions_step_timeouts_total` the | |
| 400 | +first time a step reaches `conclusion=timed_out`. Duplicate terminal | |
| 401 | +step-status retries do not increment the counter again. | |
| 402 | + | |
| 385 | 403 | ## Retention cleanup (S41g) |
| 386 | 404 | |
| 387 | 405 | `workflow:cleanup` applies the durable Actions retention contract in |
docs/internal/runbooks/actions-runner.mdmodified@@ -167,6 +167,27 @@ Expected response includes a new `run_id`, the new `run_index`, and | ||
| 167 | 167 | `parent_run_id` equal to the source run. Confirm the new row has the |
| 168 | 168 | same `head_sha` as the source run. |
| 169 | 169 | |
| 170 | +Timeout smoke: create a workflow with a short deadline and a long-running | |
| 171 | +step: | |
| 172 | + | |
| 173 | +```yaml | |
| 174 | +jobs: | |
| 175 | + timeout_probe: | |
| 176 | + runs-on: ubuntu-latest | |
| 177 | + timeout-minutes: 1 | |
| 178 | + steps: | |
| 179 | + - run: sleep 600 | |
| 180 | +``` | |
| 181 | + | |
| 182 | +Expected results: | |
| 183 | + | |
| 184 | +- The runner kills the active container shortly after the one-minute | |
| 185 | + deadline. | |
| 186 | +- The step row becomes `status=completed`, | |
| 187 | + `conclusion=timed_out`. | |
| 188 | +- The job row becomes `status=completed`, `conclusion=timed_out`. | |
| 189 | +- `/metrics` increments `shithub_actions_step_timeouts_total`. | |
| 190 | + | |
| 170 | 191 | Replay check: reusing the log token after the log call must fail with |
| 171 | 192 | 401 because its `jti` is already present in `runner_jwt_used`. |
| 172 | 193 | |
@@ -184,7 +205,7 @@ Expected results: | ||
| 184 | 205 | jobs are terminal. |
| 185 | 206 | - The PR Checks tab shows the matching check run as success. |
| 186 | 207 | - `/metrics` includes runner registration, heartbeat, JWT, job |
| 187 | - cancellation, log-scrub, and retention counters. | |
| 208 | + cancellation, log-scrub, step-timeout, and retention counters. | |
| 188 | 209 | |
| 189 | 210 | ## Retention Sweep |
| 190 | 211 | |