docs: record actions timeout behavior
Authored by
mfwolffe <wolffemf@dukes.jmu.edu>
- SHA
59d7d7aae491f8625ee6a76429cdea8e1521f6ca- Parents
-
f23d1f1 - Tree
8faf4a2
59d7d7a
59d7d7aae491f8625ee6a76429cdea8e1521f6caf23d1f1
8faf4a2| Status | File | + | - |
|---|---|---|---|
| M |
docs/internal/actions-runner-api.md
|
7 | 0 |
| M |
docs/internal/actions-schema.md
|
18 | 0 |
| M |
docs/internal/runbooks/actions-runner.md
|
22 | 1 |
docs/internal/actions-runner-api.mdmodified@@ -131,6 +131,12 @@ Completed jobs require a valid check conclusion. The handler updates | |||
| 131 | `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the | 131 | `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the |
| 132 | matching `check_runs` row created by the trigger pipeline. | 132 | matching `check_runs` row created by the trigger pipeline. |
| 133 | 133 | ||
| 134 | +`timeout-minutes` is enforced by `shithubd-runner` as a whole-job | ||
| 135 | +deadline. When it expires, the runner kills the active container, | ||
| 136 | +reports the current step as `completed/timed_out`, and reports the job | ||
| 137 | +as `completed/timed_out`. The server treats that conclusion as terminal | ||
| 138 | +failure for the workflow run rollup. | ||
| 139 | + | ||
| 134 | When a runner reports `status:"cancelled"`, any still-open steps in the | 140 | When a runner reports `status:"cancelled"`, any still-open steps in the |
| 135 | job are marked cancelled too. This keeps a killed job from leaving queued | 141 | job are marked cancelled too. This keeps a killed job from leaving queued |
| 136 | step rows that the UI would otherwise treat as live. | 142 | step rows that the UI would otherwise treat as live. |
@@ -209,3 +215,4 @@ runner posts terminal job status `cancelled`. | |||
| 209 | - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` | 215 | - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}` |
| 210 | - `shithub_actions_log_scrub_replacements_total{location="server"}` | 216 | - `shithub_actions_log_scrub_replacements_total{location="server"}` |
| 211 | - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}` | 217 | - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}` |
| 218 | +- `shithub_actions_step_timeouts_total` | ||
docs/internal/actions-schema.mdmodified@@ -382,6 +382,24 @@ Other admin surfaces are scoped to later sub-sprints: | |||
| 382 | `shithubd-cron.service`. Operators can run it manually with | 382 | `shithubd-cron.service`. Operators can run it manually with |
| 383 | `shithubd admin run-job workflow:cleanup`. | 383 | `shithubd admin run-job workflow:cleanup`. |
| 384 | 384 | ||
| 385 | +## Runner timeouts (S41g) | ||
| 386 | + | ||
| 387 | +`jobs.<key>.timeout-minutes` is enforced by `shithubd-runner` as a | ||
| 388 | +whole-job deadline. The parser stores the value in | ||
| 389 | +`workflow_jobs.timeout_minutes` with the GitHub-compatible default of | ||
| 390 | +360 minutes and a 1..4320 cap. | ||
| 391 | + | ||
| 392 | +When the deadline expires, the Docker engine explicitly kills the | ||
| 393 | +active step container, emits a terminal step update with | ||
| 394 | +`status=completed` and `conclusion=timed_out`, and the runner reports | ||
| 395 | +the job itself as `completed/timed_out`. The server rolls the parent | ||
| 396 | +workflow run up to `timed_out` when all jobs are terminal. A timed-out | ||
| 397 | +step is not masked by `continue-on-error`; the job deadline always wins. | ||
| 398 | + | ||
| 399 | +The runner API increments `shithub_actions_step_timeouts_total` the | ||
| 400 | +first time a step reaches `conclusion=timed_out`. Duplicate terminal | ||
| 401 | +step-status retries do not increment the counter again. | ||
| 402 | + | ||
| 385 | ## Retention cleanup (S41g) | 403 | ## Retention cleanup (S41g) |
| 386 | 404 | ||
| 387 | `workflow:cleanup` applies the durable Actions retention contract in | 405 | `workflow:cleanup` applies the durable Actions retention contract in |
docs/internal/runbooks/actions-runner.mdmodified@@ -167,6 +167,27 @@ Expected response includes a new `run_id`, the new `run_index`, and | |||
| 167 | `parent_run_id` equal to the source run. Confirm the new row has the | 167 | `parent_run_id` equal to the source run. Confirm the new row has the |
| 168 | same `head_sha` as the source run. | 168 | same `head_sha` as the source run. |
| 169 | 169 | ||
| 170 | +Timeout smoke: create a workflow with a short deadline and a long-running | ||
| 171 | +step: | ||
| 172 | + | ||
| 173 | +```yaml | ||
| 174 | +jobs: | ||
| 175 | + timeout_probe: | ||
| 176 | + runs-on: ubuntu-latest | ||
| 177 | + timeout-minutes: 1 | ||
| 178 | + steps: | ||
| 179 | + - run: sleep 600 | ||
| 180 | +``` | ||
| 181 | + | ||
| 182 | +Expected results: | ||
| 183 | + | ||
| 184 | +- The runner kills the active container shortly after the one-minute | ||
| 185 | + deadline. | ||
| 186 | +- The step row becomes `status=completed`, | ||
| 187 | + `conclusion=timed_out`. | ||
| 188 | +- The job row becomes `status=completed`, `conclusion=timed_out`. | ||
| 189 | +- `/metrics` increments `shithub_actions_step_timeouts_total`. | ||
| 190 | + | ||
| 170 | Replay check: reusing the log token after the log call must fail with | 191 | Replay check: reusing the log token after the log call must fail with |
| 171 | 401 because its `jti` is already present in `runner_jwt_used`. | 192 | 401 because its `jti` is already present in `runner_jwt_used`. |
| 172 | 193 | ||
@@ -184,7 +205,7 @@ Expected results: | |||
| 184 | jobs are terminal. | 205 | jobs are terminal. |
| 185 | - The PR Checks tab shows the matching check run as success. | 206 | - The PR Checks tab shows the matching check run as success. |
| 186 | - `/metrics` includes runner registration, heartbeat, JWT, job | 207 | - `/metrics` includes runner registration, heartbeat, JWT, job |
| 187 | - cancellation, log-scrub, and retention counters. | 208 | + cancellation, log-scrub, step-timeout, and retention counters. |
| 188 | 209 | ||
| 189 | ## Retention Sweep | 210 | ## Retention Sweep |
| 190 | 211 | ||