tenseleyflow/shithub / 59d7d7a

Browse files

docs: record actions timeout behavior

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
59d7d7aae491f8625ee6a76429cdea8e1521f6ca
Parents
f23d1f1
Tree
8faf4a2

3 changed files

StatusFile+-
M docs/internal/actions-runner-api.md 7 0
M docs/internal/actions-schema.md 18 0
M docs/internal/runbooks/actions-runner.md 22 1
docs/internal/actions-runner-api.mdmodified
@@ -131,6 +131,12 @@ Completed jobs require a valid check conclusion. The handler updates
131
 `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the
131
 `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the
132
 matching `check_runs` row created by the trigger pipeline.
132
 matching `check_runs` row created by the trigger pipeline.
133
 
133
 
134
+`timeout-minutes` is enforced by `shithubd-runner` as a whole-job
135
+deadline. When it expires, the runner kills the active container,
136
+reports the current step as `completed/timed_out`, and reports the job
137
+as `completed/timed_out`. The server treats that conclusion as terminal
138
+failure for the workflow run rollup.
139
+
134
 When a runner reports `status:"cancelled"`, any still-open steps in the
140
 When a runner reports `status:"cancelled"`, any still-open steps in the
135
 job are marked cancelled too. This keeps a killed job from leaving queued
141
 job are marked cancelled too. This keeps a killed job from leaving queued
136
 step rows that the UI would otherwise treat as live.
142
 step rows that the UI would otherwise treat as live.
@@ -209,3 +215,4 @@ runner posts terminal job status `cancelled`.
209
 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
215
 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
210
 - `shithub_actions_log_scrub_replacements_total{location="server"}`
216
 - `shithub_actions_log_scrub_replacements_total{location="server"}`
211
 - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
217
 - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
218
+- `shithub_actions_step_timeouts_total`
docs/internal/actions-schema.mdmodified
@@ -382,6 +382,24 @@ Other admin surfaces are scoped to later sub-sprints:
382
   `shithubd-cron.service`. Operators can run it manually with
382
   `shithubd-cron.service`. Operators can run it manually with
383
   `shithubd admin run-job workflow:cleanup`.
383
   `shithubd admin run-job workflow:cleanup`.
384
 
384
 
385
+## Runner timeouts (S41g)
386
+
387
+`jobs.<key>.timeout-minutes` is enforced by `shithubd-runner` as a
388
+whole-job deadline. The parser stores the value in
389
+`workflow_jobs.timeout_minutes` with the GitHub-compatible default of
390
+360 minutes and a 1..4320 cap.
391
+
392
+When the deadline expires, the Docker engine explicitly kills the
393
+active step container, emits a terminal step update with
394
+`status=completed` and `conclusion=timed_out`, and the runner reports
395
+the job itself as `completed/timed_out`. The server rolls the parent
396
+workflow run up to `timed_out` when all jobs are terminal. A timed-out
397
+step is not masked by `continue-on-error`; the job deadline always wins.
398
+
399
+The runner API increments `shithub_actions_step_timeouts_total` the
400
+first time a step reaches `conclusion=timed_out`. Duplicate terminal
401
+step-status retries do not increment the counter again.
402
+
385
 ## Retention cleanup (S41g)
403
 ## Retention cleanup (S41g)
386
 
404
 
387
 `workflow:cleanup` applies the durable Actions retention contract in
405
 `workflow:cleanup` applies the durable Actions retention contract in
docs/internal/runbooks/actions-runner.mdmodified
@@ -167,6 +167,27 @@ Expected response includes a new `run_id`, the new `run_index`, and
167
 `parent_run_id` equal to the source run. Confirm the new row has the
167
 `parent_run_id` equal to the source run. Confirm the new row has the
168
 same `head_sha` as the source run.
168
 same `head_sha` as the source run.
169
 
169
 
170
+Timeout smoke: create a workflow with a short deadline and a long-running
171
+step:
172
+
173
+```yaml
174
+jobs:
175
+  timeout_probe:
176
+    runs-on: ubuntu-latest
177
+    timeout-minutes: 1
178
+    steps:
179
+      - run: sleep 600
180
+```
181
+
182
+Expected results:
183
+
184
+- The runner kills the active container shortly after the one-minute
185
+  deadline.
186
+- The step row becomes `status=completed`,
187
+  `conclusion=timed_out`.
188
+- The job row becomes `status=completed`, `conclusion=timed_out`.
189
+- `/metrics` increments `shithub_actions_step_timeouts_total`.
190
+
170
 Replay check: reusing the log token after the log call must fail with
191
 Replay check: reusing the log token after the log call must fail with
171
 401 because its `jti` is already present in `runner_jwt_used`.
192
 401 because its `jti` is already present in `runner_jwt_used`.
172
 
193
 
@@ -184,7 +205,7 @@ Expected results:
184
   jobs are terminal.
205
   jobs are terminal.
185
 - The PR Checks tab shows the matching check run as success.
206
 - The PR Checks tab shows the matching check run as success.
186
 - `/metrics` includes runner registration, heartbeat, JWT, job
207
 - `/metrics` includes runner registration, heartbeat, JWT, job
187
-  cancellation, log-scrub, and retention counters.
208
+  cancellation, log-scrub, step-timeout, and retention counters.
188
 
209
 
189
 ## Retention Sweep
210
 ## Retention Sweep
190
 
211