tenseleyflow/shithub / 686e419

Browse files

docs/actions: document concurrency group runtime

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
686e41954a5313928d18d3b8a56433d961e0b8f8
Parents
8523373
Tree
a6badb5

2 changed files

StatusFile+-
M docs/internal/actions-runner-api.md 1 0
M docs/internal/actions-schema.md 33 1
docs/internal/actions-runner-api.mdmodified
@@ -213,6 +213,7 @@ runner posts terminal job status `cancelled`.
213213
 - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
214214
 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
215215
 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
216
+- `shithub_actions_concurrency_queued_total`
216217
 - `shithub_actions_log_scrub_replacements_total{location="server"}`
217218
 - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
218219
 - `shithub_actions_step_timeouts_total`
docs/internal/actions-schema.mdmodified
@@ -46,7 +46,9 @@ later schema diff:
4646
   S41g's race between a cancel request and a state transition.
4747
 - **`workflow_runs.concurrency_group`** — the concurrency-slot key,
4848
   resolved at trigger time from the workflow's `concurrency.group:`
49
-  expression. S41g's slot manager keys off this column.
49
+  expression. S41g's slot manager keys off this column and runner
50
+  claim blocks younger runs while an older same-group run still has a
51
+  queued/running job without `cancel_requested=true`.
5052
 - **`workflow_runs.parent_run_id`** — for re-runs. The new run
5153
   references the original; the UI shows a "re-ran from #N" link.
5254
 - **`workflow_jobs.runner_id`** — FK added in 0046 (after the
@@ -378,10 +380,40 @@ Other admin surfaces are scoped to later sub-sprints:
378380
   UI re-run completed/cancelled runs. Re-runs read the workflow YAML
379381
   from the original run's `head_sha`, create a fresh queued
380382
   `workflow_runs` row, and set `parent_run_id` to the source run.
383
+- S41g: workflow-level `concurrency.group` is resolved at enqueue time
384
+  against the trigger context (`shithub.ref`, `shithub.sha`, and
385
+  `shithub.event.*`). With `cancel-in-progress: true`, enqueue requests
386
+  cancellation for older active runs in the same group. Without it,
387
+  runner claim leaves the younger run queued until the older run no
388
+  longer has uncancelled queued/running jobs.
381389
 - S41g: `workflow:cleanup` is a daily retention worker enqueued by
382390
   `shithubd-cron.service`. Operators can run it manually with
383391
   `shithubd admin run-job workflow:cleanup`.
384392
 
393
+## Workflow concurrency (S41g)
394
+
395
+`concurrency.group` is a workflow-level slot key. The parser stores the
396
+raw value, and `internal/actions/concurrency` evaluates `${{ ... }}`
397
+fragments when the run is enqueued. The trigger-time context deliberately
398
+does not include secrets; event-derived values may be tainted but are
399
+safe here because the value is only used as a database key.
400
+
401
+When a run enters a non-empty group:
402
+
403
+- `cancel-in-progress: false` leaves the new run queued behind older
404
+  same-repo, same-group runs while those older runs still have
405
+  queued/running jobs with `cancel_requested=false`.
406
+- `cancel-in-progress: true` requests cancellation on those older jobs.
407
+  Queued jobs become terminal immediately; running jobs keep running
408
+  with `cancel_requested=true` so the runner can kill the active
409
+  container. Once every active older job is cancel-requested, the group
410
+  is released for the newer run.
411
+
412
+The runner claim query enforces the queueing rule, not the web handler
413
+or UI. This keeps heartbeat races honest: multiple runners can poll at
414
+the same time, but only jobs whose dependency and concurrency blockers
415
+are clear can be claimed.
416
+
385417
 ## Runner timeouts (S41g)
386418
 
387419
 `jobs.<key>.timeout-minutes` is enforced by `shithubd-runner` as a