markdown · 8300 bytes Raw Blame History

Actions runner API

The runner-facing HTTP surface lives in internal/web/handlers/api/runners.go. It is mounted under /api/v1 in the CSRF-exempt API group, but it does not use PAT auth. Runners authenticate first with a long-lived registration token and then with short-lived per-job JWTs.

Auth model

Operators register a runner with:

shithubd admin runner register --name runner-1 --labels self-hosted,linux,ubuntu-latest

The command inserts workflow_runners, stores only a SHA-256 hash in runner_tokens, and prints the 32-byte hex token once.

POST /api/v1/runners/heartbeat accepts:

Authorization: Bearer <registration-token>

When a queued job matches the runner labels and capacity is available, the response includes a job payload and a 15-minute job JWT. That JWT has claims:

{"sub":"runner:<id>","job_id":1,"run_id":1,"repo_id":1,"exp":0,"jti":"..."}

The signing key is derived from auth.totp_key_b64 with HKDF label actions-runner-jwt-v1; the raw TOTP/secretbox key is not used directly for JWT signing.

Job JWTs are single-use. Every job endpoint verifies the signature and expiry, checks that the path job belongs to the claimed runner/run, and then inserts jti into runner_jwt_used. A replay returns 401. To support multi-step runner flows, successful in-flight job endpoints return next_token and next_token_expires_at.

Consumed JWT rows are retained for 30 days after token expiry, then pruned by the daily workflow:cleanup worker. This keeps the replay gate audit trail available for recent jobs without letting the table grow unbounded.

shithubd-runner consumes the same token chain: it claims with the registration token, marks the job running with the first job JWT, then uses each returned next_token serially for log chunks, step-status updates, cancel checks, artifact upload requests, and finally the terminal job-status update. Reusing any consumed job JWT is a replay and must fail with 401.

Endpoints

POST /api/v1/runners/heartbeat

Request body:

{"labels":["ubuntu-latest","linux"],"capacity":1}

Returns 204 when no matching job is claimable. Returns 200 with token, expires_at, and job when a job is claimed. Capacity is enforced server-side by counting current workflow_jobs.status = 'running' rows for the runner while holding a row lock on the runner. The job payload includes resolved secrets and mask_values; repo secrets shadow org secrets with the same name. The server also stores an encrypted claim-time copy of the mask values on workflow_job_secret_masks so later log uploads are scrubbed against the secrets that were actually handed to the runner, even if an operator rotates or deletes a secret mid-job.

POST /api/v1/jobs/{id}/logs

Auth: job JWT. Body:

{"seq":0,"chunk":"aGVsbG8K","step_id":123}

step_id is optional for the S41c curl smoke path; when omitted the first step in the job receives the chunk. Chunks are base64-decoded, capped at 512 KiB raw, and appended to workflow_step_log_chunks. Duplicate (step_id, seq) inserts are accepted as idempotent retries. Before append, the API re-scrubs exact secret values from the job's claim-time mask snapshot. It also reprocesses any possible secret prefix carried at the end of the prior chunk, so a runner cannot leak a secret by splitting it across two log calls.

POST /api/v1/jobs/{id}/steps/{step_id}/status

Auth: job JWT. Body:

{"status":"completed","conclusion":"success"}

Valid transitions are queued|running -> running|completed|cancelled|skipped with idempotent repeats of the target terminal state. Completed and skipped steps require a valid check conclusion; cancelled defaults to cancelled when omitted. The endpoint always returns a next_token because a completed step is not the end of the job.

When object storage is configured, terminal step updates enqueue workflow:finalize_step. The worker concatenates workflow_step_log_chunks in sequence order, uploads the log to actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log, stores that key and byte count on workflow_steps, then deletes the SQL chunks.

The repository Actions UI reads logs from the same two-stage storage model. While chunks remain in SQL, a step log page concatenates them in sequence order and renders a static snapshot. After finalization, the page reads workflow_steps.log_object_key from object storage and offers a short-lived signed download URL. Live tailing is intentionally separate and lands in the S41f SSE slice.

POST /api/v1/jobs/{id}/status

Auth: job JWT. Body:

{"status":"completed","conclusion":"success"}

Valid transitions are queued|running -> running|completed|cancelled. Completed jobs require a valid check conclusion. The handler updates workflow_jobs, rolls up workflow_runs, and best-effort updates the matching check_runs row created by the trigger pipeline.

timeout-minutes is enforced by shithubd-runner as a whole-job deadline. When it expires, the runner kills the active container, reports the current step as completed/timed_out, and reports the job as completed/timed_out. The server treats that conclusion as terminal failure for the workflow run rollup.

When a runner reports status:"cancelled", any still-open steps in the job are marked cancelled too. This keeps a killed job from leaving queued step rows that the UI would otherwise treat as live.

S41d PR2 runner execution supports containerized run: steps with per-step log streaming and server-side log finalization. uses: aliases such as actions/checkout@v4 and artifact upload/download remain reserved for the later S41d slices that add checkout metadata and artifact transfer.

POST /api/v1/jobs/{id}/artifacts/upload

Auth: job JWT. Body:

{"name":"test-results.tgz","size_bytes":12345}

Creates a workflow_artifacts row and returns a pre-signed S3 PUT URL. The object key is actions/runs/<run_id>/artifacts/<name>.

POST /api/v1/jobs/{id}/cancel

Auth: PAT with repo:write, and the actor must have write permission on the repository that owns the job's workflow run. Browser UI forms use CSRF-protected repo routes that call the same lifecycle orchestrator.

Queued jobs are made terminal immediately:

  • workflow_jobs.status = cancelled
  • workflow_jobs.conclusion = cancelled
  • workflow_jobs.cancel_requested = true
  • open steps for that job are marked cancelled

Running jobs keep status = running and get cancel_requested = true. The runner sees this through cancel-check, kills the active container, then reports terminal cancelled.

POST /api/v1/runs/{id}/rerun

Auth: PAT with repo:write, and the actor must have write permission on the repository that owns the workflow run. Browser UI forms use CSRF-protected repo routes for the same operation.

Only terminal workflow runs are rerunnable. A re-run reads the original workflow file from the source run's head_sha, not from the current branch tip, then enqueues a new workflow_runs row with:

  • the same repo_id, workflow_file, head_sha, head_ref, event, and event payload
  • actor_user_id set to the user requesting the re-run
  • parent_run_id set to the source run
  • a fresh trigger_event_id in the rerun:<source_run_id>:<random> namespace

POST /api/v1/jobs/{id}/cancel-check

Auth: job JWT. Returns:

{"cancelled":false,"next_token":"..."}

The boolean mirrors workflow_jobs.cancel_requested. shithubd-runner polls this endpoint during job execution, serializing it through the same single-use JWT chain as logs and status updates. On cancelled: true, the Docker engine runs docker kill <active-container> and the runner posts terminal job status cancelled.

Metrics

  • shithub_actions_runner_registrations_total
  • shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
  • shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
  • shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
  • shithub_actions_concurrency_queued_total
  • shithub_actions_log_scrub_replacements_total{location="server"}
  • shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}
  • shithub_actions_step_timeouts_total
View source
1 # Actions runner API
2
3 The runner-facing HTTP surface lives in
4 `internal/web/handlers/api/runners.go`. It is mounted under `/api/v1`
5 in the CSRF-exempt API group, but it does not use PAT auth. Runners
6 authenticate first with a long-lived registration token and then with
7 short-lived per-job JWTs.
8
9 ## Auth model
10
11 Operators register a runner with:
12
13 ```sh
14 shithubd admin runner register --name runner-1 --labels self-hosted,linux,ubuntu-latest
15 ```
16
17 The command inserts `workflow_runners`, stores only a SHA-256 hash in
18 `runner_tokens`, and prints the 32-byte hex token once.
19
20 `POST /api/v1/runners/heartbeat` accepts:
21
22 ```http
23 Authorization: Bearer <registration-token>
24 ```
25
26 When a queued job matches the runner labels and capacity is available,
27 the response includes a job payload and a 15-minute job JWT. That JWT
28 has claims:
29
30 ```json
31 {"sub":"runner:<id>","job_id":1,"run_id":1,"repo_id":1,"exp":0,"jti":"..."}
32 ```
33
34 The signing key is derived from `auth.totp_key_b64` with HKDF label
35 `actions-runner-jwt-v1`; the raw TOTP/secretbox key is not used
36 directly for JWT signing.
37
38 Job JWTs are single-use. Every job endpoint verifies the signature and
39 expiry, checks that the path job belongs to the claimed runner/run, and
40 then inserts `jti` into `runner_jwt_used`. A replay returns 401. To
41 support multi-step runner flows, successful in-flight job endpoints
42 return `next_token` and `next_token_expires_at`.
43
44 Consumed JWT rows are retained for 30 days after token expiry, then
45 pruned by the daily `workflow:cleanup` worker. This keeps the replay
46 gate audit trail available for recent jobs without letting the table
47 grow unbounded.
48
49 `shithubd-runner` consumes the same token chain: it claims with the
50 registration token, marks the job `running` with the first job JWT, then
51 uses each returned `next_token` serially for log chunks, step-status
52 updates, cancel checks, artifact upload requests, and finally the
53 terminal job-status update. Reusing any consumed job JWT is a replay and
54 must fail with 401.
55
56 ## Endpoints
57
58 `POST /api/v1/runners/heartbeat`
59
60 Request body:
61
62 ```json
63 {"labels":["ubuntu-latest","linux"],"capacity":1}
64 ```
65
66 Returns 204 when no matching job is claimable. Returns 200 with
67 `token`, `expires_at`, and `job` when a job is claimed. Capacity is
68 enforced server-side by counting current `workflow_jobs.status =
69 'running'` rows for the runner while holding a row lock on the runner.
70 The job payload includes resolved `secrets` and `mask_values`; repo
71 secrets shadow org secrets with the same name. The server also stores
72 an encrypted claim-time copy of the mask values on
73 `workflow_job_secret_masks` so later log uploads are scrubbed against
74 the secrets that were actually handed to the runner, even if an
75 operator rotates or deletes a secret mid-job.
76
77 `POST /api/v1/jobs/{id}/logs`
78
79 Auth: job JWT. Body:
80
81 ```json
82 {"seq":0,"chunk":"aGVsbG8K","step_id":123}
83 ```
84
85 `step_id` is optional for the S41c curl smoke path; when omitted the
86 first step in the job receives the chunk. Chunks are base64-decoded,
87 capped at 512 KiB raw, and appended to `workflow_step_log_chunks`.
88 Duplicate `(step_id, seq)` inserts are accepted as idempotent retries.
89 Before append, the API re-scrubs exact secret values from the job's
90 claim-time mask snapshot. It also reprocesses any possible secret prefix
91 carried at the end of the prior chunk, so a runner cannot leak a secret
92 by splitting it across two log calls.
93
94 `POST /api/v1/jobs/{id}/steps/{step_id}/status`
95
96 Auth: job JWT. Body:
97
98 ```json
99 {"status":"completed","conclusion":"success"}
100 ```
101
102 Valid transitions are `queued|running -> running|completed|cancelled|skipped`
103 with idempotent repeats of the target terminal state. Completed and
104 skipped steps require a valid check conclusion; cancelled defaults to
105 `cancelled` when omitted. The endpoint always returns a `next_token`
106 because a completed step is not the end of the job.
107
108 When object storage is configured, terminal step updates enqueue
109 `workflow:finalize_step`. The worker concatenates
110 `workflow_step_log_chunks` in sequence order, uploads the log to
111 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`, stores that
112 key and byte count on `workflow_steps`, then deletes the SQL chunks.
113
114 The repository Actions UI reads logs from the same two-stage storage
115 model. While chunks remain in SQL, a step log page concatenates them in
116 sequence order and renders a static snapshot. After finalization, the
117 page reads `workflow_steps.log_object_key` from object storage and
118 offers a short-lived signed download URL. Live tailing is intentionally
119 separate and lands in the S41f SSE slice.
120
121 `POST /api/v1/jobs/{id}/status`
122
123 Auth: job JWT. Body:
124
125 ```json
126 {"status":"completed","conclusion":"success"}
127 ```
128
129 Valid transitions are `queued|running -> running|completed|cancelled`.
130 Completed jobs require a valid check conclusion. The handler updates
131 `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the
132 matching `check_runs` row created by the trigger pipeline.
133
134 `timeout-minutes` is enforced by `shithubd-runner` as a whole-job
135 deadline. When it expires, the runner kills the active container,
136 reports the current step as `completed/timed_out`, and reports the job
137 as `completed/timed_out`. The server treats that conclusion as terminal
138 failure for the workflow run rollup.
139
140 When a runner reports `status:"cancelled"`, any still-open steps in the
141 job are marked cancelled too. This keeps a killed job from leaving queued
142 step rows that the UI would otherwise treat as live.
143
144 S41d PR2 runner execution supports containerized `run:` steps with
145 per-step log streaming and server-side log finalization. `uses:` aliases
146 such as `actions/checkout@v4` and artifact upload/download remain
147 reserved for the later S41d slices that add checkout metadata and
148 artifact transfer.
149
150 `POST /api/v1/jobs/{id}/artifacts/upload`
151
152 Auth: job JWT. Body:
153
154 ```json
155 {"name":"test-results.tgz","size_bytes":12345}
156 ```
157
158 Creates a `workflow_artifacts` row and returns a pre-signed S3 PUT URL.
159 The object key is `actions/runs/<run_id>/artifacts/<name>`.
160
161 `POST /api/v1/jobs/{id}/cancel`
162
163 Auth: PAT with `repo:write`, and the actor must have write permission on
164 the repository that owns the job's workflow run. Browser UI forms use
165 CSRF-protected repo routes that call the same lifecycle orchestrator.
166
167 Queued jobs are made terminal immediately:
168
169 - `workflow_jobs.status = cancelled`
170 - `workflow_jobs.conclusion = cancelled`
171 - `workflow_jobs.cancel_requested = true`
172 - open steps for that job are marked cancelled
173
174 Running jobs keep `status = running` and get
175 `cancel_requested = true`. The runner sees this through
176 `cancel-check`, kills the active container, then reports terminal
177 `cancelled`.
178
179 `POST /api/v1/runs/{id}/rerun`
180
181 Auth: PAT with `repo:write`, and the actor must have write permission on
182 the repository that owns the workflow run. Browser UI forms use
183 CSRF-protected repo routes for the same operation.
184
185 Only terminal workflow runs are rerunnable. A re-run reads the original
186 workflow file from the source run's `head_sha`, not from the current
187 branch tip, then enqueues a new `workflow_runs` row with:
188
189 - the same `repo_id`, `workflow_file`, `head_sha`, `head_ref`, event,
190 and event payload
191 - `actor_user_id` set to the user requesting the re-run
192 - `parent_run_id` set to the source run
193 - a fresh `trigger_event_id` in the `rerun:<source_run_id>:<random>`
194 namespace
195
196 `POST /api/v1/jobs/{id}/cancel-check`
197
198 Auth: job JWT. Returns:
199
200 ```json
201 {"cancelled":false,"next_token":"..."}
202 ```
203
204 The boolean mirrors `workflow_jobs.cancel_requested`. `shithubd-runner`
205 polls this endpoint during job execution, serializing it through the
206 same single-use JWT chain as logs and status updates. On `cancelled:
207 true`, the Docker engine runs `docker kill <active-container>` and the
208 runner posts terminal job status `cancelled`.
209
210 ## Metrics
211
212 - `shithub_actions_runner_registrations_total`
213 - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
214 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
215 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
216 - `shithub_actions_concurrency_queued_total`
217 - `shithub_actions_log_scrub_replacements_total{location="server"}`
218 - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
219 - `shithub_actions_step_timeouts_total`