markdown · 11254 bytes Raw Blame History

Actions runner API

The runner-facing HTTP surface lives in internal/web/handlers/api/runners.go. It is mounted under /api/v1 in the CSRF-exempt API group, but it does not use PAT auth. Runners authenticate first with a long-lived registration token and then with short-lived per-job JWTs.

Auth model

Operators register a runner with:

shithubd admin runner register \
  --name runner-1 \
  --labels self-hosted,linux,ubuntu-latest,x64 \
  --capacity 1 \
  --output json

The command inserts workflow_runners, stores only a SHA-256 hash in runner_tokens, and returns the raw 32-byte hex token once. --expires-in is optional and should only be used when the deployment rotates the runner token before it expires, because the runner uses that same token for heartbeat authentication.

Operators can drain, undrain, rotate, and hard-revoke runners with shithubd admin runner drain, undrain, rotate-token, and revoke. Drained runners keep heartbeating and may finish already claimed jobs, but heartbeat claims return 204 until the runner is undrained. Hard revocation sets the runner offline, records revoked_at, revokes all registration tokens, and makes job API JWTs minted for that runner invalid. This is the token-compromise boundary: a host with an old config file cannot claim new jobs or update already claimed jobs after revocation lands in Postgres.

POST /api/v1/runners/heartbeat accepts:

Authorization: Bearer <registration-token>

When a queued job matches the runner labels and capacity is available, the response includes a job payload and a 15-minute job JWT. That JWT has claims:

{"sub":"runner:<id>","purpose":"api","job_id":1,"run_id":1,"repo_id":1,"exp":0,"jti":"..."}

The signing key is derived from auth.totp_key_b64 with HKDF label actions-runner-jwt-v1; the raw TOTP/secretbox key is not used directly for JWT signing.

API-purpose job JWTs are single-use. Every job endpoint verifies the signature and expiry, checks that the path job belongs to the claimed runner/run, and then inserts jti into runner_jwt_used. A replay returns 401. To support multi-step runner flows, successful in-flight job endpoints return next_token and next_token_expires_at.

Consumed JWT rows are retained for 30 days after token expiry, then pruned by the daily workflow:cleanup worker. This keeps the replay gate audit trail available for recent jobs without letting the table grow unbounded.

shithubd-runner consumes the same token chain: it claims with the registration token, marks the job running with the first API-purpose job JWT, then uses each returned next_token serially for log chunks, step-status updates, cancel checks, artifact upload requests, and finally the terminal job-status update. Reusing any consumed API-purpose job JWT is a replay and must fail with 401.

The heartbeat claim also returns job.checkout_url and job.checkout_token for actions/checkout@v4. The checkout token is a separate JWT with purpose:"checkout" and the same runner/job/run/repo scope. It is intentionally reusable while the job is running, because Git smart HTTP performs multiple Basic-authenticated requests during one checkout. The git HTTP handler accepts it only for git-upload-pack, only for the claimed repository, and only while the database still shows that the claimed runner is running the job. It is never accepted for pushes or runner API endpoints.

Endpoints

POST /api/v1/runners/heartbeat

Request body:

{
  "labels": ["self-hosted", "linux", "ubuntu-latest", "x64"],
  "capacity": 1,
  "host_name": "runner-host-1",
  "version": "v0.1.0"
}

Returns 204 when no matching job is claimable. Returns 200 with token, expires_at, and job when a job is claimed. Capacity is enforced server-side by counting current workflow_jobs.status = 'running' rows for the runner while holding a row lock on the runner. Claiming also enforces the effective Actions policy for the repository: disabled repos, approval-pending runs, per-repo concurrent job caps, and per-owner/org concurrent job caps are not dispatchable. Approval simply sets workflow_runs.approved_by_user_id; the next heartbeat can claim the same queued jobs, so no duplicate run is created. host_name and version are optional runner metadata. The server stores trimmed values up to 255 bytes for pool diagnostics and preserves the previous values when old runners omit them. The job payload includes checkout_url, checkout_token, resolved secrets, and mask_values; repo secrets shadow org secrets with the same name. The server also stores an encrypted claim-time copy of the mask values on workflow_job_secret_masks so later log uploads are scrubbed against the secrets that were actually handed to the runner, even if an operator rotates or deletes a secret mid-job.

Pull request runs receive no org or repo secrets in v1, even after a maintainer approves dispatch. This is intentionally stricter than the approval gate until environments/protected deployment secrets exist.

POST /api/v1/jobs/{id}/logs

Auth: job JWT. Body:

{"seq":0,"chunk":"aGVsbG8K","step_id":123}

step_id is optional for the S41c curl smoke path; when omitted the first step in the job receives the chunk. Chunks are base64-decoded, capped at 512 KiB raw, and appended to workflow_step_log_chunks. Duplicate (step_id, seq) inserts are accepted as idempotent retries. Before append, the API re-scrubs exact secret values from the job's claim-time mask snapshot. It also reprocesses any possible secret prefix carried at the end of the prior chunk, so a runner cannot leak a secret by splitting it across two log calls.

POST /api/v1/jobs/{id}/steps/{step_id}/status

Auth: job JWT. Body:

{"status":"completed","conclusion":"success"}

Valid transitions are queued|running -> running|completed|cancelled|skipped with idempotent repeats of the target terminal state. Completed and skipped steps require a valid check conclusion; cancelled defaults to cancelled when omitted. The endpoint always returns a next_token because a completed step is not the end of the job.

When object storage is configured, terminal step updates enqueue workflow:finalize_step. The worker concatenates workflow_step_log_chunks in sequence order, uploads the log to actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log, stores that key and byte count on workflow_steps, then deletes the SQL chunks.

The repository Actions UI reads logs from the same two-stage storage model. While chunks remain in SQL, a step log page concatenates them in sequence order and renders a static snapshot. After finalization, the page reads workflow_steps.log_object_key from object storage and offers a short-lived signed download URL. Live tailing is intentionally separate and lands in the S41f SSE slice.

POST /api/v1/jobs/{id}/status

Auth: job JWT. Body:

{"status":"completed","conclusion":"success"}

Valid transitions are queued|running -> running|completed|cancelled. Completed jobs require a valid check conclusion. The handler updates workflow_jobs, rolls up workflow_runs, and best-effort updates the matching check_runs row created by the trigger pipeline.

timeout-minutes is enforced by shithubd-runner as a whole-job deadline. When it expires, the runner kills the active container, reports the current step as completed/timed_out, and reports the job as completed/timed_out. The server treats that conclusion as terminal failure for the workflow run rollup.

When a runner reports status:"cancelled", any still-open steps in the job are marked cancelled too. This keeps a killed job from leaving queued step rows that the UI would otherwise treat as live.

Runner execution supports host-side actions/checkout@v4 followed by containerized run: steps with per-step log streaming and server-side log finalization. Artifact upload/download aliases remain reserved until the artifact transfer path lands.

POST /api/v1/jobs/{id}/artifacts/upload

Auth: job JWT. Body:

{"name":"test-results.tgz","size_bytes":12345}

Creates a workflow_artifacts row and returns a pre-signed S3 PUT URL. The object key is actions/runs/<run_id>/artifacts/<name>.

POST /api/v1/jobs/{id}/cancel

Auth: PAT with repo:write, and the actor must have write permission on the repository that owns the job's workflow run. Browser UI forms use CSRF-protected repo routes that call the same lifecycle orchestrator.

Queued jobs are made terminal immediately:

  • workflow_jobs.status = cancelled
  • workflow_jobs.conclusion = cancelled
  • workflow_jobs.cancel_requested = true
  • open steps for that job are marked cancelled

Running jobs keep status = running and get cancel_requested = true. The runner sees this through cancel-check, kills the active container, then reports terminal cancelled.

POST /api/v1/runs/{id}/rerun

Auth: PAT with repo:write, and the actor must have write permission on the repository that owns the workflow run. Browser UI forms use CSRF-protected repo routes for the same operation.

Only terminal workflow runs are rerunnable. A re-run reads the original workflow file from the source run's head_sha, not from the current branch tip, then enqueues a new workflow_runs row with:

  • the same repo_id, workflow_file, head_sha, head_ref, event, and event payload
  • actor_user_id set to the user requesting the re-run
  • parent_run_id set to the source run
  • a fresh trigger_event_id in the rerun:<source_run_id>:<random> namespace

POST /api/v1/jobs/{id}/cancel-check

Auth: job JWT. Returns:

{"cancelled":false,"next_token":"..."}

The boolean mirrors workflow_jobs.cancel_requested. shithubd-runner polls this endpoint during job execution, serializing it through the same single-use JWT chain as logs and status updates. On cancelled: true, the Docker engine runs docker kill <active-container> and the runner posts terminal job status cancelled.

Metrics

  • shithub_actions_runner_registrations_total
  • shithub_actions_runner_heartbeats_total{result="claimed|no_job"}
  • shithub_actions_runner_jwt_total{result="issued|rejected|replay"}
  • shithub_actions_queue_depth{resource="runs|jobs"}
  • shithub_actions_active{resource="runs|jobs"}
  • shithub_actions_runner_heartbeat_age_seconds{runner,status}
  • shithub_actions_runner_capacity{runner,status}
  • shithub_actions_runs_completed_total{event,conclusion}
  • shithub_actions_run_duration_seconds{event,conclusion}
  • shithub_actions_steps_completed_total{step_type,conclusion}
  • shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}
  • shithub_actions_concurrency_queued_total
  • shithub_actions_log_scrub_replacements_total{location="server"}
  • shithub_actions_log_chunks_total{location="server"}
  • shithub_actions_log_chunk_bytes_total{location="server"}
  • shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}
  • shithub_actions_step_timeouts_total
  • shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}
  • shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}
View source
1 # Actions runner API
2
3 The runner-facing HTTP surface lives in
4 `internal/web/handlers/api/runners.go`. It is mounted under `/api/v1`
5 in the CSRF-exempt API group, but it does not use PAT auth. Runners
6 authenticate first with a long-lived registration token and then with
7 short-lived per-job JWTs.
8
9 ## Auth model
10
11 Operators register a runner with:
12
13 ```sh
14 shithubd admin runner register \
15 --name runner-1 \
16 --labels self-hosted,linux,ubuntu-latest,x64 \
17 --capacity 1 \
18 --output json
19 ```
20
21 The command inserts `workflow_runners`, stores only a SHA-256 hash in
22 `runner_tokens`, and returns the raw 32-byte hex token once.
23 `--expires-in` is optional and should only be used when the deployment rotates
24 the runner token before it expires, because the runner uses that same token for
25 heartbeat authentication.
26
27 Operators can drain, undrain, rotate, and hard-revoke runners with
28 `shithubd admin runner drain`, `undrain`, `rotate-token`, and `revoke`.
29 Drained runners keep heartbeating and may finish already claimed jobs,
30 but heartbeat claims return 204 until the runner is undrained. Hard
31 revocation sets the runner offline, records `revoked_at`, revokes all
32 registration tokens, and makes job API JWTs minted for that runner
33 invalid. This is the token-compromise boundary: a host with an old config
34 file cannot claim new jobs or update already claimed jobs after
35 revocation lands in Postgres.
36
37 `POST /api/v1/runners/heartbeat` accepts:
38
39 ```http
40 Authorization: Bearer <registration-token>
41 ```
42
43 When a queued job matches the runner labels and capacity is available,
44 the response includes a job payload and a 15-minute job JWT. That JWT
45 has claims:
46
47 ```json
48 {"sub":"runner:<id>","purpose":"api","job_id":1,"run_id":1,"repo_id":1,"exp":0,"jti":"..."}
49 ```
50
51 The signing key is derived from `auth.totp_key_b64` with HKDF label
52 `actions-runner-jwt-v1`; the raw TOTP/secretbox key is not used
53 directly for JWT signing.
54
55 API-purpose job JWTs are single-use. Every job endpoint verifies the
56 signature and expiry, checks that the path job belongs to the claimed
57 runner/run, and then inserts `jti` into `runner_jwt_used`. A replay
58 returns 401. To support multi-step runner flows, successful in-flight job
59 endpoints return `next_token` and `next_token_expires_at`.
60
61 Consumed JWT rows are retained for 30 days after token expiry, then
62 pruned by the daily `workflow:cleanup` worker. This keeps the replay
63 gate audit trail available for recent jobs without letting the table
64 grow unbounded.
65
66 `shithubd-runner` consumes the same token chain: it claims with the
67 registration token, marks the job `running` with the first API-purpose job
68 JWT, then uses each returned `next_token` serially for log chunks,
69 step-status updates, cancel checks, artifact upload requests, and finally
70 the terminal job-status update. Reusing any consumed API-purpose job JWT
71 is a replay and must fail with 401.
72
73 The heartbeat claim also returns `job.checkout_url` and
74 `job.checkout_token` for `actions/checkout@v4`. The checkout token is a
75 separate JWT with `purpose:"checkout"` and the same runner/job/run/repo
76 scope. It is intentionally reusable while the job is `running`, because
77 Git smart HTTP performs multiple Basic-authenticated requests during one
78 checkout. The git HTTP handler accepts it only for `git-upload-pack`, only
79 for the claimed repository, and only while the database still shows that
80 the claimed runner is running the job. It is never accepted for pushes or
81 runner API endpoints.
82
83 ## Endpoints
84
85 `POST /api/v1/runners/heartbeat`
86
87 Request body:
88
89 ```json
90 {
91 "labels": ["self-hosted", "linux", "ubuntu-latest", "x64"],
92 "capacity": 1,
93 "host_name": "runner-host-1",
94 "version": "v0.1.0"
95 }
96 ```
97
98 Returns 204 when no matching job is claimable. Returns 200 with
99 `token`, `expires_at`, and `job` when a job is claimed. Capacity is
100 enforced server-side by counting current `workflow_jobs.status =
101 'running'` rows for the runner while holding a row lock on the runner.
102 Claiming also enforces the effective Actions policy for the repository:
103 disabled repos, approval-pending runs, per-repo concurrent job caps, and
104 per-owner/org concurrent job caps are not dispatchable. Approval simply
105 sets `workflow_runs.approved_by_user_id`; the next heartbeat can claim the
106 same queued jobs, so no duplicate run is created.
107 `host_name` and `version` are optional runner metadata. The server stores
108 trimmed values up to 255 bytes for pool diagnostics and preserves the
109 previous values when old runners omit them.
110 The job payload includes `checkout_url`, `checkout_token`, resolved
111 `secrets`, and `mask_values`; repo secrets shadow org secrets with the
112 same name. The server also stores an encrypted claim-time copy of the mask
113 values on `workflow_job_secret_masks` so later log uploads are scrubbed
114 against the secrets that were actually handed to the runner, even if an
115 operator rotates or deletes a secret mid-job.
116
117 Pull request runs receive no org or repo secrets in v1, even after a
118 maintainer approves dispatch. This is intentionally stricter than the
119 approval gate until environments/protected deployment secrets exist.
120
121 `POST /api/v1/jobs/{id}/logs`
122
123 Auth: job JWT. Body:
124
125 ```json
126 {"seq":0,"chunk":"aGVsbG8K","step_id":123}
127 ```
128
129 `step_id` is optional for the S41c curl smoke path; when omitted the
130 first step in the job receives the chunk. Chunks are base64-decoded,
131 capped at 512 KiB raw, and appended to `workflow_step_log_chunks`.
132 Duplicate `(step_id, seq)` inserts are accepted as idempotent retries.
133 Before append, the API re-scrubs exact secret values from the job's
134 claim-time mask snapshot. It also reprocesses any possible secret prefix
135 carried at the end of the prior chunk, so a runner cannot leak a secret
136 by splitting it across two log calls.
137
138 `POST /api/v1/jobs/{id}/steps/{step_id}/status`
139
140 Auth: job JWT. Body:
141
142 ```json
143 {"status":"completed","conclusion":"success"}
144 ```
145
146 Valid transitions are `queued|running -> running|completed|cancelled|skipped`
147 with idempotent repeats of the target terminal state. Completed and
148 skipped steps require a valid check conclusion; cancelled defaults to
149 `cancelled` when omitted. The endpoint always returns a `next_token`
150 because a completed step is not the end of the job.
151
152 When object storage is configured, terminal step updates enqueue
153 `workflow:finalize_step`. The worker concatenates
154 `workflow_step_log_chunks` in sequence order, uploads the log to
155 `actions/runs/<run_id>/jobs/<job_id>/steps/<step_id>.log`, stores that
156 key and byte count on `workflow_steps`, then deletes the SQL chunks.
157
158 The repository Actions UI reads logs from the same two-stage storage
159 model. While chunks remain in SQL, a step log page concatenates them in
160 sequence order and renders a static snapshot. After finalization, the
161 page reads `workflow_steps.log_object_key` from object storage and
162 offers a short-lived signed download URL. Live tailing is intentionally
163 separate and lands in the S41f SSE slice.
164
165 `POST /api/v1/jobs/{id}/status`
166
167 Auth: job JWT. Body:
168
169 ```json
170 {"status":"completed","conclusion":"success"}
171 ```
172
173 Valid transitions are `queued|running -> running|completed|cancelled`.
174 Completed jobs require a valid check conclusion. The handler updates
175 `workflow_jobs`, rolls up `workflow_runs`, and best-effort updates the
176 matching `check_runs` row created by the trigger pipeline.
177
178 `timeout-minutes` is enforced by `shithubd-runner` as a whole-job
179 deadline. When it expires, the runner kills the active container,
180 reports the current step as `completed/timed_out`, and reports the job
181 as `completed/timed_out`. The server treats that conclusion as terminal
182 failure for the workflow run rollup.
183
184 When a runner reports `status:"cancelled"`, any still-open steps in the
185 job are marked cancelled too. This keeps a killed job from leaving queued
186 step rows that the UI would otherwise treat as live.
187
188 Runner execution supports host-side `actions/checkout@v4` followed by
189 containerized `run:` steps with per-step log streaming and server-side log
190 finalization. Artifact upload/download aliases remain reserved until the
191 artifact transfer path lands.
192
193 `POST /api/v1/jobs/{id}/artifacts/upload`
194
195 Auth: job JWT. Body:
196
197 ```json
198 {"name":"test-results.tgz","size_bytes":12345}
199 ```
200
201 Creates a `workflow_artifacts` row and returns a pre-signed S3 PUT URL.
202 The object key is `actions/runs/<run_id>/artifacts/<name>`.
203
204 `POST /api/v1/jobs/{id}/cancel`
205
206 Auth: PAT with `repo:write`, and the actor must have write permission on
207 the repository that owns the job's workflow run. Browser UI forms use
208 CSRF-protected repo routes that call the same lifecycle orchestrator.
209
210 Queued jobs are made terminal immediately:
211
212 - `workflow_jobs.status = cancelled`
213 - `workflow_jobs.conclusion = cancelled`
214 - `workflow_jobs.cancel_requested = true`
215 - open steps for that job are marked cancelled
216
217 Running jobs keep `status = running` and get
218 `cancel_requested = true`. The runner sees this through
219 `cancel-check`, kills the active container, then reports terminal
220 `cancelled`.
221
222 `POST /api/v1/runs/{id}/rerun`
223
224 Auth: PAT with `repo:write`, and the actor must have write permission on
225 the repository that owns the workflow run. Browser UI forms use
226 CSRF-protected repo routes for the same operation.
227
228 Only terminal workflow runs are rerunnable. A re-run reads the original
229 workflow file from the source run's `head_sha`, not from the current
230 branch tip, then enqueues a new `workflow_runs` row with:
231
232 - the same `repo_id`, `workflow_file`, `head_sha`, `head_ref`, event,
233 and event payload
234 - `actor_user_id` set to the user requesting the re-run
235 - `parent_run_id` set to the source run
236 - a fresh `trigger_event_id` in the `rerun:<source_run_id>:<random>`
237 namespace
238
239 `POST /api/v1/jobs/{id}/cancel-check`
240
241 Auth: job JWT. Returns:
242
243 ```json
244 {"cancelled":false,"next_token":"..."}
245 ```
246
247 The boolean mirrors `workflow_jobs.cancel_requested`. `shithubd-runner`
248 polls this endpoint during job execution, serializing it through the
249 same single-use JWT chain as logs and status updates. On `cancelled:
250 true`, the Docker engine runs `docker kill <active-container>` and the
251 runner posts terminal job status `cancelled`.
252
253 ## Metrics
254
255 - `shithub_actions_runner_registrations_total`
256 - `shithub_actions_runner_heartbeats_total{result="claimed|no_job"}`
257 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
258 - `shithub_actions_queue_depth{resource="runs|jobs"}`
259 - `shithub_actions_active{resource="runs|jobs"}`
260 - `shithub_actions_runner_heartbeat_age_seconds{runner,status}`
261 - `shithub_actions_runner_capacity{runner,status}`
262 - `shithub_actions_runs_completed_total{event,conclusion}`
263 - `shithub_actions_run_duration_seconds{event,conclusion}`
264 - `shithub_actions_steps_completed_total{step_type,conclusion}`
265 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
266 - `shithub_actions_concurrency_queued_total`
267 - `shithub_actions_log_scrub_replacements_total{location="server"}`
268 - `shithub_actions_log_chunks_total{location="server"}`
269 - `shithub_actions_log_chunk_bytes_total{location="server"}`
270 - `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
271 - `shithub_actions_step_timeouts_total`
272 - `shithub_actions_storage_objects{kind="artifacts|step_logs|hot_log_chunks"}`
273 - `shithub_actions_storage_bytes{kind="artifacts|step_logs|hot_log_chunks"}`