Worker
The worker pool drains the Postgres-backed job queue introduced in S14.
One binary serves the web API, the SSH dispatcher, the hooks, and the
worker — shithubd worker boots a long-running process whose only job
is to dequeue and run.
Architecture
post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs
│
▼
shithubd worker
┌──────────────────────┐
│ N goroutines │
│ ClaimJob (FOR UPDATE │
│ SKIP LOCKED) │
│ → handler │
│ → MarkCompleted / │
│ Reschedule / │
│ MarkFailed │
└──────────────────────┘
The pool also runs a dedicated LISTEN goroutine that holds one Postgres
connection on LISTEN shithub_jobs. A NOTIFY from any process (typically
the post-receive hook) wakes idle workers in <1s without polling. The
backstop poll (every 5s by default) covers dropped notifications.
Job kinds shipped in S14
| Kind | Trigger | Idempotent on |
|---|---|---|
push:process |
post-receive hook per ref | push_events.processed_at |
repo:size_recalc |
enqueued by push:process |
overwrite-last-wins |
org:github_import_discover |
org import request | org_github_imports.status |
org:github_import_repo |
import discovery per GitHub repo | org_github_import_repos.status |
jobs:purge_completed |
future cron / manual ad-hoc | always safe to re-run |
trending:compute |
recurring self-scheduled S42 job | append-only snapshots |
Adding a new kind: write the handler in internal/worker/jobs/<kind>.go,
add the Kind constant to internal/worker/types.go, register it in
cmd/shithubd/worker.go. The dispatch loop is generic — no other code
needs to change.
Failure handling
- Transient errors (handler returns a non-nil non-poison error):
reschedule with
Backoff(attempts) ± 20% jitter, whereBackoffis30s * 2^(attempts-1)capped at 1h. - Poison errors (handler returns an error wrapping
worker.ErrPoison): jump toMarkJobFailedimmediately. Use this for malformed payloads or references to vanished rows. - Panics: caught by
safeRunand treated as a transient error so a buggy handler can't take down a worker goroutine. - Stuck locks:
ClaimJobignores rows whoselocked_atis older than 5 minutes — a worker that died mid-job releases its rows on the next claim cycle. - Max attempts:
jobs.max_attempts(default 5). Past the limit the row is moved tofailed_at. Failed rows surface in the S34 admin panel for poison-job triage.
Concurrency contract
FOR UPDATE SKIP LOCKEDon the inner SELECT means N concurrent workers on the same kind can claim N distinct rows in one round. Each row is processed exactly once across the cluster.- Workers hold a row's lock only while their handler runs. The lock is
released atomically by
MarkJobCompleted/RescheduleJob/MarkJobFailed. - Job handlers MUST be idempotent. A worker process killed mid-handler
leaves the row locked until the 5-minute stale-lock window passes,
at which point another worker may pick it up and re-run. Guard the
side-effects (e.g.
processed_at IS NULLchecks).
Metrics
All exported through the standard /metrics registry:
shithub_worker_jobs_processed_total{kind,outcome}— outcome ∈ok | retry | failed | poison.shithub_worker_job_duration_seconds{kind}— handler latency histogram.shithub_worker_in_flight{kind}— gauge of currently-running handlers.
Operational notes
- Default pool size is 4. Override via
--workers <n>on the CLI orSHITHUB_WORKERS=<n>in the environment. - Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
stop pulling new jobs and let in-flight handlers finish (bounded by
JobTimeout, default 5 minutes). The LISTEN goroutine drops its conn cleanly. - The pool size on the Postgres connection is sized to
Workers + 2(one for LISTEN, one slack for enqueues during shutdown). LISTEN/NOTIFYsemantics: when the hook callspg_notifyinside a transaction, the notification is only delivered after commit. So failed inserts never wake workers for nothing.
View source
| 1 | # Worker |
| 2 | |
| 3 | The worker pool drains the Postgres-backed job queue introduced in S14. |
| 4 | One binary serves the web API, the SSH dispatcher, the hooks, and the |
| 5 | worker — `shithubd worker` boots a long-running process whose only job |
| 6 | is to dequeue and run. |
| 7 | |
| 8 | ## Architecture |
| 9 | |
| 10 | ``` |
| 11 | post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs |
| 12 | │ |
| 13 | ▼ |
| 14 | shithubd worker |
| 15 | ┌──────────────────────┐ |
| 16 | │ N goroutines │ |
| 17 | │ ClaimJob (FOR UPDATE │ |
| 18 | │ SKIP LOCKED) │ |
| 19 | │ → handler │ |
| 20 | │ → MarkCompleted / │ |
| 21 | │ Reschedule / │ |
| 22 | │ MarkFailed │ |
| 23 | └──────────────────────┘ |
| 24 | ``` |
| 25 | |
| 26 | The pool also runs a dedicated LISTEN goroutine that holds one Postgres |
| 27 | connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically |
| 28 | the post-receive hook) wakes idle workers in <1s without polling. The |
| 29 | backstop poll (every 5s by default) covers dropped notifications. |
| 30 | |
| 31 | ## Job kinds shipped in S14 |
| 32 | |
| 33 | | Kind | Trigger | Idempotent on | |
| 34 | | ---------------------------- | ------------------------------------ | -------------------------------- | |
| 35 | | `push:process` | post-receive hook per ref | `push_events.processed_at` | |
| 36 | | `repo:size_recalc` | enqueued by `push:process` | overwrite-last-wins | |
| 37 | | `org:github_import_discover` | org import request | `org_github_imports.status` | |
| 38 | | `org:github_import_repo` | import discovery per GitHub repo | `org_github_import_repos.status` | |
| 39 | | `jobs:purge_completed` | future cron / manual ad-hoc | always safe to re-run | |
| 40 | | `trending:compute` | recurring self-scheduled S42 job | append-only snapshots | |
| 41 | |
| 42 | Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`, |
| 43 | add the `Kind` constant to `internal/worker/types.go`, register it in |
| 44 | `cmd/shithubd/worker.go`. The dispatch loop is generic — no other code |
| 45 | needs to change. |
| 46 | |
| 47 | ## Failure handling |
| 48 | |
| 49 | * **Transient errors** (handler returns a non-nil non-poison error): |
| 50 | reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is |
| 51 | `30s * 2^(attempts-1)` capped at 1h. |
| 52 | * **Poison errors** (handler returns an error wrapping `worker.ErrPoison`): |
| 53 | jump to `MarkJobFailed` immediately. Use this for malformed payloads or |
| 54 | references to vanished rows. |
| 55 | * **Panics**: caught by `safeRun` and treated as a transient error so a |
| 56 | buggy handler can't take down a worker goroutine. |
| 57 | * **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older |
| 58 | than 5 minutes — a worker that died mid-job releases its rows on the |
| 59 | next claim cycle. |
| 60 | * **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the |
| 61 | row is moved to `failed_at`. Failed rows surface in the S34 admin |
| 62 | panel for poison-job triage. |
| 63 | |
| 64 | ## Concurrency contract |
| 65 | |
| 66 | * `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers |
| 67 | on the same kind can claim N distinct rows in one round. Each row is |
| 68 | processed exactly once across the cluster. |
| 69 | * Workers hold a row's lock only while their handler runs. The lock is |
| 70 | released atomically by `MarkJobCompleted` / `RescheduleJob` / |
| 71 | `MarkJobFailed`. |
| 72 | * Job handlers MUST be idempotent. A worker process killed mid-handler |
| 73 | leaves the row locked until the 5-minute stale-lock window passes, |
| 74 | at which point another worker may pick it up and re-run. Guard the |
| 75 | side-effects (e.g. `processed_at IS NULL` checks). |
| 76 | |
| 77 | ## Metrics |
| 78 | |
| 79 | All exported through the standard `/metrics` registry: |
| 80 | |
| 81 | * `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈ |
| 82 | `ok | retry | failed | poison`. |
| 83 | * `shithub_worker_job_duration_seconds{kind}` — handler latency |
| 84 | histogram. |
| 85 | * `shithub_worker_in_flight{kind}` — gauge of currently-running handlers. |
| 86 | |
| 87 | ## Operational notes |
| 88 | |
| 89 | * Default pool size is 4. Override via `--workers <n>` on the CLI or |
| 90 | `SHITHUB_WORKERS=<n>` in the environment. |
| 91 | * Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers |
| 92 | stop pulling new jobs and let in-flight handlers finish (bounded by |
| 93 | `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its |
| 94 | conn cleanly. |
| 95 | * The pool size on the Postgres connection is sized to `Workers + 2` |
| 96 | (one for LISTEN, one slack for enqueues during shutdown). |
| 97 | * `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a |
| 98 | transaction, the notification is *only* delivered after commit. So |
| 99 | failed inserts never wake workers for nothing. |