markdown · 4877 bytes Raw Blame History

Worker

The worker pool drains the Postgres-backed job queue introduced in S14. One binary serves the web API, the SSH dispatcher, the hooks, and the worker — shithubd worker boots a long-running process whose only job is to dequeue and run.

Architecture

post-receive  ──INSERT push_event + INSERT job + NOTIFY──▶  Postgres jobs
                                                                  │
                                                                  ▼
                                                         shithubd worker
                                                  ┌──────────────────────┐
                                                  │ N goroutines          │
                                                  │ ClaimJob (FOR UPDATE  │
                                                  │   SKIP LOCKED)        │
                                                  │ → handler             │
                                                  │ → MarkCompleted /     │
                                                  │   Reschedule /        │
                                                  │   MarkFailed          │
                                                  └──────────────────────┘

The pool also runs a dedicated LISTEN goroutine that holds one Postgres connection on LISTEN shithub_jobs. A NOTIFY from any process (typically the post-receive hook) wakes idle workers in <1s without polling. The backstop poll (every 5s by default) covers dropped notifications.

Job kinds shipped in S14

Kind Trigger Idempotent on
push:process post-receive hook per ref push_events.processed_at
repo:size_recalc enqueued by push:process overwrite-last-wins
jobs:purge_completed future cron / manual ad-hoc always safe to re-run

Adding a new kind: write the handler in internal/worker/jobs/<kind>.go, add the Kind constant to internal/worker/types.go, register it in cmd/shithubd/worker.go. The dispatch loop is generic — no other code needs to change.

Failure handling

  • Transient errors (handler returns a non-nil non-poison error): reschedule with Backoff(attempts) ± 20% jitter, where Backoff is 30s * 2^(attempts-1) capped at 1h.
  • Poison errors (handler returns an error wrapping worker.ErrPoison): jump to MarkJobFailed immediately. Use this for malformed payloads or references to vanished rows.
  • Panics: caught by safeRun and treated as a transient error so a buggy handler can't take down a worker goroutine.
  • Stuck locks: ClaimJob ignores rows whose locked_at is older than 5 minutes — a worker that died mid-job releases its rows on the next claim cycle.
  • Max attempts: jobs.max_attempts (default 5). Past the limit the row is moved to failed_at. Failed rows surface in the S34 admin panel for poison-job triage.

Concurrency contract

  • FOR UPDATE SKIP LOCKED on the inner SELECT means N concurrent workers on the same kind can claim N distinct rows in one round. Each row is processed exactly once across the cluster.
  • Workers hold a row's lock only while their handler runs. The lock is released atomically by MarkJobCompleted / RescheduleJob / MarkJobFailed.
  • Job handlers MUST be idempotent. A worker process killed mid-handler leaves the row locked until the 5-minute stale-lock window passes, at which point another worker may pick it up and re-run. Guard the side-effects (e.g. processed_at IS NULL checks).

Metrics

All exported through the standard /metrics registry:

  • shithub_worker_jobs_processed_total{kind,outcome} — outcome ∈ ok | retry | failed | poison.
  • shithub_worker_job_duration_seconds{kind} — handler latency histogram.
  • shithub_worker_in_flight{kind} — gauge of currently-running handlers.

Operational notes

  • Default pool size is 4. Override via --workers <n> on the CLI or SHITHUB_WORKERS=<n> in the environment.
  • Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers stop pulling new jobs and let in-flight handlers finish (bounded by JobTimeout, default 5 minutes). The LISTEN goroutine drops its conn cleanly.
  • The pool size on the Postgres connection is sized to Workers + 2 (one for LISTEN, one slack for enqueues during shutdown).
  • LISTEN/NOTIFY semantics: when the hook calls pg_notify inside a transaction, the notification is only delivered after commit. So failed inserts never wake workers for nothing.
View source
1 # Worker
2
3 The worker pool drains the Postgres-backed job queue introduced in S14.
4 One binary serves the web API, the SSH dispatcher, the hooks, and the
5 worker — `shithubd worker` boots a long-running process whose only job
6 is to dequeue and run.
7
8 ## Architecture
9
10 ```
11 post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs
12
13
14 shithubd worker
15 ┌──────────────────────┐
16 │ N goroutines │
17 │ ClaimJob (FOR UPDATE │
18 │ SKIP LOCKED) │
19 │ → handler │
20 │ → MarkCompleted / │
21 │ Reschedule / │
22 │ MarkFailed │
23 └──────────────────────┘
24 ```
25
26 The pool also runs a dedicated LISTEN goroutine that holds one Postgres
27 connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically
28 the post-receive hook) wakes idle workers in <1s without polling. The
29 backstop poll (every 5s by default) covers dropped notifications.
30
31 ## Job kinds shipped in S14
32
33 | Kind | Trigger | Idempotent on |
34 | ---------------------- | ------------------------------------ | ------------------------ |
35 | `push:process` | post-receive hook per ref | `push_events.processed_at` |
36 | `repo:size_recalc` | enqueued by `push:process` | overwrite-last-wins |
37 | `jobs:purge_completed` | future cron / manual ad-hoc | always safe to re-run |
38
39 Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,
40 add the `Kind` constant to `internal/worker/types.go`, register it in
41 `cmd/shithubd/worker.go`. The dispatch loop is generic — no other code
42 needs to change.
43
44 ## Failure handling
45
46 * **Transient errors** (handler returns a non-nil non-poison error):
47 reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is
48 `30s * 2^(attempts-1)` capped at 1h.
49 * **Poison errors** (handler returns an error wrapping `worker.ErrPoison`):
50 jump to `MarkJobFailed` immediately. Use this for malformed payloads or
51 references to vanished rows.
52 * **Panics**: caught by `safeRun` and treated as a transient error so a
53 buggy handler can't take down a worker goroutine.
54 * **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older
55 than 5 minutes — a worker that died mid-job releases its rows on the
56 next claim cycle.
57 * **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the
58 row is moved to `failed_at`. Failed rows surface in the S34 admin
59 panel for poison-job triage.
60
61 ## Concurrency contract
62
63 * `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers
64 on the same kind can claim N distinct rows in one round. Each row is
65 processed exactly once across the cluster.
66 * Workers hold a row's lock only while their handler runs. The lock is
67 released atomically by `MarkJobCompleted` / `RescheduleJob` /
68 `MarkJobFailed`.
69 * Job handlers MUST be idempotent. A worker process killed mid-handler
70 leaves the row locked until the 5-minute stale-lock window passes,
71 at which point another worker may pick it up and re-run. Guard the
72 side-effects (e.g. `processed_at IS NULL` checks).
73
74 ## Metrics
75
76 All exported through the standard `/metrics` registry:
77
78 * `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈
79 `ok | retry | failed | poison`.
80 * `shithub_worker_job_duration_seconds{kind}` — handler latency
81 histogram.
82 * `shithub_worker_in_flight{kind}` — gauge of currently-running handlers.
83
84 ## Operational notes
85
86 * Default pool size is 4. Override via `--workers <n>` on the CLI or
87 `SHITHUB_WORKERS=<n>` in the environment.
88 * Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
89 stop pulling new jobs and let in-flight handlers finish (bounded by
90 `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its
91 conn cleanly.
92 * The pool size on the Postgres connection is sized to `Workers + 2`
93 (one for LISTEN, one slack for enqueues during shutdown).
94 * `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a
95 transaction, the notification is *only* delivered after commit. So
96 failed inserts never wake workers for nothing.