shithub Public

Watch 1 Fork 0 Star 0

markdown · 4877 bytes Raw Blame History

Worker

The worker pool drains the Postgres-backed job queue introduced in S14. One binary serves the web API, the SSH dispatcher, the hooks, and the worker — shithubd worker boots a long-running process whose only job is to dequeue and run.

Architecture

post-receive  ──INSERT push_event + INSERT job + NOTIFY──▶  Postgres jobs
                                                                  │
                                                                  ▼
                                                         shithubd worker
                                                  ┌──────────────────────┐
                                                  │ N goroutines          │
                                                  │ ClaimJob (FOR UPDATE  │
                                                  │   SKIP LOCKED)        │
                                                  │ → handler             │
                                                  │ → MarkCompleted /     │
                                                  │   Reschedule /        │
                                                  │   MarkFailed          │
                                                  └──────────────────────┘

The pool also runs a dedicated LISTEN goroutine that holds one Postgres connection on LISTEN shithub_jobs. A NOTIFY from any process (typically the post-receive hook) wakes idle workers in <1s without polling. The backstop poll (every 5s by default) covers dropped notifications.

Job kinds shipped in S14

Kind	Trigger	Idempotent on
`push:process`	post-receive hook per ref	`push_events.processed_at`
`repo:size_recalc`	enqueued by `push:process`	overwrite-last-wins
`jobs:purge_completed`	future cron / manual ad-hoc	always safe to re-run

Adding a new kind: write the handler in internal/worker/jobs/<kind>.go, add the Kind constant to internal/worker/types.go, register it in cmd/shithubd/worker.go. The dispatch loop is generic — no other code needs to change.

Failure handling

Transient errors (handler returns a non-nil non-poison error): reschedule with Backoff(attempts) ± 20% jitter, where Backoff is 30s * 2^(attempts-1) capped at 1h.
Poison errors (handler returns an error wrapping worker.ErrPoison): jump to MarkJobFailed immediately. Use this for malformed payloads or references to vanished rows.
Panics: caught by safeRun and treated as a transient error so a buggy handler can't take down a worker goroutine.
Stuck locks: ClaimJob ignores rows whose locked_at is older than 5 minutes — a worker that died mid-job releases its rows on the next claim cycle.
Max attempts: jobs.max_attempts (default 5). Past the limit the row is moved to failed_at. Failed rows surface in the S34 admin panel for poison-job triage.

Concurrency contract

FOR UPDATE SKIP LOCKED on the inner SELECT means N concurrent workers on the same kind can claim N distinct rows in one round. Each row is processed exactly once across the cluster.
Workers hold a row's lock only while their handler runs. The lock is released atomically by MarkJobCompleted / RescheduleJob / MarkJobFailed.
Job handlers MUST be idempotent. A worker process killed mid-handler leaves the row locked until the 5-minute stale-lock window passes, at which point another worker may pick it up and re-run. Guard the side-effects (e.g. processed_at IS NULL checks).

Metrics

All exported through the standard /metrics registry:

shithub_worker_jobs_processed_total{kind,outcome} — outcome ∈ ok | retry | failed | poison.
shithub_worker_job_duration_seconds{kind} — handler latency histogram.
shithub_worker_in_flight{kind} — gauge of currently-running handlers.

Operational notes

Default pool size is 4. Override via --workers <n> on the CLI or SHITHUB_WORKERS=<n> in the environment.
Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers stop pulling new jobs and let in-flight handlers finish (bounded by JobTimeout, default 5 minutes). The LISTEN goroutine drops its conn cleanly.
The pool size on the Postgres connection is sized to Workers + 2 (one for LISTEN, one slack for enqueues during shutdown).
LISTEN/NOTIFY semantics: when the hook calls pg_notify inside a transaction, the notification is only delivered after commit. So failed inserts never wake workers for nothing.

View source

  
        1
        # Worker
      
        2
        
        3
        The worker pool drains the Postgres-backed job queue introduced in S14.
      
        4
        One binary serves the web API, the SSH dispatcher, the hooks, and the
      
        5
        worker — `shithubd worker` boots a long-running process whose only job
      
        6
        is to dequeue and run.
      
        7
        
        8
        ## Architecture
      
        9
        
        10
        ```
      
        11
        post-receive  ──INSERT push_event + INSERT job + NOTIFY──▶  Postgres jobs
      
        12
                                                                          │
      
        13
                                                                          ▼
      
        14
                                                                 shithubd worker
      
        15
                                                          ┌──────────────────────┐
      
        16
                                                          │ N goroutines          │
      
        17
                                                          │ ClaimJob (FOR UPDATE  │
      
        18
                                                          │   SKIP LOCKED)        │
      
        19
                                                          │ → handler             │
      
        20
                                                          │ → MarkCompleted /     │
      
        21
                                                          │   Reschedule /        │
      
        22
                                                          │   MarkFailed          │
      
        23
                                                          └──────────────────────┘
      
        24
        ```
      
        25
        
        26
        The pool also runs a dedicated LISTEN goroutine that holds one Postgres
      
        27
        connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically
      
        28
        the post-receive hook) wakes idle workers in <1s without polling. The
      
        29
        backstop poll (every 5s by default) covers dropped notifications.
      
        30
        
        31
        ## Job kinds shipped in S14
      
        32
        
        33
        | Kind                   | Trigger                              | Idempotent on            |
      
        34
        | ---------------------- | ------------------------------------ | ------------------------ |
      
        35
        | `push:process`         | post-receive hook per ref            | `push_events.processed_at` |
      
        36
        | `repo:size_recalc`     | enqueued by `push:process`           | overwrite-last-wins        |
      
        37
        | `jobs:purge_completed` | future cron / manual ad-hoc          | always safe to re-run      |
      
        38
        
        39
        Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,
      
        40
        add the `Kind` constant to `internal/worker/types.go`, register it in
      
        41
        `cmd/shithubd/worker.go`. The dispatch loop is generic — no other code
      
        42
        needs to change.
      
        43
        
        44
        ## Failure handling
      
        45
        
        46
        * **Transient errors** (handler returns a non-nil non-poison error):
      
        47
          reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is
      
        48
          `30s * 2^(attempts-1)` capped at 1h.
      
        49
        * **Poison errors** (handler returns an error wrapping `worker.ErrPoison`):
      
        50
          jump to `MarkJobFailed` immediately. Use this for malformed payloads or
      
        51
          references to vanished rows.
      
        52
        * **Panics**: caught by `safeRun` and treated as a transient error so a
      
        53
          buggy handler can't take down a worker goroutine.
      
        54
        * **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older
      
        55
          than 5 minutes — a worker that died mid-job releases its rows on the
      
        56
          next claim cycle.
      
        57
        * **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the
      
        58
          row is moved to `failed_at`. Failed rows surface in the S34 admin
      
        59
          panel for poison-job triage.
      
        60
        
        61
        ## Concurrency contract
      
        62
        
        63
        * `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers
      
        64
          on the same kind can claim N distinct rows in one round. Each row is
      
        65
          processed exactly once across the cluster.
      
        66
        * Workers hold a row's lock only while their handler runs. The lock is
      
        67
          released atomically by `MarkJobCompleted` / `RescheduleJob` /
      
        68
          `MarkJobFailed`.
      
        69
        * Job handlers MUST be idempotent. A worker process killed mid-handler
      
        70
          leaves the row locked until the 5-minute stale-lock window passes,
      
        71
          at which point another worker may pick it up and re-run. Guard the
      
        72
          side-effects (e.g. `processed_at IS NULL` checks).
      
        73
        
        74
        ## Metrics
      
        75
        
        76
        All exported through the standard `/metrics` registry:
      
        77
        
        78
        * `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈
      
        79
          `ok | retry | failed | poison`.
      
        80
        * `shithub_worker_job_duration_seconds{kind}` — handler latency
      
        81
          histogram.
      
        82
        * `shithub_worker_in_flight{kind}` — gauge of currently-running handlers.
      
        83
        
        84
        ## Operational notes
      
        85
        
        86
        * Default pool size is 4. Override via `--workers <n>` on the CLI or
      
        87
          `SHITHUB_WORKERS=<n>` in the environment.
      
        88
        * Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
      
        89
          stop pulling new jobs and let in-flight handlers finish (bounded by
      
        90
          `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its
      
        91
          conn cleanly.
      
        92
        * The pool size on the Postgres connection is sized to `Workers + 2`
      
        93
          (one for LISTEN, one slack for enqueues during shutdown).
      
        94
        * `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a
      
        95
          transaction, the notification is *only* delivered after commit. So
      
        96
          failed inserts never wake workers for nothing.

1	# Worker
2
3	The worker pool drains the Postgres-backed job queue introduced in S14.
4	One binary serves the web API, the SSH dispatcher, the hooks, and the
5	worker — `shithubd worker` boots a long-running process whose only job
6	is to dequeue and run.
7
8	## Architecture
9
10	```
11	post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs
12	│
13	▼
14	shithubd worker
15	┌──────────────────────┐
16	│ N goroutines │
17	│ ClaimJob (FOR UPDATE │
18	│ SKIP LOCKED) │
19	│ → handler │
20	│ → MarkCompleted / │
21	│ Reschedule / │
22	│ MarkFailed │
23	└──────────────────────┘
24	```
25
26	The pool also runs a dedicated LISTEN goroutine that holds one Postgres
27	connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically
28	the post-receive hook) wakes idle workers in <1s without polling. The
29	backstop poll (every 5s by default) covers dropped notifications.
30
31	## Job kinds shipped in S14
32
33	\| Kind \| Trigger \| Idempotent on \|
34	\| ---------------------- \| ------------------------------------ \| ------------------------ \|
35	\| `push:process` \| post-receive hook per ref \| `push_events.processed_at` \|
36	\| `repo:size_recalc` \| enqueued by `push:process` \| overwrite-last-wins \|
37	\| `jobs:purge_completed` \| future cron / manual ad-hoc \| always safe to re-run \|
38
39	Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,
40	add the `Kind` constant to `internal/worker/types.go`, register it in
41	`cmd/shithubd/worker.go`. The dispatch loop is generic — no other code
42	needs to change.
43
44	## Failure handling
45
46	* Transient errors (handler returns a non-nil non-poison error):
47	reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is
48	`30s * 2^(attempts-1)` capped at 1h.
49	* Poison errors (handler returns an error wrapping `worker.ErrPoison`):
50	jump to `MarkJobFailed` immediately. Use this for malformed payloads or
51	references to vanished rows.
52	* Panics: caught by `safeRun` and treated as a transient error so a
53	buggy handler can't take down a worker goroutine.
54	* Stuck locks: `ClaimJob` ignores rows whose `locked_at` is older
55	than 5 minutes — a worker that died mid-job releases its rows on the
56	next claim cycle.
57	* Max attempts: `jobs.max_attempts` (default 5). Past the limit the
58	row is moved to `failed_at`. Failed rows surface in the S34 admin
59	panel for poison-job triage.
60
61	## Concurrency contract
62
63	* `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers
64	on the same kind can claim N distinct rows in one round. Each row is
65	processed exactly once across the cluster.
66	* Workers hold a row's lock only while their handler runs. The lock is
67	released atomically by `MarkJobCompleted` / `RescheduleJob` /
68	`MarkJobFailed`.
69	* Job handlers MUST be idempotent. A worker process killed mid-handler
70	leaves the row locked until the 5-minute stale-lock window passes,
71	at which point another worker may pick it up and re-run. Guard the
72	side-effects (e.g. `processed_at IS NULL` checks).
73
74	## Metrics
75
76	All exported through the standard `/metrics` registry:
77
78	* `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈
79	`ok \| retry \| failed \| poison`.
80	* `shithub_worker_job_duration_seconds{kind}` — handler latency
81	histogram.
82	* `shithub_worker_in_flight{kind}` — gauge of currently-running handlers.
83
84	## Operational notes
85
86	* Default pool size is 4. Override via `--workers <n>` on the CLI or
87	`SHITHUB_WORKERS=<n>` in the environment.
88	* Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
89	stop pulling new jobs and let in-flight handlers finish (bounded by
90	`JobTimeout`, default 5 minutes). The LISTEN goroutine drops its
91	conn cleanly.
92	* The pool size on the Postgres connection is sized to `Workers + 2`
93	(one for LISTEN, one slack for enqueues during shutdown).
94	* `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a
95	transaction, the notification is only delivered after commit. So
96	failed inserts never wake workers for nothing.