shithub Public

Watch 1 Fork 0 Star 0

markdown · 5262 bytes Raw Blame History

Worker

The worker pool drains the Postgres-backed job queue introduced in S14. One binary serves the web API, the SSH dispatcher, the hooks, and the worker — shithubd worker boots a long-running process whose only job is to dequeue and run.

Architecture

post-receive  ──INSERT push_event + INSERT job + NOTIFY──▶  Postgres jobs
                                                                  │
                                                                  ▼
                                                         shithubd worker
                                                  ┌──────────────────────┐
                                                  │ N goroutines          │
                                                  │ ClaimJob (FOR UPDATE  │
                                                  │   SKIP LOCKED)        │
                                                  │ → handler             │
                                                  │ → MarkCompleted /     │
                                                  │   Reschedule /        │
                                                  │   MarkFailed          │
                                                  └──────────────────────┘

The pool also runs a dedicated LISTEN goroutine that holds one Postgres connection on LISTEN shithub_jobs. A NOTIFY from any process (typically the post-receive hook) wakes idle workers in <1s without polling. The backstop poll (every 5s by default) covers dropped notifications.

Job kinds shipped in S14

Kind	Trigger	Idempotent on
`push:process`	post-receive hook per ref	`push_events.processed_at`
`repo:size_recalc`	enqueued by `push:process`	overwrite-last-wins
`org:github_import_discover`	org import request	`org_github_imports.status`
`org:github_import_repo`	import discovery per GitHub repo	`org_github_import_repos.status`
`jobs:purge_completed`	future cron / manual ad-hoc	always safe to re-run
`trending:compute`	recurring self-scheduled S42 job	append-only snapshots

Adding a new kind: write the handler in internal/worker/jobs/<kind>.go, add the Kind constant to internal/worker/types.go, register it in cmd/shithubd/worker.go. The dispatch loop is generic — no other code needs to change.

Failure handling

Transient errors (handler returns a non-nil non-poison error): reschedule with Backoff(attempts) ± 20% jitter, where Backoff is 30s * 2^(attempts-1) capped at 1h.
Poison errors (handler returns an error wrapping worker.ErrPoison): jump to MarkJobFailed immediately. Use this for malformed payloads or references to vanished rows.
Panics: caught by safeRun and treated as a transient error so a buggy handler can't take down a worker goroutine.
Stuck locks: ClaimJob ignores rows whose locked_at is older than 5 minutes — a worker that died mid-job releases its rows on the next claim cycle.
Max attempts: jobs.max_attempts (default 5). Past the limit the row is moved to failed_at. Failed rows surface in the S34 admin panel for poison-job triage.

Concurrency contract

FOR UPDATE SKIP LOCKED on the inner SELECT means N concurrent workers on the same kind can claim N distinct rows in one round. Each row is processed exactly once across the cluster.
Workers hold a row's lock only while their handler runs. The lock is released atomically by MarkJobCompleted / RescheduleJob / MarkJobFailed.
Job handlers MUST be idempotent. A worker process killed mid-handler leaves the row locked until the 5-minute stale-lock window passes, at which point another worker may pick it up and re-run. Guard the side-effects (e.g. processed_at IS NULL checks).

Metrics

All exported through the standard /metrics registry:

shithub_worker_jobs_processed_total{kind,outcome} — outcome ∈ ok | retry | failed | poison.
shithub_worker_job_duration_seconds{kind} — handler latency histogram.
shithub_worker_in_flight{kind} — gauge of currently-running handlers.

Operational notes

Default pool size is 4. Override via --workers <n> on the CLI or SHITHUB_WORKERS=<n> in the environment.
Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers stop pulling new jobs and let in-flight handlers finish (bounded by JobTimeout, default 5 minutes). The LISTEN goroutine drops its conn cleanly.
The pool size on the Postgres connection is sized to Workers + 2 (one for LISTEN, one slack for enqueues during shutdown).
LISTEN/NOTIFY semantics: when the hook calls pg_notify inside a transaction, the notification is only delivered after commit. So failed inserts never wake workers for nothing.

View source

  
        1
        # Worker
      
        2
        
        3
        The worker pool drains the Postgres-backed job queue introduced in S14.
      
        4
        One binary serves the web API, the SSH dispatcher, the hooks, and the
      
        5
        worker — `shithubd worker` boots a long-running process whose only job
      
        6
        is to dequeue and run.
      
        7
        
        8
        ## Architecture
      
        9
        
        10
        ```
      
        11
        post-receive  ──INSERT push_event + INSERT job + NOTIFY──▶  Postgres jobs
      
        12
                                                                          │
      
        13
                                                                          ▼
      
        14
                                                                 shithubd worker
      
        15
                                                          ┌──────────────────────┐
      
        16
                                                          │ N goroutines          │
      
        17
                                                          │ ClaimJob (FOR UPDATE  │
      
        18
                                                          │   SKIP LOCKED)        │
      
        19
                                                          │ → handler             │
      
        20
                                                          │ → MarkCompleted /     │
      
        21
                                                          │   Reschedule /        │
      
        22
                                                          │   MarkFailed          │
      
        23
                                                          └──────────────────────┘
      
        24
        ```
      
        25
        
        26
        The pool also runs a dedicated LISTEN goroutine that holds one Postgres
      
        27
        connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically
      
        28
        the post-receive hook) wakes idle workers in <1s without polling. The
      
        29
        backstop poll (every 5s by default) covers dropped notifications.
      
        30
        
        31
        ## Job kinds shipped in S14
      
        32
        
        33
        | Kind                         | Trigger                              | Idempotent on                    |
      
        34
        | ---------------------------- | ------------------------------------ | -------------------------------- |
      
        35
        | `push:process`               | post-receive hook per ref            | `push_events.processed_at`       |
      
        36
        | `repo:size_recalc`           | enqueued by `push:process`           | overwrite-last-wins              |
      
        37
        | `org:github_import_discover` | org import request                   | `org_github_imports.status`      |
      
        38
        | `org:github_import_repo`     | import discovery per GitHub repo     | `org_github_import_repos.status` |
      
        39
        | `jobs:purge_completed`       | future cron / manual ad-hoc          | always safe to re-run            |
      
        40
        | `trending:compute`           | recurring self-scheduled S42 job     | append-only snapshots            |
      
        41
        
        42
        Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,
      
        43
        add the `Kind` constant to `internal/worker/types.go`, register it in
      
        44
        `cmd/shithubd/worker.go`. The dispatch loop is generic — no other code
      
        45
        needs to change.
      
        46
        
        47
        ## Failure handling
      
        48
        
        49
        * **Transient errors** (handler returns a non-nil non-poison error):
      
        50
          reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is
      
        51
          `30s * 2^(attempts-1)` capped at 1h.
      
        52
        * **Poison errors** (handler returns an error wrapping `worker.ErrPoison`):
      
        53
          jump to `MarkJobFailed` immediately. Use this for malformed payloads or
      
        54
          references to vanished rows.
      
        55
        * **Panics**: caught by `safeRun` and treated as a transient error so a
      
        56
          buggy handler can't take down a worker goroutine.
      
        57
        * **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older
      
        58
          than 5 minutes — a worker that died mid-job releases its rows on the
      
        59
          next claim cycle.
      
        60
        * **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the
      
        61
          row is moved to `failed_at`. Failed rows surface in the S34 admin
      
        62
          panel for poison-job triage.
      
        63
        
        64
        ## Concurrency contract
      
        65
        
        66
        * `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers
      
        67
          on the same kind can claim N distinct rows in one round. Each row is
      
        68
          processed exactly once across the cluster.
      
        69
        * Workers hold a row's lock only while their handler runs. The lock is
      
        70
          released atomically by `MarkJobCompleted` / `RescheduleJob` /
      
        71
          `MarkJobFailed`.
      
        72
        * Job handlers MUST be idempotent. A worker process killed mid-handler
      
        73
          leaves the row locked until the 5-minute stale-lock window passes,
      
        74
          at which point another worker may pick it up and re-run. Guard the
      
        75
          side-effects (e.g. `processed_at IS NULL` checks).
      
        76
        
        77
        ## Metrics
      
        78
        
        79
        All exported through the standard `/metrics` registry:
      
        80
        
        81
        * `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈
      
        82
          `ok | retry | failed | poison`.
      
        83
        * `shithub_worker_job_duration_seconds{kind}` — handler latency
      
        84
          histogram.
      
        85
        * `shithub_worker_in_flight{kind}` — gauge of currently-running handlers.
      
        86
        
        87
        ## Operational notes
      
        88
        
        89
        * Default pool size is 4. Override via `--workers <n>` on the CLI or
      
        90
          `SHITHUB_WORKERS=<n>` in the environment.
      
        91
        * Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
      
        92
          stop pulling new jobs and let in-flight handlers finish (bounded by
      
        93
          `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its
      
        94
          conn cleanly.
      
        95
        * The pool size on the Postgres connection is sized to `Workers + 2`
      
        96
          (one for LISTEN, one slack for enqueues during shutdown).
      
        97
        * `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a
      
        98
          transaction, the notification is *only* delivered after commit. So
      
        99
          failed inserts never wake workers for nothing.

1	# Worker
2
3	The worker pool drains the Postgres-backed job queue introduced in S14.
4	One binary serves the web API, the SSH dispatcher, the hooks, and the
5	worker — `shithubd worker` boots a long-running process whose only job
6	is to dequeue and run.
7
8	## Architecture
9
10	```
11	post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs
12	│
13	▼
14	shithubd worker
15	┌──────────────────────┐
16	│ N goroutines │
17	│ ClaimJob (FOR UPDATE │
18	│ SKIP LOCKED) │
19	│ → handler │
20	│ → MarkCompleted / │
21	│ Reschedule / │
22	│ MarkFailed │
23	└──────────────────────┘
24	```
25
26	The pool also runs a dedicated LISTEN goroutine that holds one Postgres
27	connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically
28	the post-receive hook) wakes idle workers in <1s without polling. The
29	backstop poll (every 5s by default) covers dropped notifications.
30
31	## Job kinds shipped in S14
32
33	\| Kind \| Trigger \| Idempotent on \|
34	\| ---------------------------- \| ------------------------------------ \| -------------------------------- \|
35	\| `push:process` \| post-receive hook per ref \| `push_events.processed_at` \|
36	\| `repo:size_recalc` \| enqueued by `push:process` \| overwrite-last-wins \|
37	\| `org:github_import_discover` \| org import request \| `org_github_imports.status` \|
38	\| `org:github_import_repo` \| import discovery per GitHub repo \| `org_github_import_repos.status` \|
39	\| `jobs:purge_completed` \| future cron / manual ad-hoc \| always safe to re-run \|
40	\| `trending:compute` \| recurring self-scheduled S42 job \| append-only snapshots \|
41
42	Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,
43	add the `Kind` constant to `internal/worker/types.go`, register it in
44	`cmd/shithubd/worker.go`. The dispatch loop is generic — no other code
45	needs to change.
46
47	## Failure handling
48
49	* Transient errors (handler returns a non-nil non-poison error):
50	reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is
51	`30s * 2^(attempts-1)` capped at 1h.
52	* Poison errors (handler returns an error wrapping `worker.ErrPoison`):
53	jump to `MarkJobFailed` immediately. Use this for malformed payloads or
54	references to vanished rows.
55	* Panics: caught by `safeRun` and treated as a transient error so a
56	buggy handler can't take down a worker goroutine.
57	* Stuck locks: `ClaimJob` ignores rows whose `locked_at` is older
58	than 5 minutes — a worker that died mid-job releases its rows on the
59	next claim cycle.
60	* Max attempts: `jobs.max_attempts` (default 5). Past the limit the
61	row is moved to `failed_at`. Failed rows surface in the S34 admin
62	panel for poison-job triage.
63
64	## Concurrency contract
65
66	* `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers
67	on the same kind can claim N distinct rows in one round. Each row is
68	processed exactly once across the cluster.
69	* Workers hold a row's lock only while their handler runs. The lock is
70	released atomically by `MarkJobCompleted` / `RescheduleJob` /
71	`MarkJobFailed`.
72	* Job handlers MUST be idempotent. A worker process killed mid-handler
73	leaves the row locked until the 5-minute stale-lock window passes,
74	at which point another worker may pick it up and re-run. Guard the
75	side-effects (e.g. `processed_at IS NULL` checks).
76
77	## Metrics
78
79	All exported through the standard `/metrics` registry:
80
81	* `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈
82	`ok \| retry \| failed \| poison`.
83	* `shithub_worker_job_duration_seconds{kind}` — handler latency
84	histogram.
85	* `shithub_worker_in_flight{kind}` — gauge of currently-running handlers.
86
87	## Operational notes
88
89	* Default pool size is 4. Override via `--workers <n>` on the CLI or
90	`SHITHUB_WORKERS=<n>` in the environment.
91	* Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers
92	stop pulling new jobs and let in-flight handlers finish (bounded by
93	`JobTimeout`, default 5 minutes). The LISTEN goroutine drops its
94	conn cleanly.
95	* The pool size on the Postgres connection is sized to `Workers + 2`
96	(one for LISTEN, one slack for enqueues during shutdown).
97	* `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a
98	transaction, the notification is only delivered after commit. So
99	failed inserts never wake workers for nothing.