markdown · 2263 bytes Raw Blame History

Drain workers for maintenance

The worker is graceful-shutdown-clean: SIGTERM finishes the in-flight job and exits; queued jobs stay queued and a future worker picks them up. This makes "drain for maintenance" a near-no-op — but here's the procedure when you want to be precise.

Stop accepting new work but finish what's running

ssh worker-host
sudo systemctl stop shithubd-worker

The unit honors Type=notify + a 30-second TimeoutStopSec. The worker:

  1. Stops the next FOR UPDATE SKIP LOCKED query.
  2. Finishes the currently-leased job (if any).
  3. Exits 0.

Queued jobs stay in the jobs table with state='queued'. A restarted worker (or a different worker on a different host) will pick them up.

Confirm drain

psql -d shithub -c "
  SELECT state, count(*) FROM jobs GROUP BY state;"

You're looking for processing=0. If it's > 0 a few seconds after systemctl stop, find the rogue worker:

psql -d shithub -c "
  SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"

leased_by is a <host>:<pid> string; track down the host and process and stop it.

Drain only one job kind

If you need to do schema work that affects (e.g.) only the search-reindex job, you can pause that kind without stopping the worker:

UPDATE jobs
   SET state='paused'
 WHERE state='queued'
   AND kind='search.reindex';

The worker's leasing query filters out paused. Resume by flipping back to queued.

This is not the standard maintenance path; reach for a full drain unless you have a reason to keep other kinds running.

Hostile case: a stuck job

If a worker is leased on a job and the worker process is gone (host crashed mid-execution), the lease times out and a new worker re-leases the same job. The lease timeout is set per-kind in the job handler registry; default 5 minutes.

To force re-lease sooner (e.g., the job is small and you want it picked up now):

UPDATE jobs
   SET leased_at = NULL, leased_by = NULL, state = 'queued'
 WHERE id = <job-id>;

Audit-trail your manual intervention in the incident channel.

Resume

sudo systemctl start shithubd-worker

The worker starts polling immediately; queued jobs begin running within seconds.

View source
1 # Drain workers for maintenance
2
3 The worker is graceful-shutdown-clean: SIGTERM finishes the
4 in-flight job and exits; queued jobs stay queued and a future
5 worker picks them up. This makes "drain for maintenance" a
6 near-no-op — but here's the procedure when you want to be
7 precise.
8
9 ## Stop accepting new work but finish what's running
10
11 ```sh
12 ssh worker-host
13 sudo systemctl stop shithubd-worker
14 ```
15
16 The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The
17 worker:
18
19 1. Stops the next `FOR UPDATE SKIP LOCKED` query.
20 2. Finishes the currently-leased job (if any).
21 3. Exits 0.
22
23 Queued jobs stay in the `jobs` table with `state='queued'`. A
24 restarted worker (or a different worker on a different host)
25 will pick them up.
26
27 ## Confirm drain
28
29 ```sh
30 psql -d shithub -c "
31 SELECT state, count(*) FROM jobs GROUP BY state;"
32 ```
33
34 You're looking for `processing=0`. If it's > 0 a few seconds
35 after `systemctl stop`, find the rogue worker:
36
37 ```sh
38 psql -d shithub -c "
39 SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
40 ```
41
42 `leased_by` is a `<host>:<pid>` string; track down the host and
43 process and stop it.
44
45 ## Drain only one job kind
46
47 If you need to do schema work that affects (e.g.) only the
48 search-reindex job, you can pause that kind without stopping the
49 worker:
50
51 ```sql
52 UPDATE jobs
53 SET state='paused'
54 WHERE state='queued'
55 AND kind='search.reindex';
56 ```
57
58 The worker's leasing query filters out `paused`. Resume by
59 flipping back to `queued`.
60
61 This is **not** the standard maintenance path; reach for a full
62 drain unless you have a reason to keep other kinds running.
63
64 ## Hostile case: a stuck job
65
66 If a worker is leased on a job and the worker process is gone
67 (host crashed mid-execution), the lease times out and a new
68 worker re-leases the same job. The lease timeout is set per-kind
69 in the job handler registry; default 5 minutes.
70
71 To force re-lease sooner (e.g., the job is small and you want
72 it picked up now):
73
74 ```sql
75 UPDATE jobs
76 SET leased_at = NULL, leased_by = NULL, state = 'queued'
77 WHERE id = <job-id>;
78 ```
79
80 Audit-trail your manual intervention in the incident channel.
81
82 ## Resume
83
84 ```sh
85 sudo systemctl start shithubd-worker
86 ```
87
88 The worker starts polling immediately; queued jobs begin running
89 within seconds.