Drain workers for maintenance
The worker is graceful-shutdown-clean: SIGTERM finishes the in-flight job and exits; queued jobs stay queued and a future worker picks them up. This makes "drain for maintenance" a near-no-op — but here's the procedure when you want to be precise.
Stop accepting new work but finish what's running
ssh worker-host
sudo systemctl stop shithubd-worker
The unit honors Type=notify + a 30-second TimeoutStopSec. The
worker:
- Stops the next
FOR UPDATE SKIP LOCKEDquery. - Finishes the currently-leased job (if any).
- Exits 0.
Queued jobs stay in the jobs table with state='queued'. A
restarted worker (or a different worker on a different host)
will pick them up.
Confirm drain
psql -d shithub -c "
SELECT state, count(*) FROM jobs GROUP BY state;"
You're looking for processing=0. If it's > 0 a few seconds
after systemctl stop, find the rogue worker:
psql -d shithub -c "
SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
leased_by is a <host>:<pid> string; track down the host and
process and stop it.
Drain only one job kind
If you need to do schema work that affects (e.g.) only the search-reindex job, you can pause that kind without stopping the worker:
UPDATE jobs
SET state='paused'
WHERE state='queued'
AND kind='search.reindex';
The worker's leasing query filters out paused. Resume by
flipping back to queued.
This is not the standard maintenance path; reach for a full drain unless you have a reason to keep other kinds running.
Hostile case: a stuck job
If a worker is leased on a job and the worker process is gone (host crashed mid-execution), the lease times out and a new worker re-leases the same job. The lease timeout is set per-kind in the job handler registry; default 5 minutes.
To force re-lease sooner (e.g., the job is small and you want it picked up now):
UPDATE jobs
SET leased_at = NULL, leased_by = NULL, state = 'queued'
WHERE id = <job-id>;
Audit-trail your manual intervention in the incident channel.
Resume
sudo systemctl start shithubd-worker
The worker starts polling immediately; queued jobs begin running within seconds.
View source
| 1 | # Drain workers for maintenance |
| 2 | |
| 3 | The worker is graceful-shutdown-clean: SIGTERM finishes the |
| 4 | in-flight job and exits; queued jobs stay queued and a future |
| 5 | worker picks them up. This makes "drain for maintenance" a |
| 6 | near-no-op — but here's the procedure when you want to be |
| 7 | precise. |
| 8 | |
| 9 | ## Stop accepting new work but finish what's running |
| 10 | |
| 11 | ```sh |
| 12 | ssh worker-host |
| 13 | sudo systemctl stop shithubd-worker |
| 14 | ``` |
| 15 | |
| 16 | The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The |
| 17 | worker: |
| 18 | |
| 19 | 1. Stops the next `FOR UPDATE SKIP LOCKED` query. |
| 20 | 2. Finishes the currently-leased job (if any). |
| 21 | 3. Exits 0. |
| 22 | |
| 23 | Queued jobs stay in the `jobs` table with `state='queued'`. A |
| 24 | restarted worker (or a different worker on a different host) |
| 25 | will pick them up. |
| 26 | |
| 27 | ## Confirm drain |
| 28 | |
| 29 | ```sh |
| 30 | psql -d shithub -c " |
| 31 | SELECT state, count(*) FROM jobs GROUP BY state;" |
| 32 | ``` |
| 33 | |
| 34 | You're looking for `processing=0`. If it's > 0 a few seconds |
| 35 | after `systemctl stop`, find the rogue worker: |
| 36 | |
| 37 | ```sh |
| 38 | psql -d shithub -c " |
| 39 | SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';" |
| 40 | ``` |
| 41 | |
| 42 | `leased_by` is a `<host>:<pid>` string; track down the host and |
| 43 | process and stop it. |
| 44 | |
| 45 | ## Drain only one job kind |
| 46 | |
| 47 | If you need to do schema work that affects (e.g.) only the |
| 48 | search-reindex job, you can pause that kind without stopping the |
| 49 | worker: |
| 50 | |
| 51 | ```sql |
| 52 | UPDATE jobs |
| 53 | SET state='paused' |
| 54 | WHERE state='queued' |
| 55 | AND kind='search.reindex'; |
| 56 | ``` |
| 57 | |
| 58 | The worker's leasing query filters out `paused`. Resume by |
| 59 | flipping back to `queued`. |
| 60 | |
| 61 | This is **not** the standard maintenance path; reach for a full |
| 62 | drain unless you have a reason to keep other kinds running. |
| 63 | |
| 64 | ## Hostile case: a stuck job |
| 65 | |
| 66 | If a worker is leased on a job and the worker process is gone |
| 67 | (host crashed mid-execution), the lease times out and a new |
| 68 | worker re-leases the same job. The lease timeout is set per-kind |
| 69 | in the job handler registry; default 5 minutes. |
| 70 | |
| 71 | To force re-lease sooner (e.g., the job is small and you want |
| 72 | it picked up now): |
| 73 | |
| 74 | ```sql |
| 75 | UPDATE jobs |
| 76 | SET leased_at = NULL, leased_by = NULL, state = 'queued' |
| 77 | WHERE id = <job-id>; |
| 78 | ``` |
| 79 | |
| 80 | Audit-trail your manual intervention in the incident channel. |
| 81 | |
| 82 | ## Resume |
| 83 | |
| 84 | ```sh |
| 85 | sudo systemctl start shithubd-worker |
| 86 | ``` |
| 87 | |
| 88 | The worker starts polling immediately; queued jobs begin running |
| 89 | within seconds. |