shithub Public

Watch 1 Fork 0 Star 0

markdown · 2263 bytes Raw Blame History

Drain workers for maintenance

The worker is graceful-shutdown-clean: SIGTERM finishes the in-flight job and exits; queued jobs stay queued and a future worker picks them up. This makes "drain for maintenance" a near-no-op — but here's the procedure when you want to be precise.

Stop accepting new work but finish what's running

ssh worker-host
sudo systemctl stop shithubd-worker

The unit honors Type=notify + a 30-second TimeoutStopSec. The worker:

Stops the next FOR UPDATE SKIP LOCKED query.
Finishes the currently-leased job (if any).
Exits 0.

Queued jobs stay in the jobs table with state='queued'. A restarted worker (or a different worker on a different host) will pick them up.

Confirm drain

psql -d shithub -c "
  SELECT state, count(*) FROM jobs GROUP BY state;"

You're looking for processing=0. If it's > 0 a few seconds after systemctl stop, find the rogue worker:

psql -d shithub -c "
  SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"

leased_by is a <host>:<pid> string; track down the host and process and stop it.

Drain only one job kind

If you need to do schema work that affects (e.g.) only the search-reindex job, you can pause that kind without stopping the worker:

UPDATE jobs
   SET state='paused'
 WHERE state='queued'
   AND kind='search.reindex';

The worker's leasing query filters out paused. Resume by flipping back to queued.

This is not the standard maintenance path; reach for a full drain unless you have a reason to keep other kinds running.

Hostile case: a stuck job

If a worker is leased on a job and the worker process is gone (host crashed mid-execution), the lease times out and a new worker re-leases the same job. The lease timeout is set per-kind in the job handler registry; default 5 minutes.

To force re-lease sooner (e.g., the job is small and you want it picked up now):

UPDATE jobs
   SET leased_at = NULL, leased_by = NULL, state = 'queued'
 WHERE id = <job-id>;

Audit-trail your manual intervention in the incident channel.

Resume

sudo systemctl start shithubd-worker

The worker starts polling immediately; queued jobs begin running within seconds.

View source

  
        1
        # Drain workers for maintenance
      
        2
        
        3
        The worker is graceful-shutdown-clean: SIGTERM finishes the
      
        4
        in-flight job and exits; queued jobs stay queued and a future
      
        5
        worker picks them up. This makes "drain for maintenance" a
      
        6
        near-no-op — but here's the procedure when you want to be
      
        7
        precise.
      
        8
        
        9
        ## Stop accepting new work but finish what's running
      
        10
        
        11
        ```sh
      
        12
        ssh worker-host
      
        13
        sudo systemctl stop shithubd-worker
      
        14
        ```
      
        15
        
        16
        The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The
      
        17
        worker:
      
        18
        
        19
        1. Stops the next `FOR UPDATE SKIP LOCKED` query.
      
        20
        2. Finishes the currently-leased job (if any).
      
        21
        3. Exits 0.
      
        22
        
        23
        Queued jobs stay in the `jobs` table with `state='queued'`. A
      
        24
        restarted worker (or a different worker on a different host)
      
        25
        will pick them up.
      
        26
        
        27
        ## Confirm drain
      
        28
        
        29
        ```sh
      
        30
        psql -d shithub -c "
      
        31
          SELECT state, count(*) FROM jobs GROUP BY state;"
      
        32
        ```
      
        33
        
        34
        You're looking for `processing=0`. If it's > 0 a few seconds
      
        35
        after `systemctl stop`, find the rogue worker:
      
        36
        
        37
        ```sh
      
        38
        psql -d shithub -c "
      
        39
          SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
      
        40
        ```
      
        41
        
        42
        `leased_by` is a `<host>:<pid>` string; track down the host and
      
        43
        process and stop it.
      
        44
        
        45
        ## Drain only one job kind
      
        46
        
        47
        If you need to do schema work that affects (e.g.) only the
      
        48
        search-reindex job, you can pause that kind without stopping the
      
        49
        worker:
      
        50
        
        51
        ```sql
      
        52
        UPDATE jobs
      
        53
           SET state='paused'
      
        54
         WHERE state='queued'
      
        55
           AND kind='search.reindex';
      
        56
        ```
      
        57
        
        58
        The worker's leasing query filters out `paused`. Resume by
      
        59
        flipping back to `queued`.
      
        60
        
        61
        This is **not** the standard maintenance path; reach for a full
      
        62
        drain unless you have a reason to keep other kinds running.
      
        63
        
        64
        ## Hostile case: a stuck job
      
        65
        
        66
        If a worker is leased on a job and the worker process is gone
      
        67
        (host crashed mid-execution), the lease times out and a new
      
        68
        worker re-leases the same job. The lease timeout is set per-kind
      
        69
        in the job handler registry; default 5 minutes.
      
        70
        
        71
        To force re-lease sooner (e.g., the job is small and you want
      
        72
        it picked up now):
      
        73
        
        74
        ```sql
      
        75
        UPDATE jobs
      
        76
           SET leased_at = NULL, leased_by = NULL, state = 'queued'
      
        77
         WHERE id = <job-id>;
      
        78
        ```
      
        79
        
        80
        Audit-trail your manual intervention in the incident channel.
      
        81
        
        82
        ## Resume
      
        83
        
        84
        ```sh
      
        85
        sudo systemctl start shithubd-worker
      
        86
        ```
      
        87
        
        88
        The worker starts polling immediately; queued jobs begin running
      
        89
        within seconds.

1	# Drain workers for maintenance
2
3	The worker is graceful-shutdown-clean: SIGTERM finishes the
4	in-flight job and exits; queued jobs stay queued and a future
5	worker picks them up. This makes "drain for maintenance" a
6	near-no-op — but here's the procedure when you want to be
7	precise.
8
9	## Stop accepting new work but finish what's running
10
11	```sh
12	ssh worker-host
13	sudo systemctl stop shithubd-worker
14	```
15
16	The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The
17	worker:
18
19	1. Stops the next `FOR UPDATE SKIP LOCKED` query.
20	2. Finishes the currently-leased job (if any).
21	3. Exits 0.
22
23	Queued jobs stay in the `jobs` table with `state='queued'`. A
24	restarted worker (or a different worker on a different host)
25	will pick them up.
26
27	## Confirm drain
28
29	```sh
30	psql -d shithub -c "
31	SELECT state, count(*) FROM jobs GROUP BY state;"
32	```
33
34	You're looking for `processing=0`. If it's > 0 a few seconds
35	after `systemctl stop`, find the rogue worker:
36
37	```sh
38	psql -d shithub -c "
39	SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
40	```
41
42	`leased_by` is a `<host>:<pid>` string; track down the host and
43	process and stop it.
44
45	## Drain only one job kind
46
47	If you need to do schema work that affects (e.g.) only the
48	search-reindex job, you can pause that kind without stopping the
49	worker:
50
51	```sql
52	UPDATE jobs
53	SET state='paused'
54	WHERE state='queued'
55	AND kind='search.reindex';
56	```
57
58	The worker's leasing query filters out `paused`. Resume by
59	flipping back to `queued`.
60
61	This is not the standard maintenance path; reach for a full
62	drain unless you have a reason to keep other kinds running.
63
64	## Hostile case: a stuck job
65
66	If a worker is leased on a job and the worker process is gone
67	(host crashed mid-execution), the lease times out and a new
68	worker re-leases the same job. The lease timeout is set per-kind
69	in the job handler registry; default 5 minutes.
70
71	To force re-lease sooner (e.g., the job is small and you want
72	it picked up now):
73
74	```sql
75	UPDATE jobs
76	SET leased_at = NULL, leased_by = NULL, state = 'queued'
77	WHERE id = <job-id>;
78	```
79
80	Audit-trail your manual intervention in the incident channel.
81
82	## Resume
83
84	```sh
85	sudo systemctl start shithubd-worker
86	```
87
88	The worker starts polling immediately; queued jobs begin running
89	within seconds.