markdown · 7766 bytes Raw Blame History

Move Postgres data from root disk to block volume

One-time migration. After this lands, the droplet's root disk holds only the OS + binaries; all stateful data (pgdata, repos, tmp) lives on the attached block volume mounted at /data. The goal is twofold: (1) the root disk can never fill up from runaway DB growth and OOM the system, and (2) the volume can be detached and reattached to a replacement droplet without losing state if the host ever needs replacing.

Preconditions (verify before scheduling)

  • Block volume mounted at /data with enough free space for pgdata + headroom for several months of growth. Check:
    df -h /data
    
  • /data/pgdata does not contain a live cluster. It may contain a stale initdb from when the volume was first provisioned — that's fine and we'll move it aside.
    ls -la /data/pgdata
    # Expect a PG_VERSION file dated to volume-attach time, NOT
    # to "minutes ago". If "minutes ago", STOP — something is
    # already running there.
    
  • Recent dump landed in Spaces (not just locally). Worst-case rollback is restoring this dump:
    rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
           lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3
    
    If today's directory is empty, run a fresh dump first:
    /usr/local/bin/shithub-backup-daily
    
  • WAL archiver is healthy — gives PITR coverage for changes between the last dump and migration time:
    sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;"
    # last_archived_time should be < 5 min ago, last_failed_time blank or older
    
  • DigitalOcean snapshot taken via the DO dashboard (Droplets → shithub-prod → Snapshots → Take Snapshot). This is the panic-button rollback if everything else fails. Snapshots take a few minutes; DON'T start the migration until the snapshot completes.
  • Notice users — site will be down for ~3 minutes for a ~115 MB pgdata. Scale the window with current du -sh /var/lib/postgresql/16/main.

Migration

Total downtime: ~3 minutes for a 100 MB DB. Most of that is postgres clean-shutdown + start; the rsync itself is seconds.

ssh root@shithub.sh

1. Drain writes

systemctl stop shithubd-web
systemctl stop shithubd-cron 2>/dev/null   # if running

Verify nothing else has DB sessions open before stopping postgres (background workers, manual psql sessions):

sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';"

If this returns rows besides your own psql, kill those processes first.

2. Stop postgres

systemctl stop postgresql@16-main
systemctl status postgresql@16-main --no-pager | head -5  # should be "inactive (dead)"

3. Rename the stale pre-init aside (don't delete)

Keeping it lets us undo step 4 instantly if something looks off:

mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d)

4. Copy live data to the volume

rsync -aHX --info=progress2 preserves perms, owners, hard links, and xattrs. --info=progress2 shows a single overall progress line:

rsync -aHX --info=progress2 \
  /var/lib/postgresql/16/main/ \
  /data/pgdata/

Verify the copy looks right:

ls -la /data/pgdata/ | head
diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \
     <(cd /data/pgdata               && find . -printf '%p %s %m %u:%g\n' | sort) | head
# Empty diff = byte-identical layout.

5. Repoint the cluster

Edit the active config:

sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \
  /etc/postgresql/16/main/postgresql.conf
grep ^data_directory /etc/postgresql/16/main/postgresql.conf
# Expect: data_directory = '/data/pgdata'

(.bak lets you cp postgresql.conf.bak postgresql.conf to revert in 1 second if needed.)

6. Start postgres on the new path

systemctl start postgresql@16-main
sleep 2
systemctl is-active postgresql@16-main      # active
sudo -u postgres pg_isready -h /var/run/postgresql
sudo -u postgres psql -tAc "SHOW data_directory;"   # should print /data/pgdata
sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;"
sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;"

If any of those fail, jump to Rollback below.

7. Bring the app back up

systemctl start shithubd-web
systemctl start shithubd-cron 2>/dev/null
systemctl is-active shithubd-web
curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz   # 200

8. Smoke test

From your laptop (not the droplet):

curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/
# Walk the site briefly: load a repo, view an issue, log in.

Confirm the WAL archiver picked up where it left off (next archive timestamp should be > migration time within a couple of minutes):

ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"'

Run a fresh backup to confirm the new pgdata is durable end-to-end:

ssh root@shithub.sh /usr/local/bin/shithub-backup-daily

9. Cleanup (after a few days of healthy operation)

Don't do this until you've slept on at least one full daily backup cycle from the new location and confirmed it landed in Spaces. Then:

# Reclaim root-disk space.
rm -rf /var/lib/postgresql/16/main.preinit-*
# (After confirming /var/lib/postgresql/16/main is empty)
rmdir /var/lib/postgresql/16/main 2>/dev/null
rm -rf /data/pgdata.preinit-*

The systemd unit's RequiresMountsFor=/var/lib/postgresql/%I is a static path — if you remove the empty dir, recreate it as mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres … before the next reboot, otherwise pg_ctlcluster will refuse to start. Easier: leave the empty dir alone.

Rollback

Mid-migration (steps 4–6 failed)

systemctl stop postgresql@16-main
cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf
mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M)
mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null   # restore the stale init in case something checks for it
systemctl start postgresql@16-main
systemctl start shithubd-web

The original /var/lib/postgresql/16/main was never modified, so postgres comes back up against unchanged data. Total recovery time: ~30 seconds.

Worst case (data corruption observed after step 6)

# 1. Stop everything that talks to the DB.
systemctl stop shithubd-web shithubd-cron postgresql@16-main

# 2. Find the latest dump in Spaces.
rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
       lsl spaces-prod:shithub-backups/daily/ | sort | tail -1

# 3. Drop and recreate the cluster from the dump. See restore.md
#    for the full pg_restore procedure.

Or, faster:

Nuclear option

Restore the droplet from the DO snapshot taken in preconditions. Loses any user activity since the snapshot — usually that's "the last few minutes" since the snapshot is taken right before migration. Coordinate via status page if you do this.

Why this layout

  • Root disk is 77 GB; pgdata growth is unbounded. A runaway query log or WAL spike can fill it and freeze the whole droplet, including sshd.
  • Block volume is 100 GB, separately managed, snapshotable in DO independently of the droplet, and detachable. If the droplet is unrecoverable, attaching the volume to a fresh droplet recovers state in minutes.
  • Repos and tmp already live on /data. Moving pgdata finishes the layout the volume was provisioned for.
View source
1 # Move Postgres data from root disk to block volume
2
3 One-time migration. After this lands, the droplet's root disk
4 holds only the OS + binaries; all stateful data (pgdata, repos,
5 tmp) lives on the attached block volume mounted at `/data`. The
6 goal is twofold: (1) the root disk can never fill up from
7 runaway DB growth and OOM the system, and (2) the volume can be
8 detached and reattached to a replacement droplet without losing
9 state if the host ever needs replacing.
10
11 ## Preconditions (verify before scheduling)
12
13 - **Block volume mounted at `/data`** with enough free space for
14 pgdata + headroom for several months of growth. Check:
15 ```sh
16 df -h /data
17 ```
18 - **`/data/pgdata` does not contain a live cluster.** It may
19 contain a stale `initdb` from when the volume was first
20 provisioned — that's fine and we'll move it aside.
21 ```sh
22 ls -la /data/pgdata
23 # Expect a PG_VERSION file dated to volume-attach time, NOT
24 # to "minutes ago". If "minutes ago", STOP — something is
25 # already running there.
26 ```
27 - **Recent dump landed in Spaces** (not just locally). Worst-case
28 rollback is restoring this dump:
29 ```sh
30 rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
31 lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3
32 ```
33 If today's directory is empty, run a fresh dump first:
34 ```sh
35 /usr/local/bin/shithub-backup-daily
36 ```
37 - **WAL archiver is healthy** — gives PITR coverage for changes
38 between the last dump and migration time:
39 ```sh
40 sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;"
41 # last_archived_time should be < 5 min ago, last_failed_time blank or older
42 ```
43 - **DigitalOcean snapshot taken** via the DO dashboard
44 (Droplets → shithub-prod → Snapshots → Take Snapshot). This
45 is the panic-button rollback if everything else fails.
46 Snapshots take a few minutes; DON'T start the migration
47 until the snapshot completes.
48 - **Notice users** — site will be down for ~3 minutes for a
49 ~115 MB pgdata. Scale the window with current `du -sh
50 /var/lib/postgresql/16/main`.
51
52 ## Migration
53
54 Total downtime: ~3 minutes for a 100 MB DB. Most of that is
55 postgres clean-shutdown + start; the rsync itself is seconds.
56
57 ```sh
58 ssh root@shithub.sh
59 ```
60
61 ### 1. Drain writes
62
63 ```sh
64 systemctl stop shithubd-web
65 systemctl stop shithubd-cron 2>/dev/null # if running
66 ```
67
68 Verify nothing else has DB sessions open before stopping
69 postgres (background workers, manual psql sessions):
70
71 ```sh
72 sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';"
73 ```
74
75 If this returns rows besides your own psql, kill those processes
76 first.
77
78 ### 2. Stop postgres
79
80 ```sh
81 systemctl stop postgresql@16-main
82 systemctl status postgresql@16-main --no-pager | head -5 # should be "inactive (dead)"
83 ```
84
85 ### 3. Rename the stale pre-init aside (don't delete)
86
87 Keeping it lets us undo step 4 instantly if something looks off:
88
89 ```sh
90 mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d)
91 ```
92
93 ### 4. Copy live data to the volume
94
95 `rsync -aHX --info=progress2` preserves perms, owners, hard
96 links, and xattrs. `--info=progress2` shows a single overall
97 progress line:
98
99 ```sh
100 rsync -aHX --info=progress2 \
101 /var/lib/postgresql/16/main/ \
102 /data/pgdata/
103 ```
104
105 Verify the copy looks right:
106
107 ```sh
108 ls -la /data/pgdata/ | head
109 diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \
110 <(cd /data/pgdata && find . -printf '%p %s %m %u:%g\n' | sort) | head
111 # Empty diff = byte-identical layout.
112 ```
113
114 ### 5. Repoint the cluster
115
116 Edit the active config:
117
118 ```sh
119 sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \
120 /etc/postgresql/16/main/postgresql.conf
121 grep ^data_directory /etc/postgresql/16/main/postgresql.conf
122 # Expect: data_directory = '/data/pgdata'
123 ```
124
125 (`.bak` lets you `cp postgresql.conf.bak postgresql.conf` to
126 revert in 1 second if needed.)
127
128 ### 6. Start postgres on the new path
129
130 ```sh
131 systemctl start postgresql@16-main
132 sleep 2
133 systemctl is-active postgresql@16-main # active
134 sudo -u postgres pg_isready -h /var/run/postgresql
135 sudo -u postgres psql -tAc "SHOW data_directory;" # should print /data/pgdata
136 sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;"
137 sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;"
138 ```
139
140 If any of those fail, jump to **Rollback** below.
141
142 ### 7. Bring the app back up
143
144 ```sh
145 systemctl start shithubd-web
146 systemctl start shithubd-cron 2>/dev/null
147 systemctl is-active shithubd-web
148 curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz # 200
149 ```
150
151 ### 8. Smoke test
152
153 From your laptop (not the droplet):
154
155 ```sh
156 curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/
157 # Walk the site briefly: load a repo, view an issue, log in.
158 ```
159
160 Confirm the WAL archiver picked up where it left off (next
161 archive timestamp should be > migration time within a couple of
162 minutes):
163
164 ```sh
165 ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"'
166 ```
167
168 Run a fresh backup to confirm the new pgdata is durable end-to-end:
169
170 ```sh
171 ssh root@shithub.sh /usr/local/bin/shithub-backup-daily
172 ```
173
174 ### 9. Cleanup (after a few days of healthy operation)
175
176 Don't do this until you've slept on at least one full daily
177 backup cycle from the new location and confirmed it landed in
178 Spaces. Then:
179
180 ```sh
181 # Reclaim root-disk space.
182 rm -rf /var/lib/postgresql/16/main.preinit-*
183 # (After confirming /var/lib/postgresql/16/main is empty)
184 rmdir /var/lib/postgresql/16/main 2>/dev/null
185 rm -rf /data/pgdata.preinit-*
186 ```
187
188 The systemd unit's `RequiresMountsFor=/var/lib/postgresql/%I`
189 is a static path — if you remove the empty dir, recreate it as
190 `mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres
191 ` before the next reboot, otherwise `pg_ctlcluster` will refuse
192 to start. Easier: leave the empty dir alone.
193
194 ## Rollback
195
196 ### Mid-migration (steps 4–6 failed)
197
198 ```sh
199 systemctl stop postgresql@16-main
200 cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf
201 mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M)
202 mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null # restore the stale init in case something checks for it
203 systemctl start postgresql@16-main
204 systemctl start shithubd-web
205 ```
206
207 The original `/var/lib/postgresql/16/main` was never modified, so
208 postgres comes back up against unchanged data. Total recovery
209 time: ~30 seconds.
210
211 ### Worst case (data corruption observed after step 6)
212
213 ```sh
214 # 1. Stop everything that talks to the DB.
215 systemctl stop shithubd-web shithubd-cron postgresql@16-main
216
217 # 2. Find the latest dump in Spaces.
218 rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
219 lsl spaces-prod:shithub-backups/daily/ | sort | tail -1
220
221 # 3. Drop and recreate the cluster from the dump. See restore.md
222 # for the full pg_restore procedure.
223 ```
224
225 Or, faster:
226
227 ### Nuclear option
228
229 Restore the droplet from the DO snapshot taken in preconditions.
230 Loses any user activity since the snapshot — usually that's
231 "the last few minutes" since the snapshot is taken right before
232 migration. Coordinate via status page if you do this.
233
234 ## Why this layout
235
236 - **Root disk is 77 GB**; pgdata growth is unbounded. A
237 runaway query log or WAL spike can fill it and freeze the
238 whole droplet, including sshd.
239 - **Block volume is 100 GB**, separately managed, snapshotable
240 in DO independently of the droplet, and detachable. If the
241 droplet is unrecoverable, attaching the volume to a fresh
242 droplet recovers state in minutes.
243 - **Repos and tmp already live on `/data`**. Moving pgdata
244 finishes the layout the volume was provisioned for.