Move Postgres data from root disk to block volume
One-time migration. After this lands, the droplet's root disk
holds only the OS + binaries; all stateful data (pgdata, repos,
tmp) lives on the attached block volume mounted at /data. The
goal is twofold: (1) the root disk can never fill up from
runaway DB growth and OOM the system, and (2) the volume can be
detached and reattached to a replacement droplet without losing
state if the host ever needs replacing.
Preconditions (verify before scheduling)
- Block volume mounted at
/datawith enough free space for pgdata + headroom for several months of growth. Check:df -h /data /data/pgdatadoes not contain a live cluster. It may contain a staleinitdbfrom when the volume was first provisioned — that's fine and we'll move it aside.ls -la /data/pgdata # Expect a PG_VERSION file dated to volume-attach time, NOT # to "minutes ago". If "minutes ago", STOP — something is # already running there.- Recent dump landed in Spaces (not just locally). Worst-case
rollback is restoring this dump:
If today's directory is empty, run a fresh dump first:rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3/usr/local/bin/shithub-backup-daily - WAL archiver is healthy — gives PITR coverage for changes
between the last dump and migration time:
sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;" # last_archived_time should be < 5 min ago, last_failed_time blank or older - DigitalOcean snapshot taken via the DO dashboard (Droplets → shithub-prod → Snapshots → Take Snapshot). This is the panic-button rollback if everything else fails. Snapshots take a few minutes; DON'T start the migration until the snapshot completes.
- Notice users — site will be down for ~3 minutes for a
~115 MB pgdata. Scale the window with current
du -sh /var/lib/postgresql/16/main.
Migration
Total downtime: ~3 minutes for a 100 MB DB. Most of that is postgres clean-shutdown + start; the rsync itself is seconds.
ssh root@shithub.sh
1. Drain writes
systemctl stop shithubd-web
systemctl stop shithubd-cron 2>/dev/null # if running
Verify nothing else has DB sessions open before stopping postgres (background workers, manual psql sessions):
sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';"
If this returns rows besides your own psql, kill those processes first.
2. Stop postgres
systemctl stop postgresql@16-main
systemctl status postgresql@16-main --no-pager | head -5 # should be "inactive (dead)"
3. Rename the stale pre-init aside (don't delete)
Keeping it lets us undo step 4 instantly if something looks off:
mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d)
4. Copy live data to the volume
rsync -aHX --info=progress2 preserves perms, owners, hard
links, and xattrs. --info=progress2 shows a single overall
progress line:
rsync -aHX --info=progress2 \
/var/lib/postgresql/16/main/ \
/data/pgdata/
Verify the copy looks right:
ls -la /data/pgdata/ | head
diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \
<(cd /data/pgdata && find . -printf '%p %s %m %u:%g\n' | sort) | head
# Empty diff = byte-identical layout.
5. Repoint the cluster
Edit the active config:
sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \
/etc/postgresql/16/main/postgresql.conf
grep ^data_directory /etc/postgresql/16/main/postgresql.conf
# Expect: data_directory = '/data/pgdata'
(.bak lets you cp postgresql.conf.bak postgresql.conf to
revert in 1 second if needed.)
6. Start postgres on the new path
systemctl start postgresql@16-main
sleep 2
systemctl is-active postgresql@16-main # active
sudo -u postgres pg_isready -h /var/run/postgresql
sudo -u postgres psql -tAc "SHOW data_directory;" # should print /data/pgdata
sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;"
sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;"
If any of those fail, jump to Rollback below.
7. Bring the app back up
systemctl start shithubd-web
systemctl start shithubd-cron 2>/dev/null
systemctl is-active shithubd-web
curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz # 200
8. Smoke test
From your laptop (not the droplet):
curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/
# Walk the site briefly: load a repo, view an issue, log in.
Confirm the WAL archiver picked up where it left off (next archive timestamp should be > migration time within a couple of minutes):
ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"'
Run a fresh backup to confirm the new pgdata is durable end-to-end:
ssh root@shithub.sh /usr/local/bin/shithub-backup-daily
9. Cleanup (after a few days of healthy operation)
Don't do this until you've slept on at least one full daily backup cycle from the new location and confirmed it landed in Spaces. Then:
# Reclaim root-disk space.
rm -rf /var/lib/postgresql/16/main.preinit-*
# (After confirming /var/lib/postgresql/16/main is empty)
rmdir /var/lib/postgresql/16/main 2>/dev/null
rm -rf /data/pgdata.preinit-*
The systemd unit's RequiresMountsFor=/var/lib/postgresql/%I
is a static path — if you remove the empty dir, recreate it as
mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres … before the next reboot, otherwise pg_ctlcluster will refuse
to start. Easier: leave the empty dir alone.
Rollback
Mid-migration (steps 4–6 failed)
systemctl stop postgresql@16-main
cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf
mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M)
mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null # restore the stale init in case something checks for it
systemctl start postgresql@16-main
systemctl start shithubd-web
The original /var/lib/postgresql/16/main was never modified, so
postgres comes back up against unchanged data. Total recovery
time: ~30 seconds.
Worst case (data corruption observed after step 6)
# 1. Stop everything that talks to the DB.
systemctl stop shithubd-web shithubd-cron postgresql@16-main
# 2. Find the latest dump in Spaces.
rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
lsl spaces-prod:shithub-backups/daily/ | sort | tail -1
# 3. Drop and recreate the cluster from the dump. See restore.md
# for the full pg_restore procedure.
Or, faster:
Nuclear option
Restore the droplet from the DO snapshot taken in preconditions. Loses any user activity since the snapshot — usually that's "the last few minutes" since the snapshot is taken right before migration. Coordinate via status page if you do this.
Why this layout
- Root disk is 77 GB; pgdata growth is unbounded. A runaway query log or WAL spike can fill it and freeze the whole droplet, including sshd.
- Block volume is 100 GB, separately managed, snapshotable in DO independently of the droplet, and detachable. If the droplet is unrecoverable, attaching the volume to a fresh droplet recovers state in minutes.
- Repos and tmp already live on
/data. Moving pgdata finishes the layout the volume was provisioned for.
View source
| 1 | # Move Postgres data from root disk to block volume |
| 2 | |
| 3 | One-time migration. After this lands, the droplet's root disk |
| 4 | holds only the OS + binaries; all stateful data (pgdata, repos, |
| 5 | tmp) lives on the attached block volume mounted at `/data`. The |
| 6 | goal is twofold: (1) the root disk can never fill up from |
| 7 | runaway DB growth and OOM the system, and (2) the volume can be |
| 8 | detached and reattached to a replacement droplet without losing |
| 9 | state if the host ever needs replacing. |
| 10 | |
| 11 | ## Preconditions (verify before scheduling) |
| 12 | |
| 13 | - **Block volume mounted at `/data`** with enough free space for |
| 14 | pgdata + headroom for several months of growth. Check: |
| 15 | ```sh |
| 16 | df -h /data |
| 17 | ``` |
| 18 | - **`/data/pgdata` does not contain a live cluster.** It may |
| 19 | contain a stale `initdb` from when the volume was first |
| 20 | provisioned — that's fine and we'll move it aside. |
| 21 | ```sh |
| 22 | ls -la /data/pgdata |
| 23 | # Expect a PG_VERSION file dated to volume-attach time, NOT |
| 24 | # to "minutes ago". If "minutes ago", STOP — something is |
| 25 | # already running there. |
| 26 | ``` |
| 27 | - **Recent dump landed in Spaces** (not just locally). Worst-case |
| 28 | rollback is restoring this dump: |
| 29 | ```sh |
| 30 | rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ |
| 31 | lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3 |
| 32 | ``` |
| 33 | If today's directory is empty, run a fresh dump first: |
| 34 | ```sh |
| 35 | /usr/local/bin/shithub-backup-daily |
| 36 | ``` |
| 37 | - **WAL archiver is healthy** — gives PITR coverage for changes |
| 38 | between the last dump and migration time: |
| 39 | ```sh |
| 40 | sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;" |
| 41 | # last_archived_time should be < 5 min ago, last_failed_time blank or older |
| 42 | ``` |
| 43 | - **DigitalOcean snapshot taken** via the DO dashboard |
| 44 | (Droplets → shithub-prod → Snapshots → Take Snapshot). This |
| 45 | is the panic-button rollback if everything else fails. |
| 46 | Snapshots take a few minutes; DON'T start the migration |
| 47 | until the snapshot completes. |
| 48 | - **Notice users** — site will be down for ~3 minutes for a |
| 49 | ~115 MB pgdata. Scale the window with current `du -sh |
| 50 | /var/lib/postgresql/16/main`. |
| 51 | |
| 52 | ## Migration |
| 53 | |
| 54 | Total downtime: ~3 minutes for a 100 MB DB. Most of that is |
| 55 | postgres clean-shutdown + start; the rsync itself is seconds. |
| 56 | |
| 57 | ```sh |
| 58 | ssh root@shithub.sh |
| 59 | ``` |
| 60 | |
| 61 | ### 1. Drain writes |
| 62 | |
| 63 | ```sh |
| 64 | systemctl stop shithubd-web |
| 65 | systemctl stop shithubd-cron 2>/dev/null # if running |
| 66 | ``` |
| 67 | |
| 68 | Verify nothing else has DB sessions open before stopping |
| 69 | postgres (background workers, manual psql sessions): |
| 70 | |
| 71 | ```sh |
| 72 | sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';" |
| 73 | ``` |
| 74 | |
| 75 | If this returns rows besides your own psql, kill those processes |
| 76 | first. |
| 77 | |
| 78 | ### 2. Stop postgres |
| 79 | |
| 80 | ```sh |
| 81 | systemctl stop postgresql@16-main |
| 82 | systemctl status postgresql@16-main --no-pager | head -5 # should be "inactive (dead)" |
| 83 | ``` |
| 84 | |
| 85 | ### 3. Rename the stale pre-init aside (don't delete) |
| 86 | |
| 87 | Keeping it lets us undo step 4 instantly if something looks off: |
| 88 | |
| 89 | ```sh |
| 90 | mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d) |
| 91 | ``` |
| 92 | |
| 93 | ### 4. Copy live data to the volume |
| 94 | |
| 95 | `rsync -aHX --info=progress2` preserves perms, owners, hard |
| 96 | links, and xattrs. `--info=progress2` shows a single overall |
| 97 | progress line: |
| 98 | |
| 99 | ```sh |
| 100 | rsync -aHX --info=progress2 \ |
| 101 | /var/lib/postgresql/16/main/ \ |
| 102 | /data/pgdata/ |
| 103 | ``` |
| 104 | |
| 105 | Verify the copy looks right: |
| 106 | |
| 107 | ```sh |
| 108 | ls -la /data/pgdata/ | head |
| 109 | diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \ |
| 110 | <(cd /data/pgdata && find . -printf '%p %s %m %u:%g\n' | sort) | head |
| 111 | # Empty diff = byte-identical layout. |
| 112 | ``` |
| 113 | |
| 114 | ### 5. Repoint the cluster |
| 115 | |
| 116 | Edit the active config: |
| 117 | |
| 118 | ```sh |
| 119 | sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \ |
| 120 | /etc/postgresql/16/main/postgresql.conf |
| 121 | grep ^data_directory /etc/postgresql/16/main/postgresql.conf |
| 122 | # Expect: data_directory = '/data/pgdata' |
| 123 | ``` |
| 124 | |
| 125 | (`.bak` lets you `cp postgresql.conf.bak postgresql.conf` to |
| 126 | revert in 1 second if needed.) |
| 127 | |
| 128 | ### 6. Start postgres on the new path |
| 129 | |
| 130 | ```sh |
| 131 | systemctl start postgresql@16-main |
| 132 | sleep 2 |
| 133 | systemctl is-active postgresql@16-main # active |
| 134 | sudo -u postgres pg_isready -h /var/run/postgresql |
| 135 | sudo -u postgres psql -tAc "SHOW data_directory;" # should print /data/pgdata |
| 136 | sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;" |
| 137 | sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;" |
| 138 | ``` |
| 139 | |
| 140 | If any of those fail, jump to **Rollback** below. |
| 141 | |
| 142 | ### 7. Bring the app back up |
| 143 | |
| 144 | ```sh |
| 145 | systemctl start shithubd-web |
| 146 | systemctl start shithubd-cron 2>/dev/null |
| 147 | systemctl is-active shithubd-web |
| 148 | curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz # 200 |
| 149 | ``` |
| 150 | |
| 151 | ### 8. Smoke test |
| 152 | |
| 153 | From your laptop (not the droplet): |
| 154 | |
| 155 | ```sh |
| 156 | curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/ |
| 157 | # Walk the site briefly: load a repo, view an issue, log in. |
| 158 | ``` |
| 159 | |
| 160 | Confirm the WAL archiver picked up where it left off (next |
| 161 | archive timestamp should be > migration time within a couple of |
| 162 | minutes): |
| 163 | |
| 164 | ```sh |
| 165 | ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"' |
| 166 | ``` |
| 167 | |
| 168 | Run a fresh backup to confirm the new pgdata is durable end-to-end: |
| 169 | |
| 170 | ```sh |
| 171 | ssh root@shithub.sh /usr/local/bin/shithub-backup-daily |
| 172 | ``` |
| 173 | |
| 174 | ### 9. Cleanup (after a few days of healthy operation) |
| 175 | |
| 176 | Don't do this until you've slept on at least one full daily |
| 177 | backup cycle from the new location and confirmed it landed in |
| 178 | Spaces. Then: |
| 179 | |
| 180 | ```sh |
| 181 | # Reclaim root-disk space. |
| 182 | rm -rf /var/lib/postgresql/16/main.preinit-* |
| 183 | # (After confirming /var/lib/postgresql/16/main is empty) |
| 184 | rmdir /var/lib/postgresql/16/main 2>/dev/null |
| 185 | rm -rf /data/pgdata.preinit-* |
| 186 | ``` |
| 187 | |
| 188 | The systemd unit's `RequiresMountsFor=/var/lib/postgresql/%I` |
| 189 | is a static path — if you remove the empty dir, recreate it as |
| 190 | `mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres |
| 191 | …` before the next reboot, otherwise `pg_ctlcluster` will refuse |
| 192 | to start. Easier: leave the empty dir alone. |
| 193 | |
| 194 | ## Rollback |
| 195 | |
| 196 | ### Mid-migration (steps 4–6 failed) |
| 197 | |
| 198 | ```sh |
| 199 | systemctl stop postgresql@16-main |
| 200 | cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf |
| 201 | mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M) |
| 202 | mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null # restore the stale init in case something checks for it |
| 203 | systemctl start postgresql@16-main |
| 204 | systemctl start shithubd-web |
| 205 | ``` |
| 206 | |
| 207 | The original `/var/lib/postgresql/16/main` was never modified, so |
| 208 | postgres comes back up against unchanged data. Total recovery |
| 209 | time: ~30 seconds. |
| 210 | |
| 211 | ### Worst case (data corruption observed after step 6) |
| 212 | |
| 213 | ```sh |
| 214 | # 1. Stop everything that talks to the DB. |
| 215 | systemctl stop shithubd-web shithubd-cron postgresql@16-main |
| 216 | |
| 217 | # 2. Find the latest dump in Spaces. |
| 218 | rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ |
| 219 | lsl spaces-prod:shithub-backups/daily/ | sort | tail -1 |
| 220 | |
| 221 | # 3. Drop and recreate the cluster from the dump. See restore.md |
| 222 | # for the full pg_restore procedure. |
| 223 | ``` |
| 224 | |
| 225 | Or, faster: |
| 226 | |
| 227 | ### Nuclear option |
| 228 | |
| 229 | Restore the droplet from the DO snapshot taken in preconditions. |
| 230 | Loses any user activity since the snapshot — usually that's |
| 231 | "the last few minutes" since the snapshot is taken right before |
| 232 | migration. Coordinate via status page if you do this. |
| 233 | |
| 234 | ## Why this layout |
| 235 | |
| 236 | - **Root disk is 77 GB**; pgdata growth is unbounded. A |
| 237 | runaway query log or WAL spike can fill it and freeze the |
| 238 | whole droplet, including sshd. |
| 239 | - **Block volume is 100 GB**, separately managed, snapshotable |
| 240 | in DO independently of the droplet, and detachable. If the |
| 241 | droplet is unrecoverable, attaching the volume to a fresh |
| 242 | droplet recovers state in minutes. |
| 243 | - **Repos and tmp already live on `/data`**. Moving pgdata |
| 244 | finishes the layout the volume was provisioned for. |