# Move Postgres data from root disk to block volume One-time migration. After this lands, the droplet's root disk holds only the OS + binaries; all stateful data (pgdata, repos, tmp) lives on the attached block volume mounted at `/data`. The goal is twofold: (1) the root disk can never fill up from runaway DB growth and OOM the system, and (2) the volume can be detached and reattached to a replacement droplet without losing state if the host ever needs replacing. ## Preconditions (verify before scheduling) - **Block volume mounted at `/data`** with enough free space for pgdata + headroom for several months of growth. Check: ```sh df -h /data ``` - **`/data/pgdata` does not contain a live cluster.** It may contain a stale `initdb` from when the volume was first provisioned — that's fine and we'll move it aside. ```sh ls -la /data/pgdata # Expect a PG_VERSION file dated to volume-attach time, NOT # to "minutes ago". If "minutes ago", STOP — something is # already running there. ``` - **Recent dump landed in Spaces** (not just locally). Worst-case rollback is restoring this dump: ```sh rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3 ``` If today's directory is empty, run a fresh dump first: ```sh /usr/local/bin/shithub-backup-daily ``` - **WAL archiver is healthy** — gives PITR coverage for changes between the last dump and migration time: ```sh sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;" # last_archived_time should be < 5 min ago, last_failed_time blank or older ``` - **DigitalOcean snapshot taken** via the DO dashboard (Droplets → shithub-prod → Snapshots → Take Snapshot). This is the panic-button rollback if everything else fails. Snapshots take a few minutes; DON'T start the migration until the snapshot completes. - **Notice users** — site will be down for ~3 minutes for a ~115 MB pgdata. Scale the window with current `du -sh /var/lib/postgresql/16/main`. ## Migration Total downtime: ~3 minutes for a 100 MB DB. Most of that is postgres clean-shutdown + start; the rsync itself is seconds. ```sh ssh root@shithub.sh ``` ### 1. Drain writes ```sh systemctl stop shithubd-web systemctl stop shithubd-cron 2>/dev/null # if running ``` Verify nothing else has DB sessions open before stopping postgres (background workers, manual psql sessions): ```sh sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';" ``` If this returns rows besides your own psql, kill those processes first. ### 2. Stop postgres ```sh systemctl stop postgresql@16-main systemctl status postgresql@16-main --no-pager | head -5 # should be "inactive (dead)" ``` ### 3. Rename the stale pre-init aside (don't delete) Keeping it lets us undo step 4 instantly if something looks off: ```sh mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d) ``` ### 4. Copy live data to the volume `rsync -aHX --info=progress2` preserves perms, owners, hard links, and xattrs. `--info=progress2` shows a single overall progress line: ```sh rsync -aHX --info=progress2 \ /var/lib/postgresql/16/main/ \ /data/pgdata/ ``` Verify the copy looks right: ```sh ls -la /data/pgdata/ | head diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \ <(cd /data/pgdata && find . -printf '%p %s %m %u:%g\n' | sort) | head # Empty diff = byte-identical layout. ``` ### 5. Repoint the cluster Edit the active config: ```sh sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \ /etc/postgresql/16/main/postgresql.conf grep ^data_directory /etc/postgresql/16/main/postgresql.conf # Expect: data_directory = '/data/pgdata' ``` (`.bak` lets you `cp postgresql.conf.bak postgresql.conf` to revert in 1 second if needed.) ### 6. Start postgres on the new path ```sh systemctl start postgresql@16-main sleep 2 systemctl is-active postgresql@16-main # active sudo -u postgres pg_isready -h /var/run/postgresql sudo -u postgres psql -tAc "SHOW data_directory;" # should print /data/pgdata sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;" sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;" ``` If any of those fail, jump to **Rollback** below. ### 7. Bring the app back up ```sh systemctl start shithubd-web systemctl start shithubd-cron 2>/dev/null systemctl is-active shithubd-web curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz # 200 ``` ### 8. Smoke test From your laptop (not the droplet): ```sh curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/ # Walk the site briefly: load a repo, view an issue, log in. ``` Confirm the WAL archiver picked up where it left off (next archive timestamp should be > migration time within a couple of minutes): ```sh ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"' ``` Run a fresh backup to confirm the new pgdata is durable end-to-end: ```sh ssh root@shithub.sh /usr/local/bin/shithub-backup-daily ``` ### 9. Cleanup (after a few days of healthy operation) Don't do this until you've slept on at least one full daily backup cycle from the new location and confirmed it landed in Spaces. Then: ```sh # Reclaim root-disk space. rm -rf /var/lib/postgresql/16/main.preinit-* # (After confirming /var/lib/postgresql/16/main is empty) rmdir /var/lib/postgresql/16/main 2>/dev/null rm -rf /data/pgdata.preinit-* ``` The systemd unit's `RequiresMountsFor=/var/lib/postgresql/%I` is a static path — if you remove the empty dir, recreate it as `mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres …` before the next reboot, otherwise `pg_ctlcluster` will refuse to start. Easier: leave the empty dir alone. ## Rollback ### Mid-migration (steps 4–6 failed) ```sh systemctl stop postgresql@16-main cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M) mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null # restore the stale init in case something checks for it systemctl start postgresql@16-main systemctl start shithubd-web ``` The original `/var/lib/postgresql/16/main` was never modified, so postgres comes back up against unchanged data. Total recovery time: ~30 seconds. ### Worst case (data corruption observed after step 6) ```sh # 1. Stop everything that talks to the DB. systemctl stop shithubd-web shithubd-cron postgresql@16-main # 2. Find the latest dump in Spaces. rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ lsl spaces-prod:shithub-backups/daily/ | sort | tail -1 # 3. Drop and recreate the cluster from the dump. See restore.md # for the full pg_restore procedure. ``` Or, faster: ### Nuclear option Restore the droplet from the DO snapshot taken in preconditions. Loses any user activity since the snapshot — usually that's "the last few minutes" since the snapshot is taken right before migration. Coordinate via status page if you do this. ## Why this layout - **Root disk is 77 GB**; pgdata growth is unbounded. A runaway query log or WAL spike can fill it and freeze the whole droplet, including sshd. - **Block volume is 100 GB**, separately managed, snapshotable in DO independently of the droplet, and detachable. If the droplet is unrecoverable, attaching the volume to a fresh droplet recovers state in minutes. - **Repos and tmp already live on `/data`**. Moving pgdata finishes the layout the volume was provisioned for.