@@ -0,0 +1,244 @@ |
| 1 | +# Move Postgres data from root disk to block volume |
| 2 | + |
| 3 | +One-time migration. After this lands, the droplet's root disk |
| 4 | +holds only the OS + binaries; all stateful data (pgdata, repos, |
| 5 | +tmp) lives on the attached block volume mounted at `/data`. The |
| 6 | +goal is twofold: (1) the root disk can never fill up from |
| 7 | +runaway DB growth and OOM the system, and (2) the volume can be |
| 8 | +detached and reattached to a replacement droplet without losing |
| 9 | +state if the host ever needs replacing. |
| 10 | + |
| 11 | +## Preconditions (verify before scheduling) |
| 12 | + |
| 13 | +- **Block volume mounted at `/data`** with enough free space for |
| 14 | + pgdata + headroom for several months of growth. Check: |
| 15 | + ```sh |
| 16 | + df -h /data |
| 17 | + ``` |
| 18 | +- **`/data/pgdata` does not contain a live cluster.** It may |
| 19 | + contain a stale `initdb` from when the volume was first |
| 20 | + provisioned — that's fine and we'll move it aside. |
| 21 | + ```sh |
| 22 | + ls -la /data/pgdata |
| 23 | + # Expect a PG_VERSION file dated to volume-attach time, NOT |
| 24 | + # to "minutes ago". If "minutes ago", STOP — something is |
| 25 | + # already running there. |
| 26 | + ``` |
| 27 | +- **Recent dump landed in Spaces** (not just locally). Worst-case |
| 28 | + rollback is restoring this dump: |
| 29 | + ```sh |
| 30 | + rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ |
| 31 | + lsl spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | tail -3 |
| 32 | + ``` |
| 33 | + If today's directory is empty, run a fresh dump first: |
| 34 | + ```sh |
| 35 | + /usr/local/bin/shithub-backup-daily |
| 36 | + ``` |
| 37 | +- **WAL archiver is healthy** — gives PITR coverage for changes |
| 38 | + between the last dump and migration time: |
| 39 | + ```sh |
| 40 | + sudo -u postgres psql -tAc "SELECT last_archived_time, last_failed_time FROM pg_stat_archiver;" |
| 41 | + # last_archived_time should be < 5 min ago, last_failed_time blank or older |
| 42 | + ``` |
| 43 | +- **DigitalOcean snapshot taken** via the DO dashboard |
| 44 | + (Droplets → shithub-prod → Snapshots → Take Snapshot). This |
| 45 | + is the panic-button rollback if everything else fails. |
| 46 | + Snapshots take a few minutes; DON'T start the migration |
| 47 | + until the snapshot completes. |
| 48 | +- **Notice users** — site will be down for ~3 minutes for a |
| 49 | + ~115 MB pgdata. Scale the window with current `du -sh |
| 50 | + /var/lib/postgresql/16/main`. |
| 51 | + |
| 52 | +## Migration |
| 53 | + |
| 54 | +Total downtime: ~3 minutes for a 100 MB DB. Most of that is |
| 55 | +postgres clean-shutdown + start; the rsync itself is seconds. |
| 56 | + |
| 57 | +```sh |
| 58 | +ssh root@shithub.sh |
| 59 | +``` |
| 60 | + |
| 61 | +### 1. Drain writes |
| 62 | + |
| 63 | +```sh |
| 64 | +systemctl stop shithubd-web |
| 65 | +systemctl stop shithubd-cron 2>/dev/null # if running |
| 66 | +``` |
| 67 | + |
| 68 | +Verify nothing else has DB sessions open before stopping |
| 69 | +postgres (background workers, manual psql sessions): |
| 70 | + |
| 71 | +```sh |
| 72 | +sudo -u postgres psql -tAc "SELECT pid, application_name, client_addr, state FROM pg_stat_activity WHERE datname = 'shithub';" |
| 73 | +``` |
| 74 | + |
| 75 | +If this returns rows besides your own psql, kill those processes |
| 76 | +first. |
| 77 | + |
| 78 | +### 2. Stop postgres |
| 79 | + |
| 80 | +```sh |
| 81 | +systemctl stop postgresql@16-main |
| 82 | +systemctl status postgresql@16-main --no-pager | head -5 # should be "inactive (dead)" |
| 83 | +``` |
| 84 | + |
| 85 | +### 3. Rename the stale pre-init aside (don't delete) |
| 86 | + |
| 87 | +Keeping it lets us undo step 4 instantly if something looks off: |
| 88 | + |
| 89 | +```sh |
| 90 | +mv /data/pgdata /data/pgdata.preinit-$(date -u +%Y%m%d) |
| 91 | +``` |
| 92 | + |
| 93 | +### 4. Copy live data to the volume |
| 94 | + |
| 95 | +`rsync -aHX --info=progress2` preserves perms, owners, hard |
| 96 | +links, and xattrs. `--info=progress2` shows a single overall |
| 97 | +progress line: |
| 98 | + |
| 99 | +```sh |
| 100 | +rsync -aHX --info=progress2 \ |
| 101 | + /var/lib/postgresql/16/main/ \ |
| 102 | + /data/pgdata/ |
| 103 | +``` |
| 104 | + |
| 105 | +Verify the copy looks right: |
| 106 | + |
| 107 | +```sh |
| 108 | +ls -la /data/pgdata/ | head |
| 109 | +diff <(cd /var/lib/postgresql/16/main && find . -printf '%p %s %m %u:%g\n' | sort) \ |
| 110 | + <(cd /data/pgdata && find . -printf '%p %s %m %u:%g\n' | sort) | head |
| 111 | +# Empty diff = byte-identical layout. |
| 112 | +``` |
| 113 | + |
| 114 | +### 5. Repoint the cluster |
| 115 | + |
| 116 | +Edit the active config: |
| 117 | + |
| 118 | +```sh |
| 119 | +sed -i.bak "s|^data_directory = .*|data_directory = '/data/pgdata'|" \ |
| 120 | + /etc/postgresql/16/main/postgresql.conf |
| 121 | +grep ^data_directory /etc/postgresql/16/main/postgresql.conf |
| 122 | +# Expect: data_directory = '/data/pgdata' |
| 123 | +``` |
| 124 | + |
| 125 | +(`.bak` lets you `cp postgresql.conf.bak postgresql.conf` to |
| 126 | +revert in 1 second if needed.) |
| 127 | + |
| 128 | +### 6. Start postgres on the new path |
| 129 | + |
| 130 | +```sh |
| 131 | +systemctl start postgresql@16-main |
| 132 | +sleep 2 |
| 133 | +systemctl is-active postgresql@16-main # active |
| 134 | +sudo -u postgres pg_isready -h /var/run/postgresql |
| 135 | +sudo -u postgres psql -tAc "SHOW data_directory;" # should print /data/pgdata |
| 136 | +sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM repos;" |
| 137 | +sudo -u postgres psql -d shithub -tAc "SELECT count(*) FROM users;" |
| 138 | +``` |
| 139 | + |
| 140 | +If any of those fail, jump to **Rollback** below. |
| 141 | + |
| 142 | +### 7. Bring the app back up |
| 143 | + |
| 144 | +```sh |
| 145 | +systemctl start shithubd-web |
| 146 | +systemctl start shithubd-cron 2>/dev/null |
| 147 | +systemctl is-active shithubd-web |
| 148 | +curl -fsS -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/healthz # 200 |
| 149 | +``` |
| 150 | + |
| 151 | +### 8. Smoke test |
| 152 | + |
| 153 | +From your laptop (not the droplet): |
| 154 | + |
| 155 | +```sh |
| 156 | +curl -fsS -o /dev/null -w '%{http_code} %{time_total}s\n' https://shithub.sh/ |
| 157 | +# Walk the site briefly: load a repo, view an issue, log in. |
| 158 | +``` |
| 159 | + |
| 160 | +Confirm the WAL archiver picked up where it left off (next |
| 161 | +archive timestamp should be > migration time within a couple of |
| 162 | +minutes): |
| 163 | + |
| 164 | +```sh |
| 165 | +ssh root@shithub.sh 'sudo -u postgres psql -tAc "SELECT last_archived_time FROM pg_stat_archiver;"' |
| 166 | +``` |
| 167 | + |
| 168 | +Run a fresh backup to confirm the new pgdata is durable end-to-end: |
| 169 | + |
| 170 | +```sh |
| 171 | +ssh root@shithub.sh /usr/local/bin/shithub-backup-daily |
| 172 | +``` |
| 173 | + |
| 174 | +### 9. Cleanup (after a few days of healthy operation) |
| 175 | + |
| 176 | +Don't do this until you've slept on at least one full daily |
| 177 | +backup cycle from the new location and confirmed it landed in |
| 178 | +Spaces. Then: |
| 179 | + |
| 180 | +```sh |
| 181 | +# Reclaim root-disk space. |
| 182 | +rm -rf /var/lib/postgresql/16/main.preinit-* |
| 183 | +# (After confirming /var/lib/postgresql/16/main is empty) |
| 184 | +rmdir /var/lib/postgresql/16/main 2>/dev/null |
| 185 | +rm -rf /data/pgdata.preinit-* |
| 186 | +``` |
| 187 | + |
| 188 | +The systemd unit's `RequiresMountsFor=/var/lib/postgresql/%I` |
| 189 | +is a static path — if you remove the empty dir, recreate it as |
| 190 | +`mkdir -p /var/lib/postgresql/16/main && chown postgres:postgres |
| 191 | +…` before the next reboot, otherwise `pg_ctlcluster` will refuse |
| 192 | +to start. Easier: leave the empty dir alone. |
| 193 | + |
| 194 | +## Rollback |
| 195 | + |
| 196 | +### Mid-migration (steps 4–6 failed) |
| 197 | + |
| 198 | +```sh |
| 199 | +systemctl stop postgresql@16-main |
| 200 | +cp /etc/postgresql/16/main/postgresql.conf.bak /etc/postgresql/16/main/postgresql.conf |
| 201 | +mv /data/pgdata /data/pgdata.failed-$(date -u +%Y%m%d_%H%M) |
| 202 | +mv /data/pgdata.preinit-* /data/pgdata 2>/dev/null # restore the stale init in case something checks for it |
| 203 | +systemctl start postgresql@16-main |
| 204 | +systemctl start shithubd-web |
| 205 | +``` |
| 206 | + |
| 207 | +The original `/var/lib/postgresql/16/main` was never modified, so |
| 208 | +postgres comes back up against unchanged data. Total recovery |
| 209 | +time: ~30 seconds. |
| 210 | + |
| 211 | +### Worst case (data corruption observed after step 6) |
| 212 | + |
| 213 | +```sh |
| 214 | +# 1. Stop everything that talks to the DB. |
| 215 | +systemctl stop shithubd-web shithubd-cron postgresql@16-main |
| 216 | + |
| 217 | +# 2. Find the latest dump in Spaces. |
| 218 | +rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ |
| 219 | + lsl spaces-prod:shithub-backups/daily/ | sort | tail -1 |
| 220 | + |
| 221 | +# 3. Drop and recreate the cluster from the dump. See restore.md |
| 222 | +# for the full pg_restore procedure. |
| 223 | +``` |
| 224 | + |
| 225 | +Or, faster: |
| 226 | + |
| 227 | +### Nuclear option |
| 228 | + |
| 229 | +Restore the droplet from the DO snapshot taken in preconditions. |
| 230 | +Loses any user activity since the snapshot — usually that's |
| 231 | +"the last few minutes" since the snapshot is taken right before |
| 232 | +migration. Coordinate via status page if you do this. |
| 233 | + |
| 234 | +## Why this layout |
| 235 | + |
| 236 | +- **Root disk is 77 GB**; pgdata growth is unbounded. A |
| 237 | + runaway query log or WAL spike can fill it and freeze the |
| 238 | + whole droplet, including sshd. |
| 239 | +- **Block volume is 100 GB**, separately managed, snapshotable |
| 240 | + in DO independently of the droplet, and detachable. If the |
| 241 | + droplet is unrecoverable, attaching the volume to a fresh |
| 242 | + droplet recovers state in minutes. |
| 243 | +- **Repos and tmp already live on `/data`**. Moving pgdata |
| 244 | + finishes the layout the volume was provisioned for. |