Backups
Two-layer scheme, both mandatory:
- Continuous WAL archive to
spaces-prod:shithub-wal—deploy/postgres/archive_command.shships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or onearchive_timeout). - Daily logical dump to
spaces-prod:shithub-backups/daily/...viadeploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.
Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs
hourly from the backup host into a second-region DR bucket.
WAL archiving — first-time setup
The original provision script created shithub-backups and
shithub-backups-dr but NOT the WAL buckets, so a fresh install
ships zero WAL segments until the operator runs through this once:
- Create the WAL buckets (DO Spaces dashboard —
doctldoesn't manage buckets):shithub-walin the primary regionshithub-wal-drin the DR region
- Extend the prod RW Spaces key (Settings → API → Spaces Keys →
Edit) to grant
readwriteonshithub-wal. Thedrkey needsreadwriteonshithub-wal-drsosync-cross-region.shcan push. - Confirm the rclone config on the app droplet has both keys
(
/etc/rclone-shithub.conf—spaces-prodandspaces-drremotes). - Re-run ansible (or drop the conf.d file by hand at
/etc/postgresql/16/main/conf.d/99_shithub_archive.conf), thensystemctl restart postgresql@16-main—archive_modechange needs a restart, not a reload. - Verify within ~60 s:
sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;' # last_archived_wal: 000000010000000000000003 (or similar) # last_archived_time: <recent timestamp> # failed_count: 0 rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ lsf spaces-prod:shithub-wal/ --recursive | head - If
failed_count > 0before any successful archive:journalctl -u postgresql@16-main -n 100 | grep archiveshows the rclone error. Most common: bucket name typo, key grant missing, or the--s3-no-check-bucketflag is missing from the archive script (re-check/usr/local/bin/shithub-pg-archive).
Verifying that backups are healthy
The monitoring stack does this for you:
BackupOverduealert fires iftime() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.pg_stat_archiver.failed_count > 0is paged via thearchive-failingrunbook.
If you want to confirm by hand:
ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
Quarterly restore drill
We verify the restore path every quarter. The calendar entry
lives in the team calendar; the procedure is in restore.md.
Required outputs of a successful drill:
deploy/restore-drill/run.shexits 0.- The smoke queries pass.
- The drill log has a "restore drill OK" line and is archived to
spaces-prod:shithub-backups/drills/<YYYY-QQ>/.
If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.
Missed backup
Symptom: BackupOverdue alert.
- SSH to db host.
systemctl status shithub-backup-daily.timerandjournalctl -u shithub-backup-daily.service -n 200. - Most likely: the script ran but
rclone copytofailed (creds, network). Re-run by hand:sudo -u postgres /usr/local/bin/shithub-backup-daily. - If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.
View source
| 1 | # Backups |
| 2 | |
| 3 | Two-layer scheme, both mandatory: |
| 4 | |
| 5 | - **Continuous WAL archive** to `spaces-prod:shithub-wal` — |
| 6 | `deploy/postgres/archive_command.sh` ships every WAL segment as |
| 7 | Postgres rolls it. RPO is approximately one segment (~16MB or one |
| 8 | `archive_timeout`). |
| 9 | - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...` |
| 10 | via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7 |
| 11 | on the db host for fast recovery; bucket lifecycle keeps 90 days. |
| 12 | |
| 13 | Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs |
| 14 | hourly from the backup host into a second-region DR bucket. |
| 15 | |
| 16 | ## WAL archiving — first-time setup |
| 17 | |
| 18 | The original provision script created `shithub-backups` and |
| 19 | `shithub-backups-dr` but NOT the WAL buckets, so a fresh install |
| 20 | ships zero WAL segments until the operator runs through this once: |
| 21 | |
| 22 | 1. **Create the WAL buckets** (DO Spaces dashboard — `doctl` doesn't |
| 23 | manage buckets): |
| 24 | - `shithub-wal` in the primary region |
| 25 | - `shithub-wal-dr` in the DR region |
| 26 | 2. **Extend the prod RW Spaces key** (Settings → API → Spaces Keys → |
| 27 | Edit) to grant `readwrite` on `shithub-wal`. The `dr` key needs |
| 28 | `readwrite` on `shithub-wal-dr` so `sync-cross-region.sh` can push. |
| 29 | 3. **Confirm the rclone config on the app droplet** has both keys |
| 30 | (`/etc/rclone-shithub.conf` — `spaces-prod` and |
| 31 | `spaces-dr` remotes). |
| 32 | 4. **Re-run ansible** (or drop the conf.d file by hand at |
| 33 | `/etc/postgresql/16/main/conf.d/99_shithub_archive.conf`), then |
| 34 | `systemctl restart postgresql@16-main` — `archive_mode` change |
| 35 | needs a restart, not a reload. |
| 36 | 5. **Verify within ~60 s**: |
| 37 | ```sh |
| 38 | sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;' |
| 39 | # last_archived_wal: 000000010000000000000003 (or similar) |
| 40 | # last_archived_time: <recent timestamp> |
| 41 | # failed_count: 0 |
| 42 | rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \ |
| 43 | lsf spaces-prod:shithub-wal/ --recursive | head |
| 44 | ``` |
| 45 | 6. **If `failed_count > 0`** before any successful archive: |
| 46 | `journalctl -u postgresql@16-main -n 100 | grep archive` shows |
| 47 | the rclone error. Most common: bucket name typo, key grant |
| 48 | missing, or the `--s3-no-check-bucket` flag is missing from the |
| 49 | archive script (re-check `/usr/local/bin/shithub-pg-archive`). |
| 50 | |
| 51 | ## Verifying that backups are healthy |
| 52 | |
| 53 | The monitoring stack does this for you: |
| 54 | |
| 55 | - `BackupOverdue` alert fires if `time() - |
| 56 | shithubd_backup_last_success_seconds > 30h`. The backup script |
| 57 | pushes the timestamp to the metrics endpoint on success. |
| 58 | - `pg_stat_archiver.failed_count > 0` is paged via the |
| 59 | `archive-failing` runbook. |
| 60 | |
| 61 | If you want to confirm by hand: |
| 62 | |
| 63 | ```sh |
| 64 | ssh db |
| 65 | sudo -u postgres rclone --config /etc/rclone-shithub.conf \ |
| 66 | lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| 67 | ``` |
| 68 | |
| 69 | ## Quarterly restore drill |
| 70 | |
| 71 | We verify the restore path **every quarter**. The calendar entry |
| 72 | lives in the team calendar; the procedure is in `restore.md`. |
| 73 | |
| 74 | Required outputs of a successful drill: |
| 75 | |
| 76 | 1. `deploy/restore-drill/run.sh` exits 0. |
| 77 | 2. The smoke queries pass. |
| 78 | 3. The drill log has a "restore drill OK" line and is archived to |
| 79 | `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`. |
| 80 | |
| 81 | If the drill fails: open an incident immediately. A failing drill |
| 82 | means our backups can't actually restore; we treat that as P0. |
| 83 | |
| 84 | ## Missed backup |
| 85 | |
| 86 | **Symptom:** `BackupOverdue` alert. |
| 87 | |
| 88 | 1. SSH to db host. `systemctl status shithub-backup-daily.timer` |
| 89 | and `journalctl -u shithub-backup-daily.service -n 200`. |
| 90 | 2. Most likely: the script ran but `rclone copyto` failed (creds, |
| 91 | network). Re-run by hand: |
| 92 | `sudo -u postgres /usr/local/bin/shithub-backup-daily`. |
| 93 | 3. If the script has been failing silently for >24h, file an |
| 94 | incident — every additional day extends the RPO of an actual |
| 95 | recovery. |