Backups
Two-layer scheme, both mandatory:
- Continuous WAL archive to
spaces-prod:shithub-wal—deploy/postgres/archive_command.shships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or onearchive_timeout). - Daily logical dump to
spaces-prod:shithub-backups/daily/...viadeploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.
Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs
hourly from the backup host into a second-region DR bucket.
Verifying that backups are healthy
The monitoring stack does this for you:
BackupOverduealert fires iftime() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.pg_stat_archiver.failed_count > 0is paged via thearchive-failingrunbook.
If you want to confirm by hand:
ssh db
sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
Quarterly restore drill
We verify the restore path every quarter. The calendar entry
lives in the team calendar; the procedure is in restore.md.
Required outputs of a successful drill:
deploy/restore-drill/run.shexits 0.- The smoke queries pass.
- The drill log has a "restore drill OK" line and is archived to
spaces-prod:shithub-backups/drills/<YYYY-QQ>/.
If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.
Missed backup
Symptom: BackupOverdue alert.
- SSH to db host.
systemctl status shithub-backup-daily.timerandjournalctl -u shithub-backup-daily.service -n 200. - Most likely: the script ran but
rclone copytofailed (creds, network). Re-run by hand:sudo -u postgres /usr/local/bin/shithub-backup-daily. - If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.
View source
| 1 | # Backups |
| 2 | |
| 3 | Two-layer scheme, both mandatory: |
| 4 | |
| 5 | - **Continuous WAL archive** to `spaces-prod:shithub-wal` — |
| 6 | `deploy/postgres/archive_command.sh` ships every WAL segment as |
| 7 | Postgres rolls it. RPO is approximately one segment (~16MB or one |
| 8 | `archive_timeout`). |
| 9 | - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...` |
| 10 | via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7 |
| 11 | on the db host for fast recovery; bucket lifecycle keeps 90 days. |
| 12 | |
| 13 | Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs |
| 14 | hourly from the backup host into a second-region DR bucket. |
| 15 | |
| 16 | ## Verifying that backups are healthy |
| 17 | |
| 18 | The monitoring stack does this for you: |
| 19 | |
| 20 | - `BackupOverdue` alert fires if `time() - |
| 21 | shithubd_backup_last_success_seconds > 30h`. The backup script |
| 22 | pushes the timestamp to the metrics endpoint on success. |
| 23 | - `pg_stat_archiver.failed_count > 0` is paged via the |
| 24 | `archive-failing` runbook. |
| 25 | |
| 26 | If you want to confirm by hand: |
| 27 | |
| 28 | ```sh |
| 29 | ssh db |
| 30 | sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ |
| 31 | lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| 32 | ``` |
| 33 | |
| 34 | ## Quarterly restore drill |
| 35 | |
| 36 | We verify the restore path **every quarter**. The calendar entry |
| 37 | lives in the team calendar; the procedure is in `restore.md`. |
| 38 | |
| 39 | Required outputs of a successful drill: |
| 40 | |
| 41 | 1. `deploy/restore-drill/run.sh` exits 0. |
| 42 | 2. The smoke queries pass. |
| 43 | 3. The drill log has a "restore drill OK" line and is archived to |
| 44 | `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`. |
| 45 | |
| 46 | If the drill fails: open an incident immediately. A failing drill |
| 47 | means our backups can't actually restore; we treat that as P0. |
| 48 | |
| 49 | ## Missed backup |
| 50 | |
| 51 | **Symptom:** `BackupOverdue` alert. |
| 52 | |
| 53 | 1. SSH to db host. `systemctl status shithub-backup-daily.timer` |
| 54 | and `journalctl -u shithub-backup-daily.service -n 200`. |
| 55 | 2. Most likely: the script ran but `rclone copyto` failed (creds, |
| 56 | network). Re-run by hand: |
| 57 | `sudo -u postgres /usr/local/bin/shithub-backup-daily`. |
| 58 | 3. If the script has been failing silently for >24h, file an |
| 59 | incident — every additional day extends the RPO of an actual |
| 60 | recovery. |