Backup & restore
Two layers, both mandatory:
- Continuous WAL archive —
deploy/postgres/archive_command.shships every WAL segment tospaces-prod:shithub-walin real time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB or onearchive_timeout). - Daily logical dump —
deploy/postgres/backup-daily.shtakes apg_dump --format=customonce per day and ships it tospaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the most recent 7 on the db host.
Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors
both buckets to a second region for DR. Lifecycle in
deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps
after 90.
Verifying that backups are healthy
The monitoring stack does this for you:
BackupOverduealert fires if no successful backup in 30h.pg_stat_archiver.failed_count > 0is paged via the archive-failing alert.
By hand:
ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
Restore drills
Run quarterly. The restore drill restores the latest dump into a temp Postgres instance, runs smoke queries, tears down. Production is untouched.
ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/
To drill an older dump:
sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
--keep preserves the temp pgdata for inspection. Default is
clean-up.
A failed drill is a P0: it means our backups can't actually restore. File an incident immediately.
Real restore — full DB lost
Use this when:
- The DB host is destroyed.
- Postgres can't start and there's no working WAL.
- 24 hours of data loss is acceptable.
- Spin up a new db host
(
make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new ANSIBLE_TAGS=db). - Pull the most recent daily dump:
rclone copyto \ spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \ /tmp/restore.dump - Restore:
pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump - Confirm schema:
shithubd migrate status. - Bring the app up against the new DB (update
web.envandworker.env). - Notify users — there will be a visible activity gap.
Real restore — point-in-time
Use this when data is intact in WAL but a destructive change
happened at a known time (DROP TABLE, mass UPDATE, runaway
worker).
This is harder. The procedure:
- Stop the app (
systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue. - Restore the most recent base backup from before the incident into a fresh data directory.
- Configure recovery (
recovery.signal+ GUCs in PG16) pointing at the WAL archive in Spaces withrecovery_target_time = '<UTC timestamp just before incident>'. - Start Postgres in recovery mode; let it replay WAL.
- When recovery completes, promote.
- Restart the app.
If you have not done a PITR before, do not free-style this. Mistakes here destroy the data you were trying to save. Find someone who has done it and pair with them.
After any real restore
- Run the restore-drill smoke queries against the restored DB.
- Reconcile the audit log against any new IDs that don't exist.
- Force-rotate webhook secrets (
shithubd webhook rotate-all) — if the dump was compromised, the secrets in it may be too. - Force-rotate session epochs for any user whose session crossed the recovery window.
View source
| 1 | # Backup & restore |
| 2 | |
| 3 | Two layers, both mandatory: |
| 4 | |
| 5 | - **Continuous WAL archive** — `deploy/postgres/archive_command.sh` |
| 6 | ships every WAL segment to `spaces-prod:shithub-wal` in real |
| 7 | time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB |
| 8 | or one `archive_timeout`). |
| 9 | - **Daily logical dump** — `deploy/postgres/backup-daily.sh` |
| 10 | takes a `pg_dump --format=custom` once per day and ships it to |
| 11 | `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most |
| 12 | recent 7 on the db host. |
| 13 | |
| 14 | Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors |
| 15 | both buckets to a second region for DR. Lifecycle in |
| 16 | `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps |
| 17 | after 90. |
| 18 | |
| 19 | ## Verifying that backups are healthy |
| 20 | |
| 21 | The monitoring stack does this for you: |
| 22 | |
| 23 | - `BackupOverdue` alert fires if no successful backup in 30h. |
| 24 | - `pg_stat_archiver.failed_count > 0` is paged via the |
| 25 | archive-failing alert. |
| 26 | |
| 27 | By hand: |
| 28 | |
| 29 | ```sh |
| 30 | ssh db |
| 31 | sudo -u postgres rclone --config /etc/rclone-shithub.conf \ |
| 32 | lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| 33 | ``` |
| 34 | |
| 35 | ## Restore drills |
| 36 | |
| 37 | Run **quarterly**. The restore drill restores the latest dump |
| 38 | into a temp Postgres instance, runs smoke queries, tears down. |
| 39 | Production is untouched. |
| 40 | |
| 41 | ```sh |
| 42 | ssh backup-host |
| 43 | sudo /usr/local/bin/shithub-restore-drill |
| 44 | # uses the latest dump in spaces-prod:shithub-backups/daily/ |
| 45 | ``` |
| 46 | |
| 47 | To drill an older dump: |
| 48 | |
| 49 | ```sh |
| 50 | sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump |
| 51 | ``` |
| 52 | |
| 53 | `--keep` preserves the temp pgdata for inspection. Default is |
| 54 | clean-up. |
| 55 | |
| 56 | A failed drill is a P0: it means our backups can't actually |
| 57 | restore. File an incident immediately. |
| 58 | |
| 59 | ## Real restore — full DB lost |
| 60 | |
| 61 | Use this when: |
| 62 | |
| 63 | - The DB host is destroyed. |
| 64 | - Postgres can't start and there's no working WAL. |
| 65 | - 24 hours of data loss is acceptable. |
| 66 | |
| 67 | 1. Spin up a new db host |
| 68 | (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new |
| 69 | ANSIBLE_TAGS=db`). |
| 70 | 2. Pull the most recent daily dump: |
| 71 | ```sh |
| 72 | rclone copyto \ |
| 73 | spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \ |
| 74 | /tmp/restore.dump |
| 75 | ``` |
| 76 | 3. Restore: |
| 77 | ```sh |
| 78 | pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump |
| 79 | ``` |
| 80 | 4. Confirm schema: `shithubd migrate status`. |
| 81 | 5. Bring the app up against the new DB (update `web.env` and |
| 82 | `worker.env`). |
| 83 | 6. **Notify users** — there will be a visible activity gap. |
| 84 | |
| 85 | ## Real restore — point-in-time |
| 86 | |
| 87 | Use this when data is intact in WAL but a destructive change |
| 88 | happened at a known time (`DROP TABLE`, mass UPDATE, runaway |
| 89 | worker). |
| 90 | |
| 91 | This is harder. The procedure: |
| 92 | |
| 93 | 1. Stop the app (`systemctl stop shithubd-web shithubd-worker |
| 94 | shithubd-cron.timer`). Do not let writes continue. |
| 95 | 2. Restore the most recent base backup from before the incident |
| 96 | into a fresh data directory. |
| 97 | 3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing |
| 98 | at the WAL archive in Spaces with |
| 99 | `recovery_target_time = '<UTC timestamp just before incident>'`. |
| 100 | 4. Start Postgres in recovery mode; let it replay WAL. |
| 101 | 5. When recovery completes, promote. |
| 102 | 6. Restart the app. |
| 103 | |
| 104 | If you have not done a PITR before, **do not** free-style this. |
| 105 | Mistakes here destroy the data you were trying to save. Find |
| 106 | someone who has done it and pair with them. |
| 107 | |
| 108 | ## After any real restore |
| 109 | |
| 110 | - Run the restore-drill smoke queries against the restored DB. |
| 111 | - Reconcile the audit log against any new IDs that don't exist. |
| 112 | - Force-rotate webhook secrets (`shithubd webhook rotate-all`) — |
| 113 | if the dump was compromised, the secrets in it may be too. |
| 114 | - Force-rotate session epochs for any user whose session crossed |
| 115 | the recovery window. |