markdown · 2096 bytes Raw Blame History

Backups

Two-layer scheme, both mandatory:

  • Continuous WAL archive to spaces-prod:shithub-waldeploy/postgres/archive_command.sh ships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or one archive_timeout).
  • Daily logical dump to spaces-prod:shithub-backups/daily/... via deploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.

Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs hourly from the backup host into a second-region DR bucket.

Verifying that backups are healthy

The monitoring stack does this for you:

  • BackupOverdue alert fires if time() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.
  • pg_stat_archiver.failed_count > 0 is paged via the archive-failing runbook.

If you want to confirm by hand:

ssh db
sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Quarterly restore drill

We verify the restore path every quarter. The calendar entry lives in the team calendar; the procedure is in restore.md.

Required outputs of a successful drill:

  1. deploy/restore-drill/run.sh exits 0.
  2. The smoke queries pass.
  3. The drill log has a "restore drill OK" line and is archived to spaces-prod:shithub-backups/drills/<YYYY-QQ>/.

If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.

Missed backup

Symptom: BackupOverdue alert.

  1. SSH to db host. systemctl status shithub-backup-daily.timer and journalctl -u shithub-backup-daily.service -n 200.
  2. Most likely: the script ran but rclone copyto failed (creds, network). Re-run by hand: sudo -u postgres /usr/local/bin/shithub-backup-daily.
  3. If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.
View source
1 # Backups
2
3 Two-layer scheme, both mandatory:
4
5 - **Continuous WAL archive** to `spaces-prod:shithub-wal`
6 `deploy/postgres/archive_command.sh` ships every WAL segment as
7 Postgres rolls it. RPO is approximately one segment (~16MB or one
8 `archive_timeout`).
9 - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
10 via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
11 on the db host for fast recovery; bucket lifecycle keeps 90 days.
12
13 Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
14 hourly from the backup host into a second-region DR bucket.
15
16 ## Verifying that backups are healthy
17
18 The monitoring stack does this for you:
19
20 - `BackupOverdue` alert fires if `time() -
21 shithubd_backup_last_success_seconds > 30h`. The backup script
22 pushes the timestamp to the metrics endpoint on success.
23 - `pg_stat_archiver.failed_count > 0` is paged via the
24 `archive-failing` runbook.
25
26 If you want to confirm by hand:
27
28 ```sh
29 ssh db
30 sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
31 lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
32 ```
33
34 ## Quarterly restore drill
35
36 We verify the restore path **every quarter**. The calendar entry
37 lives in the team calendar; the procedure is in `restore.md`.
38
39 Required outputs of a successful drill:
40
41 1. `deploy/restore-drill/run.sh` exits 0.
42 2. The smoke queries pass.
43 3. The drill log has a "restore drill OK" line and is archived to
44 `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
45
46 If the drill fails: open an incident immediately. A failing drill
47 means our backups can't actually restore; we treat that as P0.
48
49 ## Missed backup
50
51 **Symptom:** `BackupOverdue` alert.
52
53 1. SSH to db host. `systemctl status shithub-backup-daily.timer`
54 and `journalctl -u shithub-backup-daily.service -n 200`.
55 2. Most likely: the script ran but `rclone copyto` failed (creds,
56 network). Re-run by hand:
57 `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
58 3. If the script has been failing silently for >24h, file an
59 incident — every additional day extends the RPO of an actual
60 recovery.