markdown · 3756 bytes Raw Blame History

Backups

Two-layer scheme, both mandatory:

  • Continuous WAL archive to spaces-prod:shithub-waldeploy/postgres/archive_command.sh ships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or one archive_timeout).
  • Daily logical dump to spaces-prod:shithub-backups/daily/... via deploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.

Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs hourly from the backup host into a second-region DR bucket.

WAL archiving — first-time setup

The original provision script created shithub-backups and shithub-backups-dr but NOT the WAL buckets, so a fresh install ships zero WAL segments until the operator runs through this once:

  1. Create the WAL buckets (DO Spaces dashboard — doctl doesn't manage buckets):
    • shithub-wal in the primary region
    • shithub-wal-dr in the DR region
  2. Extend the prod RW Spaces key (Settings → API → Spaces Keys → Edit) to grant readwrite on shithub-wal. The dr key needs readwrite on shithub-wal-dr so sync-cross-region.sh can push.
  3. Confirm the rclone config on the app droplet has both keys (/etc/rclone-shithub.confspaces-prod and spaces-dr remotes).
  4. Re-run ansible (or drop the conf.d file by hand at /etc/postgresql/16/main/conf.d/99_shithub_archive.conf), then systemctl restart postgresql@16-mainarchive_mode change needs a restart, not a reload.
  5. Verify within ~60 s:
    sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;'
    # last_archived_wal: 000000010000000000000003 (or similar)
    # last_archived_time: <recent timestamp>
    # failed_count: 0
    rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
           lsf spaces-prod:shithub-wal/ --recursive | head
    
  6. If failed_count > 0 before any successful archive: journalctl -u postgresql@16-main -n 100 | grep archive shows the rclone error. Most common: bucket name typo, key grant missing, or the --s3-no-check-bucket flag is missing from the archive script (re-check /usr/local/bin/shithub-pg-archive).

Verifying that backups are healthy

The monitoring stack does this for you:

  • BackupOverdue alert fires if time() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.
  • pg_stat_archiver.failed_count > 0 is paged via the archive-failing runbook.

If you want to confirm by hand:

ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Quarterly restore drill

We verify the restore path every quarter. The calendar entry lives in the team calendar; the procedure is in restore.md.

Required outputs of a successful drill:

  1. deploy/restore-drill/run.sh exits 0.
  2. The smoke queries pass.
  3. The drill log has a "restore drill OK" line and is archived to spaces-prod:shithub-backups/drills/<YYYY-QQ>/.

If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.

Missed backup

Symptom: BackupOverdue alert.

  1. SSH to db host. systemctl status shithub-backup-daily.timer and journalctl -u shithub-backup-daily.service -n 200.
  2. Most likely: the script ran but rclone copyto failed (creds, network). Re-run by hand: sudo -u postgres /usr/local/bin/shithub-backup-daily.
  3. If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.
View source
1 # Backups
2
3 Two-layer scheme, both mandatory:
4
5 - **Continuous WAL archive** to `spaces-prod:shithub-wal`
6 `deploy/postgres/archive_command.sh` ships every WAL segment as
7 Postgres rolls it. RPO is approximately one segment (~16MB or one
8 `archive_timeout`).
9 - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
10 via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
11 on the db host for fast recovery; bucket lifecycle keeps 90 days.
12
13 Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
14 hourly from the backup host into a second-region DR bucket.
15
16 ## WAL archiving — first-time setup
17
18 The original provision script created `shithub-backups` and
19 `shithub-backups-dr` but NOT the WAL buckets, so a fresh install
20 ships zero WAL segments until the operator runs through this once:
21
22 1. **Create the WAL buckets** (DO Spaces dashboard — `doctl` doesn't
23 manage buckets):
24 - `shithub-wal` in the primary region
25 - `shithub-wal-dr` in the DR region
26 2. **Extend the prod RW Spaces key** (Settings → API → Spaces Keys →
27 Edit) to grant `readwrite` on `shithub-wal`. The `dr` key needs
28 `readwrite` on `shithub-wal-dr` so `sync-cross-region.sh` can push.
29 3. **Confirm the rclone config on the app droplet** has both keys
30 (`/etc/rclone-shithub.conf``spaces-prod` and
31 `spaces-dr` remotes).
32 4. **Re-run ansible** (or drop the conf.d file by hand at
33 `/etc/postgresql/16/main/conf.d/99_shithub_archive.conf`), then
34 `systemctl restart postgresql@16-main``archive_mode` change
35 needs a restart, not a reload.
36 5. **Verify within ~60 s**:
37 ```sh
38 sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;'
39 # last_archived_wal: 000000010000000000000003 (or similar)
40 # last_archived_time: <recent timestamp>
41 # failed_count: 0
42 rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
43 lsf spaces-prod:shithub-wal/ --recursive | head
44 ```
45 6. **If `failed_count > 0`** before any successful archive:
46 `journalctl -u postgresql@16-main -n 100 | grep archive` shows
47 the rclone error. Most common: bucket name typo, key grant
48 missing, or the `--s3-no-check-bucket` flag is missing from the
49 archive script (re-check `/usr/local/bin/shithub-pg-archive`).
50
51 ## Verifying that backups are healthy
52
53 The monitoring stack does this for you:
54
55 - `BackupOverdue` alert fires if `time() -
56 shithubd_backup_last_success_seconds > 30h`. The backup script
57 pushes the timestamp to the metrics endpoint on success.
58 - `pg_stat_archiver.failed_count > 0` is paged via the
59 `archive-failing` runbook.
60
61 If you want to confirm by hand:
62
63 ```sh
64 ssh db
65 sudo -u postgres rclone --config /etc/rclone-shithub.conf \
66 lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
67 ```
68
69 ## Quarterly restore drill
70
71 We verify the restore path **every quarter**. The calendar entry
72 lives in the team calendar; the procedure is in `restore.md`.
73
74 Required outputs of a successful drill:
75
76 1. `deploy/restore-drill/run.sh` exits 0.
77 2. The smoke queries pass.
78 3. The drill log has a "restore drill OK" line and is archived to
79 `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
80
81 If the drill fails: open an incident immediately. A failing drill
82 means our backups can't actually restore; we treat that as P0.
83
84 ## Missed backup
85
86 **Symptom:** `BackupOverdue` alert.
87
88 1. SSH to db host. `systemctl status shithub-backup-daily.timer`
89 and `journalctl -u shithub-backup-daily.service -n 200`.
90 2. Most likely: the script ran but `rclone copyto` failed (creds,
91 network). Re-run by hand:
92 `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
93 3. If the script has been failing silently for >24h, file an
94 incident — every additional day extends the RPO of an actual
95 recovery.