markdown · 3645 bytes Raw Blame History

Backup & restore

Two layers, both mandatory:

  • Continuous WAL archivedeploy/postgres/archive_command.sh ships every WAL segment to spaces-prod:shithub-wal in real time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB or one archive_timeout).
  • Daily logical dumpdeploy/postgres/backup-daily.sh takes a pg_dump --format=custom once per day and ships it to spaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the most recent 7 on the db host.

Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors both buckets to a second region for DR. Lifecycle in deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps after 90.

Verifying that backups are healthy

The monitoring stack does this for you:

  • BackupOverdue alert fires if no successful backup in 30h.
  • pg_stat_archiver.failed_count > 0 is paged via the archive-failing alert.

By hand:

ssh db
sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Restore drills

Run quarterly. The restore drill restores the latest dump into a temp Postgres instance, runs smoke queries, tears down. Production is untouched.

ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/

To drill an older dump:

sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump

--keep preserves the temp pgdata for inspection. Default is clean-up.

A failed drill is a P0: it means our backups can't actually restore. File an incident immediately.

Real restore — full DB lost

Use this when:

  • The DB host is destroyed.
  • Postgres can't start and there's no working WAL.
  • 24 hours of data loss is acceptable.
  1. Spin up a new db host (make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new ANSIBLE_TAGS=db).
  2. Pull the most recent daily dump:
    rclone copyto \
      spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
      /tmp/restore.dump
    
  3. Restore:
    pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
    
  4. Confirm schema: shithubd migrate status.
  5. Bring the app up against the new DB (update web.env and worker.env).
  6. Notify users — there will be a visible activity gap.

Real restore — point-in-time

Use this when data is intact in WAL but a destructive change happened at a known time (DROP TABLE, mass UPDATE, runaway worker).

This is harder. The procedure:

  1. Stop the app (systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue.
  2. Restore the most recent base backup from before the incident into a fresh data directory.
  3. Configure recovery (recovery.signal + GUCs in PG16) pointing at the WAL archive in Spaces with recovery_target_time = '<UTC timestamp just before incident>'.
  4. Start Postgres in recovery mode; let it replay WAL.
  5. When recovery completes, promote.
  6. Restart the app.

If you have not done a PITR before, do not free-style this. Mistakes here destroy the data you were trying to save. Find someone who has done it and pair with them.

After any real restore

  • Run the restore-drill smoke queries against the restored DB.
  • Reconcile the audit log against any new IDs that don't exist.
  • Force-rotate webhook secrets (shithubd webhook rotate-all) — if the dump was compromised, the secrets in it may be too.
  • Force-rotate session epochs for any user whose session crossed the recovery window.
View source
1 # Backup & restore
2
3 Two layers, both mandatory:
4
5 - **Continuous WAL archive** — `deploy/postgres/archive_command.sh`
6 ships every WAL segment to `spaces-prod:shithub-wal` in real
7 time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB
8 or one `archive_timeout`).
9 - **Daily logical dump** — `deploy/postgres/backup-daily.sh`
10 takes a `pg_dump --format=custom` once per day and ships it to
11 `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most
12 recent 7 on the db host.
13
14 Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
15 both buckets to a second region for DR. Lifecycle in
16 `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
17 after 90.
18
19 ## Verifying that backups are healthy
20
21 The monitoring stack does this for you:
22
23 - `BackupOverdue` alert fires if no successful backup in 30h.
24 - `pg_stat_archiver.failed_count > 0` is paged via the
25 archive-failing alert.
26
27 By hand:
28
29 ```sh
30 ssh db
31 sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
32 lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
33 ```
34
35 ## Restore drills
36
37 Run **quarterly**. The restore drill restores the latest dump
38 into a temp Postgres instance, runs smoke queries, tears down.
39 Production is untouched.
40
41 ```sh
42 ssh backup-host
43 sudo /usr/local/bin/shithub-restore-drill
44 # uses the latest dump in spaces-prod:shithub-backups/daily/
45 ```
46
47 To drill an older dump:
48
49 ```sh
50 sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
51 ```
52
53 `--keep` preserves the temp pgdata for inspection. Default is
54 clean-up.
55
56 A failed drill is a P0: it means our backups can't actually
57 restore. File an incident immediately.
58
59 ## Real restore — full DB lost
60
61 Use this when:
62
63 - The DB host is destroyed.
64 - Postgres can't start and there's no working WAL.
65 - 24 hours of data loss is acceptable.
66
67 1. Spin up a new db host
68 (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new
69 ANSIBLE_TAGS=db`).
70 2. Pull the most recent daily dump:
71 ```sh
72 rclone copyto \
73 spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
74 /tmp/restore.dump
75 ```
76 3. Restore:
77 ```sh
78 pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
79 ```
80 4. Confirm schema: `shithubd migrate status`.
81 5. Bring the app up against the new DB (update `web.env` and
82 `worker.env`).
83 6. **Notify users** — there will be a visible activity gap.
84
85 ## Real restore — point-in-time
86
87 Use this when data is intact in WAL but a destructive change
88 happened at a known time (`DROP TABLE`, mass UPDATE, runaway
89 worker).
90
91 This is harder. The procedure:
92
93 1. Stop the app (`systemctl stop shithubd-web shithubd-worker
94 shithubd-cron.timer`). Do not let writes continue.
95 2. Restore the most recent base backup from before the incident
96 into a fresh data directory.
97 3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing
98 at the WAL archive in Spaces with
99 `recovery_target_time = '<UTC timestamp just before incident>'`.
100 4. Start Postgres in recovery mode; let it replay WAL.
101 5. When recovery completes, promote.
102 6. Restart the app.
103
104 If you have not done a PITR before, **do not** free-style this.
105 Mistakes here destroy the data you were trying to save. Find
106 someone who has done it and pair with them.
107
108 ## After any real restore
109
110 - Run the restore-drill smoke queries against the restored DB.
111 - Reconcile the audit log against any new IDs that don't exist.
112 - Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
113 if the dump was compromised, the secrets in it may be too.
114 - Force-rotate session epochs for any user whose session crossed
115 the recovery window.