shithub Public

Watch 1 Fork 0 Star 0

markdown · 3637 bytes Raw Blame History

Backup & restore

Two layers, both mandatory:

Continuous WAL archive — deploy/postgres/archive_command.sh ships every WAL segment to spaces-prod:shithub-wal in real time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB or one archive_timeout).
Daily logical dump — deploy/postgres/backup-daily.sh takes a pg_dump --format=custom once per day and ships it to spaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the most recent 7 on the db host.

Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors both buckets to a second region for DR. Lifecycle in deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps after 90.

Verifying that backups are healthy

The monitoring stack does this for you:

BackupOverdue alert fires if no successful backup in 30h.
pg_stat_archiver.failed_count > 0 is paged via the archive-failing alert.

By hand:

ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Restore drills

Run quarterly. The restore drill restores the latest dump into a temp Postgres instance, runs smoke queries, tears down. Production is untouched.

ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/

To drill an older dump:

sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump

--keep preserves the temp pgdata for inspection. Default is clean-up.

A failed drill is a P0: it means our backups can't actually restore. File an incident immediately.

Real restore — full DB lost

Use this when:

The DB host is destroyed.
Postgres can't start and there's no working WAL.
24 hours of data loss is acceptable.

Spin up a new db host (make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new ANSIBLE_TAGS=db).

Pull the most recent daily dump:

rclone copyto \
  spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
  /tmp/restore.dump

Restore:

pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump

Confirm schema: shithubd migrate status.
Bring the app up against the new DB (update web.env and worker.env).
Notify users — there will be a visible activity gap.

Real restore — point-in-time

Use this when data is intact in WAL but a destructive change happened at a known time (DROP TABLE, mass UPDATE, runaway worker).

This is harder. The procedure:

Stop the app (systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue.
Restore the most recent base backup from before the incident into a fresh data directory.
Configure recovery (recovery.signal + GUCs in PG16) pointing at the WAL archive in Spaces with recovery_target_time = '<UTC timestamp just before incident>'.
Start Postgres in recovery mode; let it replay WAL.
When recovery completes, promote.
Restart the app.

If you have not done a PITR before, do not free-style this. Mistakes here destroy the data you were trying to save. Find someone who has done it and pair with them.

After any real restore

Run the restore-drill smoke queries against the restored DB.
Reconcile the audit log against any new IDs that don't exist.
Force-rotate webhook secrets (shithubd webhook rotate-all) — if the dump was compromised, the secrets in it may be too.
Force-rotate session epochs for any user whose session crossed the recovery window.

View source

  
        1
        # Backup & restore
      
        2
        
        3
        Two layers, both mandatory:
      
        4
        
        5
        - **Continuous WAL archive** — `deploy/postgres/archive_command.sh`
      
        6
          ships every WAL segment to `spaces-prod:shithub-wal` in real
      
        7
          time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB
      
        8
          or one `archive_timeout`).
      
        9
        - **Daily logical dump** — `deploy/postgres/backup-daily.sh`
      
        10
          takes a `pg_dump --format=custom` once per day and ships it to
      
        11
          `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most
      
        12
          recent 7 on the db host.
      
        13
        
        14
        Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
      
        15
        both buckets to a second region for DR. Lifecycle in
      
        16
        `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
      
        17
        after 90.
      
        18
        
        19
        ## Verifying that backups are healthy
      
        20
        
        21
        The monitoring stack does this for you:
      
        22
        
        23
        - `BackupOverdue` alert fires if no successful backup in 30h.
      
        24
        - `pg_stat_archiver.failed_count > 0` is paged via the
      
        25
          archive-failing alert.
      
        26
        
        27
        By hand:
      
        28
        
        29
        ```sh
      
        30
        ssh db
      
        31
        sudo -u postgres rclone --config /etc/rclone-shithub.conf \
      
        32
             lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
      
        33
        ```
      
        34
        
        35
        ## Restore drills
      
        36
        
        37
        Run **quarterly**. The restore drill restores the latest dump
      
        38
        into a temp Postgres instance, runs smoke queries, tears down.
      
        39
        Production is untouched.
      
        40
        
        41
        ```sh
      
        42
        ssh backup-host
      
        43
        sudo /usr/local/bin/shithub-restore-drill
      
        44
        # uses the latest dump in spaces-prod:shithub-backups/daily/
      
        45
        ```
      
        46
        
        47
        To drill an older dump:
      
        48
        
        49
        ```sh
      
        50
        sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
      
        51
        ```
      
        52
        
        53
        `--keep` preserves the temp pgdata for inspection. Default is
      
        54
        clean-up.
      
        55
        
        56
        A failed drill is a P0: it means our backups can't actually
      
        57
        restore. File an incident immediately.
      
        58
        
        59
        ## Real restore — full DB lost
      
        60
        
        61
        Use this when:
      
        62
        
        63
        - The DB host is destroyed.
      
        64
        - Postgres can't start and there's no working WAL.
      
        65
        - 24 hours of data loss is acceptable.
      
        66
        
        67
        1. Spin up a new db host
      
        68
           (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new
      
        69
           ANSIBLE_TAGS=db`).
      
        70
        2. Pull the most recent daily dump:
      
        71
           ```sh
      
        72
           rclone copyto \
      
        73
             spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
      
        74
             /tmp/restore.dump
      
        75
           ```
      
        76
        3. Restore:
      
        77
           ```sh
      
        78
           pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
      
        79
           ```
      
        80
        4. Confirm schema: `shithubd migrate status`.
      
        81
        5. Bring the app up against the new DB (update `web.env` and
      
        82
           `worker.env`).
      
        83
        6. **Notify users** — there will be a visible activity gap.
      
        84
        
        85
        ## Real restore — point-in-time
      
        86
        
        87
        Use this when data is intact in WAL but a destructive change
      
        88
        happened at a known time (`DROP TABLE`, mass UPDATE, runaway
      
        89
        worker).
      
        90
        
        91
        This is harder. The procedure:
      
        92
        
        93
        1. Stop the app (`systemctl stop shithubd-web shithubd-worker
      
        94
           shithubd-cron.timer`). Do not let writes continue.
      
        95
        2. Restore the most recent base backup from before the incident
      
        96
           into a fresh data directory.
      
        97
        3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing
      
        98
           at the WAL archive in Spaces with
      
        99
           `recovery_target_time = '<UTC timestamp just before incident>'`.
      
        100
        4. Start Postgres in recovery mode; let it replay WAL.
      
        101
        5. When recovery completes, promote.
      
        102
        6. Restart the app.
      
        103
        
        104
        If you have not done a PITR before, **do not** free-style this.
      
        105
        Mistakes here destroy the data you were trying to save. Find
      
        106
        someone who has done it and pair with them.
      
        107
        
        108
        ## After any real restore
      
        109
        
        110
        - Run the restore-drill smoke queries against the restored DB.
      
        111
        - Reconcile the audit log against any new IDs that don't exist.
      
        112
        - Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
      
        113
          if the dump was compromised, the secrets in it may be too.
      
        114
        - Force-rotate session epochs for any user whose session crossed
      
        115
          the recovery window.

1	# Backup & restore
2
3	Two layers, both mandatory:
4
5	- Continuous WAL archive — `deploy/postgres/archive_command.sh`
6	ships every WAL segment to `spaces-prod:shithub-wal` in real
7	time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB
8	or one `archive_timeout`).
9	- Daily logical dump — `deploy/postgres/backup-daily.sh`
10	takes a `pg_dump --format=custom` once per day and ships it to
11	`spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most
12	recent 7 on the db host.
13
14	Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
15	both buckets to a second region for DR. Lifecycle in
16	`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
17	after 90.
18
19	## Verifying that backups are healthy
20
21	The monitoring stack does this for you:
22
23	- `BackupOverdue` alert fires if no successful backup in 30h.
24	- `pg_stat_archiver.failed_count > 0` is paged via the
25	archive-failing alert.
26
27	By hand:
28
29	```sh
30	ssh db
31	sudo -u postgres rclone --config /etc/rclone-shithub.conf \
32	lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
33	```
34
35	## Restore drills
36
37	Run quarterly. The restore drill restores the latest dump
38	into a temp Postgres instance, runs smoke queries, tears down.
39	Production is untouched.
40
41	```sh
42	ssh backup-host
43	sudo /usr/local/bin/shithub-restore-drill
44	# uses the latest dump in spaces-prod:shithub-backups/daily/
45	```
46
47	To drill an older dump:
48
49	```sh
50	sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
51	```
52
53	`--keep` preserves the temp pgdata for inspection. Default is
54	clean-up.
55
56	A failed drill is a P0: it means our backups can't actually
57	restore. File an incident immediately.
58
59	## Real restore — full DB lost
60
61	Use this when:
62
63	- The DB host is destroyed.
64	- Postgres can't start and there's no working WAL.
65	- 24 hours of data loss is acceptable.
66
67	1. Spin up a new db host
68	(`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new
69	ANSIBLE_TAGS=db`).
70	2. Pull the most recent daily dump:
71	```sh
72	rclone copyto \
73	spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
74	/tmp/restore.dump
75	```
76	3. Restore:
77	```sh
78	pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
79	```
80	4. Confirm schema: `shithubd migrate status`.
81	5. Bring the app up against the new DB (update `web.env` and
82	`worker.env`).
83	6. Notify users — there will be a visible activity gap.
84
85	## Real restore — point-in-time
86
87	Use this when data is intact in WAL but a destructive change
88	happened at a known time (`DROP TABLE`, mass UPDATE, runaway
89	worker).
90
91	This is harder. The procedure:
92
93	1. Stop the app (`systemctl stop shithubd-web shithubd-worker
94	shithubd-cron.timer`). Do not let writes continue.
95	2. Restore the most recent base backup from before the incident
96	into a fresh data directory.
97	3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing
98	at the WAL archive in Spaces with
99	`recovery_target_time = '<UTC timestamp just before incident>'`.
100	4. Start Postgres in recovery mode; let it replay WAL.
101	5. When recovery completes, promote.
102	6. Restart the app.
103
104	If you have not done a PITR before, do not free-style this.
105	Mistakes here destroy the data you were trying to save. Find
106	someone who has done it and pair with them.
107
108	## After any real restore
109
110	- Run the restore-drill smoke queries against the restored DB.
111	- Reconcile the audit log against any new IDs that don't exist.
112	- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
113	if the dump was compromised, the secrets in it may be too.
114	- Force-rotate session epochs for any user whose session crossed
115	the recovery window.