shithub Public

Watch 1 Fork 0 Star 0

markdown · 2096 bytes Raw Blame History

Backups

Two-layer scheme, both mandatory:

Continuous WAL archive to spaces-prod:shithub-wal — deploy/postgres/archive_command.sh ships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or one archive_timeout).
Daily logical dump to spaces-prod:shithub-backups/daily/... via deploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.

Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs hourly from the backup host into a second-region DR bucket.

Verifying that backups are healthy

The monitoring stack does this for you:

BackupOverdue alert fires if time() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.
pg_stat_archiver.failed_count > 0 is paged via the archive-failing runbook.

If you want to confirm by hand:

ssh db
sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Quarterly restore drill

We verify the restore path every quarter. The calendar entry lives in the team calendar; the procedure is in restore.md.

Required outputs of a successful drill:

deploy/restore-drill/run.sh exits 0.
The smoke queries pass.
The drill log has a "restore drill OK" line and is archived to spaces-prod:shithub-backups/drills/<YYYY-QQ>/.

If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.

Missed backup

Symptom: BackupOverdue alert.

SSH to db host. systemctl status shithub-backup-daily.timer and journalctl -u shithub-backup-daily.service -n 200.
Most likely: the script ran but rclone copyto failed (creds, network). Re-run by hand: sudo -u postgres /usr/local/bin/shithub-backup-daily.
If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.

View source

  
        1
        # Backups
      
        2
        
        3
        Two-layer scheme, both mandatory:
      
        4
        
        5
        - **Continuous WAL archive** to `spaces-prod:shithub-wal` —
      
        6
          `deploy/postgres/archive_command.sh` ships every WAL segment as
      
        7
          Postgres rolls it. RPO is approximately one segment (~16MB or one
      
        8
          `archive_timeout`).
      
        9
        - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
      
        10
          via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
      
        11
          on the db host for fast recovery; bucket lifecycle keeps 90 days.
      
        12
        
        13
        Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
      
        14
        hourly from the backup host into a second-region DR bucket.
      
        15
        
        16
        ## Verifying that backups are healthy
      
        17
        
        18
        The monitoring stack does this for you:
      
        19
        
        20
        - `BackupOverdue` alert fires if `time() -
      
        21
          shithubd_backup_last_success_seconds > 30h`. The backup script
      
        22
          pushes the timestamp to the metrics endpoint on success.
      
        23
        - `pg_stat_archiver.failed_count > 0` is paged via the
      
        24
          `archive-failing` runbook.
      
        25
        
        26
        If you want to confirm by hand:
      
        27
        
        28
        ```sh
      
        29
        ssh db
      
        30
        sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
      
        31
             lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
      
        32
        ```
      
        33
        
        34
        ## Quarterly restore drill
      
        35
        
        36
        We verify the restore path **every quarter**. The calendar entry
      
        37
        lives in the team calendar; the procedure is in `restore.md`.
      
        38
        
        39
        Required outputs of a successful drill:
      
        40
        
        41
        1. `deploy/restore-drill/run.sh` exits 0.
      
        42
        2. The smoke queries pass.
      
        43
        3. The drill log has a "restore drill OK" line and is archived to
      
        44
           `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
      
        45
        
        46
        If the drill fails: open an incident immediately. A failing drill
      
        47
        means our backups can't actually restore; we treat that as P0.
      
        48
        
        49
        ## Missed backup
      
        50
        
        51
        **Symptom:** `BackupOverdue` alert.
      
        52
        
        53
        1. SSH to db host. `systemctl status shithub-backup-daily.timer`
      
        54
           and `journalctl -u shithub-backup-daily.service -n 200`.
      
        55
        2. Most likely: the script ran but `rclone copyto` failed (creds,
      
        56
           network). Re-run by hand:
      
        57
           `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
      
        58
        3. If the script has been failing silently for >24h, file an
      
        59
           incident — every additional day extends the RPO of an actual
      
        60
           recovery.

1	# Backups
2
3	Two-layer scheme, both mandatory:
4
5	- Continuous WAL archive to `spaces-prod:shithub-wal` —
6	`deploy/postgres/archive_command.sh` ships every WAL segment as
7	Postgres rolls it. RPO is approximately one segment (~16MB or one
8	`archive_timeout`).
9	- Daily logical dump to `spaces-prod:shithub-backups/daily/...`
10	via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
11	on the db host for fast recovery; bucket lifecycle keeps 90 days.
12
13	Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
14	hourly from the backup host into a second-region DR bucket.
15
16	## Verifying that backups are healthy
17
18	The monitoring stack does this for you:
19
20	- `BackupOverdue` alert fires if `time() -
21	shithubd_backup_last_success_seconds > 30h`. The backup script
22	pushes the timestamp to the metrics endpoint on success.
23	- `pg_stat_archiver.failed_count > 0` is paged via the
24	`archive-failing` runbook.
25
26	If you want to confirm by hand:
27
28	```sh
29	ssh db
30	sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
31	lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
32	```
33
34	## Quarterly restore drill
35
36	We verify the restore path every quarter. The calendar entry
37	lives in the team calendar; the procedure is in `restore.md`.
38
39	Required outputs of a successful drill:
40
41	1. `deploy/restore-drill/run.sh` exits 0.
42	2. The smoke queries pass.
43	3. The drill log has a "restore drill OK" line and is archived to
44	`spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
45
46	If the drill fails: open an incident immediately. A failing drill
47	means our backups can't actually restore; we treat that as P0.
48
49	## Missed backup
50
51	Symptom: `BackupOverdue` alert.
52
53	1. SSH to db host. `systemctl status shithub-backup-daily.timer`
54	and `journalctl -u shithub-backup-daily.service -n 200`.
55	2. Most likely: the script ran but `rclone copyto` failed (creds,
56	network). Re-run by hand:
57	`sudo -u postgres /usr/local/bin/shithub-backup-daily`.
58	3. If the script has been failing silently for >24h, file an
59	incident — every additional day extends the RPO of an actual
60	recovery.