shithub Public

Watch 1 Fork 0 Star 0

markdown · 3756 bytes Raw Blame History

Backups

Two-layer scheme, both mandatory:

Continuous WAL archive to spaces-prod:shithub-wal — deploy/postgres/archive_command.sh ships every WAL segment as Postgres rolls it. RPO is approximately one segment (~16MB or one archive_timeout).
Daily logical dump to spaces-prod:shithub-backups/daily/... via deploy/postgres/backup-daily.sh. Keeps the most recent 7 on the db host for fast recovery; bucket lifecycle keeps 90 days.

Cross-region mirror (deploy/spaces/sync-cross-region.sh) runs hourly from the backup host into a second-region DR bucket.

WAL archiving — first-time setup

The original provision script created shithub-backups and shithub-backups-dr but NOT the WAL buckets, so a fresh install ships zero WAL segments until the operator runs through this once:

Create the WAL buckets (DO Spaces dashboard — doctl doesn't manage buckets):
- shithub-wal in the primary region
- shithub-wal-dr in the DR region
Extend the prod RW Spaces key (Settings → API → Spaces Keys → Edit) to grant readwrite on shithub-wal. The dr key needs readwrite on shithub-wal-dr so sync-cross-region.sh can push.
Confirm the rclone config on the app droplet has both keys (/etc/rclone-shithub.conf — spaces-prod and spaces-dr remotes).
Re-run ansible (or drop the conf.d file by hand at /etc/postgresql/16/main/conf.d/99_shithub_archive.conf), then systemctl restart postgresql@16-main — archive_mode change needs a restart, not a reload.

Verify within ~60 s:

sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;'
# last_archived_wal: 000000010000000000000003 (or similar)
# last_archived_time: <recent timestamp>
# failed_count: 0
rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
       lsf spaces-prod:shithub-wal/ --recursive | head

If failed_count > 0 before any successful archive: journalctl -u postgresql@16-main -n 100 | grep archive shows the rclone error. Most common: bucket name typo, key grant missing, or the --s3-no-check-bucket flag is missing from the archive script (re-check /usr/local/bin/shithub-pg-archive).

Verifying that backups are healthy

The monitoring stack does this for you:

BackupOverdue alert fires if time() - shithubd_backup_last_success_seconds > 30h. The backup script pushes the timestamp to the metrics endpoint on success.
pg_stat_archiver.failed_count > 0 is paged via the archive-failing runbook.

If you want to confirm by hand:

ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/

Quarterly restore drill

We verify the restore path every quarter. The calendar entry lives in the team calendar; the procedure is in restore.md.

Required outputs of a successful drill:

deploy/restore-drill/run.sh exits 0.
The smoke queries pass.
The drill log has a "restore drill OK" line and is archived to spaces-prod:shithub-backups/drills/<YYYY-QQ>/.

If the drill fails: open an incident immediately. A failing drill means our backups can't actually restore; we treat that as P0.

Missed backup

Symptom: BackupOverdue alert.

SSH to db host. systemctl status shithub-backup-daily.timer and journalctl -u shithub-backup-daily.service -n 200.
Most likely: the script ran but rclone copyto failed (creds, network). Re-run by hand: sudo -u postgres /usr/local/bin/shithub-backup-daily.
If the script has been failing silently for >24h, file an incident — every additional day extends the RPO of an actual recovery.

View source

  
        1
        # Backups
      
        2
        
        3
        Two-layer scheme, both mandatory:
      
        4
        
        5
        - **Continuous WAL archive** to `spaces-prod:shithub-wal` —
      
        6
          `deploy/postgres/archive_command.sh` ships every WAL segment as
      
        7
          Postgres rolls it. RPO is approximately one segment (~16MB or one
      
        8
          `archive_timeout`).
      
        9
        - **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
      
        10
          via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
      
        11
          on the db host for fast recovery; bucket lifecycle keeps 90 days.
      
        12
        
        13
        Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
      
        14
        hourly from the backup host into a second-region DR bucket.
      
        15
        
        16
        ## WAL archiving — first-time setup
      
        17
        
        18
        The original provision script created `shithub-backups` and
      
        19
        `shithub-backups-dr` but NOT the WAL buckets, so a fresh install
      
        20
        ships zero WAL segments until the operator runs through this once:
      
        21
        
        22
        1. **Create the WAL buckets** (DO Spaces dashboard — `doctl` doesn't
      
        23
           manage buckets):
      
        24
           - `shithub-wal` in the primary region
      
        25
           - `shithub-wal-dr` in the DR region
      
        26
        2. **Extend the prod RW Spaces key** (Settings → API → Spaces Keys →
      
        27
           Edit) to grant `readwrite` on `shithub-wal`. The `dr` key needs
      
        28
           `readwrite` on `shithub-wal-dr` so `sync-cross-region.sh` can push.
      
        29
        3. **Confirm the rclone config on the app droplet** has both keys
      
        30
           (`/etc/rclone-shithub.conf` — `spaces-prod` and
      
        31
           `spaces-dr` remotes).
      
        32
        4. **Re-run ansible** (or drop the conf.d file by hand at
      
        33
           `/etc/postgresql/16/main/conf.d/99_shithub_archive.conf`), then
      
        34
           `systemctl restart postgresql@16-main` — `archive_mode` change
      
        35
           needs a restart, not a reload.
      
        36
        5. **Verify within ~60 s**:
      
        37
           ```sh
      
        38
           sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;'
      
        39
           # last_archived_wal: 000000010000000000000003 (or similar)
      
        40
           # last_archived_time: <recent timestamp>
      
        41
           # failed_count: 0
      
        42
           rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
      
        43
                  lsf spaces-prod:shithub-wal/ --recursive | head
      
        44
           ```
      
        45
        6. **If `failed_count > 0`** before any successful archive:
      
        46
           `journalctl -u postgresql@16-main -n 100 | grep archive` shows
      
        47
           the rclone error. Most common: bucket name typo, key grant
      
        48
           missing, or the `--s3-no-check-bucket` flag is missing from the
      
        49
           archive script (re-check `/usr/local/bin/shithub-pg-archive`).
      
        50
        
        51
        ## Verifying that backups are healthy
      
        52
        
        53
        The monitoring stack does this for you:
      
        54
        
        55
        - `BackupOverdue` alert fires if `time() -
      
        56
          shithubd_backup_last_success_seconds > 30h`. The backup script
      
        57
          pushes the timestamp to the metrics endpoint on success.
      
        58
        - `pg_stat_archiver.failed_count > 0` is paged via the
      
        59
          `archive-failing` runbook.
      
        60
        
        61
        If you want to confirm by hand:
      
        62
        
        63
        ```sh
      
        64
        ssh db
      
        65
        sudo -u postgres rclone --config /etc/rclone-shithub.conf \
      
        66
             lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
      
        67
        ```
      
        68
        
        69
        ## Quarterly restore drill
      
        70
        
        71
        We verify the restore path **every quarter**. The calendar entry
      
        72
        lives in the team calendar; the procedure is in `restore.md`.
      
        73
        
        74
        Required outputs of a successful drill:
      
        75
        
        76
        1. `deploy/restore-drill/run.sh` exits 0.
      
        77
        2. The smoke queries pass.
      
        78
        3. The drill log has a "restore drill OK" line and is archived to
      
        79
           `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
      
        80
        
        81
        If the drill fails: open an incident immediately. A failing drill
      
        82
        means our backups can't actually restore; we treat that as P0.
      
        83
        
        84
        ## Missed backup
      
        85
        
        86
        **Symptom:** `BackupOverdue` alert.
      
        87
        
        88
        1. SSH to db host. `systemctl status shithub-backup-daily.timer`
      
        89
           and `journalctl -u shithub-backup-daily.service -n 200`.
      
        90
        2. Most likely: the script ran but `rclone copyto` failed (creds,
      
        91
           network). Re-run by hand:
      
        92
           `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
      
        93
        3. If the script has been failing silently for >24h, file an
      
        94
           incident — every additional day extends the RPO of an actual
      
        95
           recovery.

1	# Backups
2
3	Two-layer scheme, both mandatory:
4
5	- Continuous WAL archive to `spaces-prod:shithub-wal` —
6	`deploy/postgres/archive_command.sh` ships every WAL segment as
7	Postgres rolls it. RPO is approximately one segment (~16MB or one
8	`archive_timeout`).
9	- Daily logical dump to `spaces-prod:shithub-backups/daily/...`
10	via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
11	on the db host for fast recovery; bucket lifecycle keeps 90 days.
12
13	Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
14	hourly from the backup host into a second-region DR bucket.
15
16	## WAL archiving — first-time setup
17
18	The original provision script created `shithub-backups` and
19	`shithub-backups-dr` but NOT the WAL buckets, so a fresh install
20	ships zero WAL segments until the operator runs through this once:
21
22	1. Create the WAL buckets (DO Spaces dashboard — `doctl` doesn't
23	manage buckets):
24	- `shithub-wal` in the primary region
25	- `shithub-wal-dr` in the DR region
26	2. Extend the prod RW Spaces key (Settings → API → Spaces Keys →
27	Edit) to grant `readwrite` on `shithub-wal`. The `dr` key needs
28	`readwrite` on `shithub-wal-dr` so `sync-cross-region.sh` can push.
29	3. Confirm the rclone config on the app droplet has both keys
30	(`/etc/rclone-shithub.conf` — `spaces-prod` and
31	`spaces-dr` remotes).
32	4. Re-run ansible (or drop the conf.d file by hand at
33	`/etc/postgresql/16/main/conf.d/99_shithub_archive.conf`), then
34	`systemctl restart postgresql@16-main` — `archive_mode` change
35	needs a restart, not a reload.
36	5. Verify within ~60 s:
37	```sh
38	sudo -u postgres psql -xc 'SELECT * FROM pg_stat_archiver;'
39	# last_archived_wal: 000000010000000000000003 (or similar)
40	# last_archived_time: <recent timestamp>
41	# failed_count: 0
42	rclone --config /etc/rclone-shithub.conf --s3-no-check-bucket \
43	lsf spaces-prod:shithub-wal/ --recursive \| head
44	```
45	6. If `failed_count > 0` before any successful archive:
46	`journalctl -u postgresql@16-main -n 100 \| grep archive` shows
47	the rclone error. Most common: bucket name typo, key grant
48	missing, or the `--s3-no-check-bucket` flag is missing from the
49	archive script (re-check `/usr/local/bin/shithub-pg-archive`).
50
51	## Verifying that backups are healthy
52
53	The monitoring stack does this for you:
54
55	- `BackupOverdue` alert fires if `time() -
56	shithubd_backup_last_success_seconds > 30h`. The backup script
57	pushes the timestamp to the metrics endpoint on success.
58	- `pg_stat_archiver.failed_count > 0` is paged via the
59	`archive-failing` runbook.
60
61	If you want to confirm by hand:
62
63	```sh
64	ssh db
65	sudo -u postgres rclone --config /etc/rclone-shithub.conf \
66	lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
67	```
68
69	## Quarterly restore drill
70
71	We verify the restore path every quarter. The calendar entry
72	lives in the team calendar; the procedure is in `restore.md`.
73
74	Required outputs of a successful drill:
75
76	1. `deploy/restore-drill/run.sh` exits 0.
77	2. The smoke queries pass.
78	3. The drill log has a "restore drill OK" line and is archived to
79	`spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
80
81	If the drill fails: open an incident immediately. A failing drill
82	means our backups can't actually restore; we treat that as P0.
83
84	## Missed backup
85
86	Symptom: `BackupOverdue` alert.
87
88	1. SSH to db host. `systemctl status shithub-backup-daily.timer`
89	and `journalctl -u shithub-backup-daily.service -n 200`.
90	2. Most likely: the script ran but `rclone copyto` failed (creds,
91	network). Re-run by hand:
92	`sudo -u postgres /usr/local/bin/shithub-backup-daily`.
93	3. If the script has been failing silently for >24h, file an
94	incident — every additional day extends the RPO of an actual
95	recovery.