shithub Public

Watch 1 Fork 0 Star 0

markdown · 3329 bytes Raw Blame History

Restore

Two restore modes:

Drill restore — done quarterly into a temp PG instance to verify the backup chain works. Never touches the production DB. Procedure: the drill script. See "Drill" below.
Real restore — used after a data-loss incident. This is the procedure in "Real restore" below. Read it once before you ever need it.

Drill

ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/

To drill an older dump:

sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump

--keep preserves the temp pgdata for inspection. The default is clean-up.

The script writes to /var/log/shithub/restore-drill.log. After a successful drill, archive that log:

rclone copyto /var/log/shithub/restore-drill.log \
       spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log

Real restore

Two scenarios. Read the right one — they have different recovery points and different blast radius.

A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)

Use this when:

The DB host is destroyed.
Postgres can't start and we have no working WAL.
We accept losing up to a day of writes (RPO = 24h).

Procedure:

Spin up a new db host (Ansible: ANSIBLE_LIMIT=db-new make deploy --tags db).

Pull the most recent daily dump:

rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump

pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
Run shithubd migrate status to confirm schema version.
Bring the app up against the new DB (update web.env and worker.env). Verify with smoke queries from the drill suite.
Notify users — there will be a visible gap in their activity. Don't make them figure it out.

B. PITR (recover to a specific moment, e.g., before a bad migration)

Use this when:

The data is intact in WAL but a destructive change happened at a known time (DROP TABLE, mass UPDATE, runaway worker).
We need to roll the DB back to immediately before that moment.

This is harder; the procedure is:

Stop the app (systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue.
Restore the most recent base backup from before the incident into a fresh data directory.
Configure recovery.conf (or recovery.signal + GUCs in PG16) pointing at the WAL archive in Spaces with recovery_target_time = '<UTC timestamp just before incident>'.
Start Postgres in recovery mode, let it replay WAL.
When recovery completes, promote.
Restart the app.

If you have not done a PITR before: page someone who has, do not free-style this. Mistakes here lose the data you were trying to save.

After any real restore

Run the restore-drill smoke queries against the restored DB.
Reconcile the audit log: any rows referencing IDs that no longer exist need a note in the incident postmortem.
Force-rotate webhook secrets (shithubd webhook rotate-all) — if an attacker triggered the recovery, the secrets in the dump may be compromised.
Force-rotate session-epochs for any user whose session crossed the recovery window.

View source

  
        1
        # Restore
      
        2
        
        3
        Two restore modes:
      
        4
        
        5
        - **Drill restore** — done quarterly into a temp PG instance to
      
        6
          verify the backup chain works. Never touches the production DB.
      
        7
          Procedure: the drill script. See "Drill" below.
      
        8
        - **Real restore** — used after a data-loss incident. This is the
      
        9
          procedure in "Real restore" below. Read it once before you ever
      
        10
          need it.
      
        11
        
        12
        ## Drill
      
        13
        
        14
        ```sh
      
        15
        ssh backup-host
      
        16
        sudo /usr/local/bin/shithub-restore-drill
      
        17
        # uses the latest dump in spaces-prod:shithub-backups/daily/
      
        18
        ```
      
        19
        
        20
        To drill an older dump:
      
        21
        
        22
        ```sh
      
        23
        sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
      
        24
        ```
      
        25
        
        26
        `--keep` preserves the temp pgdata for inspection. The default is
      
        27
        clean-up.
      
        28
        
        29
        The script writes to `/var/log/shithub/restore-drill.log`. After a
      
        30
        successful drill, archive that log:
      
        31
        
        32
        ```sh
      
        33
        rclone copyto /var/log/shithub/restore-drill.log \
      
        34
               spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
      
        35
        ```
      
        36
        
        37
        ## Real restore
      
        38
        
        39
        Two scenarios. **Read the right one** — they have different recovery
      
        40
        points and different blast radius.
      
        41
        
        42
        ### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
      
        43
        
        44
        Use this when:
      
        45
        
        46
        - The DB host is destroyed.
      
        47
        - Postgres can't start and we have no working WAL.
      
        48
        - We accept losing up to a day of writes (RPO = 24h).
      
        49
        
        50
        Procedure:
      
        51
        
        52
        1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy
      
        53
           --tags db`).
      
        54
        2. Pull the most recent daily dump:
      
        55
           ```sh
      
        56
           rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
      
        57
           ```
      
        58
        3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump`
      
        59
        4. Run `shithubd migrate status` to confirm schema version.
      
        60
        5. Bring the app up against the new DB (update `web.env` and
      
        61
           `worker.env`). Verify with smoke queries from the drill suite.
      
        62
        6. **Notify users** — there will be a visible gap in their
      
        63
           activity. Don't make them figure it out.
      
        64
        
        65
        ### B. PITR (recover to a specific moment, e.g., before a bad migration)
      
        66
        
        67
        Use this when:
      
        68
        
        69
        - The data is intact in WAL but a destructive change happened at a
      
        70
          known time (`DROP TABLE`, mass `UPDATE`, runaway worker).
      
        71
        - We need to roll the DB back to immediately before that moment.
      
        72
        
        73
        This is harder; the procedure is:
      
        74
        
        75
        1. Stop the app (`systemctl stop shithubd-web shithubd-worker
      
        76
           shithubd-cron.timer`). Do not let writes continue.
      
        77
        2. Restore the most recent base backup from before the incident
      
        78
           into a fresh data directory.
      
        79
        3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16)
      
        80
           pointing at the WAL archive in Spaces with
      
        81
           `recovery_target_time = '<UTC timestamp just before incident>'`.
      
        82
        4. Start Postgres in recovery mode, let it replay WAL.
      
        83
        5. When recovery completes, promote.
      
        84
        6. Restart the app.
      
        85
        
        86
        If you have not done a PITR before: page someone who has, do not
      
        87
        free-style this. Mistakes here lose the data you were trying to
      
        88
        save.
      
        89
        
        90
        ## After any real restore
      
        91
        
        92
        - Run the restore-drill smoke queries against the restored DB.
      
        93
        - Reconcile the audit log: any rows referencing IDs that no longer
      
        94
          exist need a note in the incident postmortem.
      
        95
        - Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
      
        96
          if an attacker triggered the recovery, the secrets in the dump
      
        97
          may be compromised.
      
        98
        - Force-rotate session-epochs for any user whose session crossed
      
        99
          the recovery window.

1	# Restore
2
3	Two restore modes:
4
5	- Drill restore — done quarterly into a temp PG instance to
6	verify the backup chain works. Never touches the production DB.
7	Procedure: the drill script. See "Drill" below.
8	- Real restore — used after a data-loss incident. This is the
9	procedure in "Real restore" below. Read it once before you ever
10	need it.
11
12	## Drill
13
14	```sh
15	ssh backup-host
16	sudo /usr/local/bin/shithub-restore-drill
17	# uses the latest dump in spaces-prod:shithub-backups/daily/
18	```
19
20	To drill an older dump:
21
22	```sh
23	sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
24	```
25
26	`--keep` preserves the temp pgdata for inspection. The default is
27	clean-up.
28
29	The script writes to `/var/log/shithub/restore-drill.log`. After a
30	successful drill, archive that log:
31
32	```sh
33	rclone copyto /var/log/shithub/restore-drill.log \
34	spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
35	```
36
37	## Real restore
38
39	Two scenarios. Read the right one — they have different recovery
40	points and different blast radius.
41
42	### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
43
44	Use this when:
45
46	- The DB host is destroyed.
47	- Postgres can't start and we have no working WAL.
48	- We accept losing up to a day of writes (RPO = 24h).
49
50	Procedure:
51
52	1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy
53	--tags db`).
54	2. Pull the most recent daily dump:
55	```sh
56	rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
57	```
58	3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump`
59	4. Run `shithubd migrate status` to confirm schema version.
60	5. Bring the app up against the new DB (update `web.env` and
61	`worker.env`). Verify with smoke queries from the drill suite.
62	6. Notify users — there will be a visible gap in their
63	activity. Don't make them figure it out.
64
65	### B. PITR (recover to a specific moment, e.g., before a bad migration)
66
67	Use this when:
68
69	- The data is intact in WAL but a destructive change happened at a
70	known time (`DROP TABLE`, mass `UPDATE`, runaway worker).
71	- We need to roll the DB back to immediately before that moment.
72
73	This is harder; the procedure is:
74
75	1. Stop the app (`systemctl stop shithubd-web shithubd-worker
76	shithubd-cron.timer`). Do not let writes continue.
77	2. Restore the most recent base backup from before the incident
78	into a fresh data directory.
79	3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16)
80	pointing at the WAL archive in Spaces with
81	`recovery_target_time = '<UTC timestamp just before incident>'`.
82	4. Start Postgres in recovery mode, let it replay WAL.
83	5. When recovery completes, promote.
84	6. Restart the app.
85
86	If you have not done a PITR before: page someone who has, do not
87	free-style this. Mistakes here lose the data you were trying to
88	save.
89
90	## After any real restore
91
92	- Run the restore-drill smoke queries against the restored DB.
93	- Reconcile the audit log: any rows referencing IDs that no longer
94	exist need a note in the incident postmortem.
95	- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
96	if an attacker triggered the recovery, the secrets in the dump
97	may be compromised.
98	- Force-rotate session-epochs for any user whose session crossed
99	the recovery window.