markdown · 3329 bytes Raw Blame History

Restore

Two restore modes:

  • Drill restore — done quarterly into a temp PG instance to verify the backup chain works. Never touches the production DB. Procedure: the drill script. See "Drill" below.
  • Real restore — used after a data-loss incident. This is the procedure in "Real restore" below. Read it once before you ever need it.

Drill

ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/

To drill an older dump:

sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump

--keep preserves the temp pgdata for inspection. The default is clean-up.

The script writes to /var/log/shithub/restore-drill.log. After a successful drill, archive that log:

rclone copyto /var/log/shithub/restore-drill.log \
       spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log

Real restore

Two scenarios. Read the right one — they have different recovery points and different blast radius.

A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)

Use this when:

  • The DB host is destroyed.
  • Postgres can't start and we have no working WAL.
  • We accept losing up to a day of writes (RPO = 24h).

Procedure:

  1. Spin up a new db host (Ansible: ANSIBLE_LIMIT=db-new make deploy --tags db).
  2. Pull the most recent daily dump:
    rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
    
  3. pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
  4. Run shithubd migrate status to confirm schema version.
  5. Bring the app up against the new DB (update web.env and worker.env). Verify with smoke queries from the drill suite.
  6. Notify users — there will be a visible gap in their activity. Don't make them figure it out.

B. PITR (recover to a specific moment, e.g., before a bad migration)

Use this when:

  • The data is intact in WAL but a destructive change happened at a known time (DROP TABLE, mass UPDATE, runaway worker).
  • We need to roll the DB back to immediately before that moment.

This is harder; the procedure is:

  1. Stop the app (systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue.
  2. Restore the most recent base backup from before the incident into a fresh data directory.
  3. Configure recovery.conf (or recovery.signal + GUCs in PG16) pointing at the WAL archive in Spaces with recovery_target_time = '<UTC timestamp just before incident>'.
  4. Start Postgres in recovery mode, let it replay WAL.
  5. When recovery completes, promote.
  6. Restart the app.

If you have not done a PITR before: page someone who has, do not free-style this. Mistakes here lose the data you were trying to save.

After any real restore

  • Run the restore-drill smoke queries against the restored DB.
  • Reconcile the audit log: any rows referencing IDs that no longer exist need a note in the incident postmortem.
  • Force-rotate webhook secrets (shithubd webhook rotate-all) — if an attacker triggered the recovery, the secrets in the dump may be compromised.
  • Force-rotate session-epochs for any user whose session crossed the recovery window.
View source
1 # Restore
2
3 Two restore modes:
4
5 - **Drill restore** — done quarterly into a temp PG instance to
6 verify the backup chain works. Never touches the production DB.
7 Procedure: the drill script. See "Drill" below.
8 - **Real restore** — used after a data-loss incident. This is the
9 procedure in "Real restore" below. Read it once before you ever
10 need it.
11
12 ## Drill
13
14 ```sh
15 ssh backup-host
16 sudo /usr/local/bin/shithub-restore-drill
17 # uses the latest dump in spaces-prod:shithub-backups/daily/
18 ```
19
20 To drill an older dump:
21
22 ```sh
23 sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
24 ```
25
26 `--keep` preserves the temp pgdata for inspection. The default is
27 clean-up.
28
29 The script writes to `/var/log/shithub/restore-drill.log`. After a
30 successful drill, archive that log:
31
32 ```sh
33 rclone copyto /var/log/shithub/restore-drill.log \
34 spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
35 ```
36
37 ## Real restore
38
39 Two scenarios. **Read the right one** — they have different recovery
40 points and different blast radius.
41
42 ### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
43
44 Use this when:
45
46 - The DB host is destroyed.
47 - Postgres can't start and we have no working WAL.
48 - We accept losing up to a day of writes (RPO = 24h).
49
50 Procedure:
51
52 1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy
53 --tags db`).
54 2. Pull the most recent daily dump:
55 ```sh
56 rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
57 ```
58 3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump`
59 4. Run `shithubd migrate status` to confirm schema version.
60 5. Bring the app up against the new DB (update `web.env` and
61 `worker.env`). Verify with smoke queries from the drill suite.
62 6. **Notify users** — there will be a visible gap in their
63 activity. Don't make them figure it out.
64
65 ### B. PITR (recover to a specific moment, e.g., before a bad migration)
66
67 Use this when:
68
69 - The data is intact in WAL but a destructive change happened at a
70 known time (`DROP TABLE`, mass `UPDATE`, runaway worker).
71 - We need to roll the DB back to immediately before that moment.
72
73 This is harder; the procedure is:
74
75 1. Stop the app (`systemctl stop shithubd-web shithubd-worker
76 shithubd-cron.timer`). Do not let writes continue.
77 2. Restore the most recent base backup from before the incident
78 into a fresh data directory.
79 3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16)
80 pointing at the WAL archive in Spaces with
81 `recovery_target_time = '<UTC timestamp just before incident>'`.
82 4. Start Postgres in recovery mode, let it replay WAL.
83 5. When recovery completes, promote.
84 6. Restart the app.
85
86 If you have not done a PITR before: page someone who has, do not
87 free-style this. Mistakes here lose the data you were trying to
88 save.
89
90 ## After any real restore
91
92 - Run the restore-drill smoke queries against the restored DB.
93 - Reconcile the audit log: any rows referencing IDs that no longer
94 exist need a note in the incident postmortem.
95 - Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
96 if an attacker triggered the recovery, the secrets in the dump
97 may be compromised.
98 - Force-rotate session-epochs for any user whose session crossed
99 the recovery window.