Restore
Two restore modes:
- Drill restore — done quarterly into a temp PG instance to verify the backup chain works. Never touches the production DB. Procedure: the drill script. See "Drill" below.
- Real restore — used after a data-loss incident. This is the procedure in "Real restore" below. Read it once before you ever need it.
Drill
ssh backup-host
sudo /usr/local/bin/shithub-restore-drill
# uses the latest dump in spaces-prod:shithub-backups/daily/
To drill an older dump:
sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
--keep preserves the temp pgdata for inspection. The default is
clean-up.
The script writes to /var/log/shithub/restore-drill.log. After a
successful drill, archive that log:
rclone copyto /var/log/shithub/restore-drill.log \
spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
Real restore
Two scenarios. Read the right one — they have different recovery points and different blast radius.
A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
Use this when:
- The DB host is destroyed.
- Postgres can't start and we have no working WAL.
- We accept losing up to a day of writes (RPO = 24h).
Procedure:
- Spin up a new db host (Ansible:
ANSIBLE_LIMIT=db-new make deploy --tags db). - Pull the most recent daily dump:
rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump- Run
shithubd migrate statusto confirm schema version. - Bring the app up against the new DB (update
web.envandworker.env). Verify with smoke queries from the drill suite. - Notify users — there will be a visible gap in their activity. Don't make them figure it out.
B. PITR (recover to a specific moment, e.g., before a bad migration)
Use this when:
- The data is intact in WAL but a destructive change happened at a
known time (
DROP TABLE, massUPDATE, runaway worker). - We need to roll the DB back to immediately before that moment.
This is harder; the procedure is:
- Stop the app (
systemctl stop shithubd-web shithubd-worker shithubd-cron.timer). Do not let writes continue. - Restore the most recent base backup from before the incident into a fresh data directory.
- Configure
recovery.conf(orrecovery.signal+ GUCs in PG16) pointing at the WAL archive in Spaces withrecovery_target_time = '<UTC timestamp just before incident>'. - Start Postgres in recovery mode, let it replay WAL.
- When recovery completes, promote.
- Restart the app.
If you have not done a PITR before: page someone who has, do not free-style this. Mistakes here lose the data you were trying to save.
After any real restore
- Run the restore-drill smoke queries against the restored DB.
- Reconcile the audit log: any rows referencing IDs that no longer exist need a note in the incident postmortem.
- Force-rotate webhook secrets (
shithubd webhook rotate-all) — if an attacker triggered the recovery, the secrets in the dump may be compromised. - Force-rotate session-epochs for any user whose session crossed the recovery window.
View source
| 1 | # Restore |
| 2 | |
| 3 | Two restore modes: |
| 4 | |
| 5 | - **Drill restore** — done quarterly into a temp PG instance to |
| 6 | verify the backup chain works. Never touches the production DB. |
| 7 | Procedure: the drill script. See "Drill" below. |
| 8 | - **Real restore** — used after a data-loss incident. This is the |
| 9 | procedure in "Real restore" below. Read it once before you ever |
| 10 | need it. |
| 11 | |
| 12 | ## Drill |
| 13 | |
| 14 | ```sh |
| 15 | ssh backup-host |
| 16 | sudo /usr/local/bin/shithub-restore-drill |
| 17 | # uses the latest dump in spaces-prod:shithub-backups/daily/ |
| 18 | ``` |
| 19 | |
| 20 | To drill an older dump: |
| 21 | |
| 22 | ```sh |
| 23 | sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump |
| 24 | ``` |
| 25 | |
| 26 | `--keep` preserves the temp pgdata for inspection. The default is |
| 27 | clean-up. |
| 28 | |
| 29 | The script writes to `/var/log/shithub/restore-drill.log`. After a |
| 30 | successful drill, archive that log: |
| 31 | |
| 32 | ```sh |
| 33 | rclone copyto /var/log/shithub/restore-drill.log \ |
| 34 | spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log |
| 35 | ``` |
| 36 | |
| 37 | ## Real restore |
| 38 | |
| 39 | Two scenarios. **Read the right one** — they have different recovery |
| 40 | points and different blast radius. |
| 41 | |
| 42 | ### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss) |
| 43 | |
| 44 | Use this when: |
| 45 | |
| 46 | - The DB host is destroyed. |
| 47 | - Postgres can't start and we have no working WAL. |
| 48 | - We accept losing up to a day of writes (RPO = 24h). |
| 49 | |
| 50 | Procedure: |
| 51 | |
| 52 | 1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy |
| 53 | --tags db`). |
| 54 | 2. Pull the most recent daily dump: |
| 55 | ```sh |
| 56 | rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump |
| 57 | ``` |
| 58 | 3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump` |
| 59 | 4. Run `shithubd migrate status` to confirm schema version. |
| 60 | 5. Bring the app up against the new DB (update `web.env` and |
| 61 | `worker.env`). Verify with smoke queries from the drill suite. |
| 62 | 6. **Notify users** — there will be a visible gap in their |
| 63 | activity. Don't make them figure it out. |
| 64 | |
| 65 | ### B. PITR (recover to a specific moment, e.g., before a bad migration) |
| 66 | |
| 67 | Use this when: |
| 68 | |
| 69 | - The data is intact in WAL but a destructive change happened at a |
| 70 | known time (`DROP TABLE`, mass `UPDATE`, runaway worker). |
| 71 | - We need to roll the DB back to immediately before that moment. |
| 72 | |
| 73 | This is harder; the procedure is: |
| 74 | |
| 75 | 1. Stop the app (`systemctl stop shithubd-web shithubd-worker |
| 76 | shithubd-cron.timer`). Do not let writes continue. |
| 77 | 2. Restore the most recent base backup from before the incident |
| 78 | into a fresh data directory. |
| 79 | 3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16) |
| 80 | pointing at the WAL archive in Spaces with |
| 81 | `recovery_target_time = '<UTC timestamp just before incident>'`. |
| 82 | 4. Start Postgres in recovery mode, let it replay WAL. |
| 83 | 5. When recovery completes, promote. |
| 84 | 6. Restart the app. |
| 85 | |
| 86 | If you have not done a PITR before: page someone who has, do not |
| 87 | free-style this. Mistakes here lose the data you were trying to |
| 88 | save. |
| 89 | |
| 90 | ## After any real restore |
| 91 | |
| 92 | - Run the restore-drill smoke queries against the restored DB. |
| 93 | - Reconcile the audit log: any rows referencing IDs that no longer |
| 94 | exist need a note in the incident postmortem. |
| 95 | - Force-rotate webhook secrets (`shithubd webhook rotate-all`) — |
| 96 | if an attacker triggered the recovery, the secrets in the dump |
| 97 | may be compromised. |
| 98 | - Force-rotate session-epochs for any user whose session crossed |
| 99 | the recovery window. |