# Restore Two restore modes: - **Drill restore** — done quarterly into a temp PG instance to verify the backup chain works. Never touches the production DB. Procedure: the drill script. See "Drill" below. - **Real restore** — used after a data-loss incident. This is the procedure in "Real restore" below. Read it once before you ever need it. ## Drill ```sh ssh backup-host sudo /usr/local/bin/shithub-restore-drill # uses the latest dump in spaces-prod:shithub-backups/daily/ ``` To drill an older dump: ```sh sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump ``` `--keep` preserves the temp pgdata for inspection. The default is clean-up. The script writes to `/var/log/shithub/restore-drill.log`. After a successful drill, archive that log: ```sh rclone copyto /var/log/shithub/restore-drill.log \ spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log ``` ## Real restore Two scenarios. **Read the right one** — they have different recovery points and different blast radius. ### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss) Use this when: - The DB host is destroyed. - Postgres can't start and we have no working WAL. - We accept losing up to a day of writes (RPO = 24h). Procedure: 1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy --tags db`). 2. Pull the most recent daily dump: ```sh rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/.dump /tmp/restore.dump ``` 3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump` 4. Run `shithubd migrate status` to confirm schema version. 5. Bring the app up against the new DB (update `web.env` and `worker.env`). Verify with smoke queries from the drill suite. 6. **Notify users** — there will be a visible gap in their activity. Don't make them figure it out. ### B. PITR (recover to a specific moment, e.g., before a bad migration) Use this when: - The data is intact in WAL but a destructive change happened at a known time (`DROP TABLE`, mass `UPDATE`, runaway worker). - We need to roll the DB back to immediately before that moment. This is harder; the procedure is: 1. Stop the app (`systemctl stop shithubd-web shithubd-worker shithubd-cron.timer`). Do not let writes continue. 2. Restore the most recent base backup from before the incident into a fresh data directory. 3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16) pointing at the WAL archive in Spaces with `recovery_target_time = ''`. 4. Start Postgres in recovery mode, let it replay WAL. 5. When recovery completes, promote. 6. Restart the app. If you have not done a PITR before: page someone who has, do not free-style this. Mistakes here lose the data you were trying to save. ## After any real restore - Run the restore-drill smoke queries against the restored DB. - Reconcile the audit log: any rows referencing IDs that no longer exist need a note in the incident postmortem. - Force-rotate webhook secrets (`shithubd webhook rotate-all`) — if an attacker triggered the recovery, the secrets in the dump may be compromised. - Force-rotate session-epochs for any user whose session crossed the recovery window.