S37: docs — runbooks (incidents/backups/restore/upgrade/rollback)
- SHA
df5d7fb0e1b5f70ce8d9395fa0e6d9a6d1d26bf5- Parents
-
2aa998f - Tree
8eb8f53
df5d7fb
df5d7fb0e1b5f70ce8d9395fa0e6d9a6d1d26bf52aa998f
8eb8f53| Status | File | + | - |
|---|---|---|---|
| A |
docs/internal/runbooks/backups.md
|
60 | 0 |
| A |
docs/internal/runbooks/incidents.md
|
74 | 0 |
| A |
docs/internal/runbooks/restore.md
|
99 | 0 |
| A |
docs/internal/runbooks/rollback.md
|
68 | 0 |
| A |
docs/internal/runbooks/upgrade.md
|
57 | 0 |
docs/internal/runbooks/backups.mdadded@@ -0,0 +1,60 @@ | ||
| 1 | +# Backups | |
| 2 | + | |
| 3 | +Two-layer scheme, both mandatory: | |
| 4 | + | |
| 5 | +- **Continuous WAL archive** to `spaces-prod:shithub-wal` — | |
| 6 | + `deploy/postgres/archive_command.sh` ships every WAL segment as | |
| 7 | + Postgres rolls it. RPO is approximately one segment (~16MB or one | |
| 8 | + `archive_timeout`). | |
| 9 | +- **Daily logical dump** to `spaces-prod:shithub-backups/daily/...` | |
| 10 | + via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7 | |
| 11 | + on the db host for fast recovery; bucket lifecycle keeps 90 days. | |
| 12 | + | |
| 13 | +Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs | |
| 14 | +hourly from the backup host into a second-region DR bucket. | |
| 15 | + | |
| 16 | +## Verifying that backups are healthy | |
| 17 | + | |
| 18 | +The monitoring stack does this for you: | |
| 19 | + | |
| 20 | +- `BackupOverdue` alert fires if `time() - | |
| 21 | + shithubd_backup_last_success_seconds > 30h`. The backup script | |
| 22 | + pushes the timestamp to the metrics endpoint on success. | |
| 23 | +- `pg_stat_archiver.failed_count > 0` is paged via the | |
| 24 | + `archive-failing` runbook. | |
| 25 | + | |
| 26 | +If you want to confirm by hand: | |
| 27 | + | |
| 28 | +```sh | |
| 29 | +ssh db | |
| 30 | +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ | |
| 31 | + lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | |
| 32 | +``` | |
| 33 | + | |
| 34 | +## Quarterly restore drill | |
| 35 | + | |
| 36 | +We verify the restore path **every quarter**. The calendar entry | |
| 37 | +lives in the team calendar; the procedure is in `restore.md`. | |
| 38 | + | |
| 39 | +Required outputs of a successful drill: | |
| 40 | + | |
| 41 | +1. `deploy/restore-drill/run.sh` exits 0. | |
| 42 | +2. The smoke queries pass. | |
| 43 | +3. The drill log has a "restore drill OK" line and is archived to | |
| 44 | + `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`. | |
| 45 | + | |
| 46 | +If the drill fails: open an incident immediately. A failing drill | |
| 47 | +means our backups can't actually restore; we treat that as P0. | |
| 48 | + | |
| 49 | +## Missed backup | |
| 50 | + | |
| 51 | +**Symptom:** `BackupOverdue` alert. | |
| 52 | + | |
| 53 | +1. SSH to db host. `systemctl status shithub-backup-daily.timer` | |
| 54 | + and `journalctl -u shithub-backup-daily.service -n 200`. | |
| 55 | +2. Most likely: the script ran but `rclone copyto` failed (creds, | |
| 56 | + network). Re-run by hand: | |
| 57 | + `sudo -u postgres /usr/local/bin/shithub-backup-daily`. | |
| 58 | +3. If the script has been failing silently for >24h, file an | |
| 59 | + incident — every additional day extends the RPO of an actual | |
| 60 | + recovery. | |
docs/internal/runbooks/incidents.mdadded@@ -0,0 +1,74 @@ | ||
| 1 | +# Incidents | |
| 2 | + | |
| 3 | +Alert names in `deploy/monitoring/prometheus/rules.yml` link to the | |
| 4 | +anchors below. Keep procedures here short and actionable: what to | |
| 5 | +check first, how to mitigate, what to record. Postmortems live | |
| 6 | +elsewhere. | |
| 7 | + | |
| 8 | +## shithubd-down | |
| 9 | + | |
| 10 | +**Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns | |
| 11 | +502/connection refused. | |
| 12 | + | |
| 13 | +1. SSH to the affected host. | |
| 14 | +2. `systemctl status shithubd-web` — if `active (running)` but | |
| 15 | + alert fires, the metrics scrape is broken (check wg0). | |
| 16 | +3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look | |
| 17 | + for migration failures (ExecStartPre), env file errors, port | |
| 18 | + conflicts. | |
| 19 | +4. Mitigate: `systemctl restart shithubd-web`. | |
| 20 | +5. If two restarts in a row fail, roll back per `rollback.md` and | |
| 21 | + open an incident. | |
| 22 | + | |
| 23 | +## worker-down | |
| 24 | + | |
| 25 | +**Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms | |
| 26 | +visible to users: webhook deliveries stall, async fan-out lags. | |
| 27 | + | |
| 28 | +1. `systemctl status shithubd-worker`. | |
| 29 | +2. `journalctl -u shithubd-worker -n 200`. | |
| 30 | +3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart | |
| 31 | + never loses or double-processes jobs. | |
| 32 | + | |
| 33 | +## postgres-down | |
| 34 | + | |
| 35 | +**Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every | |
| 36 | +write; reads through cache may still appear to work briefly. | |
| 37 | + | |
| 38 | +1. SSH to the db host. | |
| 39 | +2. `systemctl status postgresql`. | |
| 40 | +3. If startup is failing on WAL replay: do **not** delete pgdata. | |
| 41 | + Check disk free first (`df -h /data`). If full, the most likely | |
| 42 | + cause is the WAL archiver failing — see `archive-failing` below. | |
| 43 | +4. If the data directory is fine, page the on-call DBA. Do not | |
| 44 | + restore from backup unless directed; a working PITR restore | |
| 45 | + takes ~30 min and we shouldn't gamble. | |
| 46 | + | |
| 47 | +## archive-failing | |
| 48 | + | |
| 49 | +**Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling. | |
| 50 | + | |
| 51 | +1. Look at `journalctl -u postgresql -n 200` for `archive_command | |
| 52 | + failed` lines. The last few should print the rclone error. | |
| 53 | +2. Common causes: bucket creds rotated, network partition to Spaces. | |
| 54 | +3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/ | |
| 55 | + rclone.conf lsd spaces-prod:`. | |
| 56 | +4. Mitigation: fix the underlying issue (rotate creds, restore | |
| 57 | + network). Postgres will retry automatically. | |
| 58 | +5. If disk is critically full and Spaces will not be reachable in | |
| 59 | + time: this is an emergency. Page the operator. **Do not** | |
| 60 | + `pg_archivecleanup` manually unless directed — you can lose | |
| 61 | + data. | |
| 62 | + | |
| 63 | +## job-backlog | |
| 64 | + | |
| 65 | +**Symptom:** `shithubd_job_queue_depth > 5000` for 15m. | |
| 66 | + | |
| 67 | +1. Check what's in the queue: | |
| 68 | + `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`. | |
| 69 | +2. If one kind dominates, look for a poison job in worker logs. | |
| 70 | +3. If the spread is even, the worker isn't keeping up — scale by | |
| 71 | + running a second worker host (the `FOR UPDATE SKIP LOCKED` | |
| 72 | + pattern lets multiple workers coexist safely). | |
| 73 | +4. To purge a poison job: mark it `failed` (don't delete — we want | |
| 74 | + the audit trail). | |
docs/internal/runbooks/restore.mdadded@@ -0,0 +1,99 @@ | ||
| 1 | +# Restore | |
| 2 | + | |
| 3 | +Two restore modes: | |
| 4 | + | |
| 5 | +- **Drill restore** — done quarterly into a temp PG instance to | |
| 6 | + verify the backup chain works. Never touches the production DB. | |
| 7 | + Procedure: the drill script. See "Drill" below. | |
| 8 | +- **Real restore** — used after a data-loss incident. This is the | |
| 9 | + procedure in "Real restore" below. Read it once before you ever | |
| 10 | + need it. | |
| 11 | + | |
| 12 | +## Drill | |
| 13 | + | |
| 14 | +```sh | |
| 15 | +ssh backup-host | |
| 16 | +sudo /usr/local/bin/shithub-restore-drill | |
| 17 | +# uses the latest dump in spaces-prod:shithub-backups/daily/ | |
| 18 | +``` | |
| 19 | + | |
| 20 | +To drill an older dump: | |
| 21 | + | |
| 22 | +```sh | |
| 23 | +sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump | |
| 24 | +``` | |
| 25 | + | |
| 26 | +`--keep` preserves the temp pgdata for inspection. The default is | |
| 27 | +clean-up. | |
| 28 | + | |
| 29 | +The script writes to `/var/log/shithub/restore-drill.log`. After a | |
| 30 | +successful drill, archive that log: | |
| 31 | + | |
| 32 | +```sh | |
| 33 | +rclone copyto /var/log/shithub/restore-drill.log \ | |
| 34 | + spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log | |
| 35 | +``` | |
| 36 | + | |
| 37 | +## Real restore | |
| 38 | + | |
| 39 | +Two scenarios. **Read the right one** — they have different recovery | |
| 40 | +points and different blast radius. | |
| 41 | + | |
| 42 | +### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss) | |
| 43 | + | |
| 44 | +Use this when: | |
| 45 | + | |
| 46 | +- The DB host is destroyed. | |
| 47 | +- Postgres can't start and we have no working WAL. | |
| 48 | +- We accept losing up to a day of writes (RPO = 24h). | |
| 49 | + | |
| 50 | +Procedure: | |
| 51 | + | |
| 52 | +1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy | |
| 53 | + --tags db`). | |
| 54 | +2. Pull the most recent daily dump: | |
| 55 | + ```sh | |
| 56 | + rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump | |
| 57 | + ``` | |
| 58 | +3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump` | |
| 59 | +4. Run `shithubd migrate status` to confirm schema version. | |
| 60 | +5. Bring the app up against the new DB (update `web.env` and | |
| 61 | + `worker.env`). Verify with smoke queries from the drill suite. | |
| 62 | +6. **Notify users** — there will be a visible gap in their | |
| 63 | + activity. Don't make them figure it out. | |
| 64 | + | |
| 65 | +### B. PITR (recover to a specific moment, e.g., before a bad migration) | |
| 66 | + | |
| 67 | +Use this when: | |
| 68 | + | |
| 69 | +- The data is intact in WAL but a destructive change happened at a | |
| 70 | + known time (`DROP TABLE`, mass `UPDATE`, runaway worker). | |
| 71 | +- We need to roll the DB back to immediately before that moment. | |
| 72 | + | |
| 73 | +This is harder; the procedure is: | |
| 74 | + | |
| 75 | +1. Stop the app (`systemctl stop shithubd-web shithubd-worker | |
| 76 | + shithubd-cron.timer`). Do not let writes continue. | |
| 77 | +2. Restore the most recent base backup from before the incident | |
| 78 | + into a fresh data directory. | |
| 79 | +3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16) | |
| 80 | + pointing at the WAL archive in Spaces with | |
| 81 | + `recovery_target_time = '<UTC timestamp just before incident>'`. | |
| 82 | +4. Start Postgres in recovery mode, let it replay WAL. | |
| 83 | +5. When recovery completes, promote. | |
| 84 | +6. Restart the app. | |
| 85 | + | |
| 86 | +If you have not done a PITR before: page someone who has, do not | |
| 87 | +free-style this. Mistakes here lose the data you were trying to | |
| 88 | +save. | |
| 89 | + | |
| 90 | +## After any real restore | |
| 91 | + | |
| 92 | +- Run the restore-drill smoke queries against the restored DB. | |
| 93 | +- Reconcile the audit log: any rows referencing IDs that no longer | |
| 94 | + exist need a note in the incident postmortem. | |
| 95 | +- Force-rotate webhook secrets (`shithubd webhook rotate-all`) — | |
| 96 | + if an attacker triggered the recovery, the secrets in the dump | |
| 97 | + may be compromised. | |
| 98 | +- Force-rotate session-epochs for any user whose session crossed | |
| 99 | + the recovery window. | |
docs/internal/runbooks/rollback.mdadded@@ -0,0 +1,68 @@ | ||
| 1 | +# Rollback | |
| 2 | + | |
| 3 | +When to rollback vs. roll-forward: | |
| 4 | + | |
| 5 | +- **Rollback** when the new release is broken in a way that can't | |
| 6 | + be hot-fixed in <30 min: a panic loop, an auth regression, data | |
| 7 | + corruption. | |
| 8 | +- **Roll-forward** when you can ship a hotfix faster than the | |
| 9 | + rollback ceremony. A small bug behind a known feature flag is a | |
| 10 | + roll-forward case. | |
| 11 | + | |
| 12 | +If unsure, roll back. Bad releases compound. | |
| 13 | + | |
| 14 | +## Code rollback (no migration involved) | |
| 15 | + | |
| 16 | +```sh | |
| 17 | +git checkout v<previous> | |
| 18 | +make deploy ANSIBLE_INVENTORY=production | |
| 19 | +``` | |
| 20 | + | |
| 21 | +The systemd unit `Restart=on-failure` + the binary swap means the | |
| 22 | +process flips on the next ExecStart. Connections in flight finish; | |
| 23 | +new connections hit the rolled-back binary. | |
| 24 | + | |
| 25 | +## Code rollback when the new release added migrations | |
| 26 | + | |
| 27 | +This is the dangerous case. There are three options, in order of | |
| 28 | +preference: | |
| 29 | + | |
| 30 | +### 1. Schema-compatible rollback (best) | |
| 31 | + | |
| 32 | +If the new migration only *added* columns/tables that the old code | |
| 33 | +ignores, the old code runs against the new schema fine. Just roll | |
| 34 | +the code back; leave the schema alone. | |
| 35 | + | |
| 36 | +Most of our migrations are deliberately additive for this reason. | |
| 37 | + | |
| 38 | +### 2. Roll forward to a hotfix | |
| 39 | + | |
| 40 | +If the migration changed semantics that the old code can't tolerate, | |
| 41 | +ship a hotfix on top of the new release rather than reversing the | |
| 42 | +migration. | |
| 43 | + | |
| 44 | +### 3. Migration `down` + code rollback (last resort) | |
| 45 | + | |
| 46 | +Only if (1) and (2) won't work and the data loss from `down` is | |
| 47 | +acceptable. | |
| 48 | + | |
| 49 | +```sh | |
| 50 | +ssh web-01 | |
| 51 | +sudo -u shithub /usr/local/bin/shithubd migrate down # ONE step | |
| 52 | +# verify | |
| 53 | +git checkout v<previous> | |
| 54 | +make deploy ANSIBLE_INVENTORY=production | |
| 55 | +``` | |
| 56 | + | |
| 57 | +`migrate down` rolls back exactly one step. **Never** chain `down`s | |
| 58 | +without checking each migration's down logic; some of them drop | |
| 59 | +columns and *will* lose data. | |
| 60 | + | |
| 61 | +## After any rollback | |
| 62 | + | |
| 63 | +- Note the rollback in the incident channel with the from-tag and | |
| 64 | + to-tag. | |
| 65 | +- File a follow-up issue with the failure mode. | |
| 66 | +- Disable any feature flags the bad release turned on. | |
| 67 | +- Confirm the rolled-back release passed CI (if not, you're now | |
| 68 | + running un-tested code — that's a separate incident). | |
docs/internal/runbooks/upgrade.mdadded@@ -0,0 +1,57 @@ | ||
| 1 | +# Upgrade | |
| 2 | + | |
| 3 | +Routine release deploys. The deploy is one binary swap + a systemd | |
| 4 | +restart; the only place upgrades get exciting is around DB migrations | |
| 5 | +and the occasional config schema change. | |
| 6 | + | |
| 7 | +## Standard release | |
| 8 | + | |
| 9 | +```sh | |
| 10 | +# from a clean checkout of the release tag | |
| 11 | +git fetch --tags | |
| 12 | +git checkout v<version> | |
| 13 | +make deploy-check ANSIBLE_INVENTORY=staging | |
| 14 | +make deploy ANSIBLE_INVENTORY=staging | |
| 15 | +# ... canary period ... | |
| 16 | +make deploy ANSIBLE_INVENTORY=production | |
| 17 | +``` | |
| 18 | + | |
| 19 | +`shithubd migrate up` runs as the web service's ExecStartPre, so | |
| 20 | +the binary that needs the new schema is also the one that applies | |
| 21 | +it. Order on each host: ExecStartPre runs migrations → web starts | |
| 22 | +on the new schema. | |
| 23 | + | |
| 24 | +If a migration is long (>30s), call it out in the release notes | |
| 25 | +and time the deploy outside peak hours. The web service hangs in | |
| 26 | +"activating" until ExecStartPre finishes. | |
| 27 | + | |
| 28 | +## Canary | |
| 29 | + | |
| 30 | +We deploy to staging first, watch for 30 min in Grafana. Things to | |
| 31 | +look at: | |
| 32 | + | |
| 33 | +- p95 latency on the top routes (`shithubd-overview` dashboard). | |
| 34 | +- DB call rate — a 10× jump usually means a regressed N+1. | |
| 35 | +- Job queue depth — a stuck migration reflects here. | |
| 36 | +- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`. | |
| 37 | + | |
| 38 | +If anything looks off, **do not** promote to production. Rollback | |
| 39 | +on staging is cheap; rollback on production is loud. | |
| 40 | + | |
| 41 | +## Major version (database) | |
| 42 | + | |
| 43 | +If the release notes flag a major schema change: | |
| 44 | + | |
| 45 | +1. Take a manual `pg_dump` immediately before the deploy: | |
| 46 | + `sudo -u postgres /usr/local/bin/shithub-backup-daily`. | |
| 47 | +2. Confirm it landed in Spaces. | |
| 48 | +3. Deploy to staging, run `make restore-drill` against the | |
| 49 | + *post-deploy* dump to confirm the new schema restores cleanly. | |
| 50 | +4. Then production. | |
| 51 | + | |
| 52 | +## Config schema changes | |
| 53 | + | |
| 54 | +When a release adds a required env var, the binary refuses to start | |
| 55 | +and complains in the journal. Update `deploy/ansible/roles/shithubd/ | |
| 56 | +templates/web.env.j2` (and `worker.env.j2`), bump the inventory | |
| 57 | +vars, redeploy. There's no separate migration step for env files. | |