tenseleyflow/shithub / df5d7fb

Browse files

S37: docs — runbooks (incidents/backups/restore/upgrade/rollback)

Authored by espadonne
SHA
df5d7fb0e1b5f70ce8d9395fa0e6d9a6d1d26bf5
Parents
2aa998f
Tree
8eb8f53

5 changed files

StatusFile+-
A docs/internal/runbooks/backups.md 60 0
A docs/internal/runbooks/incidents.md 74 0
A docs/internal/runbooks/restore.md 99 0
A docs/internal/runbooks/rollback.md 68 0
A docs/internal/runbooks/upgrade.md 57 0
docs/internal/runbooks/backups.mdadded
@@ -0,0 +1,60 @@
1
+# Backups
2
+
3
+Two-layer scheme, both mandatory:
4
+
5
+- **Continuous WAL archive** to `spaces-prod:shithub-wal` —
6
+  `deploy/postgres/archive_command.sh` ships every WAL segment as
7
+  Postgres rolls it. RPO is approximately one segment (~16MB or one
8
+  `archive_timeout`).
9
+- **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
10
+  via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
11
+  on the db host for fast recovery; bucket lifecycle keeps 90 days.
12
+
13
+Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
14
+hourly from the backup host into a second-region DR bucket.
15
+
16
+## Verifying that backups are healthy
17
+
18
+The monitoring stack does this for you:
19
+
20
+- `BackupOverdue` alert fires if `time() -
21
+  shithubd_backup_last_success_seconds > 30h`. The backup script
22
+  pushes the timestamp to the metrics endpoint on success.
23
+- `pg_stat_archiver.failed_count > 0` is paged via the
24
+  `archive-failing` runbook.
25
+
26
+If you want to confirm by hand:
27
+
28
+```sh
29
+ssh db
30
+sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
31
+     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
32
+```
33
+
34
+## Quarterly restore drill
35
+
36
+We verify the restore path **every quarter**. The calendar entry
37
+lives in the team calendar; the procedure is in `restore.md`.
38
+
39
+Required outputs of a successful drill:
40
+
41
+1. `deploy/restore-drill/run.sh` exits 0.
42
+2. The smoke queries pass.
43
+3. The drill log has a "restore drill OK" line and is archived to
44
+   `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
45
+
46
+If the drill fails: open an incident immediately. A failing drill
47
+means our backups can't actually restore; we treat that as P0.
48
+
49
+## Missed backup
50
+
51
+**Symptom:** `BackupOverdue` alert.
52
+
53
+1. SSH to db host. `systemctl status shithub-backup-daily.timer`
54
+   and `journalctl -u shithub-backup-daily.service -n 200`.
55
+2. Most likely: the script ran but `rclone copyto` failed (creds,
56
+   network). Re-run by hand:
57
+   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
58
+3. If the script has been failing silently for >24h, file an
59
+   incident — every additional day extends the RPO of an actual
60
+   recovery.
docs/internal/runbooks/incidents.mdadded
@@ -0,0 +1,74 @@
1
+# Incidents
2
+
3
+Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
4
+anchors below. Keep procedures here short and actionable: what to
5
+check first, how to mitigate, what to record. Postmortems live
6
+elsewhere.
7
+
8
+## shithubd-down
9
+
10
+**Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
11
+502/connection refused.
12
+
13
+1. SSH to the affected host.
14
+2. `systemctl status shithubd-web` — if `active (running)` but
15
+   alert fires, the metrics scrape is broken (check wg0).
16
+3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
17
+   for migration failures (ExecStartPre), env file errors, port
18
+   conflicts.
19
+4. Mitigate: `systemctl restart shithubd-web`.
20
+5. If two restarts in a row fail, roll back per `rollback.md` and
21
+   open an incident.
22
+
23
+## worker-down
24
+
25
+**Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
26
+visible to users: webhook deliveries stall, async fan-out lags.
27
+
28
+1. `systemctl status shithubd-worker`.
29
+2. `journalctl -u shithubd-worker -n 200`.
30
+3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
31
+   never loses or double-processes jobs.
32
+
33
+## postgres-down
34
+
35
+**Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
36
+write; reads through cache may still appear to work briefly.
37
+
38
+1. SSH to the db host.
39
+2. `systemctl status postgresql`.
40
+3. If startup is failing on WAL replay: do **not** delete pgdata.
41
+   Check disk free first (`df -h /data`). If full, the most likely
42
+   cause is the WAL archiver failing — see `archive-failing` below.
43
+4. If the data directory is fine, page the on-call DBA. Do not
44
+   restore from backup unless directed; a working PITR restore
45
+   takes ~30 min and we shouldn't gamble.
46
+
47
+## archive-failing
48
+
49
+**Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
50
+
51
+1. Look at `journalctl -u postgresql -n 200` for `archive_command
52
+   failed` lines. The last few should print the rclone error.
53
+2. Common causes: bucket creds rotated, network partition to Spaces.
54
+3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
55
+   rclone.conf lsd spaces-prod:`.
56
+4. Mitigation: fix the underlying issue (rotate creds, restore
57
+   network). Postgres will retry automatically.
58
+5. If disk is critically full and Spaces will not be reachable in
59
+   time: this is an emergency. Page the operator. **Do not**
60
+   `pg_archivecleanup` manually unless directed — you can lose
61
+   data.
62
+
63
+## job-backlog
64
+
65
+**Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
66
+
67
+1. Check what's in the queue:
68
+   `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
69
+2. If one kind dominates, look for a poison job in worker logs.
70
+3. If the spread is even, the worker isn't keeping up — scale by
71
+   running a second worker host (the `FOR UPDATE SKIP LOCKED`
72
+   pattern lets multiple workers coexist safely).
73
+4. To purge a poison job: mark it `failed` (don't delete — we want
74
+   the audit trail).
docs/internal/runbooks/restore.mdadded
@@ -0,0 +1,99 @@
1
+# Restore
2
+
3
+Two restore modes:
4
+
5
+- **Drill restore** — done quarterly into a temp PG instance to
6
+  verify the backup chain works. Never touches the production DB.
7
+  Procedure: the drill script. See "Drill" below.
8
+- **Real restore** — used after a data-loss incident. This is the
9
+  procedure in "Real restore" below. Read it once before you ever
10
+  need it.
11
+
12
+## Drill
13
+
14
+```sh
15
+ssh backup-host
16
+sudo /usr/local/bin/shithub-restore-drill
17
+# uses the latest dump in spaces-prod:shithub-backups/daily/
18
+```
19
+
20
+To drill an older dump:
21
+
22
+```sh
23
+sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
24
+```
25
+
26
+`--keep` preserves the temp pgdata for inspection. The default is
27
+clean-up.
28
+
29
+The script writes to `/var/log/shithub/restore-drill.log`. After a
30
+successful drill, archive that log:
31
+
32
+```sh
33
+rclone copyto /var/log/shithub/restore-drill.log \
34
+       spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
35
+```
36
+
37
+## Real restore
38
+
39
+Two scenarios. **Read the right one** — they have different recovery
40
+points and different blast radius.
41
+
42
+### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
43
+
44
+Use this when:
45
+
46
+- The DB host is destroyed.
47
+- Postgres can't start and we have no working WAL.
48
+- We accept losing up to a day of writes (RPO = 24h).
49
+
50
+Procedure:
51
+
52
+1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy
53
+   --tags db`).
54
+2. Pull the most recent daily dump:
55
+   ```sh
56
+   rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
57
+   ```
58
+3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump`
59
+4. Run `shithubd migrate status` to confirm schema version.
60
+5. Bring the app up against the new DB (update `web.env` and
61
+   `worker.env`). Verify with smoke queries from the drill suite.
62
+6. **Notify users** — there will be a visible gap in their
63
+   activity. Don't make them figure it out.
64
+
65
+### B. PITR (recover to a specific moment, e.g., before a bad migration)
66
+
67
+Use this when:
68
+
69
+- The data is intact in WAL but a destructive change happened at a
70
+  known time (`DROP TABLE`, mass `UPDATE`, runaway worker).
71
+- We need to roll the DB back to immediately before that moment.
72
+
73
+This is harder; the procedure is:
74
+
75
+1. Stop the app (`systemctl stop shithubd-web shithubd-worker
76
+   shithubd-cron.timer`). Do not let writes continue.
77
+2. Restore the most recent base backup from before the incident
78
+   into a fresh data directory.
79
+3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16)
80
+   pointing at the WAL archive in Spaces with
81
+   `recovery_target_time = '<UTC timestamp just before incident>'`.
82
+4. Start Postgres in recovery mode, let it replay WAL.
83
+5. When recovery completes, promote.
84
+6. Restart the app.
85
+
86
+If you have not done a PITR before: page someone who has, do not
87
+free-style this. Mistakes here lose the data you were trying to
88
+save.
89
+
90
+## After any real restore
91
+
92
+- Run the restore-drill smoke queries against the restored DB.
93
+- Reconcile the audit log: any rows referencing IDs that no longer
94
+  exist need a note in the incident postmortem.
95
+- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
96
+  if an attacker triggered the recovery, the secrets in the dump
97
+  may be compromised.
98
+- Force-rotate session-epochs for any user whose session crossed
99
+  the recovery window.
docs/internal/runbooks/rollback.mdadded
@@ -0,0 +1,68 @@
1
+# Rollback
2
+
3
+When to rollback vs. roll-forward:
4
+
5
+- **Rollback** when the new release is broken in a way that can't
6
+  be hot-fixed in <30 min: a panic loop, an auth regression, data
7
+  corruption.
8
+- **Roll-forward** when you can ship a hotfix faster than the
9
+  rollback ceremony. A small bug behind a known feature flag is a
10
+  roll-forward case.
11
+
12
+If unsure, roll back. Bad releases compound.
13
+
14
+## Code rollback (no migration involved)
15
+
16
+```sh
17
+git checkout v<previous>
18
+make deploy ANSIBLE_INVENTORY=production
19
+```
20
+
21
+The systemd unit `Restart=on-failure` + the binary swap means the
22
+process flips on the next ExecStart. Connections in flight finish;
23
+new connections hit the rolled-back binary.
24
+
25
+## Code rollback when the new release added migrations
26
+
27
+This is the dangerous case. There are three options, in order of
28
+preference:
29
+
30
+### 1. Schema-compatible rollback (best)
31
+
32
+If the new migration only *added* columns/tables that the old code
33
+ignores, the old code runs against the new schema fine. Just roll
34
+the code back; leave the schema alone.
35
+
36
+Most of our migrations are deliberately additive for this reason.
37
+
38
+### 2. Roll forward to a hotfix
39
+
40
+If the migration changed semantics that the old code can't tolerate,
41
+ship a hotfix on top of the new release rather than reversing the
42
+migration.
43
+
44
+### 3. Migration `down` + code rollback (last resort)
45
+
46
+Only if (1) and (2) won't work and the data loss from `down` is
47
+acceptable.
48
+
49
+```sh
50
+ssh web-01
51
+sudo -u shithub /usr/local/bin/shithubd migrate down  # ONE step
52
+# verify
53
+git checkout v<previous>
54
+make deploy ANSIBLE_INVENTORY=production
55
+```
56
+
57
+`migrate down` rolls back exactly one step. **Never** chain `down`s
58
+without checking each migration's down logic; some of them drop
59
+columns and *will* lose data.
60
+
61
+## After any rollback
62
+
63
+- Note the rollback in the incident channel with the from-tag and
64
+  to-tag.
65
+- File a follow-up issue with the failure mode.
66
+- Disable any feature flags the bad release turned on.
67
+- Confirm the rolled-back release passed CI (if not, you're now
68
+  running un-tested code — that's a separate incident).
docs/internal/runbooks/upgrade.mdadded
@@ -0,0 +1,57 @@
1
+# Upgrade
2
+
3
+Routine release deploys. The deploy is one binary swap + a systemd
4
+restart; the only place upgrades get exciting is around DB migrations
5
+and the occasional config schema change.
6
+
7
+## Standard release
8
+
9
+```sh
10
+# from a clean checkout of the release tag
11
+git fetch --tags
12
+git checkout v<version>
13
+make deploy-check ANSIBLE_INVENTORY=staging
14
+make deploy ANSIBLE_INVENTORY=staging
15
+# ... canary period ...
16
+make deploy ANSIBLE_INVENTORY=production
17
+```
18
+
19
+`shithubd migrate up` runs as the web service's ExecStartPre, so
20
+the binary that needs the new schema is also the one that applies
21
+it. Order on each host: ExecStartPre runs migrations → web starts
22
+on the new schema.
23
+
24
+If a migration is long (>30s), call it out in the release notes
25
+and time the deploy outside peak hours. The web service hangs in
26
+"activating" until ExecStartPre finishes.
27
+
28
+## Canary
29
+
30
+We deploy to staging first, watch for 30 min in Grafana. Things to
31
+look at:
32
+
33
+- p95 latency on the top routes (`shithubd-overview` dashboard).
34
+- DB call rate — a 10× jump usually means a regressed N+1.
35
+- Job queue depth — a stuck migration reflects here.
36
+- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`.
37
+
38
+If anything looks off, **do not** promote to production. Rollback
39
+on staging is cheap; rollback on production is loud.
40
+
41
+## Major version (database)
42
+
43
+If the release notes flag a major schema change:
44
+
45
+1. Take a manual `pg_dump` immediately before the deploy:
46
+   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
47
+2. Confirm it landed in Spaces.
48
+3. Deploy to staging, run `make restore-drill` against the
49
+   *post-deploy* dump to confirm the new schema restores cleanly.
50
+4. Then production.
51
+
52
+## Config schema changes
53
+
54
+When a release adds a required env var, the binary refuses to start
55
+and complains in the journal. Update `deploy/ansible/roles/shithubd/
56
+templates/web.env.j2` (and `worker.env.j2`), bump the inventory
57
+vars, redeploy. There's no separate migration step for env files.