`df5d7fb`

S37: docs — runbooks (incidents/backups/restore/upgrade/rollback)

Authored by

espadonne 4 days ago

SHA: df5d7fb0e1b5f70ce8d9395fa0e6d9a6d1d26bf5
Parents: 2aa998f
Tree: 8eb8f53

5 changed files

Status	File	+
A	`docs/internal/runbooks/backups.md`	60
A	`docs/internal/runbooks/incidents.md`	74
A	`docs/internal/runbooks/restore.md`	99
A	`docs/internal/runbooks/rollback.md`	68
A	`docs/internal/runbooks/upgrade.md`	57

docs/internal/runbooks/backups.mdadded

 +# Backups
++
 +Two-layer scheme, both mandatory:
++
 +- **Continuous WAL archive** to `spaces-prod:shithub-wal` —
 +  `deploy/postgres/archive_command.sh` ships every WAL segment as
 +  Postgres rolls it. RPO is approximately one segment (~16MB or one
 +  `archive_timeout`).
 +- **Daily logical dump** to `spaces-prod:shithub-backups/daily/...`
 +  via `deploy/postgres/backup-daily.sh`. Keeps the most recent 7
 +  on the db host for fast recovery; bucket lifecycle keeps 90 days.
++
 +Cross-region mirror (`deploy/spaces/sync-cross-region.sh`) runs
 +hourly from the backup host into a second-region DR bucket.
++
 +## Verifying that backups are healthy
++
 +The monitoring stack does this for you:
++
 +- `BackupOverdue` alert fires if `time() -
 +  shithubd_backup_last_success_seconds > 30h`. The backup script
 +  pushes the timestamp to the metrics endpoint on success.
 +- `pg_stat_archiver.failed_count > 0` is paged via the
 +  `archive-failing` runbook.
++
 +If you want to confirm by hand:
++
 +```sh
 +ssh db
 +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
 +     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
 +```
++
 +## Quarterly restore drill
++
 +We verify the restore path **every quarter**. The calendar entry
 +lives in the team calendar; the procedure is in `restore.md`.
++
 +Required outputs of a successful drill:
++
 +1. `deploy/restore-drill/run.sh` exits 0.
 +2. The smoke queries pass.
 +3. The drill log has a "restore drill OK" line and is archived to
 +   `spaces-prod:shithub-backups/drills/<YYYY-QQ>/`.
++
 +If the drill fails: open an incident immediately. A failing drill
 +means our backups can't actually restore; we treat that as P0.
++
 +## Missed backup
++
 +**Symptom:** `BackupOverdue` alert.
++
 +1. SSH to db host. `systemctl status shithub-backup-daily.timer`
 +   and `journalctl -u shithub-backup-daily.service -n 200`.
 +2. Most likely: the script ran but `rclone copyto` failed (creds,
 +   network). Re-run by hand:
 +   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
 +3. If the script has been failing silently for >24h, file an
 +   incident — every additional day extends the RPO of an actual
 +   recovery.

docs/internal/runbooks/incidents.mdadded

 +# Incidents
++
 +Alert names in `deploy/monitoring/prometheus/rules.yml` link to the
 +anchors below. Keep procedures here short and actionable: what to
 +check first, how to mitigate, what to record. Postmortems live
 +elsewhere.
++
 +## shithubd-down
++
 +**Symptom:** `up{job="shithubd-web"} == 0` for >2m. Site returns
 +502/connection refused.
++
 +1. SSH to the affected host.
 +2. `systemctl status shithubd-web` — if `active (running)` but
 +   alert fires, the metrics scrape is broken (check wg0).
 +3. If failed: `journalctl -u shithubd-web -n 200 --no-pager`. Look
 +   for migration failures (ExecStartPre), env file errors, port
 +   conflicts.
 +4. Mitigate: `systemctl restart shithubd-web`.
 +5. If two restarts in a row fail, roll back per `rollback.md` and
 +   open an incident.
++
 +## worker-down
++
 +**Symptom:** `up{job="shithubd-worker"} == 0` for >5m. Symptoms
 +visible to users: webhook deliveries stall, async fan-out lags.
++
 +1. `systemctl status shithubd-worker`.
 +2. `journalctl -u shithubd-worker -n 200`.
 +3. Restart. The job loop uses `FOR UPDATE SKIP LOCKED`, so a restart
 +   never loses or double-processes jobs.
++
 +## postgres-down
++
 +**Symptom:** `up{job="postgres"} == 0`. Site returns 500 on every
 +write; reads through cache may still appear to work briefly.
++
 +1. SSH to the db host.
 +2. `systemctl status postgresql`.
 +3. If startup is failing on WAL replay: do **not** delete pgdata.
 +   Check disk free first (`df -h /data`). If full, the most likely
 +   cause is the WAL archiver failing — see `archive-failing` below.
 +4. If the data directory is fine, page the on-call DBA. Do not
 +   restore from backup unless directed; a working PITR restore
 +   takes ~30 min and we shouldn't gamble.
++
 +## archive-failing
++
 +**Symptom:** `pg_stat_archiver.failed_count` increasing, disk filling.
++
 +1. Look at `journalctl -u postgresql -n 200` for `archive_command
 +   failed` lines. The last few should print the rclone error.
 +2. Common causes: bucket creds rotated, network partition to Spaces.
 +3. Confirm: `sudo -u postgres rclone --config /root/.config/rclone/
 +   rclone.conf lsd spaces-prod:`.
 +4. Mitigation: fix the underlying issue (rotate creds, restore
 +   network). Postgres will retry automatically.
 +5. If disk is critically full and Spaces will not be reachable in
 +   time: this is an emergency. Page the operator. **Do not**
 +   `pg_archivecleanup` manually unless directed — you can lose
 +   data.
++
 +## job-backlog
++
 +**Symptom:** `shithubd_job_queue_depth > 5000` for 15m.
++
 +1. Check what's in the queue:
 +   `psql -d shithub -c "SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;"`.
 +2. If one kind dominates, look for a poison job in worker logs.
 +3. If the spread is even, the worker isn't keeping up — scale by
 +   running a second worker host (the `FOR UPDATE SKIP LOCKED`
 +   pattern lets multiple workers coexist safely).
 +4. To purge a poison job: mark it `failed` (don't delete — we want
 +   the audit trail).

docs/internal/runbooks/restore.mdadded

 +# Restore
++
 +Two restore modes:
++
 +- **Drill restore** — done quarterly into a temp PG instance to
 +  verify the backup chain works. Never touches the production DB.
 +  Procedure: the drill script. See "Drill" below.
 +- **Real restore** — used after a data-loss incident. This is the
 +  procedure in "Real restore" below. Read it once before you ever
 +  need it.
++
 +## Drill
++
 +```sh
 +ssh backup-host
 +sudo /usr/local/bin/shithub-restore-drill
 +# uses the latest dump in spaces-prod:shithub-backups/daily/
 +```
++
 +To drill an older dump:
++
 +```sh
 +sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
 +```
++
 +`--keep` preserves the temp pgdata for inspection. The default is
 +clean-up.
++
 +The script writes to `/var/log/shithub/restore-drill.log`. After a
 +successful drill, archive that log:
++
 +```sh
 +rclone copyto /var/log/shithub/restore-drill.log \
 +       spaces-prod:shithub-backups/drills/$(date -u +%Y-Q)/restore-$(date -u +%Y%m%dT%H%M%SZ).log
 +```
++
 +## Real restore
++
 +Two scenarios. **Read the right one** — they have different recovery
 +points and different blast radius.
++
 +### A. Restore-from-daily (full DB lost, OK with up-to-24h data loss)
++
 +Use this when:
++
 +- The DB host is destroyed.
 +- Postgres can't start and we have no working WAL.
 +- We accept losing up to a day of writes (RPO = 24h).
++
 +Procedure:
++
 +1. Spin up a new db host (Ansible: `ANSIBLE_LIMIT=db-new make deploy
 +   --tags db`).
 +2. Pull the most recent daily dump:
 +   ```sh
 +   rclone copyto spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump /tmp/restore.dump
 +   ```
 +3. `pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump`
 +4. Run `shithubd migrate status` to confirm schema version.
 +5. Bring the app up against the new DB (update `web.env` and
 +   `worker.env`). Verify with smoke queries from the drill suite.
 +6. **Notify users** — there will be a visible gap in their
 +   activity. Don't make them figure it out.
++
 +### B. PITR (recover to a specific moment, e.g., before a bad migration)
++
 +Use this when:
++
 +- The data is intact in WAL but a destructive change happened at a
 +  known time (`DROP TABLE`, mass `UPDATE`, runaway worker).
 +- We need to roll the DB back to immediately before that moment.
++
 +This is harder; the procedure is:
++
 +1. Stop the app (`systemctl stop shithubd-web shithubd-worker
 +   shithubd-cron.timer`). Do not let writes continue.
 +2. Restore the most recent base backup from before the incident
 +   into a fresh data directory.
 +3. Configure `recovery.conf` (or `recovery.signal` + GUCs in PG16)
 +   pointing at the WAL archive in Spaces with
 +   `recovery_target_time = '<UTC timestamp just before incident>'`.
 +4. Start Postgres in recovery mode, let it replay WAL.
 +5. When recovery completes, promote.
 +6. Restart the app.
++
 +If you have not done a PITR before: page someone who has, do not
 +free-style this. Mistakes here lose the data you were trying to
 +save.
++
 +## After any real restore
++
 +- Run the restore-drill smoke queries against the restored DB.
 +- Reconcile the audit log: any rows referencing IDs that no longer
 +  exist need a note in the incident postmortem.
 +- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
 +  if an attacker triggered the recovery, the secrets in the dump
 +  may be compromised.
 +- Force-rotate session-epochs for any user whose session crossed
 +  the recovery window.

docs/internal/runbooks/rollback.mdadded

 +# Rollback
++
 +When to rollback vs. roll-forward:
++
 +- **Rollback** when the new release is broken in a way that can't
 +  be hot-fixed in <30 min: a panic loop, an auth regression, data
 +  corruption.
 +- **Roll-forward** when you can ship a hotfix faster than the
 +  rollback ceremony. A small bug behind a known feature flag is a
 +  roll-forward case.
++
 +If unsure, roll back. Bad releases compound.
++
 +## Code rollback (no migration involved)
++
 +```sh
 +git checkout v<previous>
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +The systemd unit `Restart=on-failure` + the binary swap means the
 +process flips on the next ExecStart. Connections in flight finish;
 +new connections hit the rolled-back binary.
++
 +## Code rollback when the new release added migrations
++
 +This is the dangerous case. There are three options, in order of
 +preference:
++
 +### 1. Schema-compatible rollback (best)
++
 +If the new migration only *added* columns/tables that the old code
 +ignores, the old code runs against the new schema fine. Just roll
 +the code back; leave the schema alone.
++
 +Most of our migrations are deliberately additive for this reason.
++
 +### 2. Roll forward to a hotfix
++
 +If the migration changed semantics that the old code can't tolerate,
 +ship a hotfix on top of the new release rather than reversing the
 +migration.
++
 +### 3. Migration `down` + code rollback (last resort)
++
 +Only if (1) and (2) won't work and the data loss from `down` is
 +acceptable.
++
 +```sh
 +ssh web-01
 +sudo -u shithub /usr/local/bin/shithubd migrate down  # ONE step
 +# verify
 +git checkout v<previous>
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +`migrate down` rolls back exactly one step. **Never** chain `down`s
 +without checking each migration's down logic; some of them drop
 +columns and *will* lose data.
++
 +## After any rollback
++
 +- Note the rollback in the incident channel with the from-tag and
 +  to-tag.
 +- File a follow-up issue with the failure mode.
 +- Disable any feature flags the bad release turned on.
 +- Confirm the rolled-back release passed CI (if not, you're now
 +  running un-tested code — that's a separate incident).

docs/internal/runbooks/upgrade.mdadded

 +# Upgrade
++
 +Routine release deploys. The deploy is one binary swap + a systemd
 +restart; the only place upgrades get exciting is around DB migrations
 +and the occasional config schema change.
++
 +## Standard release
++
 +```sh
 +# from a clean checkout of the release tag
 +git fetch --tags
 +git checkout v<version>
 +make deploy-check ANSIBLE_INVENTORY=staging
 +make deploy ANSIBLE_INVENTORY=staging
 +# ... canary period ...
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +`shithubd migrate up` runs as the web service's ExecStartPre, so
 +the binary that needs the new schema is also the one that applies
 +it. Order on each host: ExecStartPre runs migrations → web starts
 +on the new schema.
++
 +If a migration is long (>30s), call it out in the release notes
 +and time the deploy outside peak hours. The web service hangs in
 +"activating" until ExecStartPre finishes.
++
 +## Canary
++
 +We deploy to staging first, watch for 30 min in Grafana. Things to
 +look at:
++
 +- p95 latency on the top routes (`shithubd-overview` dashboard).
 +- DB call rate — a 10× jump usually means a regressed N+1.
 +- Job queue depth — a stuck migration reflects here.
 +- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`.
++
 +If anything looks off, **do not** promote to production. Rollback
 +on staging is cheap; rollback on production is loud.
++
 +## Major version (database)
++
 +If the release notes flag a major schema change:
++
 +1. Take a manual `pg_dump` immediately before the deploy:
 +   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
 +2. Confirm it landed in Spaces.
 +3. Deploy to staging, run `make restore-drill` against the
 +   *post-deploy* dump to confirm the new schema restores cleanly.
 +4. Then production.
++
 +## Config schema changes
++
 +When a release adds a required env var, the binary refuses to start
 +and complains in the journal. Update `deploy/ansible/roles/shithubd/
 +templates/web.env.j2` (and `worker.env.j2`), bump the inventory
 +vars, redeploy. There's no separate migration step for env files.