`76eb78b`

S38: docs — runbooks (rotate-secrets/keys, regenerate-akc, drain-workers, read-only-mode)

Authored by

espadonne 4 days ago

SHA: 76eb78b7ac16e918f879225ba9fa3409c0fcbe0e
Parents: ca28f68
Tree: bdfdb02

5 changed files

Status	File	+
A	`docs/internal/runbooks/drain-workers.md`	89
A	`docs/internal/runbooks/read-only-mode.md`	81
A	`docs/internal/runbooks/regenerate-akc.md`	82
A	`docs/internal/runbooks/rotate-keys.md`	82
A	`docs/internal/runbooks/rotate-secrets.md`	123

docs/internal/runbooks/drain-workers.mdadded

 +# Drain workers for maintenance
++
 +The worker is graceful-shutdown-clean: SIGTERM finishes the
 +in-flight job and exits; queued jobs stay queued and a future
 +worker picks them up. This makes "drain for maintenance" a
 +near-no-op — but here's the procedure when you want to be
 +precise.
++
 +## Stop accepting new work but finish what's running
++
 +```sh
 +ssh worker-host
 +sudo systemctl stop shithubd-worker
 +```
++
 +The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The
 +worker:
++
 +1. Stops the next `FOR UPDATE SKIP LOCKED` query.
 +2. Finishes the currently-leased job (if any).
 +3. Exits 0.
++
 +Queued jobs stay in the `jobs` table with `state='queued'`. A
 +restarted worker (or a different worker on a different host)
 +will pick them up.
++
 +## Confirm drain
++
 +```sh
 +psql -d shithub -c "
 +  SELECT state, count(*) FROM jobs GROUP BY state;"
 +```
++
 +You're looking for `processing=0`. If it's > 0 a few seconds
 +after `systemctl stop`, find the rogue worker:
++
 +```sh
 +psql -d shithub -c "
 +  SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
 +```
++
 +`leased_by` is a `<host>:<pid>` string; track down the host and
 +process and stop it.
++
 +## Drain only one job kind
++
 +If you need to do schema work that affects (e.g.) only the
 +search-reindex job, you can pause that kind without stopping the
 +worker:
++
 +```sql
 +UPDATE jobs
 +   SET state='paused'
 + WHERE state='queued'
 +   AND kind='search.reindex';
 +```
++
 +The worker's leasing query filters out `paused`. Resume by
 +flipping back to `queued`.
++
 +This is **not** the standard maintenance path; reach for a full
 +drain unless you have a reason to keep other kinds running.
++
 +## Hostile case: a stuck job
++
 +If a worker is leased on a job and the worker process is gone
 +(host crashed mid-execution), the lease times out and a new
 +worker re-leases the same job. The lease timeout is set per-kind
 +in the job handler registry; default 5 minutes.
++
 +To force re-lease sooner (e.g., the job is small and you want
 +it picked up now):
++
 +```sql
 +UPDATE jobs
 +   SET leased_at = NULL, leased_by = NULL, state = 'queued'
 + WHERE id = <job-id>;
 +```
++
 +Audit-trail your manual intervention in the incident channel.
++
 +## Resume
++
 +```sh
 +sudo systemctl start shithubd-worker
 +```
++
 +The worker starts polling immediately; queued jobs begin running
 +within seconds.

docs/internal/runbooks/read-only-mode.mdadded

 +# Read-only mode
++
 +shithub does not have a one-flag "read-only" toggle. When you
 +need one — disaster mid-recovery, surprise database failover,
 +controlled maintenance — these are the levers:
++
 +## The fastest "stop writes" lever
++
 +```sh
 +# At the edge: serve every write method as 503 with a banner.
 +ssh edge
 +sudo caddy reload --config /etc/caddy/Caddyfile.read-only
 +```
++
 +The repo ships a `Caddyfile.read-only` snippet that responds 503
 +to `POST`/`PUT`/`PATCH`/`DELETE`/`OPTIONS` while letting `GET`
 +and `HEAD` through. `git-receive-pack` is also blocked; clones
 +still work.
++
 +Reverse with `caddy reload --config /etc/caddy/Caddyfile`.
++
 +User experience:
++
 +- Reads work normally.
 +- Writes return a 503 with an HTML page explaining "shithub is
 +  in read-only maintenance — try again later."
 +- Pushes fail with a git-side error.
++
 +This is the easiest reversal; nothing in the app changes.
++
 +## Stop the worker
++
 +If you're worried about *background* writes (job processing
 +mutating state), stop the worker too — see
 +[drain-workers.md](./drain-workers.md).
++
 +## Set the DB to read-only
++
 +Heavy hammer. `ALTER SYSTEM SET default_transaction_read_only =
 +on; SELECT pg_reload_conf();` makes every new connection
 +read-only. The web service will blow up on its first write
 +attempt — usually `ERROR: cannot execute INSERT in a read-only
 +transaction`.
++
 +This is only useful if you suspect a bug in the read-only mode
 +above is letting writes leak through. Otherwise skip it.
++
 +To revert: `ALTER SYSTEM SET default_transaction_read_only = off;
 +SELECT pg_reload_conf();`.
++
 +## During a real DB failover
++
 +If you've failed over to a standby that's read-write and need
 +the app to talk to it:
++
 +1. Update the `db.url` inventory variable to point at the new
 +   primary.
 +2. `ANSIBLE_TAGS=app make deploy ANSIBLE_INVENTORY=production` —
 +   web/worker pick up the new connection string.
++
 +There's no replication tooling shipped today; if you have a
 +standby, you set it up out of band.
++
 +## What read-only mode does NOT do
++
 +- Stop the auth-audit log from writing — login attempts still
 +  hit the DB.
 +- Block the `shithubd hook` invocations from writing
 +  `push_events` (because the AKC + git transport still let the
 +  TCP connection through). To stop that, also stop the web
 +  service or block port 22.
 +- Block sign-ups — those are POSTs and are blocked at the edge,
 +  but the in-app banner under read-only mode is the only signal
 +  to a logged-in user.
++
 +## Communicating it
++
 +In-app banner on every page during read-only mode is the right
 +default. Today this is via a feature flag (`web.banner.text`
 +config key); set it before flipping the Caddyfile so users see
 +the banner on the last successful render.

docs/internal/runbooks/regenerate-akc.mdadded

 +# Regenerate AKC cache
++
 +The `AuthorizedKeysCommand` (`shithubd ssh-authkeys %f`) resolves
 +an offered SSH key fingerprint to a user via the
 +`user_ssh_keys` table. It does not maintain a write-through
 +cache today (every call hits the DB), so there's nothing to
 +"regenerate" in the sense of `cache.invalidate(...)`.
++
 +What this runbook actually covers: the **postgres-side**
 +caching that the AKC subprocess depends on, and the operational
 +state you might need to reset around it.
++
 +## When this matters
++
 +You may need to do something here if:
++
 +- A user's just-added SSH key is being rejected on push despite
 +  showing in their Settings.
 +- A removed SSH key is still being accepted (this would be a
 +  bug; don't shrug it off).
 +- The AKC subprocess is timing out under load and you need to
 +  understand what's slow.
++
 +## Diagnose first
++
 +```sh
 +# What's the AKC subcommand seeing?
 +sudo -u shithub-ssh /usr/local/bin/shithubd ssh-authkeys SHA256:abc...
 +# This prints what sshd would have read; empty output = no match.
 +```
++
 +Compare against the DB:
++
 +```sh
 +psql -d shithub -c "
 +  SELECT k.fingerprint, u.username
 +    FROM user_ssh_keys k
 +    JOIN users u ON u.id = k.user_id
 +   WHERE k.fingerprint = 'SHA256:abc...';"
 +```
++
 +Three possibilities:
++
 +| psql says | AKC says | Diagnosis                                      |
 +|-----------|----------|------------------------------------------------|
 +| match     | match    | sshd is broken, not us — check sshd logs.       |
 +| match     | empty    | The AKC subcommand isn't reading what we think — check `EnvironmentFile`, the binary path, and that `shithub-ssh` user can read `/etc/shithub/ssh.env`. |
 +| empty     | empty    | Key isn't in the DB — user needs to re-add.    |
 +| empty     | match    | **Stale cache or replication lag.** See below. |
++
 +## "Empty in psql, match in AKC" — actually impossible today
++
 +We don't run a read replica for the AKC. If you ever see this,
 +the AKC subcommand is reading from a different DB than psql is
 +— check `db.url` in `ssh.env` vs the operator's psql connection.
++
 +## The remove-key-but-still-accepted case
++
 +A removed SSH key being accepted is **not** caused by a stale
 +cache (we don't have one); it's caused by either:
++
 +- The key wasn't actually removed (check the DB, not the UI).
 +- sshd is using its own `authorized_keys` file in addition to
 +  AKC. Check `/etc/ssh/sshd_config` and per-user
 +  `~/.ssh/authorized_keys`. The shipped sshd config disables
 +  per-user files for the `git` user; if someone's customized,
 +  that's the leak.
 +- The session was already authenticated and is being held open.
 +  Kill it: `sudo systemctl status sshd`, find the per-conn
 +  process, `kill <pid>`.
++
 +## Future: add a cache here
++
 +If we ever add an in-process cache to AKC (we'd want to, to
 +shave the per-push DB call), invalidation becomes load-bearing:
++
 +- Cache key: SHA-256 fingerprint.
 +- Invalidate on: key add (insert) and key remove (delete).
 +- TTL: 60 seconds is the longest acceptable window between
 +  removing a key and the AKC stopping to honor it.
++
 +When that lands, this runbook gets a real "regenerate" section.

docs/internal/runbooks/rotate-keys.mdadded

 +# Rotate signing keys
++
 +Distinct from `rotate-secrets.md` because signing keys have
 +external verifiers (subscribers, browsers, recipients) that need
 +to keep validating old artifacts during the rollover window.
++
 +## Webhook HMAC secrets
++
 +Per-webhook, not global. Each webhook has its own secret. The
 +owner rotates by editing the webhook in Settings → Webhooks →
 +the hook → "Rotate secret" → set new value.
++
 +- Subscriber MUST be updated to the new secret first (or accept
 +  both during the cutover).
 +- The previous secret is **not** kept; immediate rotation cuts
 +  off old-signature validation.
 +- For ops-driven mass rotation (suspected key store compromise):
 +  `shithubd webhook rotate-all` regenerates every secret. All
 +  subscribers will start getting `signature mismatch` until the
 +  owners update their endpoints.
++
 +## Notification unsubscribe HMAC
++
 +The one-click unsubscribe links in emails are HMAC-signed. The
 +key lives in `notif.unsubscribe_key`. Rotation invalidates every
 +**old email's** unsubscribe link (a recipient with an unread
 +email containing a link from before rotation will get "invalid
 +link" if they click it after rotation).
++
 +Procedure:
++
 +1. Generate a new key: `openssl rand -base64 32`.
 +2. Replace `notif.unsubscribe_key` in `worker.env`.
 +3. Redeploy worker (`ANSIBLE_TAGS=app`).
 +4. Optional: add a banner to the in-app inbox letting users
 +   know unsubscribe links from old emails will not work.
++
 +## Cursor signing key
++
 +`internal/pagination/keyset` HMACs every cursor we hand out.
 +Rotation invalidates every cursor in flight. The user-visible
 +effect is: any open browser tab with a "Next page" link from
 +before the rotation will return "invalid cursor" → take them
 +back to page 1.
++
 +Procedure mirrors the unsubscribe key. Acceptable to do without
 +warning; the worst case is a user clicks "Next" and gets sent
 +back to the top of the list.
++
 +## TLS certificates (Caddy-managed)
++
 +Caddy obtains and renews certs from Let's Encrypt automatically.
 +Manual rotation is rarely needed; if it is:
++
 +```sh
 +ssh edge
 +sudo caddy reload --config /etc/caddy/Caddyfile
 +```
++
 +If Let's Encrypt is rate-limiting you, switch the
 +`caddy_use_acme_staging` inventory flag to `true`, redeploy,
 +verify, then flip back. Production cert reissue is constrained
 +by LE's per-week limits; mass rotation in a single hour will
 +fail.
++
 +## SSH host keys
++
 +The host keys (`/etc/ssh/ssh_host_*`) are sshd's identity to
 +clients. Rotating them prompts every git client with "WARNING:
 +REMOTE HOST IDENTIFICATION HAS CHANGED!" — disruptive.
++
 +Only rotate if the host has been compromised. Procedure:
++
 +1. `ssh-keygen -A` on the host (regenerates all host keys).
 +2. `systemctl restart sshd`.
 +3. Publish the new host-key fingerprints somewhere users can
 +   verify (status page, security advisory).
 +4. Users prune the old line from `~/.ssh/known_hosts` and
 +   re-accept on first reconnect.
++
 +User-facing communication is critical here. Without it, you'll
 +look like a MITM attacker to every developer's SSH client.

docs/internal/runbooks/rotate-secrets.mdadded

 +# Rotate secrets
++
 +Quarterly cadence; sooner if compromise is suspected. The secret
 +classes:
++
 +| Secret                       | Where it lives                               | Rotation procedure                        |
 +|------------------------------|----------------------------------------------|-------------------------------------------|
 +| `session.key_b64`            | `web.env`                                    | See "Session signing key" below.          |
 +| `auth.totp_key_b64`          | `web.env`                                    | See "TOTP AEAD key" below.                |
 +| Postgres `shithub` password  | `web.env` + `worker.env` + Postgres role     | See "DB password" below.                  |
 +| Postgres `shithub_hook` pwd  | sshd env + `hook-role-grants.sql` apply env  | See "DB password" below.                  |
 +| S3 access keys               | `web.env` + `worker.env` + Spaces dashboard  | See "Object store credentials" below.     |
 +| Postmark / SMTP creds        | `web.env`                                    | One-step: replace, redeploy.              |
 +| Webhook AEAD key             | per-row encrypted; key in `worker.env`       | Two-step migration, see below.            |
 +| Operator SSH keys            | `~operator/.ssh/authorized_keys` per host    | Add new key, verify, remove old.          |
++
 +## Session signing key
++
 +The session key signs the cookie that authenticates a logged-in
 +session. Rotating it logs **every user out** because every
 +existing cookie's MAC stops verifying.
++
 +1. Generate a new key:
 +   ```sh
 +   openssl rand -base64 32
 +   ```
 +2. Update the inventory variable `session_key`. Keep the old key
 +   in a comment for one rotation cycle so you can revert.
 +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
 +4. Verify: sign in to your own account with a fresh browser; the
 +   cookie set after sign-in is signed by the new key.
++
 +User-visible impact: every user is signed out. Notify in-band
 +before doing this if avoidable; do it without notice if the old
 +key may be compromised.
++
 +## TOTP AEAD key
++
 +The TOTP AEAD key encrypts every user's TOTP shared secret at
 +rest in the database. **Rotating this key requires a
 +re-encryption migration** — without it, every 2FA enrollment
 +becomes unreadable.
++
 +The procedure is:
++
 +1. Add the new key to `web.env` as `auth.totp_key_b64_next`
 +   alongside the existing `auth.totp_key_b64`.
 +2. Restart web (the package supports a "current + next" pair: it
 +   reads with current, falls back to next, writes with current).
 +3. Run the re-encryption job: `shithubd admin re-encrypt-totp
 +   --to-key=auth.totp_key_b64_next` (operator-only). This
 +   decrypts each row with the old key and re-encrypts with the
 +   new.
 +4. Promote `auth.totp_key_b64_next` to `auth.totp_key_b64` (drop
 +   the suffix), remove the old key.
 +5. Restart web.
++
 +Do not skip step 3. Failing to re-encrypt before retiring the old
 +key locks every 2FA-enabled user out of their account; recovery
 +codes are the only path back in, and not everyone has them
 +saved.
++
 +## DB password
++
 +Rotate by adding a new password and removing the old, **without
 +downtime**.
++
 +1. As `postgres`:
 +   ```sql
 +   ALTER ROLE shithub WITH PASSWORD '<new>';
 +   ```
 +2. Update `web.env` and `worker.env` `db_password`.
 +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
 +   The web/worker units will restart and reconnect with the new
 +   password.
++
 +If you suspect the old password was leaked, do steps 1–3 in
 +sequence within minutes — between (1) and (3) the running web
 +process still has its open connections (which authenticated
 +under the old password) but new connections will use the new.
++
 +## Object store credentials
++
 +1. In the Spaces dashboard, generate a new access key with the
 +   same scope as the old.
 +2. Update inventory `s3_access_key_id` and `s3_secret_access_key`.
 +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
 +4. Verify: trigger a webhook delivery (which writes a body
 +   snapshot) and confirm it lands in the bucket.
 +5. Once confirmed, revoke the old key in the Spaces dashboard.
++
 +Do not revoke the old key first; the running process will lose
 +access mid-flight.
++
 +## Webhook AEAD key
++
 +The webhook secret AEAD key encrypts every webhook's secret at
 +rest. Rotation is two-step like TOTP:
++
 +1. Add `webhook.aead_key_next` alongside `webhook.aead_key`.
 +2. Run `shithubd admin re-encrypt-webhooks --to-key=webhook.
 +   aead_key_next`.
 +3. Promote and restart.
++
 +Failing to re-encrypt before retiring the old key disables every
 +webhook (the auto-disable logic kicks in on first decrypt
 +failure).
++
 +## Operator SSH keys
++
 +Standard procedure: add the new key to every host's
 +`~operator/.ssh/authorized_keys`, log in with the new key to
 +confirm, remove the old. Ansible's `authorized_key` module makes
 +this idempotent; the `base` role will pick up changes if the
 +inventory's `operator_ssh_keys` list is the source of truth.
++
 +## Audit
++
 +Every rotation is logged in the host's journal (the deploy run's
 +output) and, for DB rotations, in `pg_stat_activity` history if
 +your retention allows. There's no centralized rotation log; if
 +you want one, capture each rotation in your team's incident
 +channel with date + class + reason.