S38: docs — runbooks (rotate-secrets/keys, regenerate-akc, drain-workers, read-only-mode)
- SHA
76eb78b7ac16e918f879225ba9fa3409c0fcbe0e- Parents
-
ca28f68 - Tree
bdfdb02
76eb78b
76eb78b7ac16e918f879225ba9fa3409c0fcbe0eca28f68
bdfdb02| Status | File | + | - |
|---|---|---|---|
| A |
docs/internal/runbooks/drain-workers.md
|
89 | 0 |
| A |
docs/internal/runbooks/read-only-mode.md
|
81 | 0 |
| A |
docs/internal/runbooks/regenerate-akc.md
|
82 | 0 |
| A |
docs/internal/runbooks/rotate-keys.md
|
82 | 0 |
| A |
docs/internal/runbooks/rotate-secrets.md
|
123 | 0 |
docs/internal/runbooks/drain-workers.mdadded@@ -0,0 +1,89 @@ | ||
| 1 | +# Drain workers for maintenance | |
| 2 | + | |
| 3 | +The worker is graceful-shutdown-clean: SIGTERM finishes the | |
| 4 | +in-flight job and exits; queued jobs stay queued and a future | |
| 5 | +worker picks them up. This makes "drain for maintenance" a | |
| 6 | +near-no-op — but here's the procedure when you want to be | |
| 7 | +precise. | |
| 8 | + | |
| 9 | +## Stop accepting new work but finish what's running | |
| 10 | + | |
| 11 | +```sh | |
| 12 | +ssh worker-host | |
| 13 | +sudo systemctl stop shithubd-worker | |
| 14 | +``` | |
| 15 | + | |
| 16 | +The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The | |
| 17 | +worker: | |
| 18 | + | |
| 19 | +1. Stops the next `FOR UPDATE SKIP LOCKED` query. | |
| 20 | +2. Finishes the currently-leased job (if any). | |
| 21 | +3. Exits 0. | |
| 22 | + | |
| 23 | +Queued jobs stay in the `jobs` table with `state='queued'`. A | |
| 24 | +restarted worker (or a different worker on a different host) | |
| 25 | +will pick them up. | |
| 26 | + | |
| 27 | +## Confirm drain | |
| 28 | + | |
| 29 | +```sh | |
| 30 | +psql -d shithub -c " | |
| 31 | + SELECT state, count(*) FROM jobs GROUP BY state;" | |
| 32 | +``` | |
| 33 | + | |
| 34 | +You're looking for `processing=0`. If it's > 0 a few seconds | |
| 35 | +after `systemctl stop`, find the rogue worker: | |
| 36 | + | |
| 37 | +```sh | |
| 38 | +psql -d shithub -c " | |
| 39 | + SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';" | |
| 40 | +``` | |
| 41 | + | |
| 42 | +`leased_by` is a `<host>:<pid>` string; track down the host and | |
| 43 | +process and stop it. | |
| 44 | + | |
| 45 | +## Drain only one job kind | |
| 46 | + | |
| 47 | +If you need to do schema work that affects (e.g.) only the | |
| 48 | +search-reindex job, you can pause that kind without stopping the | |
| 49 | +worker: | |
| 50 | + | |
| 51 | +```sql | |
| 52 | +UPDATE jobs | |
| 53 | + SET state='paused' | |
| 54 | + WHERE state='queued' | |
| 55 | + AND kind='search.reindex'; | |
| 56 | +``` | |
| 57 | + | |
| 58 | +The worker's leasing query filters out `paused`. Resume by | |
| 59 | +flipping back to `queued`. | |
| 60 | + | |
| 61 | +This is **not** the standard maintenance path; reach for a full | |
| 62 | +drain unless you have a reason to keep other kinds running. | |
| 63 | + | |
| 64 | +## Hostile case: a stuck job | |
| 65 | + | |
| 66 | +If a worker is leased on a job and the worker process is gone | |
| 67 | +(host crashed mid-execution), the lease times out and a new | |
| 68 | +worker re-leases the same job. The lease timeout is set per-kind | |
| 69 | +in the job handler registry; default 5 minutes. | |
| 70 | + | |
| 71 | +To force re-lease sooner (e.g., the job is small and you want | |
| 72 | +it picked up now): | |
| 73 | + | |
| 74 | +```sql | |
| 75 | +UPDATE jobs | |
| 76 | + SET leased_at = NULL, leased_by = NULL, state = 'queued' | |
| 77 | + WHERE id = <job-id>; | |
| 78 | +``` | |
| 79 | + | |
| 80 | +Audit-trail your manual intervention in the incident channel. | |
| 81 | + | |
| 82 | +## Resume | |
| 83 | + | |
| 84 | +```sh | |
| 85 | +sudo systemctl start shithubd-worker | |
| 86 | +``` | |
| 87 | + | |
| 88 | +The worker starts polling immediately; queued jobs begin running | |
| 89 | +within seconds. | |
docs/internal/runbooks/read-only-mode.mdadded@@ -0,0 +1,81 @@ | ||
| 1 | +# Read-only mode | |
| 2 | + | |
| 3 | +shithub does not have a one-flag "read-only" toggle. When you | |
| 4 | +need one — disaster mid-recovery, surprise database failover, | |
| 5 | +controlled maintenance — these are the levers: | |
| 6 | + | |
| 7 | +## The fastest "stop writes" lever | |
| 8 | + | |
| 9 | +```sh | |
| 10 | +# At the edge: serve every write method as 503 with a banner. | |
| 11 | +ssh edge | |
| 12 | +sudo caddy reload --config /etc/caddy/Caddyfile.read-only | |
| 13 | +``` | |
| 14 | + | |
| 15 | +The repo ships a `Caddyfile.read-only` snippet that responds 503 | |
| 16 | +to `POST`/`PUT`/`PATCH`/`DELETE`/`OPTIONS` while letting `GET` | |
| 17 | +and `HEAD` through. `git-receive-pack` is also blocked; clones | |
| 18 | +still work. | |
| 19 | + | |
| 20 | +Reverse with `caddy reload --config /etc/caddy/Caddyfile`. | |
| 21 | + | |
| 22 | +User experience: | |
| 23 | + | |
| 24 | +- Reads work normally. | |
| 25 | +- Writes return a 503 with an HTML page explaining "shithub is | |
| 26 | + in read-only maintenance — try again later." | |
| 27 | +- Pushes fail with a git-side error. | |
| 28 | + | |
| 29 | +This is the easiest reversal; nothing in the app changes. | |
| 30 | + | |
| 31 | +## Stop the worker | |
| 32 | + | |
| 33 | +If you're worried about *background* writes (job processing | |
| 34 | +mutating state), stop the worker too — see | |
| 35 | +[drain-workers.md](./drain-workers.md). | |
| 36 | + | |
| 37 | +## Set the DB to read-only | |
| 38 | + | |
| 39 | +Heavy hammer. `ALTER SYSTEM SET default_transaction_read_only = | |
| 40 | +on; SELECT pg_reload_conf();` makes every new connection | |
| 41 | +read-only. The web service will blow up on its first write | |
| 42 | +attempt — usually `ERROR: cannot execute INSERT in a read-only | |
| 43 | +transaction`. | |
| 44 | + | |
| 45 | +This is only useful if you suspect a bug in the read-only mode | |
| 46 | +above is letting writes leak through. Otherwise skip it. | |
| 47 | + | |
| 48 | +To revert: `ALTER SYSTEM SET default_transaction_read_only = off; | |
| 49 | +SELECT pg_reload_conf();`. | |
| 50 | + | |
| 51 | +## During a real DB failover | |
| 52 | + | |
| 53 | +If you've failed over to a standby that's read-write and need | |
| 54 | +the app to talk to it: | |
| 55 | + | |
| 56 | +1. Update the `db.url` inventory variable to point at the new | |
| 57 | + primary. | |
| 58 | +2. `ANSIBLE_TAGS=app make deploy ANSIBLE_INVENTORY=production` — | |
| 59 | + web/worker pick up the new connection string. | |
| 60 | + | |
| 61 | +There's no replication tooling shipped today; if you have a | |
| 62 | +standby, you set it up out of band. | |
| 63 | + | |
| 64 | +## What read-only mode does NOT do | |
| 65 | + | |
| 66 | +- Stop the auth-audit log from writing — login attempts still | |
| 67 | + hit the DB. | |
| 68 | +- Block the `shithubd hook` invocations from writing | |
| 69 | + `push_events` (because the AKC + git transport still let the | |
| 70 | + TCP connection through). To stop that, also stop the web | |
| 71 | + service or block port 22. | |
| 72 | +- Block sign-ups — those are POSTs and are blocked at the edge, | |
| 73 | + but the in-app banner under read-only mode is the only signal | |
| 74 | + to a logged-in user. | |
| 75 | + | |
| 76 | +## Communicating it | |
| 77 | + | |
| 78 | +In-app banner on every page during read-only mode is the right | |
| 79 | +default. Today this is via a feature flag (`web.banner.text` | |
| 80 | +config key); set it before flipping the Caddyfile so users see | |
| 81 | +the banner on the last successful render. | |
docs/internal/runbooks/regenerate-akc.mdadded@@ -0,0 +1,82 @@ | ||
| 1 | +# Regenerate AKC cache | |
| 2 | + | |
| 3 | +The `AuthorizedKeysCommand` (`shithubd ssh-authkeys %f`) resolves | |
| 4 | +an offered SSH key fingerprint to a user via the | |
| 5 | +`user_ssh_keys` table. It does not maintain a write-through | |
| 6 | +cache today (every call hits the DB), so there's nothing to | |
| 7 | +"regenerate" in the sense of `cache.invalidate(...)`. | |
| 8 | + | |
| 9 | +What this runbook actually covers: the **postgres-side** | |
| 10 | +caching that the AKC subprocess depends on, and the operational | |
| 11 | +state you might need to reset around it. | |
| 12 | + | |
| 13 | +## When this matters | |
| 14 | + | |
| 15 | +You may need to do something here if: | |
| 16 | + | |
| 17 | +- A user's just-added SSH key is being rejected on push despite | |
| 18 | + showing in their Settings. | |
| 19 | +- A removed SSH key is still being accepted (this would be a | |
| 20 | + bug; don't shrug it off). | |
| 21 | +- The AKC subprocess is timing out under load and you need to | |
| 22 | + understand what's slow. | |
| 23 | + | |
| 24 | +## Diagnose first | |
| 25 | + | |
| 26 | +```sh | |
| 27 | +# What's the AKC subcommand seeing? | |
| 28 | +sudo -u shithub-ssh /usr/local/bin/shithubd ssh-authkeys SHA256:abc... | |
| 29 | +# This prints what sshd would have read; empty output = no match. | |
| 30 | +``` | |
| 31 | + | |
| 32 | +Compare against the DB: | |
| 33 | + | |
| 34 | +```sh | |
| 35 | +psql -d shithub -c " | |
| 36 | + SELECT k.fingerprint, u.username | |
| 37 | + FROM user_ssh_keys k | |
| 38 | + JOIN users u ON u.id = k.user_id | |
| 39 | + WHERE k.fingerprint = 'SHA256:abc...';" | |
| 40 | +``` | |
| 41 | + | |
| 42 | +Three possibilities: | |
| 43 | + | |
| 44 | +| psql says | AKC says | Diagnosis | | |
| 45 | +|-----------|----------|------------------------------------------------| | |
| 46 | +| match | match | sshd is broken, not us — check sshd logs. | | |
| 47 | +| match | empty | The AKC subcommand isn't reading what we think — check `EnvironmentFile`, the binary path, and that `shithub-ssh` user can read `/etc/shithub/ssh.env`. | | |
| 48 | +| empty | empty | Key isn't in the DB — user needs to re-add. | | |
| 49 | +| empty | match | **Stale cache or replication lag.** See below. | | |
| 50 | + | |
| 51 | +## "Empty in psql, match in AKC" — actually impossible today | |
| 52 | + | |
| 53 | +We don't run a read replica for the AKC. If you ever see this, | |
| 54 | +the AKC subcommand is reading from a different DB than psql is | |
| 55 | +— check `db.url` in `ssh.env` vs the operator's psql connection. | |
| 56 | + | |
| 57 | +## The remove-key-but-still-accepted case | |
| 58 | + | |
| 59 | +A removed SSH key being accepted is **not** caused by a stale | |
| 60 | +cache (we don't have one); it's caused by either: | |
| 61 | + | |
| 62 | +- The key wasn't actually removed (check the DB, not the UI). | |
| 63 | +- sshd is using its own `authorized_keys` file in addition to | |
| 64 | + AKC. Check `/etc/ssh/sshd_config` and per-user | |
| 65 | + `~/.ssh/authorized_keys`. The shipped sshd config disables | |
| 66 | + per-user files for the `git` user; if someone's customized, | |
| 67 | + that's the leak. | |
| 68 | +- The session was already authenticated and is being held open. | |
| 69 | + Kill it: `sudo systemctl status sshd`, find the per-conn | |
| 70 | + process, `kill <pid>`. | |
| 71 | + | |
| 72 | +## Future: add a cache here | |
| 73 | + | |
| 74 | +If we ever add an in-process cache to AKC (we'd want to, to | |
| 75 | +shave the per-push DB call), invalidation becomes load-bearing: | |
| 76 | + | |
| 77 | +- Cache key: SHA-256 fingerprint. | |
| 78 | +- Invalidate on: key add (insert) and key remove (delete). | |
| 79 | +- TTL: 60 seconds is the longest acceptable window between | |
| 80 | + removing a key and the AKC stopping to honor it. | |
| 81 | + | |
| 82 | +When that lands, this runbook gets a real "regenerate" section. | |
docs/internal/runbooks/rotate-keys.mdadded@@ -0,0 +1,82 @@ | ||
| 1 | +# Rotate signing keys | |
| 2 | + | |
| 3 | +Distinct from `rotate-secrets.md` because signing keys have | |
| 4 | +external verifiers (subscribers, browsers, recipients) that need | |
| 5 | +to keep validating old artifacts during the rollover window. | |
| 6 | + | |
| 7 | +## Webhook HMAC secrets | |
| 8 | + | |
| 9 | +Per-webhook, not global. Each webhook has its own secret. The | |
| 10 | +owner rotates by editing the webhook in Settings → Webhooks → | |
| 11 | +the hook → "Rotate secret" → set new value. | |
| 12 | + | |
| 13 | +- Subscriber MUST be updated to the new secret first (or accept | |
| 14 | + both during the cutover). | |
| 15 | +- The previous secret is **not** kept; immediate rotation cuts | |
| 16 | + off old-signature validation. | |
| 17 | +- For ops-driven mass rotation (suspected key store compromise): | |
| 18 | + `shithubd webhook rotate-all` regenerates every secret. All | |
| 19 | + subscribers will start getting `signature mismatch` until the | |
| 20 | + owners update their endpoints. | |
| 21 | + | |
| 22 | +## Notification unsubscribe HMAC | |
| 23 | + | |
| 24 | +The one-click unsubscribe links in emails are HMAC-signed. The | |
| 25 | +key lives in `notif.unsubscribe_key`. Rotation invalidates every | |
| 26 | +**old email's** unsubscribe link (a recipient with an unread | |
| 27 | +email containing a link from before rotation will get "invalid | |
| 28 | +link" if they click it after rotation). | |
| 29 | + | |
| 30 | +Procedure: | |
| 31 | + | |
| 32 | +1. Generate a new key: `openssl rand -base64 32`. | |
| 33 | +2. Replace `notif.unsubscribe_key` in `worker.env`. | |
| 34 | +3. Redeploy worker (`ANSIBLE_TAGS=app`). | |
| 35 | +4. Optional: add a banner to the in-app inbox letting users | |
| 36 | + know unsubscribe links from old emails will not work. | |
| 37 | + | |
| 38 | +## Cursor signing key | |
| 39 | + | |
| 40 | +`internal/pagination/keyset` HMACs every cursor we hand out. | |
| 41 | +Rotation invalidates every cursor in flight. The user-visible | |
| 42 | +effect is: any open browser tab with a "Next page" link from | |
| 43 | +before the rotation will return "invalid cursor" → take them | |
| 44 | +back to page 1. | |
| 45 | + | |
| 46 | +Procedure mirrors the unsubscribe key. Acceptable to do without | |
| 47 | +warning; the worst case is a user clicks "Next" and gets sent | |
| 48 | +back to the top of the list. | |
| 49 | + | |
| 50 | +## TLS certificates (Caddy-managed) | |
| 51 | + | |
| 52 | +Caddy obtains and renews certs from Let's Encrypt automatically. | |
| 53 | +Manual rotation is rarely needed; if it is: | |
| 54 | + | |
| 55 | +```sh | |
| 56 | +ssh edge | |
| 57 | +sudo caddy reload --config /etc/caddy/Caddyfile | |
| 58 | +``` | |
| 59 | + | |
| 60 | +If Let's Encrypt is rate-limiting you, switch the | |
| 61 | +`caddy_use_acme_staging` inventory flag to `true`, redeploy, | |
| 62 | +verify, then flip back. Production cert reissue is constrained | |
| 63 | +by LE's per-week limits; mass rotation in a single hour will | |
| 64 | +fail. | |
| 65 | + | |
| 66 | +## SSH host keys | |
| 67 | + | |
| 68 | +The host keys (`/etc/ssh/ssh_host_*`) are sshd's identity to | |
| 69 | +clients. Rotating them prompts every git client with "WARNING: | |
| 70 | +REMOTE HOST IDENTIFICATION HAS CHANGED!" — disruptive. | |
| 71 | + | |
| 72 | +Only rotate if the host has been compromised. Procedure: | |
| 73 | + | |
| 74 | +1. `ssh-keygen -A` on the host (regenerates all host keys). | |
| 75 | +2. `systemctl restart sshd`. | |
| 76 | +3. Publish the new host-key fingerprints somewhere users can | |
| 77 | + verify (status page, security advisory). | |
| 78 | +4. Users prune the old line from `~/.ssh/known_hosts` and | |
| 79 | + re-accept on first reconnect. | |
| 80 | + | |
| 81 | +User-facing communication is critical here. Without it, you'll | |
| 82 | +look like a MITM attacker to every developer's SSH client. | |
docs/internal/runbooks/rotate-secrets.mdadded@@ -0,0 +1,123 @@ | ||
| 1 | +# Rotate secrets | |
| 2 | + | |
| 3 | +Quarterly cadence; sooner if compromise is suspected. The secret | |
| 4 | +classes: | |
| 5 | + | |
| 6 | +| Secret | Where it lives | Rotation procedure | | |
| 7 | +|------------------------------|----------------------------------------------|-------------------------------------------| | |
| 8 | +| `session.key_b64` | `web.env` | See "Session signing key" below. | | |
| 9 | +| `auth.totp_key_b64` | `web.env` | See "TOTP AEAD key" below. | | |
| 10 | +| Postgres `shithub` password | `web.env` + `worker.env` + Postgres role | See "DB password" below. | | |
| 11 | +| Postgres `shithub_hook` pwd | sshd env + `hook-role-grants.sql` apply env | See "DB password" below. | | |
| 12 | +| S3 access keys | `web.env` + `worker.env` + Spaces dashboard | See "Object store credentials" below. | | |
| 13 | +| Postmark / SMTP creds | `web.env` | One-step: replace, redeploy. | | |
| 14 | +| Webhook AEAD key | per-row encrypted; key in `worker.env` | Two-step migration, see below. | | |
| 15 | +| Operator SSH keys | `~operator/.ssh/authorized_keys` per host | Add new key, verify, remove old. | | |
| 16 | + | |
| 17 | +## Session signing key | |
| 18 | + | |
| 19 | +The session key signs the cookie that authenticates a logged-in | |
| 20 | +session. Rotating it logs **every user out** because every | |
| 21 | +existing cookie's MAC stops verifying. | |
| 22 | + | |
| 23 | +1. Generate a new key: | |
| 24 | + ```sh | |
| 25 | + openssl rand -base64 32 | |
| 26 | + ``` | |
| 27 | +2. Update the inventory variable `session_key`. Keep the old key | |
| 28 | + in a comment for one rotation cycle so you can revert. | |
| 29 | +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`. | |
| 30 | +4. Verify: sign in to your own account with a fresh browser; the | |
| 31 | + cookie set after sign-in is signed by the new key. | |
| 32 | + | |
| 33 | +User-visible impact: every user is signed out. Notify in-band | |
| 34 | +before doing this if avoidable; do it without notice if the old | |
| 35 | +key may be compromised. | |
| 36 | + | |
| 37 | +## TOTP AEAD key | |
| 38 | + | |
| 39 | +The TOTP AEAD key encrypts every user's TOTP shared secret at | |
| 40 | +rest in the database. **Rotating this key requires a | |
| 41 | +re-encryption migration** — without it, every 2FA enrollment | |
| 42 | +becomes unreadable. | |
| 43 | + | |
| 44 | +The procedure is: | |
| 45 | + | |
| 46 | +1. Add the new key to `web.env` as `auth.totp_key_b64_next` | |
| 47 | + alongside the existing `auth.totp_key_b64`. | |
| 48 | +2. Restart web (the package supports a "current + next" pair: it | |
| 49 | + reads with current, falls back to next, writes with current). | |
| 50 | +3. Run the re-encryption job: `shithubd admin re-encrypt-totp | |
| 51 | + --to-key=auth.totp_key_b64_next` (operator-only). This | |
| 52 | + decrypts each row with the old key and re-encrypts with the | |
| 53 | + new. | |
| 54 | +4. Promote `auth.totp_key_b64_next` to `auth.totp_key_b64` (drop | |
| 55 | + the suffix), remove the old key. | |
| 56 | +5. Restart web. | |
| 57 | + | |
| 58 | +Do not skip step 3. Failing to re-encrypt before retiring the old | |
| 59 | +key locks every 2FA-enabled user out of their account; recovery | |
| 60 | +codes are the only path back in, and not everyone has them | |
| 61 | +saved. | |
| 62 | + | |
| 63 | +## DB password | |
| 64 | + | |
| 65 | +Rotate by adding a new password and removing the old, **without | |
| 66 | +downtime**. | |
| 67 | + | |
| 68 | +1. As `postgres`: | |
| 69 | + ```sql | |
| 70 | + ALTER ROLE shithub WITH PASSWORD '<new>'; | |
| 71 | + ``` | |
| 72 | +2. Update `web.env` and `worker.env` `db_password`. | |
| 73 | +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`. | |
| 74 | + The web/worker units will restart and reconnect with the new | |
| 75 | + password. | |
| 76 | + | |
| 77 | +If you suspect the old password was leaked, do steps 1–3 in | |
| 78 | +sequence within minutes — between (1) and (3) the running web | |
| 79 | +process still has its open connections (which authenticated | |
| 80 | +under the old password) but new connections will use the new. | |
| 81 | + | |
| 82 | +## Object store credentials | |
| 83 | + | |
| 84 | +1. In the Spaces dashboard, generate a new access key with the | |
| 85 | + same scope as the old. | |
| 86 | +2. Update inventory `s3_access_key_id` and `s3_secret_access_key`. | |
| 87 | +3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`. | |
| 88 | +4. Verify: trigger a webhook delivery (which writes a body | |
| 89 | + snapshot) and confirm it lands in the bucket. | |
| 90 | +5. Once confirmed, revoke the old key in the Spaces dashboard. | |
| 91 | + | |
| 92 | +Do not revoke the old key first; the running process will lose | |
| 93 | +access mid-flight. | |
| 94 | + | |
| 95 | +## Webhook AEAD key | |
| 96 | + | |
| 97 | +The webhook secret AEAD key encrypts every webhook's secret at | |
| 98 | +rest. Rotation is two-step like TOTP: | |
| 99 | + | |
| 100 | +1. Add `webhook.aead_key_next` alongside `webhook.aead_key`. | |
| 101 | +2. Run `shithubd admin re-encrypt-webhooks --to-key=webhook. | |
| 102 | + aead_key_next`. | |
| 103 | +3. Promote and restart. | |
| 104 | + | |
| 105 | +Failing to re-encrypt before retiring the old key disables every | |
| 106 | +webhook (the auto-disable logic kicks in on first decrypt | |
| 107 | +failure). | |
| 108 | + | |
| 109 | +## Operator SSH keys | |
| 110 | + | |
| 111 | +Standard procedure: add the new key to every host's | |
| 112 | +`~operator/.ssh/authorized_keys`, log in with the new key to | |
| 113 | +confirm, remove the old. Ansible's `authorized_key` module makes | |
| 114 | +this idempotent; the `base` role will pick up changes if the | |
| 115 | +inventory's `operator_ssh_keys` list is the source of truth. | |
| 116 | + | |
| 117 | +## Audit | |
| 118 | + | |
| 119 | +Every rotation is logged in the host's journal (the deploy run's | |
| 120 | +output) and, for DB rotations, in `pg_stat_activity` history if | |
| 121 | +your retention allows. There's no centralized rotation log; if | |
| 122 | +you want one, capture each rotation in your team's incident | |
| 123 | +channel with date + class + reason. | |