tenseleyflow/shithub / 76eb78b

Browse files

S38: docs — runbooks (rotate-secrets/keys, regenerate-akc, drain-workers, read-only-mode)

Authored by espadonne
SHA
76eb78b7ac16e918f879225ba9fa3409c0fcbe0e
Parents
ca28f68
Tree
bdfdb02

5 changed files

StatusFile+-
A docs/internal/runbooks/drain-workers.md 89 0
A docs/internal/runbooks/read-only-mode.md 81 0
A docs/internal/runbooks/regenerate-akc.md 82 0
A docs/internal/runbooks/rotate-keys.md 82 0
A docs/internal/runbooks/rotate-secrets.md 123 0
docs/internal/runbooks/drain-workers.mdadded
@@ -0,0 +1,89 @@
1
+# Drain workers for maintenance
2
+
3
+The worker is graceful-shutdown-clean: SIGTERM finishes the
4
+in-flight job and exits; queued jobs stay queued and a future
5
+worker picks them up. This makes "drain for maintenance" a
6
+near-no-op — but here's the procedure when you want to be
7
+precise.
8
+
9
+## Stop accepting new work but finish what's running
10
+
11
+```sh
12
+ssh worker-host
13
+sudo systemctl stop shithubd-worker
14
+```
15
+
16
+The unit honors `Type=notify` + a 30-second `TimeoutStopSec`. The
17
+worker:
18
+
19
+1. Stops the next `FOR UPDATE SKIP LOCKED` query.
20
+2. Finishes the currently-leased job (if any).
21
+3. Exits 0.
22
+
23
+Queued jobs stay in the `jobs` table with `state='queued'`. A
24
+restarted worker (or a different worker on a different host)
25
+will pick them up.
26
+
27
+## Confirm drain
28
+
29
+```sh
30
+psql -d shithub -c "
31
+  SELECT state, count(*) FROM jobs GROUP BY state;"
32
+```
33
+
34
+You're looking for `processing=0`. If it's > 0 a few seconds
35
+after `systemctl stop`, find the rogue worker:
36
+
37
+```sh
38
+psql -d shithub -c "
39
+  SELECT id, kind, leased_by, leased_at FROM jobs WHERE state='processing';"
40
+```
41
+
42
+`leased_by` is a `<host>:<pid>` string; track down the host and
43
+process and stop it.
44
+
45
+## Drain only one job kind
46
+
47
+If you need to do schema work that affects (e.g.) only the
48
+search-reindex job, you can pause that kind without stopping the
49
+worker:
50
+
51
+```sql
52
+UPDATE jobs
53
+   SET state='paused'
54
+ WHERE state='queued'
55
+   AND kind='search.reindex';
56
+```
57
+
58
+The worker's leasing query filters out `paused`. Resume by
59
+flipping back to `queued`.
60
+
61
+This is **not** the standard maintenance path; reach for a full
62
+drain unless you have a reason to keep other kinds running.
63
+
64
+## Hostile case: a stuck job
65
+
66
+If a worker is leased on a job and the worker process is gone
67
+(host crashed mid-execution), the lease times out and a new
68
+worker re-leases the same job. The lease timeout is set per-kind
69
+in the job handler registry; default 5 minutes.
70
+
71
+To force re-lease sooner (e.g., the job is small and you want
72
+it picked up now):
73
+
74
+```sql
75
+UPDATE jobs
76
+   SET leased_at = NULL, leased_by = NULL, state = 'queued'
77
+ WHERE id = <job-id>;
78
+```
79
+
80
+Audit-trail your manual intervention in the incident channel.
81
+
82
+## Resume
83
+
84
+```sh
85
+sudo systemctl start shithubd-worker
86
+```
87
+
88
+The worker starts polling immediately; queued jobs begin running
89
+within seconds.
docs/internal/runbooks/read-only-mode.mdadded
@@ -0,0 +1,81 @@
1
+# Read-only mode
2
+
3
+shithub does not have a one-flag "read-only" toggle. When you
4
+need one — disaster mid-recovery, surprise database failover,
5
+controlled maintenance — these are the levers:
6
+
7
+## The fastest "stop writes" lever
8
+
9
+```sh
10
+# At the edge: serve every write method as 503 with a banner.
11
+ssh edge
12
+sudo caddy reload --config /etc/caddy/Caddyfile.read-only
13
+```
14
+
15
+The repo ships a `Caddyfile.read-only` snippet that responds 503
16
+to `POST`/`PUT`/`PATCH`/`DELETE`/`OPTIONS` while letting `GET`
17
+and `HEAD` through. `git-receive-pack` is also blocked; clones
18
+still work.
19
+
20
+Reverse with `caddy reload --config /etc/caddy/Caddyfile`.
21
+
22
+User experience:
23
+
24
+- Reads work normally.
25
+- Writes return a 503 with an HTML page explaining "shithub is
26
+  in read-only maintenance — try again later."
27
+- Pushes fail with a git-side error.
28
+
29
+This is the easiest reversal; nothing in the app changes.
30
+
31
+## Stop the worker
32
+
33
+If you're worried about *background* writes (job processing
34
+mutating state), stop the worker too — see
35
+[drain-workers.md](./drain-workers.md).
36
+
37
+## Set the DB to read-only
38
+
39
+Heavy hammer. `ALTER SYSTEM SET default_transaction_read_only =
40
+on; SELECT pg_reload_conf();` makes every new connection
41
+read-only. The web service will blow up on its first write
42
+attempt — usually `ERROR: cannot execute INSERT in a read-only
43
+transaction`.
44
+
45
+This is only useful if you suspect a bug in the read-only mode
46
+above is letting writes leak through. Otherwise skip it.
47
+
48
+To revert: `ALTER SYSTEM SET default_transaction_read_only = off;
49
+SELECT pg_reload_conf();`.
50
+
51
+## During a real DB failover
52
+
53
+If you've failed over to a standby that's read-write and need
54
+the app to talk to it:
55
+
56
+1. Update the `db.url` inventory variable to point at the new
57
+   primary.
58
+2. `ANSIBLE_TAGS=app make deploy ANSIBLE_INVENTORY=production` —
59
+   web/worker pick up the new connection string.
60
+
61
+There's no replication tooling shipped today; if you have a
62
+standby, you set it up out of band.
63
+
64
+## What read-only mode does NOT do
65
+
66
+- Stop the auth-audit log from writing — login attempts still
67
+  hit the DB.
68
+- Block the `shithubd hook` invocations from writing
69
+  `push_events` (because the AKC + git transport still let the
70
+  TCP connection through). To stop that, also stop the web
71
+  service or block port 22.
72
+- Block sign-ups — those are POSTs and are blocked at the edge,
73
+  but the in-app banner under read-only mode is the only signal
74
+  to a logged-in user.
75
+
76
+## Communicating it
77
+
78
+In-app banner on every page during read-only mode is the right
79
+default. Today this is via a feature flag (`web.banner.text`
80
+config key); set it before flipping the Caddyfile so users see
81
+the banner on the last successful render.
docs/internal/runbooks/regenerate-akc.mdadded
@@ -0,0 +1,82 @@
1
+# Regenerate AKC cache
2
+
3
+The `AuthorizedKeysCommand` (`shithubd ssh-authkeys %f`) resolves
4
+an offered SSH key fingerprint to a user via the
5
+`user_ssh_keys` table. It does not maintain a write-through
6
+cache today (every call hits the DB), so there's nothing to
7
+"regenerate" in the sense of `cache.invalidate(...)`.
8
+
9
+What this runbook actually covers: the **postgres-side**
10
+caching that the AKC subprocess depends on, and the operational
11
+state you might need to reset around it.
12
+
13
+## When this matters
14
+
15
+You may need to do something here if:
16
+
17
+- A user's just-added SSH key is being rejected on push despite
18
+  showing in their Settings.
19
+- A removed SSH key is still being accepted (this would be a
20
+  bug; don't shrug it off).
21
+- The AKC subprocess is timing out under load and you need to
22
+  understand what's slow.
23
+
24
+## Diagnose first
25
+
26
+```sh
27
+# What's the AKC subcommand seeing?
28
+sudo -u shithub-ssh /usr/local/bin/shithubd ssh-authkeys SHA256:abc...
29
+# This prints what sshd would have read; empty output = no match.
30
+```
31
+
32
+Compare against the DB:
33
+
34
+```sh
35
+psql -d shithub -c "
36
+  SELECT k.fingerprint, u.username
37
+    FROM user_ssh_keys k
38
+    JOIN users u ON u.id = k.user_id
39
+   WHERE k.fingerprint = 'SHA256:abc...';"
40
+```
41
+
42
+Three possibilities:
43
+
44
+| psql says | AKC says | Diagnosis                                      |
45
+|-----------|----------|------------------------------------------------|
46
+| match     | match    | sshd is broken, not us — check sshd logs.       |
47
+| match     | empty    | The AKC subcommand isn't reading what we think — check `EnvironmentFile`, the binary path, and that `shithub-ssh` user can read `/etc/shithub/ssh.env`. |
48
+| empty     | empty    | Key isn't in the DB — user needs to re-add.    |
49
+| empty     | match    | **Stale cache or replication lag.** See below. |
50
+
51
+## "Empty in psql, match in AKC" — actually impossible today
52
+
53
+We don't run a read replica for the AKC. If you ever see this,
54
+the AKC subcommand is reading from a different DB than psql is
55
+— check `db.url` in `ssh.env` vs the operator's psql connection.
56
+
57
+## The remove-key-but-still-accepted case
58
+
59
+A removed SSH key being accepted is **not** caused by a stale
60
+cache (we don't have one); it's caused by either:
61
+
62
+- The key wasn't actually removed (check the DB, not the UI).
63
+- sshd is using its own `authorized_keys` file in addition to
64
+  AKC. Check `/etc/ssh/sshd_config` and per-user
65
+  `~/.ssh/authorized_keys`. The shipped sshd config disables
66
+  per-user files for the `git` user; if someone's customized,
67
+  that's the leak.
68
+- The session was already authenticated and is being held open.
69
+  Kill it: `sudo systemctl status sshd`, find the per-conn
70
+  process, `kill <pid>`.
71
+
72
+## Future: add a cache here
73
+
74
+If we ever add an in-process cache to AKC (we'd want to, to
75
+shave the per-push DB call), invalidation becomes load-bearing:
76
+
77
+- Cache key: SHA-256 fingerprint.
78
+- Invalidate on: key add (insert) and key remove (delete).
79
+- TTL: 60 seconds is the longest acceptable window between
80
+  removing a key and the AKC stopping to honor it.
81
+
82
+When that lands, this runbook gets a real "regenerate" section.
docs/internal/runbooks/rotate-keys.mdadded
@@ -0,0 +1,82 @@
1
+# Rotate signing keys
2
+
3
+Distinct from `rotate-secrets.md` because signing keys have
4
+external verifiers (subscribers, browsers, recipients) that need
5
+to keep validating old artifacts during the rollover window.
6
+
7
+## Webhook HMAC secrets
8
+
9
+Per-webhook, not global. Each webhook has its own secret. The
10
+owner rotates by editing the webhook in Settings → Webhooks →
11
+the hook → "Rotate secret" → set new value.
12
+
13
+- Subscriber MUST be updated to the new secret first (or accept
14
+  both during the cutover).
15
+- The previous secret is **not** kept; immediate rotation cuts
16
+  off old-signature validation.
17
+- For ops-driven mass rotation (suspected key store compromise):
18
+  `shithubd webhook rotate-all` regenerates every secret. All
19
+  subscribers will start getting `signature mismatch` until the
20
+  owners update their endpoints.
21
+
22
+## Notification unsubscribe HMAC
23
+
24
+The one-click unsubscribe links in emails are HMAC-signed. The
25
+key lives in `notif.unsubscribe_key`. Rotation invalidates every
26
+**old email's** unsubscribe link (a recipient with an unread
27
+email containing a link from before rotation will get "invalid
28
+link" if they click it after rotation).
29
+
30
+Procedure:
31
+
32
+1. Generate a new key: `openssl rand -base64 32`.
33
+2. Replace `notif.unsubscribe_key` in `worker.env`.
34
+3. Redeploy worker (`ANSIBLE_TAGS=app`).
35
+4. Optional: add a banner to the in-app inbox letting users
36
+   know unsubscribe links from old emails will not work.
37
+
38
+## Cursor signing key
39
+
40
+`internal/pagination/keyset` HMACs every cursor we hand out.
41
+Rotation invalidates every cursor in flight. The user-visible
42
+effect is: any open browser tab with a "Next page" link from
43
+before the rotation will return "invalid cursor" → take them
44
+back to page 1.
45
+
46
+Procedure mirrors the unsubscribe key. Acceptable to do without
47
+warning; the worst case is a user clicks "Next" and gets sent
48
+back to the top of the list.
49
+
50
+## TLS certificates (Caddy-managed)
51
+
52
+Caddy obtains and renews certs from Let's Encrypt automatically.
53
+Manual rotation is rarely needed; if it is:
54
+
55
+```sh
56
+ssh edge
57
+sudo caddy reload --config /etc/caddy/Caddyfile
58
+```
59
+
60
+If Let's Encrypt is rate-limiting you, switch the
61
+`caddy_use_acme_staging` inventory flag to `true`, redeploy,
62
+verify, then flip back. Production cert reissue is constrained
63
+by LE's per-week limits; mass rotation in a single hour will
64
+fail.
65
+
66
+## SSH host keys
67
+
68
+The host keys (`/etc/ssh/ssh_host_*`) are sshd's identity to
69
+clients. Rotating them prompts every git client with "WARNING:
70
+REMOTE HOST IDENTIFICATION HAS CHANGED!" — disruptive.
71
+
72
+Only rotate if the host has been compromised. Procedure:
73
+
74
+1. `ssh-keygen -A` on the host (regenerates all host keys).
75
+2. `systemctl restart sshd`.
76
+3. Publish the new host-key fingerprints somewhere users can
77
+   verify (status page, security advisory).
78
+4. Users prune the old line from `~/.ssh/known_hosts` and
79
+   re-accept on first reconnect.
80
+
81
+User-facing communication is critical here. Without it, you'll
82
+look like a MITM attacker to every developer's SSH client.
docs/internal/runbooks/rotate-secrets.mdadded
@@ -0,0 +1,123 @@
1
+# Rotate secrets
2
+
3
+Quarterly cadence; sooner if compromise is suspected. The secret
4
+classes:
5
+
6
+| Secret                       | Where it lives                               | Rotation procedure                        |
7
+|------------------------------|----------------------------------------------|-------------------------------------------|
8
+| `session.key_b64`            | `web.env`                                    | See "Session signing key" below.          |
9
+| `auth.totp_key_b64`          | `web.env`                                    | See "TOTP AEAD key" below.                |
10
+| Postgres `shithub` password  | `web.env` + `worker.env` + Postgres role     | See "DB password" below.                  |
11
+| Postgres `shithub_hook` pwd  | sshd env + `hook-role-grants.sql` apply env  | See "DB password" below.                  |
12
+| S3 access keys               | `web.env` + `worker.env` + Spaces dashboard  | See "Object store credentials" below.     |
13
+| Postmark / SMTP creds        | `web.env`                                    | One-step: replace, redeploy.              |
14
+| Webhook AEAD key             | per-row encrypted; key in `worker.env`       | Two-step migration, see below.            |
15
+| Operator SSH keys            | `~operator/.ssh/authorized_keys` per host    | Add new key, verify, remove old.          |
16
+
17
+## Session signing key
18
+
19
+The session key signs the cookie that authenticates a logged-in
20
+session. Rotating it logs **every user out** because every
21
+existing cookie's MAC stops verifying.
22
+
23
+1. Generate a new key:
24
+   ```sh
25
+   openssl rand -base64 32
26
+   ```
27
+2. Update the inventory variable `session_key`. Keep the old key
28
+   in a comment for one rotation cycle so you can revert.
29
+3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
30
+4. Verify: sign in to your own account with a fresh browser; the
31
+   cookie set after sign-in is signed by the new key.
32
+
33
+User-visible impact: every user is signed out. Notify in-band
34
+before doing this if avoidable; do it without notice if the old
35
+key may be compromised.
36
+
37
+## TOTP AEAD key
38
+
39
+The TOTP AEAD key encrypts every user's TOTP shared secret at
40
+rest in the database. **Rotating this key requires a
41
+re-encryption migration** — without it, every 2FA enrollment
42
+becomes unreadable.
43
+
44
+The procedure is:
45
+
46
+1. Add the new key to `web.env` as `auth.totp_key_b64_next`
47
+   alongside the existing `auth.totp_key_b64`.
48
+2. Restart web (the package supports a "current + next" pair: it
49
+   reads with current, falls back to next, writes with current).
50
+3. Run the re-encryption job: `shithubd admin re-encrypt-totp
51
+   --to-key=auth.totp_key_b64_next` (operator-only). This
52
+   decrypts each row with the old key and re-encrypts with the
53
+   new.
54
+4. Promote `auth.totp_key_b64_next` to `auth.totp_key_b64` (drop
55
+   the suffix), remove the old key.
56
+5. Restart web.
57
+
58
+Do not skip step 3. Failing to re-encrypt before retiring the old
59
+key locks every 2FA-enabled user out of their account; recovery
60
+codes are the only path back in, and not everyone has them
61
+saved.
62
+
63
+## DB password
64
+
65
+Rotate by adding a new password and removing the old, **without
66
+downtime**.
67
+
68
+1. As `postgres`:
69
+   ```sql
70
+   ALTER ROLE shithub WITH PASSWORD '<new>';
71
+   ```
72
+2. Update `web.env` and `worker.env` `db_password`.
73
+3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
74
+   The web/worker units will restart and reconnect with the new
75
+   password.
76
+
77
+If you suspect the old password was leaked, do steps 1–3 in
78
+sequence within minutes — between (1) and (3) the running web
79
+process still has its open connections (which authenticated
80
+under the old password) but new connections will use the new.
81
+
82
+## Object store credentials
83
+
84
+1. In the Spaces dashboard, generate a new access key with the
85
+   same scope as the old.
86
+2. Update inventory `s3_access_key_id` and `s3_secret_access_key`.
87
+3. `make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app`.
88
+4. Verify: trigger a webhook delivery (which writes a body
89
+   snapshot) and confirm it lands in the bucket.
90
+5. Once confirmed, revoke the old key in the Spaces dashboard.
91
+
92
+Do not revoke the old key first; the running process will lose
93
+access mid-flight.
94
+
95
+## Webhook AEAD key
96
+
97
+The webhook secret AEAD key encrypts every webhook's secret at
98
+rest. Rotation is two-step like TOTP:
99
+
100
+1. Add `webhook.aead_key_next` alongside `webhook.aead_key`.
101
+2. Run `shithubd admin re-encrypt-webhooks --to-key=webhook.
102
+   aead_key_next`.
103
+3. Promote and restart.
104
+
105
+Failing to re-encrypt before retiring the old key disables every
106
+webhook (the auto-disable logic kicks in on first decrypt
107
+failure).
108
+
109
+## Operator SSH keys
110
+
111
+Standard procedure: add the new key to every host's
112
+`~operator/.ssh/authorized_keys`, log in with the new key to
113
+confirm, remove the old. Ansible's `authorized_key` module makes
114
+this idempotent; the `base` role will pick up changes if the
115
+inventory's `operator_ssh_keys` list is the source of truth.
116
+
117
+## Audit
118
+
119
+Every rotation is logged in the host's journal (the deploy run's
120
+output) and, for DB rotations, in `pg_stat_activity` history if
121
+your retention allows. There's no centralized rotation log; if
122
+you want one, capture each rotation in your team's incident
123
+channel with date + class + reason.