tenseleyflow/shithub / 2d058c9

Browse files

S40: day-one operator runbook

Authored by espadonne
SHA
2d058c9feb8c764d1939edcdcd708961b8fba472
Parents
7b94dc9
Tree
0c15f21

1 changed file

StatusFile+-
A docs/internal/runbooks/day-one.md 163 0
docs/internal/runbooks/day-one.mdadded
@@ -0,0 +1,163 @@
1
+# Day-one operator runbook
2
+
3
+This is what the on-call (the project author for v1) reads on
4
+launch day and the day after. Pair with the cutover checklist
5
+(`deploy/cutover/checklist.md`) for the actual cutover steps.
6
+
7
+## "It's been an hour"
8
+
9
+Open three tabs and walk this list:
10
+
11
+1. **Grafana → shithubd-overview dashboard.**
12
+   - `Web up` and `Worker up` should both be 1.
13
+   - p95 latency on top routes should sit near baseline (S36
14
+     numbers, doubled is the soft ceiling — see
15
+     `docs/internal/capacity.md`).
16
+   - DB calls/sec should NOT be 10× baseline. If it is, an N+1
17
+     regression slipped through; check `journalctl -u shithubd-web
18
+     --grep 'high-query-count'` (the QueryCounter middleware
19
+     emits warning lines).
20
+
21
+2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.**
22
+   - Zero panics is the bar.
23
+   - One or two ERROR lines from genuinely broken third-party
24
+     subscribers is normal (we auto-disable after 50; this is
25
+     fine).
26
+   - A flood of ERROR lines from the same handler is a bug —
27
+     bisect.
28
+
29
+3. **Status page on docs.shithub.example/status.html.**
30
+   - Confirm "All systems normal." with a current timestamp.
31
+   - If you've had any blip, even a 30-second one, log it under
32
+     "Recent incidents" with what happened. Trust comes from
33
+     transparency, not from the page being empty.
34
+
35
+If all three look clean, take a breath. Then post the announcement
36
+if you haven't.
37
+
38
+## "It's been a day"
39
+
40
+24h post-cutover checklist:
41
+
42
+1. **Backups.** The first daily logical backup should have run.
43
+   ```sh
44
+   ssh db
45
+   sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
46
+        lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
47
+   ```
48
+   Should list one `.dump` file. If empty, see
49
+   `docs/internal/runbooks/backups.md#missed-backup`.
50
+
51
+2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0
52
+   and `last_archived_time` should be within minutes of now.
53
+   ```sh
54
+   ssh db
55
+   sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
56
+   ```
57
+
58
+3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm:
59
+   ```sh
60
+   git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
61
+   git ls-remote https://shithub.example/shithub/shithub.git trunk
62
+   ```
63
+   Both should report the same SHA (within an hour of the latest
64
+   push to shithub).
65
+
66
+4. **Signups + abuse.** Check `/admin/users` for the past 24h.
67
+   - Anything obviously bot-shaped (timing, email pattern,
68
+     username pattern)? The per-/24 throttle should have caught
69
+     bursts; a long tail of single-account spam needs manual
70
+     review.
71
+   - Suspended accounts: log who and why for the retro.
72
+
73
+5. **Issue tracker.** Open
74
+   `https://shithub.example/shithub/shithub/issues`. Triage every
75
+   new issue:
76
+   - **P0** — site down / data loss / security. Fix immediately.
77
+   - **P1** — broken core flow (signup, push, merge). Fix this
78
+     week.
79
+   - **P2** — minor bug / UX. Fix in next sprint.
80
+   - **P3** — feature ask / WONTFIX / accept-with-rationale.
81
+   - Reply on every issue within 24h, even if just "tracked, P2."
82
+
83
+6. **Retro.** Update
84
+   `docs/internal/retro/v1.0.0.md` — fill the "Numbers" table,
85
+   the "What surprised us" section, the top-3 issues table.
86
+
87
+## "First incident?"
88
+
89
+The general procedure is in
90
+`docs/internal/runbooks/incidents.md`. Day-one specifics:
91
+
92
+1. **Status page first.** Within 5 min of detecting an incident,
93
+   update `docs/public/status.md` to "Investigating: <one line>".
94
+   Push, sync to docs bucket. The fact that you're aware is
95
+   half the battle.
96
+2. **Identify the right runbook.** Most v1 incidents will be one
97
+   of:
98
+   - `incidents.md#shithubd-down` — process crashed.
99
+   - `incidents.md#postgres-down` — DB unreachable.
100
+   - `incidents.md#archive-failing` — WAL archiver broken.
101
+   - `incidents.md#job-backlog` — worker queue runaway.
102
+   - `incidents.md#webhook-delivery-failing` — outbound delivery
103
+     issues.
104
+3. **Mitigate before fix.** If it's the announcement spike
105
+   tipping the rate-limit floor, you can briefly raise the floor
106
+   (config edit + restart) rather than untangling the abuse
107
+   signal. The runbook + the rate-limit-tuning notes are both at
108
+   hand.
109
+4. **Status page on resolution.** "Resolved at HH:MM UTC. <One
110
+   sentence on cause.> <One sentence on prevention if relevant.>"
111
+
112
+## Communication posture
113
+
114
+- **Honest > fast.** If you don't know the cause, "investigating"
115
+  is a fine status. Speculation in public is a debt you pay
116
+  later.
117
+- **No timelines under stress.** "We'll have an update by 18:00
118
+  UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
119
+- **Engage the reporter.** When a user files a serious bug,
120
+  thank them on the issue. Even if the fix is days out, the
121
+  acknowledgement is the part they remember.
122
+
123
+## Rate-limit tuning
124
+
125
+If legitimate signups are getting throttled (CIDR-shared
126
+networks during the announcement spike), the steps are:
127
+
128
+```sh
129
+# 1. Edit the production inventory's rate-limit vars.
130
+$EDITOR deploy/ansible/inventory/production
131
+#   signup_per_ip_per_hour: 5  → 10
132
+#   signup_per_24_per_hour: 20 → 50
133
+
134
+# 2. Apply only the app role (no DB / edge churn).
135
+make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
136
+```
137
+
138
+The web service restart picks up new env. Document the change in
139
+the retro; tighten back to defaults once the spike passes.
140
+
141
+## "Drop the GitHub mirror?" decision
142
+
143
+Default plan: **keep the mirror for 90 days post-launch**, then
144
+re-evaluate. The mirror exists as a recovery surface; if the
145
+self-hosted instance has had no extended outage in 90 days,
146
+drop the mirror push job and mark the GitHub repo as archived.
147
+
148
+The decision and date land in the next retro after the 90-day
149
+mark.
150
+
151
+## When to escalate (to whom?)
152
+
153
+Solo on-call for v1. There is no second-tier. Escalation paths:
154
+
155
+- **DB destroyed beyond restore.** Restore from cross-region DR
156
+  bucket per `runbooks/restore.md`.
157
+- **Account compromise on the operator's email.** Revoke the
158
+  compromised email's session epoch via SQL; trigger a
159
+  "rotate-secrets" pass.
160
+- **Legal request.** Acknowledge receipt; do not respond to the
161
+  substance same-day.
162
+- **Mass-abuse incident.** Read-only mode, then triage. The
163
+  read-only-mode runbook is the lever.