@@ -0,0 +1,163 @@ |
| | 1 | +# Day-one operator runbook |
| | 2 | + |
| | 3 | +This is what the on-call (the project author for v1) reads on |
| | 4 | +launch day and the day after. Pair with the cutover checklist |
| | 5 | +(`deploy/cutover/checklist.md`) for the actual cutover steps. |
| | 6 | + |
| | 7 | +## "It's been an hour" |
| | 8 | + |
| | 9 | +Open three tabs and walk this list: |
| | 10 | + |
| | 11 | +1. **Grafana → shithubd-overview dashboard.** |
| | 12 | + - `Web up` and `Worker up` should both be 1. |
| | 13 | + - p95 latency on top routes should sit near baseline (S36 |
| | 14 | + numbers, doubled is the soft ceiling — see |
| | 15 | + `docs/internal/capacity.md`). |
| | 16 | + - DB calls/sec should NOT be 10× baseline. If it is, an N+1 |
| | 17 | + regression slipped through; check `journalctl -u shithubd-web |
| | 18 | + --grep 'high-query-count'` (the QueryCounter middleware |
| | 19 | + emits warning lines). |
| | 20 | + |
| | 21 | +2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.** |
| | 22 | + - Zero panics is the bar. |
| | 23 | + - One or two ERROR lines from genuinely broken third-party |
| | 24 | + subscribers is normal (we auto-disable after 50; this is |
| | 25 | + fine). |
| | 26 | + - A flood of ERROR lines from the same handler is a bug — |
| | 27 | + bisect. |
| | 28 | + |
| | 29 | +3. **Status page on docs.shithub.example/status.html.** |
| | 30 | + - Confirm "All systems normal." with a current timestamp. |
| | 31 | + - If you've had any blip, even a 30-second one, log it under |
| | 32 | + "Recent incidents" with what happened. Trust comes from |
| | 33 | + transparency, not from the page being empty. |
| | 34 | + |
| | 35 | +If all three look clean, take a breath. Then post the announcement |
| | 36 | +if you haven't. |
| | 37 | + |
| | 38 | +## "It's been a day" |
| | 39 | + |
| | 40 | +24h post-cutover checklist: |
| | 41 | + |
| | 42 | +1. **Backups.** The first daily logical backup should have run. |
| | 43 | + ```sh |
| | 44 | + ssh db |
| | 45 | + sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ |
| | 46 | + lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| | 47 | + ``` |
| | 48 | + Should list one `.dump` file. If empty, see |
| | 49 | + `docs/internal/runbooks/backups.md#missed-backup`. |
| | 50 | + |
| | 51 | +2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0 |
| | 52 | + and `last_archived_time` should be within minutes of now. |
| | 53 | + ```sh |
| | 54 | + ssh db |
| | 55 | + sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;" |
| | 56 | + ``` |
| | 57 | + |
| | 58 | +3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm: |
| | 59 | + ```sh |
| | 60 | + git ls-remote https://github.com/tenseleyFlow/shithub.git trunk |
| | 61 | + git ls-remote https://shithub.example/shithub/shithub.git trunk |
| | 62 | + ``` |
| | 63 | + Both should report the same SHA (within an hour of the latest |
| | 64 | + push to shithub). |
| | 65 | + |
| | 66 | +4. **Signups + abuse.** Check `/admin/users` for the past 24h. |
| | 67 | + - Anything obviously bot-shaped (timing, email pattern, |
| | 68 | + username pattern)? The per-/24 throttle should have caught |
| | 69 | + bursts; a long tail of single-account spam needs manual |
| | 70 | + review. |
| | 71 | + - Suspended accounts: log who and why for the retro. |
| | 72 | + |
| | 73 | +5. **Issue tracker.** Open |
| | 74 | + `https://shithub.example/shithub/shithub/issues`. Triage every |
| | 75 | + new issue: |
| | 76 | + - **P0** — site down / data loss / security. Fix immediately. |
| | 77 | + - **P1** — broken core flow (signup, push, merge). Fix this |
| | 78 | + week. |
| | 79 | + - **P2** — minor bug / UX. Fix in next sprint. |
| | 80 | + - **P3** — feature ask / WONTFIX / accept-with-rationale. |
| | 81 | + - Reply on every issue within 24h, even if just "tracked, P2." |
| | 82 | + |
| | 83 | +6. **Retro.** Update |
| | 84 | + `docs/internal/retro/v1.0.0.md` — fill the "Numbers" table, |
| | 85 | + the "What surprised us" section, the top-3 issues table. |
| | 86 | + |
| | 87 | +## "First incident?" |
| | 88 | + |
| | 89 | +The general procedure is in |
| | 90 | +`docs/internal/runbooks/incidents.md`. Day-one specifics: |
| | 91 | + |
| | 92 | +1. **Status page first.** Within 5 min of detecting an incident, |
| | 93 | + update `docs/public/status.md` to "Investigating: <one line>". |
| | 94 | + Push, sync to docs bucket. The fact that you're aware is |
| | 95 | + half the battle. |
| | 96 | +2. **Identify the right runbook.** Most v1 incidents will be one |
| | 97 | + of: |
| | 98 | + - `incidents.md#shithubd-down` — process crashed. |
| | 99 | + - `incidents.md#postgres-down` — DB unreachable. |
| | 100 | + - `incidents.md#archive-failing` — WAL archiver broken. |
| | 101 | + - `incidents.md#job-backlog` — worker queue runaway. |
| | 102 | + - `incidents.md#webhook-delivery-failing` — outbound delivery |
| | 103 | + issues. |
| | 104 | +3. **Mitigate before fix.** If it's the announcement spike |
| | 105 | + tipping the rate-limit floor, you can briefly raise the floor |
| | 106 | + (config edit + restart) rather than untangling the abuse |
| | 107 | + signal. The runbook + the rate-limit-tuning notes are both at |
| | 108 | + hand. |
| | 109 | +4. **Status page on resolution.** "Resolved at HH:MM UTC. <One |
| | 110 | + sentence on cause.> <One sentence on prevention if relevant.>" |
| | 111 | + |
| | 112 | +## Communication posture |
| | 113 | + |
| | 114 | +- **Honest > fast.** If you don't know the cause, "investigating" |
| | 115 | + is a fine status. Speculation in public is a debt you pay |
| | 116 | + later. |
| | 117 | +- **No timelines under stress.** "We'll have an update by 18:00 |
| | 118 | + UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap. |
| | 119 | +- **Engage the reporter.** When a user files a serious bug, |
| | 120 | + thank them on the issue. Even if the fix is days out, the |
| | 121 | + acknowledgement is the part they remember. |
| | 122 | + |
| | 123 | +## Rate-limit tuning |
| | 124 | + |
| | 125 | +If legitimate signups are getting throttled (CIDR-shared |
| | 126 | +networks during the announcement spike), the steps are: |
| | 127 | + |
| | 128 | +```sh |
| | 129 | +# 1. Edit the production inventory's rate-limit vars. |
| | 130 | +$EDITOR deploy/ansible/inventory/production |
| | 131 | +# signup_per_ip_per_hour: 5 → 10 |
| | 132 | +# signup_per_24_per_hour: 20 → 50 |
| | 133 | + |
| | 134 | +# 2. Apply only the app role (no DB / edge churn). |
| | 135 | +make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app |
| | 136 | +``` |
| | 137 | + |
| | 138 | +The web service restart picks up new env. Document the change in |
| | 139 | +the retro; tighten back to defaults once the spike passes. |
| | 140 | + |
| | 141 | +## "Drop the GitHub mirror?" decision |
| | 142 | + |
| | 143 | +Default plan: **keep the mirror for 90 days post-launch**, then |
| | 144 | +re-evaluate. The mirror exists as a recovery surface; if the |
| | 145 | +self-hosted instance has had no extended outage in 90 days, |
| | 146 | +drop the mirror push job and mark the GitHub repo as archived. |
| | 147 | + |
| | 148 | +The decision and date land in the next retro after the 90-day |
| | 149 | +mark. |
| | 150 | + |
| | 151 | +## When to escalate (to whom?) |
| | 152 | + |
| | 153 | +Solo on-call for v1. There is no second-tier. Escalation paths: |
| | 154 | + |
| | 155 | +- **DB destroyed beyond restore.** Restore from cross-region DR |
| | 156 | + bucket per `runbooks/restore.md`. |
| | 157 | +- **Account compromise on the operator's email.** Revoke the |
| | 158 | + compromised email's session epoch via SQL; trigger a |
| | 159 | + "rotate-secrets" pass. |
| | 160 | +- **Legal request.** Acknowledge receipt; do not respond to the |
| | 161 | + substance same-day. |
| | 162 | +- **Mass-abuse incident.** Read-only mode, then triage. The |
| | 163 | + read-only-mode runbook is the lever. |