@@ -0,0 +1,163 @@ |
| 1 | +# Day-one operator runbook |
| 2 | + |
| 3 | +This is what the on-call (the project author for v1) reads on |
| 4 | +launch day and the day after. Pair with the cutover checklist |
| 5 | +(`deploy/cutover/checklist.md`) for the actual cutover steps. |
| 6 | + |
| 7 | +## "It's been an hour" |
| 8 | + |
| 9 | +Open three tabs and walk this list: |
| 10 | + |
| 11 | +1. **Grafana → shithubd-overview dashboard.** |
| 12 | + - `Web up` and `Worker up` should both be 1. |
| 13 | + - p95 latency on top routes should sit near baseline (S36 |
| 14 | + numbers, doubled is the soft ceiling — see |
| 15 | + `docs/internal/capacity.md`). |
| 16 | + - DB calls/sec should NOT be 10× baseline. If it is, an N+1 |
| 17 | + regression slipped through; check `journalctl -u shithubd-web |
| 18 | + --grep 'high-query-count'` (the QueryCounter middleware |
| 19 | + emits warning lines). |
| 20 | + |
| 21 | +2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.** |
| 22 | + - Zero panics is the bar. |
| 23 | + - One or two ERROR lines from genuinely broken third-party |
| 24 | + subscribers is normal (we auto-disable after 50; this is |
| 25 | + fine). |
| 26 | + - A flood of ERROR lines from the same handler is a bug — |
| 27 | + bisect. |
| 28 | + |
| 29 | +3. **Status page on docs.shithub.example/status.html.** |
| 30 | + - Confirm "All systems normal." with a current timestamp. |
| 31 | + - If you've had any blip, even a 30-second one, log it under |
| 32 | + "Recent incidents" with what happened. Trust comes from |
| 33 | + transparency, not from the page being empty. |
| 34 | + |
| 35 | +If all three look clean, take a breath. Then post the announcement |
| 36 | +if you haven't. |
| 37 | + |
| 38 | +## "It's been a day" |
| 39 | + |
| 40 | +24h post-cutover checklist: |
| 41 | + |
| 42 | +1. **Backups.** The first daily logical backup should have run. |
| 43 | + ```sh |
| 44 | + ssh db |
| 45 | + sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ |
| 46 | + lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| 47 | + ``` |
| 48 | + Should list one `.dump` file. If empty, see |
| 49 | + `docs/internal/runbooks/backups.md#missed-backup`. |
| 50 | + |
| 51 | +2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0 |
| 52 | + and `last_archived_time` should be within minutes of now. |
| 53 | + ```sh |
| 54 | + ssh db |
| 55 | + sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;" |
| 56 | + ``` |
| 57 | + |
| 58 | +3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm: |
| 59 | + ```sh |
| 60 | + git ls-remote https://github.com/tenseleyFlow/shithub.git trunk |
| 61 | + git ls-remote https://shithub.example/shithub/shithub.git trunk |
| 62 | + ``` |
| 63 | + Both should report the same SHA (within an hour of the latest |
| 64 | + push to shithub). |
| 65 | + |
| 66 | +4. **Signups + abuse.** Check `/admin/users` for the past 24h. |
| 67 | + - Anything obviously bot-shaped (timing, email pattern, |
| 68 | + username pattern)? The per-/24 throttle should have caught |
| 69 | + bursts; a long tail of single-account spam needs manual |
| 70 | + review. |
| 71 | + - Suspended accounts: log who and why for the retro. |
| 72 | + |
| 73 | +5. **Issue tracker.** Open |
| 74 | + `https://shithub.example/shithub/shithub/issues`. Triage every |
| 75 | + new issue: |
| 76 | + - **P0** — site down / data loss / security. Fix immediately. |
| 77 | + - **P1** — broken core flow (signup, push, merge). Fix this |
| 78 | + week. |
| 79 | + - **P2** — minor bug / UX. Fix in next sprint. |
| 80 | + - **P3** — feature ask / WONTFIX / accept-with-rationale. |
| 81 | + - Reply on every issue within 24h, even if just "tracked, P2." |
| 82 | + |
| 83 | +6. **Retro.** Update |
| 84 | + `docs/internal/retro/v1.0.0.md` — fill the "Numbers" table, |
| 85 | + the "What surprised us" section, the top-3 issues table. |
| 86 | + |
| 87 | +## "First incident?" |
| 88 | + |
| 89 | +The general procedure is in |
| 90 | +`docs/internal/runbooks/incidents.md`. Day-one specifics: |
| 91 | + |
| 92 | +1. **Status page first.** Within 5 min of detecting an incident, |
| 93 | + update `docs/public/status.md` to "Investigating: <one line>". |
| 94 | + Push, sync to docs bucket. The fact that you're aware is |
| 95 | + half the battle. |
| 96 | +2. **Identify the right runbook.** Most v1 incidents will be one |
| 97 | + of: |
| 98 | + - `incidents.md#shithubd-down` — process crashed. |
| 99 | + - `incidents.md#postgres-down` — DB unreachable. |
| 100 | + - `incidents.md#archive-failing` — WAL archiver broken. |
| 101 | + - `incidents.md#job-backlog` — worker queue runaway. |
| 102 | + - `incidents.md#webhook-delivery-failing` — outbound delivery |
| 103 | + issues. |
| 104 | +3. **Mitigate before fix.** If it's the announcement spike |
| 105 | + tipping the rate-limit floor, you can briefly raise the floor |
| 106 | + (config edit + restart) rather than untangling the abuse |
| 107 | + signal. The runbook + the rate-limit-tuning notes are both at |
| 108 | + hand. |
| 109 | +4. **Status page on resolution.** "Resolved at HH:MM UTC. <One |
| 110 | + sentence on cause.> <One sentence on prevention if relevant.>" |
| 111 | + |
| 112 | +## Communication posture |
| 113 | + |
| 114 | +- **Honest > fast.** If you don't know the cause, "investigating" |
| 115 | + is a fine status. Speculation in public is a debt you pay |
| 116 | + later. |
| 117 | +- **No timelines under stress.** "We'll have an update by 18:00 |
| 118 | + UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap. |
| 119 | +- **Engage the reporter.** When a user files a serious bug, |
| 120 | + thank them on the issue. Even if the fix is days out, the |
| 121 | + acknowledgement is the part they remember. |
| 122 | + |
| 123 | +## Rate-limit tuning |
| 124 | + |
| 125 | +If legitimate signups are getting throttled (CIDR-shared |
| 126 | +networks during the announcement spike), the steps are: |
| 127 | + |
| 128 | +```sh |
| 129 | +# 1. Edit the production inventory's rate-limit vars. |
| 130 | +$EDITOR deploy/ansible/inventory/production |
| 131 | +# signup_per_ip_per_hour: 5 → 10 |
| 132 | +# signup_per_24_per_hour: 20 → 50 |
| 133 | + |
| 134 | +# 2. Apply only the app role (no DB / edge churn). |
| 135 | +make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app |
| 136 | +``` |
| 137 | + |
| 138 | +The web service restart picks up new env. Document the change in |
| 139 | +the retro; tighten back to defaults once the spike passes. |
| 140 | + |
| 141 | +## "Drop the GitHub mirror?" decision |
| 142 | + |
| 143 | +Default plan: **keep the mirror for 90 days post-launch**, then |
| 144 | +re-evaluate. The mirror exists as a recovery surface; if the |
| 145 | +self-hosted instance has had no extended outage in 90 days, |
| 146 | +drop the mirror push job and mark the GitHub repo as archived. |
| 147 | + |
| 148 | +The decision and date land in the next retro after the 90-day |
| 149 | +mark. |
| 150 | + |
| 151 | +## When to escalate (to whom?) |
| 152 | + |
| 153 | +Solo on-call for v1. There is no second-tier. Escalation paths: |
| 154 | + |
| 155 | +- **DB destroyed beyond restore.** Restore from cross-region DR |
| 156 | + bucket per `runbooks/restore.md`. |
| 157 | +- **Account compromise on the operator's email.** Revoke the |
| 158 | + compromised email's session epoch via SQL; trigger a |
| 159 | + "rotate-secrets" pass. |
| 160 | +- **Legal request.** Acknowledge receipt; do not respond to the |
| 161 | + substance same-day. |
| 162 | +- **Mass-abuse incident.** Read-only mode, then triage. The |
| 163 | + read-only-mode runbook is the lever. |