`2d058c9`

S40: day-one operator runbook

Authored by

espadonne 4 days ago

SHA: 2d058c9feb8c764d1939edcdcd708961b8fba472
Parents: 7b94dc9
Tree: 0c15f21

1 changed file

Status	File	+	-
A	`docs/internal/runbooks/day-one.md`	163	0

docs/internal/runbooks/day-one.mdadded

++# Day-one operator runbook
++
++This is what the on-call (the project author for v1) reads on
++launch day and the day after. Pair with the cutover checklist
++(`deploy/cutover/checklist.md`) for the actual cutover steps.
++
++## "It's been an hour"
++
++Open three tabs and walk this list:
++
++1. **Grafana → shithubd-overview dashboard.**
++   - `Web up` and `Worker up` should both be 1.
++   - p95 latency on top routes should sit near baseline (S36
++     numbers, doubled is the soft ceiling — see
++     `docs/internal/capacity.md`).
++   - DB calls/sec should NOT be 10× baseline. If it is, an N+1
++     regression slipped through; check `journalctl -u shithubd-web
++     --grep 'high-query-count'` (the QueryCounter middleware
++     emits warning lines).
++
++2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.**
++   - Zero panics is the bar.
++   - One or two ERROR lines from genuinely broken third-party
++     subscribers is normal (we auto-disable after 50; this is
++     fine).
++   - A flood of ERROR lines from the same handler is a bug —
++     bisect.
++
++3. **Status page on docs.shithub.example/status.html.**
++   - Confirm "All systems normal." with a current timestamp.
++   - If you've had any blip, even a 30-second one, log it under
++     "Recent incidents" with what happened. Trust comes from
++     transparency, not from the page being empty.
++
++If all three look clean, take a breath. Then post the announcement
++if you haven't.
++
++## "It's been a day"
++
++24h post-cutover checklist:
++
++1. **Backups.** The first daily logical backup should have run.
++   ```sh
++   ssh db
++   sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
++        lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
++   ```
++   Should list one `.dump` file. If empty, see
++   `docs/internal/runbooks/backups.md#missed-backup`.
++
++2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0
++   and `last_archived_time` should be within minutes of now.
++   ```sh
++   ssh db
++   sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
++   ```
++
++3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm:
++   ```sh
++   git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
++   git ls-remote https://shithub.example/shithub/shithub.git trunk
++   ```
++   Both should report the same SHA (within an hour of the latest
++   push to shithub).
++
++4. **Signups + abuse.** Check `/admin/users` for the past 24h.
++   - Anything obviously bot-shaped (timing, email pattern,
++     username pattern)? The per-/24 throttle should have caught
++     bursts; a long tail of single-account spam needs manual
++     review.
++   - Suspended accounts: log who and why for the retro.
++
++5. **Issue tracker.** Open
++   `https://shithub.example/shithub/shithub/issues`. Triage every
++   new issue:
++   - **P0** — site down / data loss / security. Fix immediately.
++   - **P1** — broken core flow (signup, push, merge). Fix this
++     week.
++   - **P2** — minor bug / UX. Fix in next sprint.
++   - **P3** — feature ask / WONTFIX / accept-with-rationale.
++   - Reply on every issue within 24h, even if just "tracked, P2."
++
++6. **Retro.** Update
++   `docs/internal/retro/v1.0.0.md` — fill the "Numbers" table,
++   the "What surprised us" section, the top-3 issues table.
++
++## "First incident?"
++
++The general procedure is in
++`docs/internal/runbooks/incidents.md`. Day-one specifics:
++
++1. **Status page first.** Within 5 min of detecting an incident,
++   update `docs/public/status.md` to "Investigating: <one line>".
++   Push, sync to docs bucket. The fact that you're aware is
++   half the battle.
++2. **Identify the right runbook.** Most v1 incidents will be one
++   of:
++   - `incidents.md#shithubd-down` — process crashed.
++   - `incidents.md#postgres-down` — DB unreachable.
++   - `incidents.md#archive-failing` — WAL archiver broken.
++   - `incidents.md#job-backlog` — worker queue runaway.
++   - `incidents.md#webhook-delivery-failing` — outbound delivery
++     issues.
++3. **Mitigate before fix.** If it's the announcement spike
++   tipping the rate-limit floor, you can briefly raise the floor
++   (config edit + restart) rather than untangling the abuse
++   signal. The runbook + the rate-limit-tuning notes are both at
++   hand.
++4. **Status page on resolution.** "Resolved at HH:MM UTC. <One
++   sentence on cause.> <One sentence on prevention if relevant.>"
++
++## Communication posture
++
++- **Honest > fast.** If you don't know the cause, "investigating"
++  is a fine status. Speculation in public is a debt you pay
++  later.
++- **No timelines under stress.** "We'll have an update by 18:00
++  UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
++- **Engage the reporter.** When a user files a serious bug,
++  thank them on the issue. Even if the fix is days out, the
++  acknowledgement is the part they remember.
++
++## Rate-limit tuning
++
++If legitimate signups are getting throttled (CIDR-shared
++networks during the announcement spike), the steps are:
++
++```sh
++# 1. Edit the production inventory's rate-limit vars.
++$EDITOR deploy/ansible/inventory/production
++#   signup_per_ip_per_hour: 5  → 10
++#   signup_per_24_per_hour: 20 → 50
++
++# 2. Apply only the app role (no DB / edge churn).
++make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
++```
++
++The web service restart picks up new env. Document the change in
++the retro; tighten back to defaults once the spike passes.
++
++## "Drop the GitHub mirror?" decision
++
++Default plan: **keep the mirror for 90 days post-launch**, then
++re-evaluate. The mirror exists as a recovery surface; if the
++self-hosted instance has had no extended outage in 90 days,
++drop the mirror push job and mark the GitHub repo as archived.
++
++The decision and date land in the next retro after the 90-day
++mark.
++
++## When to escalate (to whom?)
++
++Solo on-call for v1. There is no second-tier. Escalation paths:
++
++- **DB destroyed beyond restore.** Restore from cross-region DR
++  bucket per `runbooks/restore.md`.
++- **Account compromise on the operator's email.** Revoke the
++  compromised email's session epoch via SQL; trigger a
++  "rotate-secrets" pass.
++- **Legal request.** Acknowledge receipt; do not respond to the
++  substance same-day.
++- **Mass-abuse incident.** Read-only mode, then triage. The
++  read-only-mode runbook is the lever.