markdown · 6172 bytes Raw Blame History

Day-one operator runbook

This is what the on-call (the project author for v1) reads on launch day and the day after. Pair with the cutover checklist (deploy/cutover/checklist.md) for the actual cutover steps.

"It's been an hour"

Open three tabs and walk this list:

  1. Grafana → shithubd-overview dashboard.

    • Web up and Worker up should both be 1.
    • p95 latency on top routes should sit near baseline (S36 numbers, doubled is the soft ceiling — see docs/internal/capacity.md).
    • DB calls/sec should NOT be 10× baseline. If it is, an N+1 regression slipped through; check journalctl -u shithubd-web --grep 'high-query-count' (the QueryCounter middleware emits warning lines).
  2. Loki → {service="shithubd"} |~ "panic|ERROR".

    • Zero panics is the bar.
    • One or two ERROR lines from genuinely broken third-party subscribers is normal (we auto-disable after 50; this is fine).
    • A flood of ERROR lines from the same handler is a bug — bisect.
  3. Status page on docs.shithub.sh/status.html.

    • Confirm "All systems normal." with a current timestamp.
    • If you've had any blip, even a 30-second one, log it under "Recent incidents" with what happened. Trust comes from transparency, not from the page being empty.

If all three look clean, take a breath. Then post the announcement if you haven't.

"It's been a day"

24h post-cutover checklist:

  1. Backups. The first daily logical backup should have run.

    ssh db
    sudo -u postgres rclone --config /etc/rclone-shithub.conf \
         lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
    

    Should list one .dump file. If empty, see docs/internal/runbooks/backups.md#missed-backup.

  2. WAL archive. pg_stat_archiver.failed_count should be 0 and last_archived_time should be within minutes of now.

    ssh db
    sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
    
  3. Mirror push to GitHub. The mirror job runs hourly. Confirm:

    git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
    git ls-remote https://shithub.sh/shithub/shithub.git trunk
    

    Both should report the same SHA (within an hour of the latest push to shithub).

  4. Signups + abuse. Check /admin/users for the past 24h.

    • Anything obviously bot-shaped (timing, email pattern, username pattern)? The per-/24 throttle should have caught bursts; a long tail of single-account spam needs manual review.
    • Suspended accounts: log who and why for the retro.
  5. Issue tracker. Open https://shithub.sh/shithub/shithub/issues. Triage every new issue:

    • P0 — site down / data loss / security. Fix immediately.
    • P1 — broken core flow (signup, push, merge). Fix this week.
    • P2 — minor bug / UX. Fix in next sprint.
    • P3 — feature ask / WONTFIX / accept-with-rationale.
    • Reply on every issue within 24h, even if just "tracked, P2."
  6. Retro. Update docs/internal/retro/v0.1.0.md — fill the "Numbers" table, the "What surprised us" section, the top-3 issues table.

"First incident?"

The general procedure is in docs/internal/runbooks/incidents.md. Day-one specifics:

  1. Status page first. Within 5 min of detecting an incident, update docs/public/status.md to "Investigating: ". Push, sync to docs bucket. The fact that you're aware is half the battle.
  2. Identify the right runbook. Most v1 incidents will be one of:
    • incidents.md#shithubd-down — process crashed.
    • incidents.md#postgres-down — DB unreachable.
    • incidents.md#archive-failing — WAL archiver broken.
    • incidents.md#job-backlog — worker queue runaway.
    • incidents.md#webhook-delivery-failing — outbound delivery issues.
  3. Mitigate before fix. If it's the announcement spike tipping the rate-limit floor, you can briefly raise the floor (config edit + restart) rather than untangling the abuse signal. The runbook + the rate-limit-tuning notes are both at hand.
  4. Status page on resolution. "Resolved at HH:MM UTC. "

Communication posture

  • Honest > fast. If you don't know the cause, "investigating" is a fine status. Speculation in public is a debt you pay later.
  • No timelines under stress. "We'll have an update by 18:00 UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
  • Engage the reporter. When a user files a serious bug, thank them on the issue. Even if the fix is days out, the acknowledgement is the part they remember.

Rate-limit tuning

If legitimate signups are getting throttled (CIDR-shared networks during the announcement spike), the steps are:

# 1. Edit the production inventory's rate-limit vars.
$EDITOR deploy/ansible/inventory/production
#   signup_per_ip_per_hour: 5  → 10
#   signup_per_24_per_hour: 20 → 50

# 2. Apply only the app role (no DB / edge churn).
make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app

The web service restart picks up new env. Document the change in the retro; tighten back to defaults once the spike passes.

"Drop the GitHub mirror?" decision

Default plan: keep the mirror for 90 days post-launch, then re-evaluate. The mirror exists as a recovery surface; if the self-hosted instance has had no extended outage in 90 days, drop the mirror push job and mark the GitHub repo as archived.

The decision and date land in the next retro after the 90-day mark.

When to escalate (to whom?)

Solo on-call for v1. There is no second-tier. Escalation paths:

  • DB destroyed beyond restore. Restore from cross-region DR bucket per runbooks/restore.md.
  • Account compromise on the operator's email. Revoke the compromised email's session epoch via SQL; trigger a "rotate-secrets" pass.
  • Legal request. Acknowledge receipt; do not respond to the substance same-day.
  • Mass-abuse incident. Read-only mode, then triage. The read-only-mode runbook is the lever.
View source
1 # Day-one operator runbook
2
3 This is what the on-call (the project author for v1) reads on
4 launch day and the day after. Pair with the cutover checklist
5 (`deploy/cutover/checklist.md`) for the actual cutover steps.
6
7 ## "It's been an hour"
8
9 Open three tabs and walk this list:
10
11 1. **Grafana → shithubd-overview dashboard.**
12 - `Web up` and `Worker up` should both be 1.
13 - p95 latency on top routes should sit near baseline (S36
14 numbers, doubled is the soft ceiling — see
15 `docs/internal/capacity.md`).
16 - DB calls/sec should NOT be 10× baseline. If it is, an N+1
17 regression slipped through; check `journalctl -u shithubd-web
18 --grep 'high-query-count'` (the QueryCounter middleware
19 emits warning lines).
20
21 2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.**
22 - Zero panics is the bar.
23 - One or two ERROR lines from genuinely broken third-party
24 subscribers is normal (we auto-disable after 50; this is
25 fine).
26 - A flood of ERROR lines from the same handler is a bug —
27 bisect.
28
29 3. **Status page on docs.shithub.sh/status.html.**
30 - Confirm "All systems normal." with a current timestamp.
31 - If you've had any blip, even a 30-second one, log it under
32 "Recent incidents" with what happened. Trust comes from
33 transparency, not from the page being empty.
34
35 If all three look clean, take a breath. Then post the announcement
36 if you haven't.
37
38 ## "It's been a day"
39
40 24h post-cutover checklist:
41
42 1. **Backups.** The first daily logical backup should have run.
43 ```sh
44 ssh db
45 sudo -u postgres rclone --config /etc/rclone-shithub.conf \
46 lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
47 ```
48 Should list one `.dump` file. If empty, see
49 `docs/internal/runbooks/backups.md#missed-backup`.
50
51 2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0
52 and `last_archived_time` should be within minutes of now.
53 ```sh
54 ssh db
55 sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
56 ```
57
58 3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm:
59 ```sh
60 git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
61 git ls-remote https://shithub.sh/shithub/shithub.git trunk
62 ```
63 Both should report the same SHA (within an hour of the latest
64 push to shithub).
65
66 4. **Signups + abuse.** Check `/admin/users` for the past 24h.
67 - Anything obviously bot-shaped (timing, email pattern,
68 username pattern)? The per-/24 throttle should have caught
69 bursts; a long tail of single-account spam needs manual
70 review.
71 - Suspended accounts: log who and why for the retro.
72
73 5. **Issue tracker.** Open
74 `https://shithub.sh/shithub/shithub/issues`. Triage every
75 new issue:
76 - **P0** — site down / data loss / security. Fix immediately.
77 - **P1** — broken core flow (signup, push, merge). Fix this
78 week.
79 - **P2** — minor bug / UX. Fix in next sprint.
80 - **P3** — feature ask / WONTFIX / accept-with-rationale.
81 - Reply on every issue within 24h, even if just "tracked, P2."
82
83 6. **Retro.** Update
84 `docs/internal/retro/v0.1.0.md` — fill the "Numbers" table,
85 the "What surprised us" section, the top-3 issues table.
86
87 ## "First incident?"
88
89 The general procedure is in
90 `docs/internal/runbooks/incidents.md`. Day-one specifics:
91
92 1. **Status page first.** Within 5 min of detecting an incident,
93 update `docs/public/status.md` to "Investigating: <one line>".
94 Push, sync to docs bucket. The fact that you're aware is
95 half the battle.
96 2. **Identify the right runbook.** Most v1 incidents will be one
97 of:
98 - `incidents.md#shithubd-down` — process crashed.
99 - `incidents.md#postgres-down` — DB unreachable.
100 - `incidents.md#archive-failing` — WAL archiver broken.
101 - `incidents.md#job-backlog` — worker queue runaway.
102 - `incidents.md#webhook-delivery-failing` — outbound delivery
103 issues.
104 3. **Mitigate before fix.** If it's the announcement spike
105 tipping the rate-limit floor, you can briefly raise the floor
106 (config edit + restart) rather than untangling the abuse
107 signal. The runbook + the rate-limit-tuning notes are both at
108 hand.
109 4. **Status page on resolution.** "Resolved at HH:MM UTC. <One
110 sentence on cause.> <One sentence on prevention if relevant.>"
111
112 ## Communication posture
113
114 - **Honest > fast.** If you don't know the cause, "investigating"
115 is a fine status. Speculation in public is a debt you pay
116 later.
117 - **No timelines under stress.** "We'll have an update by 18:00
118 UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
119 - **Engage the reporter.** When a user files a serious bug,
120 thank them on the issue. Even if the fix is days out, the
121 acknowledgement is the part they remember.
122
123 ## Rate-limit tuning
124
125 If legitimate signups are getting throttled (CIDR-shared
126 networks during the announcement spike), the steps are:
127
128 ```sh
129 # 1. Edit the production inventory's rate-limit vars.
130 $EDITOR deploy/ansible/inventory/production
131 # signup_per_ip_per_hour: 5 → 10
132 # signup_per_24_per_hour: 20 → 50
133
134 # 2. Apply only the app role (no DB / edge churn).
135 make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
136 ```
137
138 The web service restart picks up new env. Document the change in
139 the retro; tighten back to defaults once the spike passes.
140
141 ## "Drop the GitHub mirror?" decision
142
143 Default plan: **keep the mirror for 90 days post-launch**, then
144 re-evaluate. The mirror exists as a recovery surface; if the
145 self-hosted instance has had no extended outage in 90 days,
146 drop the mirror push job and mark the GitHub repo as archived.
147
148 The decision and date land in the next retro after the 90-day
149 mark.
150
151 ## When to escalate (to whom?)
152
153 Solo on-call for v1. There is no second-tier. Escalation paths:
154
155 - **DB destroyed beyond restore.** Restore from cross-region DR
156 bucket per `runbooks/restore.md`.
157 - **Account compromise on the operator's email.** Revoke the
158 compromised email's session epoch via SQL; trigger a
159 "rotate-secrets" pass.
160 - **Legal request.** Acknowledge receipt; do not respond to the
161 substance same-day.
162 - **Mass-abuse incident.** Read-only mode, then triage. The
163 read-only-mode runbook is the lever.