Day-one operator runbook
This is what the on-call (the project author for v1) reads on
launch day and the day after. Pair with the cutover checklist
(deploy/cutover/checklist.md) for the actual cutover steps.
"It's been an hour"
Open three tabs and walk this list:
-
Grafana → shithubd-overview dashboard.
Web upandWorker upshould both be 1.- p95 latency on top routes should sit near baseline (S36
numbers, doubled is the soft ceiling — see
docs/internal/capacity.md). - DB calls/sec should NOT be 10× baseline. If it is, an N+1
regression slipped through; check
journalctl -u shithubd-web --grep 'high-query-count'(the QueryCounter middleware emits warning lines).
-
Loki →
{service="shithubd"} |~ "panic|ERROR".- Zero panics is the bar.
- One or two ERROR lines from genuinely broken third-party subscribers is normal (we auto-disable after 50; this is fine).
- A flood of ERROR lines from the same handler is a bug — bisect.
-
Status page on docs.shithub.sh/status.html.
- Confirm "All systems normal." with a current timestamp.
- If you've had any blip, even a 30-second one, log it under "Recent incidents" with what happened. Trust comes from transparency, not from the page being empty.
If all three look clean, take a breath. Then post the announcement if you haven't.
"It's been a day"
24h post-cutover checklist:
-
Backups. The first daily logical backup should have run.
ssh db sudo -u postgres rclone --config /etc/rclone-shithub.conf \ lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/Should list one
.dumpfile. If empty, seedocs/internal/runbooks/backups.md#missed-backup. -
WAL archive.
pg_stat_archiver.failed_countshould be 0 andlast_archived_timeshould be within minutes of now.ssh db sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;" -
Mirror push to GitHub. The mirror job runs hourly. Confirm:
git ls-remote https://github.com/tenseleyFlow/shithub.git trunk git ls-remote https://shithub.sh/shithub/shithub.git trunkBoth should report the same SHA (within an hour of the latest push to shithub).
-
Signups + abuse. Check
/admin/usersfor the past 24h.- Anything obviously bot-shaped (timing, email pattern, username pattern)? The per-/24 throttle should have caught bursts; a long tail of single-account spam needs manual review.
- Suspended accounts: log who and why for the retro.
-
Issue tracker. Open
https://shithub.sh/shithub/shithub/issues. Triage every new issue:- P0 — site down / data loss / security. Fix immediately.
- P1 — broken core flow (signup, push, merge). Fix this week.
- P2 — minor bug / UX. Fix in next sprint.
- P3 — feature ask / WONTFIX / accept-with-rationale.
- Reply on every issue within 24h, even if just "tracked, P2."
-
Retro. Update
docs/internal/retro/v0.1.0.md— fill the "Numbers" table, the "What surprised us" section, the top-3 issues table.
"First incident?"
The general procedure is in
docs/internal/runbooks/incidents.md. Day-one specifics:
- Status page first. Within 5 min of detecting an incident,
update
docs/public/status.mdto "Investigating: ". Push, sync to docs bucket. The fact that you're aware is half the battle. - Identify the right runbook. Most v1 incidents will be one
of:
incidents.md#shithubd-down— process crashed.incidents.md#postgres-down— DB unreachable.incidents.md#archive-failing— WAL archiver broken.incidents.md#job-backlog— worker queue runaway.incidents.md#webhook-delivery-failing— outbound delivery issues.
- Mitigate before fix. If it's the announcement spike tipping the rate-limit floor, you can briefly raise the floor (config edit + restart) rather than untangling the abuse signal. The runbook + the rate-limit-tuning notes are both at hand.
- Status page on resolution. "Resolved at HH:MM UTC. "
Communication posture
- Honest > fast. If you don't know the cause, "investigating" is a fine status. Speculation in public is a debt you pay later.
- No timelines under stress. "We'll have an update by 18:00 UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
- Engage the reporter. When a user files a serious bug, thank them on the issue. Even if the fix is days out, the acknowledgement is the part they remember.
Rate-limit tuning
If legitimate signups are getting throttled (CIDR-shared networks during the announcement spike), the steps are:
# 1. Edit the production inventory's rate-limit vars.
$EDITOR deploy/ansible/inventory/production
# signup_per_ip_per_hour: 5 → 10
# signup_per_24_per_hour: 20 → 50
# 2. Apply only the app role (no DB / edge churn).
make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
The web service restart picks up new env. Document the change in the retro; tighten back to defaults once the spike passes.
"Drop the GitHub mirror?" decision
Default plan: keep the mirror for 90 days post-launch, then re-evaluate. The mirror exists as a recovery surface; if the self-hosted instance has had no extended outage in 90 days, drop the mirror push job and mark the GitHub repo as archived.
The decision and date land in the next retro after the 90-day mark.
When to escalate (to whom?)
Solo on-call for v1. There is no second-tier. Escalation paths:
- DB destroyed beyond restore. Restore from cross-region DR
bucket per
runbooks/restore.md. - Account compromise on the operator's email. Revoke the compromised email's session epoch via SQL; trigger a "rotate-secrets" pass.
- Legal request. Acknowledge receipt; do not respond to the substance same-day.
- Mass-abuse incident. Read-only mode, then triage. The read-only-mode runbook is the lever.
View source
| 1 | # Day-one operator runbook |
| 2 | |
| 3 | This is what the on-call (the project author for v1) reads on |
| 4 | launch day and the day after. Pair with the cutover checklist |
| 5 | (`deploy/cutover/checklist.md`) for the actual cutover steps. |
| 6 | |
| 7 | ## "It's been an hour" |
| 8 | |
| 9 | Open three tabs and walk this list: |
| 10 | |
| 11 | 1. **Grafana → shithubd-overview dashboard.** |
| 12 | - `Web up` and `Worker up` should both be 1. |
| 13 | - p95 latency on top routes should sit near baseline (S36 |
| 14 | numbers, doubled is the soft ceiling — see |
| 15 | `docs/internal/capacity.md`). |
| 16 | - DB calls/sec should NOT be 10× baseline. If it is, an N+1 |
| 17 | regression slipped through; check `journalctl -u shithubd-web |
| 18 | --grep 'high-query-count'` (the QueryCounter middleware |
| 19 | emits warning lines). |
| 20 | |
| 21 | 2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.** |
| 22 | - Zero panics is the bar. |
| 23 | - One or two ERROR lines from genuinely broken third-party |
| 24 | subscribers is normal (we auto-disable after 50; this is |
| 25 | fine). |
| 26 | - A flood of ERROR lines from the same handler is a bug — |
| 27 | bisect. |
| 28 | |
| 29 | 3. **Status page on docs.shithub.sh/status.html.** |
| 30 | - Confirm "All systems normal." with a current timestamp. |
| 31 | - If you've had any blip, even a 30-second one, log it under |
| 32 | "Recent incidents" with what happened. Trust comes from |
| 33 | transparency, not from the page being empty. |
| 34 | |
| 35 | If all three look clean, take a breath. Then post the announcement |
| 36 | if you haven't. |
| 37 | |
| 38 | ## "It's been a day" |
| 39 | |
| 40 | 24h post-cutover checklist: |
| 41 | |
| 42 | 1. **Backups.** The first daily logical backup should have run. |
| 43 | ```sh |
| 44 | ssh db |
| 45 | sudo -u postgres rclone --config /etc/rclone-shithub.conf \ |
| 46 | lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ |
| 47 | ``` |
| 48 | Should list one `.dump` file. If empty, see |
| 49 | `docs/internal/runbooks/backups.md#missed-backup`. |
| 50 | |
| 51 | 2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0 |
| 52 | and `last_archived_time` should be within minutes of now. |
| 53 | ```sh |
| 54 | ssh db |
| 55 | sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;" |
| 56 | ``` |
| 57 | |
| 58 | 3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm: |
| 59 | ```sh |
| 60 | git ls-remote https://github.com/tenseleyFlow/shithub.git trunk |
| 61 | git ls-remote https://shithub.sh/shithub/shithub.git trunk |
| 62 | ``` |
| 63 | Both should report the same SHA (within an hour of the latest |
| 64 | push to shithub). |
| 65 | |
| 66 | 4. **Signups + abuse.** Check `/admin/users` for the past 24h. |
| 67 | - Anything obviously bot-shaped (timing, email pattern, |
| 68 | username pattern)? The per-/24 throttle should have caught |
| 69 | bursts; a long tail of single-account spam needs manual |
| 70 | review. |
| 71 | - Suspended accounts: log who and why for the retro. |
| 72 | |
| 73 | 5. **Issue tracker.** Open |
| 74 | `https://shithub.sh/shithub/shithub/issues`. Triage every |
| 75 | new issue: |
| 76 | - **P0** — site down / data loss / security. Fix immediately. |
| 77 | - **P1** — broken core flow (signup, push, merge). Fix this |
| 78 | week. |
| 79 | - **P2** — minor bug / UX. Fix in next sprint. |
| 80 | - **P3** — feature ask / WONTFIX / accept-with-rationale. |
| 81 | - Reply on every issue within 24h, even if just "tracked, P2." |
| 82 | |
| 83 | 6. **Retro.** Update |
| 84 | `docs/internal/retro/v0.1.0.md` — fill the "Numbers" table, |
| 85 | the "What surprised us" section, the top-3 issues table. |
| 86 | |
| 87 | ## "First incident?" |
| 88 | |
| 89 | The general procedure is in |
| 90 | `docs/internal/runbooks/incidents.md`. Day-one specifics: |
| 91 | |
| 92 | 1. **Status page first.** Within 5 min of detecting an incident, |
| 93 | update `docs/public/status.md` to "Investigating: <one line>". |
| 94 | Push, sync to docs bucket. The fact that you're aware is |
| 95 | half the battle. |
| 96 | 2. **Identify the right runbook.** Most v1 incidents will be one |
| 97 | of: |
| 98 | - `incidents.md#shithubd-down` — process crashed. |
| 99 | - `incidents.md#postgres-down` — DB unreachable. |
| 100 | - `incidents.md#archive-failing` — WAL archiver broken. |
| 101 | - `incidents.md#job-backlog` — worker queue runaway. |
| 102 | - `incidents.md#webhook-delivery-failing` — outbound delivery |
| 103 | issues. |
| 104 | 3. **Mitigate before fix.** If it's the announcement spike |
| 105 | tipping the rate-limit floor, you can briefly raise the floor |
| 106 | (config edit + restart) rather than untangling the abuse |
| 107 | signal. The runbook + the rate-limit-tuning notes are both at |
| 108 | hand. |
| 109 | 4. **Status page on resolution.** "Resolved at HH:MM UTC. <One |
| 110 | sentence on cause.> <One sentence on prevention if relevant.>" |
| 111 | |
| 112 | ## Communication posture |
| 113 | |
| 114 | - **Honest > fast.** If you don't know the cause, "investigating" |
| 115 | is a fine status. Speculation in public is a debt you pay |
| 116 | later. |
| 117 | - **No timelines under stress.** "We'll have an update by 18:00 |
| 118 | UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap. |
| 119 | - **Engage the reporter.** When a user files a serious bug, |
| 120 | thank them on the issue. Even if the fix is days out, the |
| 121 | acknowledgement is the part they remember. |
| 122 | |
| 123 | ## Rate-limit tuning |
| 124 | |
| 125 | If legitimate signups are getting throttled (CIDR-shared |
| 126 | networks during the announcement spike), the steps are: |
| 127 | |
| 128 | ```sh |
| 129 | # 1. Edit the production inventory's rate-limit vars. |
| 130 | $EDITOR deploy/ansible/inventory/production |
| 131 | # signup_per_ip_per_hour: 5 → 10 |
| 132 | # signup_per_24_per_hour: 20 → 50 |
| 133 | |
| 134 | # 2. Apply only the app role (no DB / edge churn). |
| 135 | make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app |
| 136 | ``` |
| 137 | |
| 138 | The web service restart picks up new env. Document the change in |
| 139 | the retro; tighten back to defaults once the spike passes. |
| 140 | |
| 141 | ## "Drop the GitHub mirror?" decision |
| 142 | |
| 143 | Default plan: **keep the mirror for 90 days post-launch**, then |
| 144 | re-evaluate. The mirror exists as a recovery surface; if the |
| 145 | self-hosted instance has had no extended outage in 90 days, |
| 146 | drop the mirror push job and mark the GitHub repo as archived. |
| 147 | |
| 148 | The decision and date land in the next retro after the 90-day |
| 149 | mark. |
| 150 | |
| 151 | ## When to escalate (to whom?) |
| 152 | |
| 153 | Solo on-call for v1. There is no second-tier. Escalation paths: |
| 154 | |
| 155 | - **DB destroyed beyond restore.** Restore from cross-region DR |
| 156 | bucket per `runbooks/restore.md`. |
| 157 | - **Account compromise on the operator's email.** Revoke the |
| 158 | compromised email's session epoch via SQL; trigger a |
| 159 | "rotate-secrets" pass. |
| 160 | - **Legal request.** Acknowledge receipt; do not respond to the |
| 161 | substance same-day. |
| 162 | - **Mass-abuse incident.** Read-only mode, then triage. The |
| 163 | read-only-mode runbook is the lever. |