shithub Public

Watch 1 Fork 0 Star 0

markdown · 6172 bytes Raw Blame History

Day-one operator runbook

This is what the on-call (the project author for v1) reads on launch day and the day after. Pair with the cutover checklist (deploy/cutover/checklist.md) for the actual cutover steps.

"It's been an hour"

Open three tabs and walk this list:

Grafana → shithubd-overview dashboard.
- Web up and Worker up should both be 1.
- p95 latency on top routes should sit near baseline (S36 numbers, doubled is the soft ceiling — see docs/internal/capacity.md).
- DB calls/sec should NOT be 10× baseline. If it is, an N+1 regression slipped through; check journalctl -u shithubd-web --grep 'high-query-count' (the QueryCounter middleware emits warning lines).
Loki → {service="shithubd"} |~ "panic|ERROR".
- Zero panics is the bar.
- One or two ERROR lines from genuinely broken third-party subscribers is normal (we auto-disable after 50; this is fine).
- A flood of ERROR lines from the same handler is a bug — bisect.
Status page on docs.shithub.sh/status.html.
- Confirm "All systems normal." with a current timestamp.
- If you've had any blip, even a 30-second one, log it under "Recent incidents" with what happened. Trust comes from transparency, not from the page being empty.

If all three look clean, take a breath. Then post the announcement if you haven't.

"It's been a day"

24h post-cutover checklist:

Backups. The first daily logical backup should have run.
```
ssh db
sudo -u postgres rclone --config /etc/rclone-shithub.conf \
     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
```
Should list one .dump file. If empty, see docs/internal/runbooks/backups.md#missed-backup.
WAL archive. pg_stat_archiver.failed_count should be 0 and last_archived_time should be within minutes of now.
```
ssh db
sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
```
Mirror push to GitHub. The mirror job runs hourly. Confirm:
```
git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
git ls-remote https://shithub.sh/shithub/shithub.git trunk
```
Both should report the same SHA (within an hour of the latest push to shithub).
Signups + abuse. Check /admin/users for the past 24h.
- Anything obviously bot-shaped (timing, email pattern, username pattern)? The per-/24 throttle should have caught bursts; a long tail of single-account spam needs manual review.
- Suspended accounts: log who and why for the retro.
Issue tracker. Open https://shithub.sh/shithub/shithub/issues. Triage every new issue:
- P0 — site down / data loss / security. Fix immediately.
- P1 — broken core flow (signup, push, merge). Fix this week.
- P2 — minor bug / UX. Fix in next sprint.
- P3 — feature ask / WONTFIX / accept-with-rationale.
- Reply on every issue within 24h, even if just "tracked, P2."
Retro. Update docs/internal/retro/v0.1.0.md — fill the "Numbers" table, the "What surprised us" section, the top-3 issues table.

"First incident?"

The general procedure is in docs/internal/runbooks/incidents.md. Day-one specifics:

Status page first. Within 5 min of detecting an incident, update docs/public/status.md to "Investigating: ". Push, sync to docs bucket. The fact that you're aware is half the battle.
Identify the right runbook. Most v1 incidents will be one of:
- incidents.md#shithubd-down — process crashed.
- incidents.md#postgres-down — DB unreachable.
- incidents.md#archive-failing — WAL archiver broken.
- incidents.md#job-backlog — worker queue runaway.
- incidents.md#webhook-delivery-failing — outbound delivery issues.
Mitigate before fix. If it's the announcement spike tipping the rate-limit floor, you can briefly raise the floor (config edit + restart) rather than untangling the abuse signal. The runbook + the rate-limit-tuning notes are both at hand.
Status page on resolution. "Resolved at HH:MM UTC. "

Communication posture

Honest > fast. If you don't know the cause, "investigating" is a fine status. Speculation in public is a debt you pay later.
No timelines under stress. "We'll have an update by 18:00 UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
Engage the reporter. When a user files a serious bug, thank them on the issue. Even if the fix is days out, the acknowledgement is the part they remember.

Rate-limit tuning

If legitimate signups are getting throttled (CIDR-shared networks during the announcement spike), the steps are:

# 1. Edit the production inventory's rate-limit vars.
$EDITOR deploy/ansible/inventory/production
#   signup_per_ip_per_hour: 5  → 10
#   signup_per_24_per_hour: 20 → 50

# 2. Apply only the app role (no DB / edge churn).
make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app

The web service restart picks up new env. Document the change in the retro; tighten back to defaults once the spike passes.

"Drop the GitHub mirror?" decision

Default plan: keep the mirror for 90 days post-launch, then re-evaluate. The mirror exists as a recovery surface; if the self-hosted instance has had no extended outage in 90 days, drop the mirror push job and mark the GitHub repo as archived.

The decision and date land in the next retro after the 90-day mark.

When to escalate (to whom?)

Solo on-call for v1. There is no second-tier. Escalation paths:

DB destroyed beyond restore. Restore from cross-region DR bucket per runbooks/restore.md.
Account compromise on the operator's email. Revoke the compromised email's session epoch via SQL; trigger a "rotate-secrets" pass.
Legal request. Acknowledge receipt; do not respond to the substance same-day.
Mass-abuse incident. Read-only mode, then triage. The read-only-mode runbook is the lever.

View source

  
        1
        # Day-one operator runbook
      
        2
        
        3
        This is what the on-call (the project author for v1) reads on
      
        4
        launch day and the day after. Pair with the cutover checklist
      
        5
        (`deploy/cutover/checklist.md`) for the actual cutover steps.
      
        6
        
        7
        ## "It's been an hour"
      
        8
        
        9
        Open three tabs and walk this list:
      
        10
        
        11
        1. **Grafana → shithubd-overview dashboard.**
      
        12
           - `Web up` and `Worker up` should both be 1.
      
        13
           - p95 latency on top routes should sit near baseline (S36
      
        14
             numbers, doubled is the soft ceiling — see
      
        15
             `docs/internal/capacity.md`).
      
        16
           - DB calls/sec should NOT be 10× baseline. If it is, an N+1
      
        17
             regression slipped through; check `journalctl -u shithubd-web
      
        18
             --grep 'high-query-count'` (the QueryCounter middleware
      
        19
             emits warning lines).
      
        20
        
        21
        2. **Loki → `{service="shithubd"} |~ "panic|ERROR"`.**
      
        22
           - Zero panics is the bar.
      
        23
           - One or two ERROR lines from genuinely broken third-party
      
        24
             subscribers is normal (we auto-disable after 50; this is
      
        25
             fine).
      
        26
           - A flood of ERROR lines from the same handler is a bug —
      
        27
             bisect.
      
        28
        
        29
        3. **Status page on docs.shithub.sh/status.html.**
      
        30
           - Confirm "All systems normal." with a current timestamp.
      
        31
           - If you've had any blip, even a 30-second one, log it under
      
        32
             "Recent incidents" with what happened. Trust comes from
      
        33
             transparency, not from the page being empty.
      
        34
        
        35
        If all three look clean, take a breath. Then post the announcement
      
        36
        if you haven't.
      
        37
        
        38
        ## "It's been a day"
      
        39
        
        40
        24h post-cutover checklist:
      
        41
        
        42
        1. **Backups.** The first daily logical backup should have run.
      
        43
           ```sh
      
        44
           ssh db
      
        45
           sudo -u postgres rclone --config /etc/rclone-shithub.conf \
      
        46
                lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
      
        47
           ```
      
        48
           Should list one `.dump` file. If empty, see
      
        49
           `docs/internal/runbooks/backups.md#missed-backup`.
      
        50
        
        51
        2. **WAL archive.** `pg_stat_archiver.failed_count` should be 0
      
        52
           and `last_archived_time` should be within minutes of now.
      
        53
           ```sh
      
        54
           ssh db
      
        55
           sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
      
        56
           ```
      
        57
        
        58
        3. **Mirror push to GitHub.** The mirror job runs hourly. Confirm:
      
        59
           ```sh
      
        60
           git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
      
        61
           git ls-remote https://shithub.sh/shithub/shithub.git trunk
      
        62
           ```
      
        63
           Both should report the same SHA (within an hour of the latest
      
        64
           push to shithub).
      
        65
        
        66
        4. **Signups + abuse.** Check `/admin/users` for the past 24h.
      
        67
           - Anything obviously bot-shaped (timing, email pattern,
      
        68
             username pattern)? The per-/24 throttle should have caught
      
        69
             bursts; a long tail of single-account spam needs manual
      
        70
             review.
      
        71
           - Suspended accounts: log who and why for the retro.
      
        72
        
        73
        5. **Issue tracker.** Open
      
        74
           `https://shithub.sh/shithub/shithub/issues`. Triage every
      
        75
           new issue:
      
        76
           - **P0** — site down / data loss / security. Fix immediately.
      
        77
           - **P1** — broken core flow (signup, push, merge). Fix this
      
        78
             week.
      
        79
           - **P2** — minor bug / UX. Fix in next sprint.
      
        80
           - **P3** — feature ask / WONTFIX / accept-with-rationale.
      
        81
           - Reply on every issue within 24h, even if just "tracked, P2."
      
        82
        
        83
        6. **Retro.** Update
      
        84
           `docs/internal/retro/v0.1.0.md` — fill the "Numbers" table,
      
        85
           the "What surprised us" section, the top-3 issues table.
      
        86
        
        87
        ## "First incident?"
      
        88
        
        89
        The general procedure is in
      
        90
        `docs/internal/runbooks/incidents.md`. Day-one specifics:
      
        91
        
        92
        1. **Status page first.** Within 5 min of detecting an incident,
      
        93
           update `docs/public/status.md` to "Investigating: <one line>".
      
        94
           Push, sync to docs bucket. The fact that you're aware is
      
        95
           half the battle.
      
        96
        2. **Identify the right runbook.** Most v1 incidents will be one
      
        97
           of:
      
        98
           - `incidents.md#shithubd-down` — process crashed.
      
        99
           - `incidents.md#postgres-down` — DB unreachable.
      
        100
           - `incidents.md#archive-failing` — WAL archiver broken.
      
        101
           - `incidents.md#job-backlog` — worker queue runaway.
      
        102
           - `incidents.md#webhook-delivery-failing` — outbound delivery
      
        103
             issues.
      
        104
        3. **Mitigate before fix.** If it's the announcement spike
      
        105
           tipping the rate-limit floor, you can briefly raise the floor
      
        106
           (config edit + restart) rather than untangling the abuse
      
        107
           signal. The runbook + the rate-limit-tuning notes are both at
      
        108
           hand.
      
        109
        4. **Status page on resolution.** "Resolved at HH:MM UTC. <One
      
        110
           sentence on cause.> <One sentence on prevention if relevant.>"
      
        111
        
        112
        ## Communication posture
      
        113
        
        114
        - **Honest > fast.** If you don't know the cause, "investigating"
      
        115
          is a fine status. Speculation in public is a debt you pay
      
        116
          later.
      
        117
        - **No timelines under stress.** "We'll have an update by 18:00
      
        118
          UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
      
        119
        - **Engage the reporter.** When a user files a serious bug,
      
        120
          thank them on the issue. Even if the fix is days out, the
      
        121
          acknowledgement is the part they remember.
      
        122
        
        123
        ## Rate-limit tuning
      
        124
        
        125
        If legitimate signups are getting throttled (CIDR-shared
      
        126
        networks during the announcement spike), the steps are:
      
        127
        
        128
        ```sh
      
        129
        # 1. Edit the production inventory's rate-limit vars.
      
        130
        $EDITOR deploy/ansible/inventory/production
      
        131
        #   signup_per_ip_per_hour: 5  → 10
      
        132
        #   signup_per_24_per_hour: 20 → 50
      
        133
        
        134
        # 2. Apply only the app role (no DB / edge churn).
      
        135
        make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
      
        136
        ```
      
        137
        
        138
        The web service restart picks up new env. Document the change in
      
        139
        the retro; tighten back to defaults once the spike passes.
      
        140
        
        141
        ## "Drop the GitHub mirror?" decision
      
        142
        
        143
        Default plan: **keep the mirror for 90 days post-launch**, then
      
        144
        re-evaluate. The mirror exists as a recovery surface; if the
      
        145
        self-hosted instance has had no extended outage in 90 days,
      
        146
        drop the mirror push job and mark the GitHub repo as archived.
      
        147
        
        148
        The decision and date land in the next retro after the 90-day
      
        149
        mark.
      
        150
        
        151
        ## When to escalate (to whom?)
      
        152
        
        153
        Solo on-call for v1. There is no second-tier. Escalation paths:
      
        154
        
        155
        - **DB destroyed beyond restore.** Restore from cross-region DR
      
        156
          bucket per `runbooks/restore.md`.
      
        157
        - **Account compromise on the operator's email.** Revoke the
      
        158
          compromised email's session epoch via SQL; trigger a
      
        159
          "rotate-secrets" pass.
      
        160
        - **Legal request.** Acknowledge receipt; do not respond to the
      
        161
          substance same-day.
      
        162
        - **Mass-abuse incident.** Read-only mode, then triage. The
      
        163
          read-only-mode runbook is the lever.

1	# Day-one operator runbook
2
3	This is what the on-call (the project author for v1) reads on
4	launch day and the day after. Pair with the cutover checklist
5	(`deploy/cutover/checklist.md`) for the actual cutover steps.
6
7	## "It's been an hour"
8
9	Open three tabs and walk this list:
10
11	1. Grafana → shithubd-overview dashboard.
12	- `Web up` and `Worker up` should both be 1.
13	- p95 latency on top routes should sit near baseline (S36
14	numbers, doubled is the soft ceiling — see
15	`docs/internal/capacity.md`).
16	- DB calls/sec should NOT be 10× baseline. If it is, an N+1
17	regression slipped through; check `journalctl -u shithubd-web
18	--grep 'high-query-count'` (the QueryCounter middleware
19	emits warning lines).
20
21	2. Loki → `{service="shithubd"} \|~ "panic\|ERROR"`.
22	- Zero panics is the bar.
23	- One or two ERROR lines from genuinely broken third-party
24	subscribers is normal (we auto-disable after 50; this is
25	fine).
26	- A flood of ERROR lines from the same handler is a bug —
27	bisect.
28
29	3. Status page on docs.shithub.sh/status.html.
30	- Confirm "All systems normal." with a current timestamp.
31	- If you've had any blip, even a 30-second one, log it under
32	"Recent incidents" with what happened. Trust comes from
33	transparency, not from the page being empty.
34
35	If all three look clean, take a breath. Then post the announcement
36	if you haven't.
37
38	## "It's been a day"
39
40	24h post-cutover checklist:
41
42	1. Backups. The first daily logical backup should have run.
43	```sh
44	ssh db
45	sudo -u postgres rclone --config /etc/rclone-shithub.conf \
46	lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
47	```
48	Should list one `.dump` file. If empty, see
49	`docs/internal/runbooks/backups.md#missed-backup`.
50
51	2. WAL archive. `pg_stat_archiver.failed_count` should be 0
52	and `last_archived_time` should be within minutes of now.
53	```sh
54	ssh db
55	sudo -u postgres psql -d shithub -c "SELECT * FROM pg_stat_archiver;"
56	```
57
58	3. Mirror push to GitHub. The mirror job runs hourly. Confirm:
59	```sh
60	git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
61	git ls-remote https://shithub.sh/shithub/shithub.git trunk
62	```
63	Both should report the same SHA (within an hour of the latest
64	push to shithub).
65
66	4. Signups + abuse. Check `/admin/users` for the past 24h.
67	- Anything obviously bot-shaped (timing, email pattern,
68	username pattern)? The per-/24 throttle should have caught
69	bursts; a long tail of single-account spam needs manual
70	review.
71	- Suspended accounts: log who and why for the retro.
72
73	5. Issue tracker. Open
74	`https://shithub.sh/shithub/shithub/issues`. Triage every
75	new issue:
76	- P0 — site down / data loss / security. Fix immediately.
77	- P1 — broken core flow (signup, push, merge). Fix this
78	week.
79	- P2 — minor bug / UX. Fix in next sprint.
80	- P3 — feature ask / WONTFIX / accept-with-rationale.
81	- Reply on every issue within 24h, even if just "tracked, P2."
82
83	6. Retro. Update
84	`docs/internal/retro/v0.1.0.md` — fill the "Numbers" table,
85	the "What surprised us" section, the top-3 issues table.
86
87	## "First incident?"
88
89	The general procedure is in
90	`docs/internal/runbooks/incidents.md`. Day-one specifics:
91
92	1. Status page first. Within 5 min of detecting an incident,
93	update `docs/public/status.md` to "Investigating: <one line>".
94	Push, sync to docs bucket. The fact that you're aware is
95	half the battle.
96	2. Identify the right runbook. Most v1 incidents will be one
97	of:
98	- `incidents.md#shithubd-down` — process crashed.
99	- `incidents.md#postgres-down` — DB unreachable.
100	- `incidents.md#archive-failing` — WAL archiver broken.
101	- `incidents.md#job-backlog` — worker queue runaway.
102	- `incidents.md#webhook-delivery-failing` — outbound delivery
103	issues.
104	3. Mitigate before fix. If it's the announcement spike
105	tipping the rate-limit floor, you can briefly raise the floor
106	(config edit + restart) rather than untangling the abuse
107	signal. The runbook + the rate-limit-tuning notes are both at
108	hand.
109	4. Status page on resolution. "Resolved at HH:MM UTC. <One
110	sentence on cause.> <One sentence on prevention if relevant.>"
111
112	## Communication posture
113
114	- Honest > fast. If you don't know the cause, "investigating"
115	is a fine status. Speculation in public is a debt you pay
116	later.
117	- No timelines under stress. "We'll have an update by 18:00
118	UTC" is fine. "We'll have a fix by 18:00 UTC" is a trap.
119	- Engage the reporter. When a user files a serious bug,
120	thank them on the issue. Even if the fix is days out, the
121	acknowledgement is the part they remember.
122
123	## Rate-limit tuning
124
125	If legitimate signups are getting throttled (CIDR-shared
126	networks during the announcement spike), the steps are:
127
128	```sh
129	# 1. Edit the production inventory's rate-limit vars.
130	$EDITOR deploy/ansible/inventory/production
131	# signup_per_ip_per_hour: 5 → 10
132	# signup_per_24_per_hour: 20 → 50
133
134	# 2. Apply only the app role (no DB / edge churn).
135	make deploy ANSIBLE_INVENTORY=production ANSIBLE_TAGS=app
136	```
137
138	The web service restart picks up new env. Document the change in
139	the retro; tighten back to defaults once the spike passes.
140
141	## "Drop the GitHub mirror?" decision
142
143	Default plan: keep the mirror for 90 days post-launch, then
144	re-evaluate. The mirror exists as a recovery surface; if the
145	self-hosted instance has had no extended outage in 90 days,
146	drop the mirror push job and mark the GitHub repo as archived.
147
148	The decision and date land in the next retro after the 90-day
149	mark.
150
151	## When to escalate (to whom?)
152
153	Solo on-call for v1. There is no second-tier. Escalation paths:
154
155	- DB destroyed beyond restore. Restore from cross-region DR
156	bucket per `runbooks/restore.md`.
157	- Account compromise on the operator's email. Revoke the
158	compromised email's session epoch via SQL; trigger a
159	"rotate-secrets" pass.
160	- Legal request. Acknowledge receipt; do not respond to the
161	substance same-day.
162	- Mass-abuse incident. Read-only mode, then triage. The
163	read-only-mode runbook is the lever.