tenseleyflow/shithub / 824c62d

Browse files

S40: cutover artifacts (checklist + smoke + rollback)

Authored by espadonne
SHA
824c62d9e6968842e7fb6c2b0336dfd1f9e81bca
Parents
902991c
Tree
6fbc533

3 changed files

StatusFile+-
A deploy/cutover/checklist.md 189 0
A deploy/cutover/rollback.sh 101 0
A deploy/cutover/smoke.sh 100 0
deploy/cutover/checklist.mdadded
@@ -0,0 +1,189 @@
1
+# Cutover checklist
2
+
3
+The S40 launch checklist. Walk it top-to-bottom on cutover day;
4
+do not skip steps. Each box has a verification command or a
5
+visual check.
6
+
7
+> **Time-box.** A clean run is ~45 min from "ssh in" to "signup
8
+> open." Budget 90 min; stop and back out if you hit ~2 hours.
9
+
10
+## T-7 days
11
+
12
+- [ ] DNS A/AAAA for `shithub.example` published with low TTL
13
+      (300s) so cutover-day changes propagate fast. Verify:
14
+      `dig +short A shithub.example`.
15
+- [ ] DNS CNAME for `docs.shithub.example` published.
16
+- [ ] Postmark domain verified; SPF/DKIM/DMARC aligned. Verify:
17
+      Postmark dashboard → Domains → green.
18
+- [ ] Signup-throttle config reviewed; per-IP and per-/24
19
+      ceilings tuned for the announcement bump.
20
+- [ ] Monitoring alerts wired to the on-call's Telegram + SMS.
21
+      Test by triggering a synthetic `BackupOverdue` alert via
22
+      Alertmanager API and confirming it pages.
23
+- [ ] Rollback rehearsed on staging:
24
+      `git checkout v0.999 && make deploy ANSIBLE_INVENTORY=staging`.
25
+
26
+## T-48 hours
27
+
28
+- [ ] Last DNS change committed. Cutover after 48h ensures no
29
+      propagation lag.
30
+- [ ] S37 backup-restore drill green within last 7 days.
31
+- [ ] S38 docs deploy verified; `https://docs.shithub.example/`
32
+      returns 200.
33
+- [ ] S39 P0/P1 bugs closed.
34
+- [ ] Tag the release commit:
35
+      ```sh
36
+      git tag -a v1.0.0 -m "v1.0.0 — launch"
37
+      git push origin v1.0.0
38
+      ```
39
+
40
+## T-1 hour
41
+
42
+- [ ] On-call has phone + laptop reachable.
43
+- [ ] Status page updated to "Cutover in progress" (manual edit
44
+      to `docs/public/status.md`, push, sync to docs bucket).
45
+- [ ] `caddy_use_acme_staging=false` in production inventory
46
+      (so the cutover doesn't accidentally fall back to LE
47
+      staging).
48
+
49
+## T-0: cutover
50
+
51
+```sh
52
+# 1. Pull the v1.0.0 tag.
53
+git fetch --tags
54
+git checkout v1.0.0
55
+
56
+# 2. Dry-run to confirm exactly what will change.
57
+make deploy-check ANSIBLE_INVENTORY=production
58
+
59
+# 3. Apply. Expect ~10s downtime as the web service restarts.
60
+make deploy ANSIBLE_INVENTORY=production
61
+```
62
+
63
+The Ansible run includes `shithubd migrate up` as the web
64
+service's `ExecStartPre`. New migrations run as part of the
65
+restart; the unit stays in `activating` until they complete.
66
+
67
+Watch:
68
+
69
+```sh
70
+ssh web-01
71
+journalctl -fu shithubd-web
72
+```
73
+
74
+## Smoke
75
+
76
+Run the smoke script as soon as the deploy reports `ok=N
77
+changed=N failed=0`:
78
+
79
+```sh
80
+deploy/cutover/smoke.sh https://shithub.example
81
+```
82
+
83
+The script exercises: home page, signup form, login form, health
84
+endpoints, docs subdomain, a representative API call. Exits
85
+non-zero on any 5xx or unexpected response shape.
86
+
87
+## Bootstrap-admin
88
+
89
+```sh
90
+ssh web-01
91
+sudo -u shithub /usr/local/bin/shithubd admin bootstrap-admin \
92
+     --email you@yourdomain
93
+```
94
+
95
+The CLI prints a one-time password-reset link. Open in a browser,
96
+set a password, **immediately enable 2FA** (Settings → Account
97
+security).
98
+
99
+## Open signup
100
+
101
+If signup was gated behind a feature flag during the pre-launch
102
+build:
103
+
104
+```sh
105
+ssh web-01
106
+sudo systemctl edit shithubd-web --full
107
+# remove SHITHUB_AUTH__SIGNUP_DISABLED=true (or set to false)
108
+sudo systemctl restart shithubd-web
109
+```
110
+
111
+Otherwise signup is already on; verify via the signup form
112
+returning 200 + a valid CSRF token.
113
+
114
+## Mirror to GitHub
115
+
116
+Set up the one-way mirror so the GitHub mirror keeps receiving
117
+pushes:
118
+
119
+```sh
120
+# On the web host, as the shithub user:
121
+cd /data/repos/shithub/shithub.git
122
+git remote add github https://github.com/tenseleyFlow/shithub.git
123
+# Add the mirror push to the periodic worker job (covered by
124
+# the worker config; the mirror job kind = "git.mirror_push").
125
+```
126
+
127
+Confirm a test push lands on both:
128
+
129
+```sh
130
+git clone https://shithub.example/shithub/shithub.git /tmp/test-clone
131
+cd /tmp/test-clone
132
+echo "launch test" >> .launch-test
133
+git add .launch-test
134
+git commit -m "launch smoke push"
135
+git push origin trunk
136
+# Wait ~60s for the mirror job to run, then confirm on GitHub:
137
+git ls-remote https://github.com/tenseleyFlow/shithub.git trunk
138
+```
139
+
140
+## Status page
141
+
142
+Update `docs/public/status.md` to "All systems normal." with the
143
+current timestamp; push, sync to docs bucket.
144
+
145
+## Announcement
146
+
147
+Schedule the announcement post for **Tuesday 09:00 ET** (or your
148
+chosen window). Submit to:
149
+
150
+- [ ] Hacker News: title + URL only; first comment is the
151
+      "What is shithub?" intro.
152
+- [ ] /r/programming, /r/selfhosted: link + summary, follow
153
+      subreddit rules.
154
+- [ ] lobste.rs: title + URL.
155
+- [ ] Mastodon: short post + link.
156
+
157
+Have the FAQ tab open; expect "is this Forgejo?" / "why not
158
+Codeberg?" / "where's CI?" within the first hour.
159
+
160
+## Day-zero monitoring
161
+
162
+For the first 24h:
163
+
164
+- Refresh Grafana every 30 min.
165
+- Triage every alert immediately; nothing false-positive should
166
+  page in week 1 (we tuned for it).
167
+- Bug reports go to `https://shithub.example/shithub/shithub/issues`
168
+  (the project's own self-hosted issues — drink your own
169
+  champagne).
170
+
171
+## Backout
172
+
173
+If cutover goes sideways within the first hour:
174
+
175
+1. **Stop the bleed.** Put the site in read-only mode
176
+   (`docs/internal/runbooks/read-only-mode.md`).
177
+2. **Decide:** roll back code, restore data, or wait?
178
+3. If rolling back code: `deploy/cutover/rollback.sh v0.999`.
179
+4. Status page → "Investigating" with what we know.
180
+5. Page the operator (yourself, by definition).
181
+
182
+The 24h SLO is "report what we know, not promises about when it's
183
+fixed." Honesty wins trust; deadlines under stress lose it.
184
+
185
+## Day-one retro
186
+
187
+After the first 24h, fill in `docs/internal/retro/v1.0.0.md`
188
+with: what worked, what surprised us, top 3 user-reported
189
+issues, and the next sprint's focus.
deploy/cutover/rollback.shadded
@@ -0,0 +1,101 @@
1
+#!/usr/bin/env bash
2
+# SPDX-License-Identifier: AGPL-3.0-or-later
3
+#
4
+# S40 cutover rollback. Re-deploys a previous tag to production.
5
+# Walks the operator through the data-safe path; do NOT use this
6
+# blindly when migrations changed schema in non-additive ways
7
+# (read docs/internal/runbooks/rollback.md before running).
8
+#
9
+# Usage:
10
+#   deploy/cutover/rollback.sh v0.999
11
+#
12
+# What it does:
13
+#   1. Checks out the named tag.
14
+#   2. Confirms the tag exists and is signed (if signing is on).
15
+#   3. Runs make deploy-check (DRY-RUN) and prints the diff.
16
+#   4. Asks for explicit confirmation before applying.
17
+#   5. Runs make deploy.
18
+#   6. Runs the smoke script post-deploy.
19
+#
20
+# Exit status:
21
+#   0 — rollback completed and smoked
22
+#   1 — operator aborted, deploy failed, or smoke failed
23
+#   2 — usage error
24
+
25
+set -euo pipefail
26
+
27
+if [[ $# -lt 1 ]]; then
28
+  echo "usage: $0 <previous-tag>" >&2
29
+  exit 2
30
+fi
31
+
32
+TAG="$1"
33
+ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
34
+cd "$ROOT"
35
+
36
+confirm() {
37
+  local prompt="$1"
38
+  read -r -p "$prompt [yes/NO] " resp
39
+  if [[ "$resp" != "yes" ]]; then
40
+    echo "aborted." >&2
41
+    exit 1
42
+  fi
43
+}
44
+
45
+echo "rollback target: $TAG"
46
+
47
+# 1. Verify the tag exists locally; fetch if needed.
48
+if ! git rev-parse --verify "refs/tags/$TAG" >/dev/null 2>&1; then
49
+  echo "tag $TAG not found locally; fetching..."
50
+  git fetch --tags
51
+  if ! git rev-parse --verify "refs/tags/$TAG" >/dev/null 2>&1; then
52
+    echo "FAIL: tag $TAG does not exist on origin" >&2
53
+    exit 1
54
+  fi
55
+fi
56
+
57
+# 2. Check whether the rollback crosses any new migration files.
58
+# Forward-only migrations mean the schema is ahead of the rolled-
59
+# back code. The operator must read the matching `down` migrations
60
+# before continuing; we don't auto-rollback schema here.
61
+NEW_MIGRATIONS=$(git diff --name-only "$TAG"..HEAD -- 'internal/migrationsfs/migrations/*.sql' || true)
62
+if [[ -n "$NEW_MIGRATIONS" ]]; then
63
+  echo ""
64
+  echo "WARNING: migrations exist between $TAG and HEAD:"
65
+  echo "$NEW_MIGRATIONS" | sed 's/^/  /'
66
+  echo ""
67
+  echo "Rolling back code without rolling back schema is fine ONLY if"
68
+  echo "every migration above is purely additive (new columns/tables"
69
+  echo "the old code ignores). Read docs/internal/runbooks/rollback.md"
70
+  echo "before continuing."
71
+  echo ""
72
+  confirm "All migrations above are additive (the old code handles them)?"
73
+fi
74
+
75
+# 3. Check out the tag.
76
+echo ""
77
+echo "checking out $TAG..."
78
+git checkout "$TAG"
79
+
80
+# 4. Dry-run.
81
+echo ""
82
+echo "running ANSIBLE deploy-check (DRY-RUN)..."
83
+make deploy-check ANSIBLE_INVENTORY=production
84
+
85
+# 5. Confirm + apply.
86
+echo ""
87
+confirm "Apply the rollback to production?"
88
+make deploy ANSIBLE_INVENTORY=production
89
+
90
+# 6. Smoke. Tries to read the production base URL from a deploy var
91
+# file; falls back to asking.
92
+BASE="${SHITHUB_PROD_URL:-}"
93
+if [[ -z "$BASE" ]]; then
94
+  read -r -p "Smoke base URL (e.g. https://shithub.example): " BASE
95
+fi
96
+echo ""
97
+echo "running smoke against $BASE..."
98
+deploy/cutover/smoke.sh "$BASE"
99
+
100
+echo ""
101
+echo "rollback to $TAG complete and smoked. Update the status page."
deploy/cutover/smoke.shadded
@@ -0,0 +1,100 @@
1
+#!/usr/bin/env bash
2
+# SPDX-License-Identifier: AGPL-3.0-or-later
3
+#
4
+# S40 cutover smoke test. Exercises the public-facing routes that
5
+# matter at launch: landing page, signup/login forms render with
6
+# a fresh CSRF token, health endpoints respond, the docs subdomain
7
+# is reachable, the API authenticates a known PAT.
8
+#
9
+# Usage:
10
+#   deploy/cutover/smoke.sh https://shithub.example
11
+#
12
+# Optional env (when set, the script also exercises the API):
13
+#   SHITHUB_SMOKE_PAT     — a valid shp_ token for `user:read`
14
+#   SHITHUB_SMOKE_DOCS    — docs subdomain URL (default: docs.<base>)
15
+#
16
+# Exit status:
17
+#   0 — all green
18
+#   1 — at least one check failed
19
+#   2 — usage error
20
+
21
+set -euo pipefail
22
+
23
+if [[ $# -lt 1 ]]; then
24
+  echo "usage: $0 <base-url>" >&2
25
+  exit 2
26
+fi
27
+
28
+BASE="$1"
29
+DOCS="${SHITHUB_SMOKE_DOCS:-${BASE/shithub./docs.shithub.}}"
30
+fail=0
31
+
32
+say() { printf '\n=== %s ===\n' "$*"; }
33
+ok()  { printf '  PASS: %s\n' "$*"; }
34
+bad() { printf '  FAIL: %s\n' "$*"; fail=$((fail + 1)); }
35
+
36
+# 1. Landing.
37
+say "GET $BASE/"
38
+body=$(curl -fsS -o - -w "\n%{http_code}\n" "$BASE/" 2>&1) || { bad "landing fetch"; body=""; }
39
+if [[ "$body" == *"shithub"* ]]; then ok "body contains shithub"; else bad "body missing shithub"; fi
40
+
41
+# 2. Health endpoints. /readyz proves DB + storage are reachable.
42
+say "GET $BASE/-/health"
43
+curl -fsS "$BASE/-/health" >/dev/null && ok "/-/health 200" || bad "/-/health"
44
+say "GET $BASE/healthz"
45
+curl -fsS "$BASE/healthz" >/dev/null && ok "/healthz 200" || bad "/healthz"
46
+say "GET $BASE/readyz"
47
+curl -fsS "$BASE/readyz" >/dev/null && ok "/readyz 200" || bad "/readyz"
48
+
49
+# 3. Signup form renders.
50
+say "GET $BASE/signup"
51
+body=$(curl -fsS "$BASE/signup") || { bad "signup fetch"; body=""; }
52
+if [[ "$body" == *"csrf_token"* ]]; then ok "CSRF token present"; else bad "no csrf_token in signup form"; fi
53
+
54
+# 4. Login form renders.
55
+say "GET $BASE/login"
56
+body=$(curl -fsS "$BASE/login") || { bad "login fetch"; body=""; }
57
+if [[ "$body" == *"username"* ]] && [[ "$body" == *"password"* ]]; then
58
+  ok "login form fields present"
59
+else
60
+  bad "login form missing username/password"
61
+fi
62
+
63
+# 5. TLS posture. Strict-Transport-Security must be set.
64
+say "TLS / HSTS"
65
+hdrs=$(curl -fsS -I "$BASE/" 2>&1) || { bad "headers fetch"; hdrs=""; }
66
+if grep -qi "strict-transport-security" <<<"$hdrs"; then
67
+  ok "HSTS header set"
68
+else
69
+  bad "HSTS header missing"
70
+fi
71
+if grep -qi "x-content-type-options" <<<"$hdrs"; then
72
+  ok "X-Content-Type-Options set"
73
+else
74
+  bad "X-Content-Type-Options missing"
75
+fi
76
+
77
+# 6. Docs subdomain.
78
+say "GET $DOCS/"
79
+curl -fsS -o /dev/null "$DOCS/" && ok "docs site 200" || bad "docs site"
80
+
81
+# 7. API (only if a PAT is provided).
82
+if [[ -n "${SHITHUB_SMOKE_PAT:-}" ]]; then
83
+  say "GET $BASE/api/v1/user (with PAT)"
84
+  body=$(curl -fsS -H "Authorization: Bearer $SHITHUB_SMOKE_PAT" "$BASE/api/v1/user") || { bad "api fetch"; body=""; }
85
+  if [[ "$body" == *'"username"'* ]]; then
86
+    ok "API returned a user object"
87
+  else
88
+    bad "API response unexpected: $body"
89
+  fi
90
+else
91
+  printf '  SKIP: API check (set SHITHUB_SMOKE_PAT to run)\n'
92
+fi
93
+
94
+printf '\n'
95
+if [[ "$fail" -eq 0 ]]; then
96
+  echo "smoke: all checks passed"
97
+  exit 0
98
+fi
99
+echo "smoke: $fail check(s) FAILED"
100
+exit 1