tenseleyflow/shithub / 2aa998f

Browse files

S37: docs — deploy.md (topology, bootstrap, deploy/rollback)

Authored by espadonne
SHA
2aa998f2d149e7dbb68fd8cd6bd4ed15616c27fc
Parents
5721d2f
Tree
e663931

1 changed file

StatusFile+-
A docs/internal/deploy.md 165 0
docs/internal/deploy.mdadded
@@ -0,0 +1,165 @@
1
+# Deployment
2
+
3
+This is the operator's guide to taking a fresh box from "Ubuntu 24.04
4
+with sshd" to "running shithubd in production." It is opinionated:
5
+DigitalOcean for compute, DigitalOcean Spaces for object storage,
6
+Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the
7
+monitoring mesh. If you're running on something else, the Ansible
8
+roles are the source of truth — read them.
9
+
10
+## Topology
11
+
12
+```
13
+                +----------------------------+
14
+   public --->  |  Caddy (TLS, rate limits)  |  :443
15
+                +----------------------------+
16
+                       |  127.0.0.1:8080
17
+                +----------------------------+
18
+                |  shithubd web (systemd)    |
19
+                |  shithubd worker (systemd) |
20
+                |  shithubd cron  (timer)    |
21
+                +----------------------------+
22
+                       |
23
+                +-----------+         +-----------------+
24
+                | Postgres  |         | Spaces (S3)     |
25
+                +-----------+         | - WAL archive   |
26
+                                      | - daily dumps   |
27
+                                      | - LFS / blobs   |
28
+                                      +-----------------+
29
+                       \                    /
30
+                        \      WireGuard mesh (10.50.0.0/24)
31
+                         \________ ________/
32
+                                  |
33
+                          +----------------+
34
+                          |  Monitoring    |
35
+                          |  Prom/Loki/AM  |
36
+                          |  Grafana       |
37
+                          +----------------+
38
+```
39
+
40
+The monitoring host is *not* on the public internet. App processes
41
+listen on `127.0.0.1` and on the wg0 mesh interface only; nothing
42
+about the metrics port is reachable from outside the mesh.
43
+
44
+## One-time bootstrap
45
+
46
+1. **Provision the droplets.** Three for staging is enough (web,
47
+   db, monitoring). Production starts at five (2× web, db, backup,
48
+   monitoring) and grows the web tier first.
49
+2. **Get sshd public-key login working** for the operator user. The
50
+   Ansible base role narrows it from there.
51
+3. **Populate `deploy/ansible/inventory/<env>`** by copying
52
+   `inventory/staging.example`. The variables marked with `# REQUIRED`
53
+   come from the operator's secret store (Bitwarden, 1Password, etc.).
54
+   Do **not** commit a real inventory file.
55
+4. **Bootstrap-admin user.** After the first deploy, ssh to the web
56
+   host and run `shithubd admin bootstrap-admin --email you@…`. That
57
+   gives you the site-admin bit; subsequent admin grants happen
58
+   through `/admin/users/{id}`.
59
+
60
+## Deploying
61
+
62
+The Makefile wraps the playbook so the human commands are short:
63
+
64
+```sh
65
+# dry-run against staging (default inventory)
66
+make deploy-check
67
+
68
+# apply against staging
69
+make deploy
70
+
71
+# apply against production
72
+ANSIBLE_INVENTORY=production make deploy
73
+
74
+# only the app, not the edge or db
75
+ANSIBLE_TAGS=app make deploy
76
+
77
+# only one host (e.g. canary)
78
+ANSIBLE_LIMIT=web-02 make deploy
79
+```
80
+
81
+The playbook is idempotent — run it twice in a row and the second
82
+run should report `ok=N changed=0`. If the second run reports any
83
+changes, that's a config drift bug; investigate before continuing.
84
+
85
+## What the playbook does
86
+
87
+In rough order:
88
+
89
+- **base** — apt baseline, ufw default-deny, fail2ban, system users
90
+  (`shithub`, `shithub-ssh`), data root at `/data`.
91
+- **postgres** (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`,
92
+  applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive
93
+  command, creates the `shithub` and `shithub_hook` roles with
94
+  exact-grant permissions.
95
+- **shithubd** (`tags: [app]`) — copies the binary into
96
+  `/usr/local/bin`, drops env files into `/etc/shithub/`, installs
97
+  the three systemd units, restarts on change. The `web.service`
98
+  ExecStartPre runs `shithubd migrate up` so a deploy with new
99
+  migrations is one command.
100
+- **caddy** (`tags: [edge]`) — installs Caddy + the templated
101
+  `Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator
102
+  flips a vars flag; production after that.
103
+- **wireguard** (`tags: [net]`) — peers each host into the mesh.
104
+- **backup** (`tags: [backup]`) — installs the daily backup timer
105
+  on the db host and the cross-region sync timer on the backup host.
106
+- **monitoring-client** (`tags: [monitoring]`) — node-exporter +
107
+  promtail on every host pointing at the monitoring host.
108
+
109
+The monitoring host itself is provisioned by a separate Ansible play
110
+that lives outside this repo (it depends on operator-specific TLS
111
+material). The configs in `deploy/monitoring/` are the source of
112
+truth for *what* runs there.
113
+
114
+## Backups
115
+
116
+Two layers, both mandatory:
117
+
118
+1. **WAL archiving** (`deploy/postgres/archive_command.sh`) ships
119
+   every WAL segment to `spaces-prod:shithub-wal` in real time.
120
+   Postgres won't recycle a segment until the script reports success,
121
+   so a failing archiver fills the disk — alert on
122
+   `pg_stat_archiver.failed_count > 0`.
123
+2. **Daily logical** (`deploy/postgres/backup-daily.sh`) takes a
124
+   `pg_dump --format=custom` once per day and ships it to
125
+   `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last
126
+   7 locally for fast recovery.
127
+
128
+Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
129
+both buckets to a second region for DR. Lifecycle in
130
+`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
131
+after 90.
132
+
133
+The recovery target is **PITR within 30 days, full restore within
134
+1 hour**. We verify this every quarter with the restore drill —
135
+see `runbooks/restore.md`.
136
+
137
+## Rollback
138
+
139
+The deploy is a single binary + an env file + a systemd unit; rolling
140
+back is "redeploy the previous binary." Two paths:
141
+
142
+- **Tag rollback (preferred)** — `git checkout v<previous>` and
143
+  `make deploy`. Migrations are forward-only by design; if the new
144
+  release added a migration, you need to either accept that the
145
+  rollback leaves the schema ahead, or apply the migration's matching
146
+  `down` first (`shithubd migrate down`). Check `runbooks/rollback.md`
147
+  before touching migrations.
148
+- **Hotfix on a branch** — branch from the rolled-back tag, fix,
149
+  cut a new release. Don't force-push tags.
150
+
151
+## What goes where
152
+
153
+| Concern                       | File                                           |
154
+|-------------------------------|------------------------------------------------|
155
+| Provisioning entrypoint       | `deploy/ansible/site.yml`                      |
156
+| Per-environment vars          | `deploy/ansible/inventory/<env>`               |
157
+| App systemd units             | `deploy/systemd/`                              |
158
+| Edge config                   | `deploy/Caddyfile.j2`                          |
159
+| sshd (incl. AKC for git)      | `deploy/sshd_config.j2`                        |
160
+| Postgres scripts              | `deploy/postgres/`                             |
161
+| Spaces lifecycle + DR         | `deploy/spaces/`                               |
162
+| WireGuard mesh                | `deploy/wireguard/wg0.conf.j2`                 |
163
+| Monitoring configs            | `deploy/monitoring/`                           |
164
+| Restore drill                 | `deploy/restore-drill/`                        |
165
+| Operator runbooks             | `docs/internal/runbooks/`                      |