markdown · 7513 bytes Raw Blame History

Deployment

This is the operator's guide to taking a fresh box from "Ubuntu 24.04 with sshd" to "running shithubd in production." It is opinionated: DigitalOcean for compute, DigitalOcean Spaces for object storage, Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the monitoring mesh. If you're running on something else, the Ansible roles are the source of truth — read them.

Topology

                +----------------------------+
   public --->  |  Caddy (TLS, rate limits)  |  :443
                +----------------------------+
                       |  127.0.0.1:8080
                +----------------------------+
                |  shithubd web (systemd)    |
                |  shithubd worker (systemd) |
                |  shithubd cron  (timer)    |
                +----------------------------+
                       |
                +-----------+         +-----------------+
                | Postgres  |         | Spaces (S3)     |
                +-----------+         | - WAL archive   |
                                      | - daily dumps   |
                                      | - LFS / blobs   |
                                      +-----------------+
                       \                    /
                        \      WireGuard mesh (10.50.0.0/24)
                         \________ ________/
                                  |
                          +----------------+
                          |  Monitoring    |
                          |  Prom/Loki/AM  |
                          |  Grafana       |
                          +----------------+

The monitoring host is not on the public internet. App processes listen on 127.0.0.1 and on the wg0 mesh interface only; nothing about the metrics port is reachable from outside the mesh.

One-time bootstrap

  1. Provision the droplets. Three for staging is enough (web, db, monitoring). Production starts at five (2× web, db, backup, monitoring) and grows the web tier first.
  2. Get sshd public-key login working for the operator user. The Ansible base role narrows it from there.
  3. Populate deploy/ansible/inventory/<env> by copying inventory/staging.example. The variables marked with # REQUIRED come from the operator's secret store (Bitwarden, 1Password, etc.). Do not commit a real inventory file.
  4. Bootstrap-admin user. After the first deploy, ssh to the web host and run shithubd admin bootstrap-admin --email you@…. That gives you the site-admin bit; subsequent admin grants happen through /admin/users/{id}.

Deploying

The Makefile wraps the playbook so the human commands are short:

# dry-run against staging (default inventory)
make deploy-check

# apply against staging
make deploy

# apply against production
ANSIBLE_INVENTORY=production make deploy

# only the app, not the edge or db
ANSIBLE_TAGS=app make deploy

# only one host (e.g. canary)
ANSIBLE_LIMIT=web-02 make deploy

The playbook is idempotent — run it twice in a row and the second run should report ok=N changed=0. If the second run reports any changes, that's a config drift bug; investigate before continuing.

What the playbook does

In rough order:

  • base — apt baseline, ufw default-deny, fail2ban, system users (shithub, shithub-ssh), data root at /data.
  • postgres (tags: [db]) — installs PG16, initdb on /data/pgdata, applies our postgresql.conf/pg_hba.conf, wires the WAL archive command, creates the shithub and shithub_hook roles with exact-grant permissions.
  • shithubd (tags: [app]) — copies the binary into /usr/local/bin, drops env files into /etc/shithub/, installs the three systemd units, restarts on change. The web.service ExecStartPre runs shithubd migrate up so a deploy with new migrations is one command.
  • caddy (tags: [edge]) — installs Caddy + the templated Caddyfile. Auto-TLS via Let's Encrypt staging until the operator flips a vars flag; production after that.
  • wireguard (tags: [net]) — peers each host into the mesh.
  • backup (tags: [backup]) — installs the daily backup timer on the db host and the cross-region sync timer on the backup host.
  • monitoring-client (tags: [monitoring]) — node-exporter + promtail on every host pointing at the monitoring host.

The monitoring host itself is provisioned by a separate Ansible play that lives outside this repo (it depends on operator-specific TLS material). The configs in deploy/monitoring/ are the source of truth for what runs there.

Backups

Two layers, both mandatory:

  1. WAL archiving (deploy/postgres/archive_command.sh) ships every WAL segment to spaces-prod:shithub-wal in real time. Postgres won't recycle a segment until the script reports success, so a failing archiver fills the disk — alert on pg_stat_archiver.failed_count > 0.
  2. Daily logical (deploy/postgres/backup-daily.sh) takes a pg_dump --format=custom once per day and ships it to spaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the last 7 locally for fast recovery.

Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors both buckets to a second region for DR. Lifecycle in deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps after 90. Actions log/artifact objects use the primary object bucket's actions/runs/ prefix; apply deploy/spaces/actions-lifecycle.json with deploy/cutover/apply-actions-lifecycle.sh so provider-side blob retention matches the workflow:cleanup database sweep.

The recovery target is PITR within 30 days, full restore within 1 hour. We verify this every quarter with the restore drill — see runbooks/restore.md.

Rollback

The deploy is a single binary + an env file + a systemd unit; rolling back is "redeploy the previous binary." Two paths:

  • Tag rollback (preferred)git checkout v<previous> and make deploy. Migrations are forward-only by design; if the new release added a migration, you need to either accept that the rollback leaves the schema ahead, or apply the migration's matching down first (shithubd migrate down). Check runbooks/rollback.md before touching migrations.
  • Hotfix on a branch — branch from the rolled-back tag, fix, cut a new release. Don't force-push tags.

What goes where

Concern File
Provisioning entrypoint deploy/ansible/site.yml
Per-environment vars deploy/ansible/inventory/<env>
App systemd units deploy/systemd/
Edge config deploy/Caddyfile.j2
sshd (incl. AKC for git) deploy/sshd_config.j2
Postgres scripts deploy/postgres/
Spaces lifecycle + DR deploy/spaces/
WireGuard mesh deploy/wireguard/wg0.conf.j2
Monitoring configs deploy/monitoring/
Restore drill deploy/restore-drill/
Operator runbooks docs/internal/runbooks/
View source
1 # Deployment
2
3 This is the operator's guide to taking a fresh box from "Ubuntu 24.04
4 with sshd" to "running shithubd in production." It is opinionated:
5 DigitalOcean for compute, DigitalOcean Spaces for object storage,
6 Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the
7 monitoring mesh. If you're running on something else, the Ansible
8 roles are the source of truth — read them.
9
10 ## Topology
11
12 ```
13 +----------------------------+
14 public ---> | Caddy (TLS, rate limits) | :443
15 +----------------------------+
16 | 127.0.0.1:8080
17 +----------------------------+
18 | shithubd web (systemd) |
19 | shithubd worker (systemd) |
20 | shithubd cron (timer) |
21 +----------------------------+
22 |
23 +-----------+ +-----------------+
24 | Postgres | | Spaces (S3) |
25 +-----------+ | - WAL archive |
26 | - daily dumps |
27 | - LFS / blobs |
28 +-----------------+
29 \ /
30 \ WireGuard mesh (10.50.0.0/24)
31 \________ ________/
32 |
33 +----------------+
34 | Monitoring |
35 | Prom/Loki/AM |
36 | Grafana |
37 +----------------+
38 ```
39
40 The monitoring host is *not* on the public internet. App processes
41 listen on `127.0.0.1` and on the wg0 mesh interface only; nothing
42 about the metrics port is reachable from outside the mesh.
43
44 ## One-time bootstrap
45
46 1. **Provision the droplets.** Three for staging is enough (web,
47 db, monitoring). Production starts at five (2× web, db, backup,
48 monitoring) and grows the web tier first.
49 2. **Get sshd public-key login working** for the operator user. The
50 Ansible base role narrows it from there.
51 3. **Populate `deploy/ansible/inventory/<env>`** by copying
52 `inventory/staging.example`. The variables marked with `# REQUIRED`
53 come from the operator's secret store (Bitwarden, 1Password, etc.).
54 Do **not** commit a real inventory file.
55 4. **Bootstrap-admin user.** After the first deploy, ssh to the web
56 host and run `shithubd admin bootstrap-admin --email you@…`. That
57 gives you the site-admin bit; subsequent admin grants happen
58 through `/admin/users/{id}`.
59
60 ## Deploying
61
62 The Makefile wraps the playbook so the human commands are short:
63
64 ```sh
65 # dry-run against staging (default inventory)
66 make deploy-check
67
68 # apply against staging
69 make deploy
70
71 # apply against production
72 ANSIBLE_INVENTORY=production make deploy
73
74 # only the app, not the edge or db
75 ANSIBLE_TAGS=app make deploy
76
77 # only one host (e.g. canary)
78 ANSIBLE_LIMIT=web-02 make deploy
79 ```
80
81 The playbook is idempotent — run it twice in a row and the second
82 run should report `ok=N changed=0`. If the second run reports any
83 changes, that's a config drift bug; investigate before continuing.
84
85 ## What the playbook does
86
87 In rough order:
88
89 - **base** — apt baseline, ufw default-deny, fail2ban, system users
90 (`shithub`, `shithub-ssh`), data root at `/data`.
91 - **postgres** (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`,
92 applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive
93 command, creates the `shithub` and `shithub_hook` roles with
94 exact-grant permissions.
95 - **shithubd** (`tags: [app]`) — copies the binary into
96 `/usr/local/bin`, drops env files into `/etc/shithub/`, installs
97 the three systemd units, restarts on change. The `web.service`
98 ExecStartPre runs `shithubd migrate up` so a deploy with new
99 migrations is one command.
100 - **caddy** (`tags: [edge]`) — installs Caddy + the templated
101 `Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator
102 flips a vars flag; production after that.
103 - **wireguard** (`tags: [net]`) — peers each host into the mesh.
104 - **backup** (`tags: [backup]`) — installs the daily backup timer
105 on the db host and the cross-region sync timer on the backup host.
106 - **monitoring-client** (`tags: [monitoring]`) — node-exporter +
107 promtail on every host pointing at the monitoring host.
108
109 The monitoring host itself is provisioned by a separate Ansible play
110 that lives outside this repo (it depends on operator-specific TLS
111 material). The configs in `deploy/monitoring/` are the source of
112 truth for *what* runs there.
113
114 ## Backups
115
116 Two layers, both mandatory:
117
118 1. **WAL archiving** (`deploy/postgres/archive_command.sh`) ships
119 every WAL segment to `spaces-prod:shithub-wal` in real time.
120 Postgres won't recycle a segment until the script reports success,
121 so a failing archiver fills the disk — alert on
122 `pg_stat_archiver.failed_count > 0`.
123 2. **Daily logical** (`deploy/postgres/backup-daily.sh`) takes a
124 `pg_dump --format=custom` once per day and ships it to
125 `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last
126 7 locally for fast recovery.
127
128 Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
129 both buckets to a second region for DR. Lifecycle in
130 `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
131 after 90. Actions log/artifact objects use the primary object bucket's
132 `actions/runs/` prefix; apply `deploy/spaces/actions-lifecycle.json`
133 with `deploy/cutover/apply-actions-lifecycle.sh` so provider-side blob
134 retention matches the `workflow:cleanup` database sweep.
135
136 The recovery target is **PITR within 30 days, full restore within
137 1 hour**. We verify this every quarter with the restore drill —
138 see `runbooks/restore.md`.
139
140 ## Rollback
141
142 The deploy is a single binary + an env file + a systemd unit; rolling
143 back is "redeploy the previous binary." Two paths:
144
145 - **Tag rollback (preferred)** — `git checkout v<previous>` and
146 `make deploy`. Migrations are forward-only by design; if the new
147 release added a migration, you need to either accept that the
148 rollback leaves the schema ahead, or apply the migration's matching
149 `down` first (`shithubd migrate down`). Check `runbooks/rollback.md`
150 before touching migrations.
151 - **Hotfix on a branch** — branch from the rolled-back tag, fix,
152 cut a new release. Don't force-push tags.
153
154 ## What goes where
155
156 | Concern | File |
157 |-------------------------------|------------------------------------------------|
158 | Provisioning entrypoint | `deploy/ansible/site.yml` |
159 | Per-environment vars | `deploy/ansible/inventory/<env>` |
160 | App systemd units | `deploy/systemd/` |
161 | Edge config | `deploy/Caddyfile.j2` |
162 | sshd (incl. AKC for git) | `deploy/sshd_config.j2` |
163 | Postgres scripts | `deploy/postgres/` |
164 | Spaces lifecycle + DR | `deploy/spaces/` |
165 | WireGuard mesh | `deploy/wireguard/wg0.conf.j2` |
166 | Monitoring configs | `deploy/monitoring/` |
167 | Restore drill | `deploy/restore-drill/` |
168 | Operator runbooks | `docs/internal/runbooks/` |