@@ -0,0 +1,165 @@ |
| 1 | +# Deployment |
| 2 | + |
| 3 | +This is the operator's guide to taking a fresh box from "Ubuntu 24.04 |
| 4 | +with sshd" to "running shithubd in production." It is opinionated: |
| 5 | +DigitalOcean for compute, DigitalOcean Spaces for object storage, |
| 6 | +Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the |
| 7 | +monitoring mesh. If you're running on something else, the Ansible |
| 8 | +roles are the source of truth — read them. |
| 9 | + |
| 10 | +## Topology |
| 11 | + |
| 12 | +``` |
| 13 | + +----------------------------+ |
| 14 | + public ---> | Caddy (TLS, rate limits) | :443 |
| 15 | + +----------------------------+ |
| 16 | + | 127.0.0.1:8080 |
| 17 | + +----------------------------+ |
| 18 | + | shithubd web (systemd) | |
| 19 | + | shithubd worker (systemd) | |
| 20 | + | shithubd cron (timer) | |
| 21 | + +----------------------------+ |
| 22 | + | |
| 23 | + +-----------+ +-----------------+ |
| 24 | + | Postgres | | Spaces (S3) | |
| 25 | + +-----------+ | - WAL archive | |
| 26 | + | - daily dumps | |
| 27 | + | - LFS / blobs | |
| 28 | + +-----------------+ |
| 29 | + \ / |
| 30 | + \ WireGuard mesh (10.50.0.0/24) |
| 31 | + \________ ________/ |
| 32 | + | |
| 33 | + +----------------+ |
| 34 | + | Monitoring | |
| 35 | + | Prom/Loki/AM | |
| 36 | + | Grafana | |
| 37 | + +----------------+ |
| 38 | +``` |
| 39 | + |
| 40 | +The monitoring host is *not* on the public internet. App processes |
| 41 | +listen on `127.0.0.1` and on the wg0 mesh interface only; nothing |
| 42 | +about the metrics port is reachable from outside the mesh. |
| 43 | + |
| 44 | +## One-time bootstrap |
| 45 | + |
| 46 | +1. **Provision the droplets.** Three for staging is enough (web, |
| 47 | + db, monitoring). Production starts at five (2× web, db, backup, |
| 48 | + monitoring) and grows the web tier first. |
| 49 | +2. **Get sshd public-key login working** for the operator user. The |
| 50 | + Ansible base role narrows it from there. |
| 51 | +3. **Populate `deploy/ansible/inventory/<env>`** by copying |
| 52 | + `inventory/staging.example`. The variables marked with `# REQUIRED` |
| 53 | + come from the operator's secret store (Bitwarden, 1Password, etc.). |
| 54 | + Do **not** commit a real inventory file. |
| 55 | +4. **Bootstrap-admin user.** After the first deploy, ssh to the web |
| 56 | + host and run `shithubd admin bootstrap-admin --email you@…`. That |
| 57 | + gives you the site-admin bit; subsequent admin grants happen |
| 58 | + through `/admin/users/{id}`. |
| 59 | + |
| 60 | +## Deploying |
| 61 | + |
| 62 | +The Makefile wraps the playbook so the human commands are short: |
| 63 | + |
| 64 | +```sh |
| 65 | +# dry-run against staging (default inventory) |
| 66 | +make deploy-check |
| 67 | + |
| 68 | +# apply against staging |
| 69 | +make deploy |
| 70 | + |
| 71 | +# apply against production |
| 72 | +ANSIBLE_INVENTORY=production make deploy |
| 73 | + |
| 74 | +# only the app, not the edge or db |
| 75 | +ANSIBLE_TAGS=app make deploy |
| 76 | + |
| 77 | +# only one host (e.g. canary) |
| 78 | +ANSIBLE_LIMIT=web-02 make deploy |
| 79 | +``` |
| 80 | + |
| 81 | +The playbook is idempotent — run it twice in a row and the second |
| 82 | +run should report `ok=N changed=0`. If the second run reports any |
| 83 | +changes, that's a config drift bug; investigate before continuing. |
| 84 | + |
| 85 | +## What the playbook does |
| 86 | + |
| 87 | +In rough order: |
| 88 | + |
| 89 | +- **base** — apt baseline, ufw default-deny, fail2ban, system users |
| 90 | + (`shithub`, `shithub-ssh`), data root at `/data`. |
| 91 | +- **postgres** (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`, |
| 92 | + applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive |
| 93 | + command, creates the `shithub` and `shithub_hook` roles with |
| 94 | + exact-grant permissions. |
| 95 | +- **shithubd** (`tags: [app]`) — copies the binary into |
| 96 | + `/usr/local/bin`, drops env files into `/etc/shithub/`, installs |
| 97 | + the three systemd units, restarts on change. The `web.service` |
| 98 | + ExecStartPre runs `shithubd migrate up` so a deploy with new |
| 99 | + migrations is one command. |
| 100 | +- **caddy** (`tags: [edge]`) — installs Caddy + the templated |
| 101 | + `Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator |
| 102 | + flips a vars flag; production after that. |
| 103 | +- **wireguard** (`tags: [net]`) — peers each host into the mesh. |
| 104 | +- **backup** (`tags: [backup]`) — installs the daily backup timer |
| 105 | + on the db host and the cross-region sync timer on the backup host. |
| 106 | +- **monitoring-client** (`tags: [monitoring]`) — node-exporter + |
| 107 | + promtail on every host pointing at the monitoring host. |
| 108 | + |
| 109 | +The monitoring host itself is provisioned by a separate Ansible play |
| 110 | +that lives outside this repo (it depends on operator-specific TLS |
| 111 | +material). The configs in `deploy/monitoring/` are the source of |
| 112 | +truth for *what* runs there. |
| 113 | + |
| 114 | +## Backups |
| 115 | + |
| 116 | +Two layers, both mandatory: |
| 117 | + |
| 118 | +1. **WAL archiving** (`deploy/postgres/archive_command.sh`) ships |
| 119 | + every WAL segment to `spaces-prod:shithub-wal` in real time. |
| 120 | + Postgres won't recycle a segment until the script reports success, |
| 121 | + so a failing archiver fills the disk — alert on |
| 122 | + `pg_stat_archiver.failed_count > 0`. |
| 123 | +2. **Daily logical** (`deploy/postgres/backup-daily.sh`) takes a |
| 124 | + `pg_dump --format=custom` once per day and ships it to |
| 125 | + `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last |
| 126 | + 7 locally for fast recovery. |
| 127 | + |
| 128 | +Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors |
| 129 | +both buckets to a second region for DR. Lifecycle in |
| 130 | +`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps |
| 131 | +after 90. |
| 132 | + |
| 133 | +The recovery target is **PITR within 30 days, full restore within |
| 134 | +1 hour**. We verify this every quarter with the restore drill — |
| 135 | +see `runbooks/restore.md`. |
| 136 | + |
| 137 | +## Rollback |
| 138 | + |
| 139 | +The deploy is a single binary + an env file + a systemd unit; rolling |
| 140 | +back is "redeploy the previous binary." Two paths: |
| 141 | + |
| 142 | +- **Tag rollback (preferred)** — `git checkout v<previous>` and |
| 143 | + `make deploy`. Migrations are forward-only by design; if the new |
| 144 | + release added a migration, you need to either accept that the |
| 145 | + rollback leaves the schema ahead, or apply the migration's matching |
| 146 | + `down` first (`shithubd migrate down`). Check `runbooks/rollback.md` |
| 147 | + before touching migrations. |
| 148 | +- **Hotfix on a branch** — branch from the rolled-back tag, fix, |
| 149 | + cut a new release. Don't force-push tags. |
| 150 | + |
| 151 | +## What goes where |
| 152 | + |
| 153 | +| Concern | File | |
| 154 | +|-------------------------------|------------------------------------------------| |
| 155 | +| Provisioning entrypoint | `deploy/ansible/site.yml` | |
| 156 | +| Per-environment vars | `deploy/ansible/inventory/<env>` | |
| 157 | +| App systemd units | `deploy/systemd/` | |
| 158 | +| Edge config | `deploy/Caddyfile.j2` | |
| 159 | +| sshd (incl. AKC for git) | `deploy/sshd_config.j2` | |
| 160 | +| Postgres scripts | `deploy/postgres/` | |
| 161 | +| Spaces lifecycle + DR | `deploy/spaces/` | |
| 162 | +| WireGuard mesh | `deploy/wireguard/wg0.conf.j2` | |
| 163 | +| Monitoring configs | `deploy/monitoring/` | |
| 164 | +| Restore drill | `deploy/restore-drill/` | |
| 165 | +| Operator runbooks | `docs/internal/runbooks/` | |