Deployment
This is the operator's guide to taking a fresh box from "Ubuntu 24.04 with sshd" to "running shithubd in production." It is opinionated: DigitalOcean for compute, DigitalOcean Spaces for object storage, Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the monitoring mesh. If you're running on something else, the Ansible roles are the source of truth — read them.
Topology
+----------------------------+
public ---> | Caddy (TLS, rate limits) | :443
+----------------------------+
| 127.0.0.1:8080
+----------------------------+
| shithubd web (systemd) |
| shithubd worker (systemd) |
| shithubd cron (timer) |
+----------------------------+
|
+-----------+ +-----------------+
| Postgres | | Spaces (S3) |
+-----------+ | - WAL archive |
| - daily dumps |
| - LFS / blobs |
+-----------------+
\ /
\ WireGuard mesh (10.50.0.0/24)
\________ ________/
|
+----------------+
| Monitoring |
| Prom/Loki/AM |
| Grafana |
+----------------+
The monitoring host is not on the public internet. App processes
listen on 127.0.0.1 and on the wg0 mesh interface only; nothing
about the metrics port is reachable from outside the mesh.
One-time bootstrap
- Provision the droplets. Three for staging is enough (web, db, monitoring). Production starts at five (2× web, db, backup, monitoring) and grows the web tier first.
- Get sshd public-key login working for the operator user. The Ansible base role narrows it from there.
- Populate
deploy/ansible/inventory/<env>by copyinginventory/staging.example. The variables marked with# REQUIREDcome from the operator's secret store (Bitwarden, 1Password, etc.). Do not commit a real inventory file. - Bootstrap-admin user. After the first deploy, ssh to the web
host and run
shithubd admin bootstrap-admin --email you@…. That gives you the site-admin bit; subsequent admin grants happen through/admin/users/{id}.
Deploying
The Makefile wraps the playbook so the human commands are short:
# dry-run against staging (default inventory)
make deploy-check
# apply against staging
make deploy
# apply against production
ANSIBLE_INVENTORY=production make deploy
# only the app, not the edge or db
ANSIBLE_TAGS=app make deploy
# only one host (e.g. canary)
ANSIBLE_LIMIT=web-02 make deploy
The playbook is idempotent — run it twice in a row and the second
run should report ok=N changed=0. If the second run reports any
changes, that's a config drift bug; investigate before continuing.
What the playbook does
In rough order:
- base — apt baseline, ufw default-deny, fail2ban, system users
(
shithub,shithub-ssh), data root at/data. - postgres (
tags: [db]) — installs PG16, initdb on/data/pgdata, applies ourpostgresql.conf/pg_hba.conf, wires the WAL archive command, creates theshithubandshithub_hookroles with exact-grant permissions. - shithubd (
tags: [app]) — copies the binary into/usr/local/bin, drops env files into/etc/shithub/, installs the three systemd units, restarts on change. Theweb.serviceExecStartPre runsshithubd migrate upso a deploy with new migrations is one command. - caddy (
tags: [edge]) — installs Caddy + the templatedCaddyfile. Auto-TLS via Let's Encrypt staging until the operator flips a vars flag; production after that. - wireguard (
tags: [net]) — peers each host into the mesh. - backup (
tags: [backup]) — installs the daily backup timer on the db host and the cross-region sync timer on the backup host. - monitoring-client (
tags: [monitoring]) — node-exporter + promtail on every host pointing at the monitoring host.
The monitoring host itself is provisioned by a separate Ansible play
that lives outside this repo (it depends on operator-specific TLS
material). The configs in deploy/monitoring/ are the source of
truth for what runs there.
Backups
Two layers, both mandatory:
- WAL archiving (
deploy/postgres/archive_command.sh) ships every WAL segment tospaces-prod:shithub-walin real time. Postgres won't recycle a segment until the script reports success, so a failing archiver fills the disk — alert onpg_stat_archiver.failed_count > 0. - Daily logical (
deploy/postgres/backup-daily.sh) takes apg_dump --format=customonce per day and ships it tospaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the last 7 locally for fast recovery.
Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors
both buckets to a second region for DR. Lifecycle in
deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps
after 90. Actions log/artifact objects use the primary object bucket's
actions/runs/ prefix; apply deploy/spaces/actions-lifecycle.json
with deploy/cutover/apply-actions-lifecycle.sh so provider-side blob
retention matches the workflow:cleanup database sweep.
The recovery target is PITR within 30 days, full restore within
1 hour. We verify this every quarter with the restore drill —
see runbooks/restore.md.
Rollback
The deploy is a single binary + an env file + a systemd unit; rolling back is "redeploy the previous binary." Two paths:
- Tag rollback (preferred) —
git checkout v<previous>andmake deploy. Migrations are forward-only by design; if the new release added a migration, you need to either accept that the rollback leaves the schema ahead, or apply the migration's matchingdownfirst (shithubd migrate down). Checkrunbooks/rollback.mdbefore touching migrations. - Hotfix on a branch — branch from the rolled-back tag, fix, cut a new release. Don't force-push tags.
What goes where
| Concern | File |
|---|---|
| Provisioning entrypoint | deploy/ansible/site.yml |
| Per-environment vars | deploy/ansible/inventory/<env> |
| App systemd units | deploy/systemd/ |
| Edge config | deploy/Caddyfile.j2 |
| sshd (incl. AKC for git) | deploy/sshd_config.j2 |
| Postgres scripts | deploy/postgres/ |
| Spaces lifecycle + DR | deploy/spaces/ |
| WireGuard mesh | deploy/wireguard/wg0.conf.j2 |
| Monitoring configs | deploy/monitoring/ |
| Restore drill | deploy/restore-drill/ |
| Operator runbooks | docs/internal/runbooks/ |
View source
| 1 | # Deployment |
| 2 | |
| 3 | This is the operator's guide to taking a fresh box from "Ubuntu 24.04 |
| 4 | with sshd" to "running shithubd in production." It is opinionated: |
| 5 | DigitalOcean for compute, DigitalOcean Spaces for object storage, |
| 6 | Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the |
| 7 | monitoring mesh. If you're running on something else, the Ansible |
| 8 | roles are the source of truth — read them. |
| 9 | |
| 10 | ## Topology |
| 11 | |
| 12 | ``` |
| 13 | +----------------------------+ |
| 14 | public ---> | Caddy (TLS, rate limits) | :443 |
| 15 | +----------------------------+ |
| 16 | | 127.0.0.1:8080 |
| 17 | +----------------------------+ |
| 18 | | shithubd web (systemd) | |
| 19 | | shithubd worker (systemd) | |
| 20 | | shithubd cron (timer) | |
| 21 | +----------------------------+ |
| 22 | | |
| 23 | +-----------+ +-----------------+ |
| 24 | | Postgres | | Spaces (S3) | |
| 25 | +-----------+ | - WAL archive | |
| 26 | | - daily dumps | |
| 27 | | - LFS / blobs | |
| 28 | +-----------------+ |
| 29 | \ / |
| 30 | \ WireGuard mesh (10.50.0.0/24) |
| 31 | \________ ________/ |
| 32 | | |
| 33 | +----------------+ |
| 34 | | Monitoring | |
| 35 | | Prom/Loki/AM | |
| 36 | | Grafana | |
| 37 | +----------------+ |
| 38 | ``` |
| 39 | |
| 40 | The monitoring host is *not* on the public internet. App processes |
| 41 | listen on `127.0.0.1` and on the wg0 mesh interface only; nothing |
| 42 | about the metrics port is reachable from outside the mesh. |
| 43 | |
| 44 | ## One-time bootstrap |
| 45 | |
| 46 | 1. **Provision the droplets.** Three for staging is enough (web, |
| 47 | db, monitoring). Production starts at five (2× web, db, backup, |
| 48 | monitoring) and grows the web tier first. |
| 49 | 2. **Get sshd public-key login working** for the operator user. The |
| 50 | Ansible base role narrows it from there. |
| 51 | 3. **Populate `deploy/ansible/inventory/<env>`** by copying |
| 52 | `inventory/staging.example`. The variables marked with `# REQUIRED` |
| 53 | come from the operator's secret store (Bitwarden, 1Password, etc.). |
| 54 | Do **not** commit a real inventory file. |
| 55 | 4. **Bootstrap-admin user.** After the first deploy, ssh to the web |
| 56 | host and run `shithubd admin bootstrap-admin --email you@…`. That |
| 57 | gives you the site-admin bit; subsequent admin grants happen |
| 58 | through `/admin/users/{id}`. |
| 59 | |
| 60 | ## Deploying |
| 61 | |
| 62 | The Makefile wraps the playbook so the human commands are short: |
| 63 | |
| 64 | ```sh |
| 65 | # dry-run against staging (default inventory) |
| 66 | make deploy-check |
| 67 | |
| 68 | # apply against staging |
| 69 | make deploy |
| 70 | |
| 71 | # apply against production |
| 72 | ANSIBLE_INVENTORY=production make deploy |
| 73 | |
| 74 | # only the app, not the edge or db |
| 75 | ANSIBLE_TAGS=app make deploy |
| 76 | |
| 77 | # only one host (e.g. canary) |
| 78 | ANSIBLE_LIMIT=web-02 make deploy |
| 79 | ``` |
| 80 | |
| 81 | The playbook is idempotent — run it twice in a row and the second |
| 82 | run should report `ok=N changed=0`. If the second run reports any |
| 83 | changes, that's a config drift bug; investigate before continuing. |
| 84 | |
| 85 | ## What the playbook does |
| 86 | |
| 87 | In rough order: |
| 88 | |
| 89 | - **base** — apt baseline, ufw default-deny, fail2ban, system users |
| 90 | (`shithub`, `shithub-ssh`), data root at `/data`. |
| 91 | - **postgres** (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`, |
| 92 | applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive |
| 93 | command, creates the `shithub` and `shithub_hook` roles with |
| 94 | exact-grant permissions. |
| 95 | - **shithubd** (`tags: [app]`) — copies the binary into |
| 96 | `/usr/local/bin`, drops env files into `/etc/shithub/`, installs |
| 97 | the three systemd units, restarts on change. The `web.service` |
| 98 | ExecStartPre runs `shithubd migrate up` so a deploy with new |
| 99 | migrations is one command. |
| 100 | - **caddy** (`tags: [edge]`) — installs Caddy + the templated |
| 101 | `Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator |
| 102 | flips a vars flag; production after that. |
| 103 | - **wireguard** (`tags: [net]`) — peers each host into the mesh. |
| 104 | - **backup** (`tags: [backup]`) — installs the daily backup timer |
| 105 | on the db host and the cross-region sync timer on the backup host. |
| 106 | - **monitoring-client** (`tags: [monitoring]`) — node-exporter + |
| 107 | promtail on every host pointing at the monitoring host. |
| 108 | |
| 109 | The monitoring host itself is provisioned by a separate Ansible play |
| 110 | that lives outside this repo (it depends on operator-specific TLS |
| 111 | material). The configs in `deploy/monitoring/` are the source of |
| 112 | truth for *what* runs there. |
| 113 | |
| 114 | ## Backups |
| 115 | |
| 116 | Two layers, both mandatory: |
| 117 | |
| 118 | 1. **WAL archiving** (`deploy/postgres/archive_command.sh`) ships |
| 119 | every WAL segment to `spaces-prod:shithub-wal` in real time. |
| 120 | Postgres won't recycle a segment until the script reports success, |
| 121 | so a failing archiver fills the disk — alert on |
| 122 | `pg_stat_archiver.failed_count > 0`. |
| 123 | 2. **Daily logical** (`deploy/postgres/backup-daily.sh`) takes a |
| 124 | `pg_dump --format=custom` once per day and ships it to |
| 125 | `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last |
| 126 | 7 locally for fast recovery. |
| 127 | |
| 128 | Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors |
| 129 | both buckets to a second region for DR. Lifecycle in |
| 130 | `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps |
| 131 | after 90. Actions log/artifact objects use the primary object bucket's |
| 132 | `actions/runs/` prefix; apply `deploy/spaces/actions-lifecycle.json` |
| 133 | with `deploy/cutover/apply-actions-lifecycle.sh` so provider-side blob |
| 134 | retention matches the `workflow:cleanup` database sweep. |
| 135 | |
| 136 | The recovery target is **PITR within 30 days, full restore within |
| 137 | 1 hour**. We verify this every quarter with the restore drill — |
| 138 | see `runbooks/restore.md`. |
| 139 | |
| 140 | ## Rollback |
| 141 | |
| 142 | The deploy is a single binary + an env file + a systemd unit; rolling |
| 143 | back is "redeploy the previous binary." Two paths: |
| 144 | |
| 145 | - **Tag rollback (preferred)** — `git checkout v<previous>` and |
| 146 | `make deploy`. Migrations are forward-only by design; if the new |
| 147 | release added a migration, you need to either accept that the |
| 148 | rollback leaves the schema ahead, or apply the migration's matching |
| 149 | `down` first (`shithubd migrate down`). Check `runbooks/rollback.md` |
| 150 | before touching migrations. |
| 151 | - **Hotfix on a branch** — branch from the rolled-back tag, fix, |
| 152 | cut a new release. Don't force-push tags. |
| 153 | |
| 154 | ## What goes where |
| 155 | |
| 156 | | Concern | File | |
| 157 | |-------------------------------|------------------------------------------------| |
| 158 | | Provisioning entrypoint | `deploy/ansible/site.yml` | |
| 159 | | Per-environment vars | `deploy/ansible/inventory/<env>` | |
| 160 | | App systemd units | `deploy/systemd/` | |
| 161 | | Edge config | `deploy/Caddyfile.j2` | |
| 162 | | sshd (incl. AKC for git) | `deploy/sshd_config.j2` | |
| 163 | | Postgres scripts | `deploy/postgres/` | |
| 164 | | Spaces lifecycle + DR | `deploy/spaces/` | |
| 165 | | WireGuard mesh | `deploy/wireguard/wg0.conf.j2` | |
| 166 | | Monitoring configs | `deploy/monitoring/` | |
| 167 | | Restore drill | `deploy/restore-drill/` | |
| 168 | | Operator runbooks | `docs/internal/runbooks/` | |