shithub Public

Watch 1 Fork 0 Star 0

markdown · 7513 bytes Raw Blame History

Deployment

This is the operator's guide to taking a fresh box from "Ubuntu 24.04 with sshd" to "running shithubd in production." It is opinionated: DigitalOcean for compute, DigitalOcean Spaces for object storage, Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the monitoring mesh. If you're running on something else, the Ansible roles are the source of truth — read them.

Topology

                +----------------------------+
   public --->  |  Caddy (TLS, rate limits)  |  :443
                +----------------------------+
                       |  127.0.0.1:8080
                +----------------------------+
                |  shithubd web (systemd)    |
                |  shithubd worker (systemd) |
                |  shithubd cron  (timer)    |
                +----------------------------+
                       |
                +-----------+         +-----------------+
                | Postgres  |         | Spaces (S3)     |
                +-----------+         | - WAL archive   |
                                      | - daily dumps   |
                                      | - LFS / blobs   |
                                      +-----------------+
                       \                    /
                        \      WireGuard mesh (10.50.0.0/24)
                         \________ ________/
                                  |
                          +----------------+
                          |  Monitoring    |
                          |  Prom/Loki/AM  |
                          |  Grafana       |
                          +----------------+

The monitoring host is not on the public internet. App processes listen on 127.0.0.1 and on the wg0 mesh interface only; nothing about the metrics port is reachable from outside the mesh.

One-time bootstrap

Provision the droplets. Three for staging is enough (web, db, monitoring). Production starts at five (2× web, db, backup, monitoring) and grows the web tier first.
Get sshd public-key login working for the operator user. The Ansible base role narrows it from there.
Populate deploy/ansible/inventory/<env> by copying inventory/staging.example. The variables marked with # REQUIRED come from the operator's secret store (Bitwarden, 1Password, etc.). Do not commit a real inventory file.
Bootstrap-admin user. After the first deploy, ssh to the web host and run shithubd admin bootstrap-admin --email you@…. That gives you the site-admin bit; subsequent admin grants happen through /admin/users/{id}.

Deploying

The Makefile wraps the playbook so the human commands are short:

# dry-run against staging (default inventory)
make deploy-check

# apply against staging
make deploy

# apply against production
ANSIBLE_INVENTORY=production make deploy

# only the app, not the edge or db
ANSIBLE_TAGS=app make deploy

# only one host (e.g. canary)
ANSIBLE_LIMIT=web-02 make deploy

The playbook is idempotent — run it twice in a row and the second run should report ok=N changed=0. If the second run reports any changes, that's a config drift bug; investigate before continuing.

What the playbook does

In rough order:

base — apt baseline, ufw default-deny, fail2ban, system users (shithub, shithub-ssh), data root at /data.
postgres (tags: [db]) — installs PG16, initdb on /data/pgdata, applies our postgresql.conf/pg_hba.conf, wires the WAL archive command, creates the shithub and shithub_hook roles with exact-grant permissions.
shithubd (tags: [app]) — copies the binary into /usr/local/bin, drops env files into /etc/shithub/, installs the three systemd units, restarts on change. The web.service ExecStartPre runs shithubd migrate up so a deploy with new migrations is one command.
caddy (tags: [edge]) — installs Caddy + the templated Caddyfile. Auto-TLS via Let's Encrypt staging until the operator flips a vars flag; production after that.
wireguard (tags: [net]) — peers each host into the mesh.
backup (tags: [backup]) — installs the daily backup timer on the db host and the cross-region sync timer on the backup host.
monitoring-client (tags: [monitoring]) — node-exporter + promtail on every host pointing at the monitoring host.

The monitoring host itself is provisioned by a separate Ansible play that lives outside this repo (it depends on operator-specific TLS material). The configs in deploy/monitoring/ are the source of truth for what runs there.

Backups

Two layers, both mandatory:

WAL archiving (deploy/postgres/archive_command.sh) ships every WAL segment to spaces-prod:shithub-wal in real time. Postgres won't recycle a segment until the script reports success, so a failing archiver fills the disk — alert on pg_stat_archiver.failed_count > 0.
Daily logical (deploy/postgres/backup-daily.sh) takes a pg_dump --format=custom once per day and ships it to spaces-prod:shithub-backups/daily/YYYY/MM/DD/. Keeps the last 7 locally for fast recovery.

Cross-region copy (deploy/spaces/sync-cross-region.sh) mirrors both buckets to a second region for DR. Lifecycle in deploy/spaces/lifecycle.json prunes WAL after 30 days and dumps after 90. Actions log/artifact objects use the primary object bucket's actions/runs/ prefix; apply deploy/spaces/actions-lifecycle.json with deploy/cutover/apply-actions-lifecycle.sh so provider-side blob retention matches the workflow:cleanup database sweep.

The recovery target is PITR within 30 days, full restore within 1 hour. We verify this every quarter with the restore drill — see runbooks/restore.md.

Rollback

The deploy is a single binary + an env file + a systemd unit; rolling back is "redeploy the previous binary." Two paths:

Tag rollback (preferred) — git checkout v<previous> and make deploy. Migrations are forward-only by design; if the new release added a migration, you need to either accept that the rollback leaves the schema ahead, or apply the migration's matching down first (shithubd migrate down). Check runbooks/rollback.md before touching migrations.
Hotfix on a branch — branch from the rolled-back tag, fix, cut a new release. Don't force-push tags.

What goes where

Concern	File
Provisioning entrypoint	`deploy/ansible/site.yml`
Per-environment vars	`deploy/ansible/inventory/<env>`
App systemd units	`deploy/systemd/`
Edge config	`deploy/Caddyfile.j2`
sshd (incl. AKC for git)	`deploy/sshd_config.j2`
Postgres scripts	`deploy/postgres/`
Spaces lifecycle + DR	`deploy/spaces/`
WireGuard mesh	`deploy/wireguard/wg0.conf.j2`
Monitoring configs	`deploy/monitoring/`
Restore drill	`deploy/restore-drill/`
Operator runbooks	`docs/internal/runbooks/`

View source

  
        1
        # Deployment
      
        2
        
        3
        This is the operator's guide to taking a fresh box from "Ubuntu 24.04
      
        4
        with sshd" to "running shithubd in production." It is opinionated:
      
        5
        DigitalOcean for compute, DigitalOcean Spaces for object storage,
      
        6
        Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the
      
        7
        monitoring mesh. If you're running on something else, the Ansible
      
        8
        roles are the source of truth — read them.
      
        9
        
        10
        ## Topology
      
        11
        
        12
        ```
      
        13
                        +----------------------------+
      
        14
           public --->  |  Caddy (TLS, rate limits)  |  :443
      
        15
                        +----------------------------+
      
        16
                               |  127.0.0.1:8080
      
        17
                        +----------------------------+
      
        18
                        |  shithubd web (systemd)    |
      
        19
                        |  shithubd worker (systemd) |
      
        20
                        |  shithubd cron  (timer)    |
      
        21
                        +----------------------------+
      
        22
                               |
      
        23
                        +-----------+         +-----------------+
      
        24
                        | Postgres  |         | Spaces (S3)     |
      
        25
                        +-----------+         | - WAL archive   |
      
        26
                                              | - daily dumps   |
      
        27
                                              | - LFS / blobs   |
      
        28
                                              +-----------------+
      
        29
                               \                    /
      
        30
                                \      WireGuard mesh (10.50.0.0/24)
      
        31
                                 \________ ________/
      
        32
                                          |
      
        33
                                  +----------------+
      
        34
                                  |  Monitoring    |
      
        35
                                  |  Prom/Loki/AM  |
      
        36
                                  |  Grafana       |
      
        37
                                  +----------------+
      
        38
        ```
      
        39
        
        40
        The monitoring host is *not* on the public internet. App processes
      
        41
        listen on `127.0.0.1` and on the wg0 mesh interface only; nothing
      
        42
        about the metrics port is reachable from outside the mesh.
      
        43
        
        44
        ## One-time bootstrap
      
        45
        
        46
        1. **Provision the droplets.** Three for staging is enough (web,
      
        47
           db, monitoring). Production starts at five (2× web, db, backup,
      
        48
           monitoring) and grows the web tier first.
      
        49
        2. **Get sshd public-key login working** for the operator user. The
      
        50
           Ansible base role narrows it from there.
      
        51
        3. **Populate `deploy/ansible/inventory/<env>`** by copying
      
        52
           `inventory/staging.example`. The variables marked with `# REQUIRED`
      
        53
           come from the operator's secret store (Bitwarden, 1Password, etc.).
      
        54
           Do **not** commit a real inventory file.
      
        55
        4. **Bootstrap-admin user.** After the first deploy, ssh to the web
      
        56
           host and run `shithubd admin bootstrap-admin --email you@…`. That
      
        57
           gives you the site-admin bit; subsequent admin grants happen
      
        58
           through `/admin/users/{id}`.
      
        59
        
        60
        ## Deploying
      
        61
        
        62
        The Makefile wraps the playbook so the human commands are short:
      
        63
        
        64
        ```sh
      
        65
        # dry-run against staging (default inventory)
      
        66
        make deploy-check
      
        67
        
        68
        # apply against staging
      
        69
        make deploy
      
        70
        
        71
        # apply against production
      
        72
        ANSIBLE_INVENTORY=production make deploy
      
        73
        
        74
        # only the app, not the edge or db
      
        75
        ANSIBLE_TAGS=app make deploy
      
        76
        
        77
        # only one host (e.g. canary)
      
        78
        ANSIBLE_LIMIT=web-02 make deploy
      
        79
        ```
      
        80
        
        81
        The playbook is idempotent — run it twice in a row and the second
      
        82
        run should report `ok=N changed=0`. If the second run reports any
      
        83
        changes, that's a config drift bug; investigate before continuing.
      
        84
        
        85
        ## What the playbook does
      
        86
        
        87
        In rough order:
      
        88
        
        89
        - **base** — apt baseline, ufw default-deny, fail2ban, system users
      
        90
          (`shithub`, `shithub-ssh`), data root at `/data`.
      
        91
        - **postgres** (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`,
      
        92
          applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive
      
        93
          command, creates the `shithub` and `shithub_hook` roles with
      
        94
          exact-grant permissions.
      
        95
        - **shithubd** (`tags: [app]`) — copies the binary into
      
        96
          `/usr/local/bin`, drops env files into `/etc/shithub/`, installs
      
        97
          the three systemd units, restarts on change. The `web.service`
      
        98
          ExecStartPre runs `shithubd migrate up` so a deploy with new
      
        99
          migrations is one command.
      
        100
        - **caddy** (`tags: [edge]`) — installs Caddy + the templated
      
        101
          `Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator
      
        102
          flips a vars flag; production after that.
      
        103
        - **wireguard** (`tags: [net]`) — peers each host into the mesh.
      
        104
        - **backup** (`tags: [backup]`) — installs the daily backup timer
      
        105
          on the db host and the cross-region sync timer on the backup host.
      
        106
        - **monitoring-client** (`tags: [monitoring]`) — node-exporter +
      
        107
          promtail on every host pointing at the monitoring host.
      
        108
        
        109
        The monitoring host itself is provisioned by a separate Ansible play
      
        110
        that lives outside this repo (it depends on operator-specific TLS
      
        111
        material). The configs in `deploy/monitoring/` are the source of
      
        112
        truth for *what* runs there.
      
        113
        
        114
        ## Backups
      
        115
        
        116
        Two layers, both mandatory:
      
        117
        
        118
        1. **WAL archiving** (`deploy/postgres/archive_command.sh`) ships
      
        119
           every WAL segment to `spaces-prod:shithub-wal` in real time.
      
        120
           Postgres won't recycle a segment until the script reports success,
      
        121
           so a failing archiver fills the disk — alert on
      
        122
           `pg_stat_archiver.failed_count > 0`.
      
        123
        2. **Daily logical** (`deploy/postgres/backup-daily.sh`) takes a
      
        124
           `pg_dump --format=custom` once per day and ships it to
      
        125
           `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last
      
        126
           7 locally for fast recovery.
      
        127
        
        128
        Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
      
        129
        both buckets to a second region for DR. Lifecycle in
      
        130
        `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
      
        131
        after 90. Actions log/artifact objects use the primary object bucket's
      
        132
        `actions/runs/` prefix; apply `deploy/spaces/actions-lifecycle.json`
      
        133
        with `deploy/cutover/apply-actions-lifecycle.sh` so provider-side blob
      
        134
        retention matches the `workflow:cleanup` database sweep.
      
        135
        
        136
        The recovery target is **PITR within 30 days, full restore within
      
        137
        1 hour**. We verify this every quarter with the restore drill —
      
        138
        see `runbooks/restore.md`.
      
        139
        
        140
        ## Rollback
      
        141
        
        142
        The deploy is a single binary + an env file + a systemd unit; rolling
      
        143
        back is "redeploy the previous binary." Two paths:
      
        144
        
        145
        - **Tag rollback (preferred)** — `git checkout v<previous>` and
      
        146
          `make deploy`. Migrations are forward-only by design; if the new
      
        147
          release added a migration, you need to either accept that the
      
        148
          rollback leaves the schema ahead, or apply the migration's matching
      
        149
          `down` first (`shithubd migrate down`). Check `runbooks/rollback.md`
      
        150
          before touching migrations.
      
        151
        - **Hotfix on a branch** — branch from the rolled-back tag, fix,
      
        152
          cut a new release. Don't force-push tags.
      
        153
        
        154
        ## What goes where
      
        155
        
        156
        | Concern                       | File                                           |
      
        157
        |-------------------------------|------------------------------------------------|
      
        158
        | Provisioning entrypoint       | `deploy/ansible/site.yml`                      |
      
        159
        | Per-environment vars          | `deploy/ansible/inventory/<env>`               |
      
        160
        | App systemd units             | `deploy/systemd/`                              |
      
        161
        | Edge config                   | `deploy/Caddyfile.j2`                          |
      
        162
        | sshd (incl. AKC for git)      | `deploy/sshd_config.j2`                        |
      
        163
        | Postgres scripts              | `deploy/postgres/`                             |
      
        164
        | Spaces lifecycle + DR         | `deploy/spaces/`                               |
      
        165
        | WireGuard mesh                | `deploy/wireguard/wg0.conf.j2`                 |
      
        166
        | Monitoring configs            | `deploy/monitoring/`                           |
      
        167
        | Restore drill                 | `deploy/restore-drill/`                        |
      
        168
        | Operator runbooks             | `docs/internal/runbooks/`                      |

1	# Deployment
2
3	This is the operator's guide to taking a fresh box from "Ubuntu 24.04
4	with sshd" to "running shithubd in production." It is opinionated:
5	DigitalOcean for compute, DigitalOcean Spaces for object storage,
6	Postgres on a dedicated droplet, Caddy as the edge, WireGuard for the
7	monitoring mesh. If you're running on something else, the Ansible
8	roles are the source of truth — read them.
9
10	## Topology
11
12	```
13	+----------------------------+
14	public ---> \| Caddy (TLS, rate limits) \| :443
15	+----------------------------+
16	\| 127.0.0.1:8080
17	+----------------------------+
18	\| shithubd web (systemd) \|
19	\| shithubd worker (systemd) \|
20	\| shithubd cron (timer) \|
21	+----------------------------+
22	\|
23	+-----------+ +-----------------+
24	\| Postgres \| \| Spaces (S3) \|
25	+-----------+ \| - WAL archive \|
26	\| - daily dumps \|
27	\| - LFS / blobs \|
28	+-----------------+
29	\ /
30	\ WireGuard mesh (10.50.0.0/24)
31	\________ ________/
32	\|
33	+----------------+
34	\| Monitoring \|
35	\| Prom/Loki/AM \|
36	\| Grafana \|
37	+----------------+
38	```
39
40	The monitoring host is not on the public internet. App processes
41	listen on `127.0.0.1` and on the wg0 mesh interface only; nothing
42	about the metrics port is reachable from outside the mesh.
43
44	## One-time bootstrap
45
46	1. Provision the droplets. Three for staging is enough (web,
47	db, monitoring). Production starts at five (2× web, db, backup,
48	monitoring) and grows the web tier first.
49	2. Get sshd public-key login working for the operator user. The
50	Ansible base role narrows it from there.
51	3. Populate `deploy/ansible/inventory/<env>` by copying
52	`inventory/staging.example`. The variables marked with `# REQUIRED`
53	come from the operator's secret store (Bitwarden, 1Password, etc.).
54	Do not commit a real inventory file.
55	4. Bootstrap-admin user. After the first deploy, ssh to the web
56	host and run `shithubd admin bootstrap-admin --email you@…`. That
57	gives you the site-admin bit; subsequent admin grants happen
58	through `/admin/users/{id}`.
59
60	## Deploying
61
62	The Makefile wraps the playbook so the human commands are short:
63
64	```sh
65	# dry-run against staging (default inventory)
66	make deploy-check
67
68	# apply against staging
69	make deploy
70
71	# apply against production
72	ANSIBLE_INVENTORY=production make deploy
73
74	# only the app, not the edge or db
75	ANSIBLE_TAGS=app make deploy
76
77	# only one host (e.g. canary)
78	ANSIBLE_LIMIT=web-02 make deploy
79	```
80
81	The playbook is idempotent — run it twice in a row and the second
82	run should report `ok=N changed=0`. If the second run reports any
83	changes, that's a config drift bug; investigate before continuing.
84
85	## What the playbook does
86
87	In rough order:
88
89	- base — apt baseline, ufw default-deny, fail2ban, system users
90	(`shithub`, `shithub-ssh`), data root at `/data`.
91	- postgres (`tags: [db]`) — installs PG16, initdb on `/data/pgdata`,
92	applies our `postgresql.conf`/`pg_hba.conf`, wires the WAL archive
93	command, creates the `shithub` and `shithub_hook` roles with
94	exact-grant permissions.
95	- shithubd (`tags: [app]`) — copies the binary into
96	`/usr/local/bin`, drops env files into `/etc/shithub/`, installs
97	the three systemd units, restarts on change. The `web.service`
98	ExecStartPre runs `shithubd migrate up` so a deploy with new
99	migrations is one command.
100	- caddy (`tags: [edge]`) — installs Caddy + the templated
101	`Caddyfile`. Auto-TLS via Let's Encrypt staging until the operator
102	flips a vars flag; production after that.
103	- wireguard (`tags: [net]`) — peers each host into the mesh.
104	- backup (`tags: [backup]`) — installs the daily backup timer
105	on the db host and the cross-region sync timer on the backup host.
106	- monitoring-client (`tags: [monitoring]`) — node-exporter +
107	promtail on every host pointing at the monitoring host.
108
109	The monitoring host itself is provisioned by a separate Ansible play
110	that lives outside this repo (it depends on operator-specific TLS
111	material). The configs in `deploy/monitoring/` are the source of
112	truth for what runs there.
113
114	## Backups
115
116	Two layers, both mandatory:
117
118	1. WAL archiving (`deploy/postgres/archive_command.sh`) ships
119	every WAL segment to `spaces-prod:shithub-wal` in real time.
120	Postgres won't recycle a segment until the script reports success,
121	so a failing archiver fills the disk — alert on
122	`pg_stat_archiver.failed_count > 0`.
123	2. Daily logical (`deploy/postgres/backup-daily.sh`) takes a
124	`pg_dump --format=custom` once per day and ships it to
125	`spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the last
126	7 locally for fast recovery.
127
128	Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
129	both buckets to a second region for DR. Lifecycle in
130	`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
131	after 90. Actions log/artifact objects use the primary object bucket's
132	`actions/runs/` prefix; apply `deploy/spaces/actions-lifecycle.json`
133	with `deploy/cutover/apply-actions-lifecycle.sh` so provider-side blob
134	retention matches the `workflow:cleanup` database sweep.
135
136	The recovery target is **PITR within 30 days, full restore within
137	1 hour**. We verify this every quarter with the restore drill —
138	see `runbooks/restore.md`.
139
140	## Rollback
141
142	The deploy is a single binary + an env file + a systemd unit; rolling
143	back is "redeploy the previous binary." Two paths:
144
145	- Tag rollback (preferred) — `git checkout v<previous>` and
146	`make deploy`. Migrations are forward-only by design; if the new
147	release added a migration, you need to either accept that the
148	rollback leaves the schema ahead, or apply the migration's matching
149	`down` first (`shithubd migrate down`). Check `runbooks/rollback.md`
150	before touching migrations.
151	- Hotfix on a branch — branch from the rolled-back tag, fix,
152	cut a new release. Don't force-push tags.
153
154	## What goes where
155
156	\| Concern \| File \|
157	\|-------------------------------\|------------------------------------------------\|
158	\| Provisioning entrypoint \| `deploy/ansible/site.yml` \|
159	\| Per-environment vars \| `deploy/ansible/inventory/<env>` \|
160	\| App systemd units \| `deploy/systemd/` \|
161	\| Edge config \| `deploy/Caddyfile.j2` \|
162	\| sshd (incl. AKC for git) \| `deploy/sshd_config.j2` \|
163	\| Postgres scripts \| `deploy/postgres/` \|
164	\| Spaces lifecycle + DR \| `deploy/spaces/` \|
165	\| WireGuard mesh \| `deploy/wireguard/wg0.conf.j2` \|
166	\| Monitoring configs \| `deploy/monitoring/` \|
167	\| Restore drill \| `deploy/restore-drill/` \|
168	\| Operator runbooks \| `docs/internal/runbooks/` \|