tenseleyflow/shithub / f1b5339

Browse files

S38: docs — self-host (prereqs/deploy/configuration/backup-restore/upgrades/troubleshooting/capacity)

Authored by espadonne
SHA
f1b533933e5384f08f9dfd6830629022a65cbf56
Parents
1938837
Tree
19cecb6

7 changed files

StatusFile+-
A docs/public/self-host/backup-restore.md 115 0
A docs/public/self-host/capacity.md 87 0
A docs/public/self-host/configuration.md 126 0
A docs/public/self-host/deploy.md 153 0
A docs/public/self-host/prerequisites.md 98 0
A docs/public/self-host/troubleshooting.md 153 0
A docs/public/self-host/upgrades.md 82 0
docs/public/self-host/backup-restore.mdadded
@@ -0,0 +1,115 @@
1
+# Backup & restore
2
+
3
+Two layers, both mandatory:
4
+
5
+- **Continuous WAL archive** — `deploy/postgres/archive_command.sh`
6
+  ships every WAL segment to `spaces-prod:shithub-wal` in real
7
+  time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB
8
+  or one `archive_timeout`).
9
+- **Daily logical dump** — `deploy/postgres/backup-daily.sh`
10
+  takes a `pg_dump --format=custom` once per day and ships it to
11
+  `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most
12
+  recent 7 on the db host.
13
+
14
+Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
15
+both buckets to a second region for DR. Lifecycle in
16
+`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
17
+after 90.
18
+
19
+## Verifying that backups are healthy
20
+
21
+The monitoring stack does this for you:
22
+
23
+- `BackupOverdue` alert fires if no successful backup in 30h.
24
+- `pg_stat_archiver.failed_count > 0` is paged via the
25
+  archive-failing alert.
26
+
27
+By hand:
28
+
29
+```sh
30
+ssh db
31
+sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
32
+     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
33
+```
34
+
35
+## Restore drills
36
+
37
+Run **quarterly**. The restore drill restores the latest dump
38
+into a temp Postgres instance, runs smoke queries, tears down.
39
+Production is untouched.
40
+
41
+```sh
42
+ssh backup-host
43
+sudo /usr/local/bin/shithub-restore-drill
44
+# uses the latest dump in spaces-prod:shithub-backups/daily/
45
+```
46
+
47
+To drill an older dump:
48
+
49
+```sh
50
+sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
51
+```
52
+
53
+`--keep` preserves the temp pgdata for inspection. Default is
54
+clean-up.
55
+
56
+A failed drill is a P0: it means our backups can't actually
57
+restore. File an incident immediately.
58
+
59
+## Real restore — full DB lost
60
+
61
+Use this when:
62
+
63
+- The DB host is destroyed.
64
+- Postgres can't start and there's no working WAL.
65
+- 24 hours of data loss is acceptable.
66
+
67
+1. Spin up a new db host
68
+   (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new
69
+   ANSIBLE_TAGS=db`).
70
+2. Pull the most recent daily dump:
71
+   ```sh
72
+   rclone copyto \
73
+     spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
74
+     /tmp/restore.dump
75
+   ```
76
+3. Restore:
77
+   ```sh
78
+   pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
79
+   ```
80
+4. Confirm schema: `shithubd migrate status`.
81
+5. Bring the app up against the new DB (update `web.env` and
82
+   `worker.env`).
83
+6. **Notify users** — there will be a visible activity gap.
84
+
85
+## Real restore — point-in-time
86
+
87
+Use this when data is intact in WAL but a destructive change
88
+happened at a known time (`DROP TABLE`, mass UPDATE, runaway
89
+worker).
90
+
91
+This is harder. The procedure:
92
+
93
+1. Stop the app (`systemctl stop shithubd-web shithubd-worker
94
+   shithubd-cron.timer`). Do not let writes continue.
95
+2. Restore the most recent base backup from before the incident
96
+   into a fresh data directory.
97
+3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing
98
+   at the WAL archive in Spaces with
99
+   `recovery_target_time = '<UTC timestamp just before incident>'`.
100
+4. Start Postgres in recovery mode; let it replay WAL.
101
+5. When recovery completes, promote.
102
+6. Restart the app.
103
+
104
+If you have not done a PITR before, **do not** free-style this.
105
+Mistakes here destroy the data you were trying to save. Find
106
+someone who has done it and pair with them.
107
+
108
+## After any real restore
109
+
110
+- Run the restore-drill smoke queries against the restored DB.
111
+- Reconcile the audit log against any new IDs that don't exist.
112
+- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
113
+  if the dump was compromised, the secrets in it may be too.
114
+- Force-rotate session epochs for any user whose session crossed
115
+  the recovery window.
docs/public/self-host/capacity.mdadded
@@ -0,0 +1,87 @@
1
+# Capacity planning
2
+
3
+Rule-of-thumb numbers from the S36 bench harness on the reference
4
+hardware (2-vCPU droplets, Postgres 16 on dedicated host).
5
+Treat these as starting points; your traffic patterns will
6
+differ.
7
+
8
+## Single web host (2 vCPU / 4 GB)
9
+
10
+| Workload                              | Sustainable rate     |
11
+|---------------------------------------|----------------------|
12
+| Anonymous repo home page              | ~600 req/s, p95 < 80 ms |
13
+| Authenticated dashboard render        | ~250 req/s, p95 < 200 ms |
14
+| Diff render (PR with ~30 files)       | ~40 req/s, p95 < 600 ms |
15
+| Issue list + filters                  | ~200 req/s, p95 < 250 ms |
16
+| Code search (small corpus)            | ~30 req/s, p95 < 800 ms |
17
+
18
+The first ceiling you hit is **CPU on diff/highlight rendering**.
19
+A second web host doubles the budget linearly; the DB is rarely
20
+the bottleneck for read-heavy traffic on this hardware.
21
+
22
+## Postgres host (2 vCPU / 8 GB / 100 GB SSD)
23
+
24
+| Workload         | Sustainable rate                     |
25
+|------------------|--------------------------------------|
26
+| Read queries     | ~5,000 calls/sec sustained           |
27
+| Write queries    | ~500 calls/sec sustained             |
28
+| Connections      | 100 concurrent (`pgxpool max=10` × 10 web procs) |
29
+
30
+If the read rate goes much beyond ~5k/s, suspect an N+1 — see
31
+the troubleshooting page.
32
+
33
+## Worker host (2 vCPU / 4 GB)
34
+
35
+The worker is bursty: idle most of the time, then chews through
36
+the queue when something happens. A single worker handles:
37
+
38
+- ~150 webhook deliveries/sec (most of the time spent in the
39
+  outbound HTTP client).
40
+- ~40 fan-out events/sec (notifications + activity recompute).
41
+
42
+A second worker scales nearly linearly thanks to `FOR UPDATE
43
+SKIP LOCKED`.
44
+
45
+## Object store
46
+
47
+Sized by your push volume, not request rate. Spaces handles the
48
+PUT rate of every WAL segment + nightly dump comfortably; we've
49
+never seen the bucket as the bottleneck.
50
+
51
+Daily cost order: **WAL > daily dumps > LFS/attachments**. Most
52
+WAL is recovered cheaply by lifecycle (30-day retention). If
53
+your WAL costs run away, your `archive_timeout` is probably set
54
+too low.
55
+
56
+## Repo storage on disk
57
+
58
+Plan for ~3× your raw data size on the bare-repo filesystem:
59
+- 1× the actual git data
60
+- 1× for `git gc` working set on large repos
61
+- 1× headroom
62
+
63
+Repos that hit ~5 GiB get noticeably slow on clone; consider
64
+splitting or archiving.
65
+
66
+## When to scale up
67
+
68
+Trigger | Action
69
+---|---
70
+p95 latency on top routes climbing | Add a second web host first.
71
+DB call rate consistently > 5k/s | Look for N+1 *before* upgrading the DB.
72
+Job queue depth > 1k for hours | Add a second worker host.
73
+Disk on db host > 70% | Plan growth (it doesn't shrink). Increase WAL retention only after.
74
+Disk on web host > 70% | Audit largest repos; clean up archived if possible. Then grow.
75
+Argon2 CPU pegged on signup spikes | The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.
76
+
77
+## What we haven't measured
78
+
79
+The S36 bench harness covers the read-heavy paths. Numbers we
80
+don't yet have:
81
+
82
+- Push throughput at scale (large concurrent pushes).
83
+- Search corpus performance beyond ~10 GB of code.
84
+- Webhook fan-out at high event-per-second rates.
85
+
86
+If your deployment is going to push these limits, run the bench
87
+against your own staging before launching.
docs/public/self-host/configuration.mdadded
@@ -0,0 +1,126 @@
1
+# Configuration reference
2
+
3
+shithubd reads configuration from four sources, in increasing
4
+precedence:
5
+
6
+1. **Built-in defaults** (`internal/infra/config/Defaults()`).
7
+2. **TOML file** at `$SHITHUB_CONFIG` (default
8
+   `/etc/shithub/config.toml`). Absent is fine; bad syntax is
9
+   fatal.
10
+3. **Environment variables.** `SHITHUB_<area>__<key>` — note
11
+   double underscore between nesting levels.
12
+4. **CLI flags** passed by the caller.
13
+
14
+After merging, a small alias table maps backward-compat names
15
+(e.g., `SHITHUB_DATABASE_URL` → `db.url`); then the resolved
16
+config is validated. Any failure exits non-zero with a one-line
17
+error pointing at the offending key.
18
+
19
+## Inspecting the active configuration
20
+
21
+```sh
22
+shithubd config print     # writes resolved config as TOML, secrets redacted
23
+shithubd config validate  # exits non-zero if invalid
24
+shithubd version          # one-line summary of which sinks are configured
25
+```
26
+
27
+`config print` redacts any field whose name contains `password`,
28
+`pass`, `secret`, `key`, `token`, `dsn`, or `url` — URL fields
29
+are redacted because they often carry credentials in the
30
+userinfo component.
31
+
32
+## Reference
33
+
34
+| Key | Type | Default | Notes |
35
+|---|---|---|---|
36
+| `env` | string | `dev` | One of `dev`, `staging`, `prod`. |
37
+| `web.addr` | string | `:8080` | Listen address. |
38
+| `web.read_timeout` | duration | `30s` | Per-request read timeout. |
39
+| `web.write_timeout` | duration | `30s` | Per-request write timeout. |
40
+| `web.shutdown_timeout` | duration | `10s` | Graceful drain on SIGTERM. |
41
+| `db.url` | string | `""` | Postgres DSN. Aliased by `SHITHUB_DATABASE_URL`. |
42
+| `db.max_conns` | int | `10` | pgxpool max conns. |
43
+| `db.min_conns` | int | `0` | pgxpool min conns. |
44
+| `db.connect_timeout` | duration | `5s` | |
45
+| `log.level` | string | `info` | One of `debug`, `info`, `warn`, `error`. |
46
+| `log.format` | string | `text` | One of `text`, `json`. Use `json` in prod. |
47
+| `metrics.enabled` | bool | `true` | Mounts `/metrics`. |
48
+| `metrics.basic_auth_user` | string | `""` | Together with `pass`, gates `/metrics`. |
49
+| `metrics.basic_auth_pass` | string | `""` | Redacted on print. |
50
+| `tracing.enabled` | bool | `false` | When true, `tracing.endpoint` is required. |
51
+| `tracing.endpoint` | string | `""` | OTLP HTTP endpoint. |
52
+| `tracing.sample_rate` | float | `0.05` | Parent-based ratio sampler in [0, 1]. |
53
+| `tracing.service_name` | string | `shithubd` | OTel resource attribute. |
54
+| `error_reporting.dsn` | string | `""` | Sentry-protocol DSN. Empty disables. |
55
+| `session.key_b64` | string | `""` | Base64 32-byte AEAD key. **Required in prod.** |
56
+| `session.max_age` | duration | `720h` | Cookie session lifetime (30 days). |
57
+| `session.secure` | bool | `false` | Set `Secure` cookie attribute. **True in prod.** |
58
+| `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. |
59
+| `storage.s3.endpoint` | string | `""` | S3-compatible endpoint host[:port]. Empty disables S3. |
60
+| `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. |
61
+| `storage.s3.access_key_id` | string | `""` | |
62
+| `storage.s3.secret_access_key` | string | `""` | Redacted. |
63
+| `storage.s3.bucket` | string | `""` | One bucket per environment. |
64
+| `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for MinIO. |
65
+| `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. |
66
+| `auth.require_email_verification` | bool | `true` | Login rejected until primary email verified. |
67
+| `auth.base_url` | string | `http://127.0.0.1:8080` | Used in transactional email links. **Set to your public URL in prod.** |
68
+| `auth.site_name` | string | `shithub` | Branding token in email subjects. |
69
+| `auth.email_from` | string | `shithub <noreply@shithub.local>` | Envelope From. |
70
+| `auth.email_backend` | string | `stdout` | One of `stdout`, `smtp`, `postmark`. |
71
+| `auth.smtp.addr` | string | `127.0.0.1:1025` | Required when backend=`smtp`. |
72
+| `auth.smtp.username` | string | `""` | Optional. |
73
+| `auth.smtp.password` | string | `""` | Optional. Redacted. |
74
+| `auth.postmark.server_token` | string | `""` | Required when backend=`postmark`. Redacted. |
75
+| `auth.argon2.memory_kib` | uint32 | `65536` | argon2id memory cost (KiB). |
76
+| `auth.argon2.time` | uint32 | `3` | argon2id iterations. |
77
+| `auth.argon2.threads` | uint8 | `2` | argon2id parallelism. |
78
+| `auth.totp_key_b64` | string | `""` | Base64 32-byte AEAD for TOTP secrets. **Required if 2FA is in use.** |
79
+
80
+## Examples
81
+
82
+```sh
83
+# Listen elsewhere
84
+export SHITHUB_WEB__ADDR=:9090
85
+
86
+# Postgres
87
+export SHITHUB_DATABASE_URL=postgres://shithub:****@127.0.0.1:5432/shithub?sslmode=require
88
+
89
+# JSON logs for prod
90
+export SHITHUB_LOG__FORMAT=json
91
+export SHITHUB_LOG__LEVEL=info
92
+
93
+# Tracing
94
+export SHITHUB_TRACING__ENABLED=true
95
+export SHITHUB_TRACING__ENDPOINT=http://otel-collector:4318
96
+export SHITHUB_TRACING__SAMPLE_RATE=0.05
97
+
98
+# Session signing key
99
+export SHITHUB_SESSION_KEY=$(openssl rand -base64 32)
100
+
101
+# Metrics behind Basic auth
102
+export SHITHUB_METRICS__BASIC_AUTH_USER=prom
103
+export SHITHUB_METRICS__BASIC_AUTH_PASS=<long-random>
104
+```
105
+
106
+## Secrets handling
107
+
108
+- Never commit secrets. `.env` is gitignored; production secrets
109
+  live in a systemd `EnvironmentFile=` with mode `0600`.
110
+- The Ansible roles read sensitive values from the inventory
111
+  (which you keep encrypted with `ansible-vault` or your secret
112
+  store) and template them into `/etc/shithub/web.env` and
113
+  `worker.env`.
114
+- Log-line redaction is independent of `config print` redaction;
115
+  both layers exist on purpose.
116
+
117
+## Adding a key
118
+
119
+If you fork shithub and add a config key:
120
+
121
+1. Add the field to the appropriate struct in
122
+   `internal/infra/config/config.go` with a `toml:` tag.
123
+2. Set its default in `Defaults()`.
124
+3. Add validation in `Validate()` if it has invariants.
125
+4. If it carries a secret, name it so the redactor matches.
126
+5. Document it here.
docs/public/self-host/deploy.mdadded
@@ -0,0 +1,153 @@
1
+# Deploying with Ansible
2
+
3
+The reference install path is the Ansible playbook in
4
+`deploy/ansible/`. From a fresh Ubuntu 24.04 box the workflow is:
5
+inventory → `make deploy` → bootstrap an admin → done.
6
+
7
+## 1. Inventory
8
+
9
+Copy the example inventory:
10
+
11
+```sh
12
+cp deploy/ansible/inventory/staging.example deploy/ansible/inventory/staging
13
+$EDITOR deploy/ansible/inventory/staging
14
+```
15
+
16
+Variables marked `# REQUIRED` in the example come from your
17
+secret store (Bitwarden, 1Password, ansible-vault). **Do not**
18
+commit a real inventory file. The repo's `.gitignore` excludes
19
+the bare names `staging` and `production` to make this obvious.
20
+
21
+Critical variables:
22
+
23
+- `db_password` — Postgres password for the `shithub` role.
24
+- `hook_password` — Postgres password for the `shithub_hook`
25
+  role.
26
+- `session_key` — base64 32-byte AEAD key. Generate with
27
+  `openssl rand -base64 32`.
28
+- `totp_key` — base64 32-byte AEAD key for at-rest TOTP secrets.
29
+- `s3_*` — bucket name, region, credentials.
30
+- `email_*` — Postmark token or SMTP creds.
31
+- `wireguard_*` — peer keys for the monitoring mesh.
32
+
33
+## 2. Dry-run
34
+
35
+```sh
36
+make deploy-check ANSIBLE_INVENTORY=staging
37
+```
38
+
39
+Reports every change Ansible would make without making it. Read
40
+the diff before continuing.
41
+
42
+## 3. Apply
43
+
44
+```sh
45
+make deploy ANSIBLE_INVENTORY=staging
46
+```
47
+
48
+The playbook is idempotent; a second run should report
49
+`changed=0`. If it doesn't, that's a config drift bug — open an
50
+issue.
51
+
52
+The roles in order:
53
+
54
+- **base** — apt baseline, ufw default-deny, system users (`shithub`,
55
+  `shithub-ssh`), data root.
56
+- **postgres** — installs PG16, initdb's `/data/pgdata`, applies
57
+  our `postgresql.conf` and `pg_hba.conf`, wires the WAL archive
58
+  command, creates `shithub` and `shithub_hook` roles with
59
+  exact-grant permissions.
60
+- **shithubd** — copies the binary into `/usr/local/bin`, drops
61
+  env files into `/etc/shithub/`, installs the three systemd
62
+  units. The web service's `ExecStartPre=` runs `shithubd migrate
63
+  up` so deploys with new schema are one command.
64
+- **caddy** — installs Caddy + the templated `Caddyfile`. Auto-
65
+  TLS via Let's Encrypt staging until you flip `caddy_use_acme_
66
+  staging=false`.
67
+- **wireguard** — peers each host into the 10.50.0.0/24 mesh.
68
+- **backup** — installs the daily backup timer on the db host
69
+  and the cross-region sync timer on the backup host.
70
+- **monitoring-client** — node-exporter + promtail on every
71
+  host pointing at the monitoring host.
72
+
73
+## 4. Bootstrap the first admin
74
+
75
+The first deploy creates no users. SSH to the web host and run:
76
+
77
+```sh
78
+sudo -u shithub /usr/local/bin/shithubd admin bootstrap-admin --email you@example.com
79
+```
80
+
81
+The CLI creates a user (or grants site-admin to an existing one)
82
+and prints a one-time password-reset link. Open it in a browser,
83
+set a password, enable 2FA. Subsequent admin grants happen
84
+through `/admin/users/{id}`.
85
+
86
+## 5. Smoke
87
+
88
+- `https://shithub.example/` — Caddy serves the home page.
89
+- `https://shithub.example/-/health` — returns `200 OK` with the
90
+  build version.
91
+- Sign in as the bootstrap admin. Create a test repo. Push to it.
92
+- `https://shithub.example/admin/` — admin dashboard renders.
93
+
94
+## 6. Production
95
+
96
+Same procedure with the production inventory:
97
+
98
+```sh
99
+make deploy ANSIBLE_INVENTORY=production
100
+```
101
+
102
+For larger deploys, target a single host first:
103
+
104
+```sh
105
+make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=web-02
106
+```
107
+
108
+## Postgres
109
+
110
+The `postgres` role does the heavy lifting; nothing operator-
111
+visible should be needed. Specific things to be aware of:
112
+
113
+- **Archive command.** `deploy/postgres/archive_command.sh` ships
114
+  every WAL segment to `spaces-prod:shithub-wal`. Postgres
115
+  refuses to recycle a segment until the script reports success;
116
+  a failing archiver therefore fills the disk. Alert on
117
+  `pg_stat_archiver.failed_count > 0` (the default rule does).
118
+- **Hook role.** A separate Postgres role `shithub_hook` is used
119
+  by the `shithubd hook …` subcommands. It has minimal grants —
120
+  see `deploy/postgres/hook-role-grants.sql`. If you add a new
121
+  hook subcommand that touches a new table, update those grants.
122
+
123
+## Caddy
124
+
125
+The Caddyfile (`deploy/Caddyfile.j2`) reverse-proxies
126
+`127.0.0.1:8080`. Two special path patterns:
127
+
128
+- `/_/<owner>/<repo>.git/(info/refs|git-(upload|receive)-pack)`
129
+  uses a 30-minute timeout to accommodate large pushes / clones.
130
+- Static assets under `/static/` get aggressive cache headers.
131
+
132
+Access logs are JSON to `/var/log/caddy/access.log`; promtail
133
+ships them to Loki.
134
+
135
+## sshd
136
+
137
+The provided `sshd_config.j2` is conservative: PasswordAuthentication
138
+off, PubkeyAuthentication on, X11/agent/TCP forwarding off. The
139
+`Match User git` block uses the AKC contract — see
140
+[Authentication / SSH on the user side](../user/ssh.md).
141
+
142
+## Rolling forward
143
+
144
+```sh
145
+git fetch --tags
146
+git checkout v<new-version>
147
+make deploy ANSIBLE_INVENTORY=production
148
+```
149
+
150
+That's it. The `migrate up` is part of the unit's startup
151
+preflight; if a migration fails, the unit stays in `activating`
152
+and the journal has the error. See
153
+[Upgrades](./upgrades.md) for the major-version checklist.
docs/public/self-host/prerequisites.mdadded
@@ -0,0 +1,98 @@
1
+# Prerequisites
2
+
3
+What you need before you start. The reference deployment targets
4
+DigitalOcean + DigitalOcean Spaces; if you're on a different
5
+provider, the shape is the same — substitute equivalents.
6
+
7
+## Compute
8
+
9
+Three machines for staging is comfortable; production starts at
10
+five. The minimum useful sizes:
11
+
12
+| Role         | vCPU | RAM  | Disk | Notes                                    |
13
+|--------------|-----:|-----:|-----:|------------------------------------------|
14
+| web          | 2    | 4 GB | 40 GB | Run two for HA.                         |
15
+| worker       | 2    | 4 GB | 40 GB | Can colocate with web for small instances. |
16
+| postgres     | 2    | 8 GB | 100 GB | Start at 100 GB; grow with disk usage.  |
17
+| backup       | 1    | 2 GB | 80 GB | Idle most of the time; runs cron only.  |
18
+| monitoring   | 2    | 4 GB | 60 GB | Prom + Loki + Alertmanager + Grafana.   |
19
+
20
+Ubuntu 24.04 LTS is the supported OS. Other Debian derivatives
21
+work but are not regularly tested.
22
+
23
+## Storage
24
+
25
+shithub stores three categories:
26
+
27
+- **Bare repos** — on the web/worker hosts at `/data/repos`. Local
28
+  filesystem; **not** in object storage. Sized to total push
29
+  volume.
30
+- **Webhook delivery bodies, LFS, attachments** — in an
31
+  S3-compatible object store. The reference setup uses
32
+  DigitalOcean Spaces; MinIO works for self-hosted equivalents.
33
+- **Backups** — separate bucket. Cross-region mirror is strongly
34
+  recommended; lifecycle policies prune old data.
35
+
36
+## Database
37
+
38
+Postgres 16 on a dedicated host. Earlier versions work in
39
+development; production requires 16+ for the `pg_stat_statements`
40
+features and JSON functions we rely on.
41
+
42
+The `archive_command` ships WAL to object storage in real time
43
+([deploy.md](./deploy.md#postgres)); a daily logical dump is taken
44
+on top of that.
45
+
46
+## Domain + TLS
47
+
48
+You need:
49
+
50
+- A domain you control (e.g. `shithub.example`).
51
+- DNS records for the app (`shithub.example`) and the docs
52
+  subdomain (`docs.shithub.example`).
53
+- A TLS certificate. Caddy obtains and renews via Let's Encrypt
54
+  automatically — no manual cert management — but the DNS records
55
+  must point at your public IP first.
56
+
57
+## Email
58
+
59
+shithub sends transactional email (signup verification, password
60
+reset, notifications). Three backends:
61
+
62
+- `stdout` — prints to the journal. **Dev only.**
63
+- `smtp` — talk to your own MTA or a SaaS that exposes SMTP.
64
+- `postmark` — uses Postmark's HTTP API (the recommended
65
+  production path; Postmark's deliverability is excellent for
66
+  transactional mail).
67
+
68
+Warm-up your sending domain before going live; cold IPs have
69
+their first thousand emails treated as spam.
70
+
71
+## SSH
72
+
73
+The git server expects port 22 reachable from your users. The
74
+sshd config (deploy/sshd_config.j2) restricts the `git` user to
75
+the AKC-driven flow; the rest of sshd is operator-only.
76
+
77
+## Object store credentials
78
+
79
+Store both an access key for the runtime (read+write to the
80
+bucket) and a separate set for backups + restore drill (read-only
81
+on the runtime bucket; full access on the backup bucket).
82
+Rotate quarterly.
83
+
84
+## Time
85
+
86
+NTP must be running; the rate-limit windows, TOTP, and audit-log
87
+timestamps depend on monotonic, accurate time. Ubuntu's
88
+`systemd-timesyncd` is enough.
89
+
90
+## Skills you'll want
91
+
92
+- Comfort with Ansible — the playbook is the install path.
93
+- Postgres operational basics: pg_dump/pg_restore, monitoring,
94
+  WAL.
95
+- Reading systemd journal output (`journalctl -u …`).
96
+
97
+If any of those are unfamiliar, allow extra time on your first
98
+deploy and follow the runbooks closely.
docs/public/self-host/troubleshooting.mdadded
@@ -0,0 +1,153 @@
1
+# Troubleshooting
2
+
3
+Common operator-visible failure modes and how to diagnose them.
4
+
5
+## Caddy: certificate acquisition fails
6
+
7
+**Symptom:** Caddy logs show `failed to obtain certificate` for
8
+your domain; the site serves the staging cert (or no cert).
9
+
10
+Most often:
11
+
12
+- DNS doesn't yet point at the host. Verify with `dig +short
13
+  shithub.example`.
14
+- Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs
15
+  port 80 reachable from the public internet (Caddy redirects
16
+  to 443 *after* obtaining the cert). UFW must allow 80.
17
+- Rate-limited by Let's Encrypt. If you've been retrying with
18
+  `caddy_use_acme_staging=false` and failing, you can be locked
19
+  out for a few hours. Switch to `caddy_use_acme_staging=true`,
20
+  reload, then flip back when DNS/firewall is correct.
21
+
22
+## sshd: AKC slow / git ssh hangs at "Authenticating"
23
+
24
+**Symptom:** `git clone git@…` sits at the auth phase for many
25
+seconds before completing or timing out.
26
+
27
+The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd
28
+ssh-authkeys %f`, which queries Postgres. Latency comes from one
29
+of:
30
+
31
+- DB connection saturation. `pgxpool` is shared across the web
32
+  process; the AKC subcommand opens its own connection. Check
33
+  `pg_stat_activity` for stuck connections.
34
+- DNS for the AKC subprocess (it should not need DNS, but a
35
+  misconfigured `/etc/nsswitch.conf` can stall stat calls).
36
+- The user has a *very* large number of keys; AKC walks all of
37
+  them. Practical maximum is in the hundreds before you notice.
38
+
39
+Mitigation: investigate, don't bypass. Disabling the AKC opens a
40
+hole — every key on disk would have shell access.
41
+
42
+## Job queue: backlog growing
43
+
44
+**Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes.
45
+
46
+```sh
47
+psql -d shithub -c "
48
+  SELECT kind, count(*) FROM jobs WHERE state='queued'
49
+  GROUP BY kind ORDER BY 2 DESC;"
50
+```
51
+
52
+If one kind dominates, look for a poison job in worker logs:
53
+
54
+```sh
55
+journalctl -u shithubd-worker -n 500 --grep ERROR
56
+```
57
+
58
+If the spread is even, the worker isn't keeping up — scale by
59
+running a second worker host. The `FOR UPDATE SKIP LOCKED`
60
+pattern lets multiple workers coexist safely.
61
+
62
+To purge a poison job: mark it `failed` in SQL (don't delete —
63
+you want the audit trail).
64
+
65
+## Webhook: deliveries failing 50%+
66
+
67
+**Symptom:** `WebhookDeliveryFailing` alert.
68
+
69
+- A specific subscriber URL is broken. Check the webhook's
70
+  delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for
71
+  status codes and bodies.
72
+- Outbound network issue. From the host:
73
+  `curl -v https://example.com/hook`. If outbound HTTPS itself
74
+  is broken, that's an infrastructure problem.
75
+- Auto-disable kicks in after 50 consecutive failures per
76
+  webhook; the UI shows a banner and sets `active=false`. Owner
77
+  re-enables once the endpoint is fixed.
78
+
79
+## Push: "pre-receive hook declined"
80
+
81
+**Symptom:** User's `git push` fails with a pre-receive
82
+rejection.
83
+
84
+Common reasons:
85
+
86
+- **Branch protection.** Pushing directly to a protected branch
87
+  when the rule requires a PR. User pushes to a feature branch
88
+  instead.
89
+- **Repo over quota.** Bare-repo size cap hit; raise per-repo or
90
+  globally in config.
91
+- **Pack contains too-large object.** Large blobs blow the
92
+  per-object cap (defaults documented in `internal/repos`). Use
93
+  LFS instead for big binaries.
94
+
95
+## Postgres: archive_command failing
96
+
97
+**Symptom:** `pg_stat_archiver.failed_count` increasing; disk
98
+filling on the db host.
99
+
100
+`journalctl -u postgresql -n 200 --grep "archive_command failed"`
101
+will print the rclone error. Common causes:
102
+
103
+- Rotated bucket credentials.
104
+- Network partition to Spaces.
105
+- `rclone.conf` missing or unreadable.
106
+
107
+Confirm by hand:
108
+
109
+```sh
110
+sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
111
+     lsd spaces-prod:
112
+```
113
+
114
+Fix the underlying issue; Postgres retries automatically. If disk
115
+is critically full and Spaces won't be reachable in time, page
116
+the operator. **Do not** `pg_archivecleanup` manually unless
117
+explicitly directed — you can lose data.
118
+
119
+## Email: signups never arrive
120
+
121
+- `auth.email_backend=stdout` in your config — emails went to the
122
+  journal. Check `journalctl -u shithubd-web --grep '\[email\]'`.
123
+- Postmark token wrong / sender domain not verified. Postmark's
124
+  dashboard logs every send.
125
+- DKIM/SPF not set up; the provider may accept and silently drop.
126
+
127
+For test, send a verification email to an address you control on
128
+a domain that bounces with helpful errors (e.g., your own MTA
129
+configured to log rejections).
130
+
131
+## Sign-in fails immediately with no error in journal
132
+
133
+CSRF token expired (cookie didn't match the form's hidden token).
134
+Force a hard reload of the sign-in page; cookies older than the
135
+session lifetime are dropped.
136
+
137
+If that's not it: check `auth.require_email_verification=true` —
138
+the user may need to click a verification link.
139
+
140
+## "Site appears slow" with no specific symptom
141
+
142
+Top sources of latency, in order of frequency:
143
+
144
+1. **N+1 query regression.** DB calls/sec on the Grafana
145
+   dashboard will be much higher than baseline.
146
+2. **GC pressure.** The web process is allocating more than usual
147
+   — check Grafana's `go_gc_pause_seconds` panel.
148
+3. **Cache stampede.** If a hot cache key was just invalidated,
149
+   first-request-wins via singleflight should prevent this; if
150
+   it doesn't, check `cache.singleflight_in_flight` for excess.
151
+4. **Disk-bound on git ops.** Concurrent large clones pin the
152
+   web host's disk. Move git transports to a separate process
153
+   or scale the web tier.
docs/public/self-host/upgrades.mdadded
@@ -0,0 +1,82 @@
1
+# Upgrades & migrations
2
+
3
+Routine release deploys. Migrations apply automatically; the
4
+only place upgrades get exciting is around long migrations and
5
+the occasional config schema change.
6
+
7
+## Standard release
8
+
9
+```sh
10
+git fetch --tags
11
+git checkout v<version>
12
+make deploy-check ANSIBLE_INVENTORY=staging
13
+make deploy ANSIBLE_INVENTORY=staging
14
+# ... canary period ...
15
+make deploy ANSIBLE_INVENTORY=production
16
+```
17
+
18
+`shithubd migrate up` runs as the web service's
19
+`ExecStartPre=`, so the binary that needs the new schema is also
20
+the one that applies it. Order on each host: ExecStartPre runs
21
+migrations → web starts on the new schema.
22
+
23
+If a migration is long (>30s), the release notes call it out.
24
+Schedule the deploy outside peak hours; the web service hangs in
25
+"activating" until ExecStartPre finishes.
26
+
27
+## Canary
28
+
29
+Deploy to staging first. Watch for 30 min in Grafana. Things to
30
+look at:
31
+
32
+- p95 latency on top routes.
33
+- DB call rate — a 10× jump usually means a regressed N+1.
34
+- Job queue depth — a stuck migration reflects here.
35
+- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`.
36
+
37
+If anything looks off, **do not** promote. Rollback on staging
38
+is cheap; rollback on production is loud.
39
+
40
+## Major release (database)
41
+
42
+If the release notes flag a major schema change:
43
+
44
+1. Take a manual `pg_dump` immediately before the deploy:
45
+   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
46
+2. Confirm it landed in Spaces.
47
+3. Deploy to staging, run `make restore-drill` against the
48
+   *post-deploy* dump to confirm the new schema restores cleanly.
49
+4. Then production.
50
+
51
+## Config schema changes
52
+
53
+When a release adds a required config key, the binary refuses to
54
+start and complains in the journal. Update
55
+`deploy/ansible/roles/shithubd/templates/web.env.j2` (and
56
+`worker.env.j2`), bump the inventory vars, redeploy. There's no
57
+separate migration step for env files.
58
+
59
+## Rolling back
60
+
61
+See [Rollback (in-repo runbook)](https://github.com/tenseleyFlow/shithub/blob/main/docs/internal/runbooks/rollback.md).
62
+
63
+Three rollback shapes, in preference order:
64
+
65
+1. **Schema-compatible rollback (best).** If the migration only
66
+   *added* columns/tables that the old code ignores, the old code
67
+   runs against the new schema fine. Roll the code back; leave
68
+   schema alone. Most of our migrations are deliberately additive
69
+   for this reason.
70
+2. **Roll forward to a hotfix.** If the migration changed
71
+   semantics that the old code can't tolerate, ship a hotfix on
72
+   top of the new release rather than reversing the migration.
73
+3. **Migration `down` + code rollback.** Last resort; some `down`s
74
+   drop columns and *will* lose data.
75
+
76
+```sh
77
+# (3) only when (1) and (2) won't work
78
+ssh web-01
79
+sudo -u shithub /usr/local/bin/shithubd migrate down  # ONE step
80
+git checkout v<previous>
81
+make deploy ANSIBLE_INVENTORY=production
82
+```