S38: docs — self-host (prereqs/deploy/configuration/backup-restore/upgrades/troubleshooting/capacity)
- SHA
f1b533933e5384f08f9dfd6830629022a65cbf56- Parents
-
1938837 - Tree
19cecb6
f1b5339
f1b533933e5384f08f9dfd6830629022a65cbf561938837
19cecb6| Status | File | + | - |
|---|---|---|---|
| A |
docs/public/self-host/backup-restore.md
|
115 | 0 |
| A |
docs/public/self-host/capacity.md
|
87 | 0 |
| A |
docs/public/self-host/configuration.md
|
126 | 0 |
| A |
docs/public/self-host/deploy.md
|
153 | 0 |
| A |
docs/public/self-host/prerequisites.md
|
98 | 0 |
| A |
docs/public/self-host/troubleshooting.md
|
153 | 0 |
| A |
docs/public/self-host/upgrades.md
|
82 | 0 |
docs/public/self-host/backup-restore.mdadded@@ -0,0 +1,115 @@ | ||
| 1 | +# Backup & restore | |
| 2 | + | |
| 3 | +Two layers, both mandatory: | |
| 4 | + | |
| 5 | +- **Continuous WAL archive** — `deploy/postgres/archive_command.sh` | |
| 6 | + ships every WAL segment to `spaces-prod:shithub-wal` in real | |
| 7 | + time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB | |
| 8 | + or one `archive_timeout`). | |
| 9 | +- **Daily logical dump** — `deploy/postgres/backup-daily.sh` | |
| 10 | + takes a `pg_dump --format=custom` once per day and ships it to | |
| 11 | + `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most | |
| 12 | + recent 7 on the db host. | |
| 13 | + | |
| 14 | +Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors | |
| 15 | +both buckets to a second region for DR. Lifecycle in | |
| 16 | +`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps | |
| 17 | +after 90. | |
| 18 | + | |
| 19 | +## Verifying that backups are healthy | |
| 20 | + | |
| 21 | +The monitoring stack does this for you: | |
| 22 | + | |
| 23 | +- `BackupOverdue` alert fires if no successful backup in 30h. | |
| 24 | +- `pg_stat_archiver.failed_count > 0` is paged via the | |
| 25 | + archive-failing alert. | |
| 26 | + | |
| 27 | +By hand: | |
| 28 | + | |
| 29 | +```sh | |
| 30 | +ssh db | |
| 31 | +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ | |
| 32 | + lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/ | |
| 33 | +``` | |
| 34 | + | |
| 35 | +## Restore drills | |
| 36 | + | |
| 37 | +Run **quarterly**. The restore drill restores the latest dump | |
| 38 | +into a temp Postgres instance, runs smoke queries, tears down. | |
| 39 | +Production is untouched. | |
| 40 | + | |
| 41 | +```sh | |
| 42 | +ssh backup-host | |
| 43 | +sudo /usr/local/bin/shithub-restore-drill | |
| 44 | +# uses the latest dump in spaces-prod:shithub-backups/daily/ | |
| 45 | +``` | |
| 46 | + | |
| 47 | +To drill an older dump: | |
| 48 | + | |
| 49 | +```sh | |
| 50 | +sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump | |
| 51 | +``` | |
| 52 | + | |
| 53 | +`--keep` preserves the temp pgdata for inspection. Default is | |
| 54 | +clean-up. | |
| 55 | + | |
| 56 | +A failed drill is a P0: it means our backups can't actually | |
| 57 | +restore. File an incident immediately. | |
| 58 | + | |
| 59 | +## Real restore — full DB lost | |
| 60 | + | |
| 61 | +Use this when: | |
| 62 | + | |
| 63 | +- The DB host is destroyed. | |
| 64 | +- Postgres can't start and there's no working WAL. | |
| 65 | +- 24 hours of data loss is acceptable. | |
| 66 | + | |
| 67 | +1. Spin up a new db host | |
| 68 | + (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new | |
| 69 | + ANSIBLE_TAGS=db`). | |
| 70 | +2. Pull the most recent daily dump: | |
| 71 | + ```sh | |
| 72 | + rclone copyto \ | |
| 73 | + spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \ | |
| 74 | + /tmp/restore.dump | |
| 75 | + ``` | |
| 76 | +3. Restore: | |
| 77 | + ```sh | |
| 78 | + pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump | |
| 79 | + ``` | |
| 80 | +4. Confirm schema: `shithubd migrate status`. | |
| 81 | +5. Bring the app up against the new DB (update `web.env` and | |
| 82 | + `worker.env`). | |
| 83 | +6. **Notify users** — there will be a visible activity gap. | |
| 84 | + | |
| 85 | +## Real restore — point-in-time | |
| 86 | + | |
| 87 | +Use this when data is intact in WAL but a destructive change | |
| 88 | +happened at a known time (`DROP TABLE`, mass UPDATE, runaway | |
| 89 | +worker). | |
| 90 | + | |
| 91 | +This is harder. The procedure: | |
| 92 | + | |
| 93 | +1. Stop the app (`systemctl stop shithubd-web shithubd-worker | |
| 94 | + shithubd-cron.timer`). Do not let writes continue. | |
| 95 | +2. Restore the most recent base backup from before the incident | |
| 96 | + into a fresh data directory. | |
| 97 | +3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing | |
| 98 | + at the WAL archive in Spaces with | |
| 99 | + `recovery_target_time = '<UTC timestamp just before incident>'`. | |
| 100 | +4. Start Postgres in recovery mode; let it replay WAL. | |
| 101 | +5. When recovery completes, promote. | |
| 102 | +6. Restart the app. | |
| 103 | + | |
| 104 | +If you have not done a PITR before, **do not** free-style this. | |
| 105 | +Mistakes here destroy the data you were trying to save. Find | |
| 106 | +someone who has done it and pair with them. | |
| 107 | + | |
| 108 | +## After any real restore | |
| 109 | + | |
| 110 | +- Run the restore-drill smoke queries against the restored DB. | |
| 111 | +- Reconcile the audit log against any new IDs that don't exist. | |
| 112 | +- Force-rotate webhook secrets (`shithubd webhook rotate-all`) — | |
| 113 | + if the dump was compromised, the secrets in it may be too. | |
| 114 | +- Force-rotate session epochs for any user whose session crossed | |
| 115 | + the recovery window. | |
docs/public/self-host/capacity.mdadded@@ -0,0 +1,87 @@ | ||
| 1 | +# Capacity planning | |
| 2 | + | |
| 3 | +Rule-of-thumb numbers from the S36 bench harness on the reference | |
| 4 | +hardware (2-vCPU droplets, Postgres 16 on dedicated host). | |
| 5 | +Treat these as starting points; your traffic patterns will | |
| 6 | +differ. | |
| 7 | + | |
| 8 | +## Single web host (2 vCPU / 4 GB) | |
| 9 | + | |
| 10 | +| Workload | Sustainable rate | | |
| 11 | +|---------------------------------------|----------------------| | |
| 12 | +| Anonymous repo home page | ~600 req/s, p95 < 80 ms | | |
| 13 | +| Authenticated dashboard render | ~250 req/s, p95 < 200 ms | | |
| 14 | +| Diff render (PR with ~30 files) | ~40 req/s, p95 < 600 ms | | |
| 15 | +| Issue list + filters | ~200 req/s, p95 < 250 ms | | |
| 16 | +| Code search (small corpus) | ~30 req/s, p95 < 800 ms | | |
| 17 | + | |
| 18 | +The first ceiling you hit is **CPU on diff/highlight rendering**. | |
| 19 | +A second web host doubles the budget linearly; the DB is rarely | |
| 20 | +the bottleneck for read-heavy traffic on this hardware. | |
| 21 | + | |
| 22 | +## Postgres host (2 vCPU / 8 GB / 100 GB SSD) | |
| 23 | + | |
| 24 | +| Workload | Sustainable rate | | |
| 25 | +|------------------|--------------------------------------| | |
| 26 | +| Read queries | ~5,000 calls/sec sustained | | |
| 27 | +| Write queries | ~500 calls/sec sustained | | |
| 28 | +| Connections | 100 concurrent (`pgxpool max=10` × 10 web procs) | | |
| 29 | + | |
| 30 | +If the read rate goes much beyond ~5k/s, suspect an N+1 — see | |
| 31 | +the troubleshooting page. | |
| 32 | + | |
| 33 | +## Worker host (2 vCPU / 4 GB) | |
| 34 | + | |
| 35 | +The worker is bursty: idle most of the time, then chews through | |
| 36 | +the queue when something happens. A single worker handles: | |
| 37 | + | |
| 38 | +- ~150 webhook deliveries/sec (most of the time spent in the | |
| 39 | + outbound HTTP client). | |
| 40 | +- ~40 fan-out events/sec (notifications + activity recompute). | |
| 41 | + | |
| 42 | +A second worker scales nearly linearly thanks to `FOR UPDATE | |
| 43 | +SKIP LOCKED`. | |
| 44 | + | |
| 45 | +## Object store | |
| 46 | + | |
| 47 | +Sized by your push volume, not request rate. Spaces handles the | |
| 48 | +PUT rate of every WAL segment + nightly dump comfortably; we've | |
| 49 | +never seen the bucket as the bottleneck. | |
| 50 | + | |
| 51 | +Daily cost order: **WAL > daily dumps > LFS/attachments**. Most | |
| 52 | +WAL is recovered cheaply by lifecycle (30-day retention). If | |
| 53 | +your WAL costs run away, your `archive_timeout` is probably set | |
| 54 | +too low. | |
| 55 | + | |
| 56 | +## Repo storage on disk | |
| 57 | + | |
| 58 | +Plan for ~3× your raw data size on the bare-repo filesystem: | |
| 59 | +- 1× the actual git data | |
| 60 | +- 1× for `git gc` working set on large repos | |
| 61 | +- 1× headroom | |
| 62 | + | |
| 63 | +Repos that hit ~5 GiB get noticeably slow on clone; consider | |
| 64 | +splitting or archiving. | |
| 65 | + | |
| 66 | +## When to scale up | |
| 67 | + | |
| 68 | +Trigger | Action | |
| 69 | +---|--- | |
| 70 | +p95 latency on top routes climbing | Add a second web host first. | |
| 71 | +DB call rate consistently > 5k/s | Look for N+1 *before* upgrading the DB. | |
| 72 | +Job queue depth > 1k for hours | Add a second worker host. | |
| 73 | +Disk on db host > 70% | Plan growth (it doesn't shrink). Increase WAL retention only after. | |
| 74 | +Disk on web host > 70% | Audit largest repos; clean up archived if possible. Then grow. | |
| 75 | +Argon2 CPU pegged on signup spikes | The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups. | |
| 76 | + | |
| 77 | +## What we haven't measured | |
| 78 | + | |
| 79 | +The S36 bench harness covers the read-heavy paths. Numbers we | |
| 80 | +don't yet have: | |
| 81 | + | |
| 82 | +- Push throughput at scale (large concurrent pushes). | |
| 83 | +- Search corpus performance beyond ~10 GB of code. | |
| 84 | +- Webhook fan-out at high event-per-second rates. | |
| 85 | + | |
| 86 | +If your deployment is going to push these limits, run the bench | |
| 87 | +against your own staging before launching. | |
docs/public/self-host/configuration.mdadded@@ -0,0 +1,126 @@ | ||
| 1 | +# Configuration reference | |
| 2 | + | |
| 3 | +shithubd reads configuration from four sources, in increasing | |
| 4 | +precedence: | |
| 5 | + | |
| 6 | +1. **Built-in defaults** (`internal/infra/config/Defaults()`). | |
| 7 | +2. **TOML file** at `$SHITHUB_CONFIG` (default | |
| 8 | + `/etc/shithub/config.toml`). Absent is fine; bad syntax is | |
| 9 | + fatal. | |
| 10 | +3. **Environment variables.** `SHITHUB_<area>__<key>` — note | |
| 11 | + double underscore between nesting levels. | |
| 12 | +4. **CLI flags** passed by the caller. | |
| 13 | + | |
| 14 | +After merging, a small alias table maps backward-compat names | |
| 15 | +(e.g., `SHITHUB_DATABASE_URL` → `db.url`); then the resolved | |
| 16 | +config is validated. Any failure exits non-zero with a one-line | |
| 17 | +error pointing at the offending key. | |
| 18 | + | |
| 19 | +## Inspecting the active configuration | |
| 20 | + | |
| 21 | +```sh | |
| 22 | +shithubd config print # writes resolved config as TOML, secrets redacted | |
| 23 | +shithubd config validate # exits non-zero if invalid | |
| 24 | +shithubd version # one-line summary of which sinks are configured | |
| 25 | +``` | |
| 26 | + | |
| 27 | +`config print` redacts any field whose name contains `password`, | |
| 28 | +`pass`, `secret`, `key`, `token`, `dsn`, or `url` — URL fields | |
| 29 | +are redacted because they often carry credentials in the | |
| 30 | +userinfo component. | |
| 31 | + | |
| 32 | +## Reference | |
| 33 | + | |
| 34 | +| Key | Type | Default | Notes | | |
| 35 | +|---|---|---|---| | |
| 36 | +| `env` | string | `dev` | One of `dev`, `staging`, `prod`. | | |
| 37 | +| `web.addr` | string | `:8080` | Listen address. | | |
| 38 | +| `web.read_timeout` | duration | `30s` | Per-request read timeout. | | |
| 39 | +| `web.write_timeout` | duration | `30s` | Per-request write timeout. | | |
| 40 | +| `web.shutdown_timeout` | duration | `10s` | Graceful drain on SIGTERM. | | |
| 41 | +| `db.url` | string | `""` | Postgres DSN. Aliased by `SHITHUB_DATABASE_URL`. | | |
| 42 | +| `db.max_conns` | int | `10` | pgxpool max conns. | | |
| 43 | +| `db.min_conns` | int | `0` | pgxpool min conns. | | |
| 44 | +| `db.connect_timeout` | duration | `5s` | | | |
| 45 | +| `log.level` | string | `info` | One of `debug`, `info`, `warn`, `error`. | | |
| 46 | +| `log.format` | string | `text` | One of `text`, `json`. Use `json` in prod. | | |
| 47 | +| `metrics.enabled` | bool | `true` | Mounts `/metrics`. | | |
| 48 | +| `metrics.basic_auth_user` | string | `""` | Together with `pass`, gates `/metrics`. | | |
| 49 | +| `metrics.basic_auth_pass` | string | `""` | Redacted on print. | | |
| 50 | +| `tracing.enabled` | bool | `false` | When true, `tracing.endpoint` is required. | | |
| 51 | +| `tracing.endpoint` | string | `""` | OTLP HTTP endpoint. | | |
| 52 | +| `tracing.sample_rate` | float | `0.05` | Parent-based ratio sampler in [0, 1]. | | |
| 53 | +| `tracing.service_name` | string | `shithubd` | OTel resource attribute. | | |
| 54 | +| `error_reporting.dsn` | string | `""` | Sentry-protocol DSN. Empty disables. | | |
| 55 | +| `session.key_b64` | string | `""` | Base64 32-byte AEAD key. **Required in prod.** | | |
| 56 | +| `session.max_age` | duration | `720h` | Cookie session lifetime (30 days). | | |
| 57 | +| `session.secure` | bool | `false` | Set `Secure` cookie attribute. **True in prod.** | | |
| 58 | +| `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. | | |
| 59 | +| `storage.s3.endpoint` | string | `""` | S3-compatible endpoint host[:port]. Empty disables S3. | | |
| 60 | +| `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. | | |
| 61 | +| `storage.s3.access_key_id` | string | `""` | | | |
| 62 | +| `storage.s3.secret_access_key` | string | `""` | Redacted. | | |
| 63 | +| `storage.s3.bucket` | string | `""` | One bucket per environment. | | |
| 64 | +| `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for MinIO. | | |
| 65 | +| `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. | | |
| 66 | +| `auth.require_email_verification` | bool | `true` | Login rejected until primary email verified. | | |
| 67 | +| `auth.base_url` | string | `http://127.0.0.1:8080` | Used in transactional email links. **Set to your public URL in prod.** | | |
| 68 | +| `auth.site_name` | string | `shithub` | Branding token in email subjects. | | |
| 69 | +| `auth.email_from` | string | `shithub <noreply@shithub.local>` | Envelope From. | | |
| 70 | +| `auth.email_backend` | string | `stdout` | One of `stdout`, `smtp`, `postmark`. | | |
| 71 | +| `auth.smtp.addr` | string | `127.0.0.1:1025` | Required when backend=`smtp`. | | |
| 72 | +| `auth.smtp.username` | string | `""` | Optional. | | |
| 73 | +| `auth.smtp.password` | string | `""` | Optional. Redacted. | | |
| 74 | +| `auth.postmark.server_token` | string | `""` | Required when backend=`postmark`. Redacted. | | |
| 75 | +| `auth.argon2.memory_kib` | uint32 | `65536` | argon2id memory cost (KiB). | | |
| 76 | +| `auth.argon2.time` | uint32 | `3` | argon2id iterations. | | |
| 77 | +| `auth.argon2.threads` | uint8 | `2` | argon2id parallelism. | | |
| 78 | +| `auth.totp_key_b64` | string | `""` | Base64 32-byte AEAD for TOTP secrets. **Required if 2FA is in use.** | | |
| 79 | + | |
| 80 | +## Examples | |
| 81 | + | |
| 82 | +```sh | |
| 83 | +# Listen elsewhere | |
| 84 | +export SHITHUB_WEB__ADDR=:9090 | |
| 85 | + | |
| 86 | +# Postgres | |
| 87 | +export SHITHUB_DATABASE_URL=postgres://shithub:****@127.0.0.1:5432/shithub?sslmode=require | |
| 88 | + | |
| 89 | +# JSON logs for prod | |
| 90 | +export SHITHUB_LOG__FORMAT=json | |
| 91 | +export SHITHUB_LOG__LEVEL=info | |
| 92 | + | |
| 93 | +# Tracing | |
| 94 | +export SHITHUB_TRACING__ENABLED=true | |
| 95 | +export SHITHUB_TRACING__ENDPOINT=http://otel-collector:4318 | |
| 96 | +export SHITHUB_TRACING__SAMPLE_RATE=0.05 | |
| 97 | + | |
| 98 | +# Session signing key | |
| 99 | +export SHITHUB_SESSION_KEY=$(openssl rand -base64 32) | |
| 100 | + | |
| 101 | +# Metrics behind Basic auth | |
| 102 | +export SHITHUB_METRICS__BASIC_AUTH_USER=prom | |
| 103 | +export SHITHUB_METRICS__BASIC_AUTH_PASS=<long-random> | |
| 104 | +``` | |
| 105 | + | |
| 106 | +## Secrets handling | |
| 107 | + | |
| 108 | +- Never commit secrets. `.env` is gitignored; production secrets | |
| 109 | + live in a systemd `EnvironmentFile=` with mode `0600`. | |
| 110 | +- The Ansible roles read sensitive values from the inventory | |
| 111 | + (which you keep encrypted with `ansible-vault` or your secret | |
| 112 | + store) and template them into `/etc/shithub/web.env` and | |
| 113 | + `worker.env`. | |
| 114 | +- Log-line redaction is independent of `config print` redaction; | |
| 115 | + both layers exist on purpose. | |
| 116 | + | |
| 117 | +## Adding a key | |
| 118 | + | |
| 119 | +If you fork shithub and add a config key: | |
| 120 | + | |
| 121 | +1. Add the field to the appropriate struct in | |
| 122 | + `internal/infra/config/config.go` with a `toml:` tag. | |
| 123 | +2. Set its default in `Defaults()`. | |
| 124 | +3. Add validation in `Validate()` if it has invariants. | |
| 125 | +4. If it carries a secret, name it so the redactor matches. | |
| 126 | +5. Document it here. | |
docs/public/self-host/deploy.mdadded@@ -0,0 +1,153 @@ | ||
| 1 | +# Deploying with Ansible | |
| 2 | + | |
| 3 | +The reference install path is the Ansible playbook in | |
| 4 | +`deploy/ansible/`. From a fresh Ubuntu 24.04 box the workflow is: | |
| 5 | +inventory → `make deploy` → bootstrap an admin → done. | |
| 6 | + | |
| 7 | +## 1. Inventory | |
| 8 | + | |
| 9 | +Copy the example inventory: | |
| 10 | + | |
| 11 | +```sh | |
| 12 | +cp deploy/ansible/inventory/staging.example deploy/ansible/inventory/staging | |
| 13 | +$EDITOR deploy/ansible/inventory/staging | |
| 14 | +``` | |
| 15 | + | |
| 16 | +Variables marked `# REQUIRED` in the example come from your | |
| 17 | +secret store (Bitwarden, 1Password, ansible-vault). **Do not** | |
| 18 | +commit a real inventory file. The repo's `.gitignore` excludes | |
| 19 | +the bare names `staging` and `production` to make this obvious. | |
| 20 | + | |
| 21 | +Critical variables: | |
| 22 | + | |
| 23 | +- `db_password` — Postgres password for the `shithub` role. | |
| 24 | +- `hook_password` — Postgres password for the `shithub_hook` | |
| 25 | + role. | |
| 26 | +- `session_key` — base64 32-byte AEAD key. Generate with | |
| 27 | + `openssl rand -base64 32`. | |
| 28 | +- `totp_key` — base64 32-byte AEAD key for at-rest TOTP secrets. | |
| 29 | +- `s3_*` — bucket name, region, credentials. | |
| 30 | +- `email_*` — Postmark token or SMTP creds. | |
| 31 | +- `wireguard_*` — peer keys for the monitoring mesh. | |
| 32 | + | |
| 33 | +## 2. Dry-run | |
| 34 | + | |
| 35 | +```sh | |
| 36 | +make deploy-check ANSIBLE_INVENTORY=staging | |
| 37 | +``` | |
| 38 | + | |
| 39 | +Reports every change Ansible would make without making it. Read | |
| 40 | +the diff before continuing. | |
| 41 | + | |
| 42 | +## 3. Apply | |
| 43 | + | |
| 44 | +```sh | |
| 45 | +make deploy ANSIBLE_INVENTORY=staging | |
| 46 | +``` | |
| 47 | + | |
| 48 | +The playbook is idempotent; a second run should report | |
| 49 | +`changed=0`. If it doesn't, that's a config drift bug — open an | |
| 50 | +issue. | |
| 51 | + | |
| 52 | +The roles in order: | |
| 53 | + | |
| 54 | +- **base** — apt baseline, ufw default-deny, system users (`shithub`, | |
| 55 | + `shithub-ssh`), data root. | |
| 56 | +- **postgres** — installs PG16, initdb's `/data/pgdata`, applies | |
| 57 | + our `postgresql.conf` and `pg_hba.conf`, wires the WAL archive | |
| 58 | + command, creates `shithub` and `shithub_hook` roles with | |
| 59 | + exact-grant permissions. | |
| 60 | +- **shithubd** — copies the binary into `/usr/local/bin`, drops | |
| 61 | + env files into `/etc/shithub/`, installs the three systemd | |
| 62 | + units. The web service's `ExecStartPre=` runs `shithubd migrate | |
| 63 | + up` so deploys with new schema are one command. | |
| 64 | +- **caddy** — installs Caddy + the templated `Caddyfile`. Auto- | |
| 65 | + TLS via Let's Encrypt staging until you flip `caddy_use_acme_ | |
| 66 | + staging=false`. | |
| 67 | +- **wireguard** — peers each host into the 10.50.0.0/24 mesh. | |
| 68 | +- **backup** — installs the daily backup timer on the db host | |
| 69 | + and the cross-region sync timer on the backup host. | |
| 70 | +- **monitoring-client** — node-exporter + promtail on every | |
| 71 | + host pointing at the monitoring host. | |
| 72 | + | |
| 73 | +## 4. Bootstrap the first admin | |
| 74 | + | |
| 75 | +The first deploy creates no users. SSH to the web host and run: | |
| 76 | + | |
| 77 | +```sh | |
| 78 | +sudo -u shithub /usr/local/bin/shithubd admin bootstrap-admin --email you@example.com | |
| 79 | +``` | |
| 80 | + | |
| 81 | +The CLI creates a user (or grants site-admin to an existing one) | |
| 82 | +and prints a one-time password-reset link. Open it in a browser, | |
| 83 | +set a password, enable 2FA. Subsequent admin grants happen | |
| 84 | +through `/admin/users/{id}`. | |
| 85 | + | |
| 86 | +## 5. Smoke | |
| 87 | + | |
| 88 | +- `https://shithub.example/` — Caddy serves the home page. | |
| 89 | +- `https://shithub.example/-/health` — returns `200 OK` with the | |
| 90 | + build version. | |
| 91 | +- Sign in as the bootstrap admin. Create a test repo. Push to it. | |
| 92 | +- `https://shithub.example/admin/` — admin dashboard renders. | |
| 93 | + | |
| 94 | +## 6. Production | |
| 95 | + | |
| 96 | +Same procedure with the production inventory: | |
| 97 | + | |
| 98 | +```sh | |
| 99 | +make deploy ANSIBLE_INVENTORY=production | |
| 100 | +``` | |
| 101 | + | |
| 102 | +For larger deploys, target a single host first: | |
| 103 | + | |
| 104 | +```sh | |
| 105 | +make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=web-02 | |
| 106 | +``` | |
| 107 | + | |
| 108 | +## Postgres | |
| 109 | + | |
| 110 | +The `postgres` role does the heavy lifting; nothing operator- | |
| 111 | +visible should be needed. Specific things to be aware of: | |
| 112 | + | |
| 113 | +- **Archive command.** `deploy/postgres/archive_command.sh` ships | |
| 114 | + every WAL segment to `spaces-prod:shithub-wal`. Postgres | |
| 115 | + refuses to recycle a segment until the script reports success; | |
| 116 | + a failing archiver therefore fills the disk. Alert on | |
| 117 | + `pg_stat_archiver.failed_count > 0` (the default rule does). | |
| 118 | +- **Hook role.** A separate Postgres role `shithub_hook` is used | |
| 119 | + by the `shithubd hook …` subcommands. It has minimal grants — | |
| 120 | + see `deploy/postgres/hook-role-grants.sql`. If you add a new | |
| 121 | + hook subcommand that touches a new table, update those grants. | |
| 122 | + | |
| 123 | +## Caddy | |
| 124 | + | |
| 125 | +The Caddyfile (`deploy/Caddyfile.j2`) reverse-proxies | |
| 126 | +`127.0.0.1:8080`. Two special path patterns: | |
| 127 | + | |
| 128 | +- `/_/<owner>/<repo>.git/(info/refs|git-(upload|receive)-pack)` | |
| 129 | + uses a 30-minute timeout to accommodate large pushes / clones. | |
| 130 | +- Static assets under `/static/` get aggressive cache headers. | |
| 131 | + | |
| 132 | +Access logs are JSON to `/var/log/caddy/access.log`; promtail | |
| 133 | +ships them to Loki. | |
| 134 | + | |
| 135 | +## sshd | |
| 136 | + | |
| 137 | +The provided `sshd_config.j2` is conservative: PasswordAuthentication | |
| 138 | +off, PubkeyAuthentication on, X11/agent/TCP forwarding off. The | |
| 139 | +`Match User git` block uses the AKC contract — see | |
| 140 | +[Authentication / SSH on the user side](../user/ssh.md). | |
| 141 | + | |
| 142 | +## Rolling forward | |
| 143 | + | |
| 144 | +```sh | |
| 145 | +git fetch --tags | |
| 146 | +git checkout v<new-version> | |
| 147 | +make deploy ANSIBLE_INVENTORY=production | |
| 148 | +``` | |
| 149 | + | |
| 150 | +That's it. The `migrate up` is part of the unit's startup | |
| 151 | +preflight; if a migration fails, the unit stays in `activating` | |
| 152 | +and the journal has the error. See | |
| 153 | +[Upgrades](./upgrades.md) for the major-version checklist. | |
docs/public/self-host/prerequisites.mdadded@@ -0,0 +1,98 @@ | ||
| 1 | +# Prerequisites | |
| 2 | + | |
| 3 | +What you need before you start. The reference deployment targets | |
| 4 | +DigitalOcean + DigitalOcean Spaces; if you're on a different | |
| 5 | +provider, the shape is the same — substitute equivalents. | |
| 6 | + | |
| 7 | +## Compute | |
| 8 | + | |
| 9 | +Three machines for staging is comfortable; production starts at | |
| 10 | +five. The minimum useful sizes: | |
| 11 | + | |
| 12 | +| Role | vCPU | RAM | Disk | Notes | | |
| 13 | +|--------------|-----:|-----:|-----:|------------------------------------------| | |
| 14 | +| web | 2 | 4 GB | 40 GB | Run two for HA. | | |
| 15 | +| worker | 2 | 4 GB | 40 GB | Can colocate with web for small instances. | | |
| 16 | +| postgres | 2 | 8 GB | 100 GB | Start at 100 GB; grow with disk usage. | | |
| 17 | +| backup | 1 | 2 GB | 80 GB | Idle most of the time; runs cron only. | | |
| 18 | +| monitoring | 2 | 4 GB | 60 GB | Prom + Loki + Alertmanager + Grafana. | | |
| 19 | + | |
| 20 | +Ubuntu 24.04 LTS is the supported OS. Other Debian derivatives | |
| 21 | +work but are not regularly tested. | |
| 22 | + | |
| 23 | +## Storage | |
| 24 | + | |
| 25 | +shithub stores three categories: | |
| 26 | + | |
| 27 | +- **Bare repos** — on the web/worker hosts at `/data/repos`. Local | |
| 28 | + filesystem; **not** in object storage. Sized to total push | |
| 29 | + volume. | |
| 30 | +- **Webhook delivery bodies, LFS, attachments** — in an | |
| 31 | + S3-compatible object store. The reference setup uses | |
| 32 | + DigitalOcean Spaces; MinIO works for self-hosted equivalents. | |
| 33 | +- **Backups** — separate bucket. Cross-region mirror is strongly | |
| 34 | + recommended; lifecycle policies prune old data. | |
| 35 | + | |
| 36 | +## Database | |
| 37 | + | |
| 38 | +Postgres 16 on a dedicated host. Earlier versions work in | |
| 39 | +development; production requires 16+ for the `pg_stat_statements` | |
| 40 | +features and JSON functions we rely on. | |
| 41 | + | |
| 42 | +The `archive_command` ships WAL to object storage in real time | |
| 43 | +([deploy.md](./deploy.md#postgres)); a daily logical dump is taken | |
| 44 | +on top of that. | |
| 45 | + | |
| 46 | +## Domain + TLS | |
| 47 | + | |
| 48 | +You need: | |
| 49 | + | |
| 50 | +- A domain you control (e.g. `shithub.example`). | |
| 51 | +- DNS records for the app (`shithub.example`) and the docs | |
| 52 | + subdomain (`docs.shithub.example`). | |
| 53 | +- A TLS certificate. Caddy obtains and renews via Let's Encrypt | |
| 54 | + automatically — no manual cert management — but the DNS records | |
| 55 | + must point at your public IP first. | |
| 56 | + | |
| 57 | ||
| 58 | + | |
| 59 | +shithub sends transactional email (signup verification, password | |
| 60 | +reset, notifications). Three backends: | |
| 61 | + | |
| 62 | +- `stdout` — prints to the journal. **Dev only.** | |
| 63 | +- `smtp` — talk to your own MTA or a SaaS that exposes SMTP. | |
| 64 | +- `postmark` — uses Postmark's HTTP API (the recommended | |
| 65 | + production path; Postmark's deliverability is excellent for | |
| 66 | + transactional mail). | |
| 67 | + | |
| 68 | +Warm-up your sending domain before going live; cold IPs have | |
| 69 | +their first thousand emails treated as spam. | |
| 70 | + | |
| 71 | +## SSH | |
| 72 | + | |
| 73 | +The git server expects port 22 reachable from your users. The | |
| 74 | +sshd config (deploy/sshd_config.j2) restricts the `git` user to | |
| 75 | +the AKC-driven flow; the rest of sshd is operator-only. | |
| 76 | + | |
| 77 | +## Object store credentials | |
| 78 | + | |
| 79 | +Store both an access key for the runtime (read+write to the | |
| 80 | +bucket) and a separate set for backups + restore drill (read-only | |
| 81 | +on the runtime bucket; full access on the backup bucket). | |
| 82 | +Rotate quarterly. | |
| 83 | + | |
| 84 | +## Time | |
| 85 | + | |
| 86 | +NTP must be running; the rate-limit windows, TOTP, and audit-log | |
| 87 | +timestamps depend on monotonic, accurate time. Ubuntu's | |
| 88 | +`systemd-timesyncd` is enough. | |
| 89 | + | |
| 90 | +## Skills you'll want | |
| 91 | + | |
| 92 | +- Comfort with Ansible — the playbook is the install path. | |
| 93 | +- Postgres operational basics: pg_dump/pg_restore, monitoring, | |
| 94 | + WAL. | |
| 95 | +- Reading systemd journal output (`journalctl -u …`). | |
| 96 | + | |
| 97 | +If any of those are unfamiliar, allow extra time on your first | |
| 98 | +deploy and follow the runbooks closely. | |
docs/public/self-host/troubleshooting.mdadded@@ -0,0 +1,153 @@ | ||
| 1 | +# Troubleshooting | |
| 2 | + | |
| 3 | +Common operator-visible failure modes and how to diagnose them. | |
| 4 | + | |
| 5 | +## Caddy: certificate acquisition fails | |
| 6 | + | |
| 7 | +**Symptom:** Caddy logs show `failed to obtain certificate` for | |
| 8 | +your domain; the site serves the staging cert (or no cert). | |
| 9 | + | |
| 10 | +Most often: | |
| 11 | + | |
| 12 | +- DNS doesn't yet point at the host. Verify with `dig +short | |
| 13 | + shithub.example`. | |
| 14 | +- Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs | |
| 15 | + port 80 reachable from the public internet (Caddy redirects | |
| 16 | + to 443 *after* obtaining the cert). UFW must allow 80. | |
| 17 | +- Rate-limited by Let's Encrypt. If you've been retrying with | |
| 18 | + `caddy_use_acme_staging=false` and failing, you can be locked | |
| 19 | + out for a few hours. Switch to `caddy_use_acme_staging=true`, | |
| 20 | + reload, then flip back when DNS/firewall is correct. | |
| 21 | + | |
| 22 | +## sshd: AKC slow / git ssh hangs at "Authenticating" | |
| 23 | + | |
| 24 | +**Symptom:** `git clone git@…` sits at the auth phase for many | |
| 25 | +seconds before completing or timing out. | |
| 26 | + | |
| 27 | +The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd | |
| 28 | +ssh-authkeys %f`, which queries Postgres. Latency comes from one | |
| 29 | +of: | |
| 30 | + | |
| 31 | +- DB connection saturation. `pgxpool` is shared across the web | |
| 32 | + process; the AKC subcommand opens its own connection. Check | |
| 33 | + `pg_stat_activity` for stuck connections. | |
| 34 | +- DNS for the AKC subprocess (it should not need DNS, but a | |
| 35 | + misconfigured `/etc/nsswitch.conf` can stall stat calls). | |
| 36 | +- The user has a *very* large number of keys; AKC walks all of | |
| 37 | + them. Practical maximum is in the hundreds before you notice. | |
| 38 | + | |
| 39 | +Mitigation: investigate, don't bypass. Disabling the AKC opens a | |
| 40 | +hole — every key on disk would have shell access. | |
| 41 | + | |
| 42 | +## Job queue: backlog growing | |
| 43 | + | |
| 44 | +**Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes. | |
| 45 | + | |
| 46 | +```sh | |
| 47 | +psql -d shithub -c " | |
| 48 | + SELECT kind, count(*) FROM jobs WHERE state='queued' | |
| 49 | + GROUP BY kind ORDER BY 2 DESC;" | |
| 50 | +``` | |
| 51 | + | |
| 52 | +If one kind dominates, look for a poison job in worker logs: | |
| 53 | + | |
| 54 | +```sh | |
| 55 | +journalctl -u shithubd-worker -n 500 --grep ERROR | |
| 56 | +``` | |
| 57 | + | |
| 58 | +If the spread is even, the worker isn't keeping up — scale by | |
| 59 | +running a second worker host. The `FOR UPDATE SKIP LOCKED` | |
| 60 | +pattern lets multiple workers coexist safely. | |
| 61 | + | |
| 62 | +To purge a poison job: mark it `failed` in SQL (don't delete — | |
| 63 | +you want the audit trail). | |
| 64 | + | |
| 65 | +## Webhook: deliveries failing 50%+ | |
| 66 | + | |
| 67 | +**Symptom:** `WebhookDeliveryFailing` alert. | |
| 68 | + | |
| 69 | +- A specific subscriber URL is broken. Check the webhook's | |
| 70 | + delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for | |
| 71 | + status codes and bodies. | |
| 72 | +- Outbound network issue. From the host: | |
| 73 | + `curl -v https://example.com/hook`. If outbound HTTPS itself | |
| 74 | + is broken, that's an infrastructure problem. | |
| 75 | +- Auto-disable kicks in after 50 consecutive failures per | |
| 76 | + webhook; the UI shows a banner and sets `active=false`. Owner | |
| 77 | + re-enables once the endpoint is fixed. | |
| 78 | + | |
| 79 | +## Push: "pre-receive hook declined" | |
| 80 | + | |
| 81 | +**Symptom:** User's `git push` fails with a pre-receive | |
| 82 | +rejection. | |
| 83 | + | |
| 84 | +Common reasons: | |
| 85 | + | |
| 86 | +- **Branch protection.** Pushing directly to a protected branch | |
| 87 | + when the rule requires a PR. User pushes to a feature branch | |
| 88 | + instead. | |
| 89 | +- **Repo over quota.** Bare-repo size cap hit; raise per-repo or | |
| 90 | + globally in config. | |
| 91 | +- **Pack contains too-large object.** Large blobs blow the | |
| 92 | + per-object cap (defaults documented in `internal/repos`). Use | |
| 93 | + LFS instead for big binaries. | |
| 94 | + | |
| 95 | +## Postgres: archive_command failing | |
| 96 | + | |
| 97 | +**Symptom:** `pg_stat_archiver.failed_count` increasing; disk | |
| 98 | +filling on the db host. | |
| 99 | + | |
| 100 | +`journalctl -u postgresql -n 200 --grep "archive_command failed"` | |
| 101 | +will print the rclone error. Common causes: | |
| 102 | + | |
| 103 | +- Rotated bucket credentials. | |
| 104 | +- Network partition to Spaces. | |
| 105 | +- `rclone.conf` missing or unreadable. | |
| 106 | + | |
| 107 | +Confirm by hand: | |
| 108 | + | |
| 109 | +```sh | |
| 110 | +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ | |
| 111 | + lsd spaces-prod: | |
| 112 | +``` | |
| 113 | + | |
| 114 | +Fix the underlying issue; Postgres retries automatically. If disk | |
| 115 | +is critically full and Spaces won't be reachable in time, page | |
| 116 | +the operator. **Do not** `pg_archivecleanup` manually unless | |
| 117 | +explicitly directed — you can lose data. | |
| 118 | + | |
| 119 | +## Email: signups never arrive | |
| 120 | + | |
| 121 | +- `auth.email_backend=stdout` in your config — emails went to the | |
| 122 | + journal. Check `journalctl -u shithubd-web --grep '\[email\]'`. | |
| 123 | +- Postmark token wrong / sender domain not verified. Postmark's | |
| 124 | + dashboard logs every send. | |
| 125 | +- DKIM/SPF not set up; the provider may accept and silently drop. | |
| 126 | + | |
| 127 | +For test, send a verification email to an address you control on | |
| 128 | +a domain that bounces with helpful errors (e.g., your own MTA | |
| 129 | +configured to log rejections). | |
| 130 | + | |
| 131 | +## Sign-in fails immediately with no error in journal | |
| 132 | + | |
| 133 | +CSRF token expired (cookie didn't match the form's hidden token). | |
| 134 | +Force a hard reload of the sign-in page; cookies older than the | |
| 135 | +session lifetime are dropped. | |
| 136 | + | |
| 137 | +If that's not it: check `auth.require_email_verification=true` — | |
| 138 | +the user may need to click a verification link. | |
| 139 | + | |
| 140 | +## "Site appears slow" with no specific symptom | |
| 141 | + | |
| 142 | +Top sources of latency, in order of frequency: | |
| 143 | + | |
| 144 | +1. **N+1 query regression.** DB calls/sec on the Grafana | |
| 145 | + dashboard will be much higher than baseline. | |
| 146 | +2. **GC pressure.** The web process is allocating more than usual | |
| 147 | + — check Grafana's `go_gc_pause_seconds` panel. | |
| 148 | +3. **Cache stampede.** If a hot cache key was just invalidated, | |
| 149 | + first-request-wins via singleflight should prevent this; if | |
| 150 | + it doesn't, check `cache.singleflight_in_flight` for excess. | |
| 151 | +4. **Disk-bound on git ops.** Concurrent large clones pin the | |
| 152 | + web host's disk. Move git transports to a separate process | |
| 153 | + or scale the web tier. | |
docs/public/self-host/upgrades.mdadded@@ -0,0 +1,82 @@ | ||
| 1 | +# Upgrades & migrations | |
| 2 | + | |
| 3 | +Routine release deploys. Migrations apply automatically; the | |
| 4 | +only place upgrades get exciting is around long migrations and | |
| 5 | +the occasional config schema change. | |
| 6 | + | |
| 7 | +## Standard release | |
| 8 | + | |
| 9 | +```sh | |
| 10 | +git fetch --tags | |
| 11 | +git checkout v<version> | |
| 12 | +make deploy-check ANSIBLE_INVENTORY=staging | |
| 13 | +make deploy ANSIBLE_INVENTORY=staging | |
| 14 | +# ... canary period ... | |
| 15 | +make deploy ANSIBLE_INVENTORY=production | |
| 16 | +``` | |
| 17 | + | |
| 18 | +`shithubd migrate up` runs as the web service's | |
| 19 | +`ExecStartPre=`, so the binary that needs the new schema is also | |
| 20 | +the one that applies it. Order on each host: ExecStartPre runs | |
| 21 | +migrations → web starts on the new schema. | |
| 22 | + | |
| 23 | +If a migration is long (>30s), the release notes call it out. | |
| 24 | +Schedule the deploy outside peak hours; the web service hangs in | |
| 25 | +"activating" until ExecStartPre finishes. | |
| 26 | + | |
| 27 | +## Canary | |
| 28 | + | |
| 29 | +Deploy to staging first. Watch for 30 min in Grafana. Things to | |
| 30 | +look at: | |
| 31 | + | |
| 32 | +- p95 latency on top routes. | |
| 33 | +- DB call rate — a 10× jump usually means a regressed N+1. | |
| 34 | +- Job queue depth — a stuck migration reflects here. | |
| 35 | +- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`. | |
| 36 | + | |
| 37 | +If anything looks off, **do not** promote. Rollback on staging | |
| 38 | +is cheap; rollback on production is loud. | |
| 39 | + | |
| 40 | +## Major release (database) | |
| 41 | + | |
| 42 | +If the release notes flag a major schema change: | |
| 43 | + | |
| 44 | +1. Take a manual `pg_dump` immediately before the deploy: | |
| 45 | + `sudo -u postgres /usr/local/bin/shithub-backup-daily`. | |
| 46 | +2. Confirm it landed in Spaces. | |
| 47 | +3. Deploy to staging, run `make restore-drill` against the | |
| 48 | + *post-deploy* dump to confirm the new schema restores cleanly. | |
| 49 | +4. Then production. | |
| 50 | + | |
| 51 | +## Config schema changes | |
| 52 | + | |
| 53 | +When a release adds a required config key, the binary refuses to | |
| 54 | +start and complains in the journal. Update | |
| 55 | +`deploy/ansible/roles/shithubd/templates/web.env.j2` (and | |
| 56 | +`worker.env.j2`), bump the inventory vars, redeploy. There's no | |
| 57 | +separate migration step for env files. | |
| 58 | + | |
| 59 | +## Rolling back | |
| 60 | + | |
| 61 | +See [Rollback (in-repo runbook)](https://github.com/tenseleyFlow/shithub/blob/main/docs/internal/runbooks/rollback.md). | |
| 62 | + | |
| 63 | +Three rollback shapes, in preference order: | |
| 64 | + | |
| 65 | +1. **Schema-compatible rollback (best).** If the migration only | |
| 66 | + *added* columns/tables that the old code ignores, the old code | |
| 67 | + runs against the new schema fine. Roll the code back; leave | |
| 68 | + schema alone. Most of our migrations are deliberately additive | |
| 69 | + for this reason. | |
| 70 | +2. **Roll forward to a hotfix.** If the migration changed | |
| 71 | + semantics that the old code can't tolerate, ship a hotfix on | |
| 72 | + top of the new release rather than reversing the migration. | |
| 73 | +3. **Migration `down` + code rollback.** Last resort; some `down`s | |
| 74 | + drop columns and *will* lose data. | |
| 75 | + | |
| 76 | +```sh | |
| 77 | +# (3) only when (1) and (2) won't work | |
| 78 | +ssh web-01 | |
| 79 | +sudo -u shithub /usr/local/bin/shithubd migrate down # ONE step | |
| 80 | +git checkout v<previous> | |
| 81 | +make deploy ANSIBLE_INVENTORY=production | |
| 82 | +``` | |