`f1b5339`

S38: docs — self-host (prereqs/deploy/configuration/backup-restore/upgrades/troubleshooting/capacity)

Authored by

espadonne 4 days ago

SHA: f1b533933e5384f08f9dfd6830629022a65cbf56
Parents: 1938837
Tree: 19cecb6

7 changed files

Status	File	+
A	`docs/public/self-host/backup-restore.md`	115
A	`docs/public/self-host/capacity.md`	87
A	`docs/public/self-host/configuration.md`	126
A	`docs/public/self-host/deploy.md`	153
A	`docs/public/self-host/prerequisites.md`	98
A	`docs/public/self-host/troubleshooting.md`	153
A	`docs/public/self-host/upgrades.md`	82

docs/public/self-host/backup-restore.mdadded

 +# Backup & restore
++
 +Two layers, both mandatory:
++
 +- **Continuous WAL archive** — `deploy/postgres/archive_command.sh`
 +  ships every WAL segment to `spaces-prod:shithub-wal` in real
 +  time. Recovery point objective (RPO) ≈ one WAL segment (~16 MB
 +  or one `archive_timeout`).
 +- **Daily logical dump** — `deploy/postgres/backup-daily.sh`
 +  takes a `pg_dump --format=custom` once per day and ships it to
 +  `spaces-prod:shithub-backups/daily/YYYY/MM/DD/`. Keeps the most
 +  recent 7 on the db host.
++
 +Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
 +both buckets to a second region for DR. Lifecycle in
 +`deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
 +after 90.
++
 +## Verifying that backups are healthy
++
 +The monitoring stack does this for you:
++
 +- `BackupOverdue` alert fires if no successful backup in 30h.
 +- `pg_stat_archiver.failed_count > 0` is paged via the
 +  archive-failing alert.
++
 +By hand:
++
 +```sh
 +ssh db
 +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
 +     lsf spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/
 +```
++
 +## Restore drills
++
 +Run **quarterly**. The restore drill restores the latest dump
 +into a temp Postgres instance, runs smoke queries, tears down.
 +Production is untouched.
++
 +```sh
 +ssh backup-host
 +sudo /usr/local/bin/shithub-restore-drill
 +# uses the latest dump in spaces-prod:shithub-backups/daily/
 +```
++
 +To drill an older dump:
++
 +```sh
 +sudo /usr/local/bin/shithub-restore-drill --dump /path/to/file.dump
 +```
++
 +`--keep` preserves the temp pgdata for inspection. Default is
 +clean-up.
++
 +A failed drill is a P0: it means our backups can't actually
 +restore. File an incident immediately.
++
 +## Real restore — full DB lost
++
 +Use this when:
++
 +- The DB host is destroyed.
 +- Postgres can't start and there's no working WAL.
 +- 24 hours of data loss is acceptable.
++
 +1. Spin up a new db host
 +   (`make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=db-new
 +   ANSIBLE_TAGS=db`).
 +2. Pull the most recent daily dump:
 +   ```sh
 +   rclone copyto \
 +     spaces-prod:shithub-backups/daily/$(date -u +%Y/%m/%d)/<latest>.dump \
 +     /tmp/restore.dump
 +   ```
 +3. Restore:
 +   ```sh
 +   pg_restore --dbname=shithub --jobs=4 --no-owner --no-privileges /tmp/restore.dump
 +   ```
 +4. Confirm schema: `shithubd migrate status`.
 +5. Bring the app up against the new DB (update `web.env` and
 +   `worker.env`).
 +6. **Notify users** — there will be a visible activity gap.
++
 +## Real restore — point-in-time
++
 +Use this when data is intact in WAL but a destructive change
 +happened at a known time (`DROP TABLE`, mass UPDATE, runaway
 +worker).
++
 +This is harder. The procedure:
++
 +1. Stop the app (`systemctl stop shithubd-web shithubd-worker
 +   shithubd-cron.timer`). Do not let writes continue.
 +2. Restore the most recent base backup from before the incident
 +   into a fresh data directory.
 +3. Configure recovery (`recovery.signal` + GUCs in PG16) pointing
 +   at the WAL archive in Spaces with
 +   `recovery_target_time = '<UTC timestamp just before incident>'`.
 +4. Start Postgres in recovery mode; let it replay WAL.
 +5. When recovery completes, promote.
 +6. Restart the app.
++
 +If you have not done a PITR before, **do not** free-style this.
 +Mistakes here destroy the data you were trying to save. Find
 +someone who has done it and pair with them.
++
 +## After any real restore
++
 +- Run the restore-drill smoke queries against the restored DB.
 +- Reconcile the audit log against any new IDs that don't exist.
 +- Force-rotate webhook secrets (`shithubd webhook rotate-all`) —
 +  if the dump was compromised, the secrets in it may be too.
 +- Force-rotate session epochs for any user whose session crossed
 +  the recovery window.

docs/public/self-host/capacity.mdadded

 +# Capacity planning
++
 +Rule-of-thumb numbers from the S36 bench harness on the reference
 +hardware (2-vCPU droplets, Postgres 16 on dedicated host).
 +Treat these as starting points; your traffic patterns will
 +differ.
++
 +## Single web host (2 vCPU / 4 GB)
++
 +| Workload                              | Sustainable rate     |
 +|---------------------------------------|----------------------|
 +| Anonymous repo home page              | ~600 req/s, p95 < 80 ms |
 +| Authenticated dashboard render        | ~250 req/s, p95 < 200 ms |
 +| Diff render (PR with ~30 files)       | ~40 req/s, p95 < 600 ms |
 +| Issue list + filters                  | ~200 req/s, p95 < 250 ms |
 +| Code search (small corpus)            | ~30 req/s, p95 < 800 ms |
++
 +The first ceiling you hit is **CPU on diff/highlight rendering**.
 +A second web host doubles the budget linearly; the DB is rarely
 +the bottleneck for read-heavy traffic on this hardware.
++
 +## Postgres host (2 vCPU / 8 GB / 100 GB SSD)
++
 +| Workload         | Sustainable rate                     |
 +|------------------|--------------------------------------|
 +| Read queries     | ~5,000 calls/sec sustained           |
 +| Write queries    | ~500 calls/sec sustained             |
 +| Connections      | 100 concurrent (`pgxpool max=10` × 10 web procs) |
++
 +If the read rate goes much beyond ~5k/s, suspect an N+1 — see
 +the troubleshooting page.
++
 +## Worker host (2 vCPU / 4 GB)
++
 +The worker is bursty: idle most of the time, then chews through
 +the queue when something happens. A single worker handles:
++
 +- ~150 webhook deliveries/sec (most of the time spent in the
 +  outbound HTTP client).
 +- ~40 fan-out events/sec (notifications + activity recompute).
++
 +A second worker scales nearly linearly thanks to `FOR UPDATE
 +SKIP LOCKED`.
++
 +## Object store
++
 +Sized by your push volume, not request rate. Spaces handles the
 +PUT rate of every WAL segment + nightly dump comfortably; we've
 +never seen the bucket as the bottleneck.
++
 +Daily cost order: **WAL > daily dumps > LFS/attachments**. Most
 +WAL is recovered cheaply by lifecycle (30-day retention). If
 +your WAL costs run away, your `archive_timeout` is probably set
 +too low.
++
 +## Repo storage on disk
++
 +Plan for ~3× your raw data size on the bare-repo filesystem:
 +- 1× the actual git data
 +- 1× for `git gc` working set on large repos
 +- 1× headroom
++
 +Repos that hit ~5 GiB get noticeably slow on clone; consider
 +splitting or archiving.
++
 +## When to scale up
++
 +Trigger | Action
 +---|---
 +p95 latency on top routes climbing | Add a second web host first.
 +DB call rate consistently > 5k/s | Look for N+1 *before* upgrading the DB.
 +Job queue depth > 1k for hours | Add a second worker host.
 +Disk on db host > 70% | Plan growth (it doesn't shrink). Increase WAL retention only after.
 +Disk on web host > 70% | Audit largest repos; clean up archived if possible. Then grow.
 +Argon2 CPU pegged on signup spikes | The per-IP and per-/24 throttles are doing their job; argon2 is slow on purpose. Tune `argon2.time` only if every login is slow, not just signups.
++
 +## What we haven't measured
++
 +The S36 bench harness covers the read-heavy paths. Numbers we
 +don't yet have:
++
 +- Push throughput at scale (large concurrent pushes).
 +- Search corpus performance beyond ~10 GB of code.
 +- Webhook fan-out at high event-per-second rates.
++
 +If your deployment is going to push these limits, run the bench
 +against your own staging before launching.

docs/public/self-host/configuration.mdadded

 +# Configuration reference
++
 +shithubd reads configuration from four sources, in increasing
 +precedence:
++
 +1. **Built-in defaults** (`internal/infra/config/Defaults()`).
 +2. **TOML file** at `$SHITHUB_CONFIG` (default
 +   `/etc/shithub/config.toml`). Absent is fine; bad syntax is
 +   fatal.
 +3. **Environment variables.** `SHITHUB_<area>__<key>` — note
 +   double underscore between nesting levels.
 +4. **CLI flags** passed by the caller.
++
 +After merging, a small alias table maps backward-compat names
 +(e.g., `SHITHUB_DATABASE_URL` → `db.url`); then the resolved
 +config is validated. Any failure exits non-zero with a one-line
 +error pointing at the offending key.
++
 +## Inspecting the active configuration
++
 +```sh
 +shithubd config print     # writes resolved config as TOML, secrets redacted
 +shithubd config validate  # exits non-zero if invalid
 +shithubd version          # one-line summary of which sinks are configured
 +```
++
 +`config print` redacts any field whose name contains `password`,
 +`pass`, `secret`, `key`, `token`, `dsn`, or `url` — URL fields
 +are redacted because they often carry credentials in the
 +userinfo component.
++
 +## Reference
++
 +| Key | Type | Default | Notes |
 +|---|---|---|---|
 +| `env` | string | `dev` | One of `dev`, `staging`, `prod`. |
 +| `web.addr` | string | `:8080` | Listen address. |
 +| `web.read_timeout` | duration | `30s` | Per-request read timeout. |
 +| `web.write_timeout` | duration | `30s` | Per-request write timeout. |
 +| `web.shutdown_timeout` | duration | `10s` | Graceful drain on SIGTERM. |
 +| `db.url` | string | `""` | Postgres DSN. Aliased by `SHITHUB_DATABASE_URL`. |
 +| `db.max_conns` | int | `10` | pgxpool max conns. |
 +| `db.min_conns` | int | `0` | pgxpool min conns. |
 +| `db.connect_timeout` | duration | `5s` | |
 +| `log.level` | string | `info` | One of `debug`, `info`, `warn`, `error`. |
 +| `log.format` | string | `text` | One of `text`, `json`. Use `json` in prod. |
 +| `metrics.enabled` | bool | `true` | Mounts `/metrics`. |
 +| `metrics.basic_auth_user` | string | `""` | Together with `pass`, gates `/metrics`. |
 +| `metrics.basic_auth_pass` | string | `""` | Redacted on print. |
 +| `tracing.enabled` | bool | `false` | When true, `tracing.endpoint` is required. |
 +| `tracing.endpoint` | string | `""` | OTLP HTTP endpoint. |
 +| `tracing.sample_rate` | float | `0.05` | Parent-based ratio sampler in [0, 1]. |
 +| `tracing.service_name` | string | `shithubd` | OTel resource attribute. |
 +| `error_reporting.dsn` | string | `""` | Sentry-protocol DSN. Empty disables. |
 +| `session.key_b64` | string | `""` | Base64 32-byte AEAD key. **Required in prod.** |
 +| `session.max_age` | duration | `720h` | Cookie session lifetime (30 days). |
 +| `session.secure` | bool | `false` | Set `Secure` cookie attribute. **True in prod.** |
 +| `storage.repos_root` | string | `/data/repos` | Filesystem root for bare repos. |
 +| `storage.s3.endpoint` | string | `""` | S3-compatible endpoint host[:port]. Empty disables S3. |
 +| `storage.s3.region` | string | `us-east-1` | Region for SigV4 signing. |
 +| `storage.s3.access_key_id` | string | `""` | |
 +| `storage.s3.secret_access_key` | string | `""` | Redacted. |
 +| `storage.s3.bucket` | string | `""` | One bucket per environment. |
 +| `storage.s3.use_ssl` | bool | `false` | True for Spaces, false for MinIO. |
 +| `storage.s3.force_path_style` | bool | `true` | True for MinIO, false for Spaces. |
 +| `auth.require_email_verification` | bool | `true` | Login rejected until primary email verified. |
 +| `auth.base_url` | string | `http://127.0.0.1:8080` | Used in transactional email links. **Set to your public URL in prod.** |
 +| `auth.site_name` | string | `shithub` | Branding token in email subjects. |
 +| `auth.email_from` | string | `shithub <noreply@shithub.local>` | Envelope From. |
 +| `auth.email_backend` | string | `stdout` | One of `stdout`, `smtp`, `postmark`. |
 +| `auth.smtp.addr` | string | `127.0.0.1:1025` | Required when backend=`smtp`. |
 +| `auth.smtp.username` | string | `""` | Optional. |
 +| `auth.smtp.password` | string | `""` | Optional. Redacted. |
 +| `auth.postmark.server_token` | string | `""` | Required when backend=`postmark`. Redacted. |
 +| `auth.argon2.memory_kib` | uint32 | `65536` | argon2id memory cost (KiB). |
 +| `auth.argon2.time` | uint32 | `3` | argon2id iterations. |
 +| `auth.argon2.threads` | uint8 | `2` | argon2id parallelism. |
 +| `auth.totp_key_b64` | string | `""` | Base64 32-byte AEAD for TOTP secrets. **Required if 2FA is in use.** |
++
 +## Examples
++
 +```sh
 +# Listen elsewhere
 +export SHITHUB_WEB__ADDR=:9090
++
 +# Postgres
 +export SHITHUB_DATABASE_URL=postgres://shithub:****@127.0.0.1:5432/shithub?sslmode=require
++
 +# JSON logs for prod
 +export SHITHUB_LOG__FORMAT=json
 +export SHITHUB_LOG__LEVEL=info
++
 +# Tracing
 +export SHITHUB_TRACING__ENABLED=true
 +export SHITHUB_TRACING__ENDPOINT=http://otel-collector:4318
 +export SHITHUB_TRACING__SAMPLE_RATE=0.05
++
 +# Session signing key
 +export SHITHUB_SESSION_KEY=$(openssl rand -base64 32)
++
 +# Metrics behind Basic auth
 +export SHITHUB_METRICS__BASIC_AUTH_USER=prom
 +export SHITHUB_METRICS__BASIC_AUTH_PASS=<long-random>
 +```
++
 +## Secrets handling
++
 +- Never commit secrets. `.env` is gitignored; production secrets
 +  live in a systemd `EnvironmentFile=` with mode `0600`.
 +- The Ansible roles read sensitive values from the inventory
 +  (which you keep encrypted with `ansible-vault` or your secret
 +  store) and template them into `/etc/shithub/web.env` and
 +  `worker.env`.
 +- Log-line redaction is independent of `config print` redaction;
 +  both layers exist on purpose.
++
 +## Adding a key
++
 +If you fork shithub and add a config key:
++
 +1. Add the field to the appropriate struct in
 +   `internal/infra/config/config.go` with a `toml:` tag.
 +2. Set its default in `Defaults()`.
 +3. Add validation in `Validate()` if it has invariants.
 +4. If it carries a secret, name it so the redactor matches.
 +5. Document it here.

docs/public/self-host/deploy.mdadded

 +# Deploying with Ansible
++
 +The reference install path is the Ansible playbook in
 +`deploy/ansible/`. From a fresh Ubuntu 24.04 box the workflow is:
 +inventory → `make deploy` → bootstrap an admin → done.
++
 +## 1. Inventory
++
 +Copy the example inventory:
++
 +```sh
 +cp deploy/ansible/inventory/staging.example deploy/ansible/inventory/staging
 +$EDITOR deploy/ansible/inventory/staging
 +```
++
 +Variables marked `# REQUIRED` in the example come from your
 +secret store (Bitwarden, 1Password, ansible-vault). **Do not**
 +commit a real inventory file. The repo's `.gitignore` excludes
 +the bare names `staging` and `production` to make this obvious.
++
 +Critical variables:
++
 +- `db_password` — Postgres password for the `shithub` role.
 +- `hook_password` — Postgres password for the `shithub_hook`
 +  role.
 +- `session_key` — base64 32-byte AEAD key. Generate with
 +  `openssl rand -base64 32`.
 +- `totp_key` — base64 32-byte AEAD key for at-rest TOTP secrets.
 +- `s3_*` — bucket name, region, credentials.
 +- `email_*` — Postmark token or SMTP creds.
 +- `wireguard_*` — peer keys for the monitoring mesh.
++
 +## 2. Dry-run
++
 +```sh
 +make deploy-check ANSIBLE_INVENTORY=staging
 +```
++
 +Reports every change Ansible would make without making it. Read
 +the diff before continuing.
++
 +## 3. Apply
++
 +```sh
 +make deploy ANSIBLE_INVENTORY=staging
 +```
++
 +The playbook is idempotent; a second run should report
 +`changed=0`. If it doesn't, that's a config drift bug — open an
 +issue.
++
 +The roles in order:
++
 +- **base** — apt baseline, ufw default-deny, system users (`shithub`,
 +  `shithub-ssh`), data root.
 +- **postgres** — installs PG16, initdb's `/data/pgdata`, applies
 +  our `postgresql.conf` and `pg_hba.conf`, wires the WAL archive
 +  command, creates `shithub` and `shithub_hook` roles with
 +  exact-grant permissions.
 +- **shithubd** — copies the binary into `/usr/local/bin`, drops
 +  env files into `/etc/shithub/`, installs the three systemd
 +  units. The web service's `ExecStartPre=` runs `shithubd migrate
 +  up` so deploys with new schema are one command.
 +- **caddy** — installs Caddy + the templated `Caddyfile`. Auto-
 +  TLS via Let's Encrypt staging until you flip `caddy_use_acme_
 +  staging=false`.
 +- **wireguard** — peers each host into the 10.50.0.0/24 mesh.
 +- **backup** — installs the daily backup timer on the db host
 +  and the cross-region sync timer on the backup host.
 +- **monitoring-client** — node-exporter + promtail on every
 +  host pointing at the monitoring host.
++
 +## 4. Bootstrap the first admin
++
 +The first deploy creates no users. SSH to the web host and run:
++
 +```sh
 +sudo -u shithub /usr/local/bin/shithubd admin bootstrap-admin --email you@example.com
 +```
++
 +The CLI creates a user (or grants site-admin to an existing one)
 +and prints a one-time password-reset link. Open it in a browser,
 +set a password, enable 2FA. Subsequent admin grants happen
 +through `/admin/users/{id}`.
++
 +## 5. Smoke
++
 +- `https://shithub.example/` — Caddy serves the home page.
 +- `https://shithub.example/-/health` — returns `200 OK` with the
 +  build version.
 +- Sign in as the bootstrap admin. Create a test repo. Push to it.
 +- `https://shithub.example/admin/` — admin dashboard renders.
++
 +## 6. Production
++
 +Same procedure with the production inventory:
++
 +```sh
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +For larger deploys, target a single host first:
++
 +```sh
 +make deploy ANSIBLE_INVENTORY=production ANSIBLE_LIMIT=web-02
 +```
++
 +## Postgres
++
 +The `postgres` role does the heavy lifting; nothing operator-
 +visible should be needed. Specific things to be aware of:
++
 +- **Archive command.** `deploy/postgres/archive_command.sh` ships
 +  every WAL segment to `spaces-prod:shithub-wal`. Postgres
 +  refuses to recycle a segment until the script reports success;
 +  a failing archiver therefore fills the disk. Alert on
 +  `pg_stat_archiver.failed_count > 0` (the default rule does).
 +- **Hook role.** A separate Postgres role `shithub_hook` is used
 +  by the `shithubd hook …` subcommands. It has minimal grants —
 +  see `deploy/postgres/hook-role-grants.sql`. If you add a new
 +  hook subcommand that touches a new table, update those grants.
++
 +## Caddy
++
 +The Caddyfile (`deploy/Caddyfile.j2`) reverse-proxies
 +`127.0.0.1:8080`. Two special path patterns:
++
 +- `/_/<owner>/<repo>.git/(info/refs|git-(upload|receive)-pack)`
 +  uses a 30-minute timeout to accommodate large pushes / clones.
 +- Static assets under `/static/` get aggressive cache headers.
++
 +Access logs are JSON to `/var/log/caddy/access.log`; promtail
 +ships them to Loki.
++
 +## sshd
++
 +The provided `sshd_config.j2` is conservative: PasswordAuthentication
 +off, PubkeyAuthentication on, X11/agent/TCP forwarding off. The
 +`Match User git` block uses the AKC contract — see
 +[Authentication / SSH on the user side](../user/ssh.md).
++
 +## Rolling forward
++
 +```sh
 +git fetch --tags
 +git checkout v<new-version>
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +That's it. The `migrate up` is part of the unit's startup
 +preflight; if a migration fails, the unit stays in `activating`
 +and the journal has the error. See
 +[Upgrades](./upgrades.md) for the major-version checklist.

docs/public/self-host/prerequisites.mdadded

 +# Prerequisites
++
 +What you need before you start. The reference deployment targets
 +DigitalOcean + DigitalOcean Spaces; if you're on a different
 +provider, the shape is the same — substitute equivalents.
++
 +## Compute
++
 +Three machines for staging is comfortable; production starts at
 +five. The minimum useful sizes:
++
 +| Role         | vCPU | RAM  | Disk | Notes                                    |
 +|--------------|-----:|-----:|-----:|------------------------------------------|
 +| web          | 2    | 4 GB | 40 GB | Run two for HA.                         |
 +| worker       | 2    | 4 GB | 40 GB | Can colocate with web for small instances. |
 +| postgres     | 2    | 8 GB | 100 GB | Start at 100 GB; grow with disk usage.  |
 +| backup       | 1    | 2 GB | 80 GB | Idle most of the time; runs cron only.  |
 +| monitoring   | 2    | 4 GB | 60 GB | Prom + Loki + Alertmanager + Grafana.   |
++
 +Ubuntu 24.04 LTS is the supported OS. Other Debian derivatives
 +work but are not regularly tested.
++
 +## Storage
++
 +shithub stores three categories:
++
 +- **Bare repos** — on the web/worker hosts at `/data/repos`. Local
 +  filesystem; **not** in object storage. Sized to total push
 +  volume.
 +- **Webhook delivery bodies, LFS, attachments** — in an
 +  S3-compatible object store. The reference setup uses
 +  DigitalOcean Spaces; MinIO works for self-hosted equivalents.
 +- **Backups** — separate bucket. Cross-region mirror is strongly
 +  recommended; lifecycle policies prune old data.
++
 +## Database
++
 +Postgres 16 on a dedicated host. Earlier versions work in
 +development; production requires 16+ for the `pg_stat_statements`
 +features and JSON functions we rely on.
++
 +The `archive_command` ships WAL to object storage in real time
 +([deploy.md](./deploy.md#postgres)); a daily logical dump is taken
 +on top of that.
++
 +## Domain + TLS
++
 +You need:
++
 +- A domain you control (e.g. `shithub.example`).
 +- DNS records for the app (`shithub.example`) and the docs
 +  subdomain (`docs.shithub.example`).
 +- A TLS certificate. Caddy obtains and renews via Let's Encrypt
 +  automatically — no manual cert management — but the DNS records
 +  must point at your public IP first.
++
 +## Email
++
 +shithub sends transactional email (signup verification, password
 +reset, notifications). Three backends:
++
 +- `stdout` — prints to the journal. **Dev only.**
 +- `smtp` — talk to your own MTA or a SaaS that exposes SMTP.
 +- `postmark` — uses Postmark's HTTP API (the recommended
 +  production path; Postmark's deliverability is excellent for
 +  transactional mail).
++
 +Warm-up your sending domain before going live; cold IPs have
 +their first thousand emails treated as spam.
++
 +## SSH
++
 +The git server expects port 22 reachable from your users. The
 +sshd config (deploy/sshd_config.j2) restricts the `git` user to
 +the AKC-driven flow; the rest of sshd is operator-only.
++
 +## Object store credentials
++
 +Store both an access key for the runtime (read+write to the
 +bucket) and a separate set for backups + restore drill (read-only
 +on the runtime bucket; full access on the backup bucket).
 +Rotate quarterly.
++
 +## Time
++
 +NTP must be running; the rate-limit windows, TOTP, and audit-log
 +timestamps depend on monotonic, accurate time. Ubuntu's
 +`systemd-timesyncd` is enough.
++
 +## Skills you'll want
++
 +- Comfort with Ansible — the playbook is the install path.
 +- Postgres operational basics: pg_dump/pg_restore, monitoring,
 +  WAL.
 +- Reading systemd journal output (`journalctl -u …`).
++
 +If any of those are unfamiliar, allow extra time on your first
 +deploy and follow the runbooks closely.

docs/public/self-host/troubleshooting.mdadded

 +# Troubleshooting
++
 +Common operator-visible failure modes and how to diagnose them.
++
 +## Caddy: certificate acquisition fails
++
 +**Symptom:** Caddy logs show `failed to obtain certificate` for
 +your domain; the site serves the staging cert (or no cert).
++
 +Most often:
++
 +- DNS doesn't yet point at the host. Verify with `dig +short
 +  shithub.example`.
 +- Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs
 +  port 80 reachable from the public internet (Caddy redirects
 +  to 443 *after* obtaining the cert). UFW must allow 80.
 +- Rate-limited by Let's Encrypt. If you've been retrying with
 +  `caddy_use_acme_staging=false` and failing, you can be locked
 +  out for a few hours. Switch to `caddy_use_acme_staging=true`,
 +  reload, then flip back when DNS/firewall is correct.
++
 +## sshd: AKC slow / git ssh hangs at "Authenticating"
++
 +**Symptom:** `git clone git@…` sits at the auth phase for many
 +seconds before completing or timing out.
++
 +The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd
 +ssh-authkeys %f`, which queries Postgres. Latency comes from one
 +of:
++
 +- DB connection saturation. `pgxpool` is shared across the web
 +  process; the AKC subcommand opens its own connection. Check
 +  `pg_stat_activity` for stuck connections.
 +- DNS for the AKC subprocess (it should not need DNS, but a
 +  misconfigured `/etc/nsswitch.conf` can stall stat calls).
 +- The user has a *very* large number of keys; AKC walks all of
 +  them. Practical maximum is in the hundreds before you notice.
++
 +Mitigation: investigate, don't bypass. Disabling the AKC opens a
 +hole — every key on disk would have shell access.
++
 +## Job queue: backlog growing
++
 +**Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes.
++
 +```sh
 +psql -d shithub -c "
 +  SELECT kind, count(*) FROM jobs WHERE state='queued'
 +  GROUP BY kind ORDER BY 2 DESC;"
 +```
++
 +If one kind dominates, look for a poison job in worker logs:
++
 +```sh
 +journalctl -u shithubd-worker -n 500 --grep ERROR
 +```
++
 +If the spread is even, the worker isn't keeping up — scale by
 +running a second worker host. The `FOR UPDATE SKIP LOCKED`
 +pattern lets multiple workers coexist safely.
++
 +To purge a poison job: mark it `failed` in SQL (don't delete —
 +you want the audit trail).
++
 +## Webhook: deliveries failing 50%+
++
 +**Symptom:** `WebhookDeliveryFailing` alert.
++
 +- A specific subscriber URL is broken. Check the webhook's
 +  delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for
 +  status codes and bodies.
 +- Outbound network issue. From the host:
 +  `curl -v https://example.com/hook`. If outbound HTTPS itself
 +  is broken, that's an infrastructure problem.
 +- Auto-disable kicks in after 50 consecutive failures per
 +  webhook; the UI shows a banner and sets `active=false`. Owner
 +  re-enables once the endpoint is fixed.
++
 +## Push: "pre-receive hook declined"
++
 +**Symptom:** User's `git push` fails with a pre-receive
 +rejection.
++
 +Common reasons:
++
 +- **Branch protection.** Pushing directly to a protected branch
 +  when the rule requires a PR. User pushes to a feature branch
 +  instead.
 +- **Repo over quota.** Bare-repo size cap hit; raise per-repo or
 +  globally in config.
 +- **Pack contains too-large object.** Large blobs blow the
 +  per-object cap (defaults documented in `internal/repos`). Use
 +  LFS instead for big binaries.
++
 +## Postgres: archive_command failing
++
 +**Symptom:** `pg_stat_archiver.failed_count` increasing; disk
 +filling on the db host.
++
 +`journalctl -u postgresql -n 200 --grep "archive_command failed"`
 +will print the rclone error. Common causes:
++
 +- Rotated bucket credentials.
 +- Network partition to Spaces.
 +- `rclone.conf` missing or unreadable.
++
 +Confirm by hand:
++
 +```sh
 +sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
 +     lsd spaces-prod:
 +```
++
 +Fix the underlying issue; Postgres retries automatically. If disk
 +is critically full and Spaces won't be reachable in time, page
 +the operator. **Do not** `pg_archivecleanup` manually unless
 +explicitly directed — you can lose data.
++
 +## Email: signups never arrive
++
 +- `auth.email_backend=stdout` in your config — emails went to the
 +  journal. Check `journalctl -u shithubd-web --grep '\[email\]'`.
 +- Postmark token wrong / sender domain not verified. Postmark's
 +  dashboard logs every send.
 +- DKIM/SPF not set up; the provider may accept and silently drop.
++
 +For test, send a verification email to an address you control on
 +a domain that bounces with helpful errors (e.g., your own MTA
 +configured to log rejections).
++
 +## Sign-in fails immediately with no error in journal
++
 +CSRF token expired (cookie didn't match the form's hidden token).
 +Force a hard reload of the sign-in page; cookies older than the
 +session lifetime are dropped.
++
 +If that's not it: check `auth.require_email_verification=true` —
 +the user may need to click a verification link.
++
 +## "Site appears slow" with no specific symptom
++
 +Top sources of latency, in order of frequency:
++
 +1. **N+1 query regression.** DB calls/sec on the Grafana
 +   dashboard will be much higher than baseline.
 +2. **GC pressure.** The web process is allocating more than usual
 +   — check Grafana's `go_gc_pause_seconds` panel.
 +3. **Cache stampede.** If a hot cache key was just invalidated,
 +   first-request-wins via singleflight should prevent this; if
 +   it doesn't, check `cache.singleflight_in_flight` for excess.
 +4. **Disk-bound on git ops.** Concurrent large clones pin the
 +   web host's disk. Move git transports to a separate process
 +   or scale the web tier.

docs/public/self-host/upgrades.mdadded

 +# Upgrades & migrations
++
 +Routine release deploys. Migrations apply automatically; the
 +only place upgrades get exciting is around long migrations and
 +the occasional config schema change.
++
 +## Standard release
++
 +```sh
 +git fetch --tags
 +git checkout v<version>
 +make deploy-check ANSIBLE_INVENTORY=staging
 +make deploy ANSIBLE_INVENTORY=staging
 +# ... canary period ...
 +make deploy ANSIBLE_INVENTORY=production
 +```
++
 +`shithubd migrate up` runs as the web service's
 +`ExecStartPre=`, so the binary that needs the new schema is also
 +the one that applies it. Order on each host: ExecStartPre runs
 +migrations → web starts on the new schema.
++
 +If a migration is long (>30s), the release notes call it out.
 +Schedule the deploy outside peak hours; the web service hangs in
 +"activating" until ExecStartPre finishes.
++
 +## Canary
++
 +Deploy to staging first. Watch for 30 min in Grafana. Things to
 +look at:
++
 +- p95 latency on top routes.
 +- DB call rate — a 10× jump usually means a regressed N+1.
 +- Job queue depth — a stuck migration reflects here.
 +- Error logs in Loki: `{service="shithubd"} |~ "panic|ERROR"`.
++
 +If anything looks off, **do not** promote. Rollback on staging
 +is cheap; rollback on production is loud.
++
 +## Major release (database)
++
 +If the release notes flag a major schema change:
++
 +1. Take a manual `pg_dump` immediately before the deploy:
 +   `sudo -u postgres /usr/local/bin/shithub-backup-daily`.
 +2. Confirm it landed in Spaces.
 +3. Deploy to staging, run `make restore-drill` against the
 +   *post-deploy* dump to confirm the new schema restores cleanly.
 +4. Then production.
++
 +## Config schema changes
++
 +When a release adds a required config key, the binary refuses to
 +start and complains in the journal. Update
 +`deploy/ansible/roles/shithubd/templates/web.env.j2` (and
 +`worker.env.j2`), bump the inventory vars, redeploy. There's no
 +separate migration step for env files.
++
 +## Rolling back
++
 +See [Rollback (in-repo runbook)](https://github.com/tenseleyFlow/shithub/blob/main/docs/internal/runbooks/rollback.md).
++
 +Three rollback shapes, in preference order:
++
 +1. **Schema-compatible rollback (best).** If the migration only
 +   *added* columns/tables that the old code ignores, the old code
 +   runs against the new schema fine. Roll the code back; leave
 +   schema alone. Most of our migrations are deliberately additive
 +   for this reason.
 +2. **Roll forward to a hotfix.** If the migration changed
 +   semantics that the old code can't tolerate, ship a hotfix on
 +   top of the new release rather than reversing the migration.
 +3. **Migration `down` + code rollback.** Last resort; some `down`s
 +   drop columns and *will* lose data.
++
 +```sh
 +# (3) only when (1) and (2) won't work
 +ssh web-01
 +sudo -u shithub /usr/local/bin/shithubd migrate down  # ONE step
 +git checkout v<previous>
 +make deploy ANSIBLE_INVENTORY=production
 +```