# Troubleshooting Common operator-visible failure modes and how to diagnose them. ## Caddy: certificate acquisition fails **Symptom:** Caddy logs show `failed to obtain certificate` for your domain; the site serves the staging cert (or no cert). Most often: - DNS doesn't yet point at the host. Verify with `dig +short shithub.sh`. - Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs port 80 reachable from the public internet (Caddy redirects to 443 *after* obtaining the cert). UFW must allow 80. - Rate-limited by Let's Encrypt. If you've been retrying with `caddy_use_acme_staging=false` and failing, you can be locked out for a few hours. Switch to `caddy_use_acme_staging=true`, reload, then flip back when DNS/firewall is correct. ## sshd: AKC slow / git ssh hangs at "Authenticating" **Symptom:** `git clone git@…` sits at the auth phase for many seconds before completing or timing out. The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd ssh-authkeys %f`, which queries Postgres. Latency comes from one of: - DB connection saturation. `pgxpool` is shared across the web process; the AKC subcommand opens its own connection. Check `pg_stat_activity` for stuck connections. - DNS for the AKC subprocess (it should not need DNS, but a misconfigured `/etc/nsswitch.conf` can stall stat calls). - The user has a *very* large number of keys; AKC walks all of them. Practical maximum is in the hundreds before you notice. Mitigation: investigate, don't bypass. Disabling the AKC opens a hole — every key on disk would have shell access. ## Job queue: backlog growing **Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes. ```sh psql -d shithub -c " SELECT kind, count(*) FROM jobs WHERE state='queued' GROUP BY kind ORDER BY 2 DESC;" ``` If one kind dominates, look for a poison job in worker logs: ```sh journalctl -u shithubd-worker -n 500 --grep ERROR ``` If the spread is even, the worker isn't keeping up — scale by running a second worker host. The `FOR UPDATE SKIP LOCKED` pattern lets multiple workers coexist safely. To purge a poison job: mark it `failed` in SQL (don't delete — you want the audit trail). ## Webhook: deliveries failing 50%+ **Symptom:** `WebhookDeliveryFailing` alert. - A specific subscriber URL is broken. Check the webhook's delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for status codes and bodies. - Outbound network issue. From the host: `curl -v https://example.com/hook`. If outbound HTTPS itself is broken, that's an infrastructure problem. - Auto-disable kicks in after 50 consecutive failures per webhook; the UI shows a banner and sets `active=false`. Owner re-enables once the endpoint is fixed. ## Push: "pre-receive hook declined" **Symptom:** User's `git push` fails with a pre-receive rejection. Common reasons: - **Branch protection.** Pushing directly to a protected branch when the rule requires a PR. User pushes to a feature branch instead. - **Repo over quota.** Bare-repo size cap hit; raise per-repo or globally in config. - **Pack contains too-large object.** Large blobs blow the per-object cap (defaults documented in `internal/repos`). Use LFS instead for big binaries. ## Postgres: archive_command failing **Symptom:** `pg_stat_archiver.failed_count` increasing; disk filling on the db host. `journalctl -u postgresql -n 200 --grep "archive_command failed"` will print the rclone error. Common causes: - Rotated bucket credentials. - Network partition to Spaces. - `rclone.conf` missing or unreadable. Confirm by hand: ```sh sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ lsd spaces-prod: ``` Fix the underlying issue; Postgres retries automatically. If disk is critically full and Spaces won't be reachable in time, page the operator. **Do not** `pg_archivecleanup` manually unless explicitly directed — you can lose data. ## Email: signups never arrive - `auth.email_backend=stdout` in your config — emails went to the journal. Check `journalctl -u shithubd-web --grep '\[email\]'`. - Postmark token wrong / sender domain not verified. Postmark's dashboard logs every send. - DKIM/SPF not set up; the provider may accept and silently drop. For test, send a verification email to an address you control on a domain that bounces with helpful errors (e.g., your own MTA configured to log rejections). ## Sign-in fails immediately with no error in journal CSRF token expired (cookie didn't match the form's hidden token). Force a hard reload of the sign-in page; cookies older than the session lifetime are dropped. If that's not it: check `auth.require_email_verification=true` — the user may need to click a verification link. ## "Site appears slow" with no specific symptom Top sources of latency, in order of frequency: 1. **N+1 query regression.** DB calls/sec on the Grafana dashboard will be much higher than baseline. 2. **GC pressure.** The web process is allocating more than usual — check Grafana's `go_gc_pause_seconds` panel. 3. **Cache stampede.** If a hot cache key was just invalidated, first-request-wins via singleflight should prevent this; if it doesn't, check `cache.singleflight_in_flight` for excess. 4. **Disk-bound on git ops.** Concurrent large clones pin the web host's disk. Move git transports to a separate process or scale the web tier.