Troubleshooting
Common operator-visible failure modes and how to diagnose them.
Caddy: certificate acquisition fails
Symptom: Caddy logs show failed to obtain certificate for
your domain; the site serves the staging cert (or no cert).
Most often:
- DNS doesn't yet point at the host. Verify with
dig +short shithub.sh. - Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs port 80 reachable from the public internet (Caddy redirects to 443 after obtaining the cert). UFW must allow 80.
- Rate-limited by Let's Encrypt. If you've been retrying with
caddy_use_acme_staging=falseand failing, you can be locked out for a few hours. Switch tocaddy_use_acme_staging=true, reload, then flip back when DNS/firewall is correct.
sshd: AKC slow / git ssh hangs at "Authenticating"
Symptom: git clone git@… sits at the auth phase for many
seconds before completing or timing out.
The AuthorizedKeysCommand invokes /usr/local/bin/shithubd ssh-authkeys %f, which queries Postgres. Latency comes from one
of:
- DB connection saturation.
pgxpoolis shared across the web process; the AKC subcommand opens its own connection. Checkpg_stat_activityfor stuck connections. - DNS for the AKC subprocess (it should not need DNS, but a
misconfigured
/etc/nsswitch.confcan stall stat calls). - The user has a very large number of keys; AKC walks all of them. Practical maximum is in the hundreds before you notice.
Mitigation: investigate, don't bypass. Disabling the AKC opens a hole — every key on disk would have shell access.
Job queue: backlog growing
Symptom: shithubd_job_queue_depth > 5000 for 15+ minutes.
psql -d shithub -c "
SELECT kind, count(*) FROM jobs WHERE state='queued'
GROUP BY kind ORDER BY 2 DESC;"
If one kind dominates, look for a poison job in worker logs:
journalctl -u shithubd-worker -n 500 --grep ERROR
If the spread is even, the worker isn't keeping up — scale by
running a second worker host. The FOR UPDATE SKIP LOCKED
pattern lets multiple workers coexist safely.
To purge a poison job: mark it failed in SQL (don't delete —
you want the audit trail).
Webhook: deliveries failing 50%+
Symptom: WebhookDeliveryFailing alert.
- A specific subscriber URL is broken. Check the webhook's
delivery log in the UI (
/owner/repo/settings/hooks/{id}) for status codes and bodies. - Outbound network issue. From the host:
curl -v https://example.com/hook. If outbound HTTPS itself is broken, that's an infrastructure problem. - Auto-disable kicks in after 50 consecutive failures per
webhook; the UI shows a banner and sets
active=false. Owner re-enables once the endpoint is fixed.
Push: "pre-receive hook declined"
Symptom: User's git push fails with a pre-receive
rejection.
Common reasons:
- Branch protection. Pushing directly to a protected branch when the rule requires a PR. User pushes to a feature branch instead.
- Repo over quota. Bare-repo size cap hit; raise per-repo or globally in config.
- Pack contains too-large object. Large blobs blow the
per-object cap (defaults documented in
internal/repos). Use LFS instead for big binaries.
Postgres: archive_command failing
Symptom: pg_stat_archiver.failed_count increasing; disk
filling on the db host.
journalctl -u postgresql -n 200 --grep "archive_command failed"
will print the rclone error. Common causes:
- Rotated bucket credentials.
- Network partition to Spaces.
rclone.confmissing or unreadable.
Confirm by hand:
sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
lsd spaces-prod:
Fix the underlying issue; Postgres retries automatically. If disk
is critically full and Spaces won't be reachable in time, page
the operator. Do not pg_archivecleanup manually unless
explicitly directed — you can lose data.
Email: signups never arrive
auth.email_backend=stdoutin your config — emails went to the journal. Checkjournalctl -u shithubd-web --grep '\[email\]'.- Postmark token wrong / sender domain not verified. Postmark's dashboard logs every send.
- DKIM/SPF not set up; the provider may accept and silently drop.
For test, send a verification email to an address you control on a domain that bounces with helpful errors (e.g., your own MTA configured to log rejections).
Sign-in fails immediately with no error in journal
CSRF token expired (cookie didn't match the form's hidden token). Force a hard reload of the sign-in page; cookies older than the session lifetime are dropped.
If that's not it: check auth.require_email_verification=true —
the user may need to click a verification link.
"Site appears slow" with no specific symptom
Top sources of latency, in order of frequency:
- N+1 query regression. DB calls/sec on the Grafana dashboard will be much higher than baseline.
- GC pressure. The web process is allocating more than usual
— check Grafana's
go_gc_pause_secondspanel. - Cache stampede. If a hot cache key was just invalidated,
first-request-wins via singleflight should prevent this; if
it doesn't, check
cache.singleflight_in_flightfor excess. - Disk-bound on git ops. Concurrent large clones pin the web host's disk. Move git transports to a separate process or scale the web tier.
View source
| 1 | # Troubleshooting |
| 2 | |
| 3 | Common operator-visible failure modes and how to diagnose them. |
| 4 | |
| 5 | ## Caddy: certificate acquisition fails |
| 6 | |
| 7 | **Symptom:** Caddy logs show `failed to obtain certificate` for |
| 8 | your domain; the site serves the staging cert (or no cert). |
| 9 | |
| 10 | Most often: |
| 11 | |
| 12 | - DNS doesn't yet point at the host. Verify with `dig +short |
| 13 | shithub.sh`. |
| 14 | - Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs |
| 15 | port 80 reachable from the public internet (Caddy redirects |
| 16 | to 443 *after* obtaining the cert). UFW must allow 80. |
| 17 | - Rate-limited by Let's Encrypt. If you've been retrying with |
| 18 | `caddy_use_acme_staging=false` and failing, you can be locked |
| 19 | out for a few hours. Switch to `caddy_use_acme_staging=true`, |
| 20 | reload, then flip back when DNS/firewall is correct. |
| 21 | |
| 22 | ## sshd: AKC slow / git ssh hangs at "Authenticating" |
| 23 | |
| 24 | **Symptom:** `git clone git@…` sits at the auth phase for many |
| 25 | seconds before completing or timing out. |
| 26 | |
| 27 | The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd |
| 28 | ssh-authkeys %f`, which queries Postgres. Latency comes from one |
| 29 | of: |
| 30 | |
| 31 | - DB connection saturation. `pgxpool` is shared across the web |
| 32 | process; the AKC subcommand opens its own connection. Check |
| 33 | `pg_stat_activity` for stuck connections. |
| 34 | - DNS for the AKC subprocess (it should not need DNS, but a |
| 35 | misconfigured `/etc/nsswitch.conf` can stall stat calls). |
| 36 | - The user has a *very* large number of keys; AKC walks all of |
| 37 | them. Practical maximum is in the hundreds before you notice. |
| 38 | |
| 39 | Mitigation: investigate, don't bypass. Disabling the AKC opens a |
| 40 | hole — every key on disk would have shell access. |
| 41 | |
| 42 | ## Job queue: backlog growing |
| 43 | |
| 44 | **Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes. |
| 45 | |
| 46 | ```sh |
| 47 | psql -d shithub -c " |
| 48 | SELECT kind, count(*) FROM jobs WHERE state='queued' |
| 49 | GROUP BY kind ORDER BY 2 DESC;" |
| 50 | ``` |
| 51 | |
| 52 | If one kind dominates, look for a poison job in worker logs: |
| 53 | |
| 54 | ```sh |
| 55 | journalctl -u shithubd-worker -n 500 --grep ERROR |
| 56 | ``` |
| 57 | |
| 58 | If the spread is even, the worker isn't keeping up — scale by |
| 59 | running a second worker host. The `FOR UPDATE SKIP LOCKED` |
| 60 | pattern lets multiple workers coexist safely. |
| 61 | |
| 62 | To purge a poison job: mark it `failed` in SQL (don't delete — |
| 63 | you want the audit trail). |
| 64 | |
| 65 | ## Webhook: deliveries failing 50%+ |
| 66 | |
| 67 | **Symptom:** `WebhookDeliveryFailing` alert. |
| 68 | |
| 69 | - A specific subscriber URL is broken. Check the webhook's |
| 70 | delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for |
| 71 | status codes and bodies. |
| 72 | - Outbound network issue. From the host: |
| 73 | `curl -v https://example.com/hook`. If outbound HTTPS itself |
| 74 | is broken, that's an infrastructure problem. |
| 75 | - Auto-disable kicks in after 50 consecutive failures per |
| 76 | webhook; the UI shows a banner and sets `active=false`. Owner |
| 77 | re-enables once the endpoint is fixed. |
| 78 | |
| 79 | ## Push: "pre-receive hook declined" |
| 80 | |
| 81 | **Symptom:** User's `git push` fails with a pre-receive |
| 82 | rejection. |
| 83 | |
| 84 | Common reasons: |
| 85 | |
| 86 | - **Branch protection.** Pushing directly to a protected branch |
| 87 | when the rule requires a PR. User pushes to a feature branch |
| 88 | instead. |
| 89 | - **Repo over quota.** Bare-repo size cap hit; raise per-repo or |
| 90 | globally in config. |
| 91 | - **Pack contains too-large object.** Large blobs blow the |
| 92 | per-object cap (defaults documented in `internal/repos`). Use |
| 93 | LFS instead for big binaries. |
| 94 | |
| 95 | ## Postgres: archive_command failing |
| 96 | |
| 97 | **Symptom:** `pg_stat_archiver.failed_count` increasing; disk |
| 98 | filling on the db host. |
| 99 | |
| 100 | `journalctl -u postgresql -n 200 --grep "archive_command failed"` |
| 101 | will print the rclone error. Common causes: |
| 102 | |
| 103 | - Rotated bucket credentials. |
| 104 | - Network partition to Spaces. |
| 105 | - `rclone.conf` missing or unreadable. |
| 106 | |
| 107 | Confirm by hand: |
| 108 | |
| 109 | ```sh |
| 110 | sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \ |
| 111 | lsd spaces-prod: |
| 112 | ``` |
| 113 | |
| 114 | Fix the underlying issue; Postgres retries automatically. If disk |
| 115 | is critically full and Spaces won't be reachable in time, page |
| 116 | the operator. **Do not** `pg_archivecleanup` manually unless |
| 117 | explicitly directed — you can lose data. |
| 118 | |
| 119 | ## Email: signups never arrive |
| 120 | |
| 121 | - `auth.email_backend=stdout` in your config — emails went to the |
| 122 | journal. Check `journalctl -u shithubd-web --grep '\[email\]'`. |
| 123 | - Postmark token wrong / sender domain not verified. Postmark's |
| 124 | dashboard logs every send. |
| 125 | - DKIM/SPF not set up; the provider may accept and silently drop. |
| 126 | |
| 127 | For test, send a verification email to an address you control on |
| 128 | a domain that bounces with helpful errors (e.g., your own MTA |
| 129 | configured to log rejections). |
| 130 | |
| 131 | ## Sign-in fails immediately with no error in journal |
| 132 | |
| 133 | CSRF token expired (cookie didn't match the form's hidden token). |
| 134 | Force a hard reload of the sign-in page; cookies older than the |
| 135 | session lifetime are dropped. |
| 136 | |
| 137 | If that's not it: check `auth.require_email_verification=true` — |
| 138 | the user may need to click a verification link. |
| 139 | |
| 140 | ## "Site appears slow" with no specific symptom |
| 141 | |
| 142 | Top sources of latency, in order of frequency: |
| 143 | |
| 144 | 1. **N+1 query regression.** DB calls/sec on the Grafana |
| 145 | dashboard will be much higher than baseline. |
| 146 | 2. **GC pressure.** The web process is allocating more than usual |
| 147 | — check Grafana's `go_gc_pause_seconds` panel. |
| 148 | 3. **Cache stampede.** If a hot cache key was just invalidated, |
| 149 | first-request-wins via singleflight should prevent this; if |
| 150 | it doesn't, check `cache.singleflight_in_flight` for excess. |
| 151 | 4. **Disk-bound on git ops.** Concurrent large clones pin the |
| 152 | web host's disk. Move git transports to a separate process |
| 153 | or scale the web tier. |