markdown · 5418 bytes Raw Blame History

Troubleshooting

Common operator-visible failure modes and how to diagnose them.

Caddy: certificate acquisition fails

Symptom: Caddy logs show failed to obtain certificate for your domain; the site serves the staging cert (or no cert).

Most often:

  • DNS doesn't yet point at the host. Verify with dig +short shithub.sh.
  • Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs port 80 reachable from the public internet (Caddy redirects to 443 after obtaining the cert). UFW must allow 80.
  • Rate-limited by Let's Encrypt. If you've been retrying with caddy_use_acme_staging=false and failing, you can be locked out for a few hours. Switch to caddy_use_acme_staging=true, reload, then flip back when DNS/firewall is correct.

sshd: AKC slow / git ssh hangs at "Authenticating"

Symptom: git clone git@… sits at the auth phase for many seconds before completing or timing out.

The AuthorizedKeysCommand invokes /usr/local/bin/shithubd ssh-authkeys %f, which queries Postgres. Latency comes from one of:

  • DB connection saturation. pgxpool is shared across the web process; the AKC subcommand opens its own connection. Check pg_stat_activity for stuck connections.
  • DNS for the AKC subprocess (it should not need DNS, but a misconfigured /etc/nsswitch.conf can stall stat calls).
  • The user has a very large number of keys; AKC walks all of them. Practical maximum is in the hundreds before you notice.

Mitigation: investigate, don't bypass. Disabling the AKC opens a hole — every key on disk would have shell access.

Job queue: backlog growing

Symptom: shithubd_job_queue_depth > 5000 for 15+ minutes.

psql -d shithub -c "
  SELECT kind, count(*) FROM jobs WHERE state='queued'
  GROUP BY kind ORDER BY 2 DESC;"

If one kind dominates, look for a poison job in worker logs:

journalctl -u shithubd-worker -n 500 --grep ERROR

If the spread is even, the worker isn't keeping up — scale by running a second worker host. The FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely.

To purge a poison job: mark it failed in SQL (don't delete — you want the audit trail).

Webhook: deliveries failing 50%+

Symptom: WebhookDeliveryFailing alert.

  • A specific subscriber URL is broken. Check the webhook's delivery log in the UI (/owner/repo/settings/hooks/{id}) for status codes and bodies.
  • Outbound network issue. From the host: curl -v https://example.com/hook. If outbound HTTPS itself is broken, that's an infrastructure problem.
  • Auto-disable kicks in after 50 consecutive failures per webhook; the UI shows a banner and sets active=false. Owner re-enables once the endpoint is fixed.

Push: "pre-receive hook declined"

Symptom: User's git push fails with a pre-receive rejection.

Common reasons:

  • Branch protection. Pushing directly to a protected branch when the rule requires a PR. User pushes to a feature branch instead.
  • Repo over quota. Bare-repo size cap hit; raise per-repo or globally in config.
  • Pack contains too-large object. Large blobs blow the per-object cap (defaults documented in internal/repos). Use LFS instead for big binaries.

Postgres: archive_command failing

Symptom: pg_stat_archiver.failed_count increasing; disk filling on the db host.

journalctl -u postgresql -n 200 --grep "archive_command failed" will print the rclone error. Common causes:

  • Rotated bucket credentials.
  • Network partition to Spaces.
  • rclone.conf missing or unreadable.

Confirm by hand:

sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
     lsd spaces-prod:

Fix the underlying issue; Postgres retries automatically. If disk is critically full and Spaces won't be reachable in time, page the operator. Do not pg_archivecleanup manually unless explicitly directed — you can lose data.

Email: signups never arrive

  • auth.email_backend=stdout in your config — emails went to the journal. Check journalctl -u shithubd-web --grep '\[email\]'.
  • Postmark token wrong / sender domain not verified. Postmark's dashboard logs every send.
  • DKIM/SPF not set up; the provider may accept and silently drop.

For test, send a verification email to an address you control on a domain that bounces with helpful errors (e.g., your own MTA configured to log rejections).

Sign-in fails immediately with no error in journal

CSRF token expired (cookie didn't match the form's hidden token). Force a hard reload of the sign-in page; cookies older than the session lifetime are dropped.

If that's not it: check auth.require_email_verification=true — the user may need to click a verification link.

"Site appears slow" with no specific symptom

Top sources of latency, in order of frequency:

  1. N+1 query regression. DB calls/sec on the Grafana dashboard will be much higher than baseline.
  2. GC pressure. The web process is allocating more than usual — check Grafana's go_gc_pause_seconds panel.
  3. Cache stampede. If a hot cache key was just invalidated, first-request-wins via singleflight should prevent this; if it doesn't, check cache.singleflight_in_flight for excess.
  4. Disk-bound on git ops. Concurrent large clones pin the web host's disk. Move git transports to a separate process or scale the web tier.
View source
1 # Troubleshooting
2
3 Common operator-visible failure modes and how to diagnose them.
4
5 ## Caddy: certificate acquisition fails
6
7 **Symptom:** Caddy logs show `failed to obtain certificate` for
8 your domain; the site serves the staging cert (or no cert).
9
10 Most often:
11
12 - DNS doesn't yet point at the host. Verify with `dig +short
13 shithub.sh`.
14 - Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs
15 port 80 reachable from the public internet (Caddy redirects
16 to 443 *after* obtaining the cert). UFW must allow 80.
17 - Rate-limited by Let's Encrypt. If you've been retrying with
18 `caddy_use_acme_staging=false` and failing, you can be locked
19 out for a few hours. Switch to `caddy_use_acme_staging=true`,
20 reload, then flip back when DNS/firewall is correct.
21
22 ## sshd: AKC slow / git ssh hangs at "Authenticating"
23
24 **Symptom:** `git clone git@…` sits at the auth phase for many
25 seconds before completing or timing out.
26
27 The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd
28 ssh-authkeys %f`, which queries Postgres. Latency comes from one
29 of:
30
31 - DB connection saturation. `pgxpool` is shared across the web
32 process; the AKC subcommand opens its own connection. Check
33 `pg_stat_activity` for stuck connections.
34 - DNS for the AKC subprocess (it should not need DNS, but a
35 misconfigured `/etc/nsswitch.conf` can stall stat calls).
36 - The user has a *very* large number of keys; AKC walks all of
37 them. Practical maximum is in the hundreds before you notice.
38
39 Mitigation: investigate, don't bypass. Disabling the AKC opens a
40 hole — every key on disk would have shell access.
41
42 ## Job queue: backlog growing
43
44 **Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes.
45
46 ```sh
47 psql -d shithub -c "
48 SELECT kind, count(*) FROM jobs WHERE state='queued'
49 GROUP BY kind ORDER BY 2 DESC;"
50 ```
51
52 If one kind dominates, look for a poison job in worker logs:
53
54 ```sh
55 journalctl -u shithubd-worker -n 500 --grep ERROR
56 ```
57
58 If the spread is even, the worker isn't keeping up — scale by
59 running a second worker host. The `FOR UPDATE SKIP LOCKED`
60 pattern lets multiple workers coexist safely.
61
62 To purge a poison job: mark it `failed` in SQL (don't delete —
63 you want the audit trail).
64
65 ## Webhook: deliveries failing 50%+
66
67 **Symptom:** `WebhookDeliveryFailing` alert.
68
69 - A specific subscriber URL is broken. Check the webhook's
70 delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for
71 status codes and bodies.
72 - Outbound network issue. From the host:
73 `curl -v https://example.com/hook`. If outbound HTTPS itself
74 is broken, that's an infrastructure problem.
75 - Auto-disable kicks in after 50 consecutive failures per
76 webhook; the UI shows a banner and sets `active=false`. Owner
77 re-enables once the endpoint is fixed.
78
79 ## Push: "pre-receive hook declined"
80
81 **Symptom:** User's `git push` fails with a pre-receive
82 rejection.
83
84 Common reasons:
85
86 - **Branch protection.** Pushing directly to a protected branch
87 when the rule requires a PR. User pushes to a feature branch
88 instead.
89 - **Repo over quota.** Bare-repo size cap hit; raise per-repo or
90 globally in config.
91 - **Pack contains too-large object.** Large blobs blow the
92 per-object cap (defaults documented in `internal/repos`). Use
93 LFS instead for big binaries.
94
95 ## Postgres: archive_command failing
96
97 **Symptom:** `pg_stat_archiver.failed_count` increasing; disk
98 filling on the db host.
99
100 `journalctl -u postgresql -n 200 --grep "archive_command failed"`
101 will print the rclone error. Common causes:
102
103 - Rotated bucket credentials.
104 - Network partition to Spaces.
105 - `rclone.conf` missing or unreadable.
106
107 Confirm by hand:
108
109 ```sh
110 sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
111 lsd spaces-prod:
112 ```
113
114 Fix the underlying issue; Postgres retries automatically. If disk
115 is critically full and Spaces won't be reachable in time, page
116 the operator. **Do not** `pg_archivecleanup` manually unless
117 explicitly directed — you can lose data.
118
119 ## Email: signups never arrive
120
121 - `auth.email_backend=stdout` in your config — emails went to the
122 journal. Check `journalctl -u shithubd-web --grep '\[email\]'`.
123 - Postmark token wrong / sender domain not verified. Postmark's
124 dashboard logs every send.
125 - DKIM/SPF not set up; the provider may accept and silently drop.
126
127 For test, send a verification email to an address you control on
128 a domain that bounces with helpful errors (e.g., your own MTA
129 configured to log rejections).
130
131 ## Sign-in fails immediately with no error in journal
132
133 CSRF token expired (cookie didn't match the form's hidden token).
134 Force a hard reload of the sign-in page; cookies older than the
135 session lifetime are dropped.
136
137 If that's not it: check `auth.require_email_verification=true` —
138 the user may need to click a verification link.
139
140 ## "Site appears slow" with no specific symptom
141
142 Top sources of latency, in order of frequency:
143
144 1. **N+1 query regression.** DB calls/sec on the Grafana
145 dashboard will be much higher than baseline.
146 2. **GC pressure.** The web process is allocating more than usual
147 — check Grafana's `go_gc_pause_seconds` panel.
148 3. **Cache stampede.** If a hot cache key was just invalidated,
149 first-request-wins via singleflight should prevent this; if
150 it doesn't, check `cache.singleflight_in_flight` for excess.
151 4. **Disk-bound on git ops.** Concurrent large clones pin the
152 web host's disk. Move git transports to a separate process
153 or scale the web tier.