shithub Public

Watch 1 Fork 0 Star 0

markdown · 5418 bytes Raw Blame History

Troubleshooting

Common operator-visible failure modes and how to diagnose them.

Caddy: certificate acquisition fails

Symptom: Caddy logs show failed to obtain certificate for your domain; the site serves the staging cert (or no cert).

Most often:

DNS doesn't yet point at the host. Verify with dig +short shithub.sh.
Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs port 80 reachable from the public internet (Caddy redirects to 443 after obtaining the cert). UFW must allow 80.
Rate-limited by Let's Encrypt. If you've been retrying with caddy_use_acme_staging=false and failing, you can be locked out for a few hours. Switch to caddy_use_acme_staging=true, reload, then flip back when DNS/firewall is correct.

sshd: AKC slow / git ssh hangs at "Authenticating"

Symptom: git clone git@… sits at the auth phase for many seconds before completing or timing out.

The AuthorizedKeysCommand invokes /usr/local/bin/shithubd ssh-authkeys %f, which queries Postgres. Latency comes from one of:

DB connection saturation. pgxpool is shared across the web process; the AKC subcommand opens its own connection. Check pg_stat_activity for stuck connections.
DNS for the AKC subprocess (it should not need DNS, but a misconfigured /etc/nsswitch.conf can stall stat calls).
The user has a very large number of keys; AKC walks all of them. Practical maximum is in the hundreds before you notice.

Mitigation: investigate, don't bypass. Disabling the AKC opens a hole — every key on disk would have shell access.

Job queue: backlog growing

Symptom: shithubd_job_queue_depth > 5000 for 15+ minutes.

psql -d shithub -c "
  SELECT kind, count(*) FROM jobs WHERE state='queued'
  GROUP BY kind ORDER BY 2 DESC;"

If one kind dominates, look for a poison job in worker logs:

journalctl -u shithubd-worker -n 500 --grep ERROR

If the spread is even, the worker isn't keeping up — scale by running a second worker host. The FOR UPDATE SKIP LOCKED pattern lets multiple workers coexist safely.

To purge a poison job: mark it failed in SQL (don't delete — you want the audit trail).

Webhook: deliveries failing 50%+

Symptom: WebhookDeliveryFailing alert.

A specific subscriber URL is broken. Check the webhook's delivery log in the UI (/owner/repo/settings/hooks/{id}) for status codes and bodies.
Outbound network issue. From the host: curl -v https://example.com/hook. If outbound HTTPS itself is broken, that's an infrastructure problem.
Auto-disable kicks in after 50 consecutive failures per webhook; the UI shows a banner and sets active=false. Owner re-enables once the endpoint is fixed.

Push: "pre-receive hook declined"

Symptom: User's git push fails with a pre-receive rejection.

Common reasons:

Branch protection. Pushing directly to a protected branch when the rule requires a PR. User pushes to a feature branch instead.
Repo over quota. Bare-repo size cap hit; raise per-repo or globally in config.
Pack contains too-large object. Large blobs blow the per-object cap (defaults documented in internal/repos). Use LFS instead for big binaries.

Postgres: archive_command failing

Symptom: pg_stat_archiver.failed_count increasing; disk filling on the db host.

journalctl -u postgresql -n 200 --grep "archive_command failed" will print the rclone error. Common causes:

Rotated bucket credentials.
Network partition to Spaces.
rclone.conf missing or unreadable.

Confirm by hand:

sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
     lsd spaces-prod:

Fix the underlying issue; Postgres retries automatically. If disk is critically full and Spaces won't be reachable in time, page the operator. Do not pg_archivecleanup manually unless explicitly directed — you can lose data.

Email: signups never arrive

auth.email_backend=stdout in your config — emails went to the journal. Check journalctl -u shithubd-web --grep '\[email\]'.
Postmark token wrong / sender domain not verified. Postmark's dashboard logs every send.
DKIM/SPF not set up; the provider may accept and silently drop.

For test, send a verification email to an address you control on a domain that bounces with helpful errors (e.g., your own MTA configured to log rejections).

CSRF token expired (cookie didn't match the form's hidden token). Force a hard reload of the sign-in page; cookies older than the session lifetime are dropped.

If that's not it: check auth.require_email_verification=true — the user may need to click a verification link.

"Site appears slow" with no specific symptom

Top sources of latency, in order of frequency:

N+1 query regression. DB calls/sec on the Grafana dashboard will be much higher than baseline.
GC pressure. The web process is allocating more than usual — check Grafana's go_gc_pause_seconds panel.
Cache stampede. If a hot cache key was just invalidated, first-request-wins via singleflight should prevent this; if it doesn't, check cache.singleflight_in_flight for excess.
Disk-bound on git ops. Concurrent large clones pin the web host's disk. Move git transports to a separate process or scale the web tier.

View source

  
        1
        # Troubleshooting
      
        2
        
        3
        Common operator-visible failure modes and how to diagnose them.
      
        4
        
        5
        ## Caddy: certificate acquisition fails
      
        6
        
        7
        **Symptom:** Caddy logs show `failed to obtain certificate` for
      
        8
        your domain; the site serves the staging cert (or no cert).
      
        9
        
        10
        Most often:
      
        11
        
        12
        - DNS doesn't yet point at the host. Verify with `dig +short
      
        13
          shithub.sh`.
      
        14
        - Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs
      
        15
          port 80 reachable from the public internet (Caddy redirects
      
        16
          to 443 *after* obtaining the cert). UFW must allow 80.
      
        17
        - Rate-limited by Let's Encrypt. If you've been retrying with
      
        18
          `caddy_use_acme_staging=false` and failing, you can be locked
      
        19
          out for a few hours. Switch to `caddy_use_acme_staging=true`,
      
        20
          reload, then flip back when DNS/firewall is correct.
      
        21
        
        22
        ## sshd: AKC slow / git ssh hangs at "Authenticating"
      
        23
        
        24
        **Symptom:** `git clone git@…` sits at the auth phase for many
      
        25
        seconds before completing or timing out.
      
        26
        
        27
        The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd
      
        28
        ssh-authkeys %f`, which queries Postgres. Latency comes from one
      
        29
        of:
      
        30
        
        31
        - DB connection saturation. `pgxpool` is shared across the web
      
        32
          process; the AKC subcommand opens its own connection. Check
      
        33
          `pg_stat_activity` for stuck connections.
      
        34
        - DNS for the AKC subprocess (it should not need DNS, but a
      
        35
          misconfigured `/etc/nsswitch.conf` can stall stat calls).
      
        36
        - The user has a *very* large number of keys; AKC walks all of
      
        37
          them. Practical maximum is in the hundreds before you notice.
      
        38
        
        39
        Mitigation: investigate, don't bypass. Disabling the AKC opens a
      
        40
        hole — every key on disk would have shell access.
      
        41
        
        42
        ## Job queue: backlog growing
      
        43
        
        44
        **Symptom:** `shithubd_job_queue_depth > 5000` for 15+ minutes.
      
        45
        
        46
        ```sh
      
        47
        psql -d shithub -c "
      
        48
          SELECT kind, count(*) FROM jobs WHERE state='queued'
      
        49
          GROUP BY kind ORDER BY 2 DESC;"
      
        50
        ```
      
        51
        
        52
        If one kind dominates, look for a poison job in worker logs:
      
        53
        
        54
        ```sh
      
        55
        journalctl -u shithubd-worker -n 500 --grep ERROR
      
        56
        ```
      
        57
        
        58
        If the spread is even, the worker isn't keeping up — scale by
      
        59
        running a second worker host. The `FOR UPDATE SKIP LOCKED`
      
        60
        pattern lets multiple workers coexist safely.
      
        61
        
        62
        To purge a poison job: mark it `failed` in SQL (don't delete —
      
        63
        you want the audit trail).
      
        64
        
        65
        ## Webhook: deliveries failing 50%+
      
        66
        
        67
        **Symptom:** `WebhookDeliveryFailing` alert.
      
        68
        
        69
        - A specific subscriber URL is broken. Check the webhook's
      
        70
          delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for
      
        71
          status codes and bodies.
      
        72
        - Outbound network issue. From the host:
      
        73
          `curl -v https://example.com/hook`. If outbound HTTPS itself
      
        74
          is broken, that's an infrastructure problem.
      
        75
        - Auto-disable kicks in after 50 consecutive failures per
      
        76
          webhook; the UI shows a banner and sets `active=false`. Owner
      
        77
          re-enables once the endpoint is fixed.
      
        78
        
        79
        ## Push: "pre-receive hook declined"
      
        80
        
        81
        **Symptom:** User's `git push` fails with a pre-receive
      
        82
        rejection.
      
        83
        
        84
        Common reasons:
      
        85
        
        86
        - **Branch protection.** Pushing directly to a protected branch
      
        87
          when the rule requires a PR. User pushes to a feature branch
      
        88
          instead.
      
        89
        - **Repo over quota.** Bare-repo size cap hit; raise per-repo or
      
        90
          globally in config.
      
        91
        - **Pack contains too-large object.** Large blobs blow the
      
        92
          per-object cap (defaults documented in `internal/repos`). Use
      
        93
          LFS instead for big binaries.
      
        94
        
        95
        ## Postgres: archive_command failing
      
        96
        
        97
        **Symptom:** `pg_stat_archiver.failed_count` increasing; disk
      
        98
        filling on the db host.
      
        99
        
        100
        `journalctl -u postgresql -n 200 --grep "archive_command failed"`
      
        101
        will print the rclone error. Common causes:
      
        102
        
        103
        - Rotated bucket credentials.
      
        104
        - Network partition to Spaces.
      
        105
        - `rclone.conf` missing or unreadable.
      
        106
        
        107
        Confirm by hand:
      
        108
        
        109
        ```sh
      
        110
        sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
      
        111
             lsd spaces-prod:
      
        112
        ```
      
        113
        
        114
        Fix the underlying issue; Postgres retries automatically. If disk
      
        115
        is critically full and Spaces won't be reachable in time, page
      
        116
        the operator. **Do not** `pg_archivecleanup` manually unless
      
        117
        explicitly directed — you can lose data.
      
        118
        
        119
        ## Email: signups never arrive
      
        120
        
        121
        - `auth.email_backend=stdout` in your config — emails went to the
      
        122
          journal. Check `journalctl -u shithubd-web --grep '\[email\]'`.
      
        123
        - Postmark token wrong / sender domain not verified. Postmark's
      
        124
          dashboard logs every send.
      
        125
        - DKIM/SPF not set up; the provider may accept and silently drop.
      
        126
        
        127
        For test, send a verification email to an address you control on
      
        128
        a domain that bounces with helpful errors (e.g., your own MTA
      
        129
        configured to log rejections).
      
        130
        
        131
        ## Sign-in fails immediately with no error in journal
      
        132
        
        133
        CSRF token expired (cookie didn't match the form's hidden token).
      
        134
        Force a hard reload of the sign-in page; cookies older than the
      
        135
        session lifetime are dropped.
      
        136
        
        137
        If that's not it: check `auth.require_email_verification=true` —
      
        138
        the user may need to click a verification link.
      
        139
        
        140
        ## "Site appears slow" with no specific symptom
      
        141
        
        142
        Top sources of latency, in order of frequency:
      
        143
        
        144
        1. **N+1 query regression.** DB calls/sec on the Grafana
      
        145
           dashboard will be much higher than baseline.
      
        146
        2. **GC pressure.** The web process is allocating more than usual
      
        147
           — check Grafana's `go_gc_pause_seconds` panel.
      
        148
        3. **Cache stampede.** If a hot cache key was just invalidated,
      
        149
           first-request-wins via singleflight should prevent this; if
      
        150
           it doesn't, check `cache.singleflight_in_flight` for excess.
      
        151
        4. **Disk-bound on git ops.** Concurrent large clones pin the
      
        152
           web host's disk. Move git transports to a separate process
      
        153
           or scale the web tier.

1	# Troubleshooting
2
3	Common operator-visible failure modes and how to diagnose them.
4
5	## Caddy: certificate acquisition fails
6
7	Symptom: Caddy logs show `failed to obtain certificate` for
8	your domain; the site serves the staging cert (or no cert).
9
10	Most often:
11
12	- DNS doesn't yet point at the host. Verify with `dig +short
13	shithub.sh`.
14	- Port 80 is blocked. Let's Encrypt's HTTP-01 challenge needs
15	port 80 reachable from the public internet (Caddy redirects
16	to 443 after obtaining the cert). UFW must allow 80.
17	- Rate-limited by Let's Encrypt. If you've been retrying with
18	`caddy_use_acme_staging=false` and failing, you can be locked
19	out for a few hours. Switch to `caddy_use_acme_staging=true`,
20	reload, then flip back when DNS/firewall is correct.
21
22	## sshd: AKC slow / git ssh hangs at "Authenticating"
23
24	Symptom: `git clone git@…` sits at the auth phase for many
25	seconds before completing or timing out.
26
27	The `AuthorizedKeysCommand` invokes `/usr/local/bin/shithubd
28	ssh-authkeys %f`, which queries Postgres. Latency comes from one
29	of:
30
31	- DB connection saturation. `pgxpool` is shared across the web
32	process; the AKC subcommand opens its own connection. Check
33	`pg_stat_activity` for stuck connections.
34	- DNS for the AKC subprocess (it should not need DNS, but a
35	misconfigured `/etc/nsswitch.conf` can stall stat calls).
36	- The user has a very large number of keys; AKC walks all of
37	them. Practical maximum is in the hundreds before you notice.
38
39	Mitigation: investigate, don't bypass. Disabling the AKC opens a
40	hole — every key on disk would have shell access.
41
42	## Job queue: backlog growing
43
44	Symptom: `shithubd_job_queue_depth > 5000` for 15+ minutes.
45
46	```sh
47	psql -d shithub -c "
48	SELECT kind, count(*) FROM jobs WHERE state='queued'
49	GROUP BY kind ORDER BY 2 DESC;"
50	```
51
52	If one kind dominates, look for a poison job in worker logs:
53
54	```sh
55	journalctl -u shithubd-worker -n 500 --grep ERROR
56	```
57
58	If the spread is even, the worker isn't keeping up — scale by
59	running a second worker host. The `FOR UPDATE SKIP LOCKED`
60	pattern lets multiple workers coexist safely.
61
62	To purge a poison job: mark it `failed` in SQL (don't delete —
63	you want the audit trail).
64
65	## Webhook: deliveries failing 50%+
66
67	Symptom: `WebhookDeliveryFailing` alert.
68
69	- A specific subscriber URL is broken. Check the webhook's
70	delivery log in the UI (`/owner/repo/settings/hooks/{id}`) for
71	status codes and bodies.
72	- Outbound network issue. From the host:
73	`curl -v https://example.com/hook`. If outbound HTTPS itself
74	is broken, that's an infrastructure problem.
75	- Auto-disable kicks in after 50 consecutive failures per
76	webhook; the UI shows a banner and sets `active=false`. Owner
77	re-enables once the endpoint is fixed.
78
79	## Push: "pre-receive hook declined"
80
81	Symptom: User's `git push` fails with a pre-receive
82	rejection.
83
84	Common reasons:
85
86	- Branch protection. Pushing directly to a protected branch
87	when the rule requires a PR. User pushes to a feature branch
88	instead.
89	- Repo over quota. Bare-repo size cap hit; raise per-repo or
90	globally in config.
91	- Pack contains too-large object. Large blobs blow the
92	per-object cap (defaults documented in `internal/repos`). Use
93	LFS instead for big binaries.
94
95	## Postgres: archive_command failing
96
97	Symptom: `pg_stat_archiver.failed_count` increasing; disk
98	filling on the db host.
99
100	`journalctl -u postgresql -n 200 --grep "archive_command failed"`
101	will print the rclone error. Common causes:
102
103	- Rotated bucket credentials.
104	- Network partition to Spaces.
105	- `rclone.conf` missing or unreadable.
106
107	Confirm by hand:
108
109	```sh
110	sudo -u postgres rclone --config /root/.config/rclone/rclone.conf \
111	lsd spaces-prod:
112	```
113
114	Fix the underlying issue; Postgres retries automatically. If disk
115	is critically full and Spaces won't be reachable in time, page
116	the operator. Do not `pg_archivecleanup` manually unless
117	explicitly directed — you can lose data.
118
119	## Email: signups never arrive
120
121	- `auth.email_backend=stdout` in your config — emails went to the
122	journal. Check `journalctl -u shithubd-web --grep '\[email\]'`.
123	- Postmark token wrong / sender domain not verified. Postmark's
124	dashboard logs every send.
125	- DKIM/SPF not set up; the provider may accept and silently drop.
126
127	For test, send a verification email to an address you control on
128	a domain that bounces with helpful errors (e.g., your own MTA
129	configured to log rejections).
130
131	## Sign-in fails immediately with no error in journal
132
133	CSRF token expired (cookie didn't match the form's hidden token).
134	Force a hard reload of the sign-in page; cookies older than the
135	session lifetime are dropped.
136
137	If that's not it: check `auth.require_email_verification=true` —
138	the user may need to click a verification link.
139
140	## "Site appears slow" with no specific symptom
141
142	Top sources of latency, in order of frequency:
143
144	1. N+1 query regression. DB calls/sec on the Grafana
145	dashboard will be much higher than baseline.
146	2. GC pressure. The web process is allocating more than usual
147	— check Grafana's `go_gc_pause_seconds` panel.
148	3. Cache stampede. If a hot cache key was just invalidated,
149	first-request-wins via singleflight should prevent this; if
150	it doesn't, check `cache.singleflight_in_flight` for excess.
151	4. Disk-bound on git ops. Concurrent large clones pin the
152	web host's disk. Move git transports to a separate process
153	or scale the web tier.