S14: docs/internal/{hooks,worker}.md
Authored by
mfwolffe <wolffemf@dukes.jmu.edu>
- SHA
11f0fdaf85bfc8192a65ff31a0305145d37476f8- Parents
-
aa9ab45 - Tree
e17b934
11f0fda
11f0fdaf85bfc8192a65ff31a0305145d37476f8aa9ab45
e17b934| Status | File | + | - |
|---|---|---|---|
| A |
docs/internal/hooks.md
|
105 | 0 |
| A |
docs/internal/worker.md
|
96 | 0 |
docs/internal/hooks.mdadded@@ -0,0 +1,105 @@ | ||
| 1 | +# Git hooks | |
| 2 | + | |
| 3 | +Bare-repo git hooks are how the push pipeline (S14) gets called. | |
| 4 | +shithub installs **pre-receive** (gates) and **post-receive** (enqueue) | |
| 5 | +on every repo. Both are tiny shell shims that exec into the same | |
| 6 | +`shithubd` binary the web/SSH/worker layers run. | |
| 7 | + | |
| 8 | +## Why shell shims and not direct binary symlinks | |
| 9 | + | |
| 10 | +git invokes hooks with stdin piped, a particular cwd, and a controlled | |
| 11 | +env. The shim normalizes the entry point: `exec /path/to/shithubd hook | |
| 12 | +<name>`. Hooks aren't symlinked because (a) macOS git treats some | |
| 13 | +symlinks oddly under hardened runtime, and (b) the shim explicitly | |
| 14 | +re-exec's so signals and exit codes propagate cleanly. | |
| 15 | + | |
| 16 | +The shim is regenerated on every `hooks.Install` call — there's no | |
| 17 | +manual editing path, by design. If the binary moves (deploy upgrade, | |
| 18 | +versioned install path), run `shithubd hooks reinstall --all` to point | |
| 19 | +every repo at the new location. | |
| 20 | + | |
| 21 | +## Hook execution flow | |
| 22 | + | |
| 23 | +``` | |
| 24 | +git push ──▶ receive-pack | |
| 25 | + │ | |
| 26 | + ▼ | |
| 27 | + pre-receive (one process per push) | |
| 28 | + │ stdin: "<old> <new> <ref>" lines | |
| 29 | + │ env: SHITHUB_USER_ID / _USERNAME / _REPO_ID / _REPO_FULL_NAME | |
| 30 | + │ SHITHUB_PROTOCOL / _REMOTE_IP / _REQUEST_ID | |
| 31 | + ▼ | |
| 32 | + exec shithubd hook pre-receive | |
| 33 | + │ exit 0 → accept | |
| 34 | + │ exit ≠ 0 → reject; stderr is shown to the pusher | |
| 35 | + ▼ | |
| 36 | + (push proceeds) | |
| 37 | + │ | |
| 38 | + ▼ | |
| 39 | + post-receive (one process per push) | |
| 40 | + │ stdin: same shape | |
| 41 | + ▼ | |
| 42 | + exec shithubd hook post-receive | |
| 43 | + │ INSERT push_events row per ref | |
| 44 | + │ INSERT job (push:process) | |
| 45 | + │ NOTIFY shithub_jobs | |
| 46 | + │ exit 0 | |
| 47 | + ▼ | |
| 48 | + worker picks it up via LISTEN | |
| 49 | +``` | |
| 50 | + | |
| 51 | +## pre-receive contract | |
| 52 | + | |
| 53 | +Implements the **minimum gates** described in S14. The full branch | |
| 54 | +protection engine is S20. | |
| 55 | + | |
| 56 | +Re-checks the following from the DB even though the env carries them | |
| 57 | +(env can be stale on a long-lived SSH session): | |
| 58 | + | |
| 59 | +* User not suspended (`users.suspended_at IS NULL`). | |
| 60 | +* Repo not archived (`repos.is_archived = false`). | |
| 61 | +* Repo not soft-deleted (`repos.deleted_at IS NULL`). | |
| 62 | + | |
| 63 | +Failures emit a `shithub: ...` line on stderr that git surfaces directly | |
| 64 | +to the pusher's terminal. Latency budget: <100ms p99. | |
| 65 | + | |
| 66 | +## post-receive contract | |
| 67 | + | |
| 68 | +* Reads stdin, ignores empty/malformed lines. | |
| 69 | +* For each `<old> <new> <ref>` line: INSERT a `push_events` row, then | |
| 70 | + enqueue a `push:process` job carrying the new event id. | |
| 71 | +* Issues a single `NOTIFY shithub_jobs` per push (workers wake on the | |
| 72 | + next tx commit). | |
| 73 | +* Exits 0 even on internal errors. The push has already landed; the | |
| 74 | + pipeline is async — partial enqueue failures surface in the worker | |
| 75 | + logs and a backstop reconciler (post-MVP) can re-process orphaned | |
| 76 | + events. | |
| 77 | + | |
| 78 | +## Installation | |
| 79 | + | |
| 80 | +* On repo create (`internal/repos/create.go::Create`): runs | |
| 81 | + `hooks.Install` after `RepoFS.InitBare` succeeds. The S11 plumbing | |
| 82 | + initial commit does *not* fire hooks — that's correct, the contract | |
| 83 | + is hooks fire on user-driven pushes only. | |
| 84 | +* On deploy: `shithubd hooks reinstall --all` walks every active repo | |
| 85 | + via the DB and re-installs. Single repos: `--repo owner/name`. | |
| 86 | +* The binary path baked into the shim is `os.Executable()` of the | |
| 87 | + running shithubd at install time, resolved through `filepath.Abs`. | |
| 88 | + Test fixtures that don't exercise hooks pass `ShithubdPath: ""` to | |
| 89 | + `repos.Create.Deps` to skip installation. | |
| 90 | + | |
| 91 | +## Failure modes worth knowing | |
| 92 | + | |
| 93 | +* **`SHITHUB_*` env not set** (e.g. someone manually triggered a push | |
| 94 | + via a path that bypassed S12/S13): pre-receive returns the "missing | |
| 95 | + context" error. post-receive logs a warning and exits 0 — the push | |
| 96 | + still lands. | |
| 97 | +* **DB unreachable from hook**: pre-receive returns "server error"; | |
| 98 | + the user sees a generic message, the push aborts. post-receive logs | |
| 99 | + and exits 0; backstop reconciler will pick up the unprocessed push | |
| 100 | + on next opportunity. | |
| 101 | +* **Binary path drift between install and execution**: the shim's hard- | |
| 102 | + coded path no longer exists. git will report "hook execution failed" | |
| 103 | + and abort the push. Operators recover with `hooks reinstall`. | |
| 104 | +* **Stale shim from previous shithubd version**: `Install` is | |
| 105 | + idempotent and overwrites; re-running on a deploy is the right move. | |
docs/internal/worker.mdadded@@ -0,0 +1,96 @@ | ||
| 1 | +# Worker | |
| 2 | + | |
| 3 | +The worker pool drains the Postgres-backed job queue introduced in S14. | |
| 4 | +One binary serves the web API, the SSH dispatcher, the hooks, and the | |
| 5 | +worker — `shithubd worker` boots a long-running process whose only job | |
| 6 | +is to dequeue and run. | |
| 7 | + | |
| 8 | +## Architecture | |
| 9 | + | |
| 10 | +``` | |
| 11 | +post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs | |
| 12 | + │ | |
| 13 | + ▼ | |
| 14 | + shithubd worker | |
| 15 | + ┌──────────────────────┐ | |
| 16 | + │ N goroutines │ | |
| 17 | + │ ClaimJob (FOR UPDATE │ | |
| 18 | + │ SKIP LOCKED) │ | |
| 19 | + │ → handler │ | |
| 20 | + │ → MarkCompleted / │ | |
| 21 | + │ Reschedule / │ | |
| 22 | + │ MarkFailed │ | |
| 23 | + └──────────────────────┘ | |
| 24 | +``` | |
| 25 | + | |
| 26 | +The pool also runs a dedicated LISTEN goroutine that holds one Postgres | |
| 27 | +connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically | |
| 28 | +the post-receive hook) wakes idle workers in <1s without polling. The | |
| 29 | +backstop poll (every 5s by default) covers dropped notifications. | |
| 30 | + | |
| 31 | +## Job kinds shipped in S14 | |
| 32 | + | |
| 33 | +| Kind | Trigger | Idempotent on | | |
| 34 | +| ---------------------- | ------------------------------------ | ------------------------ | | |
| 35 | +| `push:process` | post-receive hook per ref | `push_events.processed_at` | | |
| 36 | +| `repo:size_recalc` | enqueued by `push:process` | overwrite-last-wins | | |
| 37 | +| `jobs:purge_completed` | future cron / manual ad-hoc | always safe to re-run | | |
| 38 | + | |
| 39 | +Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`, | |
| 40 | +add the `Kind` constant to `internal/worker/types.go`, register it in | |
| 41 | +`cmd/shithubd/worker.go`. The dispatch loop is generic — no other code | |
| 42 | +needs to change. | |
| 43 | + | |
| 44 | +## Failure handling | |
| 45 | + | |
| 46 | +* **Transient errors** (handler returns a non-nil non-poison error): | |
| 47 | + reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is | |
| 48 | + `30s * 2^(attempts-1)` capped at 1h. | |
| 49 | +* **Poison errors** (handler returns an error wrapping `worker.ErrPoison`): | |
| 50 | + jump to `MarkJobFailed` immediately. Use this for malformed payloads or | |
| 51 | + references to vanished rows. | |
| 52 | +* **Panics**: caught by `safeRun` and treated as a transient error so a | |
| 53 | + buggy handler can't take down a worker goroutine. | |
| 54 | +* **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older | |
| 55 | + than 5 minutes — a worker that died mid-job releases its rows on the | |
| 56 | + next claim cycle. | |
| 57 | +* **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the | |
| 58 | + row is moved to `failed_at`. Failed rows surface in the S34 admin | |
| 59 | + panel for poison-job triage. | |
| 60 | + | |
| 61 | +## Concurrency contract | |
| 62 | + | |
| 63 | +* `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers | |
| 64 | + on the same kind can claim N distinct rows in one round. Each row is | |
| 65 | + processed exactly once across the cluster. | |
| 66 | +* Workers hold a row's lock only while their handler runs. The lock is | |
| 67 | + released atomically by `MarkJobCompleted` / `RescheduleJob` / | |
| 68 | + `MarkJobFailed`. | |
| 69 | +* Job handlers MUST be idempotent. A worker process killed mid-handler | |
| 70 | + leaves the row locked until the 5-minute stale-lock window passes, | |
| 71 | + at which point another worker may pick it up and re-run. Guard the | |
| 72 | + side-effects (e.g. `processed_at IS NULL` checks). | |
| 73 | + | |
| 74 | +## Metrics | |
| 75 | + | |
| 76 | +All exported through the standard `/metrics` registry: | |
| 77 | + | |
| 78 | +* `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈ | |
| 79 | + `ok | retry | failed | poison`. | |
| 80 | +* `shithub_worker_job_duration_seconds{kind}` — handler latency | |
| 81 | + histogram. | |
| 82 | +* `shithub_worker_in_flight{kind}` — gauge of currently-running handlers. | |
| 83 | + | |
| 84 | +## Operational notes | |
| 85 | + | |
| 86 | +* Default pool size is 4. Override via `--workers <n>` on the CLI or | |
| 87 | + `SHITHUB_WORKERS=<n>` in the environment. | |
| 88 | +* Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers | |
| 89 | + stop pulling new jobs and let in-flight handlers finish (bounded by | |
| 90 | + `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its | |
| 91 | + conn cleanly. | |
| 92 | +* The pool size on the Postgres connection is sized to `Workers + 2` | |
| 93 | + (one for LISTEN, one slack for enqueues during shutdown). | |
| 94 | +* `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a | |
| 95 | + transaction, the notification is *only* delivered after commit. So | |
| 96 | + failed inserts never wake workers for nothing. | |