S14: docs/internal/{hooks,worker}.md
Authored by
mfwolffe <wolffemf@dukes.jmu.edu>
- SHA
11f0fdaf85bfc8192a65ff31a0305145d37476f8- Parents
-
aa9ab45 - Tree
e17b934
11f0fda
11f0fdaf85bfc8192a65ff31a0305145d37476f8aa9ab45
e17b934| Status | File | + | - |
|---|---|---|---|
| A |
docs/internal/hooks.md
|
105 | 0 |
| A |
docs/internal/worker.md
|
96 | 0 |
docs/internal/hooks.mdadded@@ -0,0 +1,105 @@ | |||
| 1 | +# Git hooks | ||
| 2 | + | ||
| 3 | +Bare-repo git hooks are how the push pipeline (S14) gets called. | ||
| 4 | +shithub installs **pre-receive** (gates) and **post-receive** (enqueue) | ||
| 5 | +on every repo. Both are tiny shell shims that exec into the same | ||
| 6 | +`shithubd` binary the web/SSH/worker layers run. | ||
| 7 | + | ||
| 8 | +## Why shell shims and not direct binary symlinks | ||
| 9 | + | ||
| 10 | +git invokes hooks with stdin piped, a particular cwd, and a controlled | ||
| 11 | +env. The shim normalizes the entry point: `exec /path/to/shithubd hook | ||
| 12 | +<name>`. Hooks aren't symlinked because (a) macOS git treats some | ||
| 13 | +symlinks oddly under hardened runtime, and (b) the shim explicitly | ||
| 14 | +re-exec's so signals and exit codes propagate cleanly. | ||
| 15 | + | ||
| 16 | +The shim is regenerated on every `hooks.Install` call — there's no | ||
| 17 | +manual editing path, by design. If the binary moves (deploy upgrade, | ||
| 18 | +versioned install path), run `shithubd hooks reinstall --all` to point | ||
| 19 | +every repo at the new location. | ||
| 20 | + | ||
| 21 | +## Hook execution flow | ||
| 22 | + | ||
| 23 | +``` | ||
| 24 | +git push ──▶ receive-pack | ||
| 25 | + │ | ||
| 26 | + ▼ | ||
| 27 | + pre-receive (one process per push) | ||
| 28 | + │ stdin: "<old> <new> <ref>" lines | ||
| 29 | + │ env: SHITHUB_USER_ID / _USERNAME / _REPO_ID / _REPO_FULL_NAME | ||
| 30 | + │ SHITHUB_PROTOCOL / _REMOTE_IP / _REQUEST_ID | ||
| 31 | + ▼ | ||
| 32 | + exec shithubd hook pre-receive | ||
| 33 | + │ exit 0 → accept | ||
| 34 | + │ exit ≠ 0 → reject; stderr is shown to the pusher | ||
| 35 | + ▼ | ||
| 36 | + (push proceeds) | ||
| 37 | + │ | ||
| 38 | + ▼ | ||
| 39 | + post-receive (one process per push) | ||
| 40 | + │ stdin: same shape | ||
| 41 | + ▼ | ||
| 42 | + exec shithubd hook post-receive | ||
| 43 | + │ INSERT push_events row per ref | ||
| 44 | + │ INSERT job (push:process) | ||
| 45 | + │ NOTIFY shithub_jobs | ||
| 46 | + │ exit 0 | ||
| 47 | + ▼ | ||
| 48 | + worker picks it up via LISTEN | ||
| 49 | +``` | ||
| 50 | + | ||
| 51 | +## pre-receive contract | ||
| 52 | + | ||
| 53 | +Implements the **minimum gates** described in S14. The full branch | ||
| 54 | +protection engine is S20. | ||
| 55 | + | ||
| 56 | +Re-checks the following from the DB even though the env carries them | ||
| 57 | +(env can be stale on a long-lived SSH session): | ||
| 58 | + | ||
| 59 | +* User not suspended (`users.suspended_at IS NULL`). | ||
| 60 | +* Repo not archived (`repos.is_archived = false`). | ||
| 61 | +* Repo not soft-deleted (`repos.deleted_at IS NULL`). | ||
| 62 | + | ||
| 63 | +Failures emit a `shithub: ...` line on stderr that git surfaces directly | ||
| 64 | +to the pusher's terminal. Latency budget: <100ms p99. | ||
| 65 | + | ||
| 66 | +## post-receive contract | ||
| 67 | + | ||
| 68 | +* Reads stdin, ignores empty/malformed lines. | ||
| 69 | +* For each `<old> <new> <ref>` line: INSERT a `push_events` row, then | ||
| 70 | + enqueue a `push:process` job carrying the new event id. | ||
| 71 | +* Issues a single `NOTIFY shithub_jobs` per push (workers wake on the | ||
| 72 | + next tx commit). | ||
| 73 | +* Exits 0 even on internal errors. The push has already landed; the | ||
| 74 | + pipeline is async — partial enqueue failures surface in the worker | ||
| 75 | + logs and a backstop reconciler (post-MVP) can re-process orphaned | ||
| 76 | + events. | ||
| 77 | + | ||
| 78 | +## Installation | ||
| 79 | + | ||
| 80 | +* On repo create (`internal/repos/create.go::Create`): runs | ||
| 81 | + `hooks.Install` after `RepoFS.InitBare` succeeds. The S11 plumbing | ||
| 82 | + initial commit does *not* fire hooks — that's correct, the contract | ||
| 83 | + is hooks fire on user-driven pushes only. | ||
| 84 | +* On deploy: `shithubd hooks reinstall --all` walks every active repo | ||
| 85 | + via the DB and re-installs. Single repos: `--repo owner/name`. | ||
| 86 | +* The binary path baked into the shim is `os.Executable()` of the | ||
| 87 | + running shithubd at install time, resolved through `filepath.Abs`. | ||
| 88 | + Test fixtures that don't exercise hooks pass `ShithubdPath: ""` to | ||
| 89 | + `repos.Create.Deps` to skip installation. | ||
| 90 | + | ||
| 91 | +## Failure modes worth knowing | ||
| 92 | + | ||
| 93 | +* **`SHITHUB_*` env not set** (e.g. someone manually triggered a push | ||
| 94 | + via a path that bypassed S12/S13): pre-receive returns the "missing | ||
| 95 | + context" error. post-receive logs a warning and exits 0 — the push | ||
| 96 | + still lands. | ||
| 97 | +* **DB unreachable from hook**: pre-receive returns "server error"; | ||
| 98 | + the user sees a generic message, the push aborts. post-receive logs | ||
| 99 | + and exits 0; backstop reconciler will pick up the unprocessed push | ||
| 100 | + on next opportunity. | ||
| 101 | +* **Binary path drift between install and execution**: the shim's hard- | ||
| 102 | + coded path no longer exists. git will report "hook execution failed" | ||
| 103 | + and abort the push. Operators recover with `hooks reinstall`. | ||
| 104 | +* **Stale shim from previous shithubd version**: `Install` is | ||
| 105 | + idempotent and overwrites; re-running on a deploy is the right move. | ||
docs/internal/worker.mdadded@@ -0,0 +1,96 @@ | |||
| 1 | +# Worker | ||
| 2 | + | ||
| 3 | +The worker pool drains the Postgres-backed job queue introduced in S14. | ||
| 4 | +One binary serves the web API, the SSH dispatcher, the hooks, and the | ||
| 5 | +worker — `shithubd worker` boots a long-running process whose only job | ||
| 6 | +is to dequeue and run. | ||
| 7 | + | ||
| 8 | +## Architecture | ||
| 9 | + | ||
| 10 | +``` | ||
| 11 | +post-receive ──INSERT push_event + INSERT job + NOTIFY──▶ Postgres jobs | ||
| 12 | + │ | ||
| 13 | + ▼ | ||
| 14 | + shithubd worker | ||
| 15 | + ┌──────────────────────┐ | ||
| 16 | + │ N goroutines │ | ||
| 17 | + │ ClaimJob (FOR UPDATE │ | ||
| 18 | + │ SKIP LOCKED) │ | ||
| 19 | + │ → handler │ | ||
| 20 | + │ → MarkCompleted / │ | ||
| 21 | + │ Reschedule / │ | ||
| 22 | + │ MarkFailed │ | ||
| 23 | + └──────────────────────┘ | ||
| 24 | +``` | ||
| 25 | + | ||
| 26 | +The pool also runs a dedicated LISTEN goroutine that holds one Postgres | ||
| 27 | +connection on `LISTEN shithub_jobs`. A NOTIFY from any process (typically | ||
| 28 | +the post-receive hook) wakes idle workers in <1s without polling. The | ||
| 29 | +backstop poll (every 5s by default) covers dropped notifications. | ||
| 30 | + | ||
| 31 | +## Job kinds shipped in S14 | ||
| 32 | + | ||
| 33 | +| Kind | Trigger | Idempotent on | | ||
| 34 | +| ---------------------- | ------------------------------------ | ------------------------ | | ||
| 35 | +| `push:process` | post-receive hook per ref | `push_events.processed_at` | | ||
| 36 | +| `repo:size_recalc` | enqueued by `push:process` | overwrite-last-wins | | ||
| 37 | +| `jobs:purge_completed` | future cron / manual ad-hoc | always safe to re-run | | ||
| 38 | + | ||
| 39 | +Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`, | ||
| 40 | +add the `Kind` constant to `internal/worker/types.go`, register it in | ||
| 41 | +`cmd/shithubd/worker.go`. The dispatch loop is generic — no other code | ||
| 42 | +needs to change. | ||
| 43 | + | ||
| 44 | +## Failure handling | ||
| 45 | + | ||
| 46 | +* **Transient errors** (handler returns a non-nil non-poison error): | ||
| 47 | + reschedule with `Backoff(attempts) ± 20% jitter`, where `Backoff` is | ||
| 48 | + `30s * 2^(attempts-1)` capped at 1h. | ||
| 49 | +* **Poison errors** (handler returns an error wrapping `worker.ErrPoison`): | ||
| 50 | + jump to `MarkJobFailed` immediately. Use this for malformed payloads or | ||
| 51 | + references to vanished rows. | ||
| 52 | +* **Panics**: caught by `safeRun` and treated as a transient error so a | ||
| 53 | + buggy handler can't take down a worker goroutine. | ||
| 54 | +* **Stuck locks**: `ClaimJob` ignores rows whose `locked_at` is older | ||
| 55 | + than 5 minutes — a worker that died mid-job releases its rows on the | ||
| 56 | + next claim cycle. | ||
| 57 | +* **Max attempts**: `jobs.max_attempts` (default 5). Past the limit the | ||
| 58 | + row is moved to `failed_at`. Failed rows surface in the S34 admin | ||
| 59 | + panel for poison-job triage. | ||
| 60 | + | ||
| 61 | +## Concurrency contract | ||
| 62 | + | ||
| 63 | +* `FOR UPDATE SKIP LOCKED` on the inner SELECT means N concurrent workers | ||
| 64 | + on the same kind can claim N distinct rows in one round. Each row is | ||
| 65 | + processed exactly once across the cluster. | ||
| 66 | +* Workers hold a row's lock only while their handler runs. The lock is | ||
| 67 | + released atomically by `MarkJobCompleted` / `RescheduleJob` / | ||
| 68 | + `MarkJobFailed`. | ||
| 69 | +* Job handlers MUST be idempotent. A worker process killed mid-handler | ||
| 70 | + leaves the row locked until the 5-minute stale-lock window passes, | ||
| 71 | + at which point another worker may pick it up and re-run. Guard the | ||
| 72 | + side-effects (e.g. `processed_at IS NULL` checks). | ||
| 73 | + | ||
| 74 | +## Metrics | ||
| 75 | + | ||
| 76 | +All exported through the standard `/metrics` registry: | ||
| 77 | + | ||
| 78 | +* `shithub_worker_jobs_processed_total{kind,outcome}` — outcome ∈ | ||
| 79 | + `ok | retry | failed | poison`. | ||
| 80 | +* `shithub_worker_job_duration_seconds{kind}` — handler latency | ||
| 81 | + histogram. | ||
| 82 | +* `shithub_worker_in_flight{kind}` — gauge of currently-running handlers. | ||
| 83 | + | ||
| 84 | +## Operational notes | ||
| 85 | + | ||
| 86 | +* Default pool size is 4. Override via `--workers <n>` on the CLI or | ||
| 87 | + `SHITHUB_WORKERS=<n>` in the environment. | ||
| 88 | +* Graceful shutdown: SIGINT/SIGTERM cancels the root context. Workers | ||
| 89 | + stop pulling new jobs and let in-flight handlers finish (bounded by | ||
| 90 | + `JobTimeout`, default 5 minutes). The LISTEN goroutine drops its | ||
| 91 | + conn cleanly. | ||
| 92 | +* The pool size on the Postgres connection is sized to `Workers + 2` | ||
| 93 | + (one for LISTEN, one slack for enqueues during shutdown). | ||
| 94 | +* `LISTEN/NOTIFY` semantics: when the hook calls `pg_notify` inside a | ||
| 95 | + transaction, the notification is *only* delivered after commit. So | ||
| 96 | + failed inserts never wake workers for nothing. | ||