markdown · 8848 bytes Raw Blame History

Search (S28)

S28 ships a working search experience for repos, issues/PRs, users, and code (paths + small-file content). Backed entirely by Postgres FTS (tsvector) plus pg_trgm for code-identifier substring matches. Visibility scoping flows through one composer (policy.VisibilityPredicate) so every search query is gated by the same rule the rest of the runtime uses.

Architecture

internal/search/
    search.go        — Deps + Result types + page-size constants
    query_parse.go   — operator parser (repo:, is:, state:, author:)
    repos.go         — SearchRepos
    issues.go        — SearchIssues
    users.go         — SearchUsers
    code.go          — SearchCode (paths + content union)

internal/auth/policy/
    visibility_predicate.go  — composes the WHERE-clause fragment

internal/web/handlers/search/
    search.go        — /search and /search/quick handlers

internal/worker/jobs/
    repo_index_code.go        — re-indexes a repo's default branch
    repo_index_reconcile.go   — heals drift between
                                default_branch_oid and
                                last_indexed_oid

Migrations

  • 0030 — extensions: pg_trgm, unaccent. Both ship with PostgreSQL contrib.
  • 0031 — search indexes: repos_search, issues_search, users_search tables, each keyed 1:1 with the source row and maintained by AFTER triggers. A custom text-search config shithub_search chains unaccent + english_stem so "café" matches "cafe" on both index and query sides. Backfill runs at migration time so existing rows are immediately searchable.
  • 0032 — code search: code_search_paths (paths only — cheap to populate; size cap doesn't apply) and code_search_content (paths + content for files ≤ 256 KiB AND text). Both have GIN indexes on tsv; code_search_content additionally has a GIN trigram index on the raw content for camelCase / snake_case substring matches the FTS tokenizer mangles. repos.last_indexed_oid added so the reconciler can detect drift.
  • 0041 — repo owner terms: repo search documents include the owning user/org handle plus display name, and result queries resolve owners through either users or orgs. This keeps public org repositories searchable by both owner/repo and owner-only text queries.

Visibility predicate

policy.VisibilityPredicate(actor, alias, startPlaceholder) returns a SQL fragment + bind args that filters a repos table reference to rows visible to the actor. Rules:

  • Soft-deleted repos: always excluded.
  • Site admin: only the soft-delete filter applies.
  • Anonymous: visibility = 'public'.
  • Logged-in: public OR owner OR collaborator (any role).

The composer is the single source of truth for "what repos can this viewer see in a list query." The S28 search functions thread it through every per-type query; future listing endpoints (trending, activity feed) reuse it.

The boundary is exercised by the search test suite at internal/search/search_test.go:

  • TestSearchRepos_AnonymousSeesOnlyPublic — anon never sees private rows for any free-text query.
  • TestSearchRepos_NonCollabOnPrivate — non-collab logged-in user gets zero hits on a private-repo's content.
  • TestSearchRepos_OwnerSeesPrivate — owner branch.
  • TestSearchRepos_CollabSeesPrivate — collab-row branch.
  • TestSearchIssues_AnonymousSeesOnlyPublic — issue-side mirror of the repo test (issues inherit visibility from their repo).

Query parsing

search.ParseQuery(raw) splits the user query into:

  • Text — free-text portion (drives plainto_tsquery).
  • Phrase — when a quoted span is present (drives phraseto_tsquery).
  • RepoFilterrepo:owner/name becomes {Owner, Name}.
  • StateFilteris:open / is:closed / state:open / state:closed. Aliases.
  • AuthorFilterauthor:username.

Unknown operator-shape tokens (e.g. language:Go) fall through as free text. This keeps future operator additions backwards- compatible and lets users naturally type ":"-containing strings without surprises.

The parser caps input at MaxQueryBytes (256) to defend against pathological-length queries; longer inputs are silently truncated.

Ranking

  • Repos: ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay where recency_decay = 1 / (1 + days_since_update / 30) (the spec's day-1 lean). Whole expression lives in SQL so Postgres short-circuits on the GIN index.
  • Issues: ts_rank_cd * state_weight with open = 1.5x over closed. The spec doesn't pin the multiplier; 1.5 surfaces actionable issues first without burying closed history.
  • Users: ts_rank_cd only. Suspended/deleted users are excluded at the WHERE clause so they never taint results.
  • Code: path hits rank +1.0 over content hits at the same ts_rank_cd. Within content hits, ts_rank_cd dominates trigram similarity.

Code-search index lifecycle

  • Push trigger: push:process enqueues repo:index_code when a push advances the repo's default branch. The job is idempotent
    • atomic-swap, so concurrent pushes that land while the previous index is running re-trigger on the next push tick.
  • Atomic swap: the worker runs DELETE … + INSERT … for the repo in one tx. Readers never see a partial index.
  • Size + textness gates:
    • Files > 256 KiB → path indexed, content skipped.
    • Files with NUL bytes in the first 8 KiB → treated as binary; path indexed, content skipped.
    • Indexed content truncated to 64 KiB so the trigram column doesn't bloat for huge text files.
  • Path skiplist: vendor/, node_modules/, dist/, anything under .git* is skipped by default. The path: operator is post-MVP — when it ships it will let users opt into these directories.
  • Reconciler: repo:index_reconcile enqueues a repo:index_code job for each repo where default_branch_oid <> last_indexed_oid. Self-throttling (100 repos per tick). Designed to run from cron every 5 minutes once the cron framework lands; for now it's invocable as a job.

Routes

Method Path Notes
GET /search Full results page with GitHub-style filters
GET /search/quick HTML fragment endpoint for top-bar drop

The top-bar nav embeds a search form pointing at /search; the same input now calls /search/quick as the user types and renders the returned fragment under the nav search box. Full-page type URLs emit GitHub-style type=repositories and type=pullrequests while still accepting the legacy type=repos and type=pulls aliases.

What we deferred from the spec

  • Result-HTML caching with viewer-fingerprint key: the spec's 30-second cache + (actor_id, repo_count, last_collab_change_at) fingerprint scheme. The cache key correctness is fiddly enough that we want measurements before we ship it. Without the cache the per-query cost is dominated by GIN-index lookups, which are fast on the synthetic fixture and within budget. Forward-deferred to S36 (perf pass).
  • API endpoint GET /api/v1/search?q=…&type=…: the orchestrator is API-shaped; the handler wrap is a tiny add. Parking until the S33 webhooks sprint pulls in the rest of the API surface so we do them together (consistency on auth + body cap + scope shapes). Forward-deferred to S33 / S34 API consolidation.
  • path: operator: parser falls through; querying path:foo treats it as free text today. Documented above.

These are all noted in the S28 status block as well.

Pitfalls noted in code

  • Visibility leak is the highest-stakes risk. The composer is the security boundary; the test suite asserts empty results for anon + non-collab against private fixtures.
  • tsvector size limits: per-document content cap (64 KiB) defends.
  • Locale / accent: unaccent is in the FTS config chain; tests cover.
  • Tokenizer breakdown on code: trigram fallback exists in the schema (content_trgm + gin_trgm_ops); the SQL composes both the FTS hit and a future trigram-similarity hit (post-MVP — the v1 SearchCode runs FTS only, not trigram, because untyped trgm similarity needs a per-query threshold and we haven't chosen one yet).
  • Index drift: triggers on issues/repos/users are reliable; code index relies on the worker. The reconciler is the safety net.
  • Suspended user content: user-search excludes them via the WHERE clause. Issue/PR-search inherits their content via the underlying repo's visibility — that matches the spec's "visible inside their repo to collaborators" semantics.
View source
1 # Search (S28)
2
3 S28 ships a working search experience for repos, issues/PRs, users,
4 and code (paths + small-file content). Backed entirely by Postgres
5 FTS (`tsvector`) plus `pg_trgm` for code-identifier substring matches.
6 Visibility scoping flows through one composer
7 (`policy.VisibilityPredicate`) so every search query is gated by the
8 same rule the rest of the runtime uses.
9
10 ## Architecture
11
12 ```
13 internal/search/
14 search.go — Deps + Result types + page-size constants
15 query_parse.go — operator parser (repo:, is:, state:, author:)
16 repos.go — SearchRepos
17 issues.go — SearchIssues
18 users.go — SearchUsers
19 code.go — SearchCode (paths + content union)
20
21 internal/auth/policy/
22 visibility_predicate.go — composes the WHERE-clause fragment
23
24 internal/web/handlers/search/
25 search.go — /search and /search/quick handlers
26
27 internal/worker/jobs/
28 repo_index_code.go — re-indexes a repo's default branch
29 repo_index_reconcile.go — heals drift between
30 default_branch_oid and
31 last_indexed_oid
32 ```
33
34 ## Migrations
35
36 * **0030 — extensions**: `pg_trgm`, `unaccent`. Both ship with
37 PostgreSQL contrib.
38 * **0031 — search indexes**: `repos_search`, `issues_search`,
39 `users_search` tables, each keyed 1:1 with the source row and
40 maintained by AFTER triggers. A custom text-search config
41 `shithub_search` chains `unaccent` + `english_stem` so "café"
42 matches "cafe" on both index and query sides. Backfill runs at
43 migration time so existing rows are immediately searchable.
44 * **0032 — code search**: `code_search_paths` (paths only — cheap
45 to populate; size cap doesn't apply) and `code_search_content`
46 (paths + content for files ≤ 256 KiB AND text). Both have GIN
47 indexes on `tsv`; `code_search_content` additionally has a GIN
48 trigram index on the raw content for camelCase / snake_case
49 substring matches the FTS tokenizer mangles. `repos.last_indexed_oid`
50 added so the reconciler can detect drift.
51 * **0041 — repo owner terms**: repo search documents include the
52 owning user/org handle plus display name, and result queries
53 resolve owners through either `users` or `orgs`. This keeps public
54 org repositories searchable by both `owner/repo` and owner-only
55 text queries.
56
57 ## Visibility predicate
58
59 `policy.VisibilityPredicate(actor, alias, startPlaceholder)`
60 returns a SQL fragment + bind args that filters a `repos` table
61 reference to rows visible to the actor. Rules:
62
63 * Soft-deleted repos: always excluded.
64 * Site admin: only the soft-delete filter applies.
65 * Anonymous: `visibility = 'public'`.
66 * Logged-in: public OR owner OR collaborator (any role).
67
68 The composer is the **single source of truth** for "what repos can
69 this viewer see in a list query." The S28 search functions thread it
70 through every per-type query; future listing endpoints (trending,
71 activity feed) reuse it.
72
73 The boundary is exercised by the search test suite at
74 `internal/search/search_test.go`:
75
76 * `TestSearchRepos_AnonymousSeesOnlyPublic` — anon never sees
77 private rows for any free-text query.
78 * `TestSearchRepos_NonCollabOnPrivate` — non-collab logged-in
79 user gets zero hits on a private-repo's content.
80 * `TestSearchRepos_OwnerSeesPrivate` — owner branch.
81 * `TestSearchRepos_CollabSeesPrivate` — collab-row branch.
82 * `TestSearchIssues_AnonymousSeesOnlyPublic` — issue-side mirror
83 of the repo test (issues inherit visibility from their repo).
84
85 ## Query parsing
86
87 `search.ParseQuery(raw)` splits the user query into:
88
89 * `Text` — free-text portion (drives `plainto_tsquery`).
90 * `Phrase` — when a quoted span is present (drives `phraseto_tsquery`).
91 * `RepoFilter``repo:owner/name` becomes `{Owner, Name}`.
92 * `StateFilter``is:open` / `is:closed` / `state:open` /
93 `state:closed`. Aliases.
94 * `AuthorFilter``author:username`.
95
96 Unknown operator-shape tokens (e.g. `language:Go`) fall through as
97 free text. This keeps future operator additions backwards-
98 compatible and lets users naturally type ":"-containing strings
99 without surprises.
100
101 The parser caps input at `MaxQueryBytes` (256) to defend against
102 pathological-length queries; longer inputs are silently truncated.
103
104 ## Ranking
105
106 * **Repos**: `ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay`
107 where `recency_decay = 1 / (1 + days_since_update / 30)` (the
108 spec's day-1 lean). Whole expression lives in SQL so Postgres
109 short-circuits on the GIN index.
110 * **Issues**: `ts_rank_cd * state_weight` with `open = 1.5x` over
111 `closed`. The spec doesn't pin the multiplier; 1.5 surfaces
112 actionable issues first without burying closed history.
113 * **Users**: `ts_rank_cd` only. Suspended/deleted users are
114 excluded at the WHERE clause so they never taint results.
115 * **Code**: path hits rank `+1.0` over content hits at the same
116 `ts_rank_cd`. Within content hits, `ts_rank_cd` dominates
117 trigram similarity.
118
119 ## Code-search index lifecycle
120
121 * **Push trigger**: `push:process` enqueues `repo:index_code` when
122 a push advances the repo's default branch. The job is idempotent
123 + atomic-swap, so concurrent pushes that land while the previous
124 index is running re-trigger on the next push tick.
125 * **Atomic swap**: the worker runs `DELETE … + INSERT …` for the
126 repo in one tx. Readers never see a partial index.
127 * **Size + textness gates**:
128 * Files > 256 KiB → path indexed, content skipped.
129 * Files with NUL bytes in the first 8 KiB → treated as binary;
130 path indexed, content skipped.
131 * Indexed content truncated to 64 KiB so the trigram column
132 doesn't bloat for huge text files.
133 * **Path skiplist**: `vendor/`, `node_modules/`, `dist/`, anything
134 under `.git*` is skipped by default. The `path:` operator is
135 post-MVP — when it ships it will let users opt into these
136 directories.
137 * **Reconciler**: `repo:index_reconcile` enqueues a `repo:index_code`
138 job for each repo where `default_branch_oid <> last_indexed_oid`.
139 Self-throttling (100 repos per tick). Designed to run from cron
140 every 5 minutes once the cron framework lands; for now it's
141 invocable as a job.
142
143 ## Routes
144
145 | Method | Path | Notes |
146 |--------|------------------|--------------------------------------------|
147 | GET | `/search` | Full results page with GitHub-style filters |
148 | GET | `/search/quick` | HTML fragment endpoint for top-bar drop |
149
150 The top-bar nav embeds a search form pointing at `/search`; the
151 same input now calls `/search/quick` as the user types and renders
152 the returned fragment under the nav search box. Full-page type URLs
153 emit GitHub-style `type=repositories` and `type=pullrequests` while
154 still accepting the legacy `type=repos` and `type=pulls` aliases.
155
156 ## What we deferred from the spec
157
158 * **Result-HTML caching with viewer-fingerprint key**: the spec's
159 30-second cache + `(actor_id, repo_count, last_collab_change_at)`
160 fingerprint scheme. The cache key correctness is fiddly enough
161 that we want measurements before we ship it. Without the cache
162 the per-query cost is dominated by GIN-index lookups, which are
163 fast on the synthetic fixture and within budget. **Forward-deferred
164 to S36 (perf pass)**.
165 * **API endpoint** `GET /api/v1/search?q=…&type=…`: the orchestrator
166 is API-shaped; the handler wrap is a tiny add. Parking until the
167 S33 webhooks sprint pulls in the rest of the API surface so we
168 do them together (consistency on auth + body cap + scope shapes).
169 **Forward-deferred to S33 / S34 API consolidation.**
170 * **`path:` operator**: parser falls through; querying `path:foo`
171 treats it as free text today. Documented above.
172
173 These are all noted in the S28 status block as well.
174
175 ## Pitfalls noted in code
176
177 * **Visibility leak** is the highest-stakes risk. The composer is
178 the security boundary; the test suite asserts empty results for
179 anon + non-collab against private fixtures.
180 * **`tsvector` size limits**: per-document content cap (64 KiB)
181 defends.
182 * **Locale / accent**: `unaccent` is in the FTS config chain; tests
183 cover.
184 * **Tokenizer breakdown on code**: trigram fallback exists in the
185 schema (`content_trgm` + `gin_trgm_ops`); the SQL composes both
186 the FTS hit and a future trigram-similarity hit (post-MVP — the
187 v1 SearchCode runs FTS only, not trigram, because untyped trgm
188 similarity needs a per-query threshold and we haven't chosen one
189 yet).
190 * **Index drift**: triggers on issues/repos/users are reliable;
191 code index relies on the worker. The reconciler is the safety
192 net.
193 * **Suspended user content**: user-search excludes them via the
194 WHERE clause. Issue/PR-search inherits their content via the
195 underlying repo's visibility — that matches the spec's "visible
196 inside their repo to collaborators" semantics.