Search (S28)
S28 ships a working search experience for repos, issues/PRs, users,
and code (paths + small-file content). Backed entirely by Postgres
FTS (tsvector) plus pg_trgm for code-identifier substring matches.
Visibility scoping flows through one composer
(policy.VisibilityPredicate) so every search query is gated by the
same rule the rest of the runtime uses.
Architecture
internal/search/
search.go — Deps + Result types + page-size constants
query_parse.go — operator parser (repo:, is:, state:, author:)
repos.go — SearchRepos
issues.go — SearchIssues
users.go — SearchUsers
code.go — SearchCode (paths + content union)
internal/auth/policy/
visibility_predicate.go — composes the WHERE-clause fragment
internal/web/handlers/search/
search.go — /search and /search/quick handlers
internal/worker/jobs/
repo_index_code.go — re-indexes a repo's default branch
repo_index_reconcile.go — heals drift between
default_branch_oid and
last_indexed_oid
Migrations
- 0030 — extensions:
pg_trgm,unaccent. Both ship with PostgreSQL contrib. - 0031 — search indexes:
repos_search,issues_search,users_searchtables, each keyed 1:1 with the source row and maintained by AFTER triggers. A custom text-search configshithub_searchchainsunaccent+english_stemso "café" matches "cafe" on both index and query sides. Backfill runs at migration time so existing rows are immediately searchable. - 0032 — code search:
code_search_paths(paths only — cheap to populate; size cap doesn't apply) andcode_search_content(paths + content for files ≤ 256 KiB AND text). Both have GIN indexes ontsv;code_search_contentadditionally has a GIN trigram index on the raw content for camelCase / snake_case substring matches the FTS tokenizer mangles.repos.last_indexed_oidadded so the reconciler can detect drift. - 0041 — repo owner terms: repo search documents include the
owning user/org handle plus display name, and result queries
resolve owners through either
usersororgs. This keeps public org repositories searchable by bothowner/repoand owner-only text queries.
Visibility predicate
policy.VisibilityPredicate(actor, alias, startPlaceholder)
returns a SQL fragment + bind args that filters a repos table
reference to rows visible to the actor. Rules:
- Soft-deleted repos: always excluded.
- Site admin: only the soft-delete filter applies.
- Anonymous:
visibility = 'public'. - Logged-in: public OR owner OR collaborator (any role).
The composer is the single source of truth for "what repos can this viewer see in a list query." The S28 search functions thread it through every per-type query; future listing endpoints (trending, activity feed) reuse it.
The boundary is exercised by the search test suite at
internal/search/search_test.go:
TestSearchRepos_AnonymousSeesOnlyPublic— anon never sees private rows for any free-text query.TestSearchRepos_NonCollabOnPrivate— non-collab logged-in user gets zero hits on a private-repo's content.TestSearchRepos_OwnerSeesPrivate— owner branch.TestSearchRepos_CollabSeesPrivate— collab-row branch.TestSearchIssues_AnonymousSeesOnlyPublic— issue-side mirror of the repo test (issues inherit visibility from their repo).
Query parsing
search.ParseQuery(raw) splits the user query into:
Text— free-text portion (drivesplainto_tsquery).Phrase— when a quoted span is present (drivesphraseto_tsquery).RepoFilter—repo:owner/namebecomes{Owner, Name}.StateFilter—is:open/is:closed/state:open/state:closed. Aliases.AuthorFilter—author:username.
Unknown operator-shape tokens (e.g. language:Go) fall through as
free text. This keeps future operator additions backwards-
compatible and lets users naturally type ":"-containing strings
without surprises.
The parser caps input at MaxQueryBytes (256) to defend against
pathological-length queries; longer inputs are silently truncated.
Ranking
- Repos:
ts_rank_cd * (1 + ln(1 + star_count)) * recency_decaywhererecency_decay = 1 / (1 + days_since_update / 30)(the spec's day-1 lean). Whole expression lives in SQL so Postgres short-circuits on the GIN index. - Issues:
ts_rank_cd * state_weightwithopen = 1.5xoverclosed. The spec doesn't pin the multiplier; 1.5 surfaces actionable issues first without burying closed history. - Users:
ts_rank_cdonly. Suspended/deleted users are excluded at the WHERE clause so they never taint results. - Code: path hits rank
+1.0over content hits at the samets_rank_cd. Within content hits,ts_rank_cddominates trigram similarity.
Code-search index lifecycle
- Push trigger:
push:processenqueuesrepo:index_codewhen a push advances the repo's default branch. The job is idempotent- atomic-swap, so concurrent pushes that land while the previous index is running re-trigger on the next push tick.
- Atomic swap: the worker runs
DELETE … + INSERT …for the repo in one tx. Readers never see a partial index. - Size + textness gates:
- Files > 256 KiB → path indexed, content skipped.
- Files with NUL bytes in the first 8 KiB → treated as binary; path indexed, content skipped.
- Indexed content truncated to 64 KiB so the trigram column doesn't bloat for huge text files.
- Path skiplist:
vendor/,node_modules/,dist/, anything under.git*is skipped by default. Thepath:operator is post-MVP — when it ships it will let users opt into these directories. - Reconciler:
repo:index_reconcileenqueues arepo:index_codejob for each repo wheredefault_branch_oid <> last_indexed_oid. Self-throttling (100 repos per tick). Designed to run from cron every 5 minutes once the cron framework lands; for now it's invocable as a job.
Routes
| Method | Path | Notes |
|---|---|---|
| GET | /search |
Full results page with GitHub-style filters |
| GET | /search/quick |
HTML fragment endpoint for top-bar drop |
The top-bar nav embeds a search form pointing at /search; the
same input now calls /search/quick as the user types and renders
the returned fragment under the nav search box. Full-page type URLs
emit GitHub-style type=repositories and type=pullrequests while
still accepting the legacy type=repos and type=pulls aliases.
What we deferred from the spec
- Result-HTML caching with viewer-fingerprint key: the spec's
30-second cache +
(actor_id, repo_count, last_collab_change_at)fingerprint scheme. The cache key correctness is fiddly enough that we want measurements before we ship it. Without the cache the per-query cost is dominated by GIN-index lookups, which are fast on the synthetic fixture and within budget. Forward-deferred to S36 (perf pass). - API endpoint
GET /api/v1/search?q=…&type=…: the orchestrator is API-shaped; the handler wrap is a tiny add. Parking until the S33 webhooks sprint pulls in the rest of the API surface so we do them together (consistency on auth + body cap + scope shapes). Forward-deferred to S33 / S34 API consolidation. path:operator: parser falls through; queryingpath:footreats it as free text today. Documented above.
These are all noted in the S28 status block as well.
Pitfalls noted in code
- Visibility leak is the highest-stakes risk. The composer is the security boundary; the test suite asserts empty results for anon + non-collab against private fixtures.
tsvectorsize limits: per-document content cap (64 KiB) defends.- Locale / accent:
unaccentis in the FTS config chain; tests cover. - Tokenizer breakdown on code: trigram fallback exists in the
schema (
content_trgm+gin_trgm_ops); the SQL composes both the FTS hit and a future trigram-similarity hit (post-MVP — the v1 SearchCode runs FTS only, not trigram, because untyped trgm similarity needs a per-query threshold and we haven't chosen one yet). - Index drift: triggers on issues/repos/users are reliable; code index relies on the worker. The reconciler is the safety net.
- Suspended user content: user-search excludes them via the WHERE clause. Issue/PR-search inherits their content via the underlying repo's visibility — that matches the spec's "visible inside their repo to collaborators" semantics.
View source
| 1 | # Search (S28) |
| 2 | |
| 3 | S28 ships a working search experience for repos, issues/PRs, users, |
| 4 | and code (paths + small-file content). Backed entirely by Postgres |
| 5 | FTS (`tsvector`) plus `pg_trgm` for code-identifier substring matches. |
| 6 | Visibility scoping flows through one composer |
| 7 | (`policy.VisibilityPredicate`) so every search query is gated by the |
| 8 | same rule the rest of the runtime uses. |
| 9 | |
| 10 | ## Architecture |
| 11 | |
| 12 | ``` |
| 13 | internal/search/ |
| 14 | search.go — Deps + Result types + page-size constants |
| 15 | query_parse.go — operator parser (repo:, is:, state:, author:) |
| 16 | repos.go — SearchRepos |
| 17 | issues.go — SearchIssues |
| 18 | users.go — SearchUsers |
| 19 | code.go — SearchCode (paths + content union) |
| 20 | |
| 21 | internal/auth/policy/ |
| 22 | visibility_predicate.go — composes the WHERE-clause fragment |
| 23 | |
| 24 | internal/web/handlers/search/ |
| 25 | search.go — /search and /search/quick handlers |
| 26 | |
| 27 | internal/worker/jobs/ |
| 28 | repo_index_code.go — re-indexes a repo's default branch |
| 29 | repo_index_reconcile.go — heals drift between |
| 30 | default_branch_oid and |
| 31 | last_indexed_oid |
| 32 | ``` |
| 33 | |
| 34 | ## Migrations |
| 35 | |
| 36 | * **0030 — extensions**: `pg_trgm`, `unaccent`. Both ship with |
| 37 | PostgreSQL contrib. |
| 38 | * **0031 — search indexes**: `repos_search`, `issues_search`, |
| 39 | `users_search` tables, each keyed 1:1 with the source row and |
| 40 | maintained by AFTER triggers. A custom text-search config |
| 41 | `shithub_search` chains `unaccent` + `english_stem` so "café" |
| 42 | matches "cafe" on both index and query sides. Backfill runs at |
| 43 | migration time so existing rows are immediately searchable. |
| 44 | * **0032 — code search**: `code_search_paths` (paths only — cheap |
| 45 | to populate; size cap doesn't apply) and `code_search_content` |
| 46 | (paths + content for files ≤ 256 KiB AND text). Both have GIN |
| 47 | indexes on `tsv`; `code_search_content` additionally has a GIN |
| 48 | trigram index on the raw content for camelCase / snake_case |
| 49 | substring matches the FTS tokenizer mangles. `repos.last_indexed_oid` |
| 50 | added so the reconciler can detect drift. |
| 51 | * **0041 — repo owner terms**: repo search documents include the |
| 52 | owning user/org handle plus display name, and result queries |
| 53 | resolve owners through either `users` or `orgs`. This keeps public |
| 54 | org repositories searchable by both `owner/repo` and owner-only |
| 55 | text queries. |
| 56 | |
| 57 | ## Visibility predicate |
| 58 | |
| 59 | `policy.VisibilityPredicate(actor, alias, startPlaceholder)` |
| 60 | returns a SQL fragment + bind args that filters a `repos` table |
| 61 | reference to rows visible to the actor. Rules: |
| 62 | |
| 63 | * Soft-deleted repos: always excluded. |
| 64 | * Site admin: only the soft-delete filter applies. |
| 65 | * Anonymous: `visibility = 'public'`. |
| 66 | * Logged-in: public OR owner OR collaborator (any role). |
| 67 | |
| 68 | The composer is the **single source of truth** for "what repos can |
| 69 | this viewer see in a list query." The S28 search functions thread it |
| 70 | through every per-type query; future listing endpoints (trending, |
| 71 | activity feed) reuse it. |
| 72 | |
| 73 | The boundary is exercised by the search test suite at |
| 74 | `internal/search/search_test.go`: |
| 75 | |
| 76 | * `TestSearchRepos_AnonymousSeesOnlyPublic` — anon never sees |
| 77 | private rows for any free-text query. |
| 78 | * `TestSearchRepos_NonCollabOnPrivate` — non-collab logged-in |
| 79 | user gets zero hits on a private-repo's content. |
| 80 | * `TestSearchRepos_OwnerSeesPrivate` — owner branch. |
| 81 | * `TestSearchRepos_CollabSeesPrivate` — collab-row branch. |
| 82 | * `TestSearchIssues_AnonymousSeesOnlyPublic` — issue-side mirror |
| 83 | of the repo test (issues inherit visibility from their repo). |
| 84 | |
| 85 | ## Query parsing |
| 86 | |
| 87 | `search.ParseQuery(raw)` splits the user query into: |
| 88 | |
| 89 | * `Text` — free-text portion (drives `plainto_tsquery`). |
| 90 | * `Phrase` — when a quoted span is present (drives `phraseto_tsquery`). |
| 91 | * `RepoFilter` — `repo:owner/name` becomes `{Owner, Name}`. |
| 92 | * `StateFilter` — `is:open` / `is:closed` / `state:open` / |
| 93 | `state:closed`. Aliases. |
| 94 | * `AuthorFilter` — `author:username`. |
| 95 | |
| 96 | Unknown operator-shape tokens (e.g. `language:Go`) fall through as |
| 97 | free text. This keeps future operator additions backwards- |
| 98 | compatible and lets users naturally type ":"-containing strings |
| 99 | without surprises. |
| 100 | |
| 101 | The parser caps input at `MaxQueryBytes` (256) to defend against |
| 102 | pathological-length queries; longer inputs are silently truncated. |
| 103 | |
| 104 | ## Ranking |
| 105 | |
| 106 | * **Repos**: `ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay` |
| 107 | where `recency_decay = 1 / (1 + days_since_update / 30)` (the |
| 108 | spec's day-1 lean). Whole expression lives in SQL so Postgres |
| 109 | short-circuits on the GIN index. |
| 110 | * **Issues**: `ts_rank_cd * state_weight` with `open = 1.5x` over |
| 111 | `closed`. The spec doesn't pin the multiplier; 1.5 surfaces |
| 112 | actionable issues first without burying closed history. |
| 113 | * **Users**: `ts_rank_cd` only. Suspended/deleted users are |
| 114 | excluded at the WHERE clause so they never taint results. |
| 115 | * **Code**: path hits rank `+1.0` over content hits at the same |
| 116 | `ts_rank_cd`. Within content hits, `ts_rank_cd` dominates |
| 117 | trigram similarity. |
| 118 | |
| 119 | ## Code-search index lifecycle |
| 120 | |
| 121 | * **Push trigger**: `push:process` enqueues `repo:index_code` when |
| 122 | a push advances the repo's default branch. The job is idempotent |
| 123 | + atomic-swap, so concurrent pushes that land while the previous |
| 124 | index is running re-trigger on the next push tick. |
| 125 | * **Atomic swap**: the worker runs `DELETE … + INSERT …` for the |
| 126 | repo in one tx. Readers never see a partial index. |
| 127 | * **Size + textness gates**: |
| 128 | * Files > 256 KiB → path indexed, content skipped. |
| 129 | * Files with NUL bytes in the first 8 KiB → treated as binary; |
| 130 | path indexed, content skipped. |
| 131 | * Indexed content truncated to 64 KiB so the trigram column |
| 132 | doesn't bloat for huge text files. |
| 133 | * **Path skiplist**: `vendor/`, `node_modules/`, `dist/`, anything |
| 134 | under `.git*` is skipped by default. The `path:` operator is |
| 135 | post-MVP — when it ships it will let users opt into these |
| 136 | directories. |
| 137 | * **Reconciler**: `repo:index_reconcile` enqueues a `repo:index_code` |
| 138 | job for each repo where `default_branch_oid <> last_indexed_oid`. |
| 139 | Self-throttling (100 repos per tick). Designed to run from cron |
| 140 | every 5 minutes once the cron framework lands; for now it's |
| 141 | invocable as a job. |
| 142 | |
| 143 | ## Routes |
| 144 | |
| 145 | | Method | Path | Notes | |
| 146 | |--------|------------------|--------------------------------------------| |
| 147 | | GET | `/search` | Full results page with GitHub-style filters | |
| 148 | | GET | `/search/quick` | HTML fragment endpoint for top-bar drop | |
| 149 | |
| 150 | The top-bar nav embeds a search form pointing at `/search`; the |
| 151 | same input now calls `/search/quick` as the user types and renders |
| 152 | the returned fragment under the nav search box. Full-page type URLs |
| 153 | emit GitHub-style `type=repositories` and `type=pullrequests` while |
| 154 | still accepting the legacy `type=repos` and `type=pulls` aliases. |
| 155 | |
| 156 | ## What we deferred from the spec |
| 157 | |
| 158 | * **Result-HTML caching with viewer-fingerprint key**: the spec's |
| 159 | 30-second cache + `(actor_id, repo_count, last_collab_change_at)` |
| 160 | fingerprint scheme. The cache key correctness is fiddly enough |
| 161 | that we want measurements before we ship it. Without the cache |
| 162 | the per-query cost is dominated by GIN-index lookups, which are |
| 163 | fast on the synthetic fixture and within budget. **Forward-deferred |
| 164 | to S36 (perf pass)**. |
| 165 | * **API endpoint** `GET /api/v1/search?q=…&type=…`: the orchestrator |
| 166 | is API-shaped; the handler wrap is a tiny add. Parking until the |
| 167 | S33 webhooks sprint pulls in the rest of the API surface so we |
| 168 | do them together (consistency on auth + body cap + scope shapes). |
| 169 | **Forward-deferred to S33 / S34 API consolidation.** |
| 170 | * **`path:` operator**: parser falls through; querying `path:foo` |
| 171 | treats it as free text today. Documented above. |
| 172 | |
| 173 | These are all noted in the S28 status block as well. |
| 174 | |
| 175 | ## Pitfalls noted in code |
| 176 | |
| 177 | * **Visibility leak** is the highest-stakes risk. The composer is |
| 178 | the security boundary; the test suite asserts empty results for |
| 179 | anon + non-collab against private fixtures. |
| 180 | * **`tsvector` size limits**: per-document content cap (64 KiB) |
| 181 | defends. |
| 182 | * **Locale / accent**: `unaccent` is in the FTS config chain; tests |
| 183 | cover. |
| 184 | * **Tokenizer breakdown on code**: trigram fallback exists in the |
| 185 | schema (`content_trgm` + `gin_trgm_ops`); the SQL composes both |
| 186 | the FTS hit and a future trigram-similarity hit (post-MVP — the |
| 187 | v1 SearchCode runs FTS only, not trigram, because untyped trgm |
| 188 | similarity needs a per-query threshold and we haven't chosen one |
| 189 | yet). |
| 190 | * **Index drift**: triggers on issues/repos/users are reliable; |
| 191 | code index relies on the worker. The reconciler is the safety |
| 192 | net. |
| 193 | * **Suspended user content**: user-search excludes them via the |
| 194 | WHERE clause. Issue/PR-search inherits their content via the |
| 195 | underlying repo's visibility — that matches the spec's "visible |
| 196 | inside their repo to collaborators" semantics. |