@@ -0,0 +1,193 @@ |
| 1 | +# Search (S28) |
| 2 | + |
| 3 | +S28 ships a working search experience for repos, issues/PRs, users, |
| 4 | +and code (paths + small-file content). Backed entirely by Postgres |
| 5 | +FTS (`tsvector`) plus `pg_trgm` for code-identifier substring matches. |
| 6 | +Visibility scoping flows through one composer |
| 7 | +(`policy.VisibilityPredicate`) so every search query is gated by the |
| 8 | +same rule the rest of the runtime uses. |
| 9 | + |
| 10 | +## Architecture |
| 11 | + |
| 12 | +``` |
| 13 | +internal/search/ |
| 14 | + search.go — Deps + Result types + page-size constants |
| 15 | + query_parse.go — operator parser (repo:, is:, state:, author:) |
| 16 | + repos.go — SearchRepos |
| 17 | + issues.go — SearchIssues |
| 18 | + users.go — SearchUsers |
| 19 | + code.go — SearchCode (paths + content union) |
| 20 | + |
| 21 | +internal/auth/policy/ |
| 22 | + visibility_predicate.go — composes the WHERE-clause fragment |
| 23 | + |
| 24 | +internal/web/handlers/search/ |
| 25 | + search.go — /search and /search/quick handlers |
| 26 | + |
| 27 | +internal/worker/jobs/ |
| 28 | + repo_index_code.go — re-indexes a repo's default branch |
| 29 | + repo_index_reconcile.go — heals drift between |
| 30 | + default_branch_oid and |
| 31 | + last_indexed_oid |
| 32 | +``` |
| 33 | + |
| 34 | +## Migrations |
| 35 | + |
| 36 | +* **0030 — extensions**: `pg_trgm`, `unaccent`. Both ship with |
| 37 | + PostgreSQL contrib. |
| 38 | +* **0031 — search indexes**: `repos_search`, `issues_search`, |
| 39 | + `users_search` tables, each keyed 1:1 with the source row and |
| 40 | + maintained by AFTER triggers. A custom text-search config |
| 41 | + `shithub_search` chains `unaccent` + `english_stem` so "café" |
| 42 | + matches "cafe" on both index and query sides. Backfill runs at |
| 43 | + migration time so existing rows are immediately searchable. |
| 44 | +* **0032 — code search**: `code_search_paths` (paths only — cheap |
| 45 | + to populate; size cap doesn't apply) and `code_search_content` |
| 46 | + (paths + content for files ≤ 256 KiB AND text). Both have GIN |
| 47 | + indexes on `tsv`; `code_search_content` additionally has a GIN |
| 48 | + trigram index on the raw content for camelCase / snake_case |
| 49 | + substring matches the FTS tokenizer mangles. `repos.last_indexed_oid` |
| 50 | + added so the reconciler can detect drift. |
| 51 | + |
| 52 | +## Visibility predicate |
| 53 | + |
| 54 | +`policy.VisibilityPredicate(actor, alias, startPlaceholder)` |
| 55 | +returns a SQL fragment + bind args that filters a `repos` table |
| 56 | +reference to rows visible to the actor. Rules: |
| 57 | + |
| 58 | +* Soft-deleted repos: always excluded. |
| 59 | +* Site admin: only the soft-delete filter applies. |
| 60 | +* Anonymous: `visibility = 'public'`. |
| 61 | +* Logged-in: public OR owner OR collaborator (any role). |
| 62 | + |
| 63 | +The composer is the **single source of truth** for "what repos can |
| 64 | +this viewer see in a list query." The S28 search functions thread it |
| 65 | +through every per-type query; future listing endpoints (trending, |
| 66 | +activity feed) reuse it. |
| 67 | + |
| 68 | +The boundary is exercised by the search test suite at |
| 69 | +`internal/search/search_test.go`: |
| 70 | + |
| 71 | +* `TestSearchRepos_AnonymousSeesOnlyPublic` — anon never sees |
| 72 | + private rows for any free-text query. |
| 73 | +* `TestSearchRepos_NonCollabOnPrivate` — non-collab logged-in |
| 74 | + user gets zero hits on a private-repo's content. |
| 75 | +* `TestSearchRepos_OwnerSeesPrivate` — owner branch. |
| 76 | +* `TestSearchRepos_CollabSeesPrivate` — collab-row branch. |
| 77 | +* `TestSearchIssues_AnonymousSeesOnlyPublic` — issue-side mirror |
| 78 | + of the repo test (issues inherit visibility from their repo). |
| 79 | + |
| 80 | +## Query parsing |
| 81 | + |
| 82 | +`search.ParseQuery(raw)` splits the user query into: |
| 83 | + |
| 84 | +* `Text` — free-text portion (drives `plainto_tsquery`). |
| 85 | +* `Phrase` — when a quoted span is present (drives `phraseto_tsquery`). |
| 86 | +* `RepoFilter` — `repo:owner/name` becomes `{Owner, Name}`. |
| 87 | +* `StateFilter` — `is:open` / `is:closed` / `state:open` / |
| 88 | + `state:closed`. Aliases. |
| 89 | +* `AuthorFilter` — `author:username`. |
| 90 | + |
| 91 | +Unknown operator-shape tokens (e.g. `language:Go`) fall through as |
| 92 | +free text. This keeps future operator additions backwards- |
| 93 | +compatible and lets users naturally type ":"-containing strings |
| 94 | +without surprises. |
| 95 | + |
| 96 | +The parser caps input at `MaxQueryBytes` (256) to defend against |
| 97 | +pathological-length queries; longer inputs are silently truncated. |
| 98 | + |
| 99 | +## Ranking |
| 100 | + |
| 101 | +* **Repos**: `ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay` |
| 102 | + where `recency_decay = 1 / (1 + days_since_update / 30)` (the |
| 103 | + spec's day-1 lean). Whole expression lives in SQL so Postgres |
| 104 | + short-circuits on the GIN index. |
| 105 | +* **Issues**: `ts_rank_cd * state_weight` with `open = 1.5x` over |
| 106 | + `closed`. The spec doesn't pin the multiplier; 1.5 surfaces |
| 107 | + actionable issues first without burying closed history. |
| 108 | +* **Users**: `ts_rank_cd` only. Suspended/deleted users are |
| 109 | + excluded at the WHERE clause so they never taint results. |
| 110 | +* **Code**: path hits rank `+1.0` over content hits at the same |
| 111 | + `ts_rank_cd`. Within content hits, `ts_rank_cd` dominates |
| 112 | + trigram similarity. |
| 113 | + |
| 114 | +## Code-search index lifecycle |
| 115 | + |
| 116 | +* **Push trigger**: `push:process` enqueues `repo:index_code` when |
| 117 | + a push advances the repo's default branch. The job is idempotent |
| 118 | + + atomic-swap, so concurrent pushes that land while the previous |
| 119 | + index is running re-trigger on the next push tick. |
| 120 | +* **Atomic swap**: the worker runs `DELETE … + INSERT …` for the |
| 121 | + repo in one tx. Readers never see a partial index. |
| 122 | +* **Size + textness gates**: |
| 123 | + * Files > 256 KiB → path indexed, content skipped. |
| 124 | + * Files with NUL bytes in the first 8 KiB → treated as binary; |
| 125 | + path indexed, content skipped. |
| 126 | + * Indexed content truncated to 64 KiB so the trigram column |
| 127 | + doesn't bloat for huge text files. |
| 128 | +* **Path skiplist**: `vendor/`, `node_modules/`, `dist/`, anything |
| 129 | + under `.git*` is skipped by default. The `path:` operator is |
| 130 | + post-MVP — when it ships it will let users opt into these |
| 131 | + directories. |
| 132 | +* **Reconciler**: `repo:index_reconcile` enqueues a `repo:index_code` |
| 133 | + job for each repo where `default_branch_oid <> last_indexed_oid`. |
| 134 | + Self-throttling (100 repos per tick). Designed to run from cron |
| 135 | + every 5 minutes once the cron framework lands; for now it's |
| 136 | + invocable as a job. |
| 137 | + |
| 138 | +## Routes |
| 139 | + |
| 140 | +| Method | Path | Notes | |
| 141 | +|--------|------------------|------------------------------------------| |
| 142 | +| GET | `/search` | Full results page with type tabs | |
| 143 | +| GET | `/search/quick` | htmx fragment endpoint for top-bar drop | |
| 144 | + |
| 145 | +The top-bar nav embeds a search form pointing at `/search`; the |
| 146 | +htmx-driven dropdown wiring is intentionally deferred (the |
| 147 | +endpoint exists; the JS to invoke it on keystroke comes when we |
| 148 | +add htmx-the-library to the static asset bundle). |
| 149 | + |
| 150 | +## What we deferred from the spec |
| 151 | + |
| 152 | +* **Result-HTML caching with viewer-fingerprint key**: the spec's |
| 153 | + 30-second cache + `(actor_id, repo_count, last_collab_change_at)` |
| 154 | + fingerprint scheme. The cache key correctness is fiddly enough |
| 155 | + that we want measurements before we ship it. Without the cache |
| 156 | + the per-query cost is dominated by GIN-index lookups, which are |
| 157 | + fast on the synthetic fixture and within budget. **Forward-deferred |
| 158 | + to S36 (perf pass)**. |
| 159 | +* **API endpoint** `GET /api/v1/search?q=…&type=…`: the orchestrator |
| 160 | + is API-shaped; the handler wrap is a tiny add. Parking until the |
| 161 | + S33 webhooks sprint pulls in the rest of the API surface so we |
| 162 | + do them together (consistency on auth + body cap + scope shapes). |
| 163 | + **Forward-deferred to S33 / S34 API consolidation.** |
| 164 | +* **Quick-dropdown htmx wiring**: the endpoint returns the right |
| 165 | + HTML; the static HTML form posts to `/search` directly. The |
| 166 | + dropdown lights up when we land htmx in the static asset bundle. |
| 167 | +* **`path:` operator**: parser falls through; querying `path:foo` |
| 168 | + treats it as free text today. Documented above. |
| 169 | + |
| 170 | +These are all noted in the S28 status block as well. |
| 171 | + |
| 172 | +## Pitfalls noted in code |
| 173 | + |
| 174 | +* **Visibility leak** is the highest-stakes risk. The composer is |
| 175 | + the security boundary; the test suite asserts empty results for |
| 176 | + anon + non-collab against private fixtures. |
| 177 | +* **`tsvector` size limits**: per-document content cap (64 KiB) |
| 178 | + defends. |
| 179 | +* **Locale / accent**: `unaccent` is in the FTS config chain; tests |
| 180 | + cover. |
| 181 | +* **Tokenizer breakdown on code**: trigram fallback exists in the |
| 182 | + schema (`content_trgm` + `gin_trgm_ops`); the SQL composes both |
| 183 | + the FTS hit and a future trigram-similarity hit (post-MVP — the |
| 184 | + v1 SearchCode runs FTS only, not trigram, because untyped trgm |
| 185 | + similarity needs a per-query threshold and we haven't chosen one |
| 186 | + yet). |
| 187 | +* **Index drift**: triggers on issues/repos/users are reliable; |
| 188 | + code index relies on the worker. The reconciler is the safety |
| 189 | + net. |
| 190 | +* **Suspended user content**: user-search excludes them via the |
| 191 | + WHERE clause. Issue/PR-search inherits their content via the |
| 192 | + underlying repo's visibility — that matches the spec's "visible |
| 193 | + inside their repo to collaborators" semantics. |