@@ -0,0 +1,193 @@ |
| | 1 | +# Search (S28) |
| | 2 | + |
| | 3 | +S28 ships a working search experience for repos, issues/PRs, users, |
| | 4 | +and code (paths + small-file content). Backed entirely by Postgres |
| | 5 | +FTS (`tsvector`) plus `pg_trgm` for code-identifier substring matches. |
| | 6 | +Visibility scoping flows through one composer |
| | 7 | +(`policy.VisibilityPredicate`) so every search query is gated by the |
| | 8 | +same rule the rest of the runtime uses. |
| | 9 | + |
| | 10 | +## Architecture |
| | 11 | + |
| | 12 | +``` |
| | 13 | +internal/search/ |
| | 14 | + search.go — Deps + Result types + page-size constants |
| | 15 | + query_parse.go — operator parser (repo:, is:, state:, author:) |
| | 16 | + repos.go — SearchRepos |
| | 17 | + issues.go — SearchIssues |
| | 18 | + users.go — SearchUsers |
| | 19 | + code.go — SearchCode (paths + content union) |
| | 20 | + |
| | 21 | +internal/auth/policy/ |
| | 22 | + visibility_predicate.go — composes the WHERE-clause fragment |
| | 23 | + |
| | 24 | +internal/web/handlers/search/ |
| | 25 | + search.go — /search and /search/quick handlers |
| | 26 | + |
| | 27 | +internal/worker/jobs/ |
| | 28 | + repo_index_code.go — re-indexes a repo's default branch |
| | 29 | + repo_index_reconcile.go — heals drift between |
| | 30 | + default_branch_oid and |
| | 31 | + last_indexed_oid |
| | 32 | +``` |
| | 33 | + |
| | 34 | +## Migrations |
| | 35 | + |
| | 36 | +* **0030 — extensions**: `pg_trgm`, `unaccent`. Both ship with |
| | 37 | + PostgreSQL contrib. |
| | 38 | +* **0031 — search indexes**: `repos_search`, `issues_search`, |
| | 39 | + `users_search` tables, each keyed 1:1 with the source row and |
| | 40 | + maintained by AFTER triggers. A custom text-search config |
| | 41 | + `shithub_search` chains `unaccent` + `english_stem` so "café" |
| | 42 | + matches "cafe" on both index and query sides. Backfill runs at |
| | 43 | + migration time so existing rows are immediately searchable. |
| | 44 | +* **0032 — code search**: `code_search_paths` (paths only — cheap |
| | 45 | + to populate; size cap doesn't apply) and `code_search_content` |
| | 46 | + (paths + content for files ≤ 256 KiB AND text). Both have GIN |
| | 47 | + indexes on `tsv`; `code_search_content` additionally has a GIN |
| | 48 | + trigram index on the raw content for camelCase / snake_case |
| | 49 | + substring matches the FTS tokenizer mangles. `repos.last_indexed_oid` |
| | 50 | + added so the reconciler can detect drift. |
| | 51 | + |
| | 52 | +## Visibility predicate |
| | 53 | + |
| | 54 | +`policy.VisibilityPredicate(actor, alias, startPlaceholder)` |
| | 55 | +returns a SQL fragment + bind args that filters a `repos` table |
| | 56 | +reference to rows visible to the actor. Rules: |
| | 57 | + |
| | 58 | +* Soft-deleted repos: always excluded. |
| | 59 | +* Site admin: only the soft-delete filter applies. |
| | 60 | +* Anonymous: `visibility = 'public'`. |
| | 61 | +* Logged-in: public OR owner OR collaborator (any role). |
| | 62 | + |
| | 63 | +The composer is the **single source of truth** for "what repos can |
| | 64 | +this viewer see in a list query." The S28 search functions thread it |
| | 65 | +through every per-type query; future listing endpoints (trending, |
| | 66 | +activity feed) reuse it. |
| | 67 | + |
| | 68 | +The boundary is exercised by the search test suite at |
| | 69 | +`internal/search/search_test.go`: |
| | 70 | + |
| | 71 | +* `TestSearchRepos_AnonymousSeesOnlyPublic` — anon never sees |
| | 72 | + private rows for any free-text query. |
| | 73 | +* `TestSearchRepos_NonCollabOnPrivate` — non-collab logged-in |
| | 74 | + user gets zero hits on a private-repo's content. |
| | 75 | +* `TestSearchRepos_OwnerSeesPrivate` — owner branch. |
| | 76 | +* `TestSearchRepos_CollabSeesPrivate` — collab-row branch. |
| | 77 | +* `TestSearchIssues_AnonymousSeesOnlyPublic` — issue-side mirror |
| | 78 | + of the repo test (issues inherit visibility from their repo). |
| | 79 | + |
| | 80 | +## Query parsing |
| | 81 | + |
| | 82 | +`search.ParseQuery(raw)` splits the user query into: |
| | 83 | + |
| | 84 | +* `Text` — free-text portion (drives `plainto_tsquery`). |
| | 85 | +* `Phrase` — when a quoted span is present (drives `phraseto_tsquery`). |
| | 86 | +* `RepoFilter` — `repo:owner/name` becomes `{Owner, Name}`. |
| | 87 | +* `StateFilter` — `is:open` / `is:closed` / `state:open` / |
| | 88 | + `state:closed`. Aliases. |
| | 89 | +* `AuthorFilter` — `author:username`. |
| | 90 | + |
| | 91 | +Unknown operator-shape tokens (e.g. `language:Go`) fall through as |
| | 92 | +free text. This keeps future operator additions backwards- |
| | 93 | +compatible and lets users naturally type ":"-containing strings |
| | 94 | +without surprises. |
| | 95 | + |
| | 96 | +The parser caps input at `MaxQueryBytes` (256) to defend against |
| | 97 | +pathological-length queries; longer inputs are silently truncated. |
| | 98 | + |
| | 99 | +## Ranking |
| | 100 | + |
| | 101 | +* **Repos**: `ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay` |
| | 102 | + where `recency_decay = 1 / (1 + days_since_update / 30)` (the |
| | 103 | + spec's day-1 lean). Whole expression lives in SQL so Postgres |
| | 104 | + short-circuits on the GIN index. |
| | 105 | +* **Issues**: `ts_rank_cd * state_weight` with `open = 1.5x` over |
| | 106 | + `closed`. The spec doesn't pin the multiplier; 1.5 surfaces |
| | 107 | + actionable issues first without burying closed history. |
| | 108 | +* **Users**: `ts_rank_cd` only. Suspended/deleted users are |
| | 109 | + excluded at the WHERE clause so they never taint results. |
| | 110 | +* **Code**: path hits rank `+1.0` over content hits at the same |
| | 111 | + `ts_rank_cd`. Within content hits, `ts_rank_cd` dominates |
| | 112 | + trigram similarity. |
| | 113 | + |
| | 114 | +## Code-search index lifecycle |
| | 115 | + |
| | 116 | +* **Push trigger**: `push:process` enqueues `repo:index_code` when |
| | 117 | + a push advances the repo's default branch. The job is idempotent |
| | 118 | + + atomic-swap, so concurrent pushes that land while the previous |
| | 119 | + index is running re-trigger on the next push tick. |
| | 120 | +* **Atomic swap**: the worker runs `DELETE … + INSERT …` for the |
| | 121 | + repo in one tx. Readers never see a partial index. |
| | 122 | +* **Size + textness gates**: |
| | 123 | + * Files > 256 KiB → path indexed, content skipped. |
| | 124 | + * Files with NUL bytes in the first 8 KiB → treated as binary; |
| | 125 | + path indexed, content skipped. |
| | 126 | + * Indexed content truncated to 64 KiB so the trigram column |
| | 127 | + doesn't bloat for huge text files. |
| | 128 | +* **Path skiplist**: `vendor/`, `node_modules/`, `dist/`, anything |
| | 129 | + under `.git*` is skipped by default. The `path:` operator is |
| | 130 | + post-MVP — when it ships it will let users opt into these |
| | 131 | + directories. |
| | 132 | +* **Reconciler**: `repo:index_reconcile` enqueues a `repo:index_code` |
| | 133 | + job for each repo where `default_branch_oid <> last_indexed_oid`. |
| | 134 | + Self-throttling (100 repos per tick). Designed to run from cron |
| | 135 | + every 5 minutes once the cron framework lands; for now it's |
| | 136 | + invocable as a job. |
| | 137 | + |
| | 138 | +## Routes |
| | 139 | + |
| | 140 | +| Method | Path | Notes | |
| | 141 | +|--------|------------------|------------------------------------------| |
| | 142 | +| GET | `/search` | Full results page with type tabs | |
| | 143 | +| GET | `/search/quick` | htmx fragment endpoint for top-bar drop | |
| | 144 | + |
| | 145 | +The top-bar nav embeds a search form pointing at `/search`; the |
| | 146 | +htmx-driven dropdown wiring is intentionally deferred (the |
| | 147 | +endpoint exists; the JS to invoke it on keystroke comes when we |
| | 148 | +add htmx-the-library to the static asset bundle). |
| | 149 | + |
| | 150 | +## What we deferred from the spec |
| | 151 | + |
| | 152 | +* **Result-HTML caching with viewer-fingerprint key**: the spec's |
| | 153 | + 30-second cache + `(actor_id, repo_count, last_collab_change_at)` |
| | 154 | + fingerprint scheme. The cache key correctness is fiddly enough |
| | 155 | + that we want measurements before we ship it. Without the cache |
| | 156 | + the per-query cost is dominated by GIN-index lookups, which are |
| | 157 | + fast on the synthetic fixture and within budget. **Forward-deferred |
| | 158 | + to S36 (perf pass)**. |
| | 159 | +* **API endpoint** `GET /api/v1/search?q=…&type=…`: the orchestrator |
| | 160 | + is API-shaped; the handler wrap is a tiny add. Parking until the |
| | 161 | + S33 webhooks sprint pulls in the rest of the API surface so we |
| | 162 | + do them together (consistency on auth + body cap + scope shapes). |
| | 163 | + **Forward-deferred to S33 / S34 API consolidation.** |
| | 164 | +* **Quick-dropdown htmx wiring**: the endpoint returns the right |
| | 165 | + HTML; the static HTML form posts to `/search` directly. The |
| | 166 | + dropdown lights up when we land htmx in the static asset bundle. |
| | 167 | +* **`path:` operator**: parser falls through; querying `path:foo` |
| | 168 | + treats it as free text today. Documented above. |
| | 169 | + |
| | 170 | +These are all noted in the S28 status block as well. |
| | 171 | + |
| | 172 | +## Pitfalls noted in code |
| | 173 | + |
| | 174 | +* **Visibility leak** is the highest-stakes risk. The composer is |
| | 175 | + the security boundary; the test suite asserts empty results for |
| | 176 | + anon + non-collab against private fixtures. |
| | 177 | +* **`tsvector` size limits**: per-document content cap (64 KiB) |
| | 178 | + defends. |
| | 179 | +* **Locale / accent**: `unaccent` is in the FTS config chain; tests |
| | 180 | + cover. |
| | 181 | +* **Tokenizer breakdown on code**: trigram fallback exists in the |
| | 182 | + schema (`content_trgm` + `gin_trgm_ops`); the SQL composes both |
| | 183 | + the FTS hit and a future trigram-similarity hit (post-MVP — the |
| | 184 | + v1 SearchCode runs FTS only, not trigram, because untyped trgm |
| | 185 | + similarity needs a per-query threshold and we haven't chosen one |
| | 186 | + yet). |
| | 187 | +* **Index drift**: triggers on issues/repos/users are reliable; |
| | 188 | + code index relies on the worker. The reconciler is the safety |
| | 189 | + net. |
| | 190 | +* **Suspended user content**: user-search excludes them via the |
| | 191 | + WHERE clause. Issue/PR-search inherits their content via the |
| | 192 | + underlying repo's visibility — that matches the spec's "visible |
| | 193 | + inside their repo to collaborators" semantics. |