tenseleyflow/shithub / 1238ab9

Browse files

S28: docs/internal/search.md

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
1238ab9d3e7f05cd133a6f48ce22c4a6a33a06f2
Parents
6dd134b
Tree
962267a

1 changed file

StatusFile+-
A docs/internal/search.md 193 0
docs/internal/search.mdadded
@@ -0,0 +1,193 @@
1
+# Search (S28)
2
+
3
+S28 ships a working search experience for repos, issues/PRs, users,
4
+and code (paths + small-file content). Backed entirely by Postgres
5
+FTS (`tsvector`) plus `pg_trgm` for code-identifier substring matches.
6
+Visibility scoping flows through one composer
7
+(`policy.VisibilityPredicate`) so every search query is gated by the
8
+same rule the rest of the runtime uses.
9
+
10
+## Architecture
11
+
12
+```
13
+internal/search/
14
+    search.go        — Deps + Result types + page-size constants
15
+    query_parse.go   — operator parser (repo:, is:, state:, author:)
16
+    repos.go         — SearchRepos
17
+    issues.go        — SearchIssues
18
+    users.go         — SearchUsers
19
+    code.go          — SearchCode (paths + content union)
20
+
21
+internal/auth/policy/
22
+    visibility_predicate.go  — composes the WHERE-clause fragment
23
+
24
+internal/web/handlers/search/
25
+    search.go        — /search and /search/quick handlers
26
+
27
+internal/worker/jobs/
28
+    repo_index_code.go        — re-indexes a repo's default branch
29
+    repo_index_reconcile.go   — heals drift between
30
+                                default_branch_oid and
31
+                                last_indexed_oid
32
+```
33
+
34
+## Migrations
35
+
36
+* **0030 — extensions**: `pg_trgm`, `unaccent`. Both ship with
37
+  PostgreSQL contrib.
38
+* **0031 — search indexes**: `repos_search`, `issues_search`,
39
+  `users_search` tables, each keyed 1:1 with the source row and
40
+  maintained by AFTER triggers. A custom text-search config
41
+  `shithub_search` chains `unaccent` + `english_stem` so "café"
42
+  matches "cafe" on both index and query sides. Backfill runs at
43
+  migration time so existing rows are immediately searchable.
44
+* **0032 — code search**: `code_search_paths` (paths only — cheap
45
+  to populate; size cap doesn't apply) and `code_search_content`
46
+  (paths + content for files ≤ 256 KiB AND text). Both have GIN
47
+  indexes on `tsv`; `code_search_content` additionally has a GIN
48
+  trigram index on the raw content for camelCase / snake_case
49
+  substring matches the FTS tokenizer mangles. `repos.last_indexed_oid`
50
+  added so the reconciler can detect drift.
51
+
52
+## Visibility predicate
53
+
54
+`policy.VisibilityPredicate(actor, alias, startPlaceholder)`
55
+returns a SQL fragment + bind args that filters a `repos` table
56
+reference to rows visible to the actor. Rules:
57
+
58
+* Soft-deleted repos: always excluded.
59
+* Site admin: only the soft-delete filter applies.
60
+* Anonymous: `visibility = 'public'`.
61
+* Logged-in: public OR owner OR collaborator (any role).
62
+
63
+The composer is the **single source of truth** for "what repos can
64
+this viewer see in a list query." The S28 search functions thread it
65
+through every per-type query; future listing endpoints (trending,
66
+activity feed) reuse it.
67
+
68
+The boundary is exercised by the search test suite at
69
+`internal/search/search_test.go`:
70
+
71
+* `TestSearchRepos_AnonymousSeesOnlyPublic` — anon never sees
72
+  private rows for any free-text query.
73
+* `TestSearchRepos_NonCollabOnPrivate` — non-collab logged-in
74
+  user gets zero hits on a private-repo's content.
75
+* `TestSearchRepos_OwnerSeesPrivate` — owner branch.
76
+* `TestSearchRepos_CollabSeesPrivate` — collab-row branch.
77
+* `TestSearchIssues_AnonymousSeesOnlyPublic` — issue-side mirror
78
+  of the repo test (issues inherit visibility from their repo).
79
+
80
+## Query parsing
81
+
82
+`search.ParseQuery(raw)` splits the user query into:
83
+
84
+* `Text` — free-text portion (drives `plainto_tsquery`).
85
+* `Phrase` — when a quoted span is present (drives `phraseto_tsquery`).
86
+* `RepoFilter` — `repo:owner/name` becomes `{Owner, Name}`.
87
+* `StateFilter` — `is:open` / `is:closed` / `state:open` /
88
+  `state:closed`. Aliases.
89
+* `AuthorFilter` — `author:username`.
90
+
91
+Unknown operator-shape tokens (e.g. `language:Go`) fall through as
92
+free text. This keeps future operator additions backwards-
93
+compatible and lets users naturally type ":"-containing strings
94
+without surprises.
95
+
96
+The parser caps input at `MaxQueryBytes` (256) to defend against
97
+pathological-length queries; longer inputs are silently truncated.
98
+
99
+## Ranking
100
+
101
+* **Repos**: `ts_rank_cd * (1 + ln(1 + star_count)) * recency_decay`
102
+  where `recency_decay = 1 / (1 + days_since_update / 30)` (the
103
+  spec's day-1 lean). Whole expression lives in SQL so Postgres
104
+  short-circuits on the GIN index.
105
+* **Issues**: `ts_rank_cd * state_weight` with `open = 1.5x` over
106
+  `closed`. The spec doesn't pin the multiplier; 1.5 surfaces
107
+  actionable issues first without burying closed history.
108
+* **Users**: `ts_rank_cd` only. Suspended/deleted users are
109
+  excluded at the WHERE clause so they never taint results.
110
+* **Code**: path hits rank `+1.0` over content hits at the same
111
+  `ts_rank_cd`. Within content hits, `ts_rank_cd` dominates
112
+  trigram similarity.
113
+
114
+## Code-search index lifecycle
115
+
116
+* **Push trigger**: `push:process` enqueues `repo:index_code` when
117
+  a push advances the repo's default branch. The job is idempotent
118
+  + atomic-swap, so concurrent pushes that land while the previous
119
+  index is running re-trigger on the next push tick.
120
+* **Atomic swap**: the worker runs `DELETE … + INSERT …` for the
121
+  repo in one tx. Readers never see a partial index.
122
+* **Size + textness gates**:
123
+  * Files > 256 KiB → path indexed, content skipped.
124
+  * Files with NUL bytes in the first 8 KiB → treated as binary;
125
+    path indexed, content skipped.
126
+  * Indexed content truncated to 64 KiB so the trigram column
127
+    doesn't bloat for huge text files.
128
+* **Path skiplist**: `vendor/`, `node_modules/`, `dist/`, anything
129
+  under `.git*` is skipped by default. The `path:` operator is
130
+  post-MVP — when it ships it will let users opt into these
131
+  directories.
132
+* **Reconciler**: `repo:index_reconcile` enqueues a `repo:index_code`
133
+  job for each repo where `default_branch_oid <> last_indexed_oid`.
134
+  Self-throttling (100 repos per tick). Designed to run from cron
135
+  every 5 minutes once the cron framework lands; for now it's
136
+  invocable as a job.
137
+
138
+## Routes
139
+
140
+| Method | Path             | Notes                                    |
141
+|--------|------------------|------------------------------------------|
142
+| GET    | `/search`        | Full results page with type tabs         |
143
+| GET    | `/search/quick`  | htmx fragment endpoint for top-bar drop  |
144
+
145
+The top-bar nav embeds a search form pointing at `/search`; the
146
+htmx-driven dropdown wiring is intentionally deferred (the
147
+endpoint exists; the JS to invoke it on keystroke comes when we
148
+add htmx-the-library to the static asset bundle).
149
+
150
+## What we deferred from the spec
151
+
152
+* **Result-HTML caching with viewer-fingerprint key**: the spec's
153
+  30-second cache + `(actor_id, repo_count, last_collab_change_at)`
154
+  fingerprint scheme. The cache key correctness is fiddly enough
155
+  that we want measurements before we ship it. Without the cache
156
+  the per-query cost is dominated by GIN-index lookups, which are
157
+  fast on the synthetic fixture and within budget. **Forward-deferred
158
+  to S36 (perf pass)**.
159
+* **API endpoint** `GET /api/v1/search?q=…&type=…`: the orchestrator
160
+  is API-shaped; the handler wrap is a tiny add. Parking until the
161
+  S33 webhooks sprint pulls in the rest of the API surface so we
162
+  do them together (consistency on auth + body cap + scope shapes).
163
+  **Forward-deferred to S33 / S34 API consolidation.**
164
+* **Quick-dropdown htmx wiring**: the endpoint returns the right
165
+  HTML; the static HTML form posts to `/search` directly. The
166
+  dropdown lights up when we land htmx in the static asset bundle.
167
+* **`path:` operator**: parser falls through; querying `path:foo`
168
+  treats it as free text today. Documented above.
169
+
170
+These are all noted in the S28 status block as well.
171
+
172
+## Pitfalls noted in code
173
+
174
+* **Visibility leak** is the highest-stakes risk. The composer is
175
+  the security boundary; the test suite asserts empty results for
176
+  anon + non-collab against private fixtures.
177
+* **`tsvector` size limits**: per-document content cap (64 KiB)
178
+  defends.
179
+* **Locale / accent**: `unaccent` is in the FTS config chain; tests
180
+  cover.
181
+* **Tokenizer breakdown on code**: trigram fallback exists in the
182
+  schema (`content_trgm` + `gin_trgm_ops`); the SQL composes both
183
+  the FTS hit and a future trigram-similarity hit (post-MVP — the
184
+  v1 SearchCode runs FTS only, not trigram, because untyped trgm
185
+  similarity needs a per-query threshold and we haven't chosen one
186
+  yet).
187
+* **Index drift**: triggers on issues/repos/users are reliable;
188
+  code index relies on the worker. The reconciler is the safety
189
+  net.
190
+* **Suspended user content**: user-search excludes them via the
191
+  WHERE clause. Issue/PR-search inherits their content via the
192
+  underlying repo's visibility — that matches the spec's "visible
193
+  inside their repo to collaborators" semantics.