tenseleyflow/shithub / 9898ce8

Browse files

S17: docs/internal/code-tab.md

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
9898ce8e7cfc7733490af4ab829d1651af61ba34
Parents
e97d889
Tree
17f10dd

1 changed file

StatusFile+-
A docs/internal/code-tab.md 146 0
docs/internal/code-tab.mdadded
@@ -0,0 +1,146 @@
1
+# Code tab
2
+
3
+The code tab is the GitHub-style repo browser: tree listing, blob view
4
+with syntax highlighting, raw view, "Go to file" finder, and the
5
+branch/tag switcher. After a successful push, hitting `/{owner}/{repo}`
6
+sends the user to `/tree/{default_branch}`.
7
+
8
+## Routes
9
+
10
+| Route                                            | Handler                          |
11
+| ------------------------------------------------ | -------------------------------- |
12
+| `GET /{owner}/{repo}`                            | redirects to `/tree/{default}`   |
13
+| `GET /{owner}/{repo}/tree/{ref}/{path...}`       | `codeTree`                       |
14
+| `GET /{owner}/{repo}/blob/{ref}/{path...}`       | `codeBlob`                       |
15
+| `GET /{owner}/{repo}/raw/{ref}/{path...}`        | `codeRaw`                        |
16
+| `GET /{owner}/{repo}/find/{ref}?q=...`           | `codeFinder`                     |
17
+| `GET /static/css/chroma.css`                     | runtime-generated Chroma theme   |
18
+
19
+Every code-tab handler runs through `policy.Can(... ActionRepoRead)` —
20
+private repos hide from anonymous viewers and unrelated users via the
21
+existence-leak 404 guard from S15.
22
+
23
+## Ref + path disambiguation
24
+
25
+`{ref}` is the chi `*` wildcard, so the URL `/tree/feature/x/sub/file.go`
26
+arrives as a single string. Resolution:
27
+
28
+1. If the first segment is exactly 40 hex chars → treat as a SHA, the
29
+   rest is the path.
30
+2. Otherwise, longest-prefix match against the cached ref list
31
+   (branches first, then tags, sorted longest-first). The remainder
32
+   after the matched ref is the in-tree path.
33
+
34
+This handles `release/v1.0/beta/CHANGELOG.md` correctly without
35
+ambiguity. Resolution lives in `internal/repos/git/treeops.go::ResolveRef`.
36
+
37
+Path validation rejects `..`, control chars, leading slashes, and
38
+backslashes — defense in depth on top of git's own validation.
39
+
40
+## Tree listing
41
+
42
+`git ls-tree --long --full-tree <ref>:<path>` is parsed into typed
43
+`TreeEntry` values (`tree | blob | commit | symlink`). Sort is
44
+directories first, then files alphabetically.
45
+
46
+The S17 ship excludes the htmx-driven "last commit per entry" column
47
+that the spec describes — an extra round-trip we can add later without
48
+a schema change. The current page renders the listing immediately.
49
+**Deferred to S18 (commits-per-entry)** — the spec calls out this
50
+deferral path; the tree template has the column slot ready.
51
+
52
+## File view
53
+
54
+`codeBlob` walks four cases:
55
+
56
+* **Large** (>1 MiB): placeholder + raw download link, no body read.
57
+* **Binary** (NUL byte in first 8 KiB): placeholder. Image extensions
58
+  (png/jpg/jpeg/gif/webp) ≤5 MiB get an `<img>` preview pointing at
59
+  `/raw/...`.
60
+* **Markdown** (`.md`/`.markdown`): Goldmark + bluemonday rendered HTML
61
+  PLUS a `<details>` source-toggle with the highlighted source.
62
+* **Default text**: Chroma highlight by filename extension, content
63
+  sniffing fallback.
64
+
65
+Chroma uses the `github` style baked at process start; the CSS is
66
+served from `/static/css/chroma.css` via a tiny in-process generator.
67
+
68
+## Raw view
69
+
70
+* Content-Type derived from the extension whitelist
71
+  (`code.go::rawContentType`).
72
+* `X-Content-Type-Options: nosniff` always.
73
+* `Content-Security-Policy: default-src 'none'; sandbox` at the
74
+  handler level (the global SecureHeaders middleware may overlay a
75
+  broader CSP — both are restrictive; the OR of the two is what user
76
+  agents enforce).
77
+* **`Content-Disposition: attachment`** is forced for HTML, SVG, JS,
78
+  WASM, and anything that could execute on shithub's domain. We don't
79
+  have a separate `raw.shithub.tld` host yet (post-MVP); attachment is
80
+  the safety belt.
81
+* Streamed via `git cat-file -p`; never buffered. Large blobs don't
82
+  blow up the worker's memory.
83
+
84
+## Finder ("Go to file")
85
+
86
+`/find/{ref}` lists every blob path on the ref via
87
+`git ls-tree -r --name-only`, then filters with
88
+`internal/repos/finder/finder.go::Filter`. The matcher is a
89
+subsequence-with-bonus scorer (boundary, consecutive run, basename
90
+hit) — not as fancy as VS Code's quickopen but good enough for tens of
91
+thousands of paths.
92
+
93
+Key shortcut and live-filter via htmx are spec deliverables that we
94
+defer for now — the form-submission flow works without JS and that's
95
+the floor S17 commits to.
96
+
97
+## Caching
98
+
99
+Currently **no caching layer**. Every request runs `git for-each-ref`,
100
+`git ls-tree`, etc. That's fine for small-to-medium repos; the cost
101
+shows up on big repos with deep trees. The S17 spec proposes a cache
102
+keyed on `(repo_id, ref_oid, dir_path)` invalidated on push (S14's
103
+`push:process` job is the right invalidation hook).
104
+
105
+**Deferred** — the cache is purely performance polish. When we hit a
106
+real-world repo where it matters, wire it in: file `internal/cache/`
107
+plus a callback in `worker/jobs/push_process.go`. The handlers already
108
+take a per-request `policy.Cache` so adding a per-process git cache is
109
+mechanically straightforward.
110
+
111
+## Pitfalls + protections
112
+
113
+* **XSS via raw HTML/SVG**: blocked by `Content-Disposition: attachment`
114
+  for those extensions.
115
+* **XSS via markdown**: Goldmark configured without HTML passthrough +
116
+  bluemonday's UGC policy on top. Tests in `internal/repos/markdown/`
117
+  (TODO — minimal coverage today).
118
+* **Path traversal**: `validateSubpath` in `code.go` rejects `..`,
119
+  controls, leading slashes.
120
+* **Hex collision with SHA**: ref-list lookup wins over SHA shortcut
121
+  when the same string is both.
122
+* **Encoding (GBK / Shift-JIS)**: TODO — text files outside UTF-8 may
123
+  render as garbled. The body is rendered as-is; a future commit can
124
+  add `golang.org/x/text/encoding` autodetection.
125
+
126
+## Dependencies
127
+
128
+* `github.com/alecthomas/chroma/v2` — syntax highlighting
129
+* `github.com/yuin/goldmark` — CommonMark + GFM
130
+* `github.com/microcosm-cc/bluemonday` — HTML sanitizer
131
+
132
+## Deferred polish (tracked, not blocking)
133
+
134
+These items are spec deliverables we ship in a later pass:
135
+
136
+* **Last-commit-per-entry column** with htmx lazy load and pre-walked
137
+  `git log --name-status` cache → wire into S18 (commit history) where
138
+  the same walk powers the per-file history page.
139
+* **Tree caching keyed on (repo_id, ref_oid, dir_path)**, push-event
140
+  invalidation → wire into S36 (performance pass) once we have a real
141
+  workload to measure.
142
+* **Pagination at 1000 entries per directory** → cosmetic for huge
143
+  trees; add when someone hits `node_modules`-grade inflation.
144
+* **Encoding detection for non-UTF-8 source files** → file reads are
145
+  defensive (`io.LimitReader` + size cap); render quality is the only
146
+  loss until this lands.