tenseleyflow/shithub / c21250d

Browse files

S25: docs/markdown.md (user) + docs/internal/markdown.md (developer)

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
c21250d1da7fb475eb51517ab5c5a4242cb04753
Parents
a13f95d
Tree
023390d

2 changed files

StatusFile+-
A docs/internal/markdown.md 183 0
A docs/markdown.md 112 0
docs/internal/markdown.mdadded
@@ -0,0 +1,183 @@
1
+# Markdown pipeline (internal)
2
+
3
+S25 ships shithub's canonical markdown renderer. One package owns
4
+goldmark + bluemonday; every other package routes through
5
+`markdown.Render`. The boundary is enforced by
6
+`scripts/lint-markdown-boundary.sh`.
7
+
8
+## Architecture
9
+
10
+```
11
+internal/markdown/
12
+    markdown.go          — package doc + Ref/Mention public types
13
+    version.go           — Version int32 = 1 (pipeline stamp)
14
+    opts.go              — Options + Resolvers structs
15
+    render.go            — Render() entry point + RenderHTML shim
16
+    sanitize.go          — bluemonday policy
17
+    markdown_test.go     — XSS fixture suite + golden render tests
18
+    extensions/
19
+        extensions.go    — single ASTTransformer for refs/mentions/commits/emoji
20
+        emoji.go         — curated shortcode → unicode map
21
+```
22
+
23
+## Render pipeline
24
+
25
+```
26
+source bytes
27
+    │
28
+    ▼
29
+[goldmark.Convert]  CommonMark + GFM (tables, strikethrough,
30
+    │                autolinks, task lists), html.WithUnsafe
31
+    │                so raw HTML reaches the sanitizer
32
+    │
33
+    ▼
34
+[ASTTransformer]    walks Document, skips code/codespan/link/
35
+    │                image/HTML subtrees, runs reCombined regex
36
+    │                on each Text segment, replaces matches with
37
+    │                Link nodes (mentions/refs/commits) or String
38
+    │                nodes (emoji + plain-text fallbacks).
39
+    │
40
+    ▼
41
+[bluemonday.SanitizeBytes] strict UGC policy: scheme allowlist
42
+    │                       (http/https/mailto), <details>/<summary>/
43
+    │                       <kbd>/<sup>/<sub>, language-* class on
44
+    │                       code, no <script>/<style>/<iframe>/data:
45
+    │
46
+    ▼
47
+clean HTML bytes
48
+```
49
+
50
+Each Render call builds a fresh `goldmark.Markdown` so the
51
+extension's per-call Options can plug in without races. The build
52
+is ~µs and removes any cross-render contamination of the
53
+transformer state.
54
+
55
+## Reference resolution
56
+
57
+`Options.Resolvers` is the seam between the renderer and the rest
58
+of the runtime. Each resolver is optional; a nil resolver means
59
+"render that flavor as plain text" — no link, no error, no
60
+existence leak.
61
+
62
+Visibility gating happens *inside* the resolver. The transformer
63
+trusts the resolver's `ok` return — if the resolver returns
64
+`ok=false`, the displayed text passes through as plain text.
65
+
66
+The handler/orchestrator that wires the resolver MUST honor:
67
+
68
+- `User`: rejects suspended users; respects org-team visibility
69
+  once S30/S31 land.
70
+- `Issue`: rejects when the repo isn't visible to the viewer (the
71
+  S15 `policy.Can(ActionIssueRead)` gate).
72
+- `Commit`: rejects when the SHA prefix doesn't match a real
73
+  commit; doesn't care about visibility (commit URLs respect repo
74
+  visibility at request time).
75
+
76
+## Sanitizer policy decisions
77
+
78
+The base is `bluemonday.UGCPolicy()`. We diverge by:
79
+
80
+- Allowing `<details>`, `<summary>`, `<kbd>`, `<sup>`, `<sub>`.
81
+  Common in READMEs; not security-relevant.
82
+- Allowing `id` on h1-h6 so anchor links work
83
+  (Goldmark's `WithAutoHeadingID`).
84
+- Allowing `class` on `<code>`, `<pre>`, `<span>` matching
85
+  `^(?:language-[A-Za-z0-9_+-]+|chroma|chroma-[a-zA-Z]+|nl|ln|line|hl)$`
86
+  so Chroma's syntax-highlighted output passes through.
87
+- Allowing `<input type="checkbox" disabled checked>` — Goldmark's
88
+  task-list output. The `disabled` and `checked` attributes are
89
+  HTML boolean attrs; we don't constrain their values.
90
+- Restricting URL schemes to `http`, `https`, `mailto` (no `data:`,
91
+  no `javascript:`, no `vbscript:`, no `ftp:`). This is stricter
92
+  than UGCPolicy's default and is the documented divergence from
93
+  GitHub.
94
+
95
+## Pipeline-version stamp
96
+
97
+`Version` (in `version.go`) is bumped when:
98
+
99
+- The sanitizer policy changes (added/removed tag, attribute,
100
+  scheme).
101
+- A new AST extension or rendering output change.
102
+- Goldmark / bluemonday major-version upgrade with output drift.
103
+
104
+Callers store the rendered version alongside `body_html_cached`
105
+columns. On read, callers compare the stored version to
106
+`Version`; if they differ, the cache is stale and the caller
107
+re-renders. We never run a one-shot "re-render every comment" job.
108
+
109
+## Why ASTTransformer instead of inline parsers
110
+
111
+Goldmark inline parsers run during the main parse pass and need a
112
+`Trigger() []byte` plus careful interaction with Goldmark's
113
+existing inline disambiguation (link/codespan/emphasis priorities).
114
+
115
+A single `parser.ASTTransformer` walks the document after parsing,
116
+visits text nodes whose ancestors aren't code/codespan/link/image/
117
+HTML, and runs one combined regex per node. The replacement
118
+inserts `*ast.Link` (mentions/refs/commits) or `*ast.String`
119
+(emoji + plain fallbacks) before the original text node, then
120
+removes the original.
121
+
122
+Tradeoffs:
123
+
124
+- ✅ Simpler than custom inline parsers.
125
+- ✅ Trivial to add new patterns (extend `reCombined`).
126
+- ✅ Single-pass — runs once per text node.
127
+- ❌ Can't span across other inline nodes (e.g. a mention split
128
+  by `*emphasis*`). That's fine — we don't want that anyway.
129
+
130
+## Hostile-input fixture suite
131
+
132
+`TestRender_HostileInputs` is the test of record. Every fixture
133
+is an XSS vector; the pass condition is "rendered HTML contains
134
+no executable JS surface" — no `<script>`, no `href="javascript:"`,
135
+no `on*` handlers, no `<iframe>` / `<object>` / `<embed>` /
136
+`<base>` / `<meta>` / `<style>`, no `expression()`, no
137
+`<annotation-xml>`.
138
+
139
+Adding a vector when a CVE / advisory lands is cheap; the test
140
+file lists ~40 vectors today.
141
+
142
+## Performance budget
143
+
144
+- 50 KiB body, full extensions: <30 ms p99 on MVP hardware.
145
+- Inputs above `MaxRenderInputBytes` (1 MiB) are rejected by the
146
+  defensive check in `Render`. The API layer enforces tighter
147
+  per-surface caps (64 KiB or 256 KiB depending on context); this
148
+  is the renderer's last-resort guardrail.
149
+
150
+## Refactor pass (S25)
151
+
152
+S17/S18/S21/S22/S23/S24 used a per-sprint
153
+`internal/repos/markdown.RenderHTML` wrapper. S25 deleted that
154
+package and updated every importer to
155
+`github.com/tenseleyFlow/shithub/internal/markdown`. The shim
156
+`markdown.RenderHTML(src) (string, error)` is preserved so the
157
+swap is a one-line import edit per file.
158
+
159
+Callers that want richer behavior — resolved refs, mentions,
160
+viewer-aware visibility — should call `markdown.Render` directly.
161
+The interim `RenderHTML` keeps a sensible default (SoftBreakAsBR
162
+on, no resolvers).
163
+
164
+## Lint guard
165
+
166
+`scripts/lint-markdown-boundary.sh` fails CI when goldmark or
167
+bluemonday is imported outside `internal/markdown/`. Test files
168
+are exempt. Anything else triggers the alarm; the fix is to
169
+swap the import to the canonical package.
170
+
171
+Wired into `make ci` via the `lint-markdown` target.
172
+
173
+## Deferred
174
+
175
+| Feature                  | Destination |
176
+| ------------------------ | ----------- |
177
+| KaTeX math rendering     | post-MVP    |
178
+| Mermaid diagrams         | post-MVP    |
179
+| GFM Footnotes            | post-MVP    |
180
+| Single-flight cache wrap | S36 (perf pass) — only if cache-stampede on version bump bites |
181
+| Per-line Chroma in diff hunks | S36 (already deferred)  |
182
+| `data:image/...` allowlist (sized) | post-MVP — if README writers complain enough |
183
+| `@org/team` mentions     | S31 (teams)  |
docs/markdown.mdadded
@@ -0,0 +1,112 @@
1
+# Markdown on shithub
2
+
3
+shithub renders user-authored markdown — issue bodies, PR
4
+descriptions, comments, READMEs — through one canonical pipeline.
5
+This page documents what's supported and what's deliberately not.
6
+
7
+## Supported
8
+
9
+### CommonMark + GFM
10
+
11
+The full CommonMark spec plus the curated GFM additions:
12
+
13
+- Headings (`# Title` through `###### h6`) with auto-generated
14
+  anchor IDs (`<h1 id="title">`).
15
+- Paragraphs, soft line breaks (rendered as `<br>` in
16
+  comment-style contexts; preserved as whitespace in READMEs).
17
+- Bullet, numbered, and **task lists** (`- [x]` / `- [ ]`).
18
+- Block quotes, fenced + indented code blocks.
19
+- Inline `code`, **bold**, *italic*, ~~strikethrough~~.
20
+- Tables (GFM pipe syntax).
21
+- Autolinks (`https://example.com` becomes a link automatically).
22
+
23
+### Code blocks with syntax highlighting
24
+
25
+Fenced code with a language tag turns on Chroma highlighting:
26
+
27
+````
28
+```go
29
+fmt.Println("hello")
30
+```
31
+````
32
+
33
+Languages we recognize: every language Chroma supports (~250).
34
+Unknown languages render as plain `<pre><code>` with no
35
+highlighting.
36
+
37
+### shithub-specific inline patterns
38
+
39
+| You write             | We render                                            |
40
+| --------------------- | ---------------------------------------------------- |
41
+| `@alice`              | Link to `/alice` if the user exists                  |
42
+| `#42`                 | Link to issue/PR #42 in the current repo, if visible |
43
+| `alice/proj#42`       | Cross-repo issue/PR link, if visible to you          |
44
+| `abc1234`             | Commit link in the current repo (7+ hex chars)       |
45
+| `:rocket:` / `:+1:`   | Emoji from a curated set (~150 shortcodes)          |
46
+
47
+These patterns *do not match inside code blocks or inline code* —
48
+`` `#42` `` stays literal.
49
+
50
+If a reference can't be resolved (the issue doesn't exist, the
51
+user doesn't exist, the cross-repo target isn't visible to you),
52
+we render the text as-is. No broken links, no "deleted" labels,
53
+no existence leaks.
54
+
55
+### Safe HTML (allowlisted)
56
+
57
+These tags pass through unchanged:
58
+
59
+- `<details>` / `<summary>` (collapsible sections)
60
+- `<kbd>` (keyboard markers)
61
+- `<sup>`, `<sub>` (superscript / subscript)
62
+- Standard text formatting tags Goldmark emits (em, strong, code,
63
+  pre, blockquote, ul, ol, li, table family).
64
+
65
+## Not supported
66
+
67
+We deliberately do **not** match GitHub's looser markdown surface:
68
+
69
+| Feature                  | Why                                                 |
70
+| ------------------------ | --------------------------------------------------- |
71
+| Raw HTML beyond allowlist | XSS prevention. Anything outside the list is stripped. |
72
+| `data:` URIs              | Avoids tracking pixels and decompression bombs.      |
73
+| `javascript:` URLs        | Always XSS.                                          |
74
+| `<script>`, `<style>`, `<iframe>`, `<object>`, `<embed>`, `<base>`, `<meta>` | XSS / unwanted side effects. |
75
+| Inline event handlers (`onclick`, `onerror`, etc.) | XSS.                  |
76
+| Math (KaTeX)              | Post-MVP.                                           |
77
+| Mermaid diagrams          | Post-MVP.                                           |
78
+| GFM Footnotes             | Deferred — file an issue if you want them.          |
79
+
80
+For inline images, repo-relative paths work via the `/raw/` route:
81
+`![diagram](docs/img/diagram.png)`. External-host images are also
82
+allowed; remote tracking pixels are inherent to that — we don't
83
+proxy.
84
+
85
+## Newline handling
86
+
87
+There are two render modes, picked per surface:
88
+
89
+- **Comment / issue / PR body**: newlines render as `<br>`
90
+  (matches GitHub's UI). You can write a paragraph by leaving a
91
+  blank line.
92
+- **README and other structured docs**: standard CommonMark
93
+  newline rules (paragraphs separated by blank lines, soft
94
+  newlines join words).
95
+
96
+## Cache + version
97
+
98
+Rendered HTML is cached on the source row alongside a pipeline
99
+version. Bumping the renderer (a sanitizer-policy change, a new
100
+extension, a major Goldmark/bluemonday upgrade with output drift)
101
+re-renders comments lazily on next read — we never run a "re-render
102
+every comment" batch.
103
+
104
+## Contributing
105
+
106
+Markdown changes go through `internal/markdown/`. The boundary is
107
+enforced: importing `goldmark` or `bluemonday` outside that
108
+package fails CI (`scripts/lint-markdown-boundary.sh`).
109
+
110
+If a new XSS vector lands in the wild, add a fixture to
111
+`internal/markdown/markdown_test.go::TestRender_HostileInputs` and
112
+fix the policy.