@@ -0,0 +1,183 @@ |
| 1 | +# Markdown pipeline (internal) |
| 2 | + |
| 3 | +S25 ships shithub's canonical markdown renderer. One package owns |
| 4 | +goldmark + bluemonday; every other package routes through |
| 5 | +`markdown.Render`. The boundary is enforced by |
| 6 | +`scripts/lint-markdown-boundary.sh`. |
| 7 | + |
| 8 | +## Architecture |
| 9 | + |
| 10 | +``` |
| 11 | +internal/markdown/ |
| 12 | + markdown.go — package doc + Ref/Mention public types |
| 13 | + version.go — Version int32 = 1 (pipeline stamp) |
| 14 | + opts.go — Options + Resolvers structs |
| 15 | + render.go — Render() entry point + RenderHTML shim |
| 16 | + sanitize.go — bluemonday policy |
| 17 | + markdown_test.go — XSS fixture suite + golden render tests |
| 18 | + extensions/ |
| 19 | + extensions.go — single ASTTransformer for refs/mentions/commits/emoji |
| 20 | + emoji.go — curated shortcode → unicode map |
| 21 | +``` |
| 22 | + |
| 23 | +## Render pipeline |
| 24 | + |
| 25 | +``` |
| 26 | +source bytes |
| 27 | + │ |
| 28 | + ▼ |
| 29 | +[goldmark.Convert] CommonMark + GFM (tables, strikethrough, |
| 30 | + │ autolinks, task lists), html.WithUnsafe |
| 31 | + │ so raw HTML reaches the sanitizer |
| 32 | + │ |
| 33 | + ▼ |
| 34 | +[ASTTransformer] walks Document, skips code/codespan/link/ |
| 35 | + │ image/HTML subtrees, runs reCombined regex |
| 36 | + │ on each Text segment, replaces matches with |
| 37 | + │ Link nodes (mentions/refs/commits) or String |
| 38 | + │ nodes (emoji + plain-text fallbacks). |
| 39 | + │ |
| 40 | + ▼ |
| 41 | +[bluemonday.SanitizeBytes] strict UGC policy: scheme allowlist |
| 42 | + │ (http/https/mailto), <details>/<summary>/ |
| 43 | + │ <kbd>/<sup>/<sub>, language-* class on |
| 44 | + │ code, no <script>/<style>/<iframe>/data: |
| 45 | + │ |
| 46 | + ▼ |
| 47 | +clean HTML bytes |
| 48 | +``` |
| 49 | + |
| 50 | +Each Render call builds a fresh `goldmark.Markdown` so the |
| 51 | +extension's per-call Options can plug in without races. The build |
| 52 | +is ~µs and removes any cross-render contamination of the |
| 53 | +transformer state. |
| 54 | + |
| 55 | +## Reference resolution |
| 56 | + |
| 57 | +`Options.Resolvers` is the seam between the renderer and the rest |
| 58 | +of the runtime. Each resolver is optional; a nil resolver means |
| 59 | +"render that flavor as plain text" — no link, no error, no |
| 60 | +existence leak. |
| 61 | + |
| 62 | +Visibility gating happens *inside* the resolver. The transformer |
| 63 | +trusts the resolver's `ok` return — if the resolver returns |
| 64 | +`ok=false`, the displayed text passes through as plain text. |
| 65 | + |
| 66 | +The handler/orchestrator that wires the resolver MUST honor: |
| 67 | + |
| 68 | +- `User`: rejects suspended users; respects org-team visibility |
| 69 | + once S30/S31 land. |
| 70 | +- `Issue`: rejects when the repo isn't visible to the viewer (the |
| 71 | + S15 `policy.Can(ActionIssueRead)` gate). |
| 72 | +- `Commit`: rejects when the SHA prefix doesn't match a real |
| 73 | + commit; doesn't care about visibility (commit URLs respect repo |
| 74 | + visibility at request time). |
| 75 | + |
| 76 | +## Sanitizer policy decisions |
| 77 | + |
| 78 | +The base is `bluemonday.UGCPolicy()`. We diverge by: |
| 79 | + |
| 80 | +- Allowing `<details>`, `<summary>`, `<kbd>`, `<sup>`, `<sub>`. |
| 81 | + Common in READMEs; not security-relevant. |
| 82 | +- Allowing `id` on h1-h6 so anchor links work |
| 83 | + (Goldmark's `WithAutoHeadingID`). |
| 84 | +- Allowing `class` on `<code>`, `<pre>`, `<span>` matching |
| 85 | + `^(?:language-[A-Za-z0-9_+-]+|chroma|chroma-[a-zA-Z]+|nl|ln|line|hl)$` |
| 86 | + so Chroma's syntax-highlighted output passes through. |
| 87 | +- Allowing `<input type="checkbox" disabled checked>` — Goldmark's |
| 88 | + task-list output. The `disabled` and `checked` attributes are |
| 89 | + HTML boolean attrs; we don't constrain their values. |
| 90 | +- Restricting URL schemes to `http`, `https`, `mailto` (no `data:`, |
| 91 | + no `javascript:`, no `vbscript:`, no `ftp:`). This is stricter |
| 92 | + than UGCPolicy's default and is the documented divergence from |
| 93 | + GitHub. |
| 94 | + |
| 95 | +## Pipeline-version stamp |
| 96 | + |
| 97 | +`Version` (in `version.go`) is bumped when: |
| 98 | + |
| 99 | +- The sanitizer policy changes (added/removed tag, attribute, |
| 100 | + scheme). |
| 101 | +- A new AST extension or rendering output change. |
| 102 | +- Goldmark / bluemonday major-version upgrade with output drift. |
| 103 | + |
| 104 | +Callers store the rendered version alongside `body_html_cached` |
| 105 | +columns. On read, callers compare the stored version to |
| 106 | +`Version`; if they differ, the cache is stale and the caller |
| 107 | +re-renders. We never run a one-shot "re-render every comment" job. |
| 108 | + |
| 109 | +## Why ASTTransformer instead of inline parsers |
| 110 | + |
| 111 | +Goldmark inline parsers run during the main parse pass and need a |
| 112 | +`Trigger() []byte` plus careful interaction with Goldmark's |
| 113 | +existing inline disambiguation (link/codespan/emphasis priorities). |
| 114 | + |
| 115 | +A single `parser.ASTTransformer` walks the document after parsing, |
| 116 | +visits text nodes whose ancestors aren't code/codespan/link/image/ |
| 117 | +HTML, and runs one combined regex per node. The replacement |
| 118 | +inserts `*ast.Link` (mentions/refs/commits) or `*ast.String` |
| 119 | +(emoji + plain fallbacks) before the original text node, then |
| 120 | +removes the original. |
| 121 | + |
| 122 | +Tradeoffs: |
| 123 | + |
| 124 | +- ✅ Simpler than custom inline parsers. |
| 125 | +- ✅ Trivial to add new patterns (extend `reCombined`). |
| 126 | +- ✅ Single-pass — runs once per text node. |
| 127 | +- ❌ Can't span across other inline nodes (e.g. a mention split |
| 128 | + by `*emphasis*`). That's fine — we don't want that anyway. |
| 129 | + |
| 130 | +## Hostile-input fixture suite |
| 131 | + |
| 132 | +`TestRender_HostileInputs` is the test of record. Every fixture |
| 133 | +is an XSS vector; the pass condition is "rendered HTML contains |
| 134 | +no executable JS surface" — no `<script>`, no `href="javascript:"`, |
| 135 | +no `on*` handlers, no `<iframe>` / `<object>` / `<embed>` / |
| 136 | +`<base>` / `<meta>` / `<style>`, no `expression()`, no |
| 137 | +`<annotation-xml>`. |
| 138 | + |
| 139 | +Adding a vector when a CVE / advisory lands is cheap; the test |
| 140 | +file lists ~40 vectors today. |
| 141 | + |
| 142 | +## Performance budget |
| 143 | + |
| 144 | +- 50 KiB body, full extensions: <30 ms p99 on MVP hardware. |
| 145 | +- Inputs above `MaxRenderInputBytes` (1 MiB) are rejected by the |
| 146 | + defensive check in `Render`. The API layer enforces tighter |
| 147 | + per-surface caps (64 KiB or 256 KiB depending on context); this |
| 148 | + is the renderer's last-resort guardrail. |
| 149 | + |
| 150 | +## Refactor pass (S25) |
| 151 | + |
| 152 | +S17/S18/S21/S22/S23/S24 used a per-sprint |
| 153 | +`internal/repos/markdown.RenderHTML` wrapper. S25 deleted that |
| 154 | +package and updated every importer to |
| 155 | +`github.com/tenseleyFlow/shithub/internal/markdown`. The shim |
| 156 | +`markdown.RenderHTML(src) (string, error)` is preserved so the |
| 157 | +swap is a one-line import edit per file. |
| 158 | + |
| 159 | +Callers that want richer behavior — resolved refs, mentions, |
| 160 | +viewer-aware visibility — should call `markdown.Render` directly. |
| 161 | +The interim `RenderHTML` keeps a sensible default (SoftBreakAsBR |
| 162 | +on, no resolvers). |
| 163 | + |
| 164 | +## Lint guard |
| 165 | + |
| 166 | +`scripts/lint-markdown-boundary.sh` fails CI when goldmark or |
| 167 | +bluemonday is imported outside `internal/markdown/`. Test files |
| 168 | +are exempt. Anything else triggers the alarm; the fix is to |
| 169 | +swap the import to the canonical package. |
| 170 | + |
| 171 | +Wired into `make ci` via the `lint-markdown` target. |
| 172 | + |
| 173 | +## Deferred |
| 174 | + |
| 175 | +| Feature | Destination | |
| 176 | +| ------------------------ | ----------- | |
| 177 | +| KaTeX math rendering | post-MVP | |
| 178 | +| Mermaid diagrams | post-MVP | |
| 179 | +| GFM Footnotes | post-MVP | |
| 180 | +| Single-flight cache wrap | S36 (perf pass) — only if cache-stampede on version bump bites | |
| 181 | +| Per-line Chroma in diff hunks | S36 (already deferred) | |
| 182 | +| `data:image/...` allowlist (sized) | post-MVP — if README writers complain enough | |
| 183 | +| `@org/team` mentions | S31 (teams) | |