@@ -0,0 +1,183 @@ |
| | 1 | +# Markdown pipeline (internal) |
| | 2 | + |
| | 3 | +S25 ships shithub's canonical markdown renderer. One package owns |
| | 4 | +goldmark + bluemonday; every other package routes through |
| | 5 | +`markdown.Render`. The boundary is enforced by |
| | 6 | +`scripts/lint-markdown-boundary.sh`. |
| | 7 | + |
| | 8 | +## Architecture |
| | 9 | + |
| | 10 | +``` |
| | 11 | +internal/markdown/ |
| | 12 | + markdown.go — package doc + Ref/Mention public types |
| | 13 | + version.go — Version int32 = 1 (pipeline stamp) |
| | 14 | + opts.go — Options + Resolvers structs |
| | 15 | + render.go — Render() entry point + RenderHTML shim |
| | 16 | + sanitize.go — bluemonday policy |
| | 17 | + markdown_test.go — XSS fixture suite + golden render tests |
| | 18 | + extensions/ |
| | 19 | + extensions.go — single ASTTransformer for refs/mentions/commits/emoji |
| | 20 | + emoji.go — curated shortcode → unicode map |
| | 21 | +``` |
| | 22 | + |
| | 23 | +## Render pipeline |
| | 24 | + |
| | 25 | +``` |
| | 26 | +source bytes |
| | 27 | + │ |
| | 28 | + ▼ |
| | 29 | +[goldmark.Convert] CommonMark + GFM (tables, strikethrough, |
| | 30 | + │ autolinks, task lists), html.WithUnsafe |
| | 31 | + │ so raw HTML reaches the sanitizer |
| | 32 | + │ |
| | 33 | + ▼ |
| | 34 | +[ASTTransformer] walks Document, skips code/codespan/link/ |
| | 35 | + │ image/HTML subtrees, runs reCombined regex |
| | 36 | + │ on each Text segment, replaces matches with |
| | 37 | + │ Link nodes (mentions/refs/commits) or String |
| | 38 | + │ nodes (emoji + plain-text fallbacks). |
| | 39 | + │ |
| | 40 | + ▼ |
| | 41 | +[bluemonday.SanitizeBytes] strict UGC policy: scheme allowlist |
| | 42 | + │ (http/https/mailto), <details>/<summary>/ |
| | 43 | + │ <kbd>/<sup>/<sub>, language-* class on |
| | 44 | + │ code, no <script>/<style>/<iframe>/data: |
| | 45 | + │ |
| | 46 | + ▼ |
| | 47 | +clean HTML bytes |
| | 48 | +``` |
| | 49 | + |
| | 50 | +Each Render call builds a fresh `goldmark.Markdown` so the |
| | 51 | +extension's per-call Options can plug in without races. The build |
| | 52 | +is ~µs and removes any cross-render contamination of the |
| | 53 | +transformer state. |
| | 54 | + |
| | 55 | +## Reference resolution |
| | 56 | + |
| | 57 | +`Options.Resolvers` is the seam between the renderer and the rest |
| | 58 | +of the runtime. Each resolver is optional; a nil resolver means |
| | 59 | +"render that flavor as plain text" — no link, no error, no |
| | 60 | +existence leak. |
| | 61 | + |
| | 62 | +Visibility gating happens *inside* the resolver. The transformer |
| | 63 | +trusts the resolver's `ok` return — if the resolver returns |
| | 64 | +`ok=false`, the displayed text passes through as plain text. |
| | 65 | + |
| | 66 | +The handler/orchestrator that wires the resolver MUST honor: |
| | 67 | + |
| | 68 | +- `User`: rejects suspended users; respects org-team visibility |
| | 69 | + once S30/S31 land. |
| | 70 | +- `Issue`: rejects when the repo isn't visible to the viewer (the |
| | 71 | + S15 `policy.Can(ActionIssueRead)` gate). |
| | 72 | +- `Commit`: rejects when the SHA prefix doesn't match a real |
| | 73 | + commit; doesn't care about visibility (commit URLs respect repo |
| | 74 | + visibility at request time). |
| | 75 | + |
| | 76 | +## Sanitizer policy decisions |
| | 77 | + |
| | 78 | +The base is `bluemonday.UGCPolicy()`. We diverge by: |
| | 79 | + |
| | 80 | +- Allowing `<details>`, `<summary>`, `<kbd>`, `<sup>`, `<sub>`. |
| | 81 | + Common in READMEs; not security-relevant. |
| | 82 | +- Allowing `id` on h1-h6 so anchor links work |
| | 83 | + (Goldmark's `WithAutoHeadingID`). |
| | 84 | +- Allowing `class` on `<code>`, `<pre>`, `<span>` matching |
| | 85 | + `^(?:language-[A-Za-z0-9_+-]+|chroma|chroma-[a-zA-Z]+|nl|ln|line|hl)$` |
| | 86 | + so Chroma's syntax-highlighted output passes through. |
| | 87 | +- Allowing `<input type="checkbox" disabled checked>` — Goldmark's |
| | 88 | + task-list output. The `disabled` and `checked` attributes are |
| | 89 | + HTML boolean attrs; we don't constrain their values. |
| | 90 | +- Restricting URL schemes to `http`, `https`, `mailto` (no `data:`, |
| | 91 | + no `javascript:`, no `vbscript:`, no `ftp:`). This is stricter |
| | 92 | + than UGCPolicy's default and is the documented divergence from |
| | 93 | + GitHub. |
| | 94 | + |
| | 95 | +## Pipeline-version stamp |
| | 96 | + |
| | 97 | +`Version` (in `version.go`) is bumped when: |
| | 98 | + |
| | 99 | +- The sanitizer policy changes (added/removed tag, attribute, |
| | 100 | + scheme). |
| | 101 | +- A new AST extension or rendering output change. |
| | 102 | +- Goldmark / bluemonday major-version upgrade with output drift. |
| | 103 | + |
| | 104 | +Callers store the rendered version alongside `body_html_cached` |
| | 105 | +columns. On read, callers compare the stored version to |
| | 106 | +`Version`; if they differ, the cache is stale and the caller |
| | 107 | +re-renders. We never run a one-shot "re-render every comment" job. |
| | 108 | + |
| | 109 | +## Why ASTTransformer instead of inline parsers |
| | 110 | + |
| | 111 | +Goldmark inline parsers run during the main parse pass and need a |
| | 112 | +`Trigger() []byte` plus careful interaction with Goldmark's |
| | 113 | +existing inline disambiguation (link/codespan/emphasis priorities). |
| | 114 | + |
| | 115 | +A single `parser.ASTTransformer` walks the document after parsing, |
| | 116 | +visits text nodes whose ancestors aren't code/codespan/link/image/ |
| | 117 | +HTML, and runs one combined regex per node. The replacement |
| | 118 | +inserts `*ast.Link` (mentions/refs/commits) or `*ast.String` |
| | 119 | +(emoji + plain fallbacks) before the original text node, then |
| | 120 | +removes the original. |
| | 121 | + |
| | 122 | +Tradeoffs: |
| | 123 | + |
| | 124 | +- ✅ Simpler than custom inline parsers. |
| | 125 | +- ✅ Trivial to add new patterns (extend `reCombined`). |
| | 126 | +- ✅ Single-pass — runs once per text node. |
| | 127 | +- ❌ Can't span across other inline nodes (e.g. a mention split |
| | 128 | + by `*emphasis*`). That's fine — we don't want that anyway. |
| | 129 | + |
| | 130 | +## Hostile-input fixture suite |
| | 131 | + |
| | 132 | +`TestRender_HostileInputs` is the test of record. Every fixture |
| | 133 | +is an XSS vector; the pass condition is "rendered HTML contains |
| | 134 | +no executable JS surface" — no `<script>`, no `href="javascript:"`, |
| | 135 | +no `on*` handlers, no `<iframe>` / `<object>` / `<embed>` / |
| | 136 | +`<base>` / `<meta>` / `<style>`, no `expression()`, no |
| | 137 | +`<annotation-xml>`. |
| | 138 | + |
| | 139 | +Adding a vector when a CVE / advisory lands is cheap; the test |
| | 140 | +file lists ~40 vectors today. |
| | 141 | + |
| | 142 | +## Performance budget |
| | 143 | + |
| | 144 | +- 50 KiB body, full extensions: <30 ms p99 on MVP hardware. |
| | 145 | +- Inputs above `MaxRenderInputBytes` (1 MiB) are rejected by the |
| | 146 | + defensive check in `Render`. The API layer enforces tighter |
| | 147 | + per-surface caps (64 KiB or 256 KiB depending on context); this |
| | 148 | + is the renderer's last-resort guardrail. |
| | 149 | + |
| | 150 | +## Refactor pass (S25) |
| | 151 | + |
| | 152 | +S17/S18/S21/S22/S23/S24 used a per-sprint |
| | 153 | +`internal/repos/markdown.RenderHTML` wrapper. S25 deleted that |
| | 154 | +package and updated every importer to |
| | 155 | +`github.com/tenseleyFlow/shithub/internal/markdown`. The shim |
| | 156 | +`markdown.RenderHTML(src) (string, error)` is preserved so the |
| | 157 | +swap is a one-line import edit per file. |
| | 158 | + |
| | 159 | +Callers that want richer behavior — resolved refs, mentions, |
| | 160 | +viewer-aware visibility — should call `markdown.Render` directly. |
| | 161 | +The interim `RenderHTML` keeps a sensible default (SoftBreakAsBR |
| | 162 | +on, no resolvers). |
| | 163 | + |
| | 164 | +## Lint guard |
| | 165 | + |
| | 166 | +`scripts/lint-markdown-boundary.sh` fails CI when goldmark or |
| | 167 | +bluemonday is imported outside `internal/markdown/`. Test files |
| | 168 | +are exempt. Anything else triggers the alarm; the fix is to |
| | 169 | +swap the import to the canonical package. |
| | 170 | + |
| | 171 | +Wired into `make ci` via the `lint-markdown` target. |
| | 172 | + |
| | 173 | +## Deferred |
| | 174 | + |
| | 175 | +| Feature | Destination | |
| | 176 | +| ------------------------ | ----------- | |
| | 177 | +| KaTeX math rendering | post-MVP | |
| | 178 | +| Mermaid diagrams | post-MVP | |
| | 179 | +| GFM Footnotes | post-MVP | |
| | 180 | +| Single-flight cache wrap | S36 (perf pass) — only if cache-stampede on version bump bites | |
| | 181 | +| Per-line Chroma in diff hunks | S36 (already deferred) | |
| | 182 | +| `data:image/...` allowlist (sized) | post-MVP — if README writers complain enough | |
| | 183 | +| `@org/team` mentions | S31 (teams) | |