Markdown pipeline (internal)
S25 ships shithub's canonical markdown renderer. One package owns
goldmark + bluemonday; every other package routes through
markdown.Render. The boundary is enforced by
scripts/lint-markdown-boundary.sh.
Architecture
internal/markdown/
markdown.go — package doc + Ref/Mention public types
version.go — Version int32 = 1 (pipeline stamp)
opts.go — Options + Resolvers structs
render.go — Render() entry point + RenderHTML shim
sanitize.go — bluemonday policy
markdown_test.go — XSS fixture suite + golden render tests
extensions/
extensions.go — single ASTTransformer for refs/mentions/commits/emoji
emoji.go — curated shortcode → unicode map
Render pipeline
source bytes
│
▼
[goldmark.Convert] CommonMark + GFM (tables, strikethrough,
│ autolinks, task lists), html.WithUnsafe
│ so raw HTML reaches the sanitizer
│
▼
[ASTTransformer] walks Document, skips code/codespan/link/
│ image/HTML subtrees, runs reCombined regex
│ on each Text segment, replaces matches with
│ Link nodes (mentions/refs/commits) or String
│ nodes (emoji + plain-text fallbacks).
│
▼
[bluemonday.SanitizeBytes] strict UGC policy: scheme allowlist
│ (http/https/mailto), <details>/<summary>/
│ <kbd>/<sup>/<sub>, language-* class on
│ code, no <script>/<style>/<iframe>/data:
│
▼
clean HTML bytes
Each Render call builds a fresh goldmark.Markdown so the
extension's per-call Options can plug in without races. The build
is ~µs and removes any cross-render contamination of the
transformer state.
Reference resolution
Options.Resolvers is the seam between the renderer and the rest
of the runtime. Each resolver is optional; a nil resolver means
"render that flavor as plain text" — no link, no error, no
existence leak.
Visibility gating happens inside the resolver. The transformer
trusts the resolver's ok return — if the resolver returns
ok=false, the displayed text passes through as plain text.
The handler/orchestrator that wires the resolver MUST honor:
User: rejects suspended users; respects org-team visibility once S30/S31 land.Issue: rejects when the repo isn't visible to the viewer (the S15policy.Can(ActionIssueRead)gate).Commit: rejects when the SHA prefix doesn't match a real commit; doesn't care about visibility (commit URLs respect repo visibility at request time).
Sanitizer policy decisions
The base is bluemonday.UGCPolicy(). We diverge by:
- Allowing
<details>,<summary>,<kbd>,<sup>,<sub>. Common in READMEs; not security-relevant. - Allowing
idon h1-h6 so anchor links work (Goldmark'sWithAutoHeadingID). - Allowing
classon<code>,<pre>,<span>matching^(?:language-[A-Za-z0-9_+-]+|chroma|chroma-[a-zA-Z]+|nl|ln|line|hl)$so Chroma's syntax-highlighted output passes through. - Allowing
<input type="checkbox" disabled checked>— Goldmark's task-list output. Thedisabledandcheckedattributes are HTML boolean attrs; we don't constrain their values. - Restricting URL schemes to
http,https,mailto(nodata:, nojavascript:, novbscript:, noftp:). This is stricter than UGCPolicy's default and is the documented divergence from GitHub.
Pipeline-version stamp
Version (in version.go) is bumped when:
- The sanitizer policy changes (added/removed tag, attribute, scheme).
- A new AST extension or rendering output change.
- Goldmark / bluemonday major-version upgrade with output drift.
Callers store the rendered version alongside body_html_cached
columns. On read, callers compare the stored version to
Version; if they differ, the cache is stale and the caller
re-renders. We never run a one-shot "re-render every comment" job.
Why ASTTransformer instead of inline parsers
Goldmark inline parsers run during the main parse pass and need a
Trigger() []byte plus careful interaction with Goldmark's
existing inline disambiguation (link/codespan/emphasis priorities).
A single parser.ASTTransformer walks the document after parsing,
visits text nodes whose ancestors aren't code/codespan/link/image/
HTML, and runs one combined regex per node. The replacement
inserts *ast.Link (mentions/refs/commits) or *ast.String
(emoji + plain fallbacks) before the original text node, then
removes the original.
Tradeoffs:
- ✅ Simpler than custom inline parsers.
- ✅ Trivial to add new patterns (extend
reCombined). - ✅ Single-pass — runs once per text node.
- ❌ Can't span across other inline nodes (e.g. a mention split
by
*emphasis*). That's fine — we don't want that anyway.
Hostile-input fixture suite
TestRender_HostileInputs is the test of record. Every fixture
is an XSS vector; the pass condition is "rendered HTML contains
no executable JS surface" — no <script>, no href="javascript:",
no on* handlers, no <iframe> / <object> / <embed> /
<base> / <meta> / <style>, no expression(), no
<annotation-xml>.
Adding a vector when a CVE / advisory lands is cheap; the test file lists ~40 vectors today.
Performance budget
- 50 KiB body, full extensions: <30 ms p99 on MVP hardware.
- Inputs above
MaxRenderInputBytes(1 MiB) are rejected by the defensive check inRender. The API layer enforces tighter per-surface caps (64 KiB or 256 KiB depending on context); this is the renderer's last-resort guardrail.
Refactor pass (S25)
S17/S18/S21/S22/S23/S24 used a per-sprint
internal/repos/markdown.RenderHTML wrapper. S25 deleted that
package and updated every importer to
github.com/tenseleyFlow/shithub/internal/markdown. The shim
markdown.RenderHTML(src) (string, error) is preserved so the
swap is a one-line import edit per file.
Callers that want richer behavior — resolved refs, mentions,
viewer-aware visibility — should call markdown.Render directly.
The interim RenderHTML keeps a sensible default (SoftBreakAsBR
on, no resolvers).
README presentation chrome
Repository READMEs use the same sanitized HTML pipeline as comments and issues. The repo page adds GitHub-parity presentation chrome in the web layer only:
<pre>blocks inside.markdown-bodyare wrapped client-side with a stable code-block container and a copy button. The copied text comes from the original<code>or<pre>textContent, not from injected controls.- The README outline menu is populated client-side from rendered
h1-h6elements. Goldmark-generated IDs are reused. Raw HTML headings without IDs get deterministic page-local slugs so the outline can link to them. - The JavaScript only reads sanitized text/IDs and mutates local chrome;
it is not part of the trust boundary. Sanitization remains entirely in
internal/markdown.
Lint guard
scripts/lint-markdown-boundary.sh fails CI when goldmark or
bluemonday is imported outside internal/markdown/. Test files
are exempt. Anything else triggers the alarm; the fix is to
swap the import to the canonical package.
Wired into make ci via the lint-markdown target.
Deferred
| Feature | Destination |
|---|---|
| KaTeX math rendering | post-MVP |
| Mermaid diagrams | post-MVP |
| GFM Footnotes | post-MVP |
| Single-flight cache wrap | S36 (perf pass) — only if cache-stampede on version bump bites |
| Per-line Chroma in diff hunks | S36 (already deferred) |
data:image/... allowlist (sized) |
post-MVP — if README writers complain enough |
@org/team mentions |
S31 (teams) |
View source
| 1 | # Markdown pipeline (internal) |
| 2 | |
| 3 | S25 ships shithub's canonical markdown renderer. One package owns |
| 4 | goldmark + bluemonday; every other package routes through |
| 5 | `markdown.Render`. The boundary is enforced by |
| 6 | `scripts/lint-markdown-boundary.sh`. |
| 7 | |
| 8 | ## Architecture |
| 9 | |
| 10 | ``` |
| 11 | internal/markdown/ |
| 12 | markdown.go — package doc + Ref/Mention public types |
| 13 | version.go — Version int32 = 1 (pipeline stamp) |
| 14 | opts.go — Options + Resolvers structs |
| 15 | render.go — Render() entry point + RenderHTML shim |
| 16 | sanitize.go — bluemonday policy |
| 17 | markdown_test.go — XSS fixture suite + golden render tests |
| 18 | extensions/ |
| 19 | extensions.go — single ASTTransformer for refs/mentions/commits/emoji |
| 20 | emoji.go — curated shortcode → unicode map |
| 21 | ``` |
| 22 | |
| 23 | ## Render pipeline |
| 24 | |
| 25 | ``` |
| 26 | source bytes |
| 27 | │ |
| 28 | ▼ |
| 29 | [goldmark.Convert] CommonMark + GFM (tables, strikethrough, |
| 30 | │ autolinks, task lists), html.WithUnsafe |
| 31 | │ so raw HTML reaches the sanitizer |
| 32 | │ |
| 33 | ▼ |
| 34 | [ASTTransformer] walks Document, skips code/codespan/link/ |
| 35 | │ image/HTML subtrees, runs reCombined regex |
| 36 | │ on each Text segment, replaces matches with |
| 37 | │ Link nodes (mentions/refs/commits) or String |
| 38 | │ nodes (emoji + plain-text fallbacks). |
| 39 | │ |
| 40 | ▼ |
| 41 | [bluemonday.SanitizeBytes] strict UGC policy: scheme allowlist |
| 42 | │ (http/https/mailto), <details>/<summary>/ |
| 43 | │ <kbd>/<sup>/<sub>, language-* class on |
| 44 | │ code, no <script>/<style>/<iframe>/data: |
| 45 | │ |
| 46 | ▼ |
| 47 | clean HTML bytes |
| 48 | ``` |
| 49 | |
| 50 | Each Render call builds a fresh `goldmark.Markdown` so the |
| 51 | extension's per-call Options can plug in without races. The build |
| 52 | is ~µs and removes any cross-render contamination of the |
| 53 | transformer state. |
| 54 | |
| 55 | ## Reference resolution |
| 56 | |
| 57 | `Options.Resolvers` is the seam between the renderer and the rest |
| 58 | of the runtime. Each resolver is optional; a nil resolver means |
| 59 | "render that flavor as plain text" — no link, no error, no |
| 60 | existence leak. |
| 61 | |
| 62 | Visibility gating happens *inside* the resolver. The transformer |
| 63 | trusts the resolver's `ok` return — if the resolver returns |
| 64 | `ok=false`, the displayed text passes through as plain text. |
| 65 | |
| 66 | The handler/orchestrator that wires the resolver MUST honor: |
| 67 | |
| 68 | - `User`: rejects suspended users; respects org-team visibility |
| 69 | once S30/S31 land. |
| 70 | - `Issue`: rejects when the repo isn't visible to the viewer (the |
| 71 | S15 `policy.Can(ActionIssueRead)` gate). |
| 72 | - `Commit`: rejects when the SHA prefix doesn't match a real |
| 73 | commit; doesn't care about visibility (commit URLs respect repo |
| 74 | visibility at request time). |
| 75 | |
| 76 | ## Sanitizer policy decisions |
| 77 | |
| 78 | The base is `bluemonday.UGCPolicy()`. We diverge by: |
| 79 | |
| 80 | - Allowing `<details>`, `<summary>`, `<kbd>`, `<sup>`, `<sub>`. |
| 81 | Common in READMEs; not security-relevant. |
| 82 | - Allowing `id` on h1-h6 so anchor links work |
| 83 | (Goldmark's `WithAutoHeadingID`). |
| 84 | - Allowing `class` on `<code>`, `<pre>`, `<span>` matching |
| 85 | `^(?:language-[A-Za-z0-9_+-]+|chroma|chroma-[a-zA-Z]+|nl|ln|line|hl)$` |
| 86 | so Chroma's syntax-highlighted output passes through. |
| 87 | - Allowing `<input type="checkbox" disabled checked>` — Goldmark's |
| 88 | task-list output. The `disabled` and `checked` attributes are |
| 89 | HTML boolean attrs; we don't constrain their values. |
| 90 | - Restricting URL schemes to `http`, `https`, `mailto` (no `data:`, |
| 91 | no `javascript:`, no `vbscript:`, no `ftp:`). This is stricter |
| 92 | than UGCPolicy's default and is the documented divergence from |
| 93 | GitHub. |
| 94 | |
| 95 | ## Pipeline-version stamp |
| 96 | |
| 97 | `Version` (in `version.go`) is bumped when: |
| 98 | |
| 99 | - The sanitizer policy changes (added/removed tag, attribute, |
| 100 | scheme). |
| 101 | - A new AST extension or rendering output change. |
| 102 | - Goldmark / bluemonday major-version upgrade with output drift. |
| 103 | |
| 104 | Callers store the rendered version alongside `body_html_cached` |
| 105 | columns. On read, callers compare the stored version to |
| 106 | `Version`; if they differ, the cache is stale and the caller |
| 107 | re-renders. We never run a one-shot "re-render every comment" job. |
| 108 | |
| 109 | ## Why ASTTransformer instead of inline parsers |
| 110 | |
| 111 | Goldmark inline parsers run during the main parse pass and need a |
| 112 | `Trigger() []byte` plus careful interaction with Goldmark's |
| 113 | existing inline disambiguation (link/codespan/emphasis priorities). |
| 114 | |
| 115 | A single `parser.ASTTransformer` walks the document after parsing, |
| 116 | visits text nodes whose ancestors aren't code/codespan/link/image/ |
| 117 | HTML, and runs one combined regex per node. The replacement |
| 118 | inserts `*ast.Link` (mentions/refs/commits) or `*ast.String` |
| 119 | (emoji + plain fallbacks) before the original text node, then |
| 120 | removes the original. |
| 121 | |
| 122 | Tradeoffs: |
| 123 | |
| 124 | - ✅ Simpler than custom inline parsers. |
| 125 | - ✅ Trivial to add new patterns (extend `reCombined`). |
| 126 | - ✅ Single-pass — runs once per text node. |
| 127 | - ❌ Can't span across other inline nodes (e.g. a mention split |
| 128 | by `*emphasis*`). That's fine — we don't want that anyway. |
| 129 | |
| 130 | ## Hostile-input fixture suite |
| 131 | |
| 132 | `TestRender_HostileInputs` is the test of record. Every fixture |
| 133 | is an XSS vector; the pass condition is "rendered HTML contains |
| 134 | no executable JS surface" — no `<script>`, no `href="javascript:"`, |
| 135 | no `on*` handlers, no `<iframe>` / `<object>` / `<embed>` / |
| 136 | `<base>` / `<meta>` / `<style>`, no `expression()`, no |
| 137 | `<annotation-xml>`. |
| 138 | |
| 139 | Adding a vector when a CVE / advisory lands is cheap; the test |
| 140 | file lists ~40 vectors today. |
| 141 | |
| 142 | ## Performance budget |
| 143 | |
| 144 | - 50 KiB body, full extensions: <30 ms p99 on MVP hardware. |
| 145 | - Inputs above `MaxRenderInputBytes` (1 MiB) are rejected by the |
| 146 | defensive check in `Render`. The API layer enforces tighter |
| 147 | per-surface caps (64 KiB or 256 KiB depending on context); this |
| 148 | is the renderer's last-resort guardrail. |
| 149 | |
| 150 | ## Refactor pass (S25) |
| 151 | |
| 152 | S17/S18/S21/S22/S23/S24 used a per-sprint |
| 153 | `internal/repos/markdown.RenderHTML` wrapper. S25 deleted that |
| 154 | package and updated every importer to |
| 155 | `github.com/tenseleyFlow/shithub/internal/markdown`. The shim |
| 156 | `markdown.RenderHTML(src) (string, error)` is preserved so the |
| 157 | swap is a one-line import edit per file. |
| 158 | |
| 159 | Callers that want richer behavior — resolved refs, mentions, |
| 160 | viewer-aware visibility — should call `markdown.Render` directly. |
| 161 | The interim `RenderHTML` keeps a sensible default (SoftBreakAsBR |
| 162 | on, no resolvers). |
| 163 | |
| 164 | ## README presentation chrome |
| 165 | |
| 166 | Repository READMEs use the same sanitized HTML pipeline as comments and |
| 167 | issues. The repo page adds GitHub-parity presentation chrome in the web |
| 168 | layer only: |
| 169 | |
| 170 | - `<pre>` blocks inside `.markdown-body` are wrapped client-side with a |
| 171 | stable code-block container and a copy button. The copied text comes |
| 172 | from the original `<code>` or `<pre>` textContent, not from injected |
| 173 | controls. |
| 174 | - The README outline menu is populated client-side from rendered `h1`- |
| 175 | `h6` elements. Goldmark-generated IDs are reused. Raw HTML headings |
| 176 | without IDs get deterministic page-local slugs so the outline can link |
| 177 | to them. |
| 178 | - The JavaScript only reads sanitized text/IDs and mutates local chrome; |
| 179 | it is not part of the trust boundary. Sanitization remains entirely in |
| 180 | `internal/markdown`. |
| 181 | |
| 182 | ## Lint guard |
| 183 | |
| 184 | `scripts/lint-markdown-boundary.sh` fails CI when goldmark or |
| 185 | bluemonday is imported outside `internal/markdown/`. Test files |
| 186 | are exempt. Anything else triggers the alarm; the fix is to |
| 187 | swap the import to the canonical package. |
| 188 | |
| 189 | Wired into `make ci` via the `lint-markdown` target. |
| 190 | |
| 191 | ## Deferred |
| 192 | |
| 193 | | Feature | Destination | |
| 194 | | ------------------------ | ----------- | |
| 195 | | KaTeX math rendering | post-MVP | |
| 196 | | Mermaid diagrams | post-MVP | |
| 197 | | GFM Footnotes | post-MVP | |
| 198 | | Single-flight cache wrap | S36 (perf pass) — only if cache-stampede on version bump bites | |
| 199 | | Per-line Chroma in diff hunks | S36 (already deferred) | |
| 200 | | `data:image/...` allowlist (sized) | post-MVP — if README writers complain enough | |
| 201 | | `@org/team` mentions | S31 (teams) | |