documentlanguagemodel Public
Section grammar
Everything after the closing --- of the frontmatter is the document
body. DLM's body parser splits it into typed sections using fence
markers of the form ::<type>:: on a line by themselves.
Section types
Prose (default)
Any body text that isn't inside an explicit fence is a prose section. Prose trains via continued pretraining — the model learns the writing style + vocabulary but doesn't get "question → answer" pressure.
# Heading
Prose paragraphs, markdown code blocks, whatever you'd normally write.
Another paragraph after a blank line stays in the same prose section.
Code fences (```) inside prose are preserved; the parser doesn't
interpret ::type:: lines that appear inside a code block.
Instruction (::instruction::)
Open with ::instruction:: on its own line. Each Q&A pair uses
### Q and ### A as grammar markers.
::instruction::
### Q
What is a decorator?
### A
A function that takes a function and returns a new function.
### Q
When should I use functools.wraps?
### A
Always, inside decorators.
Trains via supervised fine-tuning (SFT): the model sees Q text
as the prompt, A text as the target. This is the pattern that
produces "helpful assistant" behavior.
dlm synth instructions can also write synthesized instruction
sections back into the document. Those keep the same basic body grammar
but add an HTML provenance marker immediately after the fence. See the
instruction section reference for the full
marker shape and validation rules.
Preference (::preference::)
Open with ::preference::. Each record has three blocks:
::preference::
### Prompt
Explain recursion to a beginner.
### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.
### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.
Trains via DPO (direct preference optimization) or ORPO — the
model learns to prefer the Chosen phrasing. The DPO / ORPO trainer
lands in Sprint 17/18.
Image (::image path="..." alt="..."::)
Schema v10 adds image sections for vision-language bases. The initial
launch covered PaliGemma; later follow-ups added Qwen2-VL,
InternVL2, and Mistral Small 3.1 registry rows. The fence uses
attribute syntax instead of the bare ::type:: form:
::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.
Required attributes: path (the image file, resolved relative to the
.dlm's parent dir). Optional: alt (short description; defaults to
the filename stem on directive-ingested images).
Supported extensions. .png, .jpg, .jpeg, .webp, .gif,
.bmp, .tiff. Other binary types (PDF, archives) stay out of the
training corpus by default.
Content hash. Image sections hash on (type, path, blob_sha)
rather than the body text. Two identical-bytes images at different
paths produce different section_ids — paths carry meaning. Changing
the blob bytes flips the ID even if the path didn't move.
Directive ingest. training.sources directives with image
extensions in their include globs ingest automatically:
training:
sources:
- path: ./paper-figures
include: ["**/*.png", "**/*.jpg"]
Each discovered image becomes an ::image:: section with
alt=<filename-stem> and flows through the same row-emission path.
Current InternVL caveat. InternVL-family rows stay visible in the
registry for planning and future work, but the current runtime still
needs a custom processor/collator path for their <image> expansion
and image_flags contract. See the multi-modal training
cookbook and VL memory
guide before picking internvl2-2b.
Base-model requirements. Only vision-language bases accept image
sections at training time. dlm init --multimodal scaffolds a VL
doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
refuse image sections at train start with a pointer to --multimodal.
Fence rules
- A fence must be the full line —
::instruction::with no leading/ trailing content other than whitespace. - Fences inside triple-backtick code blocks are not active — the parser is aware of the code-fence context.
- An unfenced heading (
# ...,## ...) inside an open instruction or preference section does not close the section. Close with the next section fence or end-of-file. - Section type is case-sensitive;
::Instruction::is rejected. - Sprint 20 introduces a
::type#adapter-name::suffix for multi-adapter routing; the v1 parser accepts the suffix but ignores the#...tail.
Section IDs
Every section gets a content-addressed ID — the first 16 hex chars of
the SHA-256 of the section's canonical text. The manifest's
content_hashes records these IDs and their types so the next dlm train
can compute what's new, unchanged, or removed (Sprint 08's delta system).
You don't write these IDs in the document — they're derived and live
only in the manifest. But if you're debugging "why isn't this section
being picked up as new?", the ID in dlm show --json is the answer.
What NOT to put in sections
- API keys, personal data, anything you wouldn't want baked into a model you'll share. The adapter learns from everything in the file.
- JSON / YAML config that the model should emit literally — use instruction Q&A pairs instead. Training on raw config produces noisy generation.
- Massive code dumps (>200 KB). The replay corpus retains everything, and sequence_len is bounded at 32 KB; a single enormous section trains one step and wastes the remaining token budget.
See also
- Instruction section reference
- Preference section reference
- First train walkthrough
- Cookbook: coding tutor — full example of instruction-heavy authoring
View source
| 1 | # Section grammar |
| 2 | |
| 3 | Everything after the closing `---` of the frontmatter is the document |
| 4 | body. DLM's body parser splits it into typed **sections** using fence |
| 5 | markers of the form `::<type>::` on a line by themselves. |
| 6 | |
| 7 | ## Section types |
| 8 | |
| 9 | ### Prose (default) |
| 10 | |
| 11 | Any body text that isn't inside an explicit fence is a prose section. |
| 12 | Prose trains via **continued pretraining** — the model learns the |
| 13 | writing style + vocabulary but doesn't get "question → answer" pressure. |
| 14 | |
| 15 | ```dlm |
| 16 | # Heading |
| 17 | |
| 18 | Prose paragraphs, markdown code blocks, whatever you'd normally write. |
| 19 | |
| 20 | Another paragraph after a blank line stays in the same prose section. |
| 21 | ``` |
| 22 | |
| 23 | Code fences (` ``` `) inside prose are preserved; the parser doesn't |
| 24 | interpret `::type::` lines that appear inside a code block. |
| 25 | |
| 26 | ### Instruction (`::instruction::`) |
| 27 | |
| 28 | Open with `::instruction::` on its own line. Each Q&A pair uses |
| 29 | `### Q` and `### A` as grammar markers. |
| 30 | |
| 31 | ```dlm |
| 32 | ::instruction:: |
| 33 | ### Q |
| 34 | What is a decorator? |
| 35 | |
| 36 | ### A |
| 37 | A function that takes a function and returns a new function. |
| 38 | |
| 39 | ### Q |
| 40 | When should I use functools.wraps? |
| 41 | |
| 42 | ### A |
| 43 | Always, inside decorators. |
| 44 | ``` |
| 45 | |
| 46 | Trains via **supervised fine-tuning (SFT)**: the model sees `Q` text |
| 47 | as the prompt, `A` text as the target. This is the pattern that |
| 48 | produces "helpful assistant" behavior. |
| 49 | |
| 50 | `dlm synth instructions` can also write synthesized instruction |
| 51 | sections back into the document. Those keep the same basic body grammar |
| 52 | but add an HTML provenance marker immediately after the fence. See the |
| 53 | [instruction section reference](instruction-section.md) for the full |
| 54 | marker shape and validation rules. |
| 55 | |
| 56 | ### Preference (`::preference::`) |
| 57 | |
| 58 | Open with `::preference::`. Each record has three blocks: |
| 59 | |
| 60 | ```dlm |
| 61 | ::preference:: |
| 62 | ### Prompt |
| 63 | Explain recursion to a beginner. |
| 64 | |
| 65 | ### Chosen |
| 66 | Recursion is when a function calls itself on a smaller piece of the |
| 67 | problem. Imagine matryoshka dolls. |
| 68 | |
| 69 | ### Rejected |
| 70 | A recursive function is any function that refers to itself in its own |
| 71 | definition using the stack frame protocol. |
| 72 | ``` |
| 73 | |
| 74 | Trains via **DPO** (direct preference optimization) or **ORPO** — the |
| 75 | model learns to prefer the `Chosen` phrasing. The DPO / ORPO trainer |
| 76 | lands in Sprint 17/18. |
| 77 | |
| 78 | ### Image (`::image path="..." alt="..."::`) |
| 79 | |
| 80 | Schema v10 adds image sections for vision-language bases. The initial |
| 81 | launch covered PaliGemma; later follow-ups added Qwen2-VL, |
| 82 | InternVL2, and Mistral Small 3.1 registry rows. The fence uses |
| 83 | attribute syntax instead of the bare `::type::` form: |
| 84 | |
| 85 | ```dlm |
| 86 | ::image path="figures/architecture.png" alt="training pipeline diagram":: |
| 87 | Caption text describing the figure. The caption body becomes the "text" |
| 88 | part of the training row; the placeholder expands to the base's image |
| 89 | tokens at collate time. |
| 90 | ``` |
| 91 | |
| 92 | Required attributes: `path` (the image file, resolved relative to the |
| 93 | `.dlm`'s parent dir). Optional: `alt` (short description; defaults to |
| 94 | the filename stem on directive-ingested images). |
| 95 | |
| 96 | **Supported extensions.** `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`, |
| 97 | `.bmp`, `.tiff`. Other binary types (PDF, archives) stay out of the |
| 98 | training corpus by default. |
| 99 | |
| 100 | **Content hash.** Image sections hash on `(type, path, blob_sha)` |
| 101 | rather than the body text. Two identical-bytes images at different |
| 102 | paths produce different `section_id`s — paths carry meaning. Changing |
| 103 | the blob bytes flips the ID even if the path didn't move. |
| 104 | |
| 105 | **Directive ingest.** `training.sources` directives with image |
| 106 | extensions in their `include` globs ingest automatically: |
| 107 | |
| 108 | ```yaml |
| 109 | training: |
| 110 | sources: |
| 111 | - path: ./paper-figures |
| 112 | include: ["**/*.png", "**/*.jpg"] |
| 113 | ``` |
| 114 | |
| 115 | Each discovered image becomes an `::image::` section with |
| 116 | `alt=<filename-stem>` and flows through the same row-emission path. |
| 117 | |
| 118 | **Current InternVL caveat.** InternVL-family rows stay visible in the |
| 119 | registry for planning and future work, but the current runtime still |
| 120 | needs a custom processor/collator path for their `<image>` expansion |
| 121 | and `image_flags` contract. See the [multi-modal training |
| 122 | cookbook](../cookbook/multimodal-training.md) and [VL memory |
| 123 | guide](../hardware/vl-memory.md) before picking `internvl2-2b`. |
| 124 | |
| 125 | **Base-model requirements.** Only vision-language bases accept image |
| 126 | sections at training time. `dlm init --multimodal` scaffolds a VL |
| 127 | doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi) |
| 128 | refuse image sections at train start with a pointer to `--multimodal`. |
| 129 | |
| 130 | ## Fence rules |
| 131 | |
| 132 | - A fence must be the full line — `::instruction::` with no leading/ |
| 133 | trailing content other than whitespace. |
| 134 | - Fences inside triple-backtick code blocks are **not** active — the |
| 135 | parser is aware of the code-fence context. |
| 136 | - An unfenced heading (`# ...`, `## ...`) inside an open instruction or |
| 137 | preference section does **not** close the section. Close with the |
| 138 | next section fence or end-of-file. |
| 139 | - Section type is case-sensitive; `::Instruction::` is rejected. |
| 140 | - Sprint 20 introduces a `::type#adapter-name::` suffix for |
| 141 | multi-adapter routing; the v1 parser accepts the suffix but ignores |
| 142 | the `#...` tail. |
| 143 | |
| 144 | ## Section IDs |
| 145 | |
| 146 | Every section gets a content-addressed ID — the first 16 hex chars of |
| 147 | the SHA-256 of the section's canonical text. The manifest's |
| 148 | `content_hashes` records these IDs and their types so the next `dlm train` |
| 149 | can compute what's new, unchanged, or removed (Sprint 08's delta system). |
| 150 | |
| 151 | You don't write these IDs in the document — they're derived and live |
| 152 | only in the manifest. But if you're debugging "why isn't this section |
| 153 | being picked up as new?", the ID in `dlm show --json` is the answer. |
| 154 | |
| 155 | ## What NOT to put in sections |
| 156 | |
| 157 | - API keys, personal data, anything you wouldn't want baked into a |
| 158 | model you'll share. The adapter learns from everything in the file. |
| 159 | - JSON / YAML config that the model should emit literally — use |
| 160 | instruction Q&A pairs instead. Training on raw config produces |
| 161 | noisy generation. |
| 162 | - Massive code dumps (>200 KB). The replay corpus retains everything, |
| 163 | and sequence_len is bounded at 32 KB; a single enormous section |
| 164 | trains one step and wastes the remaining token budget. |
| 165 | |
| 166 | ## See also |
| 167 | |
| 168 | - [Instruction section reference](instruction-section.md) |
| 169 | - [Preference section reference](preference-section.md) |
| 170 | - [First train walkthrough](../getting-started/first-train.md) |
| 171 | - [Cookbook: coding tutor](../cookbook/coding-tutor.md) — full |
| 172 | example of instruction-heavy authoring |