markdown · 6302 bytes Raw Blame History

Section grammar

Everything after the closing --- of the frontmatter is the document body. DLM's body parser splits it into typed sections using fence markers of the form ::<type>:: on a line by themselves.

Section types

Prose (default)

Any body text that isn't inside an explicit fence is a prose section. Prose trains via continued pretraining — the model learns the writing style + vocabulary but doesn't get "question → answer" pressure.

# Heading

Prose paragraphs, markdown code blocks, whatever you'd normally write.

Another paragraph after a blank line stays in the same prose section.

Code fences (```) inside prose are preserved; the parser doesn't interpret ::type:: lines that appear inside a code block.

Instruction (::instruction::)

Open with ::instruction:: on its own line. Each Q&A pair uses ### Q and ### A as grammar markers.

::instruction::
### Q
What is a decorator?

### A
A function that takes a function and returns a new function.

### Q
When should I use functools.wraps?

### A
Always, inside decorators.

Trains via supervised fine-tuning (SFT): the model sees Q text as the prompt, A text as the target. This is the pattern that produces "helpful assistant" behavior.

dlm synth instructions can also write synthesized instruction sections back into the document. Those keep the same basic body grammar but add an HTML provenance marker immediately after the fence. See the instruction section reference for the full marker shape and validation rules.

Preference (::preference::)

Open with ::preference::. Each record has three blocks:

::preference::
### Prompt
Explain recursion to a beginner.

### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.

### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.

Trains via DPO (direct preference optimization) or ORPO — the model learns to prefer the Chosen phrasing. The DPO / ORPO trainer lands in Sprint 17/18.

Image (::image path="..." alt="..."::)

Schema v10 adds image sections for vision-language bases. The initial launch covered PaliGemma; later follow-ups added Qwen2-VL, InternVL2, and Mistral Small 3.1 registry rows. The fence uses attribute syntax instead of the bare ::type:: form:

::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.

Required attributes: path (the image file, resolved relative to the .dlm's parent dir). Optional: alt (short description; defaults to the filename stem on directive-ingested images).

Supported extensions. .png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff. Other binary types (PDF, archives) stay out of the training corpus by default.

Content hash. Image sections hash on (type, path, blob_sha) rather than the body text. Two identical-bytes images at different paths produce different section_ids — paths carry meaning. Changing the blob bytes flips the ID even if the path didn't move.

Directive ingest. training.sources directives with image extensions in their include globs ingest automatically:

training:
  sources:
    - path: ./paper-figures
      include: ["**/*.png", "**/*.jpg"]

Each discovered image becomes an ::image:: section with alt=<filename-stem> and flows through the same row-emission path.

Current InternVL caveat. InternVL-family rows stay visible in the registry for planning and future work, but the current runtime still needs a custom processor/collator path for their <image> expansion and image_flags contract. See the multi-modal training cookbook and VL memory guide before picking internvl2-2b.

Base-model requirements. Only vision-language bases accept image sections at training time. dlm init --multimodal scaffolds a VL doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi) refuse image sections at train start with a pointer to --multimodal.

Fence rules

  • A fence must be the full line — ::instruction:: with no leading/ trailing content other than whitespace.
  • Fences inside triple-backtick code blocks are not active — the parser is aware of the code-fence context.
  • An unfenced heading (# ..., ## ...) inside an open instruction or preference section does not close the section. Close with the next section fence or end-of-file.
  • Section type is case-sensitive; ::Instruction:: is rejected.
  • Sprint 20 introduces a ::type#adapter-name:: suffix for multi-adapter routing; the v1 parser accepts the suffix but ignores the #... tail.

Section IDs

Every section gets a content-addressed ID — the first 16 hex chars of the SHA-256 of the section's canonical text. The manifest's content_hashes records these IDs and their types so the next dlm train can compute what's new, unchanged, or removed (Sprint 08's delta system).

You don't write these IDs in the document — they're derived and live only in the manifest. But if you're debugging "why isn't this section being picked up as new?", the ID in dlm show --json is the answer.

What NOT to put in sections

  • API keys, personal data, anything you wouldn't want baked into a model you'll share. The adapter learns from everything in the file.
  • JSON / YAML config that the model should emit literally — use instruction Q&A pairs instead. Training on raw config produces noisy generation.
  • Massive code dumps (>200 KB). The replay corpus retains everything, and sequence_len is bounded at 32 KB; a single enormous section trains one step and wastes the remaining token budget.

See also

View source
1 # Section grammar
2
3 Everything after the closing `---` of the frontmatter is the document
4 body. DLM's body parser splits it into typed **sections** using fence
5 markers of the form `::<type>::` on a line by themselves.
6
7 ## Section types
8
9 ### Prose (default)
10
11 Any body text that isn't inside an explicit fence is a prose section.
12 Prose trains via **continued pretraining** — the model learns the
13 writing style + vocabulary but doesn't get "question → answer" pressure.
14
15 ```dlm
16 # Heading
17
18 Prose paragraphs, markdown code blocks, whatever you'd normally write.
19
20 Another paragraph after a blank line stays in the same prose section.
21 ```
22
23 Code fences (` ``` `) inside prose are preserved; the parser doesn't
24 interpret `::type::` lines that appear inside a code block.
25
26 ### Instruction (`::instruction::`)
27
28 Open with `::instruction::` on its own line. Each Q&A pair uses
29 `### Q` and `### A` as grammar markers.
30
31 ```dlm
32 ::instruction::
33 ### Q
34 What is a decorator?
35
36 ### A
37 A function that takes a function and returns a new function.
38
39 ### Q
40 When should I use functools.wraps?
41
42 ### A
43 Always, inside decorators.
44 ```
45
46 Trains via **supervised fine-tuning (SFT)**: the model sees `Q` text
47 as the prompt, `A` text as the target. This is the pattern that
48 produces "helpful assistant" behavior.
49
50 `dlm synth instructions` can also write synthesized instruction
51 sections back into the document. Those keep the same basic body grammar
52 but add an HTML provenance marker immediately after the fence. See the
53 [instruction section reference](instruction-section.md) for the full
54 marker shape and validation rules.
55
56 ### Preference (`::preference::`)
57
58 Open with `::preference::`. Each record has three blocks:
59
60 ```dlm
61 ::preference::
62 ### Prompt
63 Explain recursion to a beginner.
64
65 ### Chosen
66 Recursion is when a function calls itself on a smaller piece of the
67 problem. Imagine matryoshka dolls.
68
69 ### Rejected
70 A recursive function is any function that refers to itself in its own
71 definition using the stack frame protocol.
72 ```
73
74 Trains via **DPO** (direct preference optimization) or **ORPO** — the
75 model learns to prefer the `Chosen` phrasing. The DPO / ORPO trainer
76 lands in Sprint 17/18.
77
78 ### Image (`::image path="..." alt="..."::`)
79
80 Schema v10 adds image sections for vision-language bases. The initial
81 launch covered PaliGemma; later follow-ups added Qwen2-VL,
82 InternVL2, and Mistral Small 3.1 registry rows. The fence uses
83 attribute syntax instead of the bare `::type::` form:
84
85 ```dlm
86 ::image path="figures/architecture.png" alt="training pipeline diagram"::
87 Caption text describing the figure. The caption body becomes the "text"
88 part of the training row; the placeholder expands to the base's image
89 tokens at collate time.
90 ```
91
92 Required attributes: `path` (the image file, resolved relative to the
93 `.dlm`'s parent dir). Optional: `alt` (short description; defaults to
94 the filename stem on directive-ingested images).
95
96 **Supported extensions.** `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`,
97 `.bmp`, `.tiff`. Other binary types (PDF, archives) stay out of the
98 training corpus by default.
99
100 **Content hash.** Image sections hash on `(type, path, blob_sha)`
101 rather than the body text. Two identical-bytes images at different
102 paths produce different `section_id`s — paths carry meaning. Changing
103 the blob bytes flips the ID even if the path didn't move.
104
105 **Directive ingest.** `training.sources` directives with image
106 extensions in their `include` globs ingest automatically:
107
108 ```yaml
109 training:
110 sources:
111 - path: ./paper-figures
112 include: ["**/*.png", "**/*.jpg"]
113 ```
114
115 Each discovered image becomes an `::image::` section with
116 `alt=<filename-stem>` and flows through the same row-emission path.
117
118 **Current InternVL caveat.** InternVL-family rows stay visible in the
119 registry for planning and future work, but the current runtime still
120 needs a custom processor/collator path for their `<image>` expansion
121 and `image_flags` contract. See the [multi-modal training
122 cookbook](../cookbook/multimodal-training.md) and [VL memory
123 guide](../hardware/vl-memory.md) before picking `internvl2-2b`.
124
125 **Base-model requirements.** Only vision-language bases accept image
126 sections at training time. `dlm init --multimodal` scaffolds a VL
127 doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
128 refuse image sections at train start with a pointer to `--multimodal`.
129
130 ## Fence rules
131
132 - A fence must be the full line — `::instruction::` with no leading/
133 trailing content other than whitespace.
134 - Fences inside triple-backtick code blocks are **not** active — the
135 parser is aware of the code-fence context.
136 - An unfenced heading (`# ...`, `## ...`) inside an open instruction or
137 preference section does **not** close the section. Close with the
138 next section fence or end-of-file.
139 - Section type is case-sensitive; `::Instruction::` is rejected.
140 - Sprint 20 introduces a `::type#adapter-name::` suffix for
141 multi-adapter routing; the v1 parser accepts the suffix but ignores
142 the `#...` tail.
143
144 ## Section IDs
145
146 Every section gets a content-addressed ID — the first 16 hex chars of
147 the SHA-256 of the section's canonical text. The manifest's
148 `content_hashes` records these IDs and their types so the next `dlm train`
149 can compute what's new, unchanged, or removed (Sprint 08's delta system).
150
151 You don't write these IDs in the document — they're derived and live
152 only in the manifest. But if you're debugging "why isn't this section
153 being picked up as new?", the ID in `dlm show --json` is the answer.
154
155 ## What NOT to put in sections
156
157 - API keys, personal data, anything you wouldn't want baked into a
158 model you'll share. The adapter learns from everything in the file.
159 - JSON / YAML config that the model should emit literally — use
160 instruction Q&A pairs instead. Training on raw config produces
161 noisy generation.
162 - Massive code dumps (>200 KB). The replay corpus retains everything,
163 and sequence_len is bounded at 32 KB; a single enormous section
164 trains one step and wastes the remaining token budget.
165
166 ## See also
167
168 - [Instruction section reference](instruction-section.md)
169 - [Preference section reference](preference-section.md)
170 - [First train walkthrough](../getting-started/first-train.md)
171 - [Cookbook: coding tutor](../cookbook/coding-tutor.md) — full
172 example of instruction-heavy authoring