documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 6302 bytes Raw Blame History

Section grammar

Everything after the closing --- of the frontmatter is the document body. DLM's body parser splits it into typed sections using fence markers of the form ::<type>:: on a line by themselves.

Section types

Prose (default)

Any body text that isn't inside an explicit fence is a prose section. Prose trains via continued pretraining — the model learns the writing style + vocabulary but doesn't get "question → answer" pressure.

# Heading

Prose paragraphs, markdown code blocks, whatever you'd normally write.

Another paragraph after a blank line stays in the same prose section.

Code fences (```) inside prose are preserved; the parser doesn't interpret ::type:: lines that appear inside a code block.

Instruction (`::instruction::`)

Open with ::instruction:: on its own line. Each Q&A pair uses ### Q and ### A as grammar markers.

::instruction::
### Q
What is a decorator?

### A
A function that takes a function and returns a new function.

### Q
When should I use functools.wraps?

### A
Always, inside decorators.

Trains via supervised fine-tuning (SFT): the model sees Q text as the prompt, A text as the target. This is the pattern that produces "helpful assistant" behavior.

dlm synth instructions can also write synthesized instruction sections back into the document. Those keep the same basic body grammar but add an HTML provenance marker immediately after the fence. See the instruction section reference for the full marker shape and validation rules.

Preference (`::preference::`)

Open with ::preference::. Each record has three blocks:

::preference::
### Prompt
Explain recursion to a beginner.

### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.

### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.

Trains via DPO (direct preference optimization) or ORPO — the model learns to prefer the Chosen phrasing. The DPO / ORPO trainer lands in Sprint 17/18.

Image (`::image path="..." alt="..."::`)

Schema v10 adds image sections for vision-language bases. The initial launch covered PaliGemma; later follow-ups added Qwen2-VL, InternVL2, and Mistral Small 3.1 registry rows. The fence uses attribute syntax instead of the bare ::type:: form:

::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.

Required attributes: path (the image file, resolved relative to the .dlm's parent dir). Optional: alt (short description; defaults to the filename stem on directive-ingested images).

Supported extensions. .png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff. Other binary types (PDF, archives) stay out of the training corpus by default.

Content hash. Image sections hash on (type, path, blob_sha) rather than the body text. Two identical-bytes images at different paths produce different section_ids — paths carry meaning. Changing the blob bytes flips the ID even if the path didn't move.

Directive ingest. training.sources directives with image extensions in their include globs ingest automatically:

training:
  sources:
    - path: ./paper-figures
      include: ["**/*.png", "**/*.jpg"]

Each discovered image becomes an ::image:: section with alt=<filename-stem> and flows through the same row-emission path.

Current InternVL caveat. InternVL-family rows stay visible in the registry for planning and future work, but the current runtime still needs a custom processor/collator path for their <image> expansion and image_flags contract. See the multi-modal training cookbook and VL memory guide before picking internvl2-2b.

Base-model requirements. Only vision-language bases accept image sections at training time. dlm init --multimodal scaffolds a VL doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi) refuse image sections at train start with a pointer to --multimodal.

Fence rules

A fence must be the full line — ::instruction:: with no leading/ trailing content other than whitespace.
Fences inside triple-backtick code blocks are not active — the parser is aware of the code-fence context.
An unfenced heading (# ..., ## ...) inside an open instruction or preference section does not close the section. Close with the next section fence or end-of-file.
Section type is case-sensitive; ::Instruction:: is rejected.
Sprint 20 introduces a ::type#adapter-name:: suffix for multi-adapter routing; the v1 parser accepts the suffix but ignores the #... tail.

Section IDs

Every section gets a content-addressed ID — the first 16 hex chars of the SHA-256 of the section's canonical text. The manifest's content_hashes records these IDs and their types so the next dlm train can compute what's new, unchanged, or removed (Sprint 08's delta system).

You don't write these IDs in the document — they're derived and live only in the manifest. But if you're debugging "why isn't this section being picked up as new?", the ID in dlm show --json is the answer.

What NOT to put in sections

API keys, personal data, anything you wouldn't want baked into a model you'll share. The adapter learns from everything in the file.
JSON / YAML config that the model should emit literally — use instruction Q&A pairs instead. Training on raw config produces noisy generation.
Massive code dumps (>200 KB). The replay corpus retains everything, and sequence_len is bounded at 32 KB; a single enormous section trains one step and wastes the remaining token budget.

  
        1
        # Section grammar
      
        2
        
        3
        Everything after the closing `---` of the frontmatter is the document
      
        4
        body. DLM's body parser splits it into typed **sections** using fence
      
        5
        markers of the form `::<type>::` on a line by themselves.
      
        6
        
        7
        ## Section types
      
        8
        
        9
        ### Prose (default)
      
        10
        
        11
        Any body text that isn't inside an explicit fence is a prose section.
      
        12
        Prose trains via **continued pretraining** — the model learns the
      
        13
        writing style + vocabulary but doesn't get "question → answer" pressure.
      
        14
        
        15
        ```dlm
      
        16
        # Heading
      
        17
        
        18
        Prose paragraphs, markdown code blocks, whatever you'd normally write.
      
        19
        
        20
        Another paragraph after a blank line stays in the same prose section.
      
        21
        ```
      
        22
        
        23
        Code fences (` ``` `) inside prose are preserved; the parser doesn't
      
        24
        interpret `::type::` lines that appear inside a code block.
      
        25
        
        26
        ### Instruction (`::instruction::`)
      
        27
        
        28
        Open with `::instruction::` on its own line. Each Q&A pair uses
      
        29
        `### Q` and `### A` as grammar markers.
      
        30
        
        31
        ```dlm
      
        32
        ::instruction::
      
        33
        ### Q
      
        34
        What is a decorator?
      
        35
        
        36
        ### A
      
        37
        A function that takes a function and returns a new function.
      
        38
        
        39
        ### Q
      
        40
        When should I use functools.wraps?
      
        41
        
        42
        ### A
      
        43
        Always, inside decorators.
      
        44
        ```
      
        45
        
        46
        Trains via **supervised fine-tuning (SFT)**: the model sees `Q` text
      
        47
        as the prompt, `A` text as the target. This is the pattern that
      
        48
        produces "helpful assistant" behavior.
      
        49
        
        50
        `dlm synth instructions` can also write synthesized instruction
      
        51
        sections back into the document. Those keep the same basic body grammar
      
        52
        but add an HTML provenance marker immediately after the fence. See the
      
        53
        [instruction section reference](instruction-section.md) for the full
      
        54
        marker shape and validation rules.
      
        55
        
        56
        ### Preference (`::preference::`)
      
        57
        
        58
        Open with `::preference::`. Each record has three blocks:
      
        59
        
        60
        ```dlm
      
        61
        ::preference::
      
        62
        ### Prompt
      
        63
        Explain recursion to a beginner.
      
        64
        
        65
        ### Chosen
      
        66
        Recursion is when a function calls itself on a smaller piece of the
      
        67
        problem. Imagine matryoshka dolls.
      
        68
        
        69
        ### Rejected
      
        70
        A recursive function is any function that refers to itself in its own
      
        71
        definition using the stack frame protocol.
      
        72
        ```
      
        73
        
        74
        Trains via **DPO** (direct preference optimization) or **ORPO** — the
      
        75
        model learns to prefer the `Chosen` phrasing. The DPO / ORPO trainer
      
        76
        lands in Sprint 17/18.
      
        77
        
        78
        ### Image (`::image path="..." alt="..."::`)
      
        79
        
        80
        Schema v10 adds image sections for vision-language bases. The initial
      
        81
        launch covered PaliGemma; later follow-ups added Qwen2-VL,
      
        82
        InternVL2, and Mistral Small 3.1 registry rows. The fence uses
      
        83
        attribute syntax instead of the bare `::type::` form:
      
        84
        
        85
        ```dlm
      
        86
        ::image path="figures/architecture.png" alt="training pipeline diagram"::
      
        87
        Caption text describing the figure. The caption body becomes the "text"
      
        88
        part of the training row; the placeholder expands to the base's image
      
        89
        tokens at collate time.
      
        90
        ```
      
        91
        
        92
        Required attributes: `path` (the image file, resolved relative to the
      
        93
        `.dlm`'s parent dir). Optional: `alt` (short description; defaults to
      
        94
        the filename stem on directive-ingested images).
      
        95
        
        96
        **Supported extensions.** `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`,
      
        97
        `.bmp`, `.tiff`. Other binary types (PDF, archives) stay out of the
      
        98
        training corpus by default.
      
        99
        
        100
        **Content hash.** Image sections hash on `(type, path, blob_sha)`
      
        101
        rather than the body text. Two identical-bytes images at different
      
        102
        paths produce different `section_id`s — paths carry meaning. Changing
      
        103
        the blob bytes flips the ID even if the path didn't move.
      
        104
        
        105
        **Directive ingest.** `training.sources` directives with image
      
        106
        extensions in their `include` globs ingest automatically:
      
        107
        
        108
        ```yaml
      
        109
        training:
      
        110
          sources:
      
        111
            - path: ./paper-figures
      
        112
              include: ["**/*.png", "**/*.jpg"]
      
        113
        ```
      
        114
        
        115
        Each discovered image becomes an `::image::` section with
      
        116
        `alt=<filename-stem>` and flows through the same row-emission path.
      
        117
        
        118
        **Current InternVL caveat.** InternVL-family rows stay visible in the
      
        119
        registry for planning and future work, but the current runtime still
      
        120
        needs a custom processor/collator path for their `<image>` expansion
      
        121
        and `image_flags` contract. See the [multi-modal training
      
        122
        cookbook](../cookbook/multimodal-training.md) and [VL memory
      
        123
        guide](../hardware/vl-memory.md) before picking `internvl2-2b`.
      
        124
        
        125
        **Base-model requirements.** Only vision-language bases accept image
      
        126
        sections at training time. `dlm init --multimodal` scaffolds a VL
      
        127
        doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
      
        128
        refuse image sections at train start with a pointer to `--multimodal`.
      
        129
        
        130
        ## Fence rules
      
        131
        
        132
        - A fence must be the full line — `::instruction::` with no leading/
      
        133
          trailing content other than whitespace.
      
        134
        - Fences inside triple-backtick code blocks are **not** active — the
      
        135
          parser is aware of the code-fence context.
      
        136
        - An unfenced heading (`# ...`, `## ...`) inside an open instruction or
      
        137
          preference section does **not** close the section. Close with the
      
        138
          next section fence or end-of-file.
      
        139
        - Section type is case-sensitive; `::Instruction::` is rejected.
      
        140
        - Sprint 20 introduces a `::type#adapter-name::` suffix for
      
        141
          multi-adapter routing; the v1 parser accepts the suffix but ignores
      
        142
          the `#...` tail.
      
        143
        
        144
        ## Section IDs
      
        145
        
        146
        Every section gets a content-addressed ID — the first 16 hex chars of
      
        147
        the SHA-256 of the section's canonical text. The manifest's
      
        148
        `content_hashes` records these IDs and their types so the next `dlm train`
      
        149
        can compute what's new, unchanged, or removed (Sprint 08's delta system).
      
        150
        
        151
        You don't write these IDs in the document — they're derived and live
      
        152
        only in the manifest. But if you're debugging "why isn't this section
      
        153
        being picked up as new?", the ID in `dlm show --json` is the answer.
      
        154
        
        155
        ## What NOT to put in sections
      
        156
        
        157
        - API keys, personal data, anything you wouldn't want baked into a
      
        158
          model you'll share. The adapter learns from everything in the file.
      
        159
        - JSON / YAML config that the model should emit literally — use
      
        160
          instruction Q&A pairs instead. Training on raw config produces
      
        161
          noisy generation.
      
        162
        - Massive code dumps (>200 KB). The replay corpus retains everything,
      
        163
          and sequence_len is bounded at 32 KB; a single enormous section
      
        164
          trains one step and wastes the remaining token budget.
      
        165
        
        166
        ## See also
      
        167
        
        168
        - [Instruction section reference](instruction-section.md)
      
        169
        - [Preference section reference](preference-section.md)
      
        170
        - [First train walkthrough](../getting-started/first-train.md)
      
        171
        - [Cookbook: coding tutor](../cookbook/coding-tutor.md) — full
      
        172
          example of instruction-heavy authoring

1	# Section grammar
2
3	Everything after the closing `---` of the frontmatter is the document
4	body. DLM's body parser splits it into typed sections using fence
5	markers of the form `::<type>::` on a line by themselves.
6
7	## Section types
8
9	### Prose (default)
10
11	Any body text that isn't inside an explicit fence is a prose section.
12	Prose trains via continued pretraining — the model learns the
13	writing style + vocabulary but doesn't get "question → answer" pressure.
14
15	```dlm
16	# Heading
17
18	Prose paragraphs, markdown code blocks, whatever you'd normally write.
19
20	Another paragraph after a blank line stays in the same prose section.
21	```
22
23	Code fences (` ``` `) inside prose are preserved; the parser doesn't
24	interpret `::type::` lines that appear inside a code block.
25
26	### Instruction (`::instruction::`)
27
28	Open with `::instruction::` on its own line. Each Q&A pair uses
29	`### Q` and `### A` as grammar markers.
30
31	```dlm
32	::instruction::
33	### Q
34	What is a decorator?
35
36	### A
37	A function that takes a function and returns a new function.
38
39	### Q
40	When should I use functools.wraps?
41
42	### A
43	Always, inside decorators.
44	```
45
46	Trains via supervised fine-tuning (SFT): the model sees `Q` text
47	as the prompt, `A` text as the target. This is the pattern that
48	produces "helpful assistant" behavior.
49
50	`dlm synth instructions` can also write synthesized instruction
51	sections back into the document. Those keep the same basic body grammar
52	but add an HTML provenance marker immediately after the fence. See the
53	[instruction section reference](instruction-section.md) for the full
54	marker shape and validation rules.
55
56	### Preference (`::preference::`)
57
58	Open with `::preference::`. Each record has three blocks:
59
60	```dlm
61	::preference::
62	### Prompt
63	Explain recursion to a beginner.
64
65	### Chosen
66	Recursion is when a function calls itself on a smaller piece of the
67	problem. Imagine matryoshka dolls.
68
69	### Rejected
70	A recursive function is any function that refers to itself in its own
71	definition using the stack frame protocol.
72	```
73
74	Trains via DPO (direct preference optimization) or ORPO — the
75	model learns to prefer the `Chosen` phrasing. The DPO / ORPO trainer
76	lands in Sprint 17/18.
77
78	### Image (`::image path="..." alt="..."::`)
79
80	Schema v10 adds image sections for vision-language bases. The initial
81	launch covered PaliGemma; later follow-ups added Qwen2-VL,
82	InternVL2, and Mistral Small 3.1 registry rows. The fence uses
83	attribute syntax instead of the bare `::type::` form:
84
85	```dlm
86	::image path="figures/architecture.png" alt="training pipeline diagram"::
87	Caption text describing the figure. The caption body becomes the "text"
88	part of the training row; the placeholder expands to the base's image
89	tokens at collate time.
90	```
91
92	Required attributes: `path` (the image file, resolved relative to the
93	`.dlm`'s parent dir). Optional: `alt` (short description; defaults to
94	the filename stem on directive-ingested images).
95
96	Supported extensions. `.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`,
97	`.bmp`, `.tiff`. Other binary types (PDF, archives) stay out of the
98	training corpus by default.
99
100	Content hash. Image sections hash on `(type, path, blob_sha)`
101	rather than the body text. Two identical-bytes images at different
102	paths produce different `section_id`s — paths carry meaning. Changing
103	the blob bytes flips the ID even if the path didn't move.
104
105	Directive ingest. `training.sources` directives with image
106	extensions in their `include` globs ingest automatically:
107
108	```yaml
109	training:
110	sources:
111	- path: ./paper-figures
112	include: ["*/.png", "*/.jpg"]
113	```
114
115	Each discovered image becomes an `::image::` section with
116	`alt=<filename-stem>` and flows through the same row-emission path.
117
118	Current InternVL caveat. InternVL-family rows stay visible in the
119	registry for planning and future work, but the current runtime still
120	needs a custom processor/collator path for their `<image>` expansion
121	and `image_flags` contract. See the [multi-modal training
122	cookbook](../cookbook/multimodal-training.md) and [VL memory
123	guide](../hardware/vl-memory.md) before picking `internvl2-2b`.
124
125	Base-model requirements. Only vision-language bases accept image
126	sections at training time. `dlm init --multimodal` scaffolds a VL
127	doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
128	refuse image sections at train start with a pointer to `--multimodal`.
129
130	## Fence rules
131
132	- A fence must be the full line — `::instruction::` with no leading/
133	trailing content other than whitespace.
134	- Fences inside triple-backtick code blocks are not active — the
135	parser is aware of the code-fence context.
136	- An unfenced heading (`# ...`, `## ...`) inside an open instruction or
137	preference section does not close the section. Close with the
138	next section fence or end-of-file.
139	- Section type is case-sensitive; `::Instruction::` is rejected.
140	- Sprint 20 introduces a `::type#adapter-name::` suffix for
141	multi-adapter routing; the v1 parser accepts the suffix but ignores
142	the `#...` tail.
143
144	## Section IDs
145
146	Every section gets a content-addressed ID — the first 16 hex chars of
147	the SHA-256 of the section's canonical text. The manifest's
148	`content_hashes` records these IDs and their types so the next `dlm train`
149	can compute what's new, unchanged, or removed (Sprint 08's delta system).
150
151	You don't write these IDs in the document — they're derived and live
152	only in the manifest. But if you're debugging "why isn't this section
153	being picked up as new?", the ID in `dlm show --json` is the answer.
154
155	## What NOT to put in sections
156
157	- API keys, personal data, anything you wouldn't want baked into a
158	model you'll share. The adapter learns from everything in the file.
159	- JSON / YAML config that the model should emit literally — use
160	instruction Q&A pairs instead. Training on raw config produces
161	noisy generation.
162	- Massive code dumps (>200 KB). The replay corpus retains everything,
163	and sequence_len is bounded at 32 KB; a single enormous section
164	trains one step and wastes the remaining token budget.
165
166	## See also
167
168	- [Instruction section reference](instruction-section.md)
169	- [Preference section reference](preference-section.md)
170	- [First train walkthrough](../getting-started/first-train.md)
171	- [Cookbook: coding tutor](../cookbook/coding-tutor.md) — full
172	example of instruction-heavy authoring

Section grammar

Section types

Prose (default)

Instruction (::instruction::)

Preference (::preference::)

Image (::image path="..." alt="..."::)