documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5596 bytes Raw Blame History

Synthesize training data

dlm synth instructions turns prose-heavy .dlm files into usable ::instruction:: sections.

This is the shortest path from "I have notes" to "I have supervised training pairs" when the document already contains domain prose but not enough authored Q/A.

What it does

The synth loop:

Finds non-empty prose sections in the document.
Prompts a teacher model to generate question/answer pairs about that prose.
Deduplicates the generated pairs.
Optionally filters them through the sway judge.
Either stages the accepted auto_synth sections for inspection or writes them straight back into the .dlm.

The generated sections are still normal ::instruction:: sections. They just carry provenance metadata so DLM can tell synthesized pairs from hand-authored ones.

Choose a teacher

The teacher decides who writes the candidate Q/A pairs:

self: use the current local adapter for this document
hf:<model>: use a HuggingFace text model
openai:<model>: use the OpenAI API
anthropic:<model>: use the Anthropic API
vllm-server:<url>: use an OpenAI-compatible local server

The current default is self, but that only makes sense once the document already has a trained adapter. For a cold start, either:

train once first, then synth with self, or
use hf: / openai: / anthropic: / vllm-server: as the teacher

Minimal example

Start with a prose-heavy document:

---
dlm_id: 01K...
dlm_version: 15
base_model: smollm2-135m
---

DGEMM multiplies two dense matrices and can optionally accumulate the
result into an existing output matrix.

Generate one extraction-style pair per prose section with an HF teacher:

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 1 \
  --strategy extraction

That prints two summaries:

the raw synth plan
the filter report (generated, dedup, judge passed, threshold)

By default, accepted sections are staged under the store so you can inspect them:

uv run dlm synth list notes.dlm

If you want the accepted pairs written straight back into the document, use --apply:

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 1 \
  --strategy extraction \
  --apply

Strategy choices

The --strategy flag controls what kind of questions the teacher is asked to produce:

extraction: questions answered directly by the prose
expansion: questions a curious reader might ask beyond the exact wording of the prose
both: split the per-section budget across both prompt styles

Start with extraction when you care about faithfulness. Reach for expansion once the document already has a stable domain voice and you want broader instructional coverage.

Filter choices

The --filter flag controls post-generation cleanup:

sway: dedup plus judge filtering against an empty baseline
dedup-only: keep only near-duplicate suppression
none: accept everything that parses as a valid pair

sway is the safest default and is what most users should keep. It is especially helpful when using creative teachers or --strategy both.

If you are debugging prompt quality, use --filter none once and look at the raw plan before deciding whether the issue is generation or filtering.

Useful knobs

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 3 \
  --strategy both \
  --filter sway \
  --threshold 0.2 \
  --max-pairs 8 \
  --max-new-tokens 512 \
  --temp 0.2 \
  --top-p 0.95 \
  --seed 7

The most useful flags in practice are:

--per-section: generate more than one candidate pair per prose block
--max-pairs: cap document churn on large files
--threshold: tighten or loosen sway acceptance
--temp and --top-p: increase diversity when the teacher is too repetitive

Training after synth

Once the document has accepted auto_synth instruction sections, the next normal train run consumes them like any other instruction pair:

uv run dlm train notes.dlm

No special train flag is needed. Synthesized instruction sections flow through the same SFT path as hand-authored sections.

Revert and inspection

List applied auto-synth sections:

uv run dlm synth list notes.dlm

Strip every synthesized instruction section from the document:

uv run dlm synth revert notes.dlm

This only removes auto_synth: true instruction sections. Hand-authored instruction blocks stay untouched.

Common failure modes

The self teacher is weak

If --teacher self produces junk, the adapter probably is not ready yet. Train once more first, or use a stronger external teacher for the first synth pass.

Everything gets filtered out

That usually means one of three things:

the teacher produced near-duplicates
the generated answers were worse than the empty-baseline comparison in sway
the threshold is too strict

Lower --threshold, or temporarily switch to --filter dedup-only to see whether the judge is the main bottleneck.

The document churns too much

Use --max-pairs aggressively at first. A small accepted batch is much easier to reason about than dumping dozens of synthetic sections into a single file.

  
        1
        # Synthesize training data
      
        2
        
        3
        `dlm synth instructions` turns prose-heavy `.dlm` files into usable
      
        4
        `::instruction::` sections.
      
        5
        
        6
        This is the shortest path from "I have notes" to "I have supervised
      
        7
        training pairs" when the document already contains domain prose but not
      
        8
        enough authored Q/A.
      
        9
        
        10
        ## What it does
      
        11
        
        12
        The synth loop:
      
        13
        
        14
        1. Finds non-empty prose sections in the document.
      
        15
        2. Prompts a teacher model to generate question/answer pairs about that
      
        16
           prose.
      
        17
        3. Deduplicates the generated pairs.
      
        18
        4. Optionally filters them through the `sway` judge.
      
        19
        5. Either stages the accepted `auto_synth` sections for inspection or
      
        20
           writes them straight back into the `.dlm`.
      
        21
        
        22
        The generated sections are still normal `::instruction::` sections.
      
        23
        They just carry provenance metadata so DLM can tell synthesized pairs
      
        24
        from hand-authored ones.
      
        25
        
        26
        ## Choose a teacher
      
        27
        
        28
        The teacher decides who writes the candidate Q/A pairs:
      
        29
        
        30
        - `self`: use the current local adapter for this document
      
        31
        - `hf:<model>`: use a HuggingFace text model
      
        32
        - `openai:<model>`: use the OpenAI API
      
        33
        - `anthropic:<model>`: use the Anthropic API
      
        34
        - `vllm-server:<url>`: use an OpenAI-compatible local server
      
        35
        
        36
        The current default is `self`, but that only makes sense once the
      
        37
        document already has a trained adapter. For a cold start, either:
      
        38
        
        39
        - train once first, then synth with `self`, or
      
        40
        - use `hf:` / `openai:` / `anthropic:` / `vllm-server:` as the teacher
      
        41
        
        42
        ## Minimal example
      
        43
        
        44
        Start with a prose-heavy document:
      
        45
        
        46
        ```dlm
      
        47
        ---
      
        48
        dlm_id: 01K...
      
        49
        dlm_version: 15
      
        50
        base_model: smollm2-135m
      
        51
        ---
      
        52
        
        53
        DGEMM multiplies two dense matrices and can optionally accumulate the
      
        54
        result into an existing output matrix.
      
        55
        ```
      
        56
        
        57
        Generate one extraction-style pair per prose section with an HF teacher:
      
        58
        
        59
        ```sh
      
        60
        uv run dlm synth instructions notes.dlm \
      
        61
          --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
      
        62
          --per-section 1 \
      
        63
          --strategy extraction
      
        64
        ```
      
        65
        
        66
        That prints two summaries:
      
        67
        
        68
        - the raw synth plan
      
        69
        - the filter report (`generated`, `dedup`, `judge passed`, `threshold`)
      
        70
        
        71
        By default, accepted sections are staged under the store so you can
      
        72
        inspect them:
      
        73
        
        74
        ```sh
      
        75
        uv run dlm synth list notes.dlm
      
        76
        ```
      
        77
        
        78
        If you want the accepted pairs written straight back into the document,
      
        79
        use `--apply`:
      
        80
        
        81
        ```sh
      
        82
        uv run dlm synth instructions notes.dlm \
      
        83
          --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
      
        84
          --per-section 1 \
      
        85
          --strategy extraction \
      
        86
          --apply
      
        87
        ```
      
        88
        
        89
        ## Strategy choices
      
        90
        
        91
        The `--strategy` flag controls what kind of questions the teacher is
      
        92
        asked to produce:
      
        93
        
        94
        - `extraction`: questions answered directly by the prose
      
        95
        - `expansion`: questions a curious reader might ask beyond the exact
      
        96
          wording of the prose
      
        97
        - `both`: split the per-section budget across both prompt styles
      
        98
        
        99
        Start with `extraction` when you care about faithfulness. Reach for
      
        100
        `expansion` once the document already has a stable domain voice and you
      
        101
        want broader instructional coverage.
      
        102
        
        103
        ## Filter choices
      
        104
        
        105
        The `--filter` flag controls post-generation cleanup:
      
        106
        
        107
        - `sway`: dedup plus judge filtering against an empty baseline
      
        108
        - `dedup-only`: keep only near-duplicate suppression
      
        109
        - `none`: accept everything that parses as a valid pair
      
        110
        
        111
        `sway` is the safest default and is what most users should keep. It is
      
        112
        especially helpful when using creative teachers or `--strategy both`.
      
        113
        
        114
        If you are debugging prompt quality, use `--filter none` once and look
      
        115
        at the raw plan before deciding whether the issue is generation or
      
        116
        filtering.
      
        117
        
        118
        ## Useful knobs
      
        119
        
        120
        ```sh
      
        121
        uv run dlm synth instructions notes.dlm \
      
        122
          --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
      
        123
          --per-section 3 \
      
        124
          --strategy both \
      
        125
          --filter sway \
      
        126
          --threshold 0.2 \
      
        127
          --max-pairs 8 \
      
        128
          --max-new-tokens 512 \
      
        129
          --temp 0.2 \
      
        130
          --top-p 0.95 \
      
        131
          --seed 7
      
        132
        ```
      
        133
        
        134
        The most useful flags in practice are:
      
        135
        
        136
        - `--per-section`: generate more than one candidate pair per prose block
      
        137
        - `--max-pairs`: cap document churn on large files
      
        138
        - `--threshold`: tighten or loosen `sway` acceptance
      
        139
        - `--temp` and `--top-p`: increase diversity when the teacher is too
      
        140
          repetitive
      
        141
        
        142
        ## Training after synth
      
        143
        
        144
        Once the document has accepted `auto_synth` instruction sections, the
      
        145
        next normal train run consumes them like any other instruction pair:
      
        146
        
        147
        ```sh
      
        148
        uv run dlm train notes.dlm
      
        149
        ```
      
        150
        
        151
        No special train flag is needed. Synthesized instruction sections flow
      
        152
        through the same SFT path as hand-authored sections.
      
        153
        
        154
        ## Revert and inspection
      
        155
        
        156
        List applied auto-synth sections:
      
        157
        
        158
        ```sh
      
        159
        uv run dlm synth list notes.dlm
      
        160
        ```
      
        161
        
        162
        Strip every synthesized instruction section from the document:
      
        163
        
        164
        ```sh
      
        165
        uv run dlm synth revert notes.dlm
      
        166
        ```
      
        167
        
        168
        This only removes `auto_synth: true` instruction sections. Hand-authored
      
        169
        instruction blocks stay untouched.
      
        170
        
        171
        ## Common failure modes
      
        172
        
        173
        ### The self teacher is weak
      
        174
        
        175
        If `--teacher self` produces junk, the adapter probably is not ready
      
        176
        yet. Train once more first, or use a stronger external teacher for the
      
        177
        first synth pass.
      
        178
        
        179
        ### Everything gets filtered out
      
        180
        
        181
        That usually means one of three things:
      
        182
        
        183
        - the teacher produced near-duplicates
      
        184
        - the generated answers were worse than the empty-baseline comparison in
      
        185
          `sway`
      
        186
        - the threshold is too strict
      
        187
        
        188
        Lower `--threshold`, or temporarily switch to `--filter dedup-only` to
      
        189
        see whether the judge is the main bottleneck.
      
        190
        
        191
        ### The document churns too much
      
        192
        
        193
        Use `--max-pairs` aggressively at first. A small accepted batch is much
      
        194
        easier to reason about than dumping dozens of synthetic sections into a
      
        195
        single file.
      
        196
        
        197
        ## See also
      
        198
        
        199
        - [Instruction section reference](../format/instruction-section.md)
      
        200
        - [Bootstrap self-improving](bootstrap-self-improving.md)
      
        201
        - [Self-improving loop](self-improving-loop.md)
      
        202
        - [CLI reference](../cli/reference.md)

1	# Synthesize training data
2
3	`dlm synth instructions` turns prose-heavy `.dlm` files into usable
4	`::instruction::` sections.
5
6	This is the shortest path from "I have notes" to "I have supervised
7	training pairs" when the document already contains domain prose but not
8	enough authored Q/A.
9
10	## What it does
11
12	The synth loop:
13
14	1. Finds non-empty prose sections in the document.
15	2. Prompts a teacher model to generate question/answer pairs about that
16	prose.
17	3. Deduplicates the generated pairs.
18	4. Optionally filters them through the `sway` judge.
19	5. Either stages the accepted `auto_synth` sections for inspection or
20	writes them straight back into the `.dlm`.
21
22	The generated sections are still normal `::instruction::` sections.
23	They just carry provenance metadata so DLM can tell synthesized pairs
24	from hand-authored ones.
25
26	## Choose a teacher
27
28	The teacher decides who writes the candidate Q/A pairs:
29
30	- `self`: use the current local adapter for this document
31	- `hf:<model>`: use a HuggingFace text model
32	- `openai:<model>`: use the OpenAI API
33	- `anthropic:<model>`: use the Anthropic API
34	- `vllm-server:<url>`: use an OpenAI-compatible local server
35
36	The current default is `self`, but that only makes sense once the
37	document already has a trained adapter. For a cold start, either:
38
39	- train once first, then synth with `self`, or
40	- use `hf:` / `openai:` / `anthropic:` / `vllm-server:` as the teacher
41
42	## Minimal example
43
44	Start with a prose-heavy document:
45
46	```dlm
47	---
48	dlm_id: 01K...
49	dlm_version: 15
50	base_model: smollm2-135m
51	---
52
53	DGEMM multiplies two dense matrices and can optionally accumulate the
54	result into an existing output matrix.
55	```
56
57	Generate one extraction-style pair per prose section with an HF teacher:
58
59	```sh
60	uv run dlm synth instructions notes.dlm \
61	--teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
62	--per-section 1 \
63	--strategy extraction
64	```
65
66	That prints two summaries:
67
68	- the raw synth plan
69	- the filter report (`generated`, `dedup`, `judge passed`, `threshold`)
70
71	By default, accepted sections are staged under the store so you can
72	inspect them:
73
74	```sh
75	uv run dlm synth list notes.dlm
76	```
77
78	If you want the accepted pairs written straight back into the document,
79	use `--apply`:
80
81	```sh
82	uv run dlm synth instructions notes.dlm \
83	--teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
84	--per-section 1 \
85	--strategy extraction \
86	--apply
87	```
88
89	## Strategy choices
90
91	The `--strategy` flag controls what kind of questions the teacher is
92	asked to produce:
93
94	- `extraction`: questions answered directly by the prose
95	- `expansion`: questions a curious reader might ask beyond the exact
96	wording of the prose
97	- `both`: split the per-section budget across both prompt styles
98
99	Start with `extraction` when you care about faithfulness. Reach for
100	`expansion` once the document already has a stable domain voice and you
101	want broader instructional coverage.
102
103	## Filter choices
104
105	The `--filter` flag controls post-generation cleanup:
106
107	- `sway`: dedup plus judge filtering against an empty baseline
108	- `dedup-only`: keep only near-duplicate suppression
109	- `none`: accept everything that parses as a valid pair
110
111	`sway` is the safest default and is what most users should keep. It is
112	especially helpful when using creative teachers or `--strategy both`.
113
114	If you are debugging prompt quality, use `--filter none` once and look
115	at the raw plan before deciding whether the issue is generation or
116	filtering.
117
118	## Useful knobs
119
120	```sh
121	uv run dlm synth instructions notes.dlm \
122	--teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
123	--per-section 3 \
124	--strategy both \
125	--filter sway \
126	--threshold 0.2 \
127	--max-pairs 8 \
128	--max-new-tokens 512 \
129	--temp 0.2 \
130	--top-p 0.95 \
131	--seed 7
132	```
133
134	The most useful flags in practice are:
135
136	- `--per-section`: generate more than one candidate pair per prose block
137	- `--max-pairs`: cap document churn on large files
138	- `--threshold`: tighten or loosen `sway` acceptance
139	- `--temp` and `--top-p`: increase diversity when the teacher is too
140	repetitive
141
142	## Training after synth
143
144	Once the document has accepted `auto_synth` instruction sections, the
145	next normal train run consumes them like any other instruction pair:
146
147	```sh
148	uv run dlm train notes.dlm
149	```
150
151	No special train flag is needed. Synthesized instruction sections flow
152	through the same SFT path as hand-authored sections.
153
154	## Revert and inspection
155
156	List applied auto-synth sections:
157
158	```sh
159	uv run dlm synth list notes.dlm
160	```
161
162	Strip every synthesized instruction section from the document:
163
164	```sh
165	uv run dlm synth revert notes.dlm
166	```
167
168	This only removes `auto_synth: true` instruction sections. Hand-authored
169	instruction blocks stay untouched.
170
171	## Common failure modes
172
173	### The self teacher is weak
174
175	If `--teacher self` produces junk, the adapter probably is not ready
176	yet. Train once more first, or use a stronger external teacher for the
177	first synth pass.
178
179	### Everything gets filtered out
180
181	That usually means one of three things:
182
183	- the teacher produced near-duplicates
184	- the generated answers were worse than the empty-baseline comparison in
185	`sway`
186	- the threshold is too strict
187
188	Lower `--threshold`, or temporarily switch to `--filter dedup-only` to
189	see whether the judge is the main bottleneck.
190
191	### The document churns too much
192
193	Use `--max-pairs` aggressively at first. A small accepted batch is much
194	easier to reason about than dumping dozens of synthetic sections into a
195	single file.
196
197	## See also
198
199	- [Instruction section reference](../format/instruction-section.md)
200	- [Bootstrap self-improving](bootstrap-self-improving.md)
201	- [Self-improving loop](self-improving-loop.md)
202	- [CLI reference](../cli/reference.md)