markdown · 5596 bytes Raw Blame History

Synthesize training data

dlm synth instructions turns prose-heavy .dlm files into usable ::instruction:: sections.

This is the shortest path from "I have notes" to "I have supervised training pairs" when the document already contains domain prose but not enough authored Q/A.

What it does

The synth loop:

  1. Finds non-empty prose sections in the document.
  2. Prompts a teacher model to generate question/answer pairs about that prose.
  3. Deduplicates the generated pairs.
  4. Optionally filters them through the sway judge.
  5. Either stages the accepted auto_synth sections for inspection or writes them straight back into the .dlm.

The generated sections are still normal ::instruction:: sections. They just carry provenance metadata so DLM can tell synthesized pairs from hand-authored ones.

Choose a teacher

The teacher decides who writes the candidate Q/A pairs:

  • self: use the current local adapter for this document
  • hf:<model>: use a HuggingFace text model
  • openai:<model>: use the OpenAI API
  • anthropic:<model>: use the Anthropic API
  • vllm-server:<url>: use an OpenAI-compatible local server

The current default is self, but that only makes sense once the document already has a trained adapter. For a cold start, either:

  • train once first, then synth with self, or
  • use hf: / openai: / anthropic: / vllm-server: as the teacher

Minimal example

Start with a prose-heavy document:

---
dlm_id: 01K...
dlm_version: 15
base_model: smollm2-135m
---

DGEMM multiplies two dense matrices and can optionally accumulate the
result into an existing output matrix.

Generate one extraction-style pair per prose section with an HF teacher:

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 1 \
  --strategy extraction

That prints two summaries:

  • the raw synth plan
  • the filter report (generated, dedup, judge passed, threshold)

By default, accepted sections are staged under the store so you can inspect them:

uv run dlm synth list notes.dlm

If you want the accepted pairs written straight back into the document, use --apply:

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 1 \
  --strategy extraction \
  --apply

Strategy choices

The --strategy flag controls what kind of questions the teacher is asked to produce:

  • extraction: questions answered directly by the prose
  • expansion: questions a curious reader might ask beyond the exact wording of the prose
  • both: split the per-section budget across both prompt styles

Start with extraction when you care about faithfulness. Reach for expansion once the document already has a stable domain voice and you want broader instructional coverage.

Filter choices

The --filter flag controls post-generation cleanup:

  • sway: dedup plus judge filtering against an empty baseline
  • dedup-only: keep only near-duplicate suppression
  • none: accept everything that parses as a valid pair

sway is the safest default and is what most users should keep. It is especially helpful when using creative teachers or --strategy both.

If you are debugging prompt quality, use --filter none once and look at the raw plan before deciding whether the issue is generation or filtering.

Useful knobs

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 3 \
  --strategy both \
  --filter sway \
  --threshold 0.2 \
  --max-pairs 8 \
  --max-new-tokens 512 \
  --temp 0.2 \
  --top-p 0.95 \
  --seed 7

The most useful flags in practice are:

  • --per-section: generate more than one candidate pair per prose block
  • --max-pairs: cap document churn on large files
  • --threshold: tighten or loosen sway acceptance
  • --temp and --top-p: increase diversity when the teacher is too repetitive

Training after synth

Once the document has accepted auto_synth instruction sections, the next normal train run consumes them like any other instruction pair:

uv run dlm train notes.dlm

No special train flag is needed. Synthesized instruction sections flow through the same SFT path as hand-authored sections.

Revert and inspection

List applied auto-synth sections:

uv run dlm synth list notes.dlm

Strip every synthesized instruction section from the document:

uv run dlm synth revert notes.dlm

This only removes auto_synth: true instruction sections. Hand-authored instruction blocks stay untouched.

Common failure modes

The self teacher is weak

If --teacher self produces junk, the adapter probably is not ready yet. Train once more first, or use a stronger external teacher for the first synth pass.

Everything gets filtered out

That usually means one of three things:

  • the teacher produced near-duplicates
  • the generated answers were worse than the empty-baseline comparison in sway
  • the threshold is too strict

Lower --threshold, or temporarily switch to --filter dedup-only to see whether the judge is the main bottleneck.

The document churns too much

Use --max-pairs aggressively at first. A small accepted batch is much easier to reason about than dumping dozens of synthetic sections into a single file.

See also

View source
1 # Synthesize training data
2
3 `dlm synth instructions` turns prose-heavy `.dlm` files into usable
4 `::instruction::` sections.
5
6 This is the shortest path from "I have notes" to "I have supervised
7 training pairs" when the document already contains domain prose but not
8 enough authored Q/A.
9
10 ## What it does
11
12 The synth loop:
13
14 1. Finds non-empty prose sections in the document.
15 2. Prompts a teacher model to generate question/answer pairs about that
16 prose.
17 3. Deduplicates the generated pairs.
18 4. Optionally filters them through the `sway` judge.
19 5. Either stages the accepted `auto_synth` sections for inspection or
20 writes them straight back into the `.dlm`.
21
22 The generated sections are still normal `::instruction::` sections.
23 They just carry provenance metadata so DLM can tell synthesized pairs
24 from hand-authored ones.
25
26 ## Choose a teacher
27
28 The teacher decides who writes the candidate Q/A pairs:
29
30 - `self`: use the current local adapter for this document
31 - `hf:<model>`: use a HuggingFace text model
32 - `openai:<model>`: use the OpenAI API
33 - `anthropic:<model>`: use the Anthropic API
34 - `vllm-server:<url>`: use an OpenAI-compatible local server
35
36 The current default is `self`, but that only makes sense once the
37 document already has a trained adapter. For a cold start, either:
38
39 - train once first, then synth with `self`, or
40 - use `hf:` / `openai:` / `anthropic:` / `vllm-server:` as the teacher
41
42 ## Minimal example
43
44 Start with a prose-heavy document:
45
46 ```dlm
47 ---
48 dlm_id: 01K...
49 dlm_version: 15
50 base_model: smollm2-135m
51 ---
52
53 DGEMM multiplies two dense matrices and can optionally accumulate the
54 result into an existing output matrix.
55 ```
56
57 Generate one extraction-style pair per prose section with an HF teacher:
58
59 ```sh
60 uv run dlm synth instructions notes.dlm \
61 --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
62 --per-section 1 \
63 --strategy extraction
64 ```
65
66 That prints two summaries:
67
68 - the raw synth plan
69 - the filter report (`generated`, `dedup`, `judge passed`, `threshold`)
70
71 By default, accepted sections are staged under the store so you can
72 inspect them:
73
74 ```sh
75 uv run dlm synth list notes.dlm
76 ```
77
78 If you want the accepted pairs written straight back into the document,
79 use `--apply`:
80
81 ```sh
82 uv run dlm synth instructions notes.dlm \
83 --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
84 --per-section 1 \
85 --strategy extraction \
86 --apply
87 ```
88
89 ## Strategy choices
90
91 The `--strategy` flag controls what kind of questions the teacher is
92 asked to produce:
93
94 - `extraction`: questions answered directly by the prose
95 - `expansion`: questions a curious reader might ask beyond the exact
96 wording of the prose
97 - `both`: split the per-section budget across both prompt styles
98
99 Start with `extraction` when you care about faithfulness. Reach for
100 `expansion` once the document already has a stable domain voice and you
101 want broader instructional coverage.
102
103 ## Filter choices
104
105 The `--filter` flag controls post-generation cleanup:
106
107 - `sway`: dedup plus judge filtering against an empty baseline
108 - `dedup-only`: keep only near-duplicate suppression
109 - `none`: accept everything that parses as a valid pair
110
111 `sway` is the safest default and is what most users should keep. It is
112 especially helpful when using creative teachers or `--strategy both`.
113
114 If you are debugging prompt quality, use `--filter none` once and look
115 at the raw plan before deciding whether the issue is generation or
116 filtering.
117
118 ## Useful knobs
119
120 ```sh
121 uv run dlm synth instructions notes.dlm \
122 --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
123 --per-section 3 \
124 --strategy both \
125 --filter sway \
126 --threshold 0.2 \
127 --max-pairs 8 \
128 --max-new-tokens 512 \
129 --temp 0.2 \
130 --top-p 0.95 \
131 --seed 7
132 ```
133
134 The most useful flags in practice are:
135
136 - `--per-section`: generate more than one candidate pair per prose block
137 - `--max-pairs`: cap document churn on large files
138 - `--threshold`: tighten or loosen `sway` acceptance
139 - `--temp` and `--top-p`: increase diversity when the teacher is too
140 repetitive
141
142 ## Training after synth
143
144 Once the document has accepted `auto_synth` instruction sections, the
145 next normal train run consumes them like any other instruction pair:
146
147 ```sh
148 uv run dlm train notes.dlm
149 ```
150
151 No special train flag is needed. Synthesized instruction sections flow
152 through the same SFT path as hand-authored sections.
153
154 ## Revert and inspection
155
156 List applied auto-synth sections:
157
158 ```sh
159 uv run dlm synth list notes.dlm
160 ```
161
162 Strip every synthesized instruction section from the document:
163
164 ```sh
165 uv run dlm synth revert notes.dlm
166 ```
167
168 This only removes `auto_synth: true` instruction sections. Hand-authored
169 instruction blocks stay untouched.
170
171 ## Common failure modes
172
173 ### The self teacher is weak
174
175 If `--teacher self` produces junk, the adapter probably is not ready
176 yet. Train once more first, or use a stronger external teacher for the
177 first synth pass.
178
179 ### Everything gets filtered out
180
181 That usually means one of three things:
182
183 - the teacher produced near-duplicates
184 - the generated answers were worse than the empty-baseline comparison in
185 `sway`
186 - the threshold is too strict
187
188 Lower `--threshold`, or temporarily switch to `--filter dedup-only` to
189 see whether the judge is the main bottleneck.
190
191 ### The document churns too much
192
193 Use `--max-pairs` aggressively at first. A small accepted batch is much
194 easier to reason about than dumping dozens of synthetic sections into a
195 single file.
196
197 ## See also
198
199 - [Instruction section reference](../format/instruction-section.md)
200 - [Bootstrap self-improving](bootstrap-self-improving.md)
201 - [Self-improving loop](self-improving-loop.md)
202 - [CLI reference](../cli/reference.md)