markdown · 4982 bytes Raw Blame History

Bootstrap self-improving

The self-teacher loop is the most interesting version of Sprint 43: your current adapter writes new ::instruction:: sections for its own document, then the next train run folds them back in.

This is not magic. It works because DLM already has:

  • replay-backed retraining
  • synthesized instruction provenance (auto_synth)
  • a local sway judge for filtering weak candidates

Used carefully, it turns one trained document into a steadily better instruction corpus.

The honest starting point

--teacher self uses the current adapter for that .dlm. That means the loop starts after there is already a trainable local adapter.

A good bootstrap pattern is:

  1. Start with prose plus at least some useful seed supervision, or do an initial train from prose and existing sections.
  2. Run dlm synth instructions --teacher self.
  3. Retrain on the accepted synth sections.
  4. Repeat in small batches.

If the adapter still cannot answer basic questions about the document, synthetic instruction generation will mostly amplify noise.

Minimal loop

Train once:

uv run dlm train notes.dlm

Generate a small accepted batch from the current adapter and write it back immediately:

uv run dlm synth instructions notes.dlm \
  --teacher self \
  --per-section 1 \
  --strategy extraction \
  --max-pairs 4 \
  --apply

Retrain on the expanded instruction set:

uv run dlm train notes.dlm

Then inspect real output quality:

uv run dlm prompt notes.dlm "What does DGEMM do?"

That is the basic self-improving loop.

Safer staged version

If you want to inspect before writing:

uv run dlm synth instructions notes.dlm \
  --teacher self \
  --per-section 1 \
  --strategy extraction

uv run dlm synth list notes.dlm

The current implementation stages accepted synth sections for inspection, but it does not yet have a separate dlm synth apply subcommand. Use --apply on the synth run when you want the sections written straight into the document.

Why sway stays the default

The self-teacher path is the place where the default --filter sway matters most.

Without filtering, a weak adapter can happily generate:

  • duplicates
  • overly generic answers
  • plausible but wrong extrapolations

The current synth filter stack is:

  1. dedup
  2. optional judge pass
  3. optional threshold cut

The CLI prints those counts so you can tell whether the loop is getting better or just louder.

A conservative rhythm

This is a healthy local rhythm for a real project:

uv run dlm train notes.dlm
uv run dlm synth instructions notes.dlm \
  --teacher self \
  --per-section 1 \
  --max-pairs 4 \
  --apply
uv run dlm train notes.dlm
uv run dlm prompt notes.dlm "Explain the core idea."

Keep the accepted batch small at first. The point is to improve the document's instruction surface, not flood it with speculative rows.

When to switch away from self

The self-teacher is convenient, but not always the right teacher.

Prefer an external teacher when:

  • the local adapter is still very early and weak
  • you need broader general knowledge than the current adapter can supply
  • you want to compare local-vs-external synth quality on the same prose

That usually looks like:

uv run dlm synth instructions notes.dlm \
  --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
  --per-section 1 \
  --apply

and then later moving back to --teacher self once the adapter has real domain traction.

Pairing Sprint 43 with Sprint 42

Instruction synthesis and preference mining are complementary:

  • dlm synth instructions grows the SFT side of the document
  • dlm synth preferences / dlm preference mine sharpens ranking and behavior once the adapter can already produce multiple plausible answers

A practical sequence is:

  1. train
  2. synth instructions
  3. train
  4. mine preferences
  5. train preference phase

That is the closest current DLM path to a fully local self-improving document loop.

Failure modes to watch

The second pass is not better

That usually means one of:

  • the first synth batch was too weak
  • the document still lacks enough domain prose
  • the adapter is too small for the domain

Do not assume "more synthetic rows" automatically means "better model."

Expansion mode gets weird

--strategy expansion is useful, but it is also the fastest route to polished nonsense. Prefer extraction for early loops and only widen to both or expansion once the adapter is already grounded.

Prompt quality improves but factuality does not

That is a signal to go back to better prose or hand-authored instructional supervision. Self-improvement cannot invent missing source knowledge.

See also

View source
1 # Bootstrap self-improving
2
3 The self-teacher loop is the most interesting version of Sprint 43:
4 your current adapter writes new `::instruction::` sections for its own
5 document, then the next train run folds them back in.
6
7 This is not magic. It works because DLM already has:
8
9 - replay-backed retraining
10 - synthesized instruction provenance (`auto_synth`)
11 - a local `sway` judge for filtering weak candidates
12
13 Used carefully, it turns one trained document into a steadily better
14 instruction corpus.
15
16 ## The honest starting point
17
18 `--teacher self` uses the current adapter for that `.dlm`. That means
19 the loop starts **after** there is already a trainable local adapter.
20
21 A good bootstrap pattern is:
22
23 1. Start with prose plus at least some useful seed supervision, or do an
24 initial train from prose and existing sections.
25 2. Run `dlm synth instructions --teacher self`.
26 3. Retrain on the accepted synth sections.
27 4. Repeat in small batches.
28
29 If the adapter still cannot answer basic questions about the document,
30 synthetic instruction generation will mostly amplify noise.
31
32 ## Minimal loop
33
34 Train once:
35
36 ```sh
37 uv run dlm train notes.dlm
38 ```
39
40 Generate a small accepted batch from the current adapter and write it
41 back immediately:
42
43 ```sh
44 uv run dlm synth instructions notes.dlm \
45 --teacher self \
46 --per-section 1 \
47 --strategy extraction \
48 --max-pairs 4 \
49 --apply
50 ```
51
52 Retrain on the expanded instruction set:
53
54 ```sh
55 uv run dlm train notes.dlm
56 ```
57
58 Then inspect real output quality:
59
60 ```sh
61 uv run dlm prompt notes.dlm "What does DGEMM do?"
62 ```
63
64 That is the basic self-improving loop.
65
66 ## Safer staged version
67
68 If you want to inspect before writing:
69
70 ```sh
71 uv run dlm synth instructions notes.dlm \
72 --teacher self \
73 --per-section 1 \
74 --strategy extraction
75
76 uv run dlm synth list notes.dlm
77 ```
78
79 The current implementation stages accepted synth sections for
80 inspection, but it does not yet have a separate `dlm synth apply`
81 subcommand. Use `--apply` on the synth run when you want the sections
82 written straight into the document.
83
84 ## Why `sway` stays the default
85
86 The self-teacher path is the place where the default `--filter sway`
87 matters most.
88
89 Without filtering, a weak adapter can happily generate:
90
91 - duplicates
92 - overly generic answers
93 - plausible but wrong extrapolations
94
95 The current synth filter stack is:
96
97 1. dedup
98 2. optional judge pass
99 3. optional threshold cut
100
101 The CLI prints those counts so you can tell whether the loop is getting
102 better or just louder.
103
104 ## A conservative rhythm
105
106 This is a healthy local rhythm for a real project:
107
108 ```sh
109 uv run dlm train notes.dlm
110 uv run dlm synth instructions notes.dlm \
111 --teacher self \
112 --per-section 1 \
113 --max-pairs 4 \
114 --apply
115 uv run dlm train notes.dlm
116 uv run dlm prompt notes.dlm "Explain the core idea."
117 ```
118
119 Keep the accepted batch small at first. The point is to improve the
120 document's instruction surface, not flood it with speculative rows.
121
122 ## When to switch away from `self`
123
124 The self-teacher is convenient, but not always the right teacher.
125
126 Prefer an external teacher when:
127
128 - the local adapter is still very early and weak
129 - you need broader general knowledge than the current adapter can supply
130 - you want to compare local-vs-external synth quality on the same prose
131
132 That usually looks like:
133
134 ```sh
135 uv run dlm synth instructions notes.dlm \
136 --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
137 --per-section 1 \
138 --apply
139 ```
140
141 and then later moving back to `--teacher self` once the adapter has real
142 domain traction.
143
144 ## Pairing Sprint 43 with Sprint 42
145
146 Instruction synthesis and preference mining are complementary:
147
148 - `dlm synth instructions` grows the SFT side of the document
149 - `dlm synth preferences` / `dlm preference mine` sharpens ranking and
150 behavior once the adapter can already produce multiple plausible
151 answers
152
153 A practical sequence is:
154
155 1. train
156 2. synth instructions
157 3. train
158 4. mine preferences
159 5. train preference phase
160
161 That is the closest current DLM path to a fully local self-improving
162 document loop.
163
164 ## Failure modes to watch
165
166 ### The second pass is not better
167
168 That usually means one of:
169
170 - the first synth batch was too weak
171 - the document still lacks enough domain prose
172 - the adapter is too small for the domain
173
174 Do not assume "more synthetic rows" automatically means "better model."
175
176 ### Expansion mode gets weird
177
178 `--strategy expansion` is useful, but it is also the fastest route to
179 polished nonsense. Prefer `extraction` for early loops and only widen to
180 `both` or `expansion` once the adapter is already grounded.
181
182 ### Prompt quality improves but factuality does not
183
184 That is a signal to go back to better prose or hand-authored
185 instructional supervision. Self-improvement cannot invent missing source
186 knowledge.
187
188 ## See also
189
190 - [Synthesize training data](synthesize-training-data.md)
191 - [Instruction section reference](../format/instruction-section.md)
192 - [Self-improving loop](self-improving-loop.md)
193 - [Reward-model integration](reward-model-integration.md)