documentlanguagemodel Public
Bootstrap self-improving
The self-teacher loop is the most interesting version of Sprint 43:
your current adapter writes new ::instruction:: sections for its own
document, then the next train run folds them back in.
This is not magic. It works because DLM already has:
- replay-backed retraining
- synthesized instruction provenance (
auto_synth) - a local
swayjudge for filtering weak candidates
Used carefully, it turns one trained document into a steadily better instruction corpus.
The honest starting point
--teacher self uses the current adapter for that .dlm. That means
the loop starts after there is already a trainable local adapter.
A good bootstrap pattern is:
- Start with prose plus at least some useful seed supervision, or do an initial train from prose and existing sections.
- Run
dlm synth instructions --teacher self. - Retrain on the accepted synth sections.
- Repeat in small batches.
If the adapter still cannot answer basic questions about the document, synthetic instruction generation will mostly amplify noise.
Minimal loop
Train once:
uv run dlm train notes.dlm
Generate a small accepted batch from the current adapter and write it back immediately:
uv run dlm synth instructions notes.dlm \
--teacher self \
--per-section 1 \
--strategy extraction \
--max-pairs 4 \
--apply
Retrain on the expanded instruction set:
uv run dlm train notes.dlm
Then inspect real output quality:
uv run dlm prompt notes.dlm "What does DGEMM do?"
That is the basic self-improving loop.
Safer staged version
If you want to inspect before writing:
uv run dlm synth instructions notes.dlm \
--teacher self \
--per-section 1 \
--strategy extraction
uv run dlm synth list notes.dlm
The current implementation stages accepted synth sections for
inspection, but it does not yet have a separate dlm synth apply
subcommand. Use --apply on the synth run when you want the sections
written straight into the document.
Why sway stays the default
The self-teacher path is the place where the default --filter sway
matters most.
Without filtering, a weak adapter can happily generate:
- duplicates
- overly generic answers
- plausible but wrong extrapolations
The current synth filter stack is:
- dedup
- optional judge pass
- optional threshold cut
The CLI prints those counts so you can tell whether the loop is getting better or just louder.
A conservative rhythm
This is a healthy local rhythm for a real project:
uv run dlm train notes.dlm
uv run dlm synth instructions notes.dlm \
--teacher self \
--per-section 1 \
--max-pairs 4 \
--apply
uv run dlm train notes.dlm
uv run dlm prompt notes.dlm "Explain the core idea."
Keep the accepted batch small at first. The point is to improve the document's instruction surface, not flood it with speculative rows.
When to switch away from self
The self-teacher is convenient, but not always the right teacher.
Prefer an external teacher when:
- the local adapter is still very early and weak
- you need broader general knowledge than the current adapter can supply
- you want to compare local-vs-external synth quality on the same prose
That usually looks like:
uv run dlm synth instructions notes.dlm \
--teacher hf:Qwen/Qwen2.5-1.5B-Instruct \
--per-section 1 \
--apply
and then later moving back to --teacher self once the adapter has real
domain traction.
Pairing Sprint 43 with Sprint 42
Instruction synthesis and preference mining are complementary:
dlm synth instructionsgrows the SFT side of the documentdlm synth preferences/dlm preference minesharpens ranking and behavior once the adapter can already produce multiple plausible answers
A practical sequence is:
- train
- synth instructions
- train
- mine preferences
- train preference phase
That is the closest current DLM path to a fully local self-improving document loop.
Failure modes to watch
The second pass is not better
That usually means one of:
- the first synth batch was too weak
- the document still lacks enough domain prose
- the adapter is too small for the domain
Do not assume "more synthetic rows" automatically means "better model."
Expansion mode gets weird
--strategy expansion is useful, but it is also the fastest route to
polished nonsense. Prefer extraction for early loops and only widen to
both or expansion once the adapter is already grounded.
Prompt quality improves but factuality does not
That is a signal to go back to better prose or hand-authored instructional supervision. Self-improvement cannot invent missing source knowledge.
See also
View source
| 1 | # Bootstrap self-improving |
| 2 | |
| 3 | The self-teacher loop is the most interesting version of Sprint 43: |
| 4 | your current adapter writes new `::instruction::` sections for its own |
| 5 | document, then the next train run folds them back in. |
| 6 | |
| 7 | This is not magic. It works because DLM already has: |
| 8 | |
| 9 | - replay-backed retraining |
| 10 | - synthesized instruction provenance (`auto_synth`) |
| 11 | - a local `sway` judge for filtering weak candidates |
| 12 | |
| 13 | Used carefully, it turns one trained document into a steadily better |
| 14 | instruction corpus. |
| 15 | |
| 16 | ## The honest starting point |
| 17 | |
| 18 | `--teacher self` uses the current adapter for that `.dlm`. That means |
| 19 | the loop starts **after** there is already a trainable local adapter. |
| 20 | |
| 21 | A good bootstrap pattern is: |
| 22 | |
| 23 | 1. Start with prose plus at least some useful seed supervision, or do an |
| 24 | initial train from prose and existing sections. |
| 25 | 2. Run `dlm synth instructions --teacher self`. |
| 26 | 3. Retrain on the accepted synth sections. |
| 27 | 4. Repeat in small batches. |
| 28 | |
| 29 | If the adapter still cannot answer basic questions about the document, |
| 30 | synthetic instruction generation will mostly amplify noise. |
| 31 | |
| 32 | ## Minimal loop |
| 33 | |
| 34 | Train once: |
| 35 | |
| 36 | ```sh |
| 37 | uv run dlm train notes.dlm |
| 38 | ``` |
| 39 | |
| 40 | Generate a small accepted batch from the current adapter and write it |
| 41 | back immediately: |
| 42 | |
| 43 | ```sh |
| 44 | uv run dlm synth instructions notes.dlm \ |
| 45 | --teacher self \ |
| 46 | --per-section 1 \ |
| 47 | --strategy extraction \ |
| 48 | --max-pairs 4 \ |
| 49 | --apply |
| 50 | ``` |
| 51 | |
| 52 | Retrain on the expanded instruction set: |
| 53 | |
| 54 | ```sh |
| 55 | uv run dlm train notes.dlm |
| 56 | ``` |
| 57 | |
| 58 | Then inspect real output quality: |
| 59 | |
| 60 | ```sh |
| 61 | uv run dlm prompt notes.dlm "What does DGEMM do?" |
| 62 | ``` |
| 63 | |
| 64 | That is the basic self-improving loop. |
| 65 | |
| 66 | ## Safer staged version |
| 67 | |
| 68 | If you want to inspect before writing: |
| 69 | |
| 70 | ```sh |
| 71 | uv run dlm synth instructions notes.dlm \ |
| 72 | --teacher self \ |
| 73 | --per-section 1 \ |
| 74 | --strategy extraction |
| 75 | |
| 76 | uv run dlm synth list notes.dlm |
| 77 | ``` |
| 78 | |
| 79 | The current implementation stages accepted synth sections for |
| 80 | inspection, but it does not yet have a separate `dlm synth apply` |
| 81 | subcommand. Use `--apply` on the synth run when you want the sections |
| 82 | written straight into the document. |
| 83 | |
| 84 | ## Why `sway` stays the default |
| 85 | |
| 86 | The self-teacher path is the place where the default `--filter sway` |
| 87 | matters most. |
| 88 | |
| 89 | Without filtering, a weak adapter can happily generate: |
| 90 | |
| 91 | - duplicates |
| 92 | - overly generic answers |
| 93 | - plausible but wrong extrapolations |
| 94 | |
| 95 | The current synth filter stack is: |
| 96 | |
| 97 | 1. dedup |
| 98 | 2. optional judge pass |
| 99 | 3. optional threshold cut |
| 100 | |
| 101 | The CLI prints those counts so you can tell whether the loop is getting |
| 102 | better or just louder. |
| 103 | |
| 104 | ## A conservative rhythm |
| 105 | |
| 106 | This is a healthy local rhythm for a real project: |
| 107 | |
| 108 | ```sh |
| 109 | uv run dlm train notes.dlm |
| 110 | uv run dlm synth instructions notes.dlm \ |
| 111 | --teacher self \ |
| 112 | --per-section 1 \ |
| 113 | --max-pairs 4 \ |
| 114 | --apply |
| 115 | uv run dlm train notes.dlm |
| 116 | uv run dlm prompt notes.dlm "Explain the core idea." |
| 117 | ``` |
| 118 | |
| 119 | Keep the accepted batch small at first. The point is to improve the |
| 120 | document's instruction surface, not flood it with speculative rows. |
| 121 | |
| 122 | ## When to switch away from `self` |
| 123 | |
| 124 | The self-teacher is convenient, but not always the right teacher. |
| 125 | |
| 126 | Prefer an external teacher when: |
| 127 | |
| 128 | - the local adapter is still very early and weak |
| 129 | - you need broader general knowledge than the current adapter can supply |
| 130 | - you want to compare local-vs-external synth quality on the same prose |
| 131 | |
| 132 | That usually looks like: |
| 133 | |
| 134 | ```sh |
| 135 | uv run dlm synth instructions notes.dlm \ |
| 136 | --teacher hf:Qwen/Qwen2.5-1.5B-Instruct \ |
| 137 | --per-section 1 \ |
| 138 | --apply |
| 139 | ``` |
| 140 | |
| 141 | and then later moving back to `--teacher self` once the adapter has real |
| 142 | domain traction. |
| 143 | |
| 144 | ## Pairing Sprint 43 with Sprint 42 |
| 145 | |
| 146 | Instruction synthesis and preference mining are complementary: |
| 147 | |
| 148 | - `dlm synth instructions` grows the SFT side of the document |
| 149 | - `dlm synth preferences` / `dlm preference mine` sharpens ranking and |
| 150 | behavior once the adapter can already produce multiple plausible |
| 151 | answers |
| 152 | |
| 153 | A practical sequence is: |
| 154 | |
| 155 | 1. train |
| 156 | 2. synth instructions |
| 157 | 3. train |
| 158 | 4. mine preferences |
| 159 | 5. train preference phase |
| 160 | |
| 161 | That is the closest current DLM path to a fully local self-improving |
| 162 | document loop. |
| 163 | |
| 164 | ## Failure modes to watch |
| 165 | |
| 166 | ### The second pass is not better |
| 167 | |
| 168 | That usually means one of: |
| 169 | |
| 170 | - the first synth batch was too weak |
| 171 | - the document still lacks enough domain prose |
| 172 | - the adapter is too small for the domain |
| 173 | |
| 174 | Do not assume "more synthetic rows" automatically means "better model." |
| 175 | |
| 176 | ### Expansion mode gets weird |
| 177 | |
| 178 | `--strategy expansion` is useful, but it is also the fastest route to |
| 179 | polished nonsense. Prefer `extraction` for early loops and only widen to |
| 180 | `both` or `expansion` once the adapter is already grounded. |
| 181 | |
| 182 | ### Prompt quality improves but factuality does not |
| 183 | |
| 184 | That is a signal to go back to better prose or hand-authored |
| 185 | instructional supervision. Self-improvement cannot invent missing source |
| 186 | knowledge. |
| 187 | |
| 188 | ## See also |
| 189 | |
| 190 | - [Synthesize training data](synthesize-training-data.md) |
| 191 | - [Instruction section reference](../format/instruction-section.md) |
| 192 | - [Self-improving loop](self-improving-loop.md) |
| 193 | - [Reward-model integration](reward-model-integration.md) |