markdown · 4463 bytes Raw Blame History

Self-improving loop

dlm preference mine closes the gap between "I have an adapter" and "I have new preference pairs to train on."

The loop is simple:

  1. Train an initial adapter from prose and ::instruction:: sections.
  2. Mine auto-ranked ::preference:: sections from that adapter.
  3. Apply the mined sections back into the document.
  4. Train the preference phase again.

This is the shortest honest path to "train once, judge outputs, train again" without leaving the .dlm.

When this works well

  • You already have useful ::instruction:: prompts in the document.
  • The adapter is good enough to generate multiple distinct answers.
  • You want to sharpen style, brevity, refusal behavior, or task preference, not inject brand-new knowledge.

If the model is still too weak to produce meaningful alternatives, do another SFT pass first. Preference mining is an alignment loop, not a replacement for basic competence.

Minimal loop

Start with a normal document that has at least one instruction section:

::instruction::
### Q
How should release notes read?
### A
Short, factual, and low-drama.

Train once:

uv run dlm train release-notes.dlm

Mine a small batch of candidate pairs and write them straight into the document:

uv run dlm preference mine release-notes.dlm \
  --samples 4 \
  --max-pairs 8 \
  --apply

Then train just the preference phase:

uv run dlm train release-notes.dlm --phase preference

That writes the next adapter version using the newly mined ::preference:: sections.

Safer first pass

If you want a review step before touching the document, omit --apply:

uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
uv run dlm preference list release-notes.dlm
uv run dlm preference apply release-notes.dlm

This stages the mined plan under the store, lets you inspect it, and only then writes the sections into the .dlm.

What gets written

Auto-mined sections are still normal ::preference:: sections, but they carry provenance fields:

  • auto_mined: true
  • judge_name
  • judge_score_chosen
  • judge_score_rejected
  • mined_at
  • mined_run_id

That means the next dlm train consumes them through the same preference data path as hand-authored pairs.

Using --no-mined

For A/B checks, keep the mined sections in the document but exclude them from the preference phase:

uv run dlm train release-notes.dlm --phase preference --no-mined

This is useful when you want to compare:

  • hand-authored preferences only
  • mined + hand-authored preferences together

without deleting anything from the file.

Observability

Use these two commands to see what happened:

uv run dlm metrics release-notes.dlm --run-id 7 --json
uv run dlm show release-notes.dlm --json

dlm metrics surfaces per-run preference-mining events, including mined pair counts and skipped prompts. dlm show --json adds the latest preference-mining summary to the store snapshot.

Picking a judge

The default judge is sway, which bootstraps from the current adapter. That is convenient, but not always the best production choice.

  • Use sway for quick local iteration and loop-shaping.
  • Use hf:<model> when you already trust a reward model for the task.
  • Use cli:<cmd> when your org has an external scorer or policy checker.

For the judge contract and thresholds, see Reward-model integration.

Failure modes to watch

  • Near-identical generations: raise --temp, or lower --top-p constraints so the sampler can explore.
  • Weak base adapter: mine after another SFT pass, not before.
  • Reward hacking: track held-out eval behavior, not just judge scores.
  • Low-quality bootstrap self-judging: use an HF reward model on smaller bases instead of trusting sway alone.

A concrete rhythm

This is a sane lightweight loop for a personal project:

uv run dlm train notes.dlm
uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
uv run dlm train notes.dlm --phase preference
uv run dlm prompt notes.dlm "Write this week's changelog intro."

Run that loop when the adapter's behavior is close but still annoying. Do not run it just to accumulate pairs for their own sake.

See also

View source
1 # Self-improving loop
2
3 `dlm preference mine` closes the gap between "I have an adapter" and
4 "I have new preference pairs to train on."
5
6 The loop is simple:
7
8 1. Train an initial adapter from prose and `::instruction::` sections.
9 2. Mine auto-ranked `::preference::` sections from that adapter.
10 3. Apply the mined sections back into the document.
11 4. Train the preference phase again.
12
13 This is the shortest honest path to "train once, judge outputs, train
14 again" without leaving the `.dlm`.
15
16 ## When this works well
17
18 - You already have useful `::instruction::` prompts in the document.
19 - The adapter is good enough to generate multiple distinct answers.
20 - You want to sharpen style, brevity, refusal behavior, or task
21 preference, not inject brand-new knowledge.
22
23 If the model is still too weak to produce meaningful alternatives, do
24 another SFT pass first. Preference mining is an alignment loop, not a
25 replacement for basic competence.
26
27 ## Minimal loop
28
29 Start with a normal document that has at least one instruction section:
30
31 ```dlm
32 ::instruction::
33 ### Q
34 How should release notes read?
35 ### A
36 Short, factual, and low-drama.
37 ```
38
39 Train once:
40
41 ```sh
42 uv run dlm train release-notes.dlm
43 ```
44
45 Mine a small batch of candidate pairs and write them straight into the
46 document:
47
48 ```sh
49 uv run dlm preference mine release-notes.dlm \
50 --samples 4 \
51 --max-pairs 8 \
52 --apply
53 ```
54
55 Then train just the preference phase:
56
57 ```sh
58 uv run dlm train release-notes.dlm --phase preference
59 ```
60
61 That writes the next adapter version using the newly mined
62 `::preference::` sections.
63
64 ## Safer first pass
65
66 If you want a review step before touching the document, omit `--apply`:
67
68 ```sh
69 uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
70 uv run dlm preference list release-notes.dlm
71 uv run dlm preference apply release-notes.dlm
72 ```
73
74 This stages the mined plan under the store, lets you inspect it, and
75 only then writes the sections into the `.dlm`.
76
77 ## What gets written
78
79 Auto-mined sections are still normal `::preference::` sections, but they
80 carry provenance fields:
81
82 - `auto_mined: true`
83 - `judge_name`
84 - `judge_score_chosen`
85 - `judge_score_rejected`
86 - `mined_at`
87 - `mined_run_id`
88
89 That means the next `dlm train` consumes them through the same
90 preference data path as hand-authored pairs.
91
92 ## Using `--no-mined`
93
94 For A/B checks, keep the mined sections in the document but exclude them
95 from the preference phase:
96
97 ```sh
98 uv run dlm train release-notes.dlm --phase preference --no-mined
99 ```
100
101 This is useful when you want to compare:
102
103 - hand-authored preferences only
104 - mined + hand-authored preferences together
105
106 without deleting anything from the file.
107
108 ## Observability
109
110 Use these two commands to see what happened:
111
112 ```sh
113 uv run dlm metrics release-notes.dlm --run-id 7 --json
114 uv run dlm show release-notes.dlm --json
115 ```
116
117 `dlm metrics` surfaces per-run preference-mining events, including mined
118 pair counts and skipped prompts. `dlm show --json` adds the latest
119 preference-mining summary to the store snapshot.
120
121 ## Picking a judge
122
123 The default judge is `sway`, which bootstraps from the current adapter.
124 That is convenient, but not always the best production choice.
125
126 - Use `sway` for quick local iteration and loop-shaping.
127 - Use `hf:<model>` when you already trust a reward model for the task.
128 - Use `cli:<cmd>` when your org has an external scorer or policy
129 checker.
130
131 For the judge contract and thresholds, see
132 [Reward-model integration](reward-model-integration.md).
133
134 ## Failure modes to watch
135
136 - Near-identical generations: raise `--temp`, or lower `--top-p`
137 constraints so the sampler can explore.
138 - Weak base adapter: mine after another SFT pass, not before.
139 - Reward hacking: track held-out eval behavior, not just judge scores.
140 - Low-quality bootstrap self-judging: use an HF reward model on smaller
141 bases instead of trusting `sway` alone.
142
143 ## A concrete rhythm
144
145 This is a sane lightweight loop for a personal project:
146
147 ```sh
148 uv run dlm train notes.dlm
149 uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
150 uv run dlm train notes.dlm --phase preference
151 uv run dlm prompt notes.dlm "Write this week's changelog intro."
152 ```
153
154 Run that loop when the adapter's behavior is close but still annoying.
155 Do not run it just to accumulate pairs for their own sake.
156
157 ## See also
158
159 - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)
160 - [Reward-model integration](reward-model-integration.md)
161 - [Metrics & observability](metrics.md)