documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 4463 bytes Raw Blame History

Self-improving loop

dlm preference mine closes the gap between "I have an adapter" and "I have new preference pairs to train on."

The loop is simple:

Train an initial adapter from prose and ::instruction:: sections.
Mine auto-ranked ::preference:: sections from that adapter.
Apply the mined sections back into the document.
Train the preference phase again.

This is the shortest honest path to "train once, judge outputs, train again" without leaving the .dlm.

When this works well

You already have useful ::instruction:: prompts in the document.
The adapter is good enough to generate multiple distinct answers.
You want to sharpen style, brevity, refusal behavior, or task preference, not inject brand-new knowledge.

If the model is still too weak to produce meaningful alternatives, do another SFT pass first. Preference mining is an alignment loop, not a replacement for basic competence.

Minimal loop

Start with a normal document that has at least one instruction section:

::instruction::
### Q
How should release notes read?
### A
Short, factual, and low-drama.

Train once:

uv run dlm train release-notes.dlm

Mine a small batch of candidate pairs and write them straight into the document:

uv run dlm preference mine release-notes.dlm \
  --samples 4 \
  --max-pairs 8 \
  --apply

Then train just the preference phase:

uv run dlm train release-notes.dlm --phase preference

That writes the next adapter version using the newly mined ::preference:: sections.

Safer first pass

If you want a review step before touching the document, omit --apply:

uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
uv run dlm preference list release-notes.dlm
uv run dlm preference apply release-notes.dlm

This stages the mined plan under the store, lets you inspect it, and only then writes the sections into the .dlm.

What gets written

Auto-mined sections are still normal ::preference:: sections, but they carry provenance fields:

auto_mined: true
judge_name
judge_score_chosen
judge_score_rejected
mined_at
mined_run_id

That means the next dlm train consumes them through the same preference data path as hand-authored pairs.

Using `--no-mined`

For A/B checks, keep the mined sections in the document but exclude them from the preference phase:

uv run dlm train release-notes.dlm --phase preference --no-mined

This is useful when you want to compare:

hand-authored preferences only
mined + hand-authored preferences together

without deleting anything from the file.

Observability

Use these two commands to see what happened:

uv run dlm metrics release-notes.dlm --run-id 7 --json
uv run dlm show release-notes.dlm --json

dlm metrics surfaces per-run preference-mining events, including mined pair counts and skipped prompts. dlm show --json adds the latest preference-mining summary to the store snapshot.

Picking a judge

The default judge is sway, which bootstraps from the current adapter. That is convenient, but not always the best production choice.

Use sway for quick local iteration and loop-shaping.
Use hf:<model> when you already trust a reward model for the task.
Use cli:<cmd> when your org has an external scorer or policy checker.

For the judge contract and thresholds, see Reward-model integration.

Failure modes to watch

Near-identical generations: raise --temp, or lower --top-p constraints so the sampler can explore.
Weak base adapter: mine after another SFT pass, not before.
Reward hacking: track held-out eval behavior, not just judge scores.
Low-quality bootstrap self-judging: use an HF reward model on smaller bases instead of trusting sway alone.

A concrete rhythm

This is a sane lightweight loop for a personal project:

uv run dlm train notes.dlm
uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
uv run dlm train notes.dlm --phase preference
uv run dlm prompt notes.dlm "Write this week's changelog intro."

Run that loop when the adapter's behavior is close but still annoying. Do not run it just to accumulate pairs for their own sake.

  
        1
        # Self-improving loop
      
        2
        
        3
        `dlm preference mine` closes the gap between "I have an adapter" and
      
        4
        "I have new preference pairs to train on."
      
        5
        
        6
        The loop is simple:
      
        7
        
        8
        1. Train an initial adapter from prose and `::instruction::` sections.
      
        9
        2. Mine auto-ranked `::preference::` sections from that adapter.
      
        10
        3. Apply the mined sections back into the document.
      
        11
        4. Train the preference phase again.
      
        12
        
        13
        This is the shortest honest path to "train once, judge outputs, train
      
        14
        again" without leaving the `.dlm`.
      
        15
        
        16
        ## When this works well
      
        17
        
        18
        - You already have useful `::instruction::` prompts in the document.
      
        19
        - The adapter is good enough to generate multiple distinct answers.
      
        20
        - You want to sharpen style, brevity, refusal behavior, or task
      
        21
          preference, not inject brand-new knowledge.
      
        22
        
        23
        If the model is still too weak to produce meaningful alternatives, do
      
        24
        another SFT pass first. Preference mining is an alignment loop, not a
      
        25
        replacement for basic competence.
      
        26
        
        27
        ## Minimal loop
      
        28
        
        29
        Start with a normal document that has at least one instruction section:
      
        30
        
        31
        ```dlm
      
        32
        ::instruction::
      
        33
        ### Q
      
        34
        How should release notes read?
      
        35
        ### A
      
        36
        Short, factual, and low-drama.
      
        37
        ```
      
        38
        
        39
        Train once:
      
        40
        
        41
        ```sh
      
        42
        uv run dlm train release-notes.dlm
      
        43
        ```
      
        44
        
        45
        Mine a small batch of candidate pairs and write them straight into the
      
        46
        document:
      
        47
        
        48
        ```sh
      
        49
        uv run dlm preference mine release-notes.dlm \
      
        50
          --samples 4 \
      
        51
          --max-pairs 8 \
      
        52
          --apply
      
        53
        ```
      
        54
        
        55
        Then train just the preference phase:
      
        56
        
        57
        ```sh
      
        58
        uv run dlm train release-notes.dlm --phase preference
      
        59
        ```
      
        60
        
        61
        That writes the next adapter version using the newly mined
      
        62
        `::preference::` sections.
      
        63
        
        64
        ## Safer first pass
      
        65
        
        66
        If you want a review step before touching the document, omit `--apply`:
      
        67
        
        68
        ```sh
      
        69
        uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
      
        70
        uv run dlm preference list release-notes.dlm
      
        71
        uv run dlm preference apply release-notes.dlm
      
        72
        ```
      
        73
        
        74
        This stages the mined plan under the store, lets you inspect it, and
      
        75
        only then writes the sections into the `.dlm`.
      
        76
        
        77
        ## What gets written
      
        78
        
        79
        Auto-mined sections are still normal `::preference::` sections, but they
      
        80
        carry provenance fields:
      
        81
        
        82
        - `auto_mined: true`
      
        83
        - `judge_name`
      
        84
        - `judge_score_chosen`
      
        85
        - `judge_score_rejected`
      
        86
        - `mined_at`
      
        87
        - `mined_run_id`
      
        88
        
        89
        That means the next `dlm train` consumes them through the same
      
        90
        preference data path as hand-authored pairs.
      
        91
        
        92
        ## Using `--no-mined`
      
        93
        
        94
        For A/B checks, keep the mined sections in the document but exclude them
      
        95
        from the preference phase:
      
        96
        
        97
        ```sh
      
        98
        uv run dlm train release-notes.dlm --phase preference --no-mined
      
        99
        ```
      
        100
        
        101
        This is useful when you want to compare:
      
        102
        
        103
        - hand-authored preferences only
      
        104
        - mined + hand-authored preferences together
      
        105
        
        106
        without deleting anything from the file.
      
        107
        
        108
        ## Observability
      
        109
        
        110
        Use these two commands to see what happened:
      
        111
        
        112
        ```sh
      
        113
        uv run dlm metrics release-notes.dlm --run-id 7 --json
      
        114
        uv run dlm show release-notes.dlm --json
      
        115
        ```
      
        116
        
        117
        `dlm metrics` surfaces per-run preference-mining events, including mined
      
        118
        pair counts and skipped prompts. `dlm show --json` adds the latest
      
        119
        preference-mining summary to the store snapshot.
      
        120
        
        121
        ## Picking a judge
      
        122
        
        123
        The default judge is `sway`, which bootstraps from the current adapter.
      
        124
        That is convenient, but not always the best production choice.
      
        125
        
        126
        - Use `sway` for quick local iteration and loop-shaping.
      
        127
        - Use `hf:<model>` when you already trust a reward model for the task.
      
        128
        - Use `cli:<cmd>` when your org has an external scorer or policy
      
        129
          checker.
      
        130
        
        131
        For the judge contract and thresholds, see
      
        132
        [Reward-model integration](reward-model-integration.md).
      
        133
        
        134
        ## Failure modes to watch
      
        135
        
        136
        - Near-identical generations: raise `--temp`, or lower `--top-p`
      
        137
          constraints so the sampler can explore.
      
        138
        - Weak base adapter: mine after another SFT pass, not before.
      
        139
        - Reward hacking: track held-out eval behavior, not just judge scores.
      
        140
        - Low-quality bootstrap self-judging: use an HF reward model on smaller
      
        141
          bases instead of trusting `sway` alone.
      
        142
        
        143
        ## A concrete rhythm
      
        144
        
        145
        This is a sane lightweight loop for a personal project:
      
        146
        
        147
        ```sh
      
        148
        uv run dlm train notes.dlm
      
        149
        uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
      
        150
        uv run dlm train notes.dlm --phase preference
      
        151
        uv run dlm prompt notes.dlm "Write this week's changelog intro."
      
        152
        ```
      
        153
        
        154
        Run that loop when the adapter's behavior is close but still annoying.
      
        155
        Do not run it just to accumulate pairs for their own sake.
      
        156
        
        157
        ## See also
      
        158
        
        159
        - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)
      
        160
        - [Reward-model integration](reward-model-integration.md)
      
        161
        - [Metrics & observability](metrics.md)

1	# Self-improving loop
2
3	`dlm preference mine` closes the gap between "I have an adapter" and
4	"I have new preference pairs to train on."
5
6	The loop is simple:
7
8	1. Train an initial adapter from prose and `::instruction::` sections.
9	2. Mine auto-ranked `::preference::` sections from that adapter.
10	3. Apply the mined sections back into the document.
11	4. Train the preference phase again.
12
13	This is the shortest honest path to "train once, judge outputs, train
14	again" without leaving the `.dlm`.
15
16	## When this works well
17
18	- You already have useful `::instruction::` prompts in the document.
19	- The adapter is good enough to generate multiple distinct answers.
20	- You want to sharpen style, brevity, refusal behavior, or task
21	preference, not inject brand-new knowledge.
22
23	If the model is still too weak to produce meaningful alternatives, do
24	another SFT pass first. Preference mining is an alignment loop, not a
25	replacement for basic competence.
26
27	## Minimal loop
28
29	Start with a normal document that has at least one instruction section:
30
31	```dlm
32	::instruction::
33	### Q
34	How should release notes read?
35	### A
36	Short, factual, and low-drama.
37	```
38
39	Train once:
40
41	```sh
42	uv run dlm train release-notes.dlm
43	```
44
45	Mine a small batch of candidate pairs and write them straight into the
46	document:
47
48	```sh
49	uv run dlm preference mine release-notes.dlm \
50	--samples 4 \
51	--max-pairs 8 \
52	--apply
53	```
54
55	Then train just the preference phase:
56
57	```sh
58	uv run dlm train release-notes.dlm --phase preference
59	```
60
61	That writes the next adapter version using the newly mined
62	`::preference::` sections.
63
64	## Safer first pass
65
66	If you want a review step before touching the document, omit `--apply`:
67
68	```sh
69	uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
70	uv run dlm preference list release-notes.dlm
71	uv run dlm preference apply release-notes.dlm
72	```
73
74	This stages the mined plan under the store, lets you inspect it, and
75	only then writes the sections into the `.dlm`.
76
77	## What gets written
78
79	Auto-mined sections are still normal `::preference::` sections, but they
80	carry provenance fields:
81
82	- `auto_mined: true`
83	- `judge_name`
84	- `judge_score_chosen`
85	- `judge_score_rejected`
86	- `mined_at`
87	- `mined_run_id`
88
89	That means the next `dlm train` consumes them through the same
90	preference data path as hand-authored pairs.
91
92	## Using `--no-mined`
93
94	For A/B checks, keep the mined sections in the document but exclude them
95	from the preference phase:
96
97	```sh
98	uv run dlm train release-notes.dlm --phase preference --no-mined
99	```
100
101	This is useful when you want to compare:
102
103	- hand-authored preferences only
104	- mined + hand-authored preferences together
105
106	without deleting anything from the file.
107
108	## Observability
109
110	Use these two commands to see what happened:
111
112	```sh
113	uv run dlm metrics release-notes.dlm --run-id 7 --json
114	uv run dlm show release-notes.dlm --json
115	```
116
117	`dlm metrics` surfaces per-run preference-mining events, including mined
118	pair counts and skipped prompts. `dlm show --json` adds the latest
119	preference-mining summary to the store snapshot.
120
121	## Picking a judge
122
123	The default judge is `sway`, which bootstraps from the current adapter.
124	That is convenient, but not always the best production choice.
125
126	- Use `sway` for quick local iteration and loop-shaping.
127	- Use `hf:<model>` when you already trust a reward model for the task.
128	- Use `cli:<cmd>` when your org has an external scorer or policy
129	checker.
130
131	For the judge contract and thresholds, see
132	[Reward-model integration](reward-model-integration.md).
133
134	## Failure modes to watch
135
136	- Near-identical generations: raise `--temp`, or lower `--top-p`
137	constraints so the sampler can explore.
138	- Weak base adapter: mine after another SFT pass, not before.
139	- Reward hacking: track held-out eval behavior, not just judge scores.
140	- Low-quality bootstrap self-judging: use an HF reward model on smaller
141	bases instead of trusting `sway` alone.
142
143	## A concrete rhythm
144
145	This is a sane lightweight loop for a personal project:
146
147	```sh
148	uv run dlm train notes.dlm
149	uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
150	uv run dlm train notes.dlm --phase preference
151	uv run dlm prompt notes.dlm "Write this week's changelog intro."
152	```
153
154	Run that loop when the adapter's behavior is close but still annoying.
155	Do not run it just to accumulate pairs for their own sake.
156
157	## See also
158
159	- [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)
160	- [Reward-model integration](reward-model-integration.md)
161	- [Metrics & observability](metrics.md)