documentlanguagemodel Public
Self-improving loop
dlm preference mine closes the gap between "I have an adapter" and
"I have new preference pairs to train on."
The loop is simple:
- Train an initial adapter from prose and
::instruction::sections. - Mine auto-ranked
::preference::sections from that adapter. - Apply the mined sections back into the document.
- Train the preference phase again.
This is the shortest honest path to "train once, judge outputs, train
again" without leaving the .dlm.
When this works well
- You already have useful
::instruction::prompts in the document. - The adapter is good enough to generate multiple distinct answers.
- You want to sharpen style, brevity, refusal behavior, or task preference, not inject brand-new knowledge.
If the model is still too weak to produce meaningful alternatives, do another SFT pass first. Preference mining is an alignment loop, not a replacement for basic competence.
Minimal loop
Start with a normal document that has at least one instruction section:
::instruction::
### Q
How should release notes read?
### A
Short, factual, and low-drama.
Train once:
uv run dlm train release-notes.dlm
Mine a small batch of candidate pairs and write them straight into the document:
uv run dlm preference mine release-notes.dlm \
--samples 4 \
--max-pairs 8 \
--apply
Then train just the preference phase:
uv run dlm train release-notes.dlm --phase preference
That writes the next adapter version using the newly mined
::preference:: sections.
Safer first pass
If you want a review step before touching the document, omit --apply:
uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8
uv run dlm preference list release-notes.dlm
uv run dlm preference apply release-notes.dlm
This stages the mined plan under the store, lets you inspect it, and
only then writes the sections into the .dlm.
What gets written
Auto-mined sections are still normal ::preference:: sections, but they
carry provenance fields:
auto_mined: truejudge_namejudge_score_chosenjudge_score_rejectedmined_atmined_run_id
That means the next dlm train consumes them through the same
preference data path as hand-authored pairs.
Using --no-mined
For A/B checks, keep the mined sections in the document but exclude them from the preference phase:
uv run dlm train release-notes.dlm --phase preference --no-mined
This is useful when you want to compare:
- hand-authored preferences only
- mined + hand-authored preferences together
without deleting anything from the file.
Observability
Use these two commands to see what happened:
uv run dlm metrics release-notes.dlm --run-id 7 --json
uv run dlm show release-notes.dlm --json
dlm metrics surfaces per-run preference-mining events, including mined
pair counts and skipped prompts. dlm show --json adds the latest
preference-mining summary to the store snapshot.
Picking a judge
The default judge is sway, which bootstraps from the current adapter.
That is convenient, but not always the best production choice.
- Use
swayfor quick local iteration and loop-shaping. - Use
hf:<model>when you already trust a reward model for the task. - Use
cli:<cmd>when your org has an external scorer or policy checker.
For the judge contract and thresholds, see Reward-model integration.
Failure modes to watch
- Near-identical generations: raise
--temp, or lower--top-pconstraints so the sampler can explore. - Weak base adapter: mine after another SFT pass, not before.
- Reward hacking: track held-out eval behavior, not just judge scores.
- Low-quality bootstrap self-judging: use an HF reward model on smaller
bases instead of trusting
swayalone.
A concrete rhythm
This is a sane lightweight loop for a personal project:
uv run dlm train notes.dlm
uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply
uv run dlm train notes.dlm --phase preference
uv run dlm prompt notes.dlm "Write this week's changelog intro."
Run that loop when the adapter's behavior is close but still annoying. Do not run it just to accumulate pairs for their own sake.
See also
View source
| 1 | # Self-improving loop |
| 2 | |
| 3 | `dlm preference mine` closes the gap between "I have an adapter" and |
| 4 | "I have new preference pairs to train on." |
| 5 | |
| 6 | The loop is simple: |
| 7 | |
| 8 | 1. Train an initial adapter from prose and `::instruction::` sections. |
| 9 | 2. Mine auto-ranked `::preference::` sections from that adapter. |
| 10 | 3. Apply the mined sections back into the document. |
| 11 | 4. Train the preference phase again. |
| 12 | |
| 13 | This is the shortest honest path to "train once, judge outputs, train |
| 14 | again" without leaving the `.dlm`. |
| 15 | |
| 16 | ## When this works well |
| 17 | |
| 18 | - You already have useful `::instruction::` prompts in the document. |
| 19 | - The adapter is good enough to generate multiple distinct answers. |
| 20 | - You want to sharpen style, brevity, refusal behavior, or task |
| 21 | preference, not inject brand-new knowledge. |
| 22 | |
| 23 | If the model is still too weak to produce meaningful alternatives, do |
| 24 | another SFT pass first. Preference mining is an alignment loop, not a |
| 25 | replacement for basic competence. |
| 26 | |
| 27 | ## Minimal loop |
| 28 | |
| 29 | Start with a normal document that has at least one instruction section: |
| 30 | |
| 31 | ```dlm |
| 32 | ::instruction:: |
| 33 | ### Q |
| 34 | How should release notes read? |
| 35 | ### A |
| 36 | Short, factual, and low-drama. |
| 37 | ``` |
| 38 | |
| 39 | Train once: |
| 40 | |
| 41 | ```sh |
| 42 | uv run dlm train release-notes.dlm |
| 43 | ``` |
| 44 | |
| 45 | Mine a small batch of candidate pairs and write them straight into the |
| 46 | document: |
| 47 | |
| 48 | ```sh |
| 49 | uv run dlm preference mine release-notes.dlm \ |
| 50 | --samples 4 \ |
| 51 | --max-pairs 8 \ |
| 52 | --apply |
| 53 | ``` |
| 54 | |
| 55 | Then train just the preference phase: |
| 56 | |
| 57 | ```sh |
| 58 | uv run dlm train release-notes.dlm --phase preference |
| 59 | ``` |
| 60 | |
| 61 | That writes the next adapter version using the newly mined |
| 62 | `::preference::` sections. |
| 63 | |
| 64 | ## Safer first pass |
| 65 | |
| 66 | If you want a review step before touching the document, omit `--apply`: |
| 67 | |
| 68 | ```sh |
| 69 | uv run dlm preference mine release-notes.dlm --samples 4 --max-pairs 8 |
| 70 | uv run dlm preference list release-notes.dlm |
| 71 | uv run dlm preference apply release-notes.dlm |
| 72 | ``` |
| 73 | |
| 74 | This stages the mined plan under the store, lets you inspect it, and |
| 75 | only then writes the sections into the `.dlm`. |
| 76 | |
| 77 | ## What gets written |
| 78 | |
| 79 | Auto-mined sections are still normal `::preference::` sections, but they |
| 80 | carry provenance fields: |
| 81 | |
| 82 | - `auto_mined: true` |
| 83 | - `judge_name` |
| 84 | - `judge_score_chosen` |
| 85 | - `judge_score_rejected` |
| 86 | - `mined_at` |
| 87 | - `mined_run_id` |
| 88 | |
| 89 | That means the next `dlm train` consumes them through the same |
| 90 | preference data path as hand-authored pairs. |
| 91 | |
| 92 | ## Using `--no-mined` |
| 93 | |
| 94 | For A/B checks, keep the mined sections in the document but exclude them |
| 95 | from the preference phase: |
| 96 | |
| 97 | ```sh |
| 98 | uv run dlm train release-notes.dlm --phase preference --no-mined |
| 99 | ``` |
| 100 | |
| 101 | This is useful when you want to compare: |
| 102 | |
| 103 | - hand-authored preferences only |
| 104 | - mined + hand-authored preferences together |
| 105 | |
| 106 | without deleting anything from the file. |
| 107 | |
| 108 | ## Observability |
| 109 | |
| 110 | Use these two commands to see what happened: |
| 111 | |
| 112 | ```sh |
| 113 | uv run dlm metrics release-notes.dlm --run-id 7 --json |
| 114 | uv run dlm show release-notes.dlm --json |
| 115 | ``` |
| 116 | |
| 117 | `dlm metrics` surfaces per-run preference-mining events, including mined |
| 118 | pair counts and skipped prompts. `dlm show --json` adds the latest |
| 119 | preference-mining summary to the store snapshot. |
| 120 | |
| 121 | ## Picking a judge |
| 122 | |
| 123 | The default judge is `sway`, which bootstraps from the current adapter. |
| 124 | That is convenient, but not always the best production choice. |
| 125 | |
| 126 | - Use `sway` for quick local iteration and loop-shaping. |
| 127 | - Use `hf:<model>` when you already trust a reward model for the task. |
| 128 | - Use `cli:<cmd>` when your org has an external scorer or policy |
| 129 | checker. |
| 130 | |
| 131 | For the judge contract and thresholds, see |
| 132 | [Reward-model integration](reward-model-integration.md). |
| 133 | |
| 134 | ## Failure modes to watch |
| 135 | |
| 136 | - Near-identical generations: raise `--temp`, or lower `--top-p` |
| 137 | constraints so the sampler can explore. |
| 138 | - Weak base adapter: mine after another SFT pass, not before. |
| 139 | - Reward hacking: track held-out eval behavior, not just judge scores. |
| 140 | - Low-quality bootstrap self-judging: use an HF reward model on smaller |
| 141 | bases instead of trusting `sway` alone. |
| 142 | |
| 143 | ## A concrete rhythm |
| 144 | |
| 145 | This is a sane lightweight loop for a personal project: |
| 146 | |
| 147 | ```sh |
| 148 | uv run dlm train notes.dlm |
| 149 | uv run dlm preference mine notes.dlm --samples 4 --max-pairs 6 --apply |
| 150 | uv run dlm train notes.dlm --phase preference |
| 151 | uv run dlm prompt notes.dlm "Write this week's changelog intro." |
| 152 | ``` |
| 153 | |
| 154 | Run that loop when the adapter's behavior is close but still annoying. |
| 155 | Do not run it just to accumulate pairs for their own sake. |
| 156 | |
| 157 | ## See also |
| 158 | |
| 159 | - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md) |
| 160 | - [Reward-model integration](reward-model-integration.md) |
| 161 | - [Metrics & observability](metrics.md) |