documentlanguagemodel Public
Reward-model integration
dlm preference mine can score candidate answers with something other
than the adapter itself.
That is the point of the judge selector:
uv run dlm preference mine mydoc.dlm --judge sway
uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'
This page is the practical guide for the two non-default paths: HuggingFace reward models and external CLI judges.
Why use a reward model at all
The default sway judge is a bootstrap convenience. It is fast to reach
for, but it is still the adapter judging its own candidates.
Use an external judge when:
- the adapter is still small or early in training
- you care about policy or style adherence more than raw task accuracy
- you already have a reward model or scoring binary your team trusts
HuggingFace reward models
Point --judge at a sequence-classification model:
uv run dlm preference mine mydoc.dlm \
--judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
--threshold 1.0 \
--samples 4 \
--max-pairs 10
DLM loads the model lazily, scores each candidate pair, and keeps only those whose chosen-vs-rejected margin clears the threshold.
Thresholds
The default threshold depends on the judge implementation:
sway:0.1hf:<model>:1.0
Raise the threshold when you want fewer, higher-confidence pairs. Lower it when the judge is too conservative and you are getting almost no output.
External CLI judges
The cli: path is for custom scorers, policy engines, or internal
reward-model wrappers.
Example:
uv run dlm preference mine mydoc.dlm \
--judge 'cli:/usr/local/bin/rank-answer-pair' \
--samples 4
The judge process is invoked once per candidate. It receives JSON on stdin and must answer with JSON on stdout.
Input shape:
{
"prompt": "What is DGEMM?",
"candidate": "A matrix multiply."
}
Output shape:
{
"score": 0.9,
"reasoning": "Specific, correct, and terse."
}
If the command cannot be invoked or emits malformed JSON, the mine run fails fast instead of silently accepting garbage scores.
A good reward-model workflow
Start small and observable:
uv run dlm train mydoc.dlm
uv run dlm preference mine mydoc.dlm \
--judge hf:YourOrg/reward-model \
--samples 4 \
--max-pairs 6
uv run dlm preference list mydoc.dlm
uv run dlm preference apply mydoc.dlm
uv run dlm train mydoc.dlm --phase preference
Then inspect:
uv run dlm metrics mydoc.dlm --run-id 7 --json
uv run dlm prompt mydoc.dlm "..."
Judge-score improvement is not enough on its own. Always check held-out behavior from the adapter you just trained.
Common mistakes
Using reward mining for missing knowledge
Reward models pick between candidate answers. They do not invent facts the base adapter never learned. If the model is simply wrong, go back to SFT data first.
Mining too many pairs too early
If the reward model is stronger than the adapter, it can still rank a
batch of uniformly weak answers. Cap with --max-pairs and inspect the
result before turning it into a habit.
Trusting only the reward score
Repeated reward-driven loops can drift into reward hacking. Watch actual task outputs, not just margins.
When sway is still enough
Stay with the default judge when:
- you are iterating locally on tone or terseness
- the document is small and you want the lowest-friction loop
- you mainly need a filter for "better vs worse" candidates, not a strong external policy model
Move to hf: or cli: when the loop starts to matter to other people.
See also
View source
| 1 | # Reward-model integration |
| 2 | |
| 3 | `dlm preference mine` can score candidate answers with something other |
| 4 | than the adapter itself. |
| 5 | |
| 6 | That is the point of the judge selector: |
| 7 | |
| 8 | ```sh |
| 9 | uv run dlm preference mine mydoc.dlm --judge sway |
| 10 | uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model |
| 11 | uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin' |
| 12 | ``` |
| 13 | |
| 14 | This page is the practical guide for the two non-default paths: |
| 15 | HuggingFace reward models and external CLI judges. |
| 16 | |
| 17 | ## Why use a reward model at all |
| 18 | |
| 19 | The default `sway` judge is a bootstrap convenience. It is fast to reach |
| 20 | for, but it is still the adapter judging its own candidates. |
| 21 | |
| 22 | Use an external judge when: |
| 23 | |
| 24 | - the adapter is still small or early in training |
| 25 | - you care about policy or style adherence more than raw task accuracy |
| 26 | - you already have a reward model or scoring binary your team trusts |
| 27 | |
| 28 | ## HuggingFace reward models |
| 29 | |
| 30 | Point `--judge` at a sequence-classification model: |
| 31 | |
| 32 | ```sh |
| 33 | uv run dlm preference mine mydoc.dlm \ |
| 34 | --judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \ |
| 35 | --threshold 1.0 \ |
| 36 | --samples 4 \ |
| 37 | --max-pairs 10 |
| 38 | ``` |
| 39 | |
| 40 | DLM loads the model lazily, scores each candidate pair, and keeps only |
| 41 | those whose chosen-vs-rejected margin clears the threshold. |
| 42 | |
| 43 | ### Thresholds |
| 44 | |
| 45 | The default threshold depends on the judge implementation: |
| 46 | |
| 47 | - `sway`: `0.1` |
| 48 | - `hf:<model>`: `1.0` |
| 49 | |
| 50 | Raise the threshold when you want fewer, higher-confidence pairs. Lower |
| 51 | it when the judge is too conservative and you are getting almost no |
| 52 | output. |
| 53 | |
| 54 | ## External CLI judges |
| 55 | |
| 56 | The `cli:` path is for custom scorers, policy engines, or internal |
| 57 | reward-model wrappers. |
| 58 | |
| 59 | Example: |
| 60 | |
| 61 | ```sh |
| 62 | uv run dlm preference mine mydoc.dlm \ |
| 63 | --judge 'cli:/usr/local/bin/rank-answer-pair' \ |
| 64 | --samples 4 |
| 65 | ``` |
| 66 | |
| 67 | The judge process is invoked once per candidate. It receives JSON on |
| 68 | stdin and must answer with JSON on stdout. |
| 69 | |
| 70 | Input shape: |
| 71 | |
| 72 | ```json |
| 73 | { |
| 74 | "prompt": "What is DGEMM?", |
| 75 | "candidate": "A matrix multiply." |
| 76 | } |
| 77 | ``` |
| 78 | |
| 79 | Output shape: |
| 80 | |
| 81 | ```json |
| 82 | { |
| 83 | "score": 0.9, |
| 84 | "reasoning": "Specific, correct, and terse." |
| 85 | } |
| 86 | ``` |
| 87 | |
| 88 | If the command cannot be invoked or emits malformed JSON, the mine run |
| 89 | fails fast instead of silently accepting garbage scores. |
| 90 | |
| 91 | ## A good reward-model workflow |
| 92 | |
| 93 | Start small and observable: |
| 94 | |
| 95 | ```sh |
| 96 | uv run dlm train mydoc.dlm |
| 97 | uv run dlm preference mine mydoc.dlm \ |
| 98 | --judge hf:YourOrg/reward-model \ |
| 99 | --samples 4 \ |
| 100 | --max-pairs 6 |
| 101 | uv run dlm preference list mydoc.dlm |
| 102 | uv run dlm preference apply mydoc.dlm |
| 103 | uv run dlm train mydoc.dlm --phase preference |
| 104 | ``` |
| 105 | |
| 106 | Then inspect: |
| 107 | |
| 108 | ```sh |
| 109 | uv run dlm metrics mydoc.dlm --run-id 7 --json |
| 110 | uv run dlm prompt mydoc.dlm "..." |
| 111 | ``` |
| 112 | |
| 113 | Judge-score improvement is not enough on its own. Always check held-out |
| 114 | behavior from the adapter you just trained. |
| 115 | |
| 116 | ## Common mistakes |
| 117 | |
| 118 | ### Using reward mining for missing knowledge |
| 119 | |
| 120 | Reward models pick between candidate answers. They do not invent facts |
| 121 | the base adapter never learned. If the model is simply wrong, go back to |
| 122 | SFT data first. |
| 123 | |
| 124 | ### Mining too many pairs too early |
| 125 | |
| 126 | If the reward model is stronger than the adapter, it can still rank a |
| 127 | batch of uniformly weak answers. Cap with `--max-pairs` and inspect the |
| 128 | result before turning it into a habit. |
| 129 | |
| 130 | ### Trusting only the reward score |
| 131 | |
| 132 | Repeated reward-driven loops can drift into reward hacking. Watch actual |
| 133 | task outputs, not just margins. |
| 134 | |
| 135 | ## When `sway` is still enough |
| 136 | |
| 137 | Stay with the default judge when: |
| 138 | |
| 139 | - you are iterating locally on tone or terseness |
| 140 | - the document is small and you want the lowest-friction loop |
| 141 | - you mainly need a filter for "better vs worse" candidates, not a |
| 142 | strong external policy model |
| 143 | |
| 144 | Move to `hf:` or `cli:` when the loop starts to matter to other people. |
| 145 | |
| 146 | ## See also |
| 147 | |
| 148 | - [Self-improving loop](self-improving-loop.md) |
| 149 | - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md) |