markdown · 3804 bytes Raw Blame History

Reward-model integration

dlm preference mine can score candidate answers with something other than the adapter itself.

That is the point of the judge selector:

uv run dlm preference mine mydoc.dlm --judge sway
uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'

This page is the practical guide for the two non-default paths: HuggingFace reward models and external CLI judges.

Why use a reward model at all

The default sway judge is a bootstrap convenience. It is fast to reach for, but it is still the adapter judging its own candidates.

Use an external judge when:

  • the adapter is still small or early in training
  • you care about policy or style adherence more than raw task accuracy
  • you already have a reward model or scoring binary your team trusts

HuggingFace reward models

Point --judge at a sequence-classification model:

uv run dlm preference mine mydoc.dlm \
  --judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
  --threshold 1.0 \
  --samples 4 \
  --max-pairs 10

DLM loads the model lazily, scores each candidate pair, and keeps only those whose chosen-vs-rejected margin clears the threshold.

Thresholds

The default threshold depends on the judge implementation:

  • sway: 0.1
  • hf:<model>: 1.0

Raise the threshold when you want fewer, higher-confidence pairs. Lower it when the judge is too conservative and you are getting almost no output.

External CLI judges

The cli: path is for custom scorers, policy engines, or internal reward-model wrappers.

Example:

uv run dlm preference mine mydoc.dlm \
  --judge 'cli:/usr/local/bin/rank-answer-pair' \
  --samples 4

The judge process is invoked once per candidate. It receives JSON on stdin and must answer with JSON on stdout.

Input shape:

{
  "prompt": "What is DGEMM?",
  "candidate": "A matrix multiply."
}

Output shape:

{
  "score": 0.9,
  "reasoning": "Specific, correct, and terse."
}

If the command cannot be invoked or emits malformed JSON, the mine run fails fast instead of silently accepting garbage scores.

A good reward-model workflow

Start small and observable:

uv run dlm train mydoc.dlm
uv run dlm preference mine mydoc.dlm \
  --judge hf:YourOrg/reward-model \
  --samples 4 \
  --max-pairs 6
uv run dlm preference list mydoc.dlm
uv run dlm preference apply mydoc.dlm
uv run dlm train mydoc.dlm --phase preference

Then inspect:

uv run dlm metrics mydoc.dlm --run-id 7 --json
uv run dlm prompt mydoc.dlm "..." 

Judge-score improvement is not enough on its own. Always check held-out behavior from the adapter you just trained.

Common mistakes

Using reward mining for missing knowledge

Reward models pick between candidate answers. They do not invent facts the base adapter never learned. If the model is simply wrong, go back to SFT data first.

Mining too many pairs too early

If the reward model is stronger than the adapter, it can still rank a batch of uniformly weak answers. Cap with --max-pairs and inspect the result before turning it into a habit.

Trusting only the reward score

Repeated reward-driven loops can drift into reward hacking. Watch actual task outputs, not just margins.

When sway is still enough

Stay with the default judge when:

  • you are iterating locally on tone or terseness
  • the document is small and you want the lowest-friction loop
  • you mainly need a filter for "better vs worse" candidates, not a strong external policy model

Move to hf: or cli: when the loop starts to matter to other people.

See also

View source
1 # Reward-model integration
2
3 `dlm preference mine` can score candidate answers with something other
4 than the adapter itself.
5
6 That is the point of the judge selector:
7
8 ```sh
9 uv run dlm preference mine mydoc.dlm --judge sway
10 uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
11 uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'
12 ```
13
14 This page is the practical guide for the two non-default paths:
15 HuggingFace reward models and external CLI judges.
16
17 ## Why use a reward model at all
18
19 The default `sway` judge is a bootstrap convenience. It is fast to reach
20 for, but it is still the adapter judging its own candidates.
21
22 Use an external judge when:
23
24 - the adapter is still small or early in training
25 - you care about policy or style adherence more than raw task accuracy
26 - you already have a reward model or scoring binary your team trusts
27
28 ## HuggingFace reward models
29
30 Point `--judge` at a sequence-classification model:
31
32 ```sh
33 uv run dlm preference mine mydoc.dlm \
34 --judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
35 --threshold 1.0 \
36 --samples 4 \
37 --max-pairs 10
38 ```
39
40 DLM loads the model lazily, scores each candidate pair, and keeps only
41 those whose chosen-vs-rejected margin clears the threshold.
42
43 ### Thresholds
44
45 The default threshold depends on the judge implementation:
46
47 - `sway`: `0.1`
48 - `hf:<model>`: `1.0`
49
50 Raise the threshold when you want fewer, higher-confidence pairs. Lower
51 it when the judge is too conservative and you are getting almost no
52 output.
53
54 ## External CLI judges
55
56 The `cli:` path is for custom scorers, policy engines, or internal
57 reward-model wrappers.
58
59 Example:
60
61 ```sh
62 uv run dlm preference mine mydoc.dlm \
63 --judge 'cli:/usr/local/bin/rank-answer-pair' \
64 --samples 4
65 ```
66
67 The judge process is invoked once per candidate. It receives JSON on
68 stdin and must answer with JSON on stdout.
69
70 Input shape:
71
72 ```json
73 {
74 "prompt": "What is DGEMM?",
75 "candidate": "A matrix multiply."
76 }
77 ```
78
79 Output shape:
80
81 ```json
82 {
83 "score": 0.9,
84 "reasoning": "Specific, correct, and terse."
85 }
86 ```
87
88 If the command cannot be invoked or emits malformed JSON, the mine run
89 fails fast instead of silently accepting garbage scores.
90
91 ## A good reward-model workflow
92
93 Start small and observable:
94
95 ```sh
96 uv run dlm train mydoc.dlm
97 uv run dlm preference mine mydoc.dlm \
98 --judge hf:YourOrg/reward-model \
99 --samples 4 \
100 --max-pairs 6
101 uv run dlm preference list mydoc.dlm
102 uv run dlm preference apply mydoc.dlm
103 uv run dlm train mydoc.dlm --phase preference
104 ```
105
106 Then inspect:
107
108 ```sh
109 uv run dlm metrics mydoc.dlm --run-id 7 --json
110 uv run dlm prompt mydoc.dlm "..."
111 ```
112
113 Judge-score improvement is not enough on its own. Always check held-out
114 behavior from the adapter you just trained.
115
116 ## Common mistakes
117
118 ### Using reward mining for missing knowledge
119
120 Reward models pick between candidate answers. They do not invent facts
121 the base adapter never learned. If the model is simply wrong, go back to
122 SFT data first.
123
124 ### Mining too many pairs too early
125
126 If the reward model is stronger than the adapter, it can still rank a
127 batch of uniformly weak answers. Cap with `--max-pairs` and inspect the
128 result before turning it into a habit.
129
130 ### Trusting only the reward score
131
132 Repeated reward-driven loops can drift into reward hacking. Watch actual
133 task outputs, not just margins.
134
135 ## When `sway` is still enough
136
137 Stay with the default judge when:
138
139 - you are iterating locally on tone or terseness
140 - the document is small and you want the lowest-friction loop
141 - you mainly need a filter for "better vs worse" candidates, not a
142 strong external policy model
143
144 Move to `hf:` or `cli:` when the loop starts to matter to other people.
145
146 ## See also
147
148 - [Self-improving loop](self-improving-loop.md)
149 - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)