documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 3804 bytes Raw Blame History

Reward-model integration

dlm preference mine can score candidate answers with something other than the adapter itself.

That is the point of the judge selector:

uv run dlm preference mine mydoc.dlm --judge sway
uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'

This page is the practical guide for the two non-default paths: HuggingFace reward models and external CLI judges.

Why use a reward model at all

The default sway judge is a bootstrap convenience. It is fast to reach for, but it is still the adapter judging its own candidates.

Use an external judge when:

the adapter is still small or early in training
you care about policy or style adherence more than raw task accuracy
you already have a reward model or scoring binary your team trusts

HuggingFace reward models

Point --judge at a sequence-classification model:

uv run dlm preference mine mydoc.dlm \
  --judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
  --threshold 1.0 \
  --samples 4 \
  --max-pairs 10

DLM loads the model lazily, scores each candidate pair, and keeps only those whose chosen-vs-rejected margin clears the threshold.

Thresholds

The default threshold depends on the judge implementation:

sway: 0.1
hf:<model>: 1.0

Raise the threshold when you want fewer, higher-confidence pairs. Lower it when the judge is too conservative and you are getting almost no output.

External CLI judges

The cli: path is for custom scorers, policy engines, or internal reward-model wrappers.

Example:

uv run dlm preference mine mydoc.dlm \
  --judge 'cli:/usr/local/bin/rank-answer-pair' \
  --samples 4

The judge process is invoked once per candidate. It receives JSON on stdin and must answer with JSON on stdout.

Input shape:

{
  "prompt": "What is DGEMM?",
  "candidate": "A matrix multiply."
}

Output shape:

{
  "score": 0.9,
  "reasoning": "Specific, correct, and terse."
}

If the command cannot be invoked or emits malformed JSON, the mine run fails fast instead of silently accepting garbage scores.

A good reward-model workflow

Start small and observable:

uv run dlm train mydoc.dlm
uv run dlm preference mine mydoc.dlm \
  --judge hf:YourOrg/reward-model \
  --samples 4 \
  --max-pairs 6
uv run dlm preference list mydoc.dlm
uv run dlm preference apply mydoc.dlm
uv run dlm train mydoc.dlm --phase preference

Then inspect:

uv run dlm metrics mydoc.dlm --run-id 7 --json
uv run dlm prompt mydoc.dlm "..."

Judge-score improvement is not enough on its own. Always check held-out behavior from the adapter you just trained.

Common mistakes

Using reward mining for missing knowledge

Reward models pick between candidate answers. They do not invent facts the base adapter never learned. If the model is simply wrong, go back to SFT data first.

Mining too many pairs too early

If the reward model is stronger than the adapter, it can still rank a batch of uniformly weak answers. Cap with --max-pairs and inspect the result before turning it into a habit.

Trusting only the reward score

Repeated reward-driven loops can drift into reward hacking. Watch actual task outputs, not just margins.

When `sway` is still enough

Stay with the default judge when:

you are iterating locally on tone or terseness
the document is small and you want the lowest-friction loop
you mainly need a filter for "better vs worse" candidates, not a strong external policy model

Move to hf: or cli: when the loop starts to matter to other people.

  
        1
        # Reward-model integration
      
        2
        
        3
        `dlm preference mine` can score candidate answers with something other
      
        4
        than the adapter itself.
      
        5
        
        6
        That is the point of the judge selector:
      
        7
        
        8
        ```sh
      
        9
        uv run dlm preference mine mydoc.dlm --judge sway
      
        10
        uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
      
        11
        uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'
      
        12
        ```
      
        13
        
        14
        This page is the practical guide for the two non-default paths:
      
        15
        HuggingFace reward models and external CLI judges.
      
        16
        
        17
        ## Why use a reward model at all
      
        18
        
        19
        The default `sway` judge is a bootstrap convenience. It is fast to reach
      
        20
        for, but it is still the adapter judging its own candidates.
      
        21
        
        22
        Use an external judge when:
      
        23
        
        24
        - the adapter is still small or early in training
      
        25
        - you care about policy or style adherence more than raw task accuracy
      
        26
        - you already have a reward model or scoring binary your team trusts
      
        27
        
        28
        ## HuggingFace reward models
      
        29
        
        30
        Point `--judge` at a sequence-classification model:
      
        31
        
        32
        ```sh
      
        33
        uv run dlm preference mine mydoc.dlm \
      
        34
          --judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
      
        35
          --threshold 1.0 \
      
        36
          --samples 4 \
      
        37
          --max-pairs 10
      
        38
        ```
      
        39
        
        40
        DLM loads the model lazily, scores each candidate pair, and keeps only
      
        41
        those whose chosen-vs-rejected margin clears the threshold.
      
        42
        
        43
        ### Thresholds
      
        44
        
        45
        The default threshold depends on the judge implementation:
      
        46
        
        47
        - `sway`: `0.1`
      
        48
        - `hf:<model>`: `1.0`
      
        49
        
        50
        Raise the threshold when you want fewer, higher-confidence pairs. Lower
      
        51
        it when the judge is too conservative and you are getting almost no
      
        52
        output.
      
        53
        
        54
        ## External CLI judges
      
        55
        
        56
        The `cli:` path is for custom scorers, policy engines, or internal
      
        57
        reward-model wrappers.
      
        58
        
        59
        Example:
      
        60
        
        61
        ```sh
      
        62
        uv run dlm preference mine mydoc.dlm \
      
        63
          --judge 'cli:/usr/local/bin/rank-answer-pair' \
      
        64
          --samples 4
      
        65
        ```
      
        66
        
        67
        The judge process is invoked once per candidate. It receives JSON on
      
        68
        stdin and must answer with JSON on stdout.
      
        69
        
        70
        Input shape:
      
        71
        
        72
        ```json
      
        73
        {
      
        74
          "prompt": "What is DGEMM?",
      
        75
          "candidate": "A matrix multiply."
      
        76
        }
      
        77
        ```
      
        78
        
        79
        Output shape:
      
        80
        
        81
        ```json
      
        82
        {
      
        83
          "score": 0.9,
      
        84
          "reasoning": "Specific, correct, and terse."
      
        85
        }
      
        86
        ```
      
        87
        
        88
        If the command cannot be invoked or emits malformed JSON, the mine run
      
        89
        fails fast instead of silently accepting garbage scores.
      
        90
        
        91
        ## A good reward-model workflow
      
        92
        
        93
        Start small and observable:
      
        94
        
        95
        ```sh
      
        96
        uv run dlm train mydoc.dlm
      
        97
        uv run dlm preference mine mydoc.dlm \
      
        98
          --judge hf:YourOrg/reward-model \
      
        99
          --samples 4 \
      
        100
          --max-pairs 6
      
        101
        uv run dlm preference list mydoc.dlm
      
        102
        uv run dlm preference apply mydoc.dlm
      
        103
        uv run dlm train mydoc.dlm --phase preference
      
        104
        ```
      
        105
        
        106
        Then inspect:
      
        107
        
        108
        ```sh
      
        109
        uv run dlm metrics mydoc.dlm --run-id 7 --json
      
        110
        uv run dlm prompt mydoc.dlm "..." 
      
        111
        ```
      
        112
        
        113
        Judge-score improvement is not enough on its own. Always check held-out
      
        114
        behavior from the adapter you just trained.
      
        115
        
        116
        ## Common mistakes
      
        117
        
        118
        ### Using reward mining for missing knowledge
      
        119
        
        120
        Reward models pick between candidate answers. They do not invent facts
      
        121
        the base adapter never learned. If the model is simply wrong, go back to
      
        122
        SFT data first.
      
        123
        
        124
        ### Mining too many pairs too early
      
        125
        
        126
        If the reward model is stronger than the adapter, it can still rank a
      
        127
        batch of uniformly weak answers. Cap with `--max-pairs` and inspect the
      
        128
        result before turning it into a habit.
      
        129
        
        130
        ### Trusting only the reward score
      
        131
        
        132
        Repeated reward-driven loops can drift into reward hacking. Watch actual
      
        133
        task outputs, not just margins.
      
        134
        
        135
        ## When `sway` is still enough
      
        136
        
        137
        Stay with the default judge when:
      
        138
        
        139
        - you are iterating locally on tone or terseness
      
        140
        - the document is small and you want the lowest-friction loop
      
        141
        - you mainly need a filter for "better vs worse" candidates, not a
      
        142
          strong external policy model
      
        143
        
        144
        Move to `hf:` or `cli:` when the loop starts to matter to other people.
      
        145
        
        146
        ## See also
      
        147
        
        148
        - [Self-improving loop](self-improving-loop.md)
      
        149
        - [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)

1	# Reward-model integration
2
3	`dlm preference mine` can score candidate answers with something other
4	than the adapter itself.
5
6	That is the point of the judge selector:
7
8	```sh
9	uv run dlm preference mine mydoc.dlm --judge sway
10	uv run dlm preference mine mydoc.dlm --judge hf:YourOrg/reward-model
11	uv run dlm preference mine mydoc.dlm --judge 'cli:/path/to/judge-bin'
12	```
13
14	This page is the practical guide for the two non-default paths:
15	HuggingFace reward models and external CLI judges.
16
17	## Why use a reward model at all
18
19	The default `sway` judge is a bootstrap convenience. It is fast to reach
20	for, but it is still the adapter judging its own candidates.
21
22	Use an external judge when:
23
24	- the adapter is still small or early in training
25	- you care about policy or style adherence more than raw task accuracy
26	- you already have a reward model or scoring binary your team trusts
27
28	## HuggingFace reward models
29
30	Point `--judge` at a sequence-classification model:
31
32	```sh
33	uv run dlm preference mine mydoc.dlm \
34	--judge hf:OpenAssistant/reward-model-deberta-v3-large-v2 \
35	--threshold 1.0 \
36	--samples 4 \
37	--max-pairs 10
38	```
39
40	DLM loads the model lazily, scores each candidate pair, and keeps only
41	those whose chosen-vs-rejected margin clears the threshold.
42
43	### Thresholds
44
45	The default threshold depends on the judge implementation:
46
47	- `sway`: `0.1`
48	- `hf:<model>`: `1.0`
49
50	Raise the threshold when you want fewer, higher-confidence pairs. Lower
51	it when the judge is too conservative and you are getting almost no
52	output.
53
54	## External CLI judges
55
56	The `cli:` path is for custom scorers, policy engines, or internal
57	reward-model wrappers.
58
59	Example:
60
61	```sh
62	uv run dlm preference mine mydoc.dlm \
63	--judge 'cli:/usr/local/bin/rank-answer-pair' \
64	--samples 4
65	```
66
67	The judge process is invoked once per candidate. It receives JSON on
68	stdin and must answer with JSON on stdout.
69
70	Input shape:
71
72	```json
73	{
74	"prompt": "What is DGEMM?",
75	"candidate": "A matrix multiply."
76	}
77	```
78
79	Output shape:
80
81	```json
82	{
83	"score": 0.9,
84	"reasoning": "Specific, correct, and terse."
85	}
86	```
87
88	If the command cannot be invoked or emits malformed JSON, the mine run
89	fails fast instead of silently accepting garbage scores.
90
91	## A good reward-model workflow
92
93	Start small and observable:
94
95	```sh
96	uv run dlm train mydoc.dlm
97	uv run dlm preference mine mydoc.dlm \
98	--judge hf:YourOrg/reward-model \
99	--samples 4 \
100	--max-pairs 6
101	uv run dlm preference list mydoc.dlm
102	uv run dlm preference apply mydoc.dlm
103	uv run dlm train mydoc.dlm --phase preference
104	```
105
106	Then inspect:
107
108	```sh
109	uv run dlm metrics mydoc.dlm --run-id 7 --json
110	uv run dlm prompt mydoc.dlm "..."
111	```
112
113	Judge-score improvement is not enough on its own. Always check held-out
114	behavior from the adapter you just trained.
115
116	## Common mistakes
117
118	### Using reward mining for missing knowledge
119
120	Reward models pick between candidate answers. They do not invent facts
121	the base adapter never learned. If the model is simply wrong, go back to
122	SFT data first.
123
124	### Mining too many pairs too early
125
126	If the reward model is stronger than the adapter, it can still rank a
127	batch of uniformly weak answers. Cap with `--max-pairs` and inspect the
128	result before turning it into a habit.
129
130	### Trusting only the reward score
131
132	Repeated reward-driven loops can drift into reward hacking. Watch actual
133	task outputs, not just margins.
134
135	## When `sway` is still enough
136
137	Stay with the default judge when:
138
139	- you are iterating locally on tone or terseness
140	- the document is small and you want the lowest-friction loop
141	- you mainly need a filter for "better vs worse" candidates, not a
142	strong external policy model
143
144	Move to `hf:` or `cli:` when the loop starts to matter to other people.
145
146	## See also
147
148	- [Self-improving loop](self-improving-loop.md)
149	- [Preference tuning: DPO vs ORPO](preference-dpo-vs-orpo.md)