markdown · 7995 bytes Raw Blame History

Control vectors

A control vector is a one-shot steering direction extracted from ::preference:: sections. Unlike a LoRA adapter — which takes hours of training to learn a preference — a control vector is computed gradient-free in seconds, stored as a single small tensor, and applied at inference via a forward-time hook on the residual stream.

Use it when you want to steer style rather than capability: formality vs. casualness, verbosity vs. concision, cautious vs. direct. Capability work (teaching new facts, fixing bugs in code) still wants a LoRA. Control vectors are orthogonal — you can stack them over an already-trained adapter at inference time.

The shape

Extraction reads N preference pairs. For each pair, the base model is run on the chosen and rejected completions and hidden states are captured at a residual-stream layer. The difference chosen_i - rejected_i is a "pull toward chosen" vector for that example. The first right-singular vector of the stack of differences is the direction these pulls agree on — that's the steering vector.

Applied at inference with strength s, the vector is added to every token's hidden state at that layer during the forward pass:

hidden_state[t] += s * control_vector

Positive s pushes toward the chosen distribution; negative pushes away. Typical range: [-2, 2]. Beyond ±3 the model collapses into repetition.

Workflow

1. Write a ::preference:: section

Pairs should isolate the single dimension you want to steer. For formality, vary formality; keep topic and length constant.

---
dlm_id: 01KP...
base_model: smollm2-135m
---

::preference#formal::
### Prompt
Explain what a mutex is.

### Chosen
A mutex (mutual exclusion lock) is a synchronization primitive
that ensures only one thread can access a shared resource at a
time. Threads that attempt to acquire a held mutex block until it
is released.

### Rejected
so basically a mutex is like a lock that makes sure two threads
don't trip over each other when they need the same thing. you grab
it, do your thing, let it go.

Add ~10-30 pairs for a usable direction. Fewer than 5 and the signal is too noisy; more than 50 and you're past diminishing returns.

2. Extract

With hidden states collected from the base model:

import numpy as np
from dlm.control import extract_control_vector, refuse_if_policy_safety

# Validate that no preference section is tagged `policy: safety`.
refuse_if_policy_safety([section.tags for section in preference_sections])

# hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
# residual-stream activations at the chosen layer.
vec = extract_control_vector(hidden_chosen, hidden_rejected)

print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
# n_pairs=20, explained_variance=0.73
#
# 0.73 = the principal component captures 73% of the total signal
# energy. Above ~0.5 is a coherent direction. Below ~0.3, the
# pairs are probably too noisy or contradictory — add more, or
# tighten the prompt template.

3. Persist

The per-store layout at ~/.dlm/store/<dlm_id>/controls/:

controls/
    formal.safetensors     # the direction tensor
    formal.meta.json       # {layer_index, source_section_ids, n_pairs, extractor_version}

The meta JSON is how dlm show audits what produced a given vector — source sections, layer, pair count, extractor version (so future API changes can invalidate stale vectors deterministically).

4. Apply at inference

from dlm.control import apply_control

with apply_control(model, vec.direction, layer_index=12, strength=1.5):
    out = model.generate(input_ids, max_new_tokens=128)

The hook attaches on __enter__, removes on __exit__ — even if the wrapped block raises. Leaving a hook active would silently steer unrelated generations, so the context manager pattern is load-bearing.

Layer choice

layer_index picks which residual stream gets the perturbation. Rules of thumb (Panickssery et al., 2024):

  • Middle layers (40–60% depth) are the sweet spot for most style dimensions — formality, tone, caution.
  • Early layers (0–20% depth) steer vocabulary and syntax but don't propagate cleanly through downstream composition.
  • Late layers (80–100% depth) can change a few output tokens but leave the underlying reasoning unchanged.

For a 32-layer model, start at layer_index=16. Sweep [8, 16, 24] on a held-out prompt if the initial result is weak.

Safety refusal

Preference sections tagged policy: safety are refused at extraction time:

::preference#safe-refuse::
tags:
  policy: safety
### Prompt
...
### Chosen
<safe refusal>
### Rejected
<unsafe compliance>

Extracting a vector from those pairs would produce a "more safety vs less safety" direction — applied at negative strength, it erodes the safety training the document is trying to preserve. refuse_if_policy_safety surfaces the refusal before any artifact reaches disk:

ControlPolicyRefusal: refusing to extract a control vector from
preference sections tagged `policy: safety` — the resulting
steering direction could be used at negative strength to undo
the safety training the document is trying to preserve.

The refusal is at extract time, not apply time, so the vector never exists. Re-tagging to bypass the check is not supported; the footgun is the shape of the math, not the tag.

Validation

End-to-end sanity check for a newly-extracted vector:

  1. Load the base model.
  2. Generate without the vector on 20 held-out prompts.
  3. Generate with strength=1.0 on the same prompts.
  4. Judge (LLM-as-judge or manual) whether the axis moved.

For a formality vector, you expect judge-formality-rating to correlate positively with strength. A failed extraction looks like: outputs identical at ±1, nonsense at ±2. That's the signal to add more pairs, pick a different layer, or accept that the dimension you're trying to steer isn't a linear subspace.

What ships today

  • extract_control_vector — raw-SVD over chosen/rejected differences, deterministic orientation (aligned with mean pull).
  • apply_control — context-managed forward_pre_hook with shape + layer validation.
  • refuse_if_policy_safety — pre-extraction safety gate.
  • ControlVector dataclass with n_pairs + explained_variance for audit output.
  • Per-store layout: controls/<name>.safetensors + <name>.meta.json.

What's deferred

  • CLI surface (dlm control extract | apply | list) — needs a real HF base model to drive the forward-pass residual collection. Land as a follow-up when the dlm.inference.loader integration is wired.
  • Multi-control composition — additive for compatible layers, warn on conflicts. Single-control is the v1 shape.
  • Serialization format — today the spec says safetensors. Landing safetensors I/O alongside the CLI keeps the two commits paired.
  • Integration with dlm prompt--control name:strength[,...] flag for the existing prompt path.

These are all layer-cake work on top of the extraction + apply primitives shipped here; the math path is the hard-to-get-right piece and it's done.

Risks

  • Small bases are unstable. Control vectors below ~500M parameters tend to collapse into repetition past |strength| > 1. dlm doctor will warn on bases below that threshold when the CLI lands.
  • Layer choice matters more than strength. A wrong layer at strength 1 is worse than any strength at the right layer.
  • Control vectors are not a safety mechanism. They're a steering tool. The policy: safety refusal is a footgun guard, not a security boundary — anyone who can train LoRAs on the same base can produce the same direction by other means. The safety concern is specifically about documents undoing their own safety training, not about external attackers.
View source
1 # Control vectors
2
3 A **control vector** is a one-shot steering direction extracted from
4 `::preference::` sections. Unlike a LoRA adapter — which takes hours
5 of training to learn a preference — a control vector is computed
6 gradient-free in seconds, stored as a single small tensor, and
7 applied at inference via a forward-time hook on the residual
8 stream.
9
10 Use it when you want to steer *style* rather than *capability*:
11 formality vs. casualness, verbosity vs. concision, cautious vs.
12 direct. Capability work (teaching new facts, fixing bugs in code)
13 still wants a LoRA. Control vectors are orthogonal — you can stack
14 them over an already-trained adapter at inference time.
15
16 ## The shape
17
18 Extraction reads `N` preference pairs. For each pair, the base
19 model is run on the `chosen` and `rejected` completions and
20 hidden states are captured at a residual-stream layer. The
21 difference `chosen_i - rejected_i` is a "pull toward chosen"
22 vector for that example. The first right-singular vector of the
23 stack of differences is the direction these pulls agree on —
24 that's the steering vector.
25
26 Applied at inference with strength `s`, the vector is added to
27 every token's hidden state at that layer during the forward pass:
28
29 ```
30 hidden_state[t] += s * control_vector
31 ```
32
33 Positive `s` pushes toward the `chosen` distribution; negative
34 pushes away. Typical range: `[-2, 2]`. Beyond `±3` the model
35 collapses into repetition.
36
37 ## Workflow
38
39 ### 1. Write a `::preference::` section
40
41 Pairs should isolate the *single dimension* you want to steer.
42 For formality, vary formality; keep topic and length constant.
43
44 ```markdown
45 ---
46 dlm_id: 01KP...
47 base_model: smollm2-135m
48 ---
49
50 ::preference#formal::
51 ### Prompt
52 Explain what a mutex is.
53
54 ### Chosen
55 A mutex (mutual exclusion lock) is a synchronization primitive
56 that ensures only one thread can access a shared resource at a
57 time. Threads that attempt to acquire a held mutex block until it
58 is released.
59
60 ### Rejected
61 so basically a mutex is like a lock that makes sure two threads
62 don't trip over each other when they need the same thing. you grab
63 it, do your thing, let it go.
64 ```
65
66 Add ~10-30 pairs for a usable direction. Fewer than 5 and the
67 signal is too noisy; more than 50 and you're past diminishing
68 returns.
69
70 ### 2. Extract
71
72 With hidden states collected from the base model:
73
74 ```python
75 import numpy as np
76 from dlm.control import extract_control_vector, refuse_if_policy_safety
77
78 # Validate that no preference section is tagged `policy: safety`.
79 refuse_if_policy_safety([section.tags for section in preference_sections])
80
81 # hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
82 # residual-stream activations at the chosen layer.
83 vec = extract_control_vector(hidden_chosen, hidden_rejected)
84
85 print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
86 # n_pairs=20, explained_variance=0.73
87 #
88 # 0.73 = the principal component captures 73% of the total signal
89 # energy. Above ~0.5 is a coherent direction. Below ~0.3, the
90 # pairs are probably too noisy or contradictory — add more, or
91 # tighten the prompt template.
92 ```
93
94 ### 3. Persist
95
96 The per-store layout at `~/.dlm/store/<dlm_id>/controls/`:
97
98 ```
99 controls/
100 formal.safetensors # the direction tensor
101 formal.meta.json # {layer_index, source_section_ids, n_pairs, extractor_version}
102 ```
103
104 The meta JSON is how `dlm show` audits what produced a given
105 vector — source sections, layer, pair count, extractor version
106 (so future API changes can invalidate stale vectors deterministically).
107
108 ### 4. Apply at inference
109
110 ```python
111 from dlm.control import apply_control
112
113 with apply_control(model, vec.direction, layer_index=12, strength=1.5):
114 out = model.generate(input_ids, max_new_tokens=128)
115 ```
116
117 The hook attaches on `__enter__`, removes on `__exit__` — even if
118 the wrapped block raises. Leaving a hook active would silently
119 steer unrelated generations, so the context manager pattern is
120 load-bearing.
121
122 ## Layer choice
123
124 `layer_index` picks which residual stream gets the perturbation.
125 Rules of thumb (Panickssery et al., 2024):
126
127 - **Middle layers** (40–60% depth) are the sweet spot for most
128 style dimensions — formality, tone, caution.
129 - **Early layers** (0–20% depth) steer vocabulary and syntax but
130 don't propagate cleanly through downstream composition.
131 - **Late layers** (80–100% depth) can change a few output tokens
132 but leave the underlying reasoning unchanged.
133
134 For a 32-layer model, start at `layer_index=16`. Sweep `[8, 16,
135 24]` on a held-out prompt if the initial result is weak.
136
137 ## Safety refusal
138
139 Preference sections tagged `policy: safety` are **refused at
140 extraction time**:
141
142 ```markdown
143 ::preference#safe-refuse::
144 tags:
145 policy: safety
146 ### Prompt
147 ...
148 ### Chosen
149 <safe refusal>
150 ### Rejected
151 <unsafe compliance>
152 ```
153
154 Extracting a vector from those pairs would produce a "more safety
155 vs less safety" direction — applied at negative strength, it
156 erodes the safety training the document is trying to preserve.
157 `refuse_if_policy_safety` surfaces the refusal before any
158 artifact reaches disk:
159
160 ```
161 ControlPolicyRefusal: refusing to extract a control vector from
162 preference sections tagged `policy: safety` — the resulting
163 steering direction could be used at negative strength to undo
164 the safety training the document is trying to preserve.
165 ```
166
167 The refusal is at extract time, not apply time, so the vector
168 never exists. Re-tagging to bypass the check is not supported;
169 the footgun is the shape of the math, not the tag.
170
171 ## Validation
172
173 End-to-end sanity check for a newly-extracted vector:
174
175 1. Load the base model.
176 2. Generate without the vector on 20 held-out prompts.
177 3. Generate with `strength=1.0` on the same prompts.
178 4. Judge (LLM-as-judge or manual) whether the axis moved.
179
180 For a formality vector, you expect judge-formality-rating to
181 correlate positively with `strength`. A failed extraction looks
182 like: outputs identical at ±1, nonsense at ±2. That's the signal
183 to add more pairs, pick a different layer, or accept that the
184 dimension you're trying to steer isn't a linear subspace.
185
186 ## What ships today
187
188 - `extract_control_vector` — raw-SVD over chosen/rejected
189 differences, deterministic orientation (aligned with mean pull).
190 - `apply_control` — context-managed `forward_pre_hook` with
191 shape + layer validation.
192 - `refuse_if_policy_safety` — pre-extraction safety gate.
193 - `ControlVector` dataclass with `n_pairs` + `explained_variance`
194 for audit output.
195 - Per-store layout: `controls/<name>.safetensors` +
196 `<name>.meta.json`.
197
198 ## What's deferred
199
200 - **CLI surface** (`dlm control extract | apply | list`) — needs
201 a real HF base model to drive the forward-pass residual
202 collection. Land as a follow-up when the `dlm.inference.loader`
203 integration is wired.
204 - **Multi-control composition** — additive for compatible layers,
205 warn on conflicts. Single-control is the v1 shape.
206 - **Serialization format** — today the spec says safetensors.
207 Landing safetensors I/O alongside the CLI keeps the two
208 commits paired.
209 - **Integration with `dlm prompt`** — `--control name:strength[,...]`
210 flag for the existing prompt path.
211
212 These are all layer-cake work on top of the extraction + apply
213 primitives shipped here; the math path is the hard-to-get-right
214 piece and it's done.
215
216 ## Risks
217
218 - **Small bases are unstable.** Control vectors below ~500M
219 parameters tend to collapse into repetition past `|strength| > 1`.
220 `dlm doctor` will warn on bases below that threshold when the
221 CLI lands.
222 - **Layer choice matters more than strength.** A wrong layer at
223 strength 1 is worse than any strength at the right layer.
224 - **Control vectors are not a safety mechanism.** They're a
225 steering *tool*. The `policy: safety` refusal is a footgun
226 guard, not a security boundary — anyone who can train LoRAs
227 on the same base can produce the same direction by other
228 means. The safety concern is specifically about documents
229 undoing their own safety training, not about external
230 attackers.