documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 7995 bytes Raw Blame History

Control vectors

A control vector is a one-shot steering direction extracted from ::preference:: sections. Unlike a LoRA adapter — which takes hours of training to learn a preference — a control vector is computed gradient-free in seconds, stored as a single small tensor, and applied at inference via a forward-time hook on the residual stream.

Use it when you want to steer style rather than capability: formality vs. casualness, verbosity vs. concision, cautious vs. direct. Capability work (teaching new facts, fixing bugs in code) still wants a LoRA. Control vectors are orthogonal — you can stack them over an already-trained adapter at inference time.

The shape

Extraction reads N preference pairs. For each pair, the base model is run on the chosen and rejected completions and hidden states are captured at a residual-stream layer. The difference chosen_i - rejected_i is a "pull toward chosen" vector for that example. The first right-singular vector of the stack of differences is the direction these pulls agree on — that's the steering vector.

Applied at inference with strength s, the vector is added to every token's hidden state at that layer during the forward pass:

hidden_state[t] += s * control_vector

Positive s pushes toward the chosen distribution; negative pushes away. Typical range: [-2, 2]. Beyond ±3 the model collapses into repetition.

Workflow

1. Write a `::preference::` section

Pairs should isolate the single dimension you want to steer. For formality, vary formality; keep topic and length constant.

---
dlm_id: 01KP...
base_model: smollm2-135m
---

::preference#formal::
### Prompt
Explain what a mutex is.

### Chosen
A mutex (mutual exclusion lock) is a synchronization primitive
that ensures only one thread can access a shared resource at a
time. Threads that attempt to acquire a held mutex block until it
is released.

### Rejected
so basically a mutex is like a lock that makes sure two threads
don't trip over each other when they need the same thing. you grab
it, do your thing, let it go.

Add ~10-30 pairs for a usable direction. Fewer than 5 and the signal is too noisy; more than 50 and you're past diminishing returns.

2. Extract

With hidden states collected from the base model:

import numpy as np
from dlm.control import extract_control_vector, refuse_if_policy_safety

# Validate that no preference section is tagged `policy: safety`.
refuse_if_policy_safety([section.tags for section in preference_sections])

# hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
# residual-stream activations at the chosen layer.
vec = extract_control_vector(hidden_chosen, hidden_rejected)

print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
# n_pairs=20, explained_variance=0.73
#
# 0.73 = the principal component captures 73% of the total signal
# energy. Above ~0.5 is a coherent direction. Below ~0.3, the
# pairs are probably too noisy or contradictory — add more, or
# tighten the prompt template.

3. Persist

The per-store layout at ~/.dlm/store/<dlm_id>/controls/:

controls/
    formal.safetensors     # the direction tensor
    formal.meta.json       # {layer_index, source_section_ids, n_pairs, extractor_version}

The meta JSON is how dlm show audits what produced a given vector — source sections, layer, pair count, extractor version (so future API changes can invalidate stale vectors deterministically).

4. Apply at inference

from dlm.control import apply_control

with apply_control(model, vec.direction, layer_index=12, strength=1.5):
    out = model.generate(input_ids, max_new_tokens=128)

The hook attaches on __enter__, removes on __exit__ — even if the wrapped block raises. Leaving a hook active would silently steer unrelated generations, so the context manager pattern is load-bearing.

Layer choice

layer_index picks which residual stream gets the perturbation. Rules of thumb (Panickssery et al., 2024):

Middle layers (40–60% depth) are the sweet spot for most style dimensions — formality, tone, caution.
Early layers (0–20% depth) steer vocabulary and syntax but don't propagate cleanly through downstream composition.
Late layers (80–100% depth) can change a few output tokens but leave the underlying reasoning unchanged.

For a 32-layer model, start at layer_index=16. Sweep [8, 16, 24] on a held-out prompt if the initial result is weak.

Safety refusal

Preference sections tagged policy: safety are refused at extraction time:

::preference#safe-refuse::
tags:
  policy: safety
### Prompt
...
### Chosen
<safe refusal>
### Rejected
<unsafe compliance>

Extracting a vector from those pairs would produce a "more safety vs less safety" direction — applied at negative strength, it erodes the safety training the document is trying to preserve. refuse_if_policy_safety surfaces the refusal before any artifact reaches disk:

ControlPolicyRefusal: refusing to extract a control vector from
preference sections tagged `policy: safety` — the resulting
steering direction could be used at negative strength to undo
the safety training the document is trying to preserve.

The refusal is at extract time, not apply time, so the vector never exists. Re-tagging to bypass the check is not supported; the footgun is the shape of the math, not the tag.

Validation

End-to-end sanity check for a newly-extracted vector:

Load the base model.
Generate without the vector on 20 held-out prompts.
Generate with strength=1.0 on the same prompts.
Judge (LLM-as-judge or manual) whether the axis moved.

For a formality vector, you expect judge-formality-rating to correlate positively with strength. A failed extraction looks like: outputs identical at ±1, nonsense at ±2. That's the signal to add more pairs, pick a different layer, or accept that the dimension you're trying to steer isn't a linear subspace.

What ships today

extract_control_vector — raw-SVD over chosen/rejected differences, deterministic orientation (aligned with mean pull).
apply_control — context-managed forward_pre_hook with shape + layer validation.
refuse_if_policy_safety — pre-extraction safety gate.
ControlVector dataclass with n_pairs + explained_variance for audit output.
Per-store layout: controls/<name>.safetensors + <name>.meta.json.

What's deferred

CLI surface (dlm control extract | apply | list) — needs a real HF base model to drive the forward-pass residual collection. Land as a follow-up when the dlm.inference.loader integration is wired.
Multi-control composition — additive for compatible layers, warn on conflicts. Single-control is the v1 shape.
Serialization format — today the spec says safetensors. Landing safetensors I/O alongside the CLI keeps the two commits paired.
Integration with dlm prompt — --control name:strength[,...] flag for the existing prompt path.

These are all layer-cake work on top of the extraction + apply primitives shipped here; the math path is the hard-to-get-right piece and it's done.

Risks

Small bases are unstable. Control vectors below ~500M parameters tend to collapse into repetition past |strength| > 1. dlm doctor will warn on bases below that threshold when the CLI lands.
Layer choice matters more than strength. A wrong layer at strength 1 is worse than any strength at the right layer.
Control vectors are not a safety mechanism. They're a steering tool. The policy: safety refusal is a footgun guard, not a security boundary — anyone who can train LoRAs on the same base can produce the same direction by other means. The safety concern is specifically about documents undoing their own safety training, not about external attackers.

View source

  
        1
        # Control vectors
      
        2
        
        3
        A **control vector** is a one-shot steering direction extracted from
      
        4
        `::preference::` sections. Unlike a LoRA adapter — which takes hours
      
        5
        of training to learn a preference — a control vector is computed
      
        6
        gradient-free in seconds, stored as a single small tensor, and
      
        7
        applied at inference via a forward-time hook on the residual
      
        8
        stream.
      
        9
        
        10
        Use it when you want to steer *style* rather than *capability*:
      
        11
        formality vs. casualness, verbosity vs. concision, cautious vs.
      
        12
        direct. Capability work (teaching new facts, fixing bugs in code)
      
        13
        still wants a LoRA. Control vectors are orthogonal — you can stack
      
        14
        them over an already-trained adapter at inference time.
      
        15
        
        16
        ## The shape
      
        17
        
        18
        Extraction reads `N` preference pairs. For each pair, the base
      
        19
        model is run on the `chosen` and `rejected` completions and
      
        20
        hidden states are captured at a residual-stream layer. The
      
        21
        difference `chosen_i - rejected_i` is a "pull toward chosen"
      
        22
        vector for that example. The first right-singular vector of the
      
        23
        stack of differences is the direction these pulls agree on —
      
        24
        that's the steering vector.
      
        25
        
        26
        Applied at inference with strength `s`, the vector is added to
      
        27
        every token's hidden state at that layer during the forward pass:
      
        28
        
        29
        ```
      
        30
        hidden_state[t] += s * control_vector
      
        31
        ```
      
        32
        
        33
        Positive `s` pushes toward the `chosen` distribution; negative
      
        34
        pushes away. Typical range: `[-2, 2]`. Beyond `±3` the model
      
        35
        collapses into repetition.
      
        36
        
        37
        ## Workflow
      
        38
        
        39
        ### 1. Write a `::preference::` section
      
        40
        
        41
        Pairs should isolate the *single dimension* you want to steer.
      
        42
        For formality, vary formality; keep topic and length constant.
      
        43
        
        44
        ```markdown
      
        45
        ---
      
        46
        dlm_id: 01KP...
      
        47
        base_model: smollm2-135m
      
        48
        ---
      
        49
        
        50
        ::preference#formal::
      
        51
        ### Prompt
      
        52
        Explain what a mutex is.
      
        53
        
        54
        ### Chosen
      
        55
        A mutex (mutual exclusion lock) is a synchronization primitive
      
        56
        that ensures only one thread can access a shared resource at a
      
        57
        time. Threads that attempt to acquire a held mutex block until it
      
        58
        is released.
      
        59
        
        60
        ### Rejected
      
        61
        so basically a mutex is like a lock that makes sure two threads
      
        62
        don't trip over each other when they need the same thing. you grab
      
        63
        it, do your thing, let it go.
      
        64
        ```
      
        65
        
        66
        Add ~10-30 pairs for a usable direction. Fewer than 5 and the
      
        67
        signal is too noisy; more than 50 and you're past diminishing
      
        68
        returns.
      
        69
        
        70
        ### 2. Extract
      
        71
        
        72
        With hidden states collected from the base model:
      
        73
        
        74
        ```python
      
        75
        import numpy as np
      
        76
        from dlm.control import extract_control_vector, refuse_if_policy_safety
      
        77
        
        78
        # Validate that no preference section is tagged `policy: safety`.
      
        79
        refuse_if_policy_safety([section.tags for section in preference_sections])
      
        80
        
        81
        # hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
      
        82
        # residual-stream activations at the chosen layer.
      
        83
        vec = extract_control_vector(hidden_chosen, hidden_rejected)
      
        84
        
        85
        print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
      
        86
        # n_pairs=20, explained_variance=0.73
      
        87
        #
      
        88
        # 0.73 = the principal component captures 73% of the total signal
      
        89
        # energy. Above ~0.5 is a coherent direction. Below ~0.3, the
      
        90
        # pairs are probably too noisy or contradictory — add more, or
      
        91
        # tighten the prompt template.
      
        92
        ```
      
        93
        
        94
        ### 3. Persist
      
        95
        
        96
        The per-store layout at `~/.dlm/store/<dlm_id>/controls/`:
      
        97
        
        98
        ```
      
        99
        controls/
      
        100
            formal.safetensors     # the direction tensor
      
        101
            formal.meta.json       # {layer_index, source_section_ids, n_pairs, extractor_version}
      
        102
        ```
      
        103
        
        104
        The meta JSON is how `dlm show` audits what produced a given
      
        105
        vector — source sections, layer, pair count, extractor version
      
        106
        (so future API changes can invalidate stale vectors deterministically).
      
        107
        
        108
        ### 4. Apply at inference
      
        109
        
        110
        ```python
      
        111
        from dlm.control import apply_control
      
        112
        
        113
        with apply_control(model, vec.direction, layer_index=12, strength=1.5):
      
        114
            out = model.generate(input_ids, max_new_tokens=128)
      
        115
        ```
      
        116
        
        117
        The hook attaches on `__enter__`, removes on `__exit__` — even if
      
        118
        the wrapped block raises. Leaving a hook active would silently
      
        119
        steer unrelated generations, so the context manager pattern is
      
        120
        load-bearing.
      
        121
        
        122
        ## Layer choice
      
        123
        
        124
        `layer_index` picks which residual stream gets the perturbation.
      
        125
        Rules of thumb (Panickssery et al., 2024):
      
        126
        
        127
        - **Middle layers** (40–60% depth) are the sweet spot for most
      
        128
          style dimensions — formality, tone, caution.
      
        129
        - **Early layers** (0–20% depth) steer vocabulary and syntax but
      
        130
          don't propagate cleanly through downstream composition.
      
        131
        - **Late layers** (80–100% depth) can change a few output tokens
      
        132
          but leave the underlying reasoning unchanged.
      
        133
        
        134
        For a 32-layer model, start at `layer_index=16`. Sweep `[8, 16,
      
        135
        24]` on a held-out prompt if the initial result is weak.
      
        136
        
        137
        ## Safety refusal
      
        138
        
        139
        Preference sections tagged `policy: safety` are **refused at
      
        140
        extraction time**:
      
        141
        
        142
        ```markdown
      
        143
        ::preference#safe-refuse::
      
        144
        tags:
      
        145
          policy: safety
      
        146
        ### Prompt
      
        147
        ...
      
        148
        ### Chosen
      
        149
        <safe refusal>
      
        150
        ### Rejected
      
        151
        <unsafe compliance>
      
        152
        ```
      
        153
        
        154
        Extracting a vector from those pairs would produce a "more safety
      
        155
        vs less safety" direction — applied at negative strength, it
      
        156
        erodes the safety training the document is trying to preserve.
      
        157
        `refuse_if_policy_safety` surfaces the refusal before any
      
        158
        artifact reaches disk:
      
        159
        
        160
        ```
      
        161
        ControlPolicyRefusal: refusing to extract a control vector from
      
        162
        preference sections tagged `policy: safety` — the resulting
      
        163
        steering direction could be used at negative strength to undo
      
        164
        the safety training the document is trying to preserve.
      
        165
        ```
      
        166
        
        167
        The refusal is at extract time, not apply time, so the vector
      
        168
        never exists. Re-tagging to bypass the check is not supported;
      
        169
        the footgun is the shape of the math, not the tag.
      
        170
        
        171
        ## Validation
      
        172
        
        173
        End-to-end sanity check for a newly-extracted vector:
      
        174
        
        175
        1. Load the base model.
      
        176
        2. Generate without the vector on 20 held-out prompts.
      
        177
        3. Generate with `strength=1.0` on the same prompts.
      
        178
        4. Judge (LLM-as-judge or manual) whether the axis moved.
      
        179
        
        180
        For a formality vector, you expect judge-formality-rating to
      
        181
        correlate positively with `strength`. A failed extraction looks
      
        182
        like: outputs identical at ±1, nonsense at ±2. That's the signal
      
        183
        to add more pairs, pick a different layer, or accept that the
      
        184
        dimension you're trying to steer isn't a linear subspace.
      
        185
        
        186
        ## What ships today
      
        187
        
        188
        - `extract_control_vector` — raw-SVD over chosen/rejected
      
        189
          differences, deterministic orientation (aligned with mean pull).
      
        190
        - `apply_control` — context-managed `forward_pre_hook` with
      
        191
          shape + layer validation.
      
        192
        - `refuse_if_policy_safety` — pre-extraction safety gate.
      
        193
        - `ControlVector` dataclass with `n_pairs` + `explained_variance`
      
        194
          for audit output.
      
        195
        - Per-store layout: `controls/<name>.safetensors` +
      
        196
          `<name>.meta.json`.
      
        197
        
        198
        ## What's deferred
      
        199
        
        200
        - **CLI surface** (`dlm control extract | apply | list`) — needs
      
        201
          a real HF base model to drive the forward-pass residual
      
        202
          collection. Land as a follow-up when the `dlm.inference.loader`
      
        203
          integration is wired.
      
        204
        - **Multi-control composition** — additive for compatible layers,
      
        205
          warn on conflicts. Single-control is the v1 shape.
      
        206
        - **Serialization format** — today the spec says safetensors.
      
        207
          Landing safetensors I/O alongside the CLI keeps the two
      
        208
          commits paired.
      
        209
        - **Integration with `dlm prompt`** — `--control name:strength[,...]`
      
        210
          flag for the existing prompt path.
      
        211
        
        212
        These are all layer-cake work on top of the extraction + apply
      
        213
        primitives shipped here; the math path is the hard-to-get-right
      
        214
        piece and it's done.
      
        215
        
        216
        ## Risks
      
        217
        
        218
        - **Small bases are unstable.** Control vectors below ~500M
      
        219
          parameters tend to collapse into repetition past `|strength| > 1`.
      
        220
          `dlm doctor` will warn on bases below that threshold when the
      
        221
          CLI lands.
      
        222
        - **Layer choice matters more than strength.** A wrong layer at
      
        223
          strength 1 is worse than any strength at the right layer.
      
        224
        - **Control vectors are not a safety mechanism.** They're a
      
        225
          steering *tool*. The `policy: safety` refusal is a footgun
      
        226
          guard, not a security boundary — anyone who can train LoRAs
      
        227
          on the same base can produce the same direction by other
      
        228
          means. The safety concern is specifically about documents
      
        229
          undoing their own safety training, not about external
      
        230
          attackers.

1	# Control vectors
2
3	A control vector is a one-shot steering direction extracted from
4	`::preference::` sections. Unlike a LoRA adapter — which takes hours
5	of training to learn a preference — a control vector is computed
6	gradient-free in seconds, stored as a single small tensor, and
7	applied at inference via a forward-time hook on the residual
8	stream.
9
10	Use it when you want to steer style rather than capability:
11	formality vs. casualness, verbosity vs. concision, cautious vs.
12	direct. Capability work (teaching new facts, fixing bugs in code)
13	still wants a LoRA. Control vectors are orthogonal — you can stack
14	them over an already-trained adapter at inference time.
15
16	## The shape
17
18	Extraction reads `N` preference pairs. For each pair, the base
19	model is run on the `chosen` and `rejected` completions and
20	hidden states are captured at a residual-stream layer. The
21	difference `chosen_i - rejected_i` is a "pull toward chosen"
22	vector for that example. The first right-singular vector of the
23	stack of differences is the direction these pulls agree on —
24	that's the steering vector.
25
26	Applied at inference with strength `s`, the vector is added to
27	every token's hidden state at that layer during the forward pass:
28
29	```
30	hidden_state[t] += s * control_vector
31	```
32
33	Positive `s` pushes toward the `chosen` distribution; negative
34	pushes away. Typical range: `[-2, 2]`. Beyond `±3` the model
35	collapses into repetition.
36
37	## Workflow
38
39	### 1. Write a `::preference::` section
40
41	Pairs should isolate the single dimension you want to steer.
42	For formality, vary formality; keep topic and length constant.
43
44	```markdown
45	---
46	dlm_id: 01KP...
47	base_model: smollm2-135m
48	---
49
50	::preference#formal::
51	### Prompt
52	Explain what a mutex is.
53
54	### Chosen
55	A mutex (mutual exclusion lock) is a synchronization primitive
56	that ensures only one thread can access a shared resource at a
57	time. Threads that attempt to acquire a held mutex block until it
58	is released.
59
60	### Rejected
61	so basically a mutex is like a lock that makes sure two threads
62	don't trip over each other when they need the same thing. you grab
63	it, do your thing, let it go.
64	```
65
66	Add ~10-30 pairs for a usable direction. Fewer than 5 and the
67	signal is too noisy; more than 50 and you're past diminishing
68	returns.
69
70	### 2. Extract
71
72	With hidden states collected from the base model:
73
74	```python
75	import numpy as np
76	from dlm.control import extract_control_vector, refuse_if_policy_safety
77
78	# Validate that no preference section is tagged `policy: safety`.
79	refuse_if_policy_safety([section.tags for section in preference_sections])
80
81	# hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
82	# residual-stream activations at the chosen layer.
83	vec = extract_control_vector(hidden_chosen, hidden_rejected)
84
85	print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
86	# n_pairs=20, explained_variance=0.73
87	#
88	# 0.73 = the principal component captures 73% of the total signal
89	# energy. Above ~0.5 is a coherent direction. Below ~0.3, the
90	# pairs are probably too noisy or contradictory — add more, or
91	# tighten the prompt template.
92	```
93
94	### 3. Persist
95
96	The per-store layout at `~/.dlm/store/<dlm_id>/controls/`:
97
98	```
99	controls/
100	formal.safetensors # the direction tensor
101	formal.meta.json # {layer_index, source_section_ids, n_pairs, extractor_version}
102	```
103
104	The meta JSON is how `dlm show` audits what produced a given
105	vector — source sections, layer, pair count, extractor version
106	(so future API changes can invalidate stale vectors deterministically).
107
108	### 4. Apply at inference
109
110	```python
111	from dlm.control import apply_control
112
113	with apply_control(model, vec.direction, layer_index=12, strength=1.5):
114	out = model.generate(input_ids, max_new_tokens=128)
115	```
116
117	The hook attaches on `__enter__`, removes on `__exit__` — even if
118	the wrapped block raises. Leaving a hook active would silently
119	steer unrelated generations, so the context manager pattern is
120	load-bearing.
121
122	## Layer choice
123
124	`layer_index` picks which residual stream gets the perturbation.
125	Rules of thumb (Panickssery et al., 2024):
126
127	- Middle layers (40–60% depth) are the sweet spot for most
128	style dimensions — formality, tone, caution.
129	- Early layers (0–20% depth) steer vocabulary and syntax but
130	don't propagate cleanly through downstream composition.
131	- Late layers (80–100% depth) can change a few output tokens
132	but leave the underlying reasoning unchanged.
133
134	For a 32-layer model, start at `layer_index=16`. Sweep `[8, 16,
135	24]` on a held-out prompt if the initial result is weak.
136
137	## Safety refusal
138
139	Preference sections tagged `policy: safety` are **refused at
140	extraction time**:
141
142	```markdown
143	::preference#safe-refuse::
144	tags:
145	policy: safety
146	### Prompt
147	...
148	### Chosen
149	<safe refusal>
150	### Rejected
151	<unsafe compliance>
152	```
153
154	Extracting a vector from those pairs would produce a "more safety
155	vs less safety" direction — applied at negative strength, it
156	erodes the safety training the document is trying to preserve.
157	`refuse_if_policy_safety` surfaces the refusal before any
158	artifact reaches disk:
159
160	```
161	ControlPolicyRefusal: refusing to extract a control vector from
162	preference sections tagged `policy: safety` — the resulting
163	steering direction could be used at negative strength to undo
164	the safety training the document is trying to preserve.
165	```
166
167	The refusal is at extract time, not apply time, so the vector
168	never exists. Re-tagging to bypass the check is not supported;
169	the footgun is the shape of the math, not the tag.
170
171	## Validation
172
173	End-to-end sanity check for a newly-extracted vector:
174
175	1. Load the base model.
176	2. Generate without the vector on 20 held-out prompts.
177	3. Generate with `strength=1.0` on the same prompts.
178	4. Judge (LLM-as-judge or manual) whether the axis moved.
179
180	For a formality vector, you expect judge-formality-rating to
181	correlate positively with `strength`. A failed extraction looks
182	like: outputs identical at ±1, nonsense at ±2. That's the signal
183	to add more pairs, pick a different layer, or accept that the
184	dimension you're trying to steer isn't a linear subspace.
185
186	## What ships today
187
188	- `extract_control_vector` — raw-SVD over chosen/rejected
189	differences, deterministic orientation (aligned with mean pull).
190	- `apply_control` — context-managed `forward_pre_hook` with
191	shape + layer validation.
192	- `refuse_if_policy_safety` — pre-extraction safety gate.
193	- `ControlVector` dataclass with `n_pairs` + `explained_variance`
194	for audit output.
195	- Per-store layout: `controls/<name>.safetensors` +
196	`<name>.meta.json`.
197
198	## What's deferred
199
200	- CLI surface (`dlm control extract \| apply \| list`) — needs
201	a real HF base model to drive the forward-pass residual
202	collection. Land as a follow-up when the `dlm.inference.loader`
203	integration is wired.
204	- Multi-control composition — additive for compatible layers,
205	warn on conflicts. Single-control is the v1 shape.
206	- Serialization format — today the spec says safetensors.
207	Landing safetensors I/O alongside the CLI keeps the two
208	commits paired.
209	- Integration with `dlm prompt` — `--control name:strength[,...]`
210	flag for the existing prompt path.
211
212	These are all layer-cake work on top of the extraction + apply
213	primitives shipped here; the math path is the hard-to-get-right
214	piece and it's done.
215
216	## Risks
217
218	- Small bases are unstable. Control vectors below ~500M
219	parameters tend to collapse into repetition past `\|strength\| > 1`.
220	`dlm doctor` will warn on bases below that threshold when the
221	CLI lands.
222	- Layer choice matters more than strength. A wrong layer at
223	strength 1 is worse than any strength at the right layer.
224	- Control vectors are not a safety mechanism. They're a
225	steering tool. The `policy: safety` refusal is a footgun
226	guard, not a security boundary — anyone who can train LoRAs
227	on the same base can produce the same direction by other
228	means. The safety concern is specifically about documents
229	undoing their own safety training, not about external
230	attackers.