documentlanguagemodel Public
Control vectors
A control vector is a one-shot steering direction extracted from
::preference:: sections. Unlike a LoRA adapter — which takes hours
of training to learn a preference — a control vector is computed
gradient-free in seconds, stored as a single small tensor, and
applied at inference via a forward-time hook on the residual
stream.
Use it when you want to steer style rather than capability: formality vs. casualness, verbosity vs. concision, cautious vs. direct. Capability work (teaching new facts, fixing bugs in code) still wants a LoRA. Control vectors are orthogonal — you can stack them over an already-trained adapter at inference time.
The shape
Extraction reads N preference pairs. For each pair, the base
model is run on the chosen and rejected completions and
hidden states are captured at a residual-stream layer. The
difference chosen_i - rejected_i is a "pull toward chosen"
vector for that example. The first right-singular vector of the
stack of differences is the direction these pulls agree on —
that's the steering vector.
Applied at inference with strength s, the vector is added to
every token's hidden state at that layer during the forward pass:
hidden_state[t] += s * control_vector
Positive s pushes toward the chosen distribution; negative
pushes away. Typical range: [-2, 2]. Beyond ±3 the model
collapses into repetition.
Workflow
1. Write a ::preference:: section
Pairs should isolate the single dimension you want to steer. For formality, vary formality; keep topic and length constant.
---
dlm_id: 01KP...
base_model: smollm2-135m
---
::preference#formal::
### Prompt
Explain what a mutex is.
### Chosen
A mutex (mutual exclusion lock) is a synchronization primitive
that ensures only one thread can access a shared resource at a
time. Threads that attempt to acquire a held mutex block until it
is released.
### Rejected
so basically a mutex is like a lock that makes sure two threads
don't trip over each other when they need the same thing. you grab
it, do your thing, let it go.
Add ~10-30 pairs for a usable direction. Fewer than 5 and the signal is too noisy; more than 50 and you're past diminishing returns.
2. Extract
With hidden states collected from the base model:
import numpy as np
from dlm.control import extract_control_vector, refuse_if_policy_safety
# Validate that no preference section is tagged `policy: safety`.
refuse_if_policy_safety([section.tags for section in preference_sections])
# hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of
# residual-stream activations at the chosen layer.
vec = extract_control_vector(hidden_chosen, hidden_rejected)
print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}")
# n_pairs=20, explained_variance=0.73
#
# 0.73 = the principal component captures 73% of the total signal
# energy. Above ~0.5 is a coherent direction. Below ~0.3, the
# pairs are probably too noisy or contradictory — add more, or
# tighten the prompt template.
3. Persist
The per-store layout at ~/.dlm/store/<dlm_id>/controls/:
controls/
formal.safetensors # the direction tensor
formal.meta.json # {layer_index, source_section_ids, n_pairs, extractor_version}
The meta JSON is how dlm show audits what produced a given
vector — source sections, layer, pair count, extractor version
(so future API changes can invalidate stale vectors deterministically).
4. Apply at inference
from dlm.control import apply_control
with apply_control(model, vec.direction, layer_index=12, strength=1.5):
out = model.generate(input_ids, max_new_tokens=128)
The hook attaches on __enter__, removes on __exit__ — even if
the wrapped block raises. Leaving a hook active would silently
steer unrelated generations, so the context manager pattern is
load-bearing.
Layer choice
layer_index picks which residual stream gets the perturbation.
Rules of thumb (Panickssery et al., 2024):
- Middle layers (40–60% depth) are the sweet spot for most style dimensions — formality, tone, caution.
- Early layers (0–20% depth) steer vocabulary and syntax but don't propagate cleanly through downstream composition.
- Late layers (80–100% depth) can change a few output tokens but leave the underlying reasoning unchanged.
For a 32-layer model, start at layer_index=16. Sweep [8, 16, 24] on a held-out prompt if the initial result is weak.
Safety refusal
Preference sections tagged policy: safety are refused at
extraction time:
::preference#safe-refuse::
tags:
policy: safety
### Prompt
...
### Chosen
<safe refusal>
### Rejected
<unsafe compliance>
Extracting a vector from those pairs would produce a "more safety
vs less safety" direction — applied at negative strength, it
erodes the safety training the document is trying to preserve.
refuse_if_policy_safety surfaces the refusal before any
artifact reaches disk:
ControlPolicyRefusal: refusing to extract a control vector from
preference sections tagged `policy: safety` — the resulting
steering direction could be used at negative strength to undo
the safety training the document is trying to preserve.
The refusal is at extract time, not apply time, so the vector never exists. Re-tagging to bypass the check is not supported; the footgun is the shape of the math, not the tag.
Validation
End-to-end sanity check for a newly-extracted vector:
- Load the base model.
- Generate without the vector on 20 held-out prompts.
- Generate with
strength=1.0on the same prompts. - Judge (LLM-as-judge or manual) whether the axis moved.
For a formality vector, you expect judge-formality-rating to
correlate positively with strength. A failed extraction looks
like: outputs identical at ±1, nonsense at ±2. That's the signal
to add more pairs, pick a different layer, or accept that the
dimension you're trying to steer isn't a linear subspace.
What ships today
extract_control_vector— raw-SVD over chosen/rejected differences, deterministic orientation (aligned with mean pull).apply_control— context-managedforward_pre_hookwith shape + layer validation.refuse_if_policy_safety— pre-extraction safety gate.ControlVectordataclass withn_pairs+explained_variancefor audit output.- Per-store layout:
controls/<name>.safetensors+<name>.meta.json.
What's deferred
- CLI surface (
dlm control extract | apply | list) — needs a real HF base model to drive the forward-pass residual collection. Land as a follow-up when thedlm.inference.loaderintegration is wired. - Multi-control composition — additive for compatible layers, warn on conflicts. Single-control is the v1 shape.
- Serialization format — today the spec says safetensors. Landing safetensors I/O alongside the CLI keeps the two commits paired.
- Integration with
dlm prompt—--control name:strength[,...]flag for the existing prompt path.
These are all layer-cake work on top of the extraction + apply primitives shipped here; the math path is the hard-to-get-right piece and it's done.
Risks
- Small bases are unstable. Control vectors below ~500M
parameters tend to collapse into repetition past
|strength| > 1.dlm doctorwill warn on bases below that threshold when the CLI lands. - Layer choice matters more than strength. A wrong layer at strength 1 is worse than any strength at the right layer.
- Control vectors are not a safety mechanism. They're a
steering tool. The
policy: safetyrefusal is a footgun guard, not a security boundary — anyone who can train LoRAs on the same base can produce the same direction by other means. The safety concern is specifically about documents undoing their own safety training, not about external attackers.
View source
| 1 | # Control vectors |
| 2 | |
| 3 | A **control vector** is a one-shot steering direction extracted from |
| 4 | `::preference::` sections. Unlike a LoRA adapter — which takes hours |
| 5 | of training to learn a preference — a control vector is computed |
| 6 | gradient-free in seconds, stored as a single small tensor, and |
| 7 | applied at inference via a forward-time hook on the residual |
| 8 | stream. |
| 9 | |
| 10 | Use it when you want to steer *style* rather than *capability*: |
| 11 | formality vs. casualness, verbosity vs. concision, cautious vs. |
| 12 | direct. Capability work (teaching new facts, fixing bugs in code) |
| 13 | still wants a LoRA. Control vectors are orthogonal — you can stack |
| 14 | them over an already-trained adapter at inference time. |
| 15 | |
| 16 | ## The shape |
| 17 | |
| 18 | Extraction reads `N` preference pairs. For each pair, the base |
| 19 | model is run on the `chosen` and `rejected` completions and |
| 20 | hidden states are captured at a residual-stream layer. The |
| 21 | difference `chosen_i - rejected_i` is a "pull toward chosen" |
| 22 | vector for that example. The first right-singular vector of the |
| 23 | stack of differences is the direction these pulls agree on — |
| 24 | that's the steering vector. |
| 25 | |
| 26 | Applied at inference with strength `s`, the vector is added to |
| 27 | every token's hidden state at that layer during the forward pass: |
| 28 | |
| 29 | ``` |
| 30 | hidden_state[t] += s * control_vector |
| 31 | ``` |
| 32 | |
| 33 | Positive `s` pushes toward the `chosen` distribution; negative |
| 34 | pushes away. Typical range: `[-2, 2]`. Beyond `±3` the model |
| 35 | collapses into repetition. |
| 36 | |
| 37 | ## Workflow |
| 38 | |
| 39 | ### 1. Write a `::preference::` section |
| 40 | |
| 41 | Pairs should isolate the *single dimension* you want to steer. |
| 42 | For formality, vary formality; keep topic and length constant. |
| 43 | |
| 44 | ```markdown |
| 45 | --- |
| 46 | dlm_id: 01KP... |
| 47 | base_model: smollm2-135m |
| 48 | --- |
| 49 | |
| 50 | ::preference#formal:: |
| 51 | ### Prompt |
| 52 | Explain what a mutex is. |
| 53 | |
| 54 | ### Chosen |
| 55 | A mutex (mutual exclusion lock) is a synchronization primitive |
| 56 | that ensures only one thread can access a shared resource at a |
| 57 | time. Threads that attempt to acquire a held mutex block until it |
| 58 | is released. |
| 59 | |
| 60 | ### Rejected |
| 61 | so basically a mutex is like a lock that makes sure two threads |
| 62 | don't trip over each other when they need the same thing. you grab |
| 63 | it, do your thing, let it go. |
| 64 | ``` |
| 65 | |
| 66 | Add ~10-30 pairs for a usable direction. Fewer than 5 and the |
| 67 | signal is too noisy; more than 50 and you're past diminishing |
| 68 | returns. |
| 69 | |
| 70 | ### 2. Extract |
| 71 | |
| 72 | With hidden states collected from the base model: |
| 73 | |
| 74 | ```python |
| 75 | import numpy as np |
| 76 | from dlm.control import extract_control_vector, refuse_if_policy_safety |
| 77 | |
| 78 | # Validate that no preference section is tagged `policy: safety`. |
| 79 | refuse_if_policy_safety([section.tags for section in preference_sections]) |
| 80 | |
| 81 | # hidden_chosen, hidden_rejected: each (N, hidden_dim) arrays of |
| 82 | # residual-stream activations at the chosen layer. |
| 83 | vec = extract_control_vector(hidden_chosen, hidden_rejected) |
| 84 | |
| 85 | print(f"n_pairs={vec.n_pairs}, explained_variance={vec.explained_variance:.2f}") |
| 86 | # n_pairs=20, explained_variance=0.73 |
| 87 | # |
| 88 | # 0.73 = the principal component captures 73% of the total signal |
| 89 | # energy. Above ~0.5 is a coherent direction. Below ~0.3, the |
| 90 | # pairs are probably too noisy or contradictory — add more, or |
| 91 | # tighten the prompt template. |
| 92 | ``` |
| 93 | |
| 94 | ### 3. Persist |
| 95 | |
| 96 | The per-store layout at `~/.dlm/store/<dlm_id>/controls/`: |
| 97 | |
| 98 | ``` |
| 99 | controls/ |
| 100 | formal.safetensors # the direction tensor |
| 101 | formal.meta.json # {layer_index, source_section_ids, n_pairs, extractor_version} |
| 102 | ``` |
| 103 | |
| 104 | The meta JSON is how `dlm show` audits what produced a given |
| 105 | vector — source sections, layer, pair count, extractor version |
| 106 | (so future API changes can invalidate stale vectors deterministically). |
| 107 | |
| 108 | ### 4. Apply at inference |
| 109 | |
| 110 | ```python |
| 111 | from dlm.control import apply_control |
| 112 | |
| 113 | with apply_control(model, vec.direction, layer_index=12, strength=1.5): |
| 114 | out = model.generate(input_ids, max_new_tokens=128) |
| 115 | ``` |
| 116 | |
| 117 | The hook attaches on `__enter__`, removes on `__exit__` — even if |
| 118 | the wrapped block raises. Leaving a hook active would silently |
| 119 | steer unrelated generations, so the context manager pattern is |
| 120 | load-bearing. |
| 121 | |
| 122 | ## Layer choice |
| 123 | |
| 124 | `layer_index` picks which residual stream gets the perturbation. |
| 125 | Rules of thumb (Panickssery et al., 2024): |
| 126 | |
| 127 | - **Middle layers** (40–60% depth) are the sweet spot for most |
| 128 | style dimensions — formality, tone, caution. |
| 129 | - **Early layers** (0–20% depth) steer vocabulary and syntax but |
| 130 | don't propagate cleanly through downstream composition. |
| 131 | - **Late layers** (80–100% depth) can change a few output tokens |
| 132 | but leave the underlying reasoning unchanged. |
| 133 | |
| 134 | For a 32-layer model, start at `layer_index=16`. Sweep `[8, 16, |
| 135 | 24]` on a held-out prompt if the initial result is weak. |
| 136 | |
| 137 | ## Safety refusal |
| 138 | |
| 139 | Preference sections tagged `policy: safety` are **refused at |
| 140 | extraction time**: |
| 141 | |
| 142 | ```markdown |
| 143 | ::preference#safe-refuse:: |
| 144 | tags: |
| 145 | policy: safety |
| 146 | ### Prompt |
| 147 | ... |
| 148 | ### Chosen |
| 149 | <safe refusal> |
| 150 | ### Rejected |
| 151 | <unsafe compliance> |
| 152 | ``` |
| 153 | |
| 154 | Extracting a vector from those pairs would produce a "more safety |
| 155 | vs less safety" direction — applied at negative strength, it |
| 156 | erodes the safety training the document is trying to preserve. |
| 157 | `refuse_if_policy_safety` surfaces the refusal before any |
| 158 | artifact reaches disk: |
| 159 | |
| 160 | ``` |
| 161 | ControlPolicyRefusal: refusing to extract a control vector from |
| 162 | preference sections tagged `policy: safety` — the resulting |
| 163 | steering direction could be used at negative strength to undo |
| 164 | the safety training the document is trying to preserve. |
| 165 | ``` |
| 166 | |
| 167 | The refusal is at extract time, not apply time, so the vector |
| 168 | never exists. Re-tagging to bypass the check is not supported; |
| 169 | the footgun is the shape of the math, not the tag. |
| 170 | |
| 171 | ## Validation |
| 172 | |
| 173 | End-to-end sanity check for a newly-extracted vector: |
| 174 | |
| 175 | 1. Load the base model. |
| 176 | 2. Generate without the vector on 20 held-out prompts. |
| 177 | 3. Generate with `strength=1.0` on the same prompts. |
| 178 | 4. Judge (LLM-as-judge or manual) whether the axis moved. |
| 179 | |
| 180 | For a formality vector, you expect judge-formality-rating to |
| 181 | correlate positively with `strength`. A failed extraction looks |
| 182 | like: outputs identical at ±1, nonsense at ±2. That's the signal |
| 183 | to add more pairs, pick a different layer, or accept that the |
| 184 | dimension you're trying to steer isn't a linear subspace. |
| 185 | |
| 186 | ## What ships today |
| 187 | |
| 188 | - `extract_control_vector` — raw-SVD over chosen/rejected |
| 189 | differences, deterministic orientation (aligned with mean pull). |
| 190 | - `apply_control` — context-managed `forward_pre_hook` with |
| 191 | shape + layer validation. |
| 192 | - `refuse_if_policy_safety` — pre-extraction safety gate. |
| 193 | - `ControlVector` dataclass with `n_pairs` + `explained_variance` |
| 194 | for audit output. |
| 195 | - Per-store layout: `controls/<name>.safetensors` + |
| 196 | `<name>.meta.json`. |
| 197 | |
| 198 | ## What's deferred |
| 199 | |
| 200 | - **CLI surface** (`dlm control extract | apply | list`) — needs |
| 201 | a real HF base model to drive the forward-pass residual |
| 202 | collection. Land as a follow-up when the `dlm.inference.loader` |
| 203 | integration is wired. |
| 204 | - **Multi-control composition** — additive for compatible layers, |
| 205 | warn on conflicts. Single-control is the v1 shape. |
| 206 | - **Serialization format** — today the spec says safetensors. |
| 207 | Landing safetensors I/O alongside the CLI keeps the two |
| 208 | commits paired. |
| 209 | - **Integration with `dlm prompt`** — `--control name:strength[,...]` |
| 210 | flag for the existing prompt path. |
| 211 | |
| 212 | These are all layer-cake work on top of the extraction + apply |
| 213 | primitives shipped here; the math path is the hard-to-get-right |
| 214 | piece and it's done. |
| 215 | |
| 216 | ## Risks |
| 217 | |
| 218 | - **Small bases are unstable.** Control vectors below ~500M |
| 219 | parameters tend to collapse into repetition past `|strength| > 1`. |
| 220 | `dlm doctor` will warn on bases below that threshold when the |
| 221 | CLI lands. |
| 222 | - **Layer choice matters more than strength.** A wrong layer at |
| 223 | strength 1 is worse than any strength at the right layer. |
| 224 | - **Control vectors are not a safety mechanism.** They're a |
| 225 | steering *tool*. The `policy: safety` refusal is a footgun |
| 226 | guard, not a security boundary — anyone who can train LoRAs |
| 227 | on the same base can produce the same direction by other |
| 228 | means. The safety concern is specifically about documents |
| 229 | undoing their own safety training, not about external |
| 230 | attackers. |