documentlanguagemodel Public
Learned adapter gate
When a .dlm declares multiple named adapters, the user traditionally
picks weights by hand: dlm prompt --adapter tone, or the
--adapter-mix tone:0.7,knowledge:0.3 form for weighted merging. The
learned adapter gate (Sprint 34) automates this — a tiny MLP trained
post-SFT routes each prompt to a weighted combination of declared
adapters based on the prompt's content.
MoE applied to LoRA adapters instead of FFNs.
When to use it
Enable the gate when:
- You have ≥2 named adapters in
training.adapters(the gate has nothing to route between with fewer). - You have ≥4 supervising sections per adapter (below this the
gate overfits — the
cold_start_floordefault). - Different prompts should preferentially touch different adapters
(a
toneadapter for casual chat + aknowledgeadapter for factual lookups, etc.).
The gate is opt-in — training.gate.enabled: false is the default
so existing multi-adapter documents keep working with static
--adapter-mix unchanged.
Frontmatter
---
dlm_id: 01K...
dlm_version: 8
base_model: smollm2-135m
training:
adapters:
tone: {}
knowledge: {}
style: {}
gate:
enabled: true
hidden_proj_dim: 64 # gate MLP internal width
steps: 200 # training iterations
lr: 3e-4 # AdamW learning rate
cold_start_floor: 4 # per-adapter min sections
entropy_lambda: 0.01 # mode-collapse regularizer
---
entropy_lambda adds a Shannon-entropy term to the loss so the gate
is penalized for putting all weight on one adapter. Higher values
discourage mode collapse; lower values let the gate commit harder
when the data justifies it.
Training
The gate trains automatically post-SFT when enabled: true. Each
fence-tagged section becomes one supervising sample — its adapter tag
is the routing label:
::instruction#tone:: → label = "tone"
::preference#knowledge:: → label = "knowledge"
Sections without an adapter tag are dropped from the gate training set — they still train into the SFT adapter but carry no routing signal.
If any adapter has fewer than cold_start_floor supervising sections,
the gate trainer logs a warning and writes a uniform-mode
gate_config.json. Inference defaults to 1/N weights across all
declared adapters in this case — strictly better than a
poorly-trained gate would be on a tiny corpus.
Inference
# Auto (default): use the gate if one exists
dlm prompt mydoc.dlm "what does DGEMM compute?"
# Bypass the gate — uniform weights
dlm prompt mydoc.dlm "hello" --gate off
# Explicit single-adapter pin — --gate is ignored
dlm prompt mydoc.dlm "hello" --adapter tone
The gate forward is ~1ms on MPS for the default shape. Each request:
- Tokenizes the prompt.
- Runs the base model with all adapters disabled, mean-pools the last hidden state.
- Feeds the embedding through the gate → per-adapter weights.
- Sets the PEFT weights via
set_adapter_weights. - Generates as usual.
Step 2's extra forward pass is the only overhead vs a hand-picked
--adapter-mix; the embedding is computed once per request, not per
token.
Export / Ollama
Ollama's Go runtime can't evaluate a torch MLP at inference time. When
you dlm export a document with gate.enabled: true, dlm falls back
to the training-set mean gate output as static --adapter-mix
coefficients:
- Compute the gate's softmax output on every training prompt.
- Average those distributions → one fixed weight per adapter.
- Emit the averaged weights in the generated Modelfile.
The exported manifest records gate_mode: "static_mean" so downstream
tooling can tell a mean-gate export apart from a hand-picked mix.
Dynamic per-prompt routing is available only via dlm prompt / dlm repl; the exported GGUF behaves like a statically-merged adapter.
This is lossless vs today's shipped behavior — the user wasn't getting dynamic routing before either. Dynamic benefit is opt-in to the PyTorch inference path.
Observability
Gate routing stats live in the per-store metrics SQLite under the
gate_events table:
SELECT adapter_name, mean_weight, sample_count, mode
FROM gate_events
WHERE run_id = (SELECT MAX(run_id) FROM runs);
dlm show --json surfaces the same data under gate.per_adapter for
scripted workflows.
Failure modes and mitigations
| Failure | Signal | Mitigation |
|---|---|---|
| Gate trains but collapses to one adapter | Final entropy < floor; one adapter's mean_weight ≈ 1.0 |
Raise entropy_lambda; add more balanced supervising data |
| Cold-start fallback fires | WARN in logs; gate_config.json has mode: "uniform" |
Add more sections per adapter, or accept the uniform default |
Ollama-exported model diverges from dlm prompt |
Expected: export uses mean-gate static weights | Document to users; banner on export surfaces gate_mode |
| Gate training crashes | GateTrainingError logged; SFT adapter is still committed |
Non-fatal — subsequent runs retry from the adapter that did commit |
Related
multi-adapter— declaring named adaptersretrain-and-forget— retention semantics- CLI reference —
dlm prompt --gate,dlm export
View source
| 1 | # Learned adapter gate |
| 2 | |
| 3 | When a `.dlm` declares multiple named adapters, the user traditionally |
| 4 | picks weights by hand: `dlm prompt --adapter tone`, or the |
| 5 | `--adapter-mix tone:0.7,knowledge:0.3` form for weighted merging. The |
| 6 | learned adapter gate (Sprint 34) automates this — a tiny MLP trained |
| 7 | post-SFT routes each prompt to a weighted combination of declared |
| 8 | adapters based on the prompt's content. |
| 9 | |
| 10 | MoE applied to LoRA adapters instead of FFNs. |
| 11 | |
| 12 | ## When to use it |
| 13 | |
| 14 | Enable the gate when: |
| 15 | |
| 16 | - You have **≥2 named adapters** in `training.adapters` (the gate has |
| 17 | nothing to route between with fewer). |
| 18 | - You have **≥4 supervising sections per adapter** (below this the |
| 19 | gate overfits — the `cold_start_floor` default). |
| 20 | - Different prompts should preferentially touch different adapters |
| 21 | (a `tone` adapter for casual chat + a `knowledge` adapter for |
| 22 | factual lookups, etc.). |
| 23 | |
| 24 | The gate is **opt-in** — `training.gate.enabled: false` is the default |
| 25 | so existing multi-adapter documents keep working with static |
| 26 | `--adapter-mix` unchanged. |
| 27 | |
| 28 | ## Frontmatter |
| 29 | |
| 30 | ```yaml |
| 31 | --- |
| 32 | dlm_id: 01K... |
| 33 | dlm_version: 8 |
| 34 | base_model: smollm2-135m |
| 35 | training: |
| 36 | adapters: |
| 37 | tone: {} |
| 38 | knowledge: {} |
| 39 | style: {} |
| 40 | gate: |
| 41 | enabled: true |
| 42 | hidden_proj_dim: 64 # gate MLP internal width |
| 43 | steps: 200 # training iterations |
| 44 | lr: 3e-4 # AdamW learning rate |
| 45 | cold_start_floor: 4 # per-adapter min sections |
| 46 | entropy_lambda: 0.01 # mode-collapse regularizer |
| 47 | --- |
| 48 | ``` |
| 49 | |
| 50 | `entropy_lambda` adds a Shannon-entropy term to the loss so the gate |
| 51 | is penalized for putting all weight on one adapter. Higher values |
| 52 | discourage mode collapse; lower values let the gate commit harder |
| 53 | when the data justifies it. |
| 54 | |
| 55 | ## Training |
| 56 | |
| 57 | The gate trains automatically post-SFT when `enabled: true`. Each |
| 58 | fence-tagged section becomes one supervising sample — its adapter tag |
| 59 | is the routing label: |
| 60 | |
| 61 | ``` |
| 62 | ::instruction#tone:: → label = "tone" |
| 63 | ::preference#knowledge:: → label = "knowledge" |
| 64 | ``` |
| 65 | |
| 66 | Sections without an adapter tag are dropped from the gate training |
| 67 | set — they still train into the SFT adapter but carry no routing |
| 68 | signal. |
| 69 | |
| 70 | If any adapter has fewer than `cold_start_floor` supervising sections, |
| 71 | the gate trainer logs a warning and writes a **uniform-mode** |
| 72 | `gate_config.json`. Inference defaults to `1/N` weights across all |
| 73 | declared adapters in this case — strictly better than a |
| 74 | poorly-trained gate would be on a tiny corpus. |
| 75 | |
| 76 | ## Inference |
| 77 | |
| 78 | ```bash |
| 79 | # Auto (default): use the gate if one exists |
| 80 | dlm prompt mydoc.dlm "what does DGEMM compute?" |
| 81 | |
| 82 | # Bypass the gate — uniform weights |
| 83 | dlm prompt mydoc.dlm "hello" --gate off |
| 84 | |
| 85 | # Explicit single-adapter pin — --gate is ignored |
| 86 | dlm prompt mydoc.dlm "hello" --adapter tone |
| 87 | ``` |
| 88 | |
| 89 | The gate forward is ~1ms on MPS for the default shape. Each request: |
| 90 | |
| 91 | 1. Tokenizes the prompt. |
| 92 | 2. Runs the base model with all adapters **disabled**, mean-pools the |
| 93 | last hidden state. |
| 94 | 3. Feeds the embedding through the gate → per-adapter weights. |
| 95 | 4. Sets the PEFT weights via `set_adapter_weights`. |
| 96 | 5. Generates as usual. |
| 97 | |
| 98 | Step 2's extra forward pass is the only overhead vs a hand-picked |
| 99 | `--adapter-mix`; the embedding is computed once per request, not per |
| 100 | token. |
| 101 | |
| 102 | ## Export / Ollama |
| 103 | |
| 104 | Ollama's Go runtime can't evaluate a torch MLP at inference time. When |
| 105 | you `dlm export` a document with `gate.enabled: true`, dlm falls back |
| 106 | to the **training-set mean gate output** as static `--adapter-mix` |
| 107 | coefficients: |
| 108 | |
| 109 | 1. Compute the gate's softmax output on every training prompt. |
| 110 | 2. Average those distributions → one fixed weight per adapter. |
| 111 | 3. Emit the averaged weights in the generated Modelfile. |
| 112 | |
| 113 | The exported manifest records `gate_mode: "static_mean"` so downstream |
| 114 | tooling can tell a mean-gate export apart from a hand-picked mix. |
| 115 | Dynamic per-prompt routing is available only via `dlm prompt` / `dlm |
| 116 | repl`; the exported GGUF behaves like a statically-merged adapter. |
| 117 | |
| 118 | This is lossless vs today's shipped behavior — the user wasn't getting |
| 119 | dynamic routing before either. Dynamic benefit is opt-in to the |
| 120 | PyTorch inference path. |
| 121 | |
| 122 | ## Observability |
| 123 | |
| 124 | Gate routing stats live in the per-store metrics SQLite under the |
| 125 | `gate_events` table: |
| 126 | |
| 127 | ```sql |
| 128 | SELECT adapter_name, mean_weight, sample_count, mode |
| 129 | FROM gate_events |
| 130 | WHERE run_id = (SELECT MAX(run_id) FROM runs); |
| 131 | ``` |
| 132 | |
| 133 | `dlm show --json` surfaces the same data under `gate.per_adapter` for |
| 134 | scripted workflows. |
| 135 | |
| 136 | ## Failure modes and mitigations |
| 137 | |
| 138 | | Failure | Signal | Mitigation | |
| 139 | |---|---|---| |
| 140 | | Gate trains but collapses to one adapter | Final entropy < floor; one adapter's `mean_weight` ≈ 1.0 | Raise `entropy_lambda`; add more balanced supervising data | |
| 141 | | Cold-start fallback fires | WARN in logs; `gate_config.json` has `mode: "uniform"` | Add more sections per adapter, or accept the uniform default | |
| 142 | | Ollama-exported model diverges from `dlm prompt` | Expected: export uses mean-gate static weights | Document to users; banner on export surfaces `gate_mode` | |
| 143 | | Gate training crashes | `GateTrainingError` logged; SFT adapter is still committed | Non-fatal — subsequent runs retry from the adapter that did commit | |
| 144 | |
| 145 | ## Related |
| 146 | |
| 147 | - [`multi-adapter`](multi-adapter.md) — declaring named adapters |
| 148 | - [`retrain-and-forget`](retrain-and-forget.md) — retention semantics |
| 149 | - CLI reference — `dlm prompt --gate`, `dlm export` |