documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 5338 bytes Raw Blame History

Learned adapter gate

When a .dlm declares multiple named adapters, the user traditionally picks weights by hand: dlm prompt --adapter tone, or the --adapter-mix tone:0.7,knowledge:0.3 form for weighted merging. The learned adapter gate (Sprint 34) automates this — a tiny MLP trained post-SFT routes each prompt to a weighted combination of declared adapters based on the prompt's content.

MoE applied to LoRA adapters instead of FFNs.

When to use it

Enable the gate when:

You have ≥2 named adapters in training.adapters (the gate has nothing to route between with fewer).
You have ≥4 supervising sections per adapter (below this the gate overfits — the cold_start_floor default).
Different prompts should preferentially touch different adapters (a tone adapter for casual chat + a knowledge adapter for factual lookups, etc.).

The gate is opt-in — training.gate.enabled: false is the default so existing multi-adapter documents keep working with static --adapter-mix unchanged.

Frontmatter

---
dlm_id: 01K...
dlm_version: 8
base_model: smollm2-135m
training:
  adapters:
    tone: {}
    knowledge: {}
    style: {}
  gate:
    enabled: true
    hidden_proj_dim: 64       # gate MLP internal width
    steps: 200                # training iterations
    lr: 3e-4                  # AdamW learning rate
    cold_start_floor: 4       # per-adapter min sections
    entropy_lambda: 0.01      # mode-collapse regularizer
---

entropy_lambda adds a Shannon-entropy term to the loss so the gate is penalized for putting all weight on one adapter. Higher values discourage mode collapse; lower values let the gate commit harder when the data justifies it.

Training

The gate trains automatically post-SFT when enabled: true. Each fence-tagged section becomes one supervising sample — its adapter tag is the routing label:

::instruction#tone:: → label = "tone"
::preference#knowledge:: → label = "knowledge"

Sections without an adapter tag are dropped from the gate training set — they still train into the SFT adapter but carry no routing signal.

If any adapter has fewer than cold_start_floor supervising sections, the gate trainer logs a warning and writes a uniform-mode gate_config.json. Inference defaults to 1/N weights across all declared adapters in this case — strictly better than a poorly-trained gate would be on a tiny corpus.

Inference

# Auto (default): use the gate if one exists
dlm prompt mydoc.dlm "what does DGEMM compute?"

# Bypass the gate — uniform weights
dlm prompt mydoc.dlm "hello" --gate off

# Explicit single-adapter pin — --gate is ignored
dlm prompt mydoc.dlm "hello" --adapter tone

The gate forward is ~1ms on MPS for the default shape. Each request:

Tokenizes the prompt.
Runs the base model with all adapters disabled, mean-pools the last hidden state.
Feeds the embedding through the gate → per-adapter weights.
Sets the PEFT weights via set_adapter_weights.
Generates as usual.

Step 2's extra forward pass is the only overhead vs a hand-picked --adapter-mix; the embedding is computed once per request, not per token.

Export / Ollama

Ollama's Go runtime can't evaluate a torch MLP at inference time. When you dlm export a document with gate.enabled: true, dlm falls back to the training-set mean gate output as static --adapter-mix coefficients:

Compute the gate's softmax output on every training prompt.
Average those distributions → one fixed weight per adapter.
Emit the averaged weights in the generated Modelfile.

The exported manifest records gate_mode: "static_mean" so downstream tooling can tell a mean-gate export apart from a hand-picked mix. Dynamic per-prompt routing is available only via dlm prompt / dlm repl; the exported GGUF behaves like a statically-merged adapter.

This is lossless vs today's shipped behavior — the user wasn't getting dynamic routing before either. Dynamic benefit is opt-in to the PyTorch inference path.

Observability

Gate routing stats live in the per-store metrics SQLite under the gate_events table:

SELECT adapter_name, mean_weight, sample_count, mode
FROM gate_events
WHERE run_id = (SELECT MAX(run_id) FROM runs);

dlm show --json surfaces the same data under gate.per_adapter for scripted workflows.

Failure modes and mitigations

Failure	Signal	Mitigation
Gate trains but collapses to one adapter	Final entropy < floor; one adapter's `mean_weight` ≈ 1.0	Raise `entropy_lambda`; add more balanced supervising data
Cold-start fallback fires	WARN in logs; `gate_config.json` has `mode: "uniform"`	Add more sections per adapter, or accept the uniform default
Ollama-exported model diverges from `dlm prompt`	Expected: export uses mean-gate static weights	Document to users; banner on export surfaces `gate_mode`
Gate training crashes	`GateTrainingError` logged; SFT adapter is still committed	Non-fatal — subsequent runs retry from the adapter that did commit

multi-adapter — declaring named adapters
retrain-and-forget — retention semantics
CLI reference — dlm prompt --gate, dlm export

View source

  
        1
        # Learned adapter gate
      
        2
        
        3
        When a `.dlm` declares multiple named adapters, the user traditionally
      
        4
        picks weights by hand: `dlm prompt --adapter tone`, or the
      
        5
        `--adapter-mix tone:0.7,knowledge:0.3` form for weighted merging. The
      
        6
        learned adapter gate (Sprint 34) automates this — a tiny MLP trained
      
        7
        post-SFT routes each prompt to a weighted combination of declared
      
        8
        adapters based on the prompt's content.
      
        9
        
        10
        MoE applied to LoRA adapters instead of FFNs.
      
        11
        
        12
        ## When to use it
      
        13
        
        14
        Enable the gate when:
      
        15
        
        16
        - You have **≥2 named adapters** in `training.adapters` (the gate has
      
        17
          nothing to route between with fewer).
      
        18
        - You have **≥4 supervising sections per adapter** (below this the
      
        19
          gate overfits — the `cold_start_floor` default).
      
        20
        - Different prompts should preferentially touch different adapters
      
        21
          (a `tone` adapter for casual chat + a `knowledge` adapter for
      
        22
          factual lookups, etc.).
      
        23
        
        24
        The gate is **opt-in** — `training.gate.enabled: false` is the default
      
        25
        so existing multi-adapter documents keep working with static
      
        26
        `--adapter-mix` unchanged.
      
        27
        
        28
        ## Frontmatter
      
        29
        
        30
        ```yaml
      
        31
        ---
      
        32
        dlm_id: 01K...
      
        33
        dlm_version: 8
      
        34
        base_model: smollm2-135m
      
        35
        training:
      
        36
          adapters:
      
        37
            tone: {}
      
        38
            knowledge: {}
      
        39
            style: {}
      
        40
          gate:
      
        41
            enabled: true
      
        42
            hidden_proj_dim: 64       # gate MLP internal width
      
        43
            steps: 200                # training iterations
      
        44
            lr: 3e-4                  # AdamW learning rate
      
        45
            cold_start_floor: 4       # per-adapter min sections
      
        46
            entropy_lambda: 0.01      # mode-collapse regularizer
      
        47
        ---
      
        48
        ```
      
        49
        
        50
        `entropy_lambda` adds a Shannon-entropy term to the loss so the gate
      
        51
        is penalized for putting all weight on one adapter. Higher values
      
        52
        discourage mode collapse; lower values let the gate commit harder
      
        53
        when the data justifies it.
      
        54
        
        55
        ## Training
      
        56
        
        57
        The gate trains automatically post-SFT when `enabled: true`. Each
      
        58
        fence-tagged section becomes one supervising sample — its adapter tag
      
        59
        is the routing label:
      
        60
        
        61
        ```
      
        62
        ::instruction#tone:: → label = "tone"
      
        63
        ::preference#knowledge:: → label = "knowledge"
      
        64
        ```
      
        65
        
        66
        Sections without an adapter tag are dropped from the gate training
      
        67
        set — they still train into the SFT adapter but carry no routing
      
        68
        signal.
      
        69
        
        70
        If any adapter has fewer than `cold_start_floor` supervising sections,
      
        71
        the gate trainer logs a warning and writes a **uniform-mode**
      
        72
        `gate_config.json`. Inference defaults to `1/N` weights across all
      
        73
        declared adapters in this case — strictly better than a
      
        74
        poorly-trained gate would be on a tiny corpus.
      
        75
        
        76
        ## Inference
      
        77
        
        78
        ```bash
      
        79
        # Auto (default): use the gate if one exists
      
        80
        dlm prompt mydoc.dlm "what does DGEMM compute?"
      
        81
        
        82
        # Bypass the gate — uniform weights
      
        83
        dlm prompt mydoc.dlm "hello" --gate off
      
        84
        
        85
        # Explicit single-adapter pin — --gate is ignored
      
        86
        dlm prompt mydoc.dlm "hello" --adapter tone
      
        87
        ```
      
        88
        
        89
        The gate forward is ~1ms on MPS for the default shape. Each request:
      
        90
        
        91
        1. Tokenizes the prompt.
      
        92
        2. Runs the base model with all adapters **disabled**, mean-pools the
      
        93
           last hidden state.
      
        94
        3. Feeds the embedding through the gate → per-adapter weights.
      
        95
        4. Sets the PEFT weights via `set_adapter_weights`.
      
        96
        5. Generates as usual.
      
        97
        
        98
        Step 2's extra forward pass is the only overhead vs a hand-picked
      
        99
        `--adapter-mix`; the embedding is computed once per request, not per
      
        100
        token.
      
        101
        
        102
        ## Export / Ollama
      
        103
        
        104
        Ollama's Go runtime can't evaluate a torch MLP at inference time. When
      
        105
        you `dlm export` a document with `gate.enabled: true`, dlm falls back
      
        106
        to the **training-set mean gate output** as static `--adapter-mix`
      
        107
        coefficients:
      
        108
        
        109
        1. Compute the gate's softmax output on every training prompt.
      
        110
        2. Average those distributions → one fixed weight per adapter.
      
        111
        3. Emit the averaged weights in the generated Modelfile.
      
        112
        
        113
        The exported manifest records `gate_mode: "static_mean"` so downstream
      
        114
        tooling can tell a mean-gate export apart from a hand-picked mix.
      
        115
        Dynamic per-prompt routing is available only via `dlm prompt` / `dlm
      
        116
        repl`; the exported GGUF behaves like a statically-merged adapter.
      
        117
        
        118
        This is lossless vs today's shipped behavior — the user wasn't getting
      
        119
        dynamic routing before either. Dynamic benefit is opt-in to the
      
        120
        PyTorch inference path.
      
        121
        
        122
        ## Observability
      
        123
        
        124
        Gate routing stats live in the per-store metrics SQLite under the
      
        125
        `gate_events` table:
      
        126
        
        127
        ```sql
      
        128
        SELECT adapter_name, mean_weight, sample_count, mode
      
        129
        FROM gate_events
      
        130
        WHERE run_id = (SELECT MAX(run_id) FROM runs);
      
        131
        ```
      
        132
        
        133
        `dlm show --json` surfaces the same data under `gate.per_adapter` for
      
        134
        scripted workflows.
      
        135
        
        136
        ## Failure modes and mitigations
      
        137
        
        138
        | Failure | Signal | Mitigation |
      
        139
        |---|---|---|
      
        140
        | Gate trains but collapses to one adapter | Final entropy < floor; one adapter's `mean_weight` ≈ 1.0 | Raise `entropy_lambda`; add more balanced supervising data |
      
        141
        | Cold-start fallback fires | WARN in logs; `gate_config.json` has `mode: "uniform"` | Add more sections per adapter, or accept the uniform default |
      
        142
        | Ollama-exported model diverges from `dlm prompt` | Expected: export uses mean-gate static weights | Document to users; banner on export surfaces `gate_mode` |
      
        143
        | Gate training crashes | `GateTrainingError` logged; SFT adapter is still committed | Non-fatal — subsequent runs retry from the adapter that did commit |
      
        144
        
        145
        ## Related
      
        146
        
        147
        - [`multi-adapter`](multi-adapter.md) — declaring named adapters
      
        148
        - [`retrain-and-forget`](retrain-and-forget.md) — retention semantics
      
        149
        - CLI reference — `dlm prompt --gate`, `dlm export`

1	# Learned adapter gate
2
3	When a `.dlm` declares multiple named adapters, the user traditionally
4	picks weights by hand: `dlm prompt --adapter tone`, or the
5	`--adapter-mix tone:0.7,knowledge:0.3` form for weighted merging. The
6	learned adapter gate (Sprint 34) automates this — a tiny MLP trained
7	post-SFT routes each prompt to a weighted combination of declared
8	adapters based on the prompt's content.
9
10	MoE applied to LoRA adapters instead of FFNs.
11
12	## When to use it
13
14	Enable the gate when:
15
16	- You have ≥2 named adapters in `training.adapters` (the gate has
17	nothing to route between with fewer).
18	- You have ≥4 supervising sections per adapter (below this the
19	gate overfits — the `cold_start_floor` default).
20	- Different prompts should preferentially touch different adapters
21	(a `tone` adapter for casual chat + a `knowledge` adapter for
22	factual lookups, etc.).
23
24	The gate is opt-in — `training.gate.enabled: false` is the default
25	so existing multi-adapter documents keep working with static
26	`--adapter-mix` unchanged.
27
28	## Frontmatter
29
30	```yaml
31	---
32	dlm_id: 01K...
33	dlm_version: 8
34	base_model: smollm2-135m
35	training:
36	adapters:
37	tone: {}
38	knowledge: {}
39	style: {}
40	gate:
41	enabled: true
42	hidden_proj_dim: 64 # gate MLP internal width
43	steps: 200 # training iterations
44	lr: 3e-4 # AdamW learning rate
45	cold_start_floor: 4 # per-adapter min sections
46	entropy_lambda: 0.01 # mode-collapse regularizer
47	---
48	```
49
50	`entropy_lambda` adds a Shannon-entropy term to the loss so the gate
51	is penalized for putting all weight on one adapter. Higher values
52	discourage mode collapse; lower values let the gate commit harder
53	when the data justifies it.
54
55	## Training
56
57	The gate trains automatically post-SFT when `enabled: true`. Each
58	fence-tagged section becomes one supervising sample — its adapter tag
59	is the routing label:
60
61	```
62	::instruction#tone:: → label = "tone"
63	::preference#knowledge:: → label = "knowledge"
64	```
65
66	Sections without an adapter tag are dropped from the gate training
67	set — they still train into the SFT adapter but carry no routing
68	signal.
69
70	If any adapter has fewer than `cold_start_floor` supervising sections,
71	the gate trainer logs a warning and writes a uniform-mode
72	`gate_config.json`. Inference defaults to `1/N` weights across all
73	declared adapters in this case — strictly better than a
74	poorly-trained gate would be on a tiny corpus.
75
76	## Inference
77
78	```bash
79	# Auto (default): use the gate if one exists
80	dlm prompt mydoc.dlm "what does DGEMM compute?"
81
82	# Bypass the gate — uniform weights
83	dlm prompt mydoc.dlm "hello" --gate off
84
85	# Explicit single-adapter pin — --gate is ignored
86	dlm prompt mydoc.dlm "hello" --adapter tone
87	```
88
89	The gate forward is ~1ms on MPS for the default shape. Each request:
90
91	1. Tokenizes the prompt.
92	2. Runs the base model with all adapters disabled, mean-pools the
93	last hidden state.
94	3. Feeds the embedding through the gate → per-adapter weights.
95	4. Sets the PEFT weights via `set_adapter_weights`.
96	5. Generates as usual.
97
98	Step 2's extra forward pass is the only overhead vs a hand-picked
99	`--adapter-mix`; the embedding is computed once per request, not per
100	token.
101
102	## Export / Ollama
103
104	Ollama's Go runtime can't evaluate a torch MLP at inference time. When
105	you `dlm export` a document with `gate.enabled: true`, dlm falls back
106	to the training-set mean gate output as static `--adapter-mix`
107	coefficients:
108
109	1. Compute the gate's softmax output on every training prompt.
110	2. Average those distributions → one fixed weight per adapter.
111	3. Emit the averaged weights in the generated Modelfile.
112
113	The exported manifest records `gate_mode: "static_mean"` so downstream
114	tooling can tell a mean-gate export apart from a hand-picked mix.
115	Dynamic per-prompt routing is available only via `dlm prompt` / `dlm
116	repl`; the exported GGUF behaves like a statically-merged adapter.
117
118	This is lossless vs today's shipped behavior — the user wasn't getting
119	dynamic routing before either. Dynamic benefit is opt-in to the
120	PyTorch inference path.
121
122	## Observability
123
124	Gate routing stats live in the per-store metrics SQLite under the
125	`gate_events` table:
126
127	```sql
128	SELECT adapter_name, mean_weight, sample_count, mode
129	FROM gate_events
130	WHERE run_id = (SELECT MAX(run_id) FROM runs);
131	```
132
133	`dlm show --json` surfaces the same data under `gate.per_adapter` for
134	scripted workflows.
135
136	## Failure modes and mitigations
137
138	\| Failure \| Signal \| Mitigation \|
139	\|---\|---\|---\|
140	\| Gate trains but collapses to one adapter \| Final entropy < floor; one adapter's `mean_weight` ≈ 1.0 \| Raise `entropy_lambda`; add more balanced supervising data \|
141	\| Cold-start fallback fires \| WARN in logs; `gate_config.json` has `mode: "uniform"` \| Add more sections per adapter, or accept the uniform default \|
142	\| Ollama-exported model diverges from `dlm prompt` \| Expected: export uses mean-gate static weights \| Document to users; banner on export surfaces `gate_mode` \|
143	\| Gate training crashes \| `GateTrainingError` logged; SFT adapter is still committed \| Non-fatal — subsequent runs retry from the adapter that did commit \|
144
145	## Related
146
147	- [`multi-adapter`](multi-adapter.md) — declaring named adapters
148	- [`retrain-and-forget`](retrain-and-forget.md) — retention semantics
149	- CLI reference — `dlm prompt --gate`, `dlm export`