documentlanguagemodel Public

Watch 0 Fork 0 Star 0

markdown · 4673 bytes Raw Blame History

DoRA vs LoRA — when to pick which

DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight update into a magnitude scalar and a direction vector (the standard LoRA pair). Papers report 2-4% quality uplift over vanilla LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most visible on multi-task fine-tunes and less so on narrow-domain SFT.

Flip from LoRA to DoRA by a single frontmatter field:

training:
  adapter: dora    # was: lora

Every other LoRA knob — lora_r, lora_alpha, lora_dropout, target_modules — applies unchanged. The trainer sets peft.LoraConfig(use_dora=True) under the hood; requires `peft

= 0.8`.

When DoRA is worth the 10% tax

Multi-task adapters. DoRA's magnitude component gives more capacity per rank, which helps when one adapter has to juggle unrelated tasks.
Small rank budgets. At lora_r=4 or lora_r=8, DoRA reliably beats LoRA because every parameter counts. At lora_r=64+, the gap closes.
Long fine-tunes. The per-step tax compounds, but so does the per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.

When plain LoRA is the right call

Short SFT runs (< 500 steps). The 10% tax isn't amortized.
Narrow-domain Q/A. A single topic doesn't need DoRA's extra degrees of freedom.
Memory-constrained hosts. DoRA's magnitude vector is tiny but non-zero. On a 13B model at lora_r=64, it's a few extra MB that can tip you into OOM.

Comparing empirically

A twin-.dlm methodology works well:

cp corpus.dlm corpus-lora.dlm
cp corpus.dlm corpus-dora.dlm
# edit corpus-dora.dlm to set adapter: dora

dlm train corpus-lora.dlm
dlm train corpus-dora.dlm

# sway reads the adapter diffs from both stores and compares
sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter

If DoRA's delta_kl on your held-out prompts doesn't beat LoRA's by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on your specific domain.

GaLore — gradient-projected optimizer

GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory by ~40% by maintaining the first + second moments in a rank-r subspace. You set it as a drop-in AdamW replacement:

training:
  optimizer: galore_adamw        # or galore_adamw_8bit

When GaLore is worth picking

Memory-constrained training on 7B+ bases. That's where the paper's ~60% optimizer memory reduction materially helps.
Full-parameter fine-tuning. GaLore shines when AdamW's state is the memory bottleneck. On LoRA-only training the AdamW state is already tiny — GaLore's savings are measured in MB, not GB.

The sub-1B warning

The GaLore paper reports uplift at ≥ 7B base parameters. Below ~1B the rank-r projection can hurt optimization quality without giving you the memory win (because AdamW state was already small). The plan reason surfaces this visibly:

$ dlm doctor
...
reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)

The warning is advisory — you can still train. If the memory number matters for your host, GaLore may still be worth it. But if you're picking it for quality: pick adamw_torch instead.

What ships today

Schema: adapter: dora on both flat and per-adapter TrainingConfig.
Schema: optimizer: galore_adamw / galore_adamw_8bit.
peft.LoraConfig(use_dora=True) wired through build_lora_config.
SFTConfig.optim honors training.optimizer (previously ignored the frontmatter field — silent default to adamw_torch).
Plan-reason surfaces adapter=dora / optim=galore_adamw / warn=galore-small-base(<1B) so dlm doctor auditing makes the knob choices visible.

Deferred

DoRA + QLoRA combination. The adapter field is a single-value enum (lora/qlora/dora), so the combination is schema-unreachable today — no runtime refusal is needed because Pydantic rejects any attempt before it reaches the doctor. Allowing the combination requires splitting DoRA into a separate use_dora: bool field; the bnb≥0.42 compatibility check lands with that change, not before.
GaLore rank + update_proj_gap knobs. The SFTConfig.optim path uses transformers' defaults. Surfacing galore_rank and galore_update_proj_gap as frontmatter fields is a follow-up when someone wants to tune them.
Empirical comparison fixture. A slow-marked twin-train test on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size (where neither technique is expected to differ) lands with the next slow-CI pass.

View source

  
        1
        # DoRA vs LoRA — when to pick which
      
        2
        
        3
        DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight
      
        4
        update into a **magnitude scalar** and a **direction vector** (the
      
        5
        standard LoRA pair). Papers report 2-4% quality uplift over vanilla
      
        6
        LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most
      
        7
        visible on multi-task fine-tunes and less so on narrow-domain SFT.
      
        8
        
        9
        Flip from LoRA to DoRA by a single frontmatter field:
      
        10
        
        11
        ```yaml
      
        12
        training:
      
        13
          adapter: dora    # was: lora
      
        14
        ```
      
        15
        
        16
        Every other LoRA knob — `lora_r`, `lora_alpha`, `lora_dropout`,
      
        17
        `target_modules` — applies unchanged. The trainer sets
      
        18
        `peft.LoraConfig(use_dora=True)` under the hood; requires `peft
      
        19
        >= 0.8`.
      
        20
        
        21
        ## When DoRA is worth the 10% tax
      
        22
        
        23
        - **Multi-task adapters.** DoRA's magnitude component gives more
      
        24
          capacity per rank, which helps when one adapter has to juggle
      
        25
          unrelated tasks.
      
        26
        - **Small rank budgets.** At `lora_r=4` or `lora_r=8`, DoRA reliably
      
        27
          beats LoRA because every parameter counts. At `lora_r=64`+,
      
        28
          the gap closes.
      
        29
        - **Long fine-tunes.** The per-step tax compounds, but so does the
      
        30
          per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.
      
        31
        
        32
        ## When plain LoRA is the right call
      
        33
        
        34
        - **Short SFT runs (< 500 steps).** The 10% tax isn't amortized.
      
        35
        - **Narrow-domain Q/A.** A single topic doesn't need DoRA's extra
      
        36
          degrees of freedom.
      
        37
        - **Memory-constrained hosts.** DoRA's magnitude vector is tiny but
      
        38
          non-zero. On a 13B model at `lora_r=64`, it's a few extra MB that
      
        39
          can tip you into OOM.
      
        40
        
        41
        ## Comparing empirically
      
        42
        
        43
        A twin-`.dlm` methodology works well:
      
        44
        
        45
        ```bash
      
        46
        cp corpus.dlm corpus-lora.dlm
      
        47
        cp corpus.dlm corpus-dora.dlm
      
        48
        # edit corpus-dora.dlm to set adapter: dora
      
        49
        
        50
        dlm train corpus-lora.dlm
      
        51
        dlm train corpus-dora.dlm
      
        52
        
        53
        # sway reads the adapter diffs from both stores and compares
      
        54
        sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter
      
        55
        ```
      
        56
        
        57
        If DoRA's `delta_kl` on your held-out prompts doesn't beat LoRA's
      
        58
        by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on
      
        59
        your specific domain.
      
        60
        
        61
        ---
      
        62
        
        63
        # GaLore — gradient-projected optimizer
      
        64
        
        65
        GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory
      
        66
        by ~40% by maintaining the first + second moments in a rank-`r`
      
        67
        subspace. You set it as a drop-in AdamW replacement:
      
        68
        
        69
        ```yaml
      
        70
        training:
      
        71
          optimizer: galore_adamw        # or galore_adamw_8bit
      
        72
        ```
      
        73
        
        74
        ## When GaLore is worth picking
      
        75
        
        76
        - **Memory-constrained training on 7B+ bases.** That's where the
      
        77
          paper's ~60% optimizer memory reduction materially helps.
      
        78
        - **Full-parameter fine-tuning.** GaLore shines when AdamW's state
      
        79
          is the memory bottleneck. On LoRA-only training the AdamW state
      
        80
          is already tiny — GaLore's savings are measured in MB, not GB.
      
        81
        
        82
        ## The sub-1B warning
      
        83
        
        84
        The GaLore paper reports uplift at **≥ 7B base parameters**. Below
      
        85
        ~1B the rank-`r` projection can **hurt** optimization quality
      
        86
        without giving you the memory win (because AdamW state was already
      
        87
        small). The plan reason surfaces this visibly:
      
        88
        
        89
        ```
      
        90
        $ dlm doctor
      
        91
        ...
      
        92
        reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)
      
        93
        ```
      
        94
        
        95
        The warning is advisory — you can still train. If the memory number
      
        96
        matters for your host, GaLore may still be worth it. But if you're
      
        97
        picking it for quality: pick `adamw_torch` instead.
      
        98
        
        99
        ## What ships today
      
        100
        
        101
        - Schema: `adapter: dora` on both flat and per-adapter `TrainingConfig`.
      
        102
        - Schema: `optimizer: galore_adamw` / `galore_adamw_8bit`.
      
        103
        - `peft.LoraConfig(use_dora=True)` wired through `build_lora_config`.
      
        104
        - `SFTConfig.optim` honors `training.optimizer` (previously ignored
      
        105
          the frontmatter field — silent default to `adamw_torch`).
      
        106
        - Plan-reason surfaces `adapter=dora` / `optim=galore_adamw` /
      
        107
          `warn=galore-small-base(<1B)` so `dlm doctor` auditing makes the
      
        108
          knob choices visible.
      
        109
        
        110
        ## Deferred
      
        111
        
        112
        - **DoRA + QLoRA combination.** The `adapter` field is a single-value
      
        113
          enum (`lora`/`qlora`/`dora`), so the combination is schema-unreachable
      
        114
          today — no runtime refusal is needed because Pydantic rejects any
      
        115
          attempt before it reaches the doctor. Allowing the combination
      
        116
          requires splitting DoRA into a separate `use_dora: bool` field; the
      
        117
          bnb≥0.42 compatibility check lands with that change, not before.
      
        118
        - **GaLore rank + update_proj_gap knobs.** The `SFTConfig.optim`
      
        119
          path uses transformers' defaults. Surfacing `galore_rank` and
      
        120
          `galore_update_proj_gap` as frontmatter fields is a follow-up
      
        121
          when someone wants to tune them.
      
        122
        - **Empirical comparison fixture.** A slow-marked twin-train test
      
        123
          on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size
      
        124
          (where neither technique is expected to differ) lands with the
      
        125
          next slow-CI pass.

1	# DoRA vs LoRA — when to pick which
2
3	DoRA (Weight-Decomposed Low-Rank Adaptation) factors each weight
4	update into a magnitude scalar and a direction vector (the
5	standard LoRA pair). Papers report 2-4% quality uplift over vanilla
6	LoRA at matched rank, for a ~10% wall-clock tax. The uplift is most
7	visible on multi-task fine-tunes and less so on narrow-domain SFT.
8
9	Flip from LoRA to DoRA by a single frontmatter field:
10
11	```yaml
12	training:
13	adapter: dora # was: lora
14	```
15
16	Every other LoRA knob — `lora_r`, `lora_alpha`, `lora_dropout`,
17	`target_modules` — applies unchanged. The trainer sets
18	`peft.LoraConfig(use_dora=True)` under the hood; requires `peft
19	>= 0.8`.
20
21	## When DoRA is worth the 10% tax
22
23	- Multi-task adapters. DoRA's magnitude component gives more
24	capacity per rank, which helps when one adapter has to juggle
25	unrelated tasks.
26	- Small rank budgets. At `lora_r=4` or `lora_r=8`, DoRA reliably
27	beats LoRA because every parameter counts. At `lora_r=64`+,
28	the gap closes.
29	- Long fine-tunes. The per-step tax compounds, but so does the
30	per-step learning advantage. Over 5k+ steps, DoRA pulls ahead.
31
32	## When plain LoRA is the right call
33
34	- Short SFT runs (< 500 steps). The 10% tax isn't amortized.
35	- Narrow-domain Q/A. A single topic doesn't need DoRA's extra
36	degrees of freedom.
37	- Memory-constrained hosts. DoRA's magnitude vector is tiny but
38	non-zero. On a 13B model at `lora_r=64`, it's a few extra MB that
39	can tip you into OOM.
40
41	## Comparing empirically
42
43	A twin-`.dlm` methodology works well:
44
45	```bash
46	cp corpus.dlm corpus-lora.dlm
47	cp corpus.dlm corpus-dora.dlm
48	# edit corpus-dora.dlm to set adapter: dora
49
50	dlm train corpus-lora.dlm
51	dlm train corpus-dora.dlm
52
53	# sway reads the adapter diffs from both stores and compares
54	sway gate ~/.dlm/store/<dlm-lora>/adapter --against ~/.dlm/store/<dlm-dora>/adapter
55	```
56
57	If DoRA's `delta_kl` on your held-out prompts doesn't beat LoRA's
58	by ≥5%, keep LoRA — the 10% wall-clock tax isn't paying off on
59	your specific domain.
60
61	---
62
63	# GaLore — gradient-projected optimizer
64
65	GaLore (Gradient Low-Rank Projection) cuts AdamW's optimizer memory
66	by ~40% by maintaining the first + second moments in a rank-`r`
67	subspace. You set it as a drop-in AdamW replacement:
68
69	```yaml
70	training:
71	optimizer: galore_adamw # or galore_adamw_8bit
72	```
73
74	## When GaLore is worth picking
75
76	- Memory-constrained training on 7B+ bases. That's where the
77	paper's ~60% optimizer memory reduction materially helps.
78	- Full-parameter fine-tuning. GaLore shines when AdamW's state
79	is the memory bottleneck. On LoRA-only training the AdamW state
80	is already tiny — GaLore's savings are measured in MB, not GB.
81
82	## The sub-1B warning
83
84	The GaLore paper reports uplift at ≥ 7B base parameters. Below
85	~1B the rank-`r` projection can hurt optimization quality
86	without giving you the memory win (because AdamW state was already
87	small). The plan reason surfaces this visibly:
88
89	```
90	$ dlm doctor
91	...
92	reason: precision=bf16, attn=sdpa, qlora=off, optim=galore_adamw, warn=galore-small-base(135M<1B)
93	```
94
95	The warning is advisory — you can still train. If the memory number
96	matters for your host, GaLore may still be worth it. But if you're
97	picking it for quality: pick `adamw_torch` instead.
98
99	## What ships today
100
101	- Schema: `adapter: dora` on both flat and per-adapter `TrainingConfig`.
102	- Schema: `optimizer: galore_adamw` / `galore_adamw_8bit`.
103	- `peft.LoraConfig(use_dora=True)` wired through `build_lora_config`.
104	- `SFTConfig.optim` honors `training.optimizer` (previously ignored
105	the frontmatter field — silent default to `adamw_torch`).
106	- Plan-reason surfaces `adapter=dora` / `optim=galore_adamw` /
107	`warn=galore-small-base(<1B)` so `dlm doctor` auditing makes the
108	knob choices visible.
109
110	## Deferred
111
112	- DoRA + QLoRA combination. The `adapter` field is a single-value
113	enum (`lora`/`qlora`/`dora`), so the combination is schema-unreachable
114	today — no runtime refusal is needed because Pydantic rejects any
115	attempt before it reaches the doctor. Allowing the combination
116	requires splitting DoRA into a separate `use_dora: bool` field; the
117	bnb≥0.42 compatibility check lands with that change, not before.
118	- GaLore rank + update_proj_gap knobs. The `SFTConfig.optim`
119	path uses transformers' defaults. Surfacing `galore_rank` and
120	`galore_update_proj_gap` as frontmatter fields is a follow-up
121	when someone wants to tune them.
122	- Empirical comparison fixture. A slow-marked twin-train test
123	on the tiny SmolLM2-135M showing DoRA/LoRA parity at that size
124	(where neither technique is expected to differ) lands with the
125	next slow-CI pass.